基於遮罩注意力之像素級對比式自監督式學習

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：101

、訪客IP：3.136.236.126

姓名

劉慎軒(SHEN-HSUAN LIU) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

基於遮罩注意力之像素級對比式自監督式學習
(Heuristic Attention Pixel-Level Contrastive Loss for Self-supervised Visual Representation Learning)

相關論文

★ Single and Multi-Label Environmental Sound Recognition with Gaussian Process	★ 波束形成與音訊前處理之嵌入式系統實現
★ 語音合成及語者轉換之應用與設計	★ 基於語意之輿情分析系統
★ 高品質口述系統之設計與應用	★ 深度學習及加速強健特徵之CT影像跟骨骨折辨識及偵測
★ 基於風格向量空間之個性化協同過濾服裝推薦系統	★ RetinaNet應用於人臉偵測
★ 金融商品走勢預測	★ 整合深度學習方法預測年齡以及衰老基因之研究
★ 漢語之端到端語音合成研究	★ 基於 ARM 架構上的 ORB-SLAM2 的應用與改進
★ 基於深度學習之指數股票型基金趨勢預測	★ 探討財經新聞與金融趨勢的相關性
★ 基於卷積神經網路的情緒語音分析	★ 運用深度學習方法預測阿茲海默症惡化與腦中風手術存活

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

在深度學習中，高準確率除了模型架構或訓練方法的設計之外，大量的訓練資料也是不可或缺的一部分，然而在傳統的監督式學習中，大量訓練資料就意味著需要大量高質量標籤，這使得訓練模型的成本非常的高，故近年來有學者提出自監督式學習這個觀念，運用容易取得的大量無標籤訓練資料來預訓練模型，之後再用極少量帶標籤的資料做二次訓練即可得到高準確率，並減少人工標記的成本。
近期在電腦視覺領域中的自監督式學習大多是基於整張影像的特徵計算對比式損失函數 (contrastive loss)，通過在向量空間最小化相同影像特徵間的相似度，這種實例級(instance-level)的訓練方式對於運用整張影像特徵的任務(如分類任務)有很好的效果，但對於需要用到像素間差異的任務(物件偵測或實例分割)就不是那麼理想，故本文提出了一種基於遮罩注意力之像素級對比式學習方法Heuristic Attention Pixel-Level Contrastive Learning (HAPiCL), 透過非監督式學習的方法生成影像的前景遮罩，依生成的遮罩將編碼器 (Encoder)所取得整張影像的特徵圖區分成前景與背景特徵，再依照前景特徵及背景特徵向量計算像素級的對比式損失函數，以提高模型在物件偵測及分割任務上的準確率。

摘要(英)

Training a high-accuracy deep-learning model depends on various factors, such as the model architecture and training method. In addition, a large number of high-quality labeled datasets is necessary. However, it must be an unaffordable cost to collect such large-scale and high-quality datasets, which also becomes the barrier to train a high-accuracy model in the framework of supervised learning. Recently, the concept of self-supervised learning has been proposed. We can pre-train a deep learning model with the unlabeled dataset, and achieve a higher accuracy deep learning model by finetuning on the few labeled datasets. Therefore, the aforementioned issue is alleviated by applying the framework of self-supervised learning.
In self-supervised learning, most of the previous works measure the contrastive loss based on the feature extracted from the entire image. These kinds of measurements based on the instance level are suitable for the classification task. However, it is not ideal for tasks that require pixel-level information, such as object detection and instance segmentation. Therefore, we have proposed a pixel-level contrastive learning method based on mask attention, which is called Heuristic Attention Pixel-Level Contrastive Learning (HAPiCL). In HAPiCL, we generate the binary mask to split the input image into the foreground and background features through an unsupervised learning method. During the training stage, the model will measure the pixel-level contrastive loss with the foreground and background features. Such a method results in better performance in object detection as well as instance segmentation.

關鍵字(中)

★ 深度學習
★ 自監督式學習
★ 表徵學習

關鍵字(英)

★ Deep learning
★ Self-supervised learning
★ Representation learning

論文目次

一. 緒論 1
1-1、研究背景及動機 1
1-2、研究目的 2
1-3、論文架構 3
二. 文獻探討 4
2-1、 SimCLR 5
2-2、 MoCo和MoCo V2 7
2-3、 BYOL 8
2-4、 Pixel-Level Consistency 9
三. 方法介紹 12
3-1、非監督式學習遮罩生成 13
3-2、遮罩剪裁(Mask Cropping) 14
3-3、 Mask Pixel level contrastive loss 15
四. 實驗結果與討論 19
4-1、資料集 19
4-2、評估方式 19
4-2-1. 線性評估(linear evaluation) 19
4-2-2. 模型微調(Fine-tuning Procedure) 20
4-2-3. 遷移式學習(Transfer Learning) 20
4-3、實驗結果 21
4-3-1. 線性評估(linear evaluation) 21
4-3-2. 模型微調(Fine-tuning Procedure) 22
4-3-3. 遷移式學習(Transfer Learning) 23
4-4、其他實驗結果 26
4-4-1. 比較加上實例級的效果 26
4-4-2. 比較使用不同剪裁的效果 26
4-4-3. 遮罩剪裁foreground rate 比較 27
4-4-4. 用ImageNet100預訓練結果 28
4-4-5. ConvMLP輸出大小比較 29
4-4-6. Batch size 比較 29
五. 結論與未來的展望 30
5-1、結論 30
5-2、未來展望 30
六. 參考資料 32

參考文獻

[1]. DEVLIN, Jacob, et al. “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
[2]. KOMODAKIS, Nikos; GIDARIS, Spyros. “Unsupervised representation learning by predicting image rotations,” In International Conference on Learning Representations (ICLR). 2018.
[3]. HE, Kaiming, et al. “Momentum contrast for unsupervised visual representation learning,” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020. p. 9729-9738.
[4]. Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton.” A simple framework for contrastive learning of visual representations,” arXiv:2002.05709, 2020.
[5]. GRILL, Jean-Bastien, et al. “Bootstrap your own latent-a new approach to self-supervised learning,” In Advances in Neural Information Processing Systems, 2020, 33: 21271-21284.
[6]. CHEN, Xinlei, et al. “Improved baselines with momentum contrastive learning,” arXiv preprint arXiv:2003.04297, 2020.
[7]. Russakovsky, O., et al. “ImageNet Large Scale Visual Recognition Challenge,” In International Journal of Computer Vision, 2015. 115: p. 211-252.
[8]. PATHAK, Deepak, et al. “Context encoders: Feature learning by inpainting,” In Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. p. 2536-2544.
[9]. SOHN, Kihyuk. “Improved deep metric learning with multi-class n-pair loss objective,” Advances in neural information processing systems, 2016, 29.
[10]. Zhirong Wu, Yuanjun Xiong, Stella Yu, and Dahua Lin. “Unsupervised feature learning via non-parametric instance discrimination,” In CVPR, 2018.
[11]. XIE, Zhenda, et al. “Propagate yourself: Exploring pixel-level consistency for unsupervised visual representation learning,” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021. p. 16684-16693.
[12]. RONNEBERGER, Olaf; FISCHER, Philipp; BROX, Thomas. “U-net: Convolutional networks for biomedical image segmentation,” In International Conference on Medical image computing and computer-assisted intervention. Springer, Cham, 2015. p. 234-241.
[13]. CHEN, Liang-Chieh, et al. “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” In IEEE transactions on pattern analysis and machine intelligence, 2017, 40.4: 834-848.
[14]. JIANG, Huaizu, et al. “Salient object detection: A discriminative regional feature integration approach,” In Proceedings of the IEEE conference on computer vision and pattern recognition. 2013. p. 2083-2090.
[15]. FELZENSZWALB, Pedro F.; HUTTENLOCHER, Daniel P. “Efficient graph-based image segmentation,” In International journal of computer vision, 2004, 59.2: 167-181.
[16]. VAN GANSBEKE, Wouter, et al. “Unsupervised semantic segmentation by contrasting object mask proposals,” In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021. p. 10052-10062.
[17]. Zhang, S., et al. “Interactive Object Segmentation With Inside-Outside Guidance,” In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020: p. 12231-12241.
[18]. YOU, Yang; GITMAN, Igor; GINSBURG, Boris. “Large batch training of convolutional networks,” arXiv preprint arXiv:1708.03888, 2017.
[19]. GOYAL, Priya, et al. “Accurate, large minibatch sgd: Training imagenet in 1 hour,” arXiv preprint arXiv:1706.02677, 2017.
[20]. Lin, T.-Y., et al. “Microsoft COCO: Common Objects in Context,” In ECCV. 2014.
[21]. Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. “Faster R-CNN: Towards real-time object detection with region proposal networks,” In NeurIPS, 2015.
[22]. Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. “Mask R-CNN,”. In ICCV, 2017.
[23]. CHEN, Xinlei; HE, Kaiming. “Exploring simple siamese representation learning,” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021. p. 15750-15758.

指導教授

王家慶(Jia-Ching Wang)

審核日期

2022-8-15

推文