摘要: | 最近在自監督式學習的發展讓我發現其取代傳統監督式學習的可能性,尤其是自監督式學習解決了傳統監督式學習的需要大量標記資料及對不同任務泛化性不高的問題。自監督式學習使用容易獲得的未標記數據對深度神經網絡進行預訓練,然後在下游任務上進行微調,相比於監督式學習需要更少的標記資料。值得注意的是,自監督學習在包括文本、視覺、 語音等多個領域均展現出成功。 在本簡報中,我們提出了數種新穎的自監督式學習方法,用於視覺表徵學習,可以提高多個計算機視覺下游任務的效果。這些方法目標是利用輸入數據本身來生成學習目標。我們的第一種方法HAPiCLR利用影像的上下文表徵中的像素級信息,並結合對比式學習目標, 使其能夠為下游任務學習更有效的圖像表徵。第二種方法HARL引入了一種基於啟發式注意力的方法,最大化向量空間中抽象對象級嵌入,從而產生更高質量的語義表徵。最後,MVMA框架結合了多個資料擴增的輸入,利用每個訓練樣本的全局和局部信息, MVMA框架可以探索廣泛的圖像外觀,這種方法產生的表徵具有對於不同尺度的影像有很高的魯棒性,使其對下游任務有更高的泛化性及提高訓練的效率。 這些方法顯著改善了圖像分類、物件偵測和語義分割等任務的性能。它們展示了自監督式學習提取圖像特徵的能力,從而提高了在各種計算機視覺任務中的深度神經網絡效果及效率。本論文不僅介紹了新的學習算法,還提供了對自監督表徵的全面分析,揭示了不同模型之間的區別因素。總的來說,它展示了一套創新、高效、泛化性高的自監督學習在方法,使自監督式模型更好的泛化到下游任務的能力。 ;Recent advances in self-supervised learning have shown promise as an alternative to supervised learning, particularly for addressing its critical shortcomings: the need for abundant labeled data and the inability to leverage prior knowledge and skills. Self-supervised learning involves pre-training deep neural networks on pretext tasks using easily acquirable, unlabeled data and then fine-tuning it on downstream tasks of interest, requiring fewer labeled data than supervised learning. Notably, self-supervised learning has demonstrated success in diverse domains, including text, vision, speech, etc. In this thesis, we present several novel self-supervised learning methods for visual representation learning that can improve the performance of multiple computer vision downstream tasks. These methods are designed to leverage the input data itself for generating learning targets. Our first method, HAPiCLR, leverages pixel-level information from an object′s contextual representation with a contrastive learning objective, allowing it to learn more robust and efficient image representations for downstream tasks. The second method, HARL, introduces a heuristic attention-based approach that maximizes the abstract object-level embedding in vector space, resulting in higher quality semantic representations. Finally, the MVMA framework combines multiple augmentation pipelines and leveraging both global and local information from each training sample, the MVMA framework can explore a vast range of image appearances. This approach results in representations that are not only scale-invariant but also invariant to nuisance-factors, making them more robust and efficient for downstream tasks. These methods have notably improved performance in tasks like image classification, object detection, and semantic segmentation. They demonstrate the ability of self-supervised algorithms to transform high-level image properties, thereby enhancing deep neural network efficiency in various computer vision tasks. This thesis not only introduces new learning algorithms but also provides a comprehensive analysis of self-supervised representations and the distinct factors that differentiate various models. Overall, it presents a suite of innovative, adaptable, and efficient approaches to self-supervised learning in image representation, significantly boosting the robustness and effectiveness of learned features. |