在深度學習中,高準確率除了模型架構或訓練方法的設計之外,大量的訓練資料也是不可或缺的一部分,然而在傳統的監督式學習中,大量訓練資料就意味著需要大量高質量標籤,這使得訓練模型的成本非常的高,故近年來有學者提出自監督式學習這個觀念,運用容易取得的大量無標籤訓練資料來預訓練模型,之後再用極少量帶標籤的資料做二次訓練即可得到高準確率,並減少人工標記的成本。 近期在電腦視覺領域中的自監督式學習大多是基於整張影像的特徵計算對比式損失函數 (contrastive loss),通過在向量空間最小化相同影像特徵間的相似度,這種實例級(instance-level)的訓練方式對於運用整張影像特徵的任務(如分類任務)有很好的效果,但對於需要用到像素間差異的任務(物件偵測或實例分割)就不是那麼理想,故本文提出了一種基於遮罩注意力之像素級對比式學習方法Heuristic Attention Pixel-Level Contrastive Learning (HAPiCL), 透過非監督式學習的方法生成影像的前景遮罩,依生成的遮罩將編碼器 (Encoder)所取得整張影像的特徵圖區分成前景與背景特徵,再依照前景特徵及背景特徵向量計算像素級的對比式損失函數,以提高模型在物件偵測及分割任務上的準確率。 ;Training a high-accuracy deep-learning model depends on various factors, such as the model architecture and training method. In addition, a large number of high-quality labeled datasets is necessary. However, it must be an unaffordable cost to collect such large-scale and high-quality datasets, which also becomes the barrier to train a high-accuracy model in the framework of supervised learning. Recently, the concept of self-supervised learning has been proposed. We can pre-train a deep learning model with the unlabeled dataset, and achieve a higher accuracy deep learning model by finetuning on the few labeled datasets. Therefore, the aforementioned issue is alleviated by applying the framework of self-supervised learning. In self-supervised learning, most of the previous works measure the contrastive loss based on the feature extracted from the entire image. These kinds of measurements based on the instance level are suitable for the classification task. However, it is not ideal for tasks that require pixel-level information, such as object detection and instance segmentation. Therefore, we have proposed a pixel-level contrastive learning method based on mask attention, which is called Heuristic Attention Pixel-Level Contrastive Learning (HAPiCL). In HAPiCL, we generate the binary mask to split the input image into the foreground and background features through an unsupervised learning method. During the training stage, the model will measure the pixel-level contrastive loss with the foreground and background features. Such a method results in better performance in object detection as well as instance segmentation.