FINE：基於多模態對比式嵌入學習的細粒度影像理解方法;FINE: Fine-Grained Image Understanding through Multimodal Contrastive Embedding Learning

NCUIR > College of Electrical Engineering & Computer Science > Graduate Institute of Computer Science and Information Engineering > Electronic Thesis & Dissertation > Item 987654321/98575

Please use this identifier to cite or link to this item: https://ir.lib.ncu.edu.tw/handle/987654321/98575

Title:	FINE：基於多模態對比式嵌入學習的細粒度影像理解方法;FINE: Fine-Grained Image Understanding through Multimodal Contrastive Embedding Learning
Authors:	林澤華;Lin, Tse-Hua
Contributors:	資訊工程學系
Keywords:	無監督學習;對比式學習;細粒度分群;多模態特徵融合;Unsupervised Learning;Contrastive Learning;Fine-Grained Clustering;Multi modal Feature Fusion
Date:	2025-08-16
Issue Date:	2025-10-17 12:56:51 (UTC+8)
Publisher:	國立中央大學
Abstract:	近年來深度學習技術的快速發展，使得電腦視覺在分類、檢測與分割等任務上取得良好的效果。然而，大多數方法仍高度依賴大量標註資料，尤其對於像是醫療影像分析或工業產品檢測等標註成本高昂的應用場景時，會受到限制。為了降低對人工標註的依賴，非監督式學習(UnsupervisedLearning) 逐漸受到重視，其中對比式學習 (Contrastive Learning) 作為非監督式學習的一種，已被證實能夠有效學習出具判別性的特徵表示。然而，傳統的對比式學習在面對「微小且細粒度（tinyandfine-grained）」的資料時仍面對巨大的困難。這種資料通常嵌入在複雜背景中的小型目標，且類別間差異細微，使得模型難以擷取有效特徵，進而影響整體表現。本研究針對此挑戰，提出FINE:基於多模態對比式嵌入學習的細粒度影像理解方法，專為「微小且細粒度」資料所設計。我們的方法採用編碼器-解碼器架構來生成輔助影像，強調微小目標區域，從而促使更有效的特徵提取。此外，我們設計了一個多模態對比式特徵提取模組（Multi-modal Contrastive Learning Feature Extract Block, MCLFE），整合分支特徵提取模組、注意力模組與特徵融合模組。該模組與對比學習策略共同運作，分別利用InstanceLoss（IL）與 Center Loss（CL）來訓練特徵提取器與聚類中心。在實驗中，我們分別使用一種非公開的產線面板資料集以及四種公開資料集進行比較，其中包含非公開的產線面板資料集A19、Retina視網膜資料集[1]、NEU鋼鐵表面瑕疵資料集[2]、MVTecAD工業異常資料集[3]和CIFAR-10資料集[4]。其中，在視網膜資料集上，當提供準確的輔助影像時，我們的方法能顯著提升超過25%的分群準確率，突顯其在醫學影像上，處理微小且細粒度方面的有效性。在一般分類資料集上，我們的方法即使僅提升約4%的ARI，仍優於多數主流方法，顯示本方法在無需額外先驗的情況下，亦具備穩定且一致的分群能力。在鋼鐵表面瑕疵資料集中，我們的方法也可以在三個指標下取得最好的效果，證實本方法對於顯著異常亦具備良好的辨識能力。我們也於工業異常資料集中針對兩種不同型態的瑕疵類別（結構性與紋理性）進行測試。結果顯示，我們的方法在結構性瑕疵上優於所有對照方法，展現其於結構性缺陷分群上的穩定性與辨識力；雖然在紋理性瑕疵中受到紋理型異常與輔助模態限制影響，表現略遜於最佳方法，但整體仍具備一致且具有潛力的分群效果，進一步說明了本方法於多樣化工業瑕疵情境中的應用可能。最後，在非公開的產線面板資料集上，我們的方法搭配輔助影像後於NMI、ARI 與ACC三項指標上均優於現有方法，雖然各指標提升幅度有限，但整體表現更為穩定，展現出本研究方法於真實工業場景中的實用性與潛力。;In recent years, the rapid development of deep learning techniques has led to significant progress in computer vision tasks such as classification, detection, and segmentation. However, most existing approaches still heavily rely on large amounts of annotated data. This depen dency becomes a major limitation in application scenarios where annotation is costly, such as medical image analysis or industrial product inspection. To alleviate the reliance on manual labeling, unsupervised learning has gained increasing attention. Among these methods, con trastive learning—an effective unsupervised approach—has been shown to learn discriminative feature representations successfully. Nevertheless, conventional contrastive learning methods face significant challenges when applied to ”tiny and fine-grained” data. Such data typically consist of small targets embedded in complexbackgroundswithsubtleinter-classdifferences, makingitdifficult for models toextract meaningful features and thereby affecting overall performance. To address this challenge, we propose FINE:Fine-GrainedImageUnderstandingthroughMultimodalContrastiveEmbedding Learning, specifically designed for ”tiny and fine-grained” data. Our method adopts an encoder decoder architecture to generate auxiliary images that emphasize small target regions, thereby facilitating more effective feature extraction. Additionally, wedesignaMulti-modalContrastiveLearningFeatureExtractBlock(MCLFE), which integrates a multi-branch feature extraction module, an attention module, and a feature fusion module. This module, together with a contrastive learning strategy, jointly optimizes the feature extractor and clustering centers using Instance Loss (IL) and Center Loss (CL), respec tively. In our experiments, we evaluate the proposed method on a private industrial panel dataset and four public datasets: the private panel dataset A19, the Retina fundus dataset [1], the NEU surface defect dataset [2], the MVTec AD industrial anomaly dataset [3], and the CIFAR-10 dataset [4]. On the Retina dataset, our method achieves over a 25% improvement in clustering accuracy when provided with high-quality auxiliary images, demonstrating its effectiveness in handling tiny and fine-grained features in medical imaging. On general classification datasets, our method still outperforms most mainstream approaches by achieving an approximate 4% improvement in ARI, indicating its stable and consistent clustering capability even without ad ditional priors. For the NEU dataset, our method achieves the best performance across all three evaluation metrics, further validating its ability to identify prominent anomalies effectively. Moreover, we evaluate our method on two distinct defect types—structural and textural —within the MVTec AD dataset. The results show that our method outperforms all baseline approaches on structural defects, showcasing its robustness and discriminative power in cluster ing structural anomalies. Although its performance on textural defects is slightly inferior to the best-performing method due to the challenges posed by texture anomalies and auxiliary modal ity limitations, the overall results remain consistent and promising, underscoring the method’s potential in diverse industrial inspection scenarios. Finally, on the private industrial panel dataset, our method, when combined with auxiliary images, surpasses existing methods in terms of NMI, ARI, and ACC. While the improvement margins are modest, the performance is notably more stable, highlighting the practical applica bility and potential of our approach in real-world industrial environments.
Appears in Collections:	[Graduate Institute of Computer Science and Information Engineering] Electronic Thesis & Dissertation

Files in This Item:

File	Description	Size	Format
index.html		0Kb	HTML	87	View/Open

社群 sharing

Loading...