| 摘要: | 近年來深度學習技術的快速發展,使得電腦視覺在分類、檢測與分割等任務上取 得良好的效果。然而,大多數方法仍高度依賴大量標註資料,尤其對於像是醫療影像
 分析或工業產品檢測等標註成本高昂的應用場景時,會受到限制。為了降低對人工
 標註的依賴,非監督式學習(UnsupervisedLearning) 逐漸受到重視,其中對比式學習
 (Contrastive Learning) 作為非監督式學習的一種,已被證實能夠有效學習出具判別性的
 特徵表示。
 然而,傳統的對比式學習在面對「微小且細粒度(tinyandfine-grained)」的資料時
 仍面對巨大的困難。這種資料通常嵌入在複雜背景中的小型目標,且類別間差異細微,
 使得模型難以擷取有效特徵,進而影響整體表現。本研究針對此挑戰,提出FINE:基
 於多模態對比式嵌入學習的細粒度影像理解方法,專為「微小且細粒度」資料所設計。
 我們的方法採用編碼器-解碼器架構來生成輔助影像,強調微小目標區域,從而促使
 更有效的特徵提取。此外,我們設計了一個多模態對比式特徵提取模組(Multi-modal
 Contrastive Learning Feature Extract Block, MCLFE),整合分支特徵提取模組、注意力模
 組與特徵融合模組。該模組與對比學習策略共同運作,分別利用InstanceLoss(IL)與
 Center Loss(CL)來訓練特徵提取器與聚類中心。
 在實驗中,我們分別使用一種非公開的產線面板資料集以及四種公開資料集進行
 比較,其中包含非公開的產線面板資料集A19、Retina視網膜資料集[1]、NEU鋼鐵表
 面瑕疵資料集[2]、MVTecAD工業異常資料集[3]和CIFAR-10資料集[4]。其中,在
 視網膜資料集上,當提供準確的輔助影像時,我們的方法能顯著提升超過25%的分群
 準確率,突顯其在醫學影像上,處理微小且細粒度方面的有效性。在一般分類資料集
 上,我們的方法即使僅提升約4%的ARI,仍優於多數主流方法,顯示本方法在無需額
 外先驗的情況下,亦具備穩定且一致的分群能力。在鋼鐵表面瑕疵資料集中,我們的
 方法也可以在三個指標下取得最好的效果,證實本方法對於顯著異常亦具備良好的辨
 識能力。
 我們也於工業異常資料集中針對兩種不同型態的瑕疵類別(結構性與紋理性)進
 行測試。結果顯示,我們的方法在結構性瑕疵上優於所有對照方法,展現其於結構性
 缺陷分群上的穩定性與辨識力;雖然在紋理性瑕疵中受到紋理型異常與輔助模態限制
 影響,表現略遜於最佳方法,但整體仍具備一致且具有潛力的分群效果,進一步說明
 了本方法於多樣化工業瑕疵情境中的應用可能。
 最後,在非公開的產線面板資料集上,我們的方法搭配輔助影像後於NMI、ARI
 與ACC三項指標上均優於現有方法,雖然各指標提升幅度有限,但整體表現更為穩
 定,展現出本研究方法於真實工業場景中的實用性與潛力。;In recent years, the rapid development of deep learning techniques has led to significant
 progress in computer vision tasks such as classification, detection, and segmentation. However,
 most existing approaches still heavily rely on large amounts of annotated data. This depen
 dency becomes a major limitation in application scenarios where annotation is costly, such as
 medical image analysis or industrial product inspection. To alleviate the reliance on manual
 labeling, unsupervised learning has gained increasing attention. Among these methods, con
 trastive learning—an effective unsupervised approach—has been shown to learn discriminative
 feature representations successfully.
 Nevertheless, conventional contrastive learning methods face significant challenges when
 applied to ”tiny and fine-grained” data. Such data typically consist of small targets embedded in
 complexbackgroundswithsubtleinter-classdifferences, makingitdifficult for models toextract
 meaningful features and thereby affecting overall performance. To address this challenge, we
 propose FINE:Fine-GrainedImageUnderstandingthroughMultimodalContrastiveEmbedding
 Learning, specifically designed for ”tiny and fine-grained” data. Our method adopts an encoder
 decoder architecture to generate auxiliary images that emphasize small target regions, thereby
 facilitating more effective feature extraction.
 Additionally, wedesignaMulti-modalContrastiveLearningFeatureExtractBlock(MCLFE),
 which integrates a multi-branch feature extraction module, an attention module, and a feature
 fusion module. This module, together with a contrastive learning strategy, jointly optimizes the
 feature extractor and clustering centers using Instance Loss (IL) and Center Loss (CL), respec
 tively.
 In our experiments, we evaluate the proposed method on a private industrial panel dataset
 and four public datasets: the private panel dataset A19, the Retina fundus dataset [1], the NEU
 surface defect dataset [2], the MVTec AD industrial anomaly dataset [3], and the CIFAR-10
 dataset [4]. On the Retina dataset, our method achieves over a 25% improvement in clustering
 accuracy when provided with high-quality auxiliary images, demonstrating its effectiveness in
 handling tiny and fine-grained features in medical imaging. On general classification datasets,
 our method still outperforms most mainstream approaches by achieving an approximate 4%
 improvement in ARI, indicating its stable and consistent clustering capability even without ad
 ditional priors. For the NEU dataset, our method achieves the best performance across all three
 evaluation metrics, further validating its ability to identify prominent anomalies effectively.
 Moreover, we evaluate our method on two distinct defect types—structural and textural
 —within the MVTec AD dataset. The results show that our method outperforms all baseline
 approaches on structural defects, showcasing its robustness and discriminative power in cluster
 ing structural anomalies. Although its performance on textural defects is slightly inferior to the
 best-performing method due to the challenges posed by texture anomalies and auxiliary modal
 ity limitations, the overall results remain consistent and promising, underscoring the method’s
 potential in diverse industrial inspection scenarios.
 Finally, on the private industrial panel dataset, our method, when combined with auxiliary
 images, surpasses existing methods in terms of NMI, ARI, and ACC. While the improvement
 margins are modest, the performance is notably more stable, highlighting the practical applica
 bility and potential of our approach in real-world industrial environments.
 |