利用語意提示生成之工業瑕疵檢測零樣本異常分割;Zero-Shot Anomaly Segmentation for Industrial Defect Inspection via Semantic Prompt Generation

NCUIR > College of Electrical Engineering & Computer Science > Graduate Institute of Computer Science and Information Engineering > Electronic Thesis & Dissertation > Item 987654321/98534

Please use this identifier to cite or link to this item: https://ir.lib.ncu.edu.tw/handle/987654321/98534

Title:	利用語意提示生成之工業瑕疵檢測零樣本異常分割;Zero-Shot Anomaly Segmentation for Industrial Defect Inspection via Semantic Prompt Generation
Authors:	林宣彤;Lin, Hsuan-Tung
Contributors:	資訊工程學系
Keywords:	零樣本異常檢測;工業瑕疵檢測;大型語言模型;Zero-shot Anomaly Detection;Industrial Defect Detection;Large Language Models
Date:	2025-08-07
Issue Date:	2025-10-17 12:53:46 (UTC+8)
Publisher:	國立中央大學
Abstract:	基於深度學習與影像感測之瑕疵檢測已被廣泛運用於工業生產線，相關技術雖能提升異常產品的辨識能力，卻仍受限於異常樣本稀缺不易取得的問題，這使得訓練出來的模型難以涵蓋多樣且變異性高的異常瑕疵情境。近年來零樣本異常檢測成為研究重點，目標是寄望瑕疵檢測模型對於未見過的異常樣本依然能夠發揮作用，引入視覺語言模型為此提供一種有效的解決途徑，透過比對輸入影像與文字提示的語意相似度進行異常判斷，協助提升模型的泛化能力。然而，現有方法多採用固定或通用文字特徵，可能未針對特定影像內容調整語意，因此產生語意不匹配而影響異常檢測的準確與穩定。為解決上述問題，本研究提出結合大型語言模型與視覺語言模型的零樣本異常瑕疵檢測方法；我們在前處理階段運用Gemini模型依據真實遮罩產生建構提示，輔助架構中的Gemma-2-2B模型能根據輸入影像的語意內容生成精確的文字提示。最後結合視覺語言模型計算影像與文字提示間的語意相似度，進一步產生異常分布圖以定位潛在瑕疵區域。我們所提出的語意提示生成機制能依據每張影像特徵即時調整提示內容，相較於傳統靜態提示詞的作法更能強化跨模態語意對齊能力，提升模型對異常區域的關注效果。實驗結果顯示，在MVTec-AD資料集中，本方法於四項主要評估指標皆達到最佳表現，展現卓越的異常區域定位能力；在VisA資料集中，PRO指標同樣達到最佳效果，其餘指標則名列第二，且與最佳方法相距不大，驗證所提出的方法即使在未見過的異常樣本情境下，仍然能穩定識別多種類型瑕疵。;With the advancement of deep learning technologies, defect detection systems that integrate image sensing and deep learning methods have been widely applied in the industrial field. While these systems enhance anomaly recognition capabilities, they are still constrained by the scarcity and acquisition difficulty of anomalous samples, which makes it challenging for models to handle diverse and highly variable defect scenarios. In recent years, Zero-Shot Anomaly Detection (ZSAD) has gained significant interest in research, aiming to accurately identify potential defects without relying on any anomalous samples during training. In this context, Vision-Language Models (VLMs) enable anomaly detection through semantic similarity comparisons between input images and text prompts, thus enhancing model generalization. However, most existing methods use fixed or generic text features that may not adapt to specific image content, resulting in semantic mismatch problems that reduce the accuracy and stability of anomaly detection. To address this issue, we propose a novel ZSAD method that combines Large Language Models (LLMs) with VLMs. We leverage Gemini during the preprocessing stage to generate constructive prompts based on ground truth masks, guiding Gemma-2-2B within the framework to produce accurate text prompts based on the semantic content of input image. Finally, we compute the semantic similarity between image and text via VLMs to produce an anomaly map for localizing potential defect regions. Our method dynamically adjusts prompt content based on the semantic features of each image, enhancing cross-modal semantic alignment compared to conventional static prompts and improving the ability of models to attend to anomalous regions. Experimental results demonstrate that the proposed method achieves the best performance across four main evaluation metrics on the MVTec-AD dataset, showcasing excellent defect localization capability. On the VisA dataset, the PRO metric also reaches the highest score, while the other metrics rank second with minimal differences compared to the best-performing method. These results verify that the proposed method can stably identify various types of defects even in scenarios involving previously unseen anomalous samples.
Appears in Collections:	[Graduate Institute of Computer Science and Information Engineering] Electronic Thesis & Dissertation

Files in This Item:

File	Description	Size	Format
index.html		0Kb	HTML	81	View/Open

社群 sharing

Loading...