隨著半導體製程技術快速演進與晶圓尺寸微縮,微小缺陷對產品良率與製程穩定性的影響愈發關鍵。傳統的晶圓缺陷檢測方法主要依賴人工電子顯微鏡檢測與自動光學檢測,雖然近年來已有許多研究利用機器學習與深度學習技術進行缺陷分類,但這些系統往往缺乏對缺陷的語意化描述能力,使得工程師難以深入分析缺陷成因,進而影響製程優化的效率。為因應此挑戰,本研究提出一套結合 Vision Transformer (ViT) 架構之三階段晶圓缺陷辨識系統,整合異常偵測、缺陷分類與語意化圖像描述,期能提升檢測效能與資訊可讀性。 本研究以公開WM-811K晶圓瑕疵資料集作為實驗基礎,完成資料標準化與不平衡資料處理後,分別訓練分類與描述模型。實驗結果顯示,第一階段以ViT模型於測試集上達成 94.97% 準確率與 96.21% 召回率,顯示其在偵測不良品方面具高度敏感度;而第二階段ViT-GPT2模型於瑕疵分類任務中達成 97.7% 的分類準確率與優異之 Precision、Recall、F1-score表現;第三階段則進一步評估模型的語言生成品質,結果顯示模型能生成語法正確且語意一致之缺陷描述,成功實現圖像與語言之跨模態映射。綜合而言,本研究驗證 Transformer 架構於晶圓缺陷辨識任務中之有效性與應用潛力,不僅提升缺陷分類與異常偵測效能,更賦予檢測結果語意化解釋能力,為智慧製造與品質管控提供嶄新解決方案。 ;With the rapid advancement of semiconductor manufacturing, even minor wafer defects can critically impact yield and process stability. Traditional inspection methods, such as Automated Optical Inspection (AOI) and Scanning Electron Microscopy (SEM), are time-consuming and lack semantic interpretability, making it difficult for engineers to analyze root causes and effectively optimize processes. To address these limitations, this study proposes a three-stage wafer defect recognition framework based on the Vision Transformer (ViT), incorporating anomaly detection, defect classification, and image captioning. Using the publicly available WM-811K dataset, the models were trained after data standardization and class imbalance handling. In Stage 1, the ViT model achieved 94.97% accuracy and 96.21% recall for anomaly detection. In Stage 2, the ViT-GPT2 model reached 97.7% accuracy, with high precision, recall, and F1-scores across defect categories. In Stage 3, the model generated syntactically correct and semantically consistent captions, successfully completing cross-modal mapping from images to text. This study demonstrates the effectiveness of Transformer-based models in enhancing both defect detection accuracy and result interpretability, contributing to intelligent and explainable semiconductor quality control.