開放式詞彙物件偵測之影像與文字特徵 對齊度提升;Improved Image–Text Feature Alignment for Open Vocabulary Object Detection

NCUIR > College of Electrical Engineering & Computer Science > Graduate Institute of Communication Engineering > Electronic Thesis & Dissertation > Item 987654321/98091

Please use this identifier to cite or link to this item: https://ir.lib.ncu.edu.tw/handle/987654321/98091

Title:	開放式詞彙物件偵測之影像與文字特徵對齊度提升;Improved Image–Text Feature Alignment for Open Vocabulary Object Detection
Authors:	王俊諺;Wang, Chun-Yen
Contributors:	通訊工程學系
Keywords:	開放式詞彙物件偵測;視覺語言模型;影像文字特徵對齊;餘弦對齊損失;代理注意力;open-vocabulary object detection;vision-language models;image-text feature alignment;cosine alignment loss;agent attention
Date:	2025-07-10
Issue Date:	2025-10-17 12:20:16 (UTC+8)
Publisher:	國立中央大學
Abstract:	在開放式詞彙物件偵測 (open vocabulary object detection) 任務中，因為需要偵測訓練集中不包含的物件類別，所以需要提供模型額外文本資訊，模型的分類端 (classification head) 計算影像文字特徵的相似度，依此來分類影像中的物件類別，然而現有的開放式詞彙物件偵測模型在對齊影像與文本的能力上仍待改進，因此本論文為了提升對齊度，首先提出餘弦對齊損失 (cosine alignment loss)，它會計算影像嵌入 (image embeddings) 與文字嵌入 (text embeddings) 的餘弦相似度，透過餘弦相似度鼓勵模型產生與文字嵌入更對齊的影像嵌入，此外協同現有的agent attention，其在多模態融合模組中，會先強化影像嵌入 (image embeddings) 的區域表示 (region representation) ，再讓影像嵌入融合文字嵌入 (text embeddings) 提供的資訊，依此幫助模型提升影像文字的對齊度。本論文所提方案相較於現有的YOLO-World-S (參數量76.33M)，所提方案的參數量上升0.44M，FPS下降0.057，計算量上升0.47G MACs，實驗結果顯示，當採用預訓練權重 (pre-trained weight) 時，在OV-LVIS資料集上微調 (fine-tuning) 後，整體AP 提升0.7%，在novel category AP_r 提升1.3%，而在OV-COCO資料集上訓練後，本論文所提方案相較於YOLO-World-S，整體 AP 與 AP_novel 皆提升0.3%。;In open-vocabulary object detection tasks, models are required to detect object categories not present in the training set, necessitating the incorporation of additional textual information. The model’s classification head relies on the similarity between image and text features to classify objects. However, the current open-vocabulary detector exhibits limitations in effectively aligning image and text representations. Thus, this thesis first introduces a cosine alignment loss. It computes the cosine similarity between image and text embeddings to encourage the model to produce image embeddings that better align with the text. Additionally, the loss works synergistically with the existing agent attention within the multi-modal fusion module. It first enhances the image embedding representation, which subsequently integrates information derived from the text embeddings, leading to enhanced image-text feature alignment. In comparison with the existing YOLO-World-S model (76.33M parameters), the proposed approach results in an increase of 0.44M parameters, a decrease of 0.057 in FPS, and an additional computational cost of 0.47G MACs, Experimental results demonstrate that fine-tuning on the OV-LVIS dataset with pre-trained weights improves the overall AP by 0.7% and novel category AP_r by 1.3%. On the OV-COCO dataset, compared to YOLO-World-S, our model achieves a 0.3% increase in both AP and AP_novel.
Appears in Collections:	[Graduate Institute of Communication Engineering] Electronic Thesis & Dissertation

Files in This Item:

File	Description	Size	Format
index.html		0Kb	HTML	26	View/Open

社群 sharing

Loading...