English  |  正體中文  |  简体中文  |  全文筆數/總筆數 : 83776/83776 (100%)
造訪人次 : 60732588      線上人數 : 849
RC Version 7.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
搜尋範圍 查詢小技巧:
  • 您可在西文檢索詞彙前後加上"雙引號",以獲取較精準的檢索結果
  • 若欲以作者姓名搜尋,建議至進階搜尋限定作者欄位,可獲得較完整資料
  • 進階搜尋


    請使用永久網址來引用或連結此文件: https://ir.lib.ncu.edu.tw/handle/987654321/98091


    題名: 開放式詞彙物件偵測之影像與文字特徵 對齊度提升;Improved Image–Text Feature Alignment for Open Vocabulary Object Detection
    作者: 王俊諺;Wang, Chun-Yen
    貢獻者: 通訊工程學系
    關鍵詞: 開放式詞彙物件偵測;視覺語言模型;影像文字特徵對齊;餘弦對齊損失;代理注意力;open-vocabulary object detection;vision-language models;image-text feature alignment;cosine alignment loss;agent attention
    日期: 2025-07-10
    上傳時間: 2025-10-17 12:20:16 (UTC+8)
    出版者: 國立中央大學
    摘要: 在開放式詞彙物件偵測 (open vocabulary object detection) 任務中,因為需要偵測訓練集中不包含的物件類別,所以需要提供模型額外文本資訊,模型的分類端 (classification head) 計算影像文字特徵的相似度,依此來分類影像中的物件類別,然而現有的開放式詞彙物件偵測模型在對齊影像與文本的能力上仍待改進,因此本論文為了提升對齊度,首先提出餘弦對齊損失 (cosine alignment loss),它會計算影像嵌入 (image embeddings) 與文字嵌入 (text embeddings) 的餘弦相似度,透過餘弦相似度鼓勵模型產生與文字嵌入更對齊的影像嵌入,此外協同現有的agent attention,其在多模態融合模組中,會先強化影像嵌入 (image embeddings) 的區域表示 (region representation) ,再讓影像嵌入融合文字嵌入 (text embeddings) 提供的資訊,依此幫助模型提升影像文字的對齊度。本論文所提方案相較於現有的YOLO-World-S (參數量76.33M),所提方案的參數量上升0.44M,FPS下降0.057,計算量上升0.47G MACs,實驗結果顯示,當採用預訓練權重 (pre-trained weight) 時,在OV-LVIS資料集上微調 (fine-tuning) 後,整體AP 提升0.7%,在novel category AP_r 提升1.3%,而在OV-COCO資料集上訓練後,本論文所提方案相較於YOLO-World-S,整體 AP 與 AP_novel 皆提升0.3%。;In open-vocabulary object detection tasks, models are required to detect object categories not present in the training set, necessitating the incorporation of additional textual information. The model’s classification head relies on the similarity between image and text features to classify objects. However, the current open-vocabulary detector exhibits limitations in effectively aligning image and text representations. Thus, this thesis first introduces a cosine alignment loss. It computes the cosine similarity between image and text embeddings to encourage the model to produce image embeddings that better align with the text. Additionally, the loss works synergistically with the existing agent attention within the multi-modal fusion module. It first enhances the image embedding representation, which subsequently integrates information derived from the text embeddings, leading to enhanced image-text feature alignment. In comparison with the existing YOLO-World-S model (76.33M parameters), the proposed approach results in an increase of 0.44M parameters, a decrease of 0.057 in FPS, and an additional computational cost of 0.47G MACs, Experimental results demonstrate that fine-tuning on the OV-LVIS dataset with pre-trained weights improves the overall AP by 0.7% and novel category AP_r by 1.3%. On the OV-COCO dataset, compared to YOLO-World-S, our model achieves a 0.3% increase in both AP and AP_novel.
    顯示於類別:[通訊工程研究所] 博碩士論文

    文件中的檔案:

    檔案 描述 大小格式瀏覽次數
    index.html0KbHTML17檢視/開啟


    在NCUIR中所有的資料項目都受到原著作權保護.

    社群 sharing

    ::: Copyright National Central University. | 國立中央大學圖書館版權所有 | 收藏本站 | 設為首頁 | 最佳瀏覽畫面: 1024*768 | 建站日期:8-24-2009 :::
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - 隱私權政策聲明