Multimodal Composed Image Retrieval Using Querying-Transformer

NCUIR > College of Electrical Engineering & Computer Science > Graduate Institute of Computer Science and Information Engineering > Electronic Thesis & Dissertation > Item 987654321/95554

Please use this identifier to cite or link to this item: http://ir.lib.ncu.edu.tw/handle/987654321/95554

Title:	Multimodal Composed Image Retrieval Using Querying-Transformer
Authors:	楊歷恆;Yang, Alex Li-Heng
Contributors:	資訊工程學系
Keywords:	圖片搜索;Composed Image Retrieval;deep learning;attention
Date:	2024-07-23
Issue Date:	2024-10-09 17:00:39 (UTC+8)
Publisher:	國立中央大學
Abstract:	基於組合影像檢索系統的重要性在於它能夠讓用戶使用視覺參考和描述文字來找到特定影像，解決了傳統僅靠文字檢索方法的局限性。在本論文中，我們提出了一種利用 Querying-Transformer 來解決傳統影像檢索方法局限性的系統。Qformer 通過基於 Transformer 的架構，將影像和文字數據整合在一起，能夠熟練地捕捉這兩種模式之間的複雜關係。通過引入影像-文字匹配損失函數，我們的系統顯著提高了影像與文字匹配的準確性，確保了視覺和文字表現之間的高度一致性。我們還在 Qformer 模型中使用了殘差學習技術，以保留重要的視覺信息，從而在學習過程中保持原始影像的質量和特徵。為了驗證我們方法的效果，我們在 FashionIQ 和 CIRR 數據集上進行了實驗。結果顯示，我們提出的系統在各種類別中顯著優於現有模型，實現了更高的召回率指標。實驗結果展示了我們系統在實際應用中的潛力，提供了在影像檢索任務中精確性和相關性方面的顯著改進。;Composed Image Retrieval (CIR) systems are crucial because they enable users to find specific images using both visual references and descriptive text, addressing the limitations of traditional text-only search methods. In this thesis, we propose a system that utilizes the Querying-Transformer (Qformer) to address the limitations of traditional image retrieval methods. The Qformer integrates image and text data through a transformer-based architecture, adeptly capturing complex relationships between the two modalities. By incorporating the Image-Text Matching (ITM) loss function, our system significantly enhances the accuracy of image-text matching, ensuring superior alignment between visual and textual representations. We also employ residual learning techniques within the Qformer model to preserve essential visual information, thereby maintaining the quality and features of the original images throughout the learning process. To confirm the efficacy of our approach, we performed experiments on the FashionIQ and CIRR datasets. The results show that our proposed system significantly outperforms existing models, achieving superior recall metrics across various categories. The experimental results demonstrate the potential of our system in practical applications, offering robust improvements in the precision and relevance of image retrieval tasks.
Appears in Collections:	[Graduate Institute of Computer Science and Information Engineering] Electronic Thesis & Dissertation

Files in This Item:

File	Description	Size	Format
index.html		0Kb	HTML	54	View/Open

社群 sharing

Loading...