Multimodal Composed Image Retrieval Using Querying-Transformer

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：23

、訪客IP：18.226.164.18

姓名

楊歷恆(Alex Li-Heng Yang) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

(Multimodal Composed Image Retrieval Using Querying-Transformer)

相關論文

★ Dynamic Overlay Construction for Mobile Target Detection in Wireless Sensor Networks	★ 車輛導航的簡易繞路策略
★ 使用傳送端電壓改善定位	★ 利用車輛分類建構車載網路上的虛擬骨幹
★ Why Topology-based Broadcast Algorithms Do Not Work Well in Heterogeneous Wireless Networks?	★ 針對移動性目標物的有效率無線感測網路
★ 適用於無線隨意網路中以關節點為基礎的分散式拓樸控制方法	★ A Review of Existing Web Frameworks
★ 將感測網路切割成貪婪區塊的分散式演算法	★ 無線網路上Range-free的距離測量
★ Inferring Floor Plan from Trajectories	★ An Indoor Collaborative Pedestrian Dead Reckoning System
★ Dynamic Content Adjustment In Mobile Ad Hoc Networks	★ 以影像為基礎的定位系統
★ 大範圍無線感測網路下分散式資料壓縮收集演算法	★ 車用WiFi網路中的碰撞分析

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2025-7-17以後開放)

摘要(中)

基於組合影像檢索系統的重要性在於它能夠讓用戶使用視覺參
考和描述文字來找到特定影像，解決了傳統僅靠文字檢索方法的局
限性。在本論文中，我們提出了一種利用 Querying-Transformer 來
解決傳統影像檢索方法局限性的系統。Qformer 通過基於
Transformer 的架構，將影像和文字數據整合在一起，能夠熟練地捕
捉這兩種模式之間的複雜關係。通過引入影像-文字匹配損失函數，
我們的系統顯著提高了影像與文字匹配的準確性，確保了視覺和文
字表現之間的高度一致性。我們還在 Qformer 模型中使用了殘差學
習技術，以保留重要的視覺信息，從而在學習過程中保持原始影像
的質量和特徵。
為了驗證我們方法的效果，我們在 FashionIQ 和 CIRR 數據集上
進行了實驗。結果顯示，我們提出的系統在各種類別中顯著優於現
有模型，實現了更高的召回率指標。實驗結果展示了我們系統在實
際應用中的潛力，提供了在影像檢索任務中精確性和相關性方面的
顯著改進。

摘要(英)

Composed Image Retrieval (CIR) systems are crucial because they enable users to find specific images using both visual references and descriptive text, addressing the limitations of traditional text-only search methods. In this thesis, we propose a system that utilizes the Querying-Transformer (Qformer) to address the limitations of traditional image retrieval methods. The Qformer integrates image and text data through a transformer-based architecture, adeptly capturing complex relationships between the two modalities. By incorporating the Image-Text Matching (ITM) loss function, our system significantly enhances the accuracy of image-text matching, ensuring superior alignment between visual and textual representations. We also employ residual learning techniques within the Qformer model to preserve essential visual information, thereby maintaining the quality and features of the original images throughout the learning process. To confirm the efficacy of our approach, we performed experiments on the FashionIQ and CIRR datasets. The results show that our proposed system significantly outperforms existing models, achieving superior recall metrics across various categories. The experimental results demonstrate the potential of our system in practical applications, offering robust improvements in the precision and relevance of image retrieval tasks.

關鍵字(中)

★ 圖片搜索

關鍵字(英)

★ Composed Image Retrieval
★ deep learning
★ attention

論文目次

1
Introduction
1
2
Related Work
4
2.1
Visual and Language Pre-training . . . . . . . . . . . . . . . . . . . . . . .
4
2.1.1
Non-Contrast Learning-based Models . . . . . . . . . . . . . . . . .
4
2.1.2
Contrast Learning-based Models . . . . . . . . . . . . . . . . . . . .
4
2.2
Composed Image Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
2.2.1
LSTM-based Composed Image Retrieval . . . . . . . . . . . . . . .
5
2.2.2
Attention Mechanism-based Composed Image Retrieval . . . . . . .
5
2.2.3
BERT-based Composed Image Retrieval
. . . . . . . . . . . . . . .
6
2.2.4
vision-Language Foundation Composed Image Retrieval . . . . . . .
6
3
Preliminary
7
3.1
CLIP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
3.2
CLIP4Cir
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
3.2.1
Combiner network
. . . . . . . . . . . . . . . . . . . . . . . . . . .
9
3.3
BLIP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
3.4
Qformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
3.5
Residual Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
3.6
Position-guided Text Prompt
. . . . . . . . . . . . . . . . . . . . . . . . .
13
3.6.1
Block Tag Generation
. . . . . . . . . . . . . . . . . . . . . . . . .
14
4
Design
15
4.1
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
4.2
Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
4.3
Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
4.4
Research Challenges
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
4.5
Proposed System Architecture . . . . . . . . . . . . . . . . . . . . . . . . .
17
4.5.1
Padding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
4.5.2
Qformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
5
Performance
22
5.1
Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
5.2
Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
5.3
Environmental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
5.4
Experimental Results and Analysis . . . . . . . . . . . . . . . . . . . . . .
24
5.4.1
Experiment Results of CIRR Dataset . . . . . . . . . . . . . . . . .
25
5.4.2
Experiment Results of FashionIQ Dataset
. . . . . . . . . . . . . .
25
5.5
Ablation Studies
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
5.5.1
Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
6
Conclusion
28

參考文獻

[1] Muhammad Umer Anwaar, Egor Labintcev, and Martin Kleinsteuber. Compositional
learning of image-text query for image retrieval. In Proceedings of the IEEE/CVF
Winter Conference on Applications of Computer Vision (WACV), pages 1140–1149,
January 2021.
[2] Yang Bai, Xinxing Xu, Yong Liu, Salman Khan, Fahad Khan, Wangmeng Zuo, Rick
Siow Mong Goh, and Chun-Mei Feng.
Sentence-level prompts benefit composed
image retrieval. arXiv preprint arXiv:2310.05473, 2023.
[3] Alberto Baldrati, Marco Bertini, Tiberio Uricchio, and Alberto Del Bimbo. Condi-
tioned and composed image retrieval combining and partially fine-tuning clip-based
features.
In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 4959–4968, 2022.
[4] Yanbei Chen, Shaogang Gong, and Loris Bazzani. Image search with text feedback
by visiolinguistic attention learning. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR), June 2020.
[5] Ginger Delmas, Rafael Sampaio de Rezende, Gabriela Csurka, and Diane Larlus.
Artemis: Attention-based retrieval with text-explicit matching and implicit similar-
ity. arXiv preprint arXiv:2203.08101, 2022.
[6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-
training of deep bidirectional transformers for language understanding.
arXiv
preprint arXiv:1810.04805, 2018.
[7] Eric Dodds, Jack Culpepper, Simao Herdade, Yang Zhang, and Kofi Boakye.
Modality-agnostic attention fusion for visual search with text feedback.
arXiv
preprint arXiv:2007.00145, 2020.
[8] Sonam Goenka, Zhaoheng Zheng, Ayush Jaiswal, Rakesh Chada, Yue Wu, Varsha
Hedau, and Pradeep Natarajan. Fashionvlp: Vision language transformer for fashion
retrieval with feedback. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), pages 14105–14115, June 2022.
[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning
for image recognition. In Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 770–778, 2016.
[10] Mehrdad Hosseinzadeh and Yang Wang.
Composed query image retrieval using
locally bounded features. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 3596–3605, 2020.
[11] Surgan Jandial, Pinkesh Badjatiya, Pranit Chawla, Ayush Chopra, Mausoom Sarkar,
and Balaji Krishnamurthy. Sac: Semantic attention composition for text-conditioned
image retrieval. In Proceedings of the IEEE/CVF Winter Conference on Applications
of Computer Vision, pages 4021–4030, 2022.
[12] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V.
Le, Yunhsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language
representation learning with noisy text supervision, 2021.
[13] Jongseok Kim, Youngjae Yu, Hoeseong Kim, and Gunhee Kim. Dual compositional
learning in interactive image retrieval. In Proceedings of the AAAI Conference on
Artificial Intelligence, volume 35, pages 1771–1779, 2021.
[14] Seungmin Lee, Dongwan Kim, and Bohyung Han. Cosmo: Content-style modulation
for image retrieval with text feedback. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 802–812, 2021.
[15] Matan Levy, Rami Ben-Ari, Nir Darshan, and Dani Lischinski. Data roaming and
quality assessment for composed image retrieval, 2023.
[16] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi.
Blip-2: Bootstrapping
language-image pre-training with frozen image encoders and large language models.
In International conference on machine learning, pages 19730–19742. PMLR, 2023.
[17] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi.
Blip:
Bootstrapping
language-image pre-training for unified vision-language understanding and genera-
tion. In International conference on machine learning, pages 12888–12900. PMLR,
2022.
[18] Zheyuan Liu, Cristian Rodriguez-Opazo, Damien Teney, and Stephen Gould. Image
retrieval on real-life images with pre-trained vision-and-language models. In 2021
IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, October
2021.
[19] Zheyuan Liu, Weixuan Sun, Yicong Hong, Damien Teney, and Stephen Gould. Bi-
directional training for composed image retrieval via text prompt learning. In Pro-
ceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision,
pages 5753–5762, 2024.
[20] Zheyuan Liu, Weixuan Sun, Damien Teney, and Stephen Gould. Candidate set re-
ranking for composed image retrieval with dual multi-modal encoder. arXiv preprint
arXiv:2305.16304, 2023.
[21] Ilya Loshchilov and Frank Hutter.
Sgdr: Stochastic gradient descent with warm
restarts, 2016.
[22] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2017.
[23] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-
agnostic visiolinguistic representations for vision-and-language tasks, 2019.
[24] Ze Lu, Xudong Jiang, and Alex Kot. Deep coupled resnet for low-resolution face
recognition. IEEE Signal Processing Letters, 25(4):526–530, 2018.
[25] Xianfeng Ou, Pengcheng Yan, Yiming Zhang, Bing Tu, Guoyun Zhang, Jianhui Wu,
and Wujing Li. Moving object detection method via resnet-18 with encoder–decoder
structure in complex scenes. IEEE Access, 7:108152–108160, 2019.
[26] Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville.
Film: Visual reasoning with a general conditioning layer. Proceedings of the AAAI
Conference on Artificial Intelligence, 32(1), April 2018.
[27] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sand-
hini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al.
Learning transferable visual models from natural language supervision. In Interna-
tional conference on machine learning, pages 8748–8763. PMLR, 2021.
[28] Minchul Shin, Yoonjae Cho, Byungsoo Ko, and Geonmo Gu. Rtic: Residual learning
for text and image composition using graph convolutional network. arXiv preprint
arXiv:2104.03015, 2021.
[29] Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech
Galuba, Marcus Rohrbach, and Douwe Kiela. Flava: A foundational language and
vision alignment model. In 2022 IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR). IEEE, June 2022.
[30] Hao Tan and Mohit Bansal. Lxmert: Learning cross-modality encoder representa-
tions from transformers. In Proceedings of the 2019 Conference on Empirical Methods
in Natural Language Processing and the 9th International Joint Conference on Nat-
ural Language Processing (EMNLP-IJCNLP). Association for Computational Lin-
guistics, 2019.
[31] Lucas Ventura, Antoine Yang, Cordelia Schmid, and G¨ul Varol. CoVR: Learning
composed video retrieval from web video captions. AAAI, 2024.
[32] Nam Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, and James Hays.
Composing text and image for image retrieval-an empirical odyssey. In Proceedings of
the IEEE/CVF conference on computer vision and pattern recognition, pages 6439–
6448, 2019.
[33] Jinpeng Wang, Pan Zhou, Mike Zheng Shou, and Shuicheng Yan. Position-guided
text prompt for vision-language pre-training. In 2023 IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR). IEEE, June 2023.
[34] Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grau-
man, and Rogerio Feris. Fashion iq: A new dataset towards retrieving images by
natural language feedback, 2019.
[35] Youngjae Yu, Seunghwan Lee, Yuncheol Choi, and Gunhee Kim. Curlingnet: Com-
positional learning between images and text for fashion iq data.
arXiv preprint
arXiv:2003.12299, 2020.

指導教授

孫敏德(Min-Te Sun)

審核日期

2024-7-23

推文