參考文獻 |
[1] Muhammad Umer Anwaar, Egor Labintcev, and Martin Kleinsteuber. Compositional
learning of image-text query for image retrieval. In Proceedings of the IEEE/CVF
Winter Conference on Applications of Computer Vision (WACV), pages 1140–1149,
January 2021.
[2] Yang Bai, Xinxing Xu, Yong Liu, Salman Khan, Fahad Khan, Wangmeng Zuo, Rick
Siow Mong Goh, and Chun-Mei Feng.
Sentence-level prompts benefit composed
image retrieval. arXiv preprint arXiv:2310.05473, 2023.
[3] Alberto Baldrati, Marco Bertini, Tiberio Uricchio, and Alberto Del Bimbo. Condi-
tioned and composed image retrieval combining and partially fine-tuning clip-based
features.
In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 4959–4968, 2022.
[4] Yanbei Chen, Shaogang Gong, and Loris Bazzani. Image search with text feedback
by visiolinguistic attention learning. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR), June 2020.
[5] Ginger Delmas, Rafael Sampaio de Rezende, Gabriela Csurka, and Diane Larlus.
Artemis: Attention-based retrieval with text-explicit matching and implicit similar-
ity. arXiv preprint arXiv:2203.08101, 2022.
[6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-
training of deep bidirectional transformers for language understanding.
arXiv
preprint arXiv:1810.04805, 2018.
[7] Eric Dodds, Jack Culpepper, Simao Herdade, Yang Zhang, and Kofi Boakye.
Modality-agnostic attention fusion for visual search with text feedback.
arXiv
preprint arXiv:2007.00145, 2020.
[8] Sonam Goenka, Zhaoheng Zheng, Ayush Jaiswal, Rakesh Chada, Yue Wu, Varsha
Hedau, and Pradeep Natarajan. Fashionvlp: Vision language transformer for fashion
retrieval with feedback. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), pages 14105–14115, June 2022.
[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning
for image recognition. In Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 770–778, 2016.
[10] Mehrdad Hosseinzadeh and Yang Wang.
Composed query image retrieval using
locally bounded features. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 3596–3605, 2020.
[11] Surgan Jandial, Pinkesh Badjatiya, Pranit Chawla, Ayush Chopra, Mausoom Sarkar,
and Balaji Krishnamurthy. Sac: Semantic attention composition for text-conditioned
image retrieval. In Proceedings of the IEEE/CVF Winter Conference on Applications
of Computer Vision, pages 4021–4030, 2022.
[12] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V.
Le, Yunhsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language
representation learning with noisy text supervision, 2021.
[13] Jongseok Kim, Youngjae Yu, Hoeseong Kim, and Gunhee Kim. Dual compositional
learning in interactive image retrieval. In Proceedings of the AAAI Conference on
Artificial Intelligence, volume 35, pages 1771–1779, 2021.
[14] Seungmin Lee, Dongwan Kim, and Bohyung Han. Cosmo: Content-style modulation
for image retrieval with text feedback. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 802–812, 2021.
[15] Matan Levy, Rami Ben-Ari, Nir Darshan, and Dani Lischinski. Data roaming and
quality assessment for composed image retrieval, 2023.
[16] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi.
Blip-2: Bootstrapping
language-image pre-training with frozen image encoders and large language models.
In International conference on machine learning, pages 19730–19742. PMLR, 2023.
[17] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi.
Blip:
Bootstrapping
language-image pre-training for unified vision-language understanding and genera-
tion. In International conference on machine learning, pages 12888–12900. PMLR,
2022.
[18] Zheyuan Liu, Cristian Rodriguez-Opazo, Damien Teney, and Stephen Gould. Image
retrieval on real-life images with pre-trained vision-and-language models. In 2021
IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, October
2021.
[19] Zheyuan Liu, Weixuan Sun, Yicong Hong, Damien Teney, and Stephen Gould. Bi-
directional training for composed image retrieval via text prompt learning. In Pro-
ceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision,
pages 5753–5762, 2024.
[20] Zheyuan Liu, Weixuan Sun, Damien Teney, and Stephen Gould. Candidate set re-
ranking for composed image retrieval with dual multi-modal encoder. arXiv preprint
arXiv:2305.16304, 2023.
[21] Ilya Loshchilov and Frank Hutter.
Sgdr: Stochastic gradient descent with warm
restarts, 2016.
[22] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2017.
[23] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-
agnostic visiolinguistic representations for vision-and-language tasks, 2019.
[24] Ze Lu, Xudong Jiang, and Alex Kot. Deep coupled resnet for low-resolution face
recognition. IEEE Signal Processing Letters, 25(4):526–530, 2018.
[25] Xianfeng Ou, Pengcheng Yan, Yiming Zhang, Bing Tu, Guoyun Zhang, Jianhui Wu,
and Wujing Li. Moving object detection method via resnet-18 with encoder–decoder
structure in complex scenes. IEEE Access, 7:108152–108160, 2019.
[26] Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville.
Film: Visual reasoning with a general conditioning layer. Proceedings of the AAAI
Conference on Artificial Intelligence, 32(1), April 2018.
[27] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sand-
hini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al.
Learning transferable visual models from natural language supervision. In Interna-
tional conference on machine learning, pages 8748–8763. PMLR, 2021.
[28] Minchul Shin, Yoonjae Cho, Byungsoo Ko, and Geonmo Gu. Rtic: Residual learning
for text and image composition using graph convolutional network. arXiv preprint
arXiv:2104.03015, 2021.
[29] Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech
Galuba, Marcus Rohrbach, and Douwe Kiela. Flava: A foundational language and
vision alignment model. In 2022 IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR). IEEE, June 2022.
[30] Hao Tan and Mohit Bansal. Lxmert: Learning cross-modality encoder representa-
tions from transformers. In Proceedings of the 2019 Conference on Empirical Methods
in Natural Language Processing and the 9th International Joint Conference on Nat-
ural Language Processing (EMNLP-IJCNLP). Association for Computational Lin-
guistics, 2019.
[31] Lucas Ventura, Antoine Yang, Cordelia Schmid, and G¨ul Varol. CoVR: Learning
composed video retrieval from web video captions. AAAI, 2024.
[32] Nam Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, and James Hays.
Composing text and image for image retrieval-an empirical odyssey. In Proceedings of
the IEEE/CVF conference on computer vision and pattern recognition, pages 6439–
6448, 2019.
[33] Jinpeng Wang, Pan Zhou, Mike Zheng Shou, and Shuicheng Yan. Position-guided
text prompt for vision-language pre-training. In 2023 IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR). IEEE, June 2023.
[34] Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grau-
man, and Rogerio Feris. Fashion iq: A new dataset towards retrieving images by
natural language feedback, 2019.
[35] Youngjae Yu, Seunghwan Lee, Yuncheol Choi, and Gunhee Kim. Curlingnet: Com-
positional learning between images and text for fashion iq data.
arXiv preprint
arXiv:2003.12299, 2020. |