參考文獻 |
Akaho, S., 2007. A kernel method for canonical correlation analysis. arXiv:cs/0609071.
Andrew, G., Arora, R., Bilmes, J., Livescu, K., 2013. Deep Canonical Correlation Analysis, in: International Conference on Machine Learning. Presented at the International Conference on Machine Learning, PMLR, pp. 1247–1255.
Bach, F.R., Jordan, M.I., n.d. A Probabilistic Interpretation of Canonical Correlation Analysis 11.
Baltrušaitis, T., Ahuja, C., Morency, L., 2019. Multimodal Machine Learning: A Survey and Taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 41, 423–443. https://doi.org/10.1109/TPAMI.2018.2798607
Barsalou, L.W., 2008. Grounded Cognition. Annu. Rev. Psychol. 59, 617–645. https://doi.org/10.1146/annurev.psych.59.103006.093639
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T., 2017. Enriching Word Vectors with Subword Information. Trans. Assoc. Comput. Linguist. 5, 135–146. https://doi.org/10.1162/tacl_a_00051
Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y., 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. ArXiv14061078 Cs Stat.
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K., 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, in: Proceedings of the 2019 Conference of the North. Presented at the Proceedings of the 2019 Conference of the North, Association for Computational Linguistics, Minneapolis, Minnesota, pp. 4171–4186. https://doi.org/10.18653/v1/N19-1423
Donahue, J., Hendricks, L.A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T., n.d. Long-Term Recurrent Convolutional Networks for Visual Recognition and Description 10.
Dong, D., Wu, H., He, W., Yu, D., Wang, H., 2015. Multi-Task Learning for Multiple Language Translation, in: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Presented at the ACL-IJCNLP 2015, Association for Computational Linguistics, Beijing, China, pp. 1723–1732. https://doi.org/10.3115/v1/P15-1166
Feng, F., Wang, X., Li, R., 2014. Cross-modal Retrieval with Correspondence Autoencoder, in: Proceedings of the 22nd ACM International Conference on Multimedia. Presented at the MM ’14: 2014 ACM Multimedia Conference, ACM, Orlando Florida USA, pp. 7–16. https://doi.org/10.1145/2647868.2654902
Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T., 2013. DeViSE: a deep visual-semantic embedding model, in: Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS’13. Curran Associates Inc., Red Hook, NY, USA, pp. 2121–2129.
Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T., n.d. DeViSE: A Deep Visual-Semantic Embedding Model 11.
Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M., 2016a. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding. ArXiv160601847 Cs.
Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M., 2016b. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding, in: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Presented at the EMNLP 2016, Association for Computational Linguistics, Austin, Texas, pp. 457–468. https://doi.org/10.18653/v1/D16-1044
Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y., 2014. Generative Adversarial Networks. ArXiv14062661 Cs Stat.
Guo, W., Wang, J., Wang, S., 2019. Deep Multimodal Representation Learning: A Survey. IEEE Access 7, 63373–63394. https://doi.org/10.1109/ACCESS.2019.2916887
He, K., Zhang, X., Ren, S., Sun, J., 2015. Deep Residual Learning for Image Recognition. ArXiv151203385 Cs.
Huang, J., Kingsbury, B., 2013. Audio-visual deep learning for noise robust speech recognition, in: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. Presented at the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7596–7599. https://doi.org/10.1109/ICASSP.2013.6639140
Huang, X., Liu, M.-Y., Belongie, S., Kautz, J., 2018a. Multimodal Unsupervised Image-to-Image Translation, in: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (Eds.), Computer Vision – ECCV 2018, Lecture Notes in Computer Science. Springer International Publishing, Cham, pp. 179–196. https://doi.org/10.1007/978-3-030-01219-9_11
Huang, X., Liu, M.-Y., Belongie, S., Kautz, J., 2018b. Multimodal Unsupervised Image-to-Image Translation. ArXiv180404732 Cs Stat.
Johnson, M., Schuster, M., Le, Q.V., Krikun, M., Wu, Y., Chen, Z., Thorat, N., Viégas, F., Wattenberg, M., Corrado, G., Hughes, M., Dean, J., 2017. Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation. Trans. Assoc. Comput. Linguist. 5, 339–351. https://doi.org/10.1162/tacl_a_00065
Kaiser, L., Gomez, A.N., Shazeer, N., Vaswani, A., Parmar, N., Jones, L., Uszkoreit, J., 2017. One Model To Learn Them All. ArXiv170605137 Cs Stat.
Karpathy, A., Fei-Fei, L., 2017. Deep Visual-Semantic Alignments for Generating Image Descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 39, 664–676. https://doi.org/10.1109/TPAMI.2016.2598339
Karpathy, A., Fei-Fei, L., n.d. Deep Visual-Semantic Alignments for Generating Image Descriptions 10.
Kiros, R., Salakhutdinov, R., Zemel, R.S., 2014. Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models. ArXiv14112539 Cs.
Krizhevsky, A., Sutskever, I., Hinton, G.E., 2012. Imagenet classification with deep convolutional neural networks, in: Advances in Neural Information Processing Systems. pp. 1097–1105.
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324. https://doi.org/10.1109/5.726791
Liong, V.E., Lu, J., Tan, Y., Zhou, J., 2017. Deep Coupled Metric Learning for Cross-Modal Matching. IEEE Trans. Multimed. 19, 1234–1244. https://doi.org/10.1109/TMM.2016.2646180
Lu, J., Yang, J., Batra, D., Parikh, D., 2017. Hierarchical Question-Image Co-Attention for Visual Question Answering. ArXiv160600061 Cs.
Mikolov, T., Chen, K., Corrado, G., Dean, J., 2013a. Efficient Estimation of Word Representations in Vector Space. CoRR abs/1301.3781.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J., 2013b. Distributed Representations of Words and Phrases and Their Compositionality, in: Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS’13. Curran Associates Inc., USA, pp. 3111–3119.
Mor, N., Wolf, L., Polyak, A., Taigman, Y., 2018. A Universal Music Translation Network. ArXiv180507848 Cs Stat.
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y., n.d. Multimodal Deep Learning 8.
Pan, Y., Mei, T., Yao, T., Li, H., Rui, Y., 2015. Jointly Modeling Embedding and Translation to Bridge Video and Language. ArXiv150501861 Cs.
Peng, Y., Qi, J., Yuan, Y., 2017. Modality-specific Cross-modal Similarity Measurement with Recurrent Attention Network. ArXiv170804776 Cs.
Pennington, J., Socher, R., Manning, C., 2014. Glove: Global Vectors for Word Representation, in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Presented at the Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Doha, Qatar, pp. 1532–1543. https://doi.org/10.3115/v1/D14-1162
Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L., 2018. Deep Contextualized Word Representations, in: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Presented at the Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Association for Computational Linguistics, New Orleans, Louisiana, pp. 2227–2237. https://doi.org/10.18653/v1/N18-1202
Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H., 2016. Generative Adversarial Text to Image Synthesis. ArXiv160505396 Cs.
Sachan, D., Neubig, G., 2018. Parameter Sharing Methods for Multilingual Self-Attentional Translation Models, in: Proceedings of the Third Conference on Machine Translation: Research Papers. Association for Computational Linguistics, Brussels, Belgium, pp. 261–271. https://doi.org/10.18653/v1/W18-6327
Silberer, C., Ferrari, V., Lapata, M., 2017. Visually Grounded Meaning Representations. IEEE Trans. Pattern Anal. Mach. Intell. 39, 2284–2297. https://doi.org/10.1109/TPAMI.2016.2635138
Silberer, C., Ferrari, V., Lapata, M., 2013. Models of Semantic Representation with Visual Attributes, in: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Presented at the ACL 2013, Association for Computational Linguistics, Sofia, Bulgaria, pp. 572–582.
Silberer, C., Lapata, M., 2014. Learning Grounded Meaning Representations with Autoencoders, in: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Presented at the ACL 2014, Association for Computational Linguistics, Baltimore, Maryland, pp. 721–732. https://doi.org/10.3115/v1/P14-1068
Silberer, C., Lapata, M., 2012. Grounded Models of Semantic Representation, in: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Presented at the CoNLL-EMNLP 2012, Association for Computational Linguistics, Jeju Island, Korea, pp. 1423–1433.
Silberer, C., Pinkal, M., 2018. Grounding Semantic Roles in Images, in: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Presented at the EMNLP 2018, Association for Computational Linguistics, Brussels, Belgium, pp. 2616–2626. https://doi.org/10.18653/v1/D18-1282
Simonyan, K., Zisserman, A., 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. ArXiv14091556 Cs.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A., 2014. Going Deeper with Convolutions. ArXiv14094842 Cs.
Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., Saenko, K., 2015. Translating Videos to Natural Language Using Deep Recurrent Neural Networks, in: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Presented at the NAACL-HLT 2015, Association for Computational Linguistics, Denver, Colorado, pp. 1494–1504. https://doi.org/10.3115/v1/N15-1173
Vinyals, O., Toshev, A., Bengio, S., Erhan, D., 2015. Show and Tell: A Neural Image Caption Generator. Presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164.
Wang, B., Yang, Y., Xu, X., Hanjalic, A., Shen, H.T., 2017. Adversarial Cross-Modal Retrieval, in: Proceedings of the 25th ACM International Conference on Multimedia. Presented at the MM ’17: ACM Multimedia Conference, ACM, Mountain View California USA, pp. 154–162. https://doi.org/10.1145/3123266.3123326
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R., Bengio, Y., 2016. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ArXiv150203044 Cs.
Yan, F., Mikolajczyk, K., 2015. Deep correlation for matching images and text, in: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Presented at the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Boston, MA, USA, pp. 3441–3450. https://doi.org/10.1109/CVPR.2015.7298966
Zadeh, A., Chen, M., Poria, S., Cambria, E., Morency, L.-P., 2017a. Tensor Fusion Network for Multimodal Sentiment Analysis, in: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Presented at the EMNLP 2017, Association for Computational Linguistics, Copenhagen, Denmark, pp. 1103–1114. https://doi.org/10.18653/v1/D17-1115
Zadeh, A., Chen, M., Poria, S., Cambria, E., Morency, L.-P., 2017b. Tensor Fusion Network for Multimodal Sentiment Analysis. ArXiv170707250 Cs. |