MMT: Multimodal Masking Transformer for Multimodal Sentiment Analysis

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：16

、訪客IP：18.222.24.23

姓名

陳佳辰(Jia-Chen Chen) 查詢紙本館藏

畢業系所

資訊管理學系

論文名稱

(MMT: Multimodal Masking Transformer for Multimodal Sentiment Analysis)

相關論文

★ 多重標籤文本分類之實證研究 : word embedding 與傳統技術之比較	★ 基於圖神經網路之網路協定關聯分析
★ 學習模態間及模態內之共用表示式	★ Hierarchical Classification and Regression with Feature Selection
★ 病徵應用於病患自撰日誌之情緒分析	★ 基於注意力機制的開放式對話系統
★ 針對特定領域任務—基於常識的BERT模型之應用	★ 基於社群媒體使用者之硬體設備差異分析文本情緒強烈程度
★ 機器學習與特徵工程用於虛擬貨幣異常交易監控之成效討論	★ 捷運轉轍器應用長短期記憶網路與機器學習實現最佳維保時間提醒
★ 基於半監督式學習的網路流量分類	★ ERP日誌分析-以A公司為例
★ 企業資訊安全防護：網路封包蒐集分析與網路行為之探索性研究	★ 資料探勘技術在顧客關係管理之應用─以C銀行數位存款為例
★ 人臉圖片生成與增益之可用性與效率探討分析	★ 人工合成文本之資料增益於不平衡文字分類問題

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2026-8-1以後開放)

摘要(中)

隨著多模態技術的進步，多模態情感分析 (Multimodal Sentiment Analysis, MSA) 的概念已被提出並證明在多個應用中具有潛在價值。為了增強 MSA 模型的穩健性，資料增強是實現該目標的一個可行選項。然而，目前大多數增強方法主要集中在數據層面的增強。這些方法產生的增強數據在多模態情境中缺乏不同模態之間的隱藏互補信息，並且增強方法的靈活性也受到模態本身的限制。因此，我們提出了多模態遮罩變換器 (Multimodal Masking Transformer, MMT)，這是一種用於嵌入層面的多模態資料擴增編碼器-解碼器網絡，用來增強現有的 MSA 任務數據。MMT 能夠捕捉不同模態之間的隱藏互補信息並克服模態之間的限制，為增強方法提供更高的靈活性。在本研究中，我們將 MMT 與多種 MSA 模型進行整合，並將 MMT 與最先進的嵌入層面的多模態資料擴增方法進行比較評估。此外，我們還進行了關於 MMT 增強影響的敏感性分析，以展示 MMT 在提高 MSA 任務效果方面的有效性。

摘要(英)

With the advancement of multimodal techniques, the concepts of multimodal sentiment analysis (MSA) have been proposed and proven to have potential value in several applications. To enhance the robustness of models in MSA, augmentation is one of available options to achieve the goal. However, most of current augmentation methods focus on data-level augmentation. Such methods will generate augmented data lack of hidden information in multimodal scenarios, and also the flexibility of augmentation method is constrained by modalities. Thus, we propose the Multimodal Masking Transformer (MMT), an encoder-decoder network for embedding-level multimodal augmentation, to augment the existing data for current MSA task. The MMT is capable of capturing hidden information and overcoming the constraints among modalities, providing higher flexibility to the augmentation method. In this study, we integrate the MMT with multiple MSA models and evaluate the MMT against the state-of-the-art embedding-level multimodal augmentation methods. In addition, a sensitivity analysis about augmentation impact of MMT is conducted to demonstrate how effectively the MMT can improve MSA task.

關鍵字(中)

★ 情感分析
★ 多模態情感分析
★ 多模態資料擴增
★ 情緒識別

關鍵字(英)

★ sentiment analysis
★ multimodal sentiment analysis
★ multimodal data augmentation
★ emotion recognition

論文目次

摘要 I
Abstract II
Acknowledgements III
Table of Contents IV
List of Figures VI
List of Tables VIII
1. Introduction 1
1.1. Overview 1
1.2. Motivation 3
1.3. Objectives and Contributions 6
1.4. Thesis Organisation 7
2. Related Works 8
2.1. Data Augmentation 8
2.2. Multimodal Data Augmentation 8
2.3. Multimodal Sentiment Analysis 16
2.4. Discussion 27
3. Methodology 28
3.1. Overview 28
3.2. Multimodal Masking Transformer (MMT) 29
3.2.1. MMT Components 30
Augmentation Process 33
3.2.2. 33
3.3. Training 35
3.4. Task Model 37
3.5. Multimodal Augmentation Baselines 37
3.6. Datasets 38
3.6.1. IEMOCAP 38
3.6.2. MELD 38
3.7. Evaluation 39
4. Experiment 39
4.1. Results 40
4.2. Analysis 41
4.3. Post-Analysis 45
5. Conclusion 49
5.1. Overall Summary 49
5.2. Contributions 50
5.3. Study Limitations 50
5.4. Future Works 51
6. Reference 51

參考文獻

Baldi, P., Sadowski, P. and Whiteson, D. (2014) ‘Searching for Exotic Particles in High-Energy Physics with Deep Learning’, Nature Communications, 5(1), p. 4308. Available at: https://doi.org/10.1038/ncomms5308.
Barbieri, F., Anke, L.E. and Camacho-Collados, J. (2021) ‘XLM-T: Multilingual Language Models in Twitter for Sentiment Analysis and Beyond’. Available at: https://doi.org/10.48550/ARXIV.2104.12250.
Busso, C. et al. (2008) ‘IEMOCAP: interactive emotional dyadic motion capture database’, Language Resources and Evaluation, 42(4), pp. 335–359. Available at: https://doi.org/10.1007/s10579-008-9076-6.
Chudasama, V. et al. (2022) ‘M2FNet: Multi-modal Fusion Network for Emotion Recognition in Conversation’, in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), New Orleans, LA, USA: IEEE, pp. 4651–4660. Available at: https://doi.org/10.1109/CVPRW56347.2022.00511.
Ciolino, M., Noever, D. and Kalin, J. (2022) ‘Back Translation Survey for Improving Text Augmentation’. arXiv. Available at: http://arxiv.org/abs/2102.09708 (Accessed: 3 November 2023).
Devlin, J. et al. (2019) ‘BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding’, in. NAACL-HLT (1). Available at: https://openreview.net/forum?id=SkZmKmWOWH (Accessed: 7 January 2022).
Eichenberg, C. et al. (2022) ‘MAGMA -- Multimodal Augmentation of Generative Models through Adapter-based Finetuning’. arXiv. Available at: http://arxiv.org/abs/2112.05253 (Accessed: 1 November 2023).
Farkhod, A. et al. (2021) ‘LDA-Based Topic Modeling Sentiment Analysis Using Topic/Document/Sentence (TDS) Model’, Applied Sciences, 11(23), p. 11091. Available at: https://doi.org/10.3390/app112311091.
Gandhi, A. et al. (2023) ‘Multimodal sentiment analysis: A systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions’, Information Fusion, 91, pp. 424–444. Available at: https://doi.org/10.1016/j.inffus.2022.09.025.
Gandhi, A., Adhvaryu, K. and Khanduja, V. (2021) ‘Multimodal Sentiment Analysis: Review, Application Domains and Future Directions’, in 2021 IEEE Pune Section International Conference (PuneCon). 2021 IEEE Pune Section International Conference (PuneCon), Pune, India: IEEE, pp. 1–5. Available at: https://doi.org/10.1109/PuneCon52575.2021.9686504.
Geng, X. et al. (2022) ‘Multimodal Masked Autoencoders Learn Transferable Representations’. arXiv. Available at: http://arxiv.org/abs/2205.14204 (Accessed: 10 May 2024).
Ghosal, D. et al. (2019) ‘DialogueGCN: A Graph Convolutional Neural Network for Emotion Recognition in Conversation’. arXiv. Available at: http://arxiv.org/abs/1908.11540 (Accessed: 24 February 2024).
Goel, R. et al. (2021) ‘Emotion-Aware Transformer Encoder for Empathetic Dialogue Generation’, in 2021 9th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW). 2021 9th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW), Nara, Japan: IEEE, pp. 1–6. Available at: https://doi.org/10.1109/ACIIW52867.2021.9666315.
Guan, W. et al. (2021) ‘Multimodal Compatibility Modeling via Exploring the Consistent and Complementary Correlations’, in Proceedings of the 29th ACM International Conference on Multimedia. MM ’21: ACM Multimedia Conference, Virtual Event China: ACM, pp. 2299–2307. Available at: https://doi.org/10.1145/3474085.3475392.
Hao, X. et al. (2023) ‘MixGen: A New Multi-Modal Data Augmentation’. arXiv. Available at: http://arxiv.org/abs/2206.08358 (Accessed: 1 November 2023).
Hazarika, D. et al. (2022) ‘Analyzing Modality Robustness in Multimodal Sentiment Analysis’. Available at: https://doi.org/10.48550/ARXIV.2205.15465.
He, K. et al. (2021) ‘Masked Autoencoders Are Scalable Vision Learners’. arXiv. Available at: http://arxiv.org/abs/2111.06377 (Accessed: 10 May 2024).
Houlsby, N. et al. (2019) ‘Parameter-Efficient Transfer Learning for NLP’. arXiv. Available at: http://arxiv.org/abs/1902.00751 (Accessed: 28 June 2024).
Huang, C. et al. (2023) ‘MAST: Masked Augmentation Subspace Training for Generalizable Self-Supervised Priors’. Available at: https://doi.org/10.48550/ARXIV.2303.03679.
Huang, J. et al. (2023) ‘Multimodal Sentiment Analysis in Realistic Environments Based on Cross-Modal Hierarchical Fusion Network’, Electronics, 12(16), p. 3504. Available at: https://doi.org/10.3390/electronics12163504.
Hwang, Y. and Kim, J.-H. (2023) ‘Self-Supervised Unimodal Label Generation Strategy Using Recalibrated Modality Representations for Multimodal Sentiment Analysis’, in Findings of the Association for Computational Linguistics: EACL 2023. Findings of the Association for Computational Linguistics: EACL 2023, Dubrovnik, Croatia: Association for Computational Linguistics, pp. 35–46. Available at: https://doi.org/10.18653/v1/2023.findings-eacl.2.
Jangra, A. et al. (2021) ‘Multi-Modal Supplementary-Complementary Summarization using Multi-Objective Optimization’, in Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR ’21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event Canada: ACM, pp. 818–828. Available at: https://doi.org/10.1145/3404835.3462877.
Kang, Y. and Cho, Y.-S. (2024) ‘Improving Contrastive Learning in Emotion Recognition in Conversation via Data Augmentation and Decoupled Neutral Emotion’, in Y. Graham and M. Purver (eds) Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers). St. Julian’s, Malta: Association for Computational Linguistics, pp. 2194–2208. Available at: https://aclanthology.org/2024.eacl-long.134.
Kenyon-Dean, K. et al. (2018) ‘Sentiment Analysis: It’s Complicated!’, in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana: Association for Computational Linguistics, pp. 1886–1895. Available at: https://doi.org/10.18653/v1/N18-1171.
Kipf, T.N. and Welling, M. (2017) ‘Semi-Supervised Classification with Graph Convolutional Networks’. arXiv. Available at: http://arxiv.org/abs/1609.02907 (Accessed: 24 February 2024).
Lai, S. et al. (2023) ‘Multimodal Sentiment Analysis: A Survey’. Available at: https://doi.org/10.48550/ARXIV.2305.07611.
Li, Z. et al. (2022) ‘Multimodal Sentiment Analysis Based on Interactive Transformer and Soft Mapping’, Wireless Communications and Mobile Computing. Edited by M. Elhoseny, 2022, pp. 1–12. Available at: https://doi.org/10.1155/2022/6243347.
Liu, K. et al. (2018) ‘Learn to Combine Modalities in Multimodal Deep Learning’. arXiv. Available at: http://arxiv.org/abs/1805.11730 (Accessed: 20 May 2024).
Liu, Z. et al. (2023) ‘Learning Multimodal Data Augmentation in Feature Space’. arXiv. Available at: http://arxiv.org/abs/2212.14453 (Accessed: 1 November 2023).
Majumder, N. et al. (2019) ‘DialogueRNN: An Attentive RNN for Emotion Detection in Conversations’. arXiv. Available at: http://arxiv.org/abs/1811.00405 (Accessed: 16 February 2024).
Mertes, S. et al. (2020) ‘An Evolutionary-based Generative Approach for Audio Data Augmentation’, in 2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP). 2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP), Tampere, Finland: IEEE, pp. 1–6. Available at: https://doi.org/10.1109/MMSP48831.2020.9287156.
Mieleszczenko-Kowszewicz, W. et al. (2022) ‘Tell Me How You Feel: Designing Emotion-Aware Voicebots to Ease Pandemic Anxiety In Aging Citizens’. arXiv. Available at: http://arxiv.org/abs/2207.10828 (Accessed: 6 February 2024).
Park, D.S. et al. (2019) ‘SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition’, in Interspeech 2019, pp. 2613–2617. Available at: https://doi.org/10.21437/Interspeech.2019-2680.
Poria, S. et al. (2017) ‘Context-Dependent Sentiment Analysis in User-Generated Videos’, in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada: Association for Computational Linguistics, pp. 873–883. Available at: https://doi.org/10.18653/v1/P17-1081.
Poria, S., Majumder, N., et al. (2019) ‘Emotion Recognition in Conversation: Research Challenges, Datasets, and Recent Advances’. arXiv. Available at: http://arxiv.org/abs/1905.02947 (Accessed: 6 February 2024).
Poria, S., Hazarika, D., et al. (2019) ‘MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations’. arXiv. Available at: http://arxiv.org/abs/1810.02508 (Accessed: 23 February 2024).
Shen, W. et al. (2021) ‘Directed Acyclic Graph Network for Conversational Emotion Recognition’, in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online: Association for Computational Linguistics, pp. 1551–1560. Available at: https://doi.org/10.18653/v1/2021.acl-long.123.
Shi, T. and Huang, S.-L. (2023) ‘MultiEMO: An Attention-Based Correlation-Aware Multimodal Fusion Framework for Emotion Recognition in Conversations’, in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, Canada: Association for Computational Linguistics, pp. 14752–14766. Available at: https://doi.org/10.18653/v1/2023.acl-long.824.
Singh, K.K. and Lee, Y.J. (2017) ‘Hide-and-Seek: Forcing a Network to be Meticulous for Weakly-Supervised Object and Action Localization’, in 2017 IEEE International Conference on Computer Vision (ICCV). 2017 IEEE International Conference on Computer Vision (ICCV), Venice: IEEE, pp. 3544–3553. Available at: https://doi.org/10.1109/ICCV.2017.381.
Singh, U., Abhishek, K. and Azad, H.K. (2024) ‘A Survey of Cutting-edge Multimodal Sentiment Analysis’, ACM Computing Surveys, 56(9), pp. 1–38. Available at: https://doi.org/10.1145/3652149.
Sujana, Y. and Kao, H.-Y. (2023) ‘LiDA: Language-Independent Data Augmentation for Text Classification’, IEEE Access, 11, pp. 10894–10901. Available at: https://doi.org/10.1109/ACCESS.2023.3234019.
Tran, H. et al. (2023) ‘Emotion-Aware Music Recommendation’, Proceedings of the AAAI Conference on Artificial Intelligence, 37(13), pp. 16087–16095. Available at: https://doi.org/10.1609/aaai.v37i13.26911.
Vaswani, A. et al. (2017) ‘Attention is All you Need’, in Advances in Neural Information Processing Systems. Curran Associates, Inc. Available at: https://papers.nips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html (Accessed: 4 December 2022).
Wang, H. et al. (2017) ‘Select-additive learning: Improving generalization in multimodal sentiment analysis’, in 2017 IEEE International Conference on Multimedia and Expo (ICME). 2017 IEEE International Conference on Multimedia and Expo (ICME), Hong Kong, Hong Kong: IEEE, pp. 949–954. Available at: https://doi.org/10.1109/ICME.2017.8019301.
Wang, J. et al. (2016) ‘Dimensional Sentiment Analysis Using a Regional CNN-LSTM Model’, in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Berlin, Germany: Association for Computational Linguistics, pp. 225–230. Available at: https://doi.org/10.18653/v1/P16-2037.
Wang, S., Ma, Y. and Ding, Y. (2023) ‘Exploring Complementary Features in Multi-Modal Speech Emotion Recognition’, in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece: IEEE, pp. 1–5. Available at: https://doi.org/10.1109/ICASSP49357.2023.10096709.
Wei, J. and Zou, K. (2019) ‘EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks’. arXiv. Available at: http://arxiv.org/abs/1901.11196 (Accessed: 3 November 2023).
Wei, S. et al. (2020) ‘A Comparison on Data Augmentation Methods Based on Deep Learning for Audio Classification’, Journal of Physics: Conference Series, 1453(1), p. 012085. Available at: https://doi.org/10.1088/1742-6596/1453/1/012085.
Wu, Z. et al. (2019) ‘Data Augmentation Using Variational Autoencoder for Embedding Based Speaker Verification’, in Interspeech 2019. Interspeech 2019, ISCA, pp. 1163–1167. Available at: https://doi.org/10.21437/Interspeech.2019-2248.
Xu, H. et al. (2020) ‘DomBERT: Domain-oriented Language Model for Aspect-based Sentiment Analysis’, in Findings of the Association for Computational Linguistics: EMNLP 2020. Findings of the Association for Computational Linguistics: EMNLP 2020, Online: Association for Computational Linguistics, pp. 1725–1731. Available at: https://doi.org/10.18653/v1/2020.findings-emnlp.156.
Xu, M. et al. (2023) ‘A Comprehensive Survey of Image Augmentation Techniques for Deep Learning’, Pattern Recognition, 137, p. 109347. Available at: https://doi.org/10.1016/j.patcog.2023.109347.
Xu, N. et al. (2021) ‘MDA: Multimodal Data Augmentation Framework for Boosting Performance on Sentiment/Emotion Classification Tasks’, IEEE Intelligent Systems, 36(6), pp. 3–12. Available at: https://doi.org/10.1109/MIS.2020.3026715.
Yang, J. et al. (2023) ‘ConFEDE: Contrastive Feature Decomposition for Multimodal Sentiment Analysis’, in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, Canada: Association for Computational Linguistics, pp. 7617–7630. Available at: https://doi.org/10.18653/v1/2023.acl-long.421.
Zaffar, I. et al. (2022) ‘Embedding Space Augmentation for Weakly Supervised Learning in Whole-Slide Images’. arXiv. Available at: http://arxiv.org/abs/2210.17013 (Accessed: 26 June 2024).
Zhang, H. et al. (2018) ‘mixup: Beyond Empirical Risk Minimization’. arXiv. Available at: http://arxiv.org/abs/1710.09412 (Accessed: 3 November 2023).
Zhao, X. et al. (2023) ‘TMMDA: A New Token Mixup Multimodal Data Augmentation for Multimodal Sentiment Analysis’, in Proceedings of the ACM Web Conference 2023. WWW ’23: The ACM Web Conference 2023, Austin TX USA: ACM, pp. 1714–1722. Available at: https://doi.org/10.1145/3543507.3583406.
Zheng, J. et al. (2022) ‘Multimodal Representations Learning Based on Mutual Information Maximization and Minimization and Identity Embedding for Multimodal Sentiment Analysis’. Available at: https://doi.org/10.48550/ARXIV.2201.03969.

指導教授

柯士文

審核日期

2024-7-30

推文