Code-Mixing Language Model for Sentiment Analysis in Code-Mixing Data

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：5

、訪客IP：3.19.76.4

姓名

南妲(Nanda Putri Romadhona) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

(Code-Mixing Language Model for Sentiment Analysis in Code-Mixing Data)

相關論文

★ A Real-time Embedding Increasing for Session-based Recommendation with Graph Neural Networks	★ 基於主診斷的訓練目標修改用於出院病摘之十代國際疾病分類任務
★ 混合式心臟疾病危險因子與其病程辨識於電子病歷之研究	★ 基於 PowerDesigner 規範需求分析產出之快速導入方法
★ 社群論壇之問題檢索	★ 非監督式歷史文本事件類型識別──以《明實錄》中之衛所事件為例
★ 應用自然語言處理技術分析文學小說角色之關係：以互動視覺化呈現	★ 基於生醫文本擷取功能性層級之生物學表徵語言敘述：由主成分分析發想之K近鄰算法
★ 基於分類系統建立文章表示向量應用於跨語言線上百科連結	★ 藉由加入多重語音辨識結果來改善對話狀態追蹤
★ 對話系統應用於中文線上客服助理:以電信領域為例	★ 應用遞歸神經網路於適當的時機回答問題
★ 使用多任務學習改善使用者意圖分類	★ 使用轉移學習來改進針對命名實體音譯的樞軸語言方法
★ 基於歷史資訊向量與主題專精程度向量應用於尋找社群問答網站中專家	★ 使用YMCL模型改善使用者意圖分類成效

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

語碼混合(Code-mixing) 在多語言的社會中是非常常見的現象，同時也是NLP任務的挑戰之一。本研究著眼於馬來西亞文化，常常在日常用語中使用Bahasa Rojak (馬來文、英文、中文)，建立一個能應用於語碼混合的語言模型為本研究之目標。
然而，用於訓練的語碼混合資料量(Bahasa Rojak)並不多，於是我們也實作了一種擴充資料集的方法，此方法可以透過馬來西亞青年習慣的片語來生成類似於原始資料的數據。本研究中，我們提出兩種新的預訓練語言模型，BERT + LI Language Model及Mixed XLM Language Model。BERT + LI Language Model在原始的mBERT架構上多增加一層輸入層，我們稱為語言向量(Language Embedding)，其目的在於提供更多資訊用於訓練階段。Mixed XLM Language Model則是修改原始XML的語言向量層，原始XML僅在語言向量層使用一種語言的標籤，我們提出的Mixed XLM Language Model則使用多種語言的標籤以適用於語碼混合資料。我們以Masked Language Model (MLM)來訓練我們的語言模型。
我們選擇情緒分析任務來評估我們的語言模型。我們從三個不同的領域來做情緒分析: 產品評論、電影評論、股市評論(金融領域)，由於金融領域的資料是未經過標註的，所以我們請五個來自馬來西亞及印度尼西亞的標註者進行人工標註，並得到0.77的Kappa值。在本研究中，我們提出的基於混合XLM的預訓練語言模型，在語碼混合的問題上可被證明是穩固、有效及高效的。此外，對於在不同語言中的同一句話，我們的預訓練語言模型都可以在一個相鄰的向量空間中表示，這表示我們的語言模型可以提供更好的表示。
本研究的貢獻在於，我們提出一個用於馬來語碼混合(Bahasa Rojak)的預訓練語言模型，也建構出能生成與原始資料相似的擴充語碼混合的方法，並且在金融領域中產生監督式的混語資料。但是，我們仍然須致力於對語碼混合的語言模型進行更深入的研究，期望能僅以一個語言模型適用在更多元的語言上。

摘要(英)

Code-mixing is a common phenomenon in a multilingual society. Code-mixing is also one of the challenging NLP tasks. This study focuses on Malaysian society, commonly using mixed language (Malaya, English, and Chinese) in daily life as Bahasa Rojak. Build a good language model that can handle code-mixing data is the aim of this study. However, the numbers of code-mixing data (Bahasa Rojak) for training a language model is not enough.
We implement an augmentation schema that can generate data similar to the original data to increase code-mixing data by utilizing phrases that follow the habits used by Malaysian youth. We build two different new pre-trained Language Model call BERT + LI Language Model and Mixed XLM Language Model. For BERT + LI architecture based on the mBERT architecture, we add a new input layer called Language Embedding, aiming to give more information in the training process. The Mixed XLM Language Model is base on XLM architecture, with modification of the Language Embedding part. The previous XLM only can do mono-tagging in the language embedding input, but in our approach, we can do multi-tagging (flexible tagging) to handle kind of code-mixing data. We do the Masked Language Model (MLM) task for training our Language Model.
We choose the sentiment analysis (SA) task for evaluating the performance of our language model. We do SA in three different domains: product reviews, movie reviews, and stock market comment (financial domain). One of these domains only available in unlabelled data (unsupervised). So we do manual labeling data use five annotators from Malaysia and Indonesia and get Kappa value around 0.77. The sentiment analysis result for all code-mixing dataset, our Mixed XLM Language Model can outperform other pre-trained language models. Our proposed pre-train language model (Mixed XLM) has proven robust, effective, and efficient in the code-mixing problem. Moreover, our pre-trained language model can represent the same sentence in different languages in one adjacent vector space, which shows our language model can give a better representation.
Our contribution to this study, we provide a pre-trained language model for code-mixing in Malaya data (Bahasa Rojak). We also provide a schema for augmenting the code-mixing data as similar to the original data, and we yield supervised code-mixing data in the financial domain. However, it still needs more in-depth training on different code-mixing data to better handle various variations of code-mixing data around the world with only need one code-mixing Language Model.

關鍵字(中)

★ 語碼混合
★ 語言向量
★ 情緒分析
★ Bahasa Rojak

關鍵字(英)

論文目次

ABSTRACT i
摘要 iii
ACKNOWLEDGEMENT iv
CONTENTS v
LIST OF FIGURES vii
LIST OF TABLES viii
1. INTRODUCTION 1
1.1 Motivation 1
1.2 Problem Description 4
1.3 Thesis Organization 5
2. RELATED WORK 6
2.1 Code-Mixing 6
2.2 Malaysia Dataset Knowledge 7
2.3 Language Model 8
2.4 BERT 8
2.5 XLM-R 10
2.6 Sentiment Analysis for Evaluate Language Model Performace 11
2.7 Transfer Learning 13
3. METHODOLOGY 16
3.1 Formal Problem Definition 16
3.2 Data Collection 16
3.2.1 English Dataset 16
3.2.2 Malay Dataset 17
3.2.3 Code-Mixing Dataset 18
3.3 Labeling Approach 20
3.3.1 Manual Labeling 20
3.3.2 Lexicon Approach 20
3.3.3 Evaluation Labeling Data 21
3.4 Data Preprocessing 21
3.5 Proposed Framework 22
3.5.1 Pretrained Language Model 22
a. BERT + LI 23
b. Mixed XLM 23
3.5.2 Transfer Learning and Fine Tune for Sentiment Analysis 24
4. EXPERIMENT AND RESULT 25
4.1 Dataset 25
4.2 Experiment Setup 27
4.2.1 Detail Pretraining Language Model 27
4.2.2 Sentiment Analysis 27
4.3 Evaluation 28
4.3.1 Language Model 28
4.3.2 Sentiment Analysis 28
4.4 Experiment Result 29
4.4.1 Language Model 29
4.4.2 Sentiment Analysis 29
4.4.3 Ablation Study 30
5. ANALYSIS AND DISCUSSION 35
5.1 Performance of Proposed Language Model 35
5.1.1 Robustness 35
5.1.2 Effectiveness and Efficiency 35
5.1.3 Visualization 36
4. CONCLUSION 38
6.1 Conclusion 38
6.2 Future Work 39
REFERENCES 40
APPENDIX 1 43

參考文獻

[1] N. I. B. Ahmad Bukhari, A. F. Anuar, K. M. Khazin, and T. M. F. Bin Tengku Abdul Aziz, “English-Malay Code-Mixing Innovation in Facebook among Malaysian University Students,” Res. World – J. Arts Sci. Commer., 2015, doi: 10.18843/rwjasc/v6i4/01.
[2] Y. K. Lal, V. Kumar, M. Dhar, M. Shrivastava, and P. Koehn, “De-mixing sentiment from code-mixed text,” in ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Student Research Workshop, 2019.
[3] S. Thara and P. Poornachandran, “Code-Mixing: A Brief Survey,” in 2018 International Conference on Advances in Computing, Communications and Informatics, ICACCI 2018, 2018, doi: 10.1109/ICACCI.2018.8554413.
[4] R. Vollmann and S. T. Wooi, “The sociolinguistic status of Malaysian English,” GLS Grazer Linguist. Stud., 2019, doi: http://dx.doi.org/10.25364/04.46:2019.91.5.
[5] MohdSyuhaidiAbuBakar and AliffluqmanMohdMazzalan, “Aliran Pertuturan Bahasa Rojak Dalam Kalangan Pengguna Facebook Di Malaysia,” e-Academia J., 2018, doi: 10.13140/RG.2.2.21870.92485.
[6] M. I. Yasef Kaya and M. Elif Karsligil, “Stock price prediction using financial news articles,” in Proceedings - 2010 2nd IEEE International Conference on Information and Financial Engineering, ICIFE 2010, 2010, doi: 10.1109/ICIFE.2010.5609404.
[7] B. Gu, P. Konana, A. Liu, B. Rajagopalan, and J. Ghosh, “Identifying Information in Stock Message Boards and Its Implications for Stock Market Efficiency,” Workshop on Information Systems and Economics, Los Angeles, 2006.
[8] S. Lai, K. Liu, S. He, and J. Zhao, “How to generate a good word embedding,” IEEE Intell. Syst., 2016, doi: 10.1109/MIS.2016.45.
[9] S. Ruder, Neural Transfer Learning for Natural Language Processing. Galway: National University of Ireland, 2019.
[10] A. Pratapa, M. Choudhury, and S. Sitaram, “Word embeddings for code-mixed language processing,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018, 2020, doi: 10.18653/v1/d18-1344.
[11] S. Yadav and T. Chakraborty, “Unsupervised Sentiment Analysis for Code-mixed Data,” pp. 1–8, 2020, [Online]. Available: http://arxiv.org/abs/2001.11384.
[12] L. Qin, M. Ni, Y. Zhang, and W. Che, “CoSDA-ML: Multi-Lingual Code-Switching Data Augmentation for Zero-Shot Cross-Lingual NLP,” 2020, doi: 10.24963/ijcai.2020/533.
[13] H. Shafiee et al., “PENGARUH BAHASA ROJAK DI MEDIA BAHARU TERHADAP BAHASA KEBANGSAAN,” Int. J. Law, Gov. Commun., 2019.
[14] A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov, “Bag of tricks for efficient text classification,” in 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017 - Proceedings of Conference, 2017, doi: 10.18653/v1/e17-2068.
[15] A. Conneau et al., “Unsupervised Cross-lingual Representation Learning at Scale,” 2020, doi: 10.18653/v1/2020.acl-main.747.
[16] A. Conneau and G. Lample, “Cross-lingual language model pre-training,” in Advances in Neural Information Processing Systems, 2019.
[17] A. Abbasi, H. Chen, and A. Salem, “Sentiment analysis in multiple languages: Feature selection for opinion classification in Web forums,” ACM Trans. Inf. Syst., vol. 26, no. 3, 2008, doi: 10.1145/1361684.1361685.
[18] T. Nasukawa and J. Yi, “Sentiment analysis: Capturing favorability using natural language processing,” in Proceedings of the 2nd International Conference on Knowledge Capture, K-CAP 2003, 2003, doi: 10.1145/945645.945658.
[19] K. Chekima and R. Alfred, “Sentiment Analysis of Malay Social Media Text,” in Lecture Notes in Electrical Engineering, 2018, doi: 10.1007/978-981-10-8276-4_20.
[20] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le, “XLNet: Generalized autoregressive pre-training for language understanding,” in Advances in Neural Information Processing Systems, 2019.
[21] J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, “BERT: Pretraining of deep bidirectional transformers for language understanding,” in NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference, 2019.
[22] S. Sohangir, D. Wang, A. Pomeranets, and T. M. Khoshgoftaar, “Big Data: Deep Learning for financial sentiment analysis,” J. Big Data, 2018, doi: 10.1186/s40537-017-0111-6.
[23] T. Renault, “Sentiment analysis and machine learning in finance: a comparison of methods and models on one million messages,” Digit. Financ., 2020, doi: 10.1007/s42521-019-00014-x.
[24] M. F. R. Abu Bakar, N. Idris, L. Shuib, and N. Khamis, “Sentiment Analysis of Noisy Malay Text: State of Art, Challenges and Future Work,” IEEE Access, 2020, doi: 10.1109/ACCESS.2020.2968955.
[25] L. Cheng Kuan, M. Akmar Ismail, T. M. A. Zayet, and S. Mohamed Shuhidan, “Prediction of Malaysian stock market movement using sentiment analysis,” in Journal of Physics: Conference Series, 2019, doi: 10.1088/1742-6596/1339/1/012017.
[26] N. Farra, “Cross-lingual and Low-resource Sentiment Analysis,” ProQuest Diss. Theses, p. 267, 2019, [Online]. Available: https://search.proquest.com/docview/2288064497?accountid=9645.
[27] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts, “Learning word vectors for sentiment analysis,” in ACL-HLT 2011 - Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 2011.
[28] T. Loughran and B. Mcdonald, “Textual Analysis in Accounting and Finance: A Survey,” J. Account. Res., 2016, doi: 10.1111/1475-679X.12123.
[29] A. Al-Saffar, S. Awang, H. Tao, N. Omar, W. Al-Saiagh, and M. Al-bared, “Malay sentiment analysis based on combined classification approaches and Senti-lexicon algorithm,” PLoS One, 2018, doi: 10.1371/journal.pone.0194852.
[30] E. Kasmuri and H. Basiron, “Building a Malay-English code-switching subjectivity corpus for sentiment analysis,” Int. J. Adv. Soft Comput. its Appl., 2019.
[31] A. Vaswani et al., “Attention Is All You Need,” Adv. Neural Inf. Process. Syst., 2017.
[32] D. P. Kingma and J. L. Ba, “Adam: A method for stochastic optimization,” in 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track

指導教授

蔡宗翰

審核日期

2020-12-22

推文