Code-Mixing Language Model for Sentiment Analysis in Code-Mixing Data

NCU Institutional Repository > 資訊電機學院 > 資訊工程研究所 > 博碩士論文 > Item 987654321/85048

請使用永久網址來引用或連結此文件: https://ir.lib.ncu.edu.tw/handle/987654321/85048

題名:	Code-Mixing Language Model for Sentiment Analysis in Code-Mixing Data
作者:	南妲;Romadhona, Nanda Putri
貢獻者:	資訊工程學系
關鍵詞:	語碼混合;語言向量;情緒分析;Bahasa Rojak
日期:	2020-12-22
上傳時間:	2021-03-18 17:29:01 (UTC+8)
出版者:	國立中央大學
摘要:	語碼混合(Code-mixing) 在多語言的社會中是非常常見的現象，同時也是NLP任務的挑戰之一。本研究著眼於馬來西亞文化，常常在日常用語中使用Bahasa Rojak (馬來文、英文、中文)，建立一個能應用於語碼混合的語言模型為本研究之目標。然而，用於訓練的語碼混合資料量(Bahasa Rojak)並不多，於是我們也實作了一種擴充資料集的方法，此方法可以透過馬來西亞青年習慣的片語來生成類似於原始資料的數據。本研究中，我們提出兩種新的預訓練語言模型，BERT + LI Language Model及Mixed XLM Language Model。BERT + LI Language Model在原始的mBERT架構上多增加一層輸入層，我們稱為語言向量(Language Embedding)，其目的在於提供更多資訊用於訓練階段。Mixed XLM Language Model則是修改原始XML的語言向量層，原始XML僅在語言向量層使用一種語言的標籤，我們提出的Mixed XLM Language Model則使用多種語言的標籤以適用於語碼混合資料。我們以Masked Language Model (MLM)來訓練我們的語言模型。我們選擇情緒分析任務來評估我們的語言模型。我們從三個不同的領域來做情緒分析: 產品評論、電影評論、股市評論(金融領域)，由於金融領域的資料是未經過標註的，所以我們請五個來自馬來西亞及印度尼西亞的標註者進行人工標註，並得到0.77的Kappa值。在本研究中，我們提出的基於混合XLM的預訓練語言模型，在語碼混合的問題上可被證明是穩固、有效及高效的。此外，對於在不同語言中的同一句話，我們的預訓練語言模型都可以在一個相鄰的向量空間中表示，這表示我們的語言模型可以提供更好的表示。本研究的貢獻在於，我們提出一個用於馬來語碼混合(Bahasa Rojak)的預訓練語言模型，也建構出能生成與原始資料相似的擴充語碼混合的方法，並且在金融領域中產生監督式的混語資料。但是，我們仍然須致力於對語碼混合的語言模型進行更深入的研究，期望能僅以一個語言模型適用在更多元的語言上。 ;Code-mixing is a common phenomenon in a multilingual society. Code-mixing is also one of the challenging NLP tasks. This study focuses on Malaysian society, commonly using mixed language (Malaya, English, and Chinese) in daily life as Bahasa Rojak. Build a good language model that can handle code-mixing data is the aim of this study. However, the numbers of code-mixing data (Bahasa Rojak) for training a language model is not enough. We implement an augmentation schema that can generate data similar to the original data to increase code-mixing data by utilizing phrases that follow the habits used by Malaysian youth. We build two different new pre-trained Language Model call BERT + LI Language Model and Mixed XLM Language Model. For BERT + LI architecture based on the mBERT architecture, we add a new input layer called Language Embedding, aiming to give more information in the training process. The Mixed XLM Language Model is base on XLM architecture, with modification of the Language Embedding part. The previous XLM only can do mono-tagging in the language embedding input, but in our approach, we can do multi-tagging (flexible tagging) to handle kind of code-mixing data. We do the Masked Language Model (MLM) task for training our Language Model. We choose the sentiment analysis (SA) task for evaluating the performance of our language model. We do SA in three different domains: product reviews, movie reviews, and stock market comment (financial domain). One of these domains only available in unlabelled data (unsupervised). So we do manual labeling data use five annotators from Malaysia and Indonesia and get Kappa value around 0.77. The sentiment analysis result for all code-mixing dataset, our Mixed XLM Language Model can outperform other pre-trained language models. Our proposed pre-train language model (Mixed XLM) has proven robust, effective, and efficient in the code-mixing problem. Moreover, our pre-trained language model can represent the same sentence in different languages in one adjacent vector space, which shows our language model can give a better representation. Our contribution to this study, we provide a pre-trained language model for code-mixing in Malaya data (Bahasa Rojak). We also provide a schema for augmenting the code-mixing data as similar to the original data, and we yield supervised code-mixing data in the financial domain. However, it still needs more in-depth training on different code-mixing data to better handle various variations of code-mixing data around the world with only need one code-mixing Language Model.
顯示於類別:	[資訊工程研究所] 博碩士論文

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	174	檢視/開啟

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....