dc.description.abstract | Code-mixing is a common phenomenon in a multilingual society. Code-mixing is also one of the challenging NLP tasks. This study focuses on Malaysian society, commonly using mixed language (Malaya, English, and Chinese) in daily life as Bahasa Rojak. Build a good language model that can handle code-mixing data is the aim of this study. However, the numbers of code-mixing data (Bahasa Rojak) for training a language model is not enough.
We implement an augmentation schema that can generate data similar to the original data to increase code-mixing data by utilizing phrases that follow the habits used by Malaysian youth. We build two different new pre-trained Language Model call BERT + LI Language Model and Mixed XLM Language Model. For BERT + LI architecture based on the mBERT architecture, we add a new input layer called Language Embedding, aiming to give more information in the training process. The Mixed XLM Language Model is base on XLM architecture, with modification of the Language Embedding part. The previous XLM only can do mono-tagging in the language embedding input, but in our approach, we can do multi-tagging (flexible tagging) to handle kind of code-mixing data. We do the Masked Language Model (MLM) task for training our Language Model.
We choose the sentiment analysis (SA) task for evaluating the performance of our language model. We do SA in three different domains: product reviews, movie reviews, and stock market comment (financial domain). One of these domains only available in unlabelled data (unsupervised). So we do manual labeling data use five annotators from Malaysia and Indonesia and get Kappa value around 0.77. The sentiment analysis result for all code-mixing dataset, our Mixed XLM Language Model can outperform other pre-trained language models. Our proposed pre-train language model (Mixed XLM) has proven robust, effective, and efficient in the code-mixing problem. Moreover, our pre-trained language model can represent the same sentence in different languages in one adjacent vector space, which shows our language model can give a better representation.
Our contribution to this study, we provide a pre-trained language model for code-mixing in Malaya data (Bahasa Rojak). We also provide a schema for augmenting the code-mixing data as similar to the original data, and we yield supervised code-mixing data in the financial domain. However, it still needs more in-depth training on different code-mixing data to better handle various variations of code-mixing data around the world with only need one code-mixing Language Model. | en_US |