dc.description.abstract | In recent years, the globalization of labor markets has led to a significant influx of Thai
workers into countries such as Taiwan, where they contribute across various sectors ranging from
manufacturing to services. As these individuals navigate their daily lives and professional
endeavors in a foreign land, effective communication becomes paramount for their integration,
productivity, and overall well-being. However, language barriers often present formidable
obstacles, hindering seamless interaction and understanding between Thai workers and their
Taiwanese counterparts. Additionally, resources for translating between Traditional Chinese and
Thai are scarce, making it challenging to find adequate language support tools.
To address these issues, we propose an effective training methodology for leveraging Large
Language Models (LLMs) in machine translation tasks through a two-stage fine-tuning process.
In the initial stage, we enhance the proficiency of the Trustworthy AI Dialogue Engine by Taiwan
(TAIDE) 7 billion parameter model in the Thai language by refining a comprehensive dataset of
Thai instances using the Advanced Language Model-based Translator (ALMA) strategy
(developed by Xu et al. 2024 [25]). This stage focuses on building a robust foundational
understanding of the Thai language within the model.
In the subsequent stage, the TAIDE model is fine-tuned on a smaller set of high-quality
Traditional Chinese-Thai parallel data. This fine-tuning is essential for aligning the model’s
translation capabilities with the specific nuances of both languages. By utilizing Qwen2-7BInstruct (Alibaba Group, 2024 [1]) as the underlying model, our approach achieves substantial
improvements in translation performance. Specifically, our methodology achieves state-of-the-art
results with average increases of over 20 BLEU, 3 COMET, and 2 COMET_KIWI scores across
two translation directions (Chinese to Thai and Thai to Chinese), significantly outperforming prior work and surpassing performance of the state-of-the-art machine translation systems like Google
Translate and significantly larger models like GPT-3.5 Turbo, and achieving performance
comparable to GPT-4o, despite our model having only 7B parameters.
Our findings highlight the effectiveness of a targeted, two-stage fine-tuning process in
addressing the challenges posed by low-resource languages. By enhancing the model′s proficiency
in Thai through comprehensive monolingual datasets and fine-tuning with high-quality parallel
data, we demonstrate a marked improvement in translation quality. This approach not only bridges
the linguistic gap between Traditional Chinese and Thai but also sets a precedent for future
research in machine translation involving low-resource languages. | en_US |