繁體中文 - 泰文機器翻譯;Traditional Chinese - Thai Machine Translation

NCU Institutional Repository > 資訊電機學院 > 人工智慧國際碩士學位學程 > 博碩士論文 > Item 987654321/94410

jsp.display-item.identifier=請使用永久網址來引用或連結此文件: https://ir.lib.ncu.edu.tw/handle/987654321/94410

题名:	繁體中文 - 泰文機器翻譯;Traditional Chinese - Thai Machine Translation
作者:	莎薇;Meesawad, Wilailack
贡献者:	人工智慧國際碩士學位學程
关键词:	繁體中文;泰文;機器翻譯;低資源語言;大型語言模型;Traditional Chinese;Thai;Machine Translation;Low-resource Language;Large Language Model
日期:	2024-07-25
上传时间:	2024-10-09 14:41:37 (UTC+8)
出版者:	國立中央大學
摘要:	近年來，勞動市場全球化導致大量泰國勞工湧入包括台灣在內的國家，並在製造業到服務業等各個領域貢獻力量。這些人在異國他鄉進行日常生活和職業工作時，有效的溝通對於他們的融入、生產力和整體福祉至關重要。然而，語言障礙往往成為巨大的阻礙，妨礙泰國勞工與台灣同事之間的交流和理解。此外，繁體中文和泰文之間的翻譯資源稀缺，使得找到合適的語言支持工具變得具有挑戰性。為了解決這些問題，我們提出了一種有效的訓練方法，通過兩階段的微調過程來利用大型語言模型（LLMs）進行機器翻譯任務。在初始階段，我們通過使用高級語言模型翻譯（ALMA）策略（由Xu等人於2024年開發）對包含一百萬筆的泰語數據集進行提煉，提升台灣可信賴人工智慧對話引擎（TAIDE）七十億參數模型的泰語能力。這一階段重點在於建立模型對泰語的堅實基礎理解。在隨後的階段中，TAIDE模型在一個高品質繁體中文-泰文平行數據的小數據集上進行微調。這一微調對於調整模型的翻譯能力以適應兩種語言的特定細微差異至關重要。通過使用Qwen2-7B-Instruct（阿里巴巴集團，2024）作為基礎模型，我們的方法實現了翻譯性能的顯著改進。具體來說，我們的方法在兩個翻譯方向（中文到泰文和泰文到中文）上平均提升了超過20點的BLEU分數、3點的COMET分數和2點的COMET_KIWI分數，顯著超過了以往的研究成果，並超越了谷歌翻譯等最先進的機器翻譯系統以及GPT-3.5 Turbo等顯著更大的模型，達到與GPT-4o相當的性能，儘管我們的模型僅有七十億參數。我們的研究結果突顯了針對低資源語言進行目標明確的兩階段微調過程的有效性。通過全面的單語數據集來增強模型的泰語能力，並使用高品質的平行數據進行微調，我們展示了翻譯品質的顯著改善。這種方法不僅彌合了繁體中文和泰文之間的語言鴻溝，還為未來涉及低資源語言的機器翻譯研究樹立了先例。;In recent years, the globalization of labor markets has led to a significant influx of Thai workers into countries such as Taiwan, where they contribute across various sectors ranging from manufacturing to services. As these individuals navigate their daily lives and professional endeavors in a foreign land, effective communication becomes paramount for their integration, productivity, and overall well-being. However, language barriers often present formidable obstacles, hindering seamless interaction and understanding between Thai workers and their Taiwanese counterparts. Additionally, resources for translating between Traditional Chinese and Thai are scarce, making it challenging to find adequate language support tools. To address these issues, we propose an effective training methodology for leveraging Large Language Models (LLMs) in machine translation tasks through a two-stage fine-tuning process. In the initial stage, we enhance the proficiency of the Trustworthy AI Dialogue Engine by Taiwan (TAIDE) 7 billion parameter model in the Thai language by refining a comprehensive dataset of Thai instances using the Advanced Language Model-based Translator (ALMA) strategy (developed by Xu et al. 2024 [25]). This stage focuses on building a robust foundational understanding of the Thai language within the model. In the subsequent stage, the TAIDE model is fine-tuned on a smaller set of high-quality Traditional Chinese-Thai parallel data. This fine-tuning is essential for aligning the model’s translation capabilities with the specific nuances of both languages. By utilizing Qwen2-7BInstruct (Alibaba Group, 2024 [1]) as the underlying model, our approach achieves substantial improvements in translation performance. Specifically, our methodology achieves state-of-the-art results with average increases of over 20 BLEU, 3 COMET, and 2 COMET_KIWI scores across two translation directions (Chinese to Thai and Thai to Chinese), significantly outperforming prior work and surpassing performance of the state-of-the-art machine translation systems like Google Translate and significantly larger models like GPT-3.5 Turbo, and achieving performance comparable to GPT-4o, despite our model having only 7B parameters. Our findings highlight the effectiveness of a targeted, two-stage fine-tuning process in addressing the challenges posed by low-resource languages. By enhancing the model′s proficiency in Thai through comprehensive monolingual datasets and fine-tuning with high-quality parallel data, we demonstrate a marked improvement in translation quality. This approach not only bridges the linguistic gap between Traditional Chinese and Thai but also sets a precedent for future research in machine translation involving low-resource languages.
显示于类别:	[人工智慧國際碩士學位學程] 博碩士論文

文件中的档案:

档案	描述	大小	格式	浏览次数
index.html		0Kb	HTML	225	检视/开启

在NCUIR中所有的数据项都受到原著作权保护.

社群 sharing

数据加载中.....