繁體中文 - 泰文機器翻譯

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：5

、訪客IP：3.141.244.88

姓名

莎薇(Wilailack Meesawad) 查詢紙本館藏

畢業系所

人工智慧國際碩士學位學程

論文名稱

繁體中文 - 泰文機器翻譯
(Traditional Chinese - Thai Machine Translation)

相關論文

★ A Real-time Embedding Increasing for Session-based Recommendation with Graph Neural Networks	★ 基於主診斷的訓練目標修改用於出院病摘之十代國際疾病分類任務
★ 混合式心臟疾病危險因子與其病程辨識於電子病歷之研究	★ 基於 PowerDesigner 規範需求分析產出之快速導入方法
★ 社群論壇之問題檢索	★ 非監督式歷史文本事件類型識別──以《明實錄》中之衛所事件為例
★ 應用自然語言處理技術分析文學小說角色之關係：以互動視覺化呈現	★ 基於生醫文本擷取功能性層級之生物學表徵語言敘述：由主成分分析發想之K近鄰算法
★ 基於分類系統建立文章表示向量應用於跨語言線上百科連結	★ Code-Mixing Language Model for Sentiment Analysis in Code-Mixing Data
★ 應用角色感知於深度神經網路架構之對話行為分類	★ 藉由加入多重語音辨識結果來改善對話狀態追蹤
★ 主動式學習之古漢語斷詞	★ 對話系統應用於中文線上客服助理:以電信領域為例
★ 應用遞歸神經網路於適當的時機回答問題	★ 使用多任務學習改善使用者意圖分類

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2026-7-31以後開放)

摘要(中)

近年來，勞動市場全球化導致大量泰國勞工湧入包括台灣在內的國家，並在製造業
到服務業等各個領域貢獻力量。這些人在異國他鄉進行日常生活和職業工作時，有效的溝
通對於他們的融入、生產力和整體福祉至關重要。然而，語言障礙往往成為巨大的阻礙，
妨礙泰國勞工與台灣同事之間的交流和理解。此外，繁體中文和泰文之間的翻譯資源稀缺
，使得找到合適的語言支持工具變得具有挑戰性。
為了解決這些問題，我們提出了一種有效的訓練方法，通過兩階段的微調過程來利
用大型語言模型（LLMs）進行機器翻譯任務。在初始階段，我們通過使用高級語言模型翻
譯（ALMA）策略（由Xu等人於2024年開發）對包含一百萬筆的泰語數據集進行提煉，提升
台灣可信賴人工智慧對話引擎（TAIDE）七十億參數模型的泰語能力。這一階段重點在於
建立模型對泰語的堅實基礎理解。
在隨後的階段中，TAIDE模型在一個高品質繁體中文-泰文平行數據的小數據集上進
行微調。這一微調對於調整模型的翻譯能力以適應兩種語言的特定細微差異至關重要。通
過使用Qwen2-7B-Instruct（阿里巴巴集團，2024）作為基礎模型，我們的方法實現了翻
譯性能的顯著改進。具體來說，我們的方法在兩個翻譯方向（中文到泰文和泰文到中文
）上平均提升了超過20點的BLEU分數、3點的COMET分數和2點的COMET_KIWI分數，顯著超
過了以往的研究成果，並超越了谷歌翻譯等最先進的機器翻譯系統以及GPT-3.5 Turbo等
顯著更大的模型，達到與GPT-4o相當的性能，儘管我們的模型僅有七十億參數。
我們的研究結果突顯了針對低資源語言進行目標明確的兩階段微調過程的有效性。
通過全面的單語數據集來增強模型的泰語能力，並使用高品質的平行數據進行微調，我們
展示了翻譯品質的顯著改善。這種方法不僅彌合了繁體中文和泰文之間的語言鴻溝，還為
未來涉及低資源語言的機器翻譯研究樹立了先例。

摘要(英)

In recent years, the globalization of labor markets has led to a significant influx of Thai
workers into countries such as Taiwan, where they contribute across various sectors ranging from
manufacturing to services. As these individuals navigate their daily lives and professional
endeavors in a foreign land, effective communication becomes paramount for their integration,
productivity, and overall well-being. However, language barriers often present formidable
obstacles, hindering seamless interaction and understanding between Thai workers and their
Taiwanese counterparts. Additionally, resources for translating between Traditional Chinese and
Thai are scarce, making it challenging to find adequate language support tools.
To address these issues, we propose an effective training methodology for leveraging Large
Language Models (LLMs) in machine translation tasks through a two-stage fine-tuning process.
In the initial stage, we enhance the proficiency of the Trustworthy AI Dialogue Engine by Taiwan
(TAIDE) 7 billion parameter model in the Thai language by refining a comprehensive dataset of
Thai instances using the Advanced Language Model-based Translator (ALMA) strategy
(developed by Xu et al. 2024 [25]). This stage focuses on building a robust foundational
understanding of the Thai language within the model.
In the subsequent stage, the TAIDE model is fine-tuned on a smaller set of high-quality
Traditional Chinese-Thai parallel data. This fine-tuning is essential for aligning the model’s
translation capabilities with the specific nuances of both languages. By utilizing Qwen2-7BInstruct (Alibaba Group, 2024 [1]) as the underlying model, our approach achieves substantial
improvements in translation performance. Specifically, our methodology achieves state-of-the-art
results with average increases of over 20 BLEU, 3 COMET, and 2 COMET_KIWI scores across
two translation directions (Chinese to Thai and Thai to Chinese), significantly outperforming prior work and surpassing performance of the state-of-the-art machine translation systems like Google
Translate and significantly larger models like GPT-3.5 Turbo, and achieving performance
comparable to GPT-4o, despite our model having only 7B parameters.
Our findings highlight the effectiveness of a targeted, two-stage fine-tuning process in
addressing the challenges posed by low-resource languages. By enhancing the model′s proficiency
in Thai through comprehensive monolingual datasets and fine-tuning with high-quality parallel
data, we demonstrate a marked improvement in translation quality. This approach not only bridges
the linguistic gap between Traditional Chinese and Thai but also sets a precedent for future
research in machine translation involving low-resource languages.

關鍵字(中)

★ 繁體中文
★ 泰文
★ 機器翻譯
★ 低資源語言
★ 大型語言模型

關鍵字(英)

★ Traditional Chinese
★ Thai
★ Machine Translation
★ Low-resource Language
★ Large Language Model

論文目次

摘要 i
Abstract iii
Acknowledgements v
Contents vi
List of Figures viii
List of Tables ix
Chapter I Introduction 1
1.1 Traditional Machine Translation 2
1.2 Research Objective 2
Chapter II Related Work 4
2.1 Traditional Machine Translation 4
2.2 Generative (Decoder-Only) Models for Translation 5
2.3 Low-Resource Languages 7
2.4 Reference-Free Evaluation Metrics 8
Chapter III Methodology 10
3.1 Corpus Preparation 10
3.1.1 Monolingual Datasets 10
3.1.2 High-quality Parallel Datasets 12
3.2 Model Training 14
3.2.1 Continuous Pre-Training 14
3.2.2 High-Quality Data Fine-Tuning 15
3.3 Experimental Settings 16
3.3.1 Training Setup 16
3.3.2 Baselines 17
3.3.3 Evaluation Metrics 19
Chapter IV Experimental Results 21
4.1 Overview 21
4.2 Performance Comparison 21
Chapter V Analysis 24
5.1 Comparative Performance Across Translation Directions 24
5.2 Comparative Performance Across Models 25
5.3. Implications for Low-Resource Language Translation 27
5.4. Human Evaluation on The Translation Results 28
5.4. Comparison of The Translation Results Between Our Model vs
The SoTA Model 29
Chapter VI Conclusion 31
Chapter VII Limitation and Future Work 32
Bibliography 36

參考文獻

[1] Alibaba Cloud. Qwen2. https://qwenlm.github.io/blog/qwen2/. 2024.
[2] Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by
jointly learning to align and translate." arXiv preprint arXiv:1409.0473. 2014.
[3] Brown, Peter F., John Cocke, Stephen A. Della Pietra, Vincent J. Della Pietra, Frederick
Jelinek, John Lafferty, Robert L. Mercer, and Paul S. Roossin. "A statistical approach to
machine translation." Computational linguistics 16, no. 2: 79-85. 1990.
[4] Csaki, Zoltan, Pian Pawakapan, Urmish Thakker, and Qiantong Xu. "Efficiently adapting
pretrained language models to new languages.". 2023.
[5] Freitag, Markus, Nitika Mathur, Chi-kiu Lo, Eleftherios Avramidis, Ricardo Rei, Brian
Thompson, Tom Kocmi et al. "Results of WMT23 metrics shared task: Metrics might be
guilty but references are not innocent." In Proceedings of the Eighth Conference on
Machine Translation, pp. 578-628. 2023.
[6] Freitag, Markus, Ricardo Rei, Nitika Mathur, Chi-kiu Lo, Craig Stewart, Eleftherios
Avramidis, Tom Kocmi, George Foster, Alon Lavie, and André FT Martins. "Results of
WMT22 metrics shared task: Stop using BLEU–neural metrics are better and more robust."
In Proceedings of the Seventh Conference on Machine Translation (WMT), pp. 46-68.
2022.
[7] Google. Gemini: A Family of Highly Capable Multimodal Models. 2024.
[8] Gunasekar, Suriya, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno,
Sivakanth Gopi, Mojan Javaheripi et al. "Textbooks are all you need." arXiv preprint
arXiv:2306.11644. 2023.
[9] Lowphansirikul, Lalita, Charin Polpanumas, Attapol T. Rutherford, and Sarana Nutanong.
"scb-mt-en-th-2020: A large english-thai parallel corpus." arXiv preprint
arXiv:2007.03541. 2020.
[10] Lu, Bo-Han, Yi-Hsuan Lin, Annie Lee, and Richard Tzong-Han Tsai. "Enhancing
Taiwanese Hokkien Dual Translation by Exploring and Standardizing of Four Writing
Systems." In Proceedings of the 2024 Joint International Conference on Computational
Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pp. 6077-6090.
2024.
[11] Maillard, Jean, Cynthia Gao, Elahe Kalbassi, Kaushik Ram Sadagopan, Vedanuj Goswami,
Philipp Koehn, Angela Fan, and Francisco Guzmán. "Small data, big impact: Leveraging
minimal data for effective machine translation." In Proceedings of the 61st Annual Meeting
of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2740-2756.
2023.
[12] Meta. Llama3. https://ai.meta.com/blog/meta-llama-3/. 2024.
[13] National Science and Technology Council and National Applied Research Laboratories.
TAIDE-LX-7B. https://en.taide.tw. 2024.
[14] Nguyen, Xuan-Phi, Wenxuan Zhang, Xin Li, Mahani Aljunied, Qingyu Tan, Liying Cheng,
Guanzheng Chen et al. "SeaLLMs--Large Language Models for Southeast Asia." arXiv
preprint arXiv:2312.00738. 2023.
[15] NLLB Team, Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth
Heafield, Kevin Heffernan et al. "No language left behind: Scaling human-centered
machine translation (2022)." URL https://arxiv. org/abs/2207.04672. 2022.
[16] OpenAI. GPT-3.5 Turbo. https://platform.openai.com/docs/models/gpt-3-5-turbo. 2023.
[17] OpenAI. Gpt-4 technical report. 2023.
[18] OpenAI. GPT-4o. https://openai.com/index/hello-gpt-4o/. 2024.
[19] OpenThaiGPT. Released openthaigpt 7b 1.0.0-beta. https://openthaigpt.aieat.or.th/. 2023.
[20] Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. "Bleu: a method for
automatic evaluation of machine translation." In Proceedings of the 40th annual meeting
of the Association for Computational Linguistics, pp. 311-318. 2002.
[21] Post, Matt. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pp. 186–191, Brussels, Belgium.
Association for Computational Linguistics. doi: 10.18653/v1/W18-6319. URL
https://aclanthology.org/W18-6319. 2018.
[22] Rasley, Jeff, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. "Deepspeed:
System optimizations enable training deep learning models with over 100 billion
parameters." In Proceedings of the 26th ACM SIGKDD International Conference on
Knowledge Discovery & Data Mining, pp. 3505-3506. 2020.
[23] Rei, Ricardo, José GC De Souza, Duarte Alves, Chrysoula Zerva, Ana C. Farinha, Taisiya
Glushkova, Alon Lavie, Luisa Coheur, and André FT Martins. "COMET-22: Unbabel-IST
2022 submission for the metrics shared task." In Proceedings of the Seventh Conference
on Machine Translation (WMT), pp. 578-585. 2022.
[24] Rei, Ricardo, Nuno M. Guerreiro, José Pombal, Daan van Stigt, Marcos Treviso, Luisa
Coheur, José GC de Souza, and André FT Martins. "Scaling up cometkiwi: Unbabel-ist
2023 submission for the quality estimation shared task." arXiv preprint arXiv:2309.11925.
2023.
[25] Touvron, Hugo, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine
Babaei, Nikolay Bashlykov et al. "Llama 2: Open foundation and fine-tuned chat models."
arXiv preprint arXiv:2307.09288. 2023.
[26] Wu, Yonghui, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang
Macherey, Maxim Krikun et al. "Google′s neural machine translation system: Bridging the
gap between human and machine translation." arXiv preprint arXiv:1609.08144. 2016.
[27] Xu, Haoran, Young Jin Kim, Amr Sharaf, and Hany Hassan Awadalla. "A paradigm shift
in machine translation: Boosting translation performance of large language models." arXiv
preprint arXiv:2309.11674. 2023.
[28] Zhou, Chunting, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe
Ma et al. "Lima: Less is more for alignment." Advances in Neural Information Processing
Systems 36. 2024.

指導教授

蔡宗翰

審核日期

2024-7-25

推文