臺語大型語言模型的首創開發：在低資源環境下應用跨語言數據增強

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：100

、訪客IP：3.144.25.212

姓名

盧柏翰(Bo-Han Lu) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

臺語大型語言模型的首創開發：在低資源環境下應用跨語言數據增強
(Developing the First Taiwanese Hokkien Large Language Model: Cross-Lingual Data Augmentation in a Low-Resource Context)

相關論文

★ A Real-time Embedding Increasing for Session-based Recommendation with Graph Neural Networks	★ 基於主診斷的訓練目標修改用於出院病摘之十代國際疾病分類任務
★ 混合式心臟疾病危險因子與其病程辨識於電子病歷之研究	★ 基於 PowerDesigner 規範需求分析產出之快速導入方法
★ 社群論壇之問題檢索	★ 非監督式歷史文本事件類型識別──以《明實錄》中之衛所事件為例
★ 應用自然語言處理技術分析文學小說角色之關係：以互動視覺化呈現	★ 基於生醫文本擷取功能性層級之生物學表徵語言敘述：由主成分分析發想之K近鄰算法
★ 基於分類系統建立文章表示向量應用於跨語言線上百科連結	★ Code-Mixing Language Model for Sentiment Analysis in Code-Mixing Data
★ 藉由加入多重語音辨識結果來改善對話狀態追蹤	★ 對話系統應用於中文線上客服助理:以電信領域為例
★ 應用遞歸神經網路於適當的時機回答問題	★ 使用多任務學習改善使用者意圖分類
★ 使用轉移學習來改進針對命名實體音譯的樞軸語言方法	★ 基於歷史資訊向量與主題專精程度向量應用於尋找社群問答網站中專家

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2025-12-25以後開放)

摘要(中)

台灣閩南語（簡稱台語）因其語料稀少且書寫系統多樣化，導致依賴大量數據訓練的自然語言處理（Natural Language Processing, NLP）神經網路模型發展受限。近年來，隨著大型語言模型（Large Language Model, LLM）的迅速發展，NLP的多項任務已取得重大進展，但這些成果多建立於文本豐富的高資源語言（High-resource Languages, HRLs）基礎之上，如中文和英文。因此，本研究致力於解決台語在自然語言處理領域的發展困境。

首先，本研究以繁體中文預訓練過的LLaMA 2語言模型為基礎，使用有限的台語語料開發台語大型語言模型。我們訓練了一個能在中文、英文與台語間進行有效翻譯的模型，這是縮短與高資源語言之間資源差距的關鍵。研究過程中，我們收集了約78MB涵蓋各種台語書寫系統的單語語料，並將原有中文及台語的平行資料擴展至英文，建構出不同翻譯方向的平行訓練資料。本研究還探討了擴增台語詞彙表在模型上的影響，以及在預訓練與微調訓練階段使用不同台語書寫系統資料對翻譯效能的影響。實驗結果顯示，運用所有書寫系統進行預訓練及將平行語料擴展至英文能有效提升翻譯器的能力。

接著，本論文利用此翻譯器將中文指令微調資料集翻譯成台語版本，進而訓練台語對話模型。為了全面評估模型性能，我們建構了多種不同面向的評估資料集。在對話模型訓練方面，我們比較了基於翻譯模型與原始台語語言模型訓練的差異，並探討了納入英文指令微調資料集的跨語言微調方法。研究發現，以翻譯模型為基礎的對話模型在翻譯能力上有所提升，而跨語言微調方法則能在多數的生成式任務上顯示出其效益。

本研究為台語 NLP 領域帶來了新的發展，不僅透過翻譯縮小了台語與高資源語言間的資源差距，也為相關研究提供了豐富的資源和工具，為未來台語語言模型的發展奠定了堅實基礎。

摘要(英)

Taiwanese Hokkien faces developmental limitations in the field of Natural Language Processing (NLP) due to its lack of resources and diverse writing systems. This has hindered the progress of neural network models in NLP, which often rely heavily on extensive training data. In recent years, the rapid advancement of large language models (LLMs) has led to significant progress in various NLP tasks. However, most of these advancements are based on high-resource languages (HRLs) that have abundant textual resources, such as Chinese and English. Therefore, this study aims to address the developmental challenges of Taiwanese Hokkien in NLP.

Initially, this research developed a Taiwanese Hokkien LLM based on the Traditional Chinese pre-trained LLaMA 2 language model, with the limited Taiwanese Hokkien corpora. We then trained a model capable of effective translation between Chinese, English, and Taiwanese Hokkien, which is a crucial step in bridging the resource gap with HRLs. We collected around 78MB of monolingual data that included four Taiwanese Hokkien writing systems. We also extended the existing Chinese and Taiwanese Hokkien parallel data to include English, creating a diverse set of parallel training data for different translation directions. The study investigated the impact of extending the Taiwanese Hokkien vocabulary in the model and the effect of using different Taiwanese Hokkien writing systems during pre-training and fine-tuning stages. The results suggest that pre-training with all writing systems and extending parallel data to English improved the translator′s capabilities.

Furthermore, we utilized this translation model to convert Chinese instruction fine-tuning datasets into Taiwanese Hokkien versions for training a Taiwanese Hokkien chat model. To comprehensively assess the chat model′s performance, we constructed various evaluation datasets. In the experiments, we compared the effectiveness of starting with the translation model versus the original Taiwanese Hokkien language model. We also explored cross-lingual fine-tuning methods that incorporated English instruction tuning datasets. Our findings indicate that the chat model, trained based on the translation model, showed improvement in translation tasks. Additionally, the cross-lingual fine-tuning method proved beneficial in most generative tasks.

Our research has contributed to the development of Taiwanese Hokkien NLP by narrowing the resource gap with HRLs and providing valuable resources and tools for related studies. This lays a solid foundation for future development of Taiwanese Hokkien language models.

關鍵字(中)

★ 大型語言模型
★ 低資源語言
★ 神經機器翻譯
★ 台灣閩南語
★ 聊天機器人

關鍵字(英)

★ Large Language Model
★ Low-resource Language
★ Neural Machine Translation
★ Taiwanese Hokkien
★ Chatbot

論文目次

中文摘要 i
Abstract iii
誌謝 v
Contents vii
List of Figures x
List of Tables xi
1 Introduction 1
1.1 Motivate 1
1.1.1 Large Language Model 1
1.1.2 Neural Machine Translation 2
1.2 Research Goal 2
2 Background of Taiwanese Hokkien 5
2.1 Writing System Diversity in Hokkien 5
2.2 Semantic Divergence of Shared Chinese Characters in HAN and ZH 6
3 Related Work 8
3.1 Large Language Models in Translation 8
3.2 Neural Machine Translation in Hokkien 9
4 Hokkien Machine Translation 10
4.1 Corpus Preparation 10
4.1.1 Monolingual Corpora 10
4.1.2 Parallel Datasets 11
4.2 Model Training 12
4.3 Experimental Settings 14
4.3.1 Two Translation Testing Datasets 14
4.3.2 Evaluation Metrics 15
4.4 Experiment Results and Analysis 19
4.4.1 Experimental Ablation Studies 19
4.4.2 Pre-training Corpus Script-Standardization 22
4.5 Recap 23
5 Hokkien Chat Model 25
5.1 Instruction Fine-tuning 25
5.1.1 Translation Based 26
5.1.2 Cross-lingual Instruction Tuning 26
5.2 Instruction Tuning Datasets Preparation 27
5.2.1 Translate from Chinese Datasets 27
5.2.2 Synthesize from Parallel Datasets 29
5.2.3 Multiple Choice Questions 31
5.2.4 Summary of Dataset Types and Quantities 32
5.3 Experimental Settings 34
5.3.1 Testing Datasets 34
5.3.2 Evaluation Metrics 35
5.4 Experiment Results 37
5.4.1 Generation Configurations 37
5.4.2 Results from MCQ 37
5.4.3 Results from Translation 38
5.4.4 Results from General Generative QA Tasks 39
5.5 Recap 49
6 Conclusion 50
7 Future Work 52
7.1 Data Augmentation from ZH to Hokkien 52
7.2 Optimization of Hokkien Instruction Tuning Dataset 52
7.3 Extending Research to Other Taiwanese Languages 53
8 Limitation 54
Bibliography 55

參考文獻

[1] OpenAI, “Gpt-4 technical report,” ArXiv, vol. abs/2303.08774, 2023. [Online].
Available: https://api.semanticscholar.org/CorpusID:257532815
[2] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Roz-
ière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lam-
ple, “Llama: Open and efficient foundation language models,” 2023.
[3] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov,
S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open foundation and fine-tuned
chat models,” arXiv preprint arXiv:2307.09288, 2023.
[4] Y. Cui, Z. Yang, and X. Yao, “Efficient and effective text encoding for chinese
llama and alpaca,” arXiv preprint arXiv:2304.08177, 2023. [Online]. Available:
https://arxiv.org/abs/2304.08177
[5] A. Balachandran, “Tamil-llama: A new tamil language model based on llama 2,”
2023.
[6] P. S. Ding, Introduction. Singapore: Springer Singapore, 2016, pp. 1–18. [Online].
Available: https://doi.org/10.1007/978-981-287-594-5_1
[7] ——, Taiwan: The Haven for Southern Min? Singapore: Springer Singapore, 2016,
pp. 55–75. [Online]. Available: https://doi.org/10.1007/978-981-287-594-5_4
[8] Y.-F. Liao, C.-Y. Chang, H.-K. Tiun, H.-L. Su, H.-L. Khoo, J. S. Tsay,
L.-K. Tan, P. Kang, T.-g. Thiann, U.-G. Iunn, J.-H. Yang, and C.-N. Liang,
“Formosa Speech Recognition Challenge 2020 and Taiwanese Across Taiwan
Corpus,” in 2020 23rd Conference of the Oriental COCOSDA International
Committee for the Co-ordination and Standardisation of Speech Databases and
Assessment Techniques (O-COCOSDA), 2020, pp. 65–70. [Online]. Available:
https://ieeexplore.ieee.org/document/9295019
[9] Y. Moslem, R. Haque, J. D. Kelleher, and A. Way, “Adaptive Machine
Translation with Large Language Models,” in Proceedings of the 24th Annual
Conference of the European Association for Machine Translation. European
Association for Machine Translation, 2023, pp. 227–237. [Online]. Available:
https://aclanthology.org/2023.eamt-1.22
[10] X. V. Lin, T. Mihaylov, M. Artetxe, T. Wang, S. Chen, D. Simig, M. Ott,
N. Goyal, S. Bhosale, J. Du, R. Pasunuru, S. Shleifer, P. S. Koura, V. Chaudhary,
B. O’Horo, J. Wang, L. Zettlemoyer, Z. Kozareva, M. Diab, V. Stoyanov, and
X. Li, “Few-shot Learning with Multilingual Generative Language Models,” in
Proceedings of the 2022 Conference on Empirical Methods in Natural Language
Processing. Association for Computational Linguistics, 2022, pp. 9019–9052.
[Online]. Available: https://aclanthology.org/2022.emnlp-main.616
[11] W. Zhu, H. Liu, Q. Dong, J. Xu, L. Kong, J. Chen, L. Li, and S. Huang, “Multilingual
machine translation with large language models: Empirical results and analysis,”
arXiv preprint arXiv:2304.04675, 2023.
[12] B. Zhang, B. Haddow, and A. Birch, “Prompting large language model for machine
translation: A case study,” 2023.
[13] D. Vilar, M. Freitag, C. Cherry, J. Luo, V. Ratnakar, and G. Foster, “Prompting
PaLM for Translation: Assessing Strategies and Performance,” in Proceedings
of the 61st Annual Meeting of the Association for Computational Linguistics
(Volume 1: Long Papers). Association for Computational Linguistics, 2023, pp.
15 406–15 427. [Online]. Available: https://aclanthology.org/2023.acl-long.859
[14] X. García, Y. Bansal, C. Cherry, G. F. Foster, M. Krikun, F. Feng, M. Johnson,
and O. Firat, “The unreasonable effectiveness of few-shot learning for machine
translation,” ArXiv, vol. abs/2302.01398, 2023. [Online]. Available: https:
//api.semanticscholar.org/CorpusID:256598283
[15] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal,
A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-
Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu,
C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess,
J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei,
“Language models are few-shot learners,” in Advances in Neural Information
Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan,
and H. Lin, Eds., vol. 33. Curran Associates, Inc., 2020, pp. 1877–
1901. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2020/
file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
[16] S. Zhang, Q. Fang, Z. Zhang, Z. Ma, Y. Zhou, L. Huang, M. Bu, S. Gui, Y. Chen,
X. Chen et al., “Bayling: Bridging cross-lingual alignment and instruction fol-
lowing through interactive translation for large language models,” arXiv preprint
arXiv:2306.10968, 2023.
[17] W. Yang, C. Li, J. Zhang, and C. Zong, “Bigtrans: Augmenting large language
models with multilingual translation capability over 100 languages,” arXiv preprint
arXiv:2305.18098, 2023.
[18] J. Li, H. Zhou, S. Huang, S. Chen, and J. Chen, “Eliciting the translation
ability of large language models via multilingual finetuning with translation
instructions,” ArXiv, vol. abs/2305.15083, 2023. [Online]. Available: https:
//api.semanticscholar.org/CorpusID:258865882
[19] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang,
S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller,
M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe, “Train-
ing language models to follow instructions with human feedback,” 2022.
[20] A. Hendy, M. G. Abdelrehim, A. Sharaf, V. Raunak, M. Gabr, H. Matsushita,
Y. J. Kim, M. Afify, and H. H. Awadalla, “How good are gpt models at machine
translation? a comprehensive evaluation,” ArXiv, vol. abs/2302.09210, 2023.
[Online]. Available: https://api.semanticscholar.org/CorpusID:257038384
[21] W. Jiao, W. Wang, J. Huang, X. Wang, and Z. Tu, “Is chatgpt a good translator? yes
with gpt-4 as the engine,” arXiv preprint arXiv:2301.08745, 2023.
[22] H. Xu, Y. J. Kim, A. Sharaf, and H. H. Awadalla, “A paradigm shift in machine trans-
lation: Boosting translation performance of large language models,” arXiv preprint
arXiv:2309.11674, 2023.
[23] N. Team, M. R. Costa-jussà, J. Cross, O. Çelebi, M. Elbayad, K. Heafield, K. Hef-
fernan, E. Kalbassi, J. Lam, D. Licht, J. Maillard, A. Sun, S. Wang, G. Wen-
zek, A. Youngblood, B. Akula, L. Barrault, G. M. Gonzalez, P. Hansanti, J. Hoff-
man, S. Jarrett, K. R. Sadagopan, D. Rowe, S. Spruit, C. Tran, P. Andrews, N. F.
Ayan, S. Bhosale, S. Edunov, A. Fan, C. Gao, V. Goswami, F. Guzmán, P. Koehn,
A. Mourachko, C. Ropers, S. Saleem, H. Schwenk, and J. Wang, “No language left
behind: Scaling human-centered machine translation,” 2022.
[24] Y.-F. Liao, J. S. Tsay, P. Kang, H.-L. Khoo, L.-K. Tan, L.-C. Chang, U.-G. Iunn, H.-
L. Su, T.-G. Thiann, H.-K. Tiun, and S.-L. Liao, “Taiwanese across taiwan corpus
and its applications,” in 2022 25th Conference of the Oriental COCOSDA Interna-
tional Committee for the Co-ordination and Standardisation of Speech Databases
and Assessment Techniques (O-COCOSDA), 2022, pp. 1–5.
[25] S.-E. Lu, B.-H. Lu, C.-Y. Lu, and R. T.-H. Tsai, “Exploring methods for building
dialects-Mandarin code-mixing corpora: A case study in Taiwanese hokkien,” in
Findings of the Association for Computational Linguistics: EMNLP 2022. Abu
Dhabi, United Arab Emirates: Association for Computational Linguistics, Dec.
2022, pp. 6287–6305. [Online]. Available: https://aclanthology.org/2022.findings-
emnlp.469
[26] A. Conneau and G. Lample, “Cross-lingual language model pretraining,” Advances
in neural information processing systems, vol. 32, 2019.
[27] T. Kudo and J. Richardson, “SentencePiece: A simple and language independent
subword tokenizer and detokenizer for neural text processing,” in Proceedings of the
2018 Conference on Empirical Methods in Natural Language Processing: System
Demonstrations. Brussels, Belgium: Association for Computational Linguistics,
Nov. 2018, pp. 66–71. [Online]. Available: https://aclanthology.org/D18-2012
[28] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen,
“Lora: Low-rank adaptation of large language models,” 2021.
[29] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic
evaluation of machine translation,” in Proceedings of the 40th Annual Meeting
of the Association for Computational Linguistics. Philadelphia, Pennsylvania,
USA: Association for Computational Linguistics, Jul. 2002, pp. 311–318. [Online].
Available: https://aclanthology.org/P02-1040
[30] M. Popović, “chrF++: words helping character n-grams,” in Proceedings of the
Second Conference on Machine Translation. Copenhagen, Denmark: Association
for Computational Linguistics, Sep. 2017, pp. 612–618. [Online]. Available:
https://aclanthology.org/W17-4770
[31] T. Kocmi and C. Federmann, “Large language models are state-of-the-art evaluators
of translation quality,” in Proceedings of the 24th Annual Conference of the
European Association for Machine Translation. Tampere, Finland: European
Association for Machine Translation, Jun. 2023, pp. 193–203. [Online]. Available:
https://aclanthology.org/2023.eamt-1.19
[32] W. Zhu, Y. Lv, Q. Dong, F. Yuan, J. Xu, S. Huang, L. Kong, J. Chen, and L. Li,
“Extrapolating large language models to non-english by aligning languages,” 2023.
[33] M. Conover, M. Hayes, A. Mathur, J. Xie, J. Wan, S. Shah, A. Ghodsi, P. Wendell,
M. Zaharia, and R. Xin. (2023) Free dolly: Introducing the world’s first truly open
instruction-tuned llm. [Online]. Available: https://www.databricks.com/blog/2023/
04/12/dolly-first-open-commercially-viable-instruction-tuned-llm
[34] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, and
T. B. Hashimoto, “Stanford alpaca: An instruction-following llama model,” https:
//github.com/tatsu-lab/stanford_alpaca, 2023.
[35] Y.-C. Huang, Y.-L. Hsieh, Y.-Y. Lin, T. L. Hui, H.-Y. Chu, and W.-L. Hsu. (2021)
FLUD: Expert-curated large-scale machine comprehension dataset with advanced
reasoning strategies. [Online]. Available: https://www.kistep.re.kr/arpIssue.es?act=
content_view&list_no=200&act=content_view&mid=a20802000000

指導教授

蔡宗翰

審核日期

2023-12-25

推文