摘要: | 台灣閩南語(簡稱台語)因其語料稀少且書寫系統多樣化,導致依賴大量數據訓練的自然語言處理(Natural Language Processing, NLP)神經網路模型發展受限。近年來,隨著大型語言模型(Large Language Model, LLM)的迅速發展,NLP的多項任務已取得重大進展,但這些成果多建立於文本豐富的高資源語言(High-resource Languages, HRLs)基礎之上,如中文和英文。因此,本研究致力於解決台語在自然語言處理領域的發展困境。
首先,本研究以繁體中文預訓練過的LLaMA 2語言模型為基礎,使用有限的台語語料開發台語大型語言模型。我們訓練了一個能在中文、英文與台語間進行有效翻譯的模型,這是縮短與高資源語言之間資源差距的關鍵。研究過程中,我們收集了約78MB涵蓋各種台語書寫系統的單語語料,並將原有中文及台語的平行資料擴展至英文,建構出不同翻譯方向的平行訓練資料。本研究還探討了擴增台語詞彙表在模型上的影響,以及在預訓練與微調訓練階段使用不同台語書寫系統資料對翻譯效能的影響。實驗結果顯示,運用所有書寫系統進行預訓練及將平行語料擴展至英文能有效提升翻譯器的能力。
接著,本論文利用此翻譯器將中文指令微調資料集翻譯成台語版本,進而訓練台語對話模型。為了全面評估模型性能,我們建構了多種不同面向的評估資料集。在對話模型訓練方面,我們比較了基於翻譯模型與原始台語語言模型訓練的差異,並探討了納入英文指令微調資料集的跨語言微調方法。研究發現,以翻譯模型為基礎的對話模型在翻譯能力上有所提升,而跨語言微調方法則能在多數的生成式任務上顯示出其效益。
本研究為台語 NLP 領域帶來了新的發展,不僅透過翻譯縮小了台語與高資源語言間的資源差距,也為相關研究提供了豐富的資源和工具,為未來台語語言模型的發展奠定了堅實基礎。;Taiwanese Hokkien faces developmental limitations in the field of Natural Language Processing (NLP) due to its lack of resources and diverse writing systems. This has hindered the progress of neural network models in NLP, which often rely heavily on extensive training data. In recent years, the rapid advancement of large language models (LLMs) has led to significant progress in various NLP tasks. However, most of these advancements are based on high-resource languages (HRLs) that have abundant textual resources, such as Chinese and English. Therefore, this study aims to address the developmental challenges of Taiwanese Hokkien in NLP.
Initially, this research developed a Taiwanese Hokkien LLM based on the Traditional Chinese pre-trained LLaMA 2 language model, with the limited Taiwanese Hokkien corpora. We then trained a model capable of effective translation between Chinese, English, and Taiwanese Hokkien, which is a crucial step in bridging the resource gap with HRLs. We collected around 78MB of monolingual data that included four Taiwanese Hokkien writing systems. We also extended the existing Chinese and Taiwanese Hokkien parallel data to include English, creating a diverse set of parallel training data for different translation directions. The study investigated the impact of extending the Taiwanese Hokkien vocabulary in the model and the effect of using different Taiwanese Hokkien writing systems during pre-training and fine-tuning stages. The results suggest that pre-training with all writing systems and extending parallel data to English improved the translator′s capabilities.
Furthermore, we utilized this translation model to convert Chinese instruction fine-tuning datasets into Taiwanese Hokkien versions for training a Taiwanese Hokkien chat model. To comprehensively assess the chat model′s performance, we constructed various evaluation datasets. In the experiments, we compared the effectiveness of starting with the translation model versus the original Taiwanese Hokkien language model. We also explored cross-lingual fine-tuning methods that incorporated English instruction tuning datasets. Our findings indicate that the chat model, trained based on the translation model, showed improvement in translation tasks. Additionally, the cross-lingual fine-tuning method proved beneficial in most generative tasks.
Our research has contributed to the development of Taiwanese Hokkien NLP by narrowing the resource gap with HRLs and providing valuable resources and tools for related studies. This lays a solid foundation for future development of Taiwanese Hokkien language models. |