中大機構典藏-NCU Institutional Repository-提供博碩士論文、考古題、期刊論文、研究計畫等下載:Item 987654321/94410
English  |  正體中文  |  简体中文  |  全文笔数/总笔数 : 80990/80990 (100%)
造访人次 : 42803457      在线人数 : 1107
RC Version 7.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
搜寻范围 查询小技巧:
  • 您可在西文检索词汇前后加上"双引号",以获取较精准的检索结果
  • 若欲以作者姓名搜寻,建议至进阶搜寻限定作者字段,可获得较完整数据
  • 进阶搜寻


    jsp.display-item.identifier=請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/94410


    题名: 繁體中文 - 泰文機器翻譯;Traditional Chinese - Thai Machine Translation
    作者: 莎薇;Meesawad, Wilailack
    贡献者: 人工智慧國際碩士學位學程
    关键词: 繁體中文;泰文;機器翻譯;低資源語言;大型語言模型;Traditional Chinese;Thai;Machine Translation;Low-resource Language;Large Language Model
    日期: 2024-07-25
    上传时间: 2024-10-09 14:41:37 (UTC+8)
    出版者: 國立中央大學
    摘要: 近年來,勞動市場全球化導致大量泰國勞工湧入包括台灣在內的國家,並在製造業
    到服務業等各個領域貢獻力量。這些人在異國他鄉進行日常生活和職業工作時,有效的溝
    通對於他們的融入、生產力和整體福祉至關重要。然而,語言障礙往往成為巨大的阻礙,
    妨礙泰國勞工與台灣同事之間的交流和理解。此外,繁體中文和泰文之間的翻譯資源稀缺
    ,使得找到合適的語言支持工具變得具有挑戰性。
    為了解決這些問題,我們提出了一種有效的訓練方法,通過兩階段的微調過程來利
    用大型語言模型(LLMs)進行機器翻譯任務。在初始階段,我們通過使用高級語言模型翻
    譯(ALMA)策略(由Xu等人於2024年開發)對包含一百萬筆的泰語數據集進行提煉,提升
    台灣可信賴人工智慧對話引擎(TAIDE)七十億參數模型的泰語能力。這一階段重點在於
    建立模型對泰語的堅實基礎理解。
    在隨後的階段中,TAIDE模型在一個高品質繁體中文-泰文平行數據的小數據集上進
    行微調。這一微調對於調整模型的翻譯能力以適應兩種語言的特定細微差異至關重要。通
    過使用Qwen2-7B-Instruct(阿里巴巴集團,2024)作為基礎模型,我們的方法實現了翻
    譯性能的顯著改進。具體來說,我們的方法在兩個翻譯方向 (中文到泰文和泰文到中文
    )上平均提升了超過20點的BLEU分數、3點的COMET分數和2點的COMET_KIWI分數,顯著超
    過了以往的研究成果,並超越了谷歌翻譯等最先進的機器翻譯系統以及GPT-3.5 Turbo等
    顯著更大的模型,達到與GPT-4o相當的性能,儘管我們的模型僅有七十億參數。
    我們的研究結果突顯了針對低資源語言進行目標明確的兩階段微調過程的有效性。
    通過全面的單語數據集來增強模型的泰語能力,並使用高品質的平行數據進行微調,我們
    展示了翻譯品質的顯著改善。這種方法不僅彌合了繁體中文和泰文之間的語言鴻溝,還為
    未來涉及低資源語言的機器翻譯研究樹立了先例。;In recent years, the globalization of labor markets has led to a significant influx of Thai
    workers into countries such as Taiwan, where they contribute across various sectors ranging from
    manufacturing to services. As these individuals navigate their daily lives and professional
    endeavors in a foreign land, effective communication becomes paramount for their integration,
    productivity, and overall well-being. However, language barriers often present formidable
    obstacles, hindering seamless interaction and understanding between Thai workers and their
    Taiwanese counterparts. Additionally, resources for translating between Traditional Chinese and
    Thai are scarce, making it challenging to find adequate language support tools.
    To address these issues, we propose an effective training methodology for leveraging Large
    Language Models (LLMs) in machine translation tasks through a two-stage fine-tuning process.
    In the initial stage, we enhance the proficiency of the Trustworthy AI Dialogue Engine by Taiwan
    (TAIDE) 7 billion parameter model in the Thai language by refining a comprehensive dataset of
    Thai instances using the Advanced Language Model-based Translator (ALMA) strategy
    (developed by Xu et al. 2024 [25]). This stage focuses on building a robust foundational
    understanding of the Thai language within the model.
    In the subsequent stage, the TAIDE model is fine-tuned on a smaller set of high-quality
    Traditional Chinese-Thai parallel data. This fine-tuning is essential for aligning the model’s
    translation capabilities with the specific nuances of both languages. By utilizing Qwen2-7BInstruct (Alibaba Group, 2024 [1]) as the underlying model, our approach achieves substantial
    improvements in translation performance. Specifically, our methodology achieves state-of-the-art
    results with average increases of over 20 BLEU, 3 COMET, and 2 COMET_KIWI scores across
    two translation directions (Chinese to Thai and Thai to Chinese), significantly outperforming prior work and surpassing performance of the state-of-the-art machine translation systems like Google
    Translate and significantly larger models like GPT-3.5 Turbo, and achieving performance
    comparable to GPT-4o, despite our model having only 7B parameters.
    Our findings highlight the effectiveness of a targeted, two-stage fine-tuning process in
    addressing the challenges posed by low-resource languages. By enhancing the model′s proficiency
    in Thai through comprehensive monolingual datasets and fine-tuning with high-quality parallel
    data, we demonstrate a marked improvement in translation quality. This approach not only bridges
    the linguistic gap between Traditional Chinese and Thai but also sets a precedent for future
    research in machine translation involving low-resource languages.
    显示于类别:[人工智慧國際碩士學位學程] 博碩士論文

    文件中的档案:

    档案 描述 大小格式浏览次数
    index.html0KbHTML70检视/开启


    在NCUIR中所有的数据项都受到原著作权保护.

    社群 sharing

    ::: Copyright National Central University. | 國立中央大學圖書館版權所有 | 收藏本站 | 設為首頁 | 最佳瀏覽畫面: 1024*768 | 建站日期:8-24-2009 :::
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - 隱私權政策聲明