中大機構典藏-NCU Institutional Repository-提供博碩士論文、考古題、期刊論文、研究計畫等下載:Item 987654321/98158
English  |  正體中文  |  简体中文  |  Items with full text/Total items : 83776/83776 (100%)
Visitors : 59449010      Online Users : 1086
RC Version 7.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
Scope Tips:
  • please add "double quotation mark" for query phrases to get precise results
  • please goto advance search for comprehansive author search
  • Adv. Search
    HomeLoginUploadHelpAboutAdminister Goto mobile version


    Please use this identifier to cite or link to this item: https://ir.lib.ncu.edu.tw/handle/987654321/98158


    Title: 通過跨語言相關詞高效初始化大型語言模型的新詞嵌入:以英文到繁體中文為例;Efficiently Initializing New Word Embeddings for Large Language Models via Interlingual Related Words: A Case Study on English to Traditional Chinese
    Authors: 蕭士凱;Hsiao, Shih-Kai
    Contributors: 資訊工程學系
    Keywords: 自然語言處理;大型語言模型;詞嵌入;Natural Language Processing;Large Language Model;Word Embedding
    Date: 2025-05-02
    Issue Date: 2025-10-17 12:26:44 (UTC+8)
    Publisher: 國立中央大學
    Abstract: 在大型語言模型(LLMs)的領域中,模型的多語言能力變得愈加重要。然而,許多開源模型主要針對英語進行開發,這種偏向性影響了它們在其他語言(如繁體中文)上的表現。一個常見的解決方案是使用特定語言的數據進行繼續預訓練。然而,由於原始模型在目標語言上的詞彙量通常有限,訓練可能會較為低效,進而提高成本。因此,擴增詞彙表,尤其是高效地初始化新詞嵌入,仍然是一個重大挑戰。
    為了應對這一挑戰,我們提出了跨語言語義初始化(Cross-Lingual Semantic Initialization, CLSI),這是一種簡單而有效的策略,用於將現有的英語詞嵌入轉換為良好的目標語言詞嵌入初始值。CLSI 首先通過機器翻譯獲得每個新詞在語義上相關的英文詞彙,再計算這些相關詞嵌入的平均值,並套用語言偏移(Language Offset)——一個用來將詞嵌入與目標語言對齊的偏移向量,最終得出新詞嵌入的初始值。與更複雜的方法不同,CLSI 只需一個詞彙映射表,無需為生成初始詞嵌入進行額外訓練,也不依賴外部的多語言詞嵌入。
    我們在繁體中文上的實驗表明,與隨機或平均方法相比,CLSI 顯著加快了收斂速度。同時,其訓練損失能夠收斂到更低的數值,而在繁體中文上的評估也能獲得更高的分數,這些結果突顯了 CLSI 的有效性和靈活性。;In the realm of large language models (LLMs), the ability to handle multiple languages is becoming increasingly vital. However, many open-source models are primarily designed
    for English, and this focus affects their performance in other languages, such as Traditional Chinese. A common solution is to conduct continual pre-training using data in specific languages. However, because the original models often have a limited vocabulary for the target language, training can become inefficient, thereby increasing the cost. Therefore, vocabulary expansion—particularly efficiently initializing new word embeddings—remains
    a significant challenge.
    To address this challenge, we introduce Cross-Lingual Semantic Initialization (CLSI), a simple yet effective strategy to transform existing English word embeddings into good initial values for target language word embeddings. CLSI first utilizes machine translation to obtain semantically related English words for each new token. It then computes the average of these related word embeddings and applies the Language Offset—a vector used to align the word embeddings with the target language, ultimately yielding initial values for new word embeddings. Unlike more sophisticated approaches, CLSI only requires a vocabulary mapping table, eliminating the need for additional training for generating the initial word embeddings and removing the reliance on external multilingual embeddings.
    Our experiments with Traditional Chinese demonstrate that CLSI significantly speeds up convergence compared to random or mean-based methods. It also achieves lower training loss and higher evaluation scores in Traditional Chinese. These results underscore the effectiveness and flexibility of CLSI.
    Appears in Collections:[Graduate Institute of Computer Science and Information Engineering] Electronic Thesis & Dissertation

    Files in This Item:

    File Description SizeFormat
    index.html0KbHTML15View/Open


    All items in NCUIR are protected by copyright, with all rights reserved.

    社群 sharing

    ::: Copyright National Central University. | 國立中央大學圖書館版權所有 | 收藏本站 | 設為首頁 | 最佳瀏覽畫面: 1024*768 | 建站日期:8-24-2009 :::
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - 隱私權政策聲明