通過跨語言相關詞高效初始化大型語言模型的新詞嵌入：以英文到繁體中文為例;Efficiently Initializing New Word Embeddings for Large Language Models via Interlingual Related Words: A Case Study on English to Traditional Chinese

NCUIR > College of Electrical Engineering & Computer Science > Graduate Institute of Computer Science and Information Engineering > Electronic Thesis & Dissertation > Item 987654321/98158

Please use this identifier to cite or link to this item: https://ir.lib.ncu.edu.tw/handle/987654321/98158

Title:	通過跨語言相關詞高效初始化大型語言模型的新詞嵌入：以英文到繁體中文為例;Efficiently Initializing New Word Embeddings for Large Language Models via Interlingual Related Words: A Case Study on English to Traditional Chinese
Authors:	蕭士凱;Hsiao, Shih-Kai
Contributors:	資訊工程學系
Keywords:	自然語言處理;大型語言模型;詞嵌入;Natural Language Processing;Large Language Model;Word Embedding
Date:	2025-05-02
Issue Date:	2025-10-17 12:26:44 (UTC+8)
Publisher:	國立中央大學
Abstract:	在大型語言模型（LLMs）的領域中，模型的多語言能力變得愈加重要。然而，許多開源模型主要針對英語進行開發，這種偏向性影響了它們在其他語言（如繁體中文）上的表現。一個常見的解決方案是使用特定語言的數據進行繼續預訓練。然而，由於原始模型在目標語言上的詞彙量通常有限，訓練可能會較為低效，進而提高成本。因此，擴增詞彙表，尤其是高效地初始化新詞嵌入，仍然是一個重大挑戰。為了應對這一挑戰，我們提出了跨語言語義初始化（Cross-Lingual Semantic Initialization, CLSI），這是一種簡單而有效的策略，用於將現有的英語詞嵌入轉換為良好的目標語言詞嵌入初始值。CLSI 首先通過機器翻譯獲得每個新詞在語義上相關的英文詞彙，再計算這些相關詞嵌入的平均值，並套用語言偏移（Language Offset）——一個用來將詞嵌入與目標語言對齊的偏移向量，最終得出新詞嵌入的初始值。與更複雜的方法不同，CLSI 只需一個詞彙映射表，無需為生成初始詞嵌入進行額外訓練，也不依賴外部的多語言詞嵌入。我們在繁體中文上的實驗表明，與隨機或平均方法相比，CLSI 顯著加快了收斂速度。同時，其訓練損失能夠收斂到更低的數值，而在繁體中文上的評估也能獲得更高的分數，這些結果突顯了 CLSI 的有效性和靈活性。;In the realm of large language models (LLMs), the ability to handle multiple languages is becoming increasingly vital. However, many open-source models are primarily designed for English, and this focus affects their performance in other languages, such as Traditional Chinese. A common solution is to conduct continual pre-training using data in specific languages. However, because the original models often have a limited vocabulary for the target language, training can become inefficient, thereby increasing the cost. Therefore, vocabulary expansion—particularly efficiently initializing new word embeddings—remains a significant challenge. To address this challenge, we introduce Cross-Lingual Semantic Initialization (CLSI), a simple yet effective strategy to transform existing English word embeddings into good initial values for target language word embeddings. CLSI first utilizes machine translation to obtain semantically related English words for each new token. It then computes the average of these related word embeddings and applies the Language Offset—a vector used to align the word embeddings with the target language, ultimately yielding initial values for new word embeddings. Unlike more sophisticated approaches, CLSI only requires a vocabulary mapping table, eliminating the need for additional training for generating the initial word embeddings and removing the reliance on external multilingual embeddings. Our experiments with Traditional Chinese demonstrate that CLSI significantly speeds up convergence compared to random or mean-based methods. It also achieves lower training loss and higher evaluation scores in Traditional Chinese. These results underscore the effectiveness and flexibility of CLSI.
Appears in Collections:	[Graduate Institute of Computer Science and Information Engineering] Electronic Thesis & Dissertation

Files in This Item:

File	Description	Size	Format
index.html		0Kb	HTML	15	View/Open

社群 sharing

Loading...