使用深度學習於中英語碼混語音識別;Deep Learning for Mandarin-English Code-switching Speech Recognition

NCU Institutional Repository > 資訊電機學院 > 資訊工程研究所 > 博碩士論文 > Item 987654321/98477

請使用永久網址來引用或連結此文件: https://ir.lib.ncu.edu.tw/handle/987654321/98477

題名:	使用深度學習於中英語碼混語音識別;Deep Learning for Mandarin-English Code-switching Speech Recognition
作者:	高紅雅;Nga, Cao Hong
貢獻者:	資訊工程學系
關鍵詞:	自動語音識別;語碼混用語音識別;聯合優化;音位識別;遷移學習;半監督學習;互學式半監督學習;語言建模;automatic speech recognition;code-switching speech recognition;joint optimization;transfer learning;phone recognition;semi-supervised learning;mutual learning;language modeling
日期:	2025-07-31
上傳時間:	2025-10-17 12:49:46 (UTC+8)
出版者:	國立中央大學
摘要:	語碼轉換（Code-switching）是多語言社群中常見的語言現象，指的是在單一話語或對話中交替使用兩種或以上的語言。雖然人類能夠憑藉語境線索輕易理解此類語音內容，但自動語音識別（ASR）系統因語言的複雜性與訓練資源的限制，仍面臨重大挑戰。為了解決中文與英文語碼轉換語音識別（CSSR）問題，本論文提出三種創新方法：結合通用音素識別的聯合優化（Joint Optimization with Universal Phone Recognition, JOPR）、循環式遷移學習（Cyclic Transfer Learning, CTL），以及基於互學的半監督學習方法（Mutual Learning-Based Semi-Supervised Learning, MLSS）。第一種方法 JOPR 採用結合 CTC 和注意力機制的序列對序列（seq2seq）模型，具有共享編碼器與兩個混合解碼器：一個用於 CSSR，另一個則用於通用音素轉錄。透過輔助的通用音素識別任務，模型能學習語言無關的語音特徵表示，進而提升識別準確度。第二種方法 CTL 同時利用單語與語碼轉換語料，透過迭代式的預訓練與微調提升模型效能。模型首先以具有音素標註的語碼轉換語音進行訓練，接著以單語語料進行微調，並將學習到的權重轉移至最終的 CSSR 模型以進一步優化。透過反覆進行此循環式訓練，可捕捉跨語言間互補的語言知識，提升模型泛化能力。第三種方法 MLSS 建構一種基於互學的半監督學習框架，使用兩個網路進行交替式訓練。每個網路利用由對方產生的偽標籤資料持續精煉其效能。此方法有效地利用未標註資料，減少對昂貴人工標註的依賴，同時提升識別表現。此外，我們還在推理階段結合外部語言模型，採用淺層融合（shallow fusion）進一步提升識別準確率。根據 SEAME 中英語碼轉換語料的實驗結果，證實所提出的方法在混合錯誤率（MER）與字元錯誤率（CER）方面均優於現有的基準模型與最先進技術。 ;Code-switching, the alternating use of two or more languages within a single utterance or conversation, is a common phenomenon in multilingual communities. Although humans can easily interpret such speech using contextual cues, automatic speech recognition (ASR) systems continue to face challenges due to linguistic complexity and limited training resources. This thesis addresses Mandarin-English code-switching speech recognition (CSSR) by proposing three novel methods: Joint Optimization with Universal Phone Recognition (JOPR), Cyclic Transfer Learning (CTL), and Mutual Learning-Based Semi-Supervised Learning (MLSS). The first approach, JOPR, introduces a joint CTC/attention-based sequence-to-sequence (seq2seq) model with a shared encoder and two hybrid decoders: one for CSSR and the other for universal phone transcription. The auxiliary universal phone recognition task enhances the model’s language-agnostic phonetic representation, leading to improved recognition accuracy. The second method, CTL, leverages both monolingual and code-switching corpora through iterative pre-training and fine-tuning. The model is first trained on code-switching speech using phonetic labels, followed by fine-tuning with monolingual data. The model are subsequently transferred to the target CSSR model for further optimization. This cyclic training process is repeated to capture complementary linguistic knowledge across language domains, enhancing the model’s generalization ability. The third contribution, MLSS, proposes a semi-supervised learning framework using two networks in a mutual learning setup. Each network iteratively refines its performance using pseudo-labeled data generated by its counterpart. This method enables effective utilization of unlabeled data, reducing dependence on expensive manual transcriptions while improving recognition performance. Additionally, we incorporate an external language model using shallow fusion during inference to further enhance recognition accuracy. Experiments on the SEAME Mandarin-English corpus confirm that the proposed methods outperform existing baselines and state-of-the-art models in terms of Mixed Error Rate (MER) and Character Error Rate (CER).
顯示於類別:	[資訊工程研究所] 博碩士論文

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	26	檢視/開啟

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....