論文名稱 應用遮罩語言模型於語碼轉換語音識別
(Masked Language Model for Code-Switching Automatic Speech Recognition)
摘要(中) 近年來,使用語言模型改善端到端語音識別模型的輸出,已然成為單語言語音識別領域的主流方法,但相較於單語言任務的語言模型,語碼轉換任務因其文句結構的特殊性,不僅用於訓練模型的資料極為缺乏,而且傳統模型架構也不易學習多語言的語意資訊。因此,為了解決上述兩個問題,本論文引入遮罩語言模型到語碼轉換語音識別系統內,期望透過通用的語言知識和雙向的內文資訊,使系統產生更精準的結果。其中,遮罩語言模型會使用未標記資料進行自監督式的預訓練以取得通用的語言知識,之後再將模型遷移至語碼轉換語音識別領域進行適應。除此之外,由於遮罩語言模型的訓練會使用完整的雙向內文資訊,同時也會大幅增強語意的理解和模型的效能。因此,我們藉助遮罩語言模型所帶來的優勢,將其應用在語碼轉換語言模型的建立並對端到端語音識別模型的輸出序列進行重評分,以改善整體系統的效能。在本論文中,我們提出將遮罩語言模型取代傳統因果語言模型和加成在標準語音識別系統上的兩種使用方式,並在語碼轉換語料庫SEAME上進行實驗,最終,這兩種系統相較於標準架構,分別取得了最多7%和8.4%的相對混合錯誤率,證明了我們提出的方法能夠解決前述所提到的問題,增強語碼轉換語音識別系統的效能。
摘要(英) In recent years, the use of language models to improve the output of end-to-end speech recognition models has become the mainstream method in the field of monolingual speech recognition. Not only the data for training the model is extremely scarce, but also the traditional model architecture is not easy to learn multilingual semantic information. Therefore, in order to solve the above two problems, this paper introduces a masked language model into the code-switching speech recognition system, hoping to make the system produce more accurate results through general language knowledge and bidirectional context information. Among them, the masked language model uses unlabeled data for self-supervised pre-training to obtain general language knowledge, and then the model is transferred to the field of code-switching speech recognition for adaptation. In addition, since the training of the masked language model will use the complete bidirectional contextual information, it will also greatly enhance the semantic understanding and the performance of the model. Therefore, we take advantage of the masking language model and apply it to establish code-switching language model and re-score the output sequence of the end-to-end speech recognition model to improve the performance of the overall system. In this paper, we propose to replace the traditional causal language model and add the masked language model on the standard speech recognition system, and conduct experiments on the code-switched corpus SEAME. Finally, the two systems are compared. Compared with the standard architecture, relative mixed error rates of up to 7% and 8.4% were achieved, respectively, proving that our proposed method can solve the aforementioned problems and enhance the performance of the code-switched speech recognition system.
關鍵字(中) ★ 語音識別
★ 語碼轉換
★ 遮罩語音模型
關鍵字(英) ★ Speech Recognition
★ Code-Switching
★ Masked Langauge Model
論文目次 中文摘要 i
英文摘要 ii
目錄 iii
圖目錄 vi
表目錄 vii
一、 緒論(Introduction) 1
1-1 研究背景與目的(Research Background) 1
1-2 研究方法(Research Methods) 2
1-3 章節概要(Chapter Summary) 3
二、 相關文獻與文獻探討(Related Work) 4
2-1 語碼轉換語言模型(Code-Switching Language Model) 4
2-2 變壓器(Transformer) 7
2-2-1 模型架構(Model Architecture) 8
2-2-2 注意力演算法(Attention) 10
2-2-3 多頭注意力機制(Multi-Head Attention) 11
2-2-4 注意力機制的應用(Applications of Attention) 12
2-2-5 位置編碼機制(Positional Encoding,PE) 13
2-3 基於變壓器的雙向編碼器表示技術(Bidirectional Encoder Representations from Transformers,BERT) 14
2-3-1 輸入和輸出表示(Input and Output Representations) 17
2-3-2 遮罩語言模型預訓練(Masked Language Model,MLM) 20
2-3-3 次句預測預訓練(Next Sentence Prediction,NSP) 21
2-3-4 模型微調(Fine-tuning) 21
2-4 語言模型整合(Language Model Integration) 23
2-4-1 淺層融合(Shallow Fusion) 24
2-4-2 深度融合(Deep Fusion) 26
2-4-3 冷融合(Cold Fusion) 27
2-4-4 候選序列重評分(N-best List Rescoring) 28
2-5 語言模型架構與融合分析(Language Model Structure and Fusion Analysis) 29
三、 遮罩語言模型重評分之語碼轉換語音識別(Masked Language Model for Code-Switching Automatic Speech Recognition) 32
3-1 系統架構(System Structure) 33
3-1-1 端到端語音識別模型(End to End Speech Recognition Model) 35
3-1-2 遮罩語言模型(Masked Language Model) 35
3-1-3 因果語言模型(Casual Language Model) 36
3-2 遮罩語言模型重評分(Masked Language Model Rescoring) 37
3-2-1 偽對數似然估計(Pseudo-log-likelihood Estimate,PLL Estimate) 37
3-2-2 偽困惑估計(Pseudo-perplexity Estimate,PPPL Estimate) 40
3-2-3 分數插值(Score Interpolation) 41
四、 實驗與結果說明(Experiment and Result) 43
4-1 實驗設置(Experiment Setup) 43
4-1-1 資料集(Dataset) 43
4-1-2 實驗細節(Experiment Details) 45
4-1-3 評估方式(Evaluation Manner) 46
4-2 端到端語音識別模型實現(End to End Speech Recognition Implementation) 46
4-2-1 模型架構(Model Structure) 47
4-2-2 模型訓練(Model Training) 48
4-3 遮罩語言模型實現(Masked Language Model Implementation) 49
4-3-1 模型架構(Model Structure) 49
4-3-2 模型訓練(Model Training) 50
4-4 因果語言模型實現(Casual Language Model Implementation) 51
4-4-1 模型架構(Model Structure) 51
4-4-2 模型訓練(Model Training) 52
4-5 實驗結果和分析(Result and Analysis) 53
4-5-1 結果展示(Result) 53
4-5-2 結果分析(Result Analysis) 54
4-5-3 偽困惑和重評分分析(Pseudo-perplexity and Rescoring Analysis)(欠表) 56
五、 結論與未來方向(Conclusion and Future Work) 57
參考文獻(References) 58
指導教授 王家慶(Jia-Ching Wang) 審核日期 2022-9-23
