自監督學習應用於低資源語碼轉換語音識別;Self-Supervised Learning Approach For Low-Resource Code-Switching Automatic Speech Recognition

NCUIR > College of Electrical Engineering & Computer Science > Graduate Institute of Computer Science and Information Engineering > Electronic Thesis & Dissertation > Item 987654321/86715

Please use this identifier to cite or link to this item: https://ir.lib.ncu.edu.tw/handle/987654321/86715

Title:	自監督學習應用於低資源語碼轉換語音識別;Self-Supervised Learning Approach For Low-Resource Code-Switching Automatic Speech Recognition
Authors:	傅立燁;Fu, Li-Yeh
Contributors:	資訊工程學系
Keywords:	自監督學習;語碼轉換;語音識別;Self-Supervised Learning;Code-Switching;Automatic Speech Recognition
Date:	2021-08-19
Issue Date:	2021-12-07 13:09:19 (UTC+8)
Publisher:	國立中央大學
Abstract:	近年來深度學習由於可用於神經網路的訓練數據大幅增加、數據蒐集的難易度下降而興起，基於深度學習方法的端到端語音識別逐漸取代傳統語音識別方法成為主流。且智慧語音助理及各種相關設備的普及，對於語碼轉換語音識別系統的需求逐漸增加，相關研究也愈發受到重視。語碼轉換的定義是，在一段對話中使用超過一種以上的語言，例如中文-英文。語碼轉換在多語言國家相當普遍，以華語國家為例，在新加坡和馬來西亞經常將中文和英文夾雜使用，香港經常將中文、英文及粵語夾雜使用，在台灣也經常可見將中文、台語和英文夾雜使用。語碼轉換的挑戰在於，語言切換可能發生在任何時刻，聲學、語言模型必須學習對語言切換行為進行建模，但由於語碼轉換的語料稀少，導致模型訓練困難。現主流語碼轉換方法依賴於融合多個聲學模型，分別學習針對不同的語言特徵建模，以及使用多任務學習加入語言識別任務，幫助模型學習語言切換行為，但距離達到強健穩定且能應用於真實場景的語碼轉換技術仍是挑戰。為了解決上述問題，我們提出引入自監督學習技術至語碼轉換任務，自監督學習在許多領域已取得相當大的成功，如自然語言處理、計算機視覺等領域等。自監督學習的核心是使用大量未標記數據來預訓練模型，並將模型遷移至特定下游任務微調，可以大幅降低模型訓練的標記數據需求以及提高模型效能。因此，我們將語碼轉換任務視為低資源語音辨識任務，使用自監督學習透過大量未標記語料預訓練聲學編碼器模型，隨後遷移至語碼轉換語料進行微調，並在語碼轉換任務語料庫SEAME上分別取得16.4%以及23.3%的混合錯誤率，是我們目前已知最好的結果。我們的方法大幅簡化了過去文獻中須依賴於使用單語言語料預訓練多個單語言編碼器、解碼器的網路架構，並且避免使用語言識別多任務學習損失也直接簡化了對語料標註的人力開銷。我們提出的方法具有一般化及通用性，可以拓展至其他語言的語碼轉換語音識別任務。;In recent years, deep learning has risen due to the substantial increase in the amount of training data that can be used for neural networks and the decrease in the difficulty of data collection. End-to-end speech recognition based on deep learning methods has gradually replaced traditional speech recognition methods and has become the mainstream. In addition, the popularity of intelligent voice assistants and various related devices has gradually increased the demand for code-switched speech recognition systems. Code switching refers to the use of more than one language in a conversation, such as Chinese-English. Code switching is quite common in multilingual countries. The challenge is that language switching may occur at any moment, and the corpus for code switching is scarce, which makes model training difficult. The current mainstream code-switching method relies on the fusion of multiple acoustic models and the use of multi-task learning to help the model learn language switching behaviors, but there are still challenges in the code-switching technology that achieves robustness and stability and can be applied to real scenes. In order to solve the above problems, we propose to introduce self-supervised learning, use a large amount of unlabeled data to pre-train the model, and migrate the model to specific downstream tasks for fine-tuning, which greatly reduces the labeled data requirements for model training and improves model performance. The benchmark corpus SEAME of our code-switching task achieved 16.4% and 23.3% mixed error rates, which are the best results we know so far. Our method greatly simplifies the network architecture in the past literature that relied on the pre-training of multiple models using a single language corpus, and also simplifies the labor cost of additional annotations for multi-task learning.
Appears in Collections:	[Graduate Institute of Computer Science and Information Engineering] Electronic Thesis & Dissertation

Files in This Item:

File	Description	Size	Format
index.html		0Kb	HTML	108	View/Open

社群 sharing

Loading...