A Corpus Crawler for Taiwanese Mandarin Audio Transcription Using Deep Speech

NCU Institutional Repository > 資訊電機學院 > 資訊工程研究所 > 博碩士論文 > Item 987654321/86531

請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/86531

題名:	A Corpus Crawler for Taiwanese Mandarin Audio Transcription Using Deep Speech
作者:	伍家恩;Wu, Chia-En
貢獻者:	資訊工程學系
關鍵詞:	語音辨識;台灣口音;資料集處理;Common Voice;Deep Speech;Speech Recognition
日期:	2021-07-26
上傳時間:	2021-12-07 12:56:47 (UTC+8)
出版者:	國立中央大學
摘要:	隨著科技的發展，語音辨識技術逐漸被應用在各個領域，例如語音輸入和智慧助理。近年來，隨著深度學習技術不斷的發展，許多主流語言的語音辨識模型和相關的資料集也逐漸被釋出，例如英語和中國口音的中文。因此，這些主流語言的語音辨識準確率通常遠高於其他比較小眾的語言(例如:台灣口音的中文)。台灣口音的中文在很多方面都與中國口音的中文不盡相同，唯獨句子結構是比較相近的。因此，若想要讓針對中國口音開發的中文語音辨識模型也能夠正確的辨識台灣口音的中文，我們必須先收集大量的台灣口音資料集來重新訓練該模型，才能得到不錯的效果。因此，我們在本篇論文提出了一套針對台灣口音的中文語音資料集的收集系統，它可以自動從YouTube的影片中收集台灣口音的中文聲音檔和以及對應的文本資料；透過YouTube的CC字幕，我們大大簡化了收集資料的過程，讓收集語音資料集的速度大幅提升。此外，我們還設計了一系列的預處理演算法，用來解決文本資料集相關的發音問題，其中包括去除不必要的內容(例如:多餘的換行、空格、標點符號以及外來語言的文字…等)和辨識阿拉伯數字的正確中文發音。我們利用這套系統從YouTube上收集了30小時的台灣口音的中文語音資料集，用來改善Deep Speech語音辨識模型的準確率。而最終的實驗結果表明，隨著我們使用的資料集增加，語音辨識模型的平均字詞錯誤率以非線性的方式逐步下降改進。 ;Speech recognition is considered to be an enabling technology for many services, such as voice input and smart assistant. As the technique of Deep Learning develops, many speech recognition models and public corpus datasets have been released for common languages, such as English and Chinese Mandarin. As a consequence, the accuracy of speech recognition for these common languages is usually much higher than that for Taiwanese Mandarin. While Taiwanese Mandarin is different from Chinese Mandarin in several ways, they share a very similar sentence structure. Hence, the models developed for Chinese Mandarin should work well for Taiwanese Mandarin so long as Taiwanese Mandarin corpus dataset is adequately large. In this thesis, we propose a corpus crawler that automatically collects Taiwanese Mandarin audio and transcript dataset from YouTube videos. By utilizing the Closed Captioning subtitle in YouTube videos, the design of the crawler is greatly simplified, which helps to improve the speed of the crawler. In addition, several pre-processing tasks are performed to resolve the issue of context-dependent pronunciation, including removal of unnecessary content and identification of correct pronunciation of Arabic numerals. The proposed crawler is adopted to collect 30 hours of Taiwanese Mandarin corpus dataset, which are used to aid the training of Deep Speech, a well-known speech recognition model, to improve the Deep Speech model. The experimental results show that the linear increase of the dataset results in better-than-linear decrease of the average character and word error rates.
顯示於類別:	[資訊工程研究所] 博碩士論文

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	65	檢視/開啟

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....