博碩士論文 108522080 完整後設資料紀錄

DC 欄位 語言
DC.contributor資訊工程學系zh_TW
DC.creator伍家恩zh_TW
DC.creatorChia-En Wuen_US
dc.date.accessioned2021-7-26T07:39:07Z
dc.date.available2021-7-26T07:39:07Z
dc.date.issued2021
dc.identifier.urihttp://ir.lib.ncu.edu.tw:444/thesis/view_etd.asp?URN=108522080
dc.contributor.department資訊工程學系zh_TW
DC.description國立中央大學zh_TW
DC.descriptionNational Central Universityen_US
dc.description.abstract隨著科技的發展,語音辨識技術逐漸被應用在各個領域,例如語音輸入和智慧助理。近年來,隨著深度學習技術不斷的發展,許多主流語言的語音辨識模型和相關的資料集也逐漸被釋出,例如英語和中國口音的中文。因此,這些主流語言的語音辨識準確率通常遠高於其他比較小眾的語言(例如:台灣口音的中文)。台灣口音的中文在很多方面都與中國口音的中文不盡相同,唯獨句子結構是比較相近的。因此,若想要讓針對中國口音開發的中文語音辨識模型也能夠正確的辨識台灣口音的中文,我們必須先收集大量的台灣口音資料集來重新訓練該模型,才能得到不錯的效果。 因此,我們在本篇論文提出了一套針對台灣口音的中文語音資料集的收集系統,它可以自動從YouTube的影片中收集台灣口音的中文聲音檔和以及對應的文本資料;透過YouTube的CC字幕,我們大大簡化了收集資料的過程,讓收集語音資料集的速度大幅提升。此外,我們還設計了一系列的預處理演算法,用來解決文本資料集相關的發音問題,其中包括去除不必要的內容(例如:多餘的換行、空格、標點符號以及外來語言的文字…等)和辨識阿拉伯數字的正確中文發音。我們利用這套系統從YouTube上收集了30小時的台灣口音的中文語音資料集,用來改善Deep Speech語音辨識模型的準確率。而最終的實驗結果表明,隨著我們使用的資料集增加,語音辨識模型的平均字詞錯誤率以非線性的方式逐步下降改進。zh_TW
dc.description.abstractSpeech recognition is considered to be an enabling technology for many services, such as voice input and smart assistant. As the technique of Deep Learning develops, many speech recognition models and public corpus datasets have been released for common languages, such as English and Chinese Mandarin. As a consequence, the accuracy of speech recognition for these common languages is usually much higher than that for Taiwanese Mandarin. While Taiwanese Mandarin is different from Chinese Mandarin in several ways, they share a very similar sentence structure. Hence, the models developed for Chinese Mandarin should work well for Taiwanese Mandarin so long as Taiwanese Mandarin corpus dataset is adequately large. In this thesis, we propose a corpus crawler that automatically collects Taiwanese Mandarin audio and transcript dataset from YouTube videos. By utilizing the Closed Captioning subtitle in YouTube videos, the design of the crawler is greatly simplified, which helps to improve the speed of the crawler. In addition, several pre-processing tasks are performed to resolve the issue of context-dependent pronunciation, including removal of unnecessary content and identification of correct pronunciation of Arabic numerals. The proposed crawler is adopted to collect 30 hours of Taiwanese Mandarin corpus dataset, which are used to aid the training of Deep Speech, a well-known speech recognition model, to improve the Deep Speech model. The experimental results show that the linear increase of the dataset results in better-than-linear decrease of the average character and word error rates.en_US
DC.subject語音辨識zh_TW
DC.subject台灣口音zh_TW
DC.subject資料集處理zh_TW
DC.subjectCommon Voiceen_US
DC.subjectDeep Speechen_US
DC.subjectSpeech Recognitionen_US
DC.titleA Corpus Crawler for Taiwanese Mandarin Audio Transcription Using Deep Speechen_US
dc.language.isoen_USen_US
DC.type博碩士論文zh_TW
DC.typethesisen_US
DC.publisherNational Central Universityen_US

若有論文相關問題,請聯絡國立中央大學圖書館推廣服務組 TEL:(03)422-7151轉57407,或E-mail聯絡  - 隱私權政策聲明