A Corpus Crawler for Taiwanese Mandarin Audio Transcription Using Deep Speech

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：118

、訪客IP：18.220.123.118

姓名

伍家恩(Chia-En Wu) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

(A Corpus Crawler for Taiwanese Mandarin Audio Transcription Using Deep Speech)

相關論文

★ Dynamic Overlay Construction for Mobile Target Detection in Wireless Sensor Networks	★ 車輛導航的簡易繞路策略
★ 使用傳送端電壓改善定位	★ 利用車輛分類建構車載網路上的虛擬骨幹
★ Why Topology-based Broadcast Algorithms Do Not Work Well in Heterogeneous Wireless Networks?	★ 針對移動性目標物的有效率無線感測網路
★ 適用於無線隨意網路中以關節點為基礎的分散式拓樸控制方法	★ A Review of Existing Web Frameworks
★ 將感測網路切割成貪婪區塊的分散式演算法	★ 無線網路上Range-free的距離測量
★ Inferring Floor Plan from Trajectories	★ An Indoor Collaborative Pedestrian Dead Reckoning System
★ Dynamic Content Adjustment In Mobile Ad Hoc Networks	★ 以影像為基礎的定位系統
★ 大範圍無線感測網路下分散式資料壓縮收集演算法	★ 車用WiFi網路中的碰撞分析

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

隨著科技的發展，語音辨識技術逐漸被應用在各個領域，例如語音輸入和智慧助理。近年來，隨著深度學習技術不斷的發展，許多主流語言的語音辨識模型和相關的資料集也逐漸被釋出，例如英語和中國口音的中文。因此，這些主流語言的語音辨識準確率通常遠高於其他比較小眾的語言(例如:台灣口音的中文)。台灣口音的中文在很多方面都與中國口音的中文不盡相同，唯獨句子結構是比較相近的。因此，若想要讓針對中國口音開發的中文語音辨識模型也能夠正確的辨識台灣口音的中文，我們必須先收集大量的台灣口音資料集來重新訓練該模型，才能得到不錯的效果。
因此，我們在本篇論文提出了一套針對台灣口音的中文語音資料集的收集系統，它可以自動從YouTube的影片中收集台灣口音的中文聲音檔和以及對應的文本資料；透過YouTube的CC字幕，我們大大簡化了收集資料的過程，讓收集語音資料集的速度大幅提升。此外，我們還設計了一系列的預處理演算法，用來解決文本資料集相關的發音問題，其中包括去除不必要的內容(例如:多餘的換行、空格、標點符號以及外來語言的文字…等)和辨識阿拉伯數字的正確中文發音。我們利用這套系統從YouTube上收集了30小時的台灣口音的中文語音資料集，用來改善Deep Speech語音辨識模型的準確率。而最終的實驗結果表明，隨著我們使用的資料集增加，語音辨識模型的平均字詞錯誤率以非線性的方式逐步下降改進。

摘要(英)

Speech recognition is considered to be an enabling technology for many services, such as voice input and smart assistant. As the technique of Deep Learning develops, many speech recognition models and public corpus datasets have been released for common languages, such as English and Chinese Mandarin. As a consequence, the accuracy of speech recognition for these common languages is usually much higher than that for Taiwanese Mandarin. While Taiwanese Mandarin is different from Chinese Mandarin in several ways, they share a very similar sentence structure. Hence, the models developed for Chinese Mandarin should work well for Taiwanese Mandarin so long as Taiwanese Mandarin corpus dataset is adequately large. In this thesis, we propose a corpus crawler that automatically collects Taiwanese Mandarin audio and transcript dataset from YouTube videos. By utilizing the Closed Captioning subtitle in YouTube videos, the design of the crawler is greatly simplified, which helps to improve the speed of the crawler. In addition, several pre-processing tasks are performed to resolve the issue of context-dependent pronunciation, including removal of unnecessary content and identification of correct pronunciation of Arabic numerals. The proposed crawler is adopted to collect 30 hours of Taiwanese Mandarin corpus dataset, which are used to aid the training of Deep Speech, a well-known speech recognition model, to improve the Deep Speech model. The experimental results show that the linear increase of the dataset results in better-than-linear decrease of the average character and word error rates.

關鍵字(中)

★ 語音辨識
★ 台灣口音
★ 資料集處理

關鍵字(英)

★ Common Voice
★ Deep Speech
★ Speech Recognition

論文目次

1 Introduction (P.1)
2 Related Work (P.4)
2.1 Public/Private Organized Speech Corpus (P.4)
2.2 Customized Speech Corpus (P.5)
3. Preliminary (P.7)
3.1 Bidirectional Encoder Representations from Transformers (BERT) (P.7)
3.2 Common Voice (P.8)
3.3 Mozilla Deep Speech (P.10)
3.4 FFmpeg (P.12)
3.5 YouTube API (P.13)
4 Design (P.14)
4.1 Data Collection (P.15)
4.2 Data Preprocessing (P.15)
4.2.1 Removal of Unnecessary Content (P.15)
4.2.2 Conversion of Arabic Numerals (P.17)
4.2.3 Extraction of Voice and Text Data (P.20)
4.2.4 Preparation of Common Voice Format and Dictionary (P.21)
4.3 Model Training (P.21)
5 Performance (P.24)
5.1 Experimental Environment Configuration (P.24)
5.2 Dataset Description (P.25)
5.2.1 The Dataset for BERT Retraining (P.25)
5.2.2 The Datasets for Deep Speech Model Training (P.25)
5.3 Performance Metrics (P.26)
5.3.1 Turnaround Time (P.26)
5.3.2 Confusion Matrix (P.26)
5.3.3 Word Error Rate and Character Error Rate (P.28)
5.4 Experimental Results and Analysis (P.29)
6 Conclusion (P.34)
Reference (P.35)

參考文獻

[1]2000 hub5 english evaluation transcripts - linguistic data consortium.https://catalog.ldc.upenn.edu/LDC2002T43.
[2]The ami corpus.http://www.openslr.org/16.
[3]The association for computational linguistics and chinese language processing.http://www.aclclp.org.tw/use_mat_c.php#mat160.[4]Automatic speech recognition data collection with youtube v3 api,mask-rcnn and google vision api.https://towardsdatascience.com/automatic-speech-recognition-data-collection-with-youtube-v3-api-mask-rcnn-and-google-vision-api-2370d6776109.
[5]Avidemux - main page.http://avidemux.sourceforge.net/.
[6]Csr-i (wsj0) complete.https://catalog.ldc.upenn.edu/LDC93S6A.
[7]ffdshow tryouts | oﬀicial website.http://ffdshow-tryout.sourceforge.net/.
[8]Free speech... recognition (linux, windows and mac) - voxforge.org.http://www.voxforge.org/.
[9]Free st american english corpus.http://www.openslr.org/45.
[10]Kdenlive | libre video editor.https://kdenlive.org/.
[11]Librispeech asr corpus.http://www.openslr.org/12.
[12]Mplayer - the movie player.http://www.mplayerhq.hu.
[13]Tatoeba: Collection of sentences and translations.https://tatoeba.org/.
[14]Vlc: Oﬀicial site - free multimedia solutions for all os! - videolan.https://www.videolan.org/.
[15]xine - a free video player - home.xine-AFreeVideoPlayer-Home.
[16]Youtube.https://www.youtube.com.
[17]youtube-dl.https://youtube-dl.org/.
[18]youtube_dl 2021.4.26 on pypi - libraries.io.https://libraries.io/pypi/youtube_dl.
[19]R. Anup and L. Rob. Rfc2326: Real time streaming protocol (rtsp), 1998.
[20]Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, M. Kohler, JoshMeyer, Reuben Morais, Lindsay Saunders, Francis M. Tyers, and Gregor Weber.Common voice: A massively-multilingual speech corpus. InLREC, 2020.
[21]Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, JoshMeyer, Reuben Morais, Lindsay Saunders, Francis M. Tyers, and Gregor Weber.Common voice: A massively-multilingual speech corpus, 2020.
[22]T. Berners-Lee, R. Fielding, and H. Frystyk. Rfc1945: Hypertext transfer protocol– http/1.0, 1996.
[23]H. Bu, J. Du, X. Na, B. Wu, and H. Zheng. Aishell-1: An open-source mandarinspeech corpus and a speech recognition baseline. In2017 20th Conference of theOriental Chapter of the International Coordinating Committee on Speech Databasesand Speech I/O Systems and Assessment (O-COCOSDA), pages 1–5, 2017.
[24]Chia-Chen Chen, Tien-Chi Huang, James J. Park, Huang-Hua Tseng, and Neil Y.Yen. A smart assistant toward product-awareness shopping.Personal and UbiquitousComputing, 18(2):339–349, Feb 2014.
[25]Robert L. Cheng. A comparison of taiwanese, taiwan mandarin, and peking man-darin.Language, 61(2):352–377, 1985.
[26]P R Cohen and S L Oviatt. The role of voice input for human-machine communica-tion.Proceedings of the National Academy of Sciences, 92(22):9921–9927, 1995.
[27]Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
[28]Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen,Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, and Andrew Y.Ng. Deep speech: Scaling up end-to-end speech recognition, 2014.
[29]Kenneth Heafield. KenLM: Faster and smaller language model queries. InProceedingsof the Sixth Workshop on Statistical Machine Translation, pages 187–197, Edinburgh,Scotland, July 2011. Association for Computational Linguistics.
[30]Lucas Jo and Wonkyum Lee. goodatlas/zeroth.https://github.com/goodatlas/zeroth.
[31]Michael I. Jordan. Chapter 25 - serial order: A parallel distributed processing ap-proach. In John W. Donahoe and Vivian Packard Dorsel, editors,Neural-NetworkModels of Cognition, volume 121 ofAdvances in Psychology, pages 471–495. North-Holland, 1997.
[32]Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization.International Conference on Learning Representations, 12 2014.
[33]Yun-Hsuan Kuo. New dialect formation: The case of taiwanese mandarin. 01 2005.
[34]Egor Lakomkin, Sven Magg, Cornelius Weber, and Stefan Wermter. KT-speech-crawler: Automatic dataset construction for speech recognition from YouTubevideos. InProceedings of the 2018 Conference on Empirical Methods in Natural Lan-guage Processing: System Demonstrations, pages 90–95, Brussels, Belgium, Novem-ber 2018. Association for Computational Linguistics.
[35]Lantian Li, Ruiqi Liu, Jiawen Kang, Yue Fan, Hao Cui, Yunqi Cai, RavichanderVipperla, Thomas Fang Zheng, and Dong Wang. Cn-celeb: multi-genre speakerrecognition, 2020.
[36]Zhang De Liang. Deep neural network for chinese speech recognition. Master’s thesis,2015.
[37]Josh Meyer. Multi-task and transfer learning in low-resource speech recognition,2019.
[38]Clément Le Moine and Nicolas Obin. Att-hack: An expressive speech database withsocial attitudes, 2020.
[39]V. Panayotov, G. Chen, D. Povey, and S. Khudanpur. Librispeech: An asr corpusbased on public domain audio books. In2015 IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP), pages 5206–5210, 2015.
[40]Md. Wahidur Rahman, Rahabul Islam, Md. Mahmodul Hasan, Shisir Mia, and Mohammad Motiur Rahman. Iot based smart assistant for blind person and smart homeusing the bengali language.SN Computer Science, 1(5):300, Sep 2020.
[41]Anthony Rousseau, Paul Deléglise, and Yannick Estève. TED-LIUM: an automaticspeech recognition dedicated corpus. InProceedings of the Eighth International Con-ference on Language Resources and Evaluation (LREC’12), pages 125–129, Istanbul,Turkey, May 2012. European Language Resources Association (ELRA).
[42]D. E. Rumelhart and J. L. McClelland.Learning Internal Representations by ErrorPropagation, pages 318–362. 1987.
[43]M. Schuster and K.K. Paliwal. Bidirectional recurrent neural networks.IEEE Trans-actions on Signal Processing, 45(11):2673–2681, 1997.
[44]Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N.Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2017.
[45]Dong Wang and Xuewei Zhang. Thchs-30 : A free chinese speech corpus, 2015.
[46]YU Zong WU Yang. An extended hybrid end-to-end chinese speech recognition modelbased on cnn.Journal of Qingdao University of Science and Technology（NaturalScience Edition), 041(001):104–109,118, 2020.

指導教授

孫敏德(Min-Te Sun)

審核日期

2021-7-26

推文