dc.description.abstract | Speech recognition is considered to be an enabling technology for many services, such as voice input and smart assistant. As the technique of Deep Learning develops, many speech recognition models and public corpus datasets have been released for common languages, such as English and Chinese Mandarin. As a consequence, the accuracy of speech recognition for these common languages is usually much higher than that for Taiwanese Mandarin. While Taiwanese Mandarin is different from Chinese Mandarin in several ways, they share a very similar sentence structure. Hence, the models developed for Chinese Mandarin should work well for Taiwanese Mandarin so long as Taiwanese Mandarin corpus dataset is adequately large. In this thesis, we propose a corpus crawler that automatically collects Taiwanese Mandarin audio and transcript dataset from YouTube videos. By utilizing the Closed Captioning subtitle in YouTube videos, the design of the crawler is greatly simplified, which helps to improve the speed of the crawler. In addition, several pre-processing tasks are performed to resolve the issue of context-dependent pronunciation, including removal of unnecessary content and identification of correct pronunciation of Arabic numerals. The proposed crawler is adopted to collect 30 hours of Taiwanese Mandarin corpus dataset, which are used to aid the training of Deep Speech, a well-known speech recognition model, to improve the Deep Speech model. The experimental results show that the linear increase of the dataset results in better-than-linear decrease of the average character and word error rates. | en_US |