基於遷移學習之低資源語音辨識

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：31

、訪客IP：3.19.237.16

姓名

蔡緯鴻(Wei-Hong Tsai) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

基於遷移學習之低資源語音辨識
(Low-Resource Speech Recognition Based on Transfer Learning)

相關論文

★ Single and Multi-Label Environmental Sound Recognition with Gaussian Process	★ 波束形成與音訊前處理之嵌入式系統實現
★ 語音合成及語者轉換之應用與設計	★ 基於語意之輿情分析系統
★ 高品質口述系統之設計與應用	★ 深度學習及加速強健特徵之CT影像跟骨骨折辨識及偵測
★ 基於風格向量空間之個性化協同過濾服裝推薦系統	★ RetinaNet應用於人臉偵測
★ 金融商品走勢預測	★ 整合深度學習方法預測年齡以及衰老基因之研究
★ 漢語之端到端語音合成研究	★ 基於 ARM 架構上的 ORB-SLAM2 的應用與改進
★ 基於深度學習之指數股票型基金趨勢預測	★ 探討財經新聞與金融趨勢的相關性
★ 基於卷積神經網路的情緒語音分析	★ 運用深度學習方法預測阿茲海默症惡化與腦中風手術存活

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 ( 永不開放)

摘要(中)

近年端到端語音辨識（End-to-End Speech Recognition）成為語音辨識的研究趨勢，許多研究致力於探索語音辨識更高的準確性，並且在各個著名的語料庫上達到更高的準確性。然而，這些高度的準確性建立在龐大的語料上，而世界上有許多少數語言，沒有充足的語料建立該種語言的語音辨識，所建構出的語音辨識往往準確性過低，因此，如何以少量的語料建立語音辨識系統一直是語音辨識上的一項議題。
本論文使用ESPnet toolkit實現序列對序列的（Sequence to Sequence, Seq2Seq)端到端語音辨識模型，以及Fairseq toolkit實現輔助語音辨識的無監督預訓練模型，利用無標籤的（Unlabeled）單一語音資料協助擷取語音特徵，並透過遷移學習（Transfer Learning），將建立於語料較充足的語音辨識模型遷移至語料較缺乏的客語語音辨識，以此建立一個較強健的低資源（Low Resource）客語語音辨識。

摘要(英)

Recent years, end-to-end speech recognition become a popular architecture. Many research aim to improve accuracy in end-to-end speech recognition, and they achieve higher accuracy on various famous corpora indeed. However, there are many language which do not have enough data to build their speech recognition system in the world. The system often can not get a reliable result and can not be used in real-world. Therefore, how to build a reboust low-resource speech recognition is an important issue in speech recognition.
This paper uses ESPnet toolkit to implement an end-to-end speech recognition model based on sequence-to-sequence architecture, and also uses Fairseq toolkit to implement an unsupervised pre-training model for assisted speech recognition. Using unlabeled speech data to help extract speech features, and transfer a speech recognition model based on sufficient corpus to Haaka speech recognition with less corpus through transfer learning. Establish a more robust low-resource Hakka speech recognition.

關鍵字(中)

★ 語音辨識
★ 低資源
★ 端到端

關鍵字(英)

★ speech recognition
★ low-resource
★ end-to-end

論文目次

摘要 i
Abstract ii
章節目次 iii
圖目次 v
表目次 vi
第一章　緒論 1
1.1　研究動機 1
1.2　研究方向 2
1.3　章節概要 2
第二章　語音辨識簡介及文獻探討 3
2.1　特徵向量 3
2.1.1　頻譜圖（Spectrogram） 3
2.1.2　梅爾頻譜（Mel-Spectrum） 4
2.1.3　梅爾頻率倒譜係數（Mel-Frequency Cepstral Coefficients, MFCCs） 5
2.2　傳統語音辨識架構 6
2.3　相關文獻 8
第三章　模型架構及預訓練方法 10
3.1　端到端語音辨識 10
3.1.1　連結時序分類(Connectionist Temporal Classification, CTC) 11
3.1.2　Speech-Transformer 15
3.1.3　CTC/Attention混合模型 21
3.1.4　標籤平滑化（Label Smoothing） 22
3.1.5　聯合解碼（Joint Decoding） 22
3.2　wav2vec模型 23
3.2.1　因果卷積神經網路（Causal Convolution Neural Network） 24
3.2.2　Encoder Network 25
3.2.3　Context Network 25
3.2.4　wav2vec目標函數 26
3.2.5　基於wav2vec之半監督語音辨識 27
3.3　Encoder模型遷移 28
3.4　基於wav2vec及模型遷移之半監督語音辨識 29
第四章　實驗 31
4.1　實驗流程與環境 31
4.2　語料介紹 32
4.2.1　Aishell語料庫 32
4.2.2　Aishell2語料庫 32
4.2.3　Aidatatang_200zh語料庫 33
4.2.4　MAGICDATA語料庫 33
4.2.5　Primewords語料庫 34
4.2.6　ST-CMDS-20170001_1語料庫 34
4.2.7　THCHS-30語料庫 35
4.2.8　客語語料庫 35
4.2.9　Librispeech語料庫 36
4.4　端到端語音辨識模型配置 37
4.5　Encoder預訓練配置 38
4.6.1　使用wav2vec於語音辨識 39
4.6.2　使用Encoder模型遷移 43
4.6.3　基於wav2vec及Encoder模型遷移之半監督語音辨識 44
第五章　結論與未來展望 46
第六章　參考文獻 48

參考文獻

[1] Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
[2] Graves, A., Fernández, S., Gomez, F., & Schmidhuber, J. (2006, June). Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning (pp. 369-376).
[3] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008).
[4] Mohamed, A., Okhonko, D., & Zettlemoyer, L. (2019). Transformers with convolutional context for ASR. arXiv preprint arXiv:1904.11660.
[5] Dong, L., Xu, S., & Xu, B. (2018, April). Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5884-5888). IEEE.
[6] He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
[7] Kim, S., Hori, T., & Watanabe, S. (2017, March). Joint CTC-attention based end-to-end speech recognition using multi-task learning. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4835-4839). IEEE.
[8] Watanabe, S., Hori, T., Kim, S., Hershey, J. R., & Hayashi, T. (2017). Hybrid CTC/attention architecture for end-to-end speech recognition. IEEE Journal of Selected Topics in Signal Processing, 11(8), 1240-1253.
[9] Nakatani, T. (2019). Improving transformer-based end-to-end speech recognition with connectionist temporal classification and language model integration.
[10] Schneider, S., Baevski, A., Collobert, R., & Auli, M. (2019). wav2vec: Unsupervised pre-training for speech recognition. arXiv preprint arXiv:1904.05862.
[11] Watanabe, S., Hori, T., Karita, S., Hayashi, T., Nishitoba, J., Unno, Y., ... & Renduchintala, A. (2018). Espnet: End-to-end speech processing toolkit. arXiv preprint arXiv:1804.00015.
[12] Park, D. S., Chan, W., Zhang, Y., Chiu, C. C., Zoph, B., Cubuk, E. D., & Le, Q. V. (2019). Specaugment: A simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779.
[13] Ko, T., Peddinti, V., Povey, D., & Khudanpur, S. (2015). Audio augmentation for speech recognition. In Sixteenth Annual Conference of the International Speech Communication Association.
[14] Adams, O., Wiesner, M., Watanabe, S., & Yarowsky, D. (2019). Massively multilingual adversarial speech recognition. arXiv preprint arXiv:1904.02210.
[15] Stoian, M. C., Bansal, S., & Goldwater, S. (2020, May). Analyzing ASR pretraining for low-resource speech-to-text translation. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 7909-7913). IEEE.
[16] Bansal, S., Kamper, H., Livescu, K., Lopez, A., & Goldwater, S. (2018). Pre-training on high-resource speech recognition improves low-resource speech-to-text translation. arXiv preprint arXiv:1809.01431.
[17] Ott, M., Edunov, S., Baevski, A., Fan, A., Gross, S., Ng, N., ... & Auli, M. (2019). fairseq: A fast, extensible toolkit for sequence modeling. arXiv preprint arXiv:1904.01038.
[18] Bu, H., Du, J., Na, X., Wu, B., & Zheng, H. (2017, November). Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline. In 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA) (pp. 1-5). IEEE.
[19] Du, J., Na, X., Liu, X., & Bu, H. (2018). AISHELL-2: transforming mandarin ASR research into industrial scale. arXiv preprint arXiv:1808.10583.
[20] Beijing DataTang Technology Co., Ltd , “aidatatang 200zh, a free Chinese Mandarin speech corpus,” .
[21] Magic Data Technology Co., Ltd, “MAGICDATA Mandarin Chinese Read Speech Corpus,” http: //www.imagicdatatech.com/index.php/home/ dataopensource/data_info/id/101, 2019.
[22] Primewords Information Technology Co., Ltd., “Primewords Chinese Corpus Set 1,” 2018, https://www. primewords.cn.
[23] Surfingtech, “ST-CMDS-20170001 1 Free ST Chinese Mandarin Corpus,” .
[24] Wang, D., & Zhang, X. (2015). Thchs-30: A free chinese speech corpus. arXiv preprint arXiv:1512.01882.
[25] Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization. arXiv preprint arXiv:1607.06450.
[26] Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2818-2826).
[27] Panayotov, V., Chen, G., Povey, D., & Khudanpur, S. (2015, April). Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5206-5210). IEEE.
[28] Zeghidour, N., Xu, Q., Liptchinsky, V., Usunier, N., Synnaeve, G., & Collobert, R. (2018). Fully convolutional speech recognition. arXiv preprint arXiv:1812.06864.
[29] Hannun, A., Lee, A., Xu, Q., & Collobert, R. (2019). Sequence-to-sequence speech recognition with time-depth separable convolutions. arXiv preprint arXiv:1904.02619.
[30] Chan, W., Jaitly, N., Le, Q. V., & Vinyals, O. (2015). Listen, attend and spell. arXiv preprint arXiv:1508.01211.
[31] Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
[32] Graves, A. (2012). Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711.
[33] Chiu, C. C., Sainath, T. N., Wu, Y., Prabhavalkar, R., Nguyen, P., Chen, Z., ... & Jaitly, N. (2018, April). State-of-the-art speech recognition with sequence-to-sequence models. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4774-4778). IEEE.
[34] Zeghidour, N., Usunier, N., Synnaeve, G., Collobert, R., & Dupoux, E. (2018). End-to-end speech recognition from the raw waveform. arXiv preprint arXiv:1806.07098.
[35] Chen, Y., Wang, W., & Wang, C. (2020). Semi-supervised ASR by End-to-end Self-training. arXiv preprint arXiv:2001.09128.
[36] Baskar, M. K., Watanabe, S., Astudillo, R., Hori, T., Burget, L., & Černocký, J. (2019). Self-supervised Sequence-to-sequence ASR using Unpaired Speech and Text. arXiv preprint arXiv:1905.01152.
[37] Peddinti, V., Povey, D., & Khudanpur, S. (2015). A time delay neural network architecture for efficient modeling of long temporal contexts. In Sixteenth Annual Conference of the International Speech Communication Association.
[38] Bai, S., Kolter, J. Z., & Koltun, V. (2018). An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271.
[39] Oord, A. V. D., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., ... & Kavukcuoglu, K. (2016). Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499.
[40] Mohamed, A. R., Dahl, G. E., & Hinton, G. (2011). Acoustic modeling using deep belief networks. IEEE transactions on audio, speech, and language processing, 20(1), 14-22.
[41] Sak, H., Senior, A., & Beaufays, F. (2014). Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition. arXiv preprint arXiv:1402.1128.
[42] Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., ... & Silovsky, J. (2011). The Kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding (No. CONF). IEEE Signal Processing Society.
[43] Povey, D., Peddinti, V., Galvez, D., Ghahremani, P., Manohar, V., Na, X., ... & Khudanpur, S. (2016, September). Purely sequence-trained neural networks for ASR based on lattice-free MMI. In Interspeech (pp. 2751-2755).
[44] Sriram, A., Jun, H., Satheesh, S., & Coates, A. (2017). Cold fusion: Training seq2seq models together with language models. arXiv preprint arXiv:1708.06426.

指導教授

王家慶(Jia-Ching Wang)

審核日期

2020-7-30

推文