自監督學習應用於低資源語碼轉換語音識別

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：68

、訪客IP：3.14.141.177

姓名

傅立燁(Li-Yeh Fu) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

自監督學習應用於低資源語碼轉換語音識別
(Self-Supervised Learning Approach For Low-Resource Code-Switching Automatic Speech Recognition)

相關論文

★ Single and Multi-Label Environmental Sound Recognition with Gaussian Process	★ 波束形成與音訊前處理之嵌入式系統實現
★ 語音合成及語者轉換之應用與設計	★ 基於語意之輿情分析系統
★ 高品質口述系統之設計與應用	★ 深度學習及加速強健特徵之CT影像跟骨骨折辨識及偵測
★ 基於風格向量空間之個性化協同過濾服裝推薦系統	★ RetinaNet應用於人臉偵測
★ 金融商品走勢預測	★ 整合深度學習方法預測年齡以及衰老基因之研究
★ 漢語之端到端語音合成研究	★ 基於 ARM 架構上的 ORB-SLAM2 的應用與改進
★ 基於深度學習之指數股票型基金趨勢預測	★ 探討財經新聞與金融趨勢的相關性
★ 基於卷積神經網路的情緒語音分析	★ 運用深度學習方法預測阿茲海默症惡化與腦中風手術存活

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 ( 永不開放)

摘要(中)

近年來深度學習由於可用於神經網路的訓練數據大幅增加、數據蒐集的難易度下降而興起，基於深度學習方法的端到端語音識別逐漸取代傳統語音識別方法成為主流。且智慧語音助理及各種相關設備的普及，對於語碼轉換語音識別系統的需求逐漸增加，相關研究也愈發受到重視。語碼轉換的定義是，在一段對話中使用超過一種以上的語言，例如中文-英文。語碼轉換在多語言國家相當普遍，以華語國家為例，在新加坡和馬來西亞經常將中文和英文夾雜使用，香港經常將中文、英文及粵語夾雜使用，在台灣也經常可見將中文、台語和英文夾雜使用。語碼轉換的挑戰在於，語言切換可能發生在任何時刻，聲學、語言模型必須學習對語言切換行為進行建模，但由於語碼轉換的語料稀少，導致模型訓練困難。現主流語碼轉換方法依賴於融合多個聲學模型，分別學習針對不同的語言特徵建模，以及使用多任務學習加入語言識別任務，幫助模型學習語言切換行為，但距離達到強健穩定且能應用於真實場景的語碼轉換技術仍是挑戰。為了解決上述問題，我們提出引入自監督學習技術至語碼轉換任務，自監督學習在許多領域已取得相當大的成功，如自然語言處理、計算機視覺等領域等。自監督學習的核心是使用大量未標記數據來預訓練模型，並將模型遷移至特定下游任務微調，可以大幅降低模型訓練的標記數據需求以及提高模型效能。因此，我們將語碼轉換任務視為低資源語音辨識任務，使用自監督學習透過大量未標記語料預訓練聲學編碼器模型，隨後遷移至語碼轉換語料進行微調，並在語碼轉換任務語料庫SEAME上分別取得16.4%以及23.3%的混合錯誤率，是我們目前已知最好的結果。我們的方法大幅簡化了過去文獻中須依賴於使用單語言語料預訓練多個單語言編碼器、解碼器的網路架構，並且避免使用語言識別多任務學習損失也直接簡化了對語料標註的人力開銷。我們提出的方法具有一般化及通用性，可以拓展至其他語言的語碼轉換語音識別任務。

摘要(英)

In recent years, deep learning has risen due to the substantial increase in the amount of training data that can be used for neural networks and the decrease in the difficulty of data collection. End-to-end speech recognition based on deep learning methods has gradually replaced traditional speech recognition methods and has become the mainstream. In addition, the popularity of intelligent voice assistants and various related devices has gradually increased the demand for code-switched speech recognition systems. Code switching refers to the use of more than one language in a conversation, such as Chinese-English. Code switching is quite common in multilingual countries. The challenge is that language switching may occur at any moment, and the corpus for code switching is scarce, which makes model training difficult. The current mainstream code-switching method relies on the fusion of multiple acoustic models and the use of multi-task learning to help the model learn language switching behaviors, but there are still challenges in the code-switching technology that achieves robustness and stability and can be applied to real scenes. In order to solve the above problems, we propose to introduce self-supervised learning, use a large amount of unlabeled data to pre-train the model, and migrate the model to specific downstream tasks for fine-tuning, which greatly reduces the labeled data requirements for model training and improves model performance. The benchmark corpus SEAME of our code-switching task achieved 16.4% and 23.3% mixed error rates, which are the best results we know so far. Our method greatly simplifies the network architecture in the past literature that relied on the pre-training of multiple models using a single language corpus, and also simplifies the labor cost of additional annotations for multi-task learning.

關鍵字(中)

★ 自監督學習
★ 語碼轉換
★ 語音識別

關鍵字(英)

★ Self-Supervised Learning
★ Code-Switching
★ Automatic Speech Recognition

論文目次

中文摘要 i
英文摘要 ii
目錄 iii
圖目錄 v
表目錄 vi
一、緒論 - 1 -
1-1 研究背景與目的 - 1 -
1-2 研究方法與章節概要 - 2 -
二、相關文獻與文獻探討 - 3 -
2-1 端到端語音識別技術 - 3 -
2-1-1 Sequence to Sequence with Attention - 3 -
2-1-2 Connectionist Temporal Classification - 7 -
2-1-3 Recurrent Neural Network Transducer - 14 -
2-1-4 CTC-ATT hybrid Architecture - 17 -
2-2 語碼轉換語音識別 - 19 -
2-3 自監督學習 - 21 -
2-3-1 Contrastive Predictive Coding - 23 -
2-3-2 Autoregressive Predictive Coding - 26 -
2-3-3 wav2vec - 28 -
2-3-4 Mockingjay - 30 -
三、自監督學習應用於低資源語碼轉換語音識別 - 33 -
3-1 wav2vec 2.0 - 33 -
3-1-1 wav2vec 2.0模型 - 34 -
3-1-2 量化模組 - 35 -
3-1-3 自監督訓練方法 - 36 -
3-1-4 目標函數 - 36 -
3-1-5 對比損失 - 36 -
3-1-6 編碼簿多樣化損失 - 37 -
3-1-7 懲罰項損失 - 37 -
3-2 模型架構 - 37 -
3-2-1 特徵編碼器架構 - 38 -
3-2-2 上下文編碼器架構 - 38 -
3-3 自監督學習應用於語碼轉換語音識別 - 40 -
3-3-1 預訓練階段 - 40 -
3-3-2 微調階段 - 40 -
四、實驗與結果說明 - 43 -
4-2 數據集 - 43 -
4-2-1 Libri-Light - 43 -
4-2-2 SEAME - 44 -
4-3 模型實現 - 46 -
4-3-1 特徵編碼器超參數 - 46 -
4-3-2 上下文編碼器超參數 - 46 -
4-4 訓練參數 - 47 -
4-4-1 預訓練超參數 - 47 -
4-4-2 微調超參數 - 48 -
4-5 實驗結果 - 49 -
4-5-1 實驗評估方式 - 49 -
4-5-2 實驗結果展示 - 49 -
4-5-3 實驗結果探討 - 50 -
4-5-4 消融實驗 - 52 -
五、結論與未來方向 - 53 -
參考文獻 - 54 -

參考文獻

〔1〕Graves, A., Mohamed, A. R., & Hinton, G. (2013, May). Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing (pp. 6645-6649). Ieee.
〔2〕Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257-286.
〔3〕Auer, P. (Ed.). (2013). Code-switching in conversation: Language, interaction and identity. Routledge.
〔4〕Knill, K. M., Gales, M. J., Rath, S. P., Woodland, P. C., Zhang, C., & Zhang, S. X. (2013, December). Investigation of multilingual deep neural networks for spoken term detection. In 2013 IEEE Workshop on Automatic Speech Recognition and Understanding (pp. 138-143). IEEE.
〔5〕Grézl, F., Egorova, E., & Karafiát, M. (2016). Study of large data resources for multilingual training and system porting. Procedia Computer Science, 81, 15-22..
〔6〕Dalmia, S., Sanabria, R., Metze, F., & Black, A. W. (2018, April). Sequence-based multi-lingual low resource speech recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4909-4913). IEEE.
〔7〕Cho, J., Baskar, M. K., Li, R., Wiesner, M., Mallidi, S. H., Yalta, N., ... & Hori, T. (2018, December). Multilingual sequence-to-sequence speech recognition: architecture, transfer learning, and language modeling. In 2018 IEEE Spoken Language Technology Workshop (SLT) (pp. 521-527). IEEE.
〔8〕Lyu, D. C., & Lyu, R. Y. (2008). Language identification on code-switching utterances using multiple cues. In Ninth Annual Conference of the International Speech Communication Association.
〔9〕Toshniwal, S., Sainath, T. N., Weiss, R. J., Li, B., Moreno, P., Weinstein, E., & Rao, K. (2018, April). Multilingual speech recognition with a single end-to-end model. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4904-4908). IEEE.
〔10〕Zeng, Z., Khassanov, Y., Pham, V. T., Xu, H., Chng, E. S., & Li, H. (2018). On the end-to-end solution to mandarin-english code-switching speech recognition. arXiv preprint arXiv:1811.00241.
〔11〕Lyu, D. C., Tan, T. P., Chng, E. S., & Li, H. (2010). Seame: a mandarin-english code-switching speech corpus in south-east asia. In Eleventh Annual Conference of the International Speech Communication Association.
〔12〕Chan, W., Jaitly, N., Le, Q., & Vinyals, O. (2016, March). Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4960-4964). IEEE.
〔13〕Graves, A., Fernández, S., Gomez, F., & Schmidhuber, J. (2006, June). Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning (pp. 369-376).
〔14〕Graves, A. (2012). Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711.
〔15〕Kim, S., Hori, T., & Watanabe, S. (2017, March). Joint CTC-attention based end-to-end speech recognition using multi-task learning. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4835-4839). IEEE.
〔16〕Chorowski, J., Weiss, R. J., Bengio, S., & van den Oord, A. (2019). Unsupervised speech representation learning using wavenet autoencoders. IEEE/ACM transactions on audio, speech, and language processing, 27(12), 2041-2053.
〔17〕Chung, Y. A., & Glass, J. (2018). Speech2vec: A sequence-to-sequence framework for learning word embeddings from speech. arXiv preprint arXiv:1803.08976.
〔18〕Chung, Y. A., Wu, C. C., Shen, C. H., Lee, H. Y., & Lee, L. S. (2016). Audio word2vec: Unsupervised learning of audio segment representations using sequence-to-sequence autoencoder. arXiv preprint arXiv:1603.00982.
〔19〕Schneider, S., Baevski, A., Collobert, R., & Auli, M. (2019). wav2vec: Unsupervised pre-training for speech recognition. arXiv preprint arXiv:1904.05862.
〔20〕Baevski, A., Schneider, S., & Auli, M. (2019). vq-wav2vec: Self-supervised learning of discrete speech representations. arXiv preprint arXiv:1910.05453.
〔21〕Oord, A. V. D., Li, Y., & Vinyals, O. (2018). Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
〔22〕Chung, Y. A., Hsu, W. N., Tang, H., & Glass, J. (2019). An unsupervised autoregressive model for speech representation learning. arXiv preprint arXiv:1904.03240.
〔23〕Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Advances in neural information processing systems (pp. 3104-3112).
〔24〕Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
〔25〕Graves, A., Jaitly, N., & Mohamed, A. R. (2013, December). Hybrid speech recognition with deep bidirectional LSTM. In 2013 IEEE workshop on automatic speech recognition and understanding (pp. 273-278). IEEE.
〔26〕Graves, A., & Jaitly, N. (2014, June). Towards end-to-end speech recognition with recurrent neural networks. In International conference on machine learning (pp. 1764-1772). PMLR.
〔27〕Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008).
〔28〕Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 1097-1105.
〔29〕Wiseman, S., & Rush, A. M. (2016). Sequence-to-sequence learning as beam-search optimization. arXiv preprint arXiv:1606.02960.
〔30〕Zhang, Q., Lu, H., Sak, H., Tripathi, A., McDermott, E., Koo, S., & Kumar, S. (2020, May). Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 7829-7833). IEEE.
〔31〕Yeh, C. F., Mahadeokar, J., Kalgaonkar, K., Wang, Y., Le, D., Jain, M., ... & Seltzer, M. L. (2019). Transformer-transducer: End-to-end speech recognition with self-attention. arXiv preprint arXiv:1910.12977.
〔32〕Tripathi, A., Kim, J., Zhang, Q., Lu, H., & Sak, H. (2020). Transformer transducer: One model unifying streaming and non-streaming speech recognition. arXiv preprint arXiv:2010.03192.
〔33〕Huang, W., Hu, W., Yeung, Y. T., & Chen, X. (2020). Conv-transformer transducer: Low latency, low frame rate, streamable end-to-end speech recognition. arXiv preprint arXiv:2008.05750.
〔34〕Lyu, D. C., Lyu, R. Y., Chiang, Y. C., & Hsu, C. N. (2006, May). Speech recognition on code-switching among the Chinese dialects. In 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings (Vol. 1, pp. I-I). IEEE.
〔35〕Ardila, A. (2005). Spanglish: an anglicized Spanish dialect. Hispanic Journal of Behavioral Sciences, 27(1), 60-81.
〔36〕Lyudovyk, T., & Pylypenko, V. (2014). Code-switching speech recognition for closely related languages. In Spoken Language Technologies for Under-Resourced Languages.
〔37〕Chan, J. Y., Ching, P. C., Lee, T., & Meng, H. M. (2004, December). Detection of language boundary in code-switching utterances by bi-phone probabilities. In 2004 International Symposium on Chinese Spoken Language Processing (pp. 293-296). IEEE.
〔38〕Zissman, M. A. (1996). Comparison of four approaches to automatic language identification of telephone speech. IEEE Transactions on speech and audio processing, 4(1), 31.
〔39〕Mabokela, K. R., Manamela, M. J., & Manaileng, M. (2014). Modeling code-switching speech on under-resourced languages for language identification. In Spoken Language Technologies for Under-Resourced Languages.
〔40〕Nakayama, S., Tjandra, A., Sakti, S., & Nakamura, S. (2018, December). Speech chain for semi-supervised learning of japanese-english code-switching asr and tts. In 2018 IEEE Spoken Language Technology Workshop (SLT) (pp. 182-189). IEEE.
〔41〕Ullah, A., & Ahmed, T. (2020). Code Switching Language Model Using Monolingual Training Data. arXiv preprint arXiv:2012.12543.
〔42〕Yılmaz, E., Heuvel, H. V. D., & van Leeuwen, D. A. (2018). Acoustic and textual data augmentation for improved asr of code-switching speech. arXiv preprint arXiv:1807.10945.
〔43〕Zhang, S., Liu, Y., Lei, M., Ma, B., & Xie, L. (2019). Towards Language-Universal Mandarin-English Speech Recognition. In INTERSPEECH (pp. 2170-2174).
〔44〕Zhou, X., Yılmaz, E., Long, Y., Li, Y., & Li, H. (2020). Multi-encoder-decoder transformer for code-switching speech recognition. arXiv preprint arXiv:2006.10414.
〔45〕Dalmia, S., Liu, Y., Ronanki, S., & Kirchhoff, K. (2021, June). Transformer-Transducers for Code-Switched Speech Recognition. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5859-5863). IEEE.
〔46〕Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
〔47〕Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R. R., & Le, Q. V. (2019). Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems, 32.
〔48〕Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2019). Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942.
〔49〕Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., ... & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
〔50〕Liu, A. T., Yang, S. W., Chi, P. H., Hsu, P. C., & Lee, H. Y. (2020, May). Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6419-6423). IEEE.
〔51〕Baevski, A., Zhou, H., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. arXiv preprint arXiv:2006.11477.
〔52〕Jang, E., Gu, S., & Poole, B. (2016). Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144.
〔53〕Jiang, D., Lei, X., Li, W., Luo, N., Hu, Y., Zou, W., & Li, X. (2019). Improving transformer-based speech recognition using unsupervised pre-training. arXiv preprint arXiv:1910.09932.
〔54〕Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1), 1929-1958.
〔55〕Hendrycks, D., & Gimpel, K. (2016). Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415.
〔56〕Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., ... & Liu, T. (2020, November). On layer normalization in the transformer architecture. In International Conference on Machine Learning (pp. 10524-10533). PMLR.
〔57〕Park, D. S., Chan, W., Zhang, Y., Chiu, C. C., Zoph, B., Cubuk, E. D., & Le, Q. V. (2019). Specaugment: A simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779.
〔58〕Kahn, J., Rivière, M., Zheng, W., Kharitonov, E., Xu, Q., Mazaré, P. E., ... & Dupoux, E. (2020, May). Libri-light: A benchmark for asr with limited or no supervision. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 7669-7673). IEEE.
〔59〕Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., ... & Chintala, S. (2019). Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 8026-8037.
〔60〕Ott, M., Edunov, S., Baevski, A., Fan, A., Gross, S., Ng, N., ... & Auli, M. (2019). fairseq: A fast, extensible toolkit for sequence modeling. arXiv preprint arXiv:1904.01038.
〔61〕He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision (pp. 1026-1034).
〔62〕Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
〔63〕Zhu, X., & Goldberg, A. B. (2009). Introduction to semi-supervised learning. Synthesis lectures on artificial intelligence and machine learning, 3(1), 1-130.
〔64〕Zhu, X. J. (2005). Semi-supervised learning literature survey.
〔65〕Zhang, Y., Qin, J., Park, D. S., Han, W., Chiu, C. C., Pang, R., ... & Wu, Y. (2020). Pushing the limits of semi-supervised learning for automatic speech recognition. arXiv preprint arXiv:2010.10504.
〔66〕Xu, Q., Baevski, A., Likhomanenko, T., Tomasello, P., Conneau, A., Collobert, R., ... & Auli, M. (2021, June). Self-training and pre-training are complementary for speech recognition. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 3030-3034). IEEE.

指導教授

王家慶(Jia-Ching Wang)

審核日期

2021-8-19

推文