應用遮罩語言模型於語碼轉換語音識別

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：33

、訪客IP：3.143.237.2

姓名

陳振鎧(Cheng-Kai Chen) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

應用遮罩語言模型於語碼轉換語音識別
(Masked Language Model for Code-Switching Automatic Speech Recognition)

相關論文

★ Single and Multi-Label Environmental Sound Recognition with Gaussian Process	★ 波束形成與音訊前處理之嵌入式系統實現
★ 語音合成及語者轉換之應用與設計	★ 基於語意之輿情分析系統
★ 高品質口述系統之設計與應用	★ 深度學習及加速強健特徵之CT影像跟骨骨折辨識及偵測
★ 基於風格向量空間之個性化協同過濾服裝推薦系統	★ RetinaNet應用於人臉偵測
★ 金融商品走勢預測	★ 整合深度學習方法預測年齡以及衰老基因之研究
★ 漢語之端到端語音合成研究	★ 基於 ARM 架構上的 ORB-SLAM2 的應用與改進
★ 基於深度學習之指數股票型基金趨勢預測	★ 探討財經新聞與金融趨勢的相關性
★ 基於卷積神經網路的情緒語音分析	★ 運用深度學習方法預測阿茲海默症惡化與腦中風手術存活

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 ( 永不開放)

摘要(中)

近年來，使用語言模型改善端到端語音識別模型的輸出，已然成為單語言語音識別領域的主流方法，但相較於單語言任務的語言模型，語碼轉換任務因其文句結構的特殊性，不僅用於訓練模型的資料極為缺乏，而且傳統模型架構也不易學習多語言的語意資訊。因此，為了解決上述兩個問題，本論文引入遮罩語言模型到語碼轉換語音識別系統內，期望透過通用的語言知識和雙向的內文資訊，使系統產生更精準的結果。其中，遮罩語言模型會使用未標記資料進行自監督式的預訓練以取得通用的語言知識，之後再將模型遷移至語碼轉換語音識別領域進行適應。除此之外，由於遮罩語言模型的訓練會使用完整的雙向內文資訊，同時也會大幅增強語意的理解和模型的效能。因此，我們藉助遮罩語言模型所帶來的優勢，將其應用在語碼轉換語言模型的建立並對端到端語音識別模型的輸出序列進行重評分，以改善整體系統的效能。在本論文中，我們提出將遮罩語言模型取代傳統因果語言模型和加成在標準語音識別系統上的兩種使用方式，並在語碼轉換語料庫SEAME上進行實驗，最終，這兩種系統相較於標準架構，分別取得了最多7%和8.4%的相對混合錯誤率，證明了我們提出的方法能夠解決前述所提到的問題，增強語碼轉換語音識別系統的效能。

摘要(英)

In recent years, the use of language models to improve the output of end-to-end speech recognition models has become the mainstream method in the field of monolingual speech recognition. Not only the data for training the model is extremely scarce, but also the traditional model architecture is not easy to learn multilingual semantic information. Therefore, in order to solve the above two problems, this paper introduces a masked language model into the code-switching speech recognition system, hoping to make the system produce more accurate results through general language knowledge and bidirectional context information. Among them, the masked language model uses unlabeled data for self-supervised pre-training to obtain general language knowledge, and then the model is transferred to the field of code-switching speech recognition for adaptation. In addition, since the training of the masked language model will use the complete bidirectional contextual information, it will also greatly enhance the semantic understanding and the performance of the model. Therefore, we take advantage of the masking language model and apply it to establish code-switching language model and re-score the output sequence of the end-to-end speech recognition model to improve the performance of the overall system. In this paper, we propose to replace the traditional causal language model and add the masked language model on the standard speech recognition system, and conduct experiments on the code-switched corpus SEAME. Finally, the two systems are compared. Compared with the standard architecture, relative mixed error rates of up to 7% and 8.4% were achieved, respectively, proving that our proposed method can solve the aforementioned problems and enhance the performance of the code-switched speech recognition system.

關鍵字(中)

★ 語音識別
★ 語碼轉換
★ 遮罩語音模型

關鍵字(英)

★ Speech Recognition
★ Code-Switching
★ Masked Langauge Model

論文目次

中文摘要 i
英文摘要 ii
目錄 iii
圖目錄 vi
表目錄 vii
一、緒論(Introduction) 1
1-1 研究背景與目的(Research Background) 1
1-2 研究方法(Research Methods) 2
1-3 章節概要(Chapter Summary) 3
二、相關文獻與文獻探討(Related Work) 4
2-1 語碼轉換語言模型(Code-Switching Language Model) 4
2-2 變壓器(Transformer) 7
2-2-1 模型架構(Model Architecture) 8
2-2-2 注意力演算法(Attention) 10
2-2-3 多頭注意力機制(Multi-Head Attention) 11
2-2-4 注意力機制的應用(Applications of Attention) 12
2-2-5 位置編碼機制(Positional Encoding，PE) 13
2-3 基於變壓器的雙向編碼器表示技術(Bidirectional Encoder Representations from Transformers，BERT) 14
2-3-1 輸入和輸出表示(Input and Output Representations) 17
2-3-2 遮罩語言模型預訓練(Masked Language Model，MLM) 20
2-3-3 次句預測預訓練(Next Sentence Prediction，NSP) 21
2-3-4 模型微調(Fine-tuning) 21
2-4 語言模型整合(Language Model Integration) 23
2-4-1 淺層融合(Shallow Fusion) 24
2-4-2 深度融合(Deep Fusion) 26
2-4-3 冷融合(Cold Fusion) 27
2-4-4 候選序列重評分(N-best List Rescoring) 28
2-5 語言模型架構與融合分析(Language Model Structure and Fusion Analysis) 29
三、遮罩語言模型重評分之語碼轉換語音識別(Masked Language Model for Code-Switching Automatic Speech Recognition) 32
3-1 系統架構(System Structure) 33
3-1-1 端到端語音識別模型(End to End Speech Recognition Model) 35
3-1-2 遮罩語言模型(Masked Language Model) 35
3-1-3 因果語言模型(Casual Language Model) 36
3-2 遮罩語言模型重評分(Masked Language Model Rescoring) 37
3-2-1 偽對數似然估計(Pseudo-log-likelihood Estimate，PLL Estimate) 37
3-2-2 偽困惑估計(Pseudo-perplexity Estimate，PPPL Estimate) 40
3-2-3 分數插值(Score Interpolation) 41
四、實驗與結果說明(Experiment and Result) 43
4-1 實驗設置(Experiment Setup) 43
4-1-1 資料集(Dataset) 43
4-1-2 實驗細節(Experiment Details) 45
4-1-3 評估方式(Evaluation Manner) 46
4-2 端到端語音識別模型實現(End to End Speech Recognition Implementation) 46
4-2-1 模型架構(Model Structure) 47
4-2-2 模型訓練(Model Training) 48
4-3 遮罩語言模型實現(Masked Language Model Implementation) 49
4-3-1 模型架構(Model Structure) 49
4-3-2 模型訓練(Model Training) 50
4-4 因果語言模型實現(Casual Language Model Implementation) 51
4-4-1 模型架構(Model Structure) 51
4-4-2 模型訓練(Model Training) 52
4-5 實驗結果和分析(Result and Analysis) 53
4-5-1 結果展示(Result) 53
4-5-2 結果分析(Result Analysis) 54
4-5-3 偽困惑和重評分分析(Pseudo-perplexity and Rescoring Analysis)(欠表) 56
五、結論與未來方向(Conclusion and Future Work) 57
參考文獻(References) 58

參考文獻

[1] L. R. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Proc. IEEE, vol. 77, no. 2, pp. 257–286, Feb. 1989, doi: 10.1109/5.18626.
[2] Y. Li and P. Fung, “Code-Switch Language Model with Inversion Constraints for Mixed Language Speech Recognition,” in Proceedings of COLING 2012, Mumbai, India, Dec. 2012, pp. 1671–1680. Accessed: Apr. 21, 2022. [Online]. Available: https://aclanthology.org/C12-1102
[3] Y. Li and P. Fung, “Language Modeling with Functional Head Constraint for Code Switching Speech Recognition,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, Oct. 2014, pp. 907–916. doi: 10.3115/v1/D14-1098.
[4] H. Adel, N. T. Vu, F. Kraus, T. Schlippe, H. Li, and T. Schultz, “Recurrent neural network language modeling for code switching conversational speech,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, May 2013, pp. 8411–8415. doi: 10.1109/ICASSP.2013.6639306.
[5] H. Adel, N. T. Vu, and T. Schultz, “Combination of Recurrent Neural Networks and Factored Language Models for Code-Switching Language Modeling,” in Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Sofia, Bulgaria, Aug. 2013, pp. 206–211. Accessed: Apr. 21, 2022. [Online]. Available: https://aclanthology.org/P13-2037
[6] G. I. Winata, A. Madotto, C.-S. Wu, and P. Fung, “Code-Switching Language Modeling using Syntax-Aware Multi-Task Learning,” in Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching, Melbourne, Australia, Jul. 2018, pp. 62–67. doi: 10.18653/v1/W18-3207.
[7] M. Choudhury, K. Bali, S. Sitaram, and A. Baheti, “Curriculum Design for Code-switching: Experiments with Language Identification and Language Modeling with Deep Neural Networks,” in Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017), Kolkata, India, Dec. 2017, pp. 65–74. Accessed: Apr. 21, 2022. [Online]. Available: https://aclanthology.org/W17-7509
[8] A. Pratapa, G. Bhat, M. Choudhury, S. Sitaram, S. Dandapat, and K. Bali, “Language Modeling for Code-Mixing: The Role of Linguistic Theory based Synthetic Data,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, Jul. 2018, pp. 1543–1553. doi: 10.18653/v1/P18-1143.
[9] S. Garg, T. Parekh, and P. Jyothi, “Code-switched Language Models Using Dual RNNs and Same-Source Pretraining,” ArXiv180901962 Cs, Sep. 2018, Accessed: Apr. 21, 2022. [Online]. Available: http://arxiv.org/abs/1809.01962
[10] G. I. Winata, A. Madotto, C.-S. Wu, and P. Fung, “Code-Switched Language Models Using Neural Based Synthetic Data from Parallel Sentences,” ArXiv190908582 Cs, Sep. 2019, Accessed: Apr. 21, 2022. [Online]. Available: http://arxiv.org/abs/1909.08582
[11] A. Vaswani et al., “Attention Is All You Need,” ArXiv170603762 Cs, Dec. 2017, Accessed: Apr. 22, 2022. [Online]. Available: http://arxiv.org/abs/1706.03762
[12] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” ArXiv181004805 Cs, May 2019, Accessed: Apr. 22, 2022. [Online]. Available: http://arxiv.org/abs/1810.04805
[13] A. Kannan, Y. Wu, P. Nguyen, T. N. Sainath, Z. Chen, and R. Prabhavalkar, “An analysis of incorporating an external language model into a sequence-to-sequence model,” ArXiv171201996 Cs Eess, Dec. 2017, Accessed: Apr. 21, 2022. [Online]. Available: http://arxiv.org/abs/1712.01996
[14] C. Gulcehre et al., “On Using Monolingual Corpora in Neural Machine Translation,” ArXiv150303535 Cs, Jun. 2015, Accessed: Apr. 21, 2022. [Online]. Available: http://arxiv.org/abs/1503.03535
[15] A. Sriram, H. Jun, S. Satheesh, and A. Coates, “Cold Fusion: Training Seq2Seq Models Together with Language Models,” ArXiv170806426 Cs, Aug. 2017, Accessed: Apr. 21, 2022. [Online]. Available: http://arxiv.org/abs/1708.06426
[16] J. L. Elman, “Learning and development in neural networks: the importance of starting small,” Cognition, vol. 48, no. 1, pp. 71–99, Jul. 1993, doi: 10.1016/0010-0277(93)90058-4.
[17] L. Yu, W. Zhang, J. Wang, and Y. Yu, “SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient,” ArXiv160905473 Cs, Aug. 2017, Accessed: Apr. 21, 2022. [Online]. Available: http://arxiv.org/abs/1609.05473
[18] S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural Comput., vol. 9, no. 8, pp. 1735–1780, Nov. 1997, doi: 10.1162/neco.1997.9.8.1735.
[19] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling,” ArXiv14123555 Cs, Dec. 2014, Accessed: Apr. 22, 2022. [Online]. Available: http://arxiv.org/abs/1412.3555
[20] Y. Wu et al., “Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation,” ArXiv160908144 Cs, Oct. 2016, Accessed: Apr. 24, 2022. [Online]. Available: http://arxiv.org/abs/1609.08144
[21] M.-T. Luong, H. Pham, and C. D. Manning, “Effective Approaches to Attention-based Neural Machine Translation,” ArXiv150804025 Cs, Sep. 2015, Accessed: Apr. 24, 2022. [Online]. Available: http://arxiv.org/abs/1508.04025
[22] R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, and Y. Wu, “Exploring the Limits of Language Modeling,” ArXiv160202410 Cs, Feb. 2016, Accessed: Apr. 24, 2022. [Online]. Available: http://arxiv.org/abs/1602.02410
[23] D. Bahdanau, K. Cho, and Y. Bengio, “Neural Machine Translation by Jointly Learning to Align and Translate,” ArXiv14090473 Cs Stat, May 2016, Accessed: Apr. 24, 2022. [Online]. Available: http://arxiv.org/abs/1409.0473
[24] J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin, “Convolutional Sequence to Sequence Learning,” ArXiv170503122 Cs, Jul. 2017, Accessed: Apr. 24, 2022. [Online]. Available: http://arxiv.org/abs/1705.03122
[25] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient Estimation of Word Representations in Vector Space,” ArXiv13013781 Cs, Sep. 2013, Accessed: Apr. 25, 2022. [Online]. Available: http://arxiv.org/abs/1301.3781
[26] M. E. Peters, W. Ammar, C. Bhagavatula, and R. Power, “Semi-supervised sequence tagging with bidirectional language models,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, Jul. 2017, pp. 1756–1765. doi: 10.18653/v1/P17-1161.
[27] M. E. Peters et al., “Deep contextualized word representations,” ArXiv180205365 Cs, Mar. 2018, Accessed: Apr. 26, 2022. [Online]. Available: http://arxiv.org/abs/1802.05365
[28] O. Melamud, J. Goldberger, and I. Dagan, “context2vec: Learning Generic Context Embedding with Bidirectional LSTM,” in Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, Berlin, Germany, Aug. 2016, pp. 51–61. doi: 10.18653/v1/K16-1006.
[29] A. M. Dai and Q. V. Le, “Semi-supervised Sequence Learning,” ArXiv151101432 Cs, Nov. 2015, Accessed: Apr. 26, 2022. [Online]. Available: http://arxiv.org/abs/1511.01432
[30] J. Howard and S. Ruder, “Universal Language Model Fine-tuning for Text Classification,” ArXiv180106146 Cs Stat, May 2018, Accessed: Apr. 26, 2022. [Online]. Available: http://arxiv.org/abs/1801.06146
[31] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving Language Understanding by Generative Pre-Training,” p. 12.
[32] W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen, Attend and Spell,” ArXiv150801211 Cs Stat, Aug. 2015, Accessed: Apr. 22, 2022. [Online]. Available: http://arxiv.org/abs/1508.01211
[33] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio, “End-to-End Attention-based Large Vocabulary Speech Recognition,” ArXiv150804395 Cs, Mar. 2016, Accessed: Apr. 22, 2022. [Online]. Available: http://arxiv.org/abs/1508.04395
[34] A. Graves, “Sequence Transduction with Recurrent Neural Networks,” ArXiv12113711 Cs Stat, Nov. 2012, Accessed: Apr. 22, 2022. [Online]. Available: http://arxiv.org/abs/1211.3711
[35] A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber, “Connectionist Temporal Classiﬁcation: Labelling Unsegmented Sequence Data with Recurrent Neural Networks,” p. 8.
[36] Z. Yang, B. Dhingra, Y. Yuan, J. Hu, W. W. Cohen, and R. Salakhutdinov, “Words or Characters? Fine-grained Gating for Reading Comprehension,” ArXiv161101724 Cs, Sep. 2017, Accessed: Apr. 22, 2022. [Online]. Available: http://arxiv.org/abs/1611.01724
[37] R. Iyer, M. Ostendorf, and H. Gish, “Using out-of-domain data to improve in-domain language models,” IEEE Signal Process. Lett., vol. 4, no. 8, pp. 221–223, Aug. 1997, doi: 10.1109/97.611282.
[38] S. R. Gangireddy, P. Swietojanski, P. Bell, and S. Renals, “Unsupervised Adaptation of Recurrent Neural Network Language Models,” in Interspeech 2016, Sep. 2016, pp. 2333–2337. doi: 10.21437/Interspeech.2016-1342.
[39] M. Ma, M. Nirschl, F. Biadsy, and S. Kumar, “Approaches for Neural-Network Language Model Adaptation,” Aug. 2017, pp. 259–263. doi: 10.21437/Interspeech.2017-1310.
[40] J. Salazar, D. Liang, T. Q. Nguyen, and K. Kirchhoff, “Masked Language Model Scoring,” Proc. 58th Annu. Meet. Assoc. Comput. Linguist., pp. 2699–2712, 2020, doi: 10.18653/v1/2020.acl-main.240.
[41] J. Shin, Y. Lee, and K. Jung, “Effective Sentence Scoring Method using Bidirectional Language Model for Speech Recognition,” ArXiv190506655 Cs Eess, May 2019, Accessed: Apr. 23, 2022. [Online]. Available: http://arxiv.org/abs/1905.06655
[42] S.-H. Chiu and B. Chen, “Innovative Bert-based Reranking Language Models for Speech Recognition,” 2021 IEEE Spok. Lang. Technol. Workshop SLT, pp. 266–271, Jan. 2021, doi: 10.1109/SLT48900.2021.9383557.
[43] K. Li et al., “An Empirical Study of Transformer-Based Neural Language Model Adaptation,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2020, pp. 7934–7938. doi: 10.1109/ICASSP40776.2020.9053399.
[44] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin, “A Neural Probabilistic Language Model,” J. Mach. Learn. Res., vol. 3, no. Feb, pp. 1137–1155, 2003.
[45] H. Schwenk, “Continuous space language models,” Comput. Speech Lang., vol. 21, no. 3, Jul. 2007.
[46] J. Park, X. Liu, M. J. F. Gales, and P. Woodland, “Improved neural network based language modelling and adaptation,” Sep. 2010, pp. 1041–1044. doi: 10.21437/Interspeech.2010-342.
[47] H.-S. Le, I. Oparin, A. Allauzen, J.-L. Gauvain, and F. Yvon, “Structured Output Layer Neural Network Language Models for Speech Recognition,” IEEE Trans. Audio Speech Lang. Process., vol. 21, no. 1, pp. 197–206, Jan. 2013, doi: 10.1109/TASL.2012.2215599.
[48] T. Mikolov, M. Karafiát, L. Burget, J. Cernocký, and S. Khudanpur, “Recurrent neural network based language model,” Jan. 2010, vol. 2, pp. 1045–1048.
[49] T. Mikolov, S. Kombrink, L. Burget, J. Černocký, and S. Khudanpur, “Extensions of recurrent neural network language model,” in 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2011, pp. 5528–5531. doi: 10.1109/ICASSP.2011.5947611.
[50] M. Sundermeyer, R. Schlüter, and H. Ney, “LSTM Neural Networks for Language Modeling,” 2012.
[51] S. Merity, N. S. Keskar, and R. Socher, “Regularizing and Optimizing LSTM Language Models,” ArXiv170802182 Cs, Aug. 2017, Accessed: Apr. 23, 2022. [Online]. Available: http://arxiv.org/abs/1708.02182
[52] M. Sundermeyer, I. Oparin, J.-L. Gauvain, B. Freiberg, R. Schlüter, and H. Ney, “Comparison of feedforward and recurrent neural network language models,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, May 2013, pp. 8430–8434. doi: 10.1109/ICASSP.2013.6639310.
[53] A. Graves, G. Wayne, and I. Danihelka, “Neural Turing Machines,” ArXiv14105401 Cs, Dec. 2014, Accessed: Apr. 23, 2022. [Online]. Available: http://arxiv.org/abs/1410.5401
[54] D. Bahdanau, K. Cho, and Y. Bengio, “Neural Machine Translation by Jointly Learning to Align and Translate,” ArXiv14090473 Cs Stat, May 2016, Accessed: Apr. 23, 2022. [Online]. Available: http://arxiv.org/abs/1409.0473
[55] K. Irie, A. Zeyer, R. Schlüter, and H. Ney, “Language Modeling with Deep Transformers,” Interspeech 2019, pp. 3905–3909, Sep. 2019, doi: 10.21437/Interspeech.2019-2225.
[56] Y. Shi, M. Larson, and C. Jonker, “Exploiting the succeeding words in recurrent neural network language models,” Aug. 2013, pp. 632–636. doi: 10.21437/Interspeech.2013-183.
[57] E. Arisoy, A. Sethy, B. Ramabhadran, and S. Chen, “Bidirectional recurrent neural network language models for automatic speech recognition,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr. 2015, pp. 5421–5425. doi: 10.1109/ICASSP.2015.7179007.
[58] X. Chen, A. Ragni, X. Liu, and M. J. F. Gales, “Investigating Bidirectional Recurrent Neural Network Language Models for Speech Recognition,” Aug. 2017, pp. 269–273. doi: 10.21437/Interspeech.2017-513.
[59] A. Wang and K. Cho, “BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model,” ArXiv190204094 Cs, Apr. 2019, Accessed: Apr. 29, 2022. [Online]. Available: http://arxiv.org/abs/1902.04094
[60] A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,” ArXiv200611477 Cs Eess, Oct. 2020, Accessed: May 06, 2022. [Online]. Available: http://arxiv.org/abs/2006.11477
[61] A. van den Oord, Y. Li, and O. Vinyals, “Representation Learning with Contrastive Predictive Coding,” ArXiv180703748 Cs Stat, Jan. 2019, Accessed: May 06, 2022. [Online]. Available: http://arxiv.org/abs/1807.03748
[62] G. Lample and A. Conneau, “Cross-lingual Language Model Pretraining,” ArXiv190107291 Cs, Jan. 2019, Accessed: May 06, 2022. [Online]. Available: http://arxiv.org/abs/1901.07291
[63] Y. Liu et al., “RoBERTa: A Robustly Optimized BERT Pretraining Approach,” ArXiv190711692 Cs, Jul. 2019, Accessed: May 06, 2022. [Online]. Available: http://arxiv.org/abs/1907.11692
[64] J. Besag, “Statistical Analysis of Non-Lattice Data,” J. R. Stat. Soc. Ser. Stat., vol. 24, no. 3, pp. 179–195, 1975, doi: 10.2307/2987782.
[65] D.-C. Lyu, T.-P. Tan, E. Chng, and H. Li, “Mandarin–English code-switching speech corpus in South-East Asia: SEAME,” Jan. 2010, vol. 49, pp. 1986–1989. doi: 10.1007/s10579-015-9303-x.
[66] V. Pratap et al., “wav2letter++: The Fastest Open-source Speech Recognition System,” in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2019, pp. 6460–6464. doi: 10.1109/ICASSP.2019.8683535.
[67] M. Ott et al., “fairseq: A Fast, Extensible Toolkit for Sequence Modeling,” arXiv, arXiv:1904.01038, Apr. 2019. doi: 10.48550/arXiv.1904.01038.
[68] J. Kahn et al., “Libri-Light: A Benchmark for ASR with Limited or No Supervision,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2020, pp. 7669–7673. doi: 10.1109/ICASSP40776.2020.9052942.
[69] T. Wolf et al., “HuggingFace’s Transformers: State-of-the-art Natural Language Processing,” arXiv, arXiv:1910.03771, Jul. 2020. doi: 10.48550/arXiv.1910.03771.

指導教授

王家慶(Jia-Ching Wang)

審核日期

2022-9-23

推文