基於大規模多語言 語音模型於在地化語言實務應用

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：59

、訪客IP：3.133.156.201

姓名

賴泓榮(Hong-Rong Lai) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

基於大規模多語言語音模型於在地化語言實務應用
(Localization Language Applications Based on Large-scale Multilingual Speech Models)

相關論文

★ Single and Multi-Label Environmental Sound Recognition with Gaussian Process	★ 波束形成與音訊前處理之嵌入式系統實現
★ 語音合成及語者轉換之應用與設計	★ 基於語意之輿情分析系統
★ 高品質口述系統之設計與應用	★ 深度學習及加速強健特徵之CT影像跟骨骨折辨識及偵測
★ 基於風格向量空間之個性化協同過濾服裝推薦系統	★ RetinaNet應用於人臉偵測
★ 金融商品走勢預測	★ 整合深度學習方法預測年齡以及衰老基因之研究
★ 漢語之端到端語音合成研究	★ 基於 ARM 架構上的 ORB-SLAM2 的應用與改進
★ 基於深度學習之指數股票型基金趨勢預測	★ 探討財經新聞與金融趨勢的相關性
★ 基於卷積神經網路的情緒語音分析	★ 運用深度學習方法預測阿茲海默症惡化與腦中風手術存活

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 ( 永不開放)

摘要(中)

近年來，深度學習技術在語音識別、自然語言處理等領域取得了顯著的成果。深度學習中的神經網路模型具有一定的泛化能力，通過神經網路在大量數據上的訓練，模型能夠學習到更廣泛的語音變異性，從而具有更好的適應性。它不僅可以提高多語言環境中的語音識別效果，還能夠減少對標記數據的依賴並簡化系統開發流程。通過持續的研究和改進，實現更準確和可靠的多語言識別系統，這項研究對於地區性低資源語言的發展和保護具有重要意義，同時也有助於促進跨文化的交流和理解。

摘要(英)

In recent years, deep learning techniques have achieved remarkable progress in speech recognition, natural language processing, and other fields. Neural network models in deep learning demonstrate a certain level of generalization ability. Through training on extensive data, these models can learn a broader range of speech variations, leading to improved adaptability. Deep learning not only enhances speech recognition in multilingual environments but also reduces reliance on annotated data and simplifies system development processes. By continuously researching and improving, achieving more accurate and reliable multilingual recognition systems holds significant importance for the development and preservation of regional low-resource languages. Additionally, it facilitates cross-cultural communication and understanding.

關鍵字(中)

★ 多語言語音模型
★ 低資源語言
★ 語音識別
★ 深度學習

關鍵字(英)

論文目次

目錄
中文摘要 ............................................................................................................................. i
英文摘要 ............................................................................................................................ ii
目錄 ................................................................................................................................... iv
圖目錄 .............................................................................................................................. vii
表目錄 ............................................................................................................................. viii
一、緒論(Introduction) .................................................................................................... 1
1-1 研究背景與目的(Research Background) ........................................................... 1
1-2 研究方法(Research Methods)............................................................................. 3
1-3 章節概要(Chapter Summary) ............................................................................. 4
二、相關文獻與文獻探討(Related Work) ...................................................................... 5
2-1 音訊前處理（Audio Preprocessing） ................................................................ 5
2-1-1 傅立葉變換(Fourier Transform) ............................................................. 6
2-1-2 頻譜圖(Spectrogram) .............................................................................. 6
2-1-3 梅爾量表(Mel scale) ............................................................................... 8
2-2 變壓器(Transformer) .......................................................................................... 8
2-2-1 模型架構(Model Architecture) ................................................................ 9
2-2-2 注意力演算法(Attention) ...................................................................... 12
2-2-3 多頭注意力機制(Multi-Head Attention) .............................................. 13
2-2-4 注意力機制的應用(Applications of Attention) .................................... 15
2-2-5 前饋神經網路（Feedforward Neural Network, FNN） ...................... 16
2-2-6 位置編碼機制(Positional Encoding) .................................................... 17
2-3 低資源語言處理（Low-resource language processing） ............................... 19
三、基於大規模多語言語音模型於在地化語言實務應用(Localization Language
Applications Based on Large-scale Multilingual Speech Models) .......................................... 22
3-1 資料集(Dataset) ............................................................................................... 22
3-1-1 目標語言(Target Language) .................................................................. 22
3-1-2 語料收集(Corpus Collection) ................................................................ 23
3-1-3 語料清理(Corpus Cleaning) .................................................................. 24
3-1-4 語料庫建立(Corpus Building) .............................................................. 25
3-2 系統架構(System Architecture) ....................................................................... 26
3-2-1 現成架構(Off-the-Shelf Architecture) ................................................... 26
3-2-2 特徵編碼器(Feature Encoder)............................................................... 26
3-2-3 交叉注意力（Cross attention） ........................................................... 28
3-2-4 多語言訓練 ( Multilingual Learning) .................................................. 29
四、實驗與結果說明(Experiment and Result) ............................................................. 33
4-1 實驗設置(Experiment Setup) ........................................................................... 33
4-1-1 資料集(Dataset) ..................................................................................... 33
4-1-2 實驗細節(Experiment Details) .............................................................. 34
4-1-3 評估方式(Evaluation Manner) .............................................................. 34
4-2 模型實現(Model Implementation) ................................................................... 35
4-2-1 模型架構(Model Structure) .................................................................... 35
4-2-2 模型訓練(Model Training) ..................................................................... 36
4-2 實驗結果和分析(Result and Analysis) ............................................................ 37
4-3-1 結果展示(Result) .................................................................................... 38
4-3-2 結果分析(Result Analysis) ..................................................................... 38
五、結論與未來方向(Conclusion and Future Work) ................................................... 40
參考文獻(References) ...................................................................................................... 41

參考文獻

[1] L. R. Rabiner, “A Tutorial on Hidden Markov Models and Selected Applications in
Speech Recognition,” Proceedings of the IEEE, vol. 77, no. 2, 1989, doi:
10.1109/5.18626.
[2] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio, “End-to-end
attention-based large vocabulary speech recognition,” in ICASSP, IEEE
International Conference on Acoustics, Speech and Signal Processing -
Proceedings, Institute of Electrical and Electronics Engineers Inc., May 2016, pp.
4945–4949. doi: 10.1109/ICASSP.2016.7472618.
[3] S. Watanabe et al., “ESPNet: End-to-end speech processing toolkit,” in Proceedings
of the Annual Conference of the International Speech Communication Association,
INTERSPEECH, 2018. doi: 10.21437/Interspeech.2018-1456.
[4] G. Hinton et al., “Deep neural networks for acoustic modeling in speech recognition:
The shared views of four research groups,” IEEE Signal Process Mag, vol. 29, no.
6, pp. 82–97, 2012, doi: 10.1109/MSP.2012.2205597.
[5] A. Graves, A. R. Mohamed, and G. Hinton, “Speech recognition with deep recurrent
neural networks,” in ICASSP, IEEE International Conference on Acoustics, Speech
and Signal Processing - Proceedings, 2013. doi: 10.1109/ICASSP.2013.6638947.
[6] A. Vaswani et al., “Attention is all you need,” in Advances in Neural Information
Processing Systems, 2017.
[7] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Gated Recurrent Neural Networks
on Sequence Modeling,” ArXiv, 2014.
[8] M. T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attentionbased
neural machine translation,” in Conference Proceedings - EMNLP 2015:
Conference on Empirical Methods in Natural Language Processing, 2015. doi:
10.18653/v1/d15-1166.
[9] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,”
in Proceedings of the IEEE Computer Society Conference on Computer Vision and
Pattern Recognition, 2016. doi: 10.1109/CVPR.2016.90.
[10] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training
by reducing internal covariate shift,” in 32nd International Conference on Machine
Learning, ICML 2015, 2015.
[11] J. L. Ba, J. R. Kiros, and G. E. Hinton, “(LN) Layer Norm,” arXiv:1607.06450v1,
2015.
[12] K. Cho et al., “Learning phrase representations using RNN encoder-decoder for
statistical machine translation,” in EMNLP 2014 - 2014 Conference on Empirical
Methods in Natural Language Processing, Proceedings of the Conference,
Association for Computational Linguistics (ACL), 2014, pp. 1724–1734. doi:
10.3115/v1/d14-1179.
[13] D. Bahdanau, K. H. Cho, and Y. Bengio, “Neural machine translation by jointly
learning to align and translate,” in 3rd International Conference on Learning
Representations, ICLR 2015 - Conference Track Proceedings, 2015.
[14] M. Johnson et al., “Google’s Multilingual Neural Machine Translation System:
Enabling Zero-Shot Translation,” Trans Assoc Comput Linguist, vol. 5, 2017, doi:
10.1162/tacl_a_00065.
[15] R. Jozefowicz, O. Vinyals, N. Shazeer, and YonghuiWu, “Exploring the Limits of
Language Modeling arXiv:1602.02410v2,” arXiv preprint arXiv:, 2016.
[16] M. A. Hedderich, L. Lange, H. Adel, J. Strötgen, and D. Klakow, “A Survey on
Recent Approaches for Natural Language Processing in Low-Resource Scenarios,”
in NAACL-HLT 2021 - 2021 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies,
Proceedings of the Conference, 2021. doi: 10.18653/v1/2021.naacl-main.201.
[17] C. Raffel et al., “Exploring the limits of transfer learning with a unified text-to-text
transformer,” Journal of Machine Learning Research, vol. 21, Jun. 2020.
[18] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transactions on
Knowledge and Data Engineering, vol. 22, no. 10. pp. 1345–1359, 2010. doi:
10.1109/TKDE.2009.191.
[19] R. Collobert and J. Weston, “A unified architecture for natural language processing:
Deep neural networks with multitask learning,” in Proceedings of the 25th
International Conference on Machine Learning, 2008, pp. 160–167.
[20] J. E. van Engelen and H. H. Hoos, “A survey on semi-supervised learning,” Mach
Learn, vol. 109, no. 2, pp. 373–440, Feb. 2020, doi: 10.1007/s10994-019-05855-6.
[21] Y. Chen and T. Augustinova, “Are Language-Agnostic Sentence Representations
actually Language-Agnostic?,” in International Conference Recent Advances in
Natural Language Processing, RANLP, Incoma Ltd, 2021, pp. 274–280. doi:
10.26615/978-954-452-072-4_032.
[22] M. Artetxe and H. Schwenk, “Massively Multilingual Sentence Embeddings for
Zero-Shot Cross-Lingual Transfer and Beyond,” Trans Assoc Comput Linguist, vol.
7, pp. 597–610, Sep. 2019, doi: 10.1162/tacl_a_00288.
[23] S. Ruder, I. Vulić, and A. Søgaard, “A survey of cross-lingual word embedding
models,” Journal of Artificial Intelligence Research, vol. 65. AI Access Foundation,
pp. 569–631, 2019. doi: 10.1613/JAIR.1.11640.
[24] R. Dabre, C. Chu, and A. Kunchukuttan, “A Survey of Multilingual Neural Machine
Translation,” ACM Comput Surv, vol. 53, no. 5, Sep. 2020, doi: 10.1145/3406095.
[25] Y. Wang et al., “PromDA: Prompt-based Data Augmentation for Low-Resource
NLU Tasks,” in Proceedings of the Annual Meeting of the Association for
Computational Linguistics, 2022. doi: 10.18653/v1/2022.acl-long.292.
[26] J. Wei and K. Zou, “EDA: Easy data augmentation techniques for boosting
performance on text classification tasks,” in EMNLP-IJCNLP 2019 - 2019
Conference on Empirical Methods in Natural Language Processing and 9th
International Joint Conference on Natural Language Processing, Proceedings of
the Conference, Association for Computational Linguistics, 2019, pp. 6382–6388.
doi: 10.18653/v1/d19-1670.
[27] D. Li et al., “Contextualized Perturbation for Textual Adversarial Attack,” in
NAACL-HLT 2021 - 2021 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies,Proceedings of the Conference, Association for Computational Linguistics (ACL),
2021, pp. 5053–5069. doi: 10.18653/v1/2021.naacl-main.400.
[28] J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep
bidirectional transformers for language understanding,” in NAACL HLT 2019 - 2019
Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies - Proceedings of the Conference, 2019.
[29] T. Pires, E. Schlinger, and D. Garrette, “How multilingual is multilingual BERT?,”
in ACL 2019 - 57th Annual Meeting of the Association for Computational
Linguistics, Proceedings of the Conference, 2020. doi: 10.18653/v1/p19-1493.
[30] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving Language
Understanding by Generative Pre-Training,” OpenAI.com, 2018.
[31] X. Wu and M. Lode, “Language Models are Unsupervised Multitask Learners
( Summarization ),” OpenAI Blog, vol. 1, no. May, 2020.
[32] T. B. Brown et al., “Language models are few-shot learners,” in Advances in Neural
Information Processing Systems, 2020.
[33] R. Morgan and R. Garigl, Hugging

指導教授

王家慶

審核日期

2023-8-15

推文