利用預訓練模型和多種類型的數據改進語音翻譯

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：87

、訪客IP：18.119.113.133

姓名

廖楚信(Chu-Xin Liao) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

利用預訓練模型和多種類型的數據改進語音翻譯
(Leveraging Pre-trained Models and Various Types of Data to Improve Speech Translation)

相關論文

★ Single and Multi-Label Environmental Sound Recognition with Gaussian Process	★ 波束形成與音訊前處理之嵌入式系統實現
★ 語音合成及語者轉換之應用與設計	★ 基於語意之輿情分析系統
★ 高品質口述系統之設計與應用	★ 深度學習及加速強健特徵之CT影像跟骨骨折辨識及偵測
★ 基於風格向量空間之個性化協同過濾服裝推薦系統	★ RetinaNet應用於人臉偵測
★ 金融商品走勢預測	★ 整合深度學習方法預測年齡以及衰老基因之研究
★ 漢語之端到端語音合成研究	★ 基於 ARM 架構上的 ORB-SLAM2 的應用與改進
★ 基於深度學習之指數股票型基金趨勢預測	★ 探討財經新聞與金融趨勢的相關性
★ 基於卷積神經網路的情緒語音分析	★ 運用深度學習方法預測阿茲海默症惡化與腦中風手術存活

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 ( 永不開放)

摘要(中)

語音翻譯(Speech Translation，ST)是自然語言處理(NLP)和語音處理的交叉領域，目的是將一種語言的語音直接翻譯成另一種語言的語音或文字。這項技術是現代科技的重要成果之一，不僅能實現無障礙交流，還能促進全球交流與合作，推動語言教育進步。隨著全球化和跨文化交流的加速，語音翻譯技術在各種應用場景中變得越來越重要，成為許多學者研究的焦點。
深度學習技術在翻譯任務中可以細分為多種類型：文字到文字(Text-to-Text)、文字到語音(Text-to-Speech)、語音到文字(Speech-to-Text)和語音到語音(Speech-to-Speech)。其中，文字到文字、語音到文字以及語音到語音的翻譯備受關注。大型語言模型(如GPT)具備高超的理解和生成能力，使得文字到文字翻譯在大量高質量訓練資料的支持下，效果尤為突出。
語音到語音翻譯可採用三階層級聯(3-Stage Cascaded)方法，將自動語音辨識(ASR)模型，機器翻譯(MT)模型和文字轉語音(TTS)模型進行串聯。這種方法使得級聯模型的缺點(錯誤傳遞以及高延遲)更為明顯。單階層語音到語音翻譯模型(Direct Speech-to-Speech Translation Model)雖然改善了級聯模型的缺點，其效果卻落後於強大的級聯模型。這主要是因為語音到語音的訓練資料稀少，即便使用資料增強方法，效果也不如級聯模型。因此，克服資料稀少或生成高質量的語音到語音資料成為一個重要議題。本篇論文志在找出其中的平衡，使得模型能夠同時擁有高效能且低延遲。

摘要(英)

Speech Translation (ST) is an interdisciplinary field that combines Natural Language Processing (NLP) and speech processing, aiming to directly translate speech from one language into another language′s speech or text. This technology is one of the significant achievements of modern science, not only enabling barrier-free communication but also promoting global exchange and cooperation, as well as advancing language education. With the acceleration of globalization and cross-cultural exchanges, speech translation technology has become increasingly important in various application scenarios and has become a focal point of research for many scholars.
Deep learning technology in translation tasks can be categorized into several types: Text-to-Text, Text-to-Speech, Speech-to-Text, and Speech-to-Speech. Among these, Text-to-Text, Speech-to-Text, and Speech-to-Speech translation are particularly noteworthy. Large language models (such as GPT) possess exceptional comprehension and generation capabilities, making Text-to-Text translation particularly effective with extensive high-quality training data.
Speech-to-Speech translation can adopt a three-stage cascaded approach, linking Automatic Speech Recognition (ASR) models, Machine Translation (MT) models, and Text-to-Speech (TTS) models in sequence. This method makes the drawbacks of cascaded models more apparent; however, Direct Speech-to-Speech Translation Models still significantly lag behind well-trained cascaded models. This is primarily due to the scarcity of training data for Speech-to-Speech translation. Even with data augmentation techniques, the results are still inferior to cascaded models. Therefore, overcoming the scarcity of data or generating high-quality Speech-to-Speech data remains a crucial issue.
This paper aims to find a balance, ensuring that the models achieve both high performance and low latency.

關鍵字(中)

★ 自動語音辨識
★ 機器翻譯
★ 文字轉語音
★ 語音翻譯

關鍵字(英)

★ Automatic Speech Recognition
★ Machine Translation
★ Text to Speech
★ Speech Translation

論文目次

中文摘要 i
Abstract ii
章節目次 iv
圖目錄 vii
表目錄 viii
第一章緒論 1
1.1 研究背景與動機 1
1.2 研究目的 2
1.3 研究方法與章節概要 3
第二章相獻及文獻探討 4
2.1 Recurrent Neural Networks (RNNs) 4
2.1.1. Long Short-Term Memory (LSTM) 5
2.2 Transformer 7
2.2.1. Self-Attention演算法 8
2.2.2. 多頭注意力機制(Multi-head Attention) 9
2.2.3. Positional Encoding 12
2.2.4. Conformer 13
2.3 語音辨識模型 14
2.3.1. Wav2vec 2.0 15
2.3.2. HuBERT 17
2.3.3. W2V-BERT 19
2.4 語音合成模型 20
2.4.1. 語音合成模型介紹 21
2.4.2. VITS 21
2.4.3. VALL-E 22
2.4.4. Multi-band iSTFT VITS 23
2.5 語音到語音翻譯相關文獻 25
2.5.1. Direct Speech-to-Speech Translation 25
第三章多階段訓練知識蒸餾語音到語音翻譯模型 28
3.1 第一階段訓練 29
3.2 第二階段訓練 30
3.3 語音到語音翻譯模型 30
第四章實驗結果與討論 35
4.1 實驗設備 35
4.2 資料集介紹 35
4.2.1. 第一階段訓練 36
4.2.2. 第二階段訓練 36
4.2.3. 文字到語音模型訓練 38
4.3 實驗與討論 39
4.3.1. 語音到文字翻譯 40
4.3.2. 語音到語音翻譯 41
4.3.3. 消融實驗 43
4.3.4. 語音合成品質評估 45
第五章結論及未來方向 47
第六章參考文獻 48

參考文獻

[1] Zhang, Y., Xu, C., Hu, B., Zhang, C., Xiao, T., & Zhu, J. (2023). Improving End-to-End Speech Translation by Leveraging Auxiliary Speech and Text Data. Proceedings of the AAAI Conference on Artificial Intelligence, 37(11), 13984-13992. https://doi.org/10.1609/aaai.v37i11.26637
[2] Popuri, S., Chen, P. J., Wang, C., Pino, J., Adi, Y., Gu, J., ... & Lee, A. (2022). Enhanced direct speech-to-speech translation using self-supervised pre-training and data augmentation. arXiv preprint arXiv:2204.02967.
[3] Song, K., Ren, Y., Lei, Y., Wang, C., Wei, K., Xie, L., ... & Ma, Z. (2023). Styles2st: Zero-shot style transfer for direct speech-to-speech translation. arXiv preprint arXiv:2305.17732.
[4] Huang, R., Liu, J., Liu, H., Ren, Y., Zhang, L., He, J., & Zhao, Z. (2022). Transpeech: Speech-to-speech translation with bilateral perturbation. arXiv preprint arXiv:2205.12523.
[5] Lee, A., Chen, P. J., Wang, C., Gu, J., Popuri, S., Ma, X., ... & Hsu, W. N. (2021). Direct speech-to-speech translation with discrete units. arXiv preprint arXiv:2107.05604.
[6] Barrault, L., Chung, Y. A., Meglioli, M. C., Dale, D., Dong, N., Duquenne, P. A., ... & Wang, S. (2023). 6-Massively Multilingual & Multimodal Machine Translation. arXiv preprint arXiv:2308.11596.
[7] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
[8] Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
[9] Gulati, A., Qin, J., Chiu, C. C., Parmar, N., Zhang, Y., Yu, J., ... & Pang, R. (2020). Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100.
[10] Bello, I., Zoph, B., Vaswani, A., Shlens, J., & Le, Q. V. (2019). Attention augmented convolutional networks. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 3286-3295).
[11] Lu, Y., Li, Z., He, D., Sun, Z., Dong, B., Qin, T., ... & Liu, T. Y. (2019). Understanding and improving transformer from a multi-particle dynamic system point of view. arXiv preprint arXiv:1906.02762.
[12] Hsu, W. N., Bolte, B., Tsai, Y. H. H., Lakhotia, K., Salakhutdinov, R., & Mohamed, A. (2021). Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM transactions on audio, speech, and language processing, 29, 3451-3460.
[13] Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., & Joulin, A. (2020). Unsupervised learning of visual features by contrasting cluster assignments. Advances in neural information processing systems, 33, 9912-9924.
[14] Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33, 12449-12460.
[15] Jang, Eric, Shixiang Gu, and Ben Poole. "Categorical reparameterization with gumbel-softmax." arXiv preprint arXiv:1611.01144 (2016).
[16] Chung, Y. A., Zhang, Y., Han, W., Chiu, C. C., Qin, J., Pang, R., & Wu, Y. (2021, December). W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (pp. 244-250). IEEE.
[17] Oord, A. V. D., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., ... & Kavukcuoglu, K. (2016). Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499.
[18] Wang, Y., Skerry-Ryan, R. J., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N., ... & Saurous, R. A. (2017). Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135.
[19] Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., ... & Wu, Y. (2018,April). Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4779-4783). IEEE.
Ren, Y., Ruan, Y., Tan, X., Qin, T., Zhao, S., Zhao, Z., & Liu, T. Y. (2019). Fastspeech: Fast, robust and controllable text to speech. Advances in neural information processing systems, 32.
[21] Kim, J., Kong, J., & Son, J. (2021, July). Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In International Conference on Machine Learning (pp. 5530-5540). PMLR.
[22] Wang, C., Chen, S., Wu, Y., Zhang, Z., Zhou, L., Liu, S., ... & Wei, F. (2023). Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111.
[23] Kawamura, M., Shirahata, Y., Yamamoto, R., & Tachibana, K. (2023, June). Lightweight and high-fidelity end-to-end text-to-speech with multi-band generation and inverse short-time fourier transform. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1-5). IEEE.
[24] Kong, J., Kim, J., & Bae, J. (2020). Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in neural information processing systems, 33, 17022-17033.
[25] Popuri, S., Chen, P. J., Wang, C., Pino, J., Adi, Y., Gu, J., ... & Lee, A. (2022). Enhanced direct speech-to-speech translation using self-supervised pre-training and data augmentation. arXiv preprint arXiv:2204.02967.
[26] Liu, Y., Gu, J., Goyal, N., Li, X., Edunov, S., Ghazvininejad, M., ... & Zettlemoyer, L. (2020). Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics, 8, 726-742.
[27] Huang, R., Liu, J., Liu, H., Ren, Y., Zhang, L., He, J., & Zhao, Z. (2022). Transpeech: Speech-to-speech translation with bilateral perturbation. arXiv preprint arXiv:2205.12523.
[28] Wang, Y., Zhai, C., & Awadalla, H. H. (2020). Multi-task learning for multilingual neural machine translation. arXiv preprint arXiv:2010.02523.
[29] https://github.com/Helsinki-NLP/OPUS-MT-train
[30] Jörg Tiedemann and Santhosh Thottingal. 2020. OPUS-MT – Building open translation services for the World. In Proceedings of the 22nd Annual Conference of the European Association for Machine Translation, pages 479–480, Lisboa, Portugal. European Association for Machine Translation.
[31] https://opus.nlpl.eu/
[32] https://huggingface.co/Helsinki-NLP/opus-mt-zh-en
[33] https://huggingface.co/facebook/w2v-bert-2.0
[34] Dong, Q., Ye, R., Wang, M., Zhou, H., Xu, S., Xu, B., & Li, L. (2021, May). Listen, understand and translate: Triple supervision decouples end-to-end speech-to-text translation. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 35, No. 14, pp. 12749-12759).
[35] Xu, C., Hu, B., Li, Y., Zhang, Y., Ju, Q., Xiao, T., & Zhu, J. (2021). d: Integrating the pre-trained models into speech translation encoders. arXiv preprint arXiv:2105.05752.
[36] Xu, C., Hu, B., Li, Y., Zhang, Y., Ju, Q., Xiao, T., & Zhu, J. (2021). Stacked acoustic-and-textual encoding: Integrating the pre-trained models into speech translation encoders. arXiv preprint arXiv:2105.05752.
[37] https://huggingface.co/facebook/seamless-m4t-v2-large
[38] K. Ito, “The LJ speech dataset,” https://keithito.com/LJ-Speech-Dataset/, 2017.
[39] Post, M. (2018). A call for clarity in reporting BLEU scores. arXiv preprintarXiv:1804.08771.
[40] Ardila, R., Branson, M., Davis, K., Henretty, M., Kohler, M., Meyer, J., ... & Weber, G. (2019). Common voice: A massively-multilingual speech corpus. arXiv preprint arXiv:1912.06670.
[41] https://openai.com/chatgpt/
[42] https://huggingface.co/openai/whisper-large-v3
[43] Conneau, A., Ma, M., Khanuja, S., Zhang, Y., Axelrod, V., Dalmia, S., ... & Bapna, A. (2023, January). Fleurs: Few-shot learning evaluation of universal representations of speech. In 2022 IEEE Spoken Language Technology Workshop (SLT) (pp. 798-805). IEEE.

指導教授

王家慶(Jia-Ching Wang)

審核日期

2024-8-19

推文