姓名 廖楚信(Chu-Xin Liao)  查詢紙本館藏   畢業系所 資訊工程學系
論文名稱 利用預訓練模型和多種類型的數據改進語音翻譯
(Leveraging Pre-trained Models and Various Types of Data to Improve Speech Translation)
摘要(中) 語音翻譯(Speech Translation,ST)是自然語言處理(NLP)和語音處理的交叉領域,目的是將一種語言的語音直接翻譯成另一種語言的語音或文字。這項技術是現代科技的重要成果之一,不僅能實現無障礙交流,還能促進全球交流與合作,推動語言教育進步。隨著全球化和跨文化交流的加速,語音翻譯技術在各種應用場景中變得越來越重要,成為許多學者研究的焦點。
語音到語音翻譯可採用三階層級聯(3-Stage Cascaded)方法,將自動語音辨識(ASR)模型,機器翻譯(MT)模型和文字轉語音(TTS)模型進行串聯。這種方法使得級聯模型的缺點(錯誤傳遞以及高延遲)更為明顯。單階層語音到語音翻譯模型(Direct Speech-to-Speech Translation Model)雖然改善了級聯模型的缺點,其效果卻落後於強大的級聯模型。這主要是因為語音到語音的訓練資料稀少,即便使用資料增強方法,效果也不如級聯模型。因此,克服資料稀少或生成高質量的語音到語音資料成為一個重要議題。本篇論文志在找出其中的平衡,使得模型能夠同時擁有高效能且低延遲。
摘要(英) Speech Translation (ST) is an interdisciplinary field that combines Natural Language Processing (NLP) and speech processing, aiming to directly translate speech from one language into another language′s speech or text. This technology is one of the significant achievements of modern science, not only enabling barrier-free communication but also promoting global exchange and cooperation, as well as advancing language education. With the acceleration of globalization and cross-cultural exchanges, speech translation technology has become increasingly important in various application scenarios and has become a focal point of research for many scholars.
Deep learning technology in translation tasks can be categorized into several types: Text-to-Text, Text-to-Speech, Speech-to-Text, and Speech-to-Speech. Among these, Text-to-Text, Speech-to-Text, and Speech-to-Speech translation are particularly noteworthy. Large language models (such as GPT) possess exceptional comprehension and generation capabilities, making Text-to-Text translation particularly effective with extensive high-quality training data.
Speech-to-Speech translation can adopt a three-stage cascaded approach, linking Automatic Speech Recognition (ASR) models, Machine Translation (MT) models, and Text-to-Speech (TTS) models in sequence. This method makes the drawbacks of cascaded models more apparent; however, Direct Speech-to-Speech Translation Models still significantly lag behind well-trained cascaded models. This is primarily due to the scarcity of training data for Speech-to-Speech translation. Even with data augmentation techniques, the results are still inferior to cascaded models. Therefore, overcoming the scarcity of data or generating high-quality Speech-to-Speech data remains a crucial issue.
This paper aims to find a balance, ensuring that the models achieve both high performance and low latency.
關鍵字(中) ★ 自動語音辨識
★ 機器翻譯
★ 文字轉語音
★ 語音翻譯
關鍵字(英) ★ Automatic Speech Recognition
★ Machine Translation
★ Text to Speech
★ Speech Translation
論文目次 中文摘要 i
Abstract ii
章節目次 iv
圖目錄 vii
表目錄 viii
第一章 緒論 1
1.1 研究背景與動機 1
1.2 研究目的 2
1.3 研究方法與章節概要 3
第二章 相獻及文獻探討 4
2.1 Recurrent Neural Networks (RNNs) 4
2.1.1. Long Short-Term Memory (LSTM) 5
2.2 Transformer 7
2.2.1. Self-Attention演算法 8
2.2.2. 多頭注意力機制(Multi-head Attention) 9
2.2.3. Positional Encoding 12
2.2.4. Conformer 13
2.3 語音辨識模型 14
2.3.1. Wav2vec 2.0 15
2.3.2. HuBERT 17
2.3.3. W2V-BERT 19
2.4 語音合成模型 20
2.4.1. 語音合成模型介紹 21
2.4.2. VITS 21
2.4.3. VALL-E 22
2.4.4. Multi-band iSTFT VITS 23
2.5 語音到語音翻譯相關文獻 25
2.5.1. Direct Speech-to-Speech Translation 25
第三章 多階段訓練知識蒸餾語音到語音翻譯模型 28
3.1 第一階段訓練 29
3.2 第二階段訓練 30
3.3 語音到語音翻譯模型 30
第四章 實驗結果與討論 35
4.1 實驗設備 35
4.2 資料集介紹 35
4.2.1. 第一階段訓練 36
4.2.2. 第二階段訓練 36
4.2.3. 文字到語音模型訓練 38
4.3 實驗與討論 39
4.3.1. 語音到文字翻譯 40
4.3.2. 語音到語音翻譯 41
4.3.3. 消融實驗 43
4.3.4. 語音合成品質評估 45
第五章 結論及未來方向 47
第六章 參考文獻 48
指導教授 王家慶(Jia-Ching Wang) 審核日期 2024-8-19
