語音生成模型之探討;Investigate the speech generation system

NCU Institutional Repository > 資訊電機學院 > 資訊工程研究所 > 博碩士論文 > Item 987654321/82872

請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/82872

題名:	語音生成模型之探討;Investigate the speech generation system
作者:	黃稟智;Huang, Bing-Jhih
貢獻者:	資訊工程學系
關鍵詞:	文字轉語音;注意力機制;深度學習;text to speech;attention;tacotron;deep learning
日期:	2020-01-22
上傳時間:	2020-06-05 17:38:45 (UTC+8)
出版者:	國立中央大學
摘要:	語音合成一詞表示使用人為的方式合成近乎真人言談的語音，早期的做法譬如統計方式。作法繁瑣容易出錯，且如果架構是管線架構，前一階的錯誤還會產生連鎖效應。近年來，隨著深度學習的熱潮，使用深度學習架構搭建文字轉語音（Text To Speech, TTS）系統的技術已越發成熟，各式各樣利用深度網路完成的TTS應用開始進入人們的生活周遭。在TTS能成功合成出逼真的語音後，這些人們已經不滿足於合成出逼真的聲音了，不論是要能夠合成指定語者的聲音、合成指定腔調的聲音或包含特定情感的聲音。現在的TTS系統要能夠依據使用者的喜好產生聲音。 Tacotron2是一個較為經典的深度學習架構，包含一個將文字編碼的編碼器、一個將編碼結果轉成解碼器輸入的注意力機制、一個輸出頻譜圖的解碼器與一個將頻譜圖轉為音訊的聲碼器。注意力機制部分，除了Google實現Tacotron2採用的Location Sensitive Attention以及同樣常被運用於語音合成模型的Forward Attention外，Monotonic Chunkwise Attention(Mocha)也是注意力機制的一種。Mocha將原先Soft Attention的作用範圍縮小為長度固定的Chunk，希望藉此提升網路的準確度。不過目前Mocha相關的研究多運用在語音辨識(Speech Recognition)領域。本篇論文將基於多語者的Tacotron2模型，實驗上述三個注意力機制。比較Mocha Attention與其他兩者的結果差異後發現，經由MoChA產生的音訊並不會比較好，反倒是喪失了MoChA可以處理Streaming的優勢。;The term speech synthesis refers to the artificial way of synthesizing speech that is almost human-talking like. Early approaches such as statistical methods. The method is tedious and error-prone, and if the architecture is a pipeline architecture, the errors of the previous order will have a knock-on effect. In recent years, with the boom in deep learning, the technology of using the deep learning architecture to build the Text To Speech (TTS) system has become more and more mature, and various TTS applications using deep network have begun to enter people′s lives. . After TTS can successfully synthesize realistic speech, these people are not satisfied with synthesizing realistic voices, whether it is capable of synthesizing the voice of the specified speaker, synthesize a specific tone, or include emotional sound Today′s TTS systems must be able to generate sounds based on user preferences. Tacotron2 is a more classic deep learning architecture, including a text encoding Encoder, an attention mechanism that converts the encoding result into the input of the decoder, a decoder output spectrum and a vocoder that converts the spectrogram to audio. As attention mechanism, except for Location Sensitive, which Google uses to implement Tacotron2’s Attention and Forward Attention, which is also commonly used in speech synthesis models. Monotonic Chunkwise Attention (Mocha) is also a type of attention mechanism. Mocha reduce the scope of the original Soft Attention to a Chunk with a fixed length, hoping to improve network accuracy. However, Mocha-related research is mostly used in speech recognition field. This paper will be based on the multilingual Tacotron2 model. Attention mechanism. After comparing the difference between the results of Mocha Attention and the other two, it was found that the audio generated by MoChA will not be better, but it will lose the advantages to process streaming.
顯示於類別:	[資訊工程研究所] 博碩士論文

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	150	檢視/開啟

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....