漢語之端到端語音合成研究

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：14

、訪客IP：3.16.160.142

姓名

王紹全(SHAO-CHUAN WANG) 查詢紙本館藏

畢業系所

資訊工程學系在職專班

論文名稱

漢語之端到端語音合成研究
(Mandarin End-To-End Text-To-Speech Research)

相關論文

★ Single and Multi-Label Environmental Sound Recognition with Gaussian Process	★ 波束形成與音訊前處理之嵌入式系統實現
★ 語音合成及語者轉換之應用與設計	★ 基於語意之輿情分析系統
★ 高品質口述系統之設計與應用	★ 深度學習及加速強健特徵之CT影像跟骨骨折辨識及偵測
★ 基於風格向量空間之個性化協同過濾服裝推薦系統	★ RetinaNet應用於人臉偵測
★ 金融商品走勢預測	★ 整合深度學習方法預測年齡以及衰老基因之研究
★ 基於 ARM 架構上的 ORB-SLAM2 的應用與改進	★ 基於深度學習之指數股票型基金趨勢預測
★ 探討財經新聞與金融趨勢的相關性	★ 基於卷積神經網路的情緒語音分析
★ 運用深度學習方法預測阿茲海默症惡化與腦中風手術存活	★ 運用LLM自動生成食譜方法與系統

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 ( 永不開放)

摘要(中)

語音合成即指將文本合成語音的技術，在過去一個語音合成系統
通常分多個階段處理，並涉及了語音學、聲學等相關領域知識，因
此造就了高的技術門檻，由於近年來硬體技術的提升，以往基於神
經網絡架構的深度學習方法在近期廣為研究者使用，本論文亦將深
度學習技術應用到文字轉語音(TTS)系統上，利用端到端語音合成架
構，透過訓練用語音訓練出單一神經網路模型，捨棄傳統由時間模
型、聲學特徵等多個模型生成語音的架構，只使用一個端到端模
型，輸入文字即可生成目標語音。

目前常見的端到端語音合成研究以英語語系為主，然而，只要找到
文字和語音的對應關係，我們也可將其應用在其他非英語語系合
成，本論文利用漢語拼音方案的字母音標取代中文注音，以此取代
中文文字作為訓練的資料，以實現中文的語音合成，未來也希望能
以此概念將端到端語音合成推廣到其他非英文語系的使用。

摘要(英)

Speech synthesis refers to the technique of synthesizing text into speech,In the past a speech synthesis system usually has multiple stages of processing, and it
also related to phonetics, acoustics or other related domain knowledge, which creates high technical threshold. Due to the advancement of hardware technology in recent
years, the deep learning methods based on neural network architecture have been widely used by researchers recently. This paper also applies deep learning technology to text-to-speech. (TTS) system , by using End-To-End speech synthesis architecture, training a single neural network model through audio training data, and abandoning the traditional architecture of generating speech from multiple models such as time models and acoustic features, use only an end-to-end model to enter the text to generate the target speech .

Current End-To-End speech synthesis research is mainly in English, however, as long as we find the correspondence between text and speech, we can also apply it to other non-English language synthesis. This thesis replaces Chinese phonetic transcription with the phonetic symbols from Scheme of the Chinese Phonetic Alphabet, which replaces Chinese characters as training materials to achieve Chinese speech synthesis. And I hope that this concept can be used to implement other non-English languages end-to-end speech synthesis too.

關鍵字(中)

★ 端到端
★ 語音合成
★ 深度學習

關鍵字(英)

★ End-To-End
★ speech synthesis
★ deep learning

論文目次

章節目錄
摘要 I
Abstract II
章節目錄 III
圖目錄 V
表目錄 VI
第一章　緒論 1
第二章背景及相關知識 2
2.1 傳統語音合成方法 2
2.2 模型式語音合成架構 2
2.2.1 前端模組 2
2.2.2 後端模組 3
2.2.3 聲碼器(vocoder) 3
2.3 端到端語音合成介紹 3
第三章 Tacotron 端到端語音合成系統介紹 4
3.1 整體結構 4
3.2 編碼器模組 5
3.2.1 CBHG 6
3.3 解碼器模組 8
3.4 基於encoder-decoder的seq2seq的架構 9
3.5注意力機制(attention mechanism) 10
3.6 後處理網路(post-processing net) 12
3.7 Tacotron2介紹 12
第四章實驗 13
4.1 實驗語料 13
4.2 實驗環境 13
4.3 實驗準備 13
4.3.1 語料預處理 13
4.3.2 文本清理器(text cleaner) 14
4.3.3 參數設定 14
4.3.4模型訓練狀況分析 16
4.4 實驗結果 16
第五章結論與展望 19
5.1 結論 19
5.2 未來展望 19
第六章參考文獻 21
附錄一漢語拼音轉換表 23
附錄二五十音轉羅馬拼音表 24

參考文獻

[1] Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc Le, Yannis Agiomyrgiannakis, Rob Clark, Rif A. Saurous
Tacotron： Towards End-to-End Speech Synthesis , eprint arXiv:1703.10135 , 2017
[2] 當我們在談論AI說話：語音合成, https：//zhuanlan.zhihu.com/p/45517433
[3] pypinyin 套件官網 ,https：//pypinyin.readthedocs.io/zh_CN/master
[5] wiki漢語拼音方案
https//zh.wikipedia.org/wiki/%E6%B1%89%E8%AF%AD%E6%8B%BC%E9%9F%B3
[6] 標貝科技中文標準女聲語料庫
https：//www.data-baker.com/open_source.html
[7] Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio ,Neural Machine Translation by Jointly Learning to Align and Translate , eprint arXiv:1409.0473 , 2014
[8] How to read alignment graph
https：//github.com/keithito/tacotron/issues/144
[9] An implementation of Tacotron speech synthesis in TensorFlow.
https：//github.com/keithito/tacotron
[10] Kyunghyun Cho Bart van Merrienboer Caglar Gulcehre ：
Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation ,P1725 , eprint arXiv:1406.1078 , 2014
[11] D. W. Griffin and J. S. Lim, “Signal estimation from modified short-time Fourier transform,” IEEE Trans. ASSP, vol.32, no.2, pp.236–243, Apr. 1984.
[12] Attention Model(注意力模型)
https：//zhuanlan.zhihu.com/p/61816483
[13] 梅爾刻度wiki
https：//zh.wikipedia.org/wiki/%E6%A2%85%E5%B0%94%E5%88%BB%E5%BA%A6
[14] L1 loss function helps quick alignment ,
https：//github.com/Rayhane-mamah/Tacotron-2/issues/336
[15] Merlin： The Neural Network (NN) based Speech Synthesis System ,
https：//github.com/CSTR-Edinburgh/merlin
[16] 國際音標
https：//zh.wikipedia.org/wiki/%E5%9C%8B%E9%9A%9B%E9%9F%B3%E6%A8%99
[17] Tacotron參數設定參考
https：//github.com/Rayhane-mamah/Tacotron-2/blob/master/hparams.py
[18] 端到端TTS：結合代碼分析Tacotron模型結構
https：//www.twblogs.net/a/5c2c9479bd9eee35b3a45a51
[19] Dropout WIKI
https：//en.wikipedia.org/wiki/Convolutional_neural_network#Dropout
[20] Rupesh Kumar Srivastava, Klaus Greff, Jurgen Schmidhuber ,
“Highway Networks” , eprint arXiv:1507.06228 , 2015
[21] Google, Inc., 2University of California, Berkeley , “NATURAL TTS SYNTHESIS BY CONDITIONING WAVENET ON MEL SPECTROGRAM
PREDICTIONS”, eprint arXiv:1712.05884v2 , 2017

指導教授

王家慶(Jia-Ching Wang)

審核日期

2019-9-25

推文