摘要(英) |
Speech synthesis refers to the technique of synthesizing text into speech,In the past a speech synthesis system usually has multiple stages of processing, and it
also related to phonetics, acoustics or other related domain knowledge, which creates high technical threshold. Due to the advancement of hardware technology in recent
years, the deep learning methods based on neural network architecture have been widely used by researchers recently. This paper also applies deep learning technology to text-to-speech. (TTS) system , by using End-To-End speech synthesis architecture, training a single neural network model through audio training data, and abandoning the traditional architecture of generating speech from multiple models such as time models and acoustic features, use only an end-to-end model to enter the text to generate the target speech .
Current End-To-End speech synthesis research is mainly in English, however, as long as we find the correspondence between text and speech, we can also apply it to other non-English language synthesis. This thesis replaces Chinese phonetic transcription with the phonetic symbols from Scheme of the Chinese Phonetic Alphabet, which replaces Chinese characters as training materials to achieve Chinese speech synthesis. And I hope that this concept can be used to implement other non-English languages end-to-end speech synthesis too. |
參考文獻 |
[1] Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc Le, Yannis Agiomyrgiannakis, Rob Clark, Rif A. Saurous
Tacotron: Towards End-to-End Speech Synthesis , eprint arXiv:1703.10135 , 2017
[2] 當我們在談論AI說話:語音合成, https://zhuanlan.zhihu.com/p/45517433
[3] pypinyin 套件官網 ,https://pypinyin.readthedocs.io/zh_CN/master
[5] wiki漢語拼音方案
https//zh.wikipedia.org/wiki/%E6%B1%89%E8%AF%AD%E6%8B%BC%E9%9F%B3
[6] 標貝科技中文標準女聲語料庫
https://www.data-baker.com/open_source.html
[7] Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio ,Neural Machine Translation by Jointly Learning to Align and Translate , eprint arXiv:1409.0473 , 2014
[8] How to read alignment graph
https://github.com/keithito/tacotron/issues/144
[9] An implementation of Tacotron speech synthesis in TensorFlow.
https://github.com/keithito/tacotron
[10] Kyunghyun Cho Bart van Merrienboer Caglar Gulcehre :
Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation ,P1725 , eprint arXiv:1406.1078 , 2014
[11] D. W. Griffin and J. S. Lim, “Signal estimation from modified short-time Fourier transform,” IEEE Trans. ASSP, vol.32, no.2, pp.236–243, Apr. 1984.
[12] Attention Model(注意力模型)
https://zhuanlan.zhihu.com/p/61816483
[13] 梅爾刻度wiki
https://zh.wikipedia.org/wiki/%E6%A2%85%E5%B0%94%E5%88%BB%E5%BA%A6
[14] L1 loss function helps quick alignment ,
https://github.com/Rayhane-mamah/Tacotron-2/issues/336
[15] Merlin: The Neural Network (NN) based Speech Synthesis System ,
https://github.com/CSTR-Edinburgh/merlin
[16] 國際音標
https://zh.wikipedia.org/wiki/%E5%9C%8B%E9%9A%9B%E9%9F%B3%E6%A8%99
[17] Tacotron參數設定參考
https://github.com/Rayhane-mamah/Tacotron-2/blob/master/hparams.py
[18] 端到端TTS:結合代碼分析Tacotron模型結構
https://www.twblogs.net/a/5c2c9479bd9eee35b3a45a51
[19] Dropout WIKI
https://en.wikipedia.org/wiki/Convolutional_neural_network#Dropout
[20] Rupesh Kumar Srivastava, Klaus Greff, Jurgen Schmidhuber ,
“Highway Networks” , eprint arXiv:1507.06228 , 2015
[21] Google, Inc., 2University of California, Berkeley , “NATURAL TTS SYNTHESIS BY CONDITIONING WAVENET ON MEL SPECTROGRAM
PREDICTIONS”, eprint arXiv:1712.05884v2 , 2017 |