||Speech synthesis refers to the technique of synthesizing text into speech,In the past a speech synthesis system usually has multiple stages of processing, and it|
also related to phonetics, acoustics or other related domain knowledge, which creates high technical threshold. Due to the advancement of hardware technology in recent
years, the deep learning methods based on neural network architecture have been widely used by researchers recently. This paper also applies deep learning technology to text-to-speech. (TTS) system , by using End-To-End speech synthesis architecture, training a single neural network model through audio training data, and abandoning the traditional architecture of generating speech from multiple models such as time models and acoustic features, use only an end-to-end model to enter the text to generate the target speech .
Current End-To-End speech synthesis research is mainly in English, however, as long as we find the correspondence between text and speech, we can also apply it to other non-English language synthesis. This thesis replaces Chinese phonetic transcription with the phonetic symbols from Scheme of the Chinese Phonetic Alphabet, which replaces Chinese characters as training materials to achieve Chinese speech synthesis. And I hope that this concept can be used to implement other non-English languages end-to-end speech synthesis too.
|| Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc Le, Yannis Agiomyrgiannakis, Rob Clark, Rif A. Saurous |
Tacotron： Towards End-to-End Speech Synthesis , eprint arXiv:1703.10135 , 2017
 當我們在談論AI說話：語音合成, https：//zhuanlan.zhihu.com/p/45517433
 pypinyin 套件官網 ,https：//pypinyin.readthedocs.io/zh_CN/master
 Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio ,Neural Machine Translation by Jointly Learning to Align and Translate , eprint arXiv:1409.0473 , 2014
 How to read alignment graph
 An implementation of Tacotron speech synthesis in TensorFlow.
 Kyunghyun Cho Bart van Merrienboer Caglar Gulcehre ：
Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation ,P1725 , eprint arXiv:1406.1078 , 2014
 D. W. Griffin and J. S. Lim, “Signal estimation from modified short-time Fourier transform,” IEEE Trans. ASSP, vol.32, no.2, pp.236–243, Apr. 1984.
 Attention Model(注意力模型)
 L1 loss function helps quick alignment ,
 Merlin： The Neural Network (NN) based Speech Synthesis System ,
 Dropout WIKI
 Rupesh Kumar Srivastava, Klaus Greff, Jurgen Schmidhuber ,
“Highway Networks” , eprint arXiv:1507.06228 , 2015
 Google, Inc., 2University of California, Berkeley , “NATURAL TTS SYNTHESIS BY CONDITIONING WAVENET ON MEL SPECTROGRAM
PREDICTIONS”, eprint arXiv:1712.05884v2 , 2017