|dc.description.abstract||Speech synthesis refers to the technique of synthesizing text into speech,In the past a speech synthesis system usually has multiple stages of processing, and it
also related to phonetics, acoustics or other related domain knowledge, which creates high technical threshold. Due to the advancement of hardware technology in recent
years, the deep learning methods based on neural network architecture have been widely used by researchers recently. This paper also applies deep learning technology to text-to-speech. (TTS) system , by using End-To-End speech synthesis architecture, training a single neural network model through audio training data, and abandoning the traditional architecture of generating speech from multiple models such as time models and acoustic features, use only an end-to-end model to enter the text to generate the target speech .
Current End-To-End speech synthesis research is mainly in English, however, as long as we find the correspondence between text and speech, we can also apply it to other non-English language synthesis. This thesis replaces Chinese phonetic transcription with the phonetic symbols from Scheme of the Chinese Phonetic Alphabet, which replaces Chinese characters as training materials to achieve Chinese speech synthesis. And I hope that this concept can be used to implement other non-English languages end-to-end speech synthesis too.||en_US|