dc.description.abstract | Traditional speech synthesis methods mainly rely on statistical parametric speech synthesis or concatenative synthesis techniques. These methods depend on manually
extracted speech features and complex algorithms to synthesize speech, but they lack naturalness and emotion, resulting in poor synthesis quality. Since the rise of deep
learning in the 2010s, researchers have begun exploring the use of deep neural networks (DNN) to enhance the quality of synthesized speech. Today, various deep learning
models and algorithms have completely replaced traditional synthesis methods, generating speech comparable to real human voices. However, current speech synthesis
models still have the following drawbacks: training and inference speeds are somewhat slow, requiring considerable time costs; generating natural and fluent speech is no
longer a challenge, but it often lacks emotional variation, resulting in a monotonous output.
This paper constructs a speech synthesis system using an optimal transport conditional flow matching generative model, which can generate highly natural and
similar speech while achieving efficient training and inference speeds. The speech synthesis system in this paper includes the following two tasks: multilingual speech
synthesis and Chinese emotional speech synthesis. The multilingual speech synthesis system uses three datasets: Carolyn, JSUT, and Vietnamese Voice Dataset, to establish
a speech synthesis system supporting Chinese, Japanese, and Vietnamese. The Chinese emotional speech synthesis system uses the ESD-0001 Chinese dataset with emotional
style, along with a pre-trained wav2vec emotional style extractor, to extract emotional features from the training speech, allowing the model to learn to transfer the emotional styles from the dataset to the generated speech. | en_US |