摘要: | 傳統語音合成方法主要依賴於統計參數語音合成或拼接式合成技術。這些方法依靠手動提取的語音特徵和繁雜的演算法合成語音,但缺乏自然度和情感,合成效果極差。自 2010 年代深度學習蓬勃發展之始,研究者開始探索使用深度神經網絡(DNN)提升合成語音的品質,時至今日,各式深度學習模型與演算法已完全取代傳統合成方法,生成媲美真人的語音。但當前的語音合成模型仍有以下缺點:訓練、推理速度稍慢,仍需耗費相當的時間成本;且生成自然流暢的語音已非難事,但往往缺乏情感變化,較為單調。
本論文使用最佳傳輸條件流匹配生成模型構建一套語音合成系統,該模型能生成高自然度、高相似度的語音,並擁有高效的訓練及推理速度。本論文之語音合成系統包括以下兩種任務:多語言語音合成及中文情感語音合成。多語言語音合成系統使用 Carolyn、JSUT、Vietnamese Voice Dataset 三個資料集,建立支援中文、日文及越南文之語音合成系統。中文情感語音合成系統使用具有情感風格之中文資料集 ESD-0001,搭配預訓練wav2vec 情感風格提取器,用於提取訓練語音之情感特徵,使模型學習將資料集中之情感風格遷移至生成語音。 ;Traditional speech synthesis methods mainly rely on statistical parametric speech synthesis or concatenative synthesis techniques. These methods depend on manually extracted speech features and complex algorithms to synthesize speech, but they lack naturalness and emotion, resulting in poor synthesis quality. Since the rise of deep learning in the 2010s, researchers have begun exploring the use of deep neural networks (DNN) to enhance the quality of synthesized speech. Today, various deep learning models and algorithms have completely replaced traditional synthesis methods, generating speech comparable to real human voices. However, current speech synthesis models still have the following drawbacks: training and inference speeds are somewhat slow, requiring considerable time costs; generating natural and fluent speech is no longer a challenge, but it often lacks emotional variation, resulting in a monotonous output.
This paper constructs a speech synthesis system using an optimal transport conditional flow matching generative model, which can generate highly natural and similar speech while achieving efficient training and inference speeds. The speech synthesis system in this paper includes the following two tasks: multilingual speech synthesis and Chinese emotional speech synthesis. The multilingual speech synthesis system uses three datasets: Carolyn, JSUT, and Vietnamese Voice Dataset, to establish a speech synthesis system supporting Chinese, Japanese, and Vietnamese. The Chinese emotional speech synthesis system uses the ESD-0001 Chinese dataset with emotional style, along with a pre-trained wav2vec emotional style extractor, to extract emotional features from the training speech, allowing the model to learn to transfer the emotional styles from the dataset to the generated speech. |