基於最佳傳輸條件流匹配之語音合成系統

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：108

、訪客IP：18.118.140.79

姓名

金珉旭(Min-Xyu Jin) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

基於最佳傳輸條件流匹配之語音合成系統
(OT-CFM Based Text to Speech Systems)

相關論文

★ Single and Multi-Label Environmental Sound Recognition with Gaussian Process	★ 波束形成與音訊前處理之嵌入式系統實現
★ 語音合成及語者轉換之應用與設計	★ 基於語意之輿情分析系統
★ 高品質口述系統之設計與應用	★ 深度學習及加速強健特徵之CT影像跟骨骨折辨識及偵測
★ 基於風格向量空間之個性化協同過濾服裝推薦系統	★ RetinaNet應用於人臉偵測
★ 金融商品走勢預測	★ 整合深度學習方法預測年齡以及衰老基因之研究
★ 漢語之端到端語音合成研究	★ 基於 ARM 架構上的 ORB-SLAM2 的應用與改進
★ 基於深度學習之指數股票型基金趨勢預測	★ 探討財經新聞與金融趨勢的相關性
★ 基於卷積神經網路的情緒語音分析	★ 運用深度學習方法預測阿茲海默症惡化與腦中風手術存活

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 ( 永不開放)

摘要(中)

傳統語音合成方法主要依賴於統計參數語音合成或拼接式合成技術。這些方法依靠手動提取的語音特徵和繁雜的演算法合成語音，但缺乏自然度和情感，合成效果極差。自 2010 年代深度學習蓬勃發展之始，研究者開始探索使用深度神經網絡(DNN)提升合成語音的品質，時至今日，各式深度學習模型與演算法已完全取代傳統合成方法，生成媲美真人的語音。但當前的語音合成模型仍有以下缺點：訓練、推理速度稍慢，仍需耗費相當的時間成本；且生成自然流暢的語音已非難事，但往往缺乏情感變化，較為單調。

本論文使用最佳傳輸條件流匹配生成模型構建一套語音合成系統，該模型能生成高自然度、高相似度的語音，並擁有高效的訓練及推理速度。本論文之語音合成系統包括以下兩種任務：多語言語音合成及中文情感語音合成。多語言語音合成系統使用 Carolyn、JSUT、Vietnamese Voice Dataset 三個資料集，建立支援中文、日文及越南文之語音合成系統。中文情感語音合成系統使用具有情感風格之中文資料集 ESD-0001，搭配預訓練wav2vec 情感風格提取器，用於提取訓練語音之情感特徵，使模型學習將資料集中之情感風格遷移至生成語音。

摘要(英)

Traditional speech synthesis methods mainly rely on statistical parametric speech synthesis or concatenative synthesis techniques. These methods depend on manually
extracted speech features and complex algorithms to synthesize speech, but they lack naturalness and emotion, resulting in poor synthesis quality. Since the rise of deep
learning in the 2010s, researchers have begun exploring the use of deep neural networks (DNN) to enhance the quality of synthesized speech. Today, various deep learning
models and algorithms have completely replaced traditional synthesis methods, generating speech comparable to real human voices. However, current speech synthesis
models still have the following drawbacks: training and inference speeds are somewhat slow, requiring considerable time costs; generating natural and fluent speech is no
longer a challenge, but it often lacks emotional variation, resulting in a monotonous output.

This paper constructs a speech synthesis system using an optimal transport conditional flow matching generative model, which can generate highly natural and
similar speech while achieving efficient training and inference speeds. The speech synthesis system in this paper includes the following two tasks: multilingual speech
synthesis and Chinese emotional speech synthesis. The multilingual speech synthesis system uses three datasets: Carolyn, JSUT, and Vietnamese Voice Dataset, to establish
a speech synthesis system supporting Chinese, Japanese, and Vietnamese. The Chinese emotional speech synthesis system uses the ESD-0001 Chinese dataset with emotional
style, along with a pre-trained wav2vec emotional style extractor, to extract emotional features from the training speech, allowing the model to learn to transfer the emotional styles from the dataset to the generated speech.

關鍵字(中)

★ 深度學習
★ 語音合成
★ 流匹配

關鍵字(英)

論文目次

摘要............................................................................................................................... i
ABSTRACT................................................................................................................. ⅱ
目錄............................................................................................................................. ⅳ
圖目錄........................................................................................................................ ⅶ
表目錄....................................................................................................................... ⅶi
第一章緒論..................................................................................................................1
1.1 研究背景與動機..................................................................................................1
1.2 研究方法..............................................................................................................2
第二章文獻探討..........................................................................................................3
2.1 Transformer..........................................................................................................3
2.1.1 自注意力機制(Self-Attention).....................................................................4
2.1.2 多頭注意力機制(Multihead-Attention).......................................................5
2.1.3 位置編碼(Positional Encoding, PE) .............................................................6
2.2 Conformer ............................................................................................................7
2.3 Wav2vec2.0..........................................................................................................9
2.4 Glow-TTS...........................................................................................................10
2.4.1 模型架構.....................................................................................................10
2.4.2 單調對齊搜索(Monotonic Alignment Search, MAS)................................12
2.5 HiFi-GAN...........................................................................................................13
第三章基於最佳傳輸條件流匹配之語音合成系統................................................15
v
3.1 流匹配(Flow Matching, FM).............................................................................15
3.1.1 連續正規化流(Continuous Normalizing Flow, CNF)................................15
3.1.2 流匹配(Flow Matching, FM).....................................................................16
3.1.3 條件流匹配(Conditional Flow Matching, CFM)........................................16
3.2 Matcha-TTS .......................................................................................................17
3.2.1 最佳傳輸條件流匹配(OT-CFM) ...............................................................17
3.2.2 模型架構.....................................................................................................17
3.2.3 旋轉位置編碼(Rotary Positional Encoding, RoPE)..................................19
3.3 語音合成系統架構...........................................................................................20
3.3.1 多語言語音合成系統.................................................................................21
3.3.2 中文情感語音合成系統.............................................................................21
第四章實驗設置與結果............................................................................................23
4.1 資料集................................................................................................................23
4.1.1 Carolyn ........................................................................................................23
4.1.2 JSUT............................................................................................................23
4.1.3 Vietnamese Voice Dataset...........................................................................23
4.1.4 ESD-0001 ....................................................................................................23
4.2 實驗設置............................................................................................................24
4.2.1 實驗環境....................................................................................................24
4.2.2 參數設定....................................................................................................24
4.3 實驗結果............................................................................................................25
vi
4.3.1 實驗評估指標............................................................................................25
4.3.2 多語言語音合成評估................................................................................26
4.3.3 中文情感語音和成評估............................................................................27
第五章結論與未來展望............................................................................................29
參考文獻......................................................................................................................30

參考文獻

[1] Zen, Heiga, Keiichi Tokuda, and Alan W. Black. "Statistical parametric speech
synthesis." speech communication 51.11 (2009): 1039-1064.
[2] Wang, Yuxuan, et al. "Tacotron: Towards end-to-end speech synthesis." arXiv
preprint arXiv:1703.10135 (2017).
[3] Ren, Y., Ruan, Y., Tan, X., Qin, T., Zhao, S., Zhao, Z., & Liu, T. Y. (2019).
Fastspeech: Fast, robust and controllable text to speech. Advances in neural information
processing systems, 32.
[4] Ren, Yi, et al. "Fastspeech 2: Fast and high-quality end-to-end text to speech." arXiv
preprint arXiv:2006.04558 (2020).
[5] Kim, Jaehyeon, Jungil Kong, and Juhee Son. "Conditional variational autoencoder
with adversarial learning for end-to-end text-to-speech." International Conference on
Machine Learning. PMLR, 2021.
[6] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information
processing systems 30 (2017).
[7] Gulati, Anmol, et al. "Conformer: Convolution-augmented transformer for speech
recognition." arXiv preprint arXiv:2005.08100 (2020).
[8] Baevski, Alexei, et al. "wav2vec 2.0: A framework for self-supervised learning of
speech representations." Advances in neural information processing systems 33 (2020):
12449-12460.
[9] Jang, Eric, Shixiang Gu, and Ben Poole. "Categorical reparameterization with
gumbel-softmax." arXiv preprint arXiv:1611.01144 (2016).
[10] Kim, Jaehyeon, et al. "Glow-tts: A generative flow for text-to-speech via
monotonic alignment search." Advances in Neural Information Processing Systems 33
(2020): 8067-8077.
31
[11] Dinh, Laurent, David Krueger, and Yoshua Bengio. "Nice: Non-linear independent
components estimation." arXiv preprint arXiv:1410.8516 (2014).
[12] Dinh, Laurent, Jascha Sohl-Dickstein, and Samy Bengio. "Density estimation
using real nvp." arXiv preprint arXiv:1605.08803 (2016).
[13] Kong, Jungil, Jaehyeon Kim, and Jaekyoung Bae. "Hifi-gan: Generative
adversarial networks for efficient and high fidelity speech synthesis." Advances in
neural information processing systems 33 (2020): 17022-17033.
[14] Lipman, Yaron, et al. "Flow matching for generative modeling." arXiv preprint
arXiv:2210.02747 (2022).
[15] Chen, Ricky TQ, et al. "Neural ordinary differential equations." Advances in neural
information processing systems 31 (2018).
[16] Mehta, Shivam, et al. "Matcha-TTS: A fast TTS architecture with conditional flow
matching." ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP). IEEE, 2024.
[17] Ho, Jonathan, Ajay Jain, and Pieter Abbeel. "Denoising diffusion probabilistic
models." Advances in neural information processing systems 33 (2020): 6840-6851.
[18] Lee, Sang-gil, et al. "Bigvgan: A universal neural vocoder with large-scale
training." arXiv preprint arXiv:2206.04658 (2022).
[19] Su, Y. Lu, S. Pan, A. Murtadha, B. Wen, and Y. Liu, “RoFormer: Enhanced
Transformer with rotary position embedding,” arXiv preprint arXiv:2104.09864, 2021.
[20] Wagner, Johannes, et al. "Dawn of the transformer era in speech emotion
recognition: closing the valence gap." IEEE Transactions on Pattern Analysis and
Machine Intelligence 45.9 (2023): 10745-10759.
[21] Sonobe, Ryosuke, Shinnosuke Takamichi, and Hiroshi Saruwatari. "JSUT corpus:
free large-scale Japanese speech corpus for end-to-end speech synthesis." arXiv
32
preprint arXiv:1711.00354 (2017).
[22] https://github.com/CodeLinkIO/vietnamese-voice-dataset
[23] Zhou, Kun, et al. "Emotional voice conversion: Theory, databases and ESD."
Speech Communication 137 (2022): 1-18.

指導教授

王家慶(Jia-Ching Wang)

審核日期

2024-8-8

推文