使用門控遞歸網絡和對比學習進行語音合成的非並行語音轉換：一種混合深度學習方法

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：155

、訪客IP：3.138.172.222

姓名

比⾺特(Bima Prihasto) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

使用門控遞歸網絡和對比學習進行語音合成的非並行語音轉換：一種混合深度學習方法
(Non-Parallel Voice Conversion for Speech Synthesis using Gated Recurrent Networks and Contrastive Learning: A Hybrid Deep Learning Approach)

相關論文

★ Single and Multi-Label Environmental Sound Recognition with Gaussian Process	★ 波束形成與音訊前處理之嵌入式系統實現
★ 語音合成及語者轉換之應用與設計	★ 基於語意之輿情分析系統
★ 高品質口述系統之設計與應用	★ 深度學習及加速強健特徵之CT影像跟骨骨折辨識及偵測
★ 基於風格向量空間之個性化協同過濾服裝推薦系統	★ RetinaNet應用於人臉偵測
★ 金融商品走勢預測	★ 整合深度學習方法預測年齡以及衰老基因之研究
★ 漢語之端到端語音合成研究	★ 基於 ARM 架構上的 ORB-SLAM2 的應用與改進
★ 基於深度學習之指數股票型基金趨勢預測	★ 探討財經新聞與金融趨勢的相關性
★ 基於卷積神經網路的情緒語音分析	★ 運用深度學習方法預測阿茲海默症惡化與腦中風手術存活

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 ( 永不開放)

摘要(中)

這篇論文對語音處理做出了重大貢獻，特別是在語音合成和語音轉換方面。這個貢獻分為三個主要部分。首先，已經確定基於 RNN 的模型適用於解決語音合成問題，但是計算時間長仍然是一個問題。本論文在對 MGU 進行修改的基礎上，成功地構建了一種新的 RNN 架構，從 MGU 的一些方程中去除了單元狀態歷史。這種基於 MGU 的新架構的速度是其他基於 MGU 的架構的兩倍，但仍能產生同等質量的聲音。兩種對比學習之前都解決了非平行語音轉換問題，但是聲音合成結果並不理想。這是因為沒有保留聲源的信息內容，無法調整音色和韻律來匹配目標聲音。本論文介紹了一種硬性反例的對比學習方法，稱為CNEG-VC。該技術基於語音輸入生成實例方面的負面示例，並使用對抗性損失來生成硬負面示例，從而提高非並行語音轉換的性能。最後，論文提出了在頻譜特徵中使用選擇性注意作為非並行語音轉換中對比學習的錨點，稱為 CSA-VC。該技術基於對每行概率分佈的測量來選擇查詢，並使用減少的注意力矩陣來確保在合成中保留源關係。

摘要(英)

This dissertation has made a substantial contribution to speech processing, particularly in speech synthesis and voice conversion. There are three main parts to this contribution. Firstly, it has been established that RNN-based models are suitable for solving speech synthesis problems, however long computing time is still an issue. This dissertation successfully built a new RNN architecture based on modifications to the MGU, which removes the unit state history from some equations in the MGU. This new MGU-based architecture is twice as fast as the other MGU-based architectures yet still produce a sound of equal quality. Secondly, contrastive learning has previously solved non-parallel voice conversion problems, but the sound synthesis results were unsatisfactory. This is because the information content of the sound source was not preserved and the timbre and prosody could not be adjusted to match the target sound. This dissertation introduced a hard negative examples approach in contrastive learning, called CNEG-VC. This technique generates instance-wise negative examples based on the voice input and uses an adversarial loss to produce hard negative exam- ples, resulting in an improved performance in non-parallel voice conversion. Finally, the dissertation proposed the use of selective attention in spectral features as an anchor point for contrastive learning in non-parallel voice conversion, called CSA-VC. This technique selects a query based on the measurement of the probability distribution of each line and uses the reduced attention matrix to ensure that source relations are preserved in the synthesis.

關鍵字(中)

★ 語音合成
★ 語音轉換
★ 非平行數據
★ 遞歸神經網絡
★ 對比學習
★ hard negative example
★ 注意機制

關鍵字(英)

★ Speech synthesis
★ voice conversion
★ non-parallel data
★ recurrent neural networks
★ contrastive learning
★ hard negative example
★ attention mechanism

論文目次

Abstract iii
Acknowledgements iv
Table of Contents iv
List of Figures vii
List of Tables viii
List of Abbreviations x
1 Introduction 1
1.1 Motivation 1
1.2 Contribution 3
1.3 Dissertation Outline 4
2 Review of Basic Technologies 6
2.1 Speech Synthesis 6
2.1.1 Hidden Markov Model on Speech Synthesis 7
2.1.2 Recurrent Neural Network on Speech Synthesis 8
2.2 Voice Conversion 10
2.2.1 Variational Autoencoder on VC 14
2.2.2 Generative Adversarial Networks on VC 16
2.2.3 Contrastive Learning on VC 17
3 Fast Gated Recurrent Network for Speech Synthesis 19
3.1 Framework Architecture 19
3.2 Minumum Gated Unit (MGU) 20
3.3 Simplified Gated Recurrent Unit Modify (SGRU-M) 21
3.4 Experimental Results 23
3.4.1 Dataset and Experiment Setup 23
3.4.2 Result 24
3.5 Conclusions 26
4 Contrastive Learning using Hard Negative Example in Non-parallel voice conversion 28
4.1 Framework Architecture 28
4.2 Hard Negative Example Method 30
4.3 Other Objectives 31
4.4 Experiment and Result 32
4.4.1 Dataset 32
4.4.2 Conversion process 33
4.4.3 Network architecture 33
4.4.4 Training setting 33
4.4.5 Subjective Evaluation 34
4.4.6 Objective Evaluation 34
4.4.7 Result 34
4.5 Conclusion 36
5 Contrastive Learning with Selective Attention in Non-parallel Voice Conversion 38
5.1 Framework Architecture 39
5.1.1 Global attention to query selection 42
5.1.2 Local attention to query selection 43
5.1.3 Cross-domain value routing 44
5.2 Experiment and Result 44
5.2.1 Dataset 45
5.2.2 Conversion process 46
5.2.3 Network architecture 46
5.2.4 Training setting 47
5.2.5 Subjective Evaluation 47
5.2.6 Objective Evaluation 48
5.2.7 One-to-one VC result 49
5.2.8 Many-to-one VC result 50
5.2.9 Light model 52
5.2.10 Variations based on speaker accent 54
5.2.11 Variations in selective attention 54
5.2.12 Discussion 58
5.3 Conclusion 59
6 Conclusions and Future Work 60
References 62

參考文獻

[1] Heiga Zen, Keiichi Tokuda, and Alan W. Black. “Statistical parametric speech synthesis”. In: Speech Communication 51.11 (2009), pp. 1039–1064. ISSN: 0167-6393.
[2] Keiichi Tokuda et al. “Speech Synthesis Based on Hidden Markov Models”. In: Proceedings of the IEEE 101 (2013), pp. 1234–1252.
[3] Sepp Hochreiter and Jürgen Schmidhuber. “Long Short-Term Memory”. In: Neu- ral Computation 9 (1997), pp. 1735–1780.
[4] Felix Alexander Gers, Jürgen Schmidhuber, and Fred Cummins. “Learning to Forget: Continual Prediction with LSTM”. In: Neural Computation 12 (2000), pp. 2451–2471.
[5] Kyunghyun Cho et al. “Learning Phrase Representations using RNN En- coderDecoder for Statistical Machine Translation”. In: Conference on Empirical Methods in Natural Language Processing. 2014.
[6] Guoxiang Zhou et al. “Minimal gated unit for recurrent neural networks”. In: International Journal of Automation and Computing 13 (2016), pp. 226–234.
[7] Joel Heck and Fathi M. Salem. “Simplified minimal gated unit variations for re- current neural networks”. In: 2017 IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS) (2017), pp. 1593–1596.
[8] Alex Graves, Abdel rahman Mohamed, and Geoffrey E. Hinton. “Speech recog- nition with deep recurrent neural networks”. In: 2013 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (2013), pp. 6645–6649.
[9] Christophe Veaux, Junichi Yamagishi, and Simon King. “Towards Personalised Synthesised Voices for Individuals with Vocal Disabilities: Voice Banking and Reconstruction”. In: Proceedings of the Fourth Workshop on Speech and Language Processing for Assistive Technologies. 2013.
[10] Brij Mohan Lal Srivastava et al. “Evaluating Voice Conversion-based Privacy Protection against Informed Attackers”. In: ICASSP. IEEE, 2020.
[11] Anthony John Dsouza et al. “SynthPipe : AI based Human in the Loop Video Dubbing Pipeline”. In: International Conference on Advances in Electrical, Comput- ing, Communication and Sustainable Technologies (ICAECT). 2022. DOI: 10.1109/ ICAECT54875.2022.9807853.
[12] Huaizhen Tang et al. “AVQVC: One-Shot Voice Conversion By Vector Quantiza- tion With Applying Contrastive Learning”. In: ICASSP. IEEE, 2022, pp. 46134617. DOI: 10.1109/icassp43922.2022.9746369.
[13] Da-Yi Wu and Hung-yi Lee. “One-Shot Voice Conversion by Vector Quantiza- tion”. In: ICASSP. IEEE, 2020, pp. 77347738. DOI: 10 . 1109 / icassp40776 . 2020.9053854.
[14] Kaizhi Qian et al. “Zero-Shot Voice Style Transfer with Only Autoencoder Loss”. In: International Conference on Machine Learning. 2019.
[15] Takuhiro Kaneko et al. “StarGAN-VC2: Rethinking Conditional Methods for StarGAN-Based Voice Conversion”. In: Proc. Interspeech. 2019. DOI: 10.21437/ Interspeech.2019-2236.
[16] Hirokazu Kameoka et al. “Nonparallel Voice Conversion With Augmented Clas- sifier Star Generative Adversarial Networks”. In: IEEE/ACM Transactions on Au- dio, Speech, and Language Processing 28 (2020), pp. 2982–2995. ISSN: 2329-9304. DOI: 10.1109/TASLP.2020.3036784.
[17] Yinghao Aaron Li, Ali Zare, and Nima Mesgarani. “StarGANv2-VC: A Diverse, Unsupervised, Non-Parallel Framework for Natural-Sounding Voice Conver- sion”. In: Proc. Interspeech. 2021, pp. 13491353. DOI: 10.21437/interspeech. 2021-319.
[18] Takuhiro Kaneko et al. “CycleGAN-VC2: Improved CycleGAN-based Non- parallel Voice Conversion”. In: ICASSP. IEEE, 2019.
[19] Takuhiro Kaneko et al. “CycleGAN-VC3: Examining and Improving CycleGAN- VCs for Mel-Spectrogram Conversion”. In: Proc. Interspeech. 2020. DOI: 10 . 21437/Interspeech.2020-2280.
[20] Tingle Li et al. “CVC: Contrastive Learning for Non-Parallel Voice Conversion”. In: Proc. Interspeech. 2021. DOI: 10.21437/Interspeech.2021-137.
[21] Heiga Zen, Keiichi Tokuda, and Alan W. Black. “Statistical Parametric Speech Synthesis”. In: 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP ’07 4 (2007), pp. IV–1229–IV–1232.
[22] Heiga Zen, Andrew Senior, and Mike Schuster. “Statistical parametric speech synthesis using deep neural networks”. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2013. DOI: 10 . 1109 / icassp . 2013 . 6639215. URL: https : / / doi . org / 10 . 1109 % 2Ficassp . 2013 . 6639215.
[23] K. Tokuda et al. “Speech parameter generation algorithms for HMM-based speech synthesis”. In: 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100). IEEE. DOI: 10 . 1109 / icassp.2000.861820. URL: https://doi.org/10.1109%2Ficassp. 2000.861820.
[24] Yuchen Fan et al. “TTS synthesis with bidirectional LSTM based recurrent neural networks”. In: Interspeech 2014. ISCA, 2014. DOI: 10.21437/interspeech. 2014-443. URL: https://doi.org/10.21437%2Finterspeech.2014- 443.
[25] Zhizheng Wu and Simon King. “Investigating gated recurrent networks for speech synthesis”. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016. DOI: 10 . 1109 / icassp . 2016 . 7472657. URL: https://doi.org/10.1109%2Ficassp.2016.7472657.
[26] Viacheslav Klimkov et al. “Parameter Generation Algorithms for Text-To-Speech Synthesis with Recurrent Neural Networks”. In: 2018 IEEE Spoken Language Tech- nology Workshop (SLT). IEEE, 2018. DOI: 10.1109/slt.2018.8639626. URL: https://doi.org/10.1109%2Fslt.2018.8639626.
[27] D Childers, B Yegnanarayana, and Ke Wu. “Voice conversion: Factors responsi- ble for quality”. In: ICASSP. Vol. 10. IEEE, 1985, pp. 748–751.
[28] Seyed Hamidreza Mohammadi and Alexander Kain. “An overview of voice con- version systems”. In: Speech Commun. 88 (2017), pp. 65–82.
[29] Berrak Sisman et al. “An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning”. In: IEEE/ACM Transactions on Au- dio, Speech, and Language Processing 29 (2020), pp. 132–157.
[30] Ehsan Variani et al. “Deep neural networks for small footprint text-dependent speaker verification”. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2014), pp. 4052–4056.
[31] Mohammed Salah Al-Radhi, Tamás Gábor Csapó, and Géza Németh. “Continu- ous vocoder applied in deep neural network based voice conversion”. In: Multi- media Tools and Applications 78 (2019), pp. 33549 –33572.
[32] Ding Ma et al. “Two-Stage Training Method for Japanese Electrolaryngeal Speech Enhancement Based on Sequence-to-Sequence Voice Conversion”. In: 2022 IEEE Spoken Language Technology Workshop (SLT) (2022), pp. 949–954.
[33] Tuan Vu Ho, M. Kobayashi, and Masato Akagi. “Speak Like a Professional: In- creasing Speech Intelligibility by Mimicking Professional Announcer Voice with Voice Conversion”. In: Proc. Interspeech (2022).
[34] A. Kashkin, I. A. Karpukhin, and Sergei L. Shishkin. “HiFi-VC: High Quality ASR-Based Voice Conversion”. In: ArXiv abs/2203.16937 (2022).
[35] Jilong Wu et al. “Multilingual Text-To-Speech Training Using Cross Language Voice Conversion And Self-Supervised Learning Of Speech Representations”. In: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2022), pp. 8017–8021.
[36] Haohan Guo et al. “Improving Adversarial Waveform Generation Based Singing Voice Conversion with Harmonic Signals”. In: ICASSP 2022 - 2022 IEEE In- ternational Conference on Acoustics, Speech and Signal Processing (ICASSP) (2022), pp. 6657–6661.
[37] Firra M. Mukhneri, Inung Wijayanto, and Sugondo Hadiyoso. “Voice Conver- sion for Dubbing Using Linear Predictive Coding and Hidden Markov Model”. In: Journal of Southwest Jiaotong University (2020).
[38] Suresh Malodia et al. “Why Do People Use Artificial Intelligence (AI)-Enabled Voice Assistants?” In: IEEE Transactions on Engineering Management PP (2021), pp. 1–15.
[39] Susmita Bhattacharjee and Rohit Sinha. “Sensitivity Analysis of MaskCycleGAN based Voice Conversion for Enhancing Cleft Lip and Palate Speech Recognition”. In: 2022 IEEE International Conference on Signal Processing and Communications (SP- COM) (2022), pp. 1–5.
[40] Lokitha T et al. “Smart Voice Assistance for Speech disabled and Paralyzed Peo- ple”. In: 2022 International Conference on Computer Communication and Informatics (ICCCI) (2022), pp. 1–5.
[41] Masanobu Abe et al. “Voice conversion through vector quantization”. In: Journal of the Acoustical Society of Japan (E) 11.2 (1990), pp. 71–76.
[42] Kiyohiro Shikano, Satoshi Nakamura, and Masanobu Abe. “Speaker adaptation and voice conversion by codebook mapping”. In: International Symposium on Circuits and Systems (ISCAS). IEEE, 1991, pp. 594–597.
[43] Elina Helander et al. “On the impact of alignment on voice conversion performance”. In: Proc. Interspeech 2008. 2008, pp. 1453–1456.
[44] Tomoki Toda, Alan W. Black, and Keiichi Tokuda. “Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory”. en. In: IEEE Transactions on Audio, Speech and Language Processing 15.8 (Nov. 2007), pp. 2222– 2235. ISSN: 1558-7916. DOI: 10 . 1109 / TASL . 2007 . 907344. (Visited on 08/01/2022).
[45] Kazuhiro Kobayashi et al. “The NU-NAIST Voice Conversion System for the Voice Conversion Challenge 2016”. In: Proc. Interspeech 2016. 2016, pp. 1667–1671. DOI: 10.21437/Interspeech.2016-970.
[46] Elina Helander et al. “Voice conversion using partial least squares regres- sion”. In: IEEE Transactions on Audio, Speech, and Language Processing 18.5 (2010), pp. 912–921.
[47] Zhizheng Wu et al. “Exemplar-based sparse representation with residual com- pensation for voice conversion”. In: IEEE/ACM Transactions on Audio, Speech, and Language Processing 22.10 (2014), pp. 1506–1521.
[48] Chin-Cheng Hsu et al. “Voice conversion from non-parallel corpora using vari- ational auto-encoder”. en. In: 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA). Jeju, South Korea: IEEE, Dec. 2016, pp. 1–6. ISBN: 978-988-14768-2-1. DOI: 10 . 1109 / APSIPA . 2016 . 7820786. (Visited on 08/01/2022).
[49] Hirokazu Kameoka et al. “ACVAE-VC: Non-Parallel Voice Conversion With Auxiliary Classifier Variational Autoencoder”. en. In: IEEE/ACM Transactions on Audio, Speech, and Language Processing 27.9 (Sept. 2019), pp. 1432–1443. ISSN: 2329-9290, 2329-9304. DOI: 10 . 1109 / TASLP . 2019 . 2917232. (Visited on 08/01/2022).
[50] Lifa Sun et al. “Phonetic posteriorgrams for many-to-one voice conversion with- out parallel data training”. en. In: 2016 IEEE International Conference on Multime- dia and Expo (ICME). Seattle, WA, USA: IEEE, July 2016, pp. 1–6. ISBN: 978-1-4673- 7258-9. DOI: 10.1109/ICME.2016.7552917. (Visited on 08/01/2022).
[51] Feng-Long Xie, Frank K. Soong, and Haifeng Li. “A KL Divergence and DNN- Based Approach to Voice Conversion without Parallel Training Sentences”. In: Proc. Interspeech 2016. 2016, pp. 287–291. DOI: 10.21437/Interspeech.2016- 116.
[52] Yuki Saito et al. “Non-parallel voice conversion using variational autoencoders conditioned by phonetic posteriorgrams and d-vectors”. In: ICASSP. IEEE, 2018, pp. 5274–5278.
[53] Shaojin Ding and Ricardo Gutierrez-Osuna. “Group Latent Embedding for Vec- tor Quantized Variational Autoencoder in Non-Parallel Voice Conversion”. In: Interspeech. 2019.
[54] Wen-Chin Huang et al. “Unsupervised Representation Disentanglement Using Cross Domain Features and Adversarial Learning in Variational Autoencoder Based Voice Conversion”. In: IEEE Transactions on Emerging Topics in Computational Intelligence 4 (2020), pp. 468–479.
[55] Kaizhi Qian et al. “F0-Consistent Many-To-Many Non-Parallel Voice Conversion Via Conditional Autoencoder”. In: ICASSP 2020 - 2020 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP) (2020), pp. 6284–6288.
[56] Seung won Park, Doo young Kim, and Myun chul Joe. “Cotatron: Transcription- Guided Speech Encoder for Any-to-Many Voice Conversion without Parallel Data”. In: Interspeech. 2020.
[57] Kou Tanaka et al. “ATTS2S-VC: Sequence-to-sequence Voice Conversion with Attention and Context Preservation Mechanisms”. In: ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2018), pp. 6805–6809.
[58] Wen-Chin Huang et al. “Voice Transformer Network: Sequence-to-Sequence Voice Conversion Using Transformer with Text-to-Speech Pretraining”. In: In- terspeech. 2019.
[59] Hirokazu Kameoka et al. “StarGAN-VC: Non-parallel many-to-many voice con- version using star generative adversarial networks”. In: Spoken Language Technol- ogy Workshop (SLT). IEEE, 2018, pp. 266–273.
[60] Takuhiro Kaneko and Hirokazu Kameoka. “CycleGAN-VC: Non-parallel voice conversion using cycle-consistent adversarial networks”. In: 2018 26th European Signal Processing Conference (EUSIPCO). IEEE, 2018, pp. 2100–2104.
[61] Yinghao Aaron Li, Ali Asghar Zare, and Nima Mesgarani. “StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion”. In: Interspeech. 2021.
[62] Durk P Kingma et al. “Semi-supervised Learning with Deep Generative Mod- els”. In: Advances in Neural Information Processing Systems. Vol. 27. Curran Asso- ciates, Inc., 2014. (Visited on 08/04/2022).
[63] Ian Goodfellow et al. “Generative Adversarial Nets”. In: Advances in Neural In- formation Processing Systems. Vol. 27. Curran Associates, Inc., 2014. (Visited on 08/02/2022).
[64] Michael Gutmann and Aapo Hyvärinen. “Noise-contrastive estimation: A new estimation principle for unnormalized statistical models”. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics. JMLR Work- shop and Conference Proceedings, 2010, pp. 297–304.
[65] Ting Chen et al. “A simple framework for contrastive learning of visual repre- sentations”. In: International conference on machine learning. PMLR, 2020, pp. 1597– 1607.
[66] Bo-Wei Chen, Chen-Yu Chen, and Jhing-Fa Wang. “Smart Homecare Surveillance System: Behavior Identification Based on State-Transition Support Vector Machines and Sound Directivity Pattern Analysis”. In: IEEE Transactions on Systems, Man, and Cybernetics: Systems 43 (2013), pp. 1279–1289.
[67] Bo-Wei Chen et al. “Cognitive Sensors Based on Ridge Phase-Smoothing Local- ization and Multiregional Histograms of Oriented Gradients”. In: IEEE Transac- tions on Emerging Topics in Computing 7 (2019), pp. 123–134.
[68] Gavin C. Cawley and Peter D. Noakes. “LSP speech synthesis using backpropa- gation networks”. In: 1993.
[69] Rafal Józefowicz, Wojciech Zaremba, and Ilya Sutskever. “An Empirical Explo- ration of Recurrent Network Architectures”. In: International Conference on Ma- chine Learning. 2015.
[70] Martin Cooke et al. “Evaluating the intelligibility benefit of speech modifications in known noise conditions”. In: Speech Commun. 55 (2013), pp. 572–585.
[71] Zhizheng Wu, Oliver Watts, and Simon King. “Merlin: An Open Source Neural Network Speech Synthesis System”. In: Speech Synthesis Workshop. 2016.
[72] Robert A. J. Clark, Korin Richmond, and Simon King. “Multisyn: Open-domain unit selection for the Festival speech synthesis system”. In: Speech Commun. 49 (2007), pp. 317–330.
[73] Masanori Morise, Fumiya Yokomori, and Kenji Ozawa. “WORLD: A Vocoder- Based High-Quality Speech Synthesis System for Real-Time Applications”. In: IEICE Trans. Inf. Syst. 99-D (2016), pp. 1877–1884.
[74] Zhizheng Wu and Simon King. “Investigating gated recurrent networks for speech synthesis”. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2016), pp. 5140–5144.
[75] Xudong Mao et al. “Least Squares Generative Adversarial Networks”. In: ICCV. IEEE, 2017. (Visited on 10/02/2022).
[76] Junichi Yamagishi, Christophe Veaux, and Kirsten MacDonald. CSTR VCTK Cor- pus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit. University of Ed- inburgh. The Centre for Speech Technology Research (CSTR), 2019. DOI: 10 . 7488/ds/2645.
[77] Jaime Lorenzo-Trueba et al. The Voice Conversion Challenge 2018: database and results. University of Edinburgh. The Centre for Speech Technology Research (CSTR), 2018.
[78] Ryuichi Yamamoto, Eunwoo Song, and Jae-Min Kim. “Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram”. In: ICASSP. IEEE, 2020. DOI: 10.1109/ ICASSP40776.2020.9053795.
[79] Kundan Kumar et al. “MelGAN: Generative Adversarial Networks for Condi- tional Waveform Synthesis”. In: Advances in Neural Information Processing Sys- tems. 2019. (Visited on 08/25/2022).
[80] Kaiming He et al. “Deep Residual Learning for Image Recognition”. In: CVPR. IEEE, 2016. DOI: 10.1109/CVPR.2016.90.
[81] Phillip Isola et al. “Image-to-Image Translation with Conditional Adversarial Networks”. In: CVPR. IEEE, 2017. DOI: 10.1109/CVPR.2017.632.
[82] Diederik P. Kingma and Jimmy Ba. “Adam: A Method for Stochastic Optimiza- tion”. In: International Conference on Learning Representations, (ICLR). 2015.
[83] Philipos C. Loizou. “Speech Quality Assessment”. In: Multimedia Analysis, Pro- cessing and Communications. Springer Berlin Heidelberg, 2011. ISBN: 978-3-642- 19550-1 978-3-642-19551-8. DOI: 10.1007/978-3-642-19551-8_23. (Visited on 08/11/2022).
[84] Li Wan et al. “Generalized End-to-End Loss for Speaker Verification”. In: ICASSP. IEEE, 2018. DOI: 10.1109/ICASSP.2018.8462665.
[85] Ye Jia et al. “Transfer Learning from Speaker Verification to Multispeaker Text- To-Speech Synthesis”. In: Advances in Neural Information Processing Systems. 2018.
[86] Adam Paszke et al. “PyTorch: An Imperative Style, High-Performance Deep Learning Library”. In: Advances in Neural Information Processing Systems. Vol. 32. Curran Associates, Inc., 2019. (Visited on 08/10/2022).
[87] Christophe Veaux, Junichi Yamagishi, and Kirsten MacDonald. “SUPER- SEDED - CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit”. eng. In: The Rainbow Passage which the speak- ers read out can be found in the International Dialects of English Archive: (http://web.ku.edu/~idea/readings/rainbow.htm). (Oct. 2016). Accepted: 2016-10- 07T14:54:16Z Publisher: University of Edinburgh. The Centre for Speech Tech- nology Research (CSTR). DOI: 10 . 7488 / ds / 1495. URL: https : / / datashare.ed.ac.uk/handle/10283/2119 (visited on 08/10/2022).

指導教授

王家慶(Jia-Ching Wang)

審核日期

2023-7-29

推文