參考文獻 |
[1] Heiga Zen, Keiichi Tokuda, and Alan W. Black. “Statistical parametric speech synthesis”. In: Speech Communication 51.11 (2009), pp. 1039–1064. ISSN: 0167-6393.
[2] Keiichi Tokuda et al. “Speech Synthesis Based on Hidden Markov Models”. In: Proceedings of the IEEE 101 (2013), pp. 1234–1252.
[3] Sepp Hochreiter and Jürgen Schmidhuber. “Long Short-Term Memory”. In: Neu- ral Computation 9 (1997), pp. 1735–1780.
[4] Felix Alexander Gers, Jürgen Schmidhuber, and Fred Cummins. “Learning to Forget: Continual Prediction with LSTM”. In: Neural Computation 12 (2000), pp. 2451–2471.
[5] Kyunghyun Cho et al. “Learning Phrase Representations using RNN En- coderDecoder for Statistical Machine Translation”. In: Conference on Empirical Methods in Natural Language Processing. 2014.
[6] Guoxiang Zhou et al. “Minimal gated unit for recurrent neural networks”. In: International Journal of Automation and Computing 13 (2016), pp. 226–234.
[7] Joel Heck and Fathi M. Salem. “Simplified minimal gated unit variations for re- current neural networks”. In: 2017 IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS) (2017), pp. 1593–1596.
[8] Alex Graves, Abdel rahman Mohamed, and Geoffrey E. Hinton. “Speech recog- nition with deep recurrent neural networks”. In: 2013 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (2013), pp. 6645–6649.
[9] Christophe Veaux, Junichi Yamagishi, and Simon King. “Towards Personalised Synthesised Voices for Individuals with Vocal Disabilities: Voice Banking and Reconstruction”. In: Proceedings of the Fourth Workshop on Speech and Language Processing for Assistive Technologies. 2013.
[10] Brij Mohan Lal Srivastava et al. “Evaluating Voice Conversion-based Privacy Protection against Informed Attackers”. In: ICASSP. IEEE, 2020.
[11] Anthony John Dsouza et al. “SynthPipe : AI based Human in the Loop Video Dubbing Pipeline”. In: International Conference on Advances in Electrical, Comput- ing, Communication and Sustainable Technologies (ICAECT). 2022. DOI: 10.1109/ ICAECT54875.2022.9807853.
[12] Huaizhen Tang et al. “AVQVC: One-Shot Voice Conversion By Vector Quantiza- tion With Applying Contrastive Learning”. In: ICASSP. IEEE, 2022, pp. 46134617. DOI: 10.1109/icassp43922.2022.9746369.
[13] Da-Yi Wu and Hung-yi Lee. “One-Shot Voice Conversion by Vector Quantiza- tion”. In: ICASSP. IEEE, 2020, pp. 77347738. DOI: 10 . 1109 / icassp40776 . 2020.9053854.
[14] Kaizhi Qian et al. “Zero-Shot Voice Style Transfer with Only Autoencoder Loss”. In: International Conference on Machine Learning. 2019.
[15] Takuhiro Kaneko et al. “StarGAN-VC2: Rethinking Conditional Methods for StarGAN-Based Voice Conversion”. In: Proc. Interspeech. 2019. DOI: 10.21437/ Interspeech.2019-2236.
[16] Hirokazu Kameoka et al. “Nonparallel Voice Conversion With Augmented Clas- sifier Star Generative Adversarial Networks”. In: IEEE/ACM Transactions on Au- dio, Speech, and Language Processing 28 (2020), pp. 2982–2995. ISSN: 2329-9304. DOI: 10.1109/TASLP.2020.3036784.
[17] Yinghao Aaron Li, Ali Zare, and Nima Mesgarani. “StarGANv2-VC: A Diverse, Unsupervised, Non-Parallel Framework for Natural-Sounding Voice Conver- sion”. In: Proc. Interspeech. 2021, pp. 13491353. DOI: 10.21437/interspeech. 2021-319.
[18] Takuhiro Kaneko et al. “CycleGAN-VC2: Improved CycleGAN-based Non- parallel Voice Conversion”. In: ICASSP. IEEE, 2019.
[19] Takuhiro Kaneko et al. “CycleGAN-VC3: Examining and Improving CycleGAN- VCs for Mel-Spectrogram Conversion”. In: Proc. Interspeech. 2020. DOI: 10 . 21437/Interspeech.2020-2280.
[20] Tingle Li et al. “CVC: Contrastive Learning for Non-Parallel Voice Conversion”. In: Proc. Interspeech. 2021. DOI: 10.21437/Interspeech.2021-137.
[21] Heiga Zen, Keiichi Tokuda, and Alan W. Black. “Statistical Parametric Speech Synthesis”. In: 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP ’07 4 (2007), pp. IV–1229–IV–1232.
[22] Heiga Zen, Andrew Senior, and Mike Schuster. “Statistical parametric speech synthesis using deep neural networks”. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2013. DOI: 10 . 1109 / icassp . 2013 . 6639215. URL: https : / / doi . org / 10 . 1109 % 2Ficassp . 2013 . 6639215.
[23] K. Tokuda et al. “Speech parameter generation algorithms for HMM-based speech synthesis”. In: 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100). IEEE. DOI: 10 . 1109 / icassp.2000.861820. URL: https://doi.org/10.1109%2Ficassp. 2000.861820.
[24] Yuchen Fan et al. “TTS synthesis with bidirectional LSTM based recurrent neural networks”. In: Interspeech 2014. ISCA, 2014. DOI: 10.21437/interspeech. 2014-443. URL: https://doi.org/10.21437%2Finterspeech.2014- 443.
[25] Zhizheng Wu and Simon King. “Investigating gated recurrent networks for speech synthesis”. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016. DOI: 10 . 1109 / icassp . 2016 . 7472657. URL: https://doi.org/10.1109%2Ficassp.2016.7472657.
[26] Viacheslav Klimkov et al. “Parameter Generation Algorithms for Text-To-Speech Synthesis with Recurrent Neural Networks”. In: 2018 IEEE Spoken Language Tech- nology Workshop (SLT). IEEE, 2018. DOI: 10.1109/slt.2018.8639626. URL: https://doi.org/10.1109%2Fslt.2018.8639626.
[27] D Childers, B Yegnanarayana, and Ke Wu. “Voice conversion: Factors responsi- ble for quality”. In: ICASSP. Vol. 10. IEEE, 1985, pp. 748–751.
[28] Seyed Hamidreza Mohammadi and Alexander Kain. “An overview of voice con- version systems”. In: Speech Commun. 88 (2017), pp. 65–82.
[29] Berrak Sisman et al. “An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning”. In: IEEE/ACM Transactions on Au- dio, Speech, and Language Processing 29 (2020), pp. 132–157.
[30] Ehsan Variani et al. “Deep neural networks for small footprint text-dependent speaker verification”. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2014), pp. 4052–4056.
[31] Mohammed Salah Al-Radhi, Tamás Gábor Csapó, and Géza Németh. “Continu- ous vocoder applied in deep neural network based voice conversion”. In: Multi- media Tools and Applications 78 (2019), pp. 33549 –33572.
[32] Ding Ma et al. “Two-Stage Training Method for Japanese Electrolaryngeal Speech Enhancement Based on Sequence-to-Sequence Voice Conversion”. In: 2022 IEEE Spoken Language Technology Workshop (SLT) (2022), pp. 949–954.
[33] Tuan Vu Ho, M. Kobayashi, and Masato Akagi. “Speak Like a Professional: In- creasing Speech Intelligibility by Mimicking Professional Announcer Voice with Voice Conversion”. In: Proc. Interspeech (2022).
[34] A. Kashkin, I. A. Karpukhin, and Sergei L. Shishkin. “HiFi-VC: High Quality ASR-Based Voice Conversion”. In: ArXiv abs/2203.16937 (2022).
[35] Jilong Wu et al. “Multilingual Text-To-Speech Training Using Cross Language Voice Conversion And Self-Supervised Learning Of Speech Representations”. In: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2022), pp. 8017–8021.
[36] Haohan Guo et al. “Improving Adversarial Waveform Generation Based Singing Voice Conversion with Harmonic Signals”. In: ICASSP 2022 - 2022 IEEE In- ternational Conference on Acoustics, Speech and Signal Processing (ICASSP) (2022), pp. 6657–6661.
[37] Firra M. Mukhneri, Inung Wijayanto, and Sugondo Hadiyoso. “Voice Conver- sion for Dubbing Using Linear Predictive Coding and Hidden Markov Model”. In: Journal of Southwest Jiaotong University (2020).
[38] Suresh Malodia et al. “Why Do People Use Artificial Intelligence (AI)-Enabled Voice Assistants?” In: IEEE Transactions on Engineering Management PP (2021), pp. 1–15.
[39] Susmita Bhattacharjee and Rohit Sinha. “Sensitivity Analysis of MaskCycleGAN based Voice Conversion for Enhancing Cleft Lip and Palate Speech Recognition”. In: 2022 IEEE International Conference on Signal Processing and Communications (SP- COM) (2022), pp. 1–5.
[40] Lokitha T et al. “Smart Voice Assistance for Speech disabled and Paralyzed Peo- ple”. In: 2022 International Conference on Computer Communication and Informatics (ICCCI) (2022), pp. 1–5.
[41] Masanobu Abe et al. “Voice conversion through vector quantization”. In: Journal of the Acoustical Society of Japan (E) 11.2 (1990), pp. 71–76.
[42] Kiyohiro Shikano, Satoshi Nakamura, and Masanobu Abe. “Speaker adaptation and voice conversion by codebook mapping”. In: International Symposium on Circuits and Systems (ISCAS). IEEE, 1991, pp. 594–597.
[43] Elina Helander et al. “On the impact of alignment on voice conversion performance”. In: Proc. Interspeech 2008. 2008, pp. 1453–1456.
[44] Tomoki Toda, Alan W. Black, and Keiichi Tokuda. “Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory”. en. In: IEEE Transactions on Audio, Speech and Language Processing 15.8 (Nov. 2007), pp. 2222– 2235. ISSN: 1558-7916. DOI: 10 . 1109 / TASL . 2007 . 907344. (Visited on 08/01/2022).
[45] Kazuhiro Kobayashi et al. “The NU-NAIST Voice Conversion System for the Voice Conversion Challenge 2016”. In: Proc. Interspeech 2016. 2016, pp. 1667–1671. DOI: 10.21437/Interspeech.2016-970.
[46] Elina Helander et al. “Voice conversion using partial least squares regres- sion”. In: IEEE Transactions on Audio, Speech, and Language Processing 18.5 (2010), pp. 912–921.
[47] Zhizheng Wu et al. “Exemplar-based sparse representation with residual com- pensation for voice conversion”. In: IEEE/ACM Transactions on Audio, Speech, and Language Processing 22.10 (2014), pp. 1506–1521.
[48] Chin-Cheng Hsu et al. “Voice conversion from non-parallel corpora using vari- ational auto-encoder”. en. In: 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA). Jeju, South Korea: IEEE, Dec. 2016, pp. 1–6. ISBN: 978-988-14768-2-1. DOI: 10 . 1109 / APSIPA . 2016 . 7820786. (Visited on 08/01/2022).
[49] Hirokazu Kameoka et al. “ACVAE-VC: Non-Parallel Voice Conversion With Auxiliary Classifier Variational Autoencoder”. en. In: IEEE/ACM Transactions on Audio, Speech, and Language Processing 27.9 (Sept. 2019), pp. 1432–1443. ISSN: 2329-9290, 2329-9304. DOI: 10 . 1109 / TASLP . 2019 . 2917232. (Visited on 08/01/2022).
[50] Lifa Sun et al. “Phonetic posteriorgrams for many-to-one voice conversion with- out parallel data training”. en. In: 2016 IEEE International Conference on Multime- dia and Expo (ICME). Seattle, WA, USA: IEEE, July 2016, pp. 1–6. ISBN: 978-1-4673- 7258-9. DOI: 10.1109/ICME.2016.7552917. (Visited on 08/01/2022).
[51] Feng-Long Xie, Frank K. Soong, and Haifeng Li. “A KL Divergence and DNN- Based Approach to Voice Conversion without Parallel Training Sentences”. In: Proc. Interspeech 2016. 2016, pp. 287–291. DOI: 10.21437/Interspeech.2016- 116.
[52] Yuki Saito et al. “Non-parallel voice conversion using variational autoencoders conditioned by phonetic posteriorgrams and d-vectors”. In: ICASSP. IEEE, 2018, pp. 5274–5278.
[53] Shaojin Ding and Ricardo Gutierrez-Osuna. “Group Latent Embedding for Vec- tor Quantized Variational Autoencoder in Non-Parallel Voice Conversion”. In: Interspeech. 2019.
[54] Wen-Chin Huang et al. “Unsupervised Representation Disentanglement Using Cross Domain Features and Adversarial Learning in Variational Autoencoder Based Voice Conversion”. In: IEEE Transactions on Emerging Topics in Computational Intelligence 4 (2020), pp. 468–479.
[55] Kaizhi Qian et al. “F0-Consistent Many-To-Many Non-Parallel Voice Conversion Via Conditional Autoencoder”. In: ICASSP 2020 - 2020 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP) (2020), pp. 6284–6288.
[56] Seung won Park, Doo young Kim, and Myun chul Joe. “Cotatron: Transcription- Guided Speech Encoder for Any-to-Many Voice Conversion without Parallel Data”. In: Interspeech. 2020.
[57] Kou Tanaka et al. “ATTS2S-VC: Sequence-to-sequence Voice Conversion with Attention and Context Preservation Mechanisms”. In: ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2018), pp. 6805–6809.
[58] Wen-Chin Huang et al. “Voice Transformer Network: Sequence-to-Sequence Voice Conversion Using Transformer with Text-to-Speech Pretraining”. In: In- terspeech. 2019.
[59] Hirokazu Kameoka et al. “StarGAN-VC: Non-parallel many-to-many voice con- version using star generative adversarial networks”. In: Spoken Language Technol- ogy Workshop (SLT). IEEE, 2018, pp. 266–273.
[60] Takuhiro Kaneko and Hirokazu Kameoka. “CycleGAN-VC: Non-parallel voice conversion using cycle-consistent adversarial networks”. In: 2018 26th European Signal Processing Conference (EUSIPCO). IEEE, 2018, pp. 2100–2104.
[61] Yinghao Aaron Li, Ali Asghar Zare, and Nima Mesgarani. “StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion”. In: Interspeech. 2021.
[62] Durk P Kingma et al. “Semi-supervised Learning with Deep Generative Mod- els”. In: Advances in Neural Information Processing Systems. Vol. 27. Curran Asso- ciates, Inc., 2014. (Visited on 08/04/2022).
[63] Ian Goodfellow et al. “Generative Adversarial Nets”. In: Advances in Neural In- formation Processing Systems. Vol. 27. Curran Associates, Inc., 2014. (Visited on 08/02/2022).
[64] Michael Gutmann and Aapo Hyvärinen. “Noise-contrastive estimation: A new estimation principle for unnormalized statistical models”. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics. JMLR Work- shop and Conference Proceedings, 2010, pp. 297–304.
[65] Ting Chen et al. “A simple framework for contrastive learning of visual repre- sentations”. In: International conference on machine learning. PMLR, 2020, pp. 1597– 1607.
[66] Bo-Wei Chen, Chen-Yu Chen, and Jhing-Fa Wang. “Smart Homecare Surveillance System: Behavior Identification Based on State-Transition Support Vector Machines and Sound Directivity Pattern Analysis”. In: IEEE Transactions on Systems, Man, and Cybernetics: Systems 43 (2013), pp. 1279–1289.
[67] Bo-Wei Chen et al. “Cognitive Sensors Based on Ridge Phase-Smoothing Local- ization and Multiregional Histograms of Oriented Gradients”. In: IEEE Transac- tions on Emerging Topics in Computing 7 (2019), pp. 123–134.
[68] Gavin C. Cawley and Peter D. Noakes. “LSP speech synthesis using backpropa- gation networks”. In: 1993.
[69] Rafal Józefowicz, Wojciech Zaremba, and Ilya Sutskever. “An Empirical Explo- ration of Recurrent Network Architectures”. In: International Conference on Ma- chine Learning. 2015.
[70] Martin Cooke et al. “Evaluating the intelligibility benefit of speech modifications in known noise conditions”. In: Speech Commun. 55 (2013), pp. 572–585.
[71] Zhizheng Wu, Oliver Watts, and Simon King. “Merlin: An Open Source Neural Network Speech Synthesis System”. In: Speech Synthesis Workshop. 2016.
[72] Robert A. J. Clark, Korin Richmond, and Simon King. “Multisyn: Open-domain unit selection for the Festival speech synthesis system”. In: Speech Commun. 49 (2007), pp. 317–330.
[73] Masanori Morise, Fumiya Yokomori, and Kenji Ozawa. “WORLD: A Vocoder- Based High-Quality Speech Synthesis System for Real-Time Applications”. In: IEICE Trans. Inf. Syst. 99-D (2016), pp. 1877–1884.
[74] Zhizheng Wu and Simon King. “Investigating gated recurrent networks for speech synthesis”. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2016), pp. 5140–5144.
[75] Xudong Mao et al. “Least Squares Generative Adversarial Networks”. In: ICCV. IEEE, 2017. (Visited on 10/02/2022).
[76] Junichi Yamagishi, Christophe Veaux, and Kirsten MacDonald. CSTR VCTK Cor- pus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit. University of Ed- inburgh. The Centre for Speech Technology Research (CSTR), 2019. DOI: 10 . 7488/ds/2645.
[77] Jaime Lorenzo-Trueba et al. The Voice Conversion Challenge 2018: database and results. University of Edinburgh. The Centre for Speech Technology Research (CSTR), 2018.
[78] Ryuichi Yamamoto, Eunwoo Song, and Jae-Min Kim. “Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram”. In: ICASSP. IEEE, 2020. DOI: 10.1109/ ICASSP40776.2020.9053795.
[79] Kundan Kumar et al. “MelGAN: Generative Adversarial Networks for Condi- tional Waveform Synthesis”. In: Advances in Neural Information Processing Sys- tems. 2019. (Visited on 08/25/2022).
[80] Kaiming He et al. “Deep Residual Learning for Image Recognition”. In: CVPR. IEEE, 2016. DOI: 10.1109/CVPR.2016.90.
[81] Phillip Isola et al. “Image-to-Image Translation with Conditional Adversarial Networks”. In: CVPR. IEEE, 2017. DOI: 10.1109/CVPR.2017.632.
[82] Diederik P. Kingma and Jimmy Ba. “Adam: A Method for Stochastic Optimiza- tion”. In: International Conference on Learning Representations, (ICLR). 2015.
[83] Philipos C. Loizou. “Speech Quality Assessment”. In: Multimedia Analysis, Pro- cessing and Communications. Springer Berlin Heidelberg, 2011. ISBN: 978-3-642- 19550-1 978-3-642-19551-8. DOI: 10.1007/978-3-642-19551-8_23. (Visited on 08/11/2022).
[84] Li Wan et al. “Generalized End-to-End Loss for Speaker Verification”. In: ICASSP. IEEE, 2018. DOI: 10.1109/ICASSP.2018.8462665.
[85] Ye Jia et al. “Transfer Learning from Speaker Verification to Multispeaker Text- To-Speech Synthesis”. In: Advances in Neural Information Processing Systems. 2018.
[86] Adam Paszke et al. “PyTorch: An Imperative Style, High-Performance Deep Learning Library”. In: Advances in Neural Information Processing Systems. Vol. 32. Curran Associates, Inc., 2019. (Visited on 08/10/2022).
[87] Christophe Veaux, Junichi Yamagishi, and Kirsten MacDonald. “SUPER- SEDED - CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit”. eng. In: The Rainbow Passage which the speak- ers read out can be found in the International Dialects of English Archive: (http://web.ku.edu/~idea/readings/rainbow.htm). (Oct. 2016). Accepted: 2016-10- 07T14:54:16Z Publisher: University of Edinburgh. The Centre for Speech Tech- nology Research (CSTR). DOI: 10 . 7488 / ds / 1495. URL: https : / / datashare.ed.ac.uk/handle/10283/2119 (visited on 08/10/2022). |