參考文獻 |
[1] B. Sisman, J. Yamagishi, S. King and H. Li, “An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 132-157, 2021.
[2] B. Atal and M. Schroeder, “Predictive coding of speech signals and subjective error criteria,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 27, no. 3, pp. 247-254, June 1979.
[3] P. Mermelstein, “Distance measures for speech recognition, psychological and instrumental,” Pattern recognition and artificial intelligence, 116:374–388, 1976.
[4] Y. Zhang, Z. Ou and M. Hasegawa-Johnson, “Improvement of Probabilistic Acoustic Tube model for speech decomposition,” 2014 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, Florence, Italy, pp. 7929-7933, 2014.
[5] H. Kawahara, M. Morise, T. Takahashi, R. Nisimura, T. Irino, and H. Banno, “TANDEM-STRAIGHT: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0, and aperiodicity estimation,” 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 3933-3936, 2008.
[6] D. P. Kingma and M. Welling, “Auto-Encoding Variational Bayes,” 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014.
[7] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, and Y. Bengio, “Generative Adversarial Network,” Advances in neural information processing systems, NIPS 2014, Cambridge, MA, USA, pp. 2672-2680, December 2014.
[8] C. Hsu, H. Hwang, Y. Wu, Y. Tsao, and H. Wang, “Voice conversion from non-parallel corpora using variational auto-encoder,” 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA, pp. 1–6. IEEE, 2016.
[9] C. Hsu, H. Hwang, Y. Wu, Y. Tsao, and H. Wang, “Voice conversion from unaligned corpora using variational autoencoding Wasserstein generative adversarial networks,” Proc. Interspeech 2017, pp. 3364–3368, 2017.
[10] W. Huang, H. Hwang, Y. Peng, Y. Tsao, and H. Wang, “Voice conversion based on cross-domain features using variational auto encoders,” 2018 11th International Symposium on Chinese Spoken Language Processing, ISCSLP, pp. 51–55. IEEE, 2018.
[11] H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, “ACVAE-VC: Non-Parallel Voice Conversion with Auxiliary Classifier Variational Autoencoder,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 9, pp. 1432–1443, September 2019.
[12] J. Chou, C. Yeh, H. Lee, and L. Lee, “Multitarget voice conversion without parallel data by adversarially learning disentangled audio representations,” Proc. Interspeech 2018, pp. 501–505, 2018.
[13] A. Oord, O. Vinyals, and K. Kavukcuoglu, “Neural discrete representation learning,” Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS′17, Curran Associates Inc., Red Hook, NY, USA, pp. 6309–6318, 2017.
[14] Y. Gao, R. Singh, and B. Raj, “Voice impersonation using generative adversarial networks,” 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 2506–2510, IEEE, 2018.
[15] H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, “StarGAN-VC: Non-parallel many-to-many voice conversion using star generative adversarial networks,” 2018 IEEE Spoken Language Technology Workshop, SLT, pp. 266–273. IEEE, 2018.
[16] T. Kaneko, and H. Kameoka, “Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks,” ArXiv, abs/1711.11293, 2017.
[17] Y. Choi, M. Choi, M. Kim, J. Ha, S. Kim, and J. Choo, “StarGAN: Unified generative adversarial networks for multi-domain image-to-image translation,” Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8789–8797, 2018.
[18] T. Kaneko and H. Kameoka, “CycleGAN-VC: Non-parallel Voice Conversion Using Cycle-Consistent Adversarial Networks,” 2018 26th European Signal Processing Conference, EUSIPCO, Rome, Italy, pp. 2100–2104, IEEE, 2018
[19] T. Kaneko, H. Kameoka, K. Tanaka and N. Hojo, “Cyclegan-VC2: Improved Cyclegan-based Non-parallel Voice Conversion,” 2019 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, Brighton, UK, pp. 6820-6824, 2019.
[20] W. Huang, H. Luo, H. Hwang, C. Lo, Y. Peng, Y. Tsao, and H. Wang, “Unsupervised Representation Disentanglement Using Cross Domain Features and Adversarial Learning in Variational Autoencoder Based Voice Conversion,” IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 4, no. 4, pp. 468-479, Aug. 2020.
[21] T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo, “StarGAN-VC2: Rethinking conditional methods for StarGAN-based voice conversion,” Proc. Interspeech 2019, pp. 679–683, 2019.
[22] H. Kawahara, I. Masuda-Katsuse, and A. Cheveign´e, “Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based f0 extraction,” Speech Communication, vol.27, no.3-4, pp. 187-207, April. 1999.
[23] M. Morise, F. Yokomori, and K. Ozawa, “WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications,” IEICE Transactions on Information and System, vol. 99, no. 7, pp. 1877-1884, July. 2016.
[24] A. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “WaveNet: A Generative Model for Raw Audio,” 9th ISCA Speech Synthesis Workshop, SSW2016, Sunnyvale, USA, Sep. 2016.
[25] R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A Flow-based Generative Network for Speech Synthesis,” IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2019, Brighton, UK, pp. 3617-3621, May. 2019.
[26] X. Wang, S. Takaki, and J. Yamagishi, “Neural source-filter-based waveform model for statistical parametric speech synthesis,” IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2019, Brighton, UK, pp. 5916-5920, May. 2019.
[27] K. Kumar, R. Kumar, T.D. Boissière, L. Gestin, W.Z. Teoh, J.M. Sotelo, A.D. Brébisson, Y. Bengio, and A.C. Courville. “MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis,” Neural Information Processing System, NeurIPS 2019, pp. 14881-14892, Oct. 2019.
[28] J. Kong, J. Kim, and J. Bae, “HiFi-GAN: generative adversarial networks for efficient and high fidelity speech synthesis,” 34th International Conference on Neural Information Processing Systems, NIPS′20, Curran Associates Inc., Red Hook, NY, USA, Article 1428, pp. 17022–17033, 2020.
[29] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 8, pp. 1798–1828, 2013.
[30] K. Qian, Y. Zhang, S. Chang, X. Yang, and M.A. Hasegawa-Johnson, “AutoVC: Zero-shot voice style transfer with only autoencoder loss,” International Conference on Machine Learning, pp. 5210–5219, 2019.
[31] K. Qian, Z. Jin, M.A. Hasegawa-Johnson, and G.J. Mysore, “F0-consistent many-to-many non-parallel voice conversion via conditional autoencoder,” ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 6284–6288. IEEE, 2020.
[32] K. Qian, Y. Zhang, S. Chang, D. Cox, and M.A. Hasegawa-Johnson, “Unsupervised speech decomposition via triple information bottleneck,” Proceedings of the 37th International Conference on Machine Learning, pp. 7836–7846, 2020.
[33] C.H. Chan, K. Qian, Y. Zhang, and M.A. Hasegawa-Johnson, “Speechsplit2.0: Unsupervised speech disentanglement for voice conversion without tuning autoencoder bottlenecks,” ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 6332–6336, 2022.
[34] N. Jaitly and E. Hinton, “Vocal Tract Length Perturbation (VTLP) improves speech recognition,” International Conference on Machine Learning, ICML, 2013.
[35] S. Yang, M. Tantrawenith, H. Zhuang, Z. Wu, A. Sun, J. Wang, N. Cheng, H. Tang, X. Zhao, J. Wang, and H.M. Meng, “Speech Representation Disentanglement with Adversarial Mutual Information Learning for One-shot Voice Conversion,” Proc. Interspeech 2022, pp. 2553-2557, 2022.
[36] P. Cheng, W. Hao, S. Dai, J. Liu, Z. Gan, and L. Carin, “Club: A contrastive log-ratio upper bound of mutual information,” Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, ser. Proceedings of Machine Learning Research, vol. 119, pp. 1779–1788, PMLR, 2020,
[37] J. Chou and H. Lee, “One-shot voice conversion by separating speaker and content representations with instance normalization,” Interspeech, 2019, pp. 664–668.
[38] L. Zhang, R. Li, S. Wang, L. Deng, J. Liu, Y. Ren, J. He, R. Huang, J. Zhu, X. Chen, and Z. Zhao, “M4Singer: A Multi-Style, Multi-Singer and Musical Score Provided Mandarin Singing Corpus,” 36th Conference on Neural Information Processing Systems Datasets and Benchmarks Track, NIPS 2022, 2022.
[39] R. Huang, F. Chen, Y. Ren, J. Liu, C. Cui, and Z. Zhao, “Multi-Singer: Fast Multi-Singer Singing Voice Vocoder with A Large-Scale Corpus,” Proceedings of the 29th ACM International Conference on Multimedia, MM ′21, Association for Computing Machinery, New York, NY, USA, pp. 3945–3954, 2021.
[40] S. Bai, J.Z. Kolter, and V. Koltun, “An empirical evaluation of generic convolutional and recurrent networks for sequence modeling,” ArXiv, abs/1803.01271, 2018. |