參考文獻 |
[1] D. Kingma, and M. Welling, “Auto-Encoding Varitational Bayes,” 2nd International Conference on Learning Representations (ICLR 2014), Banff, AB, Canada, April 2014.
[2] A. Oord, O. Vinyals, and K. Kavukcuoglu, “Neural discrete representation learning,” 31st International Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, pp. 6309-6318, Dec 2017.
[3] A. B. L. Larsen, S. K. Sønderby, and O. Winther, “Autoencoding beyond pixels using a learned similarity metric,” 33rd International Conference on Machine Learning (ICML 2016), New York, NY, USA, pp. 1558-1566, Jun 2016.
[4] J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative Adversarial Networks,” 27th International Conference on Neural Information Processing Systems(NIPS 2014), Cambridge, MA, USA, pp. 2672-2680, December 2014.
[5] J.-Y. Zhu, P. Krhenbuhl, E. Shechtman, and A. A. Efros, “Unpaired Image-To-Image Translation Using Cycle-Consistent Adversarial Networks,” International Conference on Compurter Vision (ICCV 2017), Venice, Italy, pp. 2242-2251, Oct. 2017.
[6] Y. Choi, M. Choi, M. Kim, J. Ha, S. Kim and J. Choo, “StarGAN: Unified Generative Adversarial Networks for Multi-domain Image-to-Image Translation,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA, pp. 8789-8797, Jun. 2018.
[7] T. Kaneko and H. Kameoka, “CycleGAN-VC: Non-parallel Voice Conversion Using Cycle-Consistent Adversarial Networks,” 26th European Signal Processing Conference (EUSIPCO 2018), Rome, Italy, pp.2100-2104, Dec. 2018.
[8] L. Sun, K. Li, H. Wang, S. Kang, and H. Meng, “Phonetic posteriorgrams for many-to-one voice conversion without parallel data training,” IEEE International Conference on Multimedia and Exop (ICME 2016), Seattle, WA, USA, pp.1-6, July. 2016.
[9] H. Kawahara, I. Masuda-Katsuse, and A. Cheveign´e, “Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based f0 extraction,” Speech Communication, vol.27, no.3-4, pp.187-207, April. 1999.
[10] M. Morise, F. Yokomori, and K. Ozawa, “WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications,” IEICE Transactions on Information and System, vol. 99, no. 7, pp. 1877-1884, July. 2016.
[11] A. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “WaveNet: A Generative Model for Raw Audio,” 9th ISCA Speech Synthesis Workshop(SSW 2016), Sunnyvale, USA, Sep. 2016.
[12] R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A Flow-based Generative Network for Speech Synthesis,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2019), Brighton, UK, pp. 3617-3621, May. 2019.
[13] X. Wang, S. Takaki, and J. Yamagishi, “Neural source-filter-based waveform model for statistical parametric speech synthesis,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2019), Brighton, UK, pp. 5916-5920, May. 2019.
[14] K. Kumar, R. Kumar, T. de Boissiere, L. Gestin, W. Teoh, J. Sotelo, A. de Brébisson, Y. Bengio, and A. Courville. “MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis,” Advances in Neural Information Processing System (NeurIPS 2019), pp. 14881-14892, Oct. 2019.
[15] M. Morise, H. Kawahara, and H. Katayose, “Fast and reliable f0 estimation method based on the period extraction of vocal fold vibration of singing voice and speech,” in Proc. AES 35th International Conference, CD-ROM Proceedings, 2009.
[16] M. Morise, “Cheaptrick, a spectral envelope estimator for high-quality speech synthesis,” Speech Communication, vol.67, pp.1–7, March. 2015.
[17] M. Morise, “Platinum: A method to extract excitation signals for voice synthesis system,” Acoustical Science and Technology, vol.33, no.2, pp.123–125, March. 2012.
[18] P. J. Werbos, “Beyond regression: new tools for prediction and analysis in the behavioral sciences,” Ph.D. thesis, Harvard University, 1974.
[19] K. Fukushima, “Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position,” Biological cybernetics, vol. 36, no. 4, pp. 193-202, 1980.
[20] Y. Lecun, L. Bottou, Y. Bengio and P. Haffner, "Gradient-based learning applied to document recognition," in Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, Nov. 1998.
[21] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), pp. 770-778, Jun. 2016.
[22] Y. N. Dauphin, A. Fan, M. Auli, and D. Grngier, “Language modeling with gated convolutional networks,” Proceedings of the 34th International Conference on Machine Learning (ICML 2017), vol. 70, pp. 933-941, Aug. 2017.
[23] S. Hochreiter, and J. Schmidhuber, “Long short-term memory,” Neural computation, 9(8):1735–1780, 1997.
[24] W. Shi et al., "Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network," IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1874-1883, Jun. 2016.
[25] M. Mirza, and S. Osindero, “Conditional Generative Adversarial Nets,” arXiv:1411.1784 [cs.LG], Nov. 2014.
[65] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein Generative Adversarial Networks,” 34th International Conference on Machine Learning (ICML 2017), pp. 214-223, Aug. 2017.
[27] T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo, "Cyclegan-VC2: Improved Cyclegan-based Non-parallel Voice Conversion," ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, pp. 6820-6824, May. 2019.
[28] K. Junho, K. Minjae, K. Hyeonwoo, and L. Kwanghee, “U-GAT-IT: Unsupervised Generative Attentional Networks with Adaptive Layer-Instance Normalization for Image-to-Image Translation,” 8th International Conference on Learning Representations (ICLR 2020), Apr. 2020.
[29] P. Isla, J. Zhu, T. Zhou, and A. A. Efros, “Image-to-Image Translation with Conditional Adversarial Networks,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, pp. 5967-5976, July. 2017.
[30] R. Ferro, N. Obin, and A. Roebel, “CycleGan Voice Conversion of Spectral Envelopes using Adversarial Weights,” 28th European Signal Processing Conference (EUSIPCO 2020), pp. 406-410, Jan. 2021.
[31] K. Zhou, B. Sisman, and H. Li, "Transforming Spectrum and Prosody for Emotional Voice Conversion with Non-Parallel Training Data", Proc. Odyssey 2020 The Speaker and Language Recognition Workshop, pp. 230-237, 2020.
[32] Z. Du, K. Zhou, B. Sisman and H. Li, "Spectrum and Prosody Conversion for Cross-lingual Voice Conversion with CycleGAN," 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Auckland, New Zealand, pp. 507-513, 2020. |