參考文獻 |
[1] E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependent speaker verification,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, May 2014, pp. 4052–4056. doi: 10.1109/ICASSP.2014.6854363.
[2] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-Vectors: Robust DNN Embeddings for Speaker Recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Apr. 2018, pp. 5329–5333. doi: 10.1109/ICASSP.2018.8461375.
[3] V. Peddinti, D. Povey, and S. Khudanpur, “A time delay neural network architecture for efficient modeling of long temporal contexts,” in Interspeech 2015, Sep. 2015, pp. 3214–3218. doi: 10.21437/Interspeech.2015-647.
[4] A. Vaswani et al., “Attention Is All You Need.” arXiv, Dec. 05, 2017. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1706.03762
[5] D. Bahdanau, K. Cho, and Y. Bengio, “Neural Machine Translation by Jointly Learning to Align and Translate.” arXiv, May 19, 2016. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1409.0473
[6] L. Dong, S. Xu, and B. Xu, “Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr. 2018, pp. 5884–5888. doi: 10.1109/ICASSP.2018.8462506.
[7] A. van den Oord, Y. Li, and O. Vinyals, “Representation Learning with Contrastive Predictive Coding.” arXiv, Jan. 22, 2019. Accessed: Jul. 16, 2022. [Online]. Available: http://arxiv.org/abs/1807.03748
[8] M. Gutmann and A. Hyvärinen, “Noise-contrastive estimation: A new estimation principle for unnormalized statistical models,” in Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Mar. 2010, pp. 297–304. Accessed: Jul. 17, 2022. [Online]. Available: https://proceedings.mlr.press/v9/gutmann10a.html
[9] A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations.” arXiv, Oct. 22, 2020. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/2006.11477
[10] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” arXiv, May 24, 2019. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1810.04805
[11] E. Jang, S. Gu, and B. Poole, “Categorical Reparameterization with Gumbel-Softmax.” arXiv, Aug. 05, 2017. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1611.01144
[12] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units.” arXiv, Jun. 14, 2021. Accessed: Jun. 19, 2022. [Online]. Available: http://arxiv.org/abs/2106.07447
[13] S. Lloyd, “Least squares quantization in PCM,” IEEE Trans. Inf. Theory, vol. 28, no. 2, pp. 129–137, Mar. 1982, doi: 10.1109/TIT.1982.1056489.
[14] Y. Ren et al., “FastSpeech: Fast, Robust and Controllable Text to Speech.” arXiv, Nov. 20, 2019. Accessed: Jul. 01, 2022. [Online]. Available: http://arxiv.org/abs/1905.09263
[15] Y. Wang et al., “Tacotron: Towards End-to-End Speech Synthesis.” arXiv, Apr. 06, 2017. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1703.10135
[16] J. Shen et al., “Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions.” arXiv, Feb. 15, 2018. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1712.05884
[17] W. Ping et al., “Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning.” arXiv, Feb. 22, 2018. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1710.07654
[18] J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis.” arXiv, Oct. 23, 2020. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/2010.05646
[19] K. Kumar et al., “MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis,” in Advances in Neural Information Processing Systems, 2019, vol. 32. Accessed: Jul. 17, 2022. [Online]. Available: https://papers.nips.cc/paper/2019/hash/6804c9bca0a615bdb9374d00a9fcba59-Abstract.html
[20] L.-W. Chen, H.-Y. Lee, and Y. Tsao, “Generative Adversarial Networks for Unpaired Voice Transformation on Impaired Speech.” arXiv, Aug. 22, 2019. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1810.12656
[21] T. Kaneko and H. Kameoka, “CycleGAN-VC: Non-parallel Voice Conversion Using Cycle-Consistent Adversarial Networks,” in 2018 26th European Signal Processing Conference (EUSIPCO), Sep. 2018, pp. 2100–2104. doi: 10.23919/EUSIPCO.2018.8553236.
[22] J. Serrà, S. Pascual, and C. Segura, “Blow: a single-scale hyperconditioned flow for non-parallel raw-audio voice conversion.” arXiv, Sep. 05, 2019. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1906.00794
[23] D.-Y. Wu, Y.-H. Chen, and H. Lee, “VQVC+: One-Shot Voice Conversion by Vector Quantization and U-Net Architecture,” in Interspeech 2020, Oct. 2020, pp. 4691–4695. doi: 10.21437/Interspeech.2020-1443.
[24] C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, and H.-M. Wang, “Voice Conversion from Non-parallel Corpora Using Variational Auto-encoder.” arXiv, Oct. 13, 2016. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1610.04019
[25] I. J. Goodfellow et al., “Generative Adversarial Networks.” arXiv, Jun. 10, 2014. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1406.2661
[26] D. P. Kingma and M. Welling, “Auto-Encoding Variational Bayes.” arXiv, May 01, 2014. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1312.6114
[27] J. Chou, C. Yeh, H. Lee, and L. Lee, “Multi-target Voice Conversion without Parallel Data by Adversarially Learning Disentangled Audio Representations.” arXiv, Jun. 24, 2018. Accessed: Jul. 14, 2022. [Online]. Available: http://arxiv.org/abs/1804.02812
[28] H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, “StarGAN-VC: Non-parallel many-to-many voice conversion with star generative adversarial networks.” arXiv, Jun. 29, 2018. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1806.02169
[29] J.-C. Chou, C. Yeh, and H. Lee, “One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization,” 2019. doi: 10.21437/interspeech.2019-2663.
[30] X. Huang and S. Belongie, “Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization.” arXiv, Jul. 30, 2017. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1703.06868
[31] K. Qian, Y. Zhang, S. Chang, X. Yang, and M. Hasegawa-Johnson, “AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss.” arXiv, Jun. 06, 2019. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1905.05879
[32] A. Polyak, L. Wolf, and Y. Taigman, “TTS Skins: Speaker Conversion via ASR.” arXiv, Jul. 26, 2020. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1904.08983
[33] S. Chen et al., “WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing.” arXiv, Jan. 24, 2022. Accessed: Jun. 19, 2022. [Online]. Available: http://arxiv.org/abs/2110.13900
[34] J. Kahn et al., “Libri-Light: A Benchmark for ASR with Limited or No Supervision,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2020, pp. 7669–7673. doi: 10.1109/ICASSP40776.2020.9052942.
[35] Z. Chi et al., “XLM-E: Cross-lingual Language Model Pre-training via ELECTRA.” arXiv, Apr. 19, 2022. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/2106.16138
[36] J. Kang, R. Liu, L. Li, Y. Cai, D. Wang, and T. F. Zheng, “Domain-Invariant Speaker Vector Projection by Model-Agnostic Meta-Learning,” in Interspeech 2020, Oct. 2020, pp. 3825–3829. doi: 10.21437/Interspeech.2020-2562.
[37] C. Finn, P. Abbeel, and S. Levine, “Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks.” arXiv, Jul. 18, 2017. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1703.03400
[38] A. Raghu, M. Raghu, S. Bengio, and O. Vinyals, “Rapid Learning or Feature Reuse? Towards Understanding the Effectiveness of MAML.” arXiv, Feb. 12, 2020. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1909.09157
[39] Q. Qian, S. Zhu, J. Tang, R. Jin, B. Sun, and H. Li, “Robust Optimization over Multiple Domains.” arXiv, Nov. 14, 2018. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1805.07588
[40] Q. Dou, D. C. Castro, K. Kamnitsas, and B. Glocker, “Domain Generalization via Model-Agnostic Learning of Semantic Features.” arXiv, Oct. 29, 2019. doi: 10.48550/arXiv.1910.13580.
[41] Y. Ren et al., “FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.” arXiv, Mar. 04, 2021. Accessed: Jul. 01, 2022. [Online]. Available: http://arxiv.org/abs/2006.04558
[42] M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sonderegger, “Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi,” 2017. doi: 10.21437/INTERSPEECH.2017-1386.
[43] A. Suni, D. Aalto, T. Raitio, P. Alku, and M. Vainio, “Wavelets for intonation modeling in HMM speech synthesis,” Th ISCA Speech Synth. Workshop, p. 6, 2013.
[44] H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “AISHELL-1: An Open-Source Mandarin Speech Corpus and A Speech Recognition Baseline.” arXiv, Sep. 16, 2017. doi: 10.48550/arXiv.1709.05522.
[45] Y. Shi, H. Bu, X. Xu, S. Zhang, and M. Li, “AISHELL-3: A Multi-speaker Mandarin TTS Corpus and the Baselines.” arXiv, Apr. 22, 2021. doi: 10.48550/arXiv.2010.11567.
[46] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization.” arXiv, Jan. 29, 2017. doi: 10.48550/arXiv.1412.6980.
[47] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A Simple Way to Prevent Neural Networks from Overfitting,” p. 30.
[48] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer Normalization.” arXiv, Jul. 21, 2016. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1607.06450
[49] D. Hendrycks and K. Gimpel, “Gaussian Error Linear Units (GELUs).” arXiv, Jul. 08, 2020. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1606.08415
[50] “tsne.pdf.” Accessed: Jul. 15, 2022. [Online]. Available: http://www.cs.toronto.edu/~hinton/absps/tsne.pdf |