參考文獻 |
[1] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, “SIMULTANEOUS MODELING OF SPECTRUM, PITCH AND DURATION IN HMM-BASED SPEECH SYNTHESIS,” p. 4.
[2] “Speech parameter generation algorithms for HMM-based speech synthesis | IEEE Conference Publication | IEEE Xplore.” https://ieeexplore.ieee.org/document/861820 (accessed Jul. 04, 2022).
[3] Y. Ren et al., “FastSpeech 2: Fast and High-Quality End-to-End Text to Speech,” arXiv, arXiv:2006.04558, Mar. 2021. doi: 10.48550/arXiv.2006.04558.
[4] J. Donahue, S. Dieleman, M. Bińkowski, E. Elsen, and K. Simonyan, “End-to-End Adversarial Text-to-Speech.” arXiv, Mar. 17, 2021. Accessed: Jul. 04, 2022. [Online]. Available: http://arxiv.org/abs/2006.03575
[5] R. J. Weiss, R. Skerry-Ryan, E. Battenberg, S. Mariooryad, and D. P. Kingma, “Wave-Tacotron: Spectrogram-Free End-to-End Text-to-Speech Synthesis,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Jun. 2021, pp. 5679–5683. doi: 10.1109/ICASSP39728.2021.9413851.
[6] Y. Wang et al., “Tacotron: Towards End-to-End Speech Synthesis.” arXiv, Apr. 06, 2017. doi: 10.48550/arXiv.1703.10135.
[7] J. Shen et al., “Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions.” arXiv, Feb. 15, 2018. doi: 10.48550/arXiv.1712.05884.
[8] Y. Ren et al., “FastSpeech: Fast, Robust and Controllable Text to Speech.” arXiv, Nov. 20, 2019. doi: 10.48550/arXiv.1905.09263.
[9] N. Li, S. Liu, Y. Liu, S. Zhao, M. Liu, and M. Zhou, “Neural Speech Synthesis with Transformer Network.” arXiv, Jan. 30, 2019. doi: 10.48550/arXiv.1809.08895.
[10] B. Li, Y. Zhang, T. Sainath, Y. Wu, and W. Chan, “Bytes are All You Need: End-to-End Multilingual Speech Recognition and Synthesis with Bytes.” arXiv, Nov. 21, 2018. doi: 10.48550/arXiv.1811.09021.
[11] Y. Zhang et al., “Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning.” arXiv, Jul. 24, 2019. doi: 10.48550/arXiv.1907.04448.
[12] Z. Liu and B. Mak, “Cross-lingual Multi-speaker Text-to-speech Synthesis for Voice Cloning without Using Parallel Corpus for Unseen Speakers.” arXiv, Nov. 26, 2019. doi: 10.48550/arXiv.1911.11601.
[13] J. Yang and L. He, “Towards Universal Text-to-Speech,” in Interspeech 2020, Oct. 2020, pp. 3171–3175. doi: 10.21437/Interspeech.2020-1590.
[14] Z. Cai, Y. Yang, and M. Li, “Cross-lingual Multispeaker Text-to-Speech under Limited-Data Scenario.” arXiv, May 20, 2020. doi: 10.48550/arXiv.2005.10441.
[15] Y. Cao et al., End-to-end Code-switched TTS with Mix of Monolingual Recordings. 2019. doi: 10.1109/ICASSP.2019.8682927.
[16] L. Xue, W. Song, G. Xu, L. Xie, and Z. Wu, “Building a mixed-lingual neural TTS system with only monolingual data.” arXiv, Aug. 22, 2019. doi: 10.48550/arXiv.1904.06063.
[17] X. Zhou, X. Tian, G. Lee, R. Das, and H. Li, End-to-End Code-Switching TTS with Cross-Lingual Language Model. 2020, p. 7618. doi: 10.1109/ICASSP40776.2020.9054722.
[18] H. Hemati and D. Borth, “Using IPA-Based Tacotron for Data Efficient Cross-Lingual Speaker Adaptation and Pronunciation Enhancement.” arXiv, Mar. 31, 2022. Accessed: Jul. 05, 2022. [Online]. Available: http://arxiv.org/abs/2011.06392
[19] S. Zhao, T. H. Nguyen, H. Wang, and B. Ma, “Towards Natural Bilingual and Code-Switched Speech Synthesis Based on Mix of Monolingual Recordings and Cross-Lingual Voice Conversion.” arXiv, Oct. 15, 2020. Accessed: Jul. 05, 2022. [Online]. Available: http://arxiv.org/abs/2010.08136
[20] S. Nakayama, A. Tjandra, S. Sakti, and S. Nakamura, “Speech Chain for Semi-Supervised Learning of Japanese-English Code-Switching ASR and TTS,” 2018 IEEE Spok. Lang. Technol. Workshop SLT, 2018, doi: 10.1109/SLT.2018.8639674.
[21] A. Vaswani et al., “Attention Is All You Need.” arXiv, Dec. 05, 2017. doi: 10.48550/arXiv.1706.03762.
[22] D. Bahdanau, K. Cho, and Y. Bengio, “Neural Machine Translation by Jointly Learning to Align and Translate.” arXiv, May 19, 2016. Accessed: Jul. 04, 2022. [Online]. Available: http://arxiv.org/abs/1409.0473
[23] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units,” ArXiv210607447 Cs Eess, Jun. 2021, Accessed: Apr. 30, 2022. [Online]. Available: http://arxiv.org/abs/2106.07447
[24] A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,” ArXiv200611477 Cs Eess, Oct. 2020, Accessed: Apr. 30, 2022. [Online]. Available: http://arxiv.org/abs/2006.11477
[25] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-Vectors: Robust DNN Embeddings for Speaker Recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Apr. 2018, pp. 5329–5333. doi: 10.1109/ICASSP.2018.8461375.
[26] Y. Ganin et al., “Domain-Adversarial Training of Neural Networks,” arXiv, arXiv:1505.07818, May 2016. doi: 10.48550/arXiv.1505.07818.
[27] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” ArXiv181004805 Cs, May 2019, Accessed: Apr. 30, 2022. [Online]. Available: http://arxiv.org/abs/1810.04805
[28] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le, “XLNet: Generalized Autoregressive Pretraining for Language Understanding,” ArXiv190608237 Cs, Jan. 2020, Accessed: Apr. 30, 2022. [Online]. Available: http://arxiv.org/abs/1906.08237
[29] A. van den Oord, Y. Li, and O. Vinyals, “Representation Learning with Contrastive Predictive Coding,” ArXiv180703748 Cs Stat, Jan. 2019, Accessed: Apr. 30, 2022. [Online]. Available: http://arxiv.org/abs/1807.03748
[30] C. Wang et al., “UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data,” ArXiv210107597 Cs Eess, Jun. 2021, Accessed: Apr. 30, 2022. [Online]. Available: http://arxiv.org/abs/2101.07597
[31] S. Chen et al., “WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing,” arXiv, arXiv:2110.13900, Jan. 2022. doi: 10.48550/arXiv.2110.13900.
[32] F. Wang, W. Liu, H. Liu, and J. Cheng, “Additive Margin Softmax for Face Verification,” IEEE Signal Process. Lett., vol. 25, no. 7, pp. 926–930, Jul. 2018, doi: 10.1109/LSP.2018.2822810.
[33] M. Zhao, Y. Ma, M. Liu, and M. Xu, “The SpeakIn System for VoxCeleb Speaker Recognition Challange 2021.” arXiv, Sep. 05, 2021. Accessed: Jun. 23, 2022. [Online]. Available: http://arxiv.org/abs/2109.01989
[34] M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sonderegger, “Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi,” 2017. doi: 10.21437/INTERSPEECH.2017-1386.
[35] K. Shih, R. Valle, R. Badlani, A. Lancucki, W. Ping, and B. Catanzaro, “RAD-TTS: Parallel Flow-Based TTS with Robust Alignment Learning and Diverse Synthesis,” p. 8.
[36] L. R. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Proc. IEEE, vol. 77, no. 2, pp. 257–286, Feb. 1989, doi: 10.1109/5.18626.
[37] J. Kim, S. Kim, J. Kong, and S. Yoon, “Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search,” arXiv, arXiv:2005.11129, Oct. 2020. doi: 10.48550/arXiv.2005.11129.
[38] J. Kahn et al., “Libri-Light: A Benchmark for ASR with Limited or No Supervision,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2020, pp. 7669–7673. doi: 10.1109/ICASSP40776.2020.9052942.
[39] G. Chen et al., “GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio.” arXiv, Jun. 13, 2021. Accessed: Jul. 18, 2022. [Online]. Available: http://arxiv.org/abs/2106.06909
[40] C. Wang et al., “VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation.” arXiv, Jul. 27, 2021. Accessed: Jul. 18, 2022. [Online]. Available: http://arxiv.org/abs/2101.00390
[41] T. Wolf et al., “HuggingFace’s Transformers: State-of-the-art Natural Language Processing.” arXiv, Jul. 13, 2020. Accessed: Jul. 18, 2022. [Online]. Available: http://arxiv.org/abs/1910.03771
[42] J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis.” arXiv, Oct. 23, 2020. doi: 10.48550/arXiv.2010.05646.
[43] Y. Jia, H. Zen, J. Shen, Y. Zhang, and Y. Wu, “PnG BERT: Augmented BERT on Phonemes and Graphemes for Neural TTS.” arXiv, Jun. 07, 2021. Accessed: Jul. 19, 2022. [Online]. Available: http://arxiv.org/abs/2103.15060
[44] G. Zhang et al., Mixed-Phoneme BERT: Improving BERT with Mixed Phoneme and Sup-Phoneme Representations for Text to Speech. 2022. |