參考文獻 |
[1] Zhang, Y., Xu, C., Hu, B., Zhang, C., Xiao, T., & Zhu, J. (2023). Improving End-to-End Speech Translation by Leveraging Auxiliary Speech and Text Data. Proceedings of the AAAI Conference on Artificial Intelligence, 37(11), 13984-13992. https://doi.org/10.1609/aaai.v37i11.26637
[2] Popuri, S., Chen, P. J., Wang, C., Pino, J., Adi, Y., Gu, J., ... & Lee, A. (2022). Enhanced direct speech-to-speech translation using self-supervised pre-training and data augmentation. arXiv preprint arXiv:2204.02967.
[3] Song, K., Ren, Y., Lei, Y., Wang, C., Wei, K., Xie, L., ... & Ma, Z. (2023). Styles2st: Zero-shot style transfer for direct speech-to-speech translation. arXiv preprint arXiv:2305.17732.
[4] Huang, R., Liu, J., Liu, H., Ren, Y., Zhang, L., He, J., & Zhao, Z. (2022). Transpeech: Speech-to-speech translation with bilateral perturbation. arXiv preprint arXiv:2205.12523.
[5] Lee, A., Chen, P. J., Wang, C., Gu, J., Popuri, S., Ma, X., ... & Hsu, W. N. (2021). Direct speech-to-speech translation with discrete units. arXiv preprint arXiv:2107.05604.
[6] Barrault, L., Chung, Y. A., Meglioli, M. C., Dale, D., Dong, N., Duquenne, P. A., ... & Wang, S. (2023). 6-Massively Multilingual & Multimodal Machine Translation. arXiv preprint arXiv:2308.11596.
[7] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
[8] Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
[9] Gulati, A., Qin, J., Chiu, C. C., Parmar, N., Zhang, Y., Yu, J., ... & Pang, R. (2020). Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100.
[10] Bello, I., Zoph, B., Vaswani, A., Shlens, J., & Le, Q. V. (2019). Attention augmented convolutional networks. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 3286-3295).
[11] Lu, Y., Li, Z., He, D., Sun, Z., Dong, B., Qin, T., ... & Liu, T. Y. (2019). Understanding and improving transformer from a multi-particle dynamic system point of view. arXiv preprint arXiv:1906.02762.
[12] Hsu, W. N., Bolte, B., Tsai, Y. H. H., Lakhotia, K., Salakhutdinov, R., & Mohamed, A. (2021). Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM transactions on audio, speech, and language processing, 29, 3451-3460.
[13] Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., & Joulin, A. (2020). Unsupervised learning of visual features by contrasting cluster assignments. Advances in neural information processing systems, 33, 9912-9924.
[14] Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33, 12449-12460.
[15] Jang, Eric, Shixiang Gu, and Ben Poole. "Categorical reparameterization with gumbel-softmax." arXiv preprint arXiv:1611.01144 (2016).
[16] Chung, Y. A., Zhang, Y., Han, W., Chiu, C. C., Qin, J., Pang, R., & Wu, Y. (2021, December). W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (pp. 244-250). IEEE.
[17] Oord, A. V. D., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., ... & Kavukcuoglu, K. (2016). Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499.
[18] Wang, Y., Skerry-Ryan, R. J., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N., ... & Saurous, R. A. (2017). Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135.
[19] Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., ... & Wu, Y. (2018,April). Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4779-4783). IEEE.
Ren, Y., Ruan, Y., Tan, X., Qin, T., Zhao, S., Zhao, Z., & Liu, T. Y. (2019). Fastspeech: Fast, robust and controllable text to speech. Advances in neural information processing systems, 32.
[21] Kim, J., Kong, J., & Son, J. (2021, July). Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In International Conference on Machine Learning (pp. 5530-5540). PMLR.
[22] Wang, C., Chen, S., Wu, Y., Zhang, Z., Zhou, L., Liu, S., ... & Wei, F. (2023). Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111.
[23] Kawamura, M., Shirahata, Y., Yamamoto, R., & Tachibana, K. (2023, June). Lightweight and high-fidelity end-to-end text-to-speech with multi-band generation and inverse short-time fourier transform. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1-5). IEEE.
[24] Kong, J., Kim, J., & Bae, J. (2020). Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in neural information processing systems, 33, 17022-17033.
[25] Popuri, S., Chen, P. J., Wang, C., Pino, J., Adi, Y., Gu, J., ... & Lee, A. (2022). Enhanced direct speech-to-speech translation using self-supervised pre-training and data augmentation. arXiv preprint arXiv:2204.02967.
[26] Liu, Y., Gu, J., Goyal, N., Li, X., Edunov, S., Ghazvininejad, M., ... & Zettlemoyer, L. (2020). Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics, 8, 726-742.
[27] Huang, R., Liu, J., Liu, H., Ren, Y., Zhang, L., He, J., & Zhao, Z. (2022). Transpeech: Speech-to-speech translation with bilateral perturbation. arXiv preprint arXiv:2205.12523.
[28] Wang, Y., Zhai, C., & Awadalla, H. H. (2020). Multi-task learning for multilingual neural machine translation. arXiv preprint arXiv:2010.02523.
[29] https://github.com/Helsinki-NLP/OPUS-MT-train
[30] Jörg Tiedemann and Santhosh Thottingal. 2020. OPUS-MT – Building open translation services for the World. In Proceedings of the 22nd Annual Conference of the European Association for Machine Translation, pages 479–480, Lisboa, Portugal. European Association for Machine Translation.
[31] https://opus.nlpl.eu/
[32] https://huggingface.co/Helsinki-NLP/opus-mt-zh-en
[33] https://huggingface.co/facebook/w2v-bert-2.0
[34] Dong, Q., Ye, R., Wang, M., Zhou, H., Xu, S., Xu, B., & Li, L. (2021, May). Listen, understand and translate: Triple supervision decouples end-to-end speech-to-text translation. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 35, No. 14, pp. 12749-12759).
[35] Xu, C., Hu, B., Li, Y., Zhang, Y., Ju, Q., Xiao, T., & Zhu, J. (2021). d: Integrating the pre-trained models into speech translation encoders. arXiv preprint arXiv:2105.05752.
[36] Xu, C., Hu, B., Li, Y., Zhang, Y., Ju, Q., Xiao, T., & Zhu, J. (2021). Stacked acoustic-and-textual encoding: Integrating the pre-trained models into speech translation encoders. arXiv preprint arXiv:2105.05752.
[37] https://huggingface.co/facebook/seamless-m4t-v2-large
[38] K. Ito, “The LJ speech dataset,” https://keithito.com/LJ-Speech-Dataset/, 2017.
[39] Post, M. (2018). A call for clarity in reporting BLEU scores. arXiv preprintarXiv:1804.08771.
[40] Ardila, R., Branson, M., Davis, K., Henretty, M., Kohler, M., Meyer, J., ... & Weber, G. (2019). Common voice: A massively-multilingual speech corpus. arXiv preprint arXiv:1912.06670.
[41] https://openai.com/chatgpt/
[42] https://huggingface.co/openai/whisper-large-v3
[43] Conneau, A., Ma, M., Khanuja, S., Zhang, Y., Axelrod, V., Dalmia, S., ... & Bapna, A. (2023, January). Fleurs: Few-shot learning evaluation of universal representations of speech. In 2022 IEEE Spoken Language Technology Workshop (SLT) (pp. 798-805). IEEE. |