參考文獻 |
〔1〕Graves, A., Mohamed, A. R., & Hinton, G. (2013, May). Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing (pp. 6645-6649). Ieee.
〔2〕Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257-286.
〔3〕Auer, P. (Ed.). (2013). Code-switching in conversation: Language, interaction and identity. Routledge.
〔4〕Knill, K. M., Gales, M. J., Rath, S. P., Woodland, P. C., Zhang, C., & Zhang, S. X. (2013, December). Investigation of multilingual deep neural networks for spoken term detection. In 2013 IEEE Workshop on Automatic Speech Recognition and Understanding (pp. 138-143). IEEE.
〔5〕Grézl, F., Egorova, E., & Karafiát, M. (2016). Study of large data resources for multilingual training and system porting. Procedia Computer Science, 81, 15-22..
〔6〕Dalmia, S., Sanabria, R., Metze, F., & Black, A. W. (2018, April). Sequence-based multi-lingual low resource speech recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4909-4913). IEEE.
〔7〕Cho, J., Baskar, M. K., Li, R., Wiesner, M., Mallidi, S. H., Yalta, N., ... & Hori, T. (2018, December). Multilingual sequence-to-sequence speech recognition: architecture, transfer learning, and language modeling. In 2018 IEEE Spoken Language Technology Workshop (SLT) (pp. 521-527). IEEE.
〔8〕Lyu, D. C., & Lyu, R. Y. (2008). Language identification on code-switching utterances using multiple cues. In Ninth Annual Conference of the International Speech Communication Association.
〔9〕Toshniwal, S., Sainath, T. N., Weiss, R. J., Li, B., Moreno, P., Weinstein, E., & Rao, K. (2018, April). Multilingual speech recognition with a single end-to-end model. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4904-4908). IEEE.
〔10〕Zeng, Z., Khassanov, Y., Pham, V. T., Xu, H., Chng, E. S., & Li, H. (2018). On the end-to-end solution to mandarin-english code-switching speech recognition. arXiv preprint arXiv:1811.00241.
〔11〕Lyu, D. C., Tan, T. P., Chng, E. S., & Li, H. (2010). Seame: a mandarin-english code-switching speech corpus in south-east asia. In Eleventh Annual Conference of the International Speech Communication Association.
〔12〕Chan, W., Jaitly, N., Le, Q., & Vinyals, O. (2016, March). Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4960-4964). IEEE.
〔13〕Graves, A., Fernández, S., Gomez, F., & Schmidhuber, J. (2006, June). Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning (pp. 369-376).
〔14〕Graves, A. (2012). Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711.
〔15〕Kim, S., Hori, T., & Watanabe, S. (2017, March). Joint CTC-attention based end-to-end speech recognition using multi-task learning. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4835-4839). IEEE.
〔16〕Chorowski, J., Weiss, R. J., Bengio, S., & van den Oord, A. (2019). Unsupervised speech representation learning using wavenet autoencoders. IEEE/ACM transactions on audio, speech, and language processing, 27(12), 2041-2053.
〔17〕Chung, Y. A., & Glass, J. (2018). Speech2vec: A sequence-to-sequence framework for learning word embeddings from speech. arXiv preprint arXiv:1803.08976.
〔18〕Chung, Y. A., Wu, C. C., Shen, C. H., Lee, H. Y., & Lee, L. S. (2016). Audio word2vec: Unsupervised learning of audio segment representations using sequence-to-sequence autoencoder. arXiv preprint arXiv:1603.00982.
〔19〕Schneider, S., Baevski, A., Collobert, R., & Auli, M. (2019). wav2vec: Unsupervised pre-training for speech recognition. arXiv preprint arXiv:1904.05862.
〔20〕Baevski, A., Schneider, S., & Auli, M. (2019). vq-wav2vec: Self-supervised learning of discrete speech representations. arXiv preprint arXiv:1910.05453.
〔21〕Oord, A. V. D., Li, Y., & Vinyals, O. (2018). Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
〔22〕Chung, Y. A., Hsu, W. N., Tang, H., & Glass, J. (2019). An unsupervised autoregressive model for speech representation learning. arXiv preprint arXiv:1904.03240.
〔23〕Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Advances in neural information processing systems (pp. 3104-3112).
〔24〕Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
〔25〕Graves, A., Jaitly, N., & Mohamed, A. R. (2013, December). Hybrid speech recognition with deep bidirectional LSTM. In 2013 IEEE workshop on automatic speech recognition and understanding (pp. 273-278). IEEE.
〔26〕Graves, A., & Jaitly, N. (2014, June). Towards end-to-end speech recognition with recurrent neural networks. In International conference on machine learning (pp. 1764-1772). PMLR.
〔27〕Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008).
〔28〕Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 1097-1105.
〔29〕Wiseman, S., & Rush, A. M. (2016). Sequence-to-sequence learning as beam-search optimization. arXiv preprint arXiv:1606.02960.
〔30〕Zhang, Q., Lu, H., Sak, H., Tripathi, A., McDermott, E., Koo, S., & Kumar, S. (2020, May). Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 7829-7833). IEEE.
〔31〕Yeh, C. F., Mahadeokar, J., Kalgaonkar, K., Wang, Y., Le, D., Jain, M., ... & Seltzer, M. L. (2019). Transformer-transducer: End-to-end speech recognition with self-attention. arXiv preprint arXiv:1910.12977.
〔32〕Tripathi, A., Kim, J., Zhang, Q., Lu, H., & Sak, H. (2020). Transformer transducer: One model unifying streaming and non-streaming speech recognition. arXiv preprint arXiv:2010.03192.
〔33〕Huang, W., Hu, W., Yeung, Y. T., & Chen, X. (2020). Conv-transformer transducer: Low latency, low frame rate, streamable end-to-end speech recognition. arXiv preprint arXiv:2008.05750.
〔34〕Lyu, D. C., Lyu, R. Y., Chiang, Y. C., & Hsu, C. N. (2006, May). Speech recognition on code-switching among the Chinese dialects. In 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings (Vol. 1, pp. I-I). IEEE.
〔35〕Ardila, A. (2005). Spanglish: an anglicized Spanish dialect. Hispanic Journal of Behavioral Sciences, 27(1), 60-81.
〔36〕Lyudovyk, T., & Pylypenko, V. (2014). Code-switching speech recognition for closely related languages. In Spoken Language Technologies for Under-Resourced Languages.
〔37〕Chan, J. Y., Ching, P. C., Lee, T., & Meng, H. M. (2004, December). Detection of language boundary in code-switching utterances by bi-phone probabilities. In 2004 International Symposium on Chinese Spoken Language Processing (pp. 293-296). IEEE.
〔38〕Zissman, M. A. (1996). Comparison of four approaches to automatic language identification of telephone speech. IEEE Transactions on speech and audio processing, 4(1), 31.
〔39〕Mabokela, K. R., Manamela, M. J., & Manaileng, M. (2014). Modeling code-switching speech on under-resourced languages for language identification. In Spoken Language Technologies for Under-Resourced Languages.
〔40〕Nakayama, S., Tjandra, A., Sakti, S., & Nakamura, S. (2018, December). Speech chain for semi-supervised learning of japanese-english code-switching asr and tts. In 2018 IEEE Spoken Language Technology Workshop (SLT) (pp. 182-189). IEEE.
〔41〕Ullah, A., & Ahmed, T. (2020). Code Switching Language Model Using Monolingual Training Data. arXiv preprint arXiv:2012.12543.
〔42〕Yılmaz, E., Heuvel, H. V. D., & van Leeuwen, D. A. (2018). Acoustic and textual data augmentation for improved asr of code-switching speech. arXiv preprint arXiv:1807.10945.
〔43〕Zhang, S., Liu, Y., Lei, M., Ma, B., & Xie, L. (2019). Towards Language-Universal Mandarin-English Speech Recognition. In INTERSPEECH (pp. 2170-2174).
〔44〕Zhou, X., Yılmaz, E., Long, Y., Li, Y., & Li, H. (2020). Multi-encoder-decoder transformer for code-switching speech recognition. arXiv preprint arXiv:2006.10414.
〔45〕Dalmia, S., Liu, Y., Ronanki, S., & Kirchhoff, K. (2021, June). Transformer-Transducers for Code-Switched Speech Recognition. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5859-5863). IEEE.
〔46〕Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
〔47〕Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R. R., & Le, Q. V. (2019). Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems, 32.
〔48〕Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2019). Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942.
〔49〕Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., ... & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
〔50〕Liu, A. T., Yang, S. W., Chi, P. H., Hsu, P. C., & Lee, H. Y. (2020, May). Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6419-6423). IEEE.
〔51〕Baevski, A., Zhou, H., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. arXiv preprint arXiv:2006.11477.
〔52〕Jang, E., Gu, S., & Poole, B. (2016). Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144.
〔53〕Jiang, D., Lei, X., Li, W., Luo, N., Hu, Y., Zou, W., & Li, X. (2019). Improving transformer-based speech recognition using unsupervised pre-training. arXiv preprint arXiv:1910.09932.
〔54〕Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1), 1929-1958.
〔55〕Hendrycks, D., & Gimpel, K. (2016). Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415.
〔56〕Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., ... & Liu, T. (2020, November). On layer normalization in the transformer architecture. In International Conference on Machine Learning (pp. 10524-10533). PMLR.
〔57〕Park, D. S., Chan, W., Zhang, Y., Chiu, C. C., Zoph, B., Cubuk, E. D., & Le, Q. V. (2019). Specaugment: A simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779.
〔58〕Kahn, J., Rivière, M., Zheng, W., Kharitonov, E., Xu, Q., Mazaré, P. E., ... & Dupoux, E. (2020, May). Libri-light: A benchmark for asr with limited or no supervision. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 7669-7673). IEEE.
〔59〕Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., ... & Chintala, S. (2019). Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 8026-8037.
〔60〕Ott, M., Edunov, S., Baevski, A., Fan, A., Gross, S., Ng, N., ... & Auli, M. (2019). fairseq: A fast, extensible toolkit for sequence modeling. arXiv preprint arXiv:1904.01038.
〔61〕He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision (pp. 1026-1034).
〔62〕Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
〔63〕Zhu, X., & Goldberg, A. B. (2009). Introduction to semi-supervised learning. Synthesis lectures on artificial intelligence and machine learning, 3(1), 1-130.
〔64〕Zhu, X. J. (2005). Semi-supervised learning literature survey.
〔65〕Zhang, Y., Qin, J., Park, D. S., Han, W., Chiu, C. C., Pang, R., ... & Wu, Y. (2020). Pushing the limits of semi-supervised learning for automatic speech recognition. arXiv preprint arXiv:2010.10504.
〔66〕Xu, Q., Baevski, A., Likhomanenko, T., Tomasello, P., Conneau, A., Collobert, R., ... & Auli, M. (2021, June). Self-training and pre-training are complementary for speech recognition. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 3030-3034). IEEE. |