參考文獻 |
[1] M. Janke and L. Diener, “Emg-to-speech: Direct generation of speech from facial electromyographic signals,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, pp. 2375–2385, 2017.
[2] D. Gaddy and D. Klein, “Digital voicing of silent speech,” In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020.
[3] B. Cao, N. Sebkhi, T. Mau, O. Inan, and J.Wang, “Permanent magnetic articulograph (pma) vs electromagnetic articulograph (ema) in articulation-to-speech synthesis for silent speech interface,” 2019.
[4] Y.-W. Chen, K.-H. Hung, S.-Y. Chuang, J. Sherman,W.-C. Huang, X. Lu, andY. Tsao, “Ema2s: An end-to-end multimodal articulatory-to-speech system,” in ISCAS, 2021.
[5] G. Gosztolya, T. Grósz, L. Tóth, A. Markó, and T. Csapó, “Applying dnn adaptation to reduce the session dependency of ultrasound tongue imaging-based silent speech interfaces,” Acta Polytechnica Hungarica, vol. 17, pp. 109–124, 2020.
[6] B. McMicken, A.Kunihiro, L.Wang, S.V. Berg, and K.Rogers, “Electropalatography in a case of congenital aglossia,” Journal of Communication Disorders, Deaf Studies & Hearing Aids, vol. 2, pp. 1–7, 2014.
[7] J. Verhoeven, N. Miller, L. Daems, and C. C. Reyes-Aldasoro, “Visualisation and analysis of speech production with electropalatography,” J. Imaging, vol. 5, p. 40, 2019.
[8] J. Li, L. Deng, R. Häeb-Umbach, and Y. Gong, Robust automatic speech recognition: A bridge to practical applications. Academic Press, October 2015.
[9] D. Michelsanti and Z. Tan, “Conditional generative adversarial networks for speech enhancement and noise-robust speaker verification,” in INTERSPEECH, 2017.
[10] Y. Lai, F. Chen, S.Wang, X. Lu, Y. Tsao, and C. Lee, “A deep denoising autoencoder approach to improving the intelligibility of vocoded speech in cochlear implant simulation,” IEEE Transactions on Biomedical Engineering, vol. 64, pp. 1568–1578, 2017.
[11] S. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 27, pp. 113–120, 1979.
[12] P. Scalart and J. V. Filho, “Speech enhancement based on a priori signal to noise estimation,” in Proc. ICASSP, 1996.
[13] Y. Hu and P. C. Loizou, “A subspace approach for enhancing speech corrupted by colored noise,” IEEE Transactions on Speech and Audio Processing, vol. 11, no. 4, 2002.
[14] A. Rezayee and S. Gazor, “An adaptive klt approach for speech enhancement,” IEEE Transactions on Speech and Audio Processing, vol. 9, no. 2, pp. 87–95, 2001.
[15] P. S. Huang, S. D. Chen, P. Smaragdis, and M. A. Hasegawa-Johnson, “Singing-voice separation from monaural recordings using robust principal component analysis,” in Proc. ICASSP, vol. 11, 2012.
[16] X. Lu, Y. Tsao, S. Matsuda, and C. Hori, “Speech enhancement based on deep denoising autoencoder,” in INTERSPEECH, 2013.
[17] Y.Xu, J. Du, L.-R. Dai, and C.-H. Lee, “Aregression approach to speech enhancement based on deep neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, pp. 7–19, 2015.
[18] A. Pandey and D.Wang, “A new framework for cnn-based speech enhancement in the time domain,” IEEE/ACMTransactions on Audio, Speech, and Language Processing, vol. 27, pp. 1179–1188, 2019.
[19] J. Qi, H. Hu, Y. Wang, C.-H. H. Yang, S. M. Siniscalchi, and C.-H. Lee, “Exploring deep hybrid tensor-to-vector network architectures for regression based speech enhancement,” arXiv preprint arXiv:2007.13024, 2020.
[20] S. Fu, T. Wang, Y. Tsao, X. Lu, and H. Kawai, “End-to-end waveform utterance enhancement for direct evaluation metrics optimization by fully convolutional neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, pp. 1570–1584, 2018.
[21] H. Erdogan, J. Hershey, S. Watanabe, and J. L. Roux, “Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks,” 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 708–712, 2015.
[22] L. Sun, J. Du, L.-R. Dai, and C. Lee, “Multiple-target deep learning for lstm-rnn based speech enhancement,” 2017 Hands-free Speech Communications and Microphone Arrays (HSCMA), pp. 136–140, 2017.
[23] J.-C. Hou, S.-S. Wang, Y.-H. Lai, Y. Tsao, H.-W. Chang, and H.-M. Wang, “Audiovisual speech enhancement using multimodal deep convolutional neural networks,” IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 2, no. 2, pp. 117–128, 2018.
[24] S.-Y. Chuang, Y. Tsao, C.-C. Lo, and H.-M. Wang, “Lite audio-visual speech enhancement,” in Proc. INTERSPEECH 2020, 2020.
[25] C. Yu, K.-H. Hung, S.-S.Wang, Y. Tsao, and J.-w. Hung, “Time-domain multi-modal bone/air conducted speech enhancement,” IEEE Signal Processing Letters, vol. 27, pp. 1035–1039, 2020.
[26] D. Wang and J. Chen, “Supervised speech separation based on deep learning: An overview,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 10, pp. 1702–1726, 2018.
[27] T. N. Sainath, A.-r. Mohamed, B. Kingsbury, and B. Ramabhadran, “Deep convolutional neural networks for lvcsr,” in 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, 2013, pp. 8614–8618.
[28] T. Sainath, B. Kingsbury, G. Saon, H. Soltau, A. rahman Mohamed, G. E. Dahl, and B. Ramabhadran, “Deep convolutional neural networks for large-scale speech tasks,” Neural networks, vol. 64, pp. 39–48, 2015.
[29] C.-H.Yang, J. Qi, P.-Y. Chen, X. Ma, and C.-H. Lee, “Characterizing speech adversarial examples using self-attention u-net enhancement,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 3107–3111.
[30] K. Kinoshita, M. Delcroix, A. Ogawa, and T. Nakatani, “Text-informed speech enhancement with deep neural networks,” in Proc. INTERSPEECH 2015, 2015.
[31] D. Michelsanti, Z.-H. Tan, S. Sigurdsson, and J. Jensen, “Deep-learning-based audiovisual speech enhancement in presence of lombard effect,” Speech Communication, vol. 115, pp. 38–50, 2019.
[32] Y.-W. Chen, K.-H. Hung, S.-Y. Chuang, J. Sherman, X. Lu, and Y. Tsao, “A study of incorporating articulatory movement information in speech enhancement,” arXiv preprint arXiv:2011.01691, 2020.
[33] P. Atrey, M. A. Hossain, A. E. Saddik, and M. Kankanhalli, “Multimodal fusion for multimedia analysis: a survey,” Multimedia Systems, vol. 16, pp. 345–379, 2010.
[34] L. A. Gatys, A. S. Ecker, and M. Bethge, “A neural algorithm of artistic style,” ArXiv, vol. abs/1508.06576, 2015.
[35] Q. Chen and V. Koltun, “Photographic image synthesis with cascaded refinement networks,” 2017 IEEE International Conference on Computer Vision (ICCV), pp. 1520–1529, 2017.
[36] L. Wang, Y. Li, and S. Lazebnik, “Learning deep structure-preserving image-text embeddings,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5005–5013, 2016.
[37] F. Germain, Q. Chen, and V. Koltun, “Speech denoising with deep feature losses,” Proc. INTERSPEECH 2019, 2019.
[38] J. F. Hacking, B. L. Smith, and E. M. Johnson, “Utilizing electropalatography to train palatalized versus unpalatalized consonant productions by native speakers of american english learning russian,” 2017.
[39] M.-W. Huang, “Development of taiwan mandarin hearing in noise test,” 2005. [40] G. Hu, “100 nonspeech environmental sounds,” 2004. [Online]. Available: http://web.cse.ohio-state.edu/pnl/corpus/HuNonspeech/HuCorpus.html
[41] N. Perraudin, P. Balázs, and P. Søndergaard, “A fast griffin-lim algorithm,” 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 1–4, 2013.
[42] A. Rix, J. Beerends, M. Hollier, and A. P. Hekstra, “Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs,” 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Proceedings, vol. 2, pp. 749–752, 2001.
[43] C. Taal, R. Hendriks, R. Heusdens, and J. Jensen, “An algorithm for intelligibility prediction of time–frequency weighted noisy speech,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, pp. 2125–2136, 2011.
[44] J. Jensen and C. Taal, “An algorithm for predicting the intelligibility of speech masked by modulated noise maskers,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, pp. 2009–2022, 2016.
[45] A. Parveen, H. Inbarani, and E. Sathishkumar, “Performance analysis of unsupervised feature selection methods,” 2012 International Conference on Computing, Communication and Applications, pp. 1–7, 2012. |