參考文獻 |
[1] S.Boll, “Suppression of Acoustic Noise in Speech Using Spectral Subtraction,” IEEE Trans. Acoust, Speech, Signal Process. Vol.27, pp 113-120, Apr 1979.
[2] A. Yelwande, S. Kansal and A. Dixit, “Adaptive wiener filter for speech enhancement,” 2017 International Conference on Information, Communication, Instrumentation and Control (ICICIC), 2017, pp. 1-4.
[3] Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean square error short-time spectral amplitude estimator,” in IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 32, no. 6, pp. 1109-1121, December 1984.
[4] I. Cohen and B. Berdugo, “Noise estimation by minima controlled recursive averaging for robust speech enhancement,” in IEEE Signal Processing Letters, vol. 9, no. 1, pp. 12-15, Jan. 2002.
[5] Y. Lecun, L. Bottou, Y. Bengio and P. Haffner, “Gradient-based learning applied to document recognition,” in Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, Nov. 1998.
[6] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” In Proceedings of The 32nd International Conference on Machine Learning, pages 448–456, 2015.
[7] S.Hochreiter and J.Schmidhuber, “Long short-term memory,” Neural computation, 9(8):1735–1780, 1997.
[8] D. Stoller, S. Ewert and S. Dixon, “Wave-u-net: A multi-scale neural network for end-to-end audio source separation,” arXiv preprint arXiv:1806.03185, 2018.
[9] Y. Luo and N. Mesgarani, “Conv-Tasnet: Surpassing ideal time– frequency magnitude masking for speech separation,” IEEE/ACM transactions on audio, speech, and language processing, vol. 27, no. 8, pp. 1256–1266, 2019.
[10] F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. L. Roux, J. R. Hershey and B. Schuller, “Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR,” Latent Variable Analysis and Signal Separation Lecture Notes in Computer Science, p. 9199, 2015.
[11] D. Wang, “On ideal binary mask as the computational goal of auditory scene analysis,” in Speech separation by humans and machines. Springer, 2005, pp. 181–197.
[12] A. Narayanan and D. Wang, “Ideal ratio mask estimation using deep neural networks for robust speech recognition,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2013, pp. 7092–7096.
[13] H. Erdogan, J. R. Hershey, S. Watanabe and J. Le Roux, “Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 708–712.
[14] D. S. Williamson, Y. Wang and D. Wang, “Complex ratio masking for monaural speech separation,” IEEE/ACM transactions on audio, speech, and language processing, vol. 24, no. 3, pp. 483–492, 2015.
[15] K. Paliwal, K. W´ojcicki and B. Shannon, “The importance of phase in speech enhancement,” speech communication, vol. 53, no. 4, pp. 465–494, 2011.
[16] C. K. Reddy, V. Gopal, R. Cutler, E. Beyrami, R. Cheng, H. Dubey, S. Matusevych, R. Aichner, A. Aazami, S. Braun et al., “The interspeech 2020 deep noise suppression challenge: Datasets, subjective testing framework, and challenge results,” arXiv preprint arXiv:2005.13981, 2020.
[17] Y. Xia, S. Braun, C. K. A. Reddy, H. Dubey, R. Cutler and I. Tashev, “Weighted Speech Distortion Losses for Neural-Network-Based Real-Time Speech Enhancement,” ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 871-875.
[18] Y. Hu, Y. Liu, S. Lv, M. Xing, S. Zhang, Y. Fu, J. Wu, B. Zhang and L. Xie, “DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement,” arXiv preprint arXiv:2008.00264, 2020.
[19] X. Hao, X. Su, R. Horaud and X. Li, "Fullsubnet: A Full-Band and Sub-Band Fusion Model for Real-Time Single-Channel Speech Enhancement," ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 6633-6637.
[20] O. Oktay, J. Schlemper, L. L. Folgoc, M. Lee, M. Heinrich, K. Misawa, K. Mori, S. McDonagh, N. Y. Hammerla, B. Kainz et al., “Attention u-net: Learning where to look for the pancreas,” arXiv preprint arXiv:1804.03999, 2018.
[21] S. Braun and I. Tashev, “Data augmentation and loss normalization for deep noise suppression,” arXiv preprint arXiv:2008.06412, 2020.
[22] C. H. Taal, R. C. Hendriks, R. Heusdens and J. Jensen, “A short-time objective intelligibility measure for time-frequency weighted noisy speech,” 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, 2010, pp. 4214-4217.
[23] V. Panayotov, G. Chen, D. Povey and S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr. 2015, pp. 5206–5210.
[24] J. F. Gemmeke et al., “Audio Set: An ontology and human-labeled dataset for audio events,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, Mar. 2017, pp. 776–780.
[25] J. Thiemann, N. Ito and E. Vincent, “The Diverse Environments Multi-channel Acoustic Noise Database (DEMAND): A database of multichannel environmental noise recordings,” presented at the ICA 2013 Montreal, Montreal, Canada, 2013, pp. 035081–035081.
[26] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer and S. Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” in 2017 IEEE ICASSP, 2017, pp. 5220–5224.
[27] G. Pirker, M. Wohlmayr, S. Petrik and F. Pernkopf, “A Pitch Tracking Corpus with Evaluation on Multipitch Tracking Scenario,” p. 4. |