摘要(英) |
In recent years, smart speaker gets into full swing, amazon smart speaker, Echo, successfully changed customers’ habits of using home appliances, and voice assistant Alexa enables customers to command via voice. Smart speaker related technology are divided into front-end and back-end, front-end refers to the device, namely smart speaker front-end technology, including noise reduction, speech enhancement, echo cancellation, voice activity detection, etc., and back-end technology refers to server end, including speech recognition and semantic understanding, and so on. These technologies make each firms bet a lot of efforts.
In this thesis, we combined previous research and implemented robust wake word detection on embedded system, the system consists of two techniques in smart speakers, wake word detection and noise reduction, wake word detection is voice through the Mel cepstrum coefficient (MFCC) to extract the characteristics as input on convolution neural network and the output are probabilities of each class of wake word. Probabilities determine whether wake words are identified; Noise reduction use short-time Fourier Transform (STFT) results of the time-frequency mixed signals, after taking out the energy and put it into the recursive neural network to train, then we get the output, noise mask and speech mask, applying these masks on GEV beamformer to achieve noise reduction. |
參考文獻 |
[1] Logan, Beth. "Mel Frequency Cepstral Coefficients for Music Modeling." ISMIR. Vol. 270. 2000.
[2] LeCun, Y., Bengio, Y. and Hinton, G., 2015. Deep learning. Nature, 521(7553), pp.436-444
[3] S. Hamid Nawab , Thomas F. Quatieri, Short-time Fourier transform, Advanced topics in signal processing, Prentice-Hall, Inc., Upper Saddle River, NJ, 1987
[4] L. C. Jain , L. R. Medsker, Recurrent Neural Networks: Design and Applications, CRC Press, Inc., Boca Raton, FL, 1999
[5] Warsitz, Ernst, and Reinhold Haeb-Umbach. "Blind acoustic beamforming based on generalized eigenvalue decomposition." IEEE Transactions on audio, speech, and language processing 15.5 (2007): 1529-1539.
[6] Lawrence R. Rabiner, A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, 77 (2), p. 257–286, February 1989
[7] Forney, G. David. "The viterbi algorithm." Proceedings of the IEEE 61.3 (1973): 268-278.
[8] Sainath, Tara N., and Carolina Parada. "Convolutional neural networks for small-footprint keyword spotting." Sixteenth Annual Conference of the International Speech Communication Association. 2015.
[9] Xiao, Xiong, et al. "A study of learning based beamforming methods for speech recognition." CHiME 2016 workshop. 2016.
[10] Loss Function. [Online]. Available: https://en.wikipedia.org/wiki/Loss_function. [Accessed: 16-Aug-2018].
[11] Delay Sum Filter. [Online]. Available: http://www.labbookpages.co.uk/audio/beamforming/delaySum.html.
[Accessed: 16-Aug-2018].
[12] 反向傳播算法. [Online]. Available: https://zh.wikipedia.org/wiki/%E5%8F%8D%E5%90%91%E4%BC%A0%E6%92%AD%E7%AE%97%E6%B3%95. [Accessed: 16-Aug-2018].
[13] S. Hochreiter and J. Schmidhuber. “Long short-term memory”. Neural Computation, vol. 9, pp. 1735–1780, 1997.
[14] Heymann, Jahn, et al. "BLSTM supported GEV beamformer front-end for the 3rd CHiME challenge." Automatic Speech Recognition and Understanding (ASRU), 2015 IEEE Workshop on. IEEE, 2015.
[15] J. J. Hopfield, "Neural networks and physical systems with emergent collective computational abilities", Proceedings of the National Academy of Sciences of the USA, vol. 79, no. 8,pp. 2554–2558, April 1982.
[16] A. Krenker, J. Bes ?ter and A. Kos, “Introduction to the Artificial Neural Networks”, Artificial Neural Networks -Methodological Advances and Biomedical Applications, ISBN: 978-953-307-243-2, 2011
[17] Deep Learning in a Nutshell: Sequence Learning. [Online]. Available: https://devblogs.nvidia.com/deep-learning-nutshell-sequence-learning/.
[Accessed: 16-Aug-2018].
[18] Chung, Junyoung, et al. "Empirical evaluation of gated recurrent neural networks on sequence modeling." arXiv preprint arXiv:1412.3555 (2014).
[19] RNN?藏??算之GRU和LSTM. [Online]. Available: https://wugh.github.io/posts/2016/03/cs224d-notes4-recurrent-neural-networks-continue. [Accessed: 16-Aug-2018].
[20] Logan, Beth. "Mel Frequency Cepstral Coefficients for Music Modeling." ISMIR. 2000.
[21] 梅爾刻度. [Online]. Available: https://zh.wikipedia.org/wiki/%E6%A2%85%E5%B0%94%E5%88%BB%E5%BA%A6. [Accessed: 16-Aug-2018].
[22] Vu, Toan H., Le Dung, and Jia-Ching Wang. "Transportation Mode Detection on Mobile Devices Using Recurrent Nets." Proceedings of the 2016 ACM on Multimedia Conference. ACM, 2016.
[23] Raspberry Pi 3 Model B. [Online]. Available: https://www.raspberrypi.org/products/raspberry-pi-3-model-b.
[Accessed: 16-Aug-2018].
[24] ReSpeaker 4-Mic Array for Raspberry Pi. [Online]. Available: http://wiki.seeedstudio.com/ReSpeaker_4_Mic_Array_for_Raspberry_Pi. [Accessed: 16-Aug-2018].
[25] Cortana ?????置. [Online]. Available: https://msdn.microsoft.com/zh-cn/library/windows/hardware/dn957009(v=vs.85).aspx.
[Accessed: 16-Aug-2018].
[26] Warden, Pete. "Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition." arXiv preprint arXiv:1804.03209 (2018).
[27] 混淆矩陣. [Online]. Available: https://zh.wikipedia.org/wiki/%E6%B7%B7%E6%B7%86%E7%9F%A9%E9%98%B5. [Accessed: 16-Aug-2018].
[28] Jon Barker, Ricard Marxer, Emmanuel Vincent, and Shinji Watanabe, “The third ’CHiME’ speech separation and recognition challenge: Dataset, task and baselines,” in IEEE 2015 Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2015. |