強健性喚醒詞辨認之嵌入式系統實作

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：36

、訪客IP：3.137.211.239

姓名

邱毅青(Yi-Chin Chiu) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

強健性喚醒詞辨認之嵌入式系統實作
(Embedded System Implementation of Robust Wake Word Detection)

相關論文

★ Single and Multi-Label Environmental Sound Recognition with Gaussian Process	★ 波束形成與音訊前處理之嵌入式系統實現
★ 語音合成及語者轉換之應用與設計	★ 基於語意之輿情分析系統
★ 高品質口述系統之設計與應用	★ 深度學習及加速強健特徵之CT影像跟骨骨折辨識及偵測
★ 基於風格向量空間之個性化協同過濾服裝推薦系統	★ RetinaNet應用於人臉偵測
★ 金融商品走勢預測	★ 整合深度學習方法預測年齡以及衰老基因之研究
★ 漢語之端到端語音合成研究	★ 基於 ARM 架構上的 ORB-SLAM2 的應用與改進
★ 基於深度學習之指數股票型基金趨勢預測	★ 探討財經新聞與金融趨勢的相關性
★ 基於卷積神經網路的情緒語音分析	★ 運用深度學習方法預測阿茲海默症惡化與腦中風手術存活

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 ( 永不開放)

摘要(中)

近年來，智慧音箱產品如火如荼的發展，亞馬遜的智慧音箱Echo成功改變消費者的家電使用習慣，語音助理Alexa使消費者能夠用語音即可下達指令，讓生活更加便利，與智慧音箱相關的技術有分前端及後端，前端指的是裝置端，也就是智慧音箱前端的技術，包含噪音消除、語音增強、回聲消除、聲音活動偵測、喚醒詞辨認等等，而後端為伺服器端，則包含語音辨識、語意理解等等，也使得各家廠商在這些技術上都投注了不少心血。
本論文結合前人之研究來實作強健性喚醒詞辨認嵌入式系統，系統包含智慧音箱中的兩大技術，喚醒詞辨認以及噪音消除技術，喚醒詞辨認是將聲音經由梅爾倒頻譜係數(Mel-Frequency Cipstal Coefficients, MFCC)找出特徵後，利用卷積神經網路訓練，輸出各喚醒詞類別的機率來判定是否被辨認；噪音消除則是將聲音利用短時傅立葉轉換(Short-Time Fourier Transform, STFT)將混合訊號的時頻結果，取出能量後放入遞迴神經網路訓練，得到噪音及語音的遮罩，再應用於廣義特徵波束成形器(GEV Beamformer)上，達到噪音消除之效果。

摘要(英)

In recent years, smart speaker gets into full swing, amazon smart speaker, Echo, successfully changed customers’ habits of using home appliances, and voice assistant Alexa enables customers to command via voice. Smart speaker related technology are divided into front-end and back-end, front-end refers to the device, namely smart speaker front-end technology, including noise reduction, speech enhancement, echo cancellation, voice activity detection, etc., and back-end technology refers to server end, including speech recognition and semantic understanding, and so on. These technologies make each firms bet a lot of efforts.
In this thesis, we combined previous research and implemented robust wake word detection on embedded system, the system consists of two techniques in smart speakers, wake word detection and noise reduction, wake word detection is voice through the Mel cepstrum coefficient (MFCC) to extract the characteristics as input on convolution neural network and the output are probabilities of each class of wake word. Probabilities determine whether wake words are identified; Noise reduction use short-time Fourier Transform (STFT) results of the time-frequency mixed signals, after taking out the energy and put it into the recursive neural network to train, then we get the output, noise mask and speech mask, applying these masks on GEV beamformer to achieve noise reduction.

關鍵字(中)

★ 喚醒詞
★ 噪音消除
★ 嵌入式系統

關鍵字(英)

論文目次

中文摘要 I
ABSTRACT II
章節目次 III
圖目錄 VI
表目錄 VII
第一章緒論 1
1.1 研究背景與目的 1
1.2 研究方法與章節概要 2
第二章系統相關文獻探討 3
2.1 喚醒詞辨認 3
2.1.1 基於隱藏式馬可夫模型的喚醒詞辨認方法 3
2.1.2 基於類神經網路之喚醒詞辨認方法 4
2.2 噪音消除部分 5
2.2.1 空間濾波權重預測網路 5
2.2.2 時頻遮罩預測網路應用於波束成形方法 6
第三章波束成形及類神經網路介紹 8
3.1 波束成形 8
3.1.1 均勻麥克風陣列 8
3.1.2 MVDR波束成形器 10
3.1.2 GEV波束成形器 12
3.2 類神經網路 12
3.2.1 遞迴神經網路 12
3.2.2 LSTM 13
3.2.3 Gated Recurrent Units 15
3.2.4 卷積神經網絡 16
第四章系統架構 19
4.1 系統架構設計 19
4.2 喚醒詞模組 21
4.2.1 特徵擷取 21
4.2.1.1 梅爾頻譜(Mel-spectrum) 21
4.2.1.2 梅爾倒頻譜系數(Mel-Frequency Cepstral Coefficients, MFCCs) 22
4.2.2 Network Configuration 23
4.3 語音增強/噪音消除 23
4.3.1 Neural mask estimator 24
4.3.1.1 Network configuration 24
4.2.1.2 CGRNN 24
4.3.1.3 理想二源遮罩 25
4.3.1.3 Loss Function 26
4.3.2 GEV Beamformer with BAN 26
第五章實驗 28
5.1 硬體環境 28
5.1.1 訓練神經網路模型硬體設備 28
5.1.2 嵌入式系統硬體設備 28
5.1.2.1 Raspberry Pi 3 29
5.1.2.1 ReSpeaker 4-Mic Array for Raspberry Pi麥克風陣列模組 31
5.2 喚醒詞辨認系統實驗 32
5.2.1 喚醒詞辨認實驗流程 32
5.2.2 實驗錄音環境 32
5.2.3 實驗參數及類神經網路設定 34
5.2.4 資料集說明 34
5.2.5 喚醒詞辨認之評估準則 34
5.2.6 實驗結果比較 36
5.2.6.1 喚醒詞辨認之錯誤辨認率(False Rejection Rate) 36
5.2.6.2 喚醒詞辨認之誤報率(False Alarm Rate) 36
5.3 噪音消除系統實驗 37
5.3.1 噪音消除實驗流程 37
5.3.2 實驗錄音環境 37
5.3.3 實驗參數及類神經網路設定 39
5.3.4 資料集說明 39
5.3.5 實驗結果比較 40
5.3.5.1 模型記憶體用量比較 40
5.3.5.2 執行速度比較 40
5.3.5.3 噪音消除之結果比較 41
第六章結論及未來研究方向 43
第七章參考文獻 44

參考文獻

[1] Logan, Beth. "Mel Frequency Cepstral Coefficients for Music Modeling." ISMIR. Vol. 270. 2000.
[2] LeCun, Y., Bengio, Y. and Hinton, G., 2015. Deep learning. Nature, 521(7553), pp.436-444
[3] S. Hamid Nawab , Thomas F. Quatieri, Short-time Fourier transform, Advanced topics in signal processing, Prentice-Hall, Inc., Upper Saddle River, NJ, 1987
[4] L. C. Jain , L. R. Medsker, Recurrent Neural Networks: Design and Applications, CRC Press, Inc., Boca Raton, FL, 1999
[5] Warsitz, Ernst, and Reinhold Haeb-Umbach. "Blind acoustic beamforming based on generalized eigenvalue decomposition." IEEE Transactions on audio, speech, and language processing 15.5 (2007): 1529-1539.
[6] Lawrence R. Rabiner, A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, 77 (2), p. 257–286, February 1989
[7] Forney, G. David. "The viterbi algorithm." Proceedings of the IEEE 61.3 (1973): 268-278.
[8] Sainath, Tara N., and Carolina Parada. "Convolutional neural networks for small-footprint keyword spotting." Sixteenth Annual Conference of the International Speech Communication Association. 2015.
[9] Xiao, Xiong, et al. "A study of learning based beamforming methods for speech recognition." CHiME 2016 workshop. 2016.
[10] Loss Function. [Online]. Available: https://en.wikipedia.org/wiki/Loss_function. [Accessed: 16-Aug-2018].
[11] Delay Sum Filter. [Online]. Available: http://www.labbookpages.co.uk/audio/beamforming/delaySum.html.
[Accessed: 16-Aug-2018].
[12] 反向傳播算法. [Online]. Available: https://zh.wikipedia.org/wiki/%E5%8F%8D%E5%90%91%E4%BC%A0%E6%92%AD%E7%AE%97%E6%B3%95. [Accessed: 16-Aug-2018].
[13] S. Hochreiter and J. Schmidhuber. “Long short-term memory”. Neural Computation, vol. 9, pp. 1735–1780, 1997.
[14] Heymann, Jahn, et al. "BLSTM supported GEV beamformer front-end for the 3rd CHiME challenge." Automatic Speech Recognition and Understanding (ASRU), 2015 IEEE Workshop on. IEEE, 2015.
[15] J. J. Hopfield, "Neural networks and physical systems with emergent collective computational abilities", Proceedings of the National Academy of Sciences of the USA, vol. 79, no. 8,pp. 2554–2558, April 1982.
[16] A. Krenker, J. Bes ?ter and A. Kos, “Introduction to the Artificial Neural Networks”, Artificial Neural Networks -Methodological Advances and Biomedical Applications, ISBN: 978-953-307-243-2, 2011
[17] Deep Learning in a Nutshell: Sequence Learning. [Online]. Available: https://devblogs.nvidia.com/deep-learning-nutshell-sequence-learning/.
[Accessed: 16-Aug-2018].
[18] Chung, Junyoung, et al. "Empirical evaluation of gated recurrent neural networks on sequence modeling." arXiv preprint arXiv:1412.3555 (2014).
[19] RNN?藏??算之GRU和LSTM. [Online]. Available: https://wugh.github.io/posts/2016/03/cs224d-notes4-recurrent-neural-networks-continue. [Accessed: 16-Aug-2018].
[20] Logan, Beth. "Mel Frequency Cepstral Coefficients for Music Modeling." ISMIR. 2000.
[21] 梅爾刻度. [Online]. Available: https://zh.wikipedia.org/wiki/%E6%A2%85%E5%B0%94%E5%88%BB%E5%BA%A6. [Accessed: 16-Aug-2018].
[22] Vu, Toan H., Le Dung, and Jia-Ching Wang. "Transportation Mode Detection on Mobile Devices Using Recurrent Nets." Proceedings of the 2016 ACM on Multimedia Conference. ACM, 2016.
[23] Raspberry Pi 3 Model B. [Online]. Available: https://www.raspberrypi.org/products/raspberry-pi-3-model-b.
[Accessed: 16-Aug-2018].
[24] ReSpeaker 4-Mic Array for Raspberry Pi. [Online]. Available: http://wiki.seeedstudio.com/ReSpeaker_4_Mic_Array_for_Raspberry_Pi. [Accessed: 16-Aug-2018].
[25] Cortana ?????置. [Online]. Available: https://msdn.microsoft.com/zh-cn/library/windows/hardware/dn957009(v=vs.85).aspx.
[Accessed: 16-Aug-2018].
[26] Warden, Pete. "Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition." arXiv preprint arXiv:1804.03209 (2018).
[27] 混淆矩陣. [Online]. Available: https://zh.wikipedia.org/wiki/%E6%B7%B7%E6%B7%86%E7%9F%A9%E9%98%B5. [Accessed: 16-Aug-2018].
[28] Jon Barker, Ricard Marxer, Emmanuel Vincent, and Shinji Watanabe, “The third ’CHiME’ speech separation and recognition challenge: Dataset, task and baselines,” in IEEE 2015 Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2015.

指導教授

王家慶

審核日期

2018-8-17

推文