基於長短期記憶網路和連結時序分類的喚醒詞辨識

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：45

、訪客IP：18.118.154.79

姓名

周郁馨(YU-SIN JHOU) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

基於長短期記憶網路和連結時序分類的喚醒詞辨識
(Wake-up Word Detection Using Long Short Term Memory Network and Connectionist Temporal Classification)

相關論文

★ Single and Multi-Label Environmental Sound Recognition with Gaussian Process	★ 波束形成與音訊前處理之嵌入式系統實現
★ 語音合成及語者轉換之應用與設計	★ 基於語意之輿情分析系統
★ 高品質口述系統之設計與應用	★ 深度學習及加速強健特徵之CT影像跟骨骨折辨識及偵測
★ 基於風格向量空間之個性化協同過濾服裝推薦系統	★ RetinaNet應用於人臉偵測
★ 金融商品走勢預測	★ 整合深度學習方法預測年齡以及衰老基因之研究
★ 漢語之端到端語音合成研究	★ 基於 ARM 架構上的 ORB-SLAM2 的應用與改進
★ 基於深度學習之指數股票型基金趨勢預測	★ 探討財經新聞與金融趨勢的相關性
★ 基於卷積神經網路的情緒語音分析	★ 運用深度學習方法預測阿茲海默症惡化與腦中風手術存活

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 ( 永不開放)

摘要(中)

隨著深度學習(Deep learning)的發展，人工智慧的運用更加普遍，在語音辨識的任務中也有顯著的進步。所謂喚醒詞辨識，也被稱作關鍵詞檢測(keyword spotting)，就是在連續語音訊號中尋找特定詞語的位置，深度學習可以比傳統的方法，像是隱藏式馬可夫模型(Hidden Markov Model，HMM)，有更好的效果。一般使用深度學習網路的喚醒詞辨識系統，像是深度神經網路(Deep Neural Network，DNN)或是循環神經網路(Recurrent Neural Network，RNN)，通常是使用大量的特定詞語音訊作為訓練資料，讓網路學習關鍵詞音訊中的特徵，再預測連續音訊中關鍵詞是否存在。但是這種喚醒詞辨識系統，只能辨識固定的喚醒詞，若需要更換或是增添新喚醒詞，必須要再次蒐集新的喚醒詞資料並重新訓練模型。
本論文使用了長短期記憶網路(Long Short-term Memory，LSTM)和連結時序分類(Connectionist Temporal Classifier，CTC)來實作喚醒詞辨識模型。和原本直接預測音訊中是否有喚醒詞不同，這個辨識模型利用長短期記憶網路預測音訊中的音素，並且使用連結時序分類評估音素的可能序列，再判斷音素序列中是否有喚醒詞。因為是訓練音素序列預測的網路，訓練資料可以使用非喚醒詞的音訊檔案，讓網路可以更精準地預測音訊中的音素；在更換喚醒詞時也不需要重新訓練網路，只要少量的新喚醒資料強化網路即可。

摘要(英)

As the development of deep learning, the applications of artificial intelligence become more and more popular, and the performance of speech recognition also improve a lot. Wake-up word detection is also called keyword spotting, and it deals with the identification of keyword in audio signal. For now, Deep learning has better performance than traditional way such as hidden Markov model (HMM). To get a deep learning wake-up word model (for example, deep neural network, recurrent neural network), we have to used lots of specific word audio to train the model so that the model can learn the feature in wake-up word audio and predict if wake-up word is in the continuous audio signal. However, these keyword detection systems can only detect fixed keyword. If we want to change the keyword or add new keyword into system, we have to collect new keyword-specific data and re-train the model.
In this thesis, we use long short-term memory network (LSTM) and connectionist temporal classifier (CTC) as keyword detection model. It is different from general keyword detection because this system uses LSTM to predict the posterior of phoneme and CTC to produce the possibility of the phoneme sequence. Due to predicting phoneme sequence, we can use non-keyword data as training data and let the model predict sequence more accurately. Besides, when changing the wake-up word, this system does not have to re-train. We just need to use some new wake-up word data to modify the system.

關鍵字(中)

★ 喚醒詞
★ 深度學習
★ 長短期記憶網路
★ 連結時序分類

關鍵字(英)

★ wake-up word
★ deep learning
★ long short-term memory
★ connectionist temporal classifier

論文目次

中文摘要..................................i
Abstract................................ii
誌謝....................................iii
章節目次.................................iv
圖目錄...................................vi
表目錄..................................vii
第一章緒論...............................1
第二章文獻探討............................2
2.1 基於隱藏式馬可夫模型的喚醒詞辨識.......2
2.2 基於深度學習網路的喚醒詞辨識..........2
2.2.1卷積神經網路.....................2
2.2.2 遞迴神經網路....................4
2.2.3 長短期記憶網路..................4
2.2.4 門控循環單元....................4
2.3 基於連結時序分類的喚醒詞辨識..........5
第三章系統架構...........................6
3.1 系統架構設計........................6
3.2 特徵擷取...........................6
3.3 LSTM-CTC..........................7
3.3.1 LSTM..........................8
3.3.2 CTC...........................9
3.4 Compare..........................10
第四章實驗.............................11
4.1 資料集說明........................11
4.2 實驗環境、實驗參數和網路設定........11
4.3 實驗結果..........................14
4.3.1 喚醒詞評估....................14
4.3.2 結果比較......................14
4.3.3 喚醒詞更換....................15
第五章結論及未來研究方向.................17
第六章參考文獻..........................18

參考文獻

[1] Lawrence R. Rabiner, “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition”, Proceedings of the IEEE, 77(2), p.257-286, February 1989
[2] G. David Forney, “The viterbi algorithm”, Proceedings of the IEEE, 1973
[3] Yann LeCun, Yoshua Bengio and Geoffrey Hinton. “Deep learning”. Nature, 521(7553), p.436-444, May 2015
[4] Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner, “Gradient-Based Learning Applied to Document Recognition”, Proceedings of the IEEE,1998
[5] Sainath, Tara N., and Carolina Rarada, “Convolutional neural networks for small-footprint keyword spotting”. Sixteenth Annual Conference of the International Speech Communication Association, 2015
[6] L. C. Jain, L. R. Medsker, “Recurrent Neural Networks: Design and Applications”, CRC Press, Inc., Boca Raton, FL, 1999
[7] Xavier Glorot, and Yoshua Bengio, “Understanding the difficulty of training deep feedforward neural networks”, Xavier Glorot, Yoshua Bengio ; Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, PMLR 9:249-256, 2010
[8] Zhou jianlai, Liu jian, Song Yantao, Yu tiecheng,“Keyword spotting based on recurrent neural network”, Proceedings of the IEEE,1998
[9] Has¸im Sak, Andrew Senior, Franc¸oise Beaufays, “Long Short-term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition”, arXiv:1402.1128v1 [cs.NE] 5 Feb 2014
[10] Jing-yun ZHANG, Lu HUANG and Jia-song SUN, “Keyword Spotting with Long Short-term Memory Neural Network Architectures”, International Conference on Computer, Electronics and Communication Engineering (CECE 2017), 2017
[11] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio, “Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling”, arXiv:1412.3555v1 [cs.NE] 11 Dec 2014
[12] Kyunghyun Cho, Bart van Merrienboer and Dzmitry Bahdanau, “On the Properties of Neural Machine Translation: Encoder–Decoder Approaches”, arXiv: 1409.1259v2 [cs.CL] 7 Oct 2014
[13] Alex Graves, and Navdeep Jaitly, “Towards End-to-End Speech Recognition with Recurrent Neural Networks”, ICML′14 Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32 Pages II-1764-II-1772
[14] Wu, Gin-Der, and Chin-Teng Lin. "Word boundary detection with melscale frequency bank in noisy environment." IEEE transactions on speech and audio processing 8.5 (2000): 541-554.
[15] Shikha Gupta, Jafreezal Jaafar, Wan Fatimah wan Ahmad, and Arpit Bansal, “FEATURE EXTRACTION USING MFCC”, Signal & Image Processing : An International Journal (SIPIJ) Vol.4, No.4, August 2013
[16] lex Graves, Santiago Fern´andez, Faustino Gomez and J¨urgen Schmidhuber,“Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks”, in ICML 2006 – Proceedings of the 23th International Conference on Machine Learning, June 25–29, Pittsburgh, Pennsylvania, USA, Proceedings, 2006, pp. 369–376.
[17] Zhiming Wang, Xiaolong Li and Jun Zhou, “SMALL-FOOTPRINT KEYWORD SPOTTING USING DEEP NEURAL NETWORK AND CONNECTIONIST TEMPORAL CLASSIFIER”, arXiv: 1709.03665v1 [cs.CL] 12 Sep 2017
[18] CireşAn, Dan, et al. "Multi-column deep neural network for traffic signclassification." Neural Networks 32 (2012): 333-338.
[19] Rowley, Henry A., Shumeet Baluja, and Takeo Kanade. "Neural networkbased face detection." IEEE Transactions on pattern analysis and machine intelligence20.1 (1998): 23-38.
[20] Vassil Panayotov, Guoguo Chen, Daniel Povey and Sanjeev Khudanpur, “Librispeech: An ASR corpus based on public domain audio books”, Proceedings of the IEEE, 2015

指導教授

王家慶

審核日期

2019-8-22

推文