基於長短期記憶網路和連結時序分類的喚醒詞辨識;Wake-up Word Detection Using Long Short Term Memory Network and Connectionist Temporal Classification

NCUIR > College of Electrical Engineering & Computer Science > Graduate Institute of Computer Science and Information Engineering > Electronic Thesis & Dissertation > Item 987654321/81343

Please use this identifier to cite or link to this item: http://ir.lib.ncu.edu.tw/handle/987654321/81343

Title:	基於長短期記憶網路和連結時序分類的喚醒詞辨識;Wake-up Word Detection Using Long Short Term Memory Network and Connectionist Temporal Classification
Authors:	周郁馨;JHOU, YU-SIN
Contributors:	資訊工程學系
Keywords:	喚醒詞;深度學習;長短期記憶網路;連結時序分類;wake-up word;deep learning;long short-term memory;connectionist temporal classifier
Date:	2019-08-22
Issue Date:	2019-09-03 15:45:54 (UTC+8)
Publisher:	國立中央大學
Abstract:	隨著深度學習(Deep learning)的發展，人工智慧的運用更加普遍，在語音辨識的任務中也有顯著的進步。所謂喚醒詞辨識，也被稱作關鍵詞檢測(keyword spotting)，就是在連續語音訊號中尋找特定詞語的位置，深度學習可以比傳統的方法，像是隱藏式馬可夫模型(Hidden Markov Model，HMM)，有更好的效果。一般使用深度學習網路的喚醒詞辨識系統，像是深度神經網路(Deep Neural Network，DNN)或是循環神經網路(Recurrent Neural Network，RNN)，通常是使用大量的特定詞語音訊作為訓練資料，讓網路學習關鍵詞音訊中的特徵，再預測連續音訊中關鍵詞是否存在。但是這種喚醒詞辨識系統，只能辨識固定的喚醒詞，若需要更換或是增添新喚醒詞，必須要再次蒐集新的喚醒詞資料並重新訓練模型。本論文使用了長短期記憶網路(Long Short-term Memory，LSTM)和連結時序分類(Connectionist Temporal Classifier，CTC)來實作喚醒詞辨識模型。和原本直接預測音訊中是否有喚醒詞不同，這個辨識模型利用長短期記憶網路預測音訊中的音素，並且使用連結時序分類評估音素的可能序列，再判斷音素序列中是否有喚醒詞。因為是訓練音素序列預測的網路，訓練資料可以使用非喚醒詞的音訊檔案，讓網路可以更精準地預測音訊中的音素；在更換喚醒詞時也不需要重新訓練網路，只要少量的新喚醒資料強化網路即可。 ;As the development of deep learning, the applications of artificial intelligence become more and more popular, and the performance of speech recognition also improve a lot. Wake-up word detection is also called keyword spotting, and it deals with the identification of keyword in audio signal. For now, Deep learning has better performance than traditional way such as hidden Markov model (HMM). To get a deep learning wake-up word model (for example, deep neural network, recurrent neural network), we have to used lots of specific word audio to train the model so that the model can learn the feature in wake-up word audio and predict if wake-up word is in the continuous audio signal. However, these keyword detection systems can only detect fixed keyword. If we want to change the keyword or add new keyword into system, we have to collect new keyword-specific data and re-train the model. In this thesis, we use long short-term memory network (LSTM) and connectionist temporal classifier (CTC) as keyword detection model. It is different from general keyword detection because this system uses LSTM to predict the posterior of phoneme and CTC to produce the possibility of the phoneme sequence. Due to predicting phoneme sequence, we can use non-keyword data as training data and let the model predict sequence more accurately. Besides, when changing the wake-up word, this system does not have to re-train. We just need to use some new wake-up word data to modify the system.
Appears in Collections:	[Graduate Institute of Computer Science and Information Engineering] Electronic Thesis & Dissertation

Files in This Item:

File	Description	Size	Format
index.html		0Kb	HTML	204	View/Open

社群 sharing

Loading...