摘要: | 近年來,智慧音箱產品如火如荼的發展,亞馬遜的智慧音箱Echo成功改變消費者的家電使用習慣,語音助理Alexa使消費者能夠用語音即可下達指令,讓生活更加便利,與智慧音箱相關的技術有分前端及後端,前端指的是裝置端,也就是智慧音箱前端的技術,包含噪音消除、語音增強、回聲消除、聲音活動偵測、喚醒詞辨認等等,而後端為伺服器端,則包含語音辨識、語意理解等等,也使得各家廠商在這些技術上都投注了不少心血。 本論文結合前人之研究來實作強健性喚醒詞辨認嵌入式系統,系統包含智慧音箱中的兩大技術,喚醒詞辨認以及噪音消除技術,喚醒詞辨認是將聲音經由梅爾倒頻譜係數(Mel-Frequency Cipstal Coefficients, MFCC)找出特徵後,利用卷積神經網路訓練,輸出各喚醒詞類別的機率來判定是否被辨認;噪音消除則是將聲音利用短時傅立葉轉換(Short-Time Fourier Transform, STFT)將混合訊號的時頻結果,取出能量後放入遞迴神經網路訓練,得到噪音及語音的遮罩,再應用於廣義特徵波束成形器(GEV Beamformer)上,達到噪音消除之效果。 ;In recent years, smart speaker gets into full swing, amazon smart speaker, Echo, successfully changed customers’ habits of using home appliances, and voice assistant Alexa enables customers to command via voice. Smart speaker related technology are divided into front-end and back-end, front-end refers to the device, namely smart speaker front-end technology, including noise reduction, speech enhancement, echo cancellation, voice activity detection, etc., and back-end technology refers to server end, including speech recognition and semantic understanding, and so on. These technologies make each firms bet a lot of efforts. In this thesis, we combined previous research and implemented robust wake word detection on embedded system, the system consists of two techniques in smart speakers, wake word detection and noise reduction, wake word detection is voice through the Mel cepstrum coefficient (MFCC) to extract the characteristics as input on convolution neural network and the output are probabilities of each class of wake word. Probabilities determine whether wake words are identified; Noise reduction use short-time Fourier Transform (STFT) results of the time-frequency mixed signals, after taking out the energy and put it into the recursive neural network to train, then we get the output, noise mask and speech mask, applying these masks on GEV beamformer to achieve noise reduction. |