快速-長短期記憶聲學模型於遠距語音辨識及喚醒關鍵字任務;Fast-LSTM Acoustic Model for Distant Speech Recognition and Wake-up-word Task

NCU Institutional Repository > 資訊電機學院 > 資訊工程研究所 > 博碩士論文 > Item 987654321/74756

jsp.display-item.identifier=請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/74756

题名:	快速-長短期記憶聲學模型於遠距語音辨識及喚醒關鍵字任務;Fast-LSTM Acoustic Model for Distant Speech Recognition and Wake-up-word Task
作者:	特利安;Trianto, Rezki
贡献者:	資訊工程學系
关键词:	自動語音辨識;延時類神經網路;長短期記憶;喚醒關鍵字;波束賦形;automatic speech recognition;time delay neural network;long short-term memory;wake-up-word;beamforming
日期:	2017-08-22
上传时间:	2017-10-27 14:38:25 (UTC+8)
出版者:	國立中央大學
摘要:	自動語音辨識系統近年來已廣泛地運用在人類生活的各個角落當中，其快速的發展對人類社會有著極大的影響。儘管語音辨識技術近年突飛猛進，仍然有許多方面尚待突破。因此本文嘗試提出新的方法來改善語音辨識的精準度。本文大致上可分成兩個部分: 第一個部分為本文所提出的新的辨識方法－快速長短期記憶聲學模型 (Fast-LSTM)。這個方法主要將延時類神經網路(TDNN)的優點導入各種不同的長短期記憶模型中，藉以提升模型在語音辨識上的速度。文章中我們藉由長距語音以及多聲道音頻來作為模型檢測的樣本。結果發現，與延時類神經網路與深度神經網路(DNN)比較，本文所提出的模型確實可提升語音辨識的速度，然而於精準度上不論是傳統長短期記憶法與本文所提出的快速長短期記憶法，都不及於深度神經網路來的好。本文後半部分將提及其實驗上的一些限制及待改進的部分。本文的第二個部分為快速長短期記憶聲學模型於關鍵字偵測的運用。實驗結果發現，快速長短期記憶聲學模型在關鍵字的辨識及偵測上可以比過去既有的模型減少10%的錯誤率。;Automatic speech recognition (ASR) is very rapidly developed in several years in the field of machine learning research. Many applications of ASR are applied in everyday life, such as smart assistant or subtitle generation. In this thesis, we propose two systems. The first system is the automatic speech recognition that is using Fast-LSTM acoustic models. This proposed system utilizes the architecture of TDNN to learn the short temporal features of the inputs on some initial layers and followed by several LSTM layers above it. The CHiME3 dataset that focus on distant-talking and multi-channel audio is used in the experiment. As the front-end system, GEV beamformer utilized by BLSTM network is used to improve the quality of the utterance speech. In the experimental results, the Fast-LSTM model produces faster training time than the standard LSTM or DNN. However, the error rate obtained by using DNN is better than using LSTM or Fast-LSTM, that achieve a 4.87% of word error rate. Some limitation of the training process will be discussed in this thesis. In the second system, the Wake-up-word task is implemented, which is the sub-task of speech recognition. The trained Fast-LSTM model is used as the acoustic model by utilizing two-step classification and use the confidence measures for each generated phoneme from keyword to detect the keyword. The results obtained from the system can detect keywords well by produce a 10% error rate.
显示于类别:	[資訊工程研究所] 博碩士論文

文件中的档案:

档案	描述	大小	格式	浏览次数
index.html		0Kb	HTML	343	检视/开启

在NCUIR中所有的数据项都受到原著作权保护.

社群 sharing

数据加载中.....