快速-長短期記憶聲學模型於遠距語音辨識及喚醒關鍵字任務

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：71

、訪客IP：3.145.32.221

姓名

特利安(Rezki Trianto) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

快速-長短期記憶聲學模型於遠距語音辨識及喚醒關鍵字任務
(Fast-LSTM Acoustic Model for Distant Speech Recognition and Wake-up-word Task)

相關論文

★ Single and Multi-Label Environmental Sound Recognition with Gaussian Process	★ 波束形成與音訊前處理之嵌入式系統實現
★ 語音合成及語者轉換之應用與設計	★ 基於語意之輿情分析系統
★ 高品質口述系統之設計與應用	★ 深度學習及加速強健特徵之CT影像跟骨骨折辨識及偵測
★ 基於風格向量空間之個性化協同過濾服裝推薦系統	★ RetinaNet應用於人臉偵測
★ 金融商品走勢預測	★ 整合深度學習方法預測年齡以及衰老基因之研究
★ 漢語之端到端語音合成研究	★ 基於 ARM 架構上的 ORB-SLAM2 的應用與改進
★ 基於深度學習之指數股票型基金趨勢預測	★ 探討財經新聞與金融趨勢的相關性
★ 基於卷積神經網路的情緒語音分析	★ 運用深度學習方法預測阿茲海默症惡化與腦中風手術存活

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 ( 永不開放)

摘要(中)

自動語音辨識系統近年來已廣泛地運用在人類生活的各個角落當中，其快速的發展對人類社會有著極大的影響。儘管語音辨識技術近年突飛猛進，仍然有許多方面尚待突破。因此本文嘗試提出新的方法來改善語音辨識的精準度。本文大致上可分成兩個部分: 第一個部分為本文所提出的新的辨識方法－快速長短期記憶聲學模型 (Fast-LSTM)。這個方法主要將延時類神經網路(TDNN)的優點導入各種不同的長短期記憶模型中，藉以提升模型在語音辨識上的速度。文章中我們藉由長距語音以及多聲道音頻來作為模型檢測的樣本。結果發現，與延時類神經網路與深度神經網路(DNN)比較，本文所提出的模型確實可提升語音辨識的速度，然而於精準度上不論是傳統長短期記憶法與本文所提出的快速長短期記憶法，都不及於深度神經網路來的好。本文後半部分將提及其實驗上的一些限制及待改進的部分。
本文的第二個部分為快速長短期記憶聲學模型於關鍵字偵測的運用。實驗結果發現，快速長短期記憶聲學模型在關鍵字的辨識及偵測上可以比過去既有的模型減少10%的錯誤率。

摘要(英)

Automatic speech recognition (ASR) is very rapidly developed in several years in the field of machine learning research. Many applications of ASR are applied in everyday life, such as smart assistant or subtitle generation. In this thesis, we propose two systems. The first system is the automatic speech recognition that is using Fast-LSTM acoustic models. This proposed system utilizes the architecture of TDNN to learn the short temporal features of the inputs on some initial layers and followed by several LSTM layers above it. The CHiME3 dataset that focus on distant-talking and multi-channel audio is used in the experiment. As the front-end system, GEV beamformer utilized by BLSTM network is used to improve the quality of the utterance speech. In the experimental results, the Fast-LSTM model produces faster training time than the standard LSTM or DNN. However, the error rate obtained by using DNN is better than using LSTM or Fast-LSTM, that achieve a 4.87% of word error rate. Some limitation of the training process will be discussed in this thesis.
In the second system, the Wake-up-word task is implemented, which is the sub-task of speech recognition. The trained Fast-LSTM model is used as the acoustic model by utilizing two-step classification and use the confidence measures for each generated phoneme from keyword to detect the keyword. The results obtained from the system can detect keywords well by produce a 10% error rate.

關鍵字(中)

★ 自動語音辨識
★ 延時類神經網路
★ 長短期記憶
★ 喚醒關鍵字
★ 波束賦形

關鍵字(英)

★ automatic speech recognition
★ time delay neural network
★ long short-term memory
★ wake-up-word
★ beamforming

論文目次

摘要 i
ABSTRACT ii
ACKNOWLEDGEMENT iii
TABLE OF CONTENTS iv
LIST OF FIGURES vi
LIST OF TABLES viii
CHAPTER 1 INTRODUCTION 1
CHAPTER 2 AUTOMATIC SPEECH RECOGNITION AND WAKE-UP-WORD TASK 4
2.1 Acoustic Analysis 4
2.1.1 Mel Frequency Cepstral Coefficient 5
2.1.2 Cepstral Mean and Variance Normalization 7
2.1.3 Linear Discriminant Analysis 8
2.1.4 Maximum Likelihood Linear Transform 9
2.1.5 Feature Space Maximum Likelihood Linear Regression Adaptation 10
2.2 Acoustic Modeling 11
2.2.1 Hidden Markov Model 11
2.2.2 Deep Neural Network 13
2.2.3 Recurrent Neural Network 16
2.2.4 Time Delay Neural Network 18
2.3 Language Modeling 20
2.4 Decoder 21
2.5 Lattice Rescoring 22
2.6 Beamforming 22
2.7 Wake-up-word 25
CHAPTER 3 METHODOLOGY 26
3.1 Overview of Automatic Speech Recognition 26
3.1.1 ASR using Hybrid GMM-HMM Model 27
3.1.2 ASR using Hybrid DNN-HMM Model 28
3.1.3 ASR using Hybrid TDNN-HMM Model 30
3.1.4 ASR using Hybrid LSTM-HMM Model 32
3.1.5 ASR using Fast-LSTM Model 33
3.1.6 Lattice Rescoring 35
3.2 Overview of Wake-up-word 36
CHAPTER 4 EXPERIMENT SETUP 38
4.1 Automatic Speech Recognition Experiment Setup 38
4.1.1 Dataset Description 38
4.1.2 Experimental Setup 39
4.1.3 Data Augmentation 39
4.1.4 Evaluation 40
4.2 Wake-up-word Experiment Setup 40
CHAPTER 5 EXPERIMENT RESULT 41
5.1 Automatic Speech Recognition Experiment Result 41
5.1.1 Result of GMM-HMM Model 41
5.1.2 Result of DNN-HMM Model 42
5.1.3 Result of TDNN Acoustic Model 47
5.1.4 Result of LSTM Acoustic Model 49
5.1.5 Result of Fast-LSTM Acoustic Model 51
5.1.6 Comparison of All Approaches 53
5.2 Wake-up-word Experiment Result 55
CHAPTER 6 CONCLUSION 56
REFERENCES 57

參考文獻

[1] Hinton, Geoffrey, et al. ”Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups.” IEEE Signal Processing Magazine 29.6 (2012): 82-97.
[2] Dahl, George E., et al. ”Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition.” IEEE Transactions on Audio, Speech, and Language Processing 20.1 (2012): 30-42.
[3] Deng, Li, Geoffrey Hinton, and Brian Kingsbury. ”New types of deep neural network learning for speech recognition and related applications: An overview.” Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013.
[4] Grais, Emad M., Mehmet Umut Sen, and Hakan Erdogan. ”Deep neural networks for single channel source separation.” Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014.
[5] Lei, Yun, et al. ”A novel scheme for speaker recognition using a phonetically-aware deep neural network.” Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014.
[6] Yamada, Takanori, Longbiao Wang, and Atsuhiko Kai. ”Improvement of distant-talking speaker identification using bottleneck features of DNN.” Interspeech. 2013.
[7] Han, Kun, Dong Yu, and Ivan Tashev. ”Speech emotion recognition using deep neural network and extreme learning machine.” Interspeech. 2014.
[8] Seide, Frank, Gang Li, and Dong Yu. ”Conversational Speech Transcription Using Context-Dependent Deep Neural Networks.” Interspeech. 2011.
[9] Anguera, Xavier, Chuck Wooters, and Javier Hernando. ”Acoustic beamforming for speaker diarization of meetings.” IEEE Transactions on Audio, Speech, and Language Processing 15.7 (2007): 2011-2022.
[10] Heymann, Jahn, et al. ”BLSTM supported GEV beamformer front-end for the 3rd CHiME challenge.” Automatic Speech Recognition and Understanding (ASRU), 2015 IEEE Workshop on. IEEE, 2015.
[11] Han, Wei, et al. ”An efficient MFCC extraction method in speech recognition.” Circuits and Systems, 2006. ISCAS 2006. Proceedings. 2006 IEEE International Symposium on. IEEE, 2006.
[12] Sainath, Tara N., et al. ”Deep convolutional neural networks for LVCSR.” Acoustics, speech and signal processing (ICASSP), 2013 IEEE international conference on. IEEE, 2013.
[13] Sak, Haşim, Andrew Senior, and Françoise Beaufays. ”Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition.” arXiv preprint arXiv:1402.1128 (2014).
[14] Liu, Xunying, et al. ”Efficient lattice rescoring using recurrent neural network language models.” Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014.
[15] Du, Jun, et al. ”The USTC–iFlytek System for CHiME-4 Challenge.” Proc. CHiME (2016): 36-38.
[16] Chen, Guoguo, Carolina Parada, and Tara N. Sainath. ”Query-by-example keyword spotting using long short-term memory networks.” Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015.
[17] Fengpei Ge, and Yonghong Yan. ”Deep Neural Network Based Wake-Up-Word Speech Recognition with Two-Stage Detection” Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2017.
[18] Molau, Sirko, et al. ”Computing mel-frequency cepstral coefficients on the power spectrum.” Acoustics, Speech, and Signal Processing, 2001. Proceedings.(ICASSP′01). 2001 IEEE International Conference on. Vol. 1. IEEE, 2001.
[19] Prasad, N. Vishnu, and Srinivasan Umesh. ”Improved cepstral mean and variance normalization using Bayesian framework.” Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on. IEEE, 2013.
[20] Rath, Shakti P., et al. ”Improved feature processing for deep neural networks.” Interspeech. 2013.
[21] Belhumeur, Peter N., João P. Hespanha, and David J. Kriegman. ”Eigenfaces vs. fisherfaces: Recognition using class specific linear projection.” IEEE Transactions on pattern analysis and machine intelligence 19.7 (1997): 711-720.
[22] Gales, Mark JF. ”Maximum likelihood linear transformations for HMM-based speech recognition.” Computer speech & language 12.2 (1998): 75-98.
[23] Van Veen, Barry D., and Kevin M. Buckley. ”Beamforming: A versatile approach to spatial filtering.” IEEE ASSP magazine 5.2 (1988): 4-24.
[24] Povey, Daniel, and George Saon. ”Feature and model space speaker adaptation with full covariance Gaussians.” INTERSPEECH. 2006.
[25] Ghahramani, Zoubin. ”An introduction to hidden Markov models and Bayesian networks.” International journal of pattern recognition and artificial intelligence 15.01 (2001): 9-42.
[26] Mohamed, Abdel-rahman, George E. Dahl, and Geoffrey Hinton. ”Acoustic modeling using deep belief networks.” IEEE Transactions on Audio, Speech, and Language Processing 20.1 (2012): 14-22.
[27] Veselý, Karel, et al. ”Sequence-discriminative training of deep neural networks.” Interspeech. 2013.
[28] Chan, William, and Ian Lane. ”Deep recurrent neural networks for acoustic modelling.” arXiv preprint arXiv:1504.01482 (2015).
[29] Sak, Haşim, et al. ”Fast and accurate recurrent neural network acoustic models for speech recognition.” arXiv preprint arXiv:1507.06947 (2015).
[30] Pascanu, Razvan, et al. ”How to construct deep recurrent neural networks.” arXiv preprint arXiv:1312.6026 (2013).
[31] Graves, Alex, Navdeep Jaitly, and Abdel-rahman Mohamed. ”Hybrid speech recognition with deep bidirectional LSTM.” Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on. IEEE, 2013.
[32] Zeyer, Albert, et al. ”A comprehensive study of deep bidirectional LSTM RNNs for acoustic modeling in speech recognition.” Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2017.
[33] Greff, Klaus, et al. ”LSTM: A search space odyssey.” IEEE transactions on neural networks and learning systems (2016).
[34] Peddinti, Vijayaditya, Daniel Povey, and Sanjeev Khudanpur. ”A time delay neural network architecture for efficient modeling of long temporal contexts.” INTERSPEECH. 2015.
[35] Brown, Peter F., et al. ”Class-based n-gram models of natural language.” Computational linguistics 18.4 (1992): 467-479.
[36] Gale, William A., and Geoffrey Sampson. ”Good‐turing frequency estimation without tears.” Journal of Quantitative Linguistics 2.3 (1995): 217-237.
[37] Kneser, Reinhard, and Hermann Ney. ”Improved backing-off for m-gram language modeling.” Acoustics, Speech, and Signal Processing, 1995. ICASSP-95., 1995 International Conference on. Vol. 1. IEEE, 1995.
[38] Sundermeyer, Martin, Ralf Schlüter, and Hermann Ney. ”LSTM Neural Networks for Language Modeling.” Interspeech. 2012.
[39] Mikolov, Tomáš, et al. ”Extensions of recurrent neural network language model.” Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on. IEEE, 2011.
[40] Mikolov, Tomas, et al. ”Recurrent neural network based language model.” Interspeech. Vol. 2. 2010.
[41] Jeon, Euisok Chung Hyung-Bae Jeon, Gue Park, and Yun-Keun Lee. ”Lattice rescoring for speech recognition using large scale distributed language models.” 24th International Conference on Computational Linguistics. 2012.
[42] Barker, Jon, et al. ”The third ‘CHiME’speech separation and recognition challenge: Dataset, task and baselines.” Automatic Speech Recognition and Understanding (ASRU), 2015 IEEE Workshop on. IEEE, 2015.
[43] Povey, Daniel, et al. ”The Kaldi speech recognition toolkit.” IEEE 2011 workshop on automatic speech recognition and understanding. No. EPFL-CONF-192584. IEEE Signal Processing Society, 2011.
[44] Stolcke, Andreas. ”SRILM-an extensible language modeling toolkit.” Interspeech. Vol. 2002. 2002.
[45] Enarvi, Seppo, and Mikko Kurimo. ”TheanoLM-An Extensible Toolkit for Neural Network Language Modeling.” arXiv preprint arXiv:1605.00942 (2016).
[46] Vincent, Emmanuel, et al. ”An analysis of environment, microphone and data simulation mismatches in robust speech recognition.” Computer Speech & Language (2016).
[47] Menne, Tobias, et al. ”The RWTH/UPB/FORTH System Combination for the 4th CHiME Challenge Evaluation.” The 4th International Workshop on Speech Processing in Everyday Environments, San Francisco, CA, USA. 2016.
[48] Kaldi Toolkit Documentation. http://kaldi-asr.org/doc, last accessed on June 2017.
[49] Chang, Chih-Chung, and Chih-Jen Lin. ”LIBSVM: a library for support vector machines.” ACM transactions on intelligent systems and technology (TIST) 2.3 (2011): 27

指導教授

王家慶(Dr. Jia Ching Wang)

審核日期

2017-8-22

推文