類神經網路訓練結合局部資訊於強健性語音辨識之研究

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：16

、訪客IP：3.144.100.3

姓名

徐家鏞(Chia-yung Hsu) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

類神經網路訓練結合局部資訊於強健性語音辨識之研究
(Artificial Neural Network Incorporating Regional Information Training for Robust Speech Recognition)

相關論文

★ Single and Multi-Label Environmental Sound Recognition with Gaussian Process	★ 波束形成與音訊前處理之嵌入式系統實現
★ 語音合成及語者轉換之應用與設計	★ 基於語意之輿情分析系統
★ 高品質口述系統之設計與應用	★ 深度學習及加速強健特徵之CT影像跟骨骨折辨識及偵測
★ 基於風格向量空間之個性化協同過濾服裝推薦系統	★ RetinaNet應用於人臉偵測
★ 金融商品走勢預測	★ 整合深度學習方法預測年齡以及衰老基因之研究
★ 漢語之端到端語音合成研究	★ 基於 ARM 架構上的 ORB-SLAM2 的應用與改進
★ 基於深度學習之指數股票型基金趨勢預測	★ 探討財經新聞與金融趨勢的相關性
★ 基於卷積神經網路的情緒語音分析	★ 運用深度學習方法預測阿茲海默症惡化與腦中風手術存活

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 ( 永不開放)

摘要(中)

語音是一個在人類社會中不可或缺的要素，隨著科技的進步，人們依靠電腦來處理生活中大大小小事情的比例越來越高，因此為了使電腦能夠處理語音的資料，語音辨識即成為了一個重要的課題。

目前的語音辨識技術在乾淨的數字語音辨識上能夠有很好的辨識結果，但是我們實際生活的環境充滿了與我們辨識內容無關的噪音，隨著訊噪比(SNR)越來越低，語音辨識率也不可避免的隨之下降。因此，找出能夠在噪音環境下提升進行語音辨識的方法在我們實際生活的應用上顯得非常重要。

近年來，類神經網路 (Neural Network) 在語音辨識上的研究有著豐碩的成果，有效地減少環境以及語者變異對語音訊號造成的影響，大幅提升辨識率，但系統的語音辨識能力仍有改善空間。本論文即提出新的自動語音辨識系統架構，結合Environment Clustering (EC)、Mixture of Experts與類神經網路以進一步提升系統效能。我們將辨識系統分為Offline與Online兩階段：Offline階段依據聲學特性將整個訓練資料集分割成多個子訓練資料集，並建立各子訓練資料集的類神經網路(以類神經子網路稱之)。Online階段則使用GMM-gate來控制類神經子網路的輸出。新提出的系統架構保留子訓練資料集的聲學特性，強健語音辨識系統。實驗上，我們使用Aurora 2連續數字語音資料庫，依據字錯誤率(word error rate, WER)比較我們提出的語音辨識系統架構與傳統以類神經網路建立的辨識系統，平均字錯誤率進步6.86% ，由5.25%降低至4.89%。

摘要(英)

Speech sounds is an essential element in human society. With the advance of science and technology, the proportion of people rely on computers to handle everything in our daily life more and more. In order to make the computer capable of handling speech data, speech recognition has become an important issue.

Automatic speech recognition (ASR) in clean speech data can achieve good results but the environment we live is full of noise. As the speech SNR get lower and lower, the speech recognition accuracy inevitably decreased. For this reason, find a way to improve the noise speech recognize capability is important in our actual life.

Recently, ASR using neural network (NN) based acoustic model (AM) has achieved significant improvements. However, the mismatch (including speaker and speaking environment) of training and testing conditions still confines the applicability of ASR. This paper proposes a novel approach that combines the environment clustering (EC) and mixture of experts (MOE) algorithms (thus the proposed approach is termed EC-MOE) to enhance the robustness of ASR against mismatches. In the offline phase, we split the entire training set into several subsets, with each subset characterizing a specific speaker and speaking environment. Then, we use each subset of training data to prepare an NN-based AM. In the online phase, we use a Gaussian mixture model (GMM)-gate to determine the optimal output from the multiple NN-based AMs to render the final recognition results. We evaluated the proposed EC-MOE approach on the Aurora 2 continuous digital speech recognition task. Comparing to the baseline system, where only a single NN-based AM is used for recognition, the proposed approach achieves a clear word error rate (WER) reduction of 6.86 % (5.25% to 4.89%).

關鍵字(中)

★ 類神經網路
★ 強健性語音辨識
★ 環境群集

關鍵字(英)

★ Artificial Neural Network
★ Robust Speech Recognition
★ Environment Clustering

論文目次

摘要 iii

Abstract iv

章節目次 v

圖目錄 vii

表目錄 viii

第一章緒論 1

1.1. 前言 1

1.2. 論文架構與章節概要 3

第二章相關研究及文獻探討 4

2.1. 語音辨識簡介 4

2.1.1. ETSI進階前端特徵(ESTI Advanced Front-End feature, AFE) 5

2.1.2. 高斯混合模型(Gaussian Mixture Model, GMM) 6

2.1.3. 隱馬可夫模型(Hidden Markov Model, HMM) 7

2.1.4. 人工類神經網路(Artificial Neural Network) 11

2.1.5. 限制波爾茲曼機(Restricted Boltzmann machine) 14

2.1.6. I-vector 17

2.1.7. K-means 19

2.2. 相關研究 21

2.2.1. 專家混合系統(Mixture of Local Experts) 21

2.2.2. 環境群集(Environment Clustering, EC) 22

第三章聲學模型訓練建構於子空間分割 23

3.1. 簡介 23

3.2. Offline EC樹建構 24

3.2.1. 知識基礎一層樹 (Knowledge-based 1 layer tree, Tree-kb1) 24

3.2.2. 知識基礎兩層樹 (Knowledge-based 2 layer tree, Tree-kb2) 25

3.2.3. 知識基礎及資料導向混合樹 (Knowledge-based with data-driven tree, Tree-hybrid) 26

3.3. Online類神經網路結合與選擇 28

3.3.1. 線性組合法 (Linear Combination, LC) 28

3.3.2. 最大log-likelihood (Max log-likelihood, MaxL) 30

3.3.3. I-vector with SVM (I-SVM) 31

3.3.4. GMM-gate 32

第四章實驗 33

4.1. 實驗設置與環境 33

4.2. 實驗結果 34

第五章結論及未來研究方向 41

參考文獻 42

參考文獻

[1] A. Sankar and C.-H. Lee, “A maximum-likelihood approach to stochastic matching for robust speech recognition,” IEEE Transactions on Speech Audio Processing, vol. 4, pp.190-202, 1996.

[2] A. Varga and R. Moore, “Hidden Markov Model Decomposition of Speech And Noise," in Proceedings IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 845-848, 1990.

[3] M. J. F. Gales and S. Young, "An Improved Approach To The Hidden Markov Model Decomposition of Speech And Noise," IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 233-236, 1992.

[4] M. Q. Wang and S. J. Young, “Speech Recognition Using Hidden Markov Model Decomposition And a General Background Speech Model," IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 253-256, 1992.

[5] Y. Tsao and C.-H. Lee, "An Ensemble Speaker and Speaking Environment Modeling Approach to Robust Speech Recognition," IEEE Transactions on Audio, Speech and Language Processing, vol. 17, pp. 1025-1037, Jun. 2009.

[6] S. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 27, no. 2, pp. 113-120, Apr. 1979.

[7] J. S. Lim and A. V. Oppenheim, “Enhancement and bandwidth compression of noisy speech,” Proceedings of the IEEE, vol. 67, no. 12, pp. 1586-1604, Dec. 1979.

[8] Y.ephraim and D. Malah, “Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. ASSP-32, no.6, pp. 1109-1121, Dec. 1984.

[9] T. Lotter and P. Vary, “Speech enhancement by map spectral amplitude estimation using a super-Gaussian speech model,” EURASIP Journal on Applied Signal Processing, vol, 2005, no. 1, pp. 1110-1126, Jan. 2005.

[10] U. Kjems and J. Jensen, “Maximum likelihood based noise covariance matrix estimation for multi-microphone speech enhancement,” in Proceedings of European Signal Processing Conference, pp. 295-299, Aug. 2012.

[11] S. Furui, “Cepstral analysis technique for automatic speaker verification,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 29, no.2, pp. 254-272, Apr. 1981.

[12] O. Viikki and K. Laurila, “Cepstral domain segmental feature vector normalization for noise robust speech recognition,” Speech Communication, vol, 25, no. 1-3, pp. 133-147, Aug. 1998.

[13] M. J. F. Gales, “Maximum likelihood linear transformations for HMM-based speech recognition,” Computer Speech and Language, vol. 12, no. 2, pp. 75-98, Apr. 1998.

[14] B. -H. Juang, “Minimum classification error rate methods for speech recognition,” vol. 5, no. 3, pp. 257-265, May 1997.

[15] V. Valtchev, J. J. Odell, P.C. Woodland and S. J. Young, “MMIE training of large vocabulary recognition systems,” Speech Communication, vol. 22, no. 4, pp. 303-314, Sep. 1997.

[16] B. Li, Y. Tsao and K. C. Sim, “An Investigation of Spectral Restoration Algorithms for Deep Neural Networks based Noise Robust Speech Recognition,” INTERSPEECH, pp. 3002-3006, 2013.

[17] B. Li and K. C. Sim, “Noise adaptive front-end normalization based on Vector Taylor Series for Deep Neural Networks in robust speech recognition,” IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7408-7412. May. 2013.

[18] S. Araki, T. Hayashi, M. Delcroix, M. Fujimoto, K. Takeda and T. Nakatani ,“Exploring multi-channel features for denoising-autoencoder-based speech enhancement,” IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 116-120, Apr. 2015.

[19] Y. Xu, J. Du, L.-R. Dai and C.-H. Lee, “A Regression Approach to Speech Enhancement Based on Deep Neural Networks,” IEEE Transactions on Audio, Speech and Language Processing, vol. 23, no. 1, pp. 7-19 ,Jan. 2015.

[20] Y. Tu, J. Du, Y. Xu, L. Dai, and C.-H. Lee, “Deep neural network based speech separation for robust speech recognition,” in International Symposium on Chinese Spoken Language Processing, pp.532-536, Oct. 2014.

[21] B. Li and K. C. Sim, “An ideal hidden-activation mask for deep neural networks based noise-robust speech recognition,” IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 200-204, May. 2014.

[22] B. Li and K. C. Sim, “A Spectral Masking Approach to Noise-Robust Speech Recognition Using Deep Neural Networks,” IEEE Transactions on Audio, Speech and Language Processing, vol. 22, pp. 1296-1305, Aug. 2014.

[23] A. Narayanan and D. Wang, “Improving robustness of deep neural network acoustic models via speech separation and joint adaptive training,” IEEE Transactions on Audio, Speech and Language Processing, vol. 23, pp. 92-101, Jan. 2015.

[24] L. Breiman, “Bagging Predictors,” Journal of Machine Learning, vol. 24, no. 2, pp. 123-140, Aug. 1996.

[25] R. E. Schapire, “The Strength of Weak Learnability,” Journal of Machine Learning, vol. 5, no. 2, pp. 197-227, Jun. 1990.

[26] A. Senior, “Improving DNN speaker independence with I-vector inputs,” IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 225-229, May 2014.

[27] G. E. Hinton, S. Osindero, and Y. Teh, “A fast learning algorithm for deep belief nets,” Neural Computation, vol. 18, no. 7, pp. 1527-1554, 2006.

[28] A. Mohamed, G. E. Dahl, and G. Hinton, “Acoustic modeling using deep belief networks,” IEEE Transactions on Audio, Speech Language Processing, vol. 20, no. 1, pp. 14-22, Jan. 2012.

[29] G. Hinton, L. Deng, G. E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath and B. Kingsbury, “Deep neural networks for acoustic modeling in speech recognition,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82-97, Nov. 2012.

[30] ETSI, “Speech processing, transmission and quality aspects (STQ); Distributed speech recognition; Advanced front-end feature extraction algorithm; Compression algorithm,” ETSI standard document ES 202 050 V1.1.5, 2007

[31] E. Fosler-Lussier, “Markov Models and Hidden Markov Models: A Brief Tutorial,” International Computer Science Institute, Technical Report (TR-98-041), Dec. 1998.

[32] 王小川，語音訊號處理，三版，全華圖書，民國九十七年。

[33] N. Qian, “On the momentum term in gradient descent learning algorithms,” Neural Networks, vol. 12, no. 12, pp. 145-151, Jan. 1999.

[34] H. Larochelle, Y. Bengio, J. Louradour , P. Lamblin, “Exploring strategies for training deep neural networks,” Journal of Machine Learning Research, vol. 10, pp. 1-40, Dec. 2009.

[35] D. Erhan, P.-A. Manzagol, Y. Bengio, S. Bengio, and P. Vincent, “The difficulty of training deep architectures and the effect of unsupervised pretraining,” in Proceedings of The Twelfth International Conference on Artificial Intelligence and Statistics, pp. 153-160, 2009.

[36] G. E. Hinton, “Training products of experts by minimizing contrastive divergence,” Neural Computation, vol. 14, no. 8, pp. 1771-1800, Aug. 2002.

[37] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel and P. Ouellet, “Front-End Factor Analysis for Speaker Verification,” IEEE Transactions on Audio, Speech and Language Processing, vol. 19, no. 4, pp. 788-798, May 2010.

[38] P. Kenny, G. Boulianne, P. Ouellet and P. Dumouchel, “Joint factor analysis versus eigenchannels in speaker recognition,” IEEE Transactions on Audio, Speech and Language Processing, vol. 15, no. 4, pp. 1448-1460, May 2007.

[39] R. A. Jacobs, M. I. Jordan, S. J. Nowlan and G. E. Hinton, “Adaptive Mixtures of Local Experts,” Neural Computation, vol. 3, no. 1, pp. 79-87, Spring 1991.

[40] D. Povey, S. M. Chu, B. Varadarajan, “Universal background model based speech recognition,” IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 4561-4564, Mar. 2008.

[41] Y. Tsao, X. Lu, P. Dixon, T.-y. Hu, S. Matsuda, and C. Hori, "Incorporating Local Information of the Acoustic Environments to MAP-based Feature Compensation and Acoustic Model Adaptation," Computer Speech and Language, vol. 28, no. 3, pp. 709-726, May 2014.

[42] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, “The Kaldi Speech Recognition Toolkit,” IEEE Workshop on Automatic Speech Recognition and Understanding, Dec. 2011.

[43] M. Ida and S. Nakamura, “HMM COmposition-based rapid model adaptation using a priori noise GMM adaptation evaluation on Aurora2 corpus,” International Conference on Spoken Language Processing, pp. 437-440, 2002.

[44] D. Pearce and H.-G. Hirsch, “The Aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions,” in ASR2000 Automatic Speech Recognition: Challenges for the new Millenium ISCA Tutorial and Research Workshop, Sep. 2000.

[45] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, pp. 1929–1958, 2014.

指導教授

王家慶、曹昱(Jia-ching Wang Yu Tsao)

審核日期

2015-8-25

推文