基於語音增強技術之多模態訓練語料強健類神經網路聲學模型

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：59

、訪客IP：3.146.176.61

姓名

李安德(Ryandhimas Edo Zezario) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

基於語音增強技術之多模態訓練語料強健類神經網路聲學模型
(Study of Robustness of DNN Acoustic Modeling Based on Multi-style Training with Speech Enhancement)

相關論文

★ Single and Multi-Label Environmental Sound Recognition with Gaussian Process	★ 波束形成與音訊前處理之嵌入式系統實現
★ 語音合成及語者轉換之應用與設計	★ 基於語意之輿情分析系統
★ 高品質口述系統之設計與應用	★ 深度學習及加速強健特徵之CT影像跟骨骨折辨識及偵測
★ 基於風格向量空間之個性化協同過濾服裝推薦系統	★ RetinaNet應用於人臉偵測
★ 金融商品走勢預測	★ 整合深度學習方法預測年齡以及衰老基因之研究
★ 漢語之端到端語音合成研究	★ 基於 ARM 架構上的 ORB-SLAM2 的應用與改進
★ 基於深度學習之指數股票型基金趨勢預測	★ 探討財經新聞與金融趨勢的相關性
★ 基於卷積神經網路的情緒語音分析	★ 運用深度學習方法預測阿茲海默症惡化與腦中風手術存活

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 ( 永不開放)

摘要(中)

本研究提出了一種用於聲學建模的語音增強（MTSE）的多風格訓練，以實現強健的自動語音識別。以前的研究已經證實通過使用來自不同聲學條件的訓練數據（可以通過在不同記錄條件下收集數據或通過將噪聲注入乾淨的話語來獲得），基於深神經網絡（DNN）的聲學模型可以被訓練為對不良聲學條件更加強健。在本研究中，MTSE方法採用相同的概念，包括機器學習和基於頻譜回復的語音增強，來產生恢復的語音數據，並用它來擴展原始訓練集。通過對原始訓練數據擴增語音增強恢復的數據，基於DNN的聲學模型可以捕獲輸入分佈中的而外結構，並在異質條件下決定更準確的決策邊界。提出的MTSE方法在Aurora-4（具有模擬嘈雜語音的標準化英語ASR任務）和MATBN（具有現實世界記錄的噪聲的標準化ASR任務）數據集進行評估。實驗結果顯示，與Aurora-4 tsk基線系統相比，提出的MTSE系統在字錯誤率（10.01％〜9.06％）中顯著降低9.49％，當與MATBN任務的基線系統相比時，減少了6.15％字符錯誤率（CER）（即12.84％至12.05％）。結果表明，提出的MTSE方法可以成為可行解決方案來處理真實噪聲強健ASR中的噪聲問題。

摘要(英)

This study presents a multi-style training with speech enhancement (MTSE) for acoustic modeling to achieve robust automatic speech recognition. Previous studies have confirmed that by using training data from diverse acoustic conditions (which can be obtained either by collecting data under different recording conditions or by injecting noise into clean utterances), acoustic models based on deep neural network (DNN) can be trained more robust to adverse acoustic conditions. In this study, the MTSE approach adopts the same concept and includes machine learning and spectral restoration based speech enhancement to generate restored speech data and use it to expand the original training set. By augmenting the speech enhancement restored data with the original training data, the DNN-based acoustic models can capture additional structures in the input distribution and determine more accurate decision boundaries in heterogeneous conditions. The proposed MTSE approach was evaluated on the Aurora-4 (a standardized English ASR task with simulated noisy speech) and MATBN (a standardized Mandarin ASR task with real-world recorded noisy speech) datasets. Experimental results show that the proposed MTSE system can yield a notable reduction of 9.49% in the word error rate (from 10.01% to 9.06%) when compared to the baseline system on the Aurora-4 task and a reduction of 6.15 % in the Character error rate (CER) (i.e., from 12.84% to 12.05%) when compared to the baseline system on the MATBN task. The results suggest that the proposed MTSE approach can be a feasible solution to handle the noise issue in the real-world noise robust ASR.

關鍵字(中)

★ deep learning
★ deep neural networks
★ multi-style training
★ deep denoising autencoder
★ extreme learning
★ hierarchical extreme learning
★ spectral restoration
★ automatic speech recognition

關鍵字(英)

論文目次

摘要 i
ABSTRACT ii
ACKNOWLEDGEMENT iii
TABLE OF CONTENTS iv
LIST OF FIGURES vi
LIST OF TABLES viii
CHAPTER 1 INTRODUCTION 1
CHAPTER 2 AUTOMATIC SPEECH RECOGNITION 5
2.1 Acoustic Models 6
2.2 Deep Neural Networks 11
2.3 DNN-HMMs Acoustic Models 13
2.3.1 Restricted Boltzmann Machines 14
2.3.2 Deep Belief Networks 16
CHAPTER 3 SPEECH ENHANCEMENT 18
3.1 Machine Learning Based Speech Enhancement 18
3.1.1 Deep Denoising Autoencoder 19
3.1.2 Extreme Learning Machine 20
3.1.3 Hierarchical Extreme Learning Machine 22
3.2 Spectral Restoration Based Speech Enhancement 23
3.2.1 Minimum Mean Square Error (MMSE) 25
3.2.2 Maximum Likelihood Spectral Amplitude (MLSA) 25
3.2.3 Maximum a Posteriori Spectral Amplitude (MAPA) 26
3.2.4 Generalized Maximum a Posteriori Spectral Amplitude (GMAPA) 27
CHAPTER 4 METHODOLOGY 28
4.1 Multi-style Training 28
4.2 Proposed Multi-style training with Speech Enhancement (MTSE) 29
4.2.1. Original Setup 29
4.2.1. Extension Setup 30
CHAPTER 5 EXPERIMENTS SETUP 32
5.1 Speech Enhancement Configuration 32
5.2 ASR setup for Aurora 4 33
5.3 ASR Setup for MATBN 34
CHAPTER 6 EXPERIMENTAL RESULTS 36
6.1 Spectrogram Analysis 36
6.2 Aurora 4 ASR Result 39
6.2.1 Recognition with original test data 39
6.2.2 Recognition with restored test data 43
6.3 MATBN ASR Results 44
6.4 Correlation of STOI and WER 45
6.5 Effect of distortion to robust ASR 48
6.6 Analyzing the performances of diverse training data 49
CHAPTER 7 CONCLUSION 54
BIBLIOGRAPHY 55

參考文獻

[1] G. Hinton, L. Deng, D. Yu, G.E. Dahl, A.R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T.N. Sainath and B. Kingsbury, “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Processing Magazine, 29, no.6, 82-97, 2012.
[2] S. M. Siniscalchi, T. Sveendsen, C.-H. Lee, 2014. An artificial neural network approach to automatic speech processing. Neurocomputing, pp.326-338.
[3] J. Li, L. Deng, R. Haeb-Umbach and Y. Gong, Robust automatic speech recognition: a bridge to practical applications, Academic Press, 2015.
[4] A. Narayanan, D. Wang, “Improving robustness of deep neural network acoustic models via speech separation and joint adaptive training,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 23, no. 1, pp. 92–101, 2015.
[5] Y. Tsao, C.-H. Lee, 2009. An ensemble speaker and speaking environment modeling approach to robust speech recognition. IEEE Trans. Audio, Speech, Lang. Process. 17 (5), 1025–1037.
[6] S. M. Siniscalchi, J. Li, C. H. Lee, 2013. Hermitian polynomial for speaker adaptation of connectionist speech recognition systems. IEEE Transactions on Audio, Speech, and Language Processing, 21(10), 2152-2161.
[7] T. Tan, Y. Qian, and K. Yu, 2016. Cluster adaptive training for deep neural network based acoustic model. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(3), 459-468.
[8] R.P. Lippmann, E.A. Martin and D.B. Paul, “Multi-style training for robust isolated-word speech recognition,” In: Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1987, pp. 705-708.
[9] M.L. Seltzer, D. Yu and Y. Wang, “An investigation of deep neural networks for noise robust speech recognition,” In: Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013, pp. 7398-7402.
[10] S. Yin, C. Liu, Z. Zhang, Y. Lin, D. Wang, J. Tejedor, T.F. Zheng, and Y. Li, “Noisy training for deep neural networks in speech recognition,” EURASIP Journal on Audio, Speech, and Music Processing, 2015, pp. 1-14.
[11] C. Weng, D. Yu, M.L. Seltzer and J. Droppo, “Deep neural networks for single channel multi-talker speech recognition,” IEEE/ACM Trans. Audio Speech Lang. Process., vol 23, no.10, pp. 1670–1679, 2015.
[12] J. Li, D. Yu, J. T. Huang and Y. Gong, “Improving wideband speech recognition using mixed-bandwidth training data in CD-DNN-HMM.” In Proc. SLT, 2012, pp.131-136.
[13] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “A regression approach to speech enhancement based on deep neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 1, pp. 7–19, 2015
[14] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “An experimental study on speech enhancement based on deep neural networks,” Signal Processing Letters, vol. 21, no. 1, pp. 65–68, 2014.
[15] X.-G. Lu, Y. Tsao, S. Matsuda, and C. Hori, “Speech enhancement based on deep denoising Auto-Encoder,” in Proc. Interspeech, 2013, pp. 436–440.
[16] S.-S. Wang, H.-T. Hwang, Y.-H. Lai, Y. Tsao, X. Lu, H.-M. Wang, and B. Su, “Improving denoising auto-encoder based speech enhancement with the speech parameter generation algorithm,” in Proc. APSIPA, 2015, pp. 365–369.
[17] J. Tang, C. Deng and G. Huang, “Extreme learning machine for multilayer perceptron,” IEEE Transactions on Neural Networks and Learning Systems, vol. 27, no. 4, pp. 809-821, Apr. 2016.
[18] G.-B. Huang, L. Chen, and C.-K. Siew, “Universal approximation using incremental constructive feedforward networks with random hidden nodes,” IEEE Trans. Neural Netw., vol. 17, no. 4, pp. 879–892, Jul. 2006.
[19] G.-B. Huang, H. Zhou, X. Ding, and R. Zhang, “Extreme learning machine for regression and multiclass classification,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 42, no. 2, pp. 513–529, Apr. 2012.
[20] Y. L. He, Z.Q. Geng, Y. Xu, Q.X. Zhu, “A hierarchical structure of extreme learning machine (HELM) for high-dimensional datasets with noise,” Neurocomputing 128, 407–414, 2014.
[21] P. Scalart and J.V. Filho, “Speech enhancement based on a priori signal to noise estimation,” In: Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1996, pp. 629–632.
[22] S.F. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Trans. Acoust. Speech Signal Process., vol. 27, pp. 113–120, 1979.
[23] Y. Lu and P.C. Loizou, “A geometric approach to spectral subtraction.” Speech Communication, vol. 50, pp. 453–466, 2008.
[24] J. Li, S. Sakamoto, S. Hongo, M. Akagi and Y. Suzuki, “Adaptive border generalized spectral subtraction for speech enhancement,” Signal Process., vol. 88, no.11, pp. 2764–2776, 2008.
[25] U. Mittal and N. Phamdo, “Signal/noise KLT based ap-proach for enhancing speech degraded by colored noise,” IEEE Trans. Speech Audio Process., vol. 8, pp. 159–167, 2000.
[26] Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator,” IEEE Trans. Acoust. Speech Signal Process., vol. 32, pp. 1109–1121, 1984.
[27] S. Suhadi, C. Last and T. Fingscheidt, “A data-driven ap-proach to a priori SNR estimation,” IEEE Trans. Audio Speech Lang. Process., vol. 19, pp. 186–195, 2011.
[28] U. Kjems and J. Jensen, “Maximum likelihood based noise covariance matrix estimation for multi-microphone speech enhancement,” in Proc. European Signal Processing Conf. (EUSIPCO), 2012, pp. 1–5.
[29] Y. Tsao and Y. Lai, “Generalized maximum a posteriori spectral amplitude estimation for speech enhancement,” Speech Communication, vol. 76, pp. 112–126, 2016.
[30] B. Li, Y. Tsao, and K. C. Sim, “An investigation of spectral restoration algorithms for deep neural networks based noise robust speech recognition,” in Proc. INTERSPEECH, 2013, pp. 3002-3006.
[31] T, Ko, V. Peddinti, D. Povey, M. Seltzer, and S. Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” ICASSP 2017 (submitted), 2017.
[32] N. Parihar and J. Picone, “Aurora working group: DSR front end LVCSR evaluation AU/384/02,” Tech. Rep., Inst. for Signal and Information Process, Mississippi State University.
[33] H.M. Wang, B. Chen, J.W. Kuo, S. S. Cheng, 2005. MATBN: A Mandarin Chinese broadcast news corpus. International Journal of Computational Linguistics and Chinese Language Processing, 10(2), 219-23
[34] D. Yu and L. Deng, Automatic Speech Recognition in Springer Handbook of Signals and Communication Technology, Springer (Chapter 1), 2015.
[35] J. Li and L. Deng, Robust Automatic Speech Recognition in Springer Handbook of a Bridge of Practical Applicants, Springer (Chapter 2), 2016.
[36] M. N. Stuttle, "A Gaussian mixture model spectral representation for speech recognition," Ph.D dissertation, University of Cambridge, 2003.
[37] M. J. F. Gales, S.J. Young, “The application of hidden Markov models in speech recognition,” Foundations and Trends in Signal Processing 1 (3), 195–304, 2007.
[38] J. Baker, L. Deng, J. Glass, S. Khudanpur, C. H. Lee, N. Morgan, et al., 2009a. Research developments and directions in speech recognition and understanding, Part I. IEEE Signal Process. Mag. 26 (3), 75-80.
[39] J. Baker, L. Deng, J. Glass, S. Khudanpur, C. H. Lee, N. Morgan, et al., 2009b. Updated MINDS report on speech recognition and understanding. IEEE Signal Process. Mag. 26 (4), 78-85.
[40] D. Yu and L. Deng, Automatic Speech Recognition in Springer Handbook of Signals and Communication Technology, Springer (Chapter 4), 2015.
[41] D. Yu and L. Deng, Automatic Speech Recognition in Springer Handbook of Signals and Communication Technology, Springer (Chapter 6), 2015.
[42] L. Deng, J. Li, J.T. Huang, K. Yao, D.Yu, F. Seide, M. Seltzer, G. Zweig, X. He, J. Williams, Y. Gong, A. Acero, Recent advances in deep learning for speech research at microsoft. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP). Vancouver, Canada (2013)
[43] J. Chen, J. Benesty, Y. Huang, E. J. Diethorn, Fundamentals of Noise Reduction in Springer Handbook of Speech Processing, Springer (Chapter 43), 2008.
[44] I. Cohen, “Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging,” IEEE Trans. Speech Audio Process, vol.11, no.5, pp. 466-475, 2003.
[45] W.Y. Ma, C.R. Huang, 2016. Uniform and effective tagging of a heterogeneous giga-word corpus. In: Proc. LREC2006, 24-28

指導教授

王家慶、曹昱(Jia-Ching Wang Yu Tsao)

審核日期

2017-7-26

推文