基於卷積神經網路之語音辨識

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：106

、訪客IP：3.14.254.32

姓名

楊恕先(Shu-Sian Yang) 查詢紙本館藏

畢業系所

電機工程學系

論文名稱

基於卷積神經網路之語音辨識
(Speech Recognition by Using Convolutional Neural Network)

相關論文

★ 小型化 GSM/GPRS 行動通訊模組之研究	★ 語者辨識之研究
★ 應用投影法作受擾動奇異系統之強健性分析	★ 利用支撐向量機模型改善對立假設特徵函數之語者確認研究
★ 結合高斯混合超級向量與微分核函數之語者確認研究	★ 敏捷移動粒子群最佳化方法
★ 改良式粒子群方法之無失真影像預測編碼應用	★ 粒子群演算法應用於語者模型訓練與調適之研究
★ 粒子群演算法之語者確認系統	★ 改良式梅爾倒頻譜係數混合多種語音特徵之研究
★ 利用語者特定背景模型之語者確認系統	★ 智慧型遠端監控系統
★ 正向系統輸出回授之穩定度分析與控制器設計	★ 混合式區間搜索粒子群演算法
★ 基於深度神經網路的手勢辨識研究	★ 人體姿勢矯正項鍊配載影像辨識自動校準及手機接收警告系統

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

本論文在探討如何利用深度學習來進行語音辨識，而使用的辨識方法是先透過梅爾倒頻譜係數((Mel frequency cepstral coefficients, MFCCs)取得語音特徵參數，並輸入卷積神經網路(Convolutional Neural Network, CNN)進行語音辨識。
此法與傳統語音辨識方法最大不同是在於不需要建立聲學模型，以中文為例就省去建立大量聲母(consonant)、韻母(vowel)比對的時間。藉由透過MFCCs取得特徵參數後就可以透過卷積神經網路實現語音辨識，並且不會受到語言種類的限制。

摘要(英)

The thesis developed a speech recognition method for automatic speech recognition. In this speech recognition method, we obtained the speech feature parameters through Mel frequency cepstral coefficients and input a Convolutional Neural Network. The main difference between this Convolutional Neural Network speech recognition method and traditional speech recognition method is that it does not need to establish an acoustic model. For example, in Chinese, it saved a lot of time without establishing a large number of consonant and vowel models. After obtaining the speech feature parameters through the MFCCs, speech recognition is finished through Convolutional Neural Network.

關鍵字(中)

★ 語音辨識
★ 深度學習
★ 神經網路

關鍵字(英)

★ speech recognition
★ deep learning
★ neural network

論文目次

摘要 I
Abstract II
致謝辭 III
目錄 IV
圖目錄 VI
表目錄 VIII
第一章緒論 1
1-1 研究動機 1
1-2 文獻回顧 2
1-3 章節架構 4
第二章語音辨識 5
2-1 前處理 6
第三章卷積神經網路 15
3-1 卷積神經網路架構 15
3-1-1 卷積層 16
3-1-2 池化層 18
3-1-3 全連接層 21
3-2 激活函數 22
3-3 權重更新 25
3-1-1 隨機梯度下降法(Stochastic gradient descent, SGD)
26
3-1-2 AdaGrad 27
3-1-3 Adam 28
第四章實驗結果 34
4-1 卷積神經網路深度對辨識的影響 37
4-2 激活函數對辨識的影響 39
4-3 權重更新對辨識的影響 41
4-4 神經網路優化方式 43
第五章結論與未來研究方向 44
5-1 結論 44
5-2 未來研究方向 46
參考文獻 47

參考文獻

[1] Anjali, A. Kumar and N. Birla, Voice Command Recognition System based on MFCC and DTW, International Journal of Engineering Science and Technology, 2(12),2010.

[2] A. Mohamed, T. Sainath, G. Dahl, B. Ramabhadran, G. Hinton, and M. Picheny, “Deep belief networks using discriminative features for phone recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), May 2011, pp. 5060–5063

[3] B.Y. Chen, Q. Zhu, and N. Morgan, “A Neural Network for Learning Long-Term Temporal Features for Speech Recognition,” Proc. ICASSP 2005, March 2005, pp. 945-948

[4] Corneliu Octavian Dumitru, Inge Gavat, “A Comparative Study of Feature Extraction Methods Applied to Continuous Speech Recognition in Romanian Language,” International Symphosium ELMAR, 07-09 June, 2006, Zadar, Croatia

[5] C. Poonkuzhali, R. Karthiprakash, S. Valarmathy and M. Kalamani, An Approach to feature selection algorithm based on Ant Colony Optimization for Automatic Speech Recognition, International journal of Advanced Research in Electrical, Electronics and Instrumentation Engineering, 11(2), and 2013.

[6] C. Ittichaichareon, S. Suksri and T. Yingthawornsuk, speech Recognition using MFCC, International Conference on Computer Graphics Simulation and Modeling, 2012.

[7] C. Kim and R. M. Stern, “Feature extraction for robust speech recognition based on maximizing the sharpness of the power distribution and on power flooring”, in Proc. ICASSP, pp. 4574–4577, 2010.

[8] C. Charbuillet, B. Gas, M. Chetouani and J. L. Zarader, "Complementary features for speaker verification based on genetic algorithms," IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 4 2007 pp. IV-285 - IV-288.

[9] D. Yu, M. L. Seltzer, J. Li, J.-T. Huang, and F. Seide, “Feature learning in deep neural networks - studies on speech recognition tasks,” in Proc. Int. Conf. Learn. Represent., 2013.

[10]Diederik P. Kingma and Jimmy Lei Ba “A METHOD FOR
STOCHASTIC OPTIMIZATION” ICLR 2015.

[11] D.C.Cire¸san, U. Meier, J. Masci, L.M. Gambardella, and J. Schmidhuber. High-performance neural networks for visual object classification. Arxiv preprint arXiv:1102.0183, 2011.

[12] D. Cire¸san, U. Meier, and J. Schmidhuber. Multi-column deep neural networks for image classification. Arxiv preprint arXiv:1202.2745, 2012.

[13] E. Bocchieri and D. Dimitriadis “Investigating deep neural network based transforms of robust audio features for LVCSR” in Proc. ICASSP, pp. 6709–6713, 2013.

[14] F. Seide, G. Li, X. Chen, and D. Yu, “Feature engineering in context-dependent deep neural networks for conversational speech transcription,” in Proc. IEEE Workshop Autom. Speech Recognition Understand. (ASRU), 2011, pp. 24–29.

[15] F. Seide, G. Li, and D. Yu, “Conversational speech transcription
using context-dependent deep neural networks,” in Proc. Interspeech, 2011, pp. 437–440.

[16] H. Lee, P. Pham, Y. Largman, and A. Ng, “Unsupervised feature learning for audio classification using convolutional deep belief networks,” in Proc. Adv. Neural Inf. Process. Syst. 22, 2009, pp. 1096–1104.

[17] H. Franco, M. Graciarena, and A. Mandal, “Normalized amplitude modulation features for large vocabulary noise-robust speech recognition”, Proc. ICASSP 2012, pp. 4117-4120, March 2012

[18] J. Chen , K. K. Paliwal, M. Mizumachi and S. Nakamura, “Robust mfccs derived from differentiated power spectrum” Eurospeech 2001, Scandinavia, 2001.

[19] J.C.Wang,J.F.Wang,Y.S.Weng, “Chip design of MFCC extraction for speech recognition Volume 32 ,“ Issues 1–2, pp. 111-131, November 2002.

[20] L. Muda, M. Begam and I. Elamvazuthi, Voice Recognition Algorithms using Mel Frequency Cepstral Coefficient (MFCC) and Dynamic Time Warping(DTW) Techniques, Journal of Computing, 3(2),2010

[21] L. Deng, K. Hassanein, and M. Elmasry, “Analysis of correlation structure for a neural predictive model with applications to speech recognition,” Neural Netw., vol. 7, no. 2, pp. 331–339, 1994.

[22]L. Deng and X. Li, “Machine learning paradigms for speech recognition: An overview,” IEEE Trans. Audio, Speech, Lang. Process., vol. 21, no. 5, pp. 1060–1089, May 2013.

[23]M.A.Anusuya and S.K.Katti, “Speech Recognition by Machine: A Review”, (IJCSIS) International Journal of Computer Science and Information Security, vol. 6, no. 3, pp. 181-205, 2009.

[24]M. Kleinschmidt, “Localized spectro-temporal features for automatic speech recognition,” in Proc. of Eurospeech, 2003, Sep 2003, pp. 2573–2576.

[25]N. Morgan, “Deep and wide: Multiple layers in automatic speech recognition,” IEEE Trans. Audio, Speech, Lang. Process., vol. 20, no. 1, pp. 7–13, Jan. 2012.

[26]Ossama Abdel-Hamid, Li Deng and Dong Yu, “Exploring Convolutional Neural Network Structures and Optimization Techniques for Speech Recognition, “ Interspeech, pp. 3366-3370, August 2013.

[27]Ossama Abdel-Hamid, Abdel-rahman Mohamed, Hui Jiang, Li Deng, Gerald Penn, and Dong Yu, “Convolutional Neural Networks for Speech Recognition, “IEEE/ACM Transaction On Audio, Speech, and Language Processing, Vol. 22, No. 10, October 2014.

[28]Ovidiu Buza1, Gavril Toderean1, Alina Nica1, Alexandru Caruntu1, “Voice Signal Processing For Speech Synthesis,” IEEE International Conference on Automation, Quality and Testing Robotics, Vol. 2, pp. 360-364, 25-28 May 2006.

[29]Parwinder Pal Singh and Pushpa Rani, “An Approach to Extract Feature using MFCC,” International organization of Scientific Research, Volume .04,pp.21-25, August 2014.

[30]P. C. Woodland and D. Povey, “Large scale discriminative training of
hidden Markov models for speech recognition,” Computer Speech
and Language, vol. 16, no. 1, pp. 25–47, 2002.

[31]Q. Zhu, B. Chen, N. Morgan, and A. Stolcke, “Tandem connectionist feature extraction for conversational speech recognition,” in Machine Learning for Multimodal Interaction. Berlin/Heidelberg, Germany: Springer , 2005, vol. 3361, pp. 223–231.

[32]Rajesh Kumar Aggarwal and M. Dave, “Acoustic modeling problem for automatic speech recognition system: advances and refinements Part (Part II)”, Int J Speech Technol, pp. 309– 320, 2011.

[33]Shuo-Yiin Chang and Nelson Morgan, “Robust CNN-based Speech Recognition With Gabor Filter Kernels, “ Interspeech, pp. 905-909, September 2014.

[34]Sheeraz Memon, Margaret Lech and Ling He, "Using information theoretic vector quantization for inverted mfcc based speaker verification," 2nd International Conference on Computer, Control and Communication, 2009. IC4 2009, pp. 1 – 5.

[35]S. Witt and S. Young, “Phone-level pronunciation scoring and
assessment for interactive language learning,” Speech
Communication, vol. 30, no. 2–3, pp. 95–108, 2000.

[36]S. Dhingra, G. Nijhawan and P. Pandit, Isolated Speech Recognition using MFCC and DTW, International journal of Advanced Research in Electrical, Electronics and Instrumentation Engineering, 2013.

[37] S. Chakroborty and S. Goutam, “Improved Text-Independent Speaker Identification using Fused MFCC & IMFCC Feature Sets based on Gaussian Filter,” International Journal of Signal Processing, Vol.5, pp. 1-9, 2009.

[38]S.Y. Chang, N. Morgan “Informative spectro-temporal bottleneck features for noise-robust speech recognition”, Proc. Interspeech 2013

[39]T. Landauer, C. Kamm, and S. Singhal, “Learning a minimally structured back propagation network to recognize speech,” in Proc. 9th Annu. Conf. Cogn. Sci. Soc., 1987, pp. 531–536.

[40]W. Han, C. F. Chan, C. S. Choy and K. P. Pun, “An Efficient MFCC
Extraction Method in Speech Recognition,” International Symposium on Circuits and Systems, pp. 21-24, 2006.

[41]Wang Chen, Miao Zhenjiang and Meng Xiao, "Comparison of different implementations of mfcc," J. Computer Science & Technology, 2001, pp. 16(16): 582-589.

[42]Wang Chen, Miao Zhenjiang and Meng Xiao, "Differential mfcc and vector quantization used for real-time speaker recognition system," Congress on Image and Signal Processing, 2008, pp. 319 - 323.

[43]Y. LeCun and Y. Bengio, “Convolutional networks for images, speech, and time-series,” in The Handbook of Brain Theory and Neural Networks, M. A. Arbib, Ed. Cambridge, MA, USA: MIT Press, 1995.

指導教授

莊堯棠

審核日期

2019-6-27

推文