基於卷積與孿生神經網路之語者辨識系統

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：20

、訪客IP：3.137.41.3

姓名

陳秉揚(Ping-Yang Chen) 查詢紙本館藏

畢業系所

電機工程學系

論文名稱

基於卷積與孿生神經網路之語者辨識系統
(A Speaker Recognition System Based on Convolution and Siamese Neural Network)

相關論文

★ 小型化 GSM/GPRS 行動通訊模組之研究	★ 語者辨識之研究
★ 應用投影法作受擾動奇異系統之強健性分析	★ 利用支撐向量機模型改善對立假設特徵函數之語者確認研究
★ 結合高斯混合超級向量與微分核函數之語者確認研究	★ 敏捷移動粒子群最佳化方法
★ 改良式粒子群方法之無失真影像預測編碼應用	★ 粒子群演算法應用於語者模型訓練與調適之研究
★ 粒子群演算法之語者確認系統	★ 改良式梅爾倒頻譜係數混合多種語音特徵之研究
★ 利用語者特定背景模型之語者確認系統	★ 智慧型遠端監控系統
★ 正向系統輸出回授之穩定度分析與控制器設計	★ 混合式區間搜索粒子群演算法
★ 基於深度神經網路的手勢辨識研究	★ 人體姿勢矯正項鍊配載影像辨識自動校準及手機接收警告系統

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

語者辨識系統根據應用領域的不同可以區分為語者識別(Speaker Identification)及語者確認(Speaker Verification)兩個類別。本論文設計一個卷積神經網路(Convolution Neural Network)架構用於語者識別，透過卷積神經網路區分不同語者之間的語音特徵，並且將比較在不同的初始學習速率、權重初始化方法以及特徵擷取方法下，語者識別模型效果之差異。對於語者確認模型則會利用孿生神經網路(Siamese Neural Network)來實現，透過計算基準語者與測試語者在特徵空間中的距離，進而判斷兩位語者是否相似。最後，會把語者識別與語者確認模型介面化，讓使用者能方便使用。

摘要(英)

The speaker recognition system can be divided into two categories: “Speaker Identification” and “Speaker Verification”. This thesis designs a convolutional neural network architecture for speaker identification in order to distinguishing the speech features between different speakers, and compare the effect of the speaker identification model under different initial learning rates, weight initialization methods, and feature extraction methods. As for the speaker verification model, it’s implemented by using the siamese neural network. The siamese neural network determines whether the two speakers are similar by calculating the distance between the base speaker and the input speaker in the discriminative feature space. Finally, we design a graphical user interface for user to use.

關鍵字(中)

★ 卷積神經網路
★ 孿生神經網路
★ 語者辨識

關鍵字(英)

論文目次

摘要 I
ABSTRACT II
誌謝 III
目錄 III
圖目錄 VII
表目錄 VIII
第1章緒論 1
1.1. 研究動機 1
1.2. 文獻探討 2
1.3. 章節摘要 3
第2章語者辨識系統與技術 5
2.1. 語者辨識性質 6
2.1.1. 語者識別(Speaker Identification) 6
2.1.2. 語者確認(Speaker Verification) 6
2.2. 輸入語句之內容 7
2.2.1. 文本相關(Text-dependent) 8
2.2.2. 文本不相關(Text-independent) 8
2.3. 前處理 8
2.3.1. 預強調(Pre-emphasis) 9
2.3.2. 音框化(Frame Blocking) 10
2.3.3. 漢明窗(Hamming Window) 10
2.3.4. 快速傅立葉轉換(Fast Fourier Transform , FFT) 11
2.4. 特徵值擷取 12
2.4.1. 三角帶通濾波器(Triangular Bandpass Filter) 12
2.4.2. 對數能量(Log Energy) 13
2.4.3. 離散餘弦轉換(Discrete Cosine Transform , DCT) 14
2.4.4. 差量倒頻譜參數(Delta-cepstral coefficients) 14
第3章深度學習 16
3.1. 機器學習的種類 16
3.1.1. 監督學習(Supervised Learning) 16
3.1.2. 非監督學習(Un-supervised Learning) 17
3.1.3. 增強學習(Reinforcement Learning) 17
3.1.4. 半監督學習(Semi-supervised Learning) 18
3.2. 卷積神經網路(CONVOLUTION NEURAL NETWORK) 18
3.2.1. 卷積層(Convolution Layer) 20
3.2.2. 激活函數(Activation Function) 22
3.2.3. 池化層(Pooling Layer) 25
3.2.4. 全連接層(Fully-connected Layer) 26
3.3. 孿生神經網路(SIAMESE NEURAL NETWORK) 27
第4章優化方法及開發平台 30
4.1. 優化方法 30
4.1.1. Dropout層 30
4.1.2. 學習速率 31
4.1.3. 權重初始化(Weight Initialization) 31
4.1.4. Batch Normaliztion 32
4.2. 開發平台 34
4.2.1. Tensorflow 34
4.2.2. Keras 35
第5章實驗結果與實作系統 37
5.1. 實驗環境與設備介紹 37
5.2. 語音資料庫與特徵擷取 38
5.2.1. 語音資料庫 38
5.2.2. 特徵參數 39
5.3. 語者識別實驗 40
5.3.1. 實驗一初始學習速率對於模型效率影響之實驗 41
5.3.2. 實驗二特徵擷取與權重初始化方法之比較 42
5.3.3. 實驗三池化層對於模型影響之探討 45
5.4. 語者確認實驗 47
5.5. 實作系統 49
5.5.1. 語者識別系統 49
5.5.2. 語者確認系統 52
第6章結論與未來展望 56
6.1. 結論 56
6.2. 未來展望 57
第7章參考文獻 58

參考文獻

[1] 呂易宸, “語音門禁系統,” 中央大學電機工程學系碩士論文, 民國100年.
[2] H. Sakoe and S. Chiba,“Dynamic programming algorithm optimization for spoken word recognition,”Acoustics, Speech and Signal Processing, IEEE Transactions on,vol.26,pp.43-49,1978.
[3] C. S. Myers and L. R. Rabiner, “A level building dynamic time warping algorithm for connected word recognition,” Acoustics, Speech and Signal Processing, IEEE Transactions on, vol. ASSP-29, pp. 284-297, Apr. 1981.
[4] Cory Myers, Lawrence R. Rabiner, Aaron E. Rosenberg, “Performance Tradeoffs in Dynamic Time Warping Algorithms for Isolated Word Recognition,” Acoustics, Speech and Signal Processing, IEEE Transactions on, Vol. Assp-28, No. 6, December 1980.
[5] Mark Gales and Steve Young ,“The Application of Hidden Markov Models in Speech Recognition,” Foundations and Trends in Signal Processing Vol. 1, No. 3 pp. 195–304, 2007.
[6] L. Rabiner. A tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, In: Proceedings of IEEE Volume 77 No. 2 pp 257-286, February 1989.
[7] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker verification using adapted Gaussian mixture models,” ELSEVIER, Digital Signal Processing, vol. 10, pp. 19-41, 2000.
[8] S. Fine, J. Navratil and R. A. Gopinath, “A hybrid GMM/SVM approach to speaker identification,” IEEE Transactions, Acoustics Speech and Signal Processing, vol. 1, pp. 417-420, 2001.
[9] E. Rodriguez, B. Ruiz, A. G. Crespo, F. Garcia. “Speech/Speaker Recognition Using a HMM/GMM Hybrid Model. “ In: Proceedings of the First International Conference on Audio- and Video-Based Biometric Person Authentication, pp. 227- 234, April 2003
[10] E. Trentin and M. Gori,“A survey of hybrid ANN/HMM models for automatic speech recognition,” Neurocomputing,vol.37,no.1,pp.91-126,2001.
[11] Mohamad Adnan Al-Alaoui, Lina Al-Kanj, Jimmy Azar, and Elias Yaacoub,“Speech Recognition using Artificial Neural Networks and Hidden Markov Models,”IEEE Multidisciplinary Engineering Education Magazine, Vol.3, pp.77-86, September 2008.
[12] G. Hinton, L. Deng, D. Yu, G. Dahl, A.-R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury, “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012.
[13] A.R. Mohamed, G.E. Dahl, and G. Hinton , “Acoustic Modeling Using Deep Belief Networks,” IEEE Transactions on Audio, Speech, and Language Processing,vol.20,no.1,pp.14-22,2012.
[14] T.N. Sainath, A. Mohamed, B. Kingsbury, B. Ramabhadran, "Deep convolutional neural networks for LVCSR", Proc IEEE ICASSP, 2013.
[15] O. Abdel-Hamid, A. Mohamed, H. Jiang, L. Deng, G. Penn, D. Yu, "Convolutional neural networks for speech recognition", IEEE Transactions on Audio Speech and Language Processing, vol. 22, no. 1, pp. 1533-1545, 2014.
[16] Jui-Ting Huang, J. Li, Y. Gong, "An analysis of convolutional neural networks for speech recognition", IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),2015.
[17] J. Bromley, I. Guyon, Y. LeCun, E. Sackinger, and R. Shah. Signature verification using a siamese time delay neural network. J. Cowan and G. Tesauro (eds) Advances in Neural Information Processing Systems, 1993.
[18] S. Chopra, R. Hadsell and Y. LeCun, “ Learning a similarity metric discriminatively, with application to face verification,”In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol.1, pp. 539–546, 2005.
[19] G. Koch, R. Zemel and R. Salakhutdinov, “ Siamese neural networks for one-shot image recognition,” ICML Deep Learning Workshop. vol. 2 (2015).
[20] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. Torr, “ Fully-convolutional siamese networks for object tracking,” In European Conference on Computer Vision Workshop, pp. 850–865. Springer, 2016.
[21] 王小川，“語音訊號處理”，全華，民國93年。
[22] .D.A. Reynolds. “Speaker Identification and Verification using Gaussian Mixture Speaker Models”. Speech Communication, V. 17, pp. 177-192, 1995.
[23] S.Furui, “An Overview of Speaker Recognition Technology,” Workshop on Automatic Speaker Recognition, Identification, pp. 1–9, 1994.
[24] D.Burton, “Text Dependent Speaker Verification Using Vector Quantization Source Coding,” Transactions on Acoustics, Speech and Signal Processing, vol.35, pp. 133-143, 1987.
[25] A.Roland and C.Michael and L.T.Harvey, “Score Normalization for Text Independent Speaker Verification Systems,” ScienceDirect Digital Signal Processing, vol.10, pp. 42-54, 2000.
[26] 郭又禎, “改良式梅爾倒頻譜參數應用於關鍵字萃取,”中央大學電機工程學系碩士論文, 民國103年.
[27] J. R. Deller, J. G. Proakis and J. H. L. Hansen, “Discrete-time Processing of Speech Signals,” Wiley-IEEE Press, 1999.
[28] R. Vergin, D. OShaughnessy and A. Farhat, “Generalized Mel Frequency Cepstral Coefficients for Large-Vocabulary Speaker-Independent Continuous-Speech Recognition,” IEEE Transactions On Speech And Audio Processing, Vol. 7, NO. 5, 1999.
[29] H. Hermansky, “Perceptual linear predictive (PLP) analysis of speech,” Acoustical Society of America Journal, vol. 87, pp.1738–1752, 1990.
[30] S. Ravuri and A. Stolcke, “Recurrent neural network and LSTM models for lexical utterance classification,” in Interspeech, 2015.
[31] K. Yao, G. Zweig, M.-Y. Hwang, Y. Shi, and D. Yu, “Recurrent neural networks for language understanding,” in In Prooceedings of the Interspeech, Lyon, France, August 2013.
[32] R. Sathya and A. Abraham, “Comparison of Supervised and Unsupervised Learning Algorithms for Pattern Classification,” (IJARAI) International Journal of Advanced Research in Artificial Intelligence, vol. 2, no. 2,2013.
[33] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcement learning: a survey,” J. Artif. Intell. Res. 4, pp. 237-285, 1996.
[34] O. Chapelle, B. Scholkopf, and A. Zien, “Semi-Supervised Learning,” MIT Press, 2007.
[35] A. Subramanya and J. Bilmes, “Semi-Supervised Learning with Measure Propagation,” Journal of Machine Learning Research, 2011.
[36] J. Wu, “Introduction to Convolutional Neural Networks,” 2017.
[37] 斎藤康毅，“Deep Learning：用Python進行深度學習的基礎理論實作,”吳嘉芳譯，碁峰資訊，2017.
[38] S. Wager, S. Wang, and P. Liang, “Dropout training as adaptive regularization,” In Advances in Neural Information Processing Systems 26, pp. 351–359, 2013.
[39] Ｎ. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever and R. salakhutdinov, “Dropout: A Simple Way to Prevent Neural Networks from Overfitting,” Journal of Machine Learning Research, pp. 1929-1958,2014.
[40] N. Srivastava, “Improving Neural Networks with Dropout,” Master’s thesis, University of Toronto, January 2013.
[41] Leslie N. Smith, “Cyclical Learning Rates for Training Neural Networks,” U.S. Naval Research Laboratory,2015.
[42] Bo Yang Hsueh, Wei Li and I-Chen Wu, “Stochastic Gradient Descent with Hyperbolic-Tangent Decay,” Computer Vision and Pattern Recognition,2015.
[43] M.D Zeiler, “Adadelta: an adaptive learning rate method,” arXiv preprint arXiv:1212.5701, 2012.
[44] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS’10). Society for Artificial Intelligence and Statistics, 2010.
[45] K. He, X. Zhang, S. Ren and J. Sun, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification,” IEEE International Conference On Computer Vision,2015.
[46] S. Ioffe and C. Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,” arXiv:1502.03167 [cs.LG],2015.
[47] J. Bjorck, C. Gomes, B. Selman and KQ. Weinberger, “Understanding Batch Normalization,” arXiv:1806.02375 [cs.LG],2018.
[48] Mart´ın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mane, Rajat Monga, Sherry Moore, Derek Murray, Chris ´ Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viegas, Oriol ´ Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng, “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015.
[49] K. Wongsuphasawat, D.Smilkov, J. Wexler, J. Wilson, D. Mane, D. Fritz, D. Krishnan, F.B. Viegas and M. Wattenberg , “Visualizing Dataflow Graphs of Deep Learning Models in TensorFlow,” IEEE Transactions on Visualization and Computer Graphics, vol. 24,no .1, pp.1-12,2017.
[50] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a largescale speaker identification dataset,” in INTERSPEECH, 2017.
[51] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
[52] A. Mohamed, G. Hinton, and G. Penn, “Understanding how deep belief networks perform acoustic modelling,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), pp. 4273–4276, 2012.

指導教授

莊堯棠

審核日期

2019-6-26

推文