基於卷積遞迴神經網路之構音異常評估技術

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：108

、訪客IP：3.145.84.16

姓名

楊東翰(Dong-Han Yang) 查詢紙本館藏

畢業系所

通訊工程學系

論文名稱

基於卷積遞迴神經網路之構音異常評估技術
(Automatic Evaluation of Articulation Disorders Based on Convolutional Recurrent Neural Network)

相關論文

★ 基於區域權重之衛星影像超解析技術	★ 延伸曝光曲線線性特性之調適性高動態範圍影像融合演算法
★ 實現於RISC架構之H.264視訊編碼複雜度控制	★ 具有元學習分類權重轉移網路生成遮罩於少樣本圖像分割技術
★ 具有注意力機制之隱式表示於影像重建三維人體模型	★ 使用對抗式圖形神經網路之物件偵測張榮
★ 基於弱監督式學習可變形模型之三維人臉重建	★ 以非監督式表徵分離學習之邊緣運算裝置低延遲樂曲中人聲轉換架構
★ 基於序列至序列模型之 FMCW雷達估計人體姿勢	★ 基於多層次注意力機制之單目相機語意場景補全技術
★ 基於時序卷積網路之單FMCW雷達應用於非接觸式即時生命特徵監控	★ 視訊隨選網路上的視訊訊務描述與管理
★ 基於線性預測編碼及音框基頻週期同步之高品質語音變換技術	★ 基於藉語音再取樣萃取共振峰變化之聲調調整技術
★ 即時細緻可調性視訊在無線區域網路下之傳輸效率最佳化研究	★ 線上視訊於IP網路可變延遲環境下之訊務平順化研究

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 ( 永不開放)

摘要(中)

近年來隨著資訊科技化，人工智慧逐漸深入了我們的生活。深度學習的發展更讓語音辨識技術向前邁進了一大步，不僅能提高人機交互性，還可以應用於醫療等方面。我們用基於深度學習的語音識別技術進行錯誤發音的檢測，以此幫助有構音異常的人找出發音錯誤的地方以增加口說熟練度，並且輔助醫師進行診斷與治療。
本論文「基於卷積遞迴神經網路之構音異常評估技術」，延續過去學者的研究，提出基於CRNN-CTC 改善的系統，來提升錯誤發音檢測 (Mispronunciation Detection, MD) 的效果，達到構音異常的評估。本研究利用卷積遞迴神經網路 (Convolutional Recurrent Neural Network, CRNN) 與連結時序分類 (Connectionist Temporal Classification, CTC) 來訓練網路模型。並加入注意力機制，對構音異常評估的性能進行改善，以提升評估效果。實驗結果表明該方法用於構音異常的檢測有著良好效果。

摘要(英)

In recent years, with the advancement of Information Technology, artificial intelligence has gradually penetrated into our lives. The development of deep learning has made speech recognition technology a big step forward, not only can improve human-computer interaction, but also can be applied to medical treatment and other aspects.
In this paper, continuing the research of past scholars, we propose a system which is based on improved CRNN-CTC algorithm that can improve the effect of mispronunciation detection and achieve the evaluation of Articulation Dis-orders. We use Convolutional Recurrent Neural Network (CRNN) and Con-nectionist Temporal Classification (CTC) with attention model to train the model. The experimental results show that this method has a good effect in the auto-matic evaluation of abnormal articulation.

關鍵字(中)

★ 深度學習
★ 語音辨識
★ 構音異常
★ 卷積遞迴神經網路
★ 錯誤發音檢測與診斷

關鍵字(英)

★ Deep Learning
★ Automatic Speech Recognition
★ Articulation Disorders
★ Convolutional Recurrent Neural Network
★ Mispronunciation Detection and Diagnosis

論文目次

摘要 i
Abstract ii
誌謝 iii
目錄 iv
圖目錄 vii
表目錄 ix
第一章緒論 1
1-1 研究背景 1
1-2 研究動機與目的 1
1-3 論文架構 2
第二章自動語音辨識技術 3
2-1 語音辨識基本概述 4
2-2 聲學特徵提取 6
2-2-1 時頻譜 7
2-2-2 對數梅爾刻度時頻譜 11
2-2-3 梅爾倒頻譜係數 14
2-3 聲學模型 15
2-3-1 隱藏式馬可夫模型 16
2-3-2 高斯混合模型 18
2-3-3 深度學習結合隱藏式馬可夫模型 20
2-4 語言模型 21
2-5 語言解碼 22
第三章構音異常之評估技術 23
3-1 錯誤發音檢測 24
3-2 傳統語音評估 25
3-2-1 發音優劣評估 25
3-3 端到端語音評估 26
第四章深度學習相關介紹 27
4-1 類神經網路 28
4-1-1 單層感知機 31
4-1-2 多層感知機 32
4-2 深度學習 34
4-2-1 卷積神經網路 34
4-2-2 遞迴神經網路 40
4-2-3 長短期記憶 43
第五章提出之架構 45
5-1 系統架構 45
5-2 句子編碼器 47
5-3 音頻編碼器 48
5-4 基於注意力機制的解碼器 48
第六章實驗與分析 50
6-1 實驗環境與數據集 50
6-2 評估標準 51
6-2-1 音素識別評判標準 51
6-2-2 構音異常評估標準 52
6-3 實驗結果比較與分析 53
第七章結論與未來展望 56
參考文獻 57

參考文獻

[1] G.Hinton et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Pro-cess. Mag., vol. 29, no. 6, pp. 82–97, 2012.
[2] P. Welch, “The use of fast Fourier transform for the estimation of power spectra: a method based on time averaging over short, modified periodo-grams,” IEEE Transactions on audio and electroacoustics, vol. 15, no. 2, pp. 70-73, 1967.
[3] S. S. Stevens, and J. Volkmann, “The relation of pitch to frequency: A re-vised scale,” The American Journal of Psychology, vol. 53, no. 3, pp. 329-353, 1940.
[4] B. Logan, “Mel Frequency Cepstral Coefficients for Music Modeling,” in ISMIR, vol. 270, pp. 1-11, Oct. 2000.
[5] X. Zhou, X.Zhuang, M. Liu, H. Tang, M. Hasegawa-Johnson, T. Huang, “HMM-based acoustic event detection with AdaBoost feature selection,” in Multimodal Technologies for Perception of Humans: International Evalua-tion Workshops CLEAR 2007 and RT 2007. Springer, Berlin, Germany; 2008:345-353.
[6] G.J.Zapata-Zapata et al., “On-line signature verification using Gaussian Mixture Models and small-sample learning strategies,” Revista Facultad de Ingeniería Universidad de Antioquia, vol. 79, pp. 86-97, 2016.
[7] G. Xuan, W.Zhang and P. Chai, “EM algorithms of Gaussian Mixture Model and Hidden Markov Model,” Proc. 2001 Int. Conference on Image Processing (ICIP), vol.1, pp. 145-148, 2001.
[8] Dong Yu and Li Deng. Automatic speech recognition. Springer, pp. 23, 2016
[9] Frederick Jelinek, “Up from trigrams!-the struggle for improver language models,” in Second European Conference on Speech Communication and Technology, pp. 24, 1991.
[10] Reinhard Kneser and Hermann Ney, “Improved backing-off for m-gram language modeling,” in IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 181-184, 1995.
[11] Mehryar Mohri, Fernando Pereira, and Michael P Riley, “Weighted fi-nite-state transducers in speech recognition,” in Computer Speech & Lan-guage, pp. 69-88, 2002.
[12] W. S. Mcculloch and W. Pitts, “A Logical Calculus of the Ideas Immanent in Nervous Activity,” Bulletin of Mathematical Biophysics, vol.5, no.4, pp. 115-133, Dec. 1943.
[13] F. A. Makinde, C. T. Ako, O. D. Orodu, I. U. Asuquo, "Prediction of crude oil viscosity using feed-forward back-propagation neural network (FFBPNN)," Petroleum and Coal , vol. 54, pp. 120-131, 2012.
[14] F. Rosenblatt, “The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain,” Cornell Aeronautical Laboratory, Psychological Review, v. 65, no. 6, pp. 386–408, 1958.
[15] K. Fukushima, “Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position,” Bio-logical cybernetics, vol. 36, no. 4, pp. 193-202, 1980.
[16] Y. Lecun, L. Bottou, Y. Bengio and P. Haffner, “Gradient-Based Learning Applied to Document Recognition,” in Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, Nov. 1998.
[17] S. M. Witt, S. J. Young, “Phone-level pronunciation scoring and assessment for interactive language learning,” in Speech communication, 30 (2), pp. 95-108, 2000.
[18] F. Zhang et al., “Automatic mispronunciation detection for Mandarin,” in Proceedings of the International Conference on Acoustics Speech and Signal Processing (ICASSP), pp. 5077-5080, 2008.
[19] L. Y. Chen, and J. S. R. Jang, “Automatic pronunciation scoring with score combination by learning to rank and class-normalized DP-based quantiz-tion,” in IEEE Transactions on Audio, Speech, and Language Processing, 23 (11), pp. 1737-1749, 2015.
[20] V. Peddinti, D. Povey, and S. Khudanpur, “A time delay neural network architecture for efficient modeling of long temporal contexts,” in Proceed-ings of Interspeech. ISCA, 2015.
[21] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel et al., “The kaldi speech recognition toolkit,” in IEEE 2011 workshop on auto-matic speech recognition and understanding, no. EPFLCONF-192584. IEEE Signal Processing Society, 2011.
[22] V. Peddinti, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206-5210, 2015.
[23] Yao-Chi Hsu, Berlin Chen et al., “Evaluation Metric-related Optimization Methods for Mandarin Mispronunciation Detection,” in Computational Linguistics and Chinese Language Processing, Vol. 21, No. 2, pp. 55-70, 2016.
[24] Wenping Hu, Qian Yao and Soong Frank K., “Improved Mispronunciation Detection with Deep Neural Network Trained Acoustic Models and Transfer Learning based Logistic Regression classifiers," in Speech Communication 67, pp. 154-166, 2015.
[25] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. “Attention is all you need,” arXiv preprint arXiv:1706.03762, 2017.
[26] W.-K. Leung, X. Liu, and H. Meng, “CNN-RNN-CTC Based End-to-End Mispronunciation Detection and Diagnosis,” in IEEE International Con-ference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8132–8136, 2019.
[27] Alex Graves, Santiago Fernandez, Faustino Gomez, and Jurgen Schmid-huber, “Connectionist temporal classification: labelling unsegmented se-quence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning. ACM, pp. 369–376, 2006.
[28] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett, “DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1-1.1,” NASA STI/Recon technical report n, vol. 93, p. 27403, 1993.
[29] G. Zhao, S. Sonsaat, A. O. Silpachai, I. Lucic, E.ChukharevHudilainen, J. Levis, and R. Gutierrez-Osuna, “L2-arctic: A nonnative english speech corpus,” Perception Sensing Instrumentation Lab, 2018.

指導教授

張寶基(Pao-Chi Chang)

審核日期

2022-1-25

推文