基於卷積神經網路的情緒語音分析

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：68

、訪客IP：3.145.37.211

姓名

高儀津(YI CHIN KAO) 查詢紙本館藏

畢業系所

資訊工程學系在職專班

論文名稱

基於卷積神經網路的情緒語音分析
(Emotional Speech Analysis Based on Convolutional Neural Networks)

相關論文

★ Single and Multi-Label Environmental Sound Recognition with Gaussian Process	★ 波束形成與音訊前處理之嵌入式系統實現
★ 語音合成及語者轉換之應用與設計	★ 基於語意之輿情分析系統
★ 高品質口述系統之設計與應用	★ 深度學習及加速強健特徵之CT影像跟骨骨折辨識及偵測
★ 基於風格向量空間之個性化協同過濾服裝推薦系統	★ RetinaNet應用於人臉偵測
★ 金融商品走勢預測	★ 整合深度學習方法預測年齡以及衰老基因之研究
★ 漢語之端到端語音合成研究	★ 基於 ARM 架構上的 ORB-SLAM2 的應用與改進
★ 基於深度學習之指數股票型基金趨勢預測	★ 探討財經新聞與金融趨勢的相關性
★ 運用深度學習方法預測阿茲海默症惡化與腦中風手術存活	★ 運用LLM自動生成食譜方法與系統

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 ( 永不開放)

摘要(中)

在近期研究中，語音情感識別（speech emotion recognition）已成為人類行為分析研究領域中一個有趣且具有挑戰性的項目。該研究領域的目標是根據人類的語音音調對人們的情緒狀態進行分類。目前，該研究領域的注重於識別語音情緒自動分類器的有效性，以提高實際應用中的分類效率，例如：在電信服務中使用的分類效率，辨識正面情緒（如：快樂、驚訝）和負面情緒（如：悲傷、憤怒、厭惡和恐懼），可以為電信服務的平台用戶和客戶提供大量有效的數據。

在本論文中通過使用深度學習技術研究了涉及識別人類語音數據中的正面和負面情緒的複雜任務。本論文中使用了五個開放的情感語音數據集和四個自製語音數據集，分別訓練了多階層的模型來處理正負向的情緒辨識，該模型為正、負面情感語音數據的使用提供了良好的效果。此外，也運用自製數據集進行預訓練模型(七類情緒辨識模型)作為網路參數初始化和從頭開始訓練(Train from scratch)隨機初始化網路參數的方式去比較兩者在做三種族群的語音偵測分類。根據實驗結果，本論文中
不論是訓練三種族群的語音偵測分類或者是七類語音的情緒偵測分類，此二種任務便是效能最好的模型都是預訓練模型，可以顯著得看出優於從頭開始訓練(Train from scratch)的模型。

摘要(英)

In recent studies, speech emotion recognition has become an interesting and challenging area of research in human behavior analysis. The goal of this research area is to classify people′s emotional states based on their speech tones. Currently, the research area focuses on identifying the effectiveness of automatic classifiers of speech emotions to improve the classification efficiency in practical applications, e.g., for use in telecommunication services, identifying positive emotions (e.g., happiness, surprise) and negative emotions (e.g., sadness, anger, disgust, and fear), which can provide a large amount of valid data for platform users and customers of telecommunication services.

In this paper, the complex task of identifying positive and negative emotions in human voice data is investigated by using deep learning techniques. Five open sentiment speech datasets and four self-generated speech datasets are used to train multi-level models for positive and negative sentiment recognition, which provide good results for both positive and negative sentiment speech data. In addition, a pre-trained model (seven types of emotion recognition models) was used to initialize the network parameters, and a random initialization of the network parameters by Train from scratch to compare the two is doing the classification of three groups of speech detection. According to the experimental results, in this paper
The best model for both tasks is the pre-training model, which is significantly better than the Train from a scratch model.

關鍵字(中)

★ 卷積神經網路
★ 語音偵測
★ 情緒分類

關鍵字(英)

★ Convolutional Neural Network(CNN)
★ Speech detection
★ Emotion classification

論文目次

目錄
頁次
摘要 iv
Abstract v

目錄 vii
圖目錄 x
表目錄 xii
一、緒論 1
1.1 研究背景 .................................................................. 1
1.2 研究動機與目的 ......................................................... 2
1.3 研究方法與章節概要 ................................................... 2
二、相關研究 3
2.1 卷積神經網路(Convolutional Neural Network) ............... 3
2.2 卷積層 ..................................................................... 4
2.3 池化層 ..................................................................... 5
2.4 全連階層 .................................................................. 6
三、語音情緒偵測研究相關文獻 8 3.1 基於傳統的語音情緒偵測方法 ....................................... 8
3.1.1 隱性馬爾可夫模型 (HMM) .................................. 9
3.1.2 支持向量機 (SVM) ............................................ 9
vii
3.2 基於一維卷積神經網路語音情緒偵測方法 ........................ 10
四、語音情緒偵測模型 12 4.1 網路架構 .................................................................. 12
4.1.1 CNN(Baselinemodel) .......................................... 12
4.1.2 正負向模型 ...................................................... 16
五、實驗設計與實驗結果 17
5.1 實驗環境設置 ............................................................ 17
5.2 資料庫說明 ............................................................... 18
5.2.1 SAVEE(Surrey Audio-Visual Expressed Emotion)資
料集 ........................................................................ 19
5.2.2 RAVDESS(Ryerson Audio-Visual Database of Emo- tional Speech and Song)資料集 .................................... 20
5.2.3 CREMA-D(Crowd-sourced Emotional Multimodal Actors Dataset)資料集 ............................................... 21
5.2.4 TESS(Toronto emotional speech set)資料集 ......... 22
5.2.5 IEMOCAP(The Interactive Emotional Dyadic Mo-
tion Capture database) 資料集 ...................................... 22
5.2.6 科技部腦科技整合計畫自製資料集 ........................ 22
5.3 實驗設置與實作細節 ................................................... 25
5.3.1 資料前處理 ...................................................... 25
5.3.2 實驗評估方式 ................................................... 25
5.3.3 網路訓練設置 ................................................... 27
5.4 實驗結果 .................................................................. 27
5.4.1 實驗一:一維卷積神經網路對七類情緒的結果比較 . . . 27
5.4.2 實驗二:一維卷積神經網路對四類情緒的結果比較 . . . 29
5.4.3 實驗三:利用二階層的一維卷積神經網路正負向情緒模型的結果比較 ...................................................... 30
5.4.4 實驗四:使用預訓練模型 (七類情緒辨識模型) 和從頭開始訓練 (Train from scratch) 一維卷積神經網路對三類族群的語音偵測結果比較 ............................................. 37
六、結論與未來研究方向 40
參考文獻 41

參考文獻

參考文獻
[1] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
[2] S. Mekruksavanich, A. Jitpattanakul, and N. Hnoohom, “Negative emotion recogni- tion using deep learning for thai language,” in 2020 Joint International Conference on Digital Arts, Media and Technology with ECTI Northern Section Conference on Electrical, Electronics, Computer and Telecommunications Engineering (ECTI DAMT & NCON), pp. 71–74, IEEE, 2020.
[3] D. Issa, M. F. Demirci, and A. Yazici, “Speech emotion recognition with deep con- volutional neural networks,” Biomedical Signal Processing and Control, vol. 59, p. 101894, 2020.
[4] P. Jackson and S. Haq, “Surrey audio-visual expressed emotion (savee) database,” University of Surrey: Guildford, UK, 2014.
[5] S. R. Livingstone and F. A. Russo, “The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english,” PloS one, vol. 13, no. 5, p. e0196391, 2018.
[6] C. M. Lee, S. Narayanan, and R. Pieraccini, “Recognition of negative emotions from the speech signal,” in IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU’01., pp. 240–243, IEEE, 2001.
[7] C.VaudableandL.Devillers,“Negativeemotionsdetectionasanindicatorofdialogs quality in call centers,” in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5109–5112, IEEE, 2012.
[8] N. Hnoohom, A. Jitpattanakul, P. Inluergsri, P. Wongbudsri, and W. Ployput, “Multi-sensor-based fall detection and activity daily living classification by using en- semble learning,” in 2018 International ECTI Northern Section Conference on Elec- trical, Electronics, Computer and Telecommunications Engineering (ECTI-NCON), pp. 111–115, IEEE, 2018.
[9] N.Hnoohom,S.Mekruksavanich,andA.Jitpattanakul,“Humanactivityrecognition using triaxial acceleration data from smartphone and ensemble learning,” in 2017
41
13th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS), pp. 408–412, IEEE, 2017.
[10] H. Ali, M. Hariharan, S. Yaacob, and A. H. Adom, “Facial emotion recognition using empirical mode decomposition,” Expert Systems with Applications, vol. 42, no. 3, pp. 1261–1277, 2015.
[11] A. Schirmer and R. Adolphs, “Emotion perception from face, voice, and touch: com- parisons and convergence,” Trends in cognitive sciences, vol. 21, no. 3, pp. 216–228, 2017.
[12] D.H.HubelandT.N.Wiesel,“Receptivefieldsofsingleneuronesinthecat’sstriate cortex,” The Journal of physiology, vol. 148, no. 3, pp. 574–591, 1959.
[13] K.FukushimaandS.Miyake,“Neocognitron:Aself-organizingneuralnetworkmodel for a mechanism of visual pattern recognition,” in Competition and cooperation in neural nets, pp. 267–285, Springer, 1982.
[14] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. Lang, “Phoneme recog- nition using time-delay neural networks,” IEEE transactions on acoustics, speech, and signal processing, vol. 37, no. 3, pp. 328–339, 1989.
[15] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in neural information processing systems, vol. 25, pp. 1097–1105, 2012.
[16] J. Dai, Y. Li, K. He, and J. Sun, “R-fcn: Object detection via region-based fully con- volutional networks,” in Advances in neural information processing systems, pp. 379– 387, 2016.
[17] P. Booth, An introduction to human-computer interaction (psychology revivals). Psy- chology Press, 2014.
[18] E. R. Harper, T. Rodden, Y. Rogers, A. Sellen, B. Human, et al., “Human-computer interaction in the year 2020,” 2008.
[19] E. Cambria, A. Hussain, C. Havasi, and C. Eckl, “Sentic computing: Exploitation of common sense for the development of emotion-sensitive systems,” in Development of Multimodal Interfaces: Active Listening and Synchrony, pp. 148–156, Springer, 2010.
[20] K. Patil, P. Zope, and S. Suralkar, “Emotion detection from speech using mfcc & gmm,” Int. J. Eng. Res. Technol.(IJERT), vol. 1, no. 9, 2012.
[21] A. Hassan and R. I. Damper, “Multi-class and hierarchical svms for emotion recog- nition,” 2010.
42

[22] Y.-L.LinandG.Wei,“Speechemotionrecognitionbasedonhmmandsvm,”in2005 international conference on machine learning and cybernetics, vol. 8, pp. 4898–4901, IEEE, 2005.
[23] L. R. Rabiner, “A tutorial on hidden markov models and selected applications in speech recognition,” Proceedings of the IEEE, vol. 77, no. 2, pp. 257–286, 1989.
[24] T. L. Nwe, S. W. Foo, and L. C. De Silva, “Speech emotion recognition using hidden markov models,” Speech communication, vol. 41, no. 4, pp. 603–623, 2003.
[25] A. Nogueiras, A. Moreno, A. Bonafonte, and J. B. Mariño, “Speech emotion recog- nition using hidden markov models,” in Seventh European conference on speech com- munication and technology, 2001.
[26] C.-W. Hsu, C.-C. Chang, C.-J. Lin, et al., “A practical guide to support vector classification,” 2003.
[27] T. Seehapoch and S. Wongthanavasu, “Speech emotion recognition using support vector machines,” in 2013 5th international conference on Knowledge and smart technology (KST), pp. 86–91, IEEE, 2013.
[28] J. Weng, N. Ahuja, and T. S. Huang, “Cresceptron: a self-organizing neural network which grows adaptively,” in [Proceedings 1992] IJCNN International Joint Confer- ence on Neural Networks, vol. 1, pp. 576–581, IEEE, 1992.
[29] D. Bertero and P. Fung, “A first look into a convolutional neural network for speech emotion detection,” in 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5115–5119, IEEE, 2017.
[30] H. Cao, D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova, and R. Verma, “Crema-d: Crowd-sourced emotional multimodal actors dataset,” IEEE transactions on affective computing, vol. 5, no. 4, pp. 377–390, 2014.
[31] K. Dupuis and M. K. Pichora-Fuller, “Recognition of emotional speech for younger and older talkers: Behavioural findings from the toronto emotional speech set,” Canadian Acoustics, vol. 39, no. 3, pp. 182–183, 2011.
[32] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “Iemocap: Interactive emotional dyadic motion capture database,” Language resources and evaluation, vol. 42, no. 4, pp. 335–359, 2008.
[33] R. Caruana, “Multitask learning,” Machine learning, vol. 28, no. 1, pp. 41–75, 1997.

指導教授

王家慶(Jia-Ching Wang)

審核日期

2021-10-22

推文