結合心理特徵與情緒標籤訓練之語音情感辨識技術

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：35

、訪客IP：3.18.220.223

姓名

陳靖明(Jing-Ming Chen) 查詢紙本館藏

畢業系所

通訊工程學系

論文名稱

結合心理特徵與情緒標籤訓練之語音情感辨識技術
(Speech Emotion Recognition Based on Joint Training by Self-Assessment Manikins and Emotion Labels)

相關論文

★ 基於區域權重之衛星影像超解析技術	★ 延伸曝光曲線線性特性之調適性高動態範圍影像融合演算法
★ 實現於RISC架構之H.264視訊編碼複雜度控制	★ 基於卷積遞迴神經網路之構音異常評估技術
★ 具有元學習分類權重轉移網路生成遮罩於少樣本圖像分割技術	★ 具有注意力機制之隱式表示於影像重建三維人體模型
★ 使用對抗式圖形神經網路之物件偵測張榮	★ 基於弱監督式學習可變形模型之三維人臉重建
★ 以非監督式表徵分離學習之邊緣運算裝置低延遲樂曲中人聲轉換架構	★ 基於序列至序列模型之 FMCW雷達估計人體姿勢
★ 基於多層次注意力機制之單目相機語意場景補全技術	★ 基於時序卷積網路之單FMCW雷達應用於非接觸式即時生命特徵監控
★ 視訊隨選網路上的視訊訊務描述與管理	★ 基於線性預測編碼及音框基頻週期同步之高品質語音變換技術
★ 基於藉語音再取樣萃取共振峰變化之聲調調整技術	★ 即時細緻可調性視訊在無線區域網路下之傳輸效率最佳化研究

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

隨著人工智慧的發展，人與機器之間的互動變得越加頻繁，如聊天機器人或居家照護系統都是常見的人機互動應用。而情感辨識技術可以用來提升人機之間的互動性，亦可將情緒機器人應用於醫療方面，如病患的情緒識別等。我們希望利用深度學習的技術來學習語音訊號中的情緒特徵，達到情感辨識的效果。
本研究為「結合心理特徵與情緒標籤訓練之語音情感辨識技術」，提出藉由結合心理狀態程度的情緒特徵，輔助情緒標籤訓練神經網路，來提升語音情感的辨識率。本研究同時使用了迴歸模型以及分類模型，迴歸模型用來進行心理狀態程度的預測，而分類模型則是用來進行情緒標籤的辨識。此語音情感辨識技術於腳本與即興演出混合情境的資料集中，辨識率能夠達到64.70%，若於只有即興演出情境的資料集，辨識率則是能達到66.34%，相對於未結合心理狀態特徵的辨識技術，此方法的辨識率各自提升了2.95%以及2.09%，因此結合心理狀態的特徵能夠有效地幫助語音情感進行辨識。

摘要(英)

With the development of artificial intelligence, the interaction between humans and machines has become more and more often, such as chat robots or home care systems, which are common human-computer interaction applications. Emotional recognition can improve the interaction between man and machine, and can also apply the emotional recognition of the robot to medical aspects, such as emotional identification of patients. The objective of this work is to develop a speech emotion recognition system by learning the emotional characteristics of audio using deep learning.
In this work, we propose a system that can recognize speech emotion and use both regression models and classification models. This speech emotion recognition technology can achieve the accuracy of 64.70% in the dataset of script and improvised mixed scenes. If the dataset has only impromvised scenes, the accuracy can reach 66.34%. Compared with the characteristics of uncombined mental state, the accuracy of this technology is increased by 2.95% and 2.09%, respectively. So the characteristics of mental state can effectively help the speech emotion recognition.

關鍵字(中)

★ 語音情緒辨識
★ 心理狀態特徵
★ 深度學習
★ 卷積遞迴神經網路

關鍵字(英)

★ Speech emotion recognition
★ Self-Assessment Manikin
★ Deep learning
★ Convolutional recurrent neural network.

論文目次

摘要 i
Abstract ii
誌謝 iii
目錄 iv
圖目錄 vii
表目錄 x
第一章緒論 1
1-1 研究背景 1
1-2 研究動機與目的 1
1-3 論文架構 2
第二章語音情緒辨識 3
2-1 情緒相關介紹 3
2-1-1 PAD情緒狀態模型 5
2-1-2 自我評估人體模型 6
2-2 聲學特徵的提取技術 7
2-2-1 時頻譜 8
2-2-2 對數梅爾刻度時頻譜 11
2-2-3 梅爾倒頻譜係數 14
第三章類神經網路與深度學習 15
3-1 類神經網路 15
3-1-1 單層感知機 17
3-1-2 多層感知機與倒傳遞演算法 18
3-2 深度學習 22
3-2-1 卷積神經網路 22
3-2-2 遞迴神經網路 25
3-2-3 長短期記憶神經網路 27
3-2-4 門閘遞迴單元神經網路 29
3-2-5 雙向遞迴神經網路 31
第四章提出之架構 32
4-1 語音前處理 33
4-2 卷積遞迴神經網路架構 35
4-3 訓練階段參數設定 36
第五章實驗與分析 37
5-1 實驗環境與資料庫 37
5-2 評比標準 40
5-3 實驗結果比較與分析 41
第六章結論與未來展望 51
參考文獻 52

參考文獻

[1] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779-788, 2016.
[2] J. Redmon, and A. Farhadi, “YOLO9000: better, faster, stronger,” In Pro-ceedings of the IEEE conference on computer vision and pattern recognition, pp. 7263-7271, 2017.
[3] J. Redmon, A. Farhadi, “Yolov3: An incremental improvement,” arXiv preprint arXiv:1804.02767, 2018.
[4] O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Deep face recognition,” In bmvc, vol. 1, no. 3, pp. 6, Sep. 2015.
[5] Y. C. Wu, P. C Chang, C. Y. Wang, and J. C. Wang, “Asymmetrie Kernel Convolutional Neural Network for acoustic scenes classification,” In 2017 IEEE International Symposium on Consumer Electronics (ISCE), pp. 11-12, Nov. 2017.
[6] R. Plutchik, “Emotions and life: Perspectives from psychology, biology, and evolution,” American Psychological Association, 2003.
[7] A. Mehrabian, “Framework for a comprehensive description and measure-ment of emotional states,” Genetic, social, and general psychology mono-graphs, 1995.
[8] Plutchik′s wheel of emotions,
https://zh.wikipedia.org/wiki/File:Plutchik-wheel.svg
[9] M. M. Bradley, and P. J. Lang, “Measuring emotion: the self-assessment manikin and the semantic differential,” Journal of behavior therapy and ex-perimental psychiatry, vol. 25, no. 1, pp. 49-59, 1994.
[10] P. Welch, “The use of fast Fourier transform for the estimation of power spectra: a method based on time averaging over short, modified periodo-grams,” IEEE Transactions on audio and electroacoustics, vol. 15, no. 2, pp. 70-73, 1967.
[11] S. S. Stevens, and J. Volkmann, “The relation of pitch to frequency: A re-vised scale,” The American Journal of Psychology, vol. 53, no. 3, pp. 329-353, 1940.
[12] B. Logan, “Mel Frequency Cepstral Coefficients for Music Modeling,” In ISMIR, vol. 270, pp. 1-11, Oct. 2000.
[13] ETSI Standard Doc., “Speech Processing, Transmission and Quality Aspects (STQ); Distributed Speech Recognition; Front-End Feature Extraction Algorithm; Compression Algorithms,” ES 201 108, v1.1.3, Sep. 2003.
[14] ETSI Standard Doc., “Speech Processing, Transmission and Quality Aspects (STQ); Distributed Speech Recognition; Front-End Feature Extraction Algorithm; Compression Algorithms,” ES 202 050, v1.1.5, Jan. 2007.
[15] N.Ahmed, T. Natarajan, and K. R. Rao, “Discrete cosine transform,” IEEE transactions on Computers, vol. 100, no. 1, pp. 90-93, 1974.
[16] W. S. Mcculloch and W. Pitts, “A Logical Calculus of the Ideas Immanent in Nervous Activity,” Bulletin of Mathematical Biophysics, vol.5, no.4, pp.115-133, Dec. 1943.
[17] F. Rosenblatt, “The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain,” Cornell Aeronautical Laboratory, Psychological Review, v. 65, no. 6, pp. 386–408, 1958.
[18] N. Rochester, J. Holland, L. Haibt, W. Duda, “Tests on A Cell Assembly Theory of the Action of the Brain, Using A Large Digital Computer, ” IRE Transactions on information Theory, vol. 2, no. 3, pp. 80-93, 1956.
[19] K. Fukushima, “Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position,” Biological cybernetics, vol. 36, no. 4, pp. 193-202, 1980.
[20] J. J. Hopfield, “Neural networks and physical systems with emergent collective computational abilities,” Proceedings of the national academy of sciences, vol. 79, no. 8, pp. 2554-2558, 1982.
[21] S. Hochreiter, and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735-1780, 1997.
[22] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014.
[23] M. Schuster, and K. K. Paliwal, “Bidirectional recurrent neural networks,” IEEE Transactions on Signal Processing, vol. 45, no. 11, pp. 2673-2681, 1997.
[24] M. Chen, X. He, J. Yang, and H. Zhang, “3-D convolutional recurrent neural networks with attention model for speech emotion recognition,” IEEE Signal Processing Letters, vol. 25, no. 10, pp. 1440-1444, 2018.
[25] C. Busso, M. Bulut, C. C. Lee, A.Kazemzadeh, E.Mower, S.Kim, and S. S. Narayanan, “IEMOCAP: Interactive emotional dyadic motion capture database,” Language resources and evaluation, vol. 42, no. 4, 2008.
[26] S. Tripathi and H. Beigi, "Multi-Modal Emotion recognition on IEMOCAP Dataset using Deep Learning." arXiv preprint arXiv:1804.05788, 2018.
[27] I. Lawrence, and K. Lin, “A concordance correlation coefficient to evaluate reproducibility,” Biometrics, pp. 255-268, 1989.

指導教授

張寶基(Pao-Chi Chang)

審核日期

2019-7-29

推文