音樂及語音情緒模型建立之研究

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：81

、訪客IP：3.129.20.163

姓名

秦餘皞(YU-HAO CHIN) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

音樂及語音情緒模型建立之研究
(A Study on Modeling Affective Content of Music and Speech)

相關論文

★ Single and Multi-Label Environmental Sound Recognition with Gaussian Process	★ 波束形成與音訊前處理之嵌入式系統實現
★ 語音合成及語者轉換之應用與設計	★ 基於語意之輿情分析系統
★ 高品質口述系統之設計與應用	★ 深度學習及加速強健特徵之CT影像跟骨骨折辨識及偵測
★ 基於風格向量空間之個性化協同過濾服裝推薦系統	★ RetinaNet應用於人臉偵測
★ 金融商品走勢預測	★ 整合深度學習方法預測年齡以及衰老基因之研究
★ 漢語之端到端語音合成研究	★ 基於 ARM 架構上的 ORB-SLAM2 的應用與改進
★ 基於深度學習之指數股票型基金趨勢預測	★ 探討財經新聞與金融趨勢的相關性
★ 基於卷積神經網路的情緒語音分析	★ 運用深度學習方法預測阿茲海默症惡化與腦中風手術存活

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 ( 永不開放)

摘要(中)

情意運算為機器學習領域內一項熱門的主題。其目標為使用數學方法來對各種來源的情緒內容建模，例如音樂、人臉、肢體語言、語音，並且使用辨識演算法來辨識情緒。在眾多的情緒來源中，我們特別關注音訊的情緒辨識。音樂及語音為音訊領域中兩大情緒來源，本篇論文對兩者的情意運算做探討。
由於音樂推薦及檢索近年來被應用在很多領域，音樂情緒的建模已經變成一個熱門的研究主題。雖然已有很多相關研究被提出，但因為情緒很難被適當的描繪，這個主題還是很有挑戰性，並且情緒具主觀性，使得資料難收集，建模也變得困難。在本篇論文，我們不假設每個人對音樂情緒的感受有一致性，並且提出一個新的機器學習方法，將音樂情緒塑造成一個在向性-激發(Valence-Arousal)空間上的機率分布，這樣不但可以較精準的描繪情緒，也能處理情緒的主觀性問題。詳細流程如下，首先我們使用核密度估測(Kernel Density Estimation)來將情緒塑造成一個在向性-激發空間上的機率密度函式，接著，為了建立情緒及音訊特徵之間的連結，我們透過特徵和一些目標函式來學習一組線性組合係數，並將這些係數與訓練資料的機率密度函式做線性組合，形成未知資料的機率密度函式。我們使用NTUMIR和MediaEval2013資料庫來做實驗，在實驗中我們嘗試了數種目標函式，實驗結果證明了本方法在情緒分布預測上的有效性。我們也證明了本方法如何在以情緒為基礎的音樂推薦上使用。
過去的研究發現音樂的情緒受到多項因素影響，且在一首歌中，人聲及配樂的情緒有可能不相同。大部分的MER研究視一首歌為單一的訊號來源，並對其取特徵，但是一首歌裡面還有各種樂器及人聲，它們卻很少被獨立出來探討。因此，本論文對人聲及配樂分別擷取特徵，並探討這樣做是否對於預測動態VA有所幫助。詳細流程如下，首先訓練一個深度遞迴神經網路(Deep Recurrent Neural Network)來做兩個來源的分離，接著對兩個來源分別取韻律，音色，音調，能量，音高相關的特徵，然後結合，來預測分離前的音樂的情緒。在結合兩來源的情緒的部分，我們提出四種深度遞迴神經網路架構，並測試。各種特徵的組合方式也有被探討。我們使用MediaEval2013資料庫來做實驗，從結果中我們發現上述的方法可提升辨識的準確度。
在語音情緒的探討部分，本論文提出了語音情緒驗證系統，其以情緒變異性塑模及尺度頻率圖為基礎。本系統含兩大部分-特徵擷取及情緒驗證。在特徵擷取部分，對每個音框，我們使用匹配追踪演算法(Matching Pursuit Algorithm)來從Gabor字典中選擇重要的元素，並將每個被選到的元素的尺度及頻率整合成一張尺度頻率圖。此圖在重要的頻帶上具有鑑別性。接著我們使用稀疏表示(Sparse Representation)將尺度頻率圖轉換成稀疏係數，以增強對情緒變異的強健性。在情緒驗證階段，我們會計算兩個分數。一個是基於稀疏係數所計算的高斯模型錯誤。另一個是基於稀疏係數所計算的情緒認同指標。我們綜合考量這兩個分數來得到驗證結果。我們使用一個語音情緒資料庫來做實驗，實驗結果顯示我們所提出的系統可達到6.61%的等錯誤率(Equal Error Rate)。與其他方法的比較也證明了本系統的有效性。

摘要(英)

Affective computing (AC) is a hot topic in machine learning. One of the goals of AC is modeling the affective content of various sources such as music, face, body language, and speech via mathematical methods. A recognition approach is often carried out to recognize the modeled affect. Among all the emotional cues, we particularly focus on the emotion recognition of audio. Music and speech are important affect-brought audio sources. Both of them are investigated in this study.
Computationally modeling the affective content of music has been intensively studied in recent years because of its wide applications in music retrieval and recommendation. Although significant progress has been made, this task remains challenging due to the difficulty in properly characterizing the emotion of a music piece. Music emotion perceived by people is subjective by nature and thus complicates the process of collecting the emotion annotations as well as developing the predictive model. Instead of assuming people can reach a consensus on the emotion of music, in this work we propose a novel machine learning approach that characterizes the music emotion as a probability distribution in the valence-arousal (VA) emotion space, not only tackling the subjectivity but also precisely describing the emotions of a music piece. Specifically, we represent the emotion of a music piece as a probability density function (PDF) in the VA space via kernel density estimation from human annotations. To associate emotion with the audio features extracted from music pieces, we learn the combination coefficients by optimizing some objective functions of audio features, and then predict the emotion of an unseen piece by linearly combining the PDFs of the training pieces with the coefficients. Several algorithms for learning the coefficients are studied. Evaluations on the NTUMIR and MediaEval2013 datasets validate the effectiveness of the proposed methods in predicting the probability distributions of emotion from audio features. We also demonstrate how to use the proposed approach in emotion-based music retrieval.
It has been recognized that music emotion is influenced by multiple factors and sometimes singing voices and accompaniments may express different emotions. However, most existing work on music emotion recognition (MER) considered music audio as a single source for feature extraction, whereas the audio of most songs can be separated into singing voice and accompaniments with various instruments. The separated sources may potentially provide additional information that can help improve performances of MER, but are seldom explored. This study aims to fill this gap by investigating whether considering singing voice and accompaniments separately can help predicting dynamic VA values of music. Specifically, a deep recurrent neural network (DRNN)-based singing-voice separation algorithm was applied to separate the two sources. Rhythm, timbre, tonality, energy, and pitch-related features were then extracted from both sources which were then combined to predict the VA values of the original music with unseparated source. In combining the sources, four variations of DRNN-based approaches were proposed and evaluated, and different combinations of the features extracted from different sources are compared. Experiments on the MediaEval2013 dataset indicated that the performance can be improved by using the above method.
For the speech emotion part, this paper develops a system to implement speech-based emotion verification based on emotion variance modeling and discriminant scale-frequency maps. The proposed system consists of two parts — feature extraction and emotion verification. In the first part, for each sound frame, important atoms from the Gabor dictionary are selected by using the Matching Pursuit algorithm. The scale, frequency, and magnitude of the atoms are extracted to construct a scale-frequency map, which supports auditory discriminability by the analysis of critical bands. Next, sparse representation is used to transform scale-frequency maps into sparse coefficients to enhance the robustness against emotion variance. In the second part, emotion verification, two scores are calculated. A novel sparse representation verification approach based on Gaussian-modeled residual errors is proposed to generate the first score from the sparse coefficients. The second score is calculated by using the Emotional Agreement Index (EAI) from the same coefficients. These two scores are combined to obtain the final detection result. Experiments on an emotional database of spoken speech were conducted and indicate that the proposed approach can achieve an average Equal Error Rate (EER) of as low as 6.61%. A comparison among different approaches reveals that the proposed method is effective.

關鍵字(中)

★ 情緒辨識
★ 音樂
★ 語音
★ 機器學習

關鍵字(英)

★ emotion recognition
★ music
★ speech
★ machine learning

論文目次

摘要 I
Abstract III
List of Figures VIII
List of Tables IX
Chapter 1 Introduction 1
1.1 Affective Computing 1
1.2 Emotion Recognition for Music 1
1.3 Emotion Recognition for Speech 7

Chapter 2 Related Work 11
2.1 Previous Works of Music Emotion Recognition 11
2.2 Previous Works of Speech Emotion Recognition 13

Chapter 3 Predicting the Probability Density Function of Music Emotion using Emotion Space Mapping 16
3.1 System Overview 16
3.2 Estimating Probability Density Function in VA Space 16
3.3 Predicting Probability Density Function from Audio Features 18
3.3.1 Emotion Space Mapping 19
3.3.2 Mapping Factors Learning 20
3.3.3 Dimensionality Reduction of Audio Features 21
3.4 Experiment Setup 22
3.4.1 Datasets & Audio Features 22
3.4.2 Evaluation on Music Emotion Recognition 25
3.4.3 Evaluation on Emotion-Based Music Retrieval 26
3.5 Experimental Results 27
3.5.1 Result on Music Emotion Recognition 27
3.5.2 Result on Emotion-Based Music Retrieval 30
3.5.3 Evaluation of Different Mapping Factors Learning Methods 31

Chapter 4 Dynamic Music Emotion Recognition from Singing Voice and Accompaniments 34
4.1 System Overview 34
4.2 DRNN for VA Prediction 34
4.3 Experimental Setup 37
4.4 Experimental Results 38

Chapter 5 Speech Emotion Verification Using Emotion Variance Modeling and Discriminant Scale-Frequency Maps 42
5.1 System Overview 42
5.2 Feature Extraction 43
5.2.1 Scale-Frequency Maps 43
5.2.2 Prosodic Feature 48
5.3 Verification 49
5.3.1 Sparse Representation Verification 49
5.3.2 Emotional Agreement Index (EAI) 54
5.4 Experimental Results 54
5.4.1 Evaluation of the Number of Gabor Atoms in the Proposed System 56
5.4.2 Comparison of the Proposed System and the Other Approaches 58
5.4.3 Analysis of the Relationship between the EAI and Blended Emotions 60

Chapter 6 Conclusion 63
Bibliography 65

參考文獻

[1] Y. Rubner, C. Tomasi, and L. J. Guibas, “The earth movers distance as a metric for image retrieval,” Int. J. Computer Vision, vol. 40, no. 2, pp. 99–121, 2000.
[2] M. Caetano and F. Wiering, “Theoretical framework of a computational model of auditory memory for music emotion recognition,” in Proc. Int. Soc. Music Info. Retrieval Conf., 2014, pp. 331–336.
[3] S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decomposition by basis pursuit,” SIAM J. Sci. Comput., vol. 20, pp. 33–61, 1998.
[4] G. Collier, “Beyond valence and activity in the emotional connotations of music,” Psychology of Music, vol. 35, no. 1, pp. 110–131, 2007.
[5] J. J. Deng, C. H. C. Leung, A. Milani, and L. Chen, “Emotional states associated with music: Classification, prediction of changes, and consideration in recommendation,” ACM Trans. Intel. Systems & Technology, vol. 5, no. 1, pp. 4:1–4:36, 2015.
[6] D. L. Donoho, “For most large underdetermined systems of linear equations the minimal l1-norm solution is also the sparsest solution,” Comm. Pure Appl. Math., vol. 59, pp. 797–829, 2006.
[7] T. Eerola and J. K. Vuoskoski, “A review of music and emotion studies: Approaches, emotion models, and stimuli,” Music Perception, vol. 30, no. 3, pp. 307–340, 2013.
[8] T. Eerola, “Modelling emotions in music: Advances in conceptual, contextual and validity issues,” in AES International Confernece, 2014.
[9] Y. H. Yang and H. H. Chen, “Prediction of the distribution of perceived music emotions using discrete samples,” IEEE Trans. Audio, Speech & Language Processing, vol. 19, no. 7, pp. 2184–2196, 2011.
[10] P. N. Juslin and J. A. Sloboda, Handbook of Music and Emotion: Theory, Research, Applications. New York: Oxford University Press, 2010.
[11] L. B. Meyer, Emotion and Meaning in Music. Chicago: University of Chicago Press, 1956.
[12] A. Gabrielsson, “Emotion perceived and emotion felt: Same or different?” Musicae Scientiae, pp. 123–147, 2002, special issue.
[13] P. Saari, G. Fazekas, T. Eerola, M. Barthet, O. Lartillot, and M. Sandler, “Genre-adatptive semantic computing and audio-based modeling for music mood annotation,” IEEE Trans. Affective Computing, vol. 7, no. 2, pp. 122–135, 2016.
[14] J. D. Gibbons and S. Chakraborti, Nonparametric statistical inference. Springer, 2011.
[15] P. O. Hoyer, “Non-negative sparse coding,” in Proc. IEEE Workshop on Neural Networks for Signal Processing, 2002, pp. 557–565.
[16] X. Hu and Y.-H. Yang, “A study on cross-cultural and cross-dataset generalizability of music mood regression models,” in Proc. Sound and Music Computing Conf., 2014, pp. 1149–1155.
[17] A. Huq, J. P. Bello, and R. Rowe, “Automated music emotion recognition: A systematic evaluation,” J. New Music Research, vol. 39, no. 3, pp. 227–244, 2010.
[18] D. Huron, “Perceptual and cognitive applications in music information retrieval,” in Proc. Int. Soc. Music Info. Retrieval Conf., 2000.
[19] K. MacDorman, S. Ough, and C.-C. Ho, “Automatic emotion prediction of song excerpts: Index construction, algorithm design, and empirical comparison,” J. New Music Research, vol. 36, no. 4, pp. 281–299, 2007.
[20] A. W. Bowman and A. Azzalini, Applied Smoothing Techniques for Data Analysis. New York: Oxford University Press, 1997.
[21] Y. E. Kim, E. M. Schmidt, R. Migneco, B. G. Morton, P. Richardson, J. Scott, J. Speck, and D. Turnbull, “Music emotion recognition: A state of the art review,” in Proc. Int. Soc. Music Info. Retrieval Conf., 2010, pp. 255–266.
[22] S. Koelstra, C. Mühl, M. Soleymani, J.-S. Lee, A. Yazdani, T. Ebrahimi, T. Pun, A. Nijholt, and I. Patras, “DEAP: A database for emotion analysis; using physiological signals,” IEEE Trans. Affective Computing, vol. 3, no. 1, pp. 18–31, 2012.
[23] K. Trohidis, G. Tsoumakas, G. Kalliris, and I. Vlahavas, “Multi-label classification of music into emotions,” in Proc. Int. Soc. Music Info. Retrieval Conf., 2008, pp. 325–330.
[24] K. Krippendorff, Content analysis: An introduction to its methodology. Thousand Oaks, CA: Sage, 2013.
[25] C. Laurier, J. Grivolla, and P. Herrera, “Multimodal music mood classification using audio and lyrics,” in Proc. Int. Conf. Machine Learning and Applications, 2008, pp. 105–111.
[26] J. H. Lee and J. S. Downie, “Survey of music information needs, uses, and seeking behaviours: Preliminary findings,” in Proc. Int. Soc. Music Info. Retrieval Conf., 2004, pp. 441–446.
[27] M. Leman, V. Vermeulen, D. V. L., D. Moelants, and M. Lesaffre, “Prediction of musical affect using a combination of acoustic structural cues,” J. New Music Research, vol. 34, no. 1, pp. 39–67, 2005.
[28] T. Li and M. Ogihara, “Detecting emotion in music,” in Proc. Int. Soc. Music Info. Retrieval Conf., 2003, pp. 239–240.
[29] L. Lu, D. Liu, and H.-J. Zhang, “Automatic mood detection and tracking of music audio signals,” IEEE Trans. Audio, Speech & Language Processing, vol. 14, no. 1, pp. 5–18, 2006.
[30] X. Hu, J. S. Downie, C. Laurier, M. Bay, and A. F. Ehmann, “The 2007 MIREX audio mood classification task: Lessons learned,” in Proc. Int. Soc. Music Info. Retrieval Conf., 2008, pp. 462–467.
[31] Y. H. Yang, Y. C. Lin, Y. F. Su, and H. H. Chen, “A regression approach to music emotion recognition,” IEEE Trans. Audio, Speech & Language Processing, vol. 16, no. 2, pp. 448–457, 2008.
[32] T. Lidy and A. Rauber, “Evaluation of feature extractors and psycho-acoustic transformations for music genre classification,” in Proc. Int. Soc. Music Info. Retrieval Conf., 2005, pp. 34–41, [Online] http://www.ifs.tuwien.ac.at/mir/audiofeatureextraction.html .
[33] T. Hofmann, “Probabilistic latent semantic indexing,” in Proc. ACM SIGIR Conf. Research and Development in Information Retrieval, 1999, pp. 50–57.
[34] M. Soleymani, M. N. Caro, E. M. Schmidt, C. Y. Sha, and Y. H. Yang, “1000 songs for emotional analysis of music,” in Proc. ACM Int. Workshop. Crowdsourcing for Multimedia, 2013, pp. 1–6.
[35] F. Eyben, F. Weninger, F. Gross, and B. Schuller, “Recent developments in openSMILE, the Munich open-source multimedia feature extractor,” in Proc. ACM Int. Conf. Multimedia, 2013, pp. 835–838, [Online] http://www.audeering.com/research/opensmile .
[36] T. Hill and P. Lewicki, Statistics: Methods and Applications. StatSoft, 2005.
[37] L. Lie, D. Liu, and H. J. Zhang, “Automatic mood detection and tracking of music audio signals,” IEEE Trans. Audio, Speech & Language Processing, vol. 14, no. 1, pp. 5–18, 2006.
[38] R. Panda, R. Malheiro, B. Rocha, A. Oliveira, and R. Paiva, “Multi-modal music emotion recognition: A new dataset, methodology and comparative analysis,” in Proc. Int. Soc. Computer Music Modelling & Retrieval, 2013, pp. 1–13.
[39] A. Roda, S. Canazza, and G. D. Poli, “Clustering affective qualities of classical music: beyond the valence-arousal plane,” IEEE Trans. Affective Computing, vol. 5, no. 4, pp. 364–376, 2014.
[40] J. A. Russell, “A circumplex model of affect,” J. Personality & Social Science, vol. 39, no. 6, pp. 1161–1178, 1980.
[41] E. M. Schmidt and Y. E. Kim, “Prediction of time-varying musical mood distributions from audio,” in Proc. Int. Soc. Music Info. Retrieval Conf., 2010.
[42] E. M. Schmidt and Y. E. Kim, “Modeling musical emotion dynamics with conditional random fields,” in Proc. Int. Soc. Music Info. Retrieval Conf., 2011.
[43] B. Schölkopf, A. Smola, and K.-R. Müller, “Nonlinear component analysis as a kernel eigenvalue problem,” Neural Comput., vol. 10, no. 5, pp. 1299–1319, 1998. [Online]. Available: http://dx.doi.org/10.1162/089976698300017467 =0pt
[44] E. Schubert, “Modeling perceived emotion with continuous musical features,” Music Perception, vol. 21, no. 4, pp. 561–585, 2004.
[45] J. F. Sturm, “Using SeDuMi 1.02, a MATLAB toolbox for optimization over symmetric cones,” Optim. Meth. Softw., vol. 11, no. 1–4, pp. 625–653, 1999.
[46] A. Singhi and D. Brown, “On cultural, textual and experiential aspects of music mood,” in Proc. Int. Soc. Music Info. Retrieval Conf., 2014, pp. 3–8.
[47] M. Soleymani, A. Aljanaki, Y.-H. Yang, M. N. Caro, F. Eyben, K. Markov, B. W. Schuller, R. Veltkamp, F. Weninger, and F. Wiering, “Emotional analysis of music: A comparison of methods,” in Proc. ACM Multimedia, 2014, pp. 1161–1164.
[48] J. A. Speck, E. M. Schmidt, B. G. Morton, and Y. E. Kim, “A comparative study of collaborative vs. traditional musical mood annotation,” in Proc. Int. Soc. Music Info. Retrieval Conf., 2011.
[49] O. Lartillot and P. Toiviainen, “MIR in Matlab (II): A toolbox for musical feature extraction from audio,” in Proc. Int. Soc. Music Info. Retrieval Conf., 2007, pp. 127–130, [Online] http://users.jyu.fi/ lartillo/mirtoolbox/ .
[50] J. C. Wang, Y. H. Yang, H. M. Wang, and S. K. Jeng, “Modeling the affective content of music with a Gaussian mixture model,” IEEE Trans. Affective Computing, vol. 6, no. 1, pp. 56–68, 2015.
[51] J. K. Vuoskoski and T. Eerola, “Measuring music-induced emotion: A comparison of emotion models, personality biases, and intensity of experiences,” Music Perception, vol. 15, no. 2, pp. 159–173, 2011.
[52] F. Weninger, F. Eyben, and B. Schuller, “On-line continuous-time music mood regression with deep recurrent neural networks,” in Proc. IEEE Int. Conf. Acoustics, Speech, & Signal Processing, 2014, pp. 5449–5453.
[53] Y.-H. Yang, Y.-F. Su, Y.-C. Lin, and H. H. Chen, “Music emotion recognition: The role of individuality,” in Proc. ACM Int. Workshop on Human-centered Multimedia, 2007, pp. 13–21.
[54] Y. H. Yang, Y. C. Lin, Y. F. Su, and H. H. Chen, “A regression approach to music emotion recognition,” IEEE Trans. Audio, Speech & Language Processing, vol. 16, no. 2, pp. 448–457, 2008.
[55] Y.-H. Yang and H. H. Chen, “Ranking-based emotion recognition for music organization and retrieval,” IEEE Trans. Audio, Speech & Language Processing, vol. 19, no. 4, pp. 762–774, 2011.
[56] Y.-H. Yang and H.-H. Chen, “Machine recognition of music emotion: A review,” ACM Trans. Intel. Systems & Technology, vol. 3, no. 4, 2012.
[57] Y.-H. Yang and J.-Y. Liu, “Quantitative study of music listening behavior in a social and affective context,” IEEE Trans. Multimedia, vol. 15, no. 6, pp. 1304–1315, 2013.
[58] “Development of a global mental health action plan 2013-2020,” World Health Organization, Nov. 2012.
[59] S. Lyubomirsky, L. King, and E. Diener, “The benefits of frequent positive affect: does happiness lead to success?,” Psychological Bulletin, vol. 131, no. 6, pp. 803–855, Nov. 2005.
[60] M. E. P. Seligman and M. Csikszentmihalyi, “Positive psychology: An introduction,” American Psychologist, vol. 55, no. 1, pp. 5–14, Jan. 2000.
[61] J. Helliwell, R. Layard, and J. Sachs, “World Happiness Report,” The Earth Institute, Columbia University, New York, United States, Apr. 2012.
[62] S. B. F. Hargens, “Integral development — Taking the middle path towards gross national happiness,” Journal of Bhutan Studies, vol. 6, pp. 24–87, 2002.
[63] D. McDuff, A. Karlson, A. Kapoor, A. Roseway, and M. Czerwinski, “AffectAura: Emotional wellbeing reflection system,” in Proc. 2012 6th Int. Conf. Pervasive Computing Technologies for Healthcare, San Diego, California, United States, 2012, May 21–24, pp. 199–200.
[64] A. Tawari and M. M. Trivedi, “Speech emotion analysis: Exploring the role of context,” IEEE Trans. Multimedia, vol. 12, no. 6, pp. 502–509, Oct. 2010.
[65] A. Madan, M. Cebrian, S. Moturu, K. Farrahi, and A. S. Pentland, “Sensing the ”Health State” of a community,” IEEE Pervasive Computing, vol. 11, no. 4, pp. 36–45, Oct.–Dec. 2012.
[66] N. K. Suryadevara and S. C. Mukhopadhyay, “Wireless sensor network based home monitoring system for wellness determination of elderly,” IEEE Sensors Journal, vol. 12, no. 6, pp. 1965–1972, Jun. 2012.
[67] C. A. Frantzidis, C. Bratsas, M. A. Klados, E. Konstantinidis, C. D. Lithari, A. B. Vivas, C. L. Papadelis, E. Kaldoudi, C. Pappas, and P. D. Bamidis, “On the classification of emotional biosignals evoked while viewing affective pictures: An integrated data-mining-based approach for healthcare applications,” IEEE Trans. Information Technology in Biomedicine, vol. 14, no. 2, pp. 309–318, Mar. 2010.
[68] T. Taleb, D. Bottazzi, and N. Nasser, “A novel middleware solution to improve ubiquitous healthcare systems aided by affective information,” IEEE Trans. Information Technology in Biomedicine, vol. 14, no. 2, pp. 335–349, Mar. 2010.
[69] I. Luengo, E. Navas, and I. Hernáez, “Feature analysis and evaluation for automatic emotion identification in speech,” IEEE Trans. Multimedia, vol. 12, no. 6, pp. 490–501, Oct. 2010.
[70] A. Ortony, G. L. Clore, and A. Collins, The Cognitive Structure of Emotions. New York, NY: Cambridge University Press, May 1990.
[71] R. Plutchik, The Psychology and Biology of Emotion. New York, NY: Harper Collins College, Jan. 1994.
[72] L. Vidrascu and L. Devillers, “Annotation and detection of blended emotions in real human-human dialogs recorded in a call center,” in Proc. 2005 IEEE Int. Conf. Multimedia and Expo, Amsterdam, Netherlands, 2005, Jul. 06–09, pp. 944–947.
[73] P. C. Bagshaw, M. Jack, and J. Laver, “Automatic prosodic analysis for computer aided pronunciation teaching,” Ph.D. dissertation, Center for Speech Technology Research, University of Edinburgh, Edinburgh, Scotland, United Kingdom, 1994.
[74] C. Busso, S. Lee, and S. Narayanan, “Analysis of emotionally salient aspects of fundamental frequency for emotion detection,” IEEE Trans. Audio, Speech, and Language Processing, vol. 17, no. 4, pp. 582–596, May 2009.
[75] C. E. Williams and K. N. Stevens, “Emotions and speech: Some acoustical correlates,” Journal of the Acoustical Society of America, vol. 52, no. 4B, pp. 1238–1250, 1972.
[76] I. R. Murray and J. L. Arnott, “Toward the simulation of emotion in synthetic speech: A review of the literature on human vocal emotion,” Journal of the Acoustical Society of America, vol. 93, no. 2, pp. 1097–1108, Feb. 1993.
[77] R. Banse and K. R. Scherer, “Acoustic profiles in vocal emotion expression,” Journal of Personality and Social Psychology, vol. 70, no. 3, pp. 614–636, Mar. 1996.
[78] S. McGilloway, R. Cowie, and E. Douglas-Cowie, “Prosodic signs of emotion in speech: Preliminary results from a new technique for automatic statistical analysis,” in Proc. 13th Int. Congr. Phonetic Sciences, Stockholm, Sweden, 1995, Aug. 13–19, pp. 250–253.
[79] S. McGilloway, R. Cowie, E. Douglas-Cowie, S. Gielen, M. Westerdijk, and S. Stroeve, “Approaching automatic recognition of emotion from voice: A rough benchmark,” in Proc. ISCA Tutorial and Research Workshop on Speech and Emotion, Newcastle, Northern Ireland, United Kingdom, 2000, Sep. 05–07, 2000, pp. 207–212.
[80] R. Cowie and E. Douglas-Cowie, “Automatic statistical analysis of the signal and prosodic signs of emotion in speech,” in Proc. 4th Int. Conf. Spoken Language Processing, Philadelphia, Pennsylvania, United States, 1996, Oct. 03–06, pp. 1989–1992.
[81] C. M. Lee and S. S. Narayanan, “Toward detecting emotions in spoken dialogs,” IEEE Trans. Speech and Audio Processing, vol. 13, no. 2, pp. 293–303, Mar. 2005.
[82] E. Mower, M. J. Mataric, and S. Narayanan, “A framework for automatic human emotion classification using emotion profiles,” IEEE Trans. Audio, Speech, and Language Processing, vol. 19, no. 5, pp. 1057–1070, Jul. 2011.
[83] D. Wu, T. D. Parsons, E. Mower, and S. Narayanan, “Speech emotion estimation in 3D space,” in Proc. 2010 IEEE Int. Conf. Multimedia and Expo, Singapore, 2010, Jul. 19–23, pp. 737–742.
[84] K. R. Scherer, “Vocal communication of emotion: A review of research paradigms,” Speech Communication, vol. 40, no. 1–2, pp. 227–256, Ari. 2003.
[85] R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis, S. Kollias, W. Fellenz, and J. G. Taylor, “Emotion recognition in human-computer interaction,” IEEE Signal Processing Magazine, vol. 18, no. 1, pp. 32–80, Jan. 2001.
[86] T. L. Nwea, S. W. Foob, and L. C. De Silva, “Speech emotion recognition using hidden Markov models,” Speech Communication, vol. 41, no. 4, pp. 603–623, Nov. 2003.
[87] P. Dunker, S. Nowak, A. Begau, and C. Lanz, “Content-based mood classification for photos and music: A generic multi-modal classification framework and evaluation approach,” in Proc. 1st ACM Int. Conf. Multimedia Information Retrieval, Vancouver, British Columbia, Canada, 2008, Oct. 30–31, pp. 97–104.
[88] B. Schuller, G. Rigoll, and M. Lang, “Hidden Markov model-based speech emotion recognition,” in Proc. 2003 IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Hong Kong, China, 2003, Apri 06–10, pp. II-1–II-4.
[89] J. V. Tu, “Advantages and disadvantages of using artificial neural networks versus logistic regression for predicting medical outcomes,” Journal of Clinical Epidemiology, vol. 49, no. 11, pp. 1225–1231, Nov. 1996.
[90] M. W. Kadous, “Temporal classification: Extending the classification paradigm to multivariate time series,” Ph.D. dissertation, School of Computer Science and Engineering, University of New South Wales, Sydney, New South Wales, Australia, Oct. 2002.
[91] N. E. Gillian, “Gesture recognition for musician computer interaction,” Ph.D. dissertation, Faculty of Arts, Humanities and Social Sciences, School of Music and Sonic Arts, Queen′s University Belfast, Belfast, County Antrim, Northern Ireland, United Kingdom, Mar. 2011.
[92] J. M. K. Kua, E. Ambikairajah, J. Epps, and R. Togneri, “Speaker verification using sparse representation classification,” in Proc. 2011 IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Prague, Czech Republic, 2011, May 22–27, pp. 4548–4551.
[93] K. Huang and S. Aviyente, “Sparse representation for signal classiﬁcation,” in Proc. 20th Annual Conf. Neural Information Processing Systems, Vancouver, British Columbia, Canada, 2006, Dec. 04–07, pp. 609–616.
[94] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, “Robust face recognition via sparse representation,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 31, no. 2, pp. 210–227, Mar. 2009.
[95] N. Cho and C.-C. J. Kuo, “Sparse representation of musical signals using source-specific dictionaries,” IEEE Signal Processing Letters, vol. 17, no. 11, pp. 913–916, Nov. 2010.
[96] N. Cho and C.-C. J. Kuo, “Sparse music representation with source-specific dictionaries and its application to signal separation,” IEEE Trans. Audio, Speech, and Language Processing, vol. 19, no. 2, pp. 326–337, Feb. 2011.
[97] S. G. Mallat and Z. Zhang, “Matching pursuits with time-frequency dictionaries,” IEEE Trans. Signal Processing, vol. 41, no. 12, pp. 3397–3415, Dec. 1993.
[98] S. P. Ebenezer, A. Papandreou-Suppappola, and S. B. Suppappola, “Classification of acoustic emissions using modified matching pursuit,” EURASIP Journal on Applied Signal Processing, vol. 2004, no. 3, pp. 347–357, 2004.
[99] J. C. Wang, C. H. Lin, B. W. Chen, and M. K. Tsai, “Gabor-based nonuniform scale-frequency map for environmental sound classification in home automation,” IEEE Trans. Automation Science and Engineering, vol. 11, no. 2, pp. 607–613, Apr. 2014.
[100] S. Chu, S. Narayanan, and C.-C. J. Kuo, “Environmental sound recognition with time-frequency audio features,” IEEE Trans. Audio, Speech, and Language Processing, vol. 17, no. 6, pp. 1142–1158, Aug. 2009.
[101] K. Umapathy and S. Krishnan, “Time-width versus frequency band mapping of energy distributions,” IEEE Trans. Signal Processing, vol. 55, no. 3, pp. 978–989, Mar. 2007.
[102] S. Wang, A. Sekey, and A. Gersho, “An objective measure for predicting subjective quality of speech coders,” IEEE Journal on Selected Areas in Communications, vol. 10, no. 5, pp. 819–829, 1992.
[103] L. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition. Upper Saddle River, NJ: Prentice-Hall, 1993.
[104] E. Zwicker and H. Fastl, Psychoacoustics: Facts and Models, 2nd ed. New York, NY: Springer-Verlag, Apr. 1999.
[105] J. D. Durrant and J. H. Lovrinic, Bases of Hearing Science, 3rd ed. Baltimore, MD: Lippincott Williams and Wilkins, Jan. 1995.
[106] B. Moore, An Introduction to the Psychology of Hearing, 5th ed. Bingley, United Kingdom: Emerald Group Publishing Ltd., Jan. 2003.
[107] W. A. Yost and R. R. Fay, Auditory Perception of Sound Sources. New York, NY: Springer-Verlag, Nov. 2007.
[108] W. Brent, “Perceptually based pitch scales in cepstral techniques for percussive timbre identification,” in Proc. International Computer Music Conference, Montreal, Québec, Canada, 2009, Aug. 16–21, pp. 121–124.
[109] I. Luengo, E. Navas, I. Hernáez, and J. Sánchez, “Automatic emotion recognition using prosodic parameters,” in Proc. 9th European Conference on Speech Communication and Technology (Interspeech 2006), Lisbon, Portugal, 2005, Sep. 04–08, pp. 493–496.
[110] C.-W. Hsu and C.-J. Lin, “A comparison of methods for multiclass support vector machines,” IEEE Trans. Neural Networks, vol. 13, no. 2, pp. 415–425, Mar. 2002.
[111] M. Elad and A. M. Bruckstein, “A generalized uncertainty principle and sparse representation in pairs of bases,” IEEE Trans. Information Theory, vol. 48, no. 9, pp. 2558–2567, Sep. 2002.
[112] R. Rubinstein, S. Member, M. Zibulevsky, and M. Elad, “Double sparsity: Learning sparse dictionaries for sparse signal approximation,” IEEE Trans. Signal Processing, vol. 58, no. 3, pp. 1553–1564, Mar. 2010.
[113] K. T. Vo and A. Sowmya, “Multiscale sparse representation of high-resolution computed tomography (HRCT) lung images for diffuse lung disease classification,” in Proc. 2011 18th IEEE Int. Conf. Image Processing, Brussels, Belguim, 2011, Sep. 11–14, pp. 441–444.
[114] D. L. Donoho and M. Elad, “Optimally sparse representation in general (nonorthogonal) dictionaries via ℓ1 minimization,” Proceedings of the National Academy of Sciences, vol. 100, no. 5, pp. 2197–2202, Mar. 2003.
[115] E. J. Candes and T. Tao, “Decoding by linear programming,” IEEE Trans. Information Theory, vol. 51, no. 12, pp. 4203–4215, Dec. 2005.
[116] S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decomposition by basis pursuit,” SIAM Journal on Scientific Computing, vol. 20, no. 1, pp. 33–61, 1998.
[117] E. H. Kim, K. H. Hyun, S. H. Kim, and Y. K. Kwak, “Improved emotion recognition with a novel speaker-independent feature,” IEEE/ASME Trans. Mechatronics, vol. 14, no. 3, pp. 317–325, Jun. 2009.
[118] Toolbox Focal [online] https://sites.google.com/site/nikobrummer/focal.
[119] E.B. Gouvea, “Acoustic-feature-based frequency warping for speaker normalization,” Department of Electrical and Computer Engineering Carnegie Mellon University, Pittsburgh, Pennsylvania, Dec. 1998.
[120] E. M. Schmidt and Y. E. Kim, “Prediction of time-varying musical mood distributions using kalman filtering,” in Proc. Int. Conf. Machine Learning and Applications, 2010, pp. 655–660.
[121] Y. Imbrasaite, T. Baltrusaitis, and P. Robinson, “Ccnf for continuous emotion tracking in music: Comparison with ccrf and relative feature representation,” in Proc. IEEE Int. Conf. Multimedia and Expo., 2014.
[122] Y. H. Yang, Y. C. Lin, H. T. Cheng, I. B. Liao, Y. C. Ho, and H. H. Chen, “Toward multi-modal music emotion classification,” Advances in Multimedia Information Processing, pp. 70–79, 2008.
[123] D. Su, P. Fung, and N. Auguin, “Multimodal music emotion classification using adaboost with decision stumps,” in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, 2013, pp. 3447–3451.
[124] Xiao Hu, Kahyun Choi, and J Stephen Downie, “A framework for evaluating multimodal music mood classification,” Journal of the Association for Information Science and Technology, 2016.
[125] S. O. Ali and Z. F. Peynircioglu, “Songs and emotions: are lyrics and melodies equal partners?,” Psychology of Music, vol. 34, no. 4, pp. 511-534, Oct. 2006.
[126] K. Mori and M. Iwanaga, “Pleasure generated by sadness: Effect of sad lyrics on the emotions induced by happy music,” Psychology of Music, vol. 42, no. 5, pp. 643–652, Sep. 2014.
[127] R. Pascanu, C. Gulcehre, K. Cho, and Y. Bengio, “How to construct deep recurrent neural networks,” in Proc. Int. Conf. Learning Representations, 2014.
[128] P. S. Huang, M. Kim, M. H. Johnson, and P. Smaragdis, Singing-voice separation from monaural recordings using deep recurrent neural networks,” in Proc. Int. Soc. Music Info. Retrieval Conf., 2014.
[129] R. Bittner, J. Salamon, M. Tierney, M. Mauch, C. Cannam, and J. P. Bello, “Medleydb: A multitrack dataset for annotation-intensive mir research,” in Proc. Int. Soc. Music Info. Retrieval Conf., 2014.
[130] E. Vincent, R. Gribonval, and C. Fevotte, “Performance measurement in blind audio source separation.,” IEEE Trans. Audio, Speech, and Language Processing, vol. 14, no. 4, pp. 1462–1469, July 2006.
[131] J. H. Lee, T. Hill, and L.Work, “What does music mood mean for real users?,” in Proc. ACM Int. Conf. iConference, 2012, pp. 112–119.
[132] C. L. Hsu and J. S. R. Jang, “On the improvement of singing voice separation for monaural recordings using the mir-1k dataset,” IEEE Trans. Audio, Speech, and Language Processing, vol. 18, no. 2, pp. 310–319, Feb. 2010.

指導教授

王家慶(JIA-CHING WANG)

審核日期

2017-8-21

推文