音樂及語音情緒模型建立之研究;A Study on Modeling Affective Content of Music and Speech

NCU Institutional Repository > 資訊電機學院 > 資訊工程研究所 > 博碩士論文 > Item 987654321/74748

請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/74748

題名:	音樂及語音情緒模型建立之研究;A Study on Modeling Affective Content of Music and Speech
作者:	秦餘皞;CHIN, YU-HAO
貢獻者:	資訊工程學系
關鍵詞:	情緒辨識;音樂;語音;機器學習;emotion recognition;music;speech;machine learning
日期:	2017-08-21
上傳時間:	2017-10-27 14:38:12 (UTC+8)
出版者:	國立中央大學
摘要:	情意運算為機器學習領域內一項熱門的主題。其目標為使用數學方法來對各種來源的情緒內容建模，例如音樂、人臉、肢體語言、語音，並且使用辨識演算法來辨識情緒。在眾多的情緒來源中，我們特別關注音訊的情緒辨識。音樂及語音為音訊領域中兩大情緒來源，本篇論文對兩者的情意運算做探討。由於音樂推薦及檢索近年來被應用在很多領域，音樂情緒的建模已經變成一個熱門的研究主題。雖然已有很多相關研究被提出，但因為情緒很難被適當的描繪，這個主題還是很有挑戰性，並且情緒具主觀性，使得資料難收集，建模也變得困難。在本篇論文，我們不假設每個人對音樂情緒的感受有一致性，並且提出一個新的機器學習方法，將音樂情緒塑造成一個在向性-激發(Valence-Arousal)空間上的機率分布，這樣不但可以較精準的描繪情緒，也能處理情緒的主觀性問題。詳細流程如下，首先我們使用核密度估測(Kernel Density Estimation)來將情緒塑造成一個在向性-激發空間上的機率密度函式，接著，為了建立情緒及音訊特徵之間的連結，我們透過特徵和一些目標函式來學習一組線性組合係數，並將這些係數與訓練資料的機率密度函式做線性組合，形成未知資料的機率密度函式。我們使用NTUMIR和MediaEval2013資料庫來做實驗，在實驗中我們嘗試了數種目標函式，實驗結果證明了本方法在情緒分布預測上的有效性。我們也證明了本方法如何在以情緒為基礎的音樂推薦上使用。過去的研究發現音樂的情緒受到多項因素影響，且在一首歌中，人聲及配樂的情緒有可能不相同。大部分的MER研究視一首歌為單一的訊號來源，並對其取特徵，但是一首歌裡面還有各種樂器及人聲，它們卻很少被獨立出來探討。因此，本論文對人聲及配樂分別擷取特徵，並探討這樣做是否對於預測動態VA有所幫助。詳細流程如下，首先訓練一個深度遞迴神經網路(Deep Recurrent Neural Network)來做兩個來源的分離，接著對兩個來源分別取韻律，音色，音調，能量，音高相關的特徵，然後結合，來預測分離前的音樂的情緒。在結合兩來源的情緒的部分，我們提出四種深度遞迴神經網路架構，並測試。各種特徵的組合方式也有被探討。我們使用MediaEval2013資料庫來做實驗，從結果中我們發現上述的方法可提升辨識的準確度。在語音情緒的探討部分，本論文提出了語音情緒驗證系統，其以情緒變異性塑模及尺度頻率圖為基礎。本系統含兩大部分-特徵擷取及情緒驗證。在特徵擷取部分，對每個音框，我們使用匹配追踪演算法(Matching Pursuit Algorithm)來從Gabor字典中選擇重要的元素，並將每個被選到的元素的尺度及頻率整合成一張尺度頻率圖。此圖在重要的頻帶上具有鑑別性。接著我們使用稀疏表示(Sparse Representation)將尺度頻率圖轉換成稀疏係數，以增強對情緒變異的強健性。在情緒驗證階段，我們會計算兩個分數。一個是基於稀疏係數所計算的高斯模型錯誤。另一個是基於稀疏係數所計算的情緒認同指標。我們綜合考量這兩個分數來得到驗證結果。我們使用一個語音情緒資料庫來做實驗，實驗結果顯示我們所提出的系統可達到6.61%的等錯誤率(Equal Error Rate)。與其他方法的比較也證明了本系統的有效性。 ;Affective computing (AC) is a hot topic in machine learning. One of the goals of AC is modeling the affective content of various sources such as music, face, body language, and speech via mathematical methods. A recognition approach is often carried out to recognize the modeled affect. Among all the emotional cues, we particularly focus on the emotion recognition of audio. Music and speech are important affect-brought audio sources. Both of them are investigated in this study. Computationally modeling the affective content of music has been intensively studied in recent years because of its wide applications in music retrieval and recommendation. Although significant progress has been made, this task remains challenging due to the difficulty in properly characterizing the emotion of a music piece. Music emotion perceived by people is subjective by nature and thus complicates the process of collecting the emotion annotations as well as developing the predictive model. Instead of assuming people can reach a consensus on the emotion of music, in this work we propose a novel machine learning approach that characterizes the music emotion as a probability distribution in the valence-arousal (VA) emotion space, not only tackling the subjectivity but also precisely describing the emotions of a music piece. Specifically, we represent the emotion of a music piece as a probability density function (PDF) in the VA space via kernel density estimation from human annotations. To associate emotion with the audio features extracted from music pieces, we learn the combination coefficients by optimizing some objective functions of audio features, and then predict the emotion of an unseen piece by linearly combining the PDFs of the training pieces with the coefficients. Several algorithms for learning the coefficients are studied. Evaluations on the NTUMIR and MediaEval2013 datasets validate the effectiveness of the proposed methods in predicting the probability distributions of emotion from audio features. We also demonstrate how to use the proposed approach in emotion-based music retrieval. It has been recognized that music emotion is influenced by multiple factors and sometimes singing voices and accompaniments may express different emotions. However, most existing work on music emotion recognition (MER) considered music audio as a single source for feature extraction, whereas the audio of most songs can be separated into singing voice and accompaniments with various instruments. The separated sources may potentially provide additional information that can help improve performances of MER, but are seldom explored. This study aims to fill this gap by investigating whether considering singing voice and accompaniments separately can help predicting dynamic VA values of music. Specifically, a deep recurrent neural network (DRNN)-based singing-voice separation algorithm was applied to separate the two sources. Rhythm, timbre, tonality, energy, and pitch-related features were then extracted from both sources which were then combined to predict the VA values of the original music with unseparated source. In combining the sources, four variations of DRNN-based approaches were proposed and evaluated, and different combinations of the features extracted from different sources are compared. Experiments on the MediaEval2013 dataset indicated that the performance can be improved by using the above method. For the speech emotion part, this paper develops a system to implement speech-based emotion verification based on emotion variance modeling and discriminant scale-frequency maps. The proposed system consists of two parts — feature extraction and emotion verification. In the first part, for each sound frame, important atoms from the Gabor dictionary are selected by using the Matching Pursuit algorithm. The scale, frequency, and magnitude of the atoms are extracted to construct a scale-frequency map, which supports auditory discriminability by the analysis of critical bands. Next, sparse representation is used to transform scale-frequency maps into sparse coefficients to enhance the robustness against emotion variance. In the second part, emotion verification, two scores are calculated. A novel sparse representation verification approach based on Gaussian-modeled residual errors is proposed to generate the first score from the sparse coefficients. The second score is calculated by using the Emotional Agreement Index (EAI) from the same coefficients. These two scores are combined to obtain the final detection result. Experiments on an emotional database of spoken speech were conducted and indicate that the proposed approach can achieve an average Equal Error Rate (EER) of as low as 6.61%. A comparison among different approaches reveals that the proposed method is effective.
顯示於類別:	[資訊工程研究所] 博碩士論文

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	327	檢視/開啟

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....