強健性音訊處理研究:從訊號增強到模型學習;A Study on Robust Audio Processing: From Signal Enhancement to Model Learning

NCUIR > College of Electrical Engineering & Computer Science > Graduate Institute of Computer Science and Information Engineering > Electronic Thesis & Dissertation > Item 987654321/74776

Please use this identifier to cite or link to this item: http://ir.lib.ncu.edu.tw/handle/987654321/74776

Title:	強健性音訊處理研究:從訊號增強到模型學習;A Study on Robust Audio Processing: From Signal Enhancement to Model Learning
Authors:	李遠山;Lee, Yuan-Shan
Contributors:	資訊工程學系
Keywords:	壓縮感測;深層遞迴神經網路;聯合字典學習;階層式狄氏程序;Compressive Sensing;Recurrent Neural Network;Joint Dictionary Learning;Dirichlet Process
Date:	2017-08-23
Issue Date:	2017-10-27 14:39:02 (UTC+8)
Publisher:	國立中央大學
Abstract:	強健性對於音訊辨識系統來說是非常關鍵的課題。本論文提出兩個方法做為前端(Front-end)處理，來去除干擾音對音訊辨識系統之影響。其一，針對環境噪音，本論文提出結合壓縮感測(Compressive Sensing, CS)之語音增強方法。我們利用時頻遮罩對有噪頻譜進行初步去噪，並且將遮罩後的剩餘頻譜視作不完整之觀測，引入壓縮感測技術來估測頻譜中遺失之資訊，以強化增強訊號的品質。更進一步地，我們也推導出最佳增益值，來去除頻譜重建過程中可能產生之噪音成份。其二，針對深度干擾音源，本論文提出基於複數深層遞迴神經網路(Complex-valued Recurrent Neural Network, C-DRNN)之音源分離方法。相較於現有深層學習方法，本論文所提出的方法能夠直接對複數頻譜進行處理，這樣做的好處是能夠同時估測目標音源之能量與相位，藉此提升音源分離之效果與品質。此外，我們在深層網路架構中加入複數的遮罩層，具有使分離音源頻譜平滑的效果，而加入之複數鑑別項則能夠保留目標音源間之差異性。在後端(Back-end)辨識方面，本論文也提出兩個具不同特性的方法。其一，我們引入協同表示的概念，提出基於聯合核化字典學習(Joint Kernel Dictionary Learning, JKDL)之聲音事件辨識系統。藉由在目標函式中加入分類誤差項，能夠在學習字典的過程中同時訓練線性分類器，達到強化辨識能力並節省時間的效果。核化方法則能夠將訓練資料投射至高維特徵空間，進一步加強辨識效果。其二，考量到真實世界中類別的界定並不是那麼明確，也就是類別之間會有一些模糊地帶或是重疊。我們利用階層式狄氏程序混合模型(Hierarchical Dirichlet Process Mixture Model, HDPMM)共用成分的特性，提出音樂情緒標註與檢索系統，另外我們也考量到共用的特性可能會造成類別間的混淆，基於線性鑑別分析的概念，在系統中加入鑑別性因子，來提升分類之效果。;Robustness against noise is a critical characteristic of an audio recognition (AR) system. To develop a robust AR system, this dissertation proposes two front-end processing methods. To suppress the effects of background noise on target sound, a speech enhancement method that is based on compressive sensing (CS) is proposed. A quasi-SNR criterion are first utilized to determine whether a frequency bin in the spectrogram is reliable, and a corresponding mask is designed. The mask-extracted components of spectra are regarded as partial observation. The CS theory is used to reconstruct components that are missing from partial observations. The noise component can be further removed by multiplying the imputed spectrum with the optimized gain. To separate the target sound from the interference, a source separation method that is based on a complex-valued deep recurrent neural network (C-DRNN) is developed. A key aspect of the C-DRNN is that the activations and weights are complex-valued. Phase estimation is integrated into the C-DRNN by the construction of a deep and complex-valued regression model in the time-frequency domain. This dissertation also develops two novel methods for back-end recognition. The first is a joint kernel dictionary learning (JKDL) method for sound event classification. Our JKDL learns the collaborative representation instead of the sparse representation. The learned representation is thus ``denser′′ than the sparse representation that is learned by K-SVD. Moreover, the discriminative ability is improved by adding a classification error term into the objective function. The second is a hierarchical Dirichlet process mixture model (HPDMM), whose components can be shared between models of each audio category. Therefore, the proposed emotion models provide a better capture of the relationship between real-world emotional states.
Appears in Collections:	[Graduate Institute of Computer Science and Information Engineering] Electronic Thesis & Dissertation

Files in This Item:

File	Description	Size	Format
index.html		0Kb	HTML	495	View/Open

社群 sharing

Loading...