強健性音訊處理研究:從訊號增強到模型學習

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：19

、訪客IP：18.216.91.156

姓名

李遠山(Yuan-Shan Lee) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

強健性音訊處理研究:從訊號增強到模型學習
(A Study on Robust Audio Processing: From Signal Enhancement to Model Learning)

相關論文

★ Single and Multi-Label Environmental Sound Recognition with Gaussian Process	★ 波束形成與音訊前處理之嵌入式系統實現
★ 語音合成及語者轉換之應用與設計	★ 基於語意之輿情分析系統
★ 高品質口述系統之設計與應用	★ 深度學習及加速強健特徵之CT影像跟骨骨折辨識及偵測
★ 基於風格向量空間之個性化協同過濾服裝推薦系統	★ RetinaNet應用於人臉偵測
★ 金融商品走勢預測	★ 整合深度學習方法預測年齡以及衰老基因之研究
★ 漢語之端到端語音合成研究	★ 基於 ARM 架構上的 ORB-SLAM2 的應用與改進
★ 基於深度學習之指數股票型基金趨勢預測	★ 探討財經新聞與金融趨勢的相關性
★ 基於卷積神經網路的情緒語音分析	★ 運用深度學習方法預測阿茲海默症惡化與腦中風手術存活

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 ( 永不開放)

摘要(中)

強健性對於音訊辨識系統來說是非常關鍵的課題。本論文提出兩個方法做為前端(Front-end)處理，來去除干擾音對音訊辨識系統之影響。其一，針對環境噪音，本論文提出結合壓縮感測(Compressive Sensing, CS)之語音增強方法。我們利用時頻遮罩對有噪頻譜進行初步去噪，並且將遮罩後的剩餘頻譜視作不完整之觀測，引入壓縮感測技術來估測頻譜中遺失之資訊，以強化增強訊號的品質。更進一步地，我們也推導出最佳增益值，來去除頻譜重建過程中可能產生之噪音成份。其二，針對深度干擾音源，本論文提出基於複數深層遞迴神經網路(Complex-valued Recurrent Neural Network, C-DRNN)之音源分離方法。相較於現有深層學習方法，本論文所提出的方法能夠直接對複數頻譜進行處理，這樣做的好處是能夠同時估測目標音源之能量與相位，藉此提升音源分離之效果與品質。此外，我們在深層網路架構中加入複數的遮罩層，具有使分離音源頻譜平滑的效果，而加入之複數鑑別項則能夠保留目標音源間之差異性。在後端(Back-end)辨識方面，本論文也提出兩個具不同特性的方法。其一，我們引入協同表示的概念，提出基於聯合核化字典學習(Joint Kernel Dictionary Learning, JKDL)之聲音事件辨識系統。藉由在目標函式中加入分類誤差項，能夠在學習字典的過程中同時訓練線性分類器，達到強化辨識能力並節省時間的效果。核化方法則能夠將訓練資料投射至高維特徵空間，進一步加強辨識效果。其二，考量到真實世界中類別的界定並不是那麼明確，也就是類別之間會有一些模糊地帶或是重疊。我們利用階層式狄氏程序混合模型(Hierarchical Dirichlet Process Mixture Model, HDPMM)共用成分的特性，提出音樂情緒標註與檢索系統，另外我們也考量到共用的特性可能會造成類別間的混淆，基於線性鑑別分析的概念，在系統中加入鑑別性因子，來提升分類之效果。

摘要(英)

Robustness against noise is a critical characteristic of an audio recognition (AR) system. To develop a robust AR system, this dissertation proposes two front-end processing methods. To suppress the effects of background noise on target sound, a speech enhancement method that is based on compressive sensing (CS) is proposed. A quasi-SNR criterion are first utilized to determine whether a frequency bin in the spectrogram is reliable, and a corresponding mask is designed. The mask-extracted components of spectra are regarded as partial observation. The CS theory is used to reconstruct components that are missing from partial observations. The noise component can be further removed by multiplying the imputed spectrum with the optimized gain. To separate the target sound from the interference, a source separation method that is based on a complex-valued deep recurrent neural network (C-DRNN) is developed. A key aspect of the C-DRNN is that the activations and weights are complex-valued. Phase estimation is integrated into the C-DRNN by the construction of a deep and complex-valued regression model in the time-frequency domain. This dissertation also develops two novel methods for back-end recognition. The first is a joint kernel dictionary learning (JKDL) method for sound event classification. Our JKDL learns the collaborative representation instead of the sparse representation. The learned representation is thus ``denser′′ than the sparse representation that is learned by K-SVD. Moreover, the discriminative ability is improved by adding a classification error term into the objective function. The second is a hierarchical Dirichlet process mixture model (HPDMM), whose components can be shared between models of each audio category. Therefore, the proposed emotion models provide a better capture of the relationship between real-world emotional states.

關鍵字(中)

★ 壓縮感測
★ 深層遞迴神經網路
★ 聯合字典學習
★ 階層式狄氏程序

關鍵字(英)

★ Compressive Sensing
★ Recurrent Neural Network
★ Joint Dictionary Learning
★ Dirichlet Process

論文目次

摘要 xi
Abstract xiii
1 Introduction 1
1.1 Motivation1
1.2 Speech Enhancement3
1.3 Source Separation 4
1.4 Sound Event Recognition 5
1.5 Music Emotion Recognition7
1.6 Organization of This Dissertation 8
2 Background 9
2.1 Compressive Sensing 9
2.2 Phase-incorporated Approaches 12
2.3 Collaborative Representation 15
2.4 Class-dependent Models 17
3 Compressive Sensing-Based Speech Enhancement 19
3.1 Proposed Method 20
3.1.1 Constructing an Overcomplete Dictionary 21
3.1.2 Missing Data Mask 22
3.1.3 Estimating Missing Data by CS 23
3.2 Experimental Results 30
3.2.1 Experimental Setting 30
3.2.2 Performance Metrics 31
3.2.3 Effects of Sparsity and Size of Training Dataset 31
3.2.4 Study of Ringing Artifacts from Imputation 33
3.2.5 Effects of Error Propagation 33
3.2.6 Baseline Algorithm 34
3.2.7 Experimental Results 35
4 Complex-valued Deep Neural Network for Phase-Incorporating Monaural
Source Separation 39
4.1 Complex-valued Gradients 40
4.2 Complex-valued Deep Neural Network (C-DNN) 42
4.2.1 Sparse Model Training 42
4.2.2 Complex-Valued Rectified Linear Unit 44
4.2.3 Error Back-propagation for C-DNN 44
4.3 Complex-valued Deep Recurrent Neural Network (C-DRNN) 46
4.3.1 Complex-Valued Recurrent Model 47
4.3.2 Complex-Valued Ratio Masking Layer 48
4.3.3 Incorporating Discriminative Constraint into Objective Function 50
4.3.4 Back-Propagation Through Time for C-DRNN 51
4.4 Experimental Results 54
4.4.1 Dataset and Evaluation Criteria 54
4.4.2 Baseline Methods 55
4.4.3 Experimental Settings 55
4.4.4 Comparison between SReLU and CReLU56
4.4.5 Effect of C-RM Layer57
4.4.6 Effect of Discriminative Training58
4.4.7 Comparing with Baseline Methods59
5 Sound Event Classification Using Joint Kernel Dictionary Learning 63
5.1 Joint Dictionary Learning (JDL) 64
5.1.1 Representation coding step64
5.1.2 Dictionary learning step 65
5.1.3 Classifier updating step 65
5.2 Joint Kernel Dictionary Learning (JKDL) 66
5.2.1 Representation coding step 66
5.2.2 Coefficient dictionary learning step 67
5.2.3 Classifier updating step 67
5.3 One-versus-One Classifier Extension68
5.4 Classification 68
5.5 Incremental JKDL 69
5.6 Experiments 70
5.6.1 Parameters Selection 73
5.6.2 Effect of Different Training Data Numbers 74
5.6.3 Effect of Different Dictionary Sizes76
5.6.4 Comparing the results in terms of precision, recall and F-score 77
5.6.5 Incremental JKDL 78
6 Hierarchical Dirichlet Process Mixture Model for Music Emotion Recognition 81
6.1 Proposed Method 82
6.1.1 Dirichlet Process Mixture Model 82
6.1.2 Hierarchical Dirichlet Process Mixture Model 84
6.1.3 Discriminant HDPMM 88
6.2 Experimental Results 95
6.2.1 Music Emotion Annotation 97
6.2.2 Music Emotion Retrieval 99
6.2.3 Music Emotion Retrieval 100
6.2.4 Discussion of the SR Method 101
7 Conclusion 103
Bibliography 107
A Publication List 121
A.1 Selected Journal Articles 121
A.2 Selected Conference Papers 122

參考文獻

[1] G. Wichern, J. Xue, H. Thornburg, B. Mechtley, and A. Spanias, “Segmentation, indexing,
and retrieval for environmental and natural sounds,” IEEE Transactions on Audio,
Speech, and Language Processing, vol. 18, no. 3, pp. 688–707, Mar. 2010.
[2] R. Bardeli, “Similarity search in animal sound databases,” IEEE Transactions on Multimedia,
vol. 11, no. 1, pp. 68–76, Jan. 2009.
[3] G. Guo and S. Z. Li, “Content-based audio classification and retrieval by support vector
machines,” IEEE Transactions on Neural Networks, vol. 14, no. 1, pp. 209–215, Jan.
2003.
[4] J. Nishimura, N. Sato, and T. Kuroda, “Speech ”siglet” detection for business microscope
(concise contribution),” in IEEE International Conference on Pervasive Computing and
Communications, 2008, pp. 147–152.
[5] B. G. Ferguson and K. W. Lo, “Acoustic cueing for surveillance and security applications,”
in Sensors, and Command, Control, Communications, and Intelligence (C3I)
Technologies for Homeland Security and Homeland Defense V , 2006.
[6] G. Virone, D. Istrate, M. Vacher, N. Noury, J. F. Serignat, and J. Demongeot, “First steps
in data fusion between a multichannel audio acquisition and an information system for
home healthcare,” in IEEE Engineering in Medicine and Biology Society, 2003, pp. 1364–
1367.
[7] M. Uddin and T. Nadeem, “Energysniffer: Home energy monitoring system using smart
phones,” in International Wireless Communications and Mobile Computing Conference,
2012, pp. 159–164.
[8] J. C. Wang, H. P. Lee, J. F. Wang, and C. B. Lin, “Robust environmental sound recognition
for home automation,” IEEE Transactions on Automation Science and Engineering,
vol. 5, no. 1, pp. 25–31, Jan. 2008.
[9] P. C. Loizou, Speech Enhancement: Theory and Practice. CRC Press, 2007.
[10] S. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Transactions
on Acoustics, Speech, and Signal Processing, vol. 27, no. 2, pp. 113–120, Apr.
1979.
[11] Y. Ephraim and D. Malah, “Speech enhancement using a minimum-mean square error
short-time spectral amplitude estimator,” IEEE Transactions on Acoustics, Speech, and
Signal Processing, vol. 32, no. 6, pp. 1109–1121, Dec. 1984.
[12] ——, “Speech enhancement using a minimum mean-square error log-spectral amplitude
estimator,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 33, no. 2,
pp. 443–445, Apr. 1985.
[13] I. Cohen, “Optimal speech enhancement under signal presence uncertainty using logspectral
amplitude estimator,” IEEE Signal Processing Letters, vol. 9, no. 4, pp. 113–
116, Apr. 2002.
[14] Y. Ephraim and H. L. V. Trees, “A signal subspace approach for speech enhancement,”
in IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 2,
Apr. 1993, pp. 355–358.
[15] ——, “A signal subspace approach for speech enhancement,” IEEE Transactions on
Speech and Audio Processing, vol. 3, no. 4, pp. 251–266, Jul. 1995.
[16] J. Huang and Y. Zhao, “A dct-based fast signal subspace technique for robust speech
recognition,” IEEE Transactions on Speech and Audio Processing, vol. 8, no. 6, pp. 747–
751, Nov. 2000.
[17] A. Rezayee and S. Gazor, “An adaptive klt approach for speech enhancement,” IEEE
Transactions on Speech and Audio Processing, vol. 9, no. 2, pp. 87–95, Feb. 2001.
[18] Y. Nagata, K. Mitsubori, T. Kagi, T. Fujioka, and M. Abe, “Fast implementation of
klt-based speech enhancement using vector quantization,” IEEE Transactions on Audio,
Speech, and Language Processing, vol. 14, no. 6, pp. 2086–2097, Nov. 2006.
[19] C. H. Yang and J. F. Wang, “Noise suppression based on approximate klt with wavelet
packet expansion,” in IEEE International Conference on Acoustics, Speech and Signal
Processing, 2002, pp. 565–568.
[20] R. Miyazaki, H. Saruwatari, T. Inoue, Y. Takahashi, K. Shikano, and K. Kondo, “Musicalnoise-
free speech enhancement based on optimized iterative spectral subtraction,” IEEE
Transactions on Audio, Speech, and Language Processing, vol. 20, no. 7, pp. 2080–2094,
Sep. 2012.
[21] M. Dendrinos, S. Bakamidis, and G. Carayannis, “Speech enhancement from noise: A
regenerative approach,” Speech Communication, vol. 10, no. 1, pp. 45–67, Feb. 1991.
[22] E. Candes, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?” Journal
of the ACM, vol. 58, no. 3, pp. 1–37, 2011.
[23] H. B. Barlow, Underlying the Transformations Sensory Messages. Cambridge, MA: MIT
Press, 1961.
[24] N. Cho and C.-C. J. Kuo, “Sparse music representation with source-specific dictionaries
and its application to signal separation,” IEEE Transactions on Audio, Speech, and
Language Processing, vol. 19, no. 2, pp. 326–337, Feb. 2011.
[25] S. Ewert, B. Pardo, M. Mueller, and M. D. Plumbley, “Score-informed source separation
for musical audio recordings: An overview,” IEEE Signal Processing Magazine, vol. 31,
no. 3, pp. 116–124, May 2014.
[26] A. Ozerov, E. Vincent, and F. Bimbot, “A general flexible framework for the handling
of prior information in audio source separation,” IEEE Transactions on Audio, Speech,
and Language Processing, vol. 20, no. 4, pp. 1118–1133, May 2012.
[27] P. Smaragdis, B. Raj, and M. Shashanka, “Supervised and semi-supervised separation
of sounds from single-channel mixtures,” in International Conference on Independent
Component Analysis and Signal Separation, London, UK, 2007, pp. 414–421, isbn: 3-
540-74493-2, 978-3-540-74493-1.
[28] D. Kitamura, H. Saruwatari, K. Yagi, K. Shikano, Y. Takahashi, and K. Kondo, “Robust
music signal separation based on supervised nonnegative matrix factorization with prevention
of basis sharing,” in IEEE International Symposium on Signal Processing and
Information Technology, 2013, pp. 392–397.
[29] P. S. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis, “Deep learning for
monaural speech separation,” in IEEE International Conference on Acoustics, Speech
and Signal Processing, 2014, pp. 1562–1566.
[30] Y. Wang, A. Narayanan, and D. Wang, “On training targets for supervised speech separation,”
IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22,
no. 12, pp. 1849–1858, Dec. 2014.
[31] Y. Xu, J. Du, L. R. Dai, and C. H. Lee, “A regression approach to speech enhancement
based on deep neural networks,” IEEE/ACM Transactions on Audio, Speech, and
Language Processing, vol. 23, no. 1, pp. 7–19, Jan. 2015.
[32] P. S. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis, “Joint optimization of
masks and deep recurrent neural networks for monaural source separation,” IEEE/ACM
Transactions on Audio, Speech, and Language Processing, vol. 23, no. 12, pp. 2136–2147,
Dec. 2015.
[33] G. X. Wang, C. C. Hsu, and J. T. Chien, “Discriminative deep recurrent neural networks
for monaural speech separation,” in IEEE International Conference on Acoustics, Speech
and Signal Processing, 2016, pp. 2544–2548.
[34] D. Wang and J. S. Lim, “The unimportance of phase in speech enhancement,” IEEE
Transactions on Audio, Speech, and Language Processing, vol. 30, no. 4, pp. 679–681,
Aug. 1982.
[35] K. Michiko, G. Satoru, T. Mikio, and H. Tammo, “On the significance of phase in the
short term fourier spectrum for speech intelligibility,” Journal of the Acoustical Society
of America, vol. 127, no. 3, pp. 1432–1439, Mar. 2010.
[36] S. H. Moon, B. Kim, and I. S. Lee, “Importance of phase information in speech enhancement,”
in IEEE International Conference on Complex, Intelligent and Software Intensive
Systems, 2010, pp. 770–773.
[37] K. Paliwal, K. Wójcicki, and B. Shannon, “The importance of phase in speech enhancement,”
Speech Communication, vol. 53, no. 4, pp. 465–494, Apr. 2011.
[38] T. Gerkmann, M. Krawczyk-Becker, and J. L. Roux, “Phase processing for single-channel
speech enhancement: History and recent advances,” IEEE Signal Processing Magazine,
vol. 32, no. 2, pp. 55–66, Mar. 2015.
[39] A. Temko, R. Malkin, C. Zieger, D. Macho, and C. Nadeu, “Acoustic event detection
and classification in smart-room environments: Evaluation of chil project systems,” in IV
Biennial Workshop on Speech Technology, 2006, pp. 1–6.
[40] C. Clavel, T. Ehrette, R. Gael, and C. Paris, “Event detection for an audio-based
surveillance system,” in IEEE International Conference on Multimedia and Expo, 2005,
pp. 1306–1309.
[41] M. W. Mak and S. Y. Kung, “Low-power svm classifiers for sound event classification
on mobile devices,” in IEEE International Conference on Acoustics, Speech, and Signal
Processing, 2012, pp. 1985–1988.
[42] D. Zhao, H. Ma, and L. Liu, “Event classification for living environment surveillance using
audio sensor networks,” in IEEE International Conference on Multimedia and Expo, 2010,
pp. 528–533.
[43] S. Chen, Z. P. Sun, and B. Bridge, “Automatic traffic monitoring by intelligent sound
detection,” in IEEE Conference on Intelligent Transportation System, 1997, pp. 171–176.
[44] B. Ghoraani and S. Krishnan, “Time-frequency matrix feature extraction and classification
of environmental audio signals,” IEEE Transactions on Audio, Speech, and Language
Processing, vol. 19, no. 7, pp. 2197–2209, Sep. 2011.
[45] J. Montalvao, D. Istrate, J. Boudy, and J. Mouba, “Sound event detection in remote
health care-small learning datasets and over constrained gaussian mixture models,” in
IEEE Engineering in Medicine and Biology Society, 2010, pp. 1146–1149.
[46] N. C. Phuong and T. D. Dat, “Sound classification for event detection: Application into
medical telemonitoring,” in International Conference on Computing, Management and
Telecommunications, 2013, pp. 330–333.
[47] Y. T. Peng, C. Y. Lin, M. T. Sun, and K. C. Tsai, “Healthcare audio event classification
using hidden markov models and hierarchical hidden markov models,” in IEEE
International Conference on Multimedia and Expo, 2009, pp. 1218–1221.
[48] Y. T. Peng, C. Y. Lin, and M. T. Sun, “Audio event classification using binary hierarchical
classifiers with feature selection for healthcare applications,” in IEEE International
Symposium on Circuits and Systems, 2008, pp. 3238–3241.
[49] J. C. Wang, H. P. Lee, J. F. Wang, and C. B. Lin, “Robust environmental sound recognition
for home automation,” IEEE Transactions on Automation Science and Engineering,
vol. 5, no. 1, pp. 25–31, Jan. 2008.
[50] J. C. Wang, C. H. Lin, B. W. Chen, and M. K. Tsai, “Gabor-based nonuniform scalefrequency
map for environmental sound classification in home automation,” IEEE Transactions
on Automation Science and Engineering, vol. 11, no. 2, pp. 607–613, Apr. 2014.
[51] J. C. Wang, Y. S. Lee, and C. H. Lin, “Robust environmental sound recognition with
fast noise suppression for home automation,” IEEE Transactions on Automation Science
and Engineering, vol. 12, no. 4, pp. 1235–1242, Oct. 2015.
[52] S. Sivasankaran and K. M. M. Prabhu, “Robust features for environmental sound classification,”
in IEEE International Conference on Electronics, Computing and Communication
Technologies, 2013, pp. 1–6.
[53] K. Lee, Z. Hyung, and J. Nam, “Acoustic scene classification using sparse feature learning
and event-based pooling,” in IEEE Workshop on Applications of Signal Processing to
Audio and Acoustics, 2013, pp. 1–4.
[54] S. Chu, S. Narayanan, and C.-C. J. Kuo, “Environmental sound recognition with time frequency
audio features,” IEEE Transactions on Audio, Speech, and Language Processing,
vol. 17, no. 6, pp. 1142–1158, Aug. 2009.
[55] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, “Robust face recognition via
sparse representation,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
vol. 31, no. 2, pp. 210–227, Feb. 2009.
[56] M. Zhang, W. Li, L. Wang, J. Wei, Z. Wu, and Q. Liao, “Sparse coding for sound
event classification,” in Asia-Pacific Signal and Information Processing Association, 2013,
pp. 1–5.
[57] T. Komatsu, Y. Senda, and R. Kondo, “Acoustic event detection based on non-negative
matrix factorization with mixtures of local dictionaries and activation aggregation,”
in IEEE International Conference on Acoustics, Speech, and Signal Processing, 2016,
pp. 2259–2263.
[58] A. Mesaros, T. Heittola, O. Dikmen, and T. Virtanen, “Sound event detection in real
life recordings using coupled matrix factorization of spectral representations and class
activity annotations,” in IEEE International Conference on Acoustics, Speech, and Signal
Processing, 2015, pp. 151–155.
[59] V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez, C. Florin, L. Bogoni, and L. Moy,
“Learning from crowds,” Journal of Machine Learning Research, vol. 11, pp. 1297–1322,
Aug. 2010.
[60] Y. H. Yang and H. H. Chen, “Machine recognition of music emotion: A review,” ACM
Transactions on Intelligent system and Technology, vol. 3, no. 3, pp. 1–30, May 2012.
[61] D. Turnbull, L. Barrington, D. Torres, and G. Lanckriet, “Semantic annotation and
retrieval of music and sound effects,” IEEE Transactions on Audio, Speech, and Language
Processing, vol. 16, pp. 467–476, Feb. 2008.
[62] M. I. Mandel and D. P. W. Ellis, “Multiple-instance learning for music information
retrieval,” in International Society for Music Information Retrieval, 2008, pp. 577–582.
[63] S. Andrews, I. Tsochantaridis, and T. Hofmann, “Support vector machines for multipleinstance
learning,” in International Conference on Neural Information Processing Systems,
2002, pp. 577–584.
[64] Y. Chen, J. Bi, and J. Z. Wang, “Miles: Multiple-instance learning via embedded instance
selection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28,
no. 12, pp. 1931–1947, Dec. 2006.
[65] K. Trohidis, G. Tsoumakas, G. Kalliris, and I. P. Vlahavas, “Multi-label classification
of music into emotions,” in International Society for Music Information Retrieval, 2008,
pp. 325–330.
[66] W. Bin, Z. Erheng, H. H. Derek, H. Andrew, and Y. Qiang, “Smart: Semi-supervised
music emotion recognition with social tagging,” in SIAM Conference on Data Mining,
2013, pp. 279–287.
[67] J. A. Russell, “A circumplex model of affect,” Journal of Personality and Social Psychology,
vol. 39, no. 6, pp. 1161–1178, 1980.
[68] Y. H. Yang, Y. C. Lin, Y. F. Su, and H. H. Chen, “A regression approach to music
emotion recognition,” IEEE Transactions on Audio, Speech, and Language Processing,
vol. 16, pp. 448–457, Feb. 2008.
[69] J. C. Wang, Y. H. Yang, H. M. Wang, and S. K. Jeng, “The acoustic emotion gaussians
model for emotion-based music annotation and retrieval,” in ACM international
conference on Multimedia, 2012, pp. 89–98.
[70] K. F. MacDorman, S. Ough, and C. C. Ho, “Automatic emotion prediction of song
excerpts: Index construction, algorithm design, and empirical comparison,” Journal of
New Music Research, vol. 36, no. 4, pp. 281–299, May 2007.
[71] K. Markov and T. Matsui, “Music genre and emotion recognition using gaussian processes,”
IEEE Access, vol. 2, pp. 688–697, Jun. 2014.
[72] Y. H. Yang and H. H. Chen, “Prediction of the distribution of perceived music emotions
using discrete samples,” IEEE Trans. Audio, Speech, and Language Processing, vol. 19,
no. 7, pp. 2184–2196, Sep. 2011.
[73] H. Rauhut, K. Schnass, and P. Vandergheynst, “Compressed sensing and redundant
dictionaries,” IEEE Transactions on Information Theory, vol. 54, no. 5, pp. 2210–2219,
May 2008.
[74] D. L. Donoho, Compressed Sensing. Manuscript, 2004.
[75] E. J. Candes and T. Tao, “Decoding by linear programming,” IEEE Transactions on
Information Theory, vol. 51, no. 12, pp. 4203–4215, Dec. 2005.
[76] A. Mahalanobis and R. Muise, “Object specific image reconstruction using a compressive
sensing architecture for application in surveillance systems,” IEEE Transactions on
Aerospace and Electronic Systems, vol. 45, no. 3, pp. 1167–1180, Jul. 2009.
[77] J. Wu, F. Liu, L. C. Jiao, X. Wang, and B. Hou, “Multivariate compressive sensing
for image reconstruction in the wavelet domain: Using scale mixture models,” IEEE
Transactions on Image Processing, vol. 20, no. 12, pp. 3483–3494, Dec. 2011.
[78] C. Deng, W. Lin, B. s. Lee, and C. T. Lau, “Robust image coding based upon compressive
sensing,” IEEE Transactions on Multimedia, vol. 14, no. 2, pp. 278–290, Apr. 2012.
[79] J. Trzasko and A. Manduca, “Highly undersampled magnetic resonance image reconstruction
via homotopic l0-minimization,” IEEE Transactions on Medical Imaging, vol. 28,
no. 1, pp. 106–121, Jan. 2009.
[80] Q. F. Tan, P. G. Georgiou, and S. Narayanan, “Enhanced sparse imputation techniques
for a robust speech recognition front-end,” IEEE Transactions on Audio, Speech, and
Language Processing, vol. 19, no. 8, pp. 2418–2429, Nov. 2011.
[81] T. V. Sreenivas and W. B. Kleijn, “Compressive sensing for sparsely excited speech
signals,” in IEEE International Conference on Acoustics, Speech and Signal Processing,
2009, pp. 4125–4128.
[82] E. J. Candes and M. B. Wakin, “An introduction to compressive sampling,” IEEE Signal
Processing Magazine, vol. 25, no. 2, pp. 21–30, Mar. 2008.
[83] E. J. Candes, J. K. Romberg, and T. Tao, “Stable signal recovery from incomplete and
inaccurate measurements,” Communications on Pure and Applied Mathematics, vol. 59,
no. 8, pp. 1207–1223, Mar. 2006.
[84] S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decomposition by basis pursuit,”
SIAM Journal on Scientific Computing, vol. 43, no. 1, pp. 129–159, 2001.
[85] S. G. Mallat and Z. Zhang, “Matching pursuits with time-frequency dictionaries,” IEEE
Transactions on Signal Processing, vol. 41, no. 12, pp. 3397–3415, Dec. 1993.
[86] Y. C. Pati, R. Rezaiifar, and P. S. Krishnaprasad, “Orthogonal matching pursuit: Recursive
function approximation with applications to wavelet decomposition,” in Asilomar Conference
on Signals, Systems and Computers, 1993, pp. 40–44.
[87] T. Blumensath and M. E. Davies, “Gradient pursuits,” IEEE Transactions on Signal
Processing, vol. 56, no. 6, pp. 2370–2382, Jun. 2008.
[88] J. A. Tropp and A. C. Gilbert, “Signal recovery from random measurements via orthogonal
matching pursuit,” IEEE Transactions on Information Theory, vol. 53, no. 12,
pp. 4655–4666, Dec. 2007.
[89] D. Wu, W. P. Zhu, and M. N. S. Swamy, “Compressive sensing-based speech enhancement
in non-sparse noisy environments,” IET Signal Processing, vol. 7, no. 5, pp. 450–457, Jul.
2013.
[90] J. F. Gemmeke, H. V. Hamme, B. Cranen, and L. Boves, “Compressive sensing for missing
data imputation in noise robust speech recognition,” IEEE Journal of Selected Topics in
Signal Processing, vol. 4, no. 2, pp. 272–287, Apr. 2010.
[91] D. Wu, W. P. Zhu, and M. N. S. Swamy, “The theory of compressive sensing matching
pursuit considering time-domain noise with application to speech enhancement,” IEEE/
ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 3, pp. 682–
696, Mar. 2014.
[92] S. Y. Low, D. S. Pham, and S. Venkatesh, “Compressive speech enhancement,” Speech
Communication, vol. 55, no. 6, pp. 757–768, Jul. 2013.
[93] T. G. Kang, K. Kwon, J. W. Shin, and N. S. Kim, “Nmf-based target source separation
using deep neural network,” IEEE Signal Processing Letters, vol. 22, no. 2, pp. 229–233,
Feb. 2015.
[94] S. Nie, S. Liang, H. Li, X. L. Zhang, Z. L. Yang, W. J. Liu, and L. K. Dong, “Exploiting
spectro-temporal structures using nmf for dnn-based supervised speech separation,” in
IEEE International Conference on Acoustics, Speech and Signal Processing, 2016, pp. 469–
473.
[95] M. Krawczyk and T. Gerkmann, “Stft phase reconstruction in voiced speech for an improved
single-channel speech enhancement,” IEEE/ACM Transactions on Audio, Speech,
and Language Processing, vol. 22, no. 12, pp. 1931–1940, Dec. 2014.
[96] B. J. King and L. Atlas, “Single-channel source separation using complex matrix factorization,”
IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 8,
pp. 2591–2597, Nov. 2011.
[97] D. S. Williamson, Y. Wang, and D. Wang, “Complex ratio masking for joint enhancement
of magnitude and phase,” in IEEE International Conference on Acoustics, Speech and
Signal Processing, 2016, pp. 5220–5224.
[98] ——, “Complex ratio masking for monaural speech separation,” IEEE/ACM Transactions
on Audio, Speech, and Language Processing, vol. 24, no. 3, pp. 483–492, Mar. 2016.
[99] A. J. R. Simpson, “Deep transform: Cocktail party source separation via complex convolution
in a deep neural network,” CoRR, vol. abs/1504.02945, Apr. 2015.
[100] L. Drude, B. Raj, and R. Haeb-Umbach, “On the appropriateness of complex-valued
neural networks for speech enhancement,” in Interspeech, 2016.
[101] K. Osako, R. Singh, and B. Raj, “Complex recurrent neural networks for denoising speech
signals,” in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics,
2015, pp. 1–5.
[102] H. Lee, A. Battle, R. Raina, and A. Y. Ng., “Efficient sparse coding algorithms,” in
Neural Information Processing Systems, 2006, pp. 801–808.
[103] K. Gregor and Y. Lecun, “Learning fast approximations of sparse coding,” in International
Conference on Machine Learning, 2005, pp. 318–326.
[104] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman, “Supervised dictionary learning,”
in Neural Information Processing Systems, 2008, pp. 1033–1040.
[105] Q. Zhang and B. Li, “Discriminative k-svd for dictionary learning in face recognition,”
in IEEE Conference on Computer Vision and Pattern Recognition, 2010, pp. 2691–2698.
[106] H. V. Nguyen, V. M. Patel, N. M. Nasrabadi, and R. Chellappa, “Design of non-linear kernel
dictionaries for object recognition,” IEEE Transactions on Image Processing, vol. 22,
no. 12, pp. 5123–5135, Dec. 2013.
[107] Y. Xie, W. Zhang, C. Li, S. Lin, Y. Qu, and Y. Zhang, “Discriminative object tracking via
sparse representation and online dictionary learning,” IEEE Transactions on Cybernetics,
vol. 44, no. 4, pp. 539–553, Apr. 2014.
[108] C. Lu, J. Shi, and J. Jia, “Online robust dictionary learning,” in IEEE Conference on
Computer Vision and Pattern Recognition, 2013, pp. 415–422.
[109] W. Liu, Z. Yu, M. Yang, L. Lu, and Y. Zou, “Joint kernel dictionary and classifier learning
for sparse coding via locality preserving k-svd,” in IEEE International Conference on
Multimedia and Expo, 2015, pp. 1–6.
[110] Y. S. Lee, C. Y. Wang, S. Mathulaprangsan, J. H. Zhao, and J. C. Wang, “Localitypreserving
k-svd based joint dictionary and classifier learning for object recognition,” in
ACM Multimedia, Amsterdam, The Netherlands, 2016.
[111] L. Zhang, M. Yang, and X. Feng, “Sparse representation or collaborative representation:
Which helps face recognition?” In IEEE Conference on Computer Vision and Pattern
Recognition, 2011, pp. 471–478.
[112] R. Miotto and G. Lanckriet, “A generative context model for semantic music annotation
and retrieval,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20,
pp. 1096–1108, Aug. 2012.
[113] S. Ness, A. Theocharis, G. Tzanetakis, and L. Martins, “Improving automatic music tag
annotation using stacked generalization of probabilistic svm outputs,” in ACM international
conference on Multimedia, 2009, pp. 705–708.
[114] Y. Yang, Y. Lin, A. Lee, and H. Chen, “Improving musical concept detection by ordinal
regression and context fusion,” in International Society for Music Information Retrieval,
2009, pp. 147–152.
[115] B. Wu, E. H. Zhong, A. Horner, and Q. Yang, “Music emotion recognition by multi-label
multi-layer multi-instance multi-view learning,” in ACM Multimedia, 2014, pp. 117–126.
[116] E. Mower, M. J. Mataric, and S. Narayanan, “A framework for automatic human emotion
classification using emotion profiles,” IEEE Transactions on Audio, Speech, and Language
Processing, vol. 19, pp. 1057–1070, Jul. 2011.
[117] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei, “Hierarchical dirichlet processes,”
Journal of the American Statistical Association, vol. 101, no. 476, pp. 1566–1781, Dec.
2006.
[118] D. Kim and A. Oh, “Accounting for data dependencies within a hierarchical dirichlet
process mixture model,” in ACM international conference on Information and knowledge
management, 2011, pp. 873–878.
[119] K. Yoshii and M. Goto, “A nonparametric bayesian multipitch analyzer based on infinite
latent harmonic allocation,” IEEE Trans. Audio, Speech, and Language Processing,
vol. 20, no. 3, pp. 717–730, Mar. 2012.
[120] M. Hoffman, D. Blei, and P. Cook, “Content-based musical similarity computation using
the hierarchical dirichlet process,” in International Society for Music Information
Retrieval, 2008, pp. 349–354.
[121] M. Aharon, M. Elad, and A. Bruckstein, “K-svd: An algorithm for designing overcomplete
dictionaries for sparse representation,” IEEE Transactions on Signal Processing, vol. 54,
no. 11, pp. 4311–4322, Nov. 2006.
[122] M. Cooke, P. Green, L. Josifovski, and A. Vizinho, “Robust automatic speech recognition
with missing and unreliable acoustic data,” Speech Communication, vol. 34, no. 3,
pp. 267–285, Jun. 2001.
[123] L. Josifovski, M. Cooke, P. Green, and A. Vizinho, “State based imputation of missing
data for robust speech recognition and speech enhancement,” in Eurospeech, 1999,
pp. 2837–2840.
[124] J. F. Gemmeke and B. Cranen, “Using sparse representations for missing data imputation
in noise robust speech recognition,” in European Signal Processing Conference, 2008.
[125] S. Roweis, “Factorial models and refiltering for speech separation and denoising,” in
Interspeech, 2003, pp. 1009–1012.
[126] A. Varga and H. J. M. Steeneken, “Assessment for automatic speech recognition ii:
Noisex-92: A database and an experiment to study the effect of additive noise on speech
recognition systems,” Speech Communication, vol. 12, no. 3, pp. 247–251, Jul. 1993.
[127] Freesound, 2013. [Online]. Available: http://www.freesound.org.
[128] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual evaluation
of speech quality (pesq)-a new method for speech quality assessment of telephone networks
and codecs,” in IEEE International Conference on Acoustics, Speech and Signal
Processing, 2001, pp. 749–752.
[129] G. Kim, Y. Lu, Y. Hu, and P. Loizou, “An algorithm that improves speech intelligibility
in noise for normal-hearing listeners,” J. Acoust. Soc. Amer., vol. 126, pp. 1486–1494,
2009.
[130] J. Antoni, “The spectral kurtosis: A useful tool for characterising non-stationary signals,”
Mechanical Systems and Signal Processing, vol. 20, no. 2, pp. 282–307, Feb. 2006.
[131] S. W. P. Marziliano F. Dufaux and T. Ebrahimi, “Perceptual blur and ringing metrics:
Application to jpeg2000,” Signal Processing: Image Communication, vol. 19, no. 2,
pp. 163–172, 2004.
[132] Y. Lu and P. C. Loizou, “Estimators of the magnitude-squared spectrum and methods
for incorporating snr uncertainty,” IEEE Transactions on Audio, Speech, and Language
Processing, vol. 19, no. 5, pp. 1123–1137, Jul. 2011.
[133] N. Lyubimov and M. Kotov, “Non-negative matrix factorization with linear constraints
for single-channel speech enhancement.,” in Interspeech, 2013, pp. 446–450.
[134] Y. S. Lee, C. Y. Wang, S. F. Wang, J. C. Wang, and C. H. Wu, “Fully complex deep neural
network for phase-incorporating monaural source separation,” in IEEE International
Conference on Acoustics, Speech and Signal Processing, 2017.
[135] H. Leung and S. Haykin, “The complex backpropagation algorithm,” IEEE Transactions
on Signal Processing, vol. 39, no. 9, pp. 2101–2104, Sep. 1991.
[136] N. Benvenuto and F. Piazza, “On the complex backpropagation algorithm,” IEEE Transactions
on Signal Processing, vol. 40, no. 4, pp. 967–969, Apr. 1992.
[137] R. Remmert, Theory of Complex Functions. New York, NY, USA: Springer, 1991.
[138] W. Wirtinger, “Zur formalen theorie der funktionen von mehr komplexen veranderlichen,”
Mathematische Annalen, vol. 97, pp. 357–376, 1927.
[139] D. H. Brandwood, “A complex gradient operator and its application in adaptive array
theory,” IEE Proceedings F- Communications, Radar and Signal Processing, vol. 130,
no. 1, pp. 11–16, Feb. 1983.
[140] A. Ng, “Sparse autoencoder,” CS294A Lecture Notes, Stanford Univ., CA, USA, Tech.
Rep., 2011.
[141] M. J. Ablowitz and A. S. Fokas, Complex Variables. Cambridge, U.K.: Cambridge Univ.
Press, 2003.
[142] G. E. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pre-trained deep neural
networks for large vocabulary speech recognition,” IEEE Transactions on Audio, Speech,
and Language Processing, vol. 20, no. 1, pp. 30–42, Jan. 2012.
[143] N. Guberman, “On complex valued convolutional neural networks,” CoRR, vol. abs/
1602.09046, Feb. 2016.
[144] J. Sohl-Dickstein, B. Poole, and S. Ganguli, “Fast large-scale optimization by unifying
stochastic gradient and quasi-newton methods,” in International Conference on International
Conference on Machine Learning, 2014, pp. 604–612.
[145] K. Kreutz-Delgado, “The complex gradient operator and the cr-calculus. univ,” Univ.
California, San Diego, Tech. Rep., 2009.
[146] C. L. Hsu and J. S. R. Jang, “On the improvement of singing voice separation for monaural
recordings using the mir-1k dataset,” IEEE Transactions on Audio, Speech, and Language
Processing, vol. 18, no. 2, pp. 310–319, Feb. 2010.
[147] E. Vincent, R. Gribonval, and C. Fevotte, “Performance measurement in blind audio
source separation,” IEEE Transactions on Audio, Speech, and Language Processing,
vol. 14, no. 4, pp. 1462–1469, Jul. 2006.
[148] P. C. Loizou, Speech Enhancement: Theory and Practice. CRC Press, 2007.
[149] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual evaluation
of speech quality (pesq)-a new method for speech quality assessment of telephone networks
and codecs,” in IEEE International Conference on Acoustics, Speech and Signal
Processing, 2001, pp. 749–752.
[150] T. Tieleman and G. Hinton, “Lecture 6.5 - rmsprop,” COURSERA: Neural Networks for
Machine Learning, Tech. Rep., 2012.
[151] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/
1412.6980, Dec. 2014.
[152] H. V. Nguyen, V. M. Patel, N. M. Nasrabadi, and R. Chellappa, “Kernel dictionary
learning,” in IEEE International Conference on Acoustics, Speech, and Signal Processing,
2012, pp. 2021–2024.
[153] A. Golts and M. Elad, “Linearized kernel dictionary learning,” CoRR, vol. abs/1509.05634,
2015.
[154] F. Font, G. Roma, and X. Serra, “Freesound technical demo,” in ACM Multimedia,
Barcelona, Spain, 2013, pp. 411–412, isbn: 978-1-4503-2404-5.
[155] T. Kemp, M. Schmidt, M. Westphal, and A. Waibel, “Strategies for automatic segmentation
of audio data,” in IEEE International Conference on Acoustics, Speech, and Signal
Processing, 2000, pp. 1423–1426.
[156] M. Aharon, M. Elad, and A. Bruckstein, “K-svd: An algorithm for designing overcomplete
dictionaries for sparse representation,” IEEE Transactions on Signal Processing, vol. 54,
no. 11, pp. 4311–4322, Nov. 2006.
[157] I. CVX Research, CVX: Matlab software for disciplined convex programming, version
2.0, http://cvxr.com/cvx, 2012.
[158] R. Rubinstein, M. Zibulevsky, and M. Elad, “Efficient implementation of the k-svd algorithm
using batch orthogonal matching pursuit,” Tech. Rep., 2008.
[159] R. Hennequin, Nmf-matlab, https://github.com/romi1502/NMF-matlab, 2015.
[160] D. J. Aldous, “Exchangeability and related topics,” Ecole d̀Été de Probabilités de Saint-
Flour XIII¡V1983, pp. 1–198, 1985.
[161] D. Blackwell and J. MacQueen, “Ferguson distributions via polya urn schemes,” Annals
of Statistics, vol. 1, no. 2, pp. 353–355, Mar. 1973.
[162] Y. W. Teh, “Dirichlet processes,” Encyclopedia of Machine Learning, 2010.
[163] C. E. Antoniak, “Mixtures of dirichlet processes with applications to bayesian nonparametric
problems,” The Annals of Statistics, vol. 2, no. 6, pp. 1152–1174, Nov. 1974.
[164] S. Mika, G. Rätsch, J. Weston, B. Schölkopf, and K. R. Müller, “Fisher discriminant
analysis with kernels,” IEEE Neural Networks for Signal Processing, pp. 41–48, Aug.
1999.
[165] K. P. Murphy, Conjugate bayesian analysis of the gaussian distribution, available: http://
www.cs.ubc.ca/ murphyk/Papers/bayesGauss.pdf, 2007.
[166] C. Fraley and A. E. Raftery, “Bayesian regularization for normal mixture estimation and
model-based clustering,” Journal of Classification, vol. 24, no. 2, pp. 155–181, Sep. 2007.
[167] K. Huang and S. Aviyente, “Sparse representation for signal classification,” in International
Conference on Neural Information Processing Systems, 2006, pp. 609–616.
[168] R. He, W. S. Zheng, B. G. Hu, and X. W. Kong, “Two-stage nonnegative sparse representation
for large-scale face recognition,” IEEE Transactions on Neural Networks and
Learning Systems, vol. 24, no. 1, pp. 35–46, Jan. 2013.
[169] M. Grant and S. Boyd, Cvx: Matlab software for disciplined convex programming, available:
http://cvxr.com/cvx, 2013.
[170] ——, “Graph implementations for nonsmooth convex programs,” in Lecture Notes in
Control and Information Sciences, 2008, pp. 95–110.
[171] Y. E. Kim, E. M. Schmidt, R. Migneco, B. G. Morton, P. Richardson, J. Scott, J. A.
Speck, and D. Turnbull, “Music emotion recognition: A state of the art review,” in
International Society for Music Information Retrieval, 2010, pp. 255–266.
[172] O. Lartillot and P. Toiviainen, “A matlab toolbox for musical feature extraction from
audio,” in International Conference on Digital Audio Effects, 2007, pp. 237–244.
[173] M. D. Escobar and M. West, “Bayesian density estimation and inference using mixtures,”
Journal of the American Statistical Association, vol. 90, no. 430, pp. 577–588, Jun. 1995.
[174] G. Tsoumakas, E. Spyromitros-Xioufis, J. Vilcek, and I. Vlahavas, “Mulan: A java library
for multi-label learning,” Journal of Machine Learning Research, vol. 12, pp. 2411–2414,
Jul. 2011.
[175] L. Su, C. C. M. Yeh, J. Y. Liu, J. C. Wang, and Y. H. Yang, “A systematic evaluation
of the bag-of-frames representation for music information retrieval,” IEEE Transactions
on Multimedia, vol. 16, no. 5, pp. 1188–1200, Aug. 2014.

指導教授

王家慶(Jia-Ching Wang)

審核日期

2017-8-23

推文