機率型潛在變數模型於資料表示法學習

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：39

、訪客IP：3.143.239.234

姓名

陳思卉(Sih-Huei Chen) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

機率型潛在變數模型於資料表示法學習
(Probabilistic Latent Variable Model for Learning Data Representation)

相關論文

★ Single and Multi-Label Environmental Sound Recognition with Gaussian Process	★ 波束形成與音訊前處理之嵌入式系統實現
★ 語音合成及語者轉換之應用與設計	★ 基於語意之輿情分析系統
★ 高品質口述系統之設計與應用	★ 深度學習及加速強健特徵之CT影像跟骨骨折辨識及偵測
★ 基於風格向量空間之個性化協同過濾服裝推薦系統	★ RetinaNet應用於人臉偵測
★ 金融商品走勢預測	★ 整合深度學習方法預測年齡以及衰老基因之研究
★ 漢語之端到端語音合成研究	★ 基於 ARM 架構上的 ORB-SLAM2 的應用與改進
★ 基於深度學習之指數股票型基金趨勢預測	★ 探討財經新聞與金融趨勢的相關性
★ 基於卷積神經網路的情緒語音分析	★ 運用深度學習方法預測阿茲海默症惡化與腦中風手術存活

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 ( 永不開放)

摘要(中)

本論文針對離散及連續的潛在空間，提出基於機率型潛在變數模型 (Probabilistic Latent Variable Model) 的三種表示學習方法。對於離散型的潛在變數，本論文提出基於高斯階層型潛在狄氏配置 (Gaussian Hierarchical Latent Dirichlet Allocation, GhLDA) 的階層表示法以捕捉低階特徵參數的潛在特性。我們藉由發展一個能夠自行調整架構之階層樹狀混合模型來學習資料的潛藏表示，其對於不同類別之間的細微差別可以很好地建模。對於連續型的潛在變數，本論文提出兩個基於連續潛在變數的表示學習方法。其一，本論文提出複數高斯潛在變數模型 (Complex-Valued Gaussian Process Latent Variable Model, CGPLVM) 來學習資料的複數表示。模型的主要概念為假設複數的資料為其對應之低維度潛在變數的函數，其中此函數來自一個複數的高斯過程。此外，我們試圖保留資料的全域及局部結構並同時鼓勵學到的表示具有鑑別能力，因此我們將原複數高斯潛在變數模型的目標函數加入了對於複數資料而設計的局部保留項及鑑別項。其二，本論文提出基於變分自編碼器 (Variational Autoencoder, VAE) 及高斯過程分類器 (Gaussian Process Classifier, GPC) 之深度協同學習 (Deep collaborative learning) 方法。我們將高斯過程分類器結合至變分自編碼器，讓變分自編碼器在學習表示的過程中能夠考慮到類別資訊，並同時訓練分類器。我們提出的表示很好地區分類別之間的資料變異，並增加了原本基於變分自編碼器之表示的鑑別能力。所開發的方法之效能在多媒體的資料上進行評估，實驗結果證明了所提出方法的優越性能，特別是對於只有少量訓練資料的情況。

摘要(英)

Probabilistic framework has emerged as a powerful technique for representation learning. This dissertation proposes probabilistic latent variable model-based representation learning methods that involve both discrete and continuous latent spaces. For a discrete latent space, a hierarchical representation that is based on the Gaussian hierarchical latent Dirichlet allocation (G-hLDA) is proposed for capturing the latent characteristics of low-level features. Representation is learned by constructing an infinitely deep and branching tree-structured mixture model, which effectively models the subtle differences among classes. For a continuous latent space, a novel complex-valued latent variable model, named the complex-valued Gaussian process latent variable model (CGPLVM), is developed for discovering a compressed complex-valued representation of complex-valued data. The key concept of CGPLVM is that complex-valued data is approximated by a low-dimensional complex-valued latent representation through a function that is drawn from a complex Gaussian process. Additionally, we attempt to preserve both global and local data structures while promoting discrimination. A new objective function that incorporates a locality-preserving and a discriminative term for complex-valued data is presented. Then, a deep collaborative learning framework that is based on a variational autoencoder (VAE) and a Gaussian process (GP) is proposed to represent multimedia data with greater discriminative power than previously achieved. A Gaussian process classifier is incorporated into the VAE to guide a VAE-based representation, which distinguishes variations of data among classes and achieves the dual goals of reconstruction and classification. The developed methods are evaluated using multimedia data. The experimental results demonstrate the superior performances of the proposed methods, especially for situations with only a small number of training data.

關鍵字(中)

★ 潛在變數模型
★ 高斯過程
★ 深度學習

關鍵字(英)

★ Latent Variable Model
★ Gaussian Process
★ Deep Learning

論文目次

摘要 xi
Abstract xiii
Acknowledgement xv
1 Introduction 1
1.1 Motivation 1
1.2 Probabilistic Latent Variable Models 2
1.2.1 Discrete Latent Variable Models 2
1.2.2 Continuous Latent Variable Models 4
1.3 Organization of This Dissertation 7
2 Preliminary 9
2.1 Gaussian Process (GP) 9
2.2 Gaussian Process Latent Variable Model (GPLVM) 11
2.3 Variational Auto-encoder (VAE) 13
2.4 Hierarchical Latent Dirichlet Allocation (hLDA) 14
3 Hierarchical Representation Based on Bayesian Non-parametric Tree-structured Mixture Model 19
3.1 Overview 19
3.2 Related Works 21
3.3 Model 23
3.3.1 Gaussian Hierarchical Latent Dirichlet Allocation (G-hLDA) 23
3.3.2 Probabilistic Inference 25
3.3.3 Hierarchical Representation 27
3.4 Experimental Results 28
3.4.1 Database and Experimental Settings 28
3.4.2 Performance Metrics 29
3.4.3 Convergence analysis for the proposed method 30
3.4.4 Effects of Depth L 30
3.4.5 Discriminative Ability of Hierarchical Representation 31
3.4.6 Comparison of Proposed Methods and Baselines 35
3.5 Discussion 35
4 Complex-Valued Gaussian Process Latent Variable Model 37
4.1 Overview 37
4.1.1 Speech Enhancement 37
4.1.2 Sound Event Recognition 38
4.2 Related Works 40
4.2.1 Speech Enhancement 40
4.2.2 Sound Event Recognition 43
4.3 CGPLVM for Speech Enhancement 47
4.3.1 Missing data masks 47
4.3.2 GPLVM-based reconstruction of STFT magnitude 48
4.3.3 Phase-incorporating reconstruction of complex-valued STFT coefficient 49
4.4 CGPLVM for Sound Event Recognition 50
4.4.1 Complex-Valued Feature Extraction 50
4.4.2 CGPLVM-Based Robust Representation 53
4.5 Experimental Results 56
4.5.1 Speech Enhancement 56
4.5.2 Sound Event Recognition 65
4.6 Discussion 68
4.6.1 Speech Enhancement 68
4.6.2 Sound Event Recognition 69
5 Supervised Guiding in Complex-Valued Gaussian Process Latent Variable Model 71
5.1 Related Works 72
5.1.1 Face Recognition 72
5.1.2 Music Emotion Recognition 73
5.2 Model 74
5.2.1 Locality-Preserving and Discriminative Constraints for Complex-valued Data 74
5.2.2 Model Inference for LPD-CGPLVM 76
5.2.3 Prediction with New Test Complex-valued Data 76
5.3 Experimental Results 77
5.3.1 Visualization on MHMC database 77
5.3.2 Robust Face Recognition 79
5.3.3 Music Emotion Recognition 82
5.4 Discussion 83
6 Deep Collaborative Learning of Variational Auto-encoder and Gaussian Process 85
6.1 Overview 85
6.2 Related Works 86
6.3 Model 87
6.3.1 Preprocessing 87
6.3.2 Collaborative Learning 88
6.3.3 Model Inference 89
6.3.4 Prediction 90
6.4 Experimental Results 90
6.4.1 Experimental Settings and Performance Metrics 90
6.4.2 Baseline Methods 91
6.4.3 Settings of Parameters 92
6.4.4 Classification of Playing Techniques under Noisy Conditions 93
6.4.5 Comparison of Proposed Method and Baselines 94
6.5 Discussion 95
7 Conclusion and Future work 97
7.1 Summary of Contributions 97
7.2 Future Work 98
Bibliography 101
A Gibbs Sampling for GhLDA 113
B Derivation of Objective of Complex-Valued GPLVM 117
C Publication List 119

參考文獻

[1] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 8, pp. 1798–1828, Aug. 2013.
[2] A. van den Oord and B. Schrauwen, “The student-t mixture as a natural image patch prior with application to image compression,” J. Mach. Learn. Res., vol. 15, no. 1, pp. 2061–2086, 2014.
[3] M. H. Law, M. A. Figueiredo, and A. K. Jain, “Simultaneous feature selection and clustering using mixture models,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, no. 9, pp. 1154–1166, 2004.
[4] D. M. Blei, A. Y. Ng, and M. I Jordan, “Latent dirichlet allocation,” Journal of Machine Learning Research, vol. 3, no. 4-5, pp. 993–1022, 2003.
[5] S. Kim, S. Narayanan, and S. Sundaram, “Acoustic topic model for audio information retrieval,” in Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (ASPAA), 2009, pp. 37–40.
[6] P. Hu, W. Liu, W. Jiang, and Z. Yang, “Latent topic model based on Gaussian-LDA for audio retrieval,” in Pattern Recognition, Springer Berlin Heidelberg, 2012, pp. 556–563.
[7] ——, “Latent topic model for audio retrieval,” Pattern Recognition, vol. 47, no. 3, pp. 1138-1143, 2014.
[8] N. Rasiwasia and N. Vasconcelos, “Latent Dirichlet allocation models for image classification,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 11, pp. 2665–2679, 2013.
[9] D. B. W. Chong and F. Li, “Simultaneous image classification and annotation,” in Proc. CVPR, IEEE, 2009, pp. 1903–1910.
[10] L. J. Li, C. Wang, Y. Lim, D. M Blei, and F. F. Li, “Building and using a semantic visual image hierarchy,” in Proc. CVPR, IEEE, 2010, pp. 3336–3343.
[11] F. F. Li and P. Perona, “A Bayesian hierarchical model for learning natural scene categories,” in Proc. CVPR, IEEE, vol. 2, 2005, pp. 524–531.
[12] Z. Lu, L. Wang, and J. R. Wen, “Image classification by visual bag-of-words refinement and reduction,” Neurocomputing, vol. 173, pp. 373–384, 2016.
[13] L. Su, C. C. M. Yeh, J. Y. Liu, J. C. Wang, and Y. H. Yang, “A systematic evaluation of the bag-of-frames representation for music information retrieval,” IEEE Transactions on Multimedia, vol. 16, no. 5, pp. 1188–1200, 2014.
[14] T. Nakano, K. Yoshii, and M. Goto, “Vocal timbre analysis using latent dirichlet allocation and cross-gender vocal timbre similarity,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 5202–5206.
[15] S. Kim, S. Sundaram, P. Georgiou, and S. Narayanan, “Audio scene understanding using topic models,” in Proceedings of the Neural Information Processing Systems (NIPS) Workshop, 2009, pp. 1–4.
[16] S. Kim, P. Georgiou, and S. Narayanan, “Supervised acoustic topic model with a consequent classifier for unstructured audio classification,” in Proceedings of the International Workshop on Content-Based Multimedia Indexing (CBMI), 2012, pp. 1-6.
[17] P. Hu, W. Liu, W. Jiang, and Z. Yang, “Latent topic model based on gaussian-lda for audio retrieval,” in Pattern Recognition: Chinese Conference, CCPR 2012, Beijing, China, September 24-26, 2012. Proceedings. Springer Berlin Heidelberg, 2012, pp. 556–563.
[18] R. Das, M. Zaheer, and C. Dyer, “Gaussian lda for topic models with word embeddings,” in Proc. ACL, 2015, pp. 795–804.
[19] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei, “Hierarchical Dirichlet processes,”
Journal of the American Statistical Association, vol. 101, 2004.
[20] K. W. Lim, W. Buntine, C. Chen, and L. Du, “Nonparametric Bayesian topic modelling with the hierarchical Pitman!vYor processes,” International Journal of Approximate Reasoning, vol. 78, pp. 172 –191, 2016.
[21] S. J. Gershman and D. M. Blei, “A tutorial on bayesian nonparametric models,” Journal of Mathematical Psychology, vol. 56, no. 1, pp. 1 –12, 2012.
[22] D. M. Blei, T. L. Griffiths, and M. I. Jordan, “The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies,” Journal of the ACM, vol. 57, no. 2, 7:1–7:30, 2010.
[23] D. M. Blei, M. I. Jordan, T. L. Griffiths, and J. B. Tenenbaum, “Hierarchical topic models and the nested chinese restaurant process,” in Proceedings of the 16th International Conference on Neural Information Processing Systems, ser. NIPS’03, Whistler, British Columbia, Canada: MIT Press, 2003, pp. 17–24.
[24] Y. L. Chang, J. J. Hung, and J. T. Chien, “Bayesian nonparametric modeling of hierarchical topics and sentences,” in Proc. MLSP, 2011, pp. 1–6.
[25] J. T. Chien and Y. L. Chang, “Hierarchical theme and topic model for summarization,” in Proc. MLSP, 2013, pp. 1–6.
[26] ——, “The nested Indian buffet process for flexible topic modeling,” in Proc. INTERSPEECH, 2014, pp. 1434–1437.
[27] J. T. Chien, “Hierarchical theme and topic modeling,” IEEE Trans. Neural Netw. Learn. Syst., vol. 27, no. 3, pp. 565–578, 2016.
[28] ——, “Bayesian nonparametric learning for hierarchical and sparse topics,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 26, no. 2, pp. 422–435, 2018.
[29] I. T. Jolliffe, Principal Component Analysis. New York: Springer-Verlag, 1986.
[30] G. W. Cottrell, P. Munro, and D. Zipser, “Learning internal representations from grayscale images: An example of extensional programming,” in Proc. CogSci, 1987.
[31] M. E. Tipping and C. M. Bishop, “Probabilistic principal component analysis,” J. Roy.
Statist. So. B (Statist. Methodol.), vol. 61, no. 3, pp. 611–622, 1999.
[32] Y. S. F. Ju, J. Gao, Y. Hu, and B. Yin, “Image outlier detection and feature extraction via l1-norm-based 2d probabilistic pca,” IEEE Trans. Image Process., vol. 24, no. 12, pp. 4834–4846, 2015.
[33] X. Cui, M. Afify, and B. Zhou, “Stereo-based stochastic mapping with context using probabilistic pca for noise robust automatic speech recognition,” in Proc. ICASSP, 2012,
pp. 4705–4708.
[34] N. D. Lawrence, “Gaussian process latent variable models for visualisation of high dimensional data,” in Proc. NIPS, 2004, pp. 329–336.
[35] M. K. Titsias and N. Lawrence, “Bayesian Gaussian process latent variable model,” in AISTATS, 2010.
[36] A. C. Damianou and N. D. Lawrence, “Deep Gaussian processes,” arXiv preprint arXiv:1211.0358v2, 2012.
[37] G. Zhong, W. J. Li, D. Y. Yeung, X. Hou, and C. L. Liu, “Gaussian process latent random field,” in Proc. AAAI, 2010, pp. 679–684.
[38] S. Eleftheriadis, O. Rudovic, and M. Pantic, “Discriminative shared Gaussian processes for multiview and view-invariant facial expression recognition,” IEEE Trans. Image Process., vol. 24, no. 1, pp. 189–204, 2015.
[39] R. Urtasun and T. Darrell, “Discriminative gaussian process latent variable model for classification,” in Proc. ICML, 2007, pp. 927–934.
[40] J. Snoek, R. P. Adams, and H. Larochelle, “Nonparametric guidance of autoencoder representations using label information,” J. Mach. Learn. Res., vol. 13, pp. 2567–2588,
2012.
[41] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press, 2005.
[42] Y. Pu, Z. Gan, R. Henao, X. Yuan, C. Li, A. Stevens, and L. Carin, “Variational autoencoder for deep learning of images, labels and captions,” arXiv preprint arXiv: 1609.08976v1, 2016.
[43] S. Eleftheriadis, O. Rudovic, M. P. Deisenroth, and M. Pantic, “Variational Gaussian process auto-encoder for ordinal prediction of facial action units,” arXiv preprint arXiv:
1608.04664v2, 2016.
[44] Z. Dai, A. Damianou, J. Gonzalez, and N. Lawrence, “Variational auto-encoded deep Gaussian processes,” arXiv preprint arXiv:1511.06455v2, 2016.
[45] D. P Kingma and M. Welling, “Auto-encoding variational Bayes,” arXiv preprint arXiv:1312.6114v10, 2013.
[46] T. Gerkmann, M. Krawczyk-Becker, and J. L. Roux, “Phase processing for single channel speech enhancement: History and recent advances,” IEEE Signal Process. Mag., vol. 32, no. 2, pp. 55–66, 2015.
[47] E. Loweimi, S. M. Ahadi, and T. Drugman, “A new phase-based feature representation for robust speech recognition,” in Proc. ICASSP, 2013, pp. 7155–7159.
[48] L. Su, L. F. Yu, and Y. H. Yang., “Sparse cepstral and phase codes for guitar playing technique classification,” in Proceedings of the International Society for Music Information Retrieval (ISMIR), 2014, pp. 9–14.
[49] A. Diment, E. Cakir, T. Heittola, and T. Virtanen, “Automatic recognition of environmental sound events using all-pole group delay features,” in Proc. EUSIPCO, 2015, pp. 734–738.
[50] S. Liwicki, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic, “Euler principal component analysis,” Int. J. Comput. Vis., vol. 101, no. 3, pp. 498–518, 2013.
[51] A. Fitch, A. Kadyrov, W. Christmas, and J. Kittler, “Fast robust correlation,” IEEE Trans. Image Process., vol. 14, no. 8, pp. 1063–1073, 2005.
[52] J. D. Horel, “Complex principal component analysis: Theory and examples,” J. Climate Appl. Meteor., vol. 23, pp. 1660–1673, 1984.
[53] S. S. P. Rattan and W. W. Hsieh, “Complex-valued neural networks for nonlinear complex principal component analysis,” Neural Networks, vol. 18, no. 1, pp. 61–69, 2005.
[54] D. D. Lee and H. S. Seung, “Learning the parts of objects by non-negative matrix factorization,” Nature, vol. 401, no. 6755, pp. 788–791, 1999.
[55] H. Kameoka, O. Nobutaka, K. Kunio, and S. Shigeki, “Complex NMF: A new sparse representation for acoustic signals,” in Proc. ICASSP, 2009, pp. 3437–3440.
[56] V. H. Duong, Y. S. Lee, J. J. Ding, B. T. Pham, M. Q. Bui, P. T. Bao, and J. C.
Wang, “Exemplar-embed complex matrix factorization for facial expression recognition,” in Proc. ICASSP, 2017, pp. 1837–1841.
[57] P. Baldi and Z. Lu, “Complex-valued autoencoders,” Neural Networks, vol. 33, pp. 136–147, 2012.
[58] T. Nakashika, S. Takaki, and J. Yamagishi, “Complex-valued restricted Boltzmann machine for direct learning of frequency spectra,” in Proc. INTERSPEECH, 2017.
[59] M. Schedl, E. Gomez, and J. Urbano, “Music information retrieval: Recent developments and applications,” Foundations and trends in information retrieval, vol. 8, no. 2-3, pp. 127–261, 2014.
[60] K. Choi, G. Fazekas, K. Cho, and M. Sandler, “A tutorial on deep learning for music information retrieval,” arXiv preprint arXiv:1709.04396v1, 2017.
[61] O. Lartillot and P. Toiviainen, “A matlab toolbox for musical feature extraction from audio,” in Proceedings of the International Conference on Digital Audio Effects, 2007,
pp. 237–244.
[62] M. Muller, D. P. W. Ellis, A. Klapuri, and G. Richard, “Signal processing for music analysis,” IEEE Journal of Selected Topics in Signal Processing, vol. 5, no. 6, pp. 1088–
1110, 2011.
[63] J. Nam, J. Herrera, M. Slaney, and J. Smith, “Learning sparse feature representations for music annotation and retrieval,” in Proceedings of the International Society for Music Information Retrieval (ISMIR), 2012.
[64] K. O’Hanlon and M. D. Plumbley, “Automatic music transcription using row weighted decompositions,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013, pp. 16–20.
[65] K. Yazawa, K. Itoyama, and H. G. Okuno, “Automatic transcription of guitar tablature from audio signals in accordance with player’s proficiency,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 3122–3126.
[66] B. Logan, “Mel frequency cepstral coefficients for music modeling,” in Proceedings of the International Society of Music Information Retrieval (ISMIR), 2000.
[67] J. Abeser, H. Lukashevich, and G. Schuller, “Feature-based extraction of plucking and expression styles of the electric bass guitar,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2010, pp. 2290–2293.
[68] A. Tindale, A. Kapur, G. Tzanetakis, and I. Fujinaga, “Retrieval of percussion gestures using timbre classification techniques,” in Proceedings of the International Society for Music Information Retrieval (ISMIR), 2004, pp. 541–545.
[69] L. Su, H. M. Lin, and Y. H. Yang, “Sparse modeling of magnitude and phase-derived spectra for playing technique classification,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 12, pp. 2122–2132, 2014.
[70] F. Auger and P. Flandrin, “Improving the readability of time-frequency and time-scale representations by the reassignment method,” IEEE Transactions on Signal Processing, vol. 43, no. 5, pp. 1068–1089, 1995.
[71] K. R. Fitz and S. A. Fulop, “A unified theory of time-frequency reassignment,” CoRR, vol. abs/0903.3080, 2009.
[72] Y. P. Chen, L. Su, and Y. H. Yang, “Electric guitar playing technique detection in real-world recordings based on F0 sequence pattern recognition,” in Proceedings of the International Society for Music Information Retrieval (ISMIR), 2015, pp. 708–714.
[73] P. Manzagol, T. Bertin-Mahieux, and D. Eck, “On the use of sparse time-relative auditory codes for music,” in Proceedings of the International Society of Music Information Retrieval (ISMIR), 2008, pp. 14–18.
[74] J. Nam, J. Herrera, M. Slaney, and J. Smith, “Learning sparse feature representations for music annotation and retrieval,” in Proceedings of the International Society for Music Information Retrieval (ISMIR), 2012, pp. 565–560.
[75] E. J. Humphrey, J. P. Bello, and Y. LeCun, “Deep architectures and automatic feature learning in music informatics,” in Proceedings of the International Society for Music Information Retrieval (ISMIR), 2012, pp. 403–408.
[76] A. Gersho and R. M. Gray, Vector Quantization and Signal Compression. Kluwer Academic Publishers, 1991.
[77] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online dictionary learning for sparse coding,” in Proceedings of the Annual International Conference on Machine Learning (ICML), 2009, pp. 689–696.
[78] P. C. Loizou, Speech Enhancement: Theory and Practice, 2nd. Boca Raton, FL, USA:
CRC Press, Inc., 2013.
[79] Y. Ephraim and D. Malah, “Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator,” IEEE Trans. Audio, Speech, Language Process., vol. 32, no. 6, pp. 1109–1121, 1984.
[80] ——, “Speech enhancement using a minimum mean-square error log-spectral amplitude estimator,” IEEE Trans. Audio, Speech, Language Process., vol. 33, no. 2, pp. 443–445, 1985.
[81] N. Lyubimov and M. Kotov, “Non-negative matrix factorization with linear constraints for single-channel speech enhancement,” arXiv preprint arXiv:1309.6047, 2013.
[82] D. S. Williamson, Y. Wang, and D. Wang, “A sparse representation approach for perceptual quality improvement of separated speech,” in Proc. ICASSP, 2013, pp. 7015–7019.
[83] ——, “A two-stage approach for improving the perceptual quality of separated speech,” in Proc. ICASSP, 2014, pp. 7034–7038.
[84] D. S. Williamson, Y. Wang, and D. L. Wang, “Reconstruction techniques for improving the perceptual quality of binary masked speech,” J Acoust Soc Am., vol. 136, no. 2, pp. 892–902, 2014.
[85] J. C. Wang, Y. S. Lee, C. H. Lin, S. F. Wang, C. H. Shih, and C. H. Wu, “Compressive
sensing-based speech enhancement,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 24, no. 11, pp. 2122–2131, 2016.
[86] S. Gonzalez and M. Brookes, “Mask-based enhancement for very low quality speech,” in Proc. ICASSP, 2014, pp. 7029–7033.
[87] Y. Luo, G. Bao, Y. Xu, and Z. Ye, “Supervised monaural speech enhancement using complementary joint sparse representations,” IEEE Signal Process. Lett., vol. 23, no. 2,
pp. 237–241, 2016.
[88] G. Min, X. Zhang, J. Yang, W. Han, and X. Zou, “A perceptually motivated approach via sparse and low-rank model for speech enhancement,” in Proc. ICME, 2016, pp. 1–6.
[89] J. F. Gemmeke, H. V. Hamme, B. Cranen, and L. Boves, “Compressive sensing for missing data imputation in noise robust speech recognition,” IEEE J. Sel. Topics Signal Process., vol. 4, no. 2, pp. 272–287, 2010.
[90] L. Josifovski, M. Cooke, P. Green, and A. Vizinho, “State based imputation of missing data for robust speech recognition and speech enhancement,” in Proc. EUROSPEECH, 1999, pp. 2837–2840.
[91] J. F. Gemmeke, T. Virtanen, and A. Hurmalainen, “Exemplar-based sparse representations for noise robust automatic speech recognition,” IEEE Trans. Audio, Speech, Language Process., vol. 19, no. 7, pp. 2067–2080, 2011.
[92] P. Magron, R. Badeau, and B. David, “Complex NMF under phase constraints based on signal modeling: Application to audio source separation,” in Proc. ICASSP, 2016, pp. 46–50.
[93] F. J. Rodriguez-Serrano, S. Ewert, P. Vera-Candeas, and M. Sandler, “A score-informed shift-invariant extension of complex matrix factorization for improving the separation of overlapped partials in music recordings,” in Proc. ICASSP, 2016, pp. 61–65.
[94] S. Souli and Z. Lachiri, “Environmental sound classification using log-gabor filter,” in Proc. ICSP, 2012, pp. 144–147.
[95] M. Zhang, W. Li, L. Wang, J. Wei, Z. Wu, and Q. Liao, “Sparse coding for sound event classification,” in Proc. APSIPA, 2013, pp. 1–5.
[96] H. D. Tran and H. Li, “Probabilistic distance svm with hellinger-exponential kernel for sound event classification,” in Proc. ICASSP, 2011, pp. 2272–2275.
[97] A. Plinge, R. Grzeszick, and G. A. Fink, “A bag-of-features approach to acoustic event detection,” in Proc. ICASSP, 2014, pp. 3704–3708.
[98] J. Portelo, M. Bugalho, I. Trancoso, J. Neto, A. Abad, and A. Serralheiro, “Non-speech audio event detection,” in Proc. ICASSP, 2009, pp. 1973–1976.
[99] J. Li, L. Deng, D. Yu, Y. Gong, and A. Acero, “Hmm adaptation using a phase-sensitive acoustic distortion model for environment robust speech recognition,” in Proc. ICASSP, 2008, pp. 4069–4072.
[100] L. Wang, S. Ohtsuka, and S. Nakagawa, “High improvement of speaker identification and verification by combining mfcc and phase information,” in Proc. ICASSP, 2009, pp. 4529–4532.
[101] I. McCowan, D. Dean, M. McLaren, R. Vogt, and S. Sridharan, “The delta-phase spectrum with application to voice activity detection and speaker recognition,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 19, no. 7, pp. 2026–2038, 2011.
[102] K. Paliwal, K. Wojcicki, and B. Shannon, “The importance of phase in speech enhancement,” Elsevier Speech Commun., vol. 53, no. 4, pp. 465–494, 2011.
[103] H. Zhang, I. McLoughlin, and Y. Song, “Robust sound event recognition using convolutional neural networks,” in Proc. ICASSP, 2015, pp. 559–563.
[104] G. Guo and S. Z. Li, “Content-based audio classification and retrieval by support vector machines,” IEEE Transactions on Neural Networks, vol. 14, no. 1, pp. 209–215, Jan. 2003.
[105] G. Shi, M. M. Shanechi, and P. Aarabi, “On the importance of phase in human speech recognition,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 14, no. 5, pp. 1867–1874, 2006.
[106] M. S. E. Langarani, H. Veisi, and H. Sameti, “The effect of phase information in speech enhancement and speech recognition,” in Proc. ISSPA, 2012, pp. 1446–1447.
[107] L. Gelman and S. Braun, “The optimal usage of the fourier transform for pattern recognition,” Mech. Syst. Signal Process., vol. 15, no. 3, pp. 641–646, 2001.
[108] L. German, “Signal recognition: Both components of the short time fourier transform vs. power spectral density,” Patt. Anal. Appl., vol. 6, no. 2, pp. 91–96, 2003.
[109] C. Singh, E. Walia, and N. Mittal, “Rotation invariant complex zernike moments features and their applications to human face and character recognition,” IET Computer Vision, vol. 5, no. 5, pp. 255–265, 2011.
[110] M. Berouti, R. Schwartz, and J. Makhoul, “Enhancement of speech corrupted by acoustic noise,” in Proc. ICASSP, 1979, pp. 208–211.
[111] C. Plapous, C. Marro, and P. Scalart, “Improved signal-to-noise ratio estimation for speech enhancement,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 14, no. 6, pp. 2098–2108, 2006.
[112] W. Nogueira, G. Roma, and P. Herrera, “Automatic event classification using front end single channel noise reduction, mfcc features and a support vector machine classifier,” Tech. Rep., 2013.
[113] J. Schroder, B. Cauchi, M. R. Schadler, N. Moritz, K. Adiloglu, J. Anemuller, S. Doclo, B. Kollmeier, and S. Goetze, “Acoustic event detection using signal enhancement and spectro-temporal feature extraction,” Tech. Rep., 2013.
[114] J. Dennis, H. D. Tran, and H. Li, “Spectrogram image feature for sound event classification in mismatched conditions,” IEEE Signal Process. Lett., vol. 18, no. 2, pp. 130–133, 2011.
[115] I. McLoughlin, H. Zhang, Z. Xie, Y. Song, and W. Xiao, “Robust sound event classification using deep neural networks,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 23, no. 3, pp. 540–552, 2015.
[116] S. Chu, S. Narayanan, and C. C. J. Kuo, “Environmental sound recognition with time-frequency audio features,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 17, no. 6, pp. 1142–1158, 2009.
[117] S. Sivasankaran and K. Prabhu, “Robust features for environmental sound classification,” in Proc. CONECCT, 2013, pp. 1–6.
[118] J. C. Wang, C. H. Lin, B. W. Chen, and M. K. Tsai, “Gabor-based non-uniform scale frequency map for environmental sound classification in home automation,” IEEE Trans. Autom. Sci. Eng., vol. 11, no. 2, pp. 607–613, 2014.
[119] J. W. Hung, H. J. Hsieh, and B. Chen, “Robust speech recognition via enhancing the complex-valued acoustic spectrum in modulation domain,” IEEE/ACM Trans. Audio,
Speech, Lang. Process., vol. 24, no. 2, pp. 236–251, 2016.
[120] H. Xu, Z. H. Tan, P. Dalsgaard, and B. Lindberg, “Robust speech recognition by nonlocal means denoising processing,” IEEE Signal Process. Lett., vol. 15, pp. 701–704, 2008.
[121] Y. Zhang and Y. Zhao, “Spectral subtraction on real and imaginary modulation spectra,” in Proc. ICASSP, 2011, pp. 4744–4747.
[122] J. C. Wang, C. H. Lin, E. Siahaan, B. W. Chen, and H. L. Chuang, “Mixed sound event verification on wireless sensor network for home automation,” IEEE Trans. Ind.
Informat., vol. 10, no. 1, pp. 803–812, 2014.
[123] J. C. Wang, H. P. Lee, J. F. Wang, and C. B. Lin, “Robust environmental sound recognition for home automation,” IEEE Transactions on Automation Science and Engineering, vol. 5, no. 1, pp. 25–31, Jan. 2008.
[124] J. C. Wang, Y. S. Lee, C. H. Lin, E. Siahaan, and C. H. Yang, “Robust environmental sound recognition with fast noise suppression for home automation,” IEEE Trans. Autom. Sci. Eng., vol. 12, no. 4, pp. 1235–1242, 2015.
[125] R. Martin, “Noise power spectral density estimation based on optimal smoothing and minimum statistics,” IEEE Trans. Audio, Speech, Language Process., vol. 9, no. 5, pp. 504–512, 2001.
[126] N. Lawrence, “Probabilistic non-linear principal component analysis with Gaussian process latent variable models,” J Mach Learn Res., vol. 6, pp. 1783–1816, 2005.
[127] R. Boloix-Tortosa, F. J. Payan-Somet, E. Arias-de-Reyna, and J. J. Murillo-Fuentes, “Proper complex Gaussian processes for regression,” arXiv preprint arXiv abs/1502.04868, 2015.
[128] E. A.-d.-R.J.J.M.-F. R. Boloix-Tortosa F. J. Payan-Somet, “Proper complex Gaussian processes for regression,” 2015.
[129] L. J. M. Cooke P. Green and A. Vizinho, “Robust automatic speech recognition with missing and unreliable acoustic data,” Elsevier Speech Commun., vol. 34, pp. 267–285, 2001.
[130] CHTTL, http://www.aclclp.org.tw/use_mat_c.php.
[131] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual evaluation of speech quality (PESQ)-A new method for speech quality assessment of telephone networks and codecs,” in Proc. ICASSP, 2001, pp. 749–752.
[132] C. C. Chang and L. C. Jen, “Libsvm: A library for support vector machines,” 2001.
[133] X. Wei and C. T. Li, “Fixation and saccade based face recognition from single image per person with various occlusions and expressions,” in Proc. CVPR, 2013, pp. 70–75.
[134] X. X. Li, D. Q. Dai, X. F. Zhang, and C. X. Ren, “Structured sparse error coding for face recognition with occlusion,” IEEE Trans. Image Process., vol. 22, no. 5, pp. 1889–1900, 2013.
[135] Y. Wen, W. Liu, M. Yang, and M. Li, “Efficient misalignment-robust face recognition via locality-constrained representation,” in Proc. ICIP, 2016, pp. 3021–3025.
[136] E. J. He, J. A. Fernandez, B. V.K. V. Kumar, and M. Alkanhal, “Masked correlation filters for partially occluded face recognition,” in Proc. ICASSP, 2016, pp. 1293–1297.
[137] R. Zhi, M. Flierl, Q. Ruan, and W. B. Kleijn, “Graph-preserving sparse nonnegative matrix factorization with application to facial expression recognition,” IEEE Trans. Syst. Man. Cybern. B, Cybern., vol. 41, no. 1, pp. 38–52, 2011.
[138] J. A. Russell, “A circumplex model of affect,” Journal of Personality and Social Psychology, vol. 39, pp. 1161–1178, 1980.
[139] B. Han, S. Rho, R. B. Dannenberg, and E. Hwang, “Smers: Music emotion recognition using support vector regression,” in Proc. Int. Conf. Music Information Retrieval, 2009.
[140] Y. H. Yang, Y. C. Lin, Y. F. Su, and H. H. Chen, “A regression approach to music emotion recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 16, pp. 448–457, 2008.
[141] Z. Chuang and C. Wu, “Emotion recognition using acoustic features and textual content,” in Proc. ICME, 2004, 53-56.
[142] B. Schuller, R. Muller, M. Lang, and G. Rigoll, “Speaker independent emotion recognition by early fusion of acoustic and linguistic features within ensembles,” in Proc. INTERSPEECH, 2005, pp. 805–808.
[143] Y. H. Yang and H. H. Chen, “Machine recognition of music emotion: A review,” ACM Transactions on Intelligent system and Technology, vol. 3, no. 3, 2012.
[144] Y. Chin, C. Lin, and J. Wang, “Robust emotion recognition in live music using noise suppression and a hierarchical sparse representation classifier,” in Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific, 2014, pp. 1–4.
[145] S. H. Chen, Y. S. Lee, and J. C. Wang, “Phase-incorporating speech enhancement based on complex-valued Gaussian process latent variable model,” arXiv preprint arXiv:abs/1612.09150v2, 2016.
[146] J. C. Lin, C. H. Wu, and W. L. Wei, “Error weighted semi-coupled hidden Markov model for audio-visual emotion recognition,” IEEE Trans. Multimedia, vol. 14, no. 1, pp. 142–156, 2012.
[147] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting and composing robust features with denoising autoencoders,” in Proc. ICML, 2008, pp. 1096–1103.
[148] All Music Guide, http://www.allmusic.com.
[149] Last.fm, http://cn.last.fm/home.
[150] V. Phongthongloa, S. Kamonsantiroj, and L. Pipanmaekaporn, “Learning high level features for chord recognition using autoencoder,” in Proc. First International Workshop Pattern Recognition, 2016, pp. 1 001 117–1 001 117.
[151] N. Steenbergen, T. Gevers, and J. Burgoyne, “Chord recognition with stacked denoising autoencoders.,” Master’s thesis, University of Amsterdam, Amsterdam, Netherlands, 2014.
[152] J. Schluter, “Learning binary codes for efficient large-scale music similarity search,” in Proc. ISMIR, 2013.
[153] M. Defferrard, “Structured auto-encoder with application to music genre recognition,” Tech. Rep., 2015.
[154] H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio, “An empirical evaluation of deep architectures on problems with many factors of variation,” in Proc. ICML, 2007.
[155] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980v9, 2014.
[156] X. Lu, Y. Tsao, S. Matsuda, and C. Hori, “Speech enhancement based on deep denoising autoencoder,” in Proc. INTERSPEECH, 2013.
[157] D. J. Rezende, S. Mohamed, and D. Wierstra, “Stochastic backpropagation and approximate inference in deep generative models,” in arXiv preprint arXiv:1401.4082v3, 2014.
[158] J. M. Hernandez-Lobato and R. P. Adams, “Probabilistic backpropagation for scalable learning of bayesian neural networks,” arXiv preprint arXiv:1502.05336v2, 2015.
[159] J. T. Rolfe, “Discrete variational autoencoders,” arXiv preprint arXiv:1609.02200, 2017.
[160] J. Quinonero Candela and C. E. Rasmussen, “A unifying view of sparse approximate gaussian process regression,” J. Mach. Learn. Res., vol. 6, pp. 1939–1959, Dec. 2005.
[161] N. D. Lawrence, “Learning for larger datasets with the gaussian process latent variable model,” in Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics, M. Meila and X. Shen, Eds., ser. Proceedings of Machine Learning Research, vol. 2, San Juan, Puerto Rico: PMLR, 2007, pp. 243–250.
[162] M. P. Deisenroth and J. W. Ng, “Distributed gaussian processes,” arXiv preprint arXiv:1502.02843, 2015.

指導教授

王家慶(Jia-Ching Wang)

審核日期

2018-8-17

推文