複數型高斯過程回歸應用於語音分離

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：31

、訪客IP：3.135.205.231

姓名

黎霆元(Le Dinh Nguyen) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

複數型高斯過程回歸應用於語音分離
(Complex-valued Gaussian Process Regression for speech separation)

相關論文

★ Single and Multi-Label Environmental Sound Recognition with Gaussian Process	★ 波束形成與音訊前處理之嵌入式系統實現
★ 語音合成及語者轉換之應用與設計	★ 基於語意之輿情分析系統
★ 高品質口述系統之設計與應用	★ 深度學習及加速強健特徵之CT影像跟骨骨折辨識及偵測
★ 基於風格向量空間之個性化協同過濾服裝推薦系統	★ RetinaNet應用於人臉偵測
★ 金融商品走勢預測	★ 整合深度學習方法預測年齡以及衰老基因之研究
★ 漢語之端到端語音合成研究	★ 基於 ARM 架構上的 ORB-SLAM2 的應用與改進
★ 基於深度學習之指數股票型基金趨勢預測	★ 探討財經新聞與金融趨勢的相關性
★ 基於卷積神經網路的情緒語音分析	★ 運用深度學習方法預測阿茲海默症惡化與腦中風手術存活

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 ( 永不開放)

摘要(中)

語音分離在訊號處理中是一項具有挑戰性的問題，其在各種真實世界的應用中發揮了重要作用，例如語音辨識系統或電信通訊。語音分離的主要目標為從一個具有多個發話者的混合語音估計出個別發話者的語音。由於在一般自然環境下，語音訊號經常受到噪音或其它語音的干擾，語音分離因此變成一個有吸引力的研究課題。
另一方面，高斯過程(Gaussian Process, GP)是一種基於核函數的機器學習方法，並且已經大量的被應用在訊號處理上。在此研究中，我們提出基於高斯過程回歸(Gaussian Process Regression, GPR)的方法來模擬混合語音訊號與乾淨語音之間的非線性映射，被重建的語音訊號可由GP模型的平均函數求得。模型裡的超參數(Hyper-parameter)由共軛梯度法(Conjugate Gradient Method)來進行最佳化。在實驗上使用TIMIT的語音資料庫，其結果顯示提出的方法有較好的表現。

摘要(英)

Speech separation is a challenging signal processing which plays a significant role in improving the accuracy of various real-world applications, such as speech recognition system and telecommunication. Its main goal is to isolate or estimate the target voice of each speaker from a mixed speech talked by various speakers at the same time. Due to the fact that speech signals collected in the natural environment are frequently corrupted by noise data, speech separation has become an attractive research topic over the past several decades.
In addition, Gaussian process (GP) is a flexible kernel-based learning method which has found widespread application in signal processing. In this thesis, a supervised method is proposed for handling speech separation problem. In this work, we focus on modeling a nonlinear mapping between mixed and clean speeches based on GP regression, in which reconstructed audio signal is estimated by the predictive mean of GP model. The nonlinear conjugate gradient method was utilized to perform the hyper-parameter optimization. An experiment on a subset of TIMIT speech dataset is carried out to confirm the validity of the proposed approach.

關鍵字(中)

★ 高斯過程
★ 高斯過程回歸

關鍵字(英)

★ Gaussian Process
★ Gaussian Process Regression

論文目次

Chapter 1 Introduction 1
1.1 Motivation 1
1.2 Aim and Objective 3
1.3 Thesis Overview 4
Chapter 2 Background knowledge 5
2.1 Gaussian Process 5
2.1.1 Introduction 5
2.1.2 Covariance functions 8
2.1.3 Optimization of hyper-parameters 10
2.2 Short-time Fourier transform 12
2.2.1 Introduction 12
2.2.2 Spectrogram of STFT 14
2.2.3 Inverse short-time Fourier transform 16
2.3 Overlap-add method 17
2.4 Complex-valued Derivatives: 22
2.4.1 Differentiating complex exponentials of a real parameter 22
2.4.1.1 Differentiating complex exponentials 22
2.4.2 Differentiating function of a complex parameter 23
Chapter 3 Employed systems 26
3.1 System overview: 26
3.1.1 Real-valued GP-based system for source separation 26
3.1.2 Complex-valued GP-based system for source separation 28
3.2 GP regression-based source separation: 29
3.2.1 Real-valued GPR-based source separation 29
3.2.2 Complex-valued GPR-based source separation 31
Chapter 4 Experiments 34
4.1 Real-valued GP regression-based model for source separation 34
4.2 Complex-valued GP regression-based model for speech enhancement 37
Chapter 5 Conclusions and future work 40
Bibliographies 41

參考文獻

[1] Y. Tagawa, A. Liutkus, R. Badeau and Gäel Richard, “Gaussian Processes for Underdetermined Source Separation,” IEEE Transactions on Signal Processing, vol. 2, no. 7, Jul. 2011.
[2] P. S. Huang, M. Kim, M. H. Johnson, And P. Smaragdis, “Deep learning for Monaural Speech Separation” , IEEE International Conference on Acoustic, Speech and Signal Processing, vol. 2, no. 7, Jul. 2014.
[3] G. Logeshwari, G. S. Anandha Mala, “A survey on Single Channel Speech separation”, International Conference on Advances in Communication, Network, and Computing, pp. 387-392, Feb. 2012.
[4] M. Stetter, “Regression Methods for Source Separation”, Imaging and Modeling Cortical Population Coding Strategies, pp. 105-124, 2012.
[5] S. Park and S.Choi, “Gausian Process Regression for Voice Activity Detection and Speech Enhancement”, IEEE International Joint Conference on Neural Networks, Jun. 2008.
[6] Mikkel N Schimidt, Rasmus K. Olsson, “Linear regression on sparse features for single-channel speech separation”, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, Oct. 2007.
[7] Y. Xu, J. Du, L. R. Dai and C.H. Lee, “A Regression Approach to Speech Enhancement Based on Deep Neural Networks”, IEEE Transactions on Audio, Speech, Language Processing, vol. 23, no. 1, pp. 7-17, Jan. 2015 .
[8] D. Kounades-Bastian, L. Girin, X. Alameda-Pineda, S. Gannot and R. Horaud, “A variational EM algorithm for the Separation of moving source sources”, IEEE Workshop on Application of Signal Processing to Audio and Acoustics, Oct. 2015.
[9] P. Mowlaee and R. Saeidi, “Iterative Closed-Loop Phase-Aware Single-Channel Speech Enhancement”, IEEE Signal Processing Letters, vol. 20, no. 12, Dec. 2013.
[10] R. Boloix-Tortosa, E. Arias-de-Reyna, F. J. Payan-Somet, and J. J. Murillo-Fuentes, “Complex-Valued Gaussian Processes for Regression: A Widely Non-Linear Approach”,__, Nov. 2015.
[11] T. Gerkmann, M. Krawczyk-Becker, and J. L. Roux, “Phase Processing for Single-Channel Speech Enhancement: History and recent advances”, IEEE Signal Processing Magazine, vol. 32, pp. 55-66, Feb. 2015.
[12] Y. K. Lee, J. G. Park and O. W. Kwon, “Speech Enhancement Using Phase-Dependent A Priori SNR Estimator in Log-Mel Spectral Domain”, ETRI Journal, vol. 36, No. 5, pp. 721-727, Oct. 2014
[13] V. Zue, S. Seneff, and J.Glass “Speech database development at MIT: Timit and beyond”, Speech Communication, vol.9, pp.351-356, Aug. 1990.
[14] E. Vincent , R. Gribonval, and C, Févotte, “Performance measurement in blind audio source separation”, IEEE Trans. Audio, Speech and Language Processing, vol. 14, pp. 1462-1469
[15] Lin, Y. B., Pham, T., Lee, Y. S. and Wang, J. C, “Monaural source separation using nonnegative matrix factorization with graph regularization constraint”, Conference on Computational Linguistics and Speech Processing, Oct 2015.
[16] Gyoon, K. T., Kwon, K., Shin, J. W. and Soo, K. N, “NMF-based target source separation using deep neural network”, IEEE Signals Processing Letters, 22, 2,pp. 229-233, Feb. 2015.
[17] J. Eggert, E. Körner, “Sparse coding and NMF”, Proc. IEEE International Joint Conference on Neural Networks, vol.4, pp. 2529 – 2533, 2004.
[18] S.Araki, S. makino, H. Sawada and H. Mukai, “Reducing musical noise by a fine shift overlap-add method applied to source separation using time-frequency mask”, IEEE international conference on Acoustic, Speech, 2005, pp. III-81-III-82.
[19] G. Shi and P. Aarabi, “Robust digit recognition using phase-dependent time-frequency masking”, Proc. Int. Conf. Acoustics, Speech, and Signal Processing (ICASSP), vol. 1, Hong Kong, Apr. 2003, pp. 684–687.
[20] A. C. Lindgren, M. T. Johnson, and R. J. Povinelli, “Speech recognition using reconstructed phase space features”, Proc. Int. Conf. Acoustics, Speech, and Signal Processing (ICASSP), vol. 1, Hong Kong, Apr. 2003, pp. 60–63.
[21] R. Schlüter and H. Ney, “Using phase spectrum information for improved speech recognition performance”, Proc. Int. Conf. Acoustics, Speech, and Signal Processing (ICASSP), vol. 1, Salt Lake City, UT, May 2001, pp. 133–136
[22] S. R. Quackenbush, T. P. Barnwell, and M. A. Clements, “Objective Measures of Speech Quality”, Prentice Hall Advanced Reference Series, Englewood Cliffs, NJ, 1988.
[23] A. Daminaou and N. Lawrence, “Deep Gaussian processes,” JMLR, 31:207-215, 2014.
[24] Tuan Pham, Yuan-Shan Lee,Yan-Bo Lin,Tzu-Chiang Tai and Jia-Ching Wang, “Single Channel Source Separation Using Sparse NMF and Graph Regularization”, ASE BD&SI 2015, Oct. 2015.
[25] K.B. Petersen M. S. Pedersen, The Matrix Cookbook, Nov.15, 2012
[26] Rafael Boloix-Tortosa, F. Javier Payan-Somet, Eva Arias-de-Reyna, Juan José Murillo-Fuentes, “Proper Complex Gaussian Processes for Regression”, CoRR abs/1502.04868, 2015.
[27] P. Smaragdis and J. C. Brown, “Non-negative matrix factorization for polyphonic music transcription”, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 177 – 180), Oct. 2003.
[28] K. Paliwal, K. Wojcicki, and B. Shannon, “The importance of phase in speech enhancement”, Speech Communication, vol. 53, no. 4, pp. 465–494, Apr. 2011.
[29] Mikkel N. Schmidt, “Speech separation using non-negative features and sparse non-negative matrix factorization”, __, 2007.
[30] Lehel Csato and Manfred Opper, “Sparse online Gaussian processes”,´ Neural Computation, 14, pp. 641–669, 2002.
[31] Malte Kuss and Carl Edward Rasmussen, “Assessing approximate inference for binary Gaussian process classification”, Journal of Machine Learning Research, pp.1679–1704, 2005.
[32] M. E. Tipping, “Sparse Bayesian learning and the Relevance Vector Machine”, Journal of Machine Learning Research, 1, pp. 211–244, 2001.
[33] B. W. Silverman, “Some aspects of the spline smoothing approach to non-parametric regression curve fitting”, J. Roy. Stat. Soc. B, 47(1), pp. 1–52, 1985.
[34] Carl Edward Rasmussen. “Reduced rank Gaussian process learning”, Technical report, 2002.
[35] M. Helén and T. Virtanen, “Separation of drums from polyphonic music using nonnegative matrix factorization and support vector machine,” Proc. Eur. Signal Process. Conf., 2005.
[36] L. Benaroya, F. Bimbot, L. McDonagh, and R. Gribonval, “Non negative sparse representation for Wiener based source separation with a single sensor” , IEEE Int. Conf. Audio, Speech, Signal Process , pp. 613–616, 2003.
[37] S. A. Abdallah and M. D. Plumbley, “Polyphonic transcription by nonnegative sparse coding of power spectra”, Int. Conf. Music Inf. Retrieval, pp. 318–325, Oct. 2004.
[38] P. Smaragdis and J. C. Brown, “Non-negative matrix factorization for polyphonic music transcription”, IEEE Workshop on Applications of Signal Process. Audio Acoust., pp. 177–180, 2003.
[39] C. Uhle, C. Dittmar, and T. Sporer, “Extraction of drum tracks from polyphonic music using independent subspace analysis”, Proc. 4th Int. Symp. Independent Compon. Anal. Blind Signal Separation, pp. 843–848, 2003.
[40] S. Haykin, Z.Chen, ”The Cocktail Party Problem”, Neural Computation, 17, pp. 1875–1902, Oct. 2005.
[41] R. S. Bolia, W. T. Nelson, and R. M. Morley, “Asymmetric performance in the cocktail party effect: Implications for the design of spatial audio displays”, Human Factors, 43, pp.208–216, 2001.
[42] K. Crispien, &T. Ehrenberg, “Evaluation of the cocktail party effect for multiple speech stimuli within a spatial audio display”, Journal of the Audio Engineering Society,43, pp. 932–940, 1995.
[43] M. L. Hawley, R. Y. Litovsky, and J. F. Culling, “The benefit of binaural hearing in a cocktail party: Effect of location and type of interferer”, Journal of the Acoustical Society of America, 115, pp. 833–843, 2004.
[44] W. A. Yost, R. H. Jr Dye., Jr., and S. Sheft, “A simulated cocktail party with up to three sound sources”, Perception & Psychophysics,58, pp. 1026–1036, 1996.
[45] A. Bronkhorst, “The cocktail party phenomenon: A review of research on speech intelligibility in multiple-talker conditions”, Acustica 86, pp.117–128, 2000.
[46] C. E. Rasmussen, C. K. I. Williams, “Gaussian Processes for Machine Learning”. the MIT Press, 2006
[47] C. E. Rasmussen, “Gaussian processes in machine learning”, Available in http://www.cs.ubc.ca/hutter/earg/papers05/rasmussen_gps_in_ml.pdf, January 2011.
[48] M. Ebden, “Gaussian processes for regression: A quick introduction”, Available in http://www.robots.ox.ac.uk/mebden/reports/GPtutorial.pdf, August 2008.
[49] M. Gibbs and D. J. MacKay, “Efficient implementation of Gaussian processes”, Technical report, 1997.
[50] B. Huhle, T. Schairer, A. Schilling, and W. Strasser, “Learning to localize with gaussian process regression on omnidirectional image data”, Intelligent Robots and Systems (IROS), on 2010 IEEE/RSJ International Conference, pp. 5208 – 5213, Oct. 2010.
[51] J.Ko, D. Klein, D. Fox, and D. Haehnel, “Gaussian processes and reinforcement learning for identification and control of an autonomous blimp”, Robotics and Automation, on 2007 IEEE International Conference, pp. 742 –747, April 2007.
[52] I. G. Mattingly, T. A. Sebeok, ”Speech synthesis for phonetic and phonological models”, Current Trends in Linguistics, Mouton, The Hague, 12, pp. 2451–2487, 1974.
[53] A. Breen, “Speech Synthesis Models: A Review”, Electronics & Communication Engineering Journal, vol. 4, pp. 19-31, 1992.
[54] Macon M., Clements C, “Speech Concatenation and Synthesis Using an Overlap-Add Sinusoidal Model”, Proceedings of ICASSP 96, pp. 361-364, 1996.
[55] R. J. McAulay, and T. F. Quatieri, “Speech analysis/synthesis based on a sinusoidal representation,” IEEE Trans. on Acoustics, Speech, and Signal Processing, 34, pp. 744–754, 1986.
[56] T. F. Quatieri and R. J. McAulay, “Speech transformations based on a sinusoidal representation,” Proc. Int. Conf. Acoust., Speech, Signal Processing, pp. 489, 1985.
[57] T. F. Quatieri and R. J. McAulay, “Speech transformations based on a sinusoidal representation,” in Proc. Int. Con$ Acoust., Speech, Signal Processing, Tampa, FL, 1985, p. 489.
[58] X. Rodet and P. Depalle, “A new additive synthesis method using inverse Fourier transform and spectral envelopes”, Proceedings of International Computer Music Conference, pp. 410-411, 1992.
[59] A. Spanias, “Speech coding: A tutorial review”, Proc. IEEE, vol. 82, pp. 1541–1582, Oct. 1994
[60] L. R. Rabiner and R. W. Schafer, Digital Processing of Speech Signals. Englewood Cliffs, NJ: Prentice-Hall, 1978.
[61] X. Serra and J. Smith, “Spectral modeling synthesis: a sound analysis/synthesis system based on a deterministic plus stochastic decomposition”, Computer Music Journal, vol. 14, no. 4, pp. 12-24, 1990.
[62] P. Depalle and X. Rodet, ”Synthèse additive par FTT inverse”, Rapport Interne IRCAM, Paris, 1990.
[63] Ph. Depalle and G. Poirot, ”A modular system for analysis, processing and synthesis of sound signals”, Proc. of the Int. Comp. Music Conf., Montreal, Canada, 1991.
[64] X. Rodet, P. Depalle & G. Poirot, ”Speech Analysis and Synthesis Methods Based on Spectral Envelopes and Voiced/Unvoiced Functions”, European Conference on Speech Tech., Edinburgh, U.K., Sept. 1987.
[65] M. R. Portnoff, “Time-scale modification of speech based on short-time Fourier analysis,” IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP–30, pp. 374–390, Jun. 1981
[66] F. Rumsey, T. McCormick, Sound and recording - an introduction, Elsevier, 2002.
[67] C. E. Speaks, Introduction to sound, Singular, 1999.
[68] I. Cohen and S. Gannot, “Spectral enhancement methods,” in Springer Handbook of Speech Processing, J. Benesty, M. M. Sondhi, Y. Huang (Eds.), Springer 2008, pp.873-901, 2008.
[69] J. Du and Q. Huo, “A speech enhancement approach using piecewise linear approximation of an explicit model of environmental distortions,” Proc. Interspeech, pp. 569–572, 2008.
[70] D. Griffin and J. S. Lim, “Signal estimation from modified short time Fourier transform,” IEEE Trans. on ASSP, vol.32, no.2, pp.236-243, 1984.
[71] F. J. Harris, “On the use of windows for harmonic analysis with the discrete Fourier transform”, Proc. IEEE, vol. 66, pp. 51–83, Jan. 1978.
[72] S. Seneff, “System to independently modify excitation and/or spectrum of speech waveform without explicit pitch extraction”, IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP–30, pp. 566–578, Aug. 1982.
[73] F. J. Harris, “On the use of windows for harmonic analysis with the discrete fourier transform,” Proc. IEEE, vol. 66, pp. 51–83, Jan. 1978.
[74] Kuldip K. Paliwal, Leigh D. Alsteris, “On the usefulness of STFT phase spectrum in human listening tests”, Speech communication 45, pp. 153-170, 2005.
[75] J. B. Allen and L. R. Rabiner, “A unified approach to short-time Fourier analysis and synthesis”, Proc. IEEE, vol. 65, pp. 1558-1564, 1997.
[76] X. Serra and J. O. Smith III, “Spectral modeling synthesis: A sound analysis/synthesis system based on a deterministic plus stochastic decomposition,” Comput. Music J., vol. 14, pp. 12–24, 1990.
[77] R. J. McAulay and T. F. Quatieri, “Phase modeling and its application to sinusoidal transform coding,” Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, pp. 1713–1715, Apr. 1986.
[78] B. Yegnanarayana, D.K. Saukia, and T.R. Krishnan, “Significance of group delay functions in signal reconstruction from spectral magnitude or phase,” IEEE Trans. Acoust., Speech, Signal Processing, vol. 32, no. 3, pp. 610-622, 1984.
[79] K. K. Paliwal and L. Alsteris, “Usefulness of phase spectrum in human speech perception,” in Proc. Eur. Conf. Speech Communication and Technology (Eurospeech), Geneva, Switzerland, Sep. 2003, pp. 2117–2120.
[80] D. W. Griffin and J. S. Lim, “Signal estimation from modified short-time Fourier transform,” IEEE Trans. Acoust., Speech, Signal Process., vol. 32, no. 2, pp. 236–243, Apr. 1984.
[81] B. Bozkurt, B. Doval, C. D′Alessandro and T. Dutoit, “Improved differential phase spectrum processing for formant tracking”, Proc. ICSLP, Jeju, Korea, Oct. 2004

指導教授

王家慶(Jia-Ching Wang)

審核日期

2017-7-31

推文