基於複數深層類神經網路之單通道訊號源分離

、線上人數：44

、訪客IP：52.14.61.160

姓名	王書凡(Shu-Fan Wang) 查詢紙本館藏	畢業系所	資訊工程學系
論文名稱	基於複數深層類神經網路之單通道訊號源分離 (Monaural Source Separation Based on Complex-valued Deep Neural Network)
檔案	[Endnote RIS 格式] [Bibtex 格式] [相關文章] [文章引用] [完整記錄] [館藏目錄] 至系統瀏覽論文 ( 永不開放)
摘要(中)	深層類神經網路(Deep neural network, DNN)目前已成為處理訊號源分離問題之熱門方法。其中，幾乎所有以DNN為基礎之分離方法皆只用混合訊號頻譜之能量(Magnitude)做為網路訓練資料，而忽略了相位(Phase)這個隱含在短時傅立葉轉換(STFT)係數中之重要資訊。然而，最近的研究表明，加入相位資訊可以提升分離訊號的聽覺品質。故而在本論文中，我們在進行分離時保留頻譜之相位資訊，從輸入混合訊號中估算目標來源訊號之STFT係數，並視其為一複數域回歸問題。我們發展複數深層類神經網路(Complex-valued Deep neural network)，來學習混合訊號之STFT係數到來源訊號之STFT係數間的非線性映射，做法是利用STFT將混合訊號轉至時頻域後，將其複數STFT係數輸入複數深層類神經網路中，藉此同時考慮能量與相位資訊。此外本論文也提出在成本函數部分加入具有重建性及稀疏性限制式，以提升訊號分離效果。在實驗上，我們將所提出的方法分別應用於語音分離和歌唱分離中。
摘要(英)	Deep neural networks (DNNs) have become a popular means of separating a target source from a mixed signal. Almost all DNN-based methods modify only the magnitude spectrum of the mixture. The phase spectrum is left unchanged, which is inherent in the short-time Fourier transform (STFT) coefficients of the input signal. However, recent studies have revealed that incorporating phase information can improve the perceptual quality of separated sources. Accordingly, in this paper, estimating the STFT coefficients of target sources from an input mixture is regarded a regression problem. A fully complex-valued deep neural network is developed herein to learn the nonlinear mapping from complex-valued STFT coefficients of a mixture to sources. The proposed method is applied to speech separation and singing separation.
關鍵字(中)	★ 深層學習 ★ 盲訊號源分離 ★ 相位	關鍵字(英)	★ Deep Learning ★ Blind Source Separation ★ Phase
論文目次	中文摘要 i Abstract ii 圖目錄 iii 表目錄 v 章節目次 vi 第一章緒論 1 1.1 背景 1 1.2 研究動機與目的 2 1.3 研究方法與章節概要 2 第二章基於深層學習之訊號源分離方法及文獻探討 4 2.1 基於感知機之分離方法 5 2.1.1 感知機架構 5 2.2 基於自編碼器之分離方法 8 2.2.1 自編碼器架構 9 2.2.2 時間序列自編碼器 13 2.3 基於遞迴式類神經網路之分離方法 15 2.3.1 遞迴式類神經網路架構 17 2.4 以複數遮罩結合實數深層類神經網路之分離方法 20 2.4.1 複數遮罩推導 20 2.4.2 複數理想比例遮罩在訊號源分離上的應用 21 第三章基於複數深層類神經網路之訊號源分離 24 3.1 複數深層類神經網路在訊號源分離上的應用 24 3.2 複數深層類神經網路架構 25 3.3 複數深層類神經網路與實數深層類神經網路的比較 26 3.4 複數深層類神經網路之前傳遞推導 27 3.5 複數深層類神經網路之倒傳遞推導 27 3.6 激發函數的討論 36 3.7 成本函數的擴充 39 第四章複數深層類神經網路應用在音訊分離之實驗 42 4.1 實驗環境與複數深層類神經網路設置 42 4.2 訊號源分離之評估準則 43 4.3 實驗流程 44 4.4 人聲分離實驗結果 45 4.5 歌唱分離實驗結果 47 第五章結論及未來研究方向 49 第六章參考文獻 50
參考文獻	[1] G. Hinton, S. Osindero, and Y. Teh, ‘‘A fast learning algorithm for deep belief nets,” Neural Computation, vol. 18, no. 7, pp. 1527-1554, 2006. [2] G. Hinton and R. Salakhutdinov, ‘‘Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504-507, 2006. [3] Y. Bengio, A. Courville, and P. Vincent, ‘‘Representation Learning: A Review and New Perspectives,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1798-1828, Aug. 2013. [4] A. Belouchrani, K. Meraim, J. Cardoso, and E. Moulines, ‘‘A blind source separation technique based on second-order statistics,” IEEE Transactions on Signal Processing, vol. 45, pp. 434-44, 1997. [5] J. Cardoso, ‘‘Source separation using higher order moments,” in Proceedings IEEE International Conference on Acoustics, Speech and Signal Processing, 1989, pp. 2109-2112. [6] Y. Tan, J. Wang, and J. M. Zurada, ‘‘Nonlinear blind source separation using a radial basis function network,” IEEE Transactions Neural Networks, vol. 12, pp. 134-144, 2001. [7] J. Cardoso and A. Souloumiac, “Blind beamforming for non-Gaussian signals,” IEE Proceedings F-Radar and Signal Processing, 1993, vol. 140, no. 6, pp. 362-370, December 1993. [8] A. Bell and T. Sejnowski, “An Information-maximization approach to blind separation,” Neural Computation, vol. 7, pp. 1004-1034, 1995. [9] P. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis, “Deep learning for monaural speech separation,” in Proceedings IEEE International Conference on Acoustics, Speech and Signal Processing, 2014, pp. 1581-1585. [10] Y. Wang and D. L. Wang, “Feature denoising for speech separation in unknown noisy environments,” in Proceedings IEEE International Conference on Acoustics, Speech and Signal Processing, 2013, pp. 7472-7476. [11] E. Grais and H. Erdogan, “Single channel speech music separation using nonnegative matrix factorization and spectral masks,” in Proceedings International Conference on Digital Signal Processing, pp. 1-6, 2011. [12] N. Shuai, H. Zhang, X. L. Zhang, and W. J. Liu, “Deep stacking networks with time series for speech separation,’’ in Proceedings IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6667-6671, May 2014. [13] G. Hinton and R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, pp. 504-507, 2006. [14] A. Ng, “Sparse autoencoder,” CS294A Lecture notes, pp. 72-2011. [15] S. Nie, H. Zhang, X. Zhang, and W. Liu, “Deep stacking networks with time series for speech separation,” in Proceedings IEEE International Conference on Acoustics, Speech and Signal Processing, 2014, pp. 6667-6671. [16] D. Williamson, Y. Wang, and D. Wang, “Complex Ratio Masking for Monaural Speech Separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 3, pp. 483-492, March 2016. [17] R. Caruana, “Multitask learning,” Machine Learning, vol. 28, pp. 41-75, 1997. [18] N. Guberman, “On Complex Valued Convolutional Neural Networks,” ArXiv preprint arXiv:1602.09046 , Feb. 2016. [19] BSS_eval ToolBox : http://bass-db.gforge.inria.fr/bss_eval/ , available on 2014/7/11. [20] A. Rix, J. Beerends, M. Hollier, and A. Hekstra, “Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,” in Proceedings IEEE International Conference on Acoustics, Speech and Signal Processing, 2001, vol. 2, pp. 749-752. [21] C. Hsu and J. Jang, “On the improvement of singing voice separation for monaural recordings using the MIR-1K dataset,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 2, pp. 310-319, Feb. 2010. [22] N. Ono, Z. Koldovský, S. Miyabe, and N. Ito, “The 2013 Signal Separation Evaluation Campaign,” in Proceedings IEEE International Workshop on Machine Learning for Signal Processing, pp. 1-6, Sep. 2013. [23] B. Xia and C. Bao, “Speech enhancement with weighted denoising Auto-Encoder,” in Proceedings Interspeech, 2013, pp. 3444-3448. [24] X. Lu, Y. Tsao, S. Matsuda, and C. Hori, “Speech enhancement based on deep denoising Autoencoder,” in Proceedings Interspeech, 2013, pp. 436-440. [25] M. Hermans and B. Schrauwen, “Training and analyzing deep recurrent neural networks,” in Proceedings Advances in Neural Information Processing Systems, 2013, pp. 190-198. [26] R. Pascanu, C. Gulcehre, K. Cho, and Y. Bengio, “How to construct deep recurrent neural networks,” in Proceedings International Conference on Learning Representations, 2014. [27] F. Weninger, F. Eyben, and B. Schuller, “Single-channel speech separation with memory-enhanced recurrent neural networks,” in Proceedings IEEE International Conference on Acoustics, Speech and Signal Processing, 2014, pp. 3709-3713. [28] P. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis, “Deep learning for monaural speech separation,” in Proceedings IEEE International Conference on Acoustics, Speech and Signal Processing, 2014, pp. 1562-1566. [29] P. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis, “Joint optimization of masks and deep recurrent neural networks for monaural source separation,” ArXiv preprint arXiv:1502.04149, pp. 1-12, 2015. [30] P. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis, ‘‘Joint optimization of masks and deep recurrent neural networks for monaural source separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 12, pp. 2136-2147. [31] J. Bouvrie, “Notes on convolutional neural networks,’’ Technical report, 2006. [32] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, ‘‘Gradient-based learning applied to document recognition,” in Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, Nov 1998. [33] L. Lin, K. Wang, W. Zuo, M. Wang, J. Luo, and L. Zhang, ‘‘A deep structured model with radius-margin bound for 3D human activity recognition,” International Journal of Computer Vision, 2016. [34] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, ‘‘Large-scale video classification with convolutional neural networks,” in Proceedings IEEE Conference on Computer Vision and Pattern Recognition, 2014. [35] A. Jourjine, S. Rickard, and Ö. Yilmaz, “Blind separation of disjoint orthogonal signals: Demixing N sources from 2 mixtures,” in Proceedings IEEE International Conference on Acoustics, Speech and Signal Processing, 2000, pp. 2985-2988. [36] Ö. Yιlmaz and S. Rickard, “Blind separation of speech mixtures via time-frequency masking,” IEEE Transactions on Signal Processing, vol. 52, no. 7, pp. 1830-1847, Jul. 2004. [37] G. Bao, Z. Ye, X. Xu, and Y. Zhou, “A compressed sensing approach to blind separation of speech mixture based on a two-layer sparsity model,” IEEE Transactions on Audio, Speech and Language Processing, vol. 21, no. 5, pp. 899-906, May 2013. [38] S. Roweis, “One microphone source separation,” in Proceedings Advances in Neural Information Processing Systems, 2000, pp. 793-799. [39] M. Schmidt and R. Olsson, “Single-channel speech separation using sparse non-negative matrix factorization,” in Proceedings Interspeech, 2006, pp. 2614-2617. [40] M. Radfar and R. Dansereau, “Single-channel speech separation using soft mask ﬁltering,” IEEE Transactions on Audio, Speech and Language Processing, vol. 15, no. 8, pp. 2299-2310, Nov. 2007. [41] Y. Lee, I. Lee, and O. Kwon, “Single-channel speech separation using phase-based methods,” IEEE Transactions on Consumer Electronics, vol. 56, no. 4, pp. 2453-2459, Nov. 2010. [42] B. King and L. Atlas, “Single-channel source separation using complex matrix factorization,” IEEE Transactions on Audio, Speech and Language Processing, vol. 19, no. 8, pp. 2591-2597, Nov. 2011. [43] B. Gao, W. Woo, and S. Dlay, “Single-channel source separation using EMD-subband variable regularized sparse features,” IEEE Transactions on Audio, Speech and Language Processing, vol. 19, no. 4, pp. 961-976, May 2011. [44] P. Mowlaee, R. Saeidi, M. Christensen, Z. Tan, T. Kinnunen, P. Franti, and S. Jensen, “A joint approach for single-channel speaker identiﬁcation and speech separation,” IEEE Transactions on Audio, Speech and Language Processing, vol. 20, no. 9, pp. 2586-2601, Nov. 2012. [45] C. Demir, M. Saraclar, and A. Cemgil, “Single-channel speech-music separation for robust ASR with mixture models,” IEEE Transactions on Audio, Speech and Language Processing, vol. 21, no. 4, pp. 725-736, Apr. 2013. [46] P. Li, Y. Guan, B. Xu, and W. Liu, “Monaural speech separation based on computational auditory scene analysis and objective quality assessment of speech,” IEEE Transactions on Audio, Speech and Language Processing, vol. 14, no. 6, pp. 2014-2023, Nov. 2006. [47] G. Brown and M. Cooke, “Computational auditory scene analysis,” Computer Speech and Language, vol. 8, no. 4, pp. 297-336, 1994. [48] D. P. Ellis, “Using knowledge to organize sound: The prediction-driven approach to computational auditory scene analysis and its application to speech/nonspeech mixtures,” Speech Communication, vol. 27, no. 3, pp. 281-298, 1999. [49] B. King and L. Atlas, “Single-channel source separation using simpliﬁed-training complex matrix factorization,” in Proceedings IEEE International Conference on Acoustics, Speech and Signal Processing, 2010, pp. 4206-4209. [50] D. Lee and H. Seung, “Learning the parts of objects by non-negative matrix factorization,” Nature, vol. 401, no. 6755, pp. 788-791, 1999. [51] J. Eggert and E. Korner, “Sparse coding and NMF,” in Proceedings IEEE International Joint Conference Neural Networks, 2004, vol. 4, pp. 2529-2533. [52] H. Kameoka, N. Ono, K. Kashino, and S. Sagayama, “Complex NMF: A new sparse representation for acoustic signals,” in Proceedings IEEE International Conference on Acoustics, Speech and Signal Processing, 2009, pp. 3437-3440. [53] Y. Wang, A. Narayanan, and D. Wang, ‘‘On Training Targets for Supervised Speech Separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 12, pp. 1849-1858, Dec. 2014. [54] Y. Wang and D. Wang, “Cocktail party processing via structured prediction,” in Proceedings Advances in Neural Information Processing Systems, 2012, pp. 224-232. [55] Y. Wang and D. Wang, “Towards scaling up classification-based speech separation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 7, pp. 1381-1390, Jul. 2013. [56] K. Paliwal, K. Wójcicki, and B. Shannon, “The importance of phase in speech enhancement,” Speech Communication, vol. 53, no. 4,pp. 465-494, Apr. 2011. [57] M. Krawczyk and T. Gerkmann, ‘‘STFT Phase Reconstruction in Voiced Speech for an Improved Single-Channel Speech Enhancement,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 12, pp. 1931-1940, Dec. 2014. [58] T. Gerkmann, M. Krawczyk-Becker and J. Le Roux, ‘‘Phase Processing for Single-Channel Speech Enhancement: History and recent advances,” IEEE Signal Processing Magazine, vol. 32, no. 2, pp. 55-66, March 2015. [59] H. Leung and S. Haykin, ‘‘The complex backpropagation algorithm,” IEEE Transactions on Signal Processing, vol. 39, no. 9, pp. 2101-2104, Sep 1991. [60] N. Benvenuto and F. Piazza, ‘‘On the complex backpropagation algorithm,” IEEE Transactions on Signal Processing, vol. 40, no. 4, pp. 967-969, Apr 1992. [61] A. Hirose, Complex-Valued Neural Networks, 2nd ed. Berlin, Germany: Springer-Verlag, 2012.
指導教授	王書凡(Jia-Ching Wang)	審核日期	2016-8-25
推文	facebook plurk twitter funp google live udn HD myshare reddit netvibes friend youpush delicious baidu
網路書籤	Google bookmarks del.icio.us hemidemi myshare

博碩士論文 103522066 詳細資訊