單負源分離與非負矩陣分解和深度學習

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：27

、訪客IP：3.137.167.168

姓名

范俊(Pham Tuan) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

單負源分離與非負矩陣分解和深度學習
(Monaural source separation with non-negative matrix factorization and deep learning)

相關論文

★ Single and Multi-Label Environmental Sound Recognition with Gaussian Process	★ 波束形成與音訊前處理之嵌入式系統實現
★ 語音合成及語者轉換之應用與設計	★ 基於語意之輿情分析系統
★ 高品質口述系統之設計與應用	★ 深度學習及加速強健特徵之CT影像跟骨骨折辨識及偵測
★ 基於風格向量空間之個性化協同過濾服裝推薦系統	★ RetinaNet應用於人臉偵測
★ 金融商品走勢預測	★ 整合深度學習方法預測年齡以及衰老基因之研究
★ 漢語之端到端語音合成研究	★ 基於 ARM 架構上的 ORB-SLAM2 的應用與改進
★ 基於深度學習之指數股票型基金趨勢預測	★ 探討財經新聞與金融趨勢的相關性
★ 基於卷積神經網路的情緒語音分析	★ 運用深度學習方法預測阿茲海默症惡化與腦中風手術存活

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 ( 永不開放)

摘要(中)

單通道聲源分離(SCSS)的目的是準確地將特定的信號從混和的訊號中分離出來，如:從伴奏中提取聲音，區分男女。當只有單一個麥克風可用時，訓練數據會非常有限的，則問題就很難解決。本文提出一個改進現有方法的新方法，以在單通道源分離中獲得更好的性能。為解決SCSS問題，採用了事前訓練模型和先驗特徵的監督方法。本文提出的方法是非負矩陣分解(NMF)、深度遞歸神經網路(DRNN)和流形正則化(manifold regularization)相結合。

深度神經網路近年來得到了廣泛的應用，在物件識別、圖像分類、聲音識別、圖像生成、尤其是單聲源分離等領域都有廣泛的應用。然而，基於神經網路(DNN)的聲源分離忽略了語音信號的時序連續性，不考慮輸入資料的幾何結構。因為深度神經網路將輸入資料視為獨立的訊息序列。為解決這些問題，本文提出了一種新的基於神經網路的聲源分離方法，即DNN和一層遞迴神經網路(RNN)的結合。此外，NMF附加到DRNN上的先前資訊迫使輸出信號更類似於先前資訊，從而導致集中求解。該方法確保解決方案總是收斂的，並且這些先前訊息可以在某種程度上增強DRNN的訓練過程。流形正則化利用了輸入數據的固有幾何特徵，使其保持完整。從各個來源的乾淨資料中產生的流行(manifold)特性。

這篇論文有四個貢獻。首先, 技術發展水平變異的NMFβ-divergence,比傳統的更有效利用學習模式從乾淨音源。我們將學到的模式合並到DRNN的輸出中，並將先前的資訊作為DRNN輸出的最後一層。在DRNN的訓練過程中，需要確定DRNN輸出和最後一層之間的連接的權重(weight)和偏差(bias)。因為這些特徵的維度相當大，如果DRNN和NMF的特徵不同，我們就能從中獲益。其次，針對DRNN訓練過程中輸入數據的內部結構，提出了多種正則化方法。流形正則化有助於DRNN的特徵更加區分和避免重疊特徵。然後，對軟遮罩(soft mask)和二元遮罩(binary mask)這兩種頻率遮罩進行了測試，以測試其在SCSS中的性能。第四，提出了DRNN、流形正則化和學習模式的新目標函數。MIR-1K資料集的實驗結果表明，該演算法在信號失真比、信號干擾比、信噪比等方面均優於baselines。

摘要(英)

Single channel source separation (SCSS) aims to accurately separate specific signals from mixtures such as: extracting vocal from accompaniments, separating male and female. The problem is hard when one microphone is available and the training data is usually limited. This dissertation propose the novel approaches which improve the previous methods to produce better performances on single channel source separation. To solve problem of SCSS, the supervised method was used through before-hand trained model and prior features. The method proposed in this thesis was the combination of non-negative matrix factorization (NMF), deep recurrent neural networks (DRNN) and manifold regularization.
Deep neural networks gained the popularity in the recently years, it has numerous applications in the different fields such as object recognition, image classification, sound recognition, image generation and especially monaural source separation. However, deep neural networks (DNN) based source separation ignores temporal continuities of vocal signal as well as has no consideration to geometrical structure of input data. Because deep neural networks treat the input data as independent information sequence. To deal with these issues, this paper proposes a novel approach for source separation based DRNN which is the combination of DNN and one layer of recurrent neural networks (RNN). Besides, the prior information learned by NMF attached to DRNN to force the output signal more similar to prior information lead to the concentrated solution. This approach make sure that the solution will always converge and those prior information can enhance the training process of DRNN in somehow. Manifold regularization exploit the intrinsic geometry of input data and keep it intact. Manifold characteristic produced from clean data of each sources.
There are four contributions in this thesis. Firstly, state-of-art variants of NMF with β-divergence that are more efficient than conventional ones was utilized to learn patterns from cleaning sources. We incorporated those learned patterns into the output of DRNN and consider the prior information as the last layer of DRNN output. The weight and bias of connection between the output of DRNN and the last layer need to be fixed during the training of DRNN. Because the dimension of these features is quite big and we can get the benefit if the features of DRNN and NMF are different. Secondly, the manifold regularization is developed to take account of inner-structure of input data in DRNN training process. The manifold regularization will help the features of DRNN are more discriminate and avoid the overlap features. Thirdly, the two type of frequency masking, soft mask and binary mask, was examined to measure its performance in SCSS. Four, the new objective function was proposed for DRNN, manifold regularization and the learned patterns. Experimental results on MIR-1K dataset exhibit that the proposed algorithm yields a higher performance than the baselines in term of signal-to-distortion ratio, signal-to-interference ratio and signal-noise ratio.

關鍵字(中)

★ 深度學習
★ 源分離
★ 非負矩陣分解

關鍵字(英)

★ Deep learning
★ Source separation
★ Non-negative matrix factorization

論文目次

摘要 i
ABSTRACT iii
ACKNOWLEDGMENT vi
TABLE OF CONTENTS vii
LIST OF FIGURES x
LIST OF TABLES xii
LIST OF SYMBOLS AND ABBREVIATIONS xiii
1. Introduction 1
1.1 Single channel source separation 1
1.2 Generative and discriminative 4
1.3 Related works 5
1.4 Motivation and contribution 8
2. Source separation with dictionary learning 10
2.1 Sparse NMF with β-divergence 10
2.2 Source separation using Sparse NMF and manifold regularization 14
3. Source separation with deep learning 18
3.1 Deep neural networks 18
3.1.1 Perceptron 19
3.1.2 Normalization 20
3.1.3 Loss Function 21
3.2 Recurrent neural networks 22
3.3 Vanishing gradient problem 24
3.4 Overfitting 26
3.4.1 Weight decay 28
3.4.2 Transfer learning 30
3.4.3 Drop out 31
3.4.4 Data augmentation 31
3.5 Backpropagation 32
3.6 DNN for source separation 33
4. Source separation using DRNN and manifold regularization 36
4.1 Manifold regularization 36
4.2 Deep recurrent neural networks 37
4.3 DRNN with manifold regularization 38
5. Experiment and results for source separation with dictionary learning 42
5.1 Dataset and metrics 42
5.2 Experimental setting 43
5.3 Baselines 43
5.4 Results and discussion 44
6. Experiment and results for source separation using DRNN and manifold regularization 48
6.1 Dataset 48
6.2 Experimental setting 48
6.3 Baselines 49
6.4 Results and discussion 50
7. Conclusion and future works 54
8. References 55

參考文獻

[1] P. Di, L. D. Milone, H. L. Rufiner, and M. Yanagida, “Perceptual evaluation of blind source separation for robust speech recognition,” Signal Processing, no. 10, pp. 2578-2583, 2008.
[2] K. Reindl, Y. Zheng, and W. Kellermann, “Speech enhancement for binaural hearing aids based on blind source separation,” In Proc. ISCCSP, pp. 1-6, 2010.
[3] A. Liutkus, D. Fitzgerald, Z. Rafii, B. Pardo, and L. Daudet, “Kernel additive models for source separation,” IEEE Transactions on Signal Processing, pp. 4298-4310, 2014.
[4] P. Smaragdis, C. Fevotte, G. J. Mysore, N. Mohammadiha, and M. Hoffman, “Static and dynamic source separation using nonnegative factorizations: a unified view,” IEEE Signal Processing Magazine, pp. 66-75, 2014.
[5] T. Pham, Y. S. Lee, Y. B. Lin, T. C. Tai, and J. C. Wang, “Single channel source separation using sparse NMF and graph regularization,” In Proc. of the ASE BigData and SocialInformatics, p. 55, 2015.
[6] P. S. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis, “Deep learning for monaural speech separation,” In Proc. ICASSP, pp. 1562-1566, 2014.
[7] D. D. Lee, and H. S. Seung, “Algorithms for non-negative matrix factorization,” Advances in Neural Information Processing Systems, Cambridge, MA, USA: MIT Press, 13, 2001.
[8] N. Mikkel, “Speech separation using non-negative features and sparse non-negative matrix factorization,” Elsevier, 2007.
[9] A. Liutkus, D. Fitzgerald, Z. Rafii, B. Pardo, & L. Daudet, “Kernel additive models for source separation,” IEEE Transactions on Signal Processing, pp. 4298-4310, 2014.
[10] K. Minje, and P. Smaragdis, “Mixtures of local dictionaries for unsupervised speech enhancement,” IEEE Signal Processing Letters, pp. 293 – 297, 2015.
[11] J. Eggert, and E. Körner, “Sparse coding and NMF,” in Proc. IEEE International Joint Conference on Neural Networks, no. 4, pp. 2529 - 2533.
[12] D. Cai, X. He, J. Han, and T. Huang, “Graph regularized nonnegative matrix factorization for data representation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1548-1560, 2010.
[13] P. Sprechmann, A. M. Bronstein, and G. Sapiro, “Real-time online singing voice separation from monaural recordings using robust low-rank modeling,” In Proc. ISMIR, pp. 67-72. 2012.
[14] W. Xu, L. Xin, and G. Yihong, “Document clustering based on non-negative matrix factorization,” In Proc. International ACM SIGIR Conference on Research and Development in Informaion Retrieval, pp. 267-273, 2003.
[15] V. P. Pauca, S. Farial, M. W. Berry, and R. J. Plemmons, “Text mining using non-negative matrix factorizations,” In Proc. SIAM International Conference on Data Mining, 2004.
[16] P. Hoyer, “Non-negative matrix factorization with sparseness Constraints,” Journal of machine learning research, pp. 1457-1469, 2004.
[17] C. Févotte, and J. Idier, “Algorithms for nonnegative matrix factorization with the beta-divergence,” Neural Computation, pp. 2421-2456, 2011.
[18] J. L. Roux, F. Weninger, and J. R. Hershey, “Sparse NMF – half-baked or well done?” Mitsubishi Electric Research Laboratories Technical Report. 2015.
[19] Y. Wang, and D. L. Wang, “Towards scaling up classification-based speech separation,” IEEE Transactions on Audio, Speech and Language Processing, no. 7, pp. 1381-1390, 2013.
[20] S. Nie, S. Liang, H. Li, X. L. Zhang, Z. L. Yang, W. J. Liu, and L. K. Dong, “Exploiting spectro-temporal structures using NMF for DNN-based supervised speech separation,” In Proc. ICASSP, pp. 469-473, 2016.
[21] M. Belkin, and P. Niyogi, “Laplacian eigenmaps and spectral techniques for embedding and clustering,” Advances in neural information processing systems, Cambridge, MIT Press, 2001.
[22] T. Virtanen, “Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria,” IEEE Transactions on Audio, Speech and Language Processing, no. 3, pp. 1066-1074, 2007.
[23] F. Weninger, L. R. Jonathan, J. R. Hershey, and S. Watanabe, “Discriminative NMF and its application to single-channel source separation,” In Proc. INTERSPEECH, pp. 865-869. 2014.
[24] P. Sprechmann, A. M. Bronstein, and G. Sapiro, “Supervised non-euclidean sparse NMF via bilevel optimization with applications to speech enhancement,” In Proc. Hands-free Speech Communication and Microphone Array, pp. 11-15, 2014.
[25] K. T. Gyoon, K. Kwon, J. W. Shin, and K. N. Soo, “NMF-based target source separation using deep neural network,” IEEE Signals Processing Letters, 229-233, 2015.
[26] C. Fevotte, N. Bertin, and J. L. Durrieu, “Nonnegative matrix factorization with the Itakura-Saito divergence: With application to music,” Neural computation, pp. 793-830, 2009.
[27] X. Niyogi. “Locality preserving projections,” In Proc. Neural information processing systems, MIT, 2004.
[28] P. S. Huang, M. Kim, M. Hasegawa-Johnson, & P. Smaragdis, “Singing-Voice Separation from Monaural Recordings using Deep Recurrent Neural Networks,” In ISMIR, pp. 477-482, 2014.
[29] D. E. Rumelhart, G. E. Hinton, & R. J. Williams, “Learning representations by back-propagating errors”. Nature, 1986.
[30] D. C. Liu, and J. Nocedal, “On the limited memory BFGS method for large scale optimization,” Mathematical programming, no. 1-3, pp. 503-528, 1989.
[31] Y. H. Yang, “Low-rank representation of both singing voice and music accompaniment via learned dictionaries,” In Proc. ISMIR, pp. 427-432. 2013.
[32] J. Mairal, F. Bach, J. Ponce and G. Sapiro. “Online learning for matrix factorization and sparse coding,” Journal of Machine Learning Research, vol. 11, pp. 19-60. 2010.
[33] E. Vincent, R. Gribonval, and C. Févotte, “Performance measurement in blind audio source separation,” IEEE Transactions on Audio, Speech and Language Processing, no. 14, pp. 1462-1469, 2006.
[34] S. Uhlich, M. Porcu, F. Giron, M. Enenkl, T. Kemp, N. Takahashi, and Y. Mitsufuji, “Improving music source separation based on deep neural networks through data augmentation and network blending,” In Proc. ICASSP, pp. 261-265, 2017.
[35] S. W. McCulloch, and W. Pitts. “A logical calculus of the ideas immanent in nervous activity,” The bulletin of mathematical biophysics, pp.115-133, 1943.
[36] A. A. Nugraha, L. Antoine, and V. Emmanuel. “Multichannel audio source separation with deep neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, no. 9, pp. 1652-1664, 2016.
[37] S. Ioffe, and C. Szegedy. “Batch normalization: Accelerating deep network training by reducing internal covariate shift”. arXiv preprint, arXiv:1502.03167, 2015.
[38] T. Pham, Y. S. Lee, Y. B. Lin, Y. H. Li, T. C. Tai, and J. C. Wang, “Single channel source separation using graph sparse NMF and adaptive dictionary learning,” Intelligent Data Analysis, vol. 21, 2017.
[39] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever and R. Salakhutdinov. “Dropout: A simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, pp.1929-1958, 2014.
[40] J. Yosinski, J. Clune, Y. Bengio and H. Lipson. “How transferable are features in deep neural networks?” In Advances in neural information processing systems. pp. 3320-3328, 2014.
[41] B. Logan. “Mel frequency cepstral coefficients for music modeling,” ISMIR, 2000.
[42] J. Andén and S. Mallat, “Deep scattering spectrum,” IEEE Transactions on Signal Processing, vol. 62, no. 16, pp. 4114-4128, 2014.
[43] J. Bruna, P. Sprechmann, and Y. Lecun, “Source separation with scattering non-negative matrix factorization,” In Proc. ICASSP, 2015.
[44] M. Cooke, J. Barker, S. Cunninghamand and X. Shao, “An audio-visual corpus for speech perception and automatic speech recognition,” Journal of the Acoustical Society of America, pp. 2421-2424, 2006
[45] S. Seneff, J. Glass, V. Zue, “Speech database development at MIT: Timit and beyond,” Speech Communication, pp. 351-356, 1990.

指導教授

王家慶(Jia Ching Wang)

審核日期

2018-7-16

推文