全複數深度遞迴類神經網路應用於歌曲人聲分離

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：33

、訪客IP：3.133.112.119

姓名

俞果(Kuo Yu) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

全複數深度遞迴類神經網路應用於歌曲人聲分離
(Complex-Valued Deep Recurrent Neural Network for Singing Voice Separation)

相關論文

★ Single and Multi-Label Environmental Sound Recognition with Gaussian Process	★ 波束形成與音訊前處理之嵌入式系統實現
★ 語音合成及語者轉換之應用與設計	★ 基於語意之輿情分析系統
★ 高品質口述系統之設計與應用	★ 深度學習及加速強健特徵之CT影像跟骨骨折辨識及偵測
★ 基於風格向量空間之個性化協同過濾服裝推薦系統	★ RetinaNet應用於人臉偵測
★ 金融商品走勢預測	★ 整合深度學習方法預測年齡以及衰老基因之研究
★ 漢語之端到端語音合成研究	★ 基於 ARM 架構上的 ORB-SLAM2 的應用與改進
★ 基於深度學習之指數股票型基金趨勢預測	★ 探討財經新聞與金融趨勢的相關性
★ 基於卷積神經網路的情緒語音分析	★ 運用深度學習方法預測阿茲海默症惡化與腦中風手術存活

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 ( 永不開放)

摘要(中)

深度類神經網路(DNN)在多媒體訊號處理的領域中有不凡的表現，但大部分的基於深度類神經網路的作法都是在處理實數資料，只有少數設計成能處理複數資料，即便複數資料在多媒體的領域佔有重要的地位，因此本論文提出全複數深度遞迴類神經網路(C-DRNN)的架構來處理歌曲人聲分離，本架構可以直接處理短時傅立葉轉換(STFT)出來的複數資料，並且本架構的權重以及激發函數等都是以複數計算。本論文的目標為從歌曲中分離人聲與樂器，在倒傳遞時使用複數微分成本函數，進而得到複數梯度，本架構也對輸出層做了改進，加入了複數比例遮罩以確保最後估計的輸出不會超過輸入的數值，並且在訓練網路時多加了鑑別項以增加網路的計算能力。最後，本論文提出的方法使用MIR-1K資料庫實驗歌曲人聲分離的能力，實驗結果顯示本方法較其他深度類神經網路表現更加優秀。

摘要(英)

Deep neural networks (DNN) have performed impressively in the processing of multimedia signals. Most DNN-based approaches were developed to handle real-valued data; very few have been designed for complex-valued data, despite their being essential for processing various types of multimedia signal. Accordingly, this work presents a complex-valued deep recurrent neural network (C-DRNN) for singing voice separation. The C-DRNN operates on the complex-valued short-time discrete Fourier transform (STFT) domain. A key aspect of the C-DRNN is that the activations and weights are complex-valued. The goal herein is to reconstruct the singing voice and the background music from a mixed signal. For error back-propagation, CR-calculus is utilized to calculate the complex-valued gradients of the objective function. To reinforce model regularity, two constraints are incorporated into the cost function of the C-DRNN. The first is an additional masking layer that ensures the sum of separated sources equals the input mixture. The second is a discriminative term that preserves the mutual difference between two separated sources. Finally, the proposed method is evaluated using the MIR-1K dataset and a singing voice separation task. Experimental results demonstrate that the proposed method outperforms the state-of-the-art DNN-based methods.

關鍵字(中)

★ 深度類神經網路
★ 歌唱人聲分離
★ 相位資訊

關鍵字(英)

★ Deep Neural Network
★ Singing Voice Separation
★ Phase Informaiton

論文目次

中文摘要 i
Abstract ii
圖目錄 iii
表目錄 iv
章節目次 v
第一章緒論 1
1.1 研究背景及研究目的 1
1.2 研究方法與章節概要 2
第二章基於深度學習之音源分離方法及文獻探討 3
2.1 基於感知機(Multi-Layer Perceptron，MLP)之分離方法 4
2.1.1 多層感知機架構 4
2.2 基於自編碼器(Auto-Encoder，AE)之分離方法 7
2.2.1 自編碼器架構 8
2.2.2 時間序列自編碼器 12
2.3 基於遞迴式類神經網路(Recursive Neural Network，RNN)之分離方法 14
2.3.1 遞迴式類神經網路架構 16
2.4 基於複數深度網路(Complex-valued Deep Neural Network，C-DNN)之分離方法 19
2.4.1 複數深度網路架構 19
2.4.2 複數深度網路與實數網路的比較 20
2.4.3 複數深度網路的正傳遞以及倒傳遞 21
第三章全複數深度遞迴類神經網路應用於歌曲人聲分離 23
3.1 全複數深度遞迴類神經網路架構 23
3.2 全複數深度遞迴類神經網路正傳遞推導 24
3.3 全複數深度遞迴類神經網路倒傳遞推導 27
3.4 全複數深度遞迴類神經網路的激發函數 31
3.5 全複數深度遞迴類神經網路的權重初始化 32
3.6 全複數深度遞迴類神經網路的自適化學習率調整算法 33
第四章複數深度類神經網路應用在音訊分離之實驗 35
4.1 實驗環境與複數深度類神經網路設置 35
4.2 比較激發函數與自適應學習方法 37
4.3 鑑別項的訓練效果 38
4.4 與其他基本方法的比較 40
第五章結論及未來研究方向 43
第六章參考文獻 44

參考文獻

[1] Kitamura, Daichi, et al. ”Robust music signal separation based on supervised nonnegative matrix factorization with prevention of basis sharing.” Signal Processing and Information Technology (ISSPIT), 2013 IEEE International Symposium on. IEEE, 2013.
[2] Xu, Yong, et al. ”A regression approach to speech enhancement based on deep neural networks.” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 23.1 (2015): 7-19.
[3] Wang, Yuxuan, Arun Narayanan, and DeLiang Wang. ”On training targets for supervised speech separation.” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 22.12 (2014): 1849-1858.
[4] Huang, Po-Sen, et al. ”Deep learning for monaural speech separation.” Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014.
[5] Huang, Po-Sen, et al. ”Joint optimization of masks and deep recurrent neural networks for monaural source separation.” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 23.12 (2015): 2136-2147.
[6] Wang, Guan-Xiang, Chung-Chien Hsu, and Jen-Tzung Chien. ”Discriminative deep recurrent neural networks for monaural speech separation.” Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2016.
[7] Wang, D., and Jae Lim. ”The unimportance of phase in speech enhancement.” IEEE Transactions on Acoustics, Speech, and Signal Processing 30.4 (1982): 679-681.
[8] Kazama, Michiko, et al. ”On the significance of phase in the short term Fourier spectrum for speech intelligibility.” The Journal of the Acoustical Society of America 127.3 (2010): 1432-1439.
[9] Gerkmann, Timo, Martin Krawczyk-Becker, and Jonathan Le Roux. ”Phase processing for single-channel speech enhancement: History and recent advances.” IEEE signal processing Magazine 32.2 (2015): 55-66.
[10] Moon, Sang-Hyun, Bonam Kim, and In-Sung Lee. ”Importance of phase information in speech enhancement.” Complex, Intelligent and Software Intensive Systems (CISIS), 2010 International Conference on. IEEE, 2010.
[11] Paliwal, Kuldip, Kamil Wójcicki, and Benjamin Shannon. ”The importance of phase in speech enhancement.” speech communication 53.4 (2011): 465-494.
[12] Williamson, Donald S., Yuxuan Wang, and DeLiang Wang. ”Complex ratio masking for joint enhancement of magnitude and phase.” Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2016.
[13] Williamson, Donald S., Yuxuan Wang, and DeLiang Wang. ”Complex ratio masking for monaural speech separation.” IEEE/ACM transactions on audio, speech, and language processing 24.3 (2016): 483-492.
[14] Lee, Yuan-Shan, et al. ”Fully complex deep neural network for phase-incorporating monaural source separation.” Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2017.
[15] A. Jourjine, S. Rickard, and Ö. Yilmaz, “Blind separation of disjoint orthogonal signals: Demixing N sources from 2 mixtures,” in Proceedings IEEE International Conference on Acoustics, Speech and Signal Processing, 2000, pp. 2985-2988.
[16] Ö. Yιlmaz and S. Rickard, “Blind separation of speech mixtures via time-frequency masking,” IEEE Transactions on Signal Processing, vol. 52, no. 7, pp. 1830-1847, Jul. 2004.
[17] G. Bao, Z. Ye, X. Xu, and Y. Zhou, “A compressed sensing approach to blind separation of speech mixture based on a two-layer sparsity model,” IEEE Transactions on Audio, Speech and Language Processing, vol. 21, no. 5, pp. 899-906, May 2013.
[18] A. Belouchrani, K. Meraim, J. Cardoso, and E. Moulines, ‘‘A blind source separation technique based on second-order statistics,” IEEE Transactions on Signal Processing, vol. 45, pp. 434-44, 1997.
[19] J. Cardoso, ‘‘Source separation using higher order moments,” in Proceedings IEEE International Conference on Acoustics, Speech and Signal Processing, 1989, pp. 2109-2112.
[20] Y. Tan, J. Wang, and J. M. Zurada, ‘‘Nonlinear blind source separation using a radial basis function network,” IEEE Transactions Neural Networks, vol. 12, pp. 134-144, 2001.
[21] J. Cardoso and A. Souloumiac, “Blind beamforming for non-Gaussian signals,” IEE Proceedings F-Radar and Signal Processing, 1993, vol. 140, no. 6, pp. 362-370, December 1993.
[22] A. Bell and T. Sejnowski, “An Information-maximization approach to blind separation,” Neural Computation, vol. 7, pp. 1004-1034, 1995.
[23] S. Roweis, “One microphone source separation,” in Proceedings Advances in Neural Information Processing Systems, 2000, pp. 793-799.
[24] M. Schmidt and R. Olsson, “Single-channel speech separation using sparse non-negative matrix factorization,” in Proceedings Interspeech, 2006, pp. 2614-2617.
[25] M. Radfar and R. Dansereau, “Single-channel speech separation using soft mask ﬁltering,” IEEE Transactions on Audio, Speech and Language Processing, vol. 15, no. 8, pp. 2299-2310, Nov. 2007.
[26] Y. Lee, I. Lee, and O. Kwon, “Single-channel speech separation using phase-based methods,” IEEE Transactions on Consumer Electronics, vol. 56, no. 4, pp. 2453-2459, Nov. 2010.
[27] B. King and L. Atlas, “Single-channel source separation using complex matrix factorization,” IEEE Transactions on Audio, Speech and Language Processing, vol. 19, no. 8, pp. 2591-2597, Nov. 2011.
[28] B. Gao, W. Woo, and S. Dlay, “Single-channel source separation using EMD-subband variable regularized sparse features,” IEEE Transactions on Audio, Speech and Language Processing, vol. 19, no. 4, pp. 961-976, May 2011.
[29] P. Mowlaee, R. Saeidi, M. Christensen, Z. Tan, T. Kinnunen, P. Franti, and S. Jensen, “A joint approach for single-channel speaker identiﬁcation and speech separation,” IEEE Transactions on Audio, Speech and Language Processing, vol. 20, no. 9, pp. 2586-2601, Nov. 2012.
[30] C. Demir, M. Saraclar, and A. Cemgil, “Single-channel speech-music separation for robust ASR with mixture models,” IEEE Transactions on Audio, Speech and Language Processing, vol. 21, no. 4, pp. 725-736, Apr. 2013.
[31] P. Li, Y. Guan, B. Xu, and W. Liu, “Monaural speech separation based on computational auditory scene analysis and objective quality assessment of speech,” IEEE Transactions on Audio, Speech and Language Processing, vol. 14, no. 6, pp. 2014-2023, Nov. 2006.
[32] G. Brown and M. Cooke, “Computational auditory scene analysis,” Computer Speech and Language, vol. 8, no. 4, pp. 297-336, 1994.
[33] D. P. Ellis, “Using knowledge to organize sound: The prediction-driven approach to computational auditory scene analysis and its application to speech/nonspeech mixtures,” Speech Communication, vol. 27, no. 3, pp. 281-298, 1999.
[34] B. King and L. Atlas, “Single-channel source separation using simpliﬁed-training complex matrix factorization,” in Proceedings IEEE International Conference on Acoustics, Speech and Signal Processing, 2010, pp. 4206-4209.
[35] D. Lee and H. Seung, “Learning the parts of objects by non-negative matrix factorization,” Nature, vol. 401, no. 6755, pp. 788-791, 1999.
[36] J. Eggert and E. Korner, “Sparse coding and NMF,” in Proceedings IEEE International Joint Conference Neural Networks, 2004, vol. 4, pp. 2529-2533.
[37] H. Kameoka, N. Ono, K. Kashino, and S. Sagayama, “Complex NMF: A new sparse representation for acoustic signals,” in Proceedings IEEE International Conference on Acoustics, Speech and Signal Processing, 2009, pp. 3437-3440.
[38] Y. Wang and D. L. Wang, “Feature denoising for speech separation in unknown noisy environments,” in Proceedings IEEE International Conference on Acoustics, Speech and Signal Processing, 2013, pp. 7472-7476.
[39] E. Grais and H. Erdogan, “Single channel speech music separation using nonnegative matrix factorization and spectral masks,” in Proceedings International Conference on Digital Signal Processing, pp. 1-6, 2011.
[40] N. Shuai, H. Zhang, X. L. Zhang, and W. J. Liu, “Deep stacking networks with time series for speech separation,’’ in Proceedings IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6667-6671, May 2014.
[41] Y. Wang and D. Wang, “Cocktail party processing via structured prediction,” in Proceedings Advances in Neural Information Processing Systems, 2012, pp. 224-232.
[42] Y. Wang and D. Wang, “Towards scaling up classification-based speech separation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 7, pp. 1381-1390, Jul. 2013.
[43] B. Xia and C. Bao, “Speech enhancement with weighted denoising Auto-Encoder,” in Proceedings Interspeech, 2013, pp. 3444-3448.
[44] X. Lu, Y. Tsao, S. Matsuda, and C. Hori, “Speech enhancement based on deep denoising Autoencoder,” in Proceedings Interspeech, 2013, pp. 436-440.
[45] G. Hinton and R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, pp. 504-507, 2006.
[46] M. Hermans and B. Schrauwen, “Training and analyzing deep recurrent neural networks,” in Proceedings Advances in Neural Information Processing Systems, 2013, pp. 190-198.
[47] R. Pascanu, C. Gulcehre, K. Cho, and Y. Bengio, “How to construct deep recurrent neural networks,” in Proceedings International Conference on Learning Representations, 2014.
[48] F. Weninger, F. Eyben, and B. Schuller, “Single-channel speech separation with memory-enhanced recurrent neural networks,” in Proceedings IEEE International Conference on Acoustics, Speech and Signal Processing, 2014, pp. 3709-3713.
[49] He, Kaiming, et al. ”Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.” Proceedings of the IEEE international conference on computer vision. 2015.
[50] Glorot, Xavier, and Yoshua Bengio. ”Understanding the difficulty of training deep feedforward neural networks.” Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. 2010.
[51] Kingma, Diederik, and Jimmy Ba. ”Adam: A method for stochastic optimization.” arXiv preprint arXiv:1412.6980 (2014).
[52] Tieleman, T. and Hinton, G. Lecture 6.5 - RMSProp, COURSERA: Neural Networks for Machine Learning.Technical report, 2012.
[53] Hsu, Chao-Ling, and Jyh-Shing Roger Jang. ”On the improvement of singing voice separation for monaural recordings using the MIR-1K dataset.” IEEE Transactions on Audio, Speech, and Language Processing 18.2 (2010): 310-319.
[54] BSS_eval ToolBox : http://bass-db.gforge.inria.fr/bss_eval/ , available on 2014/7/11.
[55] Loizou, Philipos C. Speech enhancement: theory and practice. CRC press, 2013.
[56] Rix, Antony W., et al. ”Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs.” Acoustics, Speech, and Signal Processing, 2001. Proceedings.(ICASSP′01). 2001 IEEE International Conference on. Vol. 2. IEEE, 2001.
[57] A. Ng, “Sparse autoencoder,” CS294A Lecture notes, pp. 72-2011.
[58] F. Weninger, F. Eyben, and B. Schuller, “Single-channel speech separation with memory-enhanced recurrent neural networks,” in Proceedings IEEE International Conference on Acoustics, Speech and Signal Processing, 2014, pp. 3709-3713.

指導教授

王家慶(Jia-Ching Wang)

審核日期

2017-8-14

推文