端到端輕量化音樂源分離深度學習模型

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：13

、訪客IP：3.145.92.29

姓名

王耀霆(Yao-Ting Wang) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

端到端輕量化音樂源分離深度學習模型
(Lightweight End-to-End Deep Learning Model for Music Source Separation)

相關論文

★ Single and Multi-Label Environmental Sound Recognition with Gaussian Process	★ 波束形成與音訊前處理之嵌入式系統實現
★ 語音合成及語者轉換之應用與設計	★ 基於語意之輿情分析系統
★ 高品質口述系統之設計與應用	★ 深度學習及加速強健特徵之CT影像跟骨骨折辨識及偵測
★ 基於風格向量空間之個性化協同過濾服裝推薦系統	★ RetinaNet應用於人臉偵測
★ 金融商品走勢預測	★ 整合深度學習方法預測年齡以及衰老基因之研究
★ 漢語之端到端語音合成研究	★ 基於 ARM 架構上的 ORB-SLAM2 的應用與改進
★ 基於深度學習之指數股票型基金趨勢預測	★ 探討財經新聞與金融趨勢的相關性
★ 基於卷積神經網路的情緒語音分析	★ 運用深度學習方法預測阿茲海默症惡化與腦中風手術存活

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 ( 永不開放)

摘要(中)

深度類神經網路(DNN)在音訊處理的領域中進展快速，過去大多利用經由短時傅立葉轉換(STFT)出來的頻譜資訊進行處理，但其中許多作法都是僅處理實數部分，近年來為了避免複數資訊未被考慮而造成的資訊損失，陸續提出了基於時域資訊直接進行端到端處理的音源分離深度學習模型。不過這些方法一來模型龐大，參數量多，在設備運算效能受限的狀態下難以利用；另一方面，一般都需要較長時間的輸入才能獲得良好的分離效果，這代表著高延遲，對於需要低延遲的應用而言較無助益。
本論文基於前人之研究提出端到端輕量化音樂源分離深度學習模型，減少模型參數量並加速運算，並提出新穎的解碼器來進一步提升在輸入時間長度受限的狀態下的分離效果。實驗結果表明，本論文提出的方法，只需過去10%以下或是更少的參數量，就能獲得優於之前的分離結果。

摘要(英)

DNNs(Deep neural networks) have made rapid progress in the field of audio processing. In the past, most of them used spectrum information via STFT (Short Term Fourier Transform), but them usually only deal with real parts. In recent years, in order to avoid the information loss caused by the lack of consideration of complex value, deep learning models have gradually been proposed for audio source separation based on time domain for end-to-end processing. However, those models are huge, i.e., the number of parameters is very large. Therefore, it is difficult to use them where the computing resources of the device is limited. On the other hand, it generally takes a long term input to obtain a good result for separation, which represents high delay. It is less helpful for some applications that require low latency.
Based on the previous research, this thesis proposes a lightweight end-to-end music source separation deep learning model. To reduce the number of parameters and accelerate the computation, and then propose a novel decoder that can further enhance the result of separation while the input context length is limited. The experimental results show that the method proposed in this paper can obtain better than the previous results by only uses 10% or less parameters.

關鍵字(中)

★ 深度學習
★ 語音增強
★ 音源分離

關鍵字(英)

★ Deep Learning
★ Speech Enhancement
★ Audio Source Separation

論文目次

中文摘要 i
Abstract ii
章節目次 iii
圖目錄 v
表目錄 vii
第一章　緒論 1
1-1 研究背景與目的 1
1-2 研究方法與章節概要 1
第二章　相關文獻與文獻探討 3
2-1 語義分割深度學習模型應用於音樂源分離 4
2-2 端到端的音樂源分離深度學習模型 9
2-2-1 基礎架構 9
2-2-2 輸出項之限制 11
2-2-3 使用合適的輸入上下文進行預測 11
2-2-4 雙通道輸入輸出及學習的上採樣 11
2-3 影像語義分割深度學習模型 12
2-3-1 深度可分離卷積(Depthwise Separable Convolution) 12
2-3-2 空洞卷積(Atrous Convolution) 14
2-3-3 空洞空間池化金字塔 15
2-3-4 結合空洞空間池化金字塔之編碼器-解碼器 16
2-4 卷積注意力模組 16
2-5 高效的子像素卷積神經網路 18
第三章　端到端輕量化音樂源分離深度學習模型 20
3-1 端到端輕量化音樂源分離深度學習模型編碼器 20
3-2 接受域不變之解碼器 22
3-3 基於時間注意力機制的動態卷積核 25
3-4 端到端輕量化音樂源分離深度學習模型架構 27
第四章　音樂源分離之實驗 30
4-1 實驗設置 30
4-2 重新訓練Wave-U-Net與討論逐點卷積核之權重 34
4-3 各種改進方法之結果比較 36
4-4 模型參數量與輸入感知域之比較 38
4-5 音樂源分離之結果 40
第五章　結論與未來研究方向 43
第六章　參考文獻 44

參考文獻

[1] Yi Luo, Zhuo Chen, John R Hershey, Jonathan Le Roux, and Nima Mesgarani, “Deep clustering and conventional networks for music separation: Stronger together,” In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 61–65, 2017.
[2] Andreas Jansson, Eric J. Humphrey, Nicola Montecchio, Rachel Bittner, Aparna Kumar, and Tillman Weyde. “Singing voice separation with deep U-Net convolutional networks”. In Proceedings of the International Society for Music Information Retrieval Conference, pp. 323–332, 2017.
[3] Y. Luo and N. Mesgarani, “TasNet: Surpassing ideal timefrequency masking for speech separation,” arXiv preprint arXiv:1809.07454, 2018.
[4] Daniel Stoller, Sebastian Ewert, and Simon Dixon, “WaveU-Net: A multi-scale neural network for end-to-end source separation,” In Proceedings of the International Society for Music Information Retrieval Conference, vol. 19, pp. 334–340, 2018.
[5] C. Lea, R. Vidal, A. Reiter, and G. D. Hager, “Temporal convolutional networks: A uniﬁed approach to action segmentation,” In European Conference on Computer Vision. Springer, pp. 47–54, 2016.
[6] C. L. M. D. F. Ren´e and V. A. R. G. D. Hager, “Temporal convolutional networks for action segmentation and detection,” In IEEE International Conference on Computer Vision, 2017.
[7] S. Bai, J. Z. Kolter, and V. Koltun. “An empirical evaluation of generic convolutional and recurrent networks for sequence modeling.” arXiv:1803.01271, 2018.
[8] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” In Conference on Computer Vision and Pattern Recognition, 2015.
[9] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. “U-net: Convolutional networks for biomedical image segmentation.” In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 234–241, 2015.
[10] F.-R. St¨oter, A. Liutkus, and N. Ito, “The 2018 signal separation evaluation campaign,” In Proc. International Conference on Latent Variable Analysis and Signal Separation, 2018.
[11] Jen-Yu Liu and Yi-Hsuan Yang. “Denoising auto-encoder with recurrent skip connections and residual regression for music source separation.” In Proc. IEEE Int. Conf. Machine Learning and Applications, pp. 773–778, 2018.
[12] Naoya Takahashi and Yuki Mitsufuji. “Multi-scale multi-band densenets for audio source separation.” In Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp 21–25, 2017.
[13] Naoya Takahashi, Nabarun Goswami, and Yuki Mitsufuji. “MMDenseLSTM: An efﬁcient combination of convolutional and recurrent neural networks for audio source separation.” Proc. International Workshop on Acoustic Signal Enhancement, pp. 106–110, 2018.
[14] Wang, D., and Jae Lim. “The unimportance of phase in speech enhancement.” IEEE Transactions on Acoustics, Speech, and Signal Processing, pp. 679-681, 1982.
[15] Kazama, Michiko, et al. “On the significance of phase in the short term Fourier spectrum for speech intelligibility.” The Journal of the Acoustical Society of America, pp. 1432-1439, 2010.
[16] Gerkmann, Timo, Martin Krawczyk-Becker, and Jonathan Le Roux. “Phase processing for single-channel speech enhancement: History and recent advances.” IEEE signal processing Magazine, pp. 55-66, 2015.
[17] Moon, Sang-Hyun, Bonam Kim, and In-Sung Lee. “Importance of phase information in speech enhancement.” Complex, Intelligent and Software Intensive Systems, 2010.
[18] Paliwal, Kuldip, Kamil Wójcicki, and Benjamin Shannon. “The importance of phase in speech enhancement.” speech communication, pp. 465-494, 2011.
[19] Y. Tan, J. Wang, and J. M. Zurada, ‘‘Nonlinear blind source separation using a radial basis function network,” IEEE Transactions Neural Networks, vol. 12, pp. 134-144, 2001.
[20] S. Pascual, A. Bonafonte, J. Serrà, “Segan: Speech enhancement generative adversarial network”, INTERSPEECH, 2017.
[21] Augustus Odena, Vincent Dumoulin, and Chris Olah. Deconvolution and checkerboard artifacts. Distill, 2016.
[22] Badrinarayanan, V., Kendall, A., Cipolla, R.: “SegNet: a deep convolutional encoder-decoder architecture for image segmentation.”, 2017.
[23] Grauman, K., Darrell, T.: “The pyramid match kernel: discriminative classiﬁcation with sets of image features.” In IEEE International Conference on Computer Vision, 2005.
[24] Lazebnik, S., Schmid, C., Ponce, J. “Beyond bags of features: spatial pyramid matching for recognizing natural scene categories.” In Conference on Computer Vision and Pattern Recognition, 2006.
[25] He, K., Zhang, X., Ren, S., Sun, J. “Spatial pyramid pooling in deep convolutional networks for visual recognition.” In The European Conference on Computer Vision, 2014.
[26] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J. “Pyramid scene parsing network.” In Conference on Computer Vision and Pattern Recognition, 2017.
[27] Chen, L.C., Papandreou, G., Schroﬀ, F., Adam, H.: “Rethinking atrous convolution for semantic image segmentation.” arXiv:1706.05587, 2017
[28] Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L. “DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs.” In The IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, 834–848, 2017
[29] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam. “Encoder-decoder with atrous separable convolution for semantic image segmentation.” In The European Conference on Computer Vision, 2018.
[30] Sifre, L. “Rigid-motion scattering for image classiﬁcation.” Ph.D. thesis, 2014
[31] Vanhoucke, V. “Learning visual representations at scale” (invited talk). In The International Conference on Learning Representations, 2014
[32] Howard, A.G., et al. “MobileNets: eﬃcient convolutional neural networks for mobile vision applications.” arXiv:1704.04861, 2017
[33] Vincent Dumoulin and Francesco Visin. “A guide to convolution arithmetic for deep learning.” arXiv preprint arXiv:1603.07285, 2016.
[34] A. Vaswani, N. Shazeer, N. Parmar, and J. Uszkoreit, “Attention is all you need,” arXiv Preprint, arXiv:1706.03762, 2017.
[35] J. Hu, L. Shen, and G. Sun. “Squeeze-and-excitation networks.” arXiv preprint arXiv:1709.01507, 2017.
[36] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. “CBAM: convolutional block attention module.” In The European Conference on Computer Vision, 2018.
[37] Felix Wu, Angela Fan, Alexei Baevski, Yann N. Dauphin, and Michael Auli. “Pay less attention with lightweight and dynamic convlutions.” In Proc. of The International Conference on Learning Representations, 2019.
[38] Chollet, F.: Xception. “deep learning with depthwise separable convolutions.” In Conference on Computer Vision and Pattern Recognition, 2017
[39] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen. “Mobilenetv2: Inverted residuals and linear bottlenecks.” Conference on Computer Vision and Pattern Recognition, 2018.
[40] W. Shi, J. Caballero, F. Husz´ar, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang, “Real-time single image and video super-resolution using an efﬁcient sub-pixel convolutional neural network,” In Conference on Computer Vision and Pattern Recognition, 2016.
[41] M. S. Sajjadi, R. Vemulapalli, and M. Brown, “Frame-recurrent video super-resolution,” In Conference on Computer Vision and Pattern Recognition, 2018.
[42] T.-J. Yang, M. D. Collins, Y. Zhu, J.-J. Hwang, T. Liu, X. Zhang, V. Sze, G. Papandreou, and L.-C. Chen,“DeeperLab: Single-Shot Image Parser,” arXiv preprint arXiv:1902.05093, 2019.
[43] Maas, Andrew L, Hannun, Awni Y, and Ng, Andrew Y. “Rectifier nonlinearities improve neural network acoustic models.” In International Conference on Machine Learning, vol. 30, 2013.
[44] Xu, B.; Wang, N.; Chen, T.; and Li, M. “Empirical evaluation of rectified activations in convolutional network.” arXiv preprint arXiv:1505.00853, 2015.
[45] Antoine Liutkus, Derry Fitzgerald, and Zafar Raﬁi. Scalable audio separation with light kernel additive modelling. In IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 76–80, 2015.
[46] BSSEval v4 Evaluation tools : https://github.com/sigsep/sigsep-mus-eval , available on 2019/6/30.
[47] Raspberry Pi 3 Model B : https://www.raspberrypi.org/products/raspberry-pi-3-model-b, available on 2019/6/30.
[48] E. Vincent, R. Gribonval, and C. Fevotte. “Performance measurement in blind audio source separation.” In IEEE Transactions on Audio, Speech, and Language Processing, pp. 1462–1469, 2006.
[49] Tensorflow 1.13.1: https://github.com/tensorflow/tensorflow/releases/tag/v1.13.1, available on 2019/6/30.
[50] Diederik P Kingma and Jimmy Ba. Adam. “A method for stochastic optimization”. 2015.
[51] Alice Cohen-Hadria, Axel Roebel, and Geoffroy Peeters. “Improving Singing Voice Separation Using Deep U-Net and Wave-U-Net with Data Augmentation.” submitted to the IEEE International Conference on Acoustics, Speech, and Signal Processing, 2019.
[52] C. Liu, L.-C. Chen, F. Schroff, H. Adam, W. Hua, A. Yuille, and L. Fei-Fei. Auto-DeepLab: “Hierarchical neural architecture search for semantic image segmentation.” arXiv preprint arXiv:1901.02985, 2019.
[53] Wave-U-Net: https://github.com/f90/Wave-U-Net, available on 2019/6/30.
[54] C. Dong, C. C. Loy, K. He, and X. Tang. “Image super-resolution using deep convolutional networks.” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015.
[55] Roux, J. L., Wisdom, S., Erdogan, H., and Hershey, J. R. “SDR-half-baked or well done?” In IEEE International Conference on Acoustics, Speech and Signal Processing, 2019.

指導教授

王家慶(Jia-Ching Wang)

審核日期

2019-7-31

推文