單 通 道 語 音 分 離 的 深 度 學 習

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：31

、訪客IP：3.135.183.171

姓名

何銘津(Ha Minh Tan) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

單通道語音分離的深度學習
(Deep Learning for Single-Channel Speech Separation)

相關論文

★ Single and Multi-Label Environmental Sound Recognition with Gaussian Process	★ 波束形成與音訊前處理之嵌入式系統實現
★ 語音合成及語者轉換之應用與設計	★ 基於語意之輿情分析系統
★ 高品質口述系統之設計與應用	★ 深度學習及加速強健特徵之CT影像跟骨骨折辨識及偵測
★ 基於風格向量空間之個性化協同過濾服裝推薦系統	★ RetinaNet應用於人臉偵測
★ 金融商品走勢預測	★ 整合深度學習方法預測年齡以及衰老基因之研究
★ 漢語之端到端語音合成研究	★ 基於 ARM 架構上的 ORB-SLAM2 的應用與改進
★ 基於深度學習之指數股票型基金趨勢預測	★ 探討財經新聞與金融趨勢的相關性
★ 基於卷積神經網路的情緒語音分析	★ 運用深度學習方法預測阿茲海默症惡化與腦中風手術存活

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 ( 永不開放)

摘要(中)

本論文利用深度神經網路 (DNN) 來解決單通道語音分離問題，我們採用了三種不同的方法。首先，我們使用基於 frequency-to-time Domain 的單通道源分離。在這個領域中，基於嵌入向量的模型獲得突破性的成功，例如深度聚類。我們參考深度聚類的想法，提出了新的框架，即 Encoder Squash-norm Deep Clustering（ESDC）。相比於當前的方法，包括深度聚類、深度提取網路（DENet）、深度吸引子網絡（DANet）和幾種更新版本的深度聚類，結果表明，我們提出的框架顯著降低了單通道聲學分解的性能消耗。其次，我們提出了一個基於雙路徑回歸神經網路(DPRNN)的 inter-segment 和 intra-segment 的時域單通道聲學分解。這個架構在模擬超長序列的表現上具有頂尖的性能。而我們引入了一種新的選擇性相互學習法(SML)，在 SML 方法中，有兩個 DPRNN 互相交換知識並且互相學習，特別的是，剩餘的網路由高可信度預測引導的同時，忽略低可信度的預測。根據實驗結果，選擇性相互學習法(SML)大大優於其他類似的方法，如獨立訓練、知識蒸餾和使用相同模型設計的相互學習。最後，我們提出一個輕量但高性能的語音分離網路: SeliNet。 SeliNet 是採用瓶頸模塊和空洞時間金字塔池的一維卷積架構神經網路。實驗結果表明，SeliNet 在僅需少量浮點運算量和較少模型參數的同時，獲得了最先進(SOTA)的性能。

摘要(英)

This dissertation addresses the issues of single-channel speech separation by exploiting deep neural networks (DNNs). We approach three different directions. First, we approach single-channel source separation based on the frequency-to-time domain. In this domain, ground-breaking successful models based on the embedding vector which is presented such as deep clustering. We develop our framework inspired by deep clustering, namely node encoder Squash norm deep clustering (ESDC). The results have shown that our proposed framework significantly reduces the performance of single-channel acoustic decomposition in comparison to current training techniques including deep clustering, deep extractor network (DENet), deep attractor network (DANet), and several updated versions of deep clustering. Second, we proposed monaural acoustic decomposition based on the time domain. An impressive contribution of the inter-segment and the intra-segment architectures of the dual-path recurrent neural network (DPRNN), this architecture has cutting-edge performance and can simulate exceedingly long sequences. We introduce a new selective mutual learning. In the selective mutual learning (SML) approach, there are two DPRNNs. They exchange knowledge and learn from one another. In particular, the remaining network is guided by the high-confidence forecasts, meanwhile, the low-confidence predictions are disregarded. According to the experimental findings, selective mutual learning greatly outperforms other training methods such as independent training, knowledge distillation, and mutual learning using the same model design. Finally, we introduce a lightweight yet effective network for speech separation, namely SeliNet. The SeliNet is the one-dimensional convolutional architecture that employs bottleneck modules, and atrous temporal pyramid pooling. The experimental results have shown that the suggested SeliNet obtains state-of-the-art (SOTA) performance while still maintaining the small number of floating-point operations (FLOPs) and model size.

關鍵字(中)

★ 深度學習
★ 單通道聲學分解
★ 判別向量學習
★ 時域音頻分離
★ 輕量級網絡

關鍵字(英)

★ Deep learning
★ single-channel acoustic decomposition
★ lightweight network
★ time domain audio separation
★ discrimination-vector learning

論文目次

Table of Contents
Abstract vii
Acknowledgments viii
Table of Contents viii
List of Figures xii
List of Tables xv
List of Abbreviations xviii
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Audio Source Separations Systems . . . . . . . . . . . . . . . . . . 2
1.1.2 Importance of Audio Source Separation . . . . . . . . . . . . . . . 2
1.1.3 Challenging of Audio Source Separation . . . . . . . . . . . . . . 3
1.2 Outline of the Single-Channel Speech Separation System . . . . . . . . . 4
1.3 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Background 8
2.1 Blind Source Separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Problem Formulation of Monophonic Speech Separation . . . . . . . . . 10
2.3 Time-Frequency Domain Speech Separation . . . . . . . . . . . . . . . . . 11
2.3.1 Time-Frequency Representation . . . . . . . . . . . . . . . . . . . 11
2.3.2 Time-Frequency Mask . . . . . . . . . . . . . . . . . . . . . . . . . 12
Ideal Binary Mask . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Ideal Ratio Mask . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Phase-Sensitive Mask . . . . . . . . . . . . . . . . . . . . . . . . . 13
Complex Ideal Ratio Mask . . . . . . . . . . . . . . . . . . . . . . 13
2.3.3 Time-Frequency Source Separation Method . . . . . . . . . . . . . 14
2.4 Time Domain Speech Separation . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.1 Separation Method in Time Domain . . . . . . . . . . . . . . . . . 16
3 Time-Frequency Domain Speech Separation Method 18
3.1 Enhanced Learning Method on Embedding Features for Monaural Speech
Separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1.1 Introduction to Enhanced Learning Method on Embedding Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1.2 Training Features and Formula Problems . . . . . . . . . . . . . . 19
3.1.3 The Proposed Discriminative Learning Method . . . . . . . . . . 19
3.1.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Experimental Configuration and Assessment Metrics . . . . . . . 21
Comparisons and Results . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Augmented-Discrimination Learning Method
for Single Channel Speech Separation . . . . . . . . . . . . . . . . . . . . 23
3.2.1 Introduction to Augmented-Discrimination Learning Method . . 23
3.2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Some Related Approach . . . . . . . . . . . . . . . . . . . . . . . . 24
Positive and Negative Aspects . . . . . . . . . . . . . . . . . . . . 25
3.2.3 Training Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.4 The Proposed ESDC Approach for Monaural Speech Separation . 28
Preprocessing Block . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Node Encoder Block . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Squash-norm Embedding Vectors for The Separation Unit . . . . 31
Decoder Block for Waveform Reconstruction . . . . . . . . . . . . 33
x3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Objective Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . 35
Framework Configuration and Regularization . . . . . . . . . . . 38
3.3.2 Experimental Results and Comparisons . . . . . . . . . . . . . . . 38
Objective Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Spectrum and Similar Matrix Analysis . . . . . . . . . . . . . . . . 46
Subjective Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4 A Novel Selective Mutual Learning Approach for Time-Domain Speech Separation 51
4.1 Introduction Selective Mutual Learning Method for Time-Domain Speech
Separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2.1 Problem Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2.2 Selective Mutual Learning . . . . . . . . . . . . . . . . . . . . . . . 55
4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.3.1 Dataset and Evaluation Measurements . . . . . . . . . . . . . . . 57
Experimental Dataset: . . . . . . . . . . . . . . . . . . . . . . . . . 57
Evaluation Measurements . . . . . . . . . . . . . . . . . . . . . . . 57
4.3.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . 57
4.3.3 Comparison of Separation Performance . . . . . . . . . . . . . . . 58
Comparison with Other Models: . . . . . . . . . . . . . . . . . . . 58
Comparison with Other Cutting-edge Techniques: . . . . . . . . . 60
4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5 A Lightweight Network for Time-Domain Speech Separation 62
5.1 Introduction Lightweight Model . . . . . . . . . . . . . . . . . . . . . . . 62
xi5.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2.1 Problem Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2.2 SeliNet Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Encoder Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Separation Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Decoder Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.3.2 SeliNet Configurations and Training Details . . . . . . . . . . . . 68
5.3.3 Performance Comparisons . . . . . . . . . . . . . . . . . . . . . . 69
Comparison with The SOTA Methods . . . . . . . . . . . . . . . . 69
Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6 Conclusions, Perspectives, and Future Work 73
6.1 Major Findings and Contributions . . . . . . . . . . . . . . . . . . . . . . 73
6.2 Directions of Possible Future Research . . . . . . . . . . . . . . . . . . . . 75
6.3 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 75

參考文獻

[1] Philipos C Loizou. Speech enhancement: theory and practice. CRC press, 2007.
[2] Lawrence Rabiner and Biing-Hwang Juang. Fundamentals of speech recognition.
Prentice-Hall, Inc., 1993.
[3] Sebastian Ewert et al. “Score-informed source separation for musical audio
recordings: An overview”. In: IEEE Signal Processing Magazine 31.3 (2014),
pp. 116–124.
[4] David Murray, Lina Stankovic, and Vladimir Stankovic. “An electrical load
measurements dataset of United Kingdom households from a two-year longitudinal study”. In: Scientific data 4.1 (2017), pp. 1–12.
[5] Yahaya Isah Shehu et al. “Sokoto coventry fingerprint dataset”. In: arXiv preprint
arXiv:1807.10609 (2018).
[6] Christopher J Shallue and Andrew Vanderburg. “Identifying exoplanets with
deep learning: A five-planet resonant chain around kepler-80 and an eighth
planet around kepler-90”. In: The Astronomical Journal 155.2 (2018), p. 94.
[7] Ichrak Toumi, Stefano Caldarelli, and Bruno Torrésani. “A review of blind
source separation in NMR spectroscopy”. In: Progress in nuclear magnetic resonance spectroscopy 81 (2014), pp. 37–64.
[8] Tuomas Virtanen. “Speech recognition using factorial hidden Markov models
for separation in the feature space”. In: ICSLP. 2006.
[9] S. Arberet et al. “Blind spectral-GMM estimation for underdetermined instantaneous audio source separation”. In: International Conference on Independent Component Analysis and Signal Separation. Springer. 2009, pp. 751–758.
[10] S. Choi et al. “Blind source separation and independent component analysis: A
review”. In: Neural Information Processing-Letters and Reviews 6.1 (2005), pp. 1–57.
[11] Guoning Hu and DeLiang Wang. “A tandem algorithm for pitch estimation and
voiced speech segregation”. In: TASLP 18.8 (2010), pp. 2067–2079.
[12] Ke Hu and DeLiang Wang. “An unsupervised approach to cochannel speech
separation”. In: IEEE Transactions on audio, speech, and language processing 21.1
(2012), pp. 122–131.
[13] Joseph Keshet and Samy Bengio. “Spectral Clustering for Speech Separation”.
In: (2009).
77[14] Guoning Hu and DeLiang Wang. “Monaural speech segregation based on pitch
tracking and amplitude modulation”. In: IEEE Transactions on neural networks
15.5 (2004), pp. 1135–1150.
[15] Daniel D Lee and H Sebastian Seung. “Algorithms for non-negative matrix factorization”. In: Advances in neural information processing systems. 2001, pp. 556–
562.
[16] Tuomas Virtanen. “Monaural sound source separation by nonnegative matrix
factorization with temporal continuity and sparseness criteria”. In: IEEE transactions on audio, speech, and language processing 15.3 (2007), pp. 1066–1074.
[17] Paris Smaragdis. “Convolutive speech bases and their application to supervised
speech separation”. In: IEEE Transactions on Audio, Speech, and Language Processing 15.1 (2006), pp. 1–12.
[18] Umut ¸Sim¸sekli, Jonathan Le Roux, and John R Hershey. “Non-negative source-filter dynamical system for speech enhancement”. In: 2014 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2014,
pp. 6206–6210.
[19] Tomas Kounovsky and Jiri Malek. “Single channel speech enhancement using
convolutional neural network”. In: ECMSM. IEEE. 2017, pp. 1–5.
[20] Geoffrey E Hinton and Ruslan R Salakhutdinov. “Reducing the dimensionality
of data with neural networks”. In: science 313.5786 (2006), pp. 504–507.
[21] Shahla Parveen and Phil Green. “Speech enhancement with missing data techniques using recurrent neural networks”. In: ICASSP. Vol. 1. IEEE. 2004, pp. I–
733.
[22] Xugang Lu et al. “Speech enhancement based on deep denoising autoencoder.”
In: Interspeech. Vol. 2013. 2013, pp. 436–440.
[23] Yong Xu et al. “A regression approach to speech enhancement based on deep
neural networks”. In: IEEE/ACM TASLP 23.1 (2014), pp. 7–19.
[24] John R Hershey et al. “Deep clustering: Discriminative embeddings for segmentation and separation”. In: ICASSP. IEEE. 2016, pp. 31–35.
[25] Zhong-Qiu Wang, Jonathan Le Roux, and John R Hershey. “Alternative objective functions for deep clustering”. In: ICASSP. IEEE. 2018, pp. 686–690.
[26] Yusuf Isik et al. “Single-channel multi-speaker separation using deep clustering”. In: arXiv preprint arXiv:1607.02173 (2016).
[27] Zhong-Qiu Wang et al. “End-to-end speech separation with unfolded iterative
phase reconstruction”. In: arXiv preprint arXiv:1804.10204 (2018).
[28] Yanliang Jin et al. “Multi-Head Self-Attention-Based Deep Clustering for
Single-Channel Speech Separation”. In: IEEE Access 8 (2020), pp. 100013–100021.
78[29] Zhuo Chen, Yi Luo, and Nima Mesgarani. “Deep attractor network for singlemicrophone speaker separation”. In: ICASSP. IEEE. 2017, pp. 246–250.
[30] Dong Yu et al. “Permutation invariant training of deep models for speakerindependent multi-talker speech separation”. In: ICASSP. IEEE. 2017, pp. 241–
245.
[31] M. Kolbæk et al. “Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks”. In: IEEE/ACM Transactions on Audio, Speech, and Language Processing 25.10 (2017), pp. 1901–1913.
[32] Hakan Erdogan et al. “Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks”. In: ICASSP. IEEE. 2015, pp. 708–
712.
[33] Donald S Williamson, Yuxuan Wang, and DeLiang Wang. “Complex ratio masking for monaural speech separation”. In: IEEE/ACM TASLP 24.3 (2015), pp. 483–
492.
[34] Y. Isik et al. “Single-Channel Multi-Speaker Separation Using Deep Clustering”.
In: Interspeech. 2016.
[35] Y. Luo et al. “Deep clustering and conventional networks for music separation:
Stronger together”. In: ICASSP. IEEE. 2017, pp. 61–65.
[36] Y. Luo and N. Mesgarani. “Tasnet: time-domain audio separation network for
real-time, single-channel speech separation”. In: ICASSP. IEEE. 2018, pp. 696–
700.
[37] Yi Luo and Nima Mesgarani. “Conv-tasnet: Surpassing ideal time–frequency
magnitude masking for speech separation”. In: IEEE/ACM TASLP 27.8 (2019),
pp. 1256–1266.
[38] J. Wang et al. “Tune-In: Training Under Negative Environments with Interference for Attention Networks Simulating Cocktail Party Effect”. In: AAAI. 2021.
[39] Shaojie Bai, J. Z. Kolter, and V. Koltun. “An Empirical Evaluation of Generic
Convolutional and Recurrent Networks for Sequence Modeling”. In: ArXiv
abs/1803.01271 (2018).
[40] M. W. Y. Lam et al. “Mixup-breakdown: a consistency training method for improving generalization of speech separation models”. In: ICASSP. IEEE. 2020.
[41] Neil Zeghidour and David Grangier. “Wavesplit: End-to-end speech separation
by speaker clustering”. In: arXiv preprint arXiv:2002.08933 (2020).
[42] Z. Shi et al. “Furcanet: An end-to-end deep gated convolutional, long shortterm memory, deep neural networks for single channel speech separation”. In:
arXiv preprint arXiv:1902.00651 (2019).
[43] Y. Luo, Z. Chen, and T. Yoshioka. “Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation”. In: ICASSP. IEEE.
2020, pp. 46–50.
79[44] J. Chen, Q. Mao, and D. Liu. “Dual-path transformer network: Direct contextaware modeling for end-to-end monaural speech separation”. In: arXiv preprint
arXiv:2007.13975 (2020).
[45] Cem Subakan et al. “Attention is all you need in speech separation”. In: ICASSP.
IEEE. 2021, pp. 21–25.
[46] Y. Luo, C. Han, and N. Mesgarani. “Ultra-lightweight speech separation via
group communication”. In: ICASSP. IEEE. 2021.
[47] Yi Luo, Cong Han, and Nima Mesgarani. “Group Communication With Context Codec for Lightweight Source Separation”. In: IEEE/ACM TASLP 29 (2021),
pp. 1752–1761.
[48] M. WY Lam et al. “Sandglasset: A Light Multi-Granularity Self-Attentive Network for Time-Domain Speech Separation”. In: ICASSP. IEEE. 2021, pp. 5759–
5763.
[49] Alexandre Défossez et al. “Music source separation in the waveform domain”.
In: arXiv preprint arXiv:1911.13254 (2019).
[50] O. Ronneberger, P. Fischer, and T. Brox. “U-net: Convolutional networks for
biomedical image segmentation”. In: MICCAI. Springer. 2015, pp. 234–241.
[51] L Sifre and S Mallat. “Rigid-Motion Scattering for Image Classification. arXiv
2014”. In: arXiv preprint arXiv:1403.1687 ().
[52] Andrew G. Howard et al. “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications”. In: ArXiv abs/1704.04861 (2017).
[53] Mark Sandler et al. “Mobilenetv2: Inverted residuals and linear bottlenecks”.
In: CVPR. 2018, pp. 4510–4520.
[54] A. Howard et al. “Searching for mobilenetv3”. In: ICCV. 2019, pp. 1314–1324.
[55] Liang-Chieh Chen et al. “Deeplab: Semantic image segmentation with deep
convolutional nets, atrous convolution, and fully connected crfs”. In: TPAMI
40.4 (2017), pp. 834–848.
[56] Liang-Chieh Chen et al. “Rethinking atrous convolution for semantic image
segmentation”. In: arXiv preprint arXiv:1706.05587 (2017).
[57] Liang-Chieh Chen et al. “Encoder-decoder with atrous separable convolution
for semantic image segmentation”. In: ECCV. 2018, pp. 801–818.
[58] ITU-T Recommendation. “Perceptual evaluation of speech quality (PESQ): An
objective method for end-to-end speech quality assessment of narrow-band
telephone networks and speech codecs”. In: Rec. ITU-T P. 862 (2001).
[59] Cees H Taal et al. “An algorithm for intelligibility prediction of time–frequency
weighted noisy speech”. In: IEEE Transactions on Audio, Speech, and Language
Processing 19.7 (2011), pp. 2125–2136.
80[60] E. Vincent, R. Gribonval, and C. Févotte. “Performance measurement in blind
audio source separation”. In: IEEE transactions on audio, speech, and language processing 14.4 (2006), pp. 1462–1469.
[61] Trausti Kristjansson, Hagai Attias, and John Hershey. “Single microphone
source separation using high resolution signal reconstruction”. In: 2004 IEEE
International Conference on Acoustics, Speech, and Signal Processing. Vol. 2. IEEE.
2004, pp. ii–817.
[62] Aarthi M Reddy and Bhiksha Raj. “A minimum mean squared error estimator for single channel speaker separation”. In: Eighth International Conference on
Spoken Language Processing. 2004.
[63] Ameya N Deoras and A Hasegawa-Johnson. “A factorial HMM approach to
simultaneous recognition of isolated digits spoken by multiple talkers on one
audio channel”. In: ICASSP. Vol. 1. IEEE. 2004, pp. I–861.
[64] Stanley Smith Stevens, John Volkmann, and Edwin Broomell Newman. “A scale
for the measurement of the psychological magnitude pitch”. In: The journal of the
acoustical society of america 8.3 (1937), pp. 185–190.
[65] Brian R Glasberg and Brian CJ Moore. “Derivation of auditory filter shapes from
notched-noise data”. In: Hearing research 47.1-2 (1990), pp. 103–138.
[66] Yuxuan Wang, Arun Narayanan, and DeLiang Wang. “On training targets for
supervised speech separation”. In: IEEE/ACM transactions on audio, speech, and
language processing 22.12 (2014), pp. 1849–1858.
[67] Arun Narayanan and DeLiang Wang. “Ideal ratio mask estimation using deep
neural networks for robust speech recognition”. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE. 2013, pp. 7092–7096.
[68] Mikkel N Schmidt and Rasmus K Olsson. “Single-channel speech separation
using sparse non-negative matrix factorization”. In: Ninth International Conference on Spoken Language Processing. 2006.
[69] Emad M Grais, Mehmet Umut Sen, and Hakan Erdogan. “Deep neural networks for single channel source separation”. In: ICASSP. IEEE. 2014, pp. 3734–
3738.
[70] Emad M Grais et al. “Single-channel audio source separation using deep neural
network ensembles”. In: Audio Engineering Society Convention 140. Audio Engineering Society. 2016.
[71] Po-Sen Huang et al. “Joint optimization of masks and deep recurrent neural
networks for monaural source separation”. In: IEEE/ACM TASLP 12 (2015),
pp. 2136–2147.
[72] Meng Li et al. “Multi-layer Attention Mechanism Based Speech Separation
Model”. In: IEEE 19th ICCT. 2019, pp. 506–509.
81[73] Sam Roweis. “One microphone source separation”. In: Advances in neural information processing systems 13 (2000).
[74] Ozgur Yilmaz and Scott Rickard. “Blind separation of speech mixtures via
time-frequency masking”. In: IEEE Transactions on signal processing 52.7 (2004),
pp. 1830–1847.
[75] Li Li and Hirokazu Kameoka. “Deep clustering with gated convolutional networks”. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2018, pp. 16–20.
[76] Yuzhou Liu and DeLiang Wang. “Causal Deep CASA for Monaural TalkerIndependent Speaker Separation”. In: IEEE/ACM transactions on audio, speech,
and language processing 28 (2020), pp. 2109–2118.
[77] Yann N Dauphin et al. “Language modeling with gated convolutional networks”. In: International conference on machine learning. 2017, pp. 933–941.
[78] S. Sabour, N. Frosst, and Geoffrey E Hinton. “Dynamic routing between capsules”. In: Advances in neural information processing systems. 2017, pp. 3856–3866.
[79] Peter Kabal. “TSP speech database”. In: McGill University, Database Version 1.0
(2002), pp. 09–02.
[80] John S Garofolo. “Timit acoustic phonetic continuous speech corpus”. In: Linguistic Data Consortium, 1993 (1993).
[81] Diederik P Kingma and Jimmy Ba. “Adam: A method for stochastic optimization”. In: arXiv preprint arXiv:1412.6980 (2014).
[82] S. Venkataramani, C. Subakan, and P. Smaragdis. “Neural network alternatives
toconvolutive audio models for source separation”. In: MLSP. IEEE. 2017, pp. 1–
6.
[83] A. Gang, P. Biyani, and A. Soni. “Towards Automated Single Channel Source
Separation using Neural Networks”. In: arXiv preprint arXiv:1806.08086 (2018).
[84] P. Huang et al. “Deep learning for monaural speech separation”. In: ICASSP.
IEEE. 2014, pp. 1562–1566.
[85] Y. Sun et al. “Monaural source separation based on adaptive discriminative criterion in neural networks”. In: DSP. IEEE. 2017, pp. 1–5.
[86] S. Qin et al. “Graph Convolution-Based Deep Clustering for Speech Separation”. In: IEEE Access 8 (2020), pp. 82571–82580.
[87] Yannan Wang et al. “A gender mixture detection approach to unsupervised single-channel speech separation based on deep neural networks”. In:
IEEE/ACM Transactions on Audio, Speech, and Language Processing 25.7 (2017),
pp. 1535–1546.
82[88] Yuzhou Liu and DeLiang Wang. “A CASA approach to deep learning based
speaker-independent co-channel speech separation”. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2018,
pp. 5399–5403.
[89] Ali Feizollah et al. “Comparative study of k-means and mini batch k-means
clustering algorithms in android malware detection using network traffic analysis”. In: 2014 international symposium on biometrics and security technologies (ISBAST). IEEE. 2014, pp. 193–197.
[90] David Arthur and Sergei Vassilvitskii. k-means++: The advantages of careful seeding. Tech. rep. Stanford, 2006.
[91] Christian Buchta et al. “Spherical k-means clustering”. In: Journal of statistical
software 50.10 (2012), pp. 1–22.
[92] Sujan Kumar Roy, Aaron Nicolson, and Kuldip K Paliwal. “DeepLPCMHANet: Multi-Head Self-Attention for Augmented Kalman Filter-Based
Speech Enhancement”. In: IEEE Access 9 (2021), pp. 70516–70530.
[93] Kuldip Paliwal, Kamil Wójcicki, and Belinda Schwerin. “Single-channel speech
enhancement using spectral subtraction in the short-time modulation domain”.
In: Speech communication 52.5 (2010), pp. 450–475.
[94] Jimmy Ba and R. Caruana. “Do Deep Nets Really Need to be Deep?” In: NIPS.
2014.
[95] G. E. Hinton, O. Vinyals, and J. Dean. “Distilling the Knowledge in a Neural
Network”. In: ArXiv abs/1503.02531 (2015).
[96] Emilio Parisotto, Jimmy Ba, and R. Salakhutdinov. “Actor-Mimic: Deep Multitask and Transfer Reinforcement Learning”. In: CoRR abs/1511.06342 (2016).
[97] Ying Zhang et al. “Deep mutual learning”. In: CVPR. 2018, pp. 4320–4328.
[98] Yifang Yin et al. “Enhanced Audio Tagging via Multi-to Single-Modal TeacherStudent Mutual Learning”. In: AAAI. Vol. 35. 12. 2021, pp. 10709–10717.
[99] R. Masumura et al. “End-to-End Automatic Speech Recognition with Deep Mutual Learning”. In: APSIPA. IEEE. 2020, pp. 632–637.
[100] Jonathan Le Roux et al. “SDR–half-baked or well done?” In: ICASSP. IEEE. 2019,
pp. 626–630.
[101] Ryo Aihara et al. “Teacher-student deep clustering for low-delay single channel
speech separation”. In: ICASSP. IEEE. 2019, pp. 690–694.
[102] Jianping Gou et al. “Knowledge distillation: A survey”. In: IJCV 129.6 (2021),
pp. 1789–1819.
[103] Duc-Quang Vu, Ngan Le, and Jia-Ching Wang. “Teaching Yourself: A SelfKnowledge Distillation Approach to Action Recognition”. In: IEEE Access 9
(2021), pp. 105711–105723.
83[104] Duc-Quang Vu, Jia-Ching Wang, et al. “A Novel Self-Knowledge Distillation
Approach with Siamese Representation Learning for Action Recognition”. In:
VCIP. IEEE. 2021, pp. 1–5.
[105] Yi Luo, Zhuo Chen, and Nima Mesgarani. “Speaker-independent speech separation with deep attractor network”. In: IEEE/ACM TASLP 26.4 (2018), pp. 787–
796.
[106] L. Zhang et al. “Furcanext: End-to-end monaural speech separation with dynamic gated dilated temporal convolutional networks”. In: MMM. Springer.
2020, pp. 653–665.
[107] Shrikant Venkataramani, Jonah Casebeer, and Paris Smaragdis. “End-to-end
source separation with adaptive front-ends”. In: ACSSC. IEEE. 2018, pp. 684–
688.
[108] Efthymios Tzinis, Zhepei Wang, and Paris Smaragdis. “Sudo rm-rf: Efficient
networks for universal audio source separation”. In: MLSP. IEEE. 2020, pp. 1–6.
[109] Kristen Grauman and Trevor Darrell. “The pyramid match kernel: Discriminative classification with sets of image features”. In: ICCV. Vol. 2. IEEE. 2005,
pp. 1458–1465.
[110] S. Lazebnik, C. Schmid, and J. Ponce. “Beyond bags of features: Spatial pyramid
matching for recognizing natural scene categories”. In: CVPR. Vol. 2. IEEE. 2006,
pp. 2169–2178.
[111] J Garofalo et al. Continuous Speech Recognition (CSR-I) Wall Street Journal (WSJ0)
news, complete. Linguistic Data Consortium, Philadelphia (1993).
[112] Efthymios Tzinis et al. “Two-step sound source separation: Training on learned
latent targets”. In: ICASSP. IEEE. 2020, pp. 31–35.

指導教授

王家慶(Jia-Ching Wang)

審核日期

2023-2-3

推文