摘要: | 本論文利用深度神經網路 (DNN) 來解決單通道語音分離問題,我們採用了 三種不同的方法。首先,我們使用基於 frequency-to-time Domain 的單通道源分離。在這個領域中,基於嵌入向量的模型獲得突破性的成功,例如深度聚類。我們參考深度聚類的想法,提出了新的框架,即 Encoder Squash-norm Deep Clustering(ESDC)。相比於當前的方法,包括深度聚類、深度提取 網路(DENet)、深度吸引子網絡(DANet)和幾種更新版本的深度聚類,結果表明,我們提出的框架顯著降低了單通道聲學分解的性能消耗。其次,我們提出了一個基於雙路徑回歸神經網路(DPRNN)的 inter-segment 和 intra-segment 的時域單通道聲學分解。這個架構在模 擬超長序列的表現上具有頂尖的性能。而我們引入了一種新的選擇性相互學 習法(SML),在 SML 方法中,有兩個 DPRNN 互相交換知識並且互相學習,特別的是,剩餘的網路由高可信度預測引導的同時,忽略低可信度的預測。根據實驗結果,選擇性相互學習法(SML)大大優於其他類似的方法,如獨立訓練、知識蒸餾和使用相同模型設計的相互學習。最後,我們提出一個輕量 但高性能的語音分離網路: SeliNet。 SeliNet 是採用瓶頸模塊和空洞時間 金字塔池的一維卷積架構神經網路。實驗結果表明,SeliNet 在僅需少量浮 點運算量和較少模型參數的同時,獲得了最先進(SOTA)的性能。;This dissertation addresses the issues of single-channel speech separation by exploiting deep neural networks (DNNs). We approach three different directions. First, we approach single-channel source separation based on the frequency-to-time domain. In this domain, ground-breaking successful models based on the embedding vector which is presented such as deep clustering. We develop our framework inspired by deep clustering, namely node encoder Squash norm deep clustering (ESDC). The results have shown that our proposed framework significantly reduces the performance of single-channel acoustic decomposition in comparison to current training techniques including deep clustering, deep extractor network (DENet), deep attractor network (DANet), and several updated versions of deep clustering. Second, we proposed monaural acoustic decomposition based on the time domain. An impressive contribution of the inter-segment and the intra-segment architectures of the dual-path recurrent neural network (DPRNN), this architecture has cutting-edge performance and can simulate exceedingly long sequences. We introduce a new selective mutual learning. In the selective mutual learning (SML) approach, there are two DPRNNs. They exchange knowledge and learn from one another. In particular, the remaining network is guided by the high-confidence forecasts, meanwhile, the low-confidence predictions are disregarded. According to the experimental findings, selective mutual learning greatly outperforms other training methods such as independent training, knowledge distillation, and mutual learning using the same model design. Finally, we introduce a lightweight yet effective network for speech separation, namely SeliNet. The SeliNet is the one-dimensional convolutional architecture that employs bottleneck modules, and atrous temporal pyramid pooling. The experimental results have shown that the suggested SeliNet obtains state-of-the-art (SOTA) performance while still maintaining the small number of floating-point operations (FLOPs) and model size. |