On the limitations of diffusion-based speech enhancement models and an adaptive selection strategy

NCUIR > College of Electrical Engineering & Computer Science > International Graduate Program in Artificial Intelligence > Electronic Thesis & Dissertation > Item 987654321/97185

Please use this identifier to cite or link to this item: https://ir.lib.ncu.edu.tw/handle/987654321/97185

Title:	On the limitations of diffusion-based speech enhancement models and an adaptive selection strategy
Authors:	林源煜;Yu, Lim Yuan
Contributors:	人工智慧國際碩士學位學程
Keywords:	語音增強;擴散模型;音訊頻譜轉換器;頻譜熵;Speech Enhancement;Diffusion model;Audio Spectrogram Transforme;DNSMOS;Spectral Entropy
Date:	2025-07-28
Issue Date:	2025-10-17 10:56:25 (UTC+8)
Publisher:	國立中央大學
Abstract:	擴散機率模型（Diffusion probabilistic models）已成為語音增強（Speech Enhancement, SE）領域的最新頂尖技術，能夠生成高保真音訊。然而，其在不同模型與聲學條件下顯著的性能差異，往往阻礙了它們的實際應用。不僅幾乎不存在一個普適性的最佳模型，學界對於是何種輸入訊號特徵決定了特定增強方法的成敗，也缺乏足夠的理解。本論文為應對上述挑戰，提出了一套新穎的兩階段智慧模型推薦系統，旨在針對給定的帶噪輸入，動態地選擇最合適的語音增強模型。為此，我們首先引入了一組基於交叉熵（Cross-Entropy）與KL散度（KL-Divergence）的頻譜特徵。這些特徵經證明在描述增強任務的難易度以及識別不同模型的特定優勢領域上，具有統計顯著性。我們提出的推薦系統採用「守門員-專家」（gatekeeper-expert）架構，以有效處理模型選擇任務中固有的嚴重類別不平衡問題。該系統的訓練，是基於對三個主流擴散模型（SGMSE+、StoRM及CDiffuSE）的全面評估。大量實驗證明，使用經過微調的預訓練骨幹網路，如EfficientNet-B0和音訊頻譜轉換器（AST），在推薦任務上取得了很高的分類準確率。消融實驗證實，將梅爾頻譜圖（Mel-spectrograms）與我們提出的頻譜特徵結合做為混合式輸入，能夠進一步提升模型性能。至關重要的是，端對端的評估結果顯示，與通用地應用任一單一基準模型相比，由本推薦系統驅動的方法所達成的平均語音增強品質（以DNSMOS指標衡量），更為優越或極具競爭力。本研究不僅為優化語音增強流程提供了一個實用的解決方案，也為理解訊號特徵與基於擴散的生成式模型性能之間的相互作用，提供了一個更深入的分析框架。;Diffusion probabilistic models have emerged as a new state-of-the-art in speech enhancement (SE), capable of generating high-fidelity audio. However, their practical application is often hindered by significant performance variability across different models and acoustic conditions. A single, universally optimal model rarely exists, and there is a limited understanding of the input signal characteristics that dictate the success or failure of a given enhancement approach. This dissertation addresses these challenges by proposing a novel, two-stage intelligent model recommendation system designed to dynamically select the most suitable SE model for a given noisy input. To enable this, we first introduce a set of spectral features based on Cross-Entropy and KL-Divergence, which are shown to be statistically significant in characterizing enhancement difficulty and identifying model-specific operational strengths. Our proposed recommender system employs a "gatekeeper-expert" architecture to effectively manage the severe class imbalance inherent in the model selection task. The system is trained on a comprehensive evaluation of three leading diffusion models: SGMSE+, StoRM, and CDiffuSE. Extensive experiments demonstrate that fine-tuned pre-trained backbones, such as EfficientNet-B0 and AST, achieve high classification accuracy for the recommendation task. Ablation studies validate that a hybrid input, combining Mel-spectrograms with our proposed spectral features, further improves performance. Crucially, the end-to-end evaluation shows that the recommendation-driven approach achieves a superior or highly competitive average speech enhancement quality (as measured by DNSMOS) compared to universally applying any single baseline model. This work provides not only a practical solution for optimizing SE pipelines but also a deeper analytical framework for understanding the interplay between signal characteristics and the performance of diffusion-based generative models.
Appears in Collections:	[人工智慧國際碩士學位學程] 博碩士論文

Files in This Item:

File	Description	Size	Format
index.html		0Kb	HTML	2	View/Open

社群 sharing

Loading...