至關重要的是,端對端的評估結果顯示,與通用地應用任一單一基準模型相比,由本推薦系統驅動的方法所達成的平均語音增強品質(以DNSMOS指標衡量),更為優越或極具競爭力。本研究不僅為優化語音增強流程提供了一個實用的解決方案,也為理解訊號特徵與基於擴散的生成式模型性能之間的相互作用,提供了一個更深入的分析框架。;Diffusion probabilistic models have emerged as a new state-of-the-art in speech enhancement (SE), capable of generating high-fidelity audio. However, their practical application is often hindered by significant performance variability across different models and acoustic conditions. A single, universally optimal model rarely exists, and there is a limited understanding of the input signal characteristics that dictate the success or failure of a given enhancement approach.
This dissertation addresses these challenges by proposing a novel, two-stage intelligent model recommendation system designed to dynamically select the most suitable SE model for a given noisy input. To enable this, we first introduce a set of spectral features based on Cross-Entropy and KL-Divergence, which are shown to be statistically significant in characterizing enhancement difficulty and identifying model-specific operational strengths.
Our proposed recommender system employs a "gatekeeper-expert" architecture to effectively manage the severe class imbalance inherent in the model selection task. The system is trained on a comprehensive evaluation of three leading diffusion models: SGMSE+, StoRM, and CDiffuSE. Extensive experiments demonstrate that fine-tuned pre-trained backbones, such as EfficientNet-B0 and AST, achieve high classification accuracy for the recommendation task. Ablation studies validate that a hybrid input, combining Mel-spectrograms with our proposed spectral features, further improves performance.
Crucially, the end-to-end evaluation shows that the recommendation-driven approach achieves a superior or highly competitive average speech enhancement quality (as measured by DNSMOS) compared to universally applying any single baseline model. This work provides not only a practical solution for optimizing SE pipelines but also a deeper analytical framework for understanding the interplay between signal characteristics and the performance of diffusion-based generative models.