On the limitations of diffusion-based speech enhancement models and an adaptive selection strategy

NCU Institutional Repository > 資訊電機學院 > 人工智慧國際碩士學位學程 > 博碩士論文 > Item 987654321/97185

請使用永久網址來引用或連結此文件: https://ir.lib.ncu.edu.tw/handle/987654321/97185

題名:	On the limitations of diffusion-based speech enhancement models and an adaptive selection strategy
作者:	林源煜;Yu, Lim Yuan
貢獻者:	人工智慧國際碩士學位學程
關鍵詞:	語音增強;擴散模型;音訊頻譜轉換器;頻譜熵;Speech Enhancement;Diffusion model;Audio Spectrogram Transforme;DNSMOS;Spectral Entropy
日期:	2025-07-28
上傳時間:	2025-10-17 10:56:25 (UTC+8)
出版者:	國立中央大學
摘要:	擴散機率模型（Diffusion probabilistic models）已成為語音增強（Speech Enhancement, SE）領域的最新頂尖技術，能夠生成高保真音訊。然而，其在不同模型與聲學條件下顯著的性能差異，往往阻礙了它們的實際應用。不僅幾乎不存在一個普適性的最佳模型，學界對於是何種輸入訊號特徵決定了特定增強方法的成敗，也缺乏足夠的理解。本論文為應對上述挑戰，提出了一套新穎的兩階段智慧模型推薦系統，旨在針對給定的帶噪輸入，動態地選擇最合適的語音增強模型。為此，我們首先引入了一組基於交叉熵（Cross-Entropy）與KL散度（KL-Divergence）的頻譜特徵。這些特徵經證明在描述增強任務的難易度以及識別不同模型的特定優勢領域上，具有統計顯著性。我們提出的推薦系統採用「守門員-專家」（gatekeeper-expert）架構，以有效處理模型選擇任務中固有的嚴重類別不平衡問題。該系統的訓練，是基於對三個主流擴散模型（SGMSE+、StoRM及CDiffuSE）的全面評估。大量實驗證明，使用經過微調的預訓練骨幹網路，如EfficientNet-B0和音訊頻譜轉換器（AST），在推薦任務上取得了很高的分類準確率。消融實驗證實，將梅爾頻譜圖（Mel-spectrograms）與我們提出的頻譜特徵結合做為混合式輸入，能夠進一步提升模型性能。至關重要的是，端對端的評估結果顯示，與通用地應用任一單一基準模型相比，由本推薦系統驅動的方法所達成的平均語音增強品質（以DNSMOS指標衡量），更為優越或極具競爭力。本研究不僅為優化語音增強流程提供了一個實用的解決方案，也為理解訊號特徵與基於擴散的生成式模型性能之間的相互作用，提供了一個更深入的分析框架。;Diffusion probabilistic models have emerged as a new state-of-the-art in speech enhancement (SE), capable of generating high-fidelity audio. However, their practical application is often hindered by significant performance variability across different models and acoustic conditions. A single, universally optimal model rarely exists, and there is a limited understanding of the input signal characteristics that dictate the success or failure of a given enhancement approach. This dissertation addresses these challenges by proposing a novel, two-stage intelligent model recommendation system designed to dynamically select the most suitable SE model for a given noisy input. To enable this, we first introduce a set of spectral features based on Cross-Entropy and KL-Divergence, which are shown to be statistically significant in characterizing enhancement difficulty and identifying model-specific operational strengths. Our proposed recommender system employs a "gatekeeper-expert" architecture to effectively manage the severe class imbalance inherent in the model selection task. The system is trained on a comprehensive evaluation of three leading diffusion models: SGMSE+, StoRM, and CDiffuSE. Extensive experiments demonstrate that fine-tuned pre-trained backbones, such as EfficientNet-B0 and AST, achieve high classification accuracy for the recommendation task. Ablation studies validate that a hybrid input, combining Mel-spectrograms with our proposed spectral features, further improves performance. Crucially, the end-to-end evaluation shows that the recommendation-driven approach achieves a superior or highly competitive average speech enhancement quality (as measured by DNSMOS) compared to universally applying any single baseline model. This work provides not only a practical solution for optimizing SE pipelines but also a deeper analytical framework for understanding the interplay between signal characteristics and the performance of diffusion-based generative models.
顯示於類別:	[人工智慧國際碩士學位學程] 博碩士論文

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	214	檢視/開啟

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....