English  |  正體中文  |  简体中文  |  全文筆數/總筆數 : 83696/83696 (100%)
造訪人次 : 56157073      線上人數 : 937
RC Version 7.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
搜尋範圍 查詢小技巧:
  • 您可在西文檢索詞彙前後加上"雙引號",以獲取較精準的檢索結果
  • 若欲以作者姓名搜尋,建議至進階搜尋限定作者欄位,可獲得較完整資料
  • 進階搜尋


    請使用永久網址來引用或連結此文件: https://ir.lib.ncu.edu.tw/handle/987654321/97185


    題名: On the limitations of diffusion-based speech enhancement models and an adaptive selection strategy
    作者: 林源煜;Yu, Lim Yuan
    貢獻者: 人工智慧國際碩士學位學程
    關鍵詞: 語音增強;擴散模型;音訊頻譜轉換器;頻譜熵;Speech Enhancement;Diffusion model;Audio Spectrogram Transforme;DNSMOS;Spectral Entropy
    日期: 2025-07-28
    上傳時間: 2025-10-17 10:56:25 (UTC+8)
    出版者: 國立中央大學
    摘要: 擴散機率模型(Diffusion probabilistic models)已成為語音增強(Speech Enhancement, SE)領域的最新頂尖技術,能夠生成高保真音訊。然而,其在不同模型與聲學條件下顯著的性能差異,往往阻礙了它們的實際應用。不僅幾乎不存在一個普適性的最佳模型,學界對於是何種輸入訊號特徵決定了特定增強方法的成敗,也缺乏足夠的理解。本論文為應對上述挑戰,提出了一套新穎的兩階段智慧模型推薦系統,旨在針對給定的帶噪輸入,動態地選擇最合適的語音增強模型。為此,我們首先引入了一組基於交叉熵(Cross-Entropy)與KL散度(KL-Divergence)的頻譜特徵。這些特徵經證明在描述增強任務的難易度以及識別不同模型的特定優勢領域上,具有統計顯著性。

    我們提出的推薦系統採用「守門員-專家」(gatekeeper-expert)架構,以有效處理模型選擇任務中固有的嚴重類別不平衡問題。該系統的訓練,是基於對三個主流擴散模型(SGMSE+、StoRM及CDiffuSE)的全面評估。大量實驗證明,使用經過微調的預訓練骨幹網路,如EfficientNet-B0和音訊頻譜轉換器(AST),在推薦任務上取得了很高的分類準確率。消融實驗證實,將梅爾頻譜圖(Mel-spectrograms)與我們提出的頻譜特徵結合做為混合式輸入,能夠進一步提升模型性能。

    至關重要的是,端對端的評估結果顯示,與通用地應用任一單一基準模型相比,由本推薦系統驅動的方法所達成的平均語音增強品質(以DNSMOS指標衡量),更為優越或極具競爭力。本研究不僅為優化語音增強流程提供了一個實用的解決方案,也為理解訊號特徵與基於擴散的生成式模型性能之間的相互作用,提供了一個更深入的分析框架。;Diffusion probabilistic models have emerged as a new state-of-the-art in speech enhancement (SE), capable of generating high-fidelity audio. However, their practical application is often hindered by significant performance variability across different models and acoustic conditions. A single, universally optimal model rarely exists, and there is a limited understanding of the input signal characteristics that dictate the success or failure of a given enhancement approach.

    This dissertation addresses these challenges by proposing a novel, two-stage intelligent model recommendation system designed to dynamically select the most suitable SE model for a given noisy input. To enable this, we first introduce a set of spectral features based on Cross-Entropy and KL-Divergence, which are shown to be statistically significant in characterizing enhancement difficulty and identifying model-specific operational strengths.

    Our proposed recommender system employs a "gatekeeper-expert" architecture to effectively manage the severe class imbalance inherent in the model selection task. The system is trained on a comprehensive evaluation of three leading diffusion models: SGMSE+, StoRM, and CDiffuSE. Extensive experiments demonstrate that fine-tuned pre-trained backbones, such as EfficientNet-B0 and AST, achieve high classification accuracy for the recommendation task. Ablation studies validate that a hybrid input, combining Mel-spectrograms with our proposed spectral features, further improves performance.

    Crucially, the end-to-end evaluation shows that the recommendation-driven approach achieves a superior or highly competitive average speech enhancement quality (as measured by DNSMOS) compared to universally applying any single baseline model. This work provides not only a practical solution for optimizing SE pipelines but also a deeper analytical framework for understanding the interplay between signal characteristics and the performance of diffusion-based generative models.
    顯示於類別:[人工智慧國際碩士學位學程] 博碩士論文

    文件中的檔案:

    檔案 描述 大小格式瀏覽次數
    index.html0KbHTML2檢視/開啟


    在NCUIR中所有的資料項目都受到原著作權保護.

    社群 sharing

    ::: Copyright National Central University. | 國立中央大學圖書館版權所有 | 收藏本站 | 設為首頁 | 最佳瀏覽畫面: 1024*768 | 建站日期:8-24-2009 :::
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - 隱私權政策聲明