博碩士論文 112527003 詳細資訊




以作者查詢圖書館館藏 以作者查詢臺灣博碩士 以作者查詢全國書目 勘誤回報 、線上人數:17 、訪客IP:18.97.14.86
姓名 林源煜(Lim Yuan Yu)  查詢紙本館藏   畢業系所 人工智慧國際碩士學位學程
論文名稱
(On the limitations of diffusion-based speech enhancement models and an adaptive selection strategy)
相關論文
★ 透過網頁瀏覽紀錄預測使用者之個人資訊與性格特質★ 透過矩陣分解之多目標預測方法預測使用者於特殊節日前之瀏覽行為變化
★ 預測交通需求之分佈與數量—基於多重式注意力 機制之AR-LSTMs 模型★ 動態多模型融合分析研究
★ 擴展點擊流:分析點擊流中缺少的使用者行為★ 關聯式學習:利用自動編碼器與目標傳遞法分解端到端倒傳遞演算法
★ 融合多模型排序之點擊預測模型★ 分析網路日誌中有意圖、無意圖及缺失之使用者行為
★ 基於自注意力機制產生的無方向性序列編碼器使用同義詞與反義詞資訊調整詞向量★ 探索深度學習或簡易學習模型在點擊率預測任務中的使用時機
★ 空氣品質感測器之故障偵測--基於深度時空圖模型的異常偵測框架★ 以同反義詞典調整的詞向量對下游自然語言任務影響之實證研究
★ 利用輔助語句與BERT模型偵測詞彙的上下位關係★ 結合時空資料的半監督模型並應用於PM2.5空污感測器的異常偵測
★ 利用 SCPL 分解端到端倒傳遞演算法★ 藉由權重之梯度大小調整DropConnect的捨棄機率來訓練神經網路
檔案 [Endnote RIS 格式]    [Bibtex 格式]    [相關文章]   [文章引用]   [完整記錄]   [館藏目錄]   [檢視]  [下載]
  1. 本電子論文使用權限為同意立即開放。
  2. 已達開放權限電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。
  3. 請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。

摘要(中) 擴散機率模型(Diffusion probabilistic models)已成為語音增強(Speech Enhancement, SE)領域的最新頂尖技術,能夠生成高保真音訊。然而,其在不同模型與聲學條件下顯著的性能差異,往往阻礙了它們的實際應用。不僅幾乎不存在一個普適性的最佳模型,學界對於是何種輸入訊號特徵決定了特定增強方法的成敗,也缺乏足夠的理解。本論文為應對上述挑戰,提出了一套新穎的兩階段智慧模型推薦系統,旨在針對給定的帶噪輸入,動態地選擇最合適的語音增強模型。為此,我們首先引入了一組基於交叉熵(Cross-Entropy)與KL散度(KL-Divergence)的頻譜特徵。這些特徵經證明在描述增強任務的難易度以及識別不同模型的特定優勢領域上,具有統計顯著性。

我們提出的推薦系統採用「守門員-專家」(gatekeeper-expert)架構,以有效處理模型選擇任務中固有的嚴重類別不平衡問題。該系統的訓練,是基於對三個主流擴散模型(SGMSE+、StoRM及CDiffuSE)的全面評估。大量實驗證明,使用經過微調的預訓練骨幹網路,如EfficientNet-B0和音訊頻譜轉換器(AST),在推薦任務上取得了很高的分類準確率。消融實驗證實,將梅爾頻譜圖(Mel-spectrograms)與我們提出的頻譜特徵結合做為混合式輸入,能夠進一步提升模型性能。

至關重要的是,端對端的評估結果顯示,與通用地應用任一單一基準模型相比,由本推薦系統驅動的方法所達成的平均語音增強品質(以DNSMOS指標衡量),更為優越或極具競爭力。本研究不僅為優化語音增強流程提供了一個實用的解決方案,也為理解訊號特徵與基於擴散的生成式模型性能之間的相互作用,提供了一個更深入的分析框架。
摘要(英) Diffusion probabilistic models have emerged as a new state-of-the-art in speech enhancement (SE), capable of generating high-fidelity audio. However, their practical application is often hindered by significant performance variability across different models and acoustic conditions. A single, universally optimal model rarely exists, and there is a limited understanding of the input signal characteristics that dictate the success or failure of a given enhancement approach.

This dissertation addresses these challenges by proposing a novel, two-stage intelligent model recommendation system designed to dynamically select the most suitable SE model for a given noisy input. To enable this, we first introduce a set of spectral features based on Cross-Entropy and KL-Divergence, which are shown to be statistically significant in characterizing enhancement difficulty and identifying model-specific operational strengths.

Our proposed recommender system employs a "gatekeeper-expert" architecture to effectively manage the severe class imbalance inherent in the model selection task. The system is trained on a comprehensive evaluation of three leading diffusion models: SGMSE+, StoRM, and CDiffuSE. Extensive experiments demonstrate that fine-tuned pre-trained backbones, such as EfficientNet-B0 and AST, achieve high classification accuracy for the recommendation task. Ablation studies validate that a hybrid input, combining Mel-spectrograms with our proposed spectral features, further improves performance.

Crucially, the end-to-end evaluation shows that the recommendation-driven approach achieves a superior or highly competitive average speech enhancement quality (as measured by DNSMOS) compared to universally applying any single baseline model. This work provides not only a practical solution for optimizing SE pipelines but also a deeper analytical framework for understanding the interplay between signal characteristics and the performance of diffusion-based generative models.
關鍵字(中) ★ 語音增強
★ 擴散模型
★ 音訊頻譜轉換器
★ 頻譜熵
關鍵字(英) ★ Speech Enhancement
★ Diffusion model
★ Audio Spectrogram Transforme
★ DNSMOS
★ Spectral Entropy
論文目次 Table of Contents
1 Introduction 1
1.1 Overview of Speech Enhancement . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Background and Literature Review 5
2.1 Evolution of Speech Enhancement . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Traditional Speech Enhancement Techniques . . . . . . . . . . . . . 5
2.1.2 Deep Learning-based Speech Enhancement . . . . . . . . . . . . . . 6
2.2 Diffusion Probabilistic Models . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 Discrete-Time Models (DDPMs) . . . . . . . . . . . . . . . . . . . 6
2.2.2 Continuous-Time Models (SDEs) . . . . . . . . . . . . . . . . . . . 7
2.3 Baseline Diffusion-based Speech Enhancement Models . . . . . . . . . . . . 7
2.3.1 SGMSE+: The SDE-based Conditional Approach . . . . . . . . . . 8
2.3.2 CDiffuSE: The DDPM-based Conditional Approach . . . . . . . . . 8
2.3.3 StoRM: The Two-Stage Regenerative Approach . . . . . . . . . . . 9
2.4 Related Spectral Representation Techniques . . . . . . . . . . . . . . . . . 9
2.4.1 Spectral Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
vii
2.5 Evaluation Metrics for Speech Enhancement . . . . . . . . . . . . . . . . . 10
2.5.1 Intrusive Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5.2 Non-Intrusive Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 Methodology 13
3.1 Experimental Design for the Recommender System . . . . . . . . . . . . . 13
3.1.1 Enhancement Model Recommendation System . . . . . . . . . . . . 13
3.1.2 Enhancement Model Recommender Architectures . . . . . . . . . . 14
3.1.3 Proposed Two-Stage ”Gatekeeper-Expert” Recommender System . 15
3.2 Proposed Spectral Feature Extraction . . . . . . . . . . . . . . . . . . . . . 16
3.2.1 A Priori Rationale for Feature Design . . . . . . . . . . . . . . . . 17
3.2.2 Cross-Entropy (CE) . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.3 Kullback-Leibler (KL) Divergence . . . . . . . . . . . . . . . . . . . 19
3.2.4 Feature Calculation from CE and KL Matrices . . . . . . . . . . . 19
3.3 Model Architectures for Analytical Experiments . . . . . . . . . . . . . . . 20
3.3.1 Details of Pre-trained Models . . . . . . . . . . . . . . . . . . . . . 20
4 Results and Discussions 23
4.1 Dataset and Experiment design . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 Premilinary Observation: Comparative Performance of Speech Enhance-
ment Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2.1 Categorization of Enhancement Outcomes . . . . . . . . . . . . . . 25
4.2.2 Analysis of Top-Performing Models per Sample . . . . . . . . . . . 27
4.2.3 Feature Characteristics for Top-Performing Models and Universally
Challenging Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3 Performance of the Enhancement Model Recommender System . . . . . . . 29
viii
4.4 Impact of Recommendation-Driven Enhancement on Speech Quality . . . . 30
4.5 Ablation studies and discussions . . . . . . . . . . . . . . . . . . . . . . . . 32
4.5.1 Analysis of spectral features in characterizing outcome . . . . . . . 32
4.5.2 Analysis of Recommender Design via Ablation Studies . . . . . . . 34
4.5.3 Feature analysis on misclassified samples . . . . . . . . . . . . . . . 39
5 Discussion 41
5.1 The Role and Significance of the Proposed Spectral Features . . . . . . . . 41
5.1.1 Characterizing Enhancement Difficulty and Failure Modes . . . . . 41
5.1.2 Identifying Model-Specific Strengths . . . . . . . . . . . . . . . . . 42
5.1.3 Value as a Complementary Input for Advanced Classifiers . . . . . 42
5.2 Potential for Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.3 Robustness to Distribution Shift . . . . . . . . . . . . . . . . . . . . . . . . 43
6 Conclusion and Future works 44
6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.2.1 Improvement of Diffusion Model Architectures for Enhancement . . 45
6.2.2 Improved Noise Dataset Collection and Characterization . . . . . . 46
Bibliography 47
A Implementation 51
參考文獻 [1] P. C. Loizou, Speech enhancement: theory and practice. CRC press, 2007.
[2] F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. Le Roux, J. R. Hershey, and B. Schuller, “Speech enhancement with lstm recurrent neural networks and its application to noise-robust asr,” in Latent Variable Analysis and Signal Separation: 12th International Conference, LVA/ICA 2015, Liberec, Czech Republic, August 25-28, 2015, Proceedings 12, pp. 91–99, Springer, 2015.
[3] X. Li, Y. Li, Y. Dong, S. Xu, Z. Zhang, D. Wang, and S. Xiong, “Bidirectional lstm network with ordered neurons for speech enhancement.,” in Interspeech, pp. 2702–2706, 2020.
[4] C. Lea, M. D. Flynn, R. Vidal, A. Reiter, and G. D. Hager, “Temporal convolutional networks for action segmentation and detection,” in proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pp. 156–165, 2017.
[5] Y. Luo and N. Mesgarani, “Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation,” IEEE/ACM transactions on audio, speech, and language processing, vol. 27, no. 8, pp. 1256–1266, 2019.
[6] A. Pandey and D. Wang, “Tcnn: Temporal convolutional neural network for real-time speech enhancement in the time domain,” in ICASSP 2019-2019 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6875–6879, IEEE, 2019.
[7] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020.
[8] Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro, “Diffwave: A versatile diffusion model for audio synthesis,” arXiv preprint arXiv:2009.09761, 2020.
[9] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical image computing and computer-assisted
intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pp. 234–241, Springer, 2015.
[10] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differential equations,” arXiv preprint
arXiv:2011.13456, 2020.
[11] B. D. Anderson, “Reverse-time diffusion equation models,” Stochastic Processes and their Applications, vol. 12, no. 3, pp. 313–326, 1982.
[12] P. Vincent, “A connection between score matching and denoising autoencoders,”Neural computation, vol. 23, no. 7, pp. 1661–1674, 2011.
[13] J. Richter, S. Welker, J.-M. Lemercier, B. Lay, and T. Gerkmann, “Speech enhancement and dereverberation with diffusion-based generative models,” IEEE/ACM
Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2351–2364,2023.
[14] Y.-J. Lu, Z.-Q. Wang, S. Watanabe, A. Richard, C. Yu, and Y. Tsao, “Conditional diffusion probabilistic model for speech enhancement,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7402–7406, Ieee, 2022.
[15] J.-M. Lemercier, J. Richter, S. Welker, and T. Gerkmann, “Storm: A diffusion-based stochastic regeneration model for speech enhancement and dereverberation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2724–2737, 2023.
[16] H. Misra, S. Ikbal, H. Bourlard, and H. Hermansky, “Spectral entropy based feature for robust asr,” in 2004 IEEE International Conference on Acoustics, Speech, and
Signal Processing, vol. 1, pp. I–193, IEEE, 2004.
[17] A. Rix, J. Beerends, M. Hollier, and A. Hekstra, “Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and
codecs,” in 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221), vol. 2, pp. 749–752 vol.2, 2001.
[18] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithm for intelligibility prediction of time–frequency weighted noisy speech,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 2125–2136, 2011.
[19] C. K. A. Reddy, V. Gopal, and R. Cutler, “Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors,” in ICASSP 2021 -
2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6493–6497, 2021.
[20] C. K. A. Reddy, V. Gopal, and R. Cutler, “Dnsmos p.835: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors,” in ICASSP 2022 -
2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 886–890, 2022.
[21] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition,
pp. 770–778, 2016.
[22] M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in International conference on machine learning, pp. 6105–6114, PMLR, 2019.
[23] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE Conference on Computer Vision
and Pattern Recognition, pp. 248–255, 2009.
[24] Y. Gong, Y.-A. Chung, and J. Glass, “Ast: Audio spectrogram transformer,” arXiv preprint arXiv:2104.01778, 2021.
[25] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Un-terthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint
arXiv:2010.11929, 2020.
[26] J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in Proc. IEEE ICASSP 2017, (New Orleans, LA), 2017.
[27] C. K. Reddy, E. Beyrami, J. Pool, R. Cutler, S. Srinivasan, and J. Gehrke, “A scalable noisy speech dataset and online subjective test framework,” Proc. Interspeech 2019,
pp. 1816–1820, 2019.
[28] H. Cao, D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova, and R. Verma, “Crema-d: Crowd-sourced emotional multimodal actors dataset,” IEEE Transactions
on Affective Computing, vol. 5, no. 4, pp. 377–390, 2014.
[29] K. Ito and L. Johnson, “The lj speech dataset.” https://keithito.com/LJ-Speech-Dataset/, 2017.
[30] C. V. Botinhao, X. Wang, S. Takaki, and J. Yamagishi, “Investigating rnn-based speech enhancement methods for noise-robust text-to-speech,” in 9th ISCA speech synthesis workshop, pp. 159–165, 2016.
[31] C. Veaux, J. Yamagishi, and S. King, “The voice bank corpus: Design, collection and data analysis of a large regional accent speech database,” in 2013 International
Conference Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), pp. 1–4, 2013.
[32] J. Thiemann, N. Ito, and E. Vincent, “The diverse environments multi-channel acoustic noise database (demand): A database of multichannel environmental noise recordings,” in Proceedings of Meetings on Acoustics, vol. 19, p. 035081, Acoustical Society of America, 2013.
[33] C. K. Reddy, V. Gopal, R. Cutler, E. Beyrami, R. Cheng, H. Dubey, S. Matusevych, R. Aichner, A. Aazami, S. Braun, et al., “The interspeech 2020 deep noise suppression challenge: Datasets, subjective testing framework, and challenge results,” arXiv preprint arXiv:2005.13981, 2020.
指導教授 陳弘軒(Hung-Hsuan Chen) 審核日期 2025-7-28
推文 facebook   plurk   twitter   funp   google   live   udn   HD   myshare   reddit   netvibes   friend   youpush   delicious   baidu   
網路書籤 Google bookmarks   del.icio.us   hemidemi   myshare   

若有論文相關問題,請聯絡國立中央大學圖書館推廣服務組 TEL:(03)422-7151轉57407,或E-mail聯絡  - 隱私權政策聲明