Audio Deepfake Detection via a Dual-Branch Network with Layer-Aware Routing and Temporal Pooling

NCU Institutional Repository > 管理學院 > 資訊管理研究所 > 博碩士論文 > Item 987654321/98339

jsp.display-item.identifier=請使用永久網址來引用或連結此文件: https://ir.lib.ncu.edu.tw/handle/987654321/98339

题名:	Audio Deepfake Detection via a Dual-Branch Network with Layer-Aware Routing and Temporal Pooling
作者:	方國霖;FANG, GUO-LIN
贡献者:	資訊管理學系
关键词:	語音深偽偵測;自監督語音表徵;跨語言泛化;偽語音真實化;Audio deepfake detection;self-supervised speech representation;XLS-R;cross-lingual generalization;Layer-Aware Routing;Temporal Pooling
日期:	2025-07-23
上传时间:	2025-10-17 12:39:07 (UTC+8)
出版者:	國立中央大學
摘要:	隨著語音合成技術的迅速進步，語音深偽（Audio Deepfake）已成為危害語音安全與身份驗證系統的重大威脅。現有偵測模型普遍存在泛化能力不足的問題，尤其在未知語音攻擊、跨語言語料與真實世界場景中表現明顯退化。本研究提出一套基於預訓練語音模型 XLS-R 的雙分支音訊深偽偵測架構，分別導入「層感知路由（Layer-Aware Routing）」與「時間注意力聚合（Temporal Pooling）」兩項核心設計，以擷取層級與時間兩個維度上的偽造線索，提升模型在異質數據上的辨識能力。實驗涵蓋六大資料集，包含英語與中文語料、實驗室條件與 In-The-Wild 含背景噪音語音、以及多種現代 TTS 攻擊技術。結果顯示，在 In-The-Wild 測試中，本模型雖未使用任何資料增強技術，仍能達到與目前最佳 baseline 相當的表現；而在跨語言與跨資料來源測試中，模型更取得最低等錯誤率（EER），表現優於所有比較系統。此外，我們自建包含最先進 TTS 模型（如 XTTS、GPT-SoVITS、Kokoro）的高保真偽語音資料集，並探討「假音頻判真」現象，即部分高品質深偽語音在聲學與韻律上幾近無法區分於真實語音，導致偵測誤判的問題。透過層級選擇分析與特徵空間視覺化，我們進一步揭示了模型對低層語音特徵的偏好與偽造語音在時序分佈上的可識別性。;With the rapid advancement of speech synthesis technologies, audio deepfakes have emerged as a significant threat to speech security and identity verification systems. Existing detection models often suffer from poor generalization, especially when encountering unseen attacks, cross-lingual data, or real-world conditions. In this study, we propose a dual-branch architecture for audio deepfake detection based on pretrained XLS-R representations. The model integrates two core modules Layer-Aware Routing and Temporal Attention Pooling to capture spoofing artifacts across both hierarchical and temporal dimensions, thereby enhancing its robustness to heterogeneous data. We conduct comprehensive evaluations on seven datasets, covering English and Mandarin speech, clean lab conditions and in-the-wild recordings with background noise, as well as a range of modern TTS attack methods. Results show that our model, even without any data augmentation, achieves performance on par with the best current baseline on the challenging in-the-wild test set. Furthermore, it outperforms all baselines in cross-domain and cross-lingual evaluations, achieving the lowest Equal Error Rates (EERs). To further assess real-world robustness, we construct a high-fidelity dataset using state-of-the-art TTS systems such as XTTS, GPT-SoVITS, and Kokoro, and investigate the phenomenon of “fake-to-real confusion,” where high-quality synthetic speech becomes acoustically and prosodically indistinguishable from genuine audio, leading to false negatives. Through layer selection analysis and feature space visualization, we demonstrate the model′s preference for lower-layer acoustic features and its ability to recognize subtle temporal inconsistencies in spoofed audio.
显示于类别:	[資訊管理研究所] 博碩士論文

文件中的档案:

档案	描述	大小	格式	浏览次数
index.html		0Kb	HTML	266	检视/开启

在NCUIR中所有的数据项都受到原著作权保护.

社群 sharing

数据加载中.....