中大機構典藏-NCU Institutional Repository-提供博碩士論文、考古題、期刊論文、研究計畫等下載:Item 987654321/98339
English  |  正體中文  |  简体中文  |  全文笔数/总笔数 : 83776/83776 (100%)
造访人次 : 59505896      在线人数 : 694
RC Version 7.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
搜寻范围 查询小技巧:
  • 您可在西文检索词汇前后加上"双引号",以获取较精准的检索结果
  • 若欲以作者姓名搜寻,建议至进阶搜寻限定作者字段,可获得较完整数据
  • 进阶搜寻


    jsp.display-item.identifier=請使用永久網址來引用或連結此文件: https://ir.lib.ncu.edu.tw/handle/987654321/98339


    题名: Audio Deepfake Detection via a Dual-Branch Network with Layer-Aware Routing and Temporal Pooling
    作者: 方國霖;FANG, GUO-LIN
    贡献者: 資訊管理學系
    关键词: 語音深偽偵測;自監督語音表徵;跨語言泛化;偽語音真實化;Audio deepfake detection;self-supervised speech representation;XLS-R;cross-lingual generalization;Layer-Aware Routing;Temporal Pooling
    日期: 2025-07-23
    上传时间: 2025-10-17 12:39:07 (UTC+8)
    出版者: 國立中央大學
    摘要: 隨著語音合成技術的迅速進步,語音深偽(Audio Deepfake)已成為危害語音安全與身 份驗證系統的重大威脅。現有偵測模型普遍存在泛化能力不足的問題,尤其在未知語音 攻擊、跨語言語料與真實世界場景中表現明顯退化。本研究提出一套基於預訓練語音模 型 XLS-R 的雙分支音訊深偽偵測架構,分別導入「層感知路由(Layer-Aware Routing)」 與「時間注意力聚合(Temporal Pooling)」兩項核心設計,以擷取層級與時間兩個維度 上的偽造線索,提升模型在異質數據上的辨識能力。 實驗涵蓋六大資料集,包含英語與中文語料、實驗室條件與 In-The-Wild 含背景噪音語 音、以及多種現代 TTS 攻擊技術。結果顯示,在 In-The-Wild 測試中,本模型雖未使 用任何資料增強技術,仍能達到與目前最佳 baseline 相當的表現;而在跨語言與跨資料 來源測試中,模型更取得最低等錯誤率(EER),表現優於所有比較系統。此外,我們 自建包含最先進 TTS 模型(如 XTTS、GPT-SoVITS、Kokoro)的高保真偽語音資料集, 並探討「假音頻判真」現象,即部分高品質深偽語音在聲學與韻律上幾近無法區分於真 實語音,導致偵測誤判的問題。透過層級選擇分析與特徵空間視覺化,我們進一步揭示 了模型對低層語音特徵的偏好與偽造語音在時序分佈上的可識別性。;With the rapid advancement of speech synthesis technologies, audio deepfakes have emerged as a significant threat to speech security and identity verification systems. Existing detection models often suffer from poor generalization, especially when encountering unseen attacks, cross-lingual data, or real-world conditions. In this study, we propose a dual-branch architecture for audio deepfake detection based on pretrained XLS-R representations. The model integrates two core modules Layer-Aware Routing and Temporal Attention Pooling to capture spoofing artifacts across both hierarchical and temporal dimensions, thereby enhancing its robustness to heterogeneous data.

    We conduct comprehensive evaluations on seven datasets, covering English and Mandarin speech, clean lab conditions and in-the-wild recordings with background noise, as well as a range of modern TTS attack methods. Results show that our model, even without any data augmentation, achieves performance on par with the best current baseline on the challenging in-the-wild test set. Furthermore, it outperforms all baselines in cross-domain and cross-lingual evaluations, achieving the lowest Equal Error Rates (EERs). To further assess real-world robustness, we construct a high-fidelity dataset using state-of-the-art TTS systems such as XTTS, GPT-SoVITS, and Kokoro, and investigate the phenomenon of “fake-to-real confusion,” where high-quality synthetic speech becomes acoustically and prosodically indistinguishable from genuine audio, leading to false negatives. Through layer selection analysis and feature space visualization, we demonstrate the model′s preference for lower-layer acoustic features and its ability to recognize subtle temporal inconsistencies in spoofed audio.
    显示于类别:[資訊管理研究所] 博碩士論文

    文件中的档案:

    档案 描述 大小格式浏览次数
    index.html0KbHTML29检视/开启


    在NCUIR中所有的数据项都受到原著作权保护.

    社群 sharing

    ::: Copyright National Central University. | 國立中央大學圖書館版權所有 | 收藏本站 | 設為首頁 | 最佳瀏覽畫面: 1024*768 | 建站日期:8-24-2009 :::
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - 隱私權政策聲明