基於可視化與影音分析的深偽多模態偵測方法;A Multimodal Deepfake Detection Method Based on Visual and Audio-Visual Analysis

NCU Institutional Repository > 資訊電機學院 > 資訊工程研究所 > 博碩士論文 > Item 987654321/98498

jsp.display-item.identifier=請使用永久網址來引用或連結此文件: https://ir.lib.ncu.edu.tw/handle/987654321/98498

题名:	基於可視化與影音分析的深偽多模態偵測方法;A Multimodal Deepfake Detection Method Based on Visual and Audio-Visual Analysis
作者:	謝宜歡;Hsieh, Yi-Huan
贡献者:	資訊工程學系
关键词:	Deepfake 偵測;語音偽造辨識;多模態分析;Deepfake detection;Voice forgery detection;Multimodal analysis
日期:	2025-08-04
上传时间:	2025-10-17 12:51:02 (UTC+8)
出版者:	國立中央大學
摘要:	近年來，隨著深度學習技術進步，Deepfake 能偽造出高度真實的影像與音訊，對社會安全與數位鑑識構成挑戰。現今偵測方法多倚賴深度學習模型，但受限於神經網路的黑盒特性，缺乏可解釋性。此外，偽造手法已由單一模態發展為影像與音訊同步的多模態形式，傳統僅依賴單一特徵的偵測方式逐漸失效。因此，本研究旨在提升 Deepfake 偵測的解釋性與多模態分析能力，強化整體準確性與應用性。本研究首先導入 Grad-CAM 技術，視覺化影像偵測模型的關注區域，並發現模型多聚焦於嘴部區域進行判別。基於此，進一步結合嘴型轉文字（Lip-Reading-to-Text）與語音轉文字（Speech-to-Text），透過比對文字內容並以 Word Error Rate (WER) 評估影音一致性。結果顯示，當影像與音訊不匹配時，WER 顯著上升，雖然唇語辨識模型仍有限，導致標準差偏高，但整體能有效分析偽造特徵並指出可疑詞句。同時，也開發聲音偽造偵測模型，利用深度學習對音訊頻譜圖進行分類，準確率可達 95% 以上。而且藉由頻譜特徵，可以觀察出偽造語音與真實音訊存在明顯差異，並有助於提升模型辨識效果。最後，本研究比較影像、音訊與多模態偵測表現，並評估模型在不同 Deepfake 技術（如面部替換、表情操控與聲音克隆）下的辨識能力。結果顯示，本研究提出可視化與實用性的多模態偵測架構，為 Deepfake 偵測與數位媒體鑑識領域提供創新的解決方案。;In recent years, with the advancement of deep learning technologies, Deepfakes have become capable of generating highly realistic images and audio, posing significant challenges to social security and digital forensics. Current detection methods mostly rely on deep learning models, but due to the black-box nature of neural networks, they often lack interpretability. Moreover, forgery techniques have evolved from unimodal approaches to multimodal forms that synchronize both image and audio, rendering traditional single-feature detection methods increasingly ineffective. Therefore, this study aims to enhance the interpretability and multimodal analysis capabilities of Deepfake detection,improving both overall accuracy and practical applicability. This research first introduces the Grad-CAM technique to visualize the attention regions of image-based detection models and finds that these models often focus on the mouth area for classification. Building on this, we integrate lip-reading-to-text and speech-to-text models to compare the textual content and evaluate audiovisual consistency using the Word Error Rate (WER). Results show that WER increases significantly when the image and audio are inconsistent. Although the performance of the lip-reading model is still limited, resulting in a high standard deviation, it is still effective in analyzing forged features and identifying suspicious words or phrases. At the same time, an audio forgery detection model was also developed, using deep learning to classify spectrograms of audio data. The model achieves an accuracy of over 95%. Furthermore, spectrogram features reveal clear differences between fake and real audio, which helps improve the model＇s discrimination capability. Finally, this study compares the performance of image-based, audio-based, and multimodal detection approaches and evaluates the models＇ability to detect various Deepfake techniques, such as face swapping, expression manipulation, and voice cloning. The results demonstrate that the proposed multimodal detection framework offers both interpretability and practicality, providing an innovative solution for Deepfake detection and digital media forensics.
显示于类别:	[資訊工程研究所] 博碩士論文

文件中的档案:

档案	描述	大小	格式	浏览次数
index.html		0Kb	HTML	38	检视/开启

在NCUIR中所有的数据项都受到原著作权保护.

社群 sharing

数据加载中.....