基於視覺－語言模型之影像監控暴力行為偵測方法;Vision-Language Model–Based Approach for violence detection in Video Surveillance

NCU Institutional Repository > 資訊電機學院 > 人工智慧國際碩士學位學程 > 博碩士論文 > Item 987654321/99169

請使用永久網址來引用或連結此文件: https://ir.lib.ncu.edu.tw/handle/987654321/99169

題名:	基於視覺－語言模型之影像監控暴力行為偵測方法;Vision-Language Model–Based Approach for violence detection in Video Surveillance
作者:	麥曼德;Duy, Mai Manh
貢獻者:	人工智慧國際碩士學位學程
關鍵詞:	One keyword per line;Vision language model;Violence detection;Zero-shot classification
日期:	2026-01-27
上傳時間:	2026-03-06 18:15:19 (UTC+8)
出版者:	國立中央大學
摘要:	本研究旨在開發並評估一種基於視覺語言模型（Vision-Language Model, VLM）的暴力行為偵測系統。影片中的暴力行為偵測對於公共安全與監控系統而言相當重要，但真實世界中的影片往往具有畫質較低與場景複雜等問題。因此，本研究著重於提升VLM在真實環境下進行暴力行為偵測的效能。首先，本研究依據既有相關研究的常見設定建立一個基準模型（baseline model）。其次，採用零樣本（zero-shot）的 VLM，在不進行額外訓練的情況下評估其實際應用表現。第三，利用具標註的影片資料對VLM進行微調（fine-tuning），使其能更有效地適應暴力行為偵測任務。在微調過程中，模型學習更合適的視覺表徵，以提升對暴力行為的辨識能力。為了評估模型效能，本研究採用準確率（accuracy）、精確率（precision）、召回率（recall）以及 F1 分數等標準評估指標。所有實驗皆在相同條件下進行，以確保不同方法之間的比較具有公平性。實驗結果顯示，經過微調的VLM在準確率與F1分數方面皆優於基準模型與零樣本方法。此結果表示，微調能幫助模型更有效地擷取與暴力行為相關的視覺特徵。雖然零樣本模型具有高度彈性且不需額外訓練，其在真實場景中的表現仍屬可接受水準，僅略低於微調後的模型。整體而言，本研究所提出的方法具備良好的有效性與穩定性，並展現出應用於公共安全與監控系統中的實務潛力。;This study aims to develop and evaluate a violence detection system based on a Vision Language Model (VLM). Detecting violent actions in videos is important for public safety and surveillance, but real-world videos often have low quality and complex scenes. Therefore, this study focuses on improving VLM performance for real-world violence detection. First, a baseline model is implemented following common settings from previous work. Second, a zero-shot VLM is applied without additional training to evaluate its practical performance. Third, the VLM is fine-tuned using labeled video data to better adapt to the violence detection task. During fine-tuning, the model learns more suitable visual representations for recognizing violent actions. To evaluate performance, standard metrics such as accuracy, precision, recall, and F1-score are used. All experiments are conducted under the same conditions to ensure fair comparison. Results show that the fine-tuned VLM achieves higher accuracy and F1-score than both the baseline and the zero-shot approaches. This indicates that fine-tuning helps the model better capture visual patterns related to violence. Although the zero-shot model is flexible and requires no training, its performance remains acceptable and only slightly lower than the fine-tuned model in real-world scenarios. Overall, the proposed approach is effective and robust, showing strong potential for practical use in public safety and surveillance systems.
顯示於類別:	[人工智慧國際碩士學位學程] 博碩士論文

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	17	檢視/開啟

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....