English  |  正體中文  |  简体中文  |  全文筆數/總筆數 : 84432/84432 (100%)
造訪人次 : 65807072      線上人數 : 180
RC Version 7.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
搜尋範圍 查詢小技巧:
  • 您可在西文檢索詞彙前後加上"雙引號",以獲取較精準的檢索結果
  • 若欲以作者姓名搜尋,建議至進階搜尋限定作者欄位,可獲得較完整資料
  • 進階搜尋


    請使用永久網址來引用或連結此文件: https://ir.lib.ncu.edu.tw/handle/987654321/99169


    題名: 基於視覺-語言模型之影像監控暴力行為偵測方法;Vision-Language Model–Based Approach for violence detection in Video Surveillance
    作者: 麥曼德;Duy, Mai Manh
    貢獻者: 人工智慧國際碩士學位學程
    關鍵詞: One keyword per line;Vision language model;Violence detection;Zero-shot classification
    日期: 2026-01-27
    上傳時間: 2026-03-06 18:15:19 (UTC+8)
    出版者: 國立中央大學
    摘要: 本研究旨在開發並評估一種基於視覺語言模型(Vision-Language Model, VLM)的暴力行為偵測系統。影片中的暴力行為偵測對於公共安全與監控系統而言相當重要,但真實世界中的影片往往具有畫質較低與場景複雜等問題。因此,本研究著重於提升VLM在真實環境下進行暴力行為偵測的效能。首先,本研究依據既有相關研究的常見設定建立一個基準模型(baseline model)。其次,採用零樣本(zero-shot)的 VLM,在不進行額外訓練的情況下評估其實際應用表現。第三,利用具標註的影片資料對VLM進行微調(fine-tuning),使其能更有效地適應暴力行為偵測任務。在微調過程中,模型學習更合適的視覺表徵,以提升對暴力行為的辨識能力。為了評估模型效能,本研究採用準確率(accuracy)、精確率(precision)、召回率(recall)以及 F1 分數等標準評估指標。所有實驗皆在相同條件下進行,以確保不同方法之間的比較具有公平性。實驗結果顯示,經過微調的VLM在準確率與F1分數方面皆優於基準模型與零樣本方法。此結果表示,微調能幫助模型更有效地擷取與暴力行為相關的視覺特徵。雖然零樣本模型具有高度彈性且不需額外訓練,其在真實場景中的表現仍屬可接受水準,僅略低於微調後的模型。整體而言,本研究所提出的方法具備良好的有效性與穩定性,並展現出應用於公共安全與監控系統中的實務潛力。;This study aims to develop and evaluate a violence detection system based on a Vision Language Model (VLM). Detecting violent actions in videos is important for public safety and surveillance, but real-world videos often have low quality and complex scenes. Therefore, this study focuses on improving VLM performance for real-world violence detection. First, a baseline model is implemented following common settings from previous work. Second, a zero-shot VLM is applied without additional training to evaluate its practical performance. Third, the VLM is fine-tuned using labeled video data to better adapt to the violence detection task. During fine-tuning, the model learns more suitable visual representations for recognizing violent actions. To evaluate performance, standard metrics such as accuracy, precision, recall, and F1-score are used. All experiments are conducted under the same conditions to ensure fair comparison. Results show that the fine-tuned VLM achieves higher accuracy and F1-score than both the baseline and the zero-shot approaches. This indicates that fine-tuning helps the model better capture visual patterns related to violence. Although the zero-shot model is flexible and requires no training, its performance remains acceptable and only slightly lower than the fine-tuned model in real-world scenarios. Overall, the proposed approach is effective and robust, showing strong potential for practical use in public safety and surveillance systems.
    顯示於類別:[人工智慧國際碩士學位學程] 博碩士論文

    文件中的檔案:

    檔案 描述 大小格式瀏覽次數
    index.html0KbHTML17檢視/開啟


    在NCUIR中所有的資料項目都受到原著作權保護.

    社群 sharing

    ::: Copyright National Central University. | 國立中央大學圖書館版權所有 | 收藏本站 | 設為首頁 | 最佳瀏覽畫面: 1024*768 | 建站日期:8-24-2009 :::
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - 隱私權政策聲明