本研究旨在開發並評估一種基於視覺語言模型(Vision-Language Model, VLM)的暴力行為偵測系統。影片中的暴力行為偵測對於公共安全與監控系統而言相當重要,但真實世界中的影片往往具有畫質較低與場景複雜等問題。因此,本研究著重於提升VLM在真實環境下進行暴力行為偵測的效能。首先,本研究依據既有相關研究的常見設定建立一個基準模型(baseline model)。其次,採用零樣本(zero-shot)的 VLM,在不進行額外訓練的情況下評估其實際應用表現。第三,利用具標註的影片資料對VLM進行微調(fine-tuning),使其能更有效地適應暴力行為偵測任務。在微調過程中,模型學習更合適的視覺表徵,以提升對暴力行為的辨識能力。為了評估模型效能,本研究採用準確率(accuracy)、精確率(precision)、召回率(recall)以及 F1 分數等標準評估指標。所有實驗皆在相同條件下進行,以確保不同方法之間的比較具有公平性。實驗結果顯示,經過微調的VLM在準確率與F1分數方面皆優於基準模型與零樣本方法。此結果表示,微調能幫助模型更有效地擷取與暴力行為相關的視覺特徵。雖然零樣本模型具有高度彈性且不需額外訓練,其在真實場景中的表現仍屬可接受水準,僅略低於微調後的模型。整體而言,本研究所提出的方法具備良好的有效性與穩定性,並展現出應用於公共安全與監控系統中的實務潛力。;This study aims to develop and evaluate a violence detection system based on a Vision Language Model (VLM). Detecting violent actions in videos is important for public safety and surveillance, but real-world videos often have low quality and complex scenes. Therefore, this study focuses on improving VLM performance for real-world violence detection. First, a baseline model is implemented following common settings from previous work. Second, a zero-shot VLM is applied without additional training to evaluate its practical performance. Third, the VLM is fine-tuned using labeled video data to better adapt to the violence detection task. During fine-tuning, the model learns more suitable visual representations for recognizing violent actions. To evaluate performance, standard metrics such as accuracy, precision, recall, and F1-score are used. All experiments are conducted under the same conditions to ensure fair comparison. Results show that the fine-tuned VLM achieves higher accuracy and F1-score than both the baseline and the zero-shot approaches. This indicates that fine-tuning helps the model better capture visual patterns related to violence. Although the zero-shot model is flexible and requires no training, its performance remains acceptable and only slightly lower than the fine-tuned model in real-world scenarios. Overall, the proposed approach is effective and robust, showing strong potential for practical use in public safety and surveillance systems.