博碩士論文 112527606 詳細資訊




以作者查詢圖書館館藏 以作者查詢臺灣博碩士 以作者查詢全國書目 勘誤回報 、線上人數:37 、訪客IP:18.97.14.88
姓名 麥曼德(Mai Manh Duy)  查詢紙本館藏   畢業系所 人工智慧國際碩士學位學程
論文名稱 基於視覺-語言模型之影像監控暴力行為偵測方法
(Vision-Language Model–Based Approach for violence detection in Video Surveillance)
相關論文
★ Single and Multi-Label Environmental Sound Recognition with Gaussian Process★ 股票開盤價漲跌預測
★ 波束形成與音訊前處理之嵌入式系統實現★ 語音合成及語者轉換之應用與設計
★ 基於語意之輿情分析系統★ 高品質口述系統之設計與應用
★ 深度學習及加速強健特徵之CT影像跟骨骨折辨識及偵測★ 基於風格向量空間之個性化協同過濾服裝推薦系統
★ RetinaNet應用於人臉偵測★ 金融商品走勢預測
★ 整合深度學習方法預測年齡以及衰老基因之研究★ 漢語之端到端語音合成研究
★ 基於 ARM 架構上的 ORB-SLAM2 的應用與改進★ 基於深度學習之指數股票型基金趨勢預測
★ 探討財經新聞與金融趨勢的相關性★ 基於卷積神經網路的情緒語音分析
檔案 [Endnote RIS 格式]    [Bibtex 格式]    [相關文章]   [文章引用]   [完整記錄]   [館藏目錄]   至系統瀏覽論文 (2031-1-30以後開放)
摘要(中) 本研究旨在開發並評估一種基於視覺語言模型(Vision-Language Model, VLM)的暴力行為偵測系統。影片中的暴力行為偵測對於公共安全與監控系統而言相當重要,但真實世界中的影片往往具有畫質較低與場景複雜等問題。因此,本研究著重於提升VLM在真實環境下進行暴力行為偵測的效能。首先,本研究依據既有相關研究的常見設定建立一個基準模型(baseline model)。其次,採用零樣本(zero-shot)的 VLM,在不進行額外訓練的情況下評估其實際應用表現。第三,利用具標註的影片資料對VLM進行微調(fine-tuning),使其能更有效地適應暴力行為偵測任務。在微調過程中,模型學習更合適的視覺表徵,以提升對暴力行為的辨識能力。為了評估模型效能,本研究採用準確率(accuracy)、精確率(precision)、召回率(recall)以及 F1 分數等標準評估指標。所有實驗皆在相同條件下進行,以確保不同方法之間的比較具有公平性。實驗結果顯示,經過微調的VLM在準確率與F1分數方面皆優於基準模型與零樣本方法。此結果表示,微調能幫助模型更有效地擷取與暴力行為相關的視覺特徵。雖然零樣本模型具有高度彈性且不需額外訓練,其在真實場景中的表現仍屬可接受水準,僅略低於微調後的模型。整體而言,本研究所提出的方法具備良好的有效性與穩定性,並展現出應用於公共安全與監控系統中的實務潛力。
摘要(英) This study aims to develop and evaluate a violence detection system based on a Vision Language Model (VLM). Detecting violent actions in videos is important for public safety and surveillance, but real-world videos often have low quality and complex scenes. Therefore, this study focuses on improving VLM performance for real-world violence detection. First, a baseline model is implemented following common settings from previous work. Second, a zero-shot VLM is applied without additional training to evaluate its practical performance. Third, the VLM is fine-tuned using labeled video data to better adapt to the violence detection task. During fine-tuning, the model learns more suitable visual representations for recognizing violent actions. To evaluate performance, standard metrics such as accuracy, precision, recall, and F1-score are used. All experiments are conducted under the same conditions to ensure fair comparison. Results show that the fine-tuned VLM achieves higher accuracy and F1-score than both the baseline and the zero-shot approaches. This indicates that fine-tuning helps the model better capture visual patterns related to violence. Although the zero-shot model is flexible and requires no training, its performance remains acceptable and only slightly lower than the fine-tuned model in real-world scenarios. Overall, the proposed approach is effective and robust, showing strong potential for practical use in public safety and surveillance systems.
關鍵字(中) ★ One keyword per line 關鍵字(英) ★ Vision language model
★ Violence detection
★ Zero-shot classification
論文目次 1 INTRODUCTION 1
2 RELATED WORKS 5
2.1 Vision-based approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Audio-based approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Signal-based approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Vision language models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3 PRELIMINARY 10
3.1 Vision Language Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1.1 Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1.2 Projector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.3 Larger Language Model Responsibility . . . . . . . . . . . . . . . . 12
3.2 Vision Language Model for Video . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Fine-tuning techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3.1 Full Fine-tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3.2 Parameter-EfficientFine-Tuning(PEFT) . . . . . . . . . . . . . . . 15
3.4 Quantized Low-Rank Adaptation(QLoRA) . . . . . . . . . . . . . . . . . . 15
3.5 Zero-Shot Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.6 Prompt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.7 Pretrained Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4 METHODOLOGY 17
4.1 Motivation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2 Proposed Approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2.1 Method Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2.2 Prompt Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2.3 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2.4 Fine-Tuning Strategy. . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3 Baseline Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.4 Zero-shot VLM Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5 RESULT 25
5.1 Experimental Setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.1.1 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6 DISCUSSION AND CONCLUSION 29
6.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.3 Future works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
參考文獻 1. World Health Organization Youth Violence. Available online: https://www.who.int/news-room/fact-sheets/detail/youth-violence
2. H. Sheng, K. Yao, and S. Goel. Surveilling Surveillance: Estimating the Prevalence of Surveillance Cameras with Street View Data. arXiv preprint arXiv:2105.01764, 2021. https://arxiv.org/abs/2105.01764
3. H. Pan et al., "fight detection Based on Pedestrian Pose Estimation," 2018 11th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), Beijing, China, 2018, pp. 1-5, doi: 10.1109/CISP-BMEI.2018.8633057
4. V. M. Baskaran, R. Sutopo, J. Lim, J. M. -Y. Lim and K. Wong, "Reimagining Violent Action Detection with Human-Object Interaction," 2024 IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Niagara Falls, ON, Canada, 2024, pp. 1-7, doi: 10.1109/AVSS61716.2024.10672610
5. Ullah, F. U. M., Ullah, A., Muhammad, K., Haq, I. U., & Baik, S. W. (2019). Violence Detection Using Spatiotemporal Features with 3D Convolutional Neural Network. Sensors, 19(11), 2472. https://doi.org/10.3390/s19112472
6. Ali Bakhshi, Joaqu?n Garc?a-G?mez, Roberto Gil-Pita, Stephan Chalup, Violence Detection in Real-Life Audio Signals Using Lightweight Deep Neural Networks, Procedia Computer Science, Volume 222, 2023, Pages 244-251, ISSN 1877-0509, https://doi.org/10.1016/j.procs.2023.08.162
7. Kannan, A., & Kouzani, A. Z. (2024). Violence Detection Using Wi-Fi and 5G/6G Sensing Technologies: A Review. Electronics, 13(14), 2765. https://doi.org/10.3390/electronics13142765
8. W. -F. Pang, Q. -H. He, Y. -j. Hu and Y. -X. Li, "Violence Detection in Videos Based on Fusing Visual and Audio Information," ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 2021, pp. 2260-2264, doi: 10.1109/ICASSP39728.2021.9413686
9. M. Patel. Real-Time Violence Detection Using CNN-LSTM. arXiv preprint arXiv:2107.07578, 2021. https://arxiv.org/abs/2107.07578
10. M. Cheng, K. Cai, and M. Li. RWF-2000: An Open Large Scale Video Database for Violence Detection. arXiv preprint arXiv:1911.05913, 2020. https://arxiv.org/abs/1911.05913
11. V. B. Parthasarathy, A. Zafar, A. Khan, and A. Shahid. The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies, Research, Best Practices, Applied Research Challenges and Opportunities. arXiv preprint arXiv:2408.13296, 2024. https://arxiv.org/abs/2408.13296
12. E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen. LoRA: Low-Rank Adaptation of Large Language Models. arXiv preprint arXiv:2106.09685, 2021. https://arxiv.org/abs/2106.09685
13. T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer. QLoRA: Efficient Fine-tuning of Quantized LLMs. arXiv preprint arXiv:2305.14314, 2023. https://arxiv.org/abs/2305.14314
14. Y. Xian, C. H. Lampert, B. Schiele, and Z. Akata. Zero-Shot Learning – A Comprehensive Evaluation of the Good, the Bad and the Ugly. arXiv preprint arXiv:1707.00600, 2020. https://arxiv.org/abs/1707.00600
15. T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa. Large Language Models are Zero-Shot Reasoners. arXiv preprint arXiv:2205.11916, 2023. https://arxiv.org/abs/2205.11916
16. X. Han, Z. Zhang, N. Ding, Y. Gu, X. Liu, Y. Huo, J. Qiu, Y. Yao, et al. Pre-Trained Models: Past, Present and Future. arXiv preprint arXiv:2106.07139, 2021. https://arxiv.org/abs/2106.07139
17. R. Esfandiarpoor, C. Menghini, and S. H. Bach. If CLIP Could Talk: Understanding Vision-Language Model Representations Through Their Preferred Concept Descriptions. arXiv preprint arXiv:2403.16442, 2024. https://arxiv.org/abs/2403.16442
18. K. O’Shea and R. Nash. An Introduction to Convolutional Neural Networks. arXiv preprint arXiv:1511.08458, 2015. https://arxiv.org/abs/1511.08458
19. S. Hochreiter and J. Schmidhuber, "Long Short-Term Memory," in Neural Computation, vol. 9, no. 8, pp. 1735-1780, 15 Nov. 1997, doi: 10.1162/neco.1997.9.8.1735
20. K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. arXiv preprint arXiv:1512.03385, 2015. https://arxiv.org/abs/1512.03385
21. T. Hassner, Y. Itcher and O. Kliper-Gross, "Violent flows: Real-time detection of violent crowd behavior," 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Providence, RI, USA, 2012, pp. 1-6, doi: 10.1109/CVPRW.2012.6239348
22. D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning Spatiotemporal Features with 3D Convolutional Networks. arXiv preprint arXiv:1412.0767, 2015. https://arxiv.org/abs/1412.0767
23. J. Zhang, J. Huang, S. Jin and S. Lu, "Vision-Language Models for Vision Tasks: A Survey," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 8, pp. 5625-5644, Aug. 2024, doi: 10.1109/TPAMI.2024.3369699
24. Florian Bordes, Richard Yuanzhe Pang, Anurag Ajay, Alexander C. Li, Adrien Bardes, Suzanne Petryk, Oscar Ma?as, Zhiqiu Lin, Anas Mahmoud, Bargav Jayaraman, Mark Ibrahim, Melissa Hall, Yunyang Xiong, Jonathan Lebensold, Candace Ross, Srihari Jayakumar, Chuan Guo, Diane Bouchacourt, Haider Al-Tahan, Karthik Padthe, Vasu Sharma, Hu Xu, Xiaoqing Ellen Tan, Megan Richards, Samuel Lavoie, Pietro Astolfi, Reyhane Askari Hemmat, Jun Chen, Kushal Tirumala, Rim Assouel, Mazda Moayeri, Arjang Talattof, Kamalika Chaudhuri, Zechun Liu, Xilun Chen, Quentin Garrido, Karen Ullrich, Aishwarya Agrawal, Kate Saenko, Asli Celikyilmaz, and Vikas Chandra. An Introduction to Vision-Language Modeling. ArXiv preprint arXiv:2405.17247, 2024. https://arxiv.org/abs/2405.17247 [,,]
25. H. Zhang, X. Li, and L. Bing. Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding. ArXiv preprint arXiv:2306.02858, 2023. https://arxiv.org/abs/2306.02858
26. H. Lu, W. Liu, B. Zhang, B. Wang, K. Dong, B. Liu, J. Sun, T. Ren, Z. Li, H. Yang, Y. Sun, C. Deng, H. Xu, Z. Xie, and C. Ruan. DeepSeek-VL: Towards Real-World Vision-Language Understanding. ArXiv preprint arXiv:2403.05525, 2024. https://arxiv.org/abs/2403.05525
27. S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y. Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y. Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin. Qwen2.5-VL Technical Report. ArXiv preprint arXiv:2502.13923, 2025. https://arxiv.org/abs/2502.13923
28. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning Transferable Visual Models From Natural Language Supervision. ArXiv preprint arXiv:2103.00020, 2021. https://arxiv.org/abs/2103.00020
29. H. Liu, C. Li, Q. Wu, and Y. J. Lee. Visual Instruction Tuning. ArXiv preprint arXiv:2304.08485, 2023. https://arxiv.org/abs/2304.08485
30. B. Lin, Y. Ye, B. Zhu, J. Cui, M. Ning, P. Jin, and L. Yuan. Video-LLaVA: Learning United Visual Representation by Alignment Before Projection. ArXiv preprint arXiv:2311.10122, 2024. https://arxiv.org/abs/2311.10122
31. J. Devlin, M. W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics, June 2019. doi:10.18653/v1/N19-1423
32. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ArXiv preprint arXiv:2010.11929, 2021. https://arxiv.org/abs/2010.11929
33. B. Zhu, B. Lin, M. Ning, Y. Yan, J. Cui, H. Wang, Y. Pang, W. Jiang, J. Zhang, Z. Li, W. Zhang, Z. Li, W. Liu, and L. Yuan. LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment. ArXiv preprint arXiv:2310.01852, 2024. https://arxiv.org/abs/2310.01852
34. Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. MiniGPT-4: Enhancing vision-language understanding with advanced large language models. arXiv arXiv:2304.10592, 2023a
35. Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language image pre-training with frozen image encoders and large language models. In Andreas Krause, et al., editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 19730–19742. PMLR, 23–29 Jul 2023e. https://proceedings.mlr.press/v202/li23q.html
36. J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei. Scaling Laws for Neural Language Models. ArXiv preprint arXiv:2001.08361, 2020. https://arxiv.org/abs/2001.08361
指導教授 王家慶(Jia-Ching Wang) 審核日期 2026-1-27
推文 facebook   plurk   twitter   funp   google   live   udn   HD   myshare   reddit   netvibes   friend   youpush   delicious   baidu   
網路書籤 Google bookmarks   del.icio.us   hemidemi   myshare   

若有論文相關問題,請聯絡國立中央大學圖書館推廣服務組 TEL:(03)422-7151轉57407,或E-mail聯絡  - 隱私權政策聲明