| 參考文獻 |
1. World Health Organization Youth Violence. Available online: https://www.who.int/news-room/fact-sheets/detail/youth-violence
2. H. Sheng, K. Yao, and S. Goel. Surveilling Surveillance: Estimating the Prevalence of Surveillance Cameras with Street View Data. arXiv preprint arXiv:2105.01764, 2021. https://arxiv.org/abs/2105.01764
3. H. Pan et al., "fight detection Based on Pedestrian Pose Estimation," 2018 11th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), Beijing, China, 2018, pp. 1-5, doi: 10.1109/CISP-BMEI.2018.8633057
4. V. M. Baskaran, R. Sutopo, J. Lim, J. M. -Y. Lim and K. Wong, "Reimagining Violent Action Detection with Human-Object Interaction," 2024 IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Niagara Falls, ON, Canada, 2024, pp. 1-7, doi: 10.1109/AVSS61716.2024.10672610
5. Ullah, F. U. M., Ullah, A., Muhammad, K., Haq, I. U., & Baik, S. W. (2019). Violence Detection Using Spatiotemporal Features with 3D Convolutional Neural Network. Sensors, 19(11), 2472. https://doi.org/10.3390/s19112472
6. Ali Bakhshi, Joaqu?n Garc?a-G?mez, Roberto Gil-Pita, Stephan Chalup, Violence Detection in Real-Life Audio Signals Using Lightweight Deep Neural Networks, Procedia Computer Science, Volume 222, 2023, Pages 244-251, ISSN 1877-0509, https://doi.org/10.1016/j.procs.2023.08.162
7. Kannan, A., & Kouzani, A. Z. (2024). Violence Detection Using Wi-Fi and 5G/6G Sensing Technologies: A Review. Electronics, 13(14), 2765. https://doi.org/10.3390/electronics13142765
8. W. -F. Pang, Q. -H. He, Y. -j. Hu and Y. -X. Li, "Violence Detection in Videos Based on Fusing Visual and Audio Information," ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 2021, pp. 2260-2264, doi: 10.1109/ICASSP39728.2021.9413686
9. M. Patel. Real-Time Violence Detection Using CNN-LSTM. arXiv preprint arXiv:2107.07578, 2021. https://arxiv.org/abs/2107.07578
10. M. Cheng, K. Cai, and M. Li. RWF-2000: An Open Large Scale Video Database for Violence Detection. arXiv preprint arXiv:1911.05913, 2020. https://arxiv.org/abs/1911.05913
11. V. B. Parthasarathy, A. Zafar, A. Khan, and A. Shahid. The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies, Research, Best Practices, Applied Research Challenges and Opportunities. arXiv preprint arXiv:2408.13296, 2024. https://arxiv.org/abs/2408.13296
12. E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen. LoRA: Low-Rank Adaptation of Large Language Models. arXiv preprint arXiv:2106.09685, 2021. https://arxiv.org/abs/2106.09685
13. T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer. QLoRA: Efficient Fine-tuning of Quantized LLMs. arXiv preprint arXiv:2305.14314, 2023. https://arxiv.org/abs/2305.14314
14. Y. Xian, C. H. Lampert, B. Schiele, and Z. Akata. Zero-Shot Learning – A Comprehensive Evaluation of the Good, the Bad and the Ugly. arXiv preprint arXiv:1707.00600, 2020. https://arxiv.org/abs/1707.00600
15. T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa. Large Language Models are Zero-Shot Reasoners. arXiv preprint arXiv:2205.11916, 2023. https://arxiv.org/abs/2205.11916
16. X. Han, Z. Zhang, N. Ding, Y. Gu, X. Liu, Y. Huo, J. Qiu, Y. Yao, et al. Pre-Trained Models: Past, Present and Future. arXiv preprint arXiv:2106.07139, 2021. https://arxiv.org/abs/2106.07139
17. R. Esfandiarpoor, C. Menghini, and S. H. Bach. If CLIP Could Talk: Understanding Vision-Language Model Representations Through Their Preferred Concept Descriptions. arXiv preprint arXiv:2403.16442, 2024. https://arxiv.org/abs/2403.16442
18. K. O’Shea and R. Nash. An Introduction to Convolutional Neural Networks. arXiv preprint arXiv:1511.08458, 2015. https://arxiv.org/abs/1511.08458
19. S. Hochreiter and J. Schmidhuber, "Long Short-Term Memory," in Neural Computation, vol. 9, no. 8, pp. 1735-1780, 15 Nov. 1997, doi: 10.1162/neco.1997.9.8.1735
20. K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. arXiv preprint arXiv:1512.03385, 2015. https://arxiv.org/abs/1512.03385
21. T. Hassner, Y. Itcher and O. Kliper-Gross, "Violent flows: Real-time detection of violent crowd behavior," 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Providence, RI, USA, 2012, pp. 1-6, doi: 10.1109/CVPRW.2012.6239348
22. D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning Spatiotemporal Features with 3D Convolutional Networks. arXiv preprint arXiv:1412.0767, 2015. https://arxiv.org/abs/1412.0767
23. J. Zhang, J. Huang, S. Jin and S. Lu, "Vision-Language Models for Vision Tasks: A Survey," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 8, pp. 5625-5644, Aug. 2024, doi: 10.1109/TPAMI.2024.3369699
24. Florian Bordes, Richard Yuanzhe Pang, Anurag Ajay, Alexander C. Li, Adrien Bardes, Suzanne Petryk, Oscar Ma?as, Zhiqiu Lin, Anas Mahmoud, Bargav Jayaraman, Mark Ibrahim, Melissa Hall, Yunyang Xiong, Jonathan Lebensold, Candace Ross, Srihari Jayakumar, Chuan Guo, Diane Bouchacourt, Haider Al-Tahan, Karthik Padthe, Vasu Sharma, Hu Xu, Xiaoqing Ellen Tan, Megan Richards, Samuel Lavoie, Pietro Astolfi, Reyhane Askari Hemmat, Jun Chen, Kushal Tirumala, Rim Assouel, Mazda Moayeri, Arjang Talattof, Kamalika Chaudhuri, Zechun Liu, Xilun Chen, Quentin Garrido, Karen Ullrich, Aishwarya Agrawal, Kate Saenko, Asli Celikyilmaz, and Vikas Chandra. An Introduction to Vision-Language Modeling. ArXiv preprint arXiv:2405.17247, 2024. https://arxiv.org/abs/2405.17247 [,,]
25. H. Zhang, X. Li, and L. Bing. Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding. ArXiv preprint arXiv:2306.02858, 2023. https://arxiv.org/abs/2306.02858
26. H. Lu, W. Liu, B. Zhang, B. Wang, K. Dong, B. Liu, J. Sun, T. Ren, Z. Li, H. Yang, Y. Sun, C. Deng, H. Xu, Z. Xie, and C. Ruan. DeepSeek-VL: Towards Real-World Vision-Language Understanding. ArXiv preprint arXiv:2403.05525, 2024. https://arxiv.org/abs/2403.05525
27. S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y. Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y. Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin. Qwen2.5-VL Technical Report. ArXiv preprint arXiv:2502.13923, 2025. https://arxiv.org/abs/2502.13923
28. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning Transferable Visual Models From Natural Language Supervision. ArXiv preprint arXiv:2103.00020, 2021. https://arxiv.org/abs/2103.00020
29. H. Liu, C. Li, Q. Wu, and Y. J. Lee. Visual Instruction Tuning. ArXiv preprint arXiv:2304.08485, 2023. https://arxiv.org/abs/2304.08485
30. B. Lin, Y. Ye, B. Zhu, J. Cui, M. Ning, P. Jin, and L. Yuan. Video-LLaVA: Learning United Visual Representation by Alignment Before Projection. ArXiv preprint arXiv:2311.10122, 2024. https://arxiv.org/abs/2311.10122
31. J. Devlin, M. W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics, June 2019. doi:10.18653/v1/N19-1423
32. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ArXiv preprint arXiv:2010.11929, 2021. https://arxiv.org/abs/2010.11929
33. B. Zhu, B. Lin, M. Ning, Y. Yan, J. Cui, H. Wang, Y. Pang, W. Jiang, J. Zhang, Z. Li, W. Zhang, Z. Li, W. Liu, and L. Yuan. LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment. ArXiv preprint arXiv:2310.01852, 2024. https://arxiv.org/abs/2310.01852
34. Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. MiniGPT-4: Enhancing vision-language understanding with advanced large language models. arXiv arXiv:2304.10592, 2023a
35. Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language image pre-training with frozen image encoders and large language models. In Andreas Krause, et al., editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 19730–19742. PMLR, 23–29 Jul 2023e. https://proceedings.mlr.press/v202/li23q.html
36. J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei. Scaling Laws for Neural Language Models. ArXiv preprint arXiv:2001.08361, 2020. https://arxiv.org/abs/2001.08361 |