Attention-Guided Crowd Counting and Individual Localization

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：68

、訪客IP：13.59.241.75

姓名

吳佩蓉(Pei-Rong Wu) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

(Attention-Guided Crowd Counting and Individual Localization)

相關論文

★ Dynamic Overlay Construction for Mobile Target Detection in Wireless Sensor Networks	★ 車輛導航的簡易繞路策略
★ 使用傳送端電壓改善定位	★ 利用車輛分類建構車載網路上的虛擬骨幹
★ Why Topology-based Broadcast Algorithms Do Not Work Well in Heterogeneous Wireless Networks?	★ 針對移動性目標物的有效率無線感測網路
★ 適用於無線隨意網路中以關節點為基礎的分散式拓樸控制方法	★ A Review of Existing Web Frameworks
★ 將感測網路切割成貪婪區塊的分散式演算法	★ 無線網路上Range-free的距離測量
★ Inferring Floor Plan from Trajectories	★ An Indoor Collaborative Pedestrian Dead Reckoning System
★ Dynamic Content Adjustment In Mobile Ad Hoc Networks	★ 以影像為基礎的定位系統
★ 大範圍無線感測網路下分散式資料壓縮收集演算法	★ 車用WiFi網路中的碰撞分析

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

將人群計數與個別位置相結合後，可以進行全面的人群分析，從而更深入地了解人群的結構和行為。現有許多關於人群計數和個體定位的研究工作。但是，它們中的大多數不使用基於點的框架。通過利用基於點的框架，我們提出了一個名為注意力引導人群計數和個體定位(AGCCIL)的系統，旨在預測圖像中的人數並獲取頭部坐標。為了獲得更準確的計數和定位結果AGCCIL 集成 ConvNeXt、Context Extraction Module 和 Attention Guidance Module。此外，AGCCIL 還結合了 Depthwise Separable Convolution 以防止過擬合。最後，我們在上海科技大學的數據集上進行了實驗，以評估 AGCCIL 的性能並將其與最先進的工作進行比較。實驗結果表明，AGCCIL 在人群計數和個體定位方面優於最先進的方法，MAE相對於最先進的方法降低了3 % 。

摘要(英)

Crowd counting combined with individual locations allows a thorough crowd analysis, which enables a deeper understanding of the structure and behavior of the crowd. There are many existing research works on crowd counting and individual localization. However, most of them do not utilize a point-based framework. By leveraging a point-based framework, we propose a system, called Attention-Guided Crowd Counting and Individual Localization (AGCCIL), that aims to predict the number of people in an image and obtain the coordinates of the heads. To achieve more accurate counting and localization results, AGCCIL integrates ConvNeXt, Context Extraction Module, and Attention Guidance Modules. In addition, AGCCIL incorporates Depthwise Separable Convolution to prevent overfitting. Finally, we conduct experiments on the ShanghaiTech University datasets to evaluate the performance of AGCCIL and compare it with the state-of-the-art work. Experimental results demonstrate that AGCCIL outperforms the state-of-the-art method in crowd counting and individual localization, reducing the MAE of the state-of-the-art method by as much as 3%.

關鍵字(中)

★ 人群計數
★ 注意力機制
★ 點估計

關鍵字(英)

★ Crowd counting
★ Attention
★ Point estimate

論文目次

1 Introduction 1
2 Related Work 5
2.1 Detection-based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Entirety-based Detection Methods . . . . . . . . . . . . . . . . . . . 5
2.1.2 Parts-based Detection Methods . . . . . . . . . . . . . . . . . . . . 5
2.2 Regression Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 CNN-based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.1 Single-Column Architecture Methods . . . . . . . . . . . . . . . . . 7
2.3.2 Multi-Column Architecture Methods . . . . . . . . . . . . . . . . . 7
2.3.3 Hybrid Architecture Methods . . . . . . . . . . . . . . . . . . . . . 8
3 Preliminary 9
3.1 ConvNeXt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.1 Macro Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.2 Group Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.3 Inverted Bottleneck . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Depthwise Separable Convolution . . . . . . . . . . . . . . . . . . . . . . . 10
3.2.1 Depthwise Convolution . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.2 Pointwise Convolution . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 Residual Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.4 Feature Pyramid Network . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.5 Attention-guided Context Feature Pyramid Network . . . . . . . . . . . . . 14
3.5.1 Context Extraction Module . . . . . . . . . . . . . . . . . . . . . . 14
3.5.2 Attention-guided Module . . . . . . . . . . . . . . . . . . . . . . . . 15
3.6 Point to Point Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.6.1 Predicting Point Coordinates . . . . . . . . . . . . . . . . . . . . . 18
3.6.2 Matching Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4 Design 20
4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3 Research Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.4 Proposed System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.4.1 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.4.2 Backbone Network . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.4.3 Lightweight Neck Module . . . . . . . . . . . . . . . . . . . . . . . 23
4.4.4 Lightweight Head Module . . . . . . . . . . . . . . . . . . . . . . . 25
5 Performance 28
5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.3 Training and Testing Environment . . . . . . . . . . . . . . . . . . . . . . 30
5.4 Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.5 Experimental Results and Analysis . . . . . . . . . . . . . . . . . . . . . . 31
5.5.1 Performance Comparison of Different Models . . . . . . . . . . . . . 31
5.5.2 Comparison of The Number of Parameters . . . . . . . . . . . . . . 32
5.5.3 Effect of Reference Point Layout . . . . . . . . . . . . . . . . . . . . 33
5.5.4 Effect of Different Strides . . . . . . . . . . . . . . . . . . . . . . . 33
5.6 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6 Conclusion 36

參考文獻

[1] torchinfo. https://pypi.org/project/torchinfo/, 2014.
[2] Covid-19. https://www.mohw.gov.tw/cp-4634-52410-1.html, 2019.
[3] Sota. https://reurl.cc/r5kp3x, 2019.
[4] Itaewon. https://reurl.cc/V8mvaN, 2022.
[5] sports events. https://reurl.cc/Gez8Gv, 2022.
[6] Shuai Bai, Zhiqun He, Yu Qiao, Hanzhe Hu, Wei Wu, and Junjie Yan. Adaptive
dilated network with self-correction supervision for counting. In 2020 IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), pages 4593–4602,
2020.
[7] Lokesh Boominathan, Srinivas S S Kruthiventi, and R. Venkatesh Babu. Crowdnet:
A deep convolutional network for dense crowd counting. New York, NY, USA, 2016.
Association for Computing Machinery.
[8] Junxu Cao, Qi Chen, Jun Guo, and Ruichao Shi. Attention-guided context feature
pyramid network for object detection, 2020.
[9] Antoni B. Chan and Nuno Vasconcelos. Bayesian poisson regression for crowd counting. In 2009 IEEE 12th International Conference on Computer Vision, pages 545–
551, 2009.
[10] Ke Chen, Chen Change Loy, Shaogang Gong, and Tony Xiang. Feature mining for
localised crowd counting. In Proceedings of the British Machine Vision Conference,
pages 21.1–21.11. BMVA Press, 2012.
[11] Zhi-Qi Cheng, Qi Dai, Hong Li, JingKuan Song, Xiao Wu, and Alexander G. Hauptmann. Rethinking spatial invariance of convolutional networks for object counting,
2022.
[12] Fran¸cois Chollet. Xception: Deep learning with depthwise separable convolutions. In
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages
1800–1807, 2017.
[13] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen
Wei. Deformable convolutional networks, 2017.
[14] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. 2005 IEEE Computer Society Conference on Computer Vision and Pattern
Recognition (CVPR’05), 1:886–893 vol. 1, 2005.
[15] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua
Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold,
Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words:
Transformers for image recognition at scale, 2020.
[16] Juergen Gall, Angela Yao, Nima Razavi, Luc Van Gool, and Victor S. Lempitsky.
Hough forests for object detection, tracking, and action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33:2188–2202, 2011.
[17] Chenqiang Gao, Jun Liu, Qi Feng, and Jing Lv. People-flow counting in complex
environments by combining depth and color information. Multimedia Tools and Applications, 75:9315 – 9331, 2016.
[18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning
for image recognition. CoRR, abs/1512.03385, 2015.
[19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning
for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pages 770–778, 2016.
[20] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang,
Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications, 2017.
[21] Haroon Idrees, Imran Saleemi, Cody Seibert, and Mubarak Shah. Multi-source multiscale counting in extremely dense crowd images. In 2013 IEEE Conference on Computer Vision and Pattern Recognition, pages 2547–2554, 2013.
[22] Xiaoheng Jiang, Li Zhang, Mingliang Xu, Tianzhu Zhang, Pei Lv, Bing Zhou, Xin
Yang, and Yanwei Pang. Attention scaling for crowd counting. In 2020 IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), pages 4705–4714,
2020.
[23] Harold W. Kuhn. The hungarian method for the assignment problem. Naval Research
Logistics (NRL), 52, 1955.
[24] Harold W. Kuhn. The hungarian method for the assignment problem. In Michael
J¨unger, Thomas M. Liebling, Denis Naddef, George L. Nemhauser, William R. Pulleyblank, Gerhard Reinelt, Giovanni Rinaldi, and Laurence A. Wolsey, editors, 50
Years of Integer Programming 1958-2008 - From the Early Years to the State-of-theArt, pages 29–47. Springer, 2010.
[25] Yinjie Lei, Yan Liu, Pingping Zhang, and Lingqiao Liu. Towards using count-level
weak supervision for crowd counting. Pattern Recognition, 109:107616, 2021.
[26] Min Li, Zhaoxiang Zhang, Kaiqi Huang, and Tieniu Tan. Estimating the number of
people in crowded scenes by mid based foreground segmentation and head-shoulder
detection. 2008 19th International Conference on Pattern Recognition, pages 1–4,
2008.
[27] Sheng-Fuu Lin, Jaw-Yeh Chen, and Hung-Xin Chao. Estimation of number of people
in crowded scenes using perspective transformation. IEEE Trans. Syst. Man Cybern.
Part A, 31:645–654, 2001.
[28] Tsung-Yi Lin, Piotr Doll´ar, Ross Girshick, Kaiming He, and Serge Hariharan,
Bharathand Belongie. Feature pyramid networks for object detection. In 2017 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), pages 936–944,
2017.
[29] Jiang Liu, Chenqiang Gao, Deyu Meng, and Alexander G. Hauptmann. Decidenet:
Counting varying density crowds through attention guided detection and density
estimation, 2017.
[30] Weizhe Liu, Mathieu Salzmann, and Pascal Fua. Context-aware crowd counting,
2018.
[31] Weizhe Liu, Mathieu Salzmann, and Pascal Fua. Context-aware crowd counting,
2019.
[32] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin,
and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV),
pages 9992–10002, 2021.
[33] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin,
and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted
windows. CoRR, abs/2103.14030, 2021.
[34] Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell,
and Saining Xie. A convnet for the 2020s. In 2022 IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR), pages 11966–11976, 2022.
[35] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019.
[36] Ze Lu, Xudong Jiang, and Alex Kot. Deep coupled resnet for low-resolution face
recognition. IEEE Signal Processing Letters, 25(4):526–530, 2018.
[37] Zhiheng Ma, Xing Wei, Xiaopeng Hong, and Yihong Gong. Bayesian loss for crowd
count estimation with point supervision, 2019.
[38] Zhiheng Ma, Xing Wei, Xiaopeng Hong, and Yihong Gong. Bayesian loss for crowd
count estimation with point supervision. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 6141–6150, 2019.
[39] Yunqi Miao, Zijia Lin, Guiguang Ding, and Jungong Han. Shallow feature based
dense attention network for crowd counting, 2020.
[40] Xianfeng Ou, Pengcheng Yan, Yiming Zhang, Bing Tu, Guoyun Zhang, Jianhui Wu,
and Wujing Li. Moving object detection method via resnet-18 with encoder–decoder
structure in complex scenes. IEEE Access, 7:108152–108160, 2019.
[41] N. Paragios and V. Ramesh. A mrf-based approach for real-time subway monitoring.
In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision
and Pattern Recognition. CVPR 2001, volume 1, pages I–I, 2001.
[42] Viet Quoc Pham, Tatsuo Kozakaya, Osamu Yamaguchi, and Ryuzo Okada. Count
forest: Co-voting uncertain number of targets using random forest for crowd density
estimation. 2015 IEEE International Conference on Computer Vision (ICCV), pages
3253–3261, 2015.
[43] Viet-Quoc Pham, Tatsuo Kozakaya, Osamu Yamaguchi, and Ryuzo Okada. Count
forest: Co-voting uncertain number of targets using random forest for crowd density
estimation. In 2015 IEEE International Conference on Computer Vision (ICCV),
pages 3253–3261, 2015.
[44] David Ryan, Simon Denman, Clinton Fookes, and Sridha Sridharan. Crowd counting
using multiple local features. In 2009 Digital Image Computing: Techniques and
Applications, pages 81–88, 2009.
[45] Payam Sabzmeydani and Greg Mori. Detecting pedestrians by learning shapelet
features. 2007 IEEE Conference on Computer Vision and Pattern Recognition, pages
1–8, 2007.
[46] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh
Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In 2018 IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pages 4510–4520, 2018.
[47] Miaojing Shi, Zhaohui Yang, Chao Xu, and Qijun Chen. Revisiting perspective
information for efficient crowd counting, 2018.
[48] Xiaowen Shi, Xin Li, Caili Wu, Shuchen Kong, Jing Yang, and Liang He. A realtime deep network for crowd counting. In ICASSP 2020 - 2020 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2328–2332,
2020.
[49] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition, 2014.
[50] Vishwanath A. Sindagi and Vishal M. Patel. Cnn-based cascaded multi-task learning
of high-level prior and density estimation for crowd counting. In 2017 14th IEEE
International Conference on Advanced Video and Signal Based Surveillance (AVSS),
pages 1–6, 2017.
[51] Qingyu Song, Changan Wang, Zhengkai Jiang, Yabiao Wang, Ying Tai, Chengjie
Wang, Jilin Li, Feiyue Huang, and Yang Wu. Rethinking counting and localization
in crowds: A purely point-based framework, 2021.
[52] Russell Stewart, Mykhaylo Andriluka, and Andrew Y. Ng. End-to-end people detection in crowded scenes. In 2016 IEEE Conference on Computer Vision and Pattern
Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 2325–2333.
IEEE Computer Society, 2016.
[53] Pongpisit Thanasutives, Ken-ichi Fukui, Masayuki Numao, and Boonserm Kijsirikul.
Encoder-decoder based convolutional neural networks with multi-scale-aware modules for crowd counting. In 2020 25th International Conference on Pattern Recognition (ICPR), pages 2382–2389, 2021.
[54] Pongpisit Thanasutives, Ken ichi Fukui, Masayuki Numao, and Boonserm Kijsirikul.
Encoder-decoder based convolutional neural networks with multi-scale-aware modules for crowd counting. In 2020 25th International Conference on Pattern Recognition (ICPR). IEEE, jan 2021.
[55] Vladimir Vapnik, Steven E. Golowich, and Alex Smola. Support vector method for
function approximation, regression estimation and signal processing. In Proceedings of the 9th International Conference on Neural Information Processing Systems,
NIPS’96, page 281–287, Cambridge, MA, USA, 1996. MIT Press.
[56] Paul A. Viola and Michael Jones. Robust real-time face detection. International
Journal of Computer Vision, 57:137–154, 2001.
[57] Paul A. Viola, Michael J. Jones, and Daniel Snow. Detecting pedestrians using patterns of motion and appearance. International Journal of Computer Vision, 63:153–
161, 2003.
[58] Boyu Wang, Huidong Liu, Dimitris Samaras, and Minh Hoai. Distribution matching
for crowd counting, 2020.
[59] Qian Wang and Toby P. Breckon. Crowd counting via segmentation guided attention networks and curriculum loss. IEEE Transactions on Intelligent Transportation
Systems, 23(9):15233–15243, 2022.
[60] Yi Wang, Junhui Hou, Xinyu Hou, and Lap-Pui Chau. A self-training approach for
point-supervised object detection and counting in crowds. IEEE Transactions on
Image Processing, 30:2876–2887, 2021.
[61] Zijun Wei, Boyu Wang, Minh Hoai, Jianming Zhang, Xiaohui Shen, Zhe Lin, Radom´ır
Mech, and Dimitris Samaras. Sequence-to-segments networks for detecting segments
in videos. IEEE Trans. Pattern Anal. Mach. Intell., 43(3):1009–1021, 2021.
62] Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So
Kweon, and Saining Xie. Convnext v2: Co-designing and scaling convnets with
masked autoencoders, 2023.
[63] Bo Wu and Ramakant Nevatia. Detection of multiple, partially occluded humans
in a single image by bayesian combination of edgelet part detectors. Tenth IEEE
International Conference on Computer Vision (ICCV’05) Volume 1, 1:90–97 Vol. 1,
2005.
[64] Saining Xie, Ross Girshick, Piotr Doll´ar, Zhuowen Tu, and Kaiming He. Aggregated
residual transformations for deep neural networks. In 2017 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), pages 5987–5995, 2017.
[65] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions, 2015.
[66] Cong Zhang, Hongsheng Li, Xiaogang Wang, and Xiaokang Yang. Cross-scene crowd
counting via deep convolutional neural networks. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 833–841, 2015.
[67] Yingying Zhang, Desen Zhou, Siqin Chen, Shenghua Gao, and Yi Ma. Single-image
crowd counting via multi-column convolutional neural network. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
[68] Yingying Zhang, Desen Zhou, Siqin Chen, Shenghua Gao, and Yi Ma. Single-image
crowd counting via multi-column convolutional neural network. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 589–597, 2016.

指導教授

孫敏德(Min-Te Sun)

審核日期

2023-7-13

推文