基於交叉關聯與注意力模組之物件追蹤與嵌入式硬體之卷積層量化

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：46

、訪客IP：3.139.69.17

姓名

李文郁(Wen-Yu Li) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

基於交叉關聯與注意力模組之物件追蹤與嵌入式硬體之卷積層量化
(Cross Correlation and Attention Module Based Object Tracking and Convolutional Layer Quantization for Embedded Hardware)

相關論文

★ Single and Multi-Label Environmental Sound Recognition with Gaussian Process	★ 波束形成與音訊前處理之嵌入式系統實現
★ 語音合成及語者轉換之應用與設計	★ 基於語意之輿情分析系統
★ 高品質口述系統之設計與應用	★ 深度學習及加速強健特徵之CT影像跟骨骨折辨識及偵測
★ 基於風格向量空間之個性化協同過濾服裝推薦系統	★ RetinaNet應用於人臉偵測
★ 金融商品走勢預測	★ 整合深度學習方法預測年齡以及衰老基因之研究
★ 漢語之端到端語音合成研究	★ 基於 ARM 架構上的 ORB-SLAM2 的應用與改進
★ 基於深度學習之指數股票型基金趨勢預測	★ 探討財經新聞與金融趨勢的相關性
★ 基於卷積神經網路的情緒語音分析	★ 運用深度學習方法預測阿茲海默症惡化與腦中風手術存活

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 ( 永不開放)

摘要(中)

深度學習的發展與高運算的支持，對於影像處理的領域產生了各種不同的任務，使得許多的應用場合也陸續提出。其中，物件追蹤的任務是熱門的議題，也是具有挑戰性的任務之一。透過結合物件偵測的模組與追蹤演算法，在連續輸入的圖像中選取物件位置並標注其物件身份，達到物件追蹤的主要目的。此外，深度學習的大量運算行為，模型的部署也備受關注，為了順利將影像的專案可運行於邊緣裝置，模型的整數量化研究與嵌入式硬體的相互整合，儼然也成為了深度學習應用的主要課題。
本篇分別探討物件追蹤與影像分類的量化。以追蹤任務而言，採用一階段系統為主，結合物件偵測與追蹤嵌入分支的多任務訓練，不僅提升訓練與推論速度，模型也較兩階段的組織更為簡易且不失精準性。除了一階段追蹤架構之外，也加入交叉關聯平衡與注意力模組，並適度地調整模組與網路結構，試圖克服端對端網路多任務訓練問題與強化物件追蹤的嵌入身份分支。影像分類的整數量化方面，結合了自定義的FPGA卷積單元與量化後的物件分類模型，佐證模型實現於嵌入式硬體的可能性。
本篇追蹤實驗結果以不同的網路組合結構、調整嵌入身份維度與更換不同主幹網路，以評估物件追蹤系統的性能與身份識別的差異性。結果顯示，交叉關聯平衡與注意力的組合之下，驗證物件追蹤身份識別的提升。在不同的身份嵌入的維度設定中，不僅能減少運算量，也能提升追蹤效果。而在更換主幹網路後的實驗中，DLA-34與ResNet-101的主幹網路相比之下，驗證DLA-34同時具有少量參數與較好的準確率。
本篇的硬體量化實驗中，在量化後的卷積功能與FPGA整合後，得到了以VGG網路架構推論物件分類的結果，瞭解到量化對於影像推論的重要性。

摘要(英)

The developments of deep learning and high computing support have produced a variety of tasks in the image processing area, and lots of applications have been proposed one after another. In these tasks, object tracking is a hot topic, and it’s also one of the challenging tasks. Through a combination of object detection and tracking algorithm modules, the position of the object is selected in the continuous input image, and the object identity is marked to achieve the purpose of the object tracking problem. In addition, a large number of computing behaviors for deep learning and the development of models have also attracted attention. To smoothly deploy the image project to the edge device, integrating integer quantization research for model and embedded hardware seems to become the main topic of deep learning applications.
This article discusses object tracking and the quantization of image classification. In terms of tracking tasks, we adopt a one-stage system, which is combined with the object tracking method and the embedding branch of multi-object tracking. This structure not only improves the speed of training and inference, but the model is also simpler and more accurate than the second stage architecture. In addition, this object detection structure is adopted using the center point to describe the object location, which effectively solves the accuracy of object detection than previous bounding box solution. In addition to the one-stage object tracking architecture, this paper also adds extra cross-correlation balance and attention modules, moderately adjust modules and network structure to attempt to get over multi-task training problem and strengthen the branch of embedding identity for object tracking. In the aspect of image integer quantization, we combine the custom FPGA convolution unit and the quantized object classification model to prove the possibility of implementation in embedded hardware.
The results of these tracking experiments evaluate the difference between the performance of object tracking system and identity recognition by using different network combination structures, adjusting the embedded identity dimension, and replacing different backbone networks. The results show that under the combination of cross-correlation balance and attention mechanism, we verified that object tracking identity has been improved. Additionally, it also shows different identity dimension settings, which can’t only reduce the amount of computation but improve the tracking effect. At last, in the experiment of replacing the backbone network, we compare DLA-34 with ResNet-101 backbone network, which shows that DLA-34 has fewer parameters and better accuracy.
In the hardware quantization experiments, the results of inferring object classification based on different VGG models are obtained, and we realized the importance of quantization for image inference.

關鍵字(中)

★ 物件偵測
★ 物件追蹤
★ 嵌入式硬體
★ 網路量化

關鍵字(英)

★ Object Detection
★ Object Tracking
★ Embedded Hardware
★ Network Quantization

論文目次

中文摘要 i
Abstract ii
目錄 iv
圖目錄 viii
表目錄 x
第一章緒論 1
1-1 研究背景 1
1-2 研究動機與目的 2
1-3 研究方法與章節概要 3
第二章相關研究 5
2-1 卷積神經網路 5
2-1-1 卷積層 5
2-1-2 池化層 6
2-1-3 活化函數 7
2-1-4 全連結層 8
2-2 神經網路架構 9
2-2-1 殘差網路 9
2-2-2 全卷積網路 12
2-2-3 空洞卷積 13
2-2-4 可變形卷積網路 14
2-2-5 Deep Layer Aggregation 15
2-3 物件偵測架構 18
2-3-1 YOLO 18
2-3-2 R-CNN 20
2-3-3 CenterNet 21
2-4 物件追蹤架構 24
2-4-1 Centroid Tracker 24
2-4-2 Sort 25
2-4-3 DeepSort 28
2-4-4 JDE 29
2-5 嵌入式硬體環境 32
2-5-1 現場可程式化邏輯閘陣列 32
2-5-2 軟硬體協同設計 33
第三章物件追蹤與平衡架構 35
3-1 網路介紹 35
3-2 網路架構 35
3-2-1 網路架構介紹 35
3-2-2 編碼解碼網路 36
3-2-3 交叉關聯網路模組 37
3-2-4 卷積注意力模組 40
3-2-5 網路分支 42
3-3 網路訓練 44
3-3-1 預訓練與訓練過程 44
3-3-2 損失函數 44
3-3-3 訓練技巧 46
第四章卷積層的軟硬體設計 49
4-1 系統介紹 49
4-2 系統架構 49
4-2-1 卷積層模組 49
4-2-2 運算單元架構 52
4-2-3 SoC系統架構 54
4-3 網路訓練與量化 56
4-3-1 網路訓練 57
4-3-2 網路量化 57
第五章實驗設計與結果（一） 59
5-1 實驗環境設定 59
5-2 資料集介紹 59
5-2-1 Crowdhuman 60
5-2-2 2D MOT15 60
5-2-3 MOT16、MOT17 61
5-2-4 MOT20 61
5-2-5 Caltech Pedestrian 62
5-2-6 CityPersons 62
5-2-7 CUHK-SYSU 62
5-2-8 PRW 63
5-2-9 ETHZ 63
5-3 實驗度量方式 63
5-4 實驗流程 65
5-5 實驗結果 65
5-5-1 交叉關聯與殘差模組 65
5-5-2 交叉關聯與注意力殘差模組 68
5-5-3 嵌入身份特徵分支的維度比較 70
5-5-4 不同主幹網路間追蹤性能的比較 71
第六章實驗設計與結果（二） 73
6-1 實驗環境 73
6-1-1 硬體環境 73
6-1-2 軟體環境 74
6-2 資料集介紹 75
6-3 實驗流程 76
6-4 實驗結果 76
6-4-1 VGG16量化架構的推論 76
6-4-2 VGG19量化架構的推論 77
第七章結論 79
參考文獻 80

參考文獻

[1] J. Redmon, S. Divvala, R. Girshick and A. Farhadi, "You Only Look Once: Unified, Real-Time Object Detection," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2016, pp. 779-788, doi: 10.1109/CVPR.2016.91.
[2] R. Girshick, "Fast R-CNN," Proceedings of the IEEE International Conference on Computer Vision (ICCV), Dec. 2015, pp. 1440-1448, doi: 10.1109/ICCV.2015.169.
[3] X. Zhou, D. Wang and P. Krähenbühl, "Objects as Points," arXiv:1904.07850 [cs], Apr. 2019, Accessed: Apr. 20, 2021. [Online]. Available: https://arxiv.org/abs/1904.07850.
[4] A. Rosebrock, "Simple object tracking with OpenCV," PyImageSearch. Accessed: May 5, 2021. [Online]. Available: https://www.pyimagesearch.com/2018/07/23/simple-object-tracking-with-opencv/.
[5] A. Bewley, Z. Ge, L. Ott, F. Ramos and B. Upcroft, "Simple Online and Realtime Tracking," IEEE International Conference on Image Processing (ICIP), pp. 3464-3468, 2016, doi: 10.1109/ICIP.2016.7533003.
[6] S. Ren, K. He, R. Girshick and J. Sun, "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks," Advances in Neural Information Processing Systems (NIPS), 2015.
[7] N. Wojke, A. Bewley and D. Paulus, "Simple Online and Realtime Tracking with a Deep Association Metric," IEEE International Conference on Image Processing (ICIP), pp. 3645-3649, 2017, doi: 10.1109/ICIP.2017.8296962.
[8] Z. Wang, L. Zheng, Y. Liu, Y. Li and S. Wang, "Towards Real-Time Multi-Object Tracking," European Conference on Computer Vision (ECCV), pp. 107-122, Nov. 2020.
[9] Y. Zhang, C. Wang, X. Wang, W. Zeng and W. Liu, "FairMOT: On the Fairness of Detection and Re-Identification in Multiple Object Tracking," arXiv:2004.01888 [cs], Apr. 2020, Accessed: Apr. 20, 2021. [Online]. Available: https://arxiv.org/abs/2004.01888.
[10] C. Liang, Z. Zhang, Y. Lu, X. Zhou, B. Li, X. Ye and J. Zou, "Rethinking the competition between detection and ReID in Multi-Object Tracking," arXiv:2010.12138 [cs], Oct. 2020, Accessed: Apr. 20, 2021. [Online]. Available: https://arxiv.org/abs/2010.12138.
[11] S. Woo, J. Park, J. Lee and I. S. Kweon, "CBAM: Convolutional Block Attention Module," Proceedings of the European Conference on Computer Vision (ECCV), Sep. 2018.
[12] Y. Lecun, L. Bottou, Y.Bengio and P.Haffner, "Gradient-Based Applied to Document Recognition," Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2234, Nov. 1998, doi: 10.1109/5.726791.
[13] A. Krizhevky, I. Sutskever and G. E. Hinton, "ImageNet Classification with Deep Convolutional Neural Networks," Advances in Neural Information Processing System 25(NIPS 2012), pp. 1097-1105, 2012.
[14] M. D. Zeiler and R. Fergus, "Visualizing and Understanding Convolutional Networks," arXiv:1311.2901 [cs], Nov. 2013, Accessed: Apr. 20, 2021. [Online]. Available: https://arxiv.org/abs/1311.2901.
[15] K. Simonyan and A. Zisserman, "Very Deep Convolutional Networks for Large-Scale Image Recognition," arXiv:1409.1556 [cs], Sep. 2014, Accessed: Apr. 20, 2021. [Online]. Available: https://arxiv.org/abs/1409.1556.
[16] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke and A. Rabinovich, "Going Deeper with Convolutions," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2015, pp. 1-9, doi: 10.1109/CVPR.2015.7298594.
[17] K. He, X. Zhang, S. Ren and J. Sun, "Deep Residual Learning for Image Recognition," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2016, pp. 770-778, doi: 10.1109/CVPR.2016.90.
[18] S. Ioffe and C. Szegedy, "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift," arXiv:1502.03167v3 [cs], Feb. 2015, Accessed: Apr. 25, 2021. [Online]. Available: https://arxiv.org/abs/1502.03167.
[19] X. Glorot, A. Bordes and Y. Bengio, "Deep Sparse Rectifier Neural Networks," in Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), vol. 15, 2011.
[20] G. Huang, Z. Liu, L. van der Maaten and K. Q. Weinberger, "Densely Connected Convolutional Networks," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul. 2017, pp. 2261-2269, doi: 10.1109/CVPR.2017.243.
[21] J. Hu, L. Shen, S. Albanie, G. Sun and E. Wu, "Squeeze-and-Excitation Networks," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2018, pp. 7132-7141, doi: 10.1109/CVPR.2018.00745.
[22] J. Long, E. Shelhamer and T. Darrell, "Fully Convolutional Networks for Semantic Segmentation," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2015, pp. 3431-3440, doi: 10.1109/CVPR.2015.7298965.
[23] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan and S. Belongie, "Feature Pyramid Networks for Object Detection," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul. 2017, pp. 936-944, doi: 10.1109/CVPR.2017.106.
[24] F. Yu and V. Koltun, "Multi-Scale Context Aggregation by Dilated Convolutions," International Conference on Learning Representations (ICLR), 2016.
[25] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu and Y. Wei, "Deformable Convolutional Networks," Proceedings of the IEEE International Conference on Computer Vision (ICCV), Oct. 2017, pp. 764-773, doi: 10.1109/ICCV.2017.89.
[26] F. Yu, D. Wang, E. Shelhamer and T. Darrell, "Deep Layer Aggregation," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2018, pp. 2403-2412, doi: 10.1109/CVPR.2018.00255.
[27] J. Redmon and A. Farhadi, "YOLO9000: Better, Faster, Stronger," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul. 2017, pp. 6517-6525, doi: 10.1109/CVPR.2017.690.
[28] J. Deng, W. Dong, R. Socher, L. Li, K. Li and L. Fei-Fei, "ImageNet: A large-scale hierarchical image database," 2009 IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2009, pp. 248-255, doi: 10.1109/CVPR.2009.5206848.
[29] J. Redmon and A. Farhadi, "YOLOv3: An Incremental Improvement," arXiv:1804.02767 [cs], Apr. 2018, Accessed: Apr. 25, 2021. [Online]. Available: https://arxiv.org/abs/1804.02767.
[30] R. Girshick, J. Donahue, T. Darrell and J. Malik, "Rich feature hierarchies for accurate object detection and semantic segmentation," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2014, pp. 580-587, doi: 10.1109/CVPR.2014.81.
[31] T. Lin, P. Goyal, R. Girshick, K. He and P. Dollár, "Focal Loss for Dense Object Detection," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Oct. 2017, pp. 2999-3007, doi: 10.1109/ICCV.2017.324.
[32] A. Newell, K. Yang and J. Deng, "Stacked Hourglass Network for Human Pose Estimation," arXiv:1603.06937 [cs], Mar. 2016, Accessed: Apr. 20, 2021. [Online]. Available: https://arxiv.org/abs/1603.06937.
[33] R. Kalman, "A New Approach to Linear Filtering and Prediction Problems," Journal of Basic Engineering, vol. 82, no. Series D, pp. 35-45, 1960.
[34] H. W. Kuhn, "The Hungarian method for the assignment problem," Naval Research Logistics Quarterly, vol. 2, pp. 83-97, 1955.
[35] F. Pickett and T. O′Neal, "Zynq-7000," Xilinx Wiki. Accessed: May 10, 2021. [Online]. Available: https://xilinx-wiki.atlassian.net/wiki/spaces/A/pages/189530183/Zynq-7000.
[36] "Vivado Simulation Flow," Xilinx. Accessed: May 10, 2021. [Online]. Available: https://www.xilinx.com/products/design-tools/vivado/simulation.html.
[37] "PetaLinux Tools," Xilinx. Accessed: May 10, 2021. [Online]. Available: https://www.xilinx.com/products/design-tools/embedded-software/petalinux-sdk.html.
[38] T. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick and P. Dollár, "Microsoft COCO: Common Objects in Context," arXiv:1405.0312 [cs], May 2014, Accessed: Apr. 27, 2021. [Online]. Available: https://arxiv.org/abs/1405.0312.
[39] S. Shao, Z. Zhao, B. Li, T. Xiao, G. Yu, X. Zhang and J. Sun, "Crowdhuman: A Benchmark for Detecting Human in a Crowd," arXiv:1805.00123 [cs], Apr. 2018, Accessed: Apr. 27, 2021. [Online]. Available: https://arxiv.org/abs/1805.00123.
[40] A. Kendall, Y. Gal and R. Cipolla, "Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2018, pp. 7482-7491, doi: 10.1109/CVPR.2018.00781.
[41] Szymon Migacz, "8-bit Inference with TensorRT," NVIDIA, 8 May 2017. Accessed: May 10, 2021. [Online]. Available: https://on-demand.gputechconf.com/gtc/2017/presentation/s7310-8-bit-inference-with-tensorrt.pdf.
[42] L. Leal-Taixé, A. Milan, I. Reid, S. Roth and K. Schindler, "MOTChallenge 2015: Towards a Benchmark for Multi-Target Tracking," arXiv:1504.01942 [cs], Apr. 2015, Accessed: Apr. 27, 2021. [Online]. Available: https://arxiv.org/abs/1504.01942.
[43] A. Milan, L. Leal-Taxie, I. Reid, S. Roth and K. Schindler, "MOT16: A Benchmark for Multi-Object Tracking," arXiv:1603.00831 [cs], Mar. 2016, Accessed: Apr. 27, 2021. [Online]. Available: https://arxiv.org/abs/1603.00831.
[44] P. Dendorfer, H. Rezatofighi, A. Milan, J. Shi, D. Cremers, I. Reid, S. Roth, K. Schindler and L. Leal-Taixé, "MOT20: A benchmark for multi object tracking in crowded scenes," arXiv:2003.09003 [cs], Mar. 2020, Accessed: Apr. 27, 2021. [Online]. Available: https://arxiv.org/abs/2003.09003.
[45] P. Dollar, C. Wojek, B. Schiele and P. Perona, "Pedestrian Detection: An Evaluation of the State of the Art," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 4, pp. 743-761, Apr. 2012, doi: 10.1109/TPAMI.2011.155.
[46] S. Zhang, R. Benenson, B. Schiele, "CityPersons: A Diverse Dataset for Pedestrian Detection," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul. 2017, pp. 4457-4465, doi: 10.1109/CVPR.2017.474.
[47] T. Xiao, S. Li, B. Wang, L. Lin and X. Wang, "End-to-End Deep Learning for Person Search," arXiv:1604.01850v1 [cs], Apr. 2016, Accessed: Apr. 27, 2021. [Online]. Available: https://arxiv.org/abs/1604.01850v1.
[48] L. Zheng, H. Zhang, S. Sun, M. Chandraker, Y. Yang and Q. Tian, "Person Re-identification in the Wild," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul. 2017, pp. 3346-3355, doi: 10.1109/CVPR.2017.357.
[49] A. Ess and B. Leibe and K. Schindler and and L. van Gool, "A Mobile Vision System for Robust Multi-Person Tracking," IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2008, pp. 1-8, doi: 10.1109/CVPR.2008.4587581.

指導教授

王家慶(Jia-Ching Wang)

審核日期

2021-8-26

推文