基於Transformer架構的多行人追蹤

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：25

、訪客IP：18.191.15.150

姓名

林香岑(Hsiang-Tsen Lin) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

基於Transformer架構的多行人追蹤
(Transformer based multiple pedestrian tracking)

相關論文

★ 基於edX線上討論板社交關係之分組機制	★ 利用Kinect建置3D視覺化之Facebook互動系統
★ 利用 Kinect建置智慧型教室之評量系統	★ 基於行動裝置應用之智慧型都會區路徑規劃機制
★ 基於分析關鍵動量相關性之動態紋理轉換	★ 基於保護影像中直線結構的細縫裁減系統
★ 建基於開放式網路社群學習環境之社群推薦機制	★ 英語作為外語的互動式情境學習環境之系統設計
★ 基於膚色保存之情感色彩轉換機制	★ 一個用於虛擬鍵盤之手勢識別框架
★ 分數冪次型灰色生成預測模型誤差分析暨電腦工具箱之研發	★ 使用慣性傳感器構建即時人體骨架動作
★ 基於多台攝影機即時三維建模	★ 基於互補度與社群網路分析於基因演算法之分組機制
★ 即時手部追蹤之虛擬樂器演奏系統	★ 基於類神經網路之即時虛擬樂器演奏系統

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

多物件追蹤在機器學習領域中是一個十分熱門的研究項目，其最大的挑戰在於行人重疊時系統的穩定性。多數的解決方式即是擷取前幾幀中物件的外觀特徵或動作特徵來關聯當前幀該物件和前幾幀物件的相似性，再透過匈牙利演算法後處理將所有物件做匹配。
而在2021年初，MIT、Facebook、Google等將自然語言領域中較為熱門的架構，Transformer，帶入物件追蹤議題，其準確度高於現有模型，使得大家開始爭相研究。雖然Transformer的引入似乎引起了一陣熱潮，吸引大家使用該架構，但相對Transformer所需的龐大訓練資料集以及記憶體空間也讓許多研究員或學者為之頭疼。
2021年年中，發表了第一篇以Transformer為架構的端對端模型，雖然準確度並不是最頂尖，但其架構簡單、將以往需人工設計的data association包含進架構中，減少人工設計函式的誤差，使的其架構以更直觀的方式作呈現。但也因其端對端模型的設計，導致物件偵測及其後續的物件追蹤有著強烈的牽制。在本文中我們提出一種新的想法，使用YOLOv5模型的輸出資訊做為輸入資料，以輔助Transformer，增加訓練時的模型穩定度，不僅可以加快模型的收斂速度，也可藉此減少transformer的堆疊層數，降低GPU記憶體的需求量，使得單GPU的使用者也可以輕鬆訓練。

摘要(英)

Multiple object tracking is a popular research project in the field of machine learning. The biggest challenge is stability when objects overlap. Most of the solutions are to extract the appearance features or motion features in the previous frames to correlate the similarity between the current frame and the previous frames, then match objects by Hungarian Algorithm.
At the beginning of 2021, MIT, Facebook, Google, etc. brought the popular architecture in the natural language field, Transformer, into object tracking issue. Although it seems to have caused a wave of enthusiasm, attracting everyone to use this architecture, the large training dataset and memory space required by Transformer do not look friendly to researchers or scholars.
In mid-2021, the first end-to-end model with Transformer as the architecture was published. Although the accuracy is not the best, its architecture is simple, and the data associations that used to be manually designed are included in the architecture, which means we can reduce the need for manual design, can be presented more intuitively. However, object detection and object tracking are strongly hindered due to the model′s design. In this paper, we propose a new idea to use the output information of the YOLOv5 model as input data to assist the Transformer and increase the model stability during training, which can not only speed up the convergence of the model, but also reduce the stacking layers of the transformer. This reduces the amount of GPU memory required so that users with a single GPU can easily train.

關鍵字(中)

★ 多物件追蹤

關鍵字(英)

論文目次

1. Introduction 1
2. Related Work 3
2.1. Object Detection 3
2.1.1. R-CNN 3
2.1.2. YOLO 7
2.1.3. DETR 9
2.2. Multiple Object Tracking 20
2.2.1. Algorithm Based 21
2.2.2. Detection Based 24
2.2.3. Graph Neural Network Based 26
2.2.4. Transformer Based 28
3. Method 34
3.1. Pipeline 34
3.2. Preprocessing 35
3.3. Architecture 36
4. Experiment 46
4.1. Datasets 46
4.2. Implementation Details 46
4.3. Experimental Results 47
5. Conclusion 53
6. Reference 54

參考文獻

[1] Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik, "Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation," in Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
[2] Ross Girshick, "Fast R-CNN," in International Conference on Computer Vision (ICCV), 2015.
[3] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun, "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks," in Transactions on Pattern Analysis and Machine Intelligence, pp. 1137 - 1149, 2017.
[4] Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi, "You Only Look Once: Unified, Real-Time Object Detection," in Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[5] Joseph Redmon, Ali Farhadi, "YOLO9000: Better, Faster, Stronger," in Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
[6] Joseph Redmon, Ali Farhadi, "YOLOv3: An Incremental Improvement," in arXiv:1804.02767, 2018.
[7] Cui Gao, Qiang Cai, Shaofeng Ming, "YOLOv4 Object Detection Algorithm with Efficient Channel Attention Mechanism," in International Conference on Mechanical, Control and Computer Engineering (ICMCCE), 2020.
[8] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko, "End-to-End Object Detection with Transformers," in European Conference on Computer Vision (ECCV), 2020.
[9] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, Jifeng Dai, "Deformable DETR: Deformable Transformers for End-to-End Object Detection," in The International Conference on Learning Representations (ICLR), 2021.
[10] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, Yichen Wei, "Deformable Convolutional Networks", in IEEE International Conference on Computer Vision (ICCV), 2017.
[11] Alex Bewley, Zongyuan Ge, Lionel Ott, Fabio Ramos, Ben Upcroft, "Simple Online and Realtime Tracking," in International Conference on Image Processing (ICIP), 2016.
[12] Nicolai Wojke, Alex Bewley, Dietrich Paulus, "Simple Online and Realtime Tracking with a Deep Association Metric," in International Conference on Image Processing (ICIP), 2017.
[13] Zhongdao Wang, Liang Zheng, Yixuan Liu, Yali Li, Shengjin Wang, "Towards Real-Time Multi-Object Tracking," in European Conference on Computer Vision (ECCV), 2020.
[14] Jiahe Li, Xu Gao, Tingting Jiang, "Graph Network for Multiple Object Tracking," in Winter Conference on Applications of Computer Vision (WACV), 2020.
[15] Tim Meinhardt, Alexander Kirillov, Laura Leal-Taixe, Christoph Feichtenhofer, "TrackFormer: Multi-Object Tracking with Transformers," in Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
[16] Fangao Zeng, Bin Dong, Yuang Zhang, Tiancai Wang, Xiangyu Zhang, Yichen Wei, “MOTR: End-to-End Multiple-Object Tracking with Transformer,” in arXiv:2105.03247, 2021.
[17] Chenglin Yang, Yilin Wang, Jianming Zhang, He Zhang, Zijun Wei, Zhe Lin, Alan Yuille, “Lite Vision Transformer with Enhanced Self-Attention,” in arXiv:2112.10809, 2021.
[18] Li Yuan, Qibin Hou, Zihang Jiang, Jiashi Feng, Shuicheng Yan, “VOLO: Vision Outlooker for Visual Recognition,” in arXiv:2106.13112, 2021.
[19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, “Deep Residual Learning for Image Recognition,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[20] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár & C. Lawrence Zitnick. “Microsoft COCO: Common Objects in Context,” in European Conference on Computer Vision (ECCV), 2014.

指導教授

施國琛

審核日期

2022-8-8

推文