基於注意力和記憶動態融合的單物件追蹤方法

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：9

、訪客IP：3.135.209.242

姓名

Pimpa Cheewaprakobkit(Pimpa Cheewaprakobkit) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

基於注意力和記憶動態融合的單物件追蹤方法
(Advancing Single Object Tracking based on Fusion of Attention and Memory Dynamics)

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

深度神經網路已經徹底改變了電腦視覺領域，帶來了單物件追蹤任務的重大進展。然而，這些網路仍然面臨在動態環境中處理目標物件外觀變化和遮擋的挑戰。此外，在長時間內保持一致的追蹤，特別是在面對相似背景物件時，仍是一個重大挑戰。單物件追蹤的核心困難在於目標在整個視頻序列中經常發生的外觀變化，這些變化，例如縱橫比、大小比例和姿勢狀態的變化，會顯著影響追蹤器的穩定性。此外，被其他物件遮擋和雜亂的背景，使得保持一致追蹤的過程變得複雜。
為了解決這些挑戰，本論文提出了一種利用時間卷積網路（TCN）、注意力機制、和空間-時序記憶網路相結合的追蹤架構。TCN組件通過捕捉視頻序列中的時間依賴性、並且起了關鍵作用。這使得模型能夠學習物件的外觀如何隨時間演變，從而對短期外觀變化具有更高的適應性。結合注意力機制提供了雙重好處。首先，它使模型能夠根據當前背景，聚焦在畫面中最相關的區域，降低了模型的計算複雜度。這在背景雜亂或存在多個相似物件的情況下特別有利。其次，注意力機制將模型的注意力，聚焦到對追蹤目標物件至關重要的資訊特徵上。最後一個組件，空間-時序記憶網路，利用了長期記憶的能力。該網路儲存了關於目標物件的歷史訊息，包括其外觀和運動模式。這些儲存的訊息為追蹤器提供了參考點，使其能夠更好地適應目標變形和遮擋。通過有效結合這三個組件，我們提出了架構，來實現比現有方法更優越的追蹤性能。
我們方法的有效性，透過在多個基準數據集上的廣泛評估，得到了驗證，包括 GOT-10K、OTB2015、UAV123 和 VOT2018。我們的模型在GOT-10K數據集上實現了67.5%的AO（平均重疊度），在OTB2015上取得了72.1%的成功得分（AUC），在UAV123上取得了65.8%的成功得分（AUC），並在VOT2018數據集上實現了59.0%的準確性。
結果顯示了我們所提出的方法，在單物件追蹤任務中的卓越追蹤能力，展示其解決外觀變化和長期追蹤場景挑戰的潛力。這項研究提供了一個穩定且靈活的解決方案，結合注意力和記憶動態，增強在複雜現實場景中的追蹤精確性和穩定性，從而推動了追蹤系統的發展。

摘要(英)

Deep neural networks have revolutionized the field of computer vision, leading to significant advancements in single object tracking tasks. However, these networks still encounter challenges in handling dynamic environments where target objects undergo appearance changes and occlusions. Additionally, maintaining consistent tracking across extended periods, especially when faced with similar-looking background objects, presents a significant challenge. The core difficulty in single object tracking arises from the frequent variations a target′s appearance can undergo throughout the video sequence. These variations, such as changes in aspect ratio, scale, and pose, can significantly impact the robustness of trackers. Additionally, occlusions by other objects and cluttered backgrounds further complicate the process of maintaining a consistent track.
To address these challenges, this dissertation proposes a novel tracking architecture that leverages the combined strengths of a temporal convolutional network (TCN), an attention mechanism, and a spatial-temporal memory network. The TCN component plays a critical role by capturing temporal dependencies within the video sequence. This enables the model to learn how an object′s appearance evolves over time, resulting in greater resilience to short-term appearance changes. Incorporating an attention mechanism offers a two-fold benefit. Firstly, it reduces the computational complexity of the model by enabling it to focus on the most relevant regions of the frame based on the current context. This is particularly advantageous in scenarios with cluttered backgrounds or multiple similar objects present. Secondly, the attention mechanism directs the model′s focus towards informative features that are critical for tracking the target object. The final component, the spatial-temporal memory network, leverages the power of long-term memory. This network stores historical information about the target object, including its appearance and motion patterns. This stored information serves as a reference point for the tracker, allowing it to better adapt to target deformations and occlusions. By effectively combining these three elements, our proposed architecture aims to achieve superior tracking performance compared to existing methods.
The effectiveness of our approach is validated through extensive evaluations on several benchmark datasets, including GOT-10K, OTB2015, UAV123, and VOT2018. Our model achieves a state-of-the-art average overlap (AO) of 67.5% on the GOT-10K dataset, a 72.1% success score (AUC) on OTB2015, a 65.8% success score (AUC) on UAV123, and a 59.0% accuracy on the VOT2018 dataset.
The results highlight the superior tracking capabilities of our proposed approach in single object tracking tasks, demonstrating its potential to address the challenges posed by appearance variations and prolonged tracking scenarios. This research contributes to the advancement of tracking systems by offering a robust and adaptive solution that combines attention and memory dynamics to enhance tracking accuracy and robustness in complex real-world scenarios.

關鍵字(中)

★ Temporal Convolutional Network
★ attention mechanism
★ spatial-temporal memory
★ single object tracking

關鍵字(英)

★ Temporal Convolutional Network
★ attention mechanism
★ spatial-temporal memory
★ single object tracking

論文目次

TABLE OF CONTENT

摘要 i
ABSTRACT ii
ACKNOWLEDGMENT iv
TABLE OF CONTENT v
LIST OF FIGURES vii
LIST OF TABLES viii
CHAPTER 1. INTRODUCTION 1
1.1 Background 1
1.2 Objective of Research 3
1.3 Dissertation Outline 4
CHAPTER 2. LITERATURE REVIEW 5
2.1 Single Object Tracking (SOT) 5
2.2 Siamese Networks 6
2.3 Siamese Fully Convolutional (SiamFC) 7
2.4 Siamese-RPN 9
2.5 Memory Networks 11
2.6 Background Suppression 11
2.7 Temporal Convolutional Network (TCN) 13
CHAPTER 3. PROPOSED METHOD 14
3.1 Backbone Network 15
3.1.1 Historical branch 15
3.1.2 Search branch 15
3.1.3 GoogLeNet architecture 15
3.2 Temporal Convolutional Network and Attention Mechanism 17
3.2.1 Causal convolutions 18
3.2.2 Dilated convolutions 18
3.2.3 Residual Connections 19
3.3 Attention Mechanism 20
3.4 Spatial-Temporal Memory Network 22
3.5 Prediction Network 24
3.5.1 The Regression Branch 24
3.5.2 The Classification Branch 26
3.5.3 Loss Function 27
CHAPTER 4. EXPERIMENTAL RESULTS 29
4.1 Training Dataset 29
4.2 Comparison 29
CHAPTER 5. CONCLUSION AND FUTURE WORKS 39
5.1 Conclusion 39
5.2 Future Works 39
REFERENCES 41

參考文獻

REFERENCES
[1] Z. Chen, B. Zhong, G. Li, S. Zhang, and R. Ji, “Siamese box adaptive network for visual tracking,” in 2020 IEEE/CVF Conf. Comp. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 6667–6676.
[2] Q. Xie, K. Liu, A. Zhiyong, L. Wang, Y. Li, and Z. Xiang, “A novel incremental multi-template update strategy for robust object tracking,” IEEE Access, vol. 8, no. 1, pp. 162668–162682, 2020, doi: 10.1109/ACCESS.2020.3021786.
[3] J. Fan, K. Zhang, Y. Huang, Y. Zhu, and B. Chen, “Parallel spatio-temporal attention-based TCN for multivariate time series prediction,” Neural. Comput. Appl., May 2021, doi: 10.1007/s00521-021-05958-z.
[4] Y. He and J. Zhao, “Temporal convolutional networks for anomaly detection in time series,” J. Phys. Conf. Ser., vol. 1213, no. 4, pp. 1-6, Jun. 2019, doi: 10.1088/1742-6596/1213/4/042050.
[5] Z. Lai, E. Lu, and W. Xie, “MAST: A memory-augmented self-supervised tracker,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 6478-6487.
[6] H. Gao and C. Hu, “A new approach of template matching and localization based on the guidance of feature points,” in 2018 IEEE Int. Conf. Inf. Autom (ICIA), Aug. 2018, pp. 548–553.
[7] T. Shi, D. Wang, and H. Ren, “Triplet network template for siamese trackers,” IEEE Access, vol. 9, no. 1, pp. 44426–44435, 2021, doi: 10.1109/ACCESS.2021.3066294.
[8] H. Lu, X. Ren, and M. Tong, “Object tracking algorithm of fully-convolutional siamese networks using the templates with suppressed background information,” in 2021 IEEE Int. Conf. Emerg. Technol. Fact. Autom. (ETFA), Sep. 2021, pp. 1–6.
[9] W. R. Tan and S.-H. Lai, “i-Siam: Improving siamese tracker with distractors suppression and long-term strategies,” in 2019 IEEE/CVF Int. Conf. Comput. Vis.Workshop (ICCVW), 2019, pp. 55-63.
[10] B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu, “High performance visual tracking with siamese region proposal network,” in 2018 IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2018, pp. 8971–8980.
[11] Z. Zhou, X. Li, T. Zhang, H. Wang, and Z. He, “Object tracking via spatial-temporal memory network,” IEEE Trans. Circuits and Syst. Video Technol., vol. 32, no. 5, pp. 2976–2989, May 2022, doi: 10.1109/TCSVT.2021.3094645.
[12] S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluation of generic convolutional and recurrent networks for sequence modeling,” CoRR., vol. abs/1803.01271, 2018.
[13] R. Zhu, W. Liao, and Y. Wang, “Short-term prediction for wind power based on temporal convolutional network,” Energy Rep., vol. 6, no. 1, pp. 424–429, Dec. 2020, doi: 10.1016/j.egyr.2020.11.219.
[14] P. Lara-Benítez, M. Carranza-García, J. M. Luna-Romera, and J. C. Riquelme, “Temporal convolutional networks applied to energy-related time series forecasting,” Appl. Sci., vol. 10, no. 7, Apr. 2020, doi: 10.3390/app10072322.
[15] A. Vaswani et al., “Attention is all you need,” in Proc. Int. Conf. Neur. Inf. Proc. Syst (NIPS), Jun. 2017, pp. 6000-6010.
[16] Z. Fu, Q. Liu, Z. Fu, and Y. Wang, “STMTrack: Template-free visual tracking with space-time memory networks,” in 2021 IEEE/CVF Conf. Comput.Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 13769–13778.
[17] X. Wang, R. Girshick, A. Gupta and K. He, “Non-local neural networks,” in 2018 IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018, pp. 7794-7803.
[18] M. Dunnhofer, N. Martinel, and C. Micheloni, “Tracking-by-trackers with a distilled and reinforced model,” in Asian Conf. Comput. Vis. (ACCV), 2020, pp. 631–650.
[19] L. Huang, X. Zhao, and K. Huang, “GOT-10k: A large high-diversity benchmark for generic object tracking in the wild,” IEEE Trans. Pattern. Anal. Mach. Intell., vol. 43, no. 5, pp. 1562–1577, May 2021, doi: 10.1109/TPAMI.2019.2957464.
[20] J. Ye, C. Fu, F. Lin, F. Ding, S. An, and G. Lu, “Multi-Regularized correlation filter for UAV tracking and self-localization,” IEEE Trans. Ind. Electron., vol. 69, no. 6, pp. 6004–6014, 2022.
[21] F. Chen, F. Zhang, and X. Wang, “Two stages for visual object tracking,” in 2021 Int. Conf. Intell. Comput. Autom. Appl. (ICAA), Jun. 2021, pp. 165–170.
[22] M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg, “ATOM: Accurate tracking by overlap maximization,” in 2019 IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 4655–4664.
[23] G. Bhat, M. Danelljan, L. Van Gool, and R. Timofte, “Learning discriminative model prediction for tracking,” in 2019 IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 6181–6190.
[24] A. Lukezic, J. Matas, and M. Kristan, “D3S – A discriminative single shot segmentation tracker,” in 2020 IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 7131–7140.
[25] Y. Luo, M. Xu, C. Yuan, X. Cao, L. Zhang, Y. Xu, T. Wang, and Q. Feng, “SiamSNN: Siamese spiking neural networks for energy-efficient object tracking,” in Int. Conf. Neural Netw., 2021, pp. 182–194.
[26] F. Gu, J. Lu, and C. Cai, “A robust attention-enhanced network with transformer for visual tracking,” Multimed. Tools. Appl., vol. 82, no. 26, pp. 40761–40782, Nov. 2023, doi: 10.1007/s11042-023-15168-5.
[27] L. Zhang, A. Gonzalez-Garcia, J. Van De Weijer, M. Danelljan, and F. S. Khan, “Learning the model update for siamese trackers,” in 2019 IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 4009–4018.
[28] C. and S. Y. and Y. X. Jia Shuai and Ma, “Robust tracking against adversarial attacks,” Comput. Vis. (ECCV), 2020, pp. 69–84.
[29] H. Fan and H. Ling, “Siamese cascaded region proposal networks for real-time visual tracking,” in 2019 IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 7944–7953.
[30] M. Zhang, K. Van Beeck, and T. Goedemé, “Object tracking with multiple dynamic templates updating,” in Proc. Int. Conf. Image Vis. Comput. (IVCNZ2022), 2022, pp. 144–158.
[31] H. Dong, J. Jiao, and Y. Bai, “Bounding-box centralization for improving SiamFC++,” in 2021 Asian Conf. Artif. Intell. Technol. (ACAIT), Oct. 2021, pp. 196–203.
[32] Z. Zhang, H. Peng, J. Fu, B. Li, and W. Hu, “Ocean: Object-aware anchor-free tracking,” Comput. Vis. (ECCV), 2020, pp. 771–787.
[33] T. -Y. Lin, P. Goyal, R. Girshick, K. He and P. Dollár, “Focal loss for dense object detection,” in 2017 IEEE Int. Conf. Comput. Vis. (ICCV), 2017, pp. 2999-3007.
[34] Z. Tian, C. Shen, H. Chen and T. He, “FCOS: Fully convolutional one-stage object detection,” in 2019 IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2019, pp. 9626-9635.
[35] J. Yu, Y. Jiang, Z. Wang, Z. Cao, and T. Huang, “Unitbox: An advanced object detection network,” in ACM MM, Oct. 2016, pp. 516–520.
[36] D. Yuan et al., “Active learning for deep visual tracking,” IEEE Trans. Neural. Netw. Learn. Syst., pp. 1–13, 2023, doi: 10.1109/TNNLS.2023.3266837.
[37] D. Xing, N. Evangeliou, A. Tsoukalas, and A. Tzes, “Siamese transformer pyramid networks for real-time UAV tracking,” in Proc. IEEE/CVF Winter Conf. Appl. Comput. Vis., 2022, pp. 2139– 2148.
[38] D. Yuan, X. Chang, Z. Li, and Z. He, “Learning adaptive spatial-temporal context-aware correlation filters for UAV tracking,” ACM Trans. Multimedia Comput. Commun. Appl., vol. 18, no. 3, 2022, doi: 10.1145/3486678.
[39] D. Guo, Y. Shao, Y. Cui, Z. Wang, L. Zhang, and C. Shen, “Graph attention tracking,” in IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2021, pp. 9543-9552.
[40] S. Xiang, T. Zhang, S. Jiang, Y. Han, Y. Zhang, C. Du, X. Guo, L. Yu, Y. Shi, and Y. Hao, “Spiking SiamFC++: Deep spiking neural network for object tracking,” CoRR., vol. abs/2209.12010, 2022.
[41] Q. Shen, L. Qiao, J. Guo, P. Li, X. Li, B. Li, W. Feng, W. Gan, W. Wu, and W. Ouyang, “Unsupervised learning of accurate siamese tracking,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 8101–10.
[42] J. Zhang, Y. Liu, H. Liu, J. Wang, and Y. Zhang, “Distractor-aware visual tracking using hierarchical correlation filters adaptive selection,” Appl. Intell., vol. 52, no. 6, pp. 6129–6147, 2022.
[43] S. Ma, L. Zhang, Z. Hou, X. Yang, L. Pu, and X. Zhao, “Robust visual tracking via adaptive feature channel selection,” Int. J. Intell. Syst., vol. 37, no. 10, pp. 6951–6977, 2022.
[44] S. Ma, Z. Zhao, Z. Hou, L. Zhang, X. Yang, and L. Pu, “Correlation filters based on multi-expert and game theory for visual object tracking,” IEEE Trans. Instrum. Meas., vol. 71, no.1, pp. 1–14, 2022.
[45] S. Ma, B. Zhao, Z. Hou, W. Yu, L. Pu, and L. Zhang, “Robust visual object tracking based on feature channel weighting and game theory,” Int. J. Intell. Syst., vol.2023, no.1, pp. 1–19, 2023, doi: 10.1155/2023/6731717.
[46] F. Gu, J. Lu, and C. Cai, “RPformer: A robust parallel transformer for visual tracking in complex scenes,” IEEE Trans. Instrum. Meas., vol. 71, no.1, 2022, doi: 10.1109/TIM.2022.3170972.
[47] J. Nie, H. Wu, Z. He, Y. Yang, M. Gao, and Z. Dong, “Learning Localization-aware Target Confidence for Siamese Visual Tracking,” Apr. 2022, [Online]. Available: http://arxiv.org/abs/2204.14093
[48] M. Mueller, N. G. Smith, and B. Ghanem, “A Benchmark and Simulator for UAV Tracking,” in European Conference on Computer Vision, 2016. [Online]. Available: https://api.semanticscholar.org/CorpusID:10184155
[49] Y. Liu, H. Yan, W. Zhang, M. Li, and L. Liu, “An adaptive spatiotemporal correlation filtering visual tracking method,” PLoS One, vol. 18, no. 1, p. e0279240, Jan. 2023, doi: 10.1371/journal.pone.0279240.
[50] D. Zhang, Y. Fu, and Z. Zheng, “UAST: Uncertainty-Aware siamese tracking,” in Proc. Int. Conf. Machine Learning, vol. 162, no. 1 , pp. 26161–26175.

指導教授

施國琛林智揚(Timothy K. Shih Chih-Yang Lin)

審核日期

2024-5-30

推文