SiamCATR:基於孿生網路之具有交叉注意力機制和通道注意力機制特徵融合的高效視覺追蹤神經網路

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：117

、訪客IP：3.144.4.177

姓名

李俊霖(Chun-Lin Lee) 查詢紙本館藏

畢業系所

電機工程學系

論文名稱

SiamCATR:基於孿生網路之具有交叉注意力機制和通道注意力機制特徵融合的高效視覺追蹤神經網路
(SiamCATR: An Efficient and Accurate Visual Tracking via Cross-Attention Transformer and Channel-Attention Feature Fusion Network Based on Siamese Network)

相關論文

★ 即時的SIFT特徵點擷取之低記憶體硬體設計	★ 即時的人臉偵測與人臉辨識之門禁系統
★ 具即時自動跟隨功能之自走車	★ 應用於多導程心電訊號之無損壓縮演算法與實現
★ 離線自定義語音語者喚醒詞系統與嵌入式開發實現	★ 晶圓圖缺陷分類與嵌入式系統實現
★ 語音密集連接卷積網路應用於小尺寸關鍵詞偵測	★ G2LGAN: 對不平衡資料集進行資料擴增應用於晶圓圖缺陷分類
★ 補償無乘法數位濾波器有限精準度之演算法設計技巧	★ 可規劃式維特比解碼器之設計與實現
★ 以擴展基本角度CORDIC為基礎之低成本向量旋轉器矽智產設計	★ JPEG2000靜態影像編碼系統之分析與架構設計
★ 適用於通訊系統之低功率渦輪碼解碼器	★ 應用於多媒體通訊之平台式設計
★ 適用MPEG 編碼器之數位浮水印系統設計與實現	★ 適用於視訊錯誤隱藏之演算法開發及其資料重複使用考量

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2027-8-31以後開放)

摘要(中)

視覺目標追蹤任務在電腦視覺中一直是一個重要議題，廣泛應用於自動駕駛、監控系統、無人機等各個領域。其核心目的是在連續的影像序列中準確地跟蹤指定目標，即使在目標出現部分遮擋、光照變化、快速運動或背景複雜的情況下，依然能保持穩定的追蹤效果。隨著深度學習技術的快速發展，視覺目標追蹤網路也從傳統基於特徵匹配的方法演變為利用深度神經網路提取豐富特徵並進行目標追蹤。而近年來，受視覺變換器模型(Vision Transformer)在各種任務中取得成功的影響，視覺目標追蹤網路的性能也取得了顯著的進步，然而，在提升準確度與模型性能的同時，模型的參數量與運算量也大幅增加。由於視覺目標追蹤任務的實際應用往往部署在硬體資源有限的邊緣設備上，於是實時追蹤目標成為一個重大挑戰。因此，如何在保證模型準確度的同時實現高效輕量化的設計成為一個極具挑戰性的研究方向。在本論文中，我們提出了一種融合了卷積神經網路（CNN）和Transformer架構的混合模型，稱為SiamCATR。我們引入了基於Transformer架構的交叉注意力機制來增強模型對特徵圖相似特徵的表現，為了有效融合特徵，我們也引入了通道注意力機制深度互相關，使得目標在每個特徵通道都能被充分結合特徵，上述模組共同組成了高效的特徵融合網路。我們在多個視覺目標追蹤資料集上進行實驗與驗證。實驗結果證明，與當前基於高效輕量化設計的網路架構相比，我們所提出的架構取得最佳的準確度且達到實時追蹤的要求，證明了我們的模型在視覺目標追蹤任務中具有強大的競爭力。

摘要(英)

Visual object tracking has been an important issue in computer vision, which is widely used in various fields such as autonomous driving, surveillance systems, and drones. Its core purpose is to accurately track a specified target in a continuous image sequence, and to maintain stable tracking effect even when the target is partially occluded, the light changes, the fast motion or the background is complex.
With the rapid development of deep learning technology, visual object tracking networks have evolved from traditional feature-matching methods to leveraging deep neural networks to extract rich features for object tracking. Recently, influenced by the success of Vision Transformer models in various tasks, the performance of visual object tracking networks has also seen significant improvement. However, along with the increase in accuracy and performance, the number of parameters and computational load of these models has also grown substantially. Since the practical applications of visual object tracking tasks are often deployed on edge devices with limited hardware resources, real-time object tracking becomes a major challenge. Therefore, how to achieve high efficiency and lightweight design while ensuring model accuracy has become a highly challenging research direction. In this paper, we propose a hybrid model that combines Convolutional Neural Networks (CNN) and Transformer architecture, named SiamCATR. We introduce a cross-attention mechanism to enhance the model′s performance in identifying similar features in feature maps. To effectively integrate features, we incorporate a channel-attention depthwise cross-correlation mechanism, ensuring that targets can be fully combined within each feature channel. We conducted experiments on multiple visual object tracking datasets. The experimental results demonstrate that our proposed architecture achieves the best accuracy and meets the real-time tracking requirements compared to the current network architectures based on efficient and lightweight designs, proving the competitiveness of our model in visual object tracking tasks.

關鍵字(中)

★ 單目標視覺追蹤
★ 神經網路模型

關鍵字(英)

★ Single Visual Object Tracking
★ CNN-Transformer Architecture

論文目次

摘要 I
Abstract II
表目錄 IV
圖目錄 V
1. 緒論 1
1.1 研究背景與動機 1
2. 文獻探討 8
2.1 深層與高效輕量骨幹網路 8
2.2 基於深度學習之視覺目標追蹤演算法 12
2.3 基於深度學習之高效輕量化視覺目標追蹤演算法 22
3. 高效視覺追蹤神經網路架構 25
3.1 設計動機與構想 25
3.2 神經網路架構介紹 26
3.3 特徵提取骨幹網路 27
3.4 特徵融合網路 28
3.5 預測子網路模組 33
3.6 訓練損失函數 34
4. 具有深度估計測距技術與智慧喚醒之自主移動跟隨護理工作車 35
4.1 設計動機與背景想法 35
4.2 自主移動跟隨護理工作車系統介紹 38
4.3 深度估計網路結合目標追蹤網路之架構設計 39
5. 實驗結果與討論 40
5.1 資料集介紹 40
5.2 實驗細節 44
5.3 驗證指標 45
5.4 比較結果-SiamCATR 46
5.5 消融實驗 49
5.6 實驗結果與分析-跟隨護理工作車人物追蹤系統 51
5.7 討論 52
6. 結論 53
參考文獻 54

參考文獻

[1] Soleimanitaleb, Z., and M. A. Keyvanrad. "Single Object Tracking: A Survey of Methods, Datasets, and Evaluation Metrics. arXiv 2022." arXiv preprint arXiv:2201.13066.
[2] Bolme, David S., et al. "Visual object tracking using adaptive correlation filters." 2010 IEEE computer society conference on computer vision and pattern recognition. IEEE, 2010.
[3] Henriques, Joao F., et al. "Exploiting the circulant structure of tracking-by-detection with kernels." Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part IV 12. Springer Berlin Heidelberg, 2012.
[4] Henriques, João F., et al. "High-speed tracking with kernelized correlation filters." IEEE transactions on pattern analysis and machine intelligence 37.3 (2014): 583-596.
[5] Dalal, Navneet, and Bill Triggs. "Histograms of oriented gradients for human detection." 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR′05). Vol. 1. Ieee, 2005.
[6] Danelljan, Martin, et al. "Adaptive color attributes for real-time visual tracking." Proceedings of the IEEE conference on computer vision and pattern recognition. 2014.
[7] Bertinetto, Luca, et al. "Fully-convolutional siamese networks for object tracking." Computer Vision–ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part II 14. Springer International Publishing, 2016.
[8] Bromley, Jane, et al. "Signature verification using a" siamese" time delay neural network." Advances in neural information processing systems 6 (1993).
[9] Li, Bo, et al. "High performance visual tracking with siamese region proposal network." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
[10] Zhu, Zheng, et al. "Distractor-aware siamese networks for visual object tracking." Proceedings of the European conference on computer vision (ECCV). 2018.
[11] Zhang, Zhipeng, and Houwen Peng. "Deeper and wider siamese networks for real-time visual tracking." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019.
[12] Li, Bo, et al. "Siamrpn++: Evolution of siamese visual tracking with very deep networks." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019.
[13] Xu, Yinda, et al. "Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines." Proceedings of the AAAI conference on artificial intelligence. Vol. 34. No. 07. 2020.
[14] Hu, Weiming, et al. "Siammask: A framework for fast online object tracking and segmentation." IEEE Transactions on Pattern Analysis and Machine Intelligence 45.3 (2023): 3072-3089.
[15] Voigtlaender, Paul, et al. "Siam r-cnn: Visual tracking by re-detection." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020.
[16] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16words: Trans formers for image recognition at scale. In ICLR, 2021.
[17] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).
[18] Bao, Hangbo, et al. BEiT: BERT Pre-Training of Image Transformers. International Conference on Learning Representations. 2021.
[19] Caron, Mathilde, et al. Emerging properties in self-supervised vision transformers. Proceedings of the IEEE/CVF international conference on computer vision. 2021.
[20] Carion, Nicolas, et al. End-to-end object detection with transformers. European conference on computer vision. Cham: Springer International Publishing, 2020.
[21] Li, Yanghao, et al. Exploring plain vision transformer backbones for object detection. European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022.
[22] Chen, Xin, et al. "Transformer tracking." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021.
[23] Yan, Bin, et al. "Learning spatio-temporal transformer for visual tracking." Proceedings of the IEEE/CVF international conference on computer vision. 2021.
[24] Kugarajeevan, Janani, et al. "Transformers in single object tracking: An experimental survey." IEEE Access (2023).
[25] Fan, Heng, et al. "Lasot: A high-quality benchmark for large-scale single object tracking." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019.
[26] Bin Yan, Houwen Peng, Kan Wu, Dong Wang, Jianlong Fu, and Huchuan Lu. LightTrack: Finding Lightweight Neural Networks for Object Tracking via One-Shot Architecture Search. In CVPR, pages 15180–15189, 2021.
[27] Vasyl Borsuk, Roman Vei, Orest Kupyn, Tetiana Martyniuk, Igor Krashenyi, and Jiˇ ri Matas. FEAR: Fast, Efficient, Ac curate and Robust Visual Tracker. In ECCV, pages 644–663, 2022.
[28] Kang, Ben, et al. "Exploring lightweight hierarchical vision transformers for efficient visual tracking." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.
[29] Lin, Weidong, et al. "CAT: cross-attention transformer for one-shot object detection." arXiv preprint arXiv:2104.14984 (2021).
[30] Ren, Qiang, et al. "A robust and accurate end-to-end template matching method based on the Siamese network." IEEE Geoscience and Remote Sensing Letters 19 (2021): 1-5.
[31] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems 25 (2012).
[32] Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014).
[33] Szegedy, Christian, et al. "Going deeper with convolutions." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.
[34] He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
[35] Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: Mobilenetv2: Inverted residuals and linear bottlenecks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4510–4520 (2018)
[36] Ren, Shaoqing, et al. "Faster r-cnn: Towards real-time object detection with region proposal networks." Advances in neural information processing systems 28 (2015).
[37] S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, and M. Shah, ”Transformers in vision: A survey,’’ ACM Comput. Surv., vol. 54, pp. 1–41, Jan. 2022.
[38] K. Han, Y. Wang, H. Chen, X. Chen, J. Guo, Z. Liu, Y. Tang, A. Xiao, C. Xu, Y. Xu, Z. Yang, Y. Zhang, and D. Tao, ‘‘A survey on vision transformer,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 1, pp. 87–110, Jan. 2023.
[39] Carion, Nicolas, et al. "End-to-end object detection with transformers." European conference on computer vision. Cham: Springer International Publishing, 2020.
[40] Chen, Xin, et al. "Seqtrack: Sequence to sequence learning for visual object tracking." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.
[41] Blatter, Philippe, et al. "Efficient visual tracking with exemplar transformers." Proceedings of the IEEE/CVF Winter conference on applications of computer vision. 2023.
[42] Hu, Jie, Li Shen, and Gang Sun. “Squeeze-and-excitation networks.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
[43] Woo, Sanghyun, et al. “Cbam: Convolutional block attention module.” Proceedings of the European conference on computer vision (ECCV). 2018.
[44] Zheng, Zhaohui, et al. "Distance-IoU loss: Faster and better learning for bounding box regression." Proceedings of the AAAI conference on artificial intelligence. Vol. 34. No. 07. 2020.
[45] Chiu, Yu-Chen, et al. "Mobilenet-SSDv2: An improved object detection model for embedded systems." 2020 International conference on system science and engineering (ICSSE). IEEE, 2020.
[46] T. -H. Tsai and W. -C. Wan, "NL-DSE: Non-Local Neural Network with Decoder-Squeeze-and-Excitation for Monocular Depth Estimation." IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023.
[47] Lin, Tsung-Yi, et al. "Microsoft coco: Common objects in context." Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer International Publishing, 2014.
[48] Fan, Heng, et al. "Lasot: A high-quality benchmark for large-scale single object tracking." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019.
[49] Muller, Matthias, et al. "Trackingnet: A large-scale dataset and benchmark for object tracking in the wild." Proceedings of the European conference on computer vision (ECCV). 2018.
[50] Huang, Lianghua, Xin Zhao, and Kaiqi Huang. "Got-10k: A large high-diversity benchmark for generic object tracking in the wild." IEEE transactions on pattern analysis and machine intelligence 43.5 (2019): 1562-1577.
[51] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: An imperative style, high- performance deep learning library. Advances in neural information processing systems 32 (2019)
[52] Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization. arXiv preprint arXiv:1711.05101, 2017.
[53] Wu, Yi, Jongwoo Lim, and Ming-Hsuan Yang. "Online object tracking: A benchmark." Proceedings of the IEEE conference on computer vision and pattern recognition. 2013.
[54] Danelljan, Martin, et al. "Atom: Accurate tracking by overlap maximization." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019.

指導教授

蔡宗漢(Tsung-Han Tsai)

審核日期

2024-8-13

推文