視覺追蹤的多尺度視覺基礎網路

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：115

、訪客IP：52.15.234.217

姓名

王品灃(Pin-Feng Wang) 查詢紙本館藏

畢業系所

通訊工程學系

論文名稱

視覺追蹤的多尺度視覺基礎網路
(Multi-Scale Vision Foundation Networks for Visual Tracking)

相關論文

★ 應用於車內視訊之光線適應性視訊壓縮編碼器設計	★ 以粒子濾波法為基礎之改良式頭部追蹤系統
★ 應用於空間與CGS可調性視訊編碼器之快速模式決策演算法	★ 應用於人臉表情辨識之強健式主動外觀模型搜尋演算法
★ 結合Epipolar Geometry為基礎之視角間預測與快速畫面間預測方向決策之多視角視訊編碼	★ 基於改良式可信度傳遞於同質區域之立體視覺匹配演算法
★ 以階層式Boosting演算法為基礎之棒球軌跡辨識	★ 多視角視訊編碼之快速參考畫面方向決策
★ 以線上統計為基礎應用於CGS可調式編碼器之快速模式決策	★ 適用於唇形辨識之改良式主動形狀模型匹配演算法
★ 以運動補償模型為基礎之移動式平台物件追蹤	★ 基於匹配代價之非對稱式立體匹配遮蔽偵測
★ 以動量為基礎之快速多視角視訊編碼模式決策	★ 應用於地點影像辨識之快速局部L-SVMs群體分類器
★ 以高品質合成視角為導向之快速深度視訊編碼模式決策	★ 以運動補償模型為基礎之移動式相機多物件追蹤

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2026-8-1以後開放)

摘要(中)

在單目標追蹤中，採用階層式（hierarchical）的Vision Transformer（ViT）架構的追蹤器，往往追蹤表現不如plain ViT，同時文獻彼此之間架構都是有差異的，並沒有一個通用的網路架構。本論文提出一個通用的階層式網路架構（HyperXTrack），第一個將骨幹網路的架構，引用到追蹤任務上作為交互作用網路，同時加入時空上下文，空間的上下文是多尺度資訊，時間的上下文提供歷史資訊。HyperXTrack能進行全局與局部空間交互作用，且交互作用計算複雜度為影像解析度的線性複雜度。HyperXTrack每一個block都是先進行比對細緻紋理特徵，再進行整個物件外觀輪廓的交互比對。交互骨幹網路採用本論文所提之注意力機制，同時採用經典的堆疊規則在注意力機制前使用卷積。最後，本論文提出輕量的重新預訓練策略，可以使用預訓練好的MaxViT網路參數，將更改網路交互運算的網路重新訓練一個epoch，就可以讓網路的參數可以遷移到下游任務上。實驗結果顯示，本論文設計的HyperXTrack架構在GOT-10k數據集上AO以75%超越OSTrack的71%，同時僅需要使用30M參數量的階層式架構，就可以超越OSTrack的93M參數量的ViT架構。

摘要(英)

In single object tracking, the hierarchical Vision Transformer (ViT) architectures usually perform worse than plain ViT among current trackers. At the same time, the network architectures of state-of-the-art trackers are distinct, and thus there is no general purposed network architecture. This paper presents HyperXTrack, the first backbone network architecture that is applied to interaction in visual tracking. In addition, the proposed backbone interacts spatio-temporal context, where spatial context is the multi-scale information and temporal context provides historical information. HyperXTrack proceeds global and local spatial interaction, and computation complexity is linear with image resolution. After correlating with local texture features, the contour of the entire object is interacting. Interaction backbone networks adopt the proposed attention mechanism and the classic stacking rule where convolutions are applied before attention mechanism. Finally, this thesis proposes lightweight re-pretraining strategy. After modifying the existing network MaxViT, this thesis uses the pre-trained MaxViT weights, and re-pretrains only one epoch. Then the network can transfer to the downstream tasks. The experimental results show that HyperXTrack surpasses OSTrack′s 71% in AO with 71.8% on the GOT-10k dataset. HyperXTrack using a hierarchical architecture only needs 30M parameters, which can surpass OSTrack architecture with 93M parameters.

關鍵字(中)

★ 單目標追蹤
★ 階層式
★ 重新預訓練
★ 視覺轉換器
★ 模板更新策略

關鍵字(英)

★ single object tracking
★ hierarchical
★ re-pretraining
★ vision Transformer
★ template update strategy

論文目次

中文摘要 V
英文摘要 VI
誌謝 VII
目錄 VIII
圖目錄 X
表目錄 XIII
符號說明 XIV
一、緒論 1
1-1前言 1
1-2 研究動機 1
1-3 研究方法 2
1-4 論文架構 4
二、基於特徵交互之視覺追蹤介紹 6
2-1 特徵交互對於視覺追蹤技術重要性 6
2-2 基於TRANSFORMER之視覺追蹤技術現況 10
2-2-1 分別進行特徵抽取與交互作用之網路 10
2-2-2 同時進行特徵抽取與交互作用之網路 15
2-3 遷移學習對於視覺追蹤任務的影響性 22
2-4 總結 23
三、基於計算機視覺的TRANSFORMER網路 24
3-1 卷積神經網路 24
3-2 VISION TRANSFORMER 26
3-3 階層式VISION TRANSFORMER網路 30
3-4 總結 37
四、全局多尺度的視覺基礎網路 38
4-1 不同滑動窗口所獲得特徵圖 38
4-2 HYPERXTRACK 40
4-2-1 通用電腦視覺的網路架構 41
4-2-2 PREDICTION HEAD 50
4-2-3 HIERARCHICAL架構 51
4-3 神經網路輕量的重新預訓練策略 52
4-4 總結 53
五、實驗結果與討論 55
5-1 實驗的參數設定 55
5-1-1 訓練階段實驗參數設定 55
5-1-2 推論階段實驗參數設定 59
5-2實驗結果 59
5-3 消融實驗 64
5-4 追蹤結果圖 68
5-5 總結 78
六、結論與未來展望 79
參考文獻 80

參考文獻

〔1〕 L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. Torr, “Fully-convolutional siamese networks for object tracking,” in Proc. European Conference on Computer Vision, pp. 850-865, Oct. 2016.
〔2〕 B. Chen, P. Li, L. Bai, L. Qiao, Q. Shen, B. Li, W. Gan, W. Wu, and W. Ouyang, “Backbone is all your need: A simplified architecture for visual object tracking,” in Proc. European Conference on Computer Vision, pp. 375-392, Oct. 2022.
〔3〕 B. Ye, H. Chang, B. Ma, S. Shan, and X. Chen, “Joint feature learning and relation modeling for tracking: A one-stream framework,” in Proc. European Conference on Computer Vision, pp. 341-357, Oct. 2022.
〔4〕 X. Chen, B. Yan, J. Zhu, D. Wang, X. Yang, and H. Lu, “Transformer tracking,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 8126-8135, June 2021.
〔5〕 A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in Proc. International Conference on Learning Representations, Sep. 2021.
〔6〕 Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proc. IEEE International Conference on Computer Vision, pp. 10012-10022, Oct. 2021.
〔7〕 Y. Cui, C. Jiang, L. Wang, and G. Wu, “MixFormer: End-to-end tracking with iterative mixed attention,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 13608-13618, June 2022.
〔8〕 Y. Cui, C. Jiang, G. Wu, and L. Wang, “MixFormer: End-to-end tracking with iterative mixed attention,” arXiv preprint arXiv:2302.02814, Feb. 2022.
〔9〕 Z. Tu, H. Talebi, H. Zhang, F. Yang, P. Milanfar, A. Bovik, and Y. Li, “Maxvit: Multi-axis vision transformer,” in Proc. European Conference on Computer Vision, pp. 459-479, Oct. 2022.
〔10〕 J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and F.-F. Li, “Imagenet: A large-scale hierarchical image database,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 248-255, June 2009.
〔11〕 B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik, “Hypercolumns for object segmentation and fine-grained localization,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 447-456, June 2015.
〔12〕 A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. Conference on Neural Information Processing Systems, pp. 6000-6010, Dec. 2017.
〔13〕 J. Devlin, M.-W. Chang, K. Lee and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proc. Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4171-4186, June 2019.
〔14〕 Y. Wang, Y. Hou, H. Wang, Z. Miao, S. Wu, H. Sun, Q. Chen, Y. Xia, C. Chi, G. Zhao, Z. Liu, X. Xie, H. A. Sun, W. Deng, Q. Zhang, and M. Yang, “A neural corpus indexer for document Retrieval,” in Proc. Conference on Neural Information Processing Systems, Nov. 2022.
〔15〕 L. Dong, S. Xu, and B. Xu, “Speech-transformer: A no-recurrence sequence-to-sequence model for speech recognition,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5884-5888, Apr. 2018.
〔16〕 N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in Proc. European Conference on Computer Vision, pp. 213-229, Aug. 2020.
〔17〕 K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 770-778, June 2016.
〔18〕 J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” in Proc. Conference on Neural Information Processing Systems-Deep Learning Symposium, Dec. 2016.
〔19〕 L. A. Gatys, A. S. Ecker, and M. Bethge, “Image style transfer using convolutional neural networks,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 2414-2423, June 2016.
〔20〕 A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Proc. Conference on Neural Information Processing Systems, Dec. 2012.
〔21〕 B. Yan, H. Peng, J. Fu, D. Wang, and H. Lu, “Learning spatio-temporal transformer for visual tracking,” in Proc. IEEE International Conference on Computer Vision, pp. 10448-10457, Oct. 2021.
〔22〕 T. Chen, S. Saxena, L. Li, D. J. Fleet, and G. Hinton, “Pix2seq: A language modeling framework for object detection,” in Proc. International Conference on Learning Representations, Apr. 2022.
〔23〕 Y. Xiao, Y. Zhang, and P. Ni, “Ensemble long short-term tracking with convnext and Transformer,” in Proc. IEEE International Conference on Image, Vision and Computing, pp. 688-693, Nov. 2022.
〔24〕 Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A convnet for the 2020s,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 11976-11986, June 2022.
〔25〕 S. Woo, S. Debnath, R. Hu, X. Chen, Z. Liu, I. S. Kweon, and S. Xie, “ConvNeXt V2: Co-designing and scaling ConvNets with masked autoencoders,” arXiv preprint arXiv:2301.00808, Jan. 2023.
〔26〕 S. Chan, Y. Wang, J. Tao, X. Zhou, J. Tao, and Q. Shao, “MLPT: Multilayer perceptron based tracking,” in Proc. IEEE International Conference on Systems, Man, and Cybernetics, pp. 1936-1941, Oct. 2022.
〔27〕 I. Tolstikhin, N. Houlsby, A. Kolesnikov, L. Beyer, X. Zhai, T. Unterthiner, J. Yung, A. Steiner, D. Keysers, J. Uszkoreit, M. Lucic, and A. Dosovitskiy, “Mlp-mixer: An all-mlp architecture for vision,” in Proc. Conference on Neural Information Processing Systems, Dec. 2021.
〔28〕 W. Yu, M. Luo, P. Zhou, C. Si, Y. Zhou, X. Wang, J. Feng, and S. Yan, “Metaformer is actually what you need for vision,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 10819-10829, June 2022.
〔29〕 W. Yu, C. Si, P. Zhou, M. Luo, Y. Zhou, J. Feng, S. Yan, and X. Wang, “Metaformer baselines for vision,” arXiv preprint arXiv:2210.13452, Dec. 2022.
〔30〕 L. Lin, H. Fan, Y. Xu, and H. Ling. “Swintrack: A simple and strong baseline for transformer tracking,” in Proc. Conference on Neural Information Processing Systems, Nov. 2022.
〔31〕 K. He, C. Zhang, S. Xie, Z. Li, and Z. Wang, “Target-aware tracking with long-term context attention,” in Proc. of the AAAI Conference on Artificial Intelligence, Feb. 2023.
〔32〕 H. Zhang, Y. Wang, F. Dayoub, and N. Sunderhauf, “Varifocalnet: An iou-aware dense object detector,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 8514-8523, June 2021.
〔33〕 H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese, “Generalized intersection over union: A metric and a loss for bounding box regression,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 658-666, June 2019.
〔34〕 W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao, “Pyramid vision transformer: A versatile backbone for dense prediction without convolutions,” in Proc. IEEE International Conference on Computer Vision, pp. 568-578, Oct. 2021.
〔35〕 K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 16000-16009, June 2022.
〔36〕 J.-P. Lan, Z.-Q. Cheng, J.-Y. He, C. Li, B. Luo, X. Bao, W. Xiang, Y. Geng, and X. Xie, “ProContEXT: Exploring progressive context transformer for tracking,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, June 2023.
〔37〕 Z. Song, R. Luo, J. Yu, Y.-P. P. Chen, and W. Yang, “Compact transformer tracker with correlative masked modeling,” in Proc. of the AAAI Conference on Artificial Intelligence, Feb. 2023.
〔38〕 T. DeVries, and G. W. Taylor, “Improved regularization of convolutional neural networks with cutout,” arXiv preprint arXiv:1708.04552, Aug. 2017.
〔39〕 H. Bao, L. Dong, S. Piao, and F. Wei, “Beit: Bert pre-training of image transformers,” in Proc. International Conference on Learning Representations, Apr. 2022.
〔40〕 Z. Peng, L. Dong, H. Bao, Q. Ye, and F. Wei, “Beit v2: Masked image modeling with vector-quantized visual tokenizers,” in Proc. International Conference on Learning Representations, May 2023.
〔41〕 W. Wang, H. Bao, L. Dong, J. Bjorck, Z. Peng, Q. Liu, K. Aggarwal, O. K. Mohammed, S. Singhal, S. Som, and F. Wei, “Image as a foreign language: Beit pretraining for all vision and vision-language tasks,” arXiv preprint arXiv:2208.10442, Aug. 2022.
〔42〕 S. Gao, C. Zhou, and J. Zhang, “Generalized relation modeling for transformer tracking,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 18686-18695, June 2023.
〔43〕 C. Feichtenhofer, H. Fan, Y. Li, and K. He, “Masked autoencoders as spatiotemporal learners,” in Proc. Conference on Neural Information Processing Systems, pp. 35946-35958, Nov. 2022.
〔44〕 Z. Tong, Y. Song, J. Wang, and L. Wang, “Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training,” in Proc. Conference on Neural Information Processing Systems, Nov. 2022.
〔45〕 L. Wang, B. Huang, Z. Zhao, Z. Tong, Y. He, Y. Wang, Y. Wang, and Y. Qiao, “VideoMAE V2: Scaling video masked autoencoders with dual masking,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 14549-14560, June 2023.
〔46〕 Q. Wu, T. Yang, Z. Liu, B. Wu, Y. Shan, and A. B. Chan, “DropMAE: Masked autoencoders with spatial-attention dropout for tracking tasks,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 14561-14571, June 2023.
〔47〕 H. Touvron, A. Vedaldi, M. Douze, and H. Jégou, “Fixing the train-test resolution discrepancy,” in Proc. Conference on Neural Information Processing Systems, Dec. 2019.
〔48〕 H. Wu, B. Xiao, N. Codella, M. Liu, X. Dai, L. Yuan, and L. Zhang, “Cvt: Introducing convolutions to vision transformers,” in Proc. IEEE International Conference on Computer Vision, pp. 22-31, Oct. 2021.
〔49〕 P. Gao, T. Ma, H. Li, J. Dai, and Y. Qiao, “Convmae: Masked convolution meets masked autoencoders,” in Proc. Conference on Neural Information Processing Systems, Nov. 2022.
〔50〕 Y. Li, H. Mao, R. Girshick, and K. He, “Exploring plain vision transformer backbones for object detection,” in Proc. European Conference on Computer Vision, pp. 280-296, Oct. 2022.
〔51〕 H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou, “Training data-efficient image transformers & distillation through attention,” in Proc. International Conference on Machine Learning, pp. 10347-10357, July 2021.
〔52〕 X. Chen, S. Xie, and K. He, “An empirical study of training self-supervised vision transformers,” in Proc. IEEE International Conference on Computer Vision, pp. 9640-9649, Oct. 2021.
〔53〕 N. Mu, A. Kirillov, D. Wagner, and S. Xie, “Slip: Self-supervision meets language-image pre-training,” in Proc. European Conference on Computer Vision, pp. 529-544, Oct. 2022.
〔54〕 A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” in Proc. International Conference on Machine Learning, pp. 8748-8763, July 2021.
〔55〕 G. Huang, Z. Liu, L. V. D. Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 4700-4708, July 2017.
〔56〕 A. Kolesnikov, L. Beyer, X. Zhai, J. Puigcerver, J. Yung, S. Gelly, and N. Houlsby, “Big transfer (bit): General visual representation learning,” in Proc. European Conference on Computer Vision, pp. 491-507, Aug. 2020.
〔57〕 T. Xiao, M. Singh, E. Mintun, T. Darrell, P. Dollár, and R. Girshick, “Early convolutions help transformers see better,” in Proc. Conference on Neural Information Processing Systems, Dec. 2021.
〔58〕 I. Radosavovic, J. Johnson, S. Xie, W.-Y. Lo, and P. Dollár, “On network design spaces for visual recognition,” in Proc. IEEE International Conference on Computer Vision, pp. 1882-1890, Oct. 2019.
〔59〕 I. Radosavovic, R. P. Kosaraju, R. Girshick, K. He, and P. Dollár, “Designing network design spaces,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 10428-10436, June 2020.
〔60〕 M. Naseer, K. Ranasinghe, S. Khan, M. Hayat, F. S. Khan, and M.-H. Yang, “Intriguing properties of vision transformers,” in Proc. Conference on Neural Information Processing Systems, pp. 23296-23308, Dec. 2021.
〔61〕 Y. Fang, W. Wang, B. Xie, Q. Sun, L. Wu, X. Wang, T. Huang, X. Wang, and Y. Cao, “Eva: Exploring the limits of masked visual representation learning at scale,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 19358-19369, June 2023.
〔62〕 A. Hassani, S. Walton, J. Li, S. Li, and H. Shi, “Neighborhood attention transformer,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 6185-6194, June 2023.
〔63〕 A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-performance deep learning library,” in Proc. Conference on Neural Information Processing Systems, Dec. 2019.
〔64〕 A. Hassani, and H. Shi, “Dilated neighborhood attention transformer,” arXiv preprint arXiv:2209.15001, Sep. 2022.
〔65〕 M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. V. Le, “Mnasnet: Platform-aware neural architecture search for mobile,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 2820-2828, June 2019.
〔66〕 M. Tan, and Q. V. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in Proc. International Conference on Machine Learning, pp. 6105-6114, June 2019.
〔67〕 M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 4510-4520, June 2018.
〔68〕 X. Chu, Z. Tian, B. Zhang, X. Wang, and C. Shen, “Conditional positional encodings for vision transformers,” in Proc. International Conference on Learning Representations, May 2023.
〔69〕 T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Proc. European Conference on Computer Vision, pp. 740-755, Sep. 2014.
〔70〕 J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 7132-7141, June 2018.
〔71〕 C. Yang, S. Qiao, Q. Yu, X. Yuan, Y. Zhu, A. Yuille, H. Adam, and L.-C. Chen, “Moat: Alternating mobile convolution and attention brings strong vision models,” in Proc. International Conference on Learning Representations, May 2023.
〔72〕 Ross Wightman, “Pytorch image models,” https://github.com/rwightman/pytorch-image-models, 2019.
〔73〕 M. Muller, A. Bibi, S. Giancola, S. Alsubaihi, and B. Ghanem, “Trackingnet: A large-scale dataset and benchmark for object tracking in the wild,” in Proc. European Conference on Computer Vision, pp. 300-317, Sep. 2018.
〔74〕 L. Huang, X. Zhao, and K. Huang, “Got-10k: A large high-diversity benchmark for generic object tracking in the wild,” in Proc. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1562-1577, Dec. 2019.
〔75〕 H. Fan, L. Lin, F. Yang, P. Chu, G. Deng, S. Yu, H. Bai, Y. Xu, C. Liao, and H. Ling, “Lasot: A high-quality benchmark for large-scale single object tracking,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 5374-5383, June 2019.
〔76〕 T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 2117-2125, July 2017.
〔77〕 H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyond empirical risk minimization,” in Proc. International Conference on Learning Representations, Apr. 2018.
〔78〕 E. Real, J. Shlens, S. Mazzocchi, X. Pan, and V. Vanhoucke, “Youtube-boundingboxes: A large high-precision human-annotated data set for object detection in video,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 5296-5305, July 2017.
〔79〕 M. Mueller, N. Smith, and B. Ghanem, “Context-aware correlation filter tracking,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 1396-1404, July 2017.
〔80〕 Y. Liang, Q. Li, and F. Long, “Global dilated attention and target focusing network for robust tracking,” in Proc. of the AAAI Conference on Artificial Intelligence, Feb. 2023.
〔81〕 L. Zhou, Z. Zhou, K. Mao, and Z. He, “Joint visual grounding and tracking with natural language specification,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 23151-23160, June 2023.
〔82〕 Y. Cui, T. Song, G. Wu, and L. Wang, “MixFormerV2: Efficient fully transformer tracking,” arXiv preprint arXiv:2305.15896, May 2023.
〔83〕 J. Wang, D. Chen, Z. Wu, C. Luo, X. Dai, L. Yuan, and Y.-G. Jiang, “OmniTracker: Unifying object tracking by tracking-with-detection,” arXiv preprint arXiv:2303.12079, March, 2023.
〔84〕 Z. Xie, Z. Geng, J. Hu, Z. Zhang, H. Hu, and Y. Cao, “Revealing the dark secrets of masked image modeling,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 14475-14485, June 2023.
〔85〕 X. Chen, H. Peng, D. Wang, H. Lu, and H. Hu, “SeqTrack: Sequence to sequence learning for visual object tracking,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 14572-14581, June 2023.
〔86〕 B. Yan, Y. Jiang, J. Wu, D. Wang, P. Luo, Z. Yuan, and H. Lu, “Universal instance perception as object discovery and retrieval,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 15325-15336, June 2023.
〔87〕 X. Wei, Y. Bai, Y. Zheng, D. Shi, and Y. Gong, “Autoregressive visual tracking,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 9697-9706, June 2023.
〔88〕 F. Xie, L. Chu, J. Li, Y. Lu, and C. Ma, “VideoTrack: Learning to track objects via video transformer,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 22826-22835, June 2023.
〔89〕 H. Zhao, D. Wang, and H. Lu, “Representation learning for visual object tracking by masked appearance transfer,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 18696-18705, June 2023.

指導教授

唐之瑋(Chih-Wei Tang)

審核日期

2023-7-19

推文