參考文獻 |
〔1〕 L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. Torr, “Fully-convolutional siamese networks for object tracking,” in Proc. European Conference on Computer Vision, pp. 850-865, Oct. 2016.
〔2〕 B. Chen, P. Li, L. Bai, L. Qiao, Q. Shen, B. Li, W. Gan, W. Wu, and W. Ouyang, “Backbone is all your need: A simplified architecture for visual object tracking,” in Proc. European Conference on Computer Vision, pp. 375-392, Oct. 2022.
〔3〕 B. Ye, H. Chang, B. Ma, S. Shan, and X. Chen, “Joint feature learning and relation modeling for tracking: A one-stream framework,” in Proc. European Conference on Computer Vision, pp. 341-357, Oct. 2022.
〔4〕 X. Chen, B. Yan, J. Zhu, D. Wang, X. Yang, and H. Lu, “Transformer tracking,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 8126-8135, June 2021.
〔5〕 A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in Proc. International Conference on Learning Representations, Sep. 2021.
〔6〕 Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proc. IEEE International Conference on Computer Vision, pp. 10012-10022, Oct. 2021.
〔7〕 Y. Cui, C. Jiang, L. Wang, and G. Wu, “MixFormer: End-to-end tracking with iterative mixed attention,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 13608-13618, June 2022.
〔8〕 Y. Cui, C. Jiang, G. Wu, and L. Wang, “MixFormer: End-to-end tracking with iterative mixed attention,” arXiv preprint arXiv:2302.02814, Feb. 2022.
〔9〕 Z. Tu, H. Talebi, H. Zhang, F. Yang, P. Milanfar, A. Bovik, and Y. Li, “Maxvit: Multi-axis vision transformer,” in Proc. European Conference on Computer Vision, pp. 459-479, Oct. 2022.
〔10〕 J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and F.-F. Li, “Imagenet: A large-scale hierarchical image database,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 248-255, June 2009.
〔11〕 B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik, “Hypercolumns for object segmentation and fine-grained localization,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 447-456, June 2015.
〔12〕 A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. Conference on Neural Information Processing Systems, pp. 6000-6010, Dec. 2017.
〔13〕 J. Devlin, M.-W. Chang, K. Lee and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proc. Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4171-4186, June 2019.
〔14〕 Y. Wang, Y. Hou, H. Wang, Z. Miao, S. Wu, H. Sun, Q. Chen, Y. Xia, C. Chi, G. Zhao, Z. Liu, X. Xie, H. A. Sun, W. Deng, Q. Zhang, and M. Yang, “A neural corpus indexer for document Retrieval,” in Proc. Conference on Neural Information Processing Systems, Nov. 2022.
〔15〕 L. Dong, S. Xu, and B. Xu, “Speech-transformer: A no-recurrence sequence-to-sequence model for speech recognition,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5884-5888, Apr. 2018.
〔16〕 N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in Proc. European Conference on Computer Vision, pp. 213-229, Aug. 2020.
〔17〕 K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 770-778, June 2016.
〔18〕 J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” in Proc. Conference on Neural Information Processing Systems-Deep Learning Symposium, Dec. 2016.
〔19〕 L. A. Gatys, A. S. Ecker, and M. Bethge, “Image style transfer using convolutional neural networks,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 2414-2423, June 2016.
〔20〕 A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Proc. Conference on Neural Information Processing Systems, Dec. 2012.
〔21〕 B. Yan, H. Peng, J. Fu, D. Wang, and H. Lu, “Learning spatio-temporal transformer for visual tracking,” in Proc. IEEE International Conference on Computer Vision, pp. 10448-10457, Oct. 2021.
〔22〕 T. Chen, S. Saxena, L. Li, D. J. Fleet, and G. Hinton, “Pix2seq: A language modeling framework for object detection,” in Proc. International Conference on Learning Representations, Apr. 2022.
〔23〕 Y. Xiao, Y. Zhang, and P. Ni, “Ensemble long short-term tracking with convnext and Transformer,” in Proc. IEEE International Conference on Image, Vision and Computing, pp. 688-693, Nov. 2022.
〔24〕 Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A convnet for the 2020s,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 11976-11986, June 2022.
〔25〕 S. Woo, S. Debnath, R. Hu, X. Chen, Z. Liu, I. S. Kweon, and S. Xie, “ConvNeXt V2: Co-designing and scaling ConvNets with masked autoencoders,” arXiv preprint arXiv:2301.00808, Jan. 2023.
〔26〕 S. Chan, Y. Wang, J. Tao, X. Zhou, J. Tao, and Q. Shao, “MLPT: Multilayer perceptron based tracking,” in Proc. IEEE International Conference on Systems, Man, and Cybernetics, pp. 1936-1941, Oct. 2022.
〔27〕 I. Tolstikhin, N. Houlsby, A. Kolesnikov, L. Beyer, X. Zhai, T. Unterthiner, J. Yung, A. Steiner, D. Keysers, J. Uszkoreit, M. Lucic, and A. Dosovitskiy, “Mlp-mixer: An all-mlp architecture for vision,” in Proc. Conference on Neural Information Processing Systems, Dec. 2021.
〔28〕 W. Yu, M. Luo, P. Zhou, C. Si, Y. Zhou, X. Wang, J. Feng, and S. Yan, “Metaformer is actually what you need for vision,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 10819-10829, June 2022.
〔29〕 W. Yu, C. Si, P. Zhou, M. Luo, Y. Zhou, J. Feng, S. Yan, and X. Wang, “Metaformer baselines for vision,” arXiv preprint arXiv:2210.13452, Dec. 2022.
〔30〕 L. Lin, H. Fan, Y. Xu, and H. Ling. “Swintrack: A simple and strong baseline for transformer tracking,” in Proc. Conference on Neural Information Processing Systems, Nov. 2022.
〔31〕 K. He, C. Zhang, S. Xie, Z. Li, and Z. Wang, “Target-aware tracking with long-term context attention,” in Proc. of the AAAI Conference on Artificial Intelligence, Feb. 2023.
〔32〕 H. Zhang, Y. Wang, F. Dayoub, and N. Sunderhauf, “Varifocalnet: An iou-aware dense object detector,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 8514-8523, June 2021.
〔33〕 H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese, “Generalized intersection over union: A metric and a loss for bounding box regression,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 658-666, June 2019.
〔34〕 W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao, “Pyramid vision transformer: A versatile backbone for dense prediction without convolutions,” in Proc. IEEE International Conference on Computer Vision, pp. 568-578, Oct. 2021.
〔35〕 K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 16000-16009, June 2022.
〔36〕 J.-P. Lan, Z.-Q. Cheng, J.-Y. He, C. Li, B. Luo, X. Bao, W. Xiang, Y. Geng, and X. Xie, “ProContEXT: Exploring progressive context transformer for tracking,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, June 2023.
〔37〕 Z. Song, R. Luo, J. Yu, Y.-P. P. Chen, and W. Yang, “Compact transformer tracker with correlative masked modeling,” in Proc. of the AAAI Conference on Artificial Intelligence, Feb. 2023.
〔38〕 T. DeVries, and G. W. Taylor, “Improved regularization of convolutional neural networks with cutout,” arXiv preprint arXiv:1708.04552, Aug. 2017.
〔39〕 H. Bao, L. Dong, S. Piao, and F. Wei, “Beit: Bert pre-training of image transformers,” in Proc. International Conference on Learning Representations, Apr. 2022.
〔40〕 Z. Peng, L. Dong, H. Bao, Q. Ye, and F. Wei, “Beit v2: Masked image modeling with vector-quantized visual tokenizers,” in Proc. International Conference on Learning Representations, May 2023.
〔41〕 W. Wang, H. Bao, L. Dong, J. Bjorck, Z. Peng, Q. Liu, K. Aggarwal, O. K. Mohammed, S. Singhal, S. Som, and F. Wei, “Image as a foreign language: Beit pretraining for all vision and vision-language tasks,” arXiv preprint arXiv:2208.10442, Aug. 2022.
〔42〕 S. Gao, C. Zhou, and J. Zhang, “Generalized relation modeling for transformer tracking,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 18686-18695, June 2023.
〔43〕 C. Feichtenhofer, H. Fan, Y. Li, and K. He, “Masked autoencoders as spatiotemporal learners,” in Proc. Conference on Neural Information Processing Systems, pp. 35946-35958, Nov. 2022.
〔44〕 Z. Tong, Y. Song, J. Wang, and L. Wang, “Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training,” in Proc. Conference on Neural Information Processing Systems, Nov. 2022.
〔45〕 L. Wang, B. Huang, Z. Zhao, Z. Tong, Y. He, Y. Wang, Y. Wang, and Y. Qiao, “VideoMAE V2: Scaling video masked autoencoders with dual masking,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 14549-14560, June 2023.
〔46〕 Q. Wu, T. Yang, Z. Liu, B. Wu, Y. Shan, and A. B. Chan, “DropMAE: Masked autoencoders with spatial-attention dropout for tracking tasks,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 14561-14571, June 2023.
〔47〕 H. Touvron, A. Vedaldi, M. Douze, and H. Jégou, “Fixing the train-test resolution discrepancy,” in Proc. Conference on Neural Information Processing Systems, Dec. 2019.
〔48〕 H. Wu, B. Xiao, N. Codella, M. Liu, X. Dai, L. Yuan, and L. Zhang, “Cvt: Introducing convolutions to vision transformers,” in Proc. IEEE International Conference on Computer Vision, pp. 22-31, Oct. 2021.
〔49〕 P. Gao, T. Ma, H. Li, J. Dai, and Y. Qiao, “Convmae: Masked convolution meets masked autoencoders,” in Proc. Conference on Neural Information Processing Systems, Nov. 2022.
〔50〕 Y. Li, H. Mao, R. Girshick, and K. He, “Exploring plain vision transformer backbones for object detection,” in Proc. European Conference on Computer Vision, pp. 280-296, Oct. 2022.
〔51〕 H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou, “Training data-efficient image transformers & distillation through attention,” in Proc. International Conference on Machine Learning, pp. 10347-10357, July 2021.
〔52〕 X. Chen, S. Xie, and K. He, “An empirical study of training self-supervised vision transformers,” in Proc. IEEE International Conference on Computer Vision, pp. 9640-9649, Oct. 2021.
〔53〕 N. Mu, A. Kirillov, D. Wagner, and S. Xie, “Slip: Self-supervision meets language-image pre-training,” in Proc. European Conference on Computer Vision, pp. 529-544, Oct. 2022.
〔54〕 A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” in Proc. International Conference on Machine Learning, pp. 8748-8763, July 2021.
〔55〕 G. Huang, Z. Liu, L. V. D. Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 4700-4708, July 2017.
〔56〕 A. Kolesnikov, L. Beyer, X. Zhai, J. Puigcerver, J. Yung, S. Gelly, and N. Houlsby, “Big transfer (bit): General visual representation learning,” in Proc. European Conference on Computer Vision, pp. 491-507, Aug. 2020.
〔57〕 T. Xiao, M. Singh, E. Mintun, T. Darrell, P. Dollár, and R. Girshick, “Early convolutions help transformers see better,” in Proc. Conference on Neural Information Processing Systems, Dec. 2021.
〔58〕 I. Radosavovic, J. Johnson, S. Xie, W.-Y. Lo, and P. Dollár, “On network design spaces for visual recognition,” in Proc. IEEE International Conference on Computer Vision, pp. 1882-1890, Oct. 2019.
〔59〕 I. Radosavovic, R. P. Kosaraju, R. Girshick, K. He, and P. Dollár, “Designing network design spaces,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 10428-10436, June 2020.
〔60〕 M. Naseer, K. Ranasinghe, S. Khan, M. Hayat, F. S. Khan, and M.-H. Yang, “Intriguing properties of vision transformers,” in Proc. Conference on Neural Information Processing Systems, pp. 23296-23308, Dec. 2021.
〔61〕 Y. Fang, W. Wang, B. Xie, Q. Sun, L. Wu, X. Wang, T. Huang, X. Wang, and Y. Cao, “Eva: Exploring the limits of masked visual representation learning at scale,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 19358-19369, June 2023.
〔62〕 A. Hassani, S. Walton, J. Li, S. Li, and H. Shi, “Neighborhood attention transformer,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 6185-6194, June 2023.
〔63〕 A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-performance deep learning library,” in Proc. Conference on Neural Information Processing Systems, Dec. 2019.
〔64〕 A. Hassani, and H. Shi, “Dilated neighborhood attention transformer,” arXiv preprint arXiv:2209.15001, Sep. 2022.
〔65〕 M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. V. Le, “Mnasnet: Platform-aware neural architecture search for mobile,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 2820-2828, June 2019.
〔66〕 M. Tan, and Q. V. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in Proc. International Conference on Machine Learning, pp. 6105-6114, June 2019.
〔67〕 M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 4510-4520, June 2018.
〔68〕 X. Chu, Z. Tian, B. Zhang, X. Wang, and C. Shen, “Conditional positional encodings for vision transformers,” in Proc. International Conference on Learning Representations, May 2023.
〔69〕 T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Proc. European Conference on Computer Vision, pp. 740-755, Sep. 2014.
〔70〕 J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 7132-7141, June 2018.
〔71〕 C. Yang, S. Qiao, Q. Yu, X. Yuan, Y. Zhu, A. Yuille, H. Adam, and L.-C. Chen, “Moat: Alternating mobile convolution and attention brings strong vision models,” in Proc. International Conference on Learning Representations, May 2023.
〔72〕 Ross Wightman, “Pytorch image models,” https://github.com/rwightman/pytorch-image-models, 2019.
〔73〕 M. Muller, A. Bibi, S. Giancola, S. Alsubaihi, and B. Ghanem, “Trackingnet: A large-scale dataset and benchmark for object tracking in the wild,” in Proc. European Conference on Computer Vision, pp. 300-317, Sep. 2018.
〔74〕 L. Huang, X. Zhao, and K. Huang, “Got-10k: A large high-diversity benchmark for generic object tracking in the wild,” in Proc. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1562-1577, Dec. 2019.
〔75〕 H. Fan, L. Lin, F. Yang, P. Chu, G. Deng, S. Yu, H. Bai, Y. Xu, C. Liao, and H. Ling, “Lasot: A high-quality benchmark for large-scale single object tracking,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 5374-5383, June 2019.
〔76〕 T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 2117-2125, July 2017.
〔77〕 H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyond empirical risk minimization,” in Proc. International Conference on Learning Representations, Apr. 2018.
〔78〕 E. Real, J. Shlens, S. Mazzocchi, X. Pan, and V. Vanhoucke, “Youtube-boundingboxes: A large high-precision human-annotated data set for object detection in video,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 5296-5305, July 2017.
〔79〕 M. Mueller, N. Smith, and B. Ghanem, “Context-aware correlation filter tracking,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 1396-1404, July 2017.
〔80〕 Y. Liang, Q. Li, and F. Long, “Global dilated attention and target focusing network for robust tracking,” in Proc. of the AAAI Conference on Artificial Intelligence, Feb. 2023.
〔81〕 L. Zhou, Z. Zhou, K. Mao, and Z. He, “Joint visual grounding and tracking with natural language specification,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 23151-23160, June 2023.
〔82〕 Y. Cui, T. Song, G. Wu, and L. Wang, “MixFormerV2: Efficient fully transformer tracking,” arXiv preprint arXiv:2305.15896, May 2023.
〔83〕 J. Wang, D. Chen, Z. Wu, C. Luo, X. Dai, L. Yuan, and Y.-G. Jiang, “OmniTracker: Unifying object tracking by tracking-with-detection,” arXiv preprint arXiv:2303.12079, March, 2023.
〔84〕 Z. Xie, Z. Geng, J. Hu, Z. Zhang, H. Hu, and Y. Cao, “Revealing the dark secrets of masked image modeling,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 14475-14485, June 2023.
〔85〕 X. Chen, H. Peng, D. Wang, H. Lu, and H. Hu, “SeqTrack: Sequence to sequence learning for visual object tracking,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 14572-14581, June 2023.
〔86〕 B. Yan, Y. Jiang, J. Wu, D. Wang, P. Luo, Z. Yuan, and H. Lu, “Universal instance perception as object discovery and retrieval,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 15325-15336, June 2023.
〔87〕 X. Wei, Y. Bai, Y. Zheng, D. Shi, and Y. Gong, “Autoregressive visual tracking,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 9697-9706, June 2023.
〔88〕 F. Xie, L. Chu, J. Li, Y. Lu, and C. Ma, “VideoTrack: Learning to track objects via video transformer,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 22826-22835, June 2023.
〔89〕 H. Zhao, D. Wang, and H. Lu, “Representation learning for visual object tracking by masked appearance transfer,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition, pp. 18696-18705, June 2023. |