|| Z. Dai, H. Liu, Q. V. Le, and M. Tan, "CoAtNet: marrying convolution and attentionfor all data sizes," arXiv:2106.04803.|
 S. M. Pizer, E. P. Amburn, J. D. Austin, R. Cromartie, A. Geselowitz, T. Greer, B. H. Romeny, J. B. Zimmerman, and K. Zuiderveld, “Adaptive histogram equalization and its variations,” Computer Vision, Graphics, and Image Processing, vol.39, no.3, pp.355-368, 1987.
 E D. Cubuk, B. Zoph, D. Mané, V. Vasudevan, and Q. V. Le, “AutoAugment: learning augmentation strategies from data,” arXiv:1805.09501v3.
 E. D. Cubuk, B. Zoph, J. Shlens, and Q. V. Le, “RandAugment: practical automated data augmentation with a reduced search space,” arXiv:1909.13719v2.
 Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," Proceedings of the IEEE, vol.86, no.11, pp.2278-2324, Nov. 1998.
 A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in Proc. Neural Information Processing Systems (NIPS), Lake Tahoe, Nevada, Dec.3-8, 2012, pp.1097-1105.
 K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv:1409.1556.
 C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proc. IEEE Int Conf. on Computer Vision and Pattern Recognition (CVPR), Boston, MA, Jun.7-12, 2015, pp.1-9.
 K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition ," in Proc. IEEE Int. Conf. on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, Jun.27-30, 2016, pp.770-778.
 M. Lin, Q. Chen, and S. Yan, “Netwok in network,” in Proc. Int. Conf. Learn. Represent (ICLR), Banff, Canada, Apr.14-16, 2014, pp.274-278.
 F. N. Iandola, S. Han, W. Moskewicz, K. Ashraf, W. Dally, and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewer parameters and 1mb model size,” arXiv: 1602.07360.
 A. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, "Mobilenets: efficient convolutional neural networks for mobile vision applications,′′ arXiv:1704.04861.
 F. Chollet, ′′Xception: deep learning with depthwise deparable convolutions,′′ in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Honolulu, Hawaii, Jul.22-25, 2017, pp.1800-1807.
 X. Zhang, X. Zhou, M. Lin, and J. Sun, ′′ShuffleNet: an extremely efficient convolutional neural network for mobile devices,′′ arXiv:1707.01083.
 G. Huang, Z. Liu, L. V. D. Maaten, and K. Q. Weinberger, ′′Densely connected convolutional networks,′′ in Proc. IEEE Conf. on Pattern Recognition and Computer Vision (CVPR), Honolulu, Hawaii, Jul.22-25, 2017, pp.4700-4708.
 M. Guo, T. Xu, J. Liu, Z. Liu, P. Jiang, T. Mu, S. Zhang, R. R. Martin, M. Cheng, and S. Hu, ′′Attention mechanisms in computer vision: a survey,′′ arXiv:2111.07624.
 J. Hu, L. Shen, and G. Sun, "Squeeze-and-excitation networks,′′ arXiv:1709.01507v4.
 V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu, "Recurrent models of visual attention," arXiv:1406.6247.
 A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. Neural Information Processing Systems (NIPS), Long Beach, CA, Dec.4-9, 2017, pp.5998-6008.
 A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in Proc. Int. Conf. Learn. Represent (ICLR), Vienna, Austria, May.3-7, 2021, pp.1-21.
 S. Woo, J. Park, J. Lee, and I.S. Kweon, “CBAM: convolutional block attention module,” arXiv: 1807.06521v2.
 J. Li, J. Wang, Q. Tian, W. Gao, and S. Zhang, “Global-local temporal representations for video person re-identification,” in Proc. of IEEE/CVF Int. Conf. on Computer Vision (ICCV), Seoul, Korea, Oct.27-Nov.2, 2019, pp.3958-3967.
 S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp.1735-1780, 1997.
 Y. Fu, X. Wang, Y. Wei, and T. Huang, “Sta: Spatial temporal attention for large-scale video-based person reidentification,” in Proc. of AAAI Conf. on Artificial Intelligence, Honolulu, Hawaii, Jan.27-Feb.1, 2019, vol.33, pp.8287-8294.
 X. Li, W. Wang, X. Hu, and J. Yang, “Selective kernel networks,” in Proc. of IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, Jun.16- Jun.20, 2019, pp.510-519.
 M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.C. Chen, “MobileNetV2: inverted residuals and linear bottlenecks,” arXiv:1801.04381.
 D. Hendrycks and K. Gimpe, “Gaussuan error liner units (GELUS),” arXiv:1606.08415.
 T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” arXiv:1612.03144.
 J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and J. Yang, “Deformable convolutional networks” in Proc. of IEEE/CVF Int. Conf. on Computer Vision (ICCV), Venice, Italy, Oct.22- Oct.29, 2017, pp.764-773.
 X. Zhu, H. Hu, S. Lin, and J. Dai, “Deformable ConvNets v2: more deformable, better results” in Proc. of IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, Jun.16-Jun.20, 2019, pp.9308-9316.
 E. H. Adelson, C. H. Anderson, J. R. Bergen, P. J. Burt, and J. M. Ogden, “Pyramid methods in image processing,” RCA engineer, vol.29, no.6, pp.33-41, 1984.
 W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg, “SSD: single shot multiBox detector,” arXiv:1512.02325.
 D. G. Lowe, “Object recognition from local scale-invariant features” in Proc. of IEEE/CVF Int. Conf. on Computer Vision (ICCV), Kerkyra, Greece, Sep.20- Sep25, 1999, pp.1150-1157.
 Y. L. Boureau, J. Ponce, and Y. LeCun, “A theoretical analysis of feature pooling in visual recognition,” in Proc. International Conference on Machine Learning (ICML), Haifa, Israel, Jun.21-Jun.24, 2010, pp.111-118.
 Q. Li, S. Jin, and J. Yan, “Mimicking very efficient network for object detection” in Proc. IEEE Conf. on Pattern Recognition and Computer Vision (CVPR), Honolulu, Hawaii, Jul.22-Jul.25, 2017, pp.6356-6364.
 Z. Zhang and M. R. Sabuncu, “Generalized cross entropy loss for training deep neural networks with noisy labels,” in Proc. of Neural Information Processing Systems (NIPS), Palais des Congrès de Montréal, Montréal, Dec.2-8, 2018, pp.8778-8788.
 I. Loshchilov and F. Hutter, “SGDR: Stochastic gradient descent with warm restarts,” arXiv:1608.03983.
 D. P. Kingma and J. L. Ba, “Adam: a method for stochastic optimization,” arXiv:1412.6980.
 I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv:1711.05101.