參考文獻 |
[1] A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classification with deep convolutional neural networks,” in Proc. of Neural Information Processing Systems (NIPS), Harrahs and Harveys, Lake Tahoe, NV, Dec.3-8, 2012, pp.1106-1114.
[2] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” arXiv:1706.03762.
[3] Z. Dai, H. Liu, Q. V. Le, and M. Tan, “CoAtNet: marrying convolution and attention for all data sizes,” arXiv:2106.04803.
[4] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: efficient convolutional neural networks for mobile vision applications,” arXiv:1704.04861.
[5] J. Hu, L. Shen, S. Albanie, G. Sun, and E. Wu, “Squeeze-and-excitation networks,” arXiv: 1709.01507v4.
[6] T. Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, Jun.21-26, 2017, pp. 2117-2125.
[7] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv:1409.1556.
[8] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” arXiv:1409.4842.
[9] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” arXiv:1512.03385.
[10] G. Huang, Z. Liu, L. Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” arXiv:1608.06993.
[11] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: transformers for image recognition at scale,” arXiv:2010.11929.
[12] H. Wu, B. Xiao, N. Codella, M. Liu, X. Dai, L. Yuan, and L. Zhang, “CvT: introducing convolutions to vision transformers,” arXiv:2103.15808.
[13] L. Yuan, Q. Hou, Z. Jiang, J. Feng, and S. Yan, “ VOLO: vision outlooker for visual recognition,” arXiv:2106.13112
[14] S. Woo, J. Park, J.-Y. Lee, and I. Kweon, “CBAM: convolutional block attention module,” arXiv:1807.06521v2.
[15] Y. Liu, Z. Shao, Y. Teng, and N. Hoffmann, “NAM: normalization-based attention module,” arXiv:2111.12419v1.
[16] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: inverted residuals and linear bottlenecks,” arXiv:1801.04381v4.
[17] C. Sun, A. Shrivastava, S. Singh, and A. Gupta, “Revisiting unreasonable effectiveness of data in deep learning era,” in Proc. of IEEE Int. Conf. on Computer Vision (ICCV), Venice, Italy, Oct.22-29, 2017, pp.843-852.
[18] D. Hendrycks and K. Gimpel, “Gaussian error linear units (GELU),” arXiv:1606.08415v4.
[19] A. F. Agarap, “Deep learning using rectified linear units (ReLU),” arXiv:1803.08375v2.
[20] A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasudevan, Q. V. Le, and H. Adam, “Searching for mobilenetv3.” arXiv:1905.02244.
[21] M. Zeiler, D. Krishnan, G. Taylor, and R. Fergus, “Deconvolutional networks,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), San Francisco, CA, Jun.13-18, 2010, pp.2528-2535.
[22] Z. Zhang and M. R. Sabuncu, “Generalized cross entropy loss for training deep neural networks with noisy labels,” in Proc. of Neural Information Processing Systems (NIPS), Palais des Congrès de Montréal, Montréal, Canada, Dec.2-8, 2018, pp.8778-8788.
[23] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features for discriminative localization,” arXiv:1512.04150.
[24] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-CAM: visual explanations from deep networks via gradient-based localization,” arXiv:1610.02391.
[25] K. Zuiderveld, “Contrast limited adaptive histogram equalization,” in Graphics Gems, Academic Press, Amsterdam, 1994, Ch.5, pp.474-485.
[26] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv:1711.05101v3.
[27] D. P. Kingma and J. Ba, “Adam: a method for stochastic optimization,” arXiv:1412.6980v9.
[28] M. J. Zhao, N. Edakunni, A. Pocock, and G. Brown, “Beyond Fano’s inequality: bounds on the optimal F-score, BER, and cost-sensitive risk and their implications,” Journal of Machine Learning Research, 2013, pp.1033-1090.
[29] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and F.-F. Li, “ImageNet: a large-scale hierarchical image database,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Miami, FL, Jun.20-25, 2009, pp.248-255.
[30] T. Ridnik, E. B. Baruch, A. Noy, and L. Z. Manor, “Imagenet-21k pretraining for the masses,” arXiv:2104.10972.
[31] P. Shaw, J. Uszkoreit, and A. Vaswani, “Self-attention with relative position repre sentations,” arXiv:1803.02155.
[32] Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and R. Salakhutdinov, “Transformer-xl: attentive language models beyond a fixed-length context,” arXiv:1901.02860.
[33] P. Ramachandran, N. Parmar, A. Vaswani, I. Bello, A. Levskaya, and J. Shlens, “Stand-alone self-attention in vision models,” arXiv:1906.05909.
[34] Y.-H. H. Tsai, S. Bai, M. Yamada, L.-P. Morency, and R. Salakhutdinov, “Transformer dissection: a unified understanding of transformer’s attention via the lens of kernel,” arXiv:1908.11775.
[35] B. Graham, A. El-Nouby, H. Touvron, P. Stock, A. Joulin, H. Jégou, and M. Douze, “Levit: a vision transformer in convnet’s clothing for faster inference,” arXiv:2104.01136.
[36] L. Yuan, Y. Chen, T. Wang, W. Yu, Y. Shi, F. E. H. Tay, J. Feng, and S. Yan, “Tokens-to-token ViT: training vision transformers from scratch on imagenet,” arXiv:2101.11986.
[37] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: pre-training of deep bidirectional transformers for language understanding,” arXiv:1810.04805.
[38] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, and A. Askell, “Language models are few-shot learners,” arXiv:2005.14165. |