Scene Text Detection Based on Attention ConvNeXt

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：14

、訪客IP：52.14.165.32

姓名

張育陞(Yu-Sheng Chang) 查詢紙本館藏

畢業系所

資訊工程學系在職專班

論文名稱

(Scene Text Detection Based on Attention ConvNeXt)

相關論文

★ Dynamic Overlay Construction for Mobile Target Detection in Wireless Sensor Networks	★ 車輛導航的簡易繞路策略
★ 使用傳送端電壓改善定位	★ 利用車輛分類建構車載網路上的虛擬骨幹
★ Why Topology-based Broadcast Algorithms Do Not Work Well in Heterogeneous Wireless Networks?	★ 針對移動性目標物的有效率無線感測網路
★ 適用於無線隨意網路中以關節點為基礎的分散式拓樸控制方法	★ A Review of Existing Web Frameworks
★ 將感測網路切割成貪婪區塊的分散式演算法	★ 無線網路上Range-free的距離測量
★ Inferring Floor Plan from Trajectories	★ An Indoor Collaborative Pedestrian Dead Reckoning System
★ Dynamic Content Adjustment In Mobile Ad Hoc Networks	★ 以影像為基礎的定位系統
★ 大範圍無線感測網路下分散式資料壓縮收集演算法	★ 車用WiFi網路中的碰撞分析

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2026-1-25以後開放)

摘要(中)

場景文本檢測的廣泛應用使其成為研究的一個突出領域。然而在現實場景中，由於背景多樣性、文字樣式、不規則排列和圖像模糊等複雜性，場景文字偵測面臨巨大的挑戰。在這項研究中，我們提出了一個場景文字偵測系統。在該系統中，我們引入了ConvNeXt V2 Tiny作為骨幹架構，旨在提高效能。此外，我們引入了注意力機制，結合了歸一化方法，並修改了激活函數以提高準確性和訓練穩定性。在實驗中，我們的系統在三個公共資料集上進行了評估分別是MSRA-TD500、Total-Text、SCUT-CTW1500。這些資料集分別用於評估模型在不同類型的文字區域中的表現。實驗結果表明，與基準模型相比，我們的系統在性能上取得了顯著的提高，並且在參數較少的情況下優於最先進的系統。

摘要(英)

The widespread applications of scene text detection have propelled it into the spotlight as a prominent area of research. However, scene text detection presents a formidable challenge in real-world scenarios, given the complexities arising from diverse backgrounds, text styles, irregular arrangements, and image blurriness. In this research, we propose a scene text detection system. In this system, we introduce ConvNeXt V2 Tiny as the backbone architecture, with the aim of enhancing performance. Additionally, we introduce attention mechanisms, incorporate normalization methods, and modify activation functions to improve accuracy and training stability. In our experiments, our system is evaluated on three public datasets: MSRA-TD500, Total-Text, and SCUT-CTW1500. Each of these datasets is used to assess the performance of the model in different types of text regions. The experimental results indicate that our system has shown a notable improvement in performance compared to the baseline model and outperforms the SOTA system with fewer parameters.

關鍵字(中)

★ 場景文字檢測
★ 深度學習
★ 注意力機制

關鍵字(英)

★ Scene Text Detection
★ deep learning
★ attention

論文目次

1 Introduction 1
2 Related Work 3
2.1 Backbone Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . 3
2.1.2 Convolutional Neural Networks with Attention Mechanisms . . . . 3
2.1.3 Vision Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Text Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.1 Regression-based Scene Text Detection . . . . . . . . . . . . . . . . 4
2.2.2 Connected-based Scene Text Detection . . . . . . . . . . . . . . . . 5
2.2.3 Segmentation-based Scene Text Detection . . . . . . . . . . . . . . 5
2.2.4 Transformer-based Scene Text Detection . . . . . . . . . . . . . . . 6
3 Preliminary 7
3.1 Sigmoid-weighted Linear Unit . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 ConvNeXt V2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.4 Weight Standardization and Batch Channel Normalization . . . . . . . . . 9
3.5 Convolutional Block Attention Module . . . . . . . . . . . . . . . . . . . . 10
3.6 Asymmetric Convolutional Network . . . . . . . . . . . . . . . . . . . . . . 11
3.7 Faster Arbitrarily-Shaped Text Detector . . . . . . . . . . . . . . . . . . . 12
4 Design 14
4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.3 Proposed System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.3.1 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.3.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.3.2.1 Backbone Part . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3.2.2 Detection Part . . . . . . . . . . . . . . . . . . . . . . . . 19
5 Performance 21
5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.4 Experimental Results and Analysis . . . . . . . . . . . . . . . . . . . . . . 24
5.4.1 Scenes with long text lines . . . . . . . . . . . . . . . . . . . . . . . 24
5.4.2 Scenes with curved text lines . . . . . . . . . . . . . . . . . . . . . . 25
5.4.3 Scenes with curved text and small text lines . . . . . . . . . . . . . 26
5.5 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6 Conclusions 30
Reference 31

參考文獻

[1] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
[2] G. Bradski. The OpenCV Library. Dr. Dobb’s Journal of Software Tools, 2000.
[3] Qingwen Bu, Sungrae Park, Minsoo Khang, and Yichuan Cheng. Srformer: Empowering regression-based text detection transformer with segmentation. arXiv preprint arXiv:2308.10531, 2023.
[4] Han Cai, Ligeng Zhu, and Song Han. Proxylessnas: Direct neural architecture search on target task and hardware. arXiv preprint arXiv:1812.00332, 2018.
[5] Zhe Chen, Wenhai Wang, Enze Xie, ZhiBo Yang, Tong Lu, and Ping Luo. Fast: searching for a faster arbitrarily-shaped text detector with minimalist kernel representation. arXiv preprint arXiv:2111.02394, 2021.
[6] Chee Kheng Ch’ng and Chee Seng Chan. Total-text: A comprehensive dataset for scene text detection and recognition. In 2017 14th IAPR international conference on document analysis and recognition (ICDAR), volume 1, pages 935–942. IEEE, 2017.
[7] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 764–773, 2017.
[8] Dan Deng, Haifeng Liu, Xuelong Li, and Deng Cai. Pixellink: Detecting scene text via instance segmentation. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
[9] Xiaohan Ding, Yuchen Guo, Guiguang Ding, and Jungong Han. Acnet: Strengthening the kernel skeletons for powerful cnn via asymmetric convolution blocks. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1911–1920, 2019.
[10] Xiaohan Ding, Xiangyu Zhang, Ningning Ma, Jungong Han, Guiguang Ding, and Jian Sun. Repvgg: Making vgg-style convnets great again. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13733–13742, 2021.
[11] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
[12] Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural networks, 107:3–11, 2018.
[13] Alberto Garcia-Garcia, Sergio Orts-Escolano, Sergiu Oprea, Victor Villena-Martinez, and Jose Garcia-Rodriguez. A review on deep learning techniques applied to semantic segmentation. arXiv preprint arXiv:1704.06857, 2017.
[14] Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman. Synthetic data for text localisation in natural images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2315–2324, 2016.
[15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[16] Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
[17] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
[18] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017.
[19] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448–456. pmlr, 2015.
[20] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[21] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6):84–90, 2017.
[22] Xiang Li, Wenhai Wang, Xiaolin Hu, and Jian Yang. Selective kernel networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 510–519, 2019.
[23] Minghui Liao, Baoguang Shi, Xiang Bai, Xinggang Wang, and Wenyu Liu. Textboxes: A fast text detector with a single deep neural network. In Proceedings of the AAAI conference on artificial intelligence, volume 31, 2017.
[24] Minghui Liao, Zhisheng Zou, Zhaoyi Wan, Cong Yao, and Xiang Bai. Real-time scene text detection with differentiable binarization and adaptive scale fusion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1):919–931, 2022.
[25] Tsung-Yi Lin, Piotr Doll´ar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017.
[26] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
[27] Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11976–11986, 2022.
[28] Jianqi Ma, Weiyuan Shao, Hao Ye, Li Wang, Hong Wang, Yingbin Zheng, and Xiangyang Xue. Arbitrary-oriented scene text detection via rotation proposals. IEEE transactions on multimedia, 20(11):3111–3122, 2018.
[29] Siyuan Qiao, Huiyu Wang, Chenxi Liu, Wei Shen, and Alan Yuille. Micro-batch training with batch-channel normalization and weight standardization. arXiv preprint arXiv:1903.10520, 2019.
[30] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards realtime object detection with region proposal networks. Advances in neural information processing systems, 28, 2015.
[31] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520, 2018.
[32] Baoguang Shi, Xiang Bai, and Serge Belongie. Detecting oriented text in natural images by linking segments. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2550–2558, 2017.
[33] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[34] Aravind Srinivas, Tsung-Yi Lin, Niki Parmar, Jonathon Shlens, Pieter Abbeel, and Ashish Vaswani. Bottleneck transformers for visual recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16519–16529, 2021.
[35] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
[36] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv´e J´egou. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pages 10347–10357. PMLR, 2021.
[37] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
[38] Wenhai Wang, Enze Xie, Xiang Li, Wenbo Hou, Tong Lu, Gang Yu, and Shuai Shao. Shape robust text detection with progressive scale expansion network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9336–9345, 2019.
[39] Wenhai Wang, Enze Xie, Xiaoge Song, Yuhang Zang, Wenjia Wang, Tong Lu, Gang Yu, and Chunhua Shen. Efficient and accurate arbitrary-shaped text detection with pixel aggregation network. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8440–8449, 2019.
[40] Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, and Saining Xie. Convnext v2: Co-designing and scaling convnets with masked autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16133–16142, 2023.
[41] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. Cbam: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV), pages 3–19, 2018.
[42] Saining Xie, Ross Girshick, Piotr Doll´ar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1492–1500, 2017.
[43] Cong Yao, Xiang Bai, and Wenyu Liu. A unified framework for multioriented text detection and recognition. IEEE Transactions on Image Processing, 23(11):4737–4749, 2014.
[44] Cong Yao, Xiang Bai, Wenyu Liu, Yi Ma, and Zhuowen Tu. Detecting texts of arbitrary orientations in natural images. In 2012 IEEE conference on computer vision and pattern recognition, pages 1083–1090. IEEE, 2012.
[45] Jian Ye, Zhe Chen, Juhua Liu, and Bo Du. Textfusenet: Scene text detection with richer fused features. In IJCAI, volume 20, pages 516–522, 2020.
[46] Maoyuan Ye, Jing Zhang, Shanshan Zhao, Juhua Liu, Bo Du, and Dacheng Tao. Dptext-detr: Towards better scene text detection with dynamic points in transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 3241–3249, 2023.
[47] Liu Yuliang, Jin Lianwen, Zhang Shuaitao, and Zhang Sheng. Detecting curve text in the wild: New dataset and new solution. arXiv preprint arXiv:1712.02170, 2017.
[48] Yu-Xiang Zeng, Jun-Wei Hsieh, Xin Li, and Ming-Ching Chang. Mixnet: Toward accurate detection of challenging scene text in the wild. arXiv preprint arXiv:2308.12817, 2023.
[49] Hang Zhang, Chongruo Wu, Zhongyue Zhang, Yi Zhu, Haibin Lin, Zhi Zhang, Yue Sun, Tong He, Jonas Mueller, R Manmatha, et al. Resnest: Split-attention networks.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2736–2746, 2022.
[50] Shi-Xue Zhang, Xiaobin Zhu, Chun Yang, Hongfa Wang, and Xu-Cheng Yin. Adaptive boundary proposal network for arbitrary shape text detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1305–1314, 2021.

指導教授

孫敏德(Min-Te Sun)

審核日期

2024-1-26

推文