結合自我注意力模塊的多尺度特徵融合網路用於場景文字偵測

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：41

、訪客IP：3.23.102.76

姓名

何立群(Li-Chun Ho) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

結合自我注意力模塊的多尺度特徵融合網路用於場景文字偵測
(Multi-Scale Feature Fusion Network Combined with Self-Attention Module for Scene Text Detection)

相關論文

★ 使用視位與語音生物特徵作即時線上身分辨識	★ 以影像為基礎之SMD包裝料帶對位系統
★ 手持式行動裝置內容偽變造偵測暨刪除內容資料復原的研究	★ 基於SIFT演算法進行車牌認證
★ 基於動態線性決策函數之區域圖樣特徵於人臉辨識應用	★ 基於GPU的SAR資料庫模擬器：SAR回波訊號與影像資料庫平行化架構 (PASSED)
★ 利用掌紋作個人身份之確認	★ 利用色彩統計與鏡頭運鏡方式作視訊索引
★ 利用欄位群聚特徵和四個方向相鄰樹作表格文件分類	★ 筆劃特徵用於離線中文字的辨認
★ 利用可調式區塊比對並結合多圖像資訊之影像運動向量估測	★ 彩色影像分析及其應用於色彩量化影像搜尋及人臉偵測
★ 中英文名片商標的擷取及辨識	★ 利用虛筆資訊特徵作中文簽名確認
★ 基於三角幾何學及顏色特徵作人臉偵測、人臉角度分類與人臉辨識	★ 一個以膚色為基礎之互補人臉偵測策略

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2026-8-1以後開放)

摘要(中)

場景文字偵測的研究在近年來有突破性的發展，並且有著許多不同的應用，例如文件文字偵測及停車場的車牌辨識。但是，對於像是招牌、告示牌等任意形狀的場景文字偵測依然存在著許多問題，例如，許多方法沒辦法將彎曲的文字完整的標示出來，也無法有效的分開相鄰的文字。因此，我們提出了一個更有效的模型，它可以更有效的融合及利用特徵，並偵測出任意形狀的場景文字。我們是基於文字的中心區域進行預測，並透過後處理將預測出的機率圖進行擴張，得到整個文字區域的結果。我們提出Multi-Scale Feature Fusion Network以更有效的萃取及融合特徵，其中包含了結合Self-Attention Module (SAM)的Multi-Scale Attention Module (MSAM)，可以更有效的精煉特徵，最後由Self-Attention Head (SAH)預測文字機率圖。本文透過實驗證實了此方法的效果，在Total-Text數據集上得到87.4分的F-score。

摘要(英)

The research on scene text detection has made breakthroughs in recent years and has many different applications, such as document text detection and license plate recognition in parking lots. However, there are still many problems in scene text detection with arbitrary shapes such as signboards and billboards. For example, many methods cannot mark curved text fully, nor can they effectively separate adjacent text. Therefore, we propose a more efficient model, which can more effectively fuse and utilize features and detect scene texts of arbitrary shapes. In this paper, the result is predicted based on the central area of the text, and the predicted probability map is expanded through post-processing to obtain the result of the entire text area. We propose a Multi-Scale Feature Fusion Network to extract and fuse features more effectively, including Multi-Scale Attention Modules (MSAMs) combined with Self-Attention Modules (SAMs), which can refine features more effectively. Finally, Self-Attention Head (SAH) predicts the text probability map. We confirm the effect of this method through experiments and achieve F-score of 87.4 on the Total-Text dataset.

關鍵字(中)

★ 自我注意力模塊
★ 多尺度網路
★ 場景文字偵測

關鍵字(英)

★ Self-Attention Module
★ Multi-Scale Network
★ Scene Text Detection

論文目次

摘要 i
Abstract ii
誌謝 iii
目錄 iv
圖目錄 v
表目錄 vii
第一章緒論 1
1.1 研究背景與動機 1
1.2 研究目的 2
1.3 論文架構 2
第二章文獻回顧 3
2.1 場景文字偵測模型 3
2.1.1 邊界框(Bounding Box)的偵測方法 3
2.1.2 像素(Pixel)尺度的偵測方法 6
2.1.3 整合(Hybrid)的預測方法 9
2.1.4 核心(Kernel)的偵測方法 11
2.2 骨幹(Backbone)網路 13
2.3 自我注意力機制(Self-Attention) 14
2.4 多尺度(Multi-Scale)網路 15
2.5 輕量化方法 16
第三章研究方法與架構 19
3.1 模型架構 19
3.2 Backbone 19
3.3 Multi-Scale Feature Fusion Network 21
3.3.1 Multi-Scale Attention Module V1 (MSAM V1) 23
3.3.2 Multi-Scale Attention Module V2 (MSAM V2) 25
3.3.3 Convolution Block 27
3.4 Self-Attention Head (SAH) 28
3.5 Loss Functions and Post Processing 28
第四章實驗結果 30
4.1 資料集與評估方法 30
4.2 開發環境 32
4.3 消融實驗 33
4.4 實驗數據 35
第五章結論與未來展望 40
參考文獻 41

參考文獻

[1] X. Zhou et al., "East: an efficient and accurate scene text detector," in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2017, pp. 5551-5560.
[2] O. Ronneberger, P. Fischer, and T. Brox, "U-net: Convolutional networks for biomedical image segmentation," in International Conference on Medical image computing and computer-assisted intervention, 2015: Springer, pp. 234-241.
[3] M. He et al., "MOST: A multi-oriented scene text detector with localization refinement," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 8813-8822.
[4] J. Dai et al., "Deformable convolutional networks," in Proceedings of the IEEE international conference on computer vision, 2017, pp. 764-773.
[5] P. Dai, S. Zhang, H. Zhang, and X. Cao, "Progressive contour regression for arbitrary-shape scene text detection," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 7393-7402.
[6] Y. Wang, H. Xie, Z.-J. Zha, M. Xing, Z. Fu, and Y. Zhang, "Contournet: Taking a further step toward accurate arbitrary-shaped scene text detection," in proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11753-11762.
[7] P. Wang et al., "A single-shot arbitrarily-shaped text detector based on context attended multi-task learning," in Proceedings of the 27th ACM international conference on multimedia, 2019, pp. 1277-1285.
[8] Y. Baek, B. Lee, D. Han, S. Yun, and H. Lee, "Character region awareness for text detection," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 9365-9374.
[9] S. Long, J. Ruan, W. Zhang, X. He, W. Wu, and C. Yao, "Textsnake: A flexible representation for detecting text of arbitrary shapes," in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 20-36.
[10] E. Xie, Y. Zang, S. Shao, G. Yu, C. Yao, and G. Li, "Scene text detection with supervised pyramid context network," in Proceedings of the AAAI conference on artificial intelligence, 2019, vol. 33, no. 01, pp. 9038-9045.
[11] K. He, G. Gkioxari, P. Dollár, and R. Girshick, "Mask r-cnn," in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961-2969.
[12] Y. Liu, H. Chen, C. Shen, T. He, L. Jin, and L. Wang, "Abcnet: Real-time scene text spotting with adaptive bezier-curve network," in proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9809-9818.
[13] Y. Zhu, J. Chen, L. Liang, Z. Kuang, L. Jin, and W. Zhang, "Fourier contour embedding for arbitrary-shaped text detection," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3123-3131.
[14] S.-X. Zhang et al., "Deep relational reasoning graph network for arbitrary shape text detection," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9699-9708.
[15] A. Shrivastava, A. Gupta, and R. Girshick, "Training region-based object detectors with online hard example mining," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 761-769.
[16] Z. Wang, L. Zheng, Y. Li, and S. Wang, "Linkage based face clustering via graph convolution network," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 1117-1125.
[17] J. Ye, Z. Chen, J. Liu, and B. Du, "TextFuseNet: Scene Text Detection with Richer Fused Features," in IJCAI, 2020, pp. 516-522.
[18] W. Wang et al., "Shape robust text detection with progressive scale expansion network," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 9336-9345.
[19] B. R. Vatti, "A generic solution to polygon clipping," Communications of the ACM, vol. 35, no. 7, pp. 56-63, 1992.
[20] W. Wang et al., "Efficient and accurate arbitrary-shaped text detection with pixel aggregation network," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 8440-8449.
[21] M. Liao, Z. Wan, C. Yao, K. Chen, and X. Bai, "Real-time scene text detection with differentiable binarization," in Proceedings of the AAAI Conference on Artificial Intelligence, 2020, vol. 34, no. 07, pp. 11474-11481.
[22] W. Wang et al., "PAN++: towards efficient and accurate End-to-End spotting of arbitrarily-shaped text," IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
[23] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770-778.
[24] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, "Densely connected convolutional networks," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4700-4708.
[25] A. Vaswani et al., "Attention is all you need," Advances in neural information processing systems, vol. 30, 2017.
[26] Y.-M. Zhang, C.-C. Lee, J.-W. Hsieh, and K.-C. Fan, "CSL-YOLO: A New Lightweight Object Detection System for Edge Computing," arXiv preprint arXiv:2107.04829, 2021.
[27] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, "Feature pyramid networks for object detection," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2117-2125.
[28] Y.-M. Zhang, J.-W. Hsieh, C.-C. Lee, and K.-C. Fan, "SFPN: Synthetic FPN for Object Detection," arXiv preprint arXiv:2203.02445, 2022.
[29] A. G. Howard et al., "Mobilenets: Efficient convolutional neural networks for mobile vision applications," arXiv preprint arXiv:1704.04861, 2017.
[30] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, "Mobilenetv2: Inverted residuals and linear bottlenecks," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4510-4520.
[31] C.-Y. Wang, H.-Y. M. Liao, Y.-H. Wu, P.-Y. Chen, J.-W. Hsieh, and I.-H. Yeh, "CSPNet: A new backbone that can enhance learning capability of CNN," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, 2020, pp. 390-391.
[32] K. Han, Y. Wang, Q. Tian, J. Guo, C. Xu, and C. Xu, "Ghostnet: More features from cheap operations," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 1580-1589.
[33] A. Gupta, A. Vedaldi, and A. Zisserman, "Synthetic data for text localisation in natural images," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2315-2324.
[34] C. K. Ch′ng and C. S. Chan, "Total-text: A comprehensive dataset for scene text detection and recognition," in 2017 14th IAPR international conference on document analysis and recognition (ICDAR), 2017, vol. 1: IEEE, pp. 935-942.
[35] S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, "Cbam: Convolutional block attention module," in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 3-19.

指導教授

范國清、謝君偉
(Kuo-Chin Fan、Jun-Wei Hsieh)

審核日期

2022-7-28

推文