基於弱監督式學習之自然場景文字字元分割

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：54

、訪客IP：18.191.234.202

姓名

陳莉筑(Li-Zhu Chen) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

基於弱監督式學習之自然場景文字字元分割
(Character Segmentation in Scene-Text Images Based on Weakly Supervised Learning)

相關論文

★ 基於QT之跨平台無線心率分析系統實現	★ 網路電話之額外訊息傳輸機制
★ 針對與運動比賽精彩畫面相關串場效果之偵測	★ 植基於向量量化之視訊/影像內容驗證技術
★ 植基於串場效果偵測與內容分析之棒球比賽精華擷取系統	★ 以視覺特徵擷取為基礎之影像視訊內容認證技術
★ 使用動態背景補償以偵測與追蹤移動監控畫面之前景物	★ 應用於H.264/AVC視訊內容認證之適應式數位浮水印
★ 棒球比賽精華片段擷取分類系統	★ 利用H.264/AVC特徵之多攝影機即時追蹤系統
★ 利用隱式型態模式之高速公路前車偵測機制	★ 基於時間域與空間域特徵擷取之影片複製偵測機制
★ 結合數位浮水印與興趣區域位元率控制之車行視訊編碼	★ 應用於數位智權管理之H.264/AVC視訊加解密暨數位浮水印機制
★ 基於文字與主播偵測之新聞視訊分析系統	★ 植基於數位浮水印之H.264/AVC視訊內容驗證機制

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2025-7-25以後開放)

摘要(中)

近年來基於深度學習於自然場景文字檢測的相關研究盛行，普遍以偵測字詞(word)為主要目標，並取得不錯的效果。然而，文字字體型態多變，且待測影像背景趨於複雜，文字可能受到遮蔽物阻擋，特別是當自然場景文字走向多元時，準確的字詞偵測並不容易達成，也影響下一階段文字辨識的準確度。本研究提出像素級字元(character)偵測網路，透過偵測字元的方式嘗試解決不規則走向字詞不易偵測的問題。字元偵測能讓偵測框更緊貼文字邊緣，降低複雜背景對於偵測網路所造成的影響，後續的文字辨識或可使用較輕量的辨識網路，減少訓練所需的資源與時間。字元偵測的主要挑戰在於現有自然場景文字檢測資料集皆採用字詞標記，因為針對字元的人工標記相當耗時費力。我們藉由生成大量貼近真實場景的合成資料來解決訓練集缺少字元標記的問題，並結合弱監督式學習在含有字詞標記的真實影像進行模型訓練。對於這些沒有字元標記的真實資料，我們以迭代更新結果的方式使網路自動學習偵測更可靠的字元位置，提升模型表現。另外，因應缺少字元標記的測試資料，我們提出新的字元偵測評估方式。實驗結果顯示我們的方法在ICDAR2017、TotalText和CTW-1500資料集上皆優於其他字元偵測模型，我們也將同樣的方式運用於訓練中文字元偵測以驗證所提出方法在其他語言內容的可行性。

摘要(英)

In recent years, there has been a prevailing trend in deep learning-based research for natural scene-text detection. The primary focus has generally been on word-level detection, which has yielded promising results. However, text fonts have significant variations, and the backgrounds of test images tend to be complex. Text may also be obstructed by occlusions, particularly in cases where natural scene text exhibits diverse orientations. Achieving accurate word-level detection under such circumstances is challenging and can also impact the subsequent text recognition accuracy. To address the difficulty of detecting irregularly oriented words, this paper proposes a pixel-level character detection network. By detecting individual characters, the detection boxes can adhere more closely to the text boundaries, reducing the negative influence of complex backgrounds on the detection network. Lighter-weight recognition networks can thus be employed for subsequent text recognition, reducing the resource and time requirements for training. The main challenge in character detection lies in the fact that existing natural scene-text detection datasets focus on word-level annotations, since character-level annotation is a laborious and time-consuming task. To overcome this challenge, we generate a large volume of synthetic data that closely resembles real-world scenarios. We employ partially annotated data for training, incorporating weakly supervised learning techniques and the inclusion of real-world data during training. For real-world data without character-level annotations, we adopt an iterative update approach to automatically learn more reliable character positions through the use of updated results to improve the accuracy of the model. Additionally, we propose a new evaluation method for character detection to address the lack of character-level annotated test datasets. Experimental results demonstrate the superiority of our method over other character detection models on the ICDAR2017, TotalText, and CTW-1500 datasets. We also apply the same approach to train models for character detection in other languages to validate the feasibility of the proposed method.

關鍵字(中)

★ 深度學習
★ 語意分割
★ 任意走向文字定位
★ 弱監督式學習

關鍵字(英)

★ Deep learning
★ semantic segmentation
★ arbitrary orientations text localization
★ weakly supervised learning

論文目次

摘要 I
Abstract II
目錄 V
圖目錄 VIII
表目錄 X
第一章、緒論 1
1.1. 研究動機 1
1.2. 研究貢獻 3
1.3. 論文架構 4
第二章、相關研究 5
2.1. 傳統文字偵測 5
2.2. 深度學習方法 5
2.2.1. 物件偵測 6
2.2.2. 語意分割 7
2.3. 文字偵測任務 9
2.3.1. 字詞偵測 9
2.3.2. 字元偵測 11
2.4. 弱監督式學習 14
2.5. 資料集 16
第三章、提出方法 18
3.1. 資料集與標記方式 18
3.1.1. 產生合成資料集 18
3.1.2. 資料標記方式 19
3.2. 網路架構 22
3.2.1. 骨幹架構(Backbone) 22
3.2.2. 架構流程 23
3.3. 弱監督式學習方法 24
3.3.1. 偽標記(Pseudo label)更新流程 25
3.3.2. 模擬字元位置 27
3.4. 損失函數(Loss Function) 29
3.5. 後處理(Post-processing) 31
第四章、實驗結果 32
4.1. 訓練細節 32
4.2. 評估方法 32
4.3. 驗證資料集結果 33
4.3.1. ICDAR2017 34
4.3.2. Total-Text 34
4.3.3. CTW-1500 35
4.3.4. 台灣街景影像資料集 36
4.3.5. 不同訓練影像比例的比較 36
4.4. Result 37
第五章、結論與未來展望 42
5.1. 結論 42
5.2. 未來展望 42
參考文獻 44

參考文獻

[1] M. Yang et al., "Symmetry-constrained rectification network for scene text recognition," in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 9147-9156.
[2] A. Singh, N. Thakur, and A. Sharma, "A review of supervised machine learning algorithms," in 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom), 2016: Ieee, pp. 1310-1315.
[3] Z.-H. Zhou, "A brief introduction to weakly supervised learning," National science review, vol. 5, no. 1, pp. 44-53, 2018.
[4] N. Nayef et al., "Icdar2017 robust reading challenge on multi-lingual scene text detection and script identification-rrc-mlt," in 2017 14th IAPR international conference on document analysis and recognition (ICDAR), 2017, vol. 1: IEEE, pp. 1454-1459.
[5] C. K. Ch′ng and C. S. Chan, "Total-text: A comprehensive dataset for scene text detection and recognition," in 2017 14th IAPR international conference on document analysis and recognition (ICDAR), 2017, vol. 1: IEEE, pp. 935-942.
[6] L. Yuliang, J. Lianwen, Z. Shuaitao, and Z. Sheng, "Detecting curve text in the wild: New dataset and new solution," arXiv preprint arXiv:1712.02170, 2017.
[7] D. G. Lowe, "Object recognition from local scale-invariant features," in Proceedings of the seventh IEEE international conference on computer vision, 1999, vol. 2: Ieee, pp. 1150-1157.
[8] N. Dalal and B. Triggs, "Histograms of oriented gradients for human detection," in 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR′05), 2005, vol. 1: Ieee, pp. 886-893.
[9] M. A. Hearst, S. T. Dumais, E. Osuna, J. Platt, and B. Scholkopf, "Support vector machines," IEEE Intelligent Systems and their applications, vol. 13, no. 4, pp. 18-28, 1998.
[10] L. Neumann and J. Matas, "Text localization in real-world images using efficiently pruned exhaustive search," in 2011 International Conference on Document Analysis and Recognition, 2011: IEEE, pp. 687-691.
[11] B. Epshtein, E. Ofek, and Y. Wexler, "Detecting text in natural scenes with stroke width transform," in 2010 IEEE computer society conference on computer vision and pattern recognition, 2010: IEEE, pp. 2963-2970.
[12] M. Liao, B. Shi, X. Bai, X. Wang, and W. Liu, "Textboxes: A fast text detector with a single deep neural network," in Proceedings of the AAAI conference on artificial intelligence, 2017, vol. 31, no. 1.
[13] M. Liao, Z. Wan, C. Yao, K. Chen, and X. Bai, "Real-time scene text detection with differentiable binarization," in Proceedings of the AAAI conference on artificial intelligence, 2020, vol. 34, no. 07, pp. 11474-11481.
[14] S. Albawi, T. A. Mohammed, and S. Al-Zawi, "Understanding of a convolutional neural network," in 2017 international conference on engineering and technology (ICET), 2017: Ieee, pp. 1-6.
[15] S. Ren, K. He, R. Girshick, and J. Sun, "Faster r-cnn: Towards real-time object detection with region proposal networks," Advances in neural information processing systems, vol. 28, 2015.
[16] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, "You only look once: Unified, real-time object detection," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779-788.
[17] W. Liu et al., "Ssd: Single shot multibox detector," in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, 2016: Springer, pp. 21-37.
[18] Z. Tian, W. Huang, T. He, P. He, and Y. Qiao, "Detecting text in natural image with connectionist text proposal network," in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII 14, 2016: Springer, pp. 56-72.
[19] A. Graves and J. Schmidhuber, "Framewise phoneme classification with bidirectional LSTM networks," in Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005., 2005, vol. 4: IEEE, pp. 2047-2052.
[20] M. Liao, Z. Zhu, B. Shi, G.-s. Xia, and X. Bai, "Rotation-sensitive regression for oriented scene text detection," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 5909-5918.
[21] M. Liao, B. Shi, and X. Bai, "Textboxes++: A single-shot oriented scene text detector," IEEE transactions on image processing, vol. 27, no. 8, pp. 3676-3690, 2018.
[22] B. Shi, X. Bai, and S. Belongie, "Detecting oriented text in natural images by linking segments," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2550-2558.
[23] J. Long, E. Shelhamer, and T. Darrell, "Fully convolutional networks for semantic segmentation," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431-3440.
[24] O. Ronneberger, P. Fischer, and T. Brox, "U-net: Convolutional networks for biomedical image segmentation," in Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, 2015: Springer, pp. 234-241.
[25] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, "Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs," IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834-848, 2017.
[26] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, "Pyramid scene parsing network," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2881-2890.
[27] D. Deng, H. Liu, X. Li, and D. Cai, "Pixellink: Detecting scene text via instance segmentation," in Proceedings of the AAAI conference on artificial intelligence, 2018, vol. 32, no. 1.
[28] W. Wang et al., "Shape robust text detection with progressive scale expansion network," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 9336-9345.
[29] P. Wang et al., "A single-shot arbitrarily-shaped text detector based on context attended multi-task learning," in Proceedings of the 27th ACM international conference on multimedia, 2019, pp. 1277-1285.
[30] K. He, G. Gkioxari, P. Dollár, and R. Girshick, "Mask r-cnn," in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961-2969.
[31] J. Liu, X. Liu, J. Sheng, D. Liang, X. Li, and Q. Liu, "Pyramid mask text detector," arXiv preprint arXiv:1903.11800, 2019.
[32] Y.-H. Hou, "Exploiting Distance to Boundary for Segmentation-based Scene-Text Spotting," 2021.
[33] K. Sun, B. Xiao, D. Liu, and J. Wang, "Deep high-resolution representation learning for human pose estimation," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 5693-5703.
[34] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, "Aggregated residual transformations for deep neural networks," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1492-1500.
[35] Y. Baek, B. Lee, D. Han, S. Yun, and H. Lee, "Character region awareness for text detection," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 9365-9374.
[36] S. Liu and W. Deng, "Very deep convolutional neural network based image classification using small training sample size," in 2015 3rd IAPR Asian conference on pattern recognition (ACPR), 2015: IEEE, pp. 730-734.
[37] L. Xing, Z. Tian, W. Huang, and M. R. Scott, "Convolutional character networks," in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 9126-9136.
[38] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770-778.
[39] H. Law and J. Deng, "Cornernet: Detecting objects as paired keypoints," in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 734-750.
[40] J. Ye, Z. Chen, J. Liu, and B. Du, "TextFuseNet: Scene Text Detection with Richer Fused Features," in IJCAI, 2020, vol. 20, pp. 516-522.
[41] C. K. Chng et al., "Icdar2019 robust reading challenge on arbitrary-shaped text-rrc-art," in 2019 International Conference on Document Analysis and Recognition (ICDAR), 2019: IEEE, pp. 1571-1576.
[42] A. Gupta, A. Vedaldi, and A. Zisserman, "Synthetic data for text localisation in natural images," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2315-2324.
[43] D. Karatzas et al., "ICDAR 2013 robust reading competition," in 2013 12th international conference on document analysis and recognition, 2013: IEEE, pp. 1484-1493.
[44] D. Karatzas et al., "ICDAR 2015 competition on robust reading," in 2015 13th international conference on document analysis and recognition (ICDAR), 2015: IEEE, pp. 1156-1160.
[45] C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Goldman, "PatchMatch: A randomized correspondence algorithm for structural image editing," ACM Trans. Graph., vol. 28, no. 3, p. 24, 2009.
[46] R. Fabbri, L. D. F. Costa, J. C. Torelli, and O. M. Bruno, "2D Euclidean distance transform algorithms: A comparative survey," ACM Computing Surveys (CSUR), vol. 40, no. 1, pp. 1-44, 2008.
[47] D. Bautista and R. Atienza, "Scene text recognition with permuted autoregressive sequence models," in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVIII, 2022: Springer, pp. 178-196.
[48] L. Tong, "Designs of the Traditional Chinese Scene Text Dataset and Performance Evaluation for Text Detection and Recognition," 2022.
[49] M. Ye et al., "Deepsolo: Let transformer decoder with explicit points solo for text spotting," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19348-19357.

指導教授

蘇柏齊(Po-Chyi Su)

審核日期

2023-7-25

推文