基於字元間隙偵測之自然場景文字分割與辨識

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：51

、訪客IP：3.12.74.207

姓名

李孟潔(Meng-Chieh Lee) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

基於字元間隙偵測之自然場景文字分割與辨識
(Scene-Text Segmentation and Recognition via Character Spacing Detection)

相關論文

★ 基於QT之跨平台無線心率分析系統實現	★ 網路電話之額外訊息傳輸機制
★ 針對與運動比賽精彩畫面相關串場效果之偵測	★ 植基於向量量化之視訊/影像內容驗證技術
★ 植基於串場效果偵測與內容分析之棒球比賽精華擷取系統	★ 以視覺特徵擷取為基礎之影像視訊內容認證技術
★ 使用動態背景補償以偵測與追蹤移動監控畫面之前景物	★ 應用於H.264/AVC視訊內容認證之適應式數位浮水印
★ 棒球比賽精華片段擷取分類系統	★ 利用H.264/AVC特徵之多攝影機即時追蹤系統
★ 利用隱式型態模式之高速公路前車偵測機制	★ 基於時間域與空間域特徵擷取之影片複製偵測機制
★ 結合數位浮水印與興趣區域位元率控制之車行視訊編碼	★ 應用於數位智權管理之H.264/AVC視訊加解密暨數位浮水印機制
★ 基於文字與主播偵測之新聞視訊分析系統	★ 植基於數位浮水印之H.264/AVC視訊內容驗證機制

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2026-8-6以後開放)

摘要(中)

自然場景文字包含街景路標、商店招牌、告示牌以及商品包裝等，可靠地偵測與辨識這些文字有助於實現多種具潛力的應用。自然場景文字可能出現於複雜街景或非平整背景，易受到光線變化、反光、角度扭曲或其他遮蔽物影響，於自然場景影像中準確偵測與辨識文字並不容易。現今常見的研究方法是利用深度學習模型，並以字詞為單位進行標記以利後續的字詞分割、文字偵測及辨識，通常需要較多資料與較大型的深度學習模型來因應存在於字詞的多樣性。此外，經常出現的不同語種文字會增加標記與辨識的困難。考量模型訓練成本與多語種文字偵測的需求，本研究提出以字元間隙為標的之文字偵測模型來協助定位自然場景中的多語種字元，透過字元間隙決定字元中心，再使用近鄰演算法畫出字元框區域，可與其中以較輕量的模型進行字元辨識。然而，偵測字元間隙的挑戰在於現今大部分資料集的標記都是針對字詞，在缺乏字元或字元間隙標記的情況下，本研究先產生接近自然場景的人工資料集，該資料集包含字元標記框以及字元間隙標記框，再搭配弱監督式學習以含有字詞標記的真實資料集進行模型調整，使得模型在微調以及迭代更新下能更準確地定位字元間隙，進而找出字元位置。實驗結果顯示，對於包含多國語種的文字資料集，我們所提出的偵測字元間隙方法以定位字元中心位置是可行的。

摘要(英)

Scene text indicates text appearing in street signs, shop signs, notices, and product packaging, etc. and reliably detecting and recognizing scene text is beneficial for a variety of potential applications. Text in natural scenes may appear in complex street views or on uneven backgrounds, and its detection and recognition are easily affected by changes in lighting, reflections, angle distortions, or other obstructions. Nowadays, common research methods adopt deep learning models, with words labeled as units to facilitate subsequent word segmentation, text detection, and recognition. These methods usually require more data and larger deep learning models to handle the diversity of text words. Besides, multilingual text appears quite often and labeling in a unified manner is not a trivial task.
Considering the cost of model training and the detection of multilingual text, this study proposes using character gaps or spacings as detection targets to assist in the segmentation of multilingual characters. By detecting character gaps to locate character centers, and then using a nearest neighbor algorithm to draw character bounding boxes, a lighter model can be used for single-character recognition. However, the challenge of detecting character gaps or spacings lies in the fact that most current datasets are labeled for words, lacking labels for characters or character gaps. We form an synthetic image dataset that mimics natural scenes, containing character bounding boxes and character gap boxes. Combined with weakly supervised learning on real datasets with word labels, this approach allows the model to be fine-tuned and iteratively updated to more accurately locate character gaps. Experimental results show that the proposed method is feasible for detecting character gaps or spacings to locate characters in the multilingual datasets.

關鍵字(中)

★ 深度學習
★ 語義分割
★ 自然場景文字定位
★ 多國語言文字定位
★ 字元辨識
★ 弱監督式學習

關鍵字(英)

★ Deep learning
★ semantic segmentation
★ scene text localization
★ multilingual text localization
★ character recognition
★ weakly supervised learning

論文目次

摘要 I
Abstract II
誌謝 IV
目錄 V
圖目錄 VIII
表目錄 X
1. 緒論 1
1.1. 研究動機 1
1.2. 研究貢獻 3
1.3. 論文架構 4
2. 相關研究 5
2.1. 物件偵測方法 5
2.2. 語意分割方法 8
2.3. 字元偵測任務 11
2.4. 弱監督式學習 15
2.5. 資料集 18
3. 提出方法 20
3.1. 資料集與標記方式 21
3.1.1. 人工合成資料集 21
3.1.2. 資料標記方式 22
3.2. 網路架構 24
3.2.1. 骨幹架構(Backbone) 24
3.2.2. 架構流程 25
3.3. 弱監督式學習方法 27
3.3.1. 偽標記(Pseudo label)更新策略 28
3.3.2. 模擬字元間隙 31
3.4. 損失函數(Loss Function) 32
3.5. 後處理(Post-processing) 34
3.5.1. 定位字元中心與分割字元 34
3.5.2. 多尺度嘗試 36
4. 實驗結果 38
4.1. 開發環境 38
4.2. 訓練細節 38
4.3. 評估方法 39
4.3.1. 英數字字元辨識 39
4.3.2. 字數計算評估 40
4.4. 驗證資料集結果 41
4.4.1. ICDAR2017：英數字 41
4.4.2. ICDAR2017：多國語言 42
4.4.3. Total-Text 43
4.5. 多尺度消融實驗 44
4.6. 字元分割結果 45
5. 結論與未來展望 52
5.1. 結論 52
5.2. 未來展望 52
參考文獻 54

參考文獻

[1] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, "You only look once: Unified, real-time object detection," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779-788.
[2] W. Liu et al., "Ssd: Single shot multibox detector," in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, 2016, pp. 21-37: Springer.
[3] R. Girshick, "Fast r-cnn," in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1440-1448.
[4] X. Zhou et al., "East: an efficient and accurate scene text detector," in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2017, pp. 5551-5560.
[5] Z. Tian, W. Huang, T. He, P. He, and Y. Qiao, "Detecting text in natural image with connectionist text proposal network," in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII 14, 2016, pp. 56-72: Springer.
[6] W. Wang et al., "Pan++: Towards efficient and accurate end-to-end spotting of arbitrarily-shaped text," vol. 44, no. 9, pp. 5349-5367, 2021.
[7] K. Simonyan and A. J. a. p. a. Zisserman, "Very deep convolutional networks for large-scale image recognition," 2014.
[8] A. Graves and J. Schmidhuber, "Framewise phoneme classification with bidirectional LSTM networks," in Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005., 2005, vol. 4, pp. 2047-2052: IEEE.
[9] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770-778.
[10] M. He et al., "MOST: A multi-oriented scene text detector with localization refinement," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 8813-8822.
[11] M. Ye, J. Zhang, S. Zhao, J. Liu, B. Du, and D. Tao, "Dptext-detr: Towards better scene text detection with dynamic points in transformer," in Proceedings of the AAAI Conference on Artificial Intelligence, 2023, vol. 37, no. 3, pp. 3241-3249.
[12] Q. Bu, S. Park, M. Khang, and Y. Cheng, "SRFormer: Text Detection Transformer with Incorporated Segmentation and Regression," in Proceedings of the AAAI Conference on Artificial Intelligence, 2024, vol. 38, no. 2, pp. 855-863.
[13] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, "End-to-end object detection with transformers," in European conference on computer vision, 2020, pp. 213-229: Springer.
[14] J. Long, E. Shelhamer, and T. Darrell, "Fully convolutional networks for semantic segmentation," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431-3440.
[15] O. Ronneberger, P. Fischer, and T. Brox, "U-net: Convolutional networks for biomedical image segmentation," in Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, 2015, pp. 234-241: Springer.
[16] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, A. L. J. I. t. o. p. a. Yuille, and m. intelligence, "Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs," vol. 40, no. 4, pp. 834-848, 2017.
[17] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, "Pyramid scene parsing network," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2881-2890.
[18] K. Sun, B. Xiao, D. Liu, and J. Wang, "Deep high-resolution representation learning for human pose estimation," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 5693-5703.
[19] Y. LeCun, L. Bottou, Y. Bengio, and P. J. P. o. t. I. Haffner, "Gradient-based learning applied to document recognition," vol. 86, no. 11, pp. 2278-2324, 1998.
[20] K. He, G. Gkioxari, P. Dollár, and R. Girshick, "Mask r-cnn," in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961-2969.
[21] M. Liao, Z. Wan, C. Yao, K. Chen, and X. Bai, "Real-time scene text detection with differentiable binarization," in Proceedings of the AAAI conference on artificial intelligence, 2020, vol. 34, no. 07, pp. 11474-11481.
[22] X. Qin et al., "Towards robust real-time scene text detection: From semantic to instance representation learning," in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 2025-2034.
[23] M. Huang et al., "Swintextspotter: Scene text spotting via better synergy between text detection and text recognition," in proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 4593-4603.
[24] Y. Baek, B. Lee, D. Han, S. Yun, and H. Lee, "Character region awareness for text detection," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 9365-9374.
[25] L. Xing, Z. Tian, W. Huang, and M. R. Scott, "Convolutional character networks," in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 9126-9136.
[26] A. Newell, K. Yang, and J. Deng, "Stacked hourglass networks for human pose estimation," in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII 14, 2016, pp. 483-499: Springer.
[27] W. Liu, C. Chen, and K.-Y. Wong, "Char-net: A character-aware neural network for distorted scene text recognition," in Proceedings of the AAAI conference on artificial intelligence, 2018, vol. 32, no. 1.
[28] Y. Xu, Y. Wang, W. Zhou, Y. Wang, Z. Yang, and X. J. I. T. o. I. P. Bai, "Textfield: Learning a deep direction field for irregular scene text detection," vol. 28, no. 11, pp. 5566-5579, 2019.
[29] C. Xue et al., "Image-to-character-to-word transformers for accurate scene text recognition," vol. 45, no. 11, pp. 12908-12921, 2023.
[30] M. Jaderberg, K. Simonyan, and A. J. A. i. n. i. p. s. Zisserman, "Spatial transformer networks," vol. 28, 2015.
[31] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, "Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks," in Proceedings of the 23rd international conference on Machine learning, 2006, pp. 369-376.
[32] D. Karatzas et al., "ICDAR 2013 robust reading competition," in 2013 12th international conference on document analysis and recognition, 2013, pp. 1484-1493: IEEE.
[33] D. Karatzas et al., "ICDAR 2015 competition on robust reading," in 2015 13th international conference on document analysis and recognition (ICDAR), 2015, pp. 1156-1160: IEEE.
[34] N. Nayef et al., "Icdar2017 robust reading challenge on multi-lingual scene text detection and script identification-rrc-mlt," in 2017 14th IAPR international conference on document analysis and recognition (ICDAR), 2017, vol. 1, pp. 1454-1459: IEEE.
[35] A. Gupta, A. Vedaldi, and A. Zisserman, "Synthetic data for text localisation in natural images," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2315-2324.
[36] C. K. Chng et al., "Icdar2019 robust reading challenge on arbitrary-shaped text-rrc-art," in 2019 International Conference on Document Analysis and Recognition (ICDAR), 2019, pp. 1571-1576: IEEE.
[37] C. Barnes, E. Shechtman, A. Finkelstein, and D. B. J. A. T. G. Goldman, "PatchMatch: A randomized correspondence algorithm for structural image editing," vol. 28, no. 3, p. 24, 2009.
[38] L.-Z. Chen and P.-C. Su, "A Pixel-Based Character Detection Scheme for Texts with Arbitrary Orientations in Natural Scenes," in 2023 IEEE 12th Global Conference on Consumer Electronics (GCCE), 2023, pp. 961-962: IEEE.
[39] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, "Aggregated residual transformations for deep neural networks," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1492-1500.
[40] L.-C. Chen et al., "Searching for efficient multi-scale architectures for dense image prediction," vol. 31, 2018.
[41] C. K. Ch′ng and C. S. Chan, "Total-text: A comprehensive dataset for scene text detection and recognition," in 2017 14th IAPR international conference on document analysis and recognition (ICDAR), 2017, vol. 1, pp. 935-942: IEEE.
[42] J. Ye, Z. Chen, J. Liu, and B. Du, "TextFuseNet: Scene Text Detection with Richer Fused Features," in IJCAI, 2020, vol. 20, pp. 516-522.
[43] D. Bautista and R. Atienza, "Scene text recognition with permuted autoregressive sequence models," in European conference on computer vision, 2022, pp. 178-196: Springer.

指導教授

蘇柏齊(Po-Chyi Su)

審核日期

2024-8-7

推文