基於Transformer架構之繁體中文場景文字辨識系統

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：14

、訪客IP：3.144.254.237

姓名

蔡維庭(Wei-Ting Tsai) 查詢紙本館藏

畢業系所

資訊工程學系在職專班

論文名稱

基於Transformer架構之繁體中文場景文字辨識系統
(Traditional Chinese Scene Text Recognition based on Transformer Architecture)

相關論文

★ 整合GRAFCET虛擬機器的智慧型控制器開發平台	★ 分散式工業電子看板網路系統設計與實作
★ 設計與實作一個基於雙攝影機視覺系統的雙點觸控螢幕	★ 智慧型機器人的嵌入式計算平台
★ 一個即時移動物偵測與追蹤的嵌入式系統	★ 一個固態硬碟的多處理器架構與分散式控制演算法
★ 基於立體視覺手勢辨識的人機互動系統	★ 整合仿生智慧行為控制的機器人系統晶片設計
★ 嵌入式無線影像感測網路的設計與實作	★ 以雙核心處理器為基礎之車牌辨識系統
★ 基於立體視覺的連續三維手勢辨識	★ 微型、超低功耗無線感測網路控制器設計與硬體實作
★ 串流影像之即時人臉偵測、追蹤與辨識─嵌入式系統設計	★ 一個快速立體視覺系統的嵌入式硬體設計
★ 即時連續影像接合系統設計與實作	★ 基於雙核心平台的嵌入式步態辨識系統

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2028-6-27以後開放)

摘要(中)

在繁體中文場景的文字辨識任務中，系統須同時具備處理圖像和文字兩種模態的能力。由於繁體中文的字符結構複雜、字元數量眾多，為了確保文字能夠準確辨識，辨識模型和系統架構設計往往變得複雜，而且通常需要大量計算資源。為了讓硬體資源有限的邊緣設備能運作即時繁體中文辨識，本研究提出一個能動態調整架構的辨識系統。此系統由一個辨識與校正子系統所組成，辨識子系統包含輕量化辨識模型SVTR，校正子系統主要為雙向克漏字語言模型，兩者分別基於Transformer編碼器與解碼器架構而設計，透過注意力機制與多重下採樣運算讓輸出特徵能關注不同尺度的資訊，局部特徵關注字符結構與筆劃，全局特徵關注字元之間的語義資訊。因此模型架構能簡化，從而減少參數量。在訓練階段，我們將模型的梯度傳遞過程分離，以確保模型能夠獨立運作。在運行階段，系統根據不同規模的硬體環境調整配置，將參數量較少的辨識子系統運行於硬體資源有限的機器上，而讓包含校正子系統的完整系統佈署於有較高計算資源的伺服器上。從實驗中可得知，辨識子系統的參數大小只有11.45(MB)，準確率可達到 71%。結合校正子系統後，準確率則可提升至77%。

摘要(英)

In the task of text recognition in Traditional Chinese scenarios, the system needs to possess the ability to process both image and text modalities simultaneously. Given the complex character structure and extensive character set in Traditional Chinese, ensuring accurate text recognition necessitates complex design of recognition models and system architectures, often demanding significant computational resources. To enable real-time Traditional Chinese recognition on edge devices with limited hardware resources, this research proposes a recognition system with a dynamically adjustable architecture. The system consists of a recognition and a correction subsystems. The recognition subsystem incorporates a lightweight recognition model called SVTR, while the correction subsystem includes a bidirectional cloze language model. Both subsystems are designed based on the Transformer encoder-decoder architecture. Through attention mechanisms and multiple down-sampling operations, the output features are able to focus on information at different scales. Local features attend to character structure and strokes, while global features emphasize semantic information between characters. Consequently, the model architecture can be simplified, leading to a reduction in the number of parameters. During the training phase, we separate the gradient propagation process of the model to ensure its independent operation. In the inference phase, the system adjusts its configuration based on the scale of the hardware environment. The recognition subsystem, which has fewer parameters, runs on hardware-limited machines, while the main system incorporating the correction subsystem is deployed on servers with higher computational resources. Experimental results indicate that the parameter size of the recognition subsystem is a mere 11.45 MB, achieving an accuracy of 71%. Upon integration with the correction subsystem, the accuracy improves to 77%.

關鍵字(中)

★ 繁體中文辨識
★ Transformer架構
★ 場景文字辨識

關鍵字(英)

★ Traditional Chinese recognition
★ Transformer Architecture
★ Scene text recognition

論文目次

摘要 I
ABSTRACT II
目錄 III
圖目錄 VI
表目錄 VIII
第一章、緒論 1
1.1 研究背景與動機 1
1.2 研究目標 3
1.3 論文架構 4
第二章、方法回顧 5
2.1 文字識別 5
2.1.1 圖像校正(Transformation) 5
2.1.2 特徵擷取(Feature Extraction) 7
2.1.3 序列建模(Sequence Modeling) 8
2.1.4 預測(Prediction) 8
2.2 SVTR 8
2.2.1 圖塊特徵(Patch Embedding) 9
2.2.2 位置編碼(Position Embedding) 9
2.2.3 特徵融合區塊(Mixing Block) 10
2.2.4 全域特徵融合區塊(Global Mixing Block) 10
2.2.5 局部特徵融合區塊(Local Mixing Block) 11
2.2.6 合併下採樣 11
2.4 雙向克漏字語言模型 11
第三章、系統架構設計 13
3.1 MIAT系統設計方法論 13
3.2 文字辨識系統架構 14
3.3 系統離散事件建模 16
3.3.1 文字辨識子系統離散事件建模 16
3.3.2 文字校正子系統離散事件建模 18
3.4 軟體高階合成 19
3.4.1 文字辨識子系統高階合成 20
3.4.2 文字校正子系統高階合成 21
3.5 場景圖文標註介面 22
第四章、實驗結果 25
4.1 實驗環境 25
4.2 模型訓練 27
4.3 性能評估 30
4.3.1 評估指標 30
4.3.2 實驗結果 31
4.3.3 圖像校正模組驗證 33
4.3.4 辨識與校正子系統驗證 33
4.3.5 討論 34
第五章、結論與未來展望 40
5.1 結論 40
5.2 未來展望 41
參考文獻 42

參考文獻

[1] E. Borovikov, “A survey of modern optical character recognition techniques,” arXiv:1412.4183, 2014. [Online]. Available: https://arxiv.org/abs/1412.4183
[2] A. Bissacco, M. Cummins, Y. Netzer and H. Neven, “PhotoOCR: Reading Text in Uncontrolled Conditions,” in Proc. IEEE International Conference on Computer Vision, 2013, pp. 785-792.
[3] X. Chen, L. Jin, Y. Zhu, C. Luo, and T. Wang, “Text recognition in the wild: A survey,” ACM Computing Surveys (CSUR), vol. 54, no. 2, pp. 1–35, 2021.
[4] S. Long, X. He, and C. Yao, “Scene text detection and recognition: The deep learning era,” International Journal of Computer Vision, vol. 129, no. 1, pp. 161–184, 2021.
[5] Z. Tian, W. Huang, T. He, P. He, and Y. Qiao, “Detecting text in natural image with connectionist text proposal network,” in Proc. European Conference on Computer Vision (ECCV), 2016, pp. 56–72.
[6] X. Zhou, C. Yao, H. Wen, Y. Wang, S. Zhou, W. He, and J. Liang, “EAST: An efficient and accurate scene text detector,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2642-2651.
[7] Y. Baek, B. Lee, D. Han, S. Yun, and H. Lee, “Character region awareness for text detection,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 9365–9374.
[8] B. Shi, X. Bai, and S. Belongie, “Detecting oriented text in natural images by linking segments,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 3482-3490.
[9] C. Zhang, B. Liang, Z. Huang, M. En, J. Han, E. Ding, and X. Ding, “Look more than once: An accurate detector for text of arbitrary shapes,” in Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 10544-10553.
[10] S. Long, J. Ruan, W. Zhang, X. He, W. Wu, and C. Yao, “Textsnake: A flexible representation for detecting text of arbitrary shapes,” in Proc. European Conference on Computer Vision (ECCV), 2018, pp. 20-36.
[11] E. Xie, Y. Zang, S. Shao, G. Yu, C. Yao, and G. Li, “Scene text detection with supervised pyramid context network,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 1, pp. 9038–9045, 2019.
[12] W. Wang, E. Xie, X. Li, W. Hou, T. Lu, G. Yu, and S. Shao. “Shape robust text detection with progressive scale expansion network,” in Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 9328-9337.
[13] W. Wang, E. Xie, X. Song, Y. Zang, W. Wang, T. Lu, G. Yu, and C. Shen, “Efficient and Accurate Arbitrary-Shaped Text Detection with Pixel Aggregation Network,” in Proc. IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 8439–8448.
[14] M. Liao, Z. Wan, C. Yao, K. Chen, and X. Bai, “Real-time scene text detection with differentiable binarization,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 7, pp. 11474-11481, 2020.
[15] B. Shi, X. Bai, and C. Yao, “An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 11, pp. 2298-2304, 2017.
[16] B. Shi, X. Wang, P. Lyu, C. Yao, and X. Bai, “Robust Scene Text Recognition with Automatic Rectification,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 4168–4176.
[17] C. -Y. Lee and S. Osindero, “Recursive Recurrent Nets with Attention Modeling for OCR in the Wild,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2231-2239.
[18] J. Wang and X. Hu, “Gated recurrent convolution neural network for OCR,” in Proc. Neural Information Processing Systems, 2017, pp. 335–344.
[19] W. Liu, C. Chen, K.-Y. K. Wong, Z. Su, and J. Han, “STAR-Net: A SpaTial Attention Residue Network for Scene Text Recognition.” in Proc. British Machine Vision Conference (BMVC), 2016, pp. 43.1-43.13.
[20] F. Borisyuk, A. Gordo, and V. Sivakumar, “Rosetta: Large scale system for text detection and recognition in images.” in Proc. 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2018, pp. 71–79.
[21] J. Baek, G. Kim, J. Lee, S. Park, D. Han, S. Yun, S. J. Oh, and H. Lee, “What is wrong with scene text recognition model comparisons? dataset and model analysis,” in Proc. IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 4714-4722.
[22] R. Smith, "An Overview of the Tesseract OCR Engine," in Proc. Ninth International Conference on Document Analysis and Recognition (ICDAR), 2007, pp. 629-633.
[23] K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," in Proc. International Conference on Learning Representations (ICLR), 2015, pp. 1-14.
[24] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE conference on computer vision and pattern recognition (CVPR), 2016, pp. 770–778.
[25] S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural computation, vol. 9, no. 8, pp. 1735-1780, 1997.
[26] A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proc. International conference on Machine learning (ICML), 2006, pp. 369–376.
[27] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” arXiv:1706.03762, 2017. [Online]. Available: https://arxiv.org/abs/1706.03762
[28] J. Lee, S. Park, J. Baek, S. J. Oh, S. Kim and H. Lee, "On Recognizing Texts of Arbitrary Shapes with 2D Self-Attention," in Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2020, pp. 2326-2335.
[29] D. Yu, X. Li, C. Zhang, J. Han, J. Liu, and E. Ding, "Towards Accurate Scene Text Recognition With Semantic Reasoning Networks," in Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 12110-12119.
[30] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv: 2010.11929, 2020. [Online]. Available: https://arxiv.org/abs/2010.11929
[31] R. Atienza, “Vision transformer for fast and efficient scene text recognition,” in Proc. International Conference on Document Analysis and Recognition, 2021, pp. 319–334.
[32] Y. Du, Z. Chen, C. Jia, X. Yin, T. Zheng, C. Li, Y. Du, and Y.-G. Jiang, “SVTR: scene text recognition with a single visual model,” in Proc. Thirty-First International Joint Conference on Artificial Intelligence, 2022, pp. 884–890.
[33] S. Fang, H. Xie, Y. Wang, Z. Mao, and Y. Zhang, “Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition,” in Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 7094-7103.
[34] M. Mora, O. Adelakun, S. Galvan-Cruz, and F. Wang, "Impacts of IDEF0-Based Models on the Usefulness, Learning, and Value Metrics of Scrum and XP Project Management Guides," Engineering Management Journal, vol. 34, no. 4, pp. 574-590, 2022.
[35] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu, “Spatial transformer networks,” in Proc. Advances in neural information processing systems(NIPS), 2015, pp. 2017–2025.
[36] G. Xu, Y. Meng, X. Qiu, Z. Yu, and X. Wu, “Sentiment analysis of comment texts based on BiLSTM,” IEEE Access, vol. 7, pp. 51522–51532, 2019.
[37] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pretraining of deep bidirectional transformers for language understanding,” in Proc. The North American Chapter of the Association for Computational Linguistics(NAACL), 2019, pp. 4171-4186.
[38] C.-H. Chen, M.-Y. Lin, and X.-C. Guo, "High-level modeling and synthesis of smart sensor networks for Industrial Internet of Things," Computers & Electrical Engineering, vol. 61, pp. 48-66, 2017.
[39] R. Julius, T. Trenner, A. Fay, J. Neidig, and X. L. Hoang, "A meta-model based environment for GRAFCET specifications," in Proc. IEEE International Systems Conference (SysCon), 2019, pp. 1-7
[40] Y.-C. Chen, Y.-C. Chang, Y.-C. Chang, and Y.-R. Yeh, “Traditional Chinese synthetic datasets verified with labeled data for scene text recognition,” arXiv:2111.13327, 2021. [Online]. Available: https://arxiv.org/abs/2111.13327
[41] Y. Sun, Z. Ni, C.-K. Chng, Y. Liu, C. Luo, C. C. Ng, J. Han, E. Ding, J. Liu, D. Karatzas, et al., “ICDAR 2019 Competition on Large-scale Street View Text with Partial Labeling–RRC-LSVT,”in Proc. International Conference on Document Analysis and Recognition (ICDAR), 2019, pp. 1557-1562.

指導教授

陳慶瀚(Ching-Han Chen)

審核日期

2023-6-27

推文