基於Transformer架構之繁體中文場景文字辨識系統;Traditional Chinese Scene Text Recognition based on Transformer Architecture

NCUIR > College of Electrical Engineering & Computer Science > Executive Master of Computer Science and Information Engineering > Electronic Thesis & Dissertation > Item 987654321/93010

Please use this identifier to cite or link to this item: http://ir.lib.ncu.edu.tw/handle/987654321/93010

Title:	基於Transformer架構之繁體中文場景文字辨識系統;Traditional Chinese Scene Text Recognition based on Transformer Architecture
Authors:	蔡維庭;Tsai, Wei-Ting
Contributors:	資訊工程學系在職專班
Keywords:	繁體中文辨識;Transformer架構;場景文字辨識;Traditional Chinese recognition;Transformer Architecture;Scene text recognition
Date:	2023-06-27
Issue Date:	2024-09-19 16:38:24 (UTC+8)
Publisher:	國立中央大學
Abstract:	在繁體中文場景的文字辨識任務中，系統須同時具備處理圖像和文字兩種模態的能力。由於繁體中文的字符結構複雜、字元數量眾多，為了確保文字能夠準確辨識，辨識模型和系統架構設計往往變得複雜，而且通常需要大量計算資源。為了讓硬體資源有限的邊緣設備能運作即時繁體中文辨識，本研究提出一個能動態調整架構的辨識系統。此系統由一個辨識與校正子系統所組成，辨識子系統包含輕量化辨識模型SVTR，校正子系統主要為雙向克漏字語言模型，兩者分別基於Transformer編碼器與解碼器架構而設計，透過注意力機制與多重下採樣運算讓輸出特徵能關注不同尺度的資訊，局部特徵關注字符結構與筆劃，全局特徵關注字元之間的語義資訊。因此模型架構能簡化，從而減少參數量。在訓練階段，我們將模型的梯度傳遞過程分離，以確保模型能夠獨立運作。在運行階段，系統根據不同規模的硬體環境調整配置，將參數量較少的辨識子系統運行於硬體資源有限的機器上，而讓包含校正子系統的完整系統佈署於有較高計算資源的伺服器上。從實驗中可得知，辨識子系統的參數大小只有11.45(MB)，準確率可達到 71%。結合校正子系統後，準確率則可提升至77%。;In the task of text recognition in Traditional Chinese scenarios, the system needs to possess the ability to process both image and text modalities simultaneously. Given the complex character structure and extensive character set in Traditional Chinese, ensuring accurate text recognition necessitates complex design of recognition models and system architectures, often demanding significant computational resources. To enable real-time Traditional Chinese recognition on edge devices with limited hardware resources, this research proposes a recognition system with a dynamically adjustable architecture. The system consists of a recognition and a correction subsystems. The recognition subsystem incorporates a lightweight recognition model called SVTR, while the correction subsystem includes a bidirectional cloze language model. Both subsystems are designed based on the Transformer encoder-decoder architecture. Through attention mechanisms and multiple down-sampling operations, the output features are able to focus on information at different scales. Local features attend to character structure and strokes, while global features emphasize semantic information between characters. Consequently, the model architecture can be simplified, leading to a reduction in the number of parameters. During the training phase, we separate the gradient propagation process of the model to ensure its independent operation. In the inference phase, the system adjusts its configuration based on the scale of the hardware environment. The recognition subsystem, which has fewer parameters, runs on hardware-limited machines, while the main system incorporating the correction subsystem is deployed on servers with higher computational resources. Experimental results indicate that the parameter size of the recognition subsystem is a mere 11.45 MB, achieving an accuracy of 71%. Upon integration with the correction subsystem, the accuracy improves to 77%.
Appears in Collections:	[Executive Master of Computer Science and Information Engineering] Electronic Thesis & Dissertation

Files in This Item:

File	Description	Size	Format
index.html		0Kb	HTML	9	View/Open

社群 sharing

Loading...