基於注意力殘差網路之繁體中文街景文字辨識;Traditional Chinese Scene Text Recognition based on Attention-Residual Network

NCUIR > College of Electrical Engineering & Computer Science > Software Engineer > Electronic Thesis & Dissertation > Item 987654321/83789

Please use this identifier to cite or link to this item: http://ir.lib.ncu.edu.tw/handle/987654321/83789

Title:	基於注意力殘差網路之繁體中文街景文字辨識;Traditional Chinese Scene Text Recognition based on Attention-Residual Network
Authors:	蘇冠宇;Su, Kung-Yu
Contributors:	軟體工程研究所
Keywords:	電腦視覺;深度學習;街景文字偵測;繁體中文字辨識;scene text recognition;scene text detection;synthetic data
Date:	2020-07-29
Issue Date:	2020-09-02 17:06:28 (UTC+8)
Publisher:	國立中央大學
Abstract:	街景招牌文字經常傳達豐富的資訊，若能經由視覺技術辨識這些影像中的文字將有利於許多相關應用的開發。儘管電腦視覺於光學文本辨識已有相當成熟的技術，但自然場景文字辨識仍是非常具有挑戰性的任務。除了更多樣的字體、文字大小、與使用者拍攝角度等因素外，繁體中文字訓練資料目前仍不多見，眾多中文字也很難平均地蒐集相對應的照片，即使蒐集了足夠資料也會面臨數據不平衡問題。因此，本研究使用數種繁體中文字體產生高品質訓練影像及標記資料，模擬街景上複雜的文字變化，同時避免人工標記可能造成的誤差。除此之外，本文中亦探討如何使人工生成繁體文字影像更貼近街景真實文字，透過調整光線明亮度、幾何轉換、增加外框輪廓等方式產生多樣化訓練資料以增強模型的可靠性。對於文字偵測及辨識，我們採用兩階段演算法。首先我們採用Deep Lab模型以語意分割方式偵測街景中的單字與文本行所在區域，接著使用STN (Spatial Transformer Network) 修正偵測階段所框列的傾斜文字以利後續辨識階段的特徵提取。我們改良了ResNet50 模型，透過注意力機制改善模型在大型分類任務中的準確率。最後，我們透過使用者的GPS資訊與Google Place API中的地點資訊進行交叉比對，藉此驗證與修正模型輸出文字，增強街景文字的辨識能力。實驗結果顯示本研究能有效偵測及辨識繁體中文街景文字，並在複雜街景測試下表現優於Line OCR及Google Vision。;Texts in nature scenes, especially street views, usually contain rich information related to the images. Although recognition of scanned documents has been well studied, scene text recognition is still a challenging task due to variable text fonts, inconsistent lighting conditions, different text orientations, background noises, angle of camera shooting and possible image distortions. This research aims at developing effective Traditional Chinese recognition scheme for streetscape based on deep learning techniques. It should be noted that constructing a suitable training dataset is an essential step and will affect the recognition performance significantly. However, the large alphabet size of Chinese characters is certainly an issue, which may cause the so-called data imbalance problem when collecting corresponding images. In the proposed scheme, a synthetic dataset with automatic labeling is constructed using several fonts and data augmentation. In an investigated image, the potential regions of characters and text-lines are located. For the located single characters, the possibly skewed images are rectified by the spatial transform network to enhance the performance. Next, the proposed attention-residual network improves the recognition accuracy in this large-scale classification. Finally, the recognized characters are combined using detected text-lines and corrected by the information from Google Place API with the location information. The experimental results show that the proposed scheme can correctly extract the texts from the selected areas in investigated images. The recognition performance is superior to Line OCR and Google Vision in complex street scenes.
Appears in Collections:	[Software Engineer] Electronic Thesis & Dissertation

Files in This Item:

File	Description	Size	Format
index.html		0Kb	HTML	260	View/Open

社群 sharing

Loading...