基於注意力殘差網路之繁體中文街景文字辨識

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：39

、訪客IP：18.117.254.202

姓名

蘇冠宇(Kung-Yu Su) 查詢紙本館藏

畢業系所

軟體工程研究所

論文名稱

基於注意力殘差網路之繁體中文街景文字辨識
(Traditional Chinese Scene Text Recognition based on Attention-Residual Network)

相關論文

★ 基於QT之跨平台無線心率分析系統實現	★ 網路電話之額外訊息傳輸機制
★ 針對與運動比賽精彩畫面相關串場效果之偵測	★ 植基於向量量化之視訊/影像內容驗證技術
★ 植基於串場效果偵測與內容分析之棒球比賽精華擷取系統	★ 以視覺特徵擷取為基礎之影像視訊內容認證技術
★ 使用動態背景補償以偵測與追蹤移動監控畫面之前景物	★ 應用於H.264/AVC視訊內容認證之適應式數位浮水印
★ 棒球比賽精華片段擷取分類系統	★ 利用H.264/AVC特徵之多攝影機即時追蹤系統
★ 利用隱式型態模式之高速公路前車偵測機制	★ 基於時間域與空間域特徵擷取之影片複製偵測機制
★ 結合數位浮水印與興趣區域位元率控制之車行視訊編碼	★ 應用於數位智權管理之H.264/AVC視訊加解密暨數位浮水印機制
★ 基於文字與主播偵測之新聞視訊分析系統	★ 植基於數位浮水印之H.264/AVC視訊內容驗證機制

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

街景招牌文字經常傳達豐富的資訊，若能經由視覺技術辨識這些影像中的文字將有利於許多相關應用的開發。儘管電腦視覺於光學文本辨識已有相當成熟的技術，但自然場景文字辨識仍是非常具有挑戰性的任務。除了更多樣的字體、文字大小、與使用者拍攝角度等因素外，繁體中文字訓練資料目前仍不多見，眾多中文字也很難平均地蒐集相對應的照片，即使蒐集了足夠資料也會面臨數據不平衡問題。因此，本研究使用數種繁體中文字體產生高品質訓練影像及標記資料，模擬街景上複雜的文字變化，同時避免人工標記可能造成的誤差。除此之外，本文中亦探討如何使人工生成繁體文字影像更貼近街景真實文字，透過調整光線明亮度、幾何轉換、增加外框輪廓等方式產生多樣化訓練資料以增強模型的可靠性。對於文字偵測及辨識，我們採用兩階段演算法。首先我們採用Deep Lab模型以語意分割方式偵測街景中的單字與文本行所在區域，接著使用STN (Spatial Transformer Network) 修正偵測階段所框列的傾斜文字以利後續辨識階段的特徵提取。我們改良了ResNet50 模型，透過注意力機制改善模型在大型分類任務中的準確率。最後，我們透過使用者的GPS資訊與Google Place API中的地點資訊進行交叉比對，藉此驗證與修正模型輸出文字，增強街景文字的辨識能力。實驗結果顯示本研究能有效偵測及辨識繁體中文街景文字，並在複雜街景測試下表現優於Line OCR及Google Vision。

摘要(英)

Texts in nature scenes, especially street views, usually contain rich information related to the images. Although recognition of scanned documents has been well studied, scene text recognition is still a challenging task due to variable text fonts, inconsistent lighting conditions, different text orientations, background noises, angle of camera shooting and possible image distortions. This research aims at developing effective Traditional Chinese recognition scheme for streetscape based on deep learning techniques. It should be noted that constructing a suitable training dataset is an essential step and will affect the recognition performance significantly. However, the large alphabet size of Chinese characters is certainly an issue, which may cause the so-called data imbalance problem when collecting corresponding images. In the proposed scheme, a synthetic dataset with automatic labeling is constructed using several fonts and data augmentation. In an investigated image, the potential regions of characters and text-lines are located. For the located single characters, the possibly skewed images are rectified by the spatial transform network to enhance the performance. Next, the proposed attention-residual network improves the recognition accuracy in this large-scale classification. Finally, the recognized characters are combined using detected text-lines and corrected by the information from Google Place API with the location information. The experimental results show that the proposed scheme can correctly extract the texts from the selected areas in investigated images. The recognition performance is superior to Line OCR and Google Vision in complex street scenes.

關鍵字(中)

★ 電腦視覺
★ 深度學習
★ 街景文字偵測
★ 繁體中文字辨識

關鍵字(英)

★ scene text recognition
★ scene text detection
★ synthetic data

論文目次

目錄
論文摘要 IV
Abstract V
目錄 VI
附圖目錄 VIII
附表目錄 X
第一章緒論 1
1.1 研究動機 1
1.2 研究貢獻 2
1.3 論文架構 3
第二章相關研究 4
2.1 深度學習相關網路介紹 4
2.2 深度學習應用於文字偵測與辨識 8
2.3 人工合成訓練集 12
第三章提出方法 14
3.1 街景文字偵測網路 16
3.1.1 DeepLab V3+[25] 16
3.1.2 偵測網路訓練資料 24
3.1.3 文本行偵測網路及單字偵測網路偵測結果 26
3.2 繁體中文街景辨識網路 27
3.2.1 類別選擇 27
3.2.2 製作人工底圖 27
3.2.3 數據增強 28
3.2.4 網路設計 34
3.2.5 實作細節 40
第四章實驗結果 43
4.1 開發環境 43
4.2 卷積網路設定 43
4.3 真實街景測試集 44
4.3.1 真實測試集實驗結果 46
4.4 與其他商用軟體比較 47
4.4.1 藝術字體 47
4.4.2 傾斜情況 49
4.4.3 遮蔽情況 51
4.4.4 複雜街景 54
4.4.5 室內傾斜情況 56
第五章結論與未來展望 59
5.1 結論 59
5.2 未來展望 59
參考文獻 60

參考文獻

[1] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. “Imagenet classification with deep convolutional neural networks”, In Advances in neural information processing systems, 2012.
[2] Simonyan, Karen and Andrew Zisserman. “Very deep convolutional networks for large-scale image recognition”, In International Conference on Learning Representations, 2015
[3] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna, “Rethinking the Inception Architecture for Computer Vision”, In IEEE conference on computer vision and pattern recognition, 2016.
[4] Sergey Ioffe and Christian Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”, In International Conference on Machine Learning, 2015.
[5] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep residual learning for image recognition”, In IEEE conference on computer vision and pattern recognition, 2016.
[6] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger “Densely Connected Convolutional Networks “, In IEEE conference on computer vision and pattern recognition, 2017.
[7] Ross Girshick, Jeff Donahue, Trevor Darrell and Jitendra Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation”, In IEEE conference on computer vision and pattern recognition, 2014.
[8] J. Uijlings, K. van de Sande, T. Gevers, and A. Smeulders. “Selective search for object recognition”, International journal of computer vision, 2013.

[9] Suykens, Johan AK, and Joos Vandewalle. “Least squares support vector machine classifiers”, In International Conference on Machine Learning, 1998.
[10] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. “Faster r-cnn: Towards real-time object detection with region proposal networks”, In Advances in neural information processing systems, 2015.
[11] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. “You only look once: Unified, real-time object detection”, In IEEE conference on computer vision and pattern recognition, 2016.
[12] Jonathan Long, Evan Shelhamer, and Trevor Darrell. “Fully convolutional networks for semantic segmentation”, In IEEE conference on computer vision and pattern recognition, 2015.
[13] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. “Mask r-cnn.”, In Proceedings of the IEEE international conference on computer vision, 2017.
[14] Zhi Tian, Weilin Huang, Tong He, Pan He and Yu Qiao, “Detecting Text in Natural Image with Connectionist Text Proposal Network”, In European Conference on Computer Vision, 2016.
[15] Sepp Hochreiter and Jürgen Schmidhuber, “Long short-term memory”, In Neural Computation, 1997.
[16] Minghui Liao, Pengyuan Lyu, Minghang He, Cong Yao, Wenhao Wu and Xiang Bai, “Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes”, In European Conference on Computer Vision, 2018.
[17] Baoguang Shi, Xiang Bai, and Cong Yao. “An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition”, In IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016.

[18] Alex Graves, Santiago Fernández, Faustino Gomez and Jürgen Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks”, In International Conference on Machine Learning, 2006.
[19] Fenfen Sheng; Zhineng Chen and Bo Xu, “NRTR: A No-Recurrence Sequence-to-Sequence Model for Scene Text Recognition”, In International Conference on Document Analysis and Recognition, 2019.
[20] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser and Illia Polosukhin, “Attention is All you Need”, In Advances in neural information processing systems, 2017.
[21] Jeonghun Baek, Geewook Kim, Junyeop Lee, Sungrae Park, Dongyoon Han, Sangdoo Yun, Seong Joon Oh and Hwalsuk Lee, “What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis”, In Proceedings of the IEEE international conference on computer vision, 2019.
[21] Max Jaderberg, Karen Simonyan, Andrea Vedaldi and Andrew Zisserman, “Reading Text in the Wild with Convolutional Neural Networks”, International journal of computer vision, 2016.
[22] Ankush Gupta, Andrea Vedaldi and Andrew Zisserman, “SynthText in the Wild Dataset”, In IEEE conference on computer vision and pattern recognition, 2016.
[23] Youngmin Baek, Bado Lee, Dongyoon Han, Sangdoo Yun and Hwalsuk Lee, “Character Region Awareness for Text Detection”, In IEEE conference on computer vision and pattern recognition, 2019.
[24] Tai-Ling Yuan, Zhe Zhu, Kun Xu, Cheng-Jun Li and Shi-Min Hu, “Chinese Text in the Wild”, In IEEE conference on computer vision and pattern recognition, 2018.

[25] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff and Hartwig Adam, “Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation”, In European Conference on Computer Vision, 2018.
[26] Yu, Fisher, and Vladlen Koltun. “Multi-scale context aggregation by dilated convolutions”, In International Conference on Learning Representations, 2016.
[27] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy and Alan L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs”, In IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
[28] 認識中文字元碼 http://idv.sinica.edu.tw/bear/charcodes/Section05.htm
[29] Jaderberg, Max, Karen Simonyan, and Andrew Zisserman. “Spatial transformer networks”, In Advances in neural information processing systems, 2015.
[30] Hu, Jie, Li Shen, and Gang Sun. “Squeeze-and-excitation networks”, In IEEE conference on computer vision and pattern recognition, 2018.
[31] Glorot, Xavier, and Yoshua Bengio. “Understanding the difficulty of training deep feedforward neural networks”, In International Conference on Artificial Intelligence and Statistics, 2010.
[32] Kingma, Diederik P., and Jimmy Ba. “Adam: A method for stochastic optimization”, In International Conference on Learning Representations, 2015.
[33] Lee, Junyeop, Sungrae Park, Jeonghun Baek, Seong Joon Oh, Seonghyeon Kim and Hwalsuk Lee. “On Recognizing Texts of Arbitrary Shapes with 2D Self-Attention”, In IEEE conference on computer vision and pattern recognition workshops, 2020.

[34] Baoguang Shi, Xinggang Wang, Pengyuan Lyu, Cong Yao, and Xiang Bai. “Robust scene text recognition with automatic rectification”, In IEEE conference on computer vision and pattern recognition, 2016.
[35] Baoguang Shi, Mingkun Yang, Xinggang Wang, Pengyuan Lyu, Cong Yao, and Xiang Bai. “Aster: An attentional scene text recognizer with flexible rectification”, In IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
[36] Wei Liu, Chaofeng Chen, and Kwan-Yee K. Wong. “Charnet: A character-aware neural network for distorted scene text recognition”, In AAAI Conference on Artificial Intelligence, 2018.
[37] Wei Liu, Chaofeng Chen, Kwan-Yee K. Wong, Zhizhong Su, and Junyu Han. “Star-net: A spatial attention residue network for scene text recognition”, In British Machine Vision Conference, 2016.
[38] Yunze Gao, Yingying Chen, Jinqiao Wang, Zhen Lei, XiaoYu Zhang, and Hanqing Lu. “Recurrent calibration network for irregular text recognition”, In IEEE conference on computer vision and pattern recognition, 2018.
[39] Zhanzhan Cheng, Yangliu Xu, Fan Bai, Yi Niu, Shiliang Pu, and Shuigeng Zhou. “AON: Towards arbitrarily-oriented text recognition”, In IEEE conference on computer vision and pattern recognition, 2018.
[40] Hui Li, Peng Wang, Chunhua Shen, and Guyu Zhang. “Show, attend and read: A simple and strong baseline for irregular text recognition”, In AAAI Conference on Artificial Intelligence, 2019.
[41] Xiao Yang, Dafang He, Zihan Zhou, Daniel Kifer, and C Lee Giles. “Learning to read irregular text with attention mechanisms”, In International Joint Conferences on Artificial Intelligence, 2017.
[42] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel and Yoshua Bengio, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, In International Conference on Machine Learning, 2015.
[43] Dzmitry Bahdanau, Kyunghyun Cho and Yoshua Bengio, “Neural Machine Translation by Jointly Learning to Align and Translate”, In International Conference on Learning Representations 2015.
[44] Minh-Thang Luong, Hieu Pham, Christopher D. Manning, “Effective Approaches to Attention-based Neural Machine Translation”, In Empirical Methods in Natural Language Processing, 2015.
[45] S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong and R. Young, “ICDAR 2003 Robust Reading Competitions”, In International Conference on Document Analysis and Recognition, 2003.
[46] Dimosthenis Karatzas, Faisal Shafait, Seiichi Uchida, Masakazu Iwamura, Lluis Gomez i Bigorda, Sergi Robles Mestre, Joan Mas, David Fernandez Mota, Jon Almazàn Almazàn and Lluís Pere de las Heras, “ICDAR 2013 Robust Reading Competition”, In International Conference on Document Analysis and Recognition, 2013.
[47] Dimosthenis Karatzas, Lluis Gomez-Bigorda, Anguelos Nicolaou, Suman Ghosh, Andrew Bagdanov, Masakazu Iwamura, Jiri Matas, Lukas Neumann, Vijay Ramaseshan Chandrasekhar, Shijian Lu, Faisal Shafait, Seiichi Uchida and Ernest Valveny, “ICDAR 2015 competition on Robust Reading”, In International Conference on Document Analysis and Recognition, 2015.
[48] Raul Gomez, Baoguang Shi, Lluis Gomez, Lukas Numann, Andreas Veit, Jiri Matas, Serge Belongie and Dimosthenis Karatzas, “ICDAR2017 Robust Reading Challenge on COCO-Text”, In International Conference on Document Analysis and Recognition, 2017.

指導教授

蘇柏齊(Po-Chyi Su)

審核日期

2020-7-29

推文