摘要(英) |
Optical Character Recognition (OCR) is a big challenge of Computer Vision. The degree of challenge has become harder from the task of recognizing the English characters and numbers with specific font and some symbol to the task of detecting and recognizing the text in the wild. And in the domain of text detection and recognition, detecting and recognizing Chinese context is more complex than the English. First, the amount of Chinese character is much more than English, and the shape is much more complex, too. Different from English context, Chinese can be written from left to right, and from top to bottom, also, which makes Chinese text detection and recognition much harder. Training a model of OCR system needs a lot of data with label, both position of the character and what the character is, the more complex scene needs more data with label. We focus on simple task, we just detect and recognize the Chinese text with the scan files. Different from task of text in wild, the block of text is more structural in task that detecting text in scan files. Therefore, we can get a great result with a simple network for text detection. And we just need to separate each line from the region that we detected, and use the line as the input of text recognition. Then, combine the result of OCR and the position we detect, we can get all the text in the scan file. And maybe, with these results, it can develop more applications, file classification takes for an example. |
參考文獻 |
[1] I. Sutskever, O. Vinyals, Q. V. Le, “Sequence to Sequence Learning with Neural Networks,” Advances in neural information processing system, 2014.
[2] R. O’Reilly, “Biologically Plausible Error-driven Learning using Local Activation Differences: The Generalized Recirculation Algorithm,” Neural Computation, 8:5, 895-938, 1996.
[3] Hamza Mahmood, “Activation Functions in Neural Networks,” [Online], Available: https://towardsdatascience.com/activation-functions-in-neural-networks-83ff7f46a6bd, [Accessed: 23-Jul-2019].
[4] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, Nov 1998.
[5] A. Krizhevsky, I. Sutskever, G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” Proceedings of the 25th International Conference on Neural Information Processing Systems, vol. 1, pp. 1097-1105, Dec 2012.
[6] K. Simonyan, A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” arXiv:1409.1556v1 [cs.CV](2014)
[7] C. Szegedy et al., “Going deeper with convolutions,” 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, 2015, pp. 1-9.
[8] S. Ioffe, C. Szegedy. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,” ArXiv abs/1502.03167 (2015): n. pag.
[9] C. Szegedy, et al. “Rethinking the Inception Architecture for Computer Vision,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015): 2818-2826.
[10] C. Szegedy, S. Ioffe, V. Vanhoucke, “Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning,” In: CoRR, abs/1602.07261 (2016)
[11] K. He, X. Zhang, S. Ren and J. Sun, “Deep Residual Learning for Image Recognition,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 2016, pp. 770-778.
[12] J. J. Hopfield, “Neural networks and physical systems with emergent collective computational abilities,” Proceedings of the National Academy of Sciences of the USA, vol. 79, no. 8, pp. 2554-2558, April 1982.
[13] Hochreiter, Sepp, and Jürgen Schmidhuber. “Long short-term memory.” Neural computation 9.8 (1997): 1735-1780.
[14] Hamza Mahmood, “Understanding LSTM Networks,” [Online], Available: https://colah.github.io/posts/2015-08-Understanding-LSTMs/, [Accessed: 25-Jul-2019].
[15] Alex Graves, Jürgen Schmidhuber, “Framewise phoneme classification with bidirectional LSTM and other neural network architectures,” Neural Networks, Volume 18, Issues 5–6, 2005, Pages 602-610, ISSN 0893-6080.
[16] Graves, Alex et al. “Speech recognition with deep recurrent neural networks.” 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (2013): 6645-6649.
[17] Girshick, Ross B. et al. “Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.” 2014 IEEE Conference on Computer Vision and Pattern Recognition (2013): 580-587.
[18] R. Girshick, “Fast R-CNN,” 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, 2015, pp. 1440-1448.
[19] GIRSHICK, Ross. Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision. 2015. p. 1440-1448.
[20] He, Kaiming et al. “Mask R-CNN.” 2017 IEEE International Conference on Computer Vision (ICCV) (2017): 2980-2988.
[21] “The First OCR System: "GISMO",” [Online], Available: http://www.historyofinformation.com/detail.php?entryid=885, [Accessed: 20-Aug-2019].
[22] Casey, Richard G. and George Nagy. “Recognition of Printed Chinese Characters.” IEEE Trans. Electronic Computers 15 (1966): 91-101.
[23] “Tesseract Ocr,” [Online], Available: https://github.com/tesseract-ocr/, [Accessed: 20-Aug-2019].
[24] Zuo, Zhen & Shuai, Bing & Wang, Gang & Liu, Xiao & Wang, Xingxing & Wang, Bing & Chen, Yushi. (2015). Convolutional recurrent neural networks: Learning spatial dependencies for image representation. 18-26. 10.1109/CVPRW.2015.7301268.
[25] Graves, Alex et al. “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks.” ICML (2006).
[26] Tian, Zhi & Huang, Weilin & Tong, He & He, Pan & Qiao, Yu. (2016). Detecting Text in Natural Image with Connectionist Text Proposal Network. 9912. 56-72. 10.1007/978-3-319-46484-8_4.
|