基於漸進式修正網路與拼寫錯誤修正語言模型之場景文字辨識

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：23

、訪客IP：3.16.56.30

姓名

彭明正(MING-CHENG PENG) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

基於漸進式修正網路與拼寫錯誤修正語言模型之場景文字辨識
(Progressive Rectification Network and Spelling Error Correction Language Model Based Scene Text Recognition)

相關論文

★ Single and Multi-Label Environmental Sound Recognition with Gaussian Process	★ 波束形成與音訊前處理之嵌入式系統實現
★ 語音合成及語者轉換之應用與設計	★ 基於語意之輿情分析系統
★ 高品質口述系統之設計與應用	★ 深度學習及加速強健特徵之CT影像跟骨骨折辨識及偵測
★ 基於風格向量空間之個性化協同過濾服裝推薦系統	★ RetinaNet應用於人臉偵測
★ 金融商品走勢預測	★ 整合深度學習方法預測年齡以及衰老基因之研究
★ 漢語之端到端語音合成研究	★ 基於 ARM 架構上的 ORB-SLAM2 的應用與改進
★ 基於深度學習之指數股票型基金趨勢預測	★ 探討財經新聞與金融趨勢的相關性
★ 基於卷積神經網路的情緒語音分析	★ 運用深度學習方法預測阿茲海默症惡化與腦中風手術存活

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 ( 永不開放)

摘要(中)

場景文字辨識因為擁有廣大的應用領域而快速地成為了一個熱門的研究主題，不同於一般的文本辨識，複雜的背景、不規則方向、字元遮擋、影像模糊等等情況經常出現在場景文字之中，因此場景文字辨識必須要比一般的文本辨識更具備處理影像多樣化和影像品質下降的能力。
近年來隨著深度學習技術的發展，已經有不少方法嘗試著解決場景文字辨識任務，然而對於人類來說，文字辨識這項任務不僅只從眼睛看到所判斷，同時還會考慮語意知識而給出更合理的辨識結果，為了使深度學習模型更接近於人類閱讀文字的過程，近年來越來越多方法開始轉往如何使模型學會更豐富的語義資訊，然而在現有文獻中，大部分都使用了英文資料集做研究，若直接將這些研究用在中文資料集上可能並不適合。有鑑於此本論文提出了一個更適合中文文字辨識的深度學習模型，我們加入了語言模型並且使用額外的文本資料做拼寫錯誤修正欲訓練，這樣能使我們的場景文字辨識模型架構具有更好的語意推理能力，此外我們還提出了漸進式修正網路，取代了現有文獻方法中最常使用的修正網路[1]，漸進式修正網路能夠使模型更好的處理不規則方向的字。
在實驗中我們展現了本論文所提出的方法優於[1, 2]這兩種經典的場景文字辨識架構，這兩種架構也經常被其他文獻拿來比較，本論文的方法也優於[3, 4]這兩種近年所提出的方法，另外在消融實驗中我們還探討了模型中各個部分的有效性，我們相信本論文是一個更適合中文文字辨識任務的方法。

摘要(英)

Scene text recognition has quickly become a hot research topic due to its wide range of applications. Different from general text recognition, complex backgrounds, irregular directions, occlusion of characters, blurred images, etc. often appear in scene texts. Therefore, scene text recognition must be more capable of dealing with image diversification and image quality degradation than general text recognition.
In recent years, with the development of deep learning technology, many methods have been tried to solve the task of scene text recognition. However, for humans, the task of text recognition is not only judged from what the eyes see, but also considers semantic knowledge to give more reasonable recognition results. In order to make the deep learning model closer to human reading, more and more methods have begun to turn to how to make the model learn richer semantic information in recent years. However, in the existing literature, most of them use English datasets for research, and it may not be suitable to directly apply these studies to Chinese datasets. In view of this, this paper proposes a deep learning model that is more suitable for Chinese scene text recognition. We added a language model and used additional text data for spelling error correction training, which enabled our scene text recognition model to have better semantic reasoning capabilities. In addition, we also propose a progressive rectification network, which replaces the most commonly used rectification network in existing literature [1], which enables the model to better handle text with irregular orientations.
In the experiments, we show that the method proposed in this paper outperforms the two classic scene text recognition method [1, 2], which are often compared by other literatures. The method of this paper is also better than the two methods proposed in recent years [3, 4]. In addition, in the ablation study, we also explored the effectiveness of each part of the model, and we believe that this paper is a more suitable method for Chinese scene text recognition.

關鍵字(中)

★ 深度學習
★ 文字辨識
★ 語言模型
★ 拼寫錯誤修正

關鍵字(英)

★ Deep Learning
★ Text Recognition
★ Language Model
★ Spelling Error Correction

論文目次

中文摘要 i
Abstract ii
目錄 iv
圖目錄 vii
表目錄 ix
第一章緒論 1
1.1 研究背景 1
1.2 研究動機與目的 1
1.3 研究方法與章節概要 2
第二章相關研究 4
2.1 卷積神經網路 4
2.1.1 卷積層 5
2.1.2 池化層 5
2.1.3 活化函數 6
2.1.4 全連階層 7
2.2 常見的卷積神經網路架構 8
2.2.1 Residual Network[15] 9
2.3 遞迴神經網路 11
2.3.1 Long Short-Term Memory(LSTM)[18] 11
2.3.2 Bidirectional Long Short-Term Memory(Bi-LSTM) 14
2.4 變壓器(Transformer) 14
2.4.1 Self-Attention 16
2.4.2 Multi-Head Attention 18
2.4.3 Position Encoding 19
2.5 Bidirectional Encoder Representations From Transformers(BERT)[20] 20
2.5.1 BERT Input 22
2.5.2 Masked Language Model(MLM) 22
2.5.3 Next Sentence Prediction(NSP) 23
2.5.4 Fine-Tuning BERT 24
第三章場景文字辨識相關文獻 25
3.1 CRNN[2] 25
3.1.1 卷積層 25
3.1.2 遞迴層 26
3.1.3 轉錄層 26
3.2 ASTER[1] 27
3.2.1 Rectification Network 28
3.2.2 Recognition Network 29
第四章本論文提出方法 33
4.1 視覺模型(Vision Model) 33
4.1.1 Progressive Rectification network 34
4.1.2 Resnet-34 36
4.1.3 Transformer Encoder 36
4.1.4 Attention Mechanism 37
4.2 語言模型(Language Model) 37
4.3 視覺及語言模型融合(Vision And Language Fusion) 38
4.4 網路訓練及超參數 39
4.4.1 視覺模型預訓練及超參數 39
4.4.2 語言模型預訓練及超參數 39
4.4.3 微調及超參數 41
第五章實驗細節及結果 42
5.1 硬體設備及軟體版本 42
5.2 資料集介紹 42
5.2.1 ReCTS[32]、LSVT[37] 42
5.2.2 RCTW[38] 43
5.2.3 TPS轉換資料擴充 43
5.2.4 ReCTS單字合成資料擴充 44
5.2.5 CLMAD[35] 44
5.3 本論文與其它方法比較 45
5.4 消融實驗(Ablation Study) 46
5.4.1 視覺語言模型預訓練 46
5.4.2 Progressive Rectification Network 46
5.4.3 Language Model Effectiveness 49
5.4.4 Bert With Error Correction Data Pre-training 50
第六章結論與未來展望 51
第七章參考文獻 52

參考文獻

[1] B. Shi, M. Yang, X. Wang, P. Lyu, C. Yao, and X. Bai, "Aster: An attentional scene text recognizer with flexible rectification," IEEE transactions on pattern analysis and machine intelligence, vol. 41, no. 9, pp. 2035-2048, 2018.
[2] B. Shi, X. Bai, and C. Yao, "An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition," IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 11, pp. 2298-2304, 2016.
[3] S. Fang, H. Xie, Y. Wang, Z. Mao, and Y. Zhang, "Read like humans: autonomous, bidirectional and iterative language modeling for scene text recognition," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 7098-7107.
[4] Y. Wang, H. Xie, S. Fang, J. Wang, S. Zhu, and Y. Zhang, "From two to one: A new scene text recognizer with visual language modeling network," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 14194-14203.
[5] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, 1998.
[6] A. Graves, A.-r. Mohamed, and G. Hinton, "Speech recognition with deep recurrent neural networks," in 2013 IEEE international conference on acoustics, speech and signal processing, 2013: Ieee, pp. 6645-6649.
[7] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, "Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks," in Proceedings of the 23rd international conference on Machine learning, 2006, pp. 369-376.
[8] Z. Qiao, Y. Zhou, D. Yang, Y. Zhou, and W. Wang, "Seed: Semantics enhanced encoder-decoder framework for scene text recognition," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 13528-13537.
[9] D. Yu et al., "Towards accurate scene text recognition with semantic reasoning networks," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 12113-12122.
[10] A. Vaswani et al., "Attention is all you need," Advances in neural information processing systems, vol. 30, 2017.
[11] M. Jaderberg, K. Simonyan, and A. Zisserman, "Spatial transformer networks," Advances in neural information processing systems, vol. 28, 2015.
[12] A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," Advances in neural information processing systems, vol. 25, 2012.
[13] K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556, 2014.
[14] C. Szegedy et al., "Going deeper with convolutions," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1-9.
[15] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770-778.
[16] J. Hu, L. Shen, and G. Sun, "Squeeze-and-excitation networks," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132-7141.
[17] S. Ioffe and C. Szegedy, "Batch normalization: Accelerating deep network training by reducing internal covariate shift," in International conference on machine learning, 2015: PMLR, pp. 448-456.
[18] S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural computation, vol. 9, no. 8, pp. 1735-1780, 1997.
[19] R. Pascanu, T. Mikolov, and Y. Bengio, "On the difficulty of training recurrent neural networks," in International conference on machine learning, 2013: PMLR, pp. 1310-1318.
[20] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv preprint arXiv:1810.04805, 2018.
[21] A. Dosovitskiy et al., "An image is worth 16x16 words: Transformers for image recognition at scale," arXiv preprint arXiv:2010.11929, 2020.
[22] T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Efficient estimation of word representations in vector space," arXiv preprint arXiv:1301.3781, 2013.
[23] J. Pennington, R. Socher, and C. D. Manning, "Glove: Global vectors for word representation," in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532-1543.
[24] S. Ilić, E. Marrese-Taylor, J. A. Balazs, and Y. Matsuo, "Deep contextualized word representations for detecting sarcasm and irony," arXiv preprint arXiv:1809.09795, 2018.
[25] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, "Improving language understanding by generative pre-training," 2018.
[26] Y. Wu et al., "Google′s neural machine translation system: Bridging the gap between human and machine translation," arXiv preprint arXiv:1609.08144, 2016.
[27] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, "Extracting and composing robust features with denoising autoencoders," in Proceedings of the 25th international conference on Machine learning, 2008, pp. 1096-1103.
[28] F. L. Bookstein, "Thin-plate splines and the atlas problem for biomedical images," in Biennial international conference on information processing in medical imaging, 1991: Springer, pp. 326-342.
[29] D. Bahdanau, K. Cho, and Y. Bengio, "Neural machine translation by jointly learning to align and translate," arXiv preprint arXiv:1409.0473, 2014.
[30] M. Tan and Q. Le, "Efficientnet: Rethinking model scaling for convolutional neural networks," in International conference on machine learning, 2019: PMLR, pp. 6105-6114.
[31] S. Zhang, H. Huang, J. Liu, and H. Li, "Spelling error correction with soft-masked BERT," arXiv preprint arXiv:2005.07421, 2020.
[32] R. Zhang et al., "Icdar 2019 robust reading challenge on reading chinese text on signboard," in 2019 international conference on document analysis and recognition (ICDAR), 2019: IEEE, pp. 1577-1581.
[33] M. D. Zeiler, "Adadelta: an adaptive learning rate method," arXiv preprint arXiv:1212.5701, 2012.
[34] S.-H. Wu, C.-L. Liu, and L.-H. Lee, "Chinese Spelling Check Evaluation at SIGHAN Bake-off 2013," in SIGHAN@ IJCNLP, 2013: Citeseer, pp. 35-42.
[35] Y. Bai, J. Tao, J. Yi, Z. Wen, and C. Fan, "CLMAD: A chinese language model adaptation dataset," in 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), 2018: IEEE, pp. 275-279.
[36] I. Loshchilov and F. Hutter, "Decoupled weight decay regularization," arXiv preprint arXiv:1711.05101, 2017.
[37] Y. Sun et al., "ICDAR 2019 competition on large-scale street view text with partial labeling-RRC-LSVT," in 2019 International Conference on Document Analysis and Recognition (ICDAR), 2019: IEEE, pp. 1557-1562.
[38] B. Shi et al., "Icdar2017 competition on reading chinese text in the wild (rctw-17)," in 2017 14th iapr international conference on document analysis and recognition (ICDAR), 2017, vol. 1: IEEE, pp. 1429-1434.

指導教授

王家慶(Jia-Ching Wang)

審核日期

2022-9-23

推文