摘要: | 場景文字辨識因為擁有廣大的應用領域而快速地成為了一個熱門的研究主題,不同於一般的文本辨識,複雜的背景、不規則方向、字元遮擋、影像模糊等等情況經常出現在場景文字之中,因此場景文字辨識必須要比一般的文本辨識更具備處理影像多樣化和影像品質下降的能力。 近年來隨著深度學習技術的發展,已經有不少方法嘗試著解決場景文字辨識任務,然而對於人類來說,文字辨識這項任務不僅只從眼睛看到所判斷,同時還會考慮語意知識而給出更合理的辨識結果,為了使深度學習模型更接近於人類閱讀文字的過程,近年來越來越多方法開始轉往如何使模型學會更豐富的語義資訊,然而在現有文獻中,大部分都使用了英文資料集做研究,若直接將這些研究用在中文資料集上可能並不適合。有鑑於此本論文提出了一個更適合中文文字辨識的深度學習模型,我們加入了語言模型並且使用額外的文本資料做拼寫錯誤修正欲訓練,這樣能使我們的場景文字辨識模型架構具有更好的語意推理能力,此外我們還提出了漸進式修正網路,取代了現有文獻方法中最常使用的修正網路[1],漸進式修正網路能夠使模型更好的處理不規則方向的字。 在實驗中我們展現了本論文所提出的方法優於[1, 2]這兩種經典的場景文字辨識架構,這兩種架構也經常被其他文獻拿來比較,本論文的方法也優於[3, 4]這兩種近年所提出的方法,另外在消融實驗中我們還探討了模型中各個部分的有效性,我們相信本論文是一個更適合中文文字辨識任務的方法。;Scene text recognition has quickly become a hot research topic due to its wide range of applications. Different from general text recognition, complex backgrounds, irregular directions, occlusion of characters, blurred images, etc. often appear in scene texts. Therefore, scene text recognition must be more capable of dealing with image diversification and image quality degradation than general text recognition. In recent years, with the development of deep learning technology, many methods have been tried to solve the task of scene text recognition. However, for humans, the task of text recognition is not only judged from what the eyes see, but also considers semantic knowledge to give more reasonable recognition results. In order to make the deep learning model closer to human reading, more and more methods have begun to turn to how to make the model learn richer semantic information in recent years. However, in the existing literature, most of them use English datasets for research, and it may not be suitable to directly apply these studies to Chinese datasets. In view of this, this paper proposes a deep learning model that is more suitable for Chinese scene text recognition. We added a language model and used additional text data for spelling error correction training, which enabled our scene text recognition model to have better semantic reasoning capabilities. In addition, we also propose a progressive rectification network, which replaces the most commonly used rectification network in existing literature [1], which enables the model to better handle text with irregular orientations. In the experiments, we show that the method proposed in this paper outperforms the two classic scene text recognition method [1, 2], which are often compared by other literatures. The method of this paper is also better than the two methods proposed in recent years [3, 4]. In addition, in the ablation study, we also explored the effectiveness of each part of the model, and we believe that this paper is a more suitable method for Chinese scene text recognition. |