| 摘要: | 在過去,光學字元辨識的成敗往往跟特徵值的擷取有著密不可分的關係,假如沒能有效的提取出重要的特徵,其辨識結果必然不如預期。而隨著硬體設備及運算能力的提升,讓深度學習成了近年來的熱門領域,它的強大在於自動抽取特徵的能力,理論上能夠有效的尋找出好的特徵來提升光學字元識別的辨識能力。 根據IBM估計,全世界一年花費約 2 兆 5 千萬美元在將儲存於傳統媒體之非數位化文件,以人工鍵入的方式轉化為數位化文件,若能夠提高光學字元辨識的辨識率到達可用的標準,就可以大幅省下時間且降低成本。如今未能有辨識率達到100%的工具,原因是辨識圖像的來源有多種不同的情況,例如掃描文件和相機拍照的雜訊、複雜的排版、文字和背景的顏色、大大小小的圖標、不同的語言以及字體,都會大大的影響辨識結果。 本研究之目的在於尋找一個有效提升OCR軟體辨識率的方法。辨識所使用的圖像為網頁截圖,即沒有雜訊以及矯正過後的影像。由於電腦字體為True Type Font,即使相同頁面在不同的螢幕上截圖都有可能不同。在測試當中,Google Vision的辨識率是最好的,但Google Vision是一個cloud service,由於許多工廠的機台只允許使用內網,並不能對外連網,因此選用open source的Tesseract 4.0。實驗中發現,若直接使用Tesseract 4.0來對彩色的網頁進行辨識,它的辨識率非常低,但經過前處理後,辨識率就能大幅的提升。另外針對每一個頁面進行個別訓練,並無法有效的提升辨識率,原因是網頁中的內容排版複雜,且字型的大小不固定,由於Tesseract 4.0基於LSTM,若遇到大小不同的文字被判斷為同一行,都會影響它的辨識結果。 ;In the past, the success or failure of optical character recognition (OCR) is often inextricably linked to the extraction of features. If you can’t find an effective feature, the result will not be as preferable as expected. However, the improvement of hardware devices and computing power have made deep learning become a hot field in recent years due to its ability to automatically extract features and effectiveness to find good features to enhance the recognition ability of optical character recognition. According to IBM estimates, about $2.5 trillion a year has been spent on storing non-digital files by converting them into digital files by manual typing. If it is possible to improve the recognition rate of OCR to certain acceptable standard, then it can save time and reduce costs. Besides, there aren’t any tools with a recognition rate of 100% today because there are many different sources of identification images, such as scanned files, camera photo noise, complex typography, text and background colors, large and small icons, different languages and fonts that will greatly affect the recognition results. The purpose of this paper is to find a way to effectively improve the OCR software recognition rate. We used screenshots of webpages that have better corrected images and don’t have noise. The computer font is True Type Font, so the screenshots may be different even if the same page is on different screens. The result of testing indicates Google Vision, a cloud service, has better recognition rate than other software. However, many factories that demand OCR don’t connect to the Internet, so we choose Tesseract 4.0 which is an open source. The findings of this paper show that with its low recognition rate, the pre-processing of Tesseract 4.0 has better improved its recognition rate than its training. The poor result of its training is mainly caused by complex typography and different text sizes. |