English  |  正體中文  |  简体中文  |  全文筆數/總筆數 : 83696/83696 (100%)
造訪人次 : 56980536      線上人數 : 5027
RC Version 7.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
搜尋範圍 查詢小技巧:
  • 您可在西文檢索詞彙前後加上"雙引號",以獲取較精準的檢索結果
  • 若欲以作者姓名搜尋,建議至進階搜尋限定作者欄位,可獲得較完整資料
  • 進階搜尋


    請使用永久網址來引用或連結此文件: https://ir.lib.ncu.edu.tw/handle/987654321/81076


    題名: 評估與改進Tesseract運用於彩色網頁的光學字元辨識
    作者: 陳奕誠;Chen, Yi-Cheng
    貢獻者: 資訊工程學系
    關鍵詞: 光學字元辨識;Tesseract
    日期: 2019-07-12
    上傳時間: 2019-09-03 15:33:10 (UTC+8)
    出版者: 國立中央大學
    摘要: 在過去,光學字元辨識的成敗往往跟特徵值的擷取有著密不可分的關係,假如沒能有效的提取出重要的特徵,其辨識結果必然不如預期。而隨著硬體設備及運算能力的提升,讓深度學習成了近年來的熱門領域,它的強大在於自動抽取特徵的能力,理論上能夠有效的尋找出好的特徵來提升光學字元識別的辨識能力。
    根據IBM估計,全世界一年花費約 2 兆 5 千萬美元在將儲存於傳統媒體之非數位化文件,以人工鍵入的方式轉化為數位化文件,若能夠提高光學字元辨識的辨識率到達可用的標準,就可以大幅省下時間且降低成本。如今未能有辨識率達到100%的工具,原因是辨識圖像的來源有多種不同的情況,例如掃描文件和相機拍照的雜訊、複雜的排版、文字和背景的顏色、大大小小的圖標、不同的語言以及字體,都會大大的影響辨識結果。
    本研究之目的在於尋找一個有效提升OCR軟體辨識率的方法。辨識所使用的圖像為網頁截圖,即沒有雜訊以及矯正過後的影像。由於電腦字體為True Type Font,即使相同頁面在不同的螢幕上截圖都有可能不同。在測試當中,Google Vision的辨識率是最好的,但Google Vision是一個cloud service,由於許多工廠的機台只允許使用內網,並不能對外連網,因此選用open source的Tesseract 4.0。實驗中發現,若直接使用Tesseract 4.0來對彩色的網頁進行辨識,它的辨識率非常低,但經過前處理後,辨識率就能大幅的提升。另外針對每一個頁面進行個別訓練,並無法有效的提升辨識率,原因是網頁中的內容排版複雜,且字型的大小不固定,由於Tesseract 4.0基於LSTM,若遇到大小不同的文字被判斷為同一行,都會影響它的辨識結果。
    ;In the past, the success or failure of optical character recognition (OCR) is often inextricably linked to the extraction of features. If you can’t find an effective feature, the result will not be as preferable as expected. However, the improvement of hardware devices and computing power have made deep learning become a hot field in recent years due to its ability to automatically extract features and effectiveness to find good features to enhance the recognition ability of optical character recognition.
    According to IBM estimates, about $2.5 trillion a year has been spent on storing non-digital files by converting them into digital files by manual typing. If it is possible to improve the recognition rate of OCR to certain acceptable standard, then it can save time and reduce costs. Besides, there aren’t any tools with a recognition rate of 100% today because there are many different sources of identification images, such as scanned files, camera photo noise, complex typography, text and background colors, large and small icons, different languages and fonts that will greatly affect the recognition results.
    The purpose of this paper is to find a way to effectively improve the OCR software recognition rate. We used screenshots of webpages that have better corrected images and don’t have noise. The computer font is True Type Font, so the screenshots may be different even if the same page is on different screens. The result of testing indicates Google Vision, a cloud service, has better recognition rate than other software. However, many factories that demand OCR don’t connect to the Internet, so we choose Tesseract 4.0 which is an open source. The findings of this paper show that with its low recognition rate, the pre-processing of Tesseract 4.0 has better improved its recognition rate than its training. The poor result of its training is mainly caused by complex typography and different text sizes.
    顯示於類別:[資訊工程研究所] 博碩士論文

    文件中的檔案:

    檔案 描述 大小格式瀏覽次數
    index.html0KbHTML188檢視/開啟


    在NCUIR中所有的資料項目都受到原著作權保護.

    社群 sharing

    ::: Copyright National Central University. | 國立中央大學圖書館版權所有 | 收藏本站 | 設為首頁 | 最佳瀏覽畫面: 1024*768 | 建站日期:8-24-2009 :::
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - 隱私權政策聲明