一種應用於特定工程圖表影像的文字智慧辨識與提取之技術研究

NCU Institutional Repository > 工學院 > 機械工程學系碩士在職專班 > 博碩士論文 > Item 987654321/93576

jsp.display-item.identifier=請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/93576

题名:	一種應用於特定工程圖表影像的文字智慧辨識與提取之技術研究
作者:	陳冠帆;Chen, Kuan-Fan
贡献者:	機械工程學系在職專班
关键词:	文字辨識;表格辨識;信息提取;形態學操作;光學字符辨識;Text recognition;Table extraction;Information extraction;Morphological operations;Optical Character Recognition
日期:	2023-06-28
上传时间:	2024-09-19 17:20:25 (UTC+8)
出版者:	國立中央大學
摘要:	本文研究一種使用形態學與光學字符辨識功能取得特定工程圖表影像中單元格的文字內容，並記錄結果的快速辨識方法。本研究適用於特定工程圖表影像，如果需要應用於不同形式的工程圖表影像，可以修改相應工程圖表影像規則的參數。本研究以Python程式語言作為基礎，前處理使用Otsu閾值法進行圖像二值化處理，並使用形態學操作提取特定工程圖表影像之單元格位置。在文字辨識的過程中，使用Tesseract-OCR套件分為三個階段進行文字辨識與提取：1.使用全自動頁面分割搭配預訓練的英語模型、2.使用單詞分割搭配重新訓練的英語模型與3.使用單字分割搭配重新訓練的英語模型。最後，使用正規表達式搭配窮舉法修正錯誤以及與規則不符的內容。實驗結果表明，Tesseract-OCR套件雖然提供使用者預訓練的英語模型，並且這個英語模型在長字串的辨識能力非常卓越，但是在單元格中的單詞或單字辨識卻容易產生錯誤，使用三個階段搭配預訓練的英語模型辨識結果，正確率僅14.65%。而本研究使用特定工程圖表影像製作數據集重新訓練的英語模型，對於單元格中的單詞或單字辨識能力更好，正確率可以提升至58.04%。在後處理的過程中，依特殊工程圖表規則列出所有錯誤以及與規則不符的內容並使用正確字符取代，則可以讓正確率達到100%。 ;This study investigates a rapid recognition method for extracting text content from cells in specific engineering chart images using morphology and optical character recognition (OCR) techniques and recording the results. The research is applicable to specific engineering chart images, and if it needs to be applied to different types of engineering chart images, the parameters of the corresponding engineering chart image rules can be modified. Python programming language serves as the foundation for this research. In the preprocessing stage, the Otsu thresholding method is utilized for image binarization, and morphology operations are employed to extract the positions of cells in specific engineering chart images. In the text recognition process, the Tesseract-OCR package is used and divided into three stages for text recognition and extraction: 1. automatic page segmentation with a pre-trained English model, 2. word segmentation with a retrained English model, and 3. character segmentation with a retrained English model. Finally, regular expressions combined with an exhaustive approach are used to correct errors and content that deviate from the rules. The experimental results indicate that although the Tesseract-OCR package provides users with a pre-trained English model, which exhibits excellent recognition capabilities for long strings, it tends to generate errors in recognizing words or individual characters within cells. Using the three-stage approach with the pre-trained English model, the recognition accuracy is only 14.65%. However, by retraining the English model using a dataset created from specific engineering chart images, the recognition capability for words or individual characters within cells improves, achieving an accuracy of 58.04%. In the post-processing stage, by listing all errors and content that deviate from the rules based on specific engineering chart rules and replacing them with correct characters, the accuracy can be enhanced to 100%.
显示于类别:	[機械工程學系碩士在職專班 ] 博碩士論文

文件中的档案:

档案	描述	大小	格式	浏览次数
index.html		0Kb	HTML	26	检视/开启

在NCUIR中所有的数据项都受到原著作权保护.

社群 sharing

数据加载中.....