以作者查詢圖書館館藏 、以作者查詢臺灣博碩士 、以作者查詢全國書目 、勘誤回報 、線上人數:197 、訪客IP:3.15.146.43
姓名 陳泰宏(Tai-Hung Chen) 查詢紙本館藏 畢業系所 電機工程學系 論文名稱 中文商業名片辨識及後處理
(Recognition and Postprocessing of Chinese Business Cards)相關論文 檔案 [Endnote RIS 格式] [Bibtex 格式] [相關文章] [文章引用] [完整記錄] [館藏目錄] [檢視] [下載]
- 本電子論文使用權限為同意立即開放。
- 已達開放權限電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。
- 請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
摘要(中) 名片傳達許多重要的資訊,為了更有效率的使用這些資訊,自動地抽取這些資訊並建立電子資料庫是必要的,這類的程序稱之為名片辨識系統。一般而言,名片的辨識主要包含三步驟,首先,前處理級將處理名片影像並抽取名片上的文字,第二個步驟是針對名片版面作分析,最後則是後處理級,採用語意等方法來改善名片處理系統的辨識率。
這篇論文主要研究的目標為中文商業名片的辨識問題。我們假設名片上的字元已經被抽取出來並且已經分析過名片的版面,由於名片上的字元太小以及字型變化太大導致了OCR應用在名片上的低辨識率,我們研究的目的主要在改善這個問題。
在我們的方法中,採用了HMM來辨識中文商業名片上的字元,由左而右的HMM模型用來辨識字元並輸出前十名候選字。在後處理級中,語言模型接著用來改善辨識的結果。Viterbi演算法被應用在後處理的校正上,以bigram當作語意的資訊用來搜尋前十名候選字中的正確字元,所得到的最佳字元序列為後處理級中所改善的結果。
我們的實驗建立在辨識中文商業名片的公司欄位和地址欄位,用來訓練bigem和HMM的資料庫為電話簿上的資料,100張名片的地址欄位和30張名片的公司欄位被用來作測試。實驗的結果證實了我們提出的方法確實有效。摘要(英) Business cards convey significant information of personal data. In order to use the information effectively, it is necessary to automatically extract the information to build an electronic business card database. This is called a business card recognition system. In generally, a business card recognition system has three stages. First, a preprocessing stage is needed to perform image processing and extract character images. It then needs a card layout analysis as the second stage. The last stage called post-processing usually adopts linguistics to increase the recognition rate of business card processing.
The goal of this thesis is to study the recognition problems of business cards. We assume that characters have been recognized and card layout has been analyzed. Our aim is to improve the low recognition rate of OCR in business card, which happens due to the fact that characters vary greatly in font type and are too small to be recognized.
In our approach, Hidden Markov Model is adopted to recognize characters in Chinese business card. A left-right model will output the top-10 candidates as its recognition result. A postprocessing stage is followed to improve the recognition result. A Viterbi algorithm is proposed in the postprocessing stage. The algorithm will use bigram as its linguistic information to search the top-10 candidates. An optimized character sequence is obtained as the improved result of postprocessing.
Our experiments are built on the recognition of address item and company item in business cards. Bigram table and Hidden Markov Models are trained with a telephony database. 100 address items and 30 company items are used for testing. Experimental results reveal the validity of our proposed method.關鍵字(中) ★ 隱藏式馬可夫模型
★ 語意
★ 後處理
★ 中文
★ 辨識
★ 名片
★ 維特比演算法
★ 語言模型關鍵字(英) ★ language
★ linguish
★ Viterbi
★ OCR
★ HMM
★ card論文目次 Abstract in Chinesei
Abstract in Englishii
Contentsiii
List of Figuresv
Chapter 1 Introduction
1.1 Motivation1
1.2 Survey of Related Works2
1.3 System Description and Assumptions3
1.4 Thesis Organization5
Chapter 2 The Hidden Markov Model for Chinese Character Recognition
2.1 The Hidden Markov Model7
2.1.1 Feature Extraction11
2.1.2 Noise Deletion13
2.1.3 The HMM Structure14
2.2 Training15
2.3 Recognition19
Chapter 3 Postprocessing
3.1 The Language Model22
3.2 Bigram Table23
3.3 top-10 Candidate Table24
3.4 Viterbi algorithm25
3.5 Detection and Correction of Lost Candidate28
Chapter 4 Experimental Results
4.1 Company Item in Chinese Business Card30
4.2 Address Item in Chinese Business Card36
Chapter 5 Conclusions and Future Research
5.1 Conclusions41
5.2 Future Research41
References43參考文獻 【1】 C. H. Wu, “Chinese hand-written character Segmentation in Form Document”, Master thesis, Institute of Computer Science and Information Engineering, National Chiao Tung University, Taiwan, R.O.C., 1997.
【2】 Ming-Yuan Chen, “Item Identification from Business Cards”, Master thesis, Institute of Computer Science and Information Engineering, National Chiao Tung University, Taiwan, R.O.C., 1999.
【3】 Chao-Huang Chang, “A Pilot Study on Automatic Chinese Spelling Error Correction”, Communications of COLIPS, Vol.4, No.2, Dec 1994, page 143-149.
【4】 Stuart Russell and Peter Norvig, “Artificial Intelligence”, Prentice Hall, 1995.
【5】 Jay G.Wilpon et al., “Automatic Recognition of Keywords in Unconstrained Speech Using Hidden Markov Models”, IEEE Trans on Assp, Vol. 38, No. 11, Nov 1990, pp. 1970-1878.
【6】 B.-I. Li and e. al.. “A maximal matching automatic Chinese word segmentation algorithm using corpus tagging for ambiguity resolution”. R.O.C. Computational Linguistics Conference, Taiwan, pp. 135-146, 1991.
【7】 Lawrence Rabiner and B-H. Juang, “Fundamentals of Speech Recognition”. Prentice Hall, 1993.
【8】 Issam Bazzi, Richard Schwartz, and John Makhoul, “An Omnifont Open-Vocabulary OCR System for English and Arabic”, IEEE Trans on Pattern Analysis and Machine Intelligence, Vol. 21, No. 6, June 1999.
【9】 Xiaofei Huang, Jun Gu, and Youshou Wu, “A Constrained Approach to Multi font Chinese Character Recognition”, IEEE Trans on Pattern Analysis and Machine Intelligence, Vol. 15, No. 8, August 1993.
【10】 Wongyu Cho, Seong-Whan Lee, and Jin H. Kim, “Modeling and Recognition of Cursive words with Hidden Markov Models”, Pattern Recognition, Vol. 28, No. 12, pp. 1941-1953, 1995.
【11】 A. El-Yacoubi, M. Gilloux, R. Sabourin, C. Y. Suen, “An HMM-Based Approach for Off-Line Unconstrained Handwritten Word Modeling and Recognition”, IEEE Trans on Pattern Analysis and Machine Intelligence, Vol. 21, No. 8, August 1999.
【12】 Kjersti Aas and Line Eikvil, “Text Page Recognition Using Grey-Level Features and Hidden Markov Models”, Pattern Recognition, Vol. 29, No. 6, pp. 977-985, 1996.
【13】 B-S. Jeng and M-W. Chang, “Optical Chinese Character Recognition with a Hidden Markov Model Classifier-a Novel Approach”, Electronics Letters, Vol. 26, No. 18, 30th August 1990.
【14】 謝郡青, 林啟芳, 張保忠, “以連續型隱藏式馬可夫模型來計算中文簽名之動態相似度值”, 影像與識別, 1998年9月號, 80-89頁
【15】 陳信希, 李振昌, “中文文本組織名之辨識”, Communications of COLIPS, Vol. 4, No. 2, pp. 131-142, December 1994.
【16】 游政陸, “中英文字辨認系統”, 國立中央大學光電科學研究所碩士論文, 中華民國76年六月.
【17】 賴逸嶺, “中文名片處理系統”, 國立中央大學電機工程研究所碩士論文, 中華民國87年六月.指導教授 莊堯棠(Yau-Tarng Juang) 審核日期 2000-7-10 推文 facebook plurk twitter funp google live udn HD myshare reddit netvibes friend youpush delicious baidu 網路書籤 Google bookmarks del.icio.us hemidemi myshare