博碩士論文 985202043 詳細資訊




以作者查詢圖書館館藏 以作者查詢臺灣博碩士 以作者查詢全國書目 勘誤回報 、線上人數:9 、訪客IP:13.59.73.1
姓名 黃嘉毅(Chia-Yi Huang)  查詢紙本館藏   畢業系所 資訊工程學系
論文名稱 中文郵政地址與鄰近相關資訊擷取之研究
(Extraction of Chinese postal addresses and associated information from general Web pages)
相關論文
★ 行程邀約郵件的辨識與不規則時間擷取之研究★ NCUFree校園無線網路平台設計及應用服務開發
★ 網際網路半結構性資料擷取系統之設計與實作★ 非簡單瀏覽路徑之探勘與應用
★ 遞增資料關聯式規則探勘之改進★ 應用卡方獨立性檢定於關連式分類問題
★ 中文資料擷取系統之設計與研究★ 非數值型資料視覺化與兼具主客觀的分群
★ 關聯性字組在文件摘要上的探討★ 淨化網頁:網頁區塊化以及資料區域擷取
★ 問題答覆系統使用語句分類排序方式之設計與研究★ 時序資料庫中緊密頻繁連續事件型樣之有效探勘
★ 星狀座標之軸排列於群聚視覺化之應用★ 由瀏覽歷程自動產生網頁抓取程式之研究
★ 動態網頁之樣版與資料分析研究★ 同性質網頁資料整合之自動化研究
檔案 [Endnote RIS 格式]    [Bibtex 格式]    [相關文章]   [文章引用]   [完整記錄]   [館藏目錄]   [檢視]  [下載]
  1. 本電子論文使用權限為同意立即開放。
  2. 已達開放權限電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。
  3. 請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。

摘要(中) 地址在人們的生活中是經常被使用的資訊,人們常需要透過網路查詢相關實體商店、學校或組織的地址,再經由地圖標示服務確定其實際方位。然而並不是每一個網站同時提供地址與地圖標示的功能,因此本研究目的是希望設計一個能從網頁中自動擷取中文地址的服務,並結合地圖標示功能,將擷取到的地址以及其相關資訊,一併標示在地圖上,提供使用者簡單方便的地圖標記資訊服務。
我們的系統分為兩個部分,第一部分,將網頁先經過單獨中文字元切字與Yahoo中文字斷詞兩種斷詞方法處理後,透過條件式隨機域的方式搭配BIEO與IO兩種標記法訓練出地址擷取的模型,輸入的網頁經過此模型的測試過程後並擷取地址;第二部份,則以擷取到的地址為基礎,在網頁中擷取與地址相關的資訊,找出包含地址和相關資訊的地址區塊邊界。實驗結果得知,我們的地址擷取中以所有網頁的總地址為單位的效能可以提升F-measure至九成九,而以個別網頁中的地址為單位的平均效能提升平均F-measure至九成七,同時對於九成二的資料可以正確的擷取到相關資訊。
摘要(英) Address Information is closely linked to people’’s daily life. People often need to query addresses of shopping malls、schools and organization, and using the service of map marking to locate the direction. However, not all web pages providing addresses and the facility of map marking at the same time. Therefore, designing a mechanism to extract Chinese addresses automatically from web pages to combines the facility of map marking and marks the extracted addresses and the related information on the map. The service provides users in a convenient and easy way to using the information service of map marking.
Our system is divided into two steps: the first step is using Conditional Random fields to train the model of address extraction. The pages we input enter the testing process of model of address extraction and output the segment of address. The second step is using extracted addresses as landmarks to extract related information and finding out the correct boundary of address blocks. In terms of the result of experiment, the F-measure of extraction by Conditional Random field is up to 0.9914. The accuracy of the incorrect boundary is 0.9212.
關鍵字(中) ★ 相關資訊擷取
★ 條件隨機域
★ 地址擷取
關鍵字(英) ★ associated information extraction
★ conditional random fields
★ address extraction
論文目次 目錄
摘要 i
Abstract ii
圖目錄 v
表目錄 vi
一、 緒論 1
1.1. 研究動機 1
1.2. 研究背景 2
1.3. 章節概要 3
二、 相關研究 4
2.1. Pattern-Based Method 4
2.2. Machine Learning Method 6
2.3. 網頁資訊擷取之相關研究 7
三、 地址擷取 9
3.1. 斷詞切字 10
3.2. 候選字串分段 11
3.3. 特徵擷取 12
3.4. 學習模組 15
3.4.1. 條件式隨機域 (Conditional Random Fields) 15
3.4.2. 訓練過程和測試過程 16
3.5. 地址擷取 18
3.5.1. 極大分數子序列 18
四、 相關資訊擷取 20
4.1. 擷取動機 20
4.2. 擷取方法 22
4.2.1. 區塊切割 23
4.2.2. 分隔點識別 25
五、 實驗結果與分析 27
5.1 . 實驗資料與評估方式 27
5.2 . 地址擷取實驗 28
5.3 . 相關資訊擷取實驗 33
六、 結論與未來工作 34
七、 參考文獻 35
參考文獻 1. Saeid Asadi, Guowei Yang, Xiaofang Zhou, Yuan Shi, Boxuan Zhai, Wendy Wen-Rong Jiang: Pattern-Based Extraction of Addresses from Web Page Content. APWeb 2008: 407-418.
2. Karla A. V. Borges, Alberto H. F. Laender, Claudia Bauzer Medeiros, Clodoveu A. Davis: Discovering geographic locations in web pages using urban addresses. GIR 2007: 31-36.
3. Karla A. V. Borges. Use of an Ontology of Urban Places for Recognition and Extraction of Geospatial Evidences on the Web ( in Portuguese ). PhD Thesis, Federal University of Minas Gerais : Belo Horizonte ( MG ), Brazil, 2006.
4. Lin Can, Zhang Qian, Xiaofeng Meng, Wenyin Lin: Postal Address Detection from Web Documents. WIRI 2005: 40-45.
5. P. Nagabhushan, S. A. Angadi, Basavaraj S. Anami: A Fuzzy Symbolic Inference System for Postal Address Component Extraction and Labelling. FSKD 2006: 937-946.
6. Wentao Cai, Shengrui Wang, Qingshan Jiang: Address Extraction: Extraction of Location-Based Information from the Web. APWeb 2005: 925-937.
7. Dayne Freitag: Information Extraction from HTML: Application of a General Machine Learning Approach. AAAI/IAAI 1998: 517-523.
8. Thomas G. Dietterich: Machine Learning for Sequential Data: A Review. SSPR/SPR 2002: 15-30.
9. Olga Ourioupina. 2002. Extracting geographical knowledge from the internet. In Proceedings of the ICDMAM International Workshop on Active Mining.
10. W. Daelemans, J. Zavrel, K. van der Sloot, and A. van den Bosch. TiMBL: Tilburg Memory-Based Learner. ILK Technical Report ─ ILK 02-01, Tilburg, 2002.
11. J. R. Quinlan, C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc, San Francisco, 1993.
12. Uryupina, O. (2003) Semi-supervised learning of geographical gazetteers from the internet. In: Kornai, A. and Sundheim, B. (eds.) Proceedings of the HLT-NAACL 2003 Workshop on Analysis of Geographic References, Alberta,Canada: ACL, 18-25.
13. Zheyuan Yu. High Accuracy Postal Address Extraction From Web Pages. Master Thesis, Dalhousie University . 2007.
14. A. Alberto H. F. Laender, Berthier Ribeiro-Neto, and Altigran S. da Silva. DEByE - Data Extraction by Example. Data and Knowledge Engineering, 2002.
15. Wei Liu, Xiaofeng Meng, Weiyi Meng. ViDE: A Vision-based Approach for Deep Web Data Extraction. Transactions on Knowledge and Data Engineering, IEEE, 2007
16. J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Proc. 18th International Conf.on Machine Learning, 2001.
17. Roman Klinger and Katrin Tomanek. Classical Probabilistic Models and Conditional Random Fields. Algorithm Engineering Report TR07-2-013, Department of Computer Science, Dortmund University of Technology, December 2007. ISSN 1864-4503.
18. Charles Sutton and Andrew McCallum. An Introduction to Conditional Random Fields for Relational Learning.In " Introduction to Statistical Relational Learning ." Edited by Lise Getoor and Ben Taskar. MIT Press, 2006.
19. Y. Liu, E. Shriberg, A. Stolcke, and M. Harper. Comparing HMM, Maximum Entropy and Conditional Random Fields for Disfluency Detection. Proceeding of Eurospeech, 2005.
20. CRF++: Yet Another CRFtoolkit:http://crfpp.sourceforge.net/
21. Google MAP API:http://code.google.com/apis/maps/
22. Hanna M. Wallach. Conditiondal Randiom Fields: An Introduction. Technical Report MS-CIS-04-21. Department of Computer and Information Science, University of Pennsylvania, 2004.
23. Alberto H. F. Laender, Berthier A. Ribeiro-Neto. A Brief Survey of Web Data Extraction Tools. SIGMOD Record, Vol. 31, No. 2, June 2002.
24. B. E. Boser, I. M. Guyon, and V. N. Vapnik. “A training algorithm for optimal margin classifier,” In Proc. 5th ACM Workshop on Computational Learning Theory, pp. 144-152, Pittsburgh, PA, July 1992.
指導教授 張嘉惠(Chia-Hui Chang) 審核日期 2011-8-27
推文 facebook   plurk   twitter   funp   google   live   udn   HD   myshare   reddit   netvibes   friend   youpush   delicious   baidu   
網路書籤 Google bookmarks   del.icio.us   hemidemi   myshare   

若有論文相關問題,請聯絡國立中央大學圖書館推廣服務組 TEL:(03)422-7151轉57407,或E-mail聯絡  - 隱私權政策聲明