會議公告網站資訊擷取之研究

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：17

、訪客IP：18.119.29.99

姓名

胡姝涵(Shu-Han Hu) 查詢紙本館藏

畢業系所

資訊工程學系在職專班

論文名稱

會議公告網站資訊擷取之研究
(Conference Information Extraction: Segmentation Base Approach)

相關論文

★ 行程邀約郵件的辨識與不規則時間擷取之研究	★ NCUFree校園無線網路平台設計及應用服務開發
★ 網際網路半結構性資料擷取系統之設計與實作	★ 非簡單瀏覽路徑之探勘與應用
★ 遞增資料關聯式規則探勘之改進	★ 應用卡方獨立性檢定於關連式分類問題
★ 中文資料擷取系統之設計與研究	★ 非數值型資料視覺化與兼具主客觀的分群
★ 關聯性字組在文件摘要上的探討	★ 淨化網頁：網頁區塊化以及資料區域擷取
★ 問題答覆系統使用語句分類排序方式之設計與研究	★ 時序資料庫中緊密頻繁連續事件型樣之有效探勘
★ 星狀座標之軸排列於群聚視覺化之應用	★ 由瀏覽歷程自動產生網頁抓取程式之研究
★ 動態網頁之樣版與資料分析研究	★ 同性質網頁資料整合之自動化研究

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

隨著資訊科技的進步，網際網路的快速與便利使得我們漸漸以網頁來取代傳統以紙張為主的資料呈現方式，然而網頁呈現的豐富與多樣化，使得有效擷取有用的資訊成為一項重大的挑戰。資訊擷取（Information Extraction）的技術主要是將非結構化的資料，透過整理、篩選，加以整合成為結構化的資料，最後便可有效的擷取出有用的資訊。資訊擷取的設計，最直接的方法是針對各個網站利用人工撰寫資訊擷取的方式，架構出符合此網站的資訊擷取系統，但由於網站的格式隨時有可能發生變更，或是因應不同作者架構出的網站格式不同，我們都必須修改撰寫不同的資訊擷取程式，這是非常不經濟的。因此，如何利用自動化的方式因應不同的網站格式來擷取網頁資訊，是設計資訊擷取程式最大的目標。自動化的資訊擷取設計，就要仰賴機器學習（Machine Learning）的方式，如何讓電腦具有學習的能力，從以往的經驗學習到知識和擷取規則，使得電腦本身具有擷取正確資訊的能力。
　　本篇論文主要針對國際性會議（International Conference）公告網站，擷取來自不同佈告者公告的國際會議資訊，包括會議名稱、會議地點、會議日期和論文接受日期。國際會議內容以純文字為主，加上會議內容的撰寫來自不同的佈告者且為公告性質的網站，內容多為佈告者以簡短的口語來表達並不具結構性，所以在資訊的整合與擷取上有一定的困難度，如何有效的擷取出正確的資訊，本篇論文運用機器學習的方式，讓電腦具有學習的能力，自動擷取來自不同佈告者公告的國際會議資訊，並且有不錯的效果。

摘要(英)

With the progress of information technologies, the traditional sheets of paper are replaced by web pages rapidly. The versatilities and abundant contents in the web pages make the extraction of useful information far more difficult than before. Information extraction technology has allowed us to extract such information from non-structural data by means of a series of processes, such as arrangement, distillation and coalition. Due to the potential changes of infra-structure of web pages and the diversities of designers’ personal styles, the most straight-forward but may not so cost effective way is to construct extraction system manually in accordance with the characteristics of individual web site. Therefore, automated extraction is the most wanted goal to achieve.
This thesis focuses on the extraction of conference information, such as conference names, locations, dates and accept paper dates, from DB World and international conference web pages. Since the bulletin-type conference web pages are not only text-rich but also written and published orally by different individuals without any structural harmonization, it makes the processes of integration and extraction rigorously. The system which is built on machine learning techniques is creditable and validated to perform well for the extraction of specific fields from cross web site pages.

關鍵字(中)

★ 資訊擷取
★ 機器學習

關鍵字(英)

★ Information Extraction
★ Machine Learning

論文目次

第1章緒論 1
1.1 研究背景與動機 1
1.2 設計概要 3
1.3 論文架構 4
第2章相關研究與技術 5
2.1 SRV 系統 5
2.2 Rapier 系統 9
2.3 STALKER系統 13
2.4 GATE ANNIE 16
2.5 Naïve Bayes Classifier 16
2.6 SVM 17
2.7 FOIL演算法 19
第3章設計與實作 21
3.1 會議名稱 22
3.1.1 會議名稱Segmentation（Sliding Windows）23
3.1.2 會議名稱Tokenlization 25
3.2 會議地點、會議日期和論文接受日期 26
3.2.1會議地點、會議日期和論文接受日期Tokenlization 27
3.2.2會議地點、會議日期和論文接受日期 - Contextual Rule 29
第4章實驗與討論 31
4.1 會議名稱實驗結果 32
4.2 會議地點實驗結果 36
4.3 會議日期實驗結果 39
4.4 論文接受日期實驗結果 43
4.5 討論 45
第5章結論與未來展望 47
參考文獻 48

參考文獻

[1] Dayne Freitag. Information Extraction from HTML: Application of a General Machine Learning Approach. In Proceedings of the Fifteenth national Conference on Artificial Intelligence, pages 517–523, 1998.
[2] Dayne Freitag. Machine Learning for Information Extraction in Information Domains. Ph.D. thesis, Carnegie Mellon University, 1998.
[3] M.E. Califf, and R.J. Mooney. Relational learning of pattern-match rules for information extraction. In Proceedings of the 16th National Conference on AI, 328-334, 1999.
[4] M.E. Califf, and R.J. Mooney. Bottom-Up Relational Learning of Pattern Matching Rules for Information Extraction. Journal of Machine Learning Research 4 (2003) 177-210
[5] M.E. Califf, Ph.D. Relational Learning Techniques of Natural Language Information Extraction. The University of Texas at Austin, 1998. Technical Report AI98-269
[6] I. Muslea, S. Minton, and C. Knoblock, A hierarchical approach to wrapper induction. In Proceedings of 3rd International Conference on Autonomous Agents（Agents-99）,pp. 190-197, Seattle, Washington,1999
[7] Chun-Nan Hsu. Initial Results on Wrapping Semi-structured Web Pages with Finite-State Transducers and Contextual Rules. In Proceedings of AAAI-98 Workshop on AI and Information Integration, Technical Report WS-98-01. 1998.
[8] Chun-Nan Hsu. and Chien-Chi Chang. Finite-state transducers for semi-structured text mining. In Proceedings of IJCAI-99 Workshop on Text Mining: Foundations, Techniques and Applications, pp. 38-49, Stockholm, Sweden, 1999.
[9] C. H. Chang and S.C. Lui. IEPAD: Information Extraction Based on Pattern Discovery. In Proceedings of 10th International Conference on World Wide Web, pp. 681-688, 2001.
[10] J. Wang, and F.H. Lochovsky. Data Extraction and Label Assignment for Web Databases. In Proceedings of the twelfth international conference on Wide Web, Page 187 - 96, 2003.
[11] B. Liu, R. Grossman, and Y. Zhai. Mining Data Records in Web Pages. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Date Mining (KDD’03), Page 24 - 27, 2003
[12] Muggleton, S. , and Feng, C. Efficient induction of Logic Programs. In Muggleton, S., ed., Inductive Logic Programming. New York: Academic Press. 281-297, 1992.
[13] Zelle, J. M., and Mooney, R. J. Combining Top-down and bottom-up methods in inductive logic programming. In Proceedings of the Eleventh Internatinal on Machine Learning, 343-351. 1994
[14] Muggleton, S. Inverse entailment and Progol. New Generation Computing Journal 13:245 – 286. 1995
[15] Developing Language Processing Components with GATE Version 3 (a User Guide) , http://gate.ac.uk/sale/tao The University of Sheffield 2001-2005
[16] GATE – An Application Developer’s Guide http://www.dcs.shef.ac.uk/~valyt Department of Computer Science University of Sheffield, UK. 19 July 2004
[17] Tom Kenter, Diana Maynard Using GATE as an Annotation Tool 28th January 2005
[18] Tom M. Mitchell, carnegie Mellon University, Machine Learning
[19] Jiawei Han, Micheline Kamber, Data Ming concepts and Techniques
[20] Richard J. Roiger, Michael W. Geatz, Data Mining A Tutorial-Based Primer
[21] Weka The University of Waikato http://www.cs.waikato.ac.nz/ml/weka/
[22] Coenen, F. LUCS-KDD implementations of the FOIL, PTM and CPAR algorithms, http://www.cxc.liv.ac.uk/~frans/KDD/Software/FOIL_PRM_CPAR/,Department of
Science, The University of Liverpool, UK. (2004)
[23] C.J.C. Burges, A tutorial on support vector machines for pattern recognition, Data Mining Knowledge Discovery, 2, pp. 121-167,1998
[24] Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin A Practical Guide to Support Vector Classification Department of Computer Science and Information Engineering NTU
[25] LIBSVM, http://www.csie.ntu.edu.tw/~cjlin/libsvm/

指導教授

張嘉惠(Chia-Hui Chang)

審核日期

2006-7-24

推文