博碩士論文 90522059 詳細資訊


姓名 郭釋謙(Shih-Chien Kuo)  查詢紙本館藏   畢業系所 資訊工程學系
論文名稱 線上擷取規則分析
(On-Line Extraction Rule Analysis)
檔案 [檢視]  [下載]
  1. 本電子論文使用權限為同意立即開放。
  2. 已達開放權限電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。
  3. 請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。

摘要(中) 隨著網際網路的發展,越來越多的資訊以HTML的格式來呈現,有用與無用的資訊參雜其中,使用者往往可能花上大筆的時間在找尋資料,因此,透過資訊擷取系統的設計,將輸入的資料以結構化的方式呈現,進而整合資料,建構豐富的搜尋引擎。
設計資訊擷取系統,最直接的方法是針對各個網站利用人工撰寫擷取資料的包覆程式(Wrapper),但是由於網站的格式隨時有可能發生更改,因此如何快速並且自動地產生擷取程式是設計擷取系統最大的挑戰。
從1997年開始,Wrapper Induction的方法被提出,利用標示範例網頁,告訴系統要擷取的資訊,讓系統產生擷取規則,接著利用擷取規則來擷取網站的資訊。這類利用標示範例網頁的方式,雖然有不錯的擷取率,但是必須經過十分繁複的標示動作,才能產生擷取規則,因此對使用者來說,並不是那麼的便利,因此減少使用者標示的資訊擷取系統是系統設計的一大挑戰,目前不用使用者標示的系統如IEPAD等僅能解決多筆紀錄的網頁,對於單一紀錄網頁尚無解決辦法,有鑑於此,本篇論文提出一個有效的方法來完成自動化的資訊擷取系統(Information Extraction System),讓使用者不必經過繁複的標示動作便可將資料完整的擷取到手,同時解決單一記錄以及多筆記錄的網頁擷取問題。
摘要(英) The vast amount of online information available has led to renewed interest in information extraction (IE) systems that analyze input documents to produce a structured representation of selected information from the documents. However, the design of an IE system differs greatly according to its input: from unrestricted free-text to semi-structured Web documents. This paper extends an automatic pattern discovery approach called IEPAD to the rapid generation of IE systems that can extract structured data from semi-structured Web documents. In this novel framework, extraction rules can be trained not only from a multiple-record Web page but also from multiple single-record Web pages (called singular pages). Most of all, this framework requires no annotation labor that is required for many machine-learning based approaches. Evaluation results show a high level of system performance.
關鍵字(中) ★ 資訊整合
★ 資料檢索
★ 資訊擷取
關鍵字(英) ★ Information Integration
★ Information Extraction
★ Information Retrieval
論文目次 第1章 緒論 1
第2章 相關研究討論 4
2.1 使用者標示動作之資訊擷取系統 4
2.2 免標示動作之資訊擷取系統 6
2.3 WysiWyg的資訊擷取系統 9
第3章 系統架構 14
3.1 範例 14
3.2 目標區域框選(Enclosing) 16
3.3 Generalization 20
3.4 細部資料指定 24
3.5 多重Enclosing 25
3.6 擷取規則 26
第4章 擷取器 27
第5章 實驗結果與問題討論 29
5.1 擷取Multiple-Record Pages 29
5.2 擷取Singular Pages 32
第6章 結論與未來展望 36
參考文獻 37
參考文獻 [1] N. Ashish and C. Knoblock. Wrapper generation for semi-structured internet sources. SIGMOD Record, 26(4):8–15, 1997.
[2] R. Baumgartner, S. Flesca, and G. Gottlob. Supervised wrapper generation with lixto. In Proceedings of VLDB Demo, 2001.
[3] C.-H. Chang and S.-C. Lui. Iepad: Information extraction based on pattern discovery. In Proceedings of the 10th International Conference on World Wide Web, pages 681–688, Hong-Kong, May 2–6 2001.
[4] B. Chidlovskii, J. Ragetli, and M. Rijke. Automatic wrapper generation for web search engines. In Proceedings of the 1st International Conference on Web-Age Information Management (WAIM’2000), LNCS Series, Shanghai, China, 2000.
[5] D. Embley, Y. Jiang, and Y.-K. Ng. Record-boundary discovery in web documents. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’99), pages 467–478, Philadelphia, PA, 1999.
[6] D. Freitag. Information extraction from html: Application of a general machine learning approach. In Proceedings of the Fifteenth national Conference on Artificial Intelligence, pages 517–523, 1998.
[7] C.-N. Hsu and C.-C. Chang. Finite-state transducers for semi-structured text mining. In Proceedings of IJCAI-99 Workshop on Text mining: Foundations, Techniques and Applications, pages 38–49, Stockholm, Sweden, 1999.
[8] C.-N. Hsu and M.-T. Dung. Generating finite-state transducers for semi-structured data extraction from the web. Information Systems, 23(8):521–538, 1998.
[9] I. Muslea, S. Minton, and C. Knoblock. A hierarchical information from semi-structured documents. In Proceedings of the 2000 ACM CIKM International Conference on Information and Knowledge Management, pages 250-257, VA, USA, 2000.
[10] G. Huck, P. Fankhauser, K. Aberer, and E.J. Neuhold. Jedi: Extracting and synthesizing information from the web. In Proc. of COOPIS, 1998.
[11] C. Knoblock, S. Minton, and et al. J. Ambite. Modeling web sources for information integration. In Proceedings of the 15th National Conference on Artificial Intelligence and Tenth Innovative Applications of Artificial Intelligence Conference, pages 211–218, Wisconsin, USA,1998.
[12] N. Kushmerick, D. Weld, and R. Doorenbos. Wrapper induction for information extraction. In Proceedings of the 15th International Joint Conference on Artificial Intelligence (IJCAI), pages 729–737, Japan, 1997.
[13] W.-Y. Lin and W. Lam. Learning to extract hierarchical information from semi-structured documents. In Proceedings of the 2000 ACM CIKM International Conference on Information and Knowledge Management, pages 250–257, VA, USA, 2000.
[14] L. Liu, C. Pu, and W. Han. Xwrap: An xml-enabled wrapper construction system for web information sources. In Proceedings of ICDE, 2000.
[15] W. May, R. Himmeroder, G. Lausen, and B. Ludascher. A unifed framework for wrapping, mediating and restructuring information from the web. In Proc. of WWWCM, 1999.
[16] A. Sahuguet and F. Azavant. Building light-weight wrappers for legacy web data-sources using w4f. In Proceedings of VLDB, 1999.
[17] A. Sahuguet and F. Azavant. Building intelligent web applications using lightweight wrappers. Data and Knowledge Engineering, 36(3):283–316, 2001.
[18] S. Soderland. Learning to extract text-based information from the world wide web. In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, pages, 233–272, CA, USA, 1997.
[19] S. Soderland. Learning information extraction rules for semi-structured and free text. Journal of Machine Learning, 34(1-3):233–272, 1999.
[20] G. Gonnet, R. Baeza-Yates, and T.Snider, New Indices for Text: PAT Trees and PAT Arrays, In Bill Frakes, and B.Y. Ricardo, editor, Information Retrieval: Data structures and Algorithms, Prentice Hall, Englewood Cliffs, Chapter 5 (pp. 66-82), NJ, USA, 1992.
[21] World Wide Web consortium (W3C), http://www.w3c.org
指導教授 張嘉惠(Chia-Hui Chang) 審核日期 2003-7-15

若有論文相關問題,請聯絡國立中央大學圖書館推廣服務組 TEL:(03)422-7151轉57407,或E-mail聯絡