線上擷取規則分析

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：100

、訪客IP：3.148.106.49

姓名

郭釋謙(Shih-Chien Kuo) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

線上擷取規則分析
(On-Line Extraction Rule Analysis)

相關論文

★ 行程邀約郵件的辨識與不規則時間擷取之研究	★ NCUFree校園無線網路平台設計及應用服務開發
★ 網際網路半結構性資料擷取系統之設計與實作	★ 非簡單瀏覽路徑之探勘與應用
★ 遞增資料關聯式規則探勘之改進	★ 應用卡方獨立性檢定於關連式分類問題
★ 中文資料擷取系統之設計與研究	★ 非數值型資料視覺化與兼具主客觀的分群
★ 關聯性字組在文件摘要上的探討	★ 淨化網頁：網頁區塊化以及資料區域擷取
★ 問題答覆系統使用語句分類排序方式之設計與研究	★ 時序資料庫中緊密頻繁連續事件型樣之有效探勘
★ 星狀座標之軸排列於群聚視覺化之應用	★ 由瀏覽歷程自動產生網頁抓取程式之研究
★ 動態網頁之樣版與資料分析研究	★ 同性質網頁資料整合之自動化研究

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

隨著網際網路的發展，越來越多的資訊以HTML的格式來呈現，有用與無用的資訊參雜其中，使用者往往可能花上大筆的時間在找尋資料，因此，透過資訊擷取系統的設計，將輸入的資料以結構化的方式呈現，進而整合資料，建構豐富的搜尋引擎。
設計資訊擷取系統，最直接的方法是針對各個網站利用人工撰寫擷取資料的包覆程式(Wrapper)，但是由於網站的格式隨時有可能發生更改，因此如何快速並且自動地產生擷取程式是設計擷取系統最大的挑戰。
從1997年開始，Wrapper Induction的方法被提出，利用標示範例網頁，告訴系統要擷取的資訊，讓系統產生擷取規則，接著利用擷取規則來擷取網站的資訊。這類利用標示範例網頁的方式，雖然有不錯的擷取率，但是必須經過十分繁複的標示動作，才能產生擷取規則，因此對使用者來說，並不是那麼的便利，因此減少使用者標示的資訊擷取系統是系統設計的一大挑戰，目前不用使用者標示的系統如IEPAD等僅能解決多筆紀錄的網頁，對於單一紀錄網頁尚無解決辦法，有鑑於此，本篇論文提出一個有效的方法來完成自動化的資訊擷取系統(Information Extraction System)，讓使用者不必經過繁複的標示動作便可將資料完整的擷取到手，同時解決單一記錄以及多筆記錄的網頁擷取問題。

摘要(英)

The vast amount of online information available has led to renewed interest in information extraction (IE) systems that analyze input documents to produce a structured representation of selected information from the documents. However, the design of an IE system differs greatly according to its input: from unrestricted free-text to semi-structured Web documents. This paper extends an automatic pattern discovery approach called IEPAD to the rapid generation of IE systems that can extract structured data from semi-structured Web documents. In this novel framework, extraction rules can be trained not only from a multiple-record Web page but also from multiple single-record Web pages (called singular pages). Most of all, this framework requires no annotation labor that is required for many machine-learning based approaches. Evaluation results show a high level of system performance.

關鍵字(中)

★ 資訊整合
★ 資料檢索
★ 資訊擷取

關鍵字(英)

★ Information Integration
★ Information Extraction
★ Information Retrieval

論文目次

第1章緒論 1
第2章相關研究討論 4
2.1 使用者標示動作之資訊擷取系統 4
2.2 免標示動作之資訊擷取系統 6
2.3 WysiWyg的資訊擷取系統 9
第3章系統架構 14
3.1 範例 14
3.2 目標區域框選(Enclosing) 16
3.3 Generalization 20
3.4 細部資料指定 24
3.5 多重Enclosing 25
3.6 擷取規則 26
第4章擷取器 27
第5章實驗結果與問題討論 29
5.1 擷取Multiple-Record Pages 29
5.2 擷取Singular Pages 32
第6章結論與未來展望 36
參考文獻 37

參考文獻

[1] N. Ashish and C. Knoblock. Wrapper generation for semi-structured internet sources. SIGMOD Record, 26(4):8–15, 1997.
[2] R. Baumgartner, S. Flesca, and G. Gottlob. Supervised wrapper generation with lixto. In Proceedings of VLDB Demo, 2001.
[3] C.-H. Chang and S.-C. Lui. Iepad: Information extraction based on pattern discovery. In Proceedings of the 10th International Conference on World Wide Web, pages 681–688, Hong-Kong, May 2–6 2001.
[4] B. Chidlovskii, J. Ragetli, and M. Rijke. Automatic wrapper generation for web search engines. In Proceedings of the 1st International Conference on Web-Age Information Management (WAIM’2000), LNCS Series, Shanghai, China, 2000.
[5] D. Embley, Y. Jiang, and Y.-K. Ng. Record-boundary discovery in web documents. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’99), pages 467–478, Philadelphia, PA, 1999.
[6] D. Freitag. Information extraction from html: Application of a general machine learning approach. In Proceedings of the Fifteenth national Conference on Artificial Intelligence, pages 517–523, 1998.
[7] C.-N. Hsu and C.-C. Chang. Finite-state transducers for semi-structured text mining. In Proceedings of IJCAI-99 Workshop on Text mining: Foundations, Techniques and Applications, pages 38–49, Stockholm, Sweden, 1999.
[8] C.-N. Hsu and M.-T. Dung. Generating finite-state transducers for semi-structured data extraction from the web. Information Systems, 23(8):521–538, 1998.
[9] I. Muslea, S. Minton, and C. Knoblock. A hierarchical information from semi-structured documents. In Proceedings of the 2000 ACM CIKM International Conference on Information and Knowledge Management, pages 250-257, VA, USA, 2000.
[10] G. Huck, P. Fankhauser, K. Aberer, and E.J. Neuhold. Jedi: Extracting and synthesizing information from the web. In Proc. of COOPIS, 1998.
[11] C. Knoblock, S. Minton, and et al. J. Ambite. Modeling web sources for information integration. In Proceedings of the 15th National Conference on Artificial Intelligence and Tenth Innovative Applications of Artificial Intelligence Conference, pages 211–218, Wisconsin, USA,1998.
[12] N. Kushmerick, D. Weld, and R. Doorenbos. Wrapper induction for information extraction. In Proceedings of the 15th International Joint Conference on Artificial Intelligence (IJCAI), pages 729–737, Japan, 1997.
[13] W.-Y. Lin and W. Lam. Learning to extract hierarchical information from semi-structured documents. In Proceedings of the 2000 ACM CIKM International Conference on Information and Knowledge Management, pages 250–257, VA, USA, 2000.
[14] L. Liu, C. Pu, and W. Han. Xwrap: An xml-enabled wrapper construction system for web information sources. In Proceedings of ICDE, 2000.
[15] W. May, R. Himmeroder, G. Lausen, and B. Ludascher. A unifed framework for wrapping, mediating and restructuring information from the web. In Proc. of WWWCM, 1999.
[16] A. Sahuguet and F. Azavant. Building light-weight wrappers for legacy web data-sources using w4f. In Proceedings of VLDB, 1999.
[17] A. Sahuguet and F. Azavant. Building intelligent web applications using lightweight wrappers. Data and Knowledge Engineering, 36(3):283–316, 2001.
[18] S. Soderland. Learning to extract text-based information from the world wide web. In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, pages, 233–272, CA, USA, 1997.
[19] S. Soderland. Learning information extraction rules for semi-structured and free text. Journal of Machine Learning, 34(1-3):233–272, 1999.
[20] G. Gonnet, R. Baeza-Yates, and T.Snider, New Indices for Text: PAT Trees and PAT Arrays, In Bill Frakes, and B.Y. Ricardo, editor, Information Retrieval: Data structures and Algorithms, Prentice Hall, Englewood Cliffs, Chapter 5 (pp. 66-82), NJ, USA, 1992.
[21] World Wide Web consortium (W3C), http://www.w3c.org

指導教授

張嘉惠(Chia-Hui Chang)

審核日期

2003-7-15

推文