非監督式網頁層次包覆程式之驗證-一個基於樣板、語意及結構描述驗證之擷取程式

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：61

、訪客IP：18.222.163.134

姓名

林衍伶(Yen-ling Lin) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

非監督式網頁層次包覆程式之驗證-一個基於樣板、語意及結構描述驗證之擷取程式
(Page-level Wrapper Verification based on Structure, Semantic and Schema)

相關論文

★ 行程邀約郵件的辨識與不規則時間擷取之研究	★ NCUFree校園無線網路平台設計及應用服務開發
★ 網際網路半結構性資料擷取系統之設計與實作	★ 非簡單瀏覽路徑之探勘與應用
★ 遞增資料關聯式規則探勘之改進	★ 應用卡方獨立性檢定於關連式分類問題
★ 中文資料擷取系統之設計與研究	★ 非數值型資料視覺化與兼具主客觀的分群
★ 關聯性字組在文件摘要上的探討	★ 淨化網頁：網頁區塊化以及資料區域擷取
★ 問題答覆系統使用語句分類排序方式之設計與研究	★ 時序資料庫中緊密頻繁連續事件型樣之有效探勘
★ 星狀座標之軸排列於群聚視覺化之應用	★ 由瀏覽歷程自動產生網頁抓取程式之研究
★ 動態網頁之樣版與資料分析研究	★ 同性質網頁資料整合之自動化研究

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

在過去的十年的網頁資料擷取領域，有許多研究提出不同的非監督式資訊擷取方法，然而有關擷取器的驗證及維護相較少了許多，甚至對於非監督式資訊擷取程式的存在與否抱持懷疑態度。因此在本篇論文中，我們提出了一個新穎的方法來同時處理這兩個問題，一方面實作非監督式資訊擷取系統中的擷取器另一方面同時驗證擷取器的有效性。首先，我們善加利用XML本身具有嚴謹的規則大致地去確認新網頁以及已存在網頁的XML之資料描述是否符合；接著，我們對於DOM Tree中每一葉節點中的內容值與位於舊有網頁中相同路徑的內容值進行比對；最後，我們使用由基本節點以及位於葉節點的樣版轉換成一有限狀態機，進而使用其去驗證其新網頁中的基本節點與葉節點的樣版的規則順序是否遵守結構描述(schema)。若新網頁可通過這些階段，則意指在網頁中的資料可以同時被擷取，相反地，則表示原來的wrapper已不適用。我們的方法不僅僅是一個wrapper的驗證器，同時也是網頁層次之非監督式資訊擷取的抽取器(extractor)。本論文同時也測試非監督式資訊擷取所需的網頁，在只有二頁的輸入網頁時，有94%網頁可以通過擷取程式的樣板驗證，但只有40%網頁可以通過結構描述驗證；而在三個輸入網頁時，則有95%網頁可以通過擷取程式的樣板驗證，同時可以通過結構描述驗證的網頁比例也提升至81%。

摘要(英)

Unsupervised information extraction has been studied a lot in the past decade. However, not much attention has been paid to its wrapper verification. This paper focuses on wrapper verification of unsupervised information extraction. In this paper, we propose a novel method to approach two problems, including wrapper verification and extractor conduction for an unsupervised information extraction system, FiVaTech. We first utilize the property of XML validation to roughly verify the template of new web pages grabbing at later time. Next, we compare the content in each path of the parsed DOM tree with the contents in the existing pages. Finally, we use a finite state machine of basics and terminal templates to verify if the order complies with the existing schema. If a new web page could pass these stages, data in the new page is also extracted simultaneously. Our approach not only acts as a verifier but also an extractor for page-level unsupervised information extraction. With the verifier, we are able to show that at least 3 pages are required to have 95% and 81% of the pages pass through the template and schema verification, respectively.

關鍵字(中)

★ 非監督式擷取器
★ 擷取器之驗證

關鍵字(英)

★ Unsupervised wrapper induction
★ wrapper verification

論文目次

中文摘要 i
Abstract ii
誌謝 iii
Table of Contents iv
List of Figures v
List of Tables vi
Chapter 1 Introduction 1
Chapter 2 Related Work 4
Chapter 3 Preliminary 7
Chapter 4 Three-staged Wrapper Verification 12
4.1 Template Verification 13
4.2 Semantic Comparison 16
4.3 Schema Verification 19
4.3.1 Transform into the Finite State Machine 20
4.3.2 Verifying the new pages using the Finite State Machine 25
Chapter 5 Experiments 28
5.1 Synthetic page verification 28
5.2 Real word page verification 30
5.3 Performance analysis on efficiency 33
Chapter 6 Conclusion and Future Work 35
Reference 37

參考文獻

[1] J. P. Bigham, A. C. Cavender, R. S. Kaminsky, C. M. Prince, and T. S. Robison, "Transcendence: enabling a personal view of the deep web," Proceedings of the 13th international conference on Intelligent user interfaces, 2008, pp. 169-178.
[2] C.-H. Chang, M. Kayed, M. R. Girgis, and K. F. Shaalan, "A Survey of Web Information Extraction Systems," IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 10, pp. 1411-1428, 2006.
[3] M. Dontcheva, S. M. Drucker, D. Salesin, and M. F. Cohen, "Relations, cards, and search templates: user-guided web data integration and layout," Proceedings of the 20th annual ACM symposium on User interface software and technology, 2007, pp. 61-70.
[4] M. Dontcheva, S. M. Drucker, G. Wade, D. Salesin, and M. F. Cohen, "Summarizing personal web browsing sessions," Proceedings of the 19th annual ACM symposium on User interface software and technology, 2006, pp. 115-124.
[5] O. Etzioni, M. Cafarella, D. Downey, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates, "Methods for domain-independent information extraction from the web: an experimental comparison," Proceedings of the 19th national conference on Artifical intelligence, 2004, pp. 391-398.
[6] C.-N. Hsu and C.-C. Chang, "Finite-state transducers for semi-structured text mining," Proceedings of IJCAI-99 Workshop on Text Mining: Foundations, Techniques and Applications, 1999, pp. 38-49.
[7] M. Kayed and C.-H. Chang, "Page-Level Web Data Extraction from Template Pages," IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 2, pp. 249-263, 2010.
[8] N. Kushmerick, "Wrapper Induction for Information Extraction." Ph.D. University of Washington, Seattle, WA, 1997.
[9] N. Kushmerick, "Regression testing for wrapper maintenance," Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence, 1999, pp. 74-79.
[10] N. Kushmerick, "Wrapper Verification," World Wide Web Journal, vol. 3, no. 2, pp. 79-94, 2000.
[11] N. Kushmerick, "Wrapper Induction: Efficiency and Expressiveness," Artificial Intelligence, vol. 118, no. 1-2, pp. 15-68, 2000.
[12] K. Lerman, S. N. Minton, and C. A. Knoblock, "Wrapper Maintenance: A Machine Learning Approach," Journal of Artificial Intelligence Research, vol. 18, no. 1, pp. 149-181, 2003.
[13] J.-H. Li, "Differentiating Templates and Data Values from Semi-Structured Web Pages." Master's Computer Science and Information Engineering at National Center University, 2005.
[14] L. Liu, C. Pu, and W. Han, "XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources," Proceedings of the 16th International Conference on Data Engineering, 2010, pp. 611-621.
[15] P. C. Mahalanobis, "On the generalised distance in statistics,", 2 ed In Proceedings National Institute of Science, 1936, pp. 49-55.
[16] X. Meng, D. Hu, and C. Li, "Schema-Guided Wrapper Maintenance for Web-Data Extraction," Proceedings of the 5th ACM international workshop on Web information and data management, 2003, pp. 1-8.
[17] X. Meng, H. Lu, M. Gu, and H. Wang, "SG-WRAP: A Schema-Guided Wrapper Generator," Proceedings of the 18th International Conference on Data Engineering, 2002, p. 331.
[18] I. Muslea, S. Minton, and C. A. Knoblock, "Hierarchical Wrapper Induction for Semistructured Information Sources," Autonomous Agents and Multi-Agent Systems, vol. 4, no. 1-2, pp. 93-114, 2001.
[19] A. Pan, J. Raposo, M. Alvarez, J. Hidalgo, and A. Vina, "Semi-Automatic Wrapper Generation for Commercial Web Sources," 2002, pp. 265-283.
[20] E.-H. Pek, X. Li, and Y. Liu, "Web Wrapper Validation," In Proceedings of APWeb, 2003.
[21] J. Raposo, A. Pan, M. Alvarez, and J. Hidalgo, "Automatically maintaining wrappers for semi-structured web sources," Data & Knowledge Engineering, vol. 61, no. 2, pp. 331-358, 2007.
[22] D. E. Simmen, M. Altinel, V. Markl, S. Padmanabhan, and A. Singh, "Damia: data mashups for intranet applications," Proceedings of the 2008 ACM SIGMOD international conference on Management of data, 2008, pp. 1171-1182.
[23] C.-T. Ting, "User-centric Web Data Integration: Design and Implementation of Gadget on Demand System." Master's Computer Science and Information Engineering at National Center University, 2008.
[24] Base Class Library, http://msdn.microsoft.com/en-us/netframework/aa569603.aspx
[25] Dapper: The Data Mapper, “http://www.dapper.net/”
[26] HTML Tidy Library Project, http://tidy.sourceforge.net/
[27] Html Agility Pack, http://htmlagilitypack.codeplex.com/Wikipage
[28] XML, http://www.w3.org/XML/
[29] XML Path, http://www.w3.org/TR/xpath/
[30] XML Schema, http://www.w3.org/XML/Schema/
[31] XML Schema Elements, http://msdn.microsoft.com/en-us/library/ms256142(v=VS.90).aspx

指導教授

張嘉惠(Chia-hui Chang)

審核日期

2010-7-27

推文