非監督式網頁層次包覆程式之驗證-一個基於樣板、語意及結構描述驗證之擷取程式 ; Page-level Wrapper Verification based on Structure, Semantic and Schema

NCU Institutional Repository > 資訊電機學院 > 資訊工程研究所 > 博碩士論文 > Item 987654321/44715

請使用永久網址來引用或連結此文件: https://ir.lib.ncu.edu.tw/handle/987654321/44715

題名:	非監督式網頁層次包覆程式之驗證-一個基於樣板、語意及結構描述驗證之擷取程式;Page-level Wrapper Verification based on Structure, Semantic and Schema
作者:	林衍伶;Yen-ling Lin
貢獻者:	資訊工程研究所
關鍵詞:	非監督式擷取器;wrapper verification;Unsupervised wrapper induction
日期:	2010-07-27
上傳時間:	2010-12-09 13:53:32 (UTC+8)
出版者:	國立中央大學
摘要:	在過去的十年的網頁資料擷取領域，有許多研究提出不同的非監督式資訊擷取方法，然而有關擷取器的驗證及維護相較少了許多，甚至對於非監督式資訊擷取程式的存在與否抱持懷疑態度。因此在本篇論文中，我們提出了一個新穎的方法來同時處理這兩個問題，一方面實作非監督式資訊擷取系統中的擷取器另一方面同時驗證擷取器的有效性。首先，我們善加利用XML本身具有嚴謹的規則大致地去確認新網頁以及已存在網頁的XML之資料描述是否符合；接著，我們對於DOM Tree中每一葉節點中的內容值與位於舊有網頁中相同路徑的內容值進行比對；最後，我們使用由基本節點以及位於葉節點的樣版轉換成一有限狀態機，進而使用其去驗證其新網頁中的基本節點與葉節點的樣版的規則順序是否遵守結構描述(schema)。若新網頁可通過這些階段，則意指在網頁中的資料可以同時被擷取，相反地，則表示原來的wrapper已不適用。我們的方法不僅僅是一個wrapper的驗證器，同時也是網頁層次之非監督式資訊擷取的抽取器(extractor)。本論文同時也測試非監督式資訊擷取所需的網頁，在只有二頁的輸入網頁時，有94%網頁可以通過擷取程式的樣板驗證，但只有40%網頁可以通過結構描述驗證；而在三個輸入網頁時，則有95%網頁可以通過擷取程式的樣板驗證，同時可以通過結構描述驗證的網頁比例也提升至81%。Unsupervised information extraction has been studied a lot in the past decade. However, not much attention has been paid to its wrapper verification. This paper focuses on wrapper verification of unsupervised information extraction. In this paper, we propose a novel method to approach two problems, including wrapper verification and extractor conduction for an unsupervised information extraction system, FiVaTech. We first utilize the property of XML validation to roughly verify the template of new web pages grabbing at later time. Next, we compare the content in each path of the parsed DOM tree with the contents in the existing pages. Finally, we use a finite state machine of basics and terminal templates to verify if the order complies with the existing schema. If a new web page could pass these stages, data in the new page is also extracted simultaneously. Our approach not only acts as a verifier but also an extractor for page-level unsupervised information extraction. With the verifier, we are able to show that at least 3 pages are required to have 95% and 81% of the pages pass through the template and schema verification, respectively.
顯示於類別:	[資訊工程研究所] 博碩士論文

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	669	檢視/開啟

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....