非監督式網頁層次包覆程式之驗證-一個基於樣板、語意及結構描述驗證之擷取程式

DC 欄位	值	語言
DC.contributor	資訊工程學系	zh_TW
DC.creator	林衍伶	zh_TW
DC.creator	Yen-ling Lin	en_US
dc.date.accessioned	2010-7-27T07:39:07Z
dc.date.available	2010-7-27T07:39:07Z
dc.date.issued	2010
dc.identifier.uri	http://ir.lib.ncu.edu.tw:444/thesis/view_etd.asp?URN=975202029
dc.contributor.department	資訊工程學系	zh_TW
DC.description	國立中央大學	zh_TW
DC.description	National Central University	en_US
dc.description.abstract	在過去的十年的網頁資料擷取領域，有許多研究提出不同的非監督式資訊擷取方法，然而有關擷取器的驗證及維護相較少了許多，甚至對於非監督式資訊擷取程式的存在與否抱持懷疑態度。因此在本篇論文中，我們提出了一個新穎的方法來同時處理這兩個問題，一方面實作非監督式資訊擷取系統中的擷取器另一方面同時驗證擷取器的有效性。首先，我們善加利用XML本身具有嚴謹的規則大致地去確認新網頁以及已存在網頁的XML之資料描述是否符合；接著，我們對於DOM Tree中每一葉節點中的內容值與位於舊有網頁中相同路徑的內容值進行比對；最後，我們使用由基本節點以及位於葉節點的樣版轉換成一有限狀態機，進而使用其去驗證其新網頁中的基本節點與葉節點的樣版的規則順序是否遵守結構描述(schema)。若新網頁可通過這些階段，則意指在網頁中的資料可以同時被擷取，相反地，則表示原來的wrapper已不適用。我們的方法不僅僅是一個wrapper的驗證器，同時也是網頁層次之非監督式資訊擷取的抽取器(extractor)。本論文同時也測試非監督式資訊擷取所需的網頁，在只有二頁的輸入網頁時，有94%網頁可以通過擷取程式的樣板驗證，但只有40%網頁可以通過結構描述驗證；而在三個輸入網頁時，則有95%網頁可以通過擷取程式的樣板驗證，同時可以通過結構描述驗證的網頁比例也提升至81%。	zh_TW
dc.description.abstract	Unsupervised information extraction has been studied a lot in the past decade. However, not much attention has been paid to its wrapper verification. This paper focuses on wrapper verification of unsupervised information extraction. In this paper, we propose a novel method to approach two problems, including wrapper verification and extractor conduction for an unsupervised information extraction system, FiVaTech. We first utilize the property of XML validation to roughly verify the template of new web pages grabbing at later time. Next, we compare the content in each path of the parsed DOM tree with the contents in the existing pages. Finally, we use a finite state machine of basics and terminal templates to verify if the order complies with the existing schema. If a new web page could pass these stages, data in the new page is also extracted simultaneously. Our approach not only acts as a verifier but also an extractor for page-level unsupervised information extraction. With the verifier, we are able to show that at least 3 pages are required to have 95% and 81% of the pages pass through the template and schema verification, respectively.	en_US
DC.subject	非監督式擷取器	zh_TW
DC.subject	擷取器之驗證	zh_TW
DC.subject	Unsupervised wrapper induction	en_US
DC.subject	wrapper verification	en_US
DC.title	非監督式網頁層次包覆程式之驗證-一個基於樣板、語意及結構描述驗證之擷取程式	zh_TW
dc.language.iso	zh-TW	zh-TW
DC.title	Page-level Wrapper Verification based on Structure, Semantic and Schema	en_US
DC.type	博碩士論文	zh_TW
DC.type	thesis	en_US
DC.publisher	National Central University	en_US

博碩士論文 975202029 完整後設資料紀錄