博碩士論文 975202029 完整後設資料紀錄

DC 欄位 語言
DC.contributor資訊工程學系zh_TW
DC.creator林衍伶zh_TW
DC.creatorYen-ling Linen_US
dc.date.accessioned2010-7-27T07:39:07Z
dc.date.available2010-7-27T07:39:07Z
dc.date.issued2010
dc.identifier.urihttp://ir.lib.ncu.edu.tw:444/thesis/view_etd.asp?URN=975202029
dc.contributor.department資訊工程學系zh_TW
DC.description國立中央大學zh_TW
DC.descriptionNational Central Universityen_US
dc.description.abstract在過去的十年的網頁資料擷取領域,有許多研究提出不同的非監督式資訊擷取方法,然而有關擷取器的驗證及維護相較少了許多,甚至對於非監督式資訊擷取程式的存在與否抱持懷疑態度。因此在本篇論文中,我們提出了一個新穎的方法來同時處理這兩個問題,一方面實作非監督式資訊擷取系統中的擷取器另一方面同時驗證擷取器的有效性。首先,我們善加利用XML本身具有嚴謹的規則大致地去確認新網頁以及已存在網頁的XML之資料描述是否符合;接著,我們對於DOM Tree中每一葉節點中的內容值與位於舊有網頁中相同路徑的內容值進行比對;最後,我們使用由基本節點以及位於葉節點的樣版轉換成一有限狀態機,進而使用其去驗證其新網頁中的基本節點與葉節點的樣版的規則順序是否遵守結構描述(schema)。若新網頁可通過這些階段,則意指在網頁中的資料可以同時被擷取,相反地,則表示原來的wrapper已不適用。我們的方法不僅僅是一個wrapper的驗證器,同時也是網頁層次之非監督式資訊擷取的抽取器(extractor)。本論文同時也測試非監督式資訊擷取所需的網頁,在只有二頁的輸入網頁時,有94%網頁可以通過擷取程式的樣板驗證,但只有40%網頁可以通過結構描述驗證;而在三個輸入網頁時,則有95%網頁可以通過擷取程式的樣板驗證,同時可以通過結構描述驗證的網頁比例也提升至81%。 zh_TW
dc.description.abstractUnsupervised information extraction has been studied a lot in the past decade. However, not much attention has been paid to its wrapper verification. This paper focuses on wrapper verification of unsupervised information extraction. In this paper, we propose a novel method to approach two problems, including wrapper verification and extractor conduction for an unsupervised information extraction system, FiVaTech. We first utilize the property of XML validation to roughly verify the template of new web pages grabbing at later time. Next, we compare the content in each path of the parsed DOM tree with the contents in the existing pages. Finally, we use a finite state machine of basics and terminal templates to verify if the order complies with the existing schema. If a new web page could pass these stages, data in the new page is also extracted simultaneously. Our approach not only acts as a verifier but also an extractor for page-level unsupervised information extraction. With the verifier, we are able to show that at least 3 pages are required to have 95% and 81% of the pages pass through the template and schema verification, respectively. en_US
DC.subject非監督式擷取器zh_TW
DC.subject擷取器之驗證zh_TW
DC.subjectUnsupervised wrapper inductionen_US
DC.subjectwrapper verificationen_US
DC.title非監督式網頁層次包覆程式之驗證-一個基於樣板、語意及結構描述驗證之擷取程式zh_TW
dc.language.isozh-TWzh-TW
DC.titlePage-level Wrapper Verification based on Structure, Semantic and Schemaen_US
DC.type博碩士論文zh_TW
DC.typethesisen_US
DC.publisherNational Central Universityen_US

若有論文相關問題,請聯絡國立中央大學圖書館推廣服務組 TEL:(03)422-7151轉57407,或E-mail聯絡  - 隱私權政策聲明