非監督式包覆程式維護之綱要對映; Schema Matching for Unsupervised Wrapper Maintenance

NCU Institutional Repository > 資訊電機學院 > 資訊工程研究所 > 博碩士論文 > Item 987654321/48532

請使用永久網址來引用或連結此文件: https://ir.lib.ncu.edu.tw/handle/987654321/48532

題名:	非監督式包覆程式維護之綱要對映;Schema Matching for Unsupervised Wrapper Maintenance
作者:	官直毅;Chi-I Kuan
貢獻者:	資訊工程研究所
關鍵詞:	包覆程式的維護;資訊整合;綱要對映;Wrapper Maintenance;Data Integration;Schema Matching
日期:	2011-08-29
上傳時間:	2012-01-05 14:57:19 (UTC+8)
摘要:	包覆程式(Wrapper)泛指用以收集網頁中特定資料的擷取程式，使用者能藉由包覆程式存取特定的資料，再將資料透過資訊整合步驟(Information Integration)以形成有利用價值的資訊，而後提供一套整合性的網路服務系統或資料分析系統。然則，網站的開發者常會因為各種不同的需求而修改網站，使得原本的包覆程式產生錯誤以至於無法使用相同的程式來擷取資訊，此時程式開發員只能選擇重新撰寫或修改擷取程式來解決。有鑑於此，近年來有許多非監督式包覆程式產生器(Unsupervised Wrapper Induction) 被廣泛的討論，藉由動態網頁中的規律性來產生對應於網站的擷取模組，並藉由擷取模組自動化地擷取資料，如此就不需要每次都重新撰寫包覆程式。然非監督式包覆程式產生器在維護上可能遭遇到的狀況是:當網站隨著時間而修改，使得在時間t和時間t’時所擷取下來的資訊無論在綱要、實例上都會有極大的差異，要如何整合資料就是本論文深入探討的問題。當取得時間t和時間t’的綱要(Schema)後，可以利用綱要所提供的結構資訊(Structure)和實例內容(Instance)的高度相關性來將此兩綱要作對應，本論文分別就實例步驟和結構步驟遴選出對應屬性。實例步驟包含資料型別的鑑定、相同記錄的找尋、以及利用實例資訊的相似度找尋可能的對應屬性。結構步驟提出不同類型的結構相似度計算方法，而後結合這些相似度以反應出資料在結構上的特徵，進而選出相對應的屬性。藉由實例資訊相似度用以擷取屬性的特徵，再使用結構資訊相似度來取得屬性間的關係，故不需要訓練資料也使得系統能自動化的對應屬性，且在各領域上都能有令人滿意的效能。對於Book領域的屬性對應的F-Measure可以達到92%的效果，而Job領域也能達到95%，Hotel領域達到86%的效果，最不容易作對應的CarBuyer領域也能達到84%的屬性對應，就整體來說結構相似度在屬性的對應上是確實有幫助的。 Wrapper refers to program which is used to extract the specific data in web page, researchers can access specific data by wrapper and use information integration to transfer the data to be useful information, then provide a set of integrated network services, systems or data analysis system. But the site developers often modify the website because of different needs, this making the original wrapper error that can’t extract data. At this situation, the program developer can just re-write or modify original wrapper to solve. For this reason, unsupervised wrapper induction is widely discussed in recent years. It builds extracted module automatically by the regularity of the dynamic web page and extracted data by such module, so programmer don’t need to write wrapper for specific website every time. The problem unsupervised wrapper induction may encounter is its maintenance. If the website changes by time, we will have two extracted data at time t and at time t’. How to identify the related information and integrate them is our goal. We use the instance and structure information which generated by FiVatech (the unsupervised wrapper induction tool we used) to match the correlation attribute.
顯示於類別:	[資訊工程研究所] 博碩士論文

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	860	檢視/開啟

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....