博碩士論文 975202035 詳細資訊


姓名 官直毅(Chi-I Kuan)  查詢紙本館藏   畢業系所 資訊工程學系
論文名稱 非監督式包覆程式維護之綱要對映
(Schema Matching for Unsupervised Wrapper Maintenance)
檔案 [檢視]  [下載]
  1. 本電子論文使用權限為同意立即開放。
  2. 已達開放權限電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。
  3. 請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。

摘要(中) 包覆程式(Wrapper)泛指用以收集網頁中特定資料的擷取程式,使用者能藉由包覆程式存取特定的資料,再將資料透過資訊整合步驟(Information Integration)以形成有利用價值的資訊,而後提供一套整合性的網路服務系統或資料分析系統。
然則,網站的開發者常會因為各種不同的需求而修改網站,使得原本的包覆程式產生錯誤以至於無法使用相同的程式來擷取資訊,此時程式開發員只能選擇重新撰寫或修改擷取程式來解決。有鑑於此,近年來有許多非監督式包覆程式產生器(Unsupervised Wrapper Induction) 被廣泛的討論,藉由動態網頁中的規律性來產生對應於網站的擷取模組,並藉由擷取模組自動化地擷取資料,如此就不需要每次都重新撰寫包覆程式。
然非監督式包覆程式產生器在維護上可能遭遇到的狀況是:當網站隨著時間而修改,使得在時間t和時間t’時所擷取下來的資訊無論在綱要、實例上都會有極大的差異,要如何整合資料就是本論文深入探討的問題。
當取得時間t和時間t’的綱要(Schema)後,可以利用綱要所提供的結構資訊(Structure)和實例內容(Instance)的高度相關性來將此兩綱要作對應,本論文分別就實例步驟和結構步驟遴選出對應屬性。實例步驟包含資料型別的鑑定、相同記錄的找尋、以及利用實例資訊的相似度找尋可能的對應屬性。結構步驟提出不同類型的結構相似度計算方法,而後結合這些相似度以反應出資料在結構上的特徵,進而選出相對應的屬性。
藉由實例資訊相似度用以擷取屬性的特徵,再使用結構資訊相似度來取得屬性間的關係,故不需要訓練資料也使得系統能自動化的對應屬性,且在各領域上都能有令人滿意的效能。對於Book領域的屬性對應的F-Measure可以達到92%的效果,而Job領域也能達到95%,Hotel領域達到86%的效果,最不容易作對應的CarBuyer領域也能達到84%的屬性對應,就整體來說結構相似度在屬性的對應上是確實有幫助的。
摘要(英) Wrapper refers to program which is used to extract the specific data in web page, researchers can access specific data by wrapper and use information integration to transfer the data to be useful information, then provide a set of integrated network services, systems or data analysis system.
But the site developers often modify the website because of different needs, this making the original wrapper error that can’t extract data. At this situation, the program developer can just re-write or modify original wrapper to solve. For this reason, unsupervised wrapper induction is widely discussed in recent years. It builds extracted module automatically by the regularity of the dynamic web page and extracted data by such module, so programmer don’t need to write wrapper for specific website every time.
The problem unsupervised wrapper induction may encounter is its maintenance. If the website changes by time, we will have two extracted data at time t and at time t’. How to identify the related information and integrate them is our goal. We use the instance and structure information which generated by FiVatech (the unsupervised wrapper induction tool we used) to match the correlation attribute.
關鍵字(中) ★ 包覆程式的維護
★ 資訊整合
★ 綱要對映
關鍵字(英) ★ Wrapper Maintenance
★ Data Integration
★ Schema Matching
論文目次 中文題要 I
英文提要 II
目錄 III
圖目錄 IV
一. 緒論 1
二. 相關研究探討 4
2.1. 研究背景 4
2.2. 綱要映對 5
2.2.1. 綱要映對的類型 6
2.2.2. Dual Correlation Mining(DCM) 6
2.2.3. On-the-fly Data Integration of Homogeneous Web Data 7
2.2.4. Combining Schema and Instance Information 8
2.2.5. Improving XML schema matching performance using Prufer sequences 9
三. PRELIMINARY 12
3.1. FIVATECH 12
3.2. 符號定義 14
四. 系統架構 15
4.1. 實例資訊 15
4.1.1. 資料型別 16
4.1.2. 尋找相同記錄配對 18
4.1.3. 選擇候選屬性 18
4.2. 結構階層資訊 20
4.2.1. 節點順序相似度 21
4.2.2. 相鄰節點相似度 21
4.2.3. 路徑相似度 22
4.2.4. 父節點型態相似度 22
4.3. 實例階層相似度和結構階層相似度的結合 23
五. 實驗結果 24
5.1. 效能評估方法和實驗設計 24
5.2. 找尋相同記錄配對時的閥值 25
5.3. 實例資訊相似度在不同領域上的表現 26
5.4. 測試各種結構相似度的影響 29
5.5. 合併實例資訊相似度和結構資訊相似度的效能 32
六. 結論與未來研究方向 36
七. 參考文獻 37
參考文獻 [1] A. Algergawy, E. Schallehn, G. Saake, A Prufer sequence-based approach for schema matching, in: BalticDB & IS2008, Estonia, 2008.
[2] A. Algergawy, E. Schallehn, G. Saake. A Sequence-based Ontology Matching Approach. 18th European Conference on Artificial Intelligence Workshop, Greece. 2008.
[3] A. Algergawy, E. Schallehn, G. Saake. Improving XML schema matching performance using Prufer sequences. Data & Knowledge Engineering, Volume 68, pp. 728–747. 2009.
[4] A. Algergawy, R. Nayak, G. Saake. Element similarity measures in XML schema matching. Information Sciences Vol.180. pp. 4975-5998. 2010.
[5] A. Gal, Managing uncertainty in schema matching with top-k schema mappings, Journal on Data Semantics Vol.6 90–114, 2006.
[6] A. Halevy, A. Rajaraman, J. Ordille. Data Integration: The Teenage Years. Very Large Data Bases, pp. 12-15. 2006.
[7] B. He, K. C.-C. Chang, and J. Han. Discovering Complex Matching across Web Query Interfaces: A Correlation Mining Approach. In Proceedings of the 2004 ACM SIGKDD International Conference on Knowledge Discovery and Data mining, pp. 148-157, 2004.
[8] C.-C. Huang, C.-H. Chang. On-the-fly Data Integration of Homogeneous Web Data. Master dissertation, National Central University. 2004.
[9] C.-H. Chang, M. Kayed, M. R. Girgis, K. Shaalan, A Survey of Web Information Extraction Systems, IEEE TKDE (SCI, EI), Vol. 18, No. 10, pp. 1411-1428. 2006.
[10] E. Rah, P. A. Bernstein. A survey of approaches to automatically schema matching. The International Journal on Very Large Data Bases, Vol. 10, Issue 4, pp. 334-350. 2001.
[11] G. Beliakov, A. Pradera, T. Calvo, Aggregation Functions: A Guide for Practitioners, Studies in Fuzziness and Soft Computing, vol. 221, Springer, 2007.
[12] H. Zhao, Combining schema and instance information for integrating heterogeneous databases: an analytical approach and empirical evaluation, Ph.D. dissertation, University of Arizona, 2002.
[13] H. Zhao, S. Ram, Clustering schema elements for semantic integration of heterogeneous data sources, Journal of Database Management 15, Vol. 4, pp. 88–106. 2004.
[14] H. Zhao, S. Ram, Clustering similar schema elements across heterogeneous databases: a first step in database integration. Advanced Topics in Database Research, Vol. 5, pp. 235–256. 2006.
[15] H. Zhao, S. Ram, Entity identification for heterogeneous database integration—a multiple classifier system approach and empirical evaluation, Information Systems, Vol. 30, pp. 119–132. 2005.
[16] H. Zhao, S. Ram. Combining schema and instance information for integrating heterogeneous data sources. Data & Knowledge Engineering. 2006.
[17] J.-H. Li, C.-H. Chang. Differentiating Templates and Data Values from Semi-Structured Web Pages. Master dissertation, National Central University. 2004.
[18] L.-F. Chang, C.-H. Chung. Generation of Web page Fetchers from Navigation Records. Master dissertation, National Central University. 2005.
[19] M. Kayed, C.-H. Chang. FiVaTech : Page-Level Web Data Extraction from Template Pages. IEEE Trans. Knowl. Data Eng. Vol. 22, No.2, pp. 249-263, 2010.
[20] M. Kayed. C.-H. Chang, Page-Level Web Data Extraction from Template Pages IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 2, pp. 249-263, 2010.
[21] Y.-L. Lin, C.-H. Chung. Page-level Wrapper Verification based on Structure, Semantic and Schema. Master dissertation, National Central University.2010.
[22] Z. Zhang, B. He, and K. C.-C. Chang. On-the-fly constraint mapping across web query interfaces. In Proceedings of the Very Large Data Bases Workshop on Information Integration on the Web, 2004.
[23] N. Kushmerick. Wrapper Verification. World Wide Web, vol. 3, no 2, pp. 79–94, 2000.
指導教授 張嘉惠(Chia-Hui Chang) 審核日期 2011-8-29

若有論文相關問題,請聯絡國立中央大學圖書館推廣服務組 TEL:(03)422-7151轉57407,或E-mail聯絡