機器學習應用於樣版網頁擷取之研究

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：24

、訪客IP：3.141.7.152

姓名

張志豪(Chih-Hao Chang) 查詢紙本館藏

畢業系所

資訊工程學系在職專班

論文名稱

機器學習應用於樣版網頁擷取之研究
(A Machine Learning Based Approach to WebExtraction from Template Pages)

相關論文

★ 行程邀約郵件的辨識與不規則時間擷取之研究	★ NCUFree校園無線網路平台設計及應用服務開發
★ 網際網路半結構性資料擷取系統之設計與實作	★ 非簡單瀏覽路徑之探勘與應用
★ 遞增資料關聯式規則探勘之改進	★ 應用卡方獨立性檢定於關連式分類問題
★ 中文資料擷取系統之設計與研究	★ 非數值型資料視覺化與兼具主客觀的分群
★ 關聯性字組在文件摘要上的探討	★ 淨化網頁：網頁區塊化以及資料區域擷取
★ 問題答覆系統使用語句分類排序方式之設計與研究	★ 時序資料庫中緊密頻繁連續事件型樣之有效探勘
★ 星狀座標之軸排列於群聚視覺化之應用	★ 由瀏覽歷程自動產生網頁抓取程式之研究
★ 動態網頁之樣版與資料分析研究	★ 同性質網頁資料整合之自動化研究

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

網際網路包含大量的資料，其中深網(Deep Web)所提供的大量結構性資料，
相較於表層網路(Surface Web)所提供的資訊有更高的價值。然而深網藉由共同
閘道介面(CGI)所提供給人們查詢的方式並不適合程式的讀取，因此對於資訊的
整合來說，如何從查詢所得的網頁中擷取所要的資料，是存在十多年的挑戰。其
中的技術發展也從監督式的資料擷取方法到非監督式的資料擷取方法，並從主要
資料擷取(Data Rich Section Data Extraction) 進化到全頁式的資料擷取
(Page-level Data Extraction)。非監督式的資料擷取方法主要透過相似的網頁
結構反向推導其產生模型使用的HTML 樣板以及資料模型，由於同樣的HTML 標籤
可能用以呈現不同的資訊，因此自動推論最大的困難點便在於如何辨識相同
HTML 標籤是否代表不同的意義。本篇論文應用機器學習方法來判斷網文件物件
模型樹(DOM Tree)中的兩個HTML 標籤是否為同儕節點（Peer Node），藉以改善
非監督式的資料擷取方法FiVaTech 推論網頁樣板(Template)及資料結構
(Schema)的準確度。此分類器採用HTML 標籤資訊、視覺化資訊、文字內容資訊
等三類做為分類器的特性。另外我們同時也利用比對顯示在瀏覽器上HTML 標籤
的影像來輔助樣版的判斷。實驗結果顯示，採用J48 分類器對於Peer Node 的辨
識可以逹到90%左右的準確率，同時對於資料結構的準確度也有20%的改善，顯
示此方法的可行之處。

摘要(英)

A huge amount of information on the World Wide Web has a
structured HTML form as they are generated dynamically from databases
and have the same template. This paper proposes a page-level web data
extraction system FiVaTech2 that extracts schema and templates from
these template-based web pages automatically. The proposed system,
FiVaTech2, is an extension to our previously page-level web data
extraction system FiVaTech. FiVaTech2 uses a machine learning (ML)
based method which compares HTML tag pairs to estimate how likely
they present in the web pages. We use one of the ML techniques called
J48 decision tree classifier and also use image comparison to assist
templates detection. Each HTML tag in the web page has several features
that can be divided into the three types: visual information, DOM tree
information, and HTML tag contents. Our experiments show an
encouraging result for the test pages when combinations of the three
types of tag features are used. Also, our experiments show that FiVaTech2
performs better and has higher efficiency than FiVaTech.

關鍵字(中)

★ 機器學習

關鍵字(英)

★ machine learning

論文目次

FiVaTech2 .............................................................................................................. 1
A Machine Learning Based Approach to Web Data Extraction from Template Pages .................. 1
Contents ................................................................................................................. 2
List of Figures ........................................................................................................... 3
List of Tables............................................................................................................ 4
1 Introduction ................................................................................................... 1
2 Related Works ................................................................................................ 4
2.1 Approaches using DOM tree information ........................................ 4
2.2 Approaches using visual information .............................................. 5
2.3 Approaches using token occurrence frequency ................................ 5
2.4 Machine Learning Tool ................................................................. 6
3 FiVaTech ........................................................................................................ 7
3.1 Definitions ................................................................................. 7
3.2 FiVaTech Tree Merging ................................................................ 9
3.3 Schema Detection ..................................................................... 11
4 FiVaTech2 .................................................................................................... 12
4.1 Filtering Out Template Blocks in the Inputed DOM Trees ................ 13
4.2 Peer Nodes Recognition ............................................................. 16
4.2.1 Decorative tag comparison .................................................. 16
4.2.2 Decision Tree for Peer Node Recognition ............................... 18
4.2.3 Decision Tree Attributes ...................................................... 19
4.2.4 Accuracy by Different HTML Tag Tree Levels .......................... 21
4.2.5 Accuracy by Different Data Regions ...................................... 22
4.2.6 Accuracy by Number of Training Sites ................................... 23
4.2.7 Accuracy by Number of Training Node Pairs ........................... 25
5 Experiments ................................................................................................. 27
6 Conclusions and Future Works ........................................................................ 31
References ............................................................................................................. 32

參考文獻

[1] C.-H. Chang, C.-N. Hsu, S.-C. Lui: IEPAD:Information extraction based on pattern discovery.
WWW-10,pp.223-231, 2001
[2] C.-H. Chang, M. Kayed, M. R. Girgis, K. F. Shaalan: A Survey of Web Information Extraction
System. IEEE TKDE(SCI, EI),Vol. 18,No.10,pp.1411-1428, 2006
[3] M. Kayed, C.-H. Chang, FiVaTech: Page-Level Web Data Extraction from Template Pages, IEEE
TKDE, vol. 22, no. 2, pp. 249-263, Feb. 2010.
[4] S. Sarawagi. Automation in Information Extraction and Data Integration (Tutorial). Proceedings
of the 2002 International Conference on Very Large Data Base (VLDB), 2002.
[5] H. Zhao, W. Meng, Z. Wu, V. Raghavan, C. T. Yu: Fully automatic wrapper generation for search
engines. WWW 2005: 66-75
[6] H. Zhao, W. Meng and Z. Wu, V. Raghavan, C. Yu Automatic Extraction of Dynamic Record
Sections From Search Engine Result Pages. VLDB, pp.989-1000, 2006
[7] K. Simon, G. Lausen: ViPER: augmenting automatic information extraction with visual
perceptions. CIKM 2005: 381-388
[8] W. Liu, X.-F. Meng, W.-Y. Meng. Vision-Based Web data records extraction. In: Proc. of the 9th
SIGMOD Int’l Wor shop on Web and Databases (WebDB 2006). Chicago: ACM Press, 2006.
[9] W. Liu, X.-F. Meng, W.-Y. Meng. ViDE: A Vision-based Approach for Deep Web Data
Extraction. Transactions on Knowledge and Data Engineering, IEEE, 2007
[10] J. Hammer, J. McHugh, and H. Garcia-Molina, "Semistructured Data: The TSIMMIS
Experience," Proc. First East-European Symp. Advances in Databases and iformation Systems
(ADBIS), pp. 1-8,1997.
[11] L. Liu, C. Pu, and W. Han, "XWRAP: an XML-enabled wrapper construction system for Web
information sources," in Data Engineering, 2000. Proceedings. 16th International Conference
on,2000, pp. 611-621.
33
[12] D. Freitag, "Information Extraction from HTML: Application of a General Learning Approach,"
1998.
[13] H. F. L. Alberto, R.-N. Berthier, and S. d. S. Altigran, "DEByE - Date extraction by example,"
Data Knowl. Eng., vol. 40, pp. 121-154, 2002.
[14] N. Kushmerick, "Wrapper induction for information extraction," University of Washington, 1997,
p. 246.
[15] ME. Califf, RJ. Mooney, "Relational Learning of Pattern-Match Rules for Information
Extraction," University of Texas at Austin 1998.
[16] C.-H. Chang and S.-C Kuo, "Olera: semisupervised Web-data extraction with visual support,"
Intelligent Systems, vol. 19, pp. 56-64, 2004.
[17] A. Arasu and H. Garcia-Molina, "Extracting structured data from Web pages," in Proceedings of
the 2003 ACM SIGMOD international conference on management of data San Diego, California:
ACM,2003.
[18] V. Crescenzi, G. Mecca, P. Merialdo. "RoadRunner: Towards Automatic Data Extraction from
Large Web Sites," in Proceedings of the 27th International Conference on Very Large Data
Bases: Morgan Kaufmann Publishers Inc., 2001.
[19] B. Liu, R. Grossman. Y. Zhai. “Mining data records from Web pages.” KDD-03, 2003.
[20] Y. Zhai and B. Liu, "Web data extraction based on partial tree alignment," in Proceedings of the
14th international conference on World Wide Web Chiba, Japan: ACM, 2005.
[21] J. Wang, F. H. Lochovsky. Data extraction and label assignment for web databases. WWW 2003:
187-196
[22] B. Liu and Y. Zhai. NET – A System for Extracting Web Data from Flat and Nested Data
Records. WISE Conference, 2005.
[23] http://www.cs.waikato.ac.nz/ml/weka/
[24] http://www.bing.com/

指導教授

張嘉惠(Chia-Hui Chang)

審核日期

2010-7-26

推文