應用動態編碼於多頁面網頁之記錄邊界偵測與資訊擷取

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：21

、訪客IP：3.149.214.32

姓名

陳明權(Ming-chuan Chen) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

應用動態編碼於多頁面網頁之記錄邊界偵測與資訊擷取
(Exploiting Dynamic Encoding and Multiple Pages for Record Boundary Detection and Data Extraction)

相關論文

★ 行程邀約郵件的辨識與不規則時間擷取之研究	★ NCUFree校園無線網路平台設計及應用服務開發
★ 網際網路半結構性資料擷取系統之設計與實作	★ 非簡單瀏覽路徑之探勘與應用
★ 遞增資料關聯式規則探勘之改進	★ 應用卡方獨立性檢定於關連式分類問題
★ 中文資料擷取系統之設計與研究	★ 非數值型資料視覺化與兼具主客觀的分群
★ 關聯性字組在文件摘要上的探討	★ 淨化網頁：網頁區塊化以及資料區域擷取
★ 問題答覆系統使用語句分類排序方式之設計與研究	★ 時序資料庫中緊密頻繁連續事件型樣之有效探勘
★ 星狀座標之軸排列於群聚視覺化之應用	★ 由瀏覽歷程自動產生網頁抓取程式之研究
★ 動態網頁之樣版與資料分析研究	★ 同性質網頁資料整合之自動化研究

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

記錄範圍偵測在擷取器推導(Wrapper Induction)中是一個很重要的環節，偵測的結果好壞會直接影響後續的排比以及最後的準確度。過去的方法多為在單一網頁中進行各個區塊相似度計算，擁有的資訊量較少，而使用樹狀結構的相似度計算也會造成計算量的上升。在本篇論文中我們參考來自同個網站的多個網頁，分析出各網頁中共同與差異的部分，克服單一網頁所缺乏的資訊；同時為減少多個網頁增加的計算量，系統分析的主要對象為DOM樹中的葉節點，其數量僅為所有節點的三成。藉由葉節點在多個網頁的分佈情形，本文提出動態編碼，對葉節點進行抽象化，用以突顯記錄的規律性，使得重複樣式探勘能得到較好的成效。最後對於記錄範圍的偵測，本文提出地標的概念，根據存在於各筆記錄中的地標，並藉由在樹狀結構中的走訪來推測相應的記錄範圍。在實驗與評估的部分，本篇論文使用了知名的資料集與過去幾個系統比較，皆能達到不錯的準確率。

摘要(英)

Record boundary detection plays an important role in wrapper induction and the quality of record boundary detection will affect the precision of alignment and extraction directly. Previous approaches usually focus on calculating similarity between blocksor measure tree similarity in a single page.
In this paper, we analyze multiple pages that are generated by the same website. By exploring common parts and different parts in pages, we can overcome the weakness in single-page approaches. Because the computation load will increase when we deal with more pages, the proposed approach only focus on leaf nodes in DOM tree, which are about 30 percent of all nodes. We propose dynamic encoding, which can abstract leaf nodes and emphasize the regularity of every data records. With the dynamic encoding, we reduce the numberof the repeated pattern discovered. Finally, we propose the idea of landmark, which is located in the data record, and detecting the record boundary by segmenting the DOM tree. In the experiment, we evaluate the efficiencyin our approach and compare the effectivenesswith other systems.

關鍵字(中)

★ 記錄範圍偵測
★ 動態編碼
★ 資訊擷取

關鍵字(英)

論文目次

目錄
摘要 i
Abstract ii
誌謝 iii
目錄 iv
圖目錄 v
表目錄 vi
一、緒論 1
二、相關研究 4
三、研究方法 6
3.1 前處理 7
3.2 動態編碼 10
3.3 記錄範圍偵測 12
3.3.1 地標偵測 12
3.3.2 重複樣式探勘 13
3.3.1 記錄範圍偵測演算法 14
3.4 記錄範圍修正 16
3.4.1 移除節點數過少的YCA 17
3.4.2 移除節點數不平衡的YCA 17
3.4.3 保持YCA互相獨立 17
3.4.4 找回遺漏記錄 20
四、實驗 23
4.1 執行效率評估 24
4.2 記錄範圍偵測結果評估 26
4.3 使用地標的改善評估 28
五、結論與未來工作 34
參考文獻 35

圖目錄
圖 1表列式網頁範例 1
圖 2系統流程圖 7
圖 3範例網頁 8
圖 4範例網頁原始碼 8
圖 5範例網頁之文件物件模型樹 8
圖 6節點合併示意圖 12
圖 7樣式範例網頁 14
圖 8樣式對應實例集合 14
圖 9 YCA與FDA 15
圖 10記錄範圍偵測演算法 16
圖 11節點數平衡度公式 17
圖 12 YCA為包含關係案例圖 18
圖 13相同YCA節點且FDA同層級案例圖 19
圖 14相同YCA節點且FDA不同層級案例圖 20
圖 15虛擬FDA 21
圖 16記錄修正範例 21
圖 17表格式範例網頁 21
圖 18排比結果範例 22
圖 19評估公式 23
圖 20執行時間與葉節點數關係評估 26
圖 21分析頁面數量與偵測成效 (TBDW) 27
圖 22分析頁面數量與偵測成效 (ViNTs) 28
圖 23流程圖比較 29
圖 24有無記錄範圍偵測比較 30
圖 25有無地標偵測比較 30
圖 26三種策略在無地標網站之評估 30
圖 27無結尾標籤造成結構問題之網頁 33
圖 28多組記錄區域含有相同地標 33
圖 29巢狀架構與巢狀內容網頁 33

表目錄
表 1範例網頁對應之出現向量 10
表 2葉節點數比例統計 24
表 3執行效率綜合比較 (TBDW) 25
表 4執行效率綜合比較 (ViNTs) 25
表 5本系統記錄區域範圍偵測結果評估 27
表 6本系統記錄範圍偵測結果評估 27
表 7本系統與其他方法比較 28
表 8地標改善評估表 31
表 9地標數量與樣式數量 31
表 10地標效果有明顯差異之網站(RPM與LD+RPM) 31

參考文獻

1. A. Arasu and H. Garcia-Molina, "Extracting structured data from Web pages", Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pp.337-348, San Diego, California, 2003
2. G.O. Arocena and A.O. Mendelzon, "WebOQL: restructuring documents, databases and Webs", Data Engineering, 1998. Proceedings., 14th International Conference on, 24-33, 1998.
3. L. Bing, et al., "Towards a unified solution: data record region detection and segmentation", Proceedings of the 20th ACM international conference on Information and knowledge management, pp.1265-1274, Glasgow, Scotland, UK, 2011
4. A. Carlson and C. Schafer, "Bootstrapping Information Extraction from Semi-structured Web Pages", Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases - Part I, pp.195-210, Antwerp, Belgium, 2008
5. C.H. Chang, et al., "A Survey of Web Information Extraction Systems", Knowledge and Data Engineering, IEEE Transactions on, Vol 18(10), pp.1411-1428, 2006
6. C.H. Chang and S.C. Kuo, "OLERA: Semisupervised Web-Data Extraction with Visual Support", IEEE Intelligent Systems, Vol 19(6), pp.56-64, 2004
7. C.H. Chang and S.C. Lui, "IEPAD: information extraction based on pattern discovery", Proceedings of the 10th international conference on World Wide Web, pp.681-688, Hong Kong, Hong Kong, 2001
8. W.W. Cohen, et al., "A flexible learning system for wrapping tables and lists in HTML documents", Proceedings of the 11th international conference on World Wide Web, pp.232-241, Honolulu, Hawaii, USA, 2002
9. V. Crescenzi, et al., "RoadRunner: Towards Automatic Data Extraction from Large Web Sites", Proceedings of the 27th International Conference on Very Large Data Bases, pp.109-118, 2001
10. P. Gulhane, et al., "Exploiting content redundancy for web information extraction", Proc. VLDB Endow., Vol 3(1-2), pp.578-587, 2010
11. C.N. Hsu and M.T. Dung, "Generating finite-state transducers for semi-structured data extraction from the Web", Inf. Syst., Vol 23(9), pp.521-538, 1998
12. M. Kayed and C.H. Chang, "FiVaTech: Page-Level Web Data Extraction from Template Pages", Knowledge and Data Engineering, IEEE Transactions on, Vol 22(2), pp.249-263, 2010
13. B. Liu, et al., "Mining data records in Web pages", Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp.601-606, Washington, D.C., 2003
14. L. Liu, et al., "XWRAP: an XML-enabled wrapper construction system for Web information sources", Data Engineering, 2000. Proceedings. 16th International Conference on, 611-621, 2000.
15. W. Liu, et al., "ViDE: A Vision-Based Approach for Deep Web Data Extraction", Knowledge and Data Engineering, IEEE Transactions on, Vol 22(3), pp.447-460, 2010
16. A. Machanavajjhala, et al., "Collective extraction from heterogeneous web lists", Proceedings of the fourth ACM international conference on Web search and data mining, pp.445-454, Hong Kong, China, 2011
17. G. Miao, et al., "Extracting data records from the web using tag path clustering", Proceedings of the 18th international conference on World wide web, pp.981-990, Madrid, Spain, 2009
18. I. Muslea, et al., "Hierarchical Wrapper Induction for Semistructured Information Sources", Autonomous Agents and Multi-Agent Systems, Vol 4(1-2), pp.93-114, 2001
19. J. Raposo, et al., "The Wargo system: semi-automatic wrapper generation in presence of complex data access modes", Database and Expert Systems Applications, 2002. Proceedings. 13th International Workshop on, 313-317, 2002.
20. A. Sahuguet and F. Azavant, "Building intelligent web applications using lightweight wrappers", Data Knowl. Eng., Vol 36(3), pp.283-316, 2001
21. K. Simon and G. Lausen, "ViPER: augmenting automatic information extraction with visual perceptions", Proceedings of the 14th ACM international conference on Information and knowledge management, pp.381-388, Bremen, Germany, 2005
22. H.A. Sleiman and R. Corchuelo, "A Survey on Region Extractors from Web Documents", Knowledge and Data Engineering, IEEE Transactions on, Vol 25(9), pp.1960-1981, 2013
23. H.A. Sleiman and R. Corchuelo, "TEX: An efficient and effective unsupervised Web information extractor", Knowledge-Based Systems, Vol 39(0), pp.109-123, 2013
24. S. Soderland, "Learning Information Extraction Rules for Semi-Structured and Free Text", Mach. Learn., Vol 34(1-3), pp.233-272, 1999
25. J. Wang and F.H. Lochovsky, "Data extraction and label assignment for web databases", Proceedings of the 12th international conference on World Wide Web, pp.187-196, Budapest, Hungary, 2003
26. Y. Yamada, et al., "Testbed for information extraction from deep web", Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters, pp.346-347, New York, NY, USA, 2004
27. Y. Zhai and B. Liu, "Web data extraction based on partial tree alignment", Proceedings of the 14th international conference on World Wide Web, pp.76-85, Chiba, Japan, 2005
28. H. Zhao, et al., "Fully automatic wrapper generation for search engines", Proceedings of the 14th international conference on World Wide Web, pp.66-75, Chiba, Japan, 2005

指導教授

張嘉惠(Chia-hui Chang)

審核日期

2014-8-21

推文