樣板網頁結構自動分群

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：36

、訪客IP：18.226.180.122

姓名

吳佳儒(Jia-Ru Wu) 查詢紙本館藏

畢業系所

資訊工程學系在職專班

論文名稱

樣板網頁結構自動分群
(Clustering of Template Page for Data Extraction)

相關論文

★ 行程邀約郵件的辨識與不規則時間擷取之研究	★ NCUFree校園無線網路平台設計及應用服務開發
★ 網際網路半結構性資料擷取系統之設計與實作	★ 非簡單瀏覽路徑之探勘與應用
★ 遞增資料關聯式規則探勘之改進	★ 應用卡方獨立性檢定於關連式分類問題
★ 中文資料擷取系統之設計與研究	★ 非數值型資料視覺化與兼具主客觀的分群
★ 關聯性字組在文件摘要上的探討	★ 淨化網頁：網頁區塊化以及資料區域擷取
★ 問題答覆系統使用語句分類排序方式之設計與研究	★ 時序資料庫中緊密頻繁連續事件型樣之有效探勘
★ 星狀座標之軸排列於群聚視覺化之應用	★ 由瀏覽歷程自動產生網頁抓取程式之研究
★ 動態網頁之樣版與資料分析研究	★ 同性質網頁資料整合之自動化研究

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

在網頁資料擷取(Web Data Extraction)的領域中，由於網頁內容多樣及架構的複雜性，要如何自動從各式不同樣板的網頁中擷取出資料，這類型的研究一直面臨相當大的挑戰。
網頁資料擷取系統主要分為記錄層級(Record Level)和頁面層級(Page Level)兩大類別，兩者是接受相同樣板的網頁，進行資料擷取或是綱要推導，針對不同網頁樣板來進行分群之研究較為少見。
本篇論文提出一個依照網頁結構之相似程度來自動分群的功能，簡化不同網頁樣板之間擷取的問題，針對所設計的網頁特徵來實作非監督式分群與監督式分群，並比較其分群之效能。雖從整體分群效果中來看不甚理想，但於目標群結果可達到在非監督式分群時之精確率 99%，召回率 78%，監督式分群時之精確率 97%，召回率超過 80%。
最後，此分群結果可再結合Page-level Information Extraction System (UWIDE) 系統，產生完整的頁面綱要及擷取出所需 POI 相關資訊，進而建立及累積資料庫，以提升相關加值服務之效率及品質。

摘要(英)

In the field of Web Data Extraction, due to the diversity of web content and the complexity of the web page structure, the research of extracting data automatically from web pages of different template has always faced considerable challenges. The web data extraction system is
mainly divided into two categories: Record Level and Page Level. Both input dataset use the web pages of the same template, and are used for data extraction and schema induction. Clustering research on web page of different template is rarely to be found.
This paper proposes a method to do clustering automatically with the similarity of web page structure, and can simplify the problem of data extraction from different templates in web page. We also use the unsupervised and supervised clustering, which based on our designed features, and compare the performance of both clustering results. Although the overall clustering performance is not well as expected, the results of unsupervised clustering can reach
a precision of 99% for the target cluster, a recall rate of approximately 78%. A precision of 97%, and a recall rate of more than 80% for supervised clustering.
Finally, we can generate a complete web page schema and extract the POI-related information via Page-Level Information Extraction System (UWIDE) with this clustering result. It can also be accumulated into databases, to enhance the efficiency and quality of related value
added services.

關鍵字(中)

★ 特徵挑選
★ 樣板網頁擷取
★ 階層式分群
★ 非監督式分群

關鍵字(英)

論文目次

摘要 i
Abstract ii
誌謝 iii
目錄 iv
圖目錄 v
表目錄 vi
一、簡介 1
二、相關研究 5
2.1. 網頁資料擷取技術 5
2.2. 網頁分群研究 7
2.2.1 特徵挑選 7
2.2.2 分群演算法 8
三、系統架構 9
3.1. 網頁分群策略 9
3.1.1 特徵設計及頁面表示方式 10
3.1.2 群數決定方法 14
3.2. 綱要擷取 16
四、實驗方法及結果 18
4.1. 評估方法 19
4.2. 非監督式分群 21
4.2.1 特徵挑選 21
4.2.2 分群實驗 23
4.3. 監督式分群 25
4.3.1 分群實驗 25
4.4. 目標群效能 27
五、結論 31
參考文獻 32

參考文獻

[1] Chang CH., Chen TS., Chen MC., Ding JL. Efficient Page-Level Data Extraction via
Schema Induction and Verification. PAKDD 2016.
[2] Kayed, Mohammed , Mohammed & Shaalan, Khaled & F, Khaled. (2006). A Survey of Web
Information Extraction Systems. IEEE Transactions on Knowledge and Data Engineering.
18. 1411-1428.
[3] Chang, C.-H. and Lui, S.-C., IEPAD: Information extraction based on pattern discovery.
Proceedings of the Tenth International Conference on World Wide Web (WWW), Hong
Kong, pp. 223-231, 2001.
[4] Crescenzi, V., Mecca, G. and Merialdo, P., RoadRunner: towardsautomatic data extraction
from large Web sites. Proceedings of the 26th International Conference on Very Large
Database Systems (VLDB), Rome, Italy, pp. 109-118, 2001.
[5] K. Mohammed, "FiVaTech: Page-Level Web Data Extraction from Template Pages," IEEE
Transactions on Knowledge and Data Engineering, vol. 22, pp. 249-263, 03/31 2010.
[6] MING-CYUAN, Chen, et al.應用路徑資訊輔助樣板探勘於網頁層級之資料擷取研究.
Technologies and Applications of Artificial Intelligencester, 2013.
[7] TIAN-CHENG, Chen, et al.基於頁面層級之快速網頁資料擷取與綱要驗證.
Technologies and Applications of Artificial Intelligencester, 2014.
[8] Gottron T. (2008) Clustering Template Based Web Documents. In: Macdonald C., Ounis I.,
Plachouras V., Ruthven I., White R.W. (eds) Advances in Information Retrieval. ECIR
2008. Lecture Notes in Computer Science, vol 4956.
[9] Huang X. et al. (2017) Web Content Extraction Using Clustering with Web Structure. In:
Cong F., Leung A., Wei Q. (eds) Advances in Neural Networks - ISNN 2017. ISNN 2017.
Lecture Notes in Computer Science, vol 10261.
[10] Nikolaos K. Papadakis, Dimitrios Skoutas, Konstantinos Raftopoulos, and Theodora A.
Varvarigou. 2005. STAVIES: A System for Information Extraction from Unknown Web
Data Sources through Automatic Web Wrapper Generation Using Clustering Techniques.
IEEE Trans. on Knowl. and Data Eng. 17, 12 (December 2005), 1638-1652.
[11] Mucha J., Snaprud M., Nietzio A. (2016) Web Page Clustering for More Efficient Website
Accessibility Evaluations. In: Miesenberger K., Buhler C., Penaz P. (eds) Computers
Helping People with Special Needs. ICCHP 2016. Lecture Notes in Computer Science,
vol 9758.
[12] Eibe Frank, Mark A. Hall, and Ian H. Witten (2016). The WEKA Workbench. Online
Appendix for "Data Mining: Practical Machine Learning Tools and Techniques", Morgan
Kaufmann, Fourth Edition, 2016.
[13] Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.
[14] Crescenzi, Valter & Merialdo, Paolo & Missier, Paolo. (2005). Clustering Web pages based
on their structure. Data & Knowledge Engineering. 54. 279-299.
[15] https://en.wikipedia.org/wiki/Support_vector_machine
[16] C.-C. Chang, C.-J. Lin, LIBSVM: A Library for Support Vector Machines, ACM
Transactions on Intelligent Systems and Technology, Vol. 2, No.3, Article 27, April, 2011.
[17]https://nlp.stanford.edu/IR-book/html/htmledition/single-link-and-complete-link
clustering-1.html
[18] https://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html
[19] Peter J. Rousseeuw, Silhouettes: A graphical aid to the interpretation and validation of
cluster analysis, Journal of Computational and Applied Mathematics, Volume 20, 1987,
Pages 53-65,
[20] http://daisen.cc.kyushu-u.ac.jp/TBDW/
[21] C.-H. Chang, Y.-L. Lin, K.-C. Lin, and M. Kayed, "Page-Level Wrapper Verification for
Unsupervised Web Data Extraction," in Web Information Systems Engineering – WISE 2013. vol. 8180, X. Lin, Y. Manolopoulos, D. Srivastava, and G. Huang, Eds., ed: Springer
Berlin Heidelberg, 2013, pp. 454-467
[22] O. Yuliana, C.-H. Chang, A novel alignment algorithm for effective web data extraction
from singleton pages, Applied Intelligence (To appear)

指導教授

張嘉惠(Chia-Hui Chang)

審核日期

2018-7-23

推文