朝向有效率的非監督式網頁資料擷取：從非監督到自我訓練Wrapper

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：18

、訪客IP：13.58.148.134

姓名

時福仁(Naufal Said) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

朝向有效率的非監督式網頁資料擷取：從非監督到自我訓練Wrapper
(Toward Efficient Unsupervised Web Data Extraction: From Unsupervised to Self-Trained Wrappers)

相關論文

★ 行程邀約郵件的辨識與不規則時間擷取之研究	★ NCUFree校園無線網路平台設計及應用服務開發
★ 網際網路半結構性資料擷取系統之設計與實作	★ 非簡單瀏覽路徑之探勘與應用
★ 遞增資料關聯式規則探勘之改進	★ 應用卡方獨立性檢定於關連式分類問題
★ 中文資料擷取系統之設計與研究	★ 非數值型資料視覺化與兼具主客觀的分群
★ 關聯性字組在文件摘要上的探討	★ 淨化網頁：網頁區塊化以及資料區域擷取
★ 問題答覆系統使用語句分類排序方式之設計與研究	★ 時序資料庫中緊密頻繁連續事件型樣之有效探勘
★ 星狀座標之軸排列於群聚視覺化之應用	★ 由瀏覽歷程自動產生網頁抓取程式之研究
★ 動態網頁之樣版與資料分析研究	★ 同性質網頁資料整合之自動化研究

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

網頁資料擷取在許多智慧商業任務中是一個關鍵元件，像是資料的轉換、交換、分析和解釋。已經有許多人工、監督式或非監督式的Wrapper induction方法被提出。但是大多數的研究都專注在資料擷取的成效，並沒有專注在擷取的效率。在這篇論文中，我們顯示出非監督式網頁資料擷取的Wrapper生成是和監督式的Wrapper induction同等重要的，因為已經生成的Wrapper可以不需要複雜的分析並更有效率地完成任務，因此，我們將非監督式網頁擷取視為一個Oracle Machine來生成標記的訓練資料並採用兩種方法來生成Wrapper：Schema引導的Finite-State Machine (FSM)和資料驅動的機器學習方法。實驗結果顯示FSM生成的Wrapper可以在較少量的訓練資料中便達到好的成效，而機器學習類的方法則是在測試時更有效率但需要較多的訓練資料來達到同等的成效。此外，FSM生成的Wrapper可以當作是機器學習類方法的Filter來達到減少資料量並改善學習曲線的效果。

摘要(英)

Web data extraction is a key component for many business intelligence tasks, such as data transformation, exchange, analysis, and interpretation. Many approaches have been proposed for wrapper induction, either manual, supervised or unsupervised. However, most research focuses on extraction effectiveness. Not much attention has been paid to extraction efficiency. In this thesis, we argue that wrapper generation for unsupervised web data extraction is as important as supervised wrapper induction because the generated wrappers could work more efficiently without sophisticated analysis. Therefore, we can treat unsupervised data extraction as an oracle machine to generate annotated training examples and consider two methods of wrapper generation: schema-guided finite-state machine (FSM) approaches and data-driven machine learning (ML) approaches. The experimental result shows that the FSM wrapper can perform well even with fewer training data, while the ML-based models are more efficient during testing but require more training pages to achieve the same effectiveness. Furthermore, FSM wrappers can work as a filter to reduce the number of training pages and advance the learning curve for ML-based wrappers.

關鍵字(中)

★ 資訊系統
★ 資料擷取與整合
★ 深層網路
★ Wrappers
★ ETL
★ 資料交換

關鍵字(英)

★ Information Systems
★ Data Extraction and Integration
★ Deep web
★ Wrappers (data mining)
★ ETL
★ Data exchange

論文目次

Acknowledgements iv
摘要vi
Abstract vii
Contents viii
List of Figures x
List of Tables xi
Chapter 1 Introduction 1
1.1 Background 1
1.2 Motivation 4
1.3 Contribution 6
Chapter 2 Literature Review 8
2.1 Wrapper 8
2.1.1 Wrapper Induction 9
2.1.2 Automated Data Extraction 10
2.1.3 Wrapper Maintenance 11
2.2 Finite State Machine 12
2.3 Active Learning 13
Chapter 3 Proposed Method 15
3.1 Training Phase: FSM Construction 15
3.2 Testing Phase: Universal Wrapper 20
Chapter 4 Experiment 26
4.1 Dataset 26
4.2 Evaluation 27
4.3 Baseline 27
4.3.1 KNN and SVM 28
4.3.2 CRF Suite 29
4.3.3 CNN-based Neural Networks 29
4.4 Result & Analysis 30
4.4.1 Small dataset: EXALG+TEX 31
4.4.2 SWDE dataset 32
4.4.3 Active page selection 35
Chapter 5 Conclusion 38
References 39

參考文獻

[1] T. Abeel, Y. V. de Peer, and Y. Saeys. Java-ML: A machine learning library. Journal of Machine Learning Research, 10:931–934, 2009. Software available at http://java-ml.sourceforge.net.

[2] A. Arasu and H. Garcia-Molina. Extracting structured data from web pages. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data. ACM, New York, pages 337–348, 2003.

[3] M.-F. Balcan, S. Hanneke, and J. Vaughan. The true sample complexity of active learning. Machine Learning, 80:111–139, 2010.

[4] M. Bronzi, V. Crescenzi, P. Merialdo, and P. Papotti. Extraction and integration of partially overlapping web sources. In Proceedings of the VLDB Endowment, Vol.6, No. 10, pages 805–416, 2013.

[5] C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.

[6] C.-H. Chang, T.-S. Chen, M.-C. Chen, and J.-L. Ding. Efficient page-level data extraction via schema induction and verification. In Proceedings of the 1st International Conference on Web Information Systems Engineering. Springer, Switzerland, pages 454–467, 2013.

[7] C.-H. Chang, Y.-L. Lin, K.-C. Lin, and M. Kayed. Page-level wrapper verification for unsupervised web data extraction. In Proceedings of the 2nd Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, Switzerland, pages 478–490, 2016.

[8] C.-H. Chang and S.-C. Lui. IEPAD: information extraction based on pattern discovery. In Proceedings of the 10th international conference on World Wide Web. ACM, New York, pages 681–688, 2001.

[9] V. Crescenzi and G. Mecca. Automatic information extraction from large websites. Journal of the ACM (JACM), 51(5):731–779, 2004.

[10] V. Crescenzi, P. Merialdo, and D. Qiu. Hybrid crowd-machine wrapper inference. ACM Transactions on Knowledge Discovery from Data, 13(5):1–43, 2019.

[11] I. F. de Viana, P. J. Abad, J. L. Alvarez, and J. L. Arjona. MAVE: Multilevel wrApper Verification systEm. IEEE Transactions on Knowledge and Data Engineering, 28(9): 2393–2406, 2016.

[12] Diffbot. Diffbot. (2020). retrieved may 5, 2020 from https://www.diffbot.com.

[13] R. R. Fayzrakhmanov, E. Sallinger, B. Spencer, T. Furche, and G. Gottlob. Browserless web data extraction: Challenges and opportunities. In Proceedings of WWW’18: The World Wide Web Conference, Lyon, France, pages 1095–1104, 2018.

[14] Fminer. Fminer. (2020). retrieved may 5, 2020 from http://www.fminer.com.

[15] T. Furche, G. Gottlob, G. Grasso, X. Guo, G. Orsi, C. Schallhart, and C. Wang. DIADEM: thousands of websites to a single database. In Proceedings of the VLDB Endowment (14), Vol. 7, pages 1845–1856, 2014.

[16] J. Guo, V. Crescenzi, T. Furche, G. Grasso, and G. Gottlob. RED: Redundancy-driven data extraction from result pages. In Proceedings of WWW ’19: The World Wide Web Conference, San Francisco, CA, USA, pages 605–615, 2019.

[17] C.-N. Hsu and C.-C. Chang. Finite-state transducers for semi-structured text mining. In Proceedings of IJCAI-99 Workshop on Text Mining: Foundations, Techniques and
Applications. USA, pages 38–49, 1999.

[18] Import.io. import.io. (2020). retrieved may 5, 2020 from https://www.import.io.

[19] Ipswitch, Inc. imacros. (2020). retrieved may 5, 2020 from https://imacros.net.

[20] U. Irmak and T. Suel. Interactive wrapper generation with minimal user effort. In Proceedings of the 15th international conference on World Wide Web, pages 553–
563, 2006.

[21] P. Jiménez and R. Corchuelo. On learning web information extraction rules with tango. Information Systems, 62:74––103, 2016.

[22] M. Kayed and C.-H. Chang. FiVaTech: Page level web data extraction from template pages. IEEE transactions on knowledge and data engineering, 22(2):249–263, 2009.

[23] N. Kushmerick. Wrapper verification. World Wide Web, 3(2):79–94, 2000.

[24] N. Kushmerick, D. S. Weld, and R. Doorenbos. Wrapper induction for information extraction. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI)., pages 729–737, 1997.

[25] K. Lerman, S. N. Minton, and C. A. Knoblock. Wrapper maintenance: A machine learning approach. Journal of Artificial Intelligence Research, 18:149–181, 2003.

[26] B. Liu, R. Frossman, and Y. Zhai. Mining data records in web pages. In Proceedings of the 9th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, New York, pages 601–606, 2003.

[27] J. Lyseggen. Outside Insight: Navigating a World Drowning in Data. Penguin, 2017.

[28] Mozenda, Inc. Mozenda. (2020). retrieved may 5, 2020 from https://www.mozenda.com.

[29] I. Muslea, S. Minton, and C. Knoblock. Stalker: Learning extraction rules for semistructured, web-based information sources. In Proceedings of AAAI-98 Workshop on AI and Information Integration. AAAI Press, USA, pages 74–81, 1998.

[30] I. Muslea, S. Minton, and C. Knoblock. Active learning with multiple views. Journal of Artificial Intelligence Research, 27(1):203–233, 2006.

[31] Naoaki Okazaki. Crfsuite: a fast implementation of conditional random fields (crfs). retrieved may 25, 2016 from http://www.chokkan.org/software/crfsuite/, 2007.

[32] S. Ortona, G. Orsi, M. Buoncristiano, and T. Furche. WADaR: Joint repairs for web wrappers. In Proceedings of the VLDB Endowment, pages 1996––1999, 2015.

[33] S. Ortona, G. Orsi, T. Furche, and M. Buoncristiano. Joint repairs for web wrappers. In Proceedings of IEEE 32nd International Conference on Data Engineering (ICDE), pages 1146–1157, 2016.

[34] B. Settles. Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin–Madison, 2009.

[35] H. A. Sleiman and R. Corchuelo. Tex: An efficient and effective unsupervised web information extractor. Knowledge-Based Systems, 39:109–123, 2013.

[36] Visual Web Ripper. Visual Web Ripper. (2020). retrieved may 5, 2020 from http://visualwebripper.com.

[37] J. Wang and W. Tepfenhart. Formal Methods in Computer Science. CRC Press, 06 2019.

[38] Wrapidity Limited. Wrapidity. (2020). retrieved may 5, 2020 from https://www.wrapidity.com.

[39] O. Y. Yuliana and C.-H. Chang. DCADE: divide and conquer alignment with dynamic encoding for full page data extraction. Applied Intelligence, 50:271–295, 2019.

[40] Y. Zhai and B. Liu. Structured data extraction from the web based on partial tree alignment. IEEE Transactions on Knowledge and Data Engineering, 18(12):1614–1628, 2006.

指導教授

張嘉惠(Chia-Hui Chang)

審核日期

2020-7-23

推文