摘要(英) |
Web is the most important and primary way for fetching information nowadays, especially in deep web. In web data extraction, the page level approach compared with the record level approach is a more comprehensive solution because it can generate more complete page schema for extracting all the data of page.
Otherwise, most research of web data extraction is focusing on algorithm of schema induction or extraction, instead of user-end service. Therefore, the research of this paper provide a ETL(extract-transform-load) system with automated crawler which base on unsupervised extraction. The users can extract and output (e.g. API endpoint, static export) web data by user-friend GUI, without any programming. Hoping the research can simplify the management of the entire complex process and bring convenience web data extraction to the general public. |
參考文獻 |
[1] J.-L, Ding, C.-H, Chang, “Page-level Information Extraction System”, master thesis 102525015, 2015.
[2] O. Y Yuliana, C.-H, Chang, “DCADE: Divide and Conquer Alignment with Dynamic Encoding for Full Page Data Extraction”, under review ICDM 2018 conference.
[3] A. Arasu, H. Garcia-Molina, "Extracting structured data from Web pages", presented at the Proceedings of the 2003 ACM SIGMOD international conference on Management of data, San Diego, California, 2003.
[4] K. Kayed, C.-H. Chang, "FiVaTech: Page-Level Web Data Extraction from Template Pages" , IEEE Transactions on Knowledge and Data Engineering, vol. 22, pp. 249-263, 2010.
[5] H. A. Sleiman and R. Corchuelo, "TEX: An efficient and effective unsupervised Web information extractor", Know.-Based Syst., vol. 39, pp. 109-123, 2013.
[6] S. Zheng, R. Song, J.-R. Wen, C.-L Giles, “Efficient Record-Level Wrapper Induction”, CIKM’09, November 2–6, 2009.
[7] M. Geel, T. Church, M. C. Norrie, “Sift: An End-User Tool for Gathering Web Content on the Go”, DocEng’12, September 4–7, 2012.
[8] J. Sta?rka, L. Holubova?, M. Necˇasky?, “Strigil: A Framework for Data Extraction in Semi-Structured Web Documents”, iiWAS 2013.
[9] Import.io, http://import.io
[10] Dexi.io, https://dexi.io
[11] https://en.wikipedia.org/wiki/CAPTCHA
[12] Puppeteer, https://pptr.dev
[13] MongoDB, https://www.mongodb.com
[14] Y,-K, Lai, C.-H, Chang, “Design and Implementation of Mobile Web Creator with Componentized Template”, unpublished. |