非監督式網頁資料擷取、轉置、載入與輸出系統

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：93

、訪客IP：52.14.240.57

姓名

周昱安(Yu-An Chou) 查詢紙本館藏

畢業系所

軟體工程研究所

論文名稱

非監督式網頁資料擷取、轉置、載入與輸出系統
(Web Data ETL System with Unsupervised Extraction)

相關論文

★ 行程邀約郵件的辨識與不規則時間擷取之研究	★ NCUFree校園無線網路平台設計及應用服務開發
★ 網際網路半結構性資料擷取系統之設計與實作	★ 非簡單瀏覽路徑之探勘與應用
★ 遞增資料關聯式規則探勘之改進	★ 應用卡方獨立性檢定於關連式分類問題
★ 中文資料擷取系統之設計與研究	★ 非數值型資料視覺化與兼具主客觀的分群
★ 關聯性字組在文件摘要上的探討	★ 淨化網頁：網頁區塊化以及資料區域擷取
★ 問題答覆系統使用語句分類排序方式之設計與研究	★ 時序資料庫中緊密頻繁連續事件型樣之有效探勘
★ 星狀座標之軸排列於群聚視覺化之應用	★ 由瀏覽歷程自動產生網頁抓取程式之研究
★ 動態網頁之樣版與資料分析研究	★ 同性質網頁資料整合之自動化研究

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

Web現今已成為人們獲取資訊最主要也最龐大的管道之一，尤其是深網資訊(Deep Web)擁有相當高的再利用的價值。而在網頁資料擷取(Web Data Extraction)的領域中，頁面層次(Page Level)相較於記錄層次(Record Level)的做法，能對相同樣版的網頁產生完整的頁面綱要，以涵蓋到整個頁面中所有資料的擷取需求，可以說是資料擷取較完整的解決方案。
此外，大多網頁資料擷取的研究都只著重在資料擷取與綱要推導的演算法，而沒有進一步結合相關的資料轉置與輸出服務，來延伸其資料結果之應用。因此，本研究以非監督式網頁資料擷取系統為基礎，實作了自動化爬蟲的資料轉置控管系統。透過直覺易用的圖形介面操作與選取，讓使用者在不需撰寫程式情況下，也能夠進行自動化爬蟲並根據需求來調整與輸出資料結果(例如：API Endpoint、靜態匯出)，實現資料擷取(Extract)、轉置(Transform)、載入(Load)的ETL服務。希望能夠將整個複雜的流程進行系統化的管理，並將這個領域的應用普及到一般使用者的層級。

摘要(英)

Web is the most important and primary way for fetching information nowadays, especially in deep web. In web data extraction, the page level approach compared with the record level approach is a more comprehensive solution because it can generate more complete page schema for extracting all the data of page.

Otherwise, most research of web data extraction is focusing on algorithm of schema induction or extraction, instead of user-end service. Therefore, the research of this paper provide a ETL(extract-transform-load) system with automated crawler which base on unsupervised extraction. The users can extract and output (e.g. API endpoint, static export) web data by user-friend GUI, without any programming. Hoping the research can simplify the management of the entire complex process and bring convenience web data extraction to the general public.

關鍵字(中)

★ 非監督式網頁資料擷取
★ 自動化爬蟲

關鍵字(英)

★ unsupervised web data extraction
★ automated crawler
★ ETL

論文目次

中文摘要 ii
Abstract iii
1. 緒論 1
2. 相關研究 5
2.1 非監督式網頁擷取系統 5
2.2 資料擷取應用類研究 6
2.3 市面網頁資料擷取服務 8
3. 系統架構與設計 10
3.1 Extractor Manager 11
3.2 Crawler 12
3.3 HTML Preprocessor 14
3.4 Extraction Handler 15
3.5 Output Service 15
3.6 Scheduler 16
3.7 Extractor Database 17
3.8 與其他系統之結合應用 19
4. 實驗討論 20
4.1 質性分析 20
4.2 使用者評價量化分析 23
5. 結論 26
6. 參考資料 28

參考文獻

[1] J.-L, Ding, C.-H, Chang, “Page-level Information Extraction System”, master thesis 102525015, 2015.
[2] O. Y Yuliana, C.-H, Chang, “DCADE: Divide and Conquer Alignment with Dynamic Encoding for Full Page Data Extraction”, under review ICDM 2018 conference.
[3] A. Arasu, H. Garcia-Molina, "Extracting structured data from Web pages", presented at the Proceedings of the 2003 ACM SIGMOD international conference on Management of data, San Diego, California, 2003.
[4] K. Kayed, C.-H. Chang, "FiVaTech: Page-Level Web Data Extraction from Template Pages" , IEEE Transactions on Knowledge and Data Engineering, vol. 22, pp. 249-263, 2010.
[5] H. A. Sleiman and R. Corchuelo, "TEX: An efficient and effective unsupervised Web information extractor", Know.-Based Syst., vol. 39, pp. 109-123, 2013.
[6] S. Zheng, R. Song, J.-R. Wen, C.-L Giles, “Efficient Record-Level Wrapper Induction”, CIKM’09, November 2–6, 2009.
[7] M. Geel, T. Church, M. C. Norrie, “Sift: An End-User Tool for Gathering Web Content on the Go”, DocEng’12, September 4–7, 2012.
[8] J. Sta?rka, L. Holubova?, M. Necˇasky?, “Strigil: A Framework for Data Extraction in Semi-Structured Web Documents”, iiWAS 2013.
[9] Import.io, http://import.io
[10] Dexi.io, https://dexi.io
[11] https://en.wikipedia.org/wiki/CAPTCHA
[12] Puppeteer, https://pptr.dev
[13] MongoDB, https://www.mongodb.com
[14] Y,-K, Lai, C.-H, Chang, “Design and Implementation of Mobile Web Creator with Componentized Template”, unpublished.

指導教授

張嘉惠(Chia-Hui Chang)

審核日期

2018-8-20

推文