以作者查詢圖書館館藏 、以作者查詢臺灣博碩士 、以作者查詢全國書目 、勘誤回報 、線上人數:13 、訪客IP:3.129.194.182
姓名 丁中立(Jhong-li Ding) 查詢紙本館藏 畢業系所 軟體工程研究所 論文名稱 網頁層級資料擷取系統
(Page-level Information Extraction System)相關論文 檔案 [Endnote RIS 格式] [Bibtex 格式] [相關文章] [文章引用] [完整記錄] [館藏目錄] [檢視] [下載]
- 本電子論文使用權限為同意立即開放。
- 已達開放權限電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。
- 請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
摘要(中) 在網頁資訊擷取(Web Data Extraction)的領域中,如何自動的從各種不同架構的網頁中擷取資料的相關議題至今已被探討研究十年,然而由於現今網頁的內容多樣與架構的複雜,現有的方法均有其限制之處,再加上大量網頁擷取的需求,使得網頁資訊擷取的研究仍面臨相當大的挑戰。
網頁資料擷取系統主要分成記錄層級(Record Level)和頁面層級(Page Level)兩大類別,雖然頁面層級相較於記錄層級能夠得到更完整的網頁資訊,但由於問題的複雜及實作的困難,使得現今提出的系統中,其擷取的效能與效率都有改進的空間,此外現存系統皆需要使用者具有資訊背景,沒有提供簡單友善的圖形介面(GUI)。
在本篇論文當中,我們提出了一套頁面層級資訊擷取系統,M.-C. Chen及T.-S. Chen所提出的頁面層級系統的架構為基底,提供一個簡單友善的圖形介面,讓使用者,可以用此系統,快速擷取出所需要的網頁資訊。並且再往上對其訓練的流程做改良,以提升系統的擷取效能;在本論文的實驗中顯示,對於訓練的流程上的改良結果,不但不影響原本在表列網頁(List Page)就很好的部份,且在詳細網頁(Detail Page)中,準確率(Precision)提升了33.08%、召回率(Recall)提升32.4%,在整體效能比較中,改善後的系統得到了最高的召回率。在精確度(Accuracy)部份,實驗顯示改良後的系統光是預設的模組參數值,在整體精確度就比TEX還要高出許多;若是再以人工調整模組參數,整體精確率可再向上提升至98.8%,整體精確率比TEX還要高27%。
摘要(英) The problem of web data extraction has been studied more than ten years. Because of the structural complexity and diversity in web pages, existing researches are limited to record-level data extraction. Beside, demand of extracting data from large amount of web pages make it a challenging task for researchers.
Although the web data extracted by page-level approach is more complete than record-level approach, very few researches focus on this task because of the difficulties and complexities in the problem. On the other hands, existing web data extraction systems need IT background users, because these systems have not provide friendly GUI for users.
In this pager, we provide a web data extraction systems based on M.-C. Chen and T.-S. Chen. We provide a friendly GUI for users to improve the training procedure of the schema induction process. The experimental results show that the performance on list page websites remain high and the performance on detail pages are increased precision 33.08% and recall 32.4%. In addition, improved system get highest recall than other systems. For accuracy, our system is higher than TEX with default threshold. If we adjust the threshold of models, we can improve the overall accuracy form 94.5% to 98.8%; Overall accuracy is 27% higher than TEX.
關鍵字(中) ★ 資料擷取
★ 地標
★ 使用者介面關鍵字(英) 論文目次 中文摘要 i
Abstract ii
誌謝 iii
目錄 iv
圖目錄 v
表目錄 vi
一、 緒論 1
二、 相關研究 5
2.1 非監督式網頁層級擷取系統 5
2.2 網頁層級擷取規則驗證系統 6
三、 系統架構與研究方法 8
3.1 全網頁排比 (Full Page Alignment) 11
3.2 動態編碼擴充 (Dynamic Encoding Extension) 12
3.2.1 啟動動態編碼擴充 14
3.2.2 目標路徑找尋演算法 15
3.3 低密度合併 (Low Density Combine) 16
3.3.1 啟動低密度合併 18
3.3.2 低密度合併演算法 18
四、 擷取系統圖形介面 21
開新專案 21
綱要訓練 22
資料表格呈現 24
應用程式介面(API) 25
設定參數 26
五、 實驗 27
5.1 訓練流程改良之效能比較 28
5.1.1 表列網頁 29
5.1.2 詳細網頁 30
5.2 資料擷取系統間之效能比較 32
5.2.1 以基底節點為角度 32
5.2.2 以資料節點為角度 34
六、 結論 36
參考文獻 37
附錄 39
參考文獻 [1] A. Arasu, H. Garcia-Molina, "Extracting structured data from Web pages", presented at the Proceedings of the 2003 ACM SIGMOD international conference on Management of data, San Diego, California, 2003.
[2] R. Baumgartner, S. Flesca , G. Gottlob, "Visual Web Information Extraction with Lixto", 27th International Conference on Very Large Data Bases, 2001.
[3] M. Bronzi, V. Crescenzi, P. Merialdo, P. Papotti, "Extraction and Integration of Partially Overlapping Web Sources", 39th International Conference on Very Large Data Bases, 2013.
[4] C.-H. Chang, M. Kayed, M. Ramzy Girgis, K. Shaalan, "A Survey of Web Information Extraction Systems", Knowledge and Data Engineering, IEEE Transactions on, Vol 18(10), pp.1411-1428, 2006.
[5] C.-H. Chang, S.-C. Lui, "IEPAD: information extraction based on pattern discovery", presented at the Proceedings of the 10th international conference on World Wide Web, Hong Kong, Hong Kong, 2001.
[6] C.-H. Chang, Y.-L. Lin, K.-C. Lin, and M. Kayed, "Page-Level Wrapper Verification for Unsupervised Web Data Extraction", in Web Information Systems Engineering, 2013.
[7] M.-C. Chen, T.-S. Chen, C.-H, Chang, "應用路徑資訊輔助樣板探勘於網頁層級之資料擷取研究", Conference on Technologies and Applications of Artificial Intelligencester, 2013.
[8] T.-S. Chen, M.-C, Chen, C.-H, Chang, "基於頁面層級之快速網頁資料擷取與綱要驗證", Conference on Technologies and Applications of Artificial Intelligencester, 2014.
[9] V. Crescenzi, G. Mecca, and P. Merialdo, "RoadRunner: Towards Automatic Data Extraction from Large Web Sites", presented at the Proceedings of the 27th International Conference on Very Large Data Bases, 2001.
[10] K. Kayed, C.-H. Chang, "FiVaTech: Page-Level Web Data Extraction from Template Pages" , IEEE Transactions on Knowledge and Data Engineering, vol. 22, pp. 249-263, 2010.
[11] A.H.F. Laender, B.A. Ribeiro-Neto, A.S. da Silva, J. S. Teixeira, "A Brief Survey of Web Data Extraction Tools", SIGMOD, Vol. 31, pp. 84-93, 2002.
[12] S. Lingam, S. Elbaum, "Supporting End-Users in the Creation of Dependable Web Clips", World Wide Web, 2007.
[13] H. A. Sleiman and R. Corchuelo, "TEX: An efficient and effective unsupervised Web information extractor", Know.-Based Syst., vol. 39, pp. 109-123, 2013.
[14] S. Soderland, "Learning Information Extraction Rules for Semi-Structured and Free Text", Mach. Learn., Vol 34(1-3), pp. 233-272, 1999
[15] J. Wong, J. I. Hong, "Making Mashups with Marmite: Towards End-User Programming for the Web", CHI, 2007.
指導教授 張嘉惠(Chia-hui Chang) 審核日期 2015-7-29 推文 facebook plurk twitter funp google live udn HD myshare reddit netvibes friend youpush delicious baidu 網路書籤 Google bookmarks del.icio.us hemidemi myshare