以網頁識別及清理改善資料擷取的研究

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：127

、訪客IP：3.145.79.65

姓名

劉仁宇(Jen-Yu Liu) 查詢紙本館藏

畢業系所

資訊工程學系在職專班

論文名稱

以網頁識別及清理改善資料擷取的研究
(Web page Classification and Cleaning for Information Extraction)

相關論文

★ 行程邀約郵件的辨識與不規則時間擷取之研究	★ NCUFree校園無線網路平台設計及應用服務開發
★ 網際網路半結構性資料擷取系統之設計與實作	★ 非簡單瀏覽路徑之探勘與應用
★ 遞增資料關聯式規則探勘之改進	★ 應用卡方獨立性檢定於關連式分類問題
★ 中文資料擷取系統之設計與研究	★ 非數值型資料視覺化與兼具主客觀的分群
★ 關聯性字組在文件摘要上的探討	★ 淨化網頁：網頁區塊化以及資料區域擷取
★ 問題答覆系統使用語句分類排序方式之設計與研究	★ 時序資料庫中緊密頻繁連續事件型樣之有效探勘
★ 星狀座標之軸排列於群聚視覺化之應用	★ 由瀏覽歷程自動產生網頁抓取程式之研究
★ 動態網頁之樣版與資料分析研究	★ 同性質網頁資料整合之自動化研究

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

網際網路使用的普及，豐富資訊不斷量增下，使用者面臨最大難題不在於資訊內容的多寡，而在於擷取出的資料能否符合實際所需。在網頁內容擷取最常遇到兩項困難：一是目標區域外，會有一些無關的資料；在目標區域內，也會夾雜著少許雜訊，影響擷取的正確性；然而真正擷取的目標內容，卻也因字詞與字詞間沒有嚴謹的文法及界限，而無法完整識別。
基於此理由，本篇論文希藉由網頁清理技術來達成資料擷取的正確性。我們採用SVM分類器，配合頁面清理技術做為實際擷取的輸入網頁；另外在資料擷取上，採用SoftMealy擷取器，以Induction rule的演算法產出擷取規則。依據此種概念，提出CBIE（Cleaning Based Information Extraction）。我們的實驗從DBWorld中已確認Accepted paper公佈時程的各Conferences網站，辨識Accepted paper所在的網頁，再經由頁面清理擷取其中論文題目與作者，其結果顯示有相當程度改善效果，也證明頁面清理想法的可行性。

摘要(英)

As the popularization of internet, one puzzle the users may be forced to face is not the large quantity of information, but the difficulty to extract the information they desired from the web pages. In web Information extraction, the researchers are confronted by at least two difficulties which may decrease the precision and accuracy of the results. The first is the irrelevant data that appears outside the target areas. The second is the noisy information garbled with desired contents inside the target areas. In addition to these, the desired contents may not be identified completely due to the lack of clear separator.
The purpose of this thesis is to solve those difficulties during web information extraction by incorporating page cleaning techniques. We use Support Vector Machine (SVM) to train a classifier for page cleaning. The cleaned pages are them applied to generated extraction rules by SoftMealy. The proposed idea, called CBIE(Cleaning Based Information Extraction), was applied on the extraction of paper titles and authors from accepted papers identified from websites the result shows that the cleaned pages were higher extractor performance them original web pages.

關鍵字(中)

★ 資料擷取
★ 機器學習

關鍵字(英)

★ Information extraction
★ machine learning

論文目次

摘　要 I
Abstract II
誌　謝 III
目　錄 IV
圖目錄 VI
表目錄 VII
第一章　緒論 1
1.1　研究背景 1
1.2　研究動機 2
1.3　研究目標 4
1.4　問題分析 4
1.5　研究方法及技術 6
1.6　論文架構 7
第二章　文獻研究與技術 8
2.1　半結構化文件為主的資訊擷取技術 8
2.1.1　IEPAD 8
2.1.2　Embley 10
2.1.3　WIEN 11
2.1.4　STALKER 12
2.1.5　SoftMealy 13
2.2　相關技術 15
2.2.1　有限狀態轉換機( FST ) 15
2.2.2　DOM Tree 16
2.2.3　分行段落偵測( LSD ) 17
2.2.4　支持向量機（SVM） 18
第三章　系統架構與實作方法 21
3.1　SVM網頁分類Model 21
3.2　頁面清理模組（Page Cleaning） 27
3.2.1　DOM Tree區塊偵測 28
3.2.2　分行段落偵測（LSD） 29
3.2.3　雜訊過濾器 30
3.3　屬性標示卅學習識別與擷取 33
第四章　實驗 37
4.1　實驗測試資料的前置說明 37
4.2　實驗結果與問題討論 37
4.2.1　實驗結果 37
4.2.2　問題討論 52
第五章　結論 54
參考文獻 55

參考文獻

1. Chia-Hui Chang and Chun-Nan Hsu. Automatic Extraction of Information Blocks Using PAT Trees. In Proceedings of 1999 National Computer Symposium (NCS-1999), Tamkang University, Tamsui, Taiwan, Dec 1999.
2. Chia-Hui Chang, Shao-Chen Lui, and Yen-Chin Wu. Applying pattern mining to Web information extraction. In Proceedings of the 5th Pacific Asia Conference on Knowledge Discovery and Data Mining (PAKDD-2000), pp. 4-16, Hong Kong, Apr 2001.
3. Chia-Hui Chang and Shao-Chen Lui. IEPAD: Information Extraction based on Pattern Discovery, In Proceedings of the 10th International Conference on World Wide Web (WWW10), pp. 595-609, Hong Kong, May 2001.
4. D. Embley, Y. Jiang, and Y.-K. Ng. Record-boundary discovery in web documents. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’99), pages 467–478, Philadelphia, PA, 1999.
5. N. Kushmerick, D. Weld, and R. Doorenbos, Wrapper Induction for information extraction. In Proceedings of the 15th International, Joint Conference on AI (IJCAI-97), pp. 729-737, 1997.
6. N. Kushmerick, Wrapper Induction: Efficiency and expressiveness. Workshop on AI & Information Integration. In Proceedings Of AAAI-98 Workshop on Artificial Intelligence and Information Integration, pp. 15-68, AAAI Press, Menlo Park, California, 1998.
7. I. Muslea, S. Minton, and C. Knoblock, STALKER: learning extraction rules for semi-structured, Web-based information sources. In Proceedings of AAAI-98 Workshop on AI and Information Integration, Technical Report WS-98-01, AAAI Press, Menlo Park, California, 1998.
8. I. Muslea, S. Minton, and C. Knoblock, A hierarchical approach to wrapper induction. In Proceedings of the 3rd International Conference on Autonomous Agents (Agents-99), pp. 190-197, Seattle, Washington, 1999.
9. Chun-Nan Hsu and Ming-Tzung Dung. Generating finite-state transducers for semi-structured data. Journal of Information Systems, Special Issue on Semi-structured Data, Volume 23, pp. 521-537, Aug 1998.
10. Chun-Nan Hsu and Chien-Chi Chang. Finite-state transducers for semi-structured text mining. In Proceedings of IJCAI-99 Workshop on Text Mining: Foundations, Techniques and Applications, pp. 38-49, Stockholm, Sweden, 1999.
11. Chun-Nan Hsu. Initial Results on Wrapping Semi-structured Web Pages with Finite-State Transducers and Contextual Rules. 1998.
12. Dan Dipasquo. Using HTML Formatting to Aid in Natural Language Processing on the World Wide Web. Senior Honors Thesis, School of Computer Science, Carneige Mellon University, June, 1988.
13. Aho, Alfred V. Algorithms for finding patterns in strings. In J. van Leeuwen, editor, Handbook of Theoretical Computer Science, pages 255-300, Elsevier, 1990.
14. Boris Chidlovskii, Jon Ragetli and Maarten de Rijke: Wrapper Generation via Grammar Induction. ECML 2000, 11th European Conference on Machine Learning, January 7, 2000.
15. Ricardo Baeza-Yates, Berthier Ribeiro-Neto: Modern Information Retrieval. Copyright 1999 by The ACM press, A Division of the Association for Computing Machinary, Inc. (ACM).
16. Tom M. Mitchell: Machine Learning. Copyright 1997 by The McGraw-Hill, Inc.
17. N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press 2000.
18. Document Object Model(DOM)Level 2 Traversal and Range Specification. Version 1.0, W3C Recommendation 13 November, 2000.
19. SVM - Support Vector Machines, http://www.dtreg.com/svm.htm
20. Chih-Jen Lin's(LIBSVM), http://www.csie.ntu.edu.tw/%7Ecjlin/libsvm/index.html
21. An Introduction to Support Vector Machines and Other Kernel-based Learning Methods, Nell Cristianini, John Shawe-Taylor.

指導教授

張嘉惠(Chia-Hui Chang)

審核日期

2006-7-24

推文