淨化網頁：網頁區塊化以及資料區域擷取

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：46

、訪客IP：3.145.34.51

姓名

李泓儒(Hong-Ru Lee) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

淨化網頁：網頁區塊化以及資料區域擷取
(Web Cleaning：Page Segmentation and Data-rich Section Mining)

相關論文

★ 行程邀約郵件的辨識與不規則時間擷取之研究	★ NCUFree校園無線網路平台設計及應用服務開發
★ 網際網路半結構性資料擷取系統之設計與實作	★ 非簡單瀏覽路徑之探勘與應用
★ 遞增資料關聯式規則探勘之改進	★ 應用卡方獨立性檢定於關連式分類問題
★ 中文資料擷取系統之設計與研究	★ 非數值型資料視覺化與兼具主客觀的分群
★ 關聯性字組在文件摘要上的探討	★ 問題答覆系統使用語句分類排序方式之設計與研究
★ 時序資料庫中緊密頻繁連續事件型樣之有效探勘	★ 星狀座標之軸排列於群聚視覺化之應用
★ 由瀏覽歷程自動產生網頁抓取程式之研究	★ 動態網頁之樣版與資料分析研究
★ 同性質網頁資料整合之自動化研究	★ 時序性資料庫中未知週期之非同步週期性樣板的探勘

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

網頁是呈現線上龐大資料庫最主要的方式，內容包含許多資訊，除了本身想顯示的主要資料區域之外，還加上其他許多部份，例如：導覽連結、廣告、裝飾的圖文、著作權聲明…等等。每個部份都有各自分別的功能，把整個網頁分成很多獨立功能的小單元區塊，這些小單元區塊將可以應用在許多方面，例如：網路傳輸、儲存，PDA、手機上的瀏覽…等。
大多數使用者感興趣的部份，可能只有表達網頁真正內容的資料區域，其它部份雖然有助於使用者瀏覽更方便，但是卻對電腦程式來說卻非常難去處理，例如：網頁資訊的資訊檢索(Information Retrieval)、資訊擷取(Information Extraction)、分類(Classification)、分群(Clustering)，都會因為整個網頁內容不只有單一個主題目的，而造成這些研究上處理的困難。
本篇論文提出一個有效的方法，PSDSM演算法，可以將整個網頁分割成很多具有單一功能、獨立存在的小區塊，然後在這些區塊當中，找出表達網頁真正內容的資料區域；主要分成兩個方面，第一、網頁區塊化：利用網頁重複性的結構，將內容相近的部份歸為同一區塊；第二、資料區域擷取：藉由比較兩個網頁對應區塊的內容，決定哪一個區塊是資料區塊。
實驗結果顯示，在許多不同型態的網站，利用本篇論文的方法擷取出的資料區域，幾乎完美的接近使用者感興趣的真正內容；另外，將之應用於網頁資訊擷取系統 – IEPAD，以及網頁分類，結果顯示，擷取出的資料區域，對於這兩方面的研究都有不錯的幫助；最後，比較資料區域與整個網頁的資料量，對於某些網站，資料量減少的程度高達75%，對於網路傳輸將可以有不錯的助益。

摘要(英)

Web Page is the major manner to present huge online data. A web page often contains many segments, including main actual content in this page (we called “data-rich section”), navigational bar, advertisements, copyright and privacy notices, and unnecessary images and extraneous links for decoration. Each segment has its useful function. Dividing web pages into many independent segments has many applications. For example, network caching, cell phone and PDA browsing.
Many people only interest in main content (data-rich section) of the page, other segment can benefit human browsing, but these “human-oriented” segments are difficult for computer programs to parse. Due to these segments contain not only one purpose, they can seriously harm web data mining.
We propose an PSDSM algorithm to segment web page into many single purpose、independent blocks and identify Data-rich Section. Our approach has two aspects. First, we use repeated structures of a web page to segment web page. Second, we identify Data-rich section by block comparison.
Experimental results show that data-rich section mining by our PSDSM algorithm almost match the actual content of user interesting. Furthermore, it also benefit in web informational extraction – IEPAD and web page classification. Data-rich section can effectively reduce size of whole web page thereby improving network issues.

關鍵字(中)

★ 網頁區塊化
★ 資料區域

關鍵字(英)

★ page segmentation
★ data-rich section

論文目次

第一章緒論 1
1.1 問題定義 3
1.2 貢獻 4
1.3 論文架構 4
第二章應用與動機 5
2.1 應用 5
2.2 IE系統 7
2.3 動機與研究方向 7
第三章相關研究討論 8
3.1 利用標籤的特性以及視覺線索切割網頁 9
3.2 網頁區塊化並比較網頁找出資料區域 10
3.3 直接由DOM Tree中擷取資料區域 12
3.3.1 單一網頁擷取 12
3.3.2 多網頁擷取 13
第四章 PSDSM演算法 18
4.1 Page Segmentation 19
4.1.1 重複性區域(Repeated Block) 20
4.1.2 只考慮重複性區域的問題 21
4.1.3 子樹結構編碼 23
4.1.4 重新尋找重複性區域 28
4.1.5 Page Segmentation整體演算法 29
4.2 Data-rich Section Mining 30
4.2.1 區塊在不同網頁的差異性 30
4.2.2 區塊大小的重要性 34
4.2.3 資料區域擷取—多筆資料比數網頁 34
4.2.4 資料區域擷取—單筆資料網頁 35
第五章實驗結果 36
5.1 資料區域的正確性 36
5.2 應用方向 – IEPAD 43
5.3 應用方向 – 分類演算法(Classification) 44
5.4 網頁資料減少程度 47
第六章結論與未來展望 48
參考文獻 50

參考文獻

[1] A. Z. Border, S. C. Glassman, and M. S. Manasse. Syntactic clustering of the web. In Proceedings of the 6th International World Wide Web Conference(WWW6), pp. 1157-1166, 1997.
[2] A. Arasu and H. Garcia-Molina. Extracting structured data from web pages. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 337-348, San Diego, California, USA, June 9-12, 2003.
[3] B. Liu, R. Grossman, and Y. Zhai. Mining data records in web pages. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD-2003), pp. 601-606, Washington, DC, USA, August 24 - 27, 2003.
[4] C.-H. Chang and S.-C. Lui. IEPAD：Information extraction based on pattern discovery. In Proceedings of the 10th International Conference on World Wide Web, pp. 681-688, Hong Kong, May 2-6, 2001.
[5] C.-H. Chang and S.-C. Kuo. OLERA：A semi-supervised approach for web data extraction with visual support. IEEE Intelligent Systems, 2003.
[6] Document Object Model(DOM) – W3C Recommendation.
http://www.w3c.org/DOM/
[7] J. Wang and F. H. Lochovsky. Data-rich section extraction from HTML pages. In Proceedings of IEEE Computer Society 2002. 3rd International Conference on Web Information Systems Engineering (WISE 2002), pp. 313-322, Singapore, December 12-14, 2002.
[8] J. Wang and F. H. Lochovsky. Data extraction and label assignment for web databases. In Proceedings of the Twelfth International World Wide Web Conference, WWW2003, pp. 187-196, Budapest, Hungary, May 20-24, 2003
[9] J. M. Kleinberg. Authoritative sources in a hyperlinked environment. In Proceeding of 9th ACM-SIAM Symposium on Discrete Algorithms, 1998 and IBM Research Report RJ 10076, May 1997. Extended version in Journal of the ACM 46(1999), pp. 604-632。
[10] L. Ramaswamy, A. lyengar, L. Liu, and F. Douglis. Automatic detection of fragments in dynamically generated web pages. In Proceedings of the Thirteenth International World Wide Web Conference, WWW2004, New York, USA, May 17-22, 2004.
[11] L. Yi, B. Liu, and X. Li. Eliminating noisy information in web pages for data mining. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD-2003), pp. 296-305, Washington, DC, USA, August 24 - 27, 2003.
[12] S.-H. Lin and J.-M. Ho. Discovering informative content blocks from web documents. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 588-593, Edmonton Alberta, Canada, July 23-26, 2002.
[13] S. Yu, D. Cai, J.-R. Wen, and W.-Y. Ma. Improving pseudo-relevance feedback in web information retrieval using web page segmentation. In Proceedings of the Twelfth International World Wide Web Conference, WWW2003, pp. 11-18, Budapest, Hungary, May 20-24, 2003.
[14] S. Gupta, G. Kaiser, D. Neistadt, and P. Grimm. DOM-based content extraction of HTML documents. In Proceedings of the Twelfth International World Wide Web Conference, WWW2003, pp. 207-214, Budapest, Hungary, May 20-24, 2003.
[15] T. Munzner, F. Guimbretiere, S. Tasiran, L. Zhang, and Y. Zhou. TreeJuxtaposer：Scalable tree comparison using Focus+Context with guaranteed visibility. In Proceeding of ACM SIGGRAPH 2003. pp. 453-462, July 2003.
[16] V. Crescenzi, G. Mecca, and P. Merialdo. ROADRUNNER：Towards automatic data extraction from large web sites. In Proceedings of 27th International Conference on Very Large Data Bases, pp. 109-118, Roma, Italy, September 11-14, 2001.
[17] Y. Chen, W.-Y. Ma, and H.-J. Zhang. Detecting web pages structure for adpative viewing on small form factor devices. In Proceedings of the Twelfth International World Wide Web Conference, WWW2003, pp. 225-266, Budapest, Hungary, May 20-24, 2003.
[18] Z. Bar-Yossef and S. Rajagopalan. Template detection via data mining and its applications. In Proceedings of the Eleventh International World Wide Web Conference, WWW2002, pp. 580-591, Honolulu, Hawaii, USA, May 7-11, 2002.

指導教授

張嘉惠(Chia-Hui Chang)

審核日期

2004-7-14

推文