由瀏覽歷程自動產生網頁抓取程式之研究

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：29

、訪客IP：18.220.102.112

姓名

張立帆(Li-fang Chang) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

由瀏覽歷程自動產生網頁抓取程式之研究
(Generation of Web page Fetchers from Navigation Records)

相關論文

★ 行程邀約郵件的辨識與不規則時間擷取之研究	★ NCUFree校園無線網路平台設計及應用服務開發
★ 網際網路半結構性資料擷取系統之設計與實作	★ 非簡單瀏覽路徑之探勘與應用
★ 遞增資料關聯式規則探勘之改進	★ 應用卡方獨立性檢定於關連式分類問題
★ 中文資料擷取系統之設計與研究	★ 非數值型資料視覺化與兼具主客觀的分群
★ 關聯性字組在文件摘要上的探討	★ 淨化網頁：網頁區塊化以及資料區域擷取
★ 問題答覆系統使用語句分類排序方式之設計與研究	★ 時序資料庫中緊密頻繁連續事件型樣之有效探勘
★ 星狀座標之軸排列於群聚視覺化之應用	★ 動態網頁之樣版與資料分析研究
★ 同性質網頁資料整合之自動化研究	★ 時序性資料庫中未知週期之非同步週期性樣板的探勘

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

全球資訊網(World Wide Web)在資訊爆炸的今天，充斥著各式各樣數量難以估算的資料，能夠快速有效的擷取與整合這些資料成為有用的資訊或知識，是近年來很熱門的課題，由於目前全球資訊網上流通的多為HTML文件，為提供使用者瀏覽而設計的半結構化語言，不利於分析和比較的應用，若能將HTML的網頁資料透過資訊整合，由資訊擷取技術將網頁轉為結構化的資料，以統一的資料庫或XML文件型式儲存，對於資訊的應用有很大的幫助，例如購物網站的比價分析、新聞資料的收集…等，而如何將全球資訊網上的HTML網頁文件加以過濾、收集、擷取與整合是近年來相當重要的研究。
本篇論文將資訊擷取技術的研究，區分為網頁抓取與資料擷取兩種技術，而資料擷取技術，已經進行相當長的一段時間，而監督式與非監督式資料擷取系統，對網頁資料的擷取都有很大的貢獻，然而，大多數的研究重視如何從網頁中擷取出資料，缺少了抓取需要擷取網頁的研究。因為需要進行擷取的網頁是相當多的，一頁一頁的手動抓取是沒有效率的，而且大多數的網頁都是以相同網頁樣板所產生，在瀏覽或抓取這些網頁時，都會進行相同的重覆動作，因此，有一些研究讓使用者自行建立抓取網頁的瀏覽模型，幫助使用者抓取網頁，但是需要使用者先學習系統所定義的瀏覽模型，對使用者而言是較不自然的作法。
本篇論文所提出的網頁抓取系統，系統是以IE瀏覽器提供瀏覽網頁的環境，讓使用者以平時瀏覽網頁的方式，瀏覽過部分需要抓取的網頁，系統則記錄瀏覽過的網頁和瀏覽動作，透過瀏覽的歷程，建立使用者瀏覽網頁的模型，並以執行器抓取所需的網頁。此外，由於網頁中可能存有Client-side程式，執行器在抓取網頁時，也是以IE瀏覽器模擬使用者瀏覽網頁的過程來抓取網頁，讓抓取網頁時，同時執行網頁中的Client-side程式。

關鍵字(中)

★ 瀏覽歷程
★ 網頁抓取

關鍵字(英)

★ Web page Fetcher
★ Navigation Record

論文目次

圖表目錄 I
表格目錄 II
第 1 章緒論 1
第 2 章相關研究 8
2.1 單一網站的抓取 8
2.2 多個網站的網頁抓取系統 14
第 3 章系統架構與演算法 22
3.1 瀏覽記錄器(NAVIGATION RECORDER) 23
3.2 瀏覽模型建立器(NAVIGATION MODEL BUILDER) 31
3.3 網頁抓取執行器 50
第 4 章實驗 54
4.1 相似網頁的EDIT DISTANCE門檻值實驗 54
4.2 表單查詢及多筆記錄網頁抓取 56
4.3 分類目錄網頁抓取 58
4.4 總結 59
第 5 章結論與未來展望 63
5.1 結論 63
5.2 未來展望 64
參考文獻 65
附錄 67

參考文獻

[1] A. Arasu, H. Garcia-Molina, Extracting Structured data from Web pages. In Proceedings of Special Interest Group on Management of Data 2003.
[2] A. Sahuguet and F. Azavant. Building light-weight wrappers for legacy web data-sources using W4F. In Proceedings of Very Large Database 1999.
[3] A. Alberto H. F. Laender, Berthier Ribeiro-Neto, and Altigran S. da Silva. DEByE - Data Extraction by Example, Data and Knowledge Engineering, 2002.
[4] C.-N. Hsu, C.-C. Chang, Finite-state transducers for semi-structured text mining. In Proceedings of IJCAI-99 Workshop on Text Mining.
[5] Chang, C.-H., and Shao-Chen, L. IEPAD: Information extraction based on pattern discovery. In Proceedings of the tenth international conference on World Wide Web 2001.
[6] Chun-Nan Hsu, Chia-Hui Chang, Harianto Siek, Jiann-Jyh Lu, Jen-Jei Chiou. Reconfigurable Web Wrapper Agents for Web Information Integration, In Proceedings of IEEE Intelligent Systems 2003.
[7] D. Shestakov et al. DEQUE: querying the deep web. In Proceedings of Data Knowledge Eng. 52(2): 273-311 2004
[8] http://www.deepspot.com/
[9] http://www.w3.org/DOM
[10] http://www.informit.com/articles/article.asp?p=25922&seqNum=5&rl=1
[11] I. Muslea, S. Minton and C. Knoblock. A Hierarchical Approach to Wrapper Induction. In Proceedings of the Third Int. Conf. on Autonomous Agents, 1999.
[12] Juliano Palmieri Lage, Altigran Soares da Silva, Paulo Braz Golgher, Alberto H. F. Laender: Automatic generation of agents for collecting hidden Web pages for data extraction. In Proceedings of Data Knowledge Eng. 49(2): 177-196 2004.
[13] L. Liu, C. Pu, W. Han. XWRAP: An XML-enabled Wrapper Construction System for Web Information Sources, In Proceedings of International Conference on Data Engineering 2002.
[14] Laender, A. H. F., Ribeiro-Neto, B., Da Silva, And Silva, E. S. Representing Web Data as Complex Objects, In Proceedings of the First International Conference on Electronic Commerce and Web Technologies (EC-Web 2000), pp. 216–228, Greenwich, UK, 2000.
[15] P.B. Golgher, Alberto H. F. Laender, Altigran S. da Silva, Berthier Ribeiro-Neto. An Example-Based Environment for Wrapper Generation, SIAM 2000.
[16] R. Baumbartner, S. Flesca, G.Gottlob. Visual Web Information with Lixto, In Proceedings of Very Large Database In VLDB 2001.
[17] S. Chakrabarti, M. van der Berg, and B. Dom. Focused crawling: a new approach to topic-specific web resource discovery. In Proceedings of 8th WWW Conference, Toronto, Canada, 1999.
[18] Sriram Raghavan and Hector Garcia-Molina. Crawling the hidden web. In Proceedings of Very Large Database Conference 2001.
[19] V. Anupam, Juliana Freire, Bharat Kumar, Automating Web Navigation with the WebVCR. 2000.
[20] V. Crescenzi, G. Mecca, and P. Merialdo. RoadRunner: Towards automatic data extraction from large web sites. In Proceedings of the VLDB Conference, 2001.
[21] V. Crescenzi G. , G. Mecca, and P. Merialdo. : An Automatic Data Grabber for Large Web Sites. In Proceedings of the Very Large Database Conference 2004.
[22] W. Cohen, M. Hurst and L. Jensen. A Flexible Learning System for Wrapping Tables and Lists in HTML Documents. In Proceedings of World Wide Web Conference 2002.

指導教授

張嘉惠(Chia-hui Chang)

審核日期

2005-7-13

推文