動態網頁之樣版與資料分析研究

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：11

、訪客IP：3.149.246.106

姓名

李季壕(Ji-Hao Li) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

動態網頁之樣版與資料分析研究
(Differentiating Templates and Data Values from Semi-Structured Web Pages)

相關論文

★ 行程邀約郵件的辨識與不規則時間擷取之研究	★ NCUFree校園無線網路平台設計及應用服務開發
★ 網際網路半結構性資料擷取系統之設計與實作	★ 非簡單瀏覽路徑之探勘與應用
★ 遞增資料關聯式規則探勘之改進	★ 應用卡方獨立性檢定於關連式分類問題
★ 中文資料擷取系統之設計與研究	★ 非數值型資料視覺化與兼具主客觀的分群
★ 關聯性字組在文件摘要上的探討	★ 淨化網頁：網頁區塊化以及資料區域擷取
★ 問題答覆系統使用語句分類排序方式之設計與研究	★ 時序資料庫中緊密頻繁連續事件型樣之有效探勘
★ 星狀座標之軸排列於群聚視覺化之應用	★ 由瀏覽歷程自動產生網頁抓取程式之研究
★ 同性質網頁資料整合之自動化研究	★ 時序性資料庫中未知週期之非同步週期性樣板的探勘

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

由於網際網路WWW的迅速發展，所以現有愈來愈多企業及一般網路使用者等都會透過Web來呈現他們的訊息或資料；而網路服務（Web Services）的盛行如網路書店、網路購物、入口網站等諸多服務更是帶動了網路使用爆發性的成長，而這些的類型網站通常會提供了一個搜索介面來便利使用者進行資料的查詢，例如透過一個CGI程式來進行該網站的資料庫搜索，並將與使用者查詢相關的資料嵌入到一個固定的網頁樣版中（Template）後呈現該網頁給使用者觀看，而此方式所產生的網頁我們稱之為動態網頁（Dynamic HTML）。從這些網頁中，我們不難地可以觀察到：這些網頁往往夾雜了多種不同來源的廣告或是不相關於使用者查詢的資訊；為了過濾這些雜訊以及資料收集的簡便，於是便有資料擷取（Information Extraction）研究因應而生，希望透過資料擷取系統來減少資料收集工作的瑣事。而此種系統對於資訊整合的工作者來說甚為重要，因為欲對不同網站上的進行資料整合工作時，他必須需要先手動地將不同網站的資料內容ㄧㄧ擷取出來，並將該資料存到Excel或是資料庫中後，才可進行後端的資訊整合步驟（Information Integration），以提供一套整合性的網路服務或資料分析。但是由於網站開發者常常會因為需求的變更，而對於其網站進行若干的修改，造成原本已整合完成的工作得重新進行擷取分析、並整合等複雜的重複工作。
有鑑於此，近來有許多自動化的網頁擷取系統被廣泛的討論，它們從一群網站所收集回來的動態網頁中，透過網頁之間的比對來產生該網站動態網頁資料的擷取模組，稱之為Wrapper，並且利用該模組來進行該網站的動態網頁資料擷取；而此擷取模式不僅簡便了網路資訊整合者的工作，更可以加速處理於不同類型的網站整合工作使用。

關鍵字(中)

★ 動態網頁
★ 樣版
★ 等價類

關鍵字(英)

★ dToken
★ equivalence class
★ EXALG

論文目次

第1章緒論 1
1.1 論文架構 4
第2章相關研究 5
2.1 記錄層次擷取系統 5
2.2 網頁層次擷取系統 12
2.3 網站層次擷取系統 14
2.4 擷取系統的討論 15
第3章 EXALG網頁層次資訊擷取系統 18
3.1 預先定義 18
3.2 網頁產生模組 20
3.3 問題定義 22
3.4 EXALG演算法介紹 23
第4章 EXALG的潛在問題與解決方案 26
4.1 EXALG的潛在問題 26
4.1.1 配對標籤不一致的區分 26
4.1.2 Ambiguous問題－具有多個相同出現向量的等價類群 28
4.1.3 多種出現向量的樣版文字節點（Text Node） 28
4.1.4 False Positive和False Negative的等價類影響 30
4.2 我們的解決方案 32
4.2.1 配對標籤的區分方法 32
4.2.2 具相同出現向量的等價類選擇 33
4.2.3 文字節點樣版的區分方法 33
4.2.4 False Positive與False Negative問題的處理 35
第5章實驗與討論 37
5.1 實驗評估 37
5.2 結果與討論 38
第6章結論與未來展望 42
參考文獻 43
附錄 45

參考文獻

[1] A. Arvind, and H. Garcia-Molina, Data Integration and Sharing II: Extracting Structured Data from Web Pages. In Proceedings of 2003 ACM SIGMOD International Conference on Management of Data, Page: 337 – 348, 2003.
[2] J. Caverlee, D. Buttler, and L. Liu. Discovering Objects in Dynamically- Generated Web Pages. Technical report, Georgia Institute of Technology, 2003
[3] C. H. Chang and S.C. Lui. IEPAD: Information Extraction Based on Pattern Discovery. In Proceedings of the 10th international conference on World Wide Web, Page: 681 – 688, 2001.
[4] V. Crescenzi, G. Mecca, and P.Merialdo. ROADRUNNER: Towards automatic data extraction from large web sites. In Proceedings of the 2001 International Conference on Very Large Data Base (VLDB), Page: 109 – 118, 2001.
[5] H. Davulcu, S. Koduri, and S. Nagariajan. DataRover: A Taxonomy Based Crawler for Automated Data Extraction from Data-Intensive Websites. In Proceedings of the 5th ACM international workshop on Web information and data management (WIDM’03), Page 9 – 14, 2003.
[6] C. N. Hsu, and C. C. Chang. Finite-state Transducers for Semi-Structured Text Mining. In Proceedings of IJCAI-99 Workshop on Text Mining: Foundations, Techniques and Application, Page 38 – 49, 1999.
[7] C. N. Hsu, and M. T. Dung. Generating Finite-state Transducers for Semi-Structured Data Extraction from the Web. Information Systems, 23(8):521-538, 1998
[8] N. Kushmerick, D. S. Weld, and R. B. Doorenbos. Wrapper Induction for Information Extraction. In Intl. Joint Conference on Articial Intelligence (IJCAI), pages 729 – 737, 1997.
[9] A. Laender, B. Ribeir-Neto, and A. da Silva, an J. Teixeira. A Brief Survey of Web Data Extraction Tools. ACM Sigmod Record, Volume 31, Issue 2, Pages: 84 – 93, 2002.
[10] K. Lerman, L. Getoor, S. Minton, and C. Knoblock. Using the Structure of Web Sites for Automatic Segmentation of Tables. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data (SIGMOD’04), Page 119 – 130, 2004
[11] B. Liu, R. Grossman, and Y. Zhai. Mining Data Records in Web Pages. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Date Mining (KDD’03), Page 24 - 27, 2003
[12] Z. Liu, F. Li, and W. K. Ng. Wiccap Data Model: Mapping Physical Websites to Logical Views. In Proceedings of the 21st International Conference on Conceptual Modeling, Page 120 – 134, 2002.
[13] I. Muslea, S. Minton, and C. A. Knoblock. STALKER: Learning Extraction Rules for Semistructured Web-based Information Sources. In Proceedings of AAAI Workshop on AI and Information Integration, Pages 74-81, 1998.
[14] S. Pandya. Improving Search Engines for a Changing Web. In M.Tech Dissertation of Department of Computer Science and Engineering Indian Institute of Technology, Powai. Mumbai.
[15] S. Sarawagi. Automation in Information Extraction and Data Integration (Tutorial). In Proceedings of the 2002 International Conference on Very Large Data Base (VLDB), 2002.
[16] H. Song, S. Giri, and F. Ma. Data Extraction and Annotation for Dynamic Web Pages. In Proceedings of the 2004 IEEE International Conference on e-Technology, e-Commerce, and e-Service (EEE’04), Page 499 – 502, 2004.
[17] J. Wang, and F.H. Lochovsky. Data Extraction and Label Assignment for Web Databases. In Proceedings of the twelfth international conference on World Wide Web, Page 187 – 196, 2003.
[18] G. Yang, I.V.Ramakrishnan, and M.Kifer. On the Complexity of Schema Inference from Web Pages in the Presence of Nullable Data Attributes. In Proceedings of the twelfth international conference on Information and knowledge management, Page 224 – 231 , 2003
[19] Y. Zhai, and B. Liu. Web Data Extraction Based on Partial Tree Alignment. In the Proceedings of the 14th international conference on World Wide Web, Page 76 – 85, 2005
[20] Web Service Site: Amzon.com http://www.Amazon.com
[21] Document Object Model, DOM. http://www.w3.org/DOM/
[22] EXALG: Experimental results. http://www-db.stanford.edu/~arvind/extract/

指導教授

張嘉惠(Chia-Hui Chang)

審核日期

2005-7-13

推文