基於CSP與最佳狀態序列之擷取程式驗證

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：23

、訪客IP：3.142.119.182

姓名

林冠辰(Kuan-Chen Lin) 查詢紙本館藏

畢業系所

資訊工程學系在職專班

論文名稱

基於CSP與最佳狀態序列之擷取程式驗證
(Template and Schema Guided Wrapper Verification based on CSP and Best State Sequence)

相關論文

★ 行程邀約郵件的辨識與不規則時間擷取之研究	★ NCUFree校園無線網路平台設計及應用服務開發
★ 網際網路半結構性資料擷取系統之設計與實作	★ 非簡單瀏覽路徑之探勘與應用
★ 遞增資料關聯式規則探勘之改進	★ 應用卡方獨立性檢定於關連式分類問題
★ 中文資料擷取系統之設計與研究	★ 非數值型資料視覺化與兼具主客觀的分群
★ 關聯性字組在文件摘要上的探討	★ 淨化網頁：網頁區塊化以及資料區域擷取
★ 問題答覆系統使用語句分類排序方式之設計與研究	★ 時序資料庫中緊密頻繁連續事件型樣之有效探勘
★ 星狀座標之軸排列於群聚視覺化之應用	★ 由瀏覽歷程自動產生網頁抓取程式之研究
★ 動態網頁之樣版與資料分析研究	★ 同性質網頁資料整合之自動化研究

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

這是資訊爆炸的時代，幾乎任何資訊都可透過網路取得，因此近期網頁資料擷取領域，相繼有許多研究提出各種的非監督式擷取方法，能夠快速有效擷取資料進行後續加值應用。但是網路環境快速變動的特定，也讓非監督式擷取程式面臨挑戰。非監督式擷取程式在產生擷取規則時，需要相當複雜計算，需要耗費相當的時間，不可能每次擷取資料都重新產生新的擷取規則，因此擷取程式後續驗證與維護機制必然越來越重要。
擷取程式會解析網頁內容，產生網頁模板(Template)與資料結構(Schema)來擷取資料。擷取程式驗證機制主要目的是確保時間點t產生的模板(Template)與資料結構結構(Schema)仍能適用於時間點t’的資料擷取。在網頁XML DOM樹狀結構中，資料內容是存在於葉節點位置，因此可運用有限狀態機模型檢測網頁葉節點轉換規律是否與時間點t的網頁模板(Template)與資料結構(Schema)一致。本論文嘗試簡化建立有限狀態機模型過程，使其更為快速有效，並以CSP快速刪減候選狀態節點的數量，加速模型驗證效率，並搭配狀態的序列組合(Sequence Probability)來檢測網頁模板(Template)與資料結構結構(Schema)有效性，以實驗測試其效率及有效性。

摘要(英)

Wrapper induction is a complex process that takes a considerable amount of time. However, data extraction that requires constant wrapper induction is inefficient for a more sophisticated Web site design. Therefore, wrapper verification and its maintenance are becoming two of a few major subjects for research. This paper focuses on wrapper verification of unsupervised information extraction. The researcher uses the leaf nodes of the dom tree as the source of transformation and obtains a FSM (Finite State Machine) for schema verification. If the new page could pass the verification, it would be considered a simultaneous data layout.
This paper attempts to simplify the process of creating the finite state machine model, and proposes Schema Guided Wrapper Verification based on CSP (Constraint Satisfaction Problems) to reduce the number of candidate states to accelerate validation efficiency. The approach this study proposes not only improves validation efficiency, but also finds a better Best State Sequence to improve the accuracy of data extraction.

關鍵字(中)

★ 擷取程式驗證
★ 有限狀態機
★ 限制滿足問題
★ 網頁資料擷取

關鍵字(英)

★ wrapper verification
★ Finite State Machine
★ Constraint Satisfaction Problems
★ web data extraction

論文目次

中文提要 i
ABSTRACT ii
誌謝 iii
目錄 iv
圖目錄 v
表目錄 vi
一、緒論 1
二、相關研究 5
2.1 擷取程式簡介 5
2.2 驗證機制 6
三、非監督式網頁擷取程式驗證 9
3.1 驗證方法與流程 9
3.2 以葉節點為序之驗證模型 11
3.2.1 建立葉節點之有限狀態機(Finite State Machine) 14
3.2.2 利用限制滿足問題驗證新網頁 19
3.3 最佳狀態序列 22
3.3.1 葉節點內容相似性(NodeSim) 23
3.3.2 狀態轉換機率(Transition probability) 24
3.3.3 狀態序列合適度 24
四、實驗結果 27
4.1 實驗方法 27
4.2 有限狀態機建立之效率評估 28
4.3 CSP驗證網頁之效率評估 30
4.4 狀態序列之有效性評估 32
五、結論 37
參考文獻 38

參考文獻

[1] C.-H. Chang, M. Kayed, M. R. Girgis, and K. F. Shaalan, "A Survey of Web
Information Extraction Systems," IEEE Transactions on Knowledge and Data
Engineering, vol. 18, no. 10, pp. 1411-1428, 2006.
[2] C.-H. Chang, S. Yang, C.-M. Liou, and M. Kayed, "Gadget creation for
personal information integration on web portals," IEEE International
Conference on Information Reuse and Integration, 2008.
[3] M. Dontcheva, S. M. Drucker, G. Wade, D. Salesin, and M. F. Cohen,
"Summarizing personal web browsing sessions," Proceedings of the 19th
annual ACM symposium on User interface software and technology, 2006, pp.
115-124.
[4] O. Etzioni, M. Cafarella, D. Downey, A.-M. Popescu, T. Shaked, S. Soderland,
D. S. Weld, and A. Yates, "Methods for domain-independent information
extraction from the web: an experimental comparison," Proceedings of the
19th national conference on Artifical intelligence, 2004, pp. 391-398.
[5] C.-N. Hsu and C.-C. Chang, "Finite-state transducers for semi-structured text
mining," Proceedings of IJCAI-99 Workshop on Text Mining: Foundations,
Techniques and Applications, 1999, pp. 38-49.
[6] M. Kayed and C.-H. Chang, "Page-Level Web Data Extraction from Template
Pages," IEEE Transactions on Knowledge and Data Engineering, vol. 22, no.
2, pp. 249-263, 2010.
[7] N. Kushmerick, "Wrapper Induction for Information Extraction." Ph.D.
University of Washington, Seattle, WA, 1997.
[8] N. Kushmerick, "Regression testing for wrapper maintenance," Proceedings of
the sixteenth national conference on Artificial intelligence and the eleventh
Innovative applications of artificial intelligence conference innovative
applications of artificial intelligence, 1999, pp. 74-79.
[9] N. Kushmerick, "Wrapper Verification," World Wide Web Journal, vol. 3, no. 2,
pp. 79-94, 2000.
[10] N. Kushmerick, "Wrapper Induction: Efficiency and Expressiveness,"
Artificial Intelligence, vol. 118, no. 1-2, pp. 15-68, 2000.
[11] K. Lerman, S. N. Minton, and C. A. Knoblock, "Wrapper Maintenance: A
Machine Learning Approach," Journal of Artificial Intelligence Research, vol.
18, no. 1, pp. 149-181, 2003.
[12] J.-H. Li, "Differentiating Templates and Data Values from Semi-Structured
Web Pages." Master’s Computer Science and Information Engineering at
National Center University, 2005.
[13] L. Liu, C. Pu, and W. Han, "XWRAP: An XML-Enabled Wrapper
Construction System for Web Information Sources," Proceedings of the 16th
International Conference on Data Engineering, 2000, pp. 611-621.
[14] X. Meng, D. Hu, and C. Li, "Schema-Guided Wrapper Maintenance for
Web-Data Extraction," Proceedings of the 5th ACM international workshop on
Web information and data management, 2003, pp. 1-8.
[15] X. Meng, H. Lu, M. Gu, and H. Wang, "SG-WRAP: A Schema-Guided
Wrapper Generator," Proceedings of the 18th International Conference on
Data Engineering, 2002, p. 331.
[16] I. Muslea, S. Minton, and C. A. Knoblock, "Hierarchical Wrapper Induction
for Semistructured Information Sources," Autonomous Agents and Multi-Agent
Systems, vol. 4, no. 1-2, pp. 93-114, 2001.
[17] E.-H. Pek, X. Li, and Y. Liu, "Web Wrapper Validation," In Proceedings of
APWeb, 2003.
[18] Y.-L. Lin, "Page-level Wrapper Verification based on Structure, Semantic
and Schema." Master’s Computer Science and Information Engineering
at National Center University, 2008.
[19] C. A. Knoblock Projects,
http://www.isi.edu/integration/people/knoblock/projects/prj_wrapper_maintain.html
[20] Document Object Model (DOM), http://www.w3.org/DOM/
[21] XML Schema Definition Language (XSD), http://www.w3.org/TR/xmlschema11-1/

指導教授

張嘉惠

審核日期

2013-1-29

推文