一種網頁資訊擷取程式之自動化產生技術研發

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：89

、訪客IP：18.190.217.122

姓名

劉榮修(Jung-Hsiu Liu) 查詢紙本館藏

畢業系所

資訊管理學系

論文名稱

一種網頁資訊擷取程式之自動化產生技術研發
(An automatic wrapper generation for web information extraction)

相關論文

★ 應用數位版權管理機制於數位影音光碟內容保護之研究	★ 以應用程式虛擬化技術達成企業軟體版權管理之研究
★ 以IAX2為基礎之網頁電話架構設計	★ 應用機器學習技術協助警察偵辦詐騙案件之研究
★ 擴充防止詐欺及保護隱私功能之帳戶式票務系統研究-以大眾運輸為例	★ 網際網路半結構化資料之蒐集與整合研究
★ 電子商務環境下網路購物幫手之研究	★ 網路安全縱深防護機制之研究
★ 國家寬頻實驗網路上資源預先保留與資源衝突之研究	★ 以樹狀關聯式架構偵測電子郵件病毒之研究
★ 考量地區差異性之隨選視訊系統影片配置研究	★ 不信任區域網路中數位證據保留之研究
★ 入侵偵測系統事件說明暨自動增加偵測規則之整合性輔助系統研發	★ 利用程序追蹤方法關聯分散式入侵偵測系統之入侵警示研究
★ 應用XML/XACML於工作流程管理系統之授權管制研究	★ 快速建置SIP服務的設計與實作研究

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

網際網路是相當巨大資訊貯藏庫，蘊含著豐富的資料，其中有資訊檢索、資訊擷取、資訊整合、及資訊探勘等領域的研究。目前擷取網頁資訊的方式多是採用擷取程式(Wrapper)，近年來也有相當多的研究針對產生擷取程式作設計與探討，本研究針對眾多的研究文獻將產生擷取程式的方法分成四類，自動分析學習、樣本歸納學習、手動式建立規則、與輔助式建立規則。不過各項研究各有優劣，綜合來看，常見的缺點有適用領域過小，需要建立樣本來作為學習的依據，或者是需要手動的方式來自行建立擷取規則。本研究的目的是為了解決上述的缺點，設計互動式的介面來自動產生擷取規則，以網頁標籤樹狀結構來表示各類網頁格式的資訊位置，以提高可適用的網頁格式範圍，另外提供直覺式的操作介面讓使用者完成擷取設定，更為輕鬆、簡便。最後本研究與同樣提供介面輔助的系統作評估，以說明本系統的設計功能更為強大，使用更為方便，也與WIEN系統比較，以驗證本系統的有效性與可用性。

摘要(英)

WWW covers huge information. And web information extraction is an important issue in WWW. But we found some drawbacks to this issue from many researches. The drawbacks include less applicable domain, sample learning cost, and handcrafting rules. So we present an approach to generate wrappers for web information extraction. Our contribution are as follow: (1)developing interactive interface to generate extraction rules automatically without any samples; (2)the extraction rules can be applicable many kinds of web formats. The final, we measure some web sites to test the applicability of our wrapper generation system.

關鍵字(中)

★ 網頁資訊擷取
★ 擷取程式
★ 自動化產生技術

關鍵字(英)

★ wrapper
★ web information extraction
★ automatic generation

論文目次

目錄 III
圖目 VI
表目 VIII
第一章緒論 1
第一節研究背景與動機 2
1.1.1 網際網路資訊豐富 2
1.1.2 擷取網頁資訊 3
第二節研究目的 4
1.2.1 產生擷取規則自動化 5
1.2.2 網頁適用類型廣泛 5
1.2.3 操作方式簡便 5
第三節系統設計方法 6
1.3.1 建立樹狀結構之方法 6
1.3.2 建構操作介面之方法 6
1.3.3 自動產生擷取規則之方法 7
第四節研究成果 7
第五節論文結構 8
第二章相關研究 9
第一節資訊擷取領域探討 9
2.1.1 資訊擷取 9
2.1.2 傳統資訊擷取之方法 10
2.1.3 傳統應用資訊擷取之文件類型 11
2.1.4 網頁資訊擷取 11
第二節產生擷取程式之研究 12
2.2.1 自動分析學習類型 14
2.2.2 樣本歸納學習類型 15
2.2.3 手動式建立規則類型 20
2.2.4 輔助式建立規則類型 24
2.2.5 綜合比較 27
第三章系統設計 30
第一節研究架構概述 30
3.1.1 系統運作架構 31
3.1.2 研究設計運作流程 33
第二節網頁標籤樹狀結構圖設計 34
3.2.1 定義樹狀結構 35
3.2.2 建立網頁標籤樹狀結構圖 36
第三節樹狀結構再處理程序 38
3.3.1 HTML標籤分類 38
3.3.2 修剪樹狀結構 41
3.3.3 簡化樹狀結構 43
第四節使用操作介面設計 46
3.4.1 由樹狀結構轉化為操作介面之設計方式 46
3.4.2 逐步縮小資訊範圍 48
3.4.3 資訊範圍之結構路徑位置 48
第五節擷取規則設計 49
3.5.1 一般類資訊擷取規則 50
3.5.2 其他類資訊擷取規則 53
3.5.3 圖像卅超連結下載存檔之應用 59
3.5.4 多網頁資訊擷取之應用 60
第四章系統設計實作 65
第一節系統實作架構 65
4.1.1 系統架構模組 66
4.1.2 系統開發環境 67
第二節產生擷取規則單元實作 68
4.2.1 網頁處理模組 68
4.2.2 選擇資訊擷取範圍模組 70
4.2.3 設定擷取條件模組 72
第三節擷取程式單元實作 75
4.3.1 擷取工作模組 75
4.3.2 排程模組 77
4.3.3 資訊呈現模組 78
第五章系統使用實例與評估 80
第一節系統使用實例 80
5.1.1 表格類網頁擷取實例 80
5.1.2 其他類網頁擷取實例 84
5.1.3 多網頁資訊類型擷取實例 87
第二節系統評估 89
5.2.1 與Bright之系統評估 90
5.2.2 與WIEN系統評估 92
5.2.3 實際網站測試評估 93
第六章結論 97
第一節研究結論 97
第二節研究貢獻 98
第三節未來研究方向 98
參考文獻 100

參考文獻

[AI 1999] Douglas E. Appelt, David J. Israel, “Introduction to Information Extraction Technology”, International Joint Conference on Artificial Intelligence (IJCAI-99) Tutorial, Stockholm, Sweden, 1999.
Access from http://www.ai.mit.edu/people/jimmylin/papers/intro-to-ie.pdf on June 2002.
[AK 1997] Naveen Ashish, Craig Knoblock, “Semi-automatic Wrapper Generation for Internet Information Sources”, Conference in Cooperative Information Systems, pp. 160-169, 1997.
[AK 1997-2] Naveen Ashish, Craig Knoblock, “Wrapper Generation for Semi-structured Internet Sources”, Proc. Workshop in Management of Semi-structured Data, 1997.
Access from http://citeseer.nj.nec.com/78296.html on June 2002.
[BGRV 1999] Laura Bright, Jean-Robert Gruser, Louiqa Raschid, Maria Esther Vidal, “A wrapper generation toolkit specify and construct wrappers for web accessible data sources (WebSources)”, International Journal of Computer Systems Science and Engineering, Vol. 14, No. 2, pp. 83-97, 1999.
[BHC 1996] Robin D. Burke, Kristian J. Hammond, Edwin Cooper, “Knowledge-based information retrieval from semi-structured text”, AAAI/IAAI, Vol. 1, pp. 462-468, 1996.
[BLG 1998] Kurt D. Bollacker, Steve Lawrence, and C. Lee Giles, “Citeseer: An autonomous web agent for automatic retrieval and identification of interesting publications”, Proceedings of the 2nd International Conference on Autonomous Agents, ACM Press, pp.116-123, 1998.
[CERT/CC] CERT Coordination Center, http://www.cert.org/.
[CGL 1998] Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, Daniele Nardi, Riccardo Rosati, “Description Logic Framework for Information Integration”, Principles of Knowledge Representation and Reasoning, pp. 2-13, 1998.
[Childlovskii 2000] Boris Chidlovskii, “Wrapper Generation by k-Reversible Grammar Induction”, In ECAI2000 workshop on Machine Learning for Information Extraction, 2000.
Access from http://citeseer.nj.nec.com/469912.html on June 2002.
[Ciravegna 2000] Fabio Ciravegna, “Learning to Tag for Information Extraction from Text”, In ECAI2000 workshop on Machine Learning for Information Extraction, 2000.
Access from http://www.dcs.shef.ac.uk/~fabio/ecai-workshop.html on June 2002.
[CL 1996] Jim Cowie, Wendy Lehnert, “Information Extraction”, Communications of the ACM, Vol. 39, No. 1, pp. 80-91, 1996.
[Cohen 1998] William W. Cohen, “A web-based Information system that reasons with structured collection of text”, Proceedings of the 2nd International Conference on Autonomous Agents (Agents’’98), pp. 400-407, 1998.
[CRR 2000] Boris Chidlovskii, Jon Ragetli, Maarten de Rijke, “Wrapper Generation via Grammar Induction”, European Conference on Machine Learning, pp. 96-108, 2000.
[CS 1998] Liren Chen, Katia Sycara, “WebMate:A Personal Agent for Browsing and Searching”, Proceedings of the Second International Conference on Autonomous Agents,. ACM Press, May 1998.,pp.132-139 ,1998.
[ECJ+ 1999] D.W. Embley, D.M. Campbell, Y.S. Jiang, S.W. Liddle, D.W. Lonsdale, Y.-K. Ng, R.D. Smith, “Conceptual-model-based data extraction from multiple-record Web pages”, Data & Knowledge Engineering, Vol. 31, pp. 227-251, 1999.
[Eikvil 1999] Line Eikvil, “Information Extraction from world wide web -A Survey-”, Norwegian Computing Center, No. 945, July 1999.
Access from http://citeseer.nj.nec.com/eikvil99information.html on June 2002.
[Etzioni 1996] Oren Etzioni, “The World Wide Web: quagmire or gold mine?”, Communications of the ACM, Vol. 39, No. 11, pp. 65-68, 1996.
[FHK+ 1997] Jürgen Frohn, Rainer Himmeröder, Paul-Th. Kandzia, Georg Lausen, Christian Schlepphorst, “FLORID - A Prototype for F-Logic”, In Intl. Conference on Data Engineering (ICDE), 1997.
Access from http://citeseer.nj.nec.com/frohn97florid.html on June 2002.
[GetRight] GetRight-Download Manager program, http://www.getright.com/.
[GMV 2000] Alejandro Gutierrez, Regina Motz, Daniel Viera, “Building Databases with Information Extracted from Web Documents”, Computer Science Society (SCCC ‘00), pp.41-49, 2000.
[GS 1999] Xiaoying Gao, Leon Sterling, “Semi-Structured data extraction from heterogeneous sources”, 2nd International Workshop on Innovative Internet Information Systems (IIIS’’99), 1999.
Access from http://citeseer.nj.nec.com/gao99semistructured.html on June 2002.
[GW 1998] Robert Gaizauskas, Yorick Wilks, “Information Extraction: Beyond Document Retrieval”, Computational Linguistics and Chinese Language Processing, Vol. 3, No. 2, pp. 17-60, August 1998.
[GW 1999] Tao Guan, Kam-Fai Wong, “KPS: a Web information mining algorithm”, Computer Networks, Vol. 31, pp. 1495-1507, 1999.
[HMC+ 1997] J. Hammer, H. Garcia-Molina, J. Cho, R. Aranha, A. Crespo, “Extracting Semistructured Information from the Web”, Proceedings of the Workshop on Management of Semistructured Data, 1997.
Access from http://citeseer.nj.nec.com/hammer97extracting.html on June 2002.
[HTML401] HTML 4.01 Specification, http://www.w3.org/TR/html401/.
[ISS] ISS Security Center, http://www.iss.net/.
[KB 2000] Raymond Kosala, Hendrik Blockeel, “Web Mining Research: A Survey”, ACM SIGKDD Explorations, Vol. 2, Iss. 1, pp. 1-15, July 2000.
[KC 2001] Yong Hae Kong, In Seok Choi, “An efficient Web information extracting system”, Proceedings of IEEE International Symposium on Industrial Electronics (ISIE 2001), Vol. 3, pp. 1771-1774, 2001.
[KS 1997] Paul-Th. Kandzia, Christian Schlepphorst, “FLORID - A Prototype for F-Logic”, 12th German Workshop on Logic Programming (WLP ‘97), pp. 17-19, September 1997.
[Kushmerick 1997] Nicholas Kushmerick, “Wrapper Induction for Information Extraction”, Ph.D. dissertation, University of Washington, 1997.
[Kushmerick 2000] Nicholas Kushmerick, “Wrapper Induction: Efficiency and Expressiveness”, Artificial Intelligence, Vol. 118, Iss. 1-2, pp. 15-68, April 2000.
[KWD 1997] Nickolas Kushmerick, Daniel S. Weld, Robert Doorenbos, “Wrapper Induction for Information Extraction”, Intl. Joint Conference on Artificial Intelligence (IJCAI), pp. 729-737, 1997.
[LHL+ 1998] Bertram Ludascher, Rainer Himmeroder, Georg Lausen, Wolfgang May, Christian Schlepphorst, “Managing Semistructured Data With Florid: A Deductive Object-Oriented Perspective”, Information System, Vol. 23, No. 8, pp. 589-613, 1998
[LLG 1999] Mengchi Liu, Tok Wang Ling, Tao Guan, “Integration of semistructured Data with Patial and Inconsistent Information”, Database Engineering and Applications, pp. 44-52, 1999.
[LP 1997] Ling Liu, Calton Pu, “An Adaptive Object-oriented Approach to Integration and Access of Heterogeneous Information Sources”, Distributed and Parallel Databases, Vol. 5, No. 2, pp. 167-205, 1997.
[LPT+ 1998] Ling Liu, Calton Pu, Wei Tang, David Buttler, John Biggs, Tong Zhou, Paul Benninghoff, Wei Han, “CQ: A Personalized Update Monitoring Toolkit”, In Proceedings of ACM SIGMOD Conference, 1998.
Access from http://citeseer.nj.nec.com/liu98cq.html on June 2002.
[May 1999] Wolfgang May, “Modeling and Querying Structure and Contents of the Web”, IEEE Internet Computing, pp. 721-725, 1999.
[May 2000] Wolfgang May, “An integrated architecture for exploring, wrapping, mediating and restructuring information from the Web”, Database Conference, pp. 82-89, 2000.
[Openfind] Openfind網路資訊搜尋網站, http://www.openfind.com.tw/.
[PL 1998] Calton Pu, Ling Liu, “Update Monitoring: The CQ Project”, The 2nd International Conference on Worldwide Computing and Its Applications - WWCA’’98, Tsukuba, Japan, Lecture Notes in Computer Science, Vol. 1368, pp. 396-411, 1998.
[Poibeau 2000] Thierry POIBEAU, “Corpus-based Learning for Information Extraction”, Actes du workshop Machine Learning for Information Extraction (ML4IE), 14th European Conference on Artificial Intelligence (ECAI’2000), Berlin, 2000.
Access from http://www.dcs.shef.ac.uk/~fabio/ecai-workshop.html on June 2002.
[RN 1998] Anand Rajaraman, Peter Norvig, “Virtual database technology: transforming the internet into a database”, IEEE Internet Computing, Vol. 2, Iss. 4, pp. 55-58, July-Aug. 1998.
[Singh 1998] Narinder Singh, “Unifying heterogeneous information models”, Communications of the ACM, Vol. 41, No. 5, pp. 37-44, May 1998.
[Soderland 1997] Stephen Soderland, “Learning to Extract Text-based Information from the World Wide Web”, Knowledge Discovery and Data Mining, pp. 251-254, 1997.
[Teleport Pro] Teleport Pro-Offline Browser Webspider, http://www.tenmax.com/teleport/pro/home.htm.
[Tidy] HTML Tidy, http://www.w3c.org/People/Raggett/tidy/.
[TSIMMIS] TSIMMIS Project, http://www-db.stanford.edu/tsimmis/tsimmis.html.
[Yahoo 股市] Yahoo奇摩股市, http://tw.stock.yahoo.com/.
[YCO 2001] Jaeyoung Yang, Joongmin Choi, Heekuck Oh, “MORPHEUS：A customized comparison-shopping agent”, The 5th International Conference on Autonomous Agents (Agents-2001), Montreal, Canada, pp. 63-64, 2001.
[YLC 2000] Jaeyoung Yang, Eun-seok Lee, Joong-min Choi, “A Shopping Agent That Automatically Constructs Wrapper for Semi-Structured Online Vendors”, Lecture Notes in Computer Science, Vol. 1983, pp. 368-373, 2000.
[李明德 1998] 李明德，“網際網路上半結構化資料的擷取、管理與呈現系統”，國立中央大學資訊管理學系研究所碩士論文，民國87年6月。
[呂紹誠 2001] 呂紹誠，“網際網路半結構性資料擷取系統之設計與實作”，國立中央大學資訊工程學系研究所碩士論文，民國90年6月。
[范綱岷 2001] 范綱岷，“使用超本文標記語言剖析樹建構多網頁資訊萃取及融合代理人”，國立台灣科技大學電子工程學系研究所碩士論文，民國90年。
[顏逸品 2000] 顏逸品，“網際網路半結構化資料之蒐集與整合系統”，國立中央大學資訊管理學系研究所碩士論文，民國89年6月。

指導教授

陳奕明(Yi-Ming Chen)

審核日期

2002-7-11

推文