具線上學習功能之新型擷取程式

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：105

、訪客IP：3.144.36.141

姓名

黃陳科(Chen-Ko Huang) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

具線上學習功能之新型擷取程式
(A Novel Wrapper with the On-Line Learning Capability)

相關論文

★ 以Q-學習法為基礎之群體智慧演算法及其應用	★ 發展遲緩兒童之復健系統研製
★ 從認知風格角度比較教師評量與同儕互評之差異：從英語寫作到遊戲製作	★ 基於檢驗數值的糖尿病腎病變預測模型
★ 模糊類神經網路為架構之遙測影像分類器設計	★ 複合式群聚演算法
★ 身心障礙者輔具之研製	★ 指紋分類器之研究
★ 背光影像補償及色彩減量之研究	★ 類神經網路於營利事業所得稅選案之應用
★ 一個新的線上學習系統及其於稅務選案上之應用	★ 人眼追蹤系統及其於人機介面之應用
★ 結合群體智慧與自我組織映射圖的資料視覺化研究	★ 追瞳系統之研發於身障者之人機介面應用
★ 以類免疫系統為基礎之線上學習類神經模糊系統及其應用	★ 基因演算法於語音聲紋解攪拌之應用

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

由於現今網際網路的發達，很多資訊儲存於資料庫，然後再透過網頁呈現；而網頁的編寫目前是透過共同閘道介面（Common Gateway Interface, CGI）程式產生，凡是由同一個共同閘道介面產生的網頁，均有其固定的規則。因此本論文可以使用此一規則反向地將資料一筆一筆擷取，這規則就稱為擷取規則(Extraction Rule)。使用擷取規則將網頁的資料庫反向擷取出資訊的程式，就稱為擷取程式或包覆程式（Wrapper）。包覆程式的功能在於擷取網頁的資訊來源，並將其儲存為根據使用者所定義的格式，以方便將處理過後的資料進一步整合。為顧及網際網路的資訊過於泛濫，因此設計一個可學習的資訊擷取系統自動地產生包覆程式，可以方便整合網頁資訊，並且可省除使用者太過繁複的標示。換言之，資訊擷取系統必須根據訓練網頁所要擷取的內容，產生相對的擷取規則傳至擷取程式處理。鑑於這些考量，本論文發展出一個新的方法，以訊號化為基礎，找出使用者標示範例與網頁的關連性特徵，此方法本論文稱為「以長條圖及邊界標籤為基礎之關連性係數」，用以實現整個擷取系統，可因應網頁資訊的多元性以產生擷取規則、並且具有線上學習效能的擷取程式。

摘要(英)

Since the Internet has been very popular and prosperous, a great amount of information now is saved among the database which is accessible through webpages. At present, most webpage-editing is using Common Gateway Interface (CGI) programming; therefore, it is of some certain constant rules. Thus we can extract the information on webpage with these constant rules known as ‘Extraction Rules’. The programming basing on Extraction Rules which can extract the information on webpage is called ‘Wrapper’.
Wrapper can not only extract the information which is performed on the webpage, but it can also transform and save information into the format which the user defines. Hence, it allows us to process the information for further purpose. On considering the overwhelming scale of internet information, designing an information extraction system with learning capability can combine the information on the webpage and enable the user build up Wrapper automatically with simple template marking. In other words, the information extraction system must abstract and establish extraction rules according to the training page for wrapper. On account of these, we develop a new method based on signals called” histogram and boundary tag-based correlation coefficient.” The method can discover correlation features between the template which the user marks and webpage, and implement the extraction system. We develop the programming with On-Line Learning Capability to set up extraction rules which will be able to cope with the diverse webpage.

關鍵字(中)

★ 擷取規則
★ 包覆程式
★ 擷取程式

關鍵字(英)

★ extraction rule
★ ｗrapper

論文目次

摘要 I
Abstract II
誌謝 IV
目錄 V
圖目錄 VII
表目錄 IX
第一章緒論 1
1.1 研究背景 1
1.2 研究動機 1
1.3 研究目標 2
1.4 問題分析 3
1.5 論文架構 4
第二章相關研究 5
2.1 WIEN擷取系統 6
2.2 STALKER擷取系統 6
2.3 SoftMealy擷取系統 7
2.4 Embley擷取系統 8
2.5 結論 9
第三章系統架構 11
3.1 整體架構 11
3.2 訓練網頁 14
3.2.1 前置處理 15
3.2.2 以長條圖及邊界標籤為基礎之關連性係數 17
3.2.3 自動校正機制(Self-Calibrating Mechanism) 27
3.2.4 新增範例機制 29
3.3 測試網頁 30
3.3.1 線上學習機制 31
3.4 屬性對應 33
3.5 單一紀錄網頁處理 36
第四章系統介紹與實驗結果 39
4.1 系統介紹 39
4.2 實驗結果 47
4.2.1多重紀錄網頁 47
4.2.2單一紀錄網頁 51
第五章結論與展望 53
5.1 結論 53
5.2 未來研究方向 54
參考文獻 55
圖目錄
圖 3.1 整體架構圖 14
圖 3.2 將原始網頁訊號化示意圖 16
圖 3.3 將標示範例訊號化示意圖 16
圖 3.4 「以長條圖及邊界標籤為基礎之關連性係數」流程圖 18
圖 3.5 標示範例的長條圖統計示意圖 19
圖 3.6 資訊網頁與範例之長條圖統計關連性係數示意圖 21
圖 3.7 「以長條圖及邊界標籤為基礎之關連性係數」示意圖 23
圖 3.8 Google網站查尋結果網頁 24
圖 3.9 原始網頁訊號化 24
圖 3.10 使用者標示之範例 25
圖 3.11 標示範例訊號化 25
圖 3.12 使用者標示之範例標籤順序 25
圖 3.13 原始網頁取出的資料窗 25
圖 3.14 為使用者標示範例可能忽略標示之標籤 28
圖 3.15 線上學習機制流程圖 31
圖 3.16 Springerlink網站中使用者標示的一筆範例 33
圖 3.17 使用者標示範例屬性 34
圖 3.18 單一紀錄網頁處理流程圖 37
圖 4.1 參數設定 39
圖 4.2 設定參數 40
圖 4.3 xsd檔案 41
圖 4.4 與xsd檔對應的屬性 41
圖 4.5 標示範例選項 42
圖 4.6 使用者標示第一筆紀錄當範例 43
圖 4.7 使用者標示屬性畫面 44
圖 4.8 網頁擷取中的訓練網頁選項 45
圖 4.9 訓練網頁中找到的所有資訊 45
圖 4.10 測試網頁選項 46
圖 4.11 選擇測試網頁畫面 46
圖 4.12 Springerlink網頁顯示的其中一筆紀錄 48
圖 4.13 CiteSeer其中二筆紀錄 48
圖 4.14 ACM網站中二筆紀錄 49
圖 4.15 ACM網站其中一個紀錄 49
圖 4.16 Google網站中使用者標示之範例 50
圖 4.17 Google網站中另外一型紀錄 50
圖 4.18 MSN網站中其中一筆紀錄 50
圖 4.19 yahoo拍賣網中單一紀錄網頁的資訊 51
表目錄
表 4.1 多重紀錄網站擷取率 47
表 4.2 單一紀錄網站擷取率 51

參考文獻

[1] 呂紹誠，「網際網路半結構性資料擷取系統之設計與實作」，碩士論文，國立中央大學資訊工程學系，中壢，2001。
[2] 郭釋謙，「線上擷取規則分析」，碩士論文，國立中央大學資訊工程學系，中壢，2003。
[3] Association for Computing Machinery(ACM), http://portal.acm.org/portal.cfm
[4] R. Baumgartner, S. Flesca, and G. Gottlob, “Supervised Wrapper Generation with Lixto,” in Proceedings of VLDB Demo,2001.
[5] R. Baumgartner, S. Flesca, and G. Gottlob, “Visual Web Information Extraction with Lixto,” in Proceedings of the 27th VLDB Conference, Roma, Italy, 2001.
[6] C. H. Chang and S. C. Lui, “Iepad: Information extraction based on pattern discovery,” in Proceedings of the 10th International Conference on World Wide Web, pp. 681-688, Hong-Kong, May2-6 2001.
[7] C. H. Chang and C. N. Hsu, “Automatic Extraction of Information Blocks Using PAT Trees,” in Proceedings of 1999 National Computer Symposium (NCS-1999), Tamkang University, Tamsui, Taiwan, Dec 1999.
[8] C. H. Chang, S. C. Lui, and Y. C. Wu, “Applying pattern mining to Web information extraction,” in Proceedings of the 5th Pacific Asia Conference on Knowledge Discovery and Data Mining (PAKDD-2000), pp. 4-16, Hong Kong, Apr 2001.
[9] CiteSeer, http://citeseer.ist.psu.edu/
[10] Elsevier, http://sdos.ejournal.ascc.net/
[11] D. W. Embley, Y. Jiang, and Y. K. Ng, “Record-boundary discovery in web documents,” in Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’99), pp. 467-478, Philadelphia, PA, 1999.
[12] D. W. Embley, Y. K. Ng, and Li. Xu, “Recognizing Ontology -Applicable Multiple-Record Web Documents,” in Proceedings of the 20th International Conference on Conceptual Modeling on Lecture Notes in Computer Science, Vol. 2224, pp.555-570, London, UK, 2001
[13] Google, http://www.google.com
[14] C. N. Hsu and M. T. Dung, “Generating finite-state transducers for semi-structured data,” Journal of Information Systems, Special Issue on Semi-structured Data, Vol. 23, pp. 521-537, Aug. 1998.
[15] C. N. Hsu and C. C. Chang, “Finite-state transducers for semi- structured text mining,” in Proceedings of IJCAI-99 Workshop on Text mining: Foundations, Techniques and Applications, pp. 38-49, Stockholm, Sweden, 1999.
[16] Institute of Electrical and Electronics Engineers (IEEE), http://ieeexplore.ieee.org/
[17] N. Kushmerick, D. Weld, and R. Doorenbos, “Wrapper induction for information extraction,” in Proceedings of the 15th International Joint Conference on Artificial Intelligence (IJCAI), pp 729-737, Japan, 1997.
[18] N. Kushmerick, “Wrapper Induction: Efficiency and expressiveness. Workshop on AI & Information Integration,” in Proceedings of AAAI-98 Workshop on Artificial Intelligence and Information Integration, AAAI Press, pp. 15-68, Menlo Park, California,1998.
[19] L. Liu, C. Pu, and W. Han, “Xwrap: An xml-enabled wrapper construction system for web information sources,” in Proceedings of ICDE, 2000.
[20] Msn, http://www.msn.com/
[21] I. Muslea, S. Minton, and C. Knoblock, “STALKER: learning extraction rules for semi-structured, Web-based information sources,” in Proceedings of AAAI-98 Workshop on AI and Information Integration, Technical Report WS-98-01, AAAI Press, Menlo Park, California, 1998.
[22] I. Muslea, S. Minton, and C. Knoblock, “A hierarchical approach to wrapper induction,” in Proceedings of the 3rd International Conference on 68 Autonomous Agents (Agents-99), pp. 190-197, Seattle, Washington, 1999.
[23] I. Muslea, S. Minton, and C. Knoblock, “A hierarchical information from semi-structured documents,” in Proceedings of the 2000 ACM CIKM International Conference on Information and Knowledge Management, pp. 250-257, VA, USA, 2000.
[24] N. Papadakis, D. N. Skoutas, K. Raftopoulos, and T. A. Varvarigou, “An Automatic Web Wrapper for Extracting Information from Web Sources, Using Clustering Techniques,” IEEE/IPSJ International Symposium on Applications and the Internet (SAINT 2005), pp. 24-30, Trento, Italy, Jan. 2005.
[25] A. Sahuguet and F. Azavant, “Building light-weight wrappers for legacy web data-sources using w4f,”in Proceeding of VLDB, 1999.
[26] A. Sahuguet and F. Azavant, “Building intelligent web applications using lightweight wrappers,” Data and Knowledge Engineering, 36(3):283-316, 2001.
[27] SpringerLink, http://link.springer-ny.com/
[28] Yahoo, http://tw.yahoo.com/

指導教授

蘇木春(Mu-Chun Su)

審核日期

2005-7-14

推文