具線上學習功能之新型擷取程式; A Novel Wrapper with the On-Line Learning Capability

NCU Institutional Repository > 資訊電機學院 > 資訊工程研究所 > 博碩士論文 > Item 987654321/8987

請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/8987

題名:	具線上學習功能之新型擷取程式;A Novel Wrapper with the On-Line Learning Capability
作者:	黃陳科;Chen-Ko Huang
貢獻者:	資訊工程研究所
關鍵詞:	擷取規則;包覆程式;擷取程式;extraction rule;ｗrapper
日期:	2005-07-05
上傳時間:	2009-09-22 11:38:54 (UTC+8)
出版者:	國立中央大學圖書館
摘要:	由於現今網際網路的發達，很多資訊儲存於資料庫，然後再透過網頁呈現；而網頁的編寫目前是透過共同閘道介面（Common Gateway Interface, CGI）程式產生，凡是由同一個共同閘道介面產生的網頁，均有其固定的規則。因此本論文可以使用此一規則反向地將資料一筆一筆擷取，這規則就稱為擷取規則(Extraction Rule)。使用擷取規則將網頁的資料庫反向擷取出資訊的程式，就稱為擷取程式或包覆程式（Wrapper）。包覆程式的功能在於擷取網頁的資訊來源，並將其儲存為根據使用者所定義的格式，以方便將處理過後的資料進一步整合。為顧及網際網路的資訊過於泛濫，因此設計一個可學習的資訊擷取系統自動地產生包覆程式，可以方便整合網頁資訊，並且可省除使用者太過繁複的標示。換言之，資訊擷取系統必須根據訓練網頁所要擷取的內容，產生相對的擷取規則傳至擷取程式處理。鑑於這些考量，本論文發展出一個新的方法，以訊號化為基礎，找出使用者標示範例與網頁的關連性特徵，此方法本論文稱為「以長條圖及邊界標籤為基礎之關連性係數」，用以實現整個擷取系統，可因應網頁資訊的多元性以產生擷取規則、並且具有線上學習效能的擷取程式。 Since the Internet has been very popular and prosperous, a great amount of information now is saved among the database which is accessible through webpages. At present, most webpage-editing is using Common Gateway Interface (CGI) programming; therefore, it is of some certain constant rules. Thus we can extract the information on webpage with these constant rules known as ‘Extraction Rules’. The programming basing on Extraction Rules which can extract the information on webpage is called ‘Wrapper’. Wrapper can not only extract the information which is performed on the webpage, but it can also transform and save information into the format which the user defines. Hence, it allows us to process the information for further purpose. On considering the overwhelming scale of internet information, designing an information extraction system with learning capability can combine the information on the webpage and enable the user build up Wrapper automatically with simple template marking. In other words, the information extraction system must abstract and establish extraction rules according to the training page for wrapper. On account of these, we develop a new method based on signals called” histogram and boundary tag-based correlation coefficient.” The method can discover correlation features between the template which the user marks and webpage, and implement the extraction system. We develop the programming with On-Line Learning Capability to set up extraction rules which will be able to cope with the diverse webpage.
顯示於類別:	[資訊工程研究所] 博碩士論文

文件中的檔案:

檔案	大小	格式	瀏覽次數

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....