樣板網頁結構自動分群

DC 欄位	值	語言
DC.contributor	資訊工程學系在職專班	zh_TW
DC.creator	吳佳儒	zh_TW
DC.creator	Jia-Ru Wu	en_US
dc.date.accessioned	2018-7-23T07:39:07Z
dc.date.available	2018-7-23T07:39:07Z
dc.date.issued	2018
dc.identifier.uri	http://ir.lib.ncu.edu.tw:88/thesis/view_etd.asp?URN=101552002
dc.contributor.department	資訊工程學系在職專班	zh_TW
DC.description	國立中央大學	zh_TW
DC.description	National Central University	en_US
dc.description.abstract	在網頁資料擷取(Web Data Extraction)的領域中，由於網頁內容多樣及架構的複雜性，要如何自動從各式不同樣板的網頁中擷取出資料，這類型的研究一直面臨相當大的挑戰。網頁資料擷取系統主要分為記錄層級(Record Level)和頁面層級(Page Level)兩大類別，兩者是接受相同樣板的網頁，進行資料擷取或是綱要推導，針對不同網頁樣板來進行分群之研究較為少見。本篇論文提出一個依照網頁結構之相似程度來自動分群的功能，簡化不同網頁樣板之間擷取的問題，針對所設計的網頁特徵來實作非監督式分群與監督式分群，並比較其分群之效能。雖從整體分群效果中來看不甚理想，但於目標群結果可達到在非監督式分群時之精確率 99%，召回率 78%，監督式分群時之精確率 97%，召回率超過 80%。最後，此分群結果可再結合Page-level Information Extraction System (UWIDE) 系統，產生完整的頁面綱要及擷取出所需 POI 相關資訊，進而建立及累積資料庫，以提升相關加值服務之效率及品質。	zh_TW
dc.description.abstract	In the field of Web Data Extraction, due to the diversity of web content and the complexity of the web page structure, the research of extracting data automatically from web pages of different template has always faced considerable challenges. The web data extraction system is mainly divided into two categories: Record Level and Page Level. Both input dataset use the web pages of the same template, and are used for data extraction and schema induction. Clustering research on web page of different template is rarely to be found. This paper proposes a method to do clustering automatically with the similarity of web page structure, and can simplify the problem of data extraction from different templates in web page. We also use the unsupervised and supervised clustering, which based on our designed features, and compare the performance of both clustering results. Although the overall clustering performance is not well as expected, the results of unsupervised clustering can reach a precision of 99% for the target cluster, a recall rate of approximately 78%. A precision of 97%, and a recall rate of more than 80% for supervised clustering. Finally, we can generate a complete web page schema and extract the POI-related information via Page-Level Information Extraction System (UWIDE) with this clustering result. It can also be accumulated into databases, to enhance the efficiency and quality of related value added services.	en_US
DC.subject	特徵挑選	zh_TW
DC.subject	樣板網頁擷取	zh_TW
DC.subject	階層式分群	zh_TW
DC.subject	非監督式分群	zh_TW
DC.title	樣板網頁結構自動分群	zh_TW
dc.language.iso	zh-TW	zh-TW
DC.title	Clustering of Template Page for Data Extraction	en_US
DC.type	博碩士論文	zh_TW
DC.type	thesis	en_US
DC.publisher	National Central University	en_US

博碩士論文 101552002 完整後設資料紀錄