博碩士論文 101552002 完整後設資料紀錄

DC 欄位 語言
DC.contributor資訊工程學系在職專班zh_TW
DC.creator吳佳儒zh_TW
DC.creatorJia-Ru Wuen_US
dc.date.accessioned2018-7-23T07:39:07Z
dc.date.available2018-7-23T07:39:07Z
dc.date.issued2018
dc.identifier.urihttp://ir.lib.ncu.edu.tw:88/thesis/view_etd.asp?URN=101552002
dc.contributor.department資訊工程學系在職專班zh_TW
DC.description國立中央大學zh_TW
DC.descriptionNational Central Universityen_US
dc.description.abstract在網頁資料擷取(Web Data Extraction)的領域中,由於網頁內容多樣及架構的複雜性,要如何自動從各式不同樣板的網頁中擷取出資料,這類型的研究一直面臨相當大的挑戰。 網頁資料擷取系統主要分為記錄層級(Record Level)和頁面層級(Page Level)兩大類別,兩者是接受相同樣板的網頁,進行資料擷取或是綱要推導,針對不同網頁樣板來進行分群之研究較為少見。 本篇論文提出一個依照網頁結構之相似程度來自動分群的功能,簡化不同網頁樣板之間擷取的問題,針對所設計的網頁特徵來實作非監督式分群與監督式分群,並比較其分群之效能。雖從整體分群效果中來看不甚理想,但於目標群結果可達到在非監督式分群時之精確率 99%,召回率 78%,監督式分群時之精確率 97%,召回率超過 80%。 最後,此分群結果可再結合Page-level Information Extraction System (UWIDE) 系統,產生完整的頁面綱要及擷取出所需 POI 相關資訊,進而建立及累積資料庫,以提升相關加值服務之效率及品質。zh_TW
dc.description.abstractIn the field of Web Data Extraction, due to the diversity of web content and the complexity of the web page structure, the research of extracting data automatically from web pages of different template has always faced considerable challenges. The web data extraction system is mainly divided into two categories: Record Level and Page Level. Both input dataset use the web pages of the same template, and are used for data extraction and schema induction. Clustering research on web page of different template is rarely to be found. This paper proposes a method to do clustering automatically with the similarity of web page structure, and can simplify the problem of data extraction from different templates in web page. We also use the unsupervised and supervised clustering, which based on our designed features, and compare the performance of both clustering results. Although the overall clustering performance is not well as expected, the results of unsupervised clustering can reach a precision of 99% for the target cluster, a recall rate of approximately 78%. A precision of 97%, and a recall rate of more than 80% for supervised clustering. Finally, we can generate a complete web page schema and extract the POI-related information via Page-Level Information Extraction System (UWIDE) with this clustering result. It can also be accumulated into databases, to enhance the efficiency and quality of related value added services.en_US
DC.subject特徵挑選zh_TW
DC.subject樣板網頁擷取zh_TW
DC.subject階層式分群zh_TW
DC.subject非監督式分群zh_TW
DC.title樣板網頁結構自動分群zh_TW
dc.language.isozh-TWzh-TW
DC.titleClustering of Template Page for Data Extractionen_US
DC.type博碩士論文zh_TW
DC.typethesisen_US
DC.publisherNational Central Universityen_US

若有論文相關問題,請聯絡國立中央大學圖書館推廣服務組 TEL:(03)422-7151轉57407,或E-mail聯絡  - 隱私權政策聲明