基於單網頁資料提取及綱要匹配之多網頁資料提取

DC 欄位	值	語言
DC.contributor	資訊工程學系	zh_TW
DC.creator	張智鈞	zh_TW
DC.creator	Chih-Chun Chang	en_US
dc.date.accessioned	2022-9-1T07:39:07Z
dc.date.available	2022-9-1T07:39:07Z
dc.date.issued	2022
dc.identifier.uri	http://ir.lib.ncu.edu.tw:88/thesis/view_etd.asp?URN=109522040
dc.contributor.department	資訊工程學系	zh_TW
DC.description	國立中央大學	zh_TW
DC.description	National Central University	en_US
dc.description.abstract	網際網路(WWW)是現代資訊傳播的主流媒體，許多應用服務仰賴網頁資料擷取(Web Data Extraction)技術支援資訊整合服務。雖然過去已有不少非監督式資料擷取方法被提出，但是考慮單網頁的多筆資料的擷取方法(如MDR)僅能處理本頁中的記錄集(RecordSet)，無法顧慮整體結構；而多網頁的對齊方法(如DCADE)雖能透過多頁資料辨識樣版與資料的區分，但是對於記錄集的辨識往往不夠韌性(Robust)，對於複雜網站往往無法完成擷取任務。本研究結合兩種方法的優點，先採用MDR對個別網頁進行資料集擷取，再將多網頁資料集擷取結果進行記錄集匹配(Recordset Matching)、行對齊(Column Alignment) 和非記錄集匹配(NonRecordset Matching)三項子任務。其中在記錄集的部分，我們利用了BERT sentence representation計算資料集中每筆資料的表示法，再搭配綱要匹配(Schema Matching)達到了記錄集匹配；同時應用KNN、SVM分類器完成行對齊任務;在非紀錄的部分則是利用DCADE多網頁數據擷取方法對於非記錄集進行多頁屬性對齊能力來完成；最終我們合併兩項結果，達到多分頁數據擷取。除了ExAlg、WEIR資料集之外，我們也提供了一個網站最新消息公告資料集(Announcement List Website, ALW)，用來測試網站最新資訊或公告列表的自動資料效果。實驗結果顯示，我們提出的方法DEVOSM (Data Extraction via On-the-fly Schema Matching) ，在ExAlg、WEIR和ALW資料集上改善了55.6%、60%及33.7%的記錄集擷取率，顯示所提方法的有效性。	zh_TW
dc.description.abstract	WorldWideWeb (WWW) is the mainstream media of modern information dissemination. Many application services rely on Web Data Extraction technology to support information integration services.Although many unsupervised data capture methods have been proposed in the past, the multi-data capture method (such as MDR) that considers a single page can only process the RecordSet in this page, and it cannot consider the overall structure; While multi-page alignment methods (such as DCADE) can identify the difference between templates and data through multi-page data, the identification of record sets is often not robust enough, and it is often impossible to complete the retrieval task for complex websites.In this study, combining the advantages of the two methods, MDR is used to extract data sets from individual web pages, and then the data sets from multiple web pages are extracted for three sub-tasks: Recordset Matching, Column Alignment and NonRecordset Matching.In the record set part, we use the BERT sentence representation to calculate the representation of each data in the data set, and then use the Schema Matching to achieve the record set matching; at the same time, the KNN and SVM classifiers are used to complete the Column Alignment task; in The non-record part is accomplished by using the DCADE multi-page data capture method to perform multi-page attribute alignment capabilities for non-record sets; finally, we combine the two results to achieve multi-page data capture.In addition to ExAlg and WEIR data sets, we also provide a website latest news announcement data set (Announcement List Website, ALW), which is used to test the automatic data effect of the latest website news or announcement list. Experimental results show that our proposed method DEVOSM (Data Extraction via On-the-fly Schema Matching) improved by 55.6%、60% and 33.7% on ExAlg, WEIR and ALW datasets Recordset retrieval rate, showing the effectiveness of the proposed method.	en_US
DC.subject	網頁數據擷取	zh_TW
DC.subject	綱要匹配	zh_TW
DC.subject	Web data extraction	en_US
DC.subject	Schema matching	en_US
DC.subject	ETL	en_US
DC.title	基於單網頁資料提取及綱要匹配之多網頁資料提取	zh_TW
dc.language.iso	zh-TW	zh-TW
DC.title	Multipage schema induction based on single-page data extraction and schema matching	en_US
DC.type	博碩士論文	zh_TW
DC.type	thesis	en_US
DC.publisher	National Central University	en_US

博碩士論文 109522040 完整後設資料紀錄