基於單網頁資料提取及綱要匹配之多網頁資料提取;Multipage schema induction based on single-page data extraction and schema matching

NCU Institutional Repository > 資訊電機學院 > 資訊工程研究所 > 博碩士論文 > Item 987654321/90003

請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/90003

題名:	基於單網頁資料提取及綱要匹配之多網頁資料提取;Multipage schema induction based on single-page data extraction and schema matching
作者:	張智鈞;Chang, Chih-Chun
貢獻者:	資訊工程學系
關鍵詞:	網頁數據擷取;綱要匹配;Web data extraction;Schema matching;ETL
日期:	2022-09-01
上傳時間:	2022-10-04 12:07:16 (UTC+8)
出版者:	國立中央大學
摘要:	網際網路(WWW)是現代資訊傳播的主流媒體，許多應用服務仰賴網頁資料擷取(Web Data Extraction)技術支援資訊整合服務。雖然過去已有不少非監督式資料擷取方法被提出，但是考慮單網頁的多筆資料的擷取方法(如MDR)僅能處理本頁中的記錄集(RecordSet)，無法顧慮整體結構；而多網頁的對齊方法(如DCADE)雖能透過多頁資料辨識樣版與資料的區分，但是對於記錄集的辨識往往不夠韌性(Robust)，對於複雜網站往往無法完成擷取任務。本研究結合兩種方法的優點，先採用MDR對個別網頁進行資料集擷取，再將多網頁資料集擷取結果進行記錄集匹配(Recordset Matching)、行對齊(Column Alignment) 和非記錄集匹配(NonRecordset Matching)三項子任務。其中在記錄集的部分，我們利用了BERT sentence representation計算資料集中每筆資料的表示法，再搭配綱要匹配(Schema Matching)達到了記錄集匹配；同時應用KNN、SVM分類器完成行對齊任務;在非紀錄的部分則是利用DCADE多網頁數據擷取方法對於非記錄集進行多頁屬性對齊能力來完成；最終我們合併兩項結果，達到多分頁數據擷取。除了ExAlg、WEIR資料集之外，我們也提供了一個網站最新消息公告資料集(Announcement List Website, ALW)，用來測試網站最新資訊或公告列表的自動資料效果。實驗結果顯示，我們提出的方法DEVOSM (Data Extraction via On-the-fly Schema Matching) ，在ExAlg、WEIR和ALW資料集上改善了55.6%、60%及33.7%的記錄集擷取率，顯示所提方法的有效性。;WorldWideWeb (WWW) is the mainstream media of modern information dissemination. Many application services rely on Web Data Extraction technology to support information integration services.Although many unsupervised data capture methods have been proposed in the past, the multi-data capture method (such as MDR) that considers a single page can only process the RecordSet in this page, and it cannot consider the overall structure; While multi-page alignment methods (such as DCADE) can identify the difference between templates and data through multi-page data, the identification of record sets is often not robust enough, and it is often impossible to complete the retrieval task for complex websites.In this study, combining the advantages of the two methods, MDR is used to extract data sets from individual web pages, and then the data sets from multiple web pages are extracted for three sub-tasks: Recordset Matching, Column Alignment and NonRecordset Matching.In the record set part, we use the BERT sentence representation to calculate the representation of each data in the data set, and then use the Schema Matching to achieve the record set matching; at the same time, the KNN and SVM classifiers are used to complete the Column Alignment task; in The non-record part is accomplished by using the DCADE multi-page data capture method to perform multi-page attribute alignment capabilities for non-record sets; finally, we combine the two results to achieve multi-page data capture.In addition to ExAlg and WEIR data sets, we also provide a website latest news announcement data set (Announcement List Website, ALW), which is used to test the automatic data effect of the latest website news or announcement list. Experimental results show that our proposed method DEVOSM (Data Extraction via On-the-fly Schema Matching) improved by 55.6%、60% and 33.7% on ExAlg, WEIR and ALW datasets Recordset retrieval rate, showing the effectiveness of the proposed method.
顯示於類別:	[資訊工程研究所] 博碩士論文

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	75	檢視/開啟

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....