博碩士論文 109522040 完整後設資料紀錄

DC 欄位 語言
DC.contributor資訊工程學系zh_TW
DC.creator張智鈞zh_TW
DC.creatorChih-Chun Changen_US
dc.date.accessioned2022-9-1T07:39:07Z
dc.date.available2022-9-1T07:39:07Z
dc.date.issued2022
dc.identifier.urihttp://ir.lib.ncu.edu.tw:88/thesis/view_etd.asp?URN=109522040
dc.contributor.department資訊工程學系zh_TW
DC.description國立中央大學zh_TW
DC.descriptionNational Central Universityen_US
dc.description.abstract網際網路(WWW)是現代資訊傳播的主流媒體,許多應用服務仰賴網頁資料擷取(Web Data Extraction)技術支援資訊整合服務。雖然過去已有不少非監督式資料擷取方法被提出,但是考慮單網頁的多筆資料的擷取方法(如MDR)僅能處理本頁中的記錄集(RecordSet),無法顧慮整體結構;而多網頁的對齊方法(如DCADE)雖能透過多頁資料辨識樣版與資料的區分,但是對於記錄集的辨識往往不夠韌性(Robust),對於複雜網站往往無法完成擷取任務。本研究結合兩種方法的優點,先採用MDR對個別網頁進行資料集擷取,再將多網頁資料集擷取結果進行記錄集匹配(Recordset Matching)、行對齊(Column Alignment) 和非記錄集匹配(NonRecordset Matching)三項子任務。其中在記錄集的部分,我們利用了BERT sentence representation計算資料集中每筆資料的表示法,再搭配綱要匹配(Schema Matching)達到了記錄集匹配;同時應用KNN、SVM分類器完成行對齊任務;在非紀錄的部分則是利用DCADE多網頁數據擷取方法對於非記錄集進行多頁屬性對齊能力來完成;最終我們合併兩項結果,達到多分頁數據擷取。 除了ExAlg、WEIR資料集之外,我們也提供了一個網站最新消息公告資料集(Announcement List Website, ALW),用來測試網站最新資訊或公告列表的自動資料效果。實驗結果顯示,我們提出的方法DEVOSM (Data Extraction via On-the-fly Schema Matching) ,在ExAlg、WEIR和ALW資料集上改善了55.6%、60%及33.7%的記錄集擷取率,顯示所提方法的有效性。zh_TW
dc.description.abstractWorldWideWeb (WWW) is the mainstream media of modern information dissemination. Many application services rely on Web Data Extraction technology to support information integration services.Although many unsupervised data capture methods have been proposed in the past, the multi-data capture method (such as MDR) that considers a single page can only process the RecordSet in this page, and it cannot consider the overall structure; While multi-page alignment methods (such as DCADE) can identify the difference between templates and data through multi-page data, the identification of record sets is often not robust enough, and it is often impossible to complete the retrieval task for complex websites.In this study, combining the advantages of the two methods, MDR is used to extract data sets from individual web pages, and then the data sets from multiple web pages are extracted for three sub-tasks: Recordset Matching, Column Alignment and NonRecordset Matching.In the record set part, we use the BERT sentence representation to calculate the representation of each data in the data set, and then use the Schema Matching to achieve the record set matching; at the same time, the KNN and SVM classifiers are used to complete the Column Alignment task; in The non-record part is accomplished by using the DCADE multi-page data capture method to perform multi-page attribute alignment capabilities for non-record sets; finally, we combine the two results to achieve multi-page data capture.In addition to ExAlg and WEIR data sets, we also provide a website latest news announcement data set (Announcement List Website, ALW), which is used to test the automatic data effect of the latest website news or announcement list. Experimental results show that our proposed method DEVOSM (Data Extraction via On-the-fly Schema Matching) improved by 55.6%、60% and 33.7% on ExAlg, WEIR and ALW datasets Recordset retrieval rate, showing the effectiveness of the proposed method.en_US
DC.subject網頁數據擷取zh_TW
DC.subject綱要匹配zh_TW
DC.subjectWeb data extractionen_US
DC.subjectSchema matchingen_US
DC.subjectETLen_US
DC.title基於單網頁資料提取及綱要匹配之多網頁資料提取zh_TW
dc.language.isozh-TWzh-TW
DC.titleMultipage schema induction based on single-page data extraction and schema matchingen_US
DC.type博碩士論文zh_TW
DC.typethesisen_US
DC.publisherNational Central Universityen_US

若有論文相關問題,請聯絡國立中央大學圖書館推廣服務組 TEL:(03)422-7151轉57407,或E-mail聯絡  - 隱私權政策聲明