應用動態編碼及分治對齊算法之免標記樣 版網頁完整綱要推導研究;Annotation-Free Induction of Full Schema from Template Web Pages with Dynamic Encoding

NCUIR > College of Electrical Engineering & Computer Science > Graduate Institute of Computer Science and Information Engineering > Electronic Thesis & Dissertation > Item 987654321/81066

Please use this identifier to cite or link to this item: http://ir.lib.ncu.edu.tw/handle/987654321/81066

Title:	應用動態編碼及分治對齊算法之免標記樣版網頁完整綱要推導研究;Annotation-Free Induction of Full Schema from Template Web Pages with Dynamic Encoding
Authors:	陳燕琴;Yuliana, Oviliani Yenty
Contributors:	資訊工程學系
Keywords:	深度Web數據提取;劃分對齊;動態編碼;全模式歸納;多個模板頁面;Deep web data extraction;Divide-conquer alignment;Dynamic encoding;Full-schema induction;Multiple template pages
Date:	2019-07-06
Issue Date:	2019-09-03 15:32:32 (UTC+8)
Publisher:	國立中央大學
Abstract:	從樣版網頁中自動擷取資料是資料整合和分析的基本任務。大多數研究都集中在表列網頁的資訊擷取上。單個項目網頁的資料對齊問題（包含單個項目的詳細資訊）的處理較少，而且更具挑戰性。在第一項工作中，我們提出了一種新穎的分治對齊演算法（DCA），它可以運作在單個頁面的DOM 樹上的葉節點上。該想法是通過來自地標等價類葉節點的最長增加子序列來檢測強制模板，並遞迴地將相同的過程應用於由強制模板劃分的每個段。DCA 能夠有效地對齊每個段，並利用two-pass 過程有效地處理多階屬性與值的配對。結果表明，DCA 分別優於TEX 和 WEIR 2％和12％。在完整表格結構評估方面，改進更為明顯，在TEX 和ExAlg 的26 個網站上，得到0.95（DCA）對比0.63（TEX）F1 measure。在第二項工作中，我們提出了一個無監督的完整表格結構網頁資訊擷取，通過帶有動態編碼的Divide-and-Conquer Alignment（DCADE）來自多個表列網頁或具有相同模板的單個頁面。我們基於葉節點內容定義內容等價類和類別等價類。然後，我們在路徑中組合HTML 屬性（id和class）以用於各種級別的編碼，以便所提出的演算法可以通過探索從特定到一般的各個層級的相似特性來對齊葉節點。我們使用TEX 和ExAlg 的49 個網站進行實驗。我們提出的DCADE 對非記錄集資訊擷取數據提取（FD）在F1 measure中達到了0.962，以及對記錄集資訊擷取（FS）在F1 measure 得到0.962，其性能優於其他頁面層級的網頁資訊擷取方法，例如DCA（FD = 0.660），TEX（FD = 0.454 和FS = 0.549), RoadRunner（FD= 0.396 和FS = 0.330）以及UWIDE（FD = 0.260 和FS = 0.081）。;Automatic data extraction from template pages is an essential task for data integration and analysis. Most researches focus on data extraction from list pages. The problem of data alignment for singleton pages, which contain detail information of a single item is less addressed and is more challenging. In the rst work, we propose a novel Divide-and-Conquer Alignment algorithm (DCA) that works on leaf nodes from the DOM trees of singleton pages. The idea is to detect mandatory templates via the longest increasing sub-sequence from the landmark quivalence class leaf nodes and recursively apply the same procedure to each segment divided by mandatory templates. DCA able aligns each segment efficiently and handles multi-order attribute-value pairs e effectively with a two-pass procedure. The results on selected items, DCA outperforms TEX and WEIR 2% and 12% respectively. The improvement is more obvious in terms of full schema evaluation, with 0.95 (DCA) versus 0.63 (TEX) F1 measure, on 26 websites from TEX and ExAlg. In the second work, we propose an unsupervised full schema web data extraction via Divide-and-Conquer Alignment with Dynamic Encoding (DCADE) from either multiple list pages or singleton pages with the same template. We de ne the Content Equivalence Class and Typeset Equivalence Class based on leaf node content. We then combine HTML attributes (id and class) in the paths for various levels of encoding, so that the proposed algorithm can align leaf nodes by exploring patterns at various levels from speci c to general. We conducted experiments on 49 real-world websites used in TEX and ExAlg. The proposed DCADE achieved a 0.962 F1 measure for non-recordset data extraction (FD), and a 0.936 F1 measure for recordset data extraction (FS), which outperformed other page-level web data extraction methods, i.e., DCA (FD=0.660), TEX (FD=0.454 and FS=0.549), RoadRunner (FD=0.396 and FS=0.330), and UWIDE (FD=0.260 and FS=0.081).
Appears in Collections:	[Graduate Institute of Computer Science and Information Engineering] Electronic Thesis & Dissertation

Files in This Item:

File	Description	Size	Format
index.html		0Kb	HTML	97	View/Open

社群 sharing

Loading...