在第二項工作中,我們提出了一個無監督的完整表格結構網頁資訊擷取,通過帶有動態編碼的Divide-and-Conquer Alignment(DCADE)來自多個表列網頁或具有相同模板的單個頁面。我們基於葉節點內容定義內容等價類和類別等價類。然後,我們在路徑中組合HTML 屬性(id和class)以用於各種級別的編碼,以便所提出的演算法可以通過探索從特定到一般的各個層級的相似特性來對齊葉節點。我們使用TEX 和ExAlg 的49 個網站進行實驗。我們提出的DCADE 對非記錄集資訊擷取數據提取(FD)在F1 measure中達到了0.962,以及對記錄集資訊擷取(FS)在F1 measure 得到0.962,其性能優於其他頁面層級的網頁資訊擷取方法, 例如DCA(FD = 0.660),TEX(FD = 0.454 和FS = 0.549), RoadRunner(FD= 0.396 和FS = 0.330)以及UWIDE(FD = 0.260 和FS = 0.081)。;Automatic data extraction from template pages is an essential task for data integration and analysis. Most researches focus on data extraction from list pages. The problem of data alignment for singleton pages, which contain detail information of a single item is less addressed and is more challenging. In the rst work, we propose a novel Divide-and-Conquer Alignment algorithm (DCA) that works on leaf nodes from the DOM trees of singleton pages. The idea is to detect mandatory templates via the longest increasing sub-sequence from the landmark quivalence class leaf nodes and recursively apply the same procedure to each segment divided by mandatory templates. DCA able aligns each segment efficiently and handles multi-order attribute-value pairs e effectively with a two-pass procedure. The results on selected items, DCA outperforms TEX and WEIR 2% and 12% respectively. The improvement is more obvious in terms of full schema evaluation, with 0.95 (DCA) versus 0.63 (TEX) F1 measure, on 26 websites from TEX and ExAlg.
In the second work, we propose an unsupervised full schema web data extraction via Divide-and-Conquer Alignment with Dynamic Encoding (DCADE) from either multiple list pages or singleton pages with the same template. We de ne the Content Equivalence Class and Typeset Equivalence Class based on leaf node content. We then combine HTML attributes (id and class) in the paths for various levels of encoding, so that the proposed algorithm can align leaf nodes by exploring patterns at various levels from speci c to general. We conducted experiments on 49 real-world websites used in TEX and ExAlg. The proposed DCADE achieved a 0.962 F1 measure for non-recordset data extraction (FD), and a 0.936 F1 measure for recordset data extraction (FS), which outperformed other page-level web data extraction methods, i.e., DCA (FD=0.660), TEX (FD=0.454 and FS=0.549), RoadRunner (FD=0.396 and FS=0.330), and UWIDE (FD=0.260 and FS=0.081).