摘要: | 在傳統網頁擷取(Web Data Extraction)服務中,若碰到需要大量公告式資料(如:新聞、活動頁面等等)的情況,往往會需要透過使用者手動在網頁擷取系統上做分頁標記,因此在遇到分頁資料量龐大的網站時,使用者會耗費大量的時間在"教導機器如何切換網頁",導致無法有效地進行大規模的資料擷取。本研究將會把這個問題轉換成NLP領域中的序列標記(Sequence Labeling)問題,提供了基於神經網路的序列標記方法 - PRNSM,並結合了大多數網頁標記研究不會使用的 HTML Attribute 資訊,將網頁中的分頁標記成 "PAGE"、"NEXT" 以及 "OTHER",並在單一語言訓練、測試上面得到 0.818 的平均 Macro F1,另外我們也透過零樣本實驗展示模型在多語言的效能,在測試資料集 DE, RU, ZH, JA, KO 的零樣本實驗中達到了 0.774 的平均 Macro F1,最後我們將研究成果結合非監督式資料擷取系統(Unsupervised Data Extraction System),建立大規模自動化資料擷取系統,在大規模活動擷取的實際應用中,我們能從從 402 個網站中自動產生出 196 個資料 API,達到接近 0.5 的 API 建立率。;Information extraction, transformation and loading (abbreviated as ETL) tools are important for big data analysis and value-added applications, especially when the information comes from the Web. Typical Web scraping systems allow users to specify where to fetch the page and what information or data to be extracted from the page. Although these commercial services already provide a friendly graphical user interface (GUI) to guide the system to the target pages for each data source, such systems are not scalable because users have to create crawlers one by one. In this paper we consider the problem of pagination recognition, which aims to automate the process of telling the system how to find similar pages by locating the next page link and the list of page links from any starting URL. We propose a neural sequence model which label each clickable links in a page as either one of the three tags: ``NEXT′′, ``PAGE′′ or ``OTHER′′, where the first two could guide the system to find similar pages of the seed URL. To have multilingual support, we have exploited the query and keywords in the links as well as LASER for anchor text embedding. The experimental results show that the proposed model, called PRNSM (Pagination Recognition Neural Sequence Model), achieves an average of macro 0.774 F1 score on 6 datasets including EN, DE, RU, ZH, JA, and KO. In terms of practical deployment on event extraction, we are able to automatically create 196 data API from 402 given event source URLs. |