基於自動分頁預測之大規模資料應用程式介面建置 - 以活動擷取為例

DC 欄位	值	語言
DC.contributor	軟體工程研究所	zh_TW
DC.creator	吳承儒	zh_TW
DC.creator	Cheng-Ju Wu	en_US
dc.date.accessioned	2021-8-4T07:39:07Z
dc.date.available	2021-8-4T07:39:07Z
dc.date.issued	2021
dc.identifier.uri	http://ir.lib.ncu.edu.tw:88/thesis/view_etd.asp?URN=108525003
dc.contributor.department	軟體工程研究所	zh_TW
DC.description	國立中央大學	zh_TW
DC.description	National Central University	en_US
dc.description.abstract	在傳統網頁擷取(Web Data Extraction)服務中，若碰到需要大量公告式資料(如:新聞、活動頁面等等)的情況，往往會需要透過使用者手動在網頁擷取系統上做分頁標記，因此在遇到分頁資料量龐大的網站時，使用者會耗費大量的時間在＂教導機器如何切換網頁＂，導致無法有效地進行大規模的資料擷取。本研究將會把這個問題轉換成NLP領域中的序列標記(Sequence Labeling)問題，提供了基於神經網路的序列標記方法 - PRNSM，並結合了大多數網頁標記研究不會使用的 HTML Attribute 資訊，將網頁中的分頁標記成＂PAGE＂、＂NEXT＂以及＂OTHER＂，並在單一語言訓練、測試上面得到 0.818 的平均 Macro F1，另外我們也透過零樣本實驗展示模型在多語言的效能，在測試資料集 DE, RU, ZH, JA, KO 的零樣本實驗中達到了 0.774 的平均 Macro F1，最後我們將研究成果結合非監督式資料擷取系統(Unsupervised Data Extraction System)，建立大規模自動化資料擷取系統，在大規模活動擷取的實際應用中，我們能從從 402 個網站中自動產生出 196 個資料 API，達到接近 0.5 的 API 建立率。	zh_TW
dc.description.abstract	Information extraction, transformation and loading (abbreviated as ETL) tools are important for big data analysis and value-added applications, especially when the information comes from the Web. Typical Web scraping systems allow users to specify where to fetch the page and what information or data to be extracted from the page. Although these commercial services already provide a friendly graphical user interface (GUI) to guide the system to the target pages for each data source, such systems are not scalable because users have to create crawlers one by one. In this paper we consider the problem of pagination recognition, which aims to automate the process of telling the system how to find similar pages by locating the next page link and the list of page links from any starting URL. We propose a neural sequence model which label each clickable links in a page as either one of the three tags: ``NEXT′′, ``PAGE′′ or ``OTHER′′, where the first two could guide the system to find similar pages of the seed URL. To have multilingual support, we have exploited the query and keywords in the links as well as LASER for anchor text embedding. The experimental results show that the proposed model, called PRNSM (Pagination Recognition Neural Sequence Model), achieves an average of macro 0.774 F1 score on 6 datasets including EN, DE, RU, ZH, JA, and KO. In terms of practical deployment on event extraction, we are able to automatically create 196 data API from 402 given event source URLs.	en_US
DC.subject	ETL	zh_TW
DC.subject	分頁預測	zh_TW
DC.subject	序列標記	zh_TW
DC.subject	自動化爬蟲系統	zh_TW
DC.subject	ETL	en_US
DC.subject	Pagination prediction	en_US
DC.subject	Sequence labeling	en_US
DC.subject	Automated crawler system	en_US
DC.title	基於自動分頁預測之大規模資料應用程式介面建置 - 以活動擷取為例	zh_TW
dc.language.iso	zh-TW	zh-TW
DC.title	Large Scale Web Data API Creation via Automatic Pagination Recognition - A Case Study on Event Extraction	en_US
DC.type	博碩士論文	zh_TW
DC.type	thesis	en_US
DC.publisher	National Central University	en_US

博碩士論文 108525003 完整後設資料紀錄