博碩士論文 108525003 完整後設資料紀錄

DC 欄位 語言
DC.contributor軟體工程研究所zh_TW
DC.creator吳承儒zh_TW
DC.creatorCheng-Ju Wuen_US
dc.date.accessioned2021-8-4T07:39:07Z
dc.date.available2021-8-4T07:39:07Z
dc.date.issued2021
dc.identifier.urihttp://ir.lib.ncu.edu.tw:88/thesis/view_etd.asp?URN=108525003
dc.contributor.department軟體工程研究所zh_TW
DC.description國立中央大學zh_TW
DC.descriptionNational Central Universityen_US
dc.description.abstract在傳統網頁擷取(Web Data Extraction)服務中,若碰到需要大量公告式資料(如:新聞、活動頁面等等)的情況,往往會需要透過使用者手動在網頁擷取系統上做分頁標記,因此在遇到分頁資料量龐大的網站時,使用者會耗費大量的時間在"教導機器如何切換網頁",導致無法有效地進行大規模的資料擷取。本研究將會把這個問題轉換成NLP領域中的序列標記(Sequence Labeling)問題,提供了基於神經網路的序列標記方法 - PRNSM,並結合了大多數網頁標記研究不會使用的 HTML Attribute 資訊,將網頁中的分頁標記成 "PAGE"、"NEXT" 以及 "OTHER",並在單一語言訓練、測試上面得到 0.818 的平均 Macro F1,另外我們也透過零樣本實驗展示模型在多語言的效能,在測試資料集 DE, RU, ZH, JA, KO 的零樣本實驗中達到了 0.774 的平均 Macro F1,最後我們將研究成果結合非監督式資料擷取系統(Unsupervised Data Extraction System),建立大規模自動化資料擷取系統,在大規模活動擷取的實際應用中,我們能從從 402 個網站中自動產生出 196 個資料 API,達到接近 0.5 的 API 建立率。zh_TW
dc.description.abstractInformation extraction, transformation and loading (abbreviated as ETL) tools are important for big data analysis and value-added applications, especially when the information comes from the Web. Typical Web scraping systems allow users to specify where to fetch the page and what information or data to be extracted from the page. Although these commercial services already provide a friendly graphical user interface (GUI) to guide the system to the target pages for each data source, such systems are not scalable because users have to create crawlers one by one. In this paper we consider the problem of pagination recognition, which aims to automate the process of telling the system how to find similar pages by locating the next page link and the list of page links from any starting URL. We propose a neural sequence model which label each clickable links in a page as either one of the three tags: ``NEXT′′, ``PAGE′′ or ``OTHER′′, where the first two could guide the system to find similar pages of the seed URL. To have multilingual support, we have exploited the query and keywords in the links as well as LASER for anchor text embedding. The experimental results show that the proposed model, called PRNSM (Pagination Recognition Neural Sequence Model), achieves an average of macro 0.774 F1 score on 6 datasets including EN, DE, RU, ZH, JA, and KO. In terms of practical deployment on event extraction, we are able to automatically create 196 data API from 402 given event source URLs.en_US
DC.subjectETLzh_TW
DC.subject分頁預測zh_TW
DC.subject序列標記zh_TW
DC.subject自動化爬蟲系統zh_TW
DC.subjectETLen_US
DC.subjectPagination predictionen_US
DC.subjectSequence labelingen_US
DC.subjectAutomated crawler systemen_US
DC.title基於自動分頁預測之大規模資料應用程式介面建置 - 以活動擷取為例zh_TW
dc.language.isozh-TWzh-TW
DC.titleLarge Scale Web Data API Creation via Automatic Pagination Recognition - A Case Study on Event Extractionen_US
DC.type博碩士論文zh_TW
DC.typethesisen_US
DC.publisherNational Central Universityen_US

若有論文相關問題,請聯絡國立中央大學圖書館推廣服務組 TEL:(03)422-7151轉57407,或E-mail聯絡  - 隱私權政策聲明