基於自動分頁預測之大規模資料應用程式介面建置 - 以活動擷取為例

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：33

、訪客IP：18.116.28.79

姓名

吳承儒(Cheng-Ju Wu) 查詢紙本館藏

畢業系所

軟體工程研究所

論文名稱

基於自動分頁預測之大規模資料應用程式介面建置 - 以活動擷取為例
(Large Scale Web Data API Creation via Automatic Pagination Recognition - A Case Study on Event Extraction)

相關論文

★ 行程邀約郵件的辨識與不規則時間擷取之研究	★ NCUFree校園無線網路平台設計及應用服務開發
★ 網際網路半結構性資料擷取系統之設計與實作	★ 非簡單瀏覽路徑之探勘與應用
★ 遞增資料關聯式規則探勘之改進	★ 應用卡方獨立性檢定於關連式分類問題
★ 中文資料擷取系統之設計與研究	★ 非數值型資料視覺化與兼具主客觀的分群
★ 關聯性字組在文件摘要上的探討	★ 淨化網頁：網頁區塊化以及資料區域擷取
★ 問題答覆系統使用語句分類排序方式之設計與研究	★ 時序資料庫中緊密頻繁連續事件型樣之有效探勘
★ 星狀座標之軸排列於群聚視覺化之應用	★ 由瀏覽歷程自動產生網頁抓取程式之研究
★ 動態網頁之樣版與資料分析研究	★ 同性質網頁資料整合之自動化研究

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

在傳統網頁擷取(Web Data Extraction)服務中，若碰到需要大量公告式資料(如:新聞、活動頁面等等)的情況，往往會需要透過使用者手動在網頁擷取系統上做分頁標記，因此在遇到分頁資料量龐大的網站時，使用者會耗費大量的時間在"教導機器如何切換網頁"，導致無法有效地進行大規模的資料擷取。本研究將會把這個問題轉換成NLP領域中的序列標記(Sequence Labeling)問題，提供了基於神經網路的序列標記方法 - PRNSM，並結合了大多數網頁標記研究不會使用的 HTML Attribute 資訊，將網頁中的分頁標記成 "PAGE"、"NEXT" 以及 "OTHER"，並在單一語言訓練、測試上面得到 0.818 的平均 Macro F1，另外我們也透過零樣本實驗展示模型在多語言的效能，在測試資料集 DE, RU, ZH, JA, KO 的零樣本實驗中達到了 0.774 的平均 Macro F1，最後我們將研究成果結合非監督式資料擷取系統(Unsupervised Data Extraction System)，建立大規模自動化資料擷取系統，在大規模活動擷取的實際應用中，我們能從從 402 個網站中自動產生出 196 個資料 API，達到接近 0.5 的 API 建立率。

摘要(英)

Information extraction, transformation and loading (abbreviated as ETL) tools are important for big data analysis and value-added applications, especially when the information comes from the Web. Typical Web scraping systems allow users to specify where to fetch the page and what information or data to be extracted from the page. Although these commercial services already provide a friendly graphical user interface (GUI) to guide the system to the target pages for each data source, such systems are not scalable because users have to create crawlers one by one. In this paper we consider the problem of pagination recognition, which aims to automate the process of telling the system how to find similar pages by locating the next page link and the list of page links from any starting URL. We propose a neural sequence model which label each clickable links in a page as either one of the three tags: ``NEXT′′, ``PAGE′′ or ``OTHER′′, where the first two could guide the system to find similar pages of the seed URL. To have multilingual support, we have exploited the query and keywords in the links as well as LASER for anchor text embedding. The experimental results show that the proposed model, called PRNSM (Pagination Recognition Neural Sequence Model), achieves an average of macro 0.774 F1 score on 6 datasets including EN, DE, RU, ZH, JA, and KO. In terms of practical deployment on event extraction, we are able to automatically create 196 data API from
402 given event source URLs.

關鍵字(中)

★ ETL
★ 分頁預測
★ 序列標記
★ 自動化爬蟲系統

關鍵字(英)

★ ETL
★ Pagination prediction
★ Sequence labeling
★ Automated crawler system

論文目次

中文摘要 i
Abstract iii
目錄 v
圖目錄 vii
表目錄 ix
一、緒論 1
二、相關研究 7
2.1 非監督式資訊擷取系統 7
2.2 市面資料擷取服務 7
2.3 分頁標籤偵測 8
2.4 序列標記 10
2.5 多語言句嵌入 11
2.6 網頁節點表示 13
三、分頁標籤偵測 15
3.1 問題定義 15
3.2 發表方法 17
3.2.1 父節點資訊 17
3.2.2 網頁屬性嵌入 17
3.2.3 文字內容嵌入 18
3.2.4 序列表示層 19
3.2.5 標記預測層 19
3.2.6 訓練目標 20
3.3 訓練分析 20
3.3.1 資料集 20
3.3.2 實驗設定 21
3.3.3 實驗結果 21
3.3.4 模型開發實驗 23
四、案例研究 - 活動活動事件擷取 29
4.1 多頁訊息分割(Multiple Message Splitting) 30
4.2 實驗研究 32
4.2.1 資料集 32
4.2.2 最終結果 32
五、結論 35
參考文獻 37

參考文獻

[1] Jhong li Ding. Page-level information extraction system. Master’s thesis, National Central University, Taoyuan, Taiwan, 2015.
[2] Oviliani Y. Yuliana and Chia-Hui Chang. Dcade: divide and conquer alignment with dynamic encoding for full page data extraction.
Applied Intelligence, pages 1–25, July 2019.
[3] Chou Yu An. Web data etl system with unsupervised extrac-
tion. Master’s thesis, National Central University, Taoyuan, Taiwan, 2018.
[4] Import.io. Import.io. https://www.import.io/product/, 2012.
[5] Dexi.io. Dexi.io. https://www.dexi.io/, 2015.
[6] Tianhao Wu and Vincent Sgro. Methods and systems for automated detection of pagination, 2016. US20160103799A1.
[7] Mikhail Korobov and Iván de Prado and Mark E. Haase. Au-
topager: Detect and classify pagination links. https://github.
com/TeamHG-Memex/autopager, 2016.
[8] Naoaki Okazaki. Crfsuite: a fast implementation of conditional random fields (crfs), May 2007.
[9] Bing Liu, Robert Grossman, and Yanhong Zhai. Mining data
records in web pages. In Proceedings of the ninth ACM SIGKDD
international conference on Knowledge discovery and data mining, pages 601–606, New York, 2003. ACM.
[10] Yanhong Zhai and Bing Liu. Structured data extraction from the web based on partial tree alignment. IEEE Transactions on Knowl-edge and Data Engineering, 18(12):1614–1628, December 2006.
[11] Valter Crescenzi and Giansalvatore Mecca. Automatic informa-
tion extraction from large websites. Journal of the ACM (JACM),
51(5):731–779, September 2004.
[12] Arvind Arasu and Hector Garcia-Molina. Extracting structured
data from web pages. In Proceedings of the 2003 ACM SIGMOD
international conference on Management of data, pages 337–348,
New York, 2003. ACM.
[13] Chia-Hui Chang and Shao-Chen Lui. Iepad: information extraction
based on pattern discovery. In Proceedings of the 10th international
conference on World Wide Web, pages 681–688, New York, 2001.
ACM.
[14] KPHB Colony.
Previous/next page.
https://chrome.
google.com/webstore/detail/previous-next-page/
fmichikmgflpgibapdhepmodjdjemmda.
[15] Google Extension.
nextpage. https://
chrome.google.com/webstore/detail/nextpage/
njgkgdihapikidfkbodalicplflciggb.
[16] Zhiheng Huang, Wei Xu, and Kai Yu. Bidirectional lstm-crf models
for sequence tagging, 2015. cite arxiv:1508.01991.
[17] Xuezhe Ma and Eduard Hovy. End-to-end sequence labeling via bi-
directional LSTM-CNNs-CRF. In Proceedings of the 54th Annual
Meeting of the Association for Computational Linguistics (Volume
1: Long Papers), pages 1064–1074, Berlin, Germany, August 2016.
Association for Computational Linguistics.
[18] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
Toutanova. BERT: Pre-training of Deep Bidirectional Transformers
for Language Understanding. In Proceedings of the 2019 Confer-
ence of the North American Chapter of the Association for Com-
putational Linguistics: Human Language Technologies, Volume 1
(Long and Short Papers), NAACL, page 4171–4186, Minneapolis,
Minnesota, 2019. Association for Computational Linguistics.
[19] Mikel Artetxe and Holger Schwenk. Massively multilingual sen-
tence embeddings for zero-shot cross-lingual transfer and beyond.
In Transactions of the Association for Computational Linguistics,
TACL, pages 597–610, 2018.
[20] Thijs Vogels, Octavian-Eugen Ganea, and Carsten Eickhoff.
Web2text: Deep structured boilerplate removal. In Advances in
Information Retrieval, ECIR, pages 167–179. Springer, 2018.
[21] Jurek Leonhardt, Avishek Anand, and Megha Khosla. Boilerplate
removal using a neural sequence labeling model. In Companion
Proceedings of the Web Conference 2020, WWW ’20, page 226–229,
New York, NY, USA, 2020. Association for Computing Machinery. [22] Amazon. Alexa global top sites. https://www.alexa.com/
topsites.
[23] Andrew Cantino. Selector gadget. https://github.com/cantino/
selectorgadget.
[24] Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convo-
lutional networks for text classification. In Proceedings of the 28th
International Conference on Neural Information Processing Sys-
tems - Volume 1, NIPS’15, page 649–657, Cambridge, MA, USA,
2015. MIT Press.
[25] Puppeteer.
Puppeteer.
https://github.com/puppeteer/
puppeteer.
[26] VMWare. Rabbitmq. https://www.rabbitmq.com/.
[27] MongoDB. Mongodb. https://www.mongodb.com/.

指導教授

張嘉惠(Chia-Hui Chang)

審核日期

2021-8-4

推文