摘要(英) |
Information extraction, transformation and loading (abbreviated as ETL) tools are important for big data analysis and value-added applications, especially when the information comes from the Web. Typical Web scraping systems allow users to specify where to fetch the page and what information or data to be extracted from the page. Although these commercial services already provide a friendly graphical user interface (GUI) to guide the system to the target pages for each data source, such systems are not scalable because users have to create crawlers one by one. In this paper we consider the problem of pagination recognition, which aims to automate the process of telling the system how to find similar pages by locating the next page link and the list of page links from any starting URL. We propose a neural sequence model which label each clickable links in a page as either one of the three tags: ``NEXT′′, ``PAGE′′ or ``OTHER′′, where the first two could guide the system to find similar pages of the seed URL. To have multilingual support, we have exploited the query and keywords in the links as well as LASER for anchor text embedding. The experimental results show that the proposed model, called PRNSM (Pagination Recognition Neural Sequence Model), achieves an average of macro 0.774 F1 score on 6 datasets including EN, DE, RU, ZH, JA, and KO. In terms of practical deployment on event extraction, we are able to automatically create 196 data API from
402 given event source URLs. |
參考文獻 |
[1] Jhong li Ding. Page-level information extraction system. Master’s thesis, National Central University, Taoyuan, Taiwan, 2015.
[2] Oviliani Y. Yuliana and Chia-Hui Chang. Dcade: divide and conquer alignment with dynamic encoding for full page data extraction.
Applied Intelligence, pages 1–25, July 2019.
[3] Chou Yu An. Web data etl system with unsupervised extrac-
tion. Master’s thesis, National Central University, Taoyuan, Taiwan, 2018.
[4] Import.io. Import.io. https://www.import.io/product/, 2012.
[5] Dexi.io. Dexi.io. https://www.dexi.io/, 2015.
[6] Tianhao Wu and Vincent Sgro. Methods and systems for automated detection of pagination, 2016. US20160103799A1.
[7] Mikhail Korobov and Iván de Prado and Mark E. Haase. Au-
topager: Detect and classify pagination links. https://github.
com/TeamHG-Memex/autopager, 2016.
[8] Naoaki Okazaki. Crfsuite: a fast implementation of conditional random fields (crfs), May 2007.
[9] Bing Liu, Robert Grossman, and Yanhong Zhai. Mining data
records in web pages. In Proceedings of the ninth ACM SIGKDD
international conference on Knowledge discovery and data mining, pages 601–606, New York, 2003. ACM.
[10] Yanhong Zhai and Bing Liu. Structured data extraction from the web based on partial tree alignment. IEEE Transactions on Knowl-edge and Data Engineering, 18(12):1614–1628, December 2006.
[11] Valter Crescenzi and Giansalvatore Mecca. Automatic informa-
tion extraction from large websites. Journal of the ACM (JACM),
51(5):731–779, September 2004.
[12] Arvind Arasu and Hector Garcia-Molina. Extracting structured
data from web pages. In Proceedings of the 2003 ACM SIGMOD
international conference on Management of data, pages 337–348,
New York, 2003. ACM.
[13] Chia-Hui Chang and Shao-Chen Lui. Iepad: information extraction
based on pattern discovery. In Proceedings of the 10th international
conference on World Wide Web, pages 681–688, New York, 2001.
ACM.
[14] KPHB Colony.
Previous/next page.
https://chrome.
google.com/webstore/detail/previous-next-page/
fmichikmgflpgibapdhepmodjdjemmda.
[15] Google Extension.
nextpage. https://
chrome.google.com/webstore/detail/nextpage/
njgkgdihapikidfkbodalicplflciggb.
[16] Zhiheng Huang, Wei Xu, and Kai Yu. Bidirectional lstm-crf models
for sequence tagging, 2015. cite arxiv:1508.01991.
[17] Xuezhe Ma and Eduard Hovy. End-to-end sequence labeling via bi-
directional LSTM-CNNs-CRF. In Proceedings of the 54th Annual
Meeting of the Association for Computational Linguistics (Volume
1: Long Papers), pages 1064–1074, Berlin, Germany, August 2016.
Association for Computational Linguistics.
[18] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
Toutanova. BERT: Pre-training of Deep Bidirectional Transformers
for Language Understanding. In Proceedings of the 2019 Confer-
ence of the North American Chapter of the Association for Com-
putational Linguistics: Human Language Technologies, Volume 1
(Long and Short Papers), NAACL, page 4171–4186, Minneapolis,
Minnesota, 2019. Association for Computational Linguistics.
[19] Mikel Artetxe and Holger Schwenk. Massively multilingual sen-
tence embeddings for zero-shot cross-lingual transfer and beyond.
In Transactions of the Association for Computational Linguistics,
TACL, pages 597–610, 2018.
[20] Thijs Vogels, Octavian-Eugen Ganea, and Carsten Eickhoff.
Web2text: Deep structured boilerplate removal. In Advances in
Information Retrieval, ECIR, pages 167–179. Springer, 2018.
[21] Jurek Leonhardt, Avishek Anand, and Megha Khosla. Boilerplate
removal using a neural sequence labeling model. In Companion
Proceedings of the Web Conference 2020, WWW ’20, page 226–229,
New York, NY, USA, 2020. Association for Computing Machinery. [22] Amazon. Alexa global top sites. https://www.alexa.com/
topsites.
[23] Andrew Cantino. Selector gadget. https://github.com/cantino/
selectorgadget.
[24] Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convo-
lutional networks for text classification. In Proceedings of the 28th
International Conference on Neural Information Processing Sys-
tems - Volume 1, NIPS’15, page 649–657, Cambridge, MA, USA,
2015. MIT Press.
[25] Puppeteer.
Puppeteer.
https://github.com/puppeteer/
puppeteer.
[26] VMWare. Rabbitmq. https://www.rabbitmq.com/.
[27] MongoDB. Mongodb. https://www.mongodb.com/. |