摘要(英) |
WorldWideWeb (WWW) is the mainstream media of modern information dissemination. Many application services rely on Web Data Extraction technology to support information integration services.Although many unsupervised data capture methods have been proposed in the past, the multi-data capture method (such as MDR) that considers a single page can only process the RecordSet in this page, and it cannot consider the overall structure; While multi-page alignment methods (such as DCADE) can identify the difference between templates and data through multi-page data, the identification of record sets is often not robust enough, and it is often impossible to complete the retrieval task for complex websites.In this study, combining the advantages of the two methods, MDR is used to extract data sets from individual web pages, and then the data sets from multiple web pages are extracted for three sub-tasks: Recordset Matching, Column Alignment and NonRecordset Matching.In the record set part, we use the BERT sentence representation to calculate the representation of each data in the data set, and then use the Schema Matching to achieve the record set matching; at the same time, the KNN and SVM classifiers are used to complete the Column Alignment task; in The non-record part is accomplished by using the DCADE multi-page data capture method to perform multi-page attribute alignment capabilities for non-record sets; finally, we combine the two results to achieve multi-page data capture.In addition to ExAlg and WEIR data sets, we also provide a website latest news announcement data set (Announcement List Website, ALW), which is used to test the automatic data effect of the latest website news or announcement list. Experimental results show that our proposed method DEVOSM (Data Extraction via On-the-fly Schema Matching) improved by 55.6%、60% and 33.7% on ExAlg, WEIR and ALW datasets Recordset retrieval rate, showing the effectiveness of the proposed method. |
參考文獻 |
參考文獻
[1] Arvind Arasu and Hector Garcia-Molina. Extracting structured data from web
pages. In Proceedings of the 2003 ACM SIGMOD international conference on
Management of data, pages 337–348, 2003.
[2] David Aumueller, Hong-Hai Do, Sabine Massmann, and Erhard Rahm. Schema
and ontology matching with coma++. In Proceedings of the 2005 ACM SIGMOD
international conference on Management of data, pages 906–908, 2005.
[3] Lidong Bing, Wai Lam, and Tak-Lam Wong. Wikipedia entity expansion and
attribute extraction from the web using semi-supervised learning. In Proceedings
of the sixth ACM international conference on Web search and data mining,
pages 567–576, 2013.
[4] Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, and Paolo Papotti. Extraction
and integration of partially overlapping web sources. Proceedings of the VLDB
Endowment, 6(10):805–816, 2013.
[5] David Buttler, Ling Liu, and Calton Pu. A fully automated object extraction
system for the world wide web. In Proceedings 21st International Conference
on Distributed Computing Systems, pages 361–370. IEEE, 2001.
[6] Andrew Carlson, Justin Betteridge, Richard C Wang, Estevam R Hruschka Jr,
and Tom M Mitchell. Coupled semi-supervised learning for information extraction.
In Proceedings of the third ACM international conference on Web search
and data mining, pages 101–110, 2010.
[7] CH Chang. Information extraction based on pattern discovery. In Proc. of 10th
World Wide Web Conference, 2001.
[8] Chia-Hui Chang, Tian-Sheng Chen, Ming-Chuan Chen, and Jhung-Li Ding.
Efficient page-level data extraction via schema induction and verification. In
Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 478–
490. Springer, 2016.
[9] Yu-An Chou. Web data etl system with unsupervised extraction. 2018.
[10] Valter Crescenzi, Giansalvatore Mecca, Paolo Merialdo, et al. Roadrunner:
Towards automatic data extraction from large web sites. In VLDB, volume 1,
pages 109–118, 2001.
[11] Valter Crescenzi, Paolo Merialdo, and Disheng Qiu. Alfred: crowd assisted data
extraction. In Proceedings of the 22nd international conference on World Wide
Web, pages 297–300, 2013.
[12] Hong-Hai Do and Erhard Rahm. Coma—a system for flexible combination of
schema matching approaches. In VLDB’02: Proceedings of the 28th International
Conference on Very Large Databases, pages 610–621. Elsevier, 2002.
[13] Emilio Ferrara, Pasquale De Meo, Giacomo Fiumara, and Robert Baumgartner.
Web data extraction, applications and techniques: A survey. Knowledge-based
systems, 70:301–323, 2014.
[14] Avigdor Gal. Uncertain schema matching. Synthesis Lectures on Data Management,
3(1):1–97, 2011.
[15] Zvi Galil, Silvio Micali, and Harold Gabow. An o(evlogv) algorithm for finding
a maximal weighted matching in general graphs. SIAM Journal on Computing,
15(1):120–130, 1986.
[16] Patricia Jiménez and Rafael Corchuelo. On learning web information extraction
rules with tango. Information Systems, 62:74–103, 2016.
[17] Mohammed Kayed and Chia-Hui Chang. Fivatech: Page-level web data extraction
from template pages. IEEE transactions on knowledge and data engineering,
22(2):249–263, 2009.
[18] Teuvo Kohonen. The self-organizing map. Proceedings of the IEEE, 78(9):1464–
1480, 1990.
[19] Bing Liu, Robert Grossman, and Yanhong Zhai. Mining data records in web
pages. In Proceedings of the ninth ACM SIGKDD international conference on
Knowledge discovery and data mining, pages 601–606, 2003.
[20] Anan Marie and Avigdor Gal. Managing uncertainty in schema matcher ensembles.
In International Conference on Scalable Uncertainty Management, pages
60–73. Springer, 2007.
[21] Jan Portisch, Michael Hladik, and Heiko Paulheim. Background knowledge in
schema matching: A survey.
[22] Jianfeng Qu, Dantong Ouyang, Wen Hua, Yuxin Ye, and Xiaofang Zhou. Discovering
correlations between sparse features in distant supervision for relation
extraction. In Proceedings of the twelfth ACM international conference on web
search and data mining, pages 726–734, 2019.
[23] Erhard Rahm and Philip A Bernstein. On matching schemas automatically.
VLDB journal, 10(4):334–350, 2001.
[24] Tanvi Sahay, Ankita Mehta, and Shruti Jadon. Schema matching using machine
learning. In 2020 7th International Conference on Signal Processing and
Integrated Networks (SPIN), pages 359–366. IEEE, 2020.
[25] Sunita Sarawagi. Information extraction. found. trends databases 1, 3 (march
2008), 261–377, 2008.
[26] Roee Shraga, Avigdor Gal, and Haggai Roitman. Adnev: Cross-domain schema
matching using deep similarity matrix adjustment and evaluation. Proceedings
of the VLDB Endowment, 13(9):1401–1415, 2020.
[27] Hassan A Sleiman and Rafael Corchuelo. Tex: An efficient and effective unsupervised
web information extractor. Knowledge-Based Systems, 39:109–123,
2013.
[28] Oviliani Yenty Yuliana and Chia-Hui Chang. A novel alignment algorithm for
effective web data extraction from singleton-item pages. Applied Intelligence,
48(11):4355–4370, 2018.
[29] Oviliani Yenty Yuliana and Chia-Hui Chang. Dcade: divide and conquer alignment
with dynamic encoding for full page data extraction. Applied Intelligence,
50(2):271–295, 2020. |