摘要(英) |
Since the development of the Internet, it has not only become the main platform for application development, but also the most important channel for people to obtain information. A large number of web crawlers are constructed to crawl information on the Internet, in order to integrate and provide value-added information services. According to statistics from Internet security companies Imperva and Barracuda, half of the Internet traffic comes from cyberbots.In order to prevent attacks from malicious robots, the architecture of web page design is becoming more and more complex. Through the use of JavaScript development technology, change the way web pages embed and present data. This is undoubtedly a considerable challenge for the construction of value-added network application services. For example, the content of the webpage is dynamically updated when the URL is unchanged. How to overcome web crawling of this type of website is the subject of this article.
In order to obtain the information on dynamic web pages, this research developed a system that simulates the user′s click process on the Chrome extension. Use Chrome extensions to record user clicks and input, so as to reproduce the user′s operations during web browsing and grab web data. Help users successfully crawl web page data without writing code and provide regular automatic crawling functions. For the dynamic webpage download problem of highly interactive and one-page websites, the purpose of data extraction and reuse is achieved. For automatic page detection failures, government URL links, and Alex’s statistics of 75 dynamic web pages, 70 were successfully crawled, with a success rate of 93.33% |
參考文獻 |
[1] Thoma Bravo. Imperva. https://www.imperva.com/blog/bad-b ot-report-2021-the-pandeniic-of-the-internet/,2002.
[2] Berislav Kucan. helpnetsecurity. https://www.helpnetsecurity.com/2021/09/07/bad-bots-internet-traffic/, 1998.
[3] Google. Chrome extension. https://chrome.google.com/webstore/category/extensions, 2009.
[4] Cheng-Ju Wu. Large-scale web data api creation via automatic paginationrecognition -a case study on announcement monitoring. Master′s thesis, National Central University, Taoyuan, Taiwan, 2021.
[5] Yu-An Chou. Web data etl system with unsupervised extractiori. Master′s thesis, National Central University, Taoyuan, Taiwan, 2018.
[6] S. Chaudhari, R. Aparna, V. G. Tekkur, G. L. Pavan, and S. R. Karki. Ingredient/recipe algorithm using web mining and web scraping for smart chef. In 2020 IEEE International Conference on Electronics, Computing and Communication Technologies(CONECCT)),pages 1-3, Bangalore, India, 2020. IEEE.
[7] K. Sundaramoorthy, R. Durga, and S. Nagadarshini. Newsone — an aggregation system for news using web scraping method. In 2017 International Conference on Technical Advancements in Computers and Communications (ICTACC),pages 1-4, Melmaurvathur, India, 2017. IEEE.
[8] L. R. Julian and F. Natalia. The use of web scraping in computer parts and assembly price comparison. In 2015 3rd International Conference on New Media (CONMEDIA), pages 2-4, Tangerang, Indonesia, 2015. IEEE.
[9] Oviliani Y. Yuliana and Chia-Hui Chang. Dcade: divide and conquer alignment with dynamic encoding for full page data extraction. Applied Intelligence, pages 1-25, July 2019.
[10] wikipedia. Ajax. https://en.wikipedia.org/wiki/Ajax_(programming), 1999.
[11] wikipedia. Xpath. https://en.wikipedia.org/wiki/XPath, 1998.
[12] wikipedia. Css. https://en.wikipedia.org/wiki/CSS, 1996.
[13] wikipedia. Http. https://en.wikipedia.org/wiki/HTTP, 1996.
[14] Shore Group Associates. shoregrpleaderboard. https://www.shoregrp.com/blog/top-free-no-code-web-scraping-tools, 2006.
[15] Hevo. Hevo・ https://hevodata.com/leam/8-best-web-scraping-tools/, 2017.
[16] webhose.io. webhose.io. https://webhose.io/, 2015.
[17] Gil Elbaz. commonerawl. https://commoncrawl.org/, 2011.
[18] Shore Group Associates. shoregrp. https://www.shoregrp.com/, 2006.
[19] Proxy Crawl, scraperapi. https://www.scraperapi.com/, 2017 ・
[20] Zyte (formerly Scrapinghub). Scrapy. https://scrapy.org/, 2008.
[21] Web Scraper. Web scraper, https://webscraper.io/, 2017.
[22] Octoparse. Octoparse, https://www.octoparse.com/, 2016 ・
[23] Simplescraper. Simplescraper. https://simplescraper.io/, 2017.
[24] ParseHub. Parsehub. https://www.parsehub.com/, 2013.
[25] Dexi.io. Dexi.io. https://www.dexi.io/, 2015.
[26] Mozenda. Mozenda. https://www.mozenda.com/, 2007.
[27] Content Grabber. Content grabber. https://contentgrabber.com/Manual/understanding_the_concept.htm, 2020.
[28] Import.io. ImpoTt.io. https://www.import.io/, 2013.
[29] Enlyft. enlyftselenium. https://enlyft.com/tech/products/selenium, 2021.
[30] pcloudy. pcloudy. https: //www.pcloudy.com/blogs/best-selenium-python-fi′ameworks-for-test-automation-in-2021/, 2021.
[31] Holger Krekel. pytest. https://docs.pytest.org/en/6.2.x/, 2004.
[32] Pekka Klarck and Janne Harkonen. Robot framework. https://robotframework.org/, 2008.
[33] Benno Rice. behave. https://behave.readthedocs.io/en/stable/, 2012.
[34] St eve Purcell. pyunit. http://pyunit.Sourceforge.net/, 2001.
[35] Jason Pellerin. nose2. https://docs.nose2.io/en/latest/, 2010.
[36] J hong li Ding. Page-level information extraction system. Master′s thesis, National Central University, Taoyuan, Taiwan, 2015.
[37] Bing Liu, Robert Grossman, and Yanhong Zhai. Mining data records in web pages. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining) pages 601-606, New York, 2003. ACM.
[38] Chia-Hui Chang, Tian-Sheng Chen, Ming-Chuan Chen, and Jhung-Li Ding. Efficient page-level data extraction via schema induction and verification. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 478-490, Switzerland, 2016. Springer.
[39] Elsevier. sciencedirect. https://www.sciencedirect.com/, 1997.
[40] Alexa Internet. Alexa. https://www.alexa.com/topsites, 1996. |