基於網頁瀏覽模擬器之動態爬蟲程式生成研究

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：114

、訪客IP：3.15.146.27

姓名

廖勳(Hsun Liao) 查詢紙本館藏

畢業系所

資訊工程學系在職專班

論文名稱

基於網頁瀏覽模擬器之動態爬蟲程式生成研究
(Generation of dynamic web crawler via browser simulator - Decoupling of crawling and extraction for WebETL tool construction)

相關論文

★ 行程邀約郵件的辨識與不規則時間擷取之研究	★ NCUFree校園無線網路平台設計及應用服務開發
★ 網際網路半結構性資料擷取系統之設計與實作	★ 非簡單瀏覽路徑之探勘與應用
★ 遞增資料關聯式規則探勘之改進	★ 應用卡方獨立性檢定於關連式分類問題
★ 中文資料擷取系統之設計與研究	★ 非數值型資料視覺化與兼具主客觀的分群
★ 關聯性字組在文件摘要上的探討	★ 淨化網頁：網頁區塊化以及資料區域擷取
★ 問題答覆系統使用語句分類排序方式之設計與研究	★ 時序資料庫中緊密頻繁連續事件型樣之有效探勘
★ 星狀座標之軸排列於群聚視覺化之應用	★ 由瀏覽歷程自動產生網頁抓取程式之研究
★ 動態網頁之樣版與資料分析研究	★ 同性質網頁資料整合之自動化研究

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

網際網路發展至今，不僅成為應用程式開發的主要平台，也是人們獲取資訊最主要的管道。大量的網路爬蟲 (Web Crawler) 被建構來抓取網路上的資訊，藉以整合提供加值的資訊服務。根據網路安全公司 Imperva 及 Barracuda 統計，網際網路上有半數的流量來自網路機器人。為了防範惡意機器人的攻擊，網頁設計的架構日益複雜，透過 JavaScript 開發技術的使用，改變網頁嵌入和呈現數據的方式。這對於建構加值型網路應用服務來說，無疑是相當大的挑戰。例如在網址不變的情況下動態更新網頁內容。如何克服這類型的網站的網頁抓取是本文研究的主題。

為了取得動態網頁的資料，本研究在 Chrome extension 上開發一套模擬使用者點擊流程的系統，透過 Chrome 擴充套件來記錄使用者的點擊與輸入，達到重現使用者在網頁瀏覽時的操作並抓取網頁資料。幫助使用者在不用寫程式碼的前提下，成功抓取網頁資料並提供定期自動抓取的功能。改善 WebETL System，對高互動性及一頁式網站的動態網頁下載問題，達到資料擷取及重覆使用的目的 (Data extraction And Reuse)。針對自動分頁偵測失敗與政府網址連結與Alex統計的熱門網站共75個動態網頁中，成功的抓取70個，有93.33%的成功率。

摘要(英)

Since the development of the Internet, it has not only become the main platform for application development, but also the most important channel for people to obtain information. A large number of web crawlers are constructed to crawl information on the Internet, in order to integrate and provide value-added information services. According to statistics from Internet security companies Imperva and Barracuda, half of the Internet traffic comes from cyberbots.In order to prevent attacks from malicious robots, the architecture of web page design is becoming more and more complex. Through the use of JavaScript development technology, change the way web pages embed and present data. This is undoubtedly a considerable challenge for the construction of value-added network application services. For example, the content of the webpage is dynamically updated when the URL is unchanged. How to overcome web crawling of this type of website is the subject of this article.

In order to obtain the information on dynamic web pages, this research developed a system that simulates the user′s click process on the Chrome extension. Use Chrome extensions to record user clicks and input, so as to reproduce the user′s operations during web browsing and grab web data. Help users successfully crawl web page data without writing code and provide regular automatic crawling functions. For the dynamic webpage download problem of highly interactive and one-page websites, the purpose of data extraction and reuse is achieved. For automatic page detection failures, government URL links, and Alex’s statistics of 75 dynamic web pages, 70 were successfully crawled, with a success rate of 93.33%

關鍵字(中)

★ 動態網頁
★ 無程式碼
★ 網頁抓取

關鍵字(英)

★ dynamic Web page
★ no code
★ web scraper

論文目次

中文摘要 i
Abstract iii
目錄 v
圖目錄 vii
表目錄 ix
一、緒論 1
二、相關研究 7
2.1 提供資料類型 7
2.2 客製化服務 9
2.3 開發者套件或服務 9
2.4 網頁抓取工具 12
2.5 選擇使用者操作流程腳本語言 15
三、 WebETL Robot系統架構 19
3.1 設計理念 19
3.2 瀏覽器選擇 21
3.3 使用介面設計 21
3.4 元素指定方法 22
3.5 特殊結構網站 23
3.6 WebETL System串接 24
3.7 替換輸入值功能 25
四、實驗討論 27
4.1 資料來源 27
4.2 抓取過程 28
4.3 抓取結果 31
4.4 錯誤分析 32
五、結論與未來研究 35
參考文獻 37

參考文獻

[1] Thoma Bravo. Imperva. https://www.imperva.com/blog/bad-b ot-report-2021-the-pandeniic-of-the-internet/,2002.
[2] Berislav Kucan. helpnetsecurity. https://www.helpnetsecurity.com/2021/09/07/bad-bots-internet-traffic/, 1998.
[3] Google. Chrome extension. https://chrome.google.com/webstore/category/extensions, 2009.
[4] Cheng-Ju Wu. Large-scale web data api creation via automatic paginationrecognition -a case study on announcement monitoring. Master′s thesis, National Central University, Taoyuan, Taiwan, 2021.
[5] Yu-An Chou. Web data etl system with unsupervised extractiori. Master′s thesis, National Central University, Taoyuan, Taiwan, 2018.
[6] S. Chaudhari, R. Aparna, V. G. Tekkur, G. L. Pavan, and S. R. Karki. Ingredient/recipe algorithm using web mining and web scraping for smart chef. In 2020 IEEE International Conference on Electronics, Computing and Communication Technologies(CONECCT)),pages 1-3, Bangalore, India, 2020. IEEE.
[7] K. Sundaramoorthy, R. Durga, and S. Nagadarshini. Newsone — an aggregation system for news using web scraping method. In 2017 International Conference on Technical Advancements in Computers and Communications (ICTACC),pages 1-4, Melmaurvathur, India, 2017. IEEE.
[8] L. R. Julian and F. Natalia. The use of web scraping in computer parts and assembly price comparison. In 2015 3rd International Conference on New Media (CONMEDIA), pages 2-4, Tangerang, Indonesia, 2015. IEEE.
[9] Oviliani Y. Yuliana and Chia-Hui Chang. Dcade: divide and conquer alignment with dynamic encoding for full page data extraction. Applied Intelligence, pages 1-25, July 2019.
[10] wikipedia. Ajax. https://en.wikipedia.org/wiki/Ajax_(programming), 1999.
[11] wikipedia. Xpath. https://en.wikipedia.org/wiki/XPath, 1998.
[12] wikipedia. Css. https://en.wikipedia.org/wiki/CSS, 1996.
[13] wikipedia. Http. https://en.wikipedia.org/wiki/HTTP, 1996.
[14] Shore Group Associates. shoregrpleaderboard. https://www.shoregrp.com/blog/top-free-no-code-web-scraping-tools, 2006.
[15] Hevo. Hevo・ https://hevodata.com/leam/8-best-web-scraping-tools/, 2017.
[16] webhose.io. webhose.io. https://webhose.io/, 2015.
[17] Gil Elbaz. commonerawl. https://commoncrawl.org/, 2011.
[18] Shore Group Associates. shoregrp. https://www.shoregrp.com/, 2006.
[19] Proxy Crawl, scraperapi. https://www.scraperapi.com/, 2017 ・
[20] Zyte (formerly Scrapinghub). Scrapy. https://scrapy.org/, 2008.
[21] Web Scraper. Web scraper, https://webscraper.io/, 2017.
[22] Octoparse. Octoparse, https://www.octoparse.com/, 2016 ・
[23] Simplescraper. Simplescraper. https://simplescraper.io/, 2017.
[24] ParseHub. Parsehub. https://www.parsehub.com/, 2013.
[25] Dexi.io. Dexi.io. https://www.dexi.io/, 2015.

[26] Mozenda. Mozenda. https://www.mozenda.com/, 2007.
[27] Content Grabber. Content grabber. https://contentgrabber.com/Manual/understanding_the_concept.htm, 2020.
[28] Import.io. ImpoTt.io. https://www.import.io/, 2013.
[29] Enlyft. enlyftselenium. https://enlyft.com/tech/products/selenium, 2021.
[30] pcloudy. pcloudy. https: //www.pcloudy.com/blogs/best-selenium-python-fi′ameworks-for-test-automation-in-2021/, 2021.
[31] Holger Krekel. pytest. https://docs.pytest.org/en/6.2.x/, 2004.
[32] Pekka Klarck and Janne Harkonen. Robot framework. https://robotframework.org/, 2008.
[33] Benno Rice. behave. https://behave.readthedocs.io/en/stable/, 2012.
[34] St eve Purcell. pyunit. http://pyunit.Sourceforge.net/, 2001.
[35] Jason Pellerin. nose2. https://docs.nose2.io/en/latest/, 2010.
[36] J hong li Ding. Page-level information extraction system. Master′s thesis, National Central University, Taoyuan, Taiwan, 2015.
[37] Bing Liu, Robert Grossman, and Yanhong Zhai. Mining data records in web pages. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining) pages 601-606, New York, 2003. ACM.
[38] Chia-Hui Chang, Tian-Sheng Chen, Ming-Chuan Chen, and Jhung-Li Ding. Efficient page-level data extraction via schema induction and verification. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 478-490, Switzerland, 2016. Springer.
[39] Elsevier. sciencedirect. https://www.sciencedirect.com/, 1997.
[40] Alexa Internet. Alexa. https://www.alexa.com/topsites, 1996.

指導教授

張嘉惠(Chia-Hui Chang)

審核日期

2021-12-22

推文