中大機構典藏-NCU Institutional Repository-提供博碩士論文、考古題、期刊論文、研究計畫等下載:Item 987654321/88321
English  |  正體中文  |  简体中文  |  Items with full text/Total items : 78852/78852 (100%)
Visitors : 38468735      Online Users : 313
RC Version 7.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
Scope Tips:
  • please add "double quotation mark" for query phrases to get precise results
  • please goto advance search for comprehansive author search
  • Adv. Search
    HomeLoginUploadHelpAboutAdminister Goto mobile version


    Please use this identifier to cite or link to this item: http://ir.lib.ncu.edu.tw/handle/987654321/88321


    Title: 基於網頁瀏覽模擬器之動態爬蟲程式生成研究;Generation of dynamic web crawler via browser simulator - Decoupling of crawling and extraction for WebETL tool construction
    Authors: 廖勳;Liao, Hsun
    Contributors: 資訊工程學系在職專班
    Keywords: 動態網頁;無程式碼;網頁抓取;dynamic Web page;no code;web scraper
    Date: 2021-12-22
    Issue Date: 2022-07-13 22:46:39 (UTC+8)
    Publisher: 國立中央大學
    Abstract: 網際網路發展至今,不僅成為應用程式開發的主要平台,也是人們獲取資訊最主要的管道。大量的網路爬蟲 (Web Crawler) 被建構來抓取網路上的資訊,藉以整合提供加值的資訊服務。根據網路安全公司 Imperva 及 Barracuda 統計,網際網路上有半數的流量來自網路機器人。為了防範惡意機器人的攻擊,網頁設計的架構日益複雜,透過 JavaScript 開發技術的使用,改變網頁嵌入和呈現數據的方式。這對於建構加值型網路應用服務來說,無疑是相當大的挑戰。例如在網址不變的情況下動態更新網頁內容。如何克服這類型的網站的網頁抓取是本文研究的主題。

    為了取得動態網頁的資料,本研究在 Chrome extension 上開發一套模擬使用者點擊流程的系統,透過 Chrome 擴充套件來記錄使用者的點擊與輸入,達到重現使用者在網頁瀏覽時的操作並抓取網頁資料。幫助使用者在不用寫程式碼的前提下,成功抓取網頁資料並提供定期自動抓取的功能。改善 WebETL System,對高互動性及一頁式網站的動態網頁下載問題,達到資料擷取及重覆使用的目的 (Data extraction And Reuse)。針對自動分頁偵測 失敗與政府網址連結與Alex統計的熱門網站共75個動態網頁中,成功的抓取70個,有93.33%的成功率。;Since the development of the Internet, it has not only become the main platform for application development, but also the most important channel for people to obtain information. A large number of web crawlers are constructed to crawl information on the Internet, in order to integrate and provide value-added information services. According to statistics from Internet security companies Imperva and Barracuda, half of the Internet traffic comes from cyberbots.In order to prevent attacks from malicious robots, the architecture of web page design is becoming more and more complex. Through the use of JavaScript development technology, change the way web pages embed and present data. This is undoubtedly a considerable challenge for the construction of value-added network application services. For example, the content of the webpage is dynamically updated when the URL is unchanged. How to overcome web crawling of this type of website is the subject of this article.


    In order to obtain the information on dynamic web pages, this research developed a system that simulates the user′s click process on the Chrome extension. Use Chrome extensions to record user clicks and input, so as to reproduce the user′s operations during web browsing and grab web data. Help users successfully crawl web page data without writing code and provide regular automatic crawling functions. For the dynamic webpage download problem of highly interactive and one-page websites, the purpose of data extraction and reuse is achieved. For automatic page detection failures, government URL links, and Alex’s statistics of 75 dynamic web pages, 70 were successfully crawled, with a success rate of 93.33%
    Appears in Collections:[Executive Master of Computer Science and Information Engineering] Electronic Thesis & Dissertation

    Files in This Item:

    File Description SizeFormat
    index.html0KbHTML172View/Open


    All items in NCUIR are protected by copyright, with all rights reserved.

    社群 sharing

    ::: Copyright National Central University. | 國立中央大學圖書館版權所有 | 收藏本站 | 設為首頁 | 最佳瀏覽畫面: 1024*768 | 建站日期:8-24-2009 :::
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - 隱私權政策聲明