蒐集直播串流資訊之自動化爬蟲系統;Automatic Crawling System for Collecting Live Streaming Information

NCUIR > College of Electrical Engineering & Computer Science > Graduate Institute of Communication Engineering > Electronic Thesis & Dissertation > Item 987654321/83836

Please use this identifier to cite or link to this item: https://ir.lib.ncu.edu.tw/handle/987654321/83836

Title:	蒐集直播串流資訊之自動化爬蟲系統;Automatic Crawling System for Collecting Live Streaming Information
Authors:	郭維勳;Kuo, Wei-Xun
Contributors:	通訊工程學系
Keywords:	動態網頁爬蟲;直播爬蟲;DOM爬蟲;AJAX爬蟲;直播平台爬蟲;Dynamic Web Crawler;Live Streaming Crawler;DOM Crawler;AJAX Crawler;Live Streaming Platform Crawler
Date:	2020-07-30
Issue Date:	2020-09-02 17:11:16 (UTC+8)
Publisher:	國立中央大學
Abstract:	隨著電腦網路及行動通訊技術的發展，頻寬已經足以支撐多媒體應用，現代人們已經習慣使用3C產品收看影音，有線電視台與傳統電視台的收視市場也已逐漸式微。傳統的直播只能從電臺或是電視台，但隨著技術的發展，直播已經是人人隨手可得傳播資訊的方式之一。自 2016 年來，直播產業逐漸興盛，不論人在哪裡都可透過直播即時與直播主互動，有許多商家透過直播販賣商品，更成為「電商直播」新興產業，可見直播呈現爆炸式的發展趨勢。網頁時光機為全球的網頁保留下數以億計的歷史記錄，許多網頁可能因經營不善或其他原因而關站，多數可以在網頁時光機中找到。隨著網頁技術的發展新興的網站都已經採用動態內容的技術來設計網站，因此網頁時光機只能擷取很少量的內容。因應大直播時代的來臨，卻沒有一個歷史資料庫妥善蒐集直播平台的資訊，因此本研究提出針對直播平台的自動化內容爬蟲系統。若想完整蒐集直播平台的頻道資訊必須由爬蟲工程師針對每個直播平台設計專用的爬蟲程式。直播產業的經濟市場越大意謂著有越多的新平台希望分一杯羹，新的直播平台將會不停的誕生，舊平台也會為了提升使用者體驗不斷推陳出新。基於以上問題，本研究想設計一套自動化的直播平台資訊爬蟲系統，為因應新平台的誕生及既有平台的改版，皆可自動化爬蟲程式的運作。本研究提出之爬蟲系統分為三種爬蟲類型，分別為API爬蟲、AJAX爬蟲、DOM爬蟲。系統會依據平台的網頁架構找到最適合的爬蟲類型來進行資料的蒐集。API爬蟲視直播平台有無提供API服務，再依據API文件撰寫爬蟲程式，此部分為人工處理。AJAX爬蟲則擷取直播平台載入資料的HTTP Request，再進行過濾及參數判斷，得到動態內容的Request URL。DOM爬蟲抓取直播平台網頁後將網頁轉換成DOM Tree架構，判斷重複出現的直播區塊，再從區塊中提取直播頻道資訊。三種爬蟲以API及AJAX爬蟲的效能最佳，每次取得資料只需傳送輕量的HTTP Request，DOM爬蟲通用性最高，需要執行瀏覽器再透過操作瀏覽器取得直播資訊，因此效能最差，但DOM爬蟲可成功爬取大部分直播平台的資訊。 ;With the development of computer network and radio access technologies, the bandwidth is sufficient to support multimedia applications. Today, people are accustomed to using 3C products to watch video and access the media. The market of the cable TV and traditional TV has gradually declined. The traditional live streaming can only obtain from radio or TV, but with the development of technology, live streaming is already one of the ways for everyone to spread information. Since 2016, the live streaming industry has gradually flourished. No matter where people are, they can interact with the live streaming host in real time through live streaming platform. Many merchants sell products through live streaming, and it has become an emerging industry of "e-commerce over live streaming ". Live streaming shows an explosive development trend. “Wayback Machine” keeps hundreds of millions of historical records for global webpages. Many webpages may close due to poor management or other reasons. With the development of webpage technology, most websites have used dynamic content technology to design websites, so “Wayback Machine” can only capture a small amount of content. In the face of the popularity of live streaming, there is no historical database to collect information on the live streaming platform completely, so this study proposes an automated content crawler system for the live streaming platform. To collect the channel information of the live streaming platform completely, a crawler engineer must design a dedicated crawler program for each live streaming platform. The larger economic market of the live streaming industry means that there are more new platforms want to share a slice of the cake. New live streaming platforms will be born all the time, and old platforms will constantly update to improve user experience. Based on the problems above, this study wants to design an automated information crawler system of live streaming platform, which can automate the operation of the crawler program in response to the new platform and the revision of the existing platform. The automated crawler system proposed in this study divide into three types of crawlers, namely API crawler, AJAX crawler, and DOM crawler. The system will find the most suitable type of crawler according to the webpage structure of the platform to collect data. The API crawler depends on whether the live streaming platform provides API services, and then writes the crawler program according to the API document. This part processed manually. The AJAX crawler captures the HTTP Request of the data loaded by the live streaming platform, and then performs filtering and parameter judgment to obtain the Request URL for dynamic content. The DOM crawler crawls the webpage of the live streaming platform and converts the webpage into a DOM Tree structure, judges the repeated live streaming blocks, and then extracts live streaming channel information from the blocks. The API crawler and AJAX crawler have the best performance. Each time data is retrieved, only a light HTTP request is sent. The DOM crawler has the highest versatility. It needs to execute the browser and then obtain the live streaming information through the browser, so the performance is the worst, but the DOM crawler can successfully crawl the information of most live streaming platforms.
Appears in Collections:	[Graduate Institute of Communication Engineering] Electronic Thesis & Dissertation

Files in This Item:

File	Description	Size	Format
index.html		0Kb	HTML	254	View/Open

社群 sharing

Loading...