dc.description.abstract | With the development of computer network and radio access technologies, the bandwidth is sufficient to support multimedia applications. Today, people are accustomed to using 3C products to watch video and access the media. The market of the cable TV and traditional TV has gradually declined. The traditional live streaming can only obtain from radio or TV, but with the development of technology, live streaming is already one of the ways for everyone to spread information.
Since 2016, the live streaming industry has gradually flourished. No matter where people are, they can interact with the live streaming host in real time through live streaming platform. Many merchants sell products through live streaming, and it has become an emerging industry of "e-commerce over live streaming ". Live streaming shows an explosive development trend.
“Wayback Machine” keeps hundreds of millions of historical records for global webpages. Many webpages may close due to poor management or other reasons. With the development of webpage technology, most websites have used dynamic content technology to design websites, so “Wayback Machine” can only capture a small amount of content.
In the face of the popularity of live streaming, there is no historical database to collect information on the live streaming platform completely, so this study proposes an automated content crawler system for the live streaming platform. To collect the channel information of the live streaming platform completely, a crawler engineer must design a dedicated crawler program for each live streaming platform. The larger economic market of the live streaming industry means that there are more new platforms want to share a slice of the cake. New live streaming platforms will be born all the time, and old platforms will constantly update to improve user experience. Based on the problems above, this study wants to design an automated information crawler system of live streaming platform, which can automate the operation of the crawler program in response to the new platform and the revision of the existing platform.
The automated crawler system proposed in this study divide into three types of crawlers, namely API crawler, AJAX crawler, and DOM crawler. The system will find the most suitable type of crawler according to the webpage structure of the platform to collect data. The API crawler depends on whether the live streaming platform provides API services, and then writes the crawler program according to the API document. This part processed manually. The AJAX crawler captures the HTTP Request of the data loaded by the live streaming platform, and then performs filtering and parameter judgment to obtain the Request URL for dynamic content. The DOM crawler crawls the webpage of the live streaming platform and converts the webpage into a DOM Tree structure, judges the repeated live streaming blocks, and then extracts live streaming channel information from the blocks.
The API crawler and AJAX crawler have the best performance. Each time data is retrieved, only a light HTTP request is sent. The DOM crawler has the highest versatility. It needs to execute the browser and then obtain the live streaming information through the browser, so the performance is the worst, but the DOM crawler can successfully crawl the information of most live streaming platforms. | en_US |