地理網路爬蟲：具擴充及擴展性之地理網路資源爬行架構;GeoWeb Crawler: An Extensible and Scalable Web Crawling Framework for Discovering Geospatial Web Resources

NCU Institutional Repository > 工學院 > 土木工程研究所 > 博碩士論文 > Item 987654321/71139

請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/71139

題名:	地理網路爬蟲：具擴充及擴展性之地理網路資源爬行架構;GeoWeb Crawler: An Extensible and Scalable Web Crawling Framework for Discovering Geospatial Web Resources
作者:	張皓;Chang,Hao
貢獻者:	土木工程學系
關鍵詞:	地理網路;資源搜尋;網路爬蟲;開放地理空間協會;Geospatial Web;Resource discovery;Web crawler;Open Geospatial Consortium
日期:	2016-07-26
上傳時間:	2016-10-13 12:08:32 (UTC+8)
出版者:	國立中央大學
摘要:	由於科技的進步，網際網路(World-Wide Web, WWW)發展快速，在過去Web 1.0的時代，使用者只能單方面地接收少數組織機關、管理者所發布的資訊。然而如今已進入Web 2.0時代，所有網際網路使用者皆能夠在網路上分享各種資料或網路服務(web service)，成為資料的提供者。在所有使用者都能夠在網路上發布資料或服務的情況下，巨量資料(Big Data)的概念逐漸被重視，巨量資料的三個V特性，分別是資料總量(Volume)、資料產生速度(Velocity)、以及資料多樣性(Variety)。而地理空間資料也符合此3V特性，如此巨量的地理空間資料散布在網際網路各處(即地理網路Geospatial Web, GeoWeb)，造成資料搜尋上的困難。如同網際網路有搜尋引擎提供網頁搜尋服務，地理網路亦需搜尋引擎提供使用者快速搜尋地理資源。由於建立搜尋引擎的第一要件為資料蒐集，本研究主要目標為設計一個可擴充及擴展之網路爬蟲(web crawler)架構，GeoWeb Crawler，主動地發掘網路上各式地理空間資料來源。在本研究目前之搜尋目標包含地理網路服務及資料，如Open Geospatial Consortium (OGC)所訂定的Sensor Observation Service (SOS)、Web Map Service (WMS) 、Web Map Tile Service (WMTS)、Web Feature Service (WFS)、Web Coverage Service (WCS)、Web Processing Service (WPS)、Catalogue Service for the Web (CSW)、Keyhole Markup Language (KML)及ESRI 公司所發展的Shapefile檔案。為了提升網路爬行效能，並搜尋大範圍的網際網路覆蓋，我們利用分散式處理來達成水平可擴充性。經過測試，八台電腦同時爬行，速度約提高13倍。此外，一般網路爬蟲被設計為基於超連結爬行，而本研究之爬行架構GeoWeb Crawler可透過開發客製化連結器以搜尋出隱藏在網路服務的地理資源。在地理資源搜尋成果中，本研究針對10種開放標準式資源及3種非開放標準式資源，共蒐集到7,351個地理網路服務及194,003個地理資料集，此數量約為現有方法所蒐集數量的3.8倍至47.5倍。透過統計不同爬行深度的地理資源分布，其成果顯示利用Google搜尋作為爬行起點，確實有利於搜尋地理資源。而當爬行越深，其效益遞減。最後本研究也建立了地理網路搜尋引擎之雛型，GeoHub。根據上述研究成果，本研究提出之GeoWeb Crawler具擴充及擴展性，可提供完整的地理網路資源索引，進而作為地理網路搜尋引擎之基礎。;With the advance of the World-Wide Web (WWW) technology, people can easily share content on the Web, including geospatial data and web services. While geospatial resources are being published at an ever-increasing speed, the "big geospatial data management" issues start attracting attention. Among the big geospatial data issues, this research focuses on discovering distributed geospatial resources. As resources are scattered on the globally distributed WWW, users are facing difficulties in finding the resources they need. While the WWW has Web search engines addressing web resource discovery issues, we envision that the geospatial Web (i.e., GeoWeb) also requires GeoWeb search engines for users to efficiently find GeoWeb resources. To realize a GeoWeb search engine, one of the first steps is to proactively discover GeoWeb resources on the WWW. Hence, in this study, we propose the GeoWeb Crawler, an extensible Web crawling framework that can find various types of GeoWeb resources, such as Open Geospatial Consortium (OGC) web services, Keyhole Markup Language (KML) and ESRI Shapefiles. In addition, to promote the performance of the GeoWeb Crawler, we apply the distributed computing concept in the framework to easily scale horizontally. By using 8 machines, we had 13 times performance improvement on the crawling process. Furthermore, while regular web crawlers are ideal for discovering resources with hyperlinks, the GeoWeb Crawler should customize connectors to find the resources hidden behind open or proprietary web services. The result shows that for 10 targeted open-standard-based resource types and 3 non-open-standard-based resource types, the GeoWeb Crawler discovered 7,351 geospatial services, and 194,003 datasets, which are 3.8 to 47.5 times more than what users can find with existing approaches. Based on the crawling level distribution of discovered resources, the result indicates that Google search provide us good seeds to discover resources efficiently. However, the deeper levels we crawl, the more unnecessary effort we spend. Based on the proposed solution, we built the GeoWeb search engine prototype, GeoHub. According to the experimental result, the proposed GeoWeb Crawler framework is proven to be extensible and scalable to provide comprehensive index of GeoWeb.
顯示於類別:	[土木工程研究所] 博碩士論文

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	222	檢視/開啟

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....