基於Web之商家景點擷取與資料庫建置;Points of Interest Extraction from Unstructured Web

NCUIR > College of Electrical Engineering & Computer Science > Graduate Institute of Computer Science and Information Engineering > Electronic Thesis & Dissertation > Item 987654321/68852

Please use this identifier to cite or link to this item: http://ir.lib.ncu.edu.tw/handle/987654321/68852

Title:	基於Web之商家景點擷取與資料庫建置;Points of Interest Extraction from Unstructured Web
Authors:	高霆耀;Kao,Ting-yao
Contributors:	資訊工程學系
Keywords:	電子地圖;網路爬蟲;資訊擷取;POI資料庫;electronic map;web crawler;information extraction;POI database
Date:	2015-07-29
Issue Date:	2015-09-23 14:45:05 (UTC+8)
Publisher:	國立中央大學
Abstract:	隨著行動裝置的普及，區域搜尋成為了一項新興的熱門服務。然而區域搜尋要能提供完整的服務，必需要讓使用者能夠準確地搜尋到附近的餐廳、旅館、巴士站、卡拉OK、圖書館、藥局等各式各樣食衣住行育樂等地點(Point of Interest, POI)，為此我們要建構一個完整地POI資料庫供使用者查詢。近年來因為網際網路的盛行，使用者開始會在他們的部落格或是社交網路上分享旅遊經驗或是POI的資料，同時也有越來越多的商家或組織有自己的官方網頁，並且在網頁上詳細的介紹他們的資料。隨著這類型的網頁越來越多，整個網際網路也成為了最大的POI資訊來源。在本篇論文中我們提出一個基於Web資訊的POI建置系統，系統可以分為兩大部分，第一部分為包含地址網頁（Address-bearing Page, ABP）的爬取，目的是在網路中尋找ABP，這些網頁中會包含許多POI以及可用來做為檢索的POI相關描述訊息。第二部分為POI擷取系統，透過條件隨機域（Conditional Random Field, CRF）作為學習演算法產生的中文組織名稱辨識模型及中文地址辨識模型，找出網頁中所有出現的地址和組織名稱，接著再將地址與組織名稱配對成POI資料，最後再為每一個POI擷取其相關資訊。 ;With the increased popularity of mobile devices, local search has become a new popular service. However, a complete local search service have to provide nearby POIs (Point-of-Interest) like stores, shops, gas stations, parking lots, bus stops, drugstore for users. Therefore, we need a powerful POI database to support that. In recent years, the web has become the largest data source of POIs. With the prevalence of Internet, people will share their travel experience and information of POIs that they had been visited on social network, their blogs, and even check-in post. Besides, many companies and organizations publish their business on their own websites. Those webpages contain a large number of POIs. In this paper, we propose a POI database construction system based on the immense data of the Web. Our system could be separated into two parts: the query-based crawler, the POI extraction system. The goal of query-based crawler is to collect ABP (address-bearing pages) from the web as address is a good indicator of POIs. The second part is POI extraction system. We use CRF (Conditional Random Field) to train a Chinese postal address recognition model and a Chinese organization recognition model. Then POI extraction system extracts addresses and POI names from ABP with these two CRF models and pairs an address and a POI name as a POI. In the end, POI extraction system will extract POI associated information for each POI to construct a complete POI data.
Appears in Collections:	[Graduate Institute of Computer Science and Information Engineering] Electronic Thesis & Dissertation

Files in This Item:

File	Description	Size	Format
index.html		0Kb	HTML	403	View/Open

社群 sharing

Loading...