摘要: | 隨著智慧行動設備的普及率快速提升,查詢店家、地點等POI(Point of Interest)資訊的服務也變成大家的日常所需,提供這種服務的背後需要有一個龐大的POI資料庫。在經過一段時間之後,這些資料庫的POI資料就不一定是最新的。如果使用者得到錯誤的資訊,將會浪費他寶貴的時間。所以如何讓POI資料庫保持在最新的狀態就成了一門關鍵的課題。我們希望透過持續更新資料庫,識別出已經停止營運的POI,從而提供正確的POI資訊。 由於來自黃頁的POI資料庫的資料量過於龐大,很難有效地使用人工的方式進行更新驗證,而政府有大量的開放資料是由眾多業者共同維護的。其中「全國營業(稅籍)登記資料集」和「公司解散登記清冊」可以被我們使用。然而,開放資料集的資料格式與一般的POI資料庫不同也需要小心處理。除此之外,網路上有豐富的資料量可以提供我們使用。利用網路上的資訊,例如網頁更新日期、網路上的聲量等資料來訓練驗證模型,檢測資料庫中可能過期的POI。 在本論文中,我們的系統目標在於在可行的時間內偵測資料庫內過期的POI。方法分為兩個部分。第一部分為政府開放資料的使用,找出POI資料庫與開放資料共同擁有的POI以直接更新其狀態;第二部分則是利用網路資訊訓練POI過期驗證模型,偵測資料庫內已經過期的POI。實驗結果顯示採用Google地圖資訊、與上次有消息的時間差、是否還出現在官網上、描述POI過期的詞彙等資料可達到F度量0.758,透過特徵組合可達到F度量0.91,比起Chuang等人模型提升F度量0.201。 ;With the increase usage of mobile phones, the demand of searching POI (Point of Interest), such as store, address, etc., is becoming part of people′s daily life. Providing such services needs a massive POI database. However, the POI information for such a database may change as time passing. It’s annoying for user to get wrong information. How to keep the POI database up to date by continuously identifying outdated POIs and updating the database has become a key issue. As the POI database grows, it is difficult to effectively use the manual way to verify the data. Yet the government has open data regarding business, e.g. “全國營業(稅籍)登記資料集” and “公司解散登記清冊”. However, the data should be used carefully since the data format of the open data set is different with general POI database used in may service On the other hand, there is rich and available information on the web. Using the information on the web, such as the date that the web page is updated, the volume of POI mentioned on the web, we can train a verification model to detect POIs that may be outdated in the database. In this paper, our goal is to detect outdated POIs in the database within a feasible time. The approach can be divided to two parts. The first part is using open government information. The second part is using Web information to train a model to detect outdated POI in the database. Experiments show that our performance can achieve 0.758 F-measure (by using google map information, time distance between today and recent publishing date, appear on official website or not, words about outdated POI description), best performance can be reached to 0.91 F-measure by feature combination, it′s higher than Chuang 0.201. |