摘要(英) |
With the increased popularity of mobile devices, local search has become a new popular service. However, a complete local search service have to provide nearby POIs (Point-of-Interest) like stores, shops, gas stations, parking lots, bus stops, drugstore for users. Therefore, we need a powerful POI database to support that. In recent years, the web has become the largest data source of POIs. With the prevalence of Internet, people will share their travel experience and information of POIs that they had been visited on social network, their blogs, and even check-in post. Besides, many companies and organizations publish their business on their own websites. Those webpages contain a large number of POIs.
In this paper, we propose a POI database construction system based on the immense data of the Web. Our system could be separated into two parts: the query-based crawler, the POI extraction system. The goal of query-based crawler is to collect ABP (address-bearing pages) from the web as address is a good indicator of POIs. The second part is POI extraction system. We use CRF (Conditional Random Field) to train a Chinese postal address recognition model and a Chinese organization recognition model. Then POI extraction system extracts addresses and POI names from ABP with these two CRF models and pairs an address and a POI name as a POI. In the end, POI extraction system will extract POI associated information for each POI to construct a complete POI data.
|
參考文獻 |
[1] D. Ahlers, Business entity retrieval and data provision for yellow pages by local search. Integrating IR technologies for professional search, ECIR, 2013.
[2] D. Ahlers and S. Boll, Location-based Web search. The Geospatial Web, 55-66, Springer, 2007.
[3] S. Chakrabarti, M. Van den Berg and B. Dom, Focused crawling: a new approach to topic specific Web resource discovery, WWW, 1999.
[4] C.-C. Chang, C.-J., LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(27):1-27, 2011.
[5] C.-H. Chang, C.-Y. Huang and Y.-S. Su, Chinese Postal Address and Associated Information Extraction,” The 26th Annual Conference of the Japanese Society for Artificial Intelligence, 2012.
[6] J. Cho and H. Garcia-Molina, The evolution of the Web and implications for an incremental crawler, VLDB ’00 Proceedings of the 26th International Conference on Very Large Data Bases, 200-209, 2000.
[7] H.-M. Chuang, C.-H. Chang, Verification of POI and Location Pairs via Weakly Labeled Data. WWW 2015 Workshop, May 18–22, 2015.
[8] J. Foley, M. Bendersky and V.Josifovski, Learning to extract local events from the Web, SIGIR, Chile, August 9-13, 2015.
[9] Y. He, D. Xin, V. Ganti, S. Rajaraman and N. Shah, Crawling deep Web entity pages, International Conference on Web Search and Data Mining, 2013.
[10] Y.-Y. Huang, C.-L. Chou, C.-H. Chang, Web NER Model Generator Tool based on Google Snippets, submitted for publication, 2015.
[11] C. B. Jones and R. S. Purves, Geographical information retrieval, International Journal of Geographical Information Science, 22(3), 219–228, 2008.
[12] Y.-Y. Lin, C.-H. Chang, 網頁商家名稱擷取與 地址配對之研究 (Store Name Extraction and Name-Address Matching on the Web) [In Chinese]. ROCLING, 2014.
[13] Y. Ling, J. Yang and L. He, Chinese organization name recognition based on multiple features, in Pacific Asia conference on Intelligence and Security Informatics, 136-144, 2012.
[14] M. Najork and J. L. Wiener, Breadth-first crawling yields high-quality pages, Proceedings of the 10th international conference on World Wide Web, 114-118, 2001.
[15] C. Olston and M. Najork, Web crawling. Foundations and trends, information retrieval, 4(3), 175-246, 2010.
[16] M. Sanderson and J. Kohler, Analyzing geographic queries, in Workshop on Geographic Information Retrieval (SIGIR), Sheffield, UK, 2004.
[17] P. Serdyukov, V. Murdock and R. V. Zwol, Placing Flickr photos on a map. SIGIR, MA, USA, 2009.
[18] V. Shkapenyuk and T. Suel, Design and implementation of a high-performance distributed Web crawler, Proceedings of the 18th International Conference on Data Engineering, San Jose, CA, USA, 2002.
[19] A. Popescu and A. Shabou, Towards precise POI localization with social media. ACM Multimedia Conference, Catalunya, Spain, Oct. 21-25, 2013.
[20] S. Zhang and X. Wang, Automatic Recognition of Chinese Organization Name Based on Conditional Random Fields, Natural Language Processing and Knowledge Engineering, Sheffield, 229-233, 2007.
|