以商家名稱萃取與地址配對協助地理資訊檢索之研究

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：18

、訪客IP：3.144.178.82

姓名

林育暘(Yu-Yang Lin) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

以商家名稱萃取與地址配對協助地理資訊檢索之研究
(Store Name Extraction and Name-Address Matching for Geographic Information Retrieval)

相關論文

★ 行程邀約郵件的辨識與不規則時間擷取之研究	★ NCUFree校園無線網路平台設計及應用服務開發
★ 網際網路半結構性資料擷取系統之設計與實作	★ 非簡單瀏覽路徑之探勘與應用
★ 遞增資料關聯式規則探勘之改進	★ 應用卡方獨立性檢定於關連式分類問題
★ 中文資料擷取系統之設計與研究	★ 非數值型資料視覺化與兼具主客觀的分群
★ 關聯性字組在文件摘要上的探討	★ 淨化網頁：網頁區塊化以及資料區域擷取
★ 問題答覆系統使用語句分類排序方式之設計與研究	★ 時序資料庫中緊密頻繁連續事件型樣之有效探勘
★ 星狀座標之軸排列於群聚視覺化之應用	★ 由瀏覽歷程自動產生網頁抓取程式之研究
★ 動態網頁之樣版與資料分析研究	★ 同性質網頁資料整合之自動化研究

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

行動化是2014的趨勢之一，根據IDC調查報告顯示，平板電腦的出貨量在2013年Q4首次超過個人電腦，而智慧型手機不論在出貨量或市佔率早就遠遠超過其他裝置的總和。適地性服務（Location-based Service）在這波趨勢中具有至關重要的地位，因為裝置行動化的因素，大量查詢需求因此誕生，例如：路線導航、查詢附近餐廳、加油站等。適地性服務要能廣泛的提供服務，通常需要有一個完整的POI（Point of Interest）資料庫，而整個網路就是最大的資訊來源。這些資料源自於網站管理者、群眾外包（crowdsourcing）或個人使用者所分享的資訊，包括了地址、名稱、電話、評論等資訊。現在雖然有各種擷取地址相關資訊的方法，但經常面臨無法取得明確POI的名稱，在資訊檢索上受到很大的限制。
在本篇論文中，我們提出一個商家名稱辨識的方法，藉由收集網路上包含地址的網頁，來辨認命名實體，建立一個具有商家名稱與地址關聯性的資料庫，以提高地址相關資訊檢索的效果，讓使用者在使用行動裝置查詢時，能直接輸入店家名稱或關鍵字查詢地址之服務，有效提供使用者便利性。其中，在商家命名實體辨認上，本篇論文提出了商家與組織名稱在命名上的共同特性，利用此共同特性當作特徵加入CRF模型，以提供N-Gram與詞性之外的特徵。

摘要(英)

Mobile devices are the trend of 2014. According to the report of IDC, the first time unit shipments of tablet has exceed PCs in 2013 Q4. The smart phone has already exceed other devices in unit shipments and market ratio. LBS (Location-based Service) plays an important role in this trend. Because of the device mobility, many demand have been proposed, for example, navigation, searching restaurant or gas station. It’s usually needs a POI (Point-of Interest) database to support a LBS. The web is the largest data source, these data come from website manager, crowdsourcing and people sharing information, including address, name, phone and comment. There are many method to extract address associated information nowadays, but they are usually faced with the challenge of extracting name of POI. It’s a limitation of information retrieval.
Our system could be separated into three parts: the Taiwan address normalization, the Store Name Entity Recognition and Address-StoreNE matching. Finally, users can search the store names on the mobile device and get the informations like address, telephone and comment immediately. In the part of Store NER, our research propose a common characteristic of store and organization names. We use these characteristic as features to join the CRF model, enhanced the recognition result.

關鍵字(中)

★ 條件隨機域
★ 自然語言處理
★ 地理資訊檢索

關鍵字(英)

★ CRF
★ NLP
★ GIR

論文目次

目錄
中文摘要 i
Abstract ii
誌謝 iii
目錄 iv
圖目錄 v
表目錄 vi
一、緒論 1
1.1 研究動機 1
1.2 研究背景 2
1.3 章節概要 5
二、相關研究 6
2.1 爬蟲與地理資訊檢索 6
2.2 地址與相關資訊擷取 7
2.3 中文組織命名實體辨認 8
三、商家名稱擷取及地址匹配 10
3.1 商家名稱辨認 (Store Named Entity Recognition) 10
3.1.1 自動標記與前處理 11
3.1.2 訓練資料準備 12
3.2 地址-商家名稱匹配 (Address-StoreNE Match) 15
3.3 台灣地址標準化 (Taiwan Address Normalization) 21
四、實驗 23
3.1 商家名稱辨識率 24
3.2 地址-商家名稱匹配正確率 31
五、結論與未來工作 34
六、參考資料 35

圖目錄
Figure 1 IDC 2013所公布的各裝置的出貨量及預測 2
Figure 2包含地址網頁的自動標記流程 11
Figure 3 搜尋結果Snippet的自動標記流程 11
Figure 4四種網頁類別(左上)自然語言網頁範例 (右上) 註腳資訊網頁範例 (左下) 清單型網頁範例 (右下) 深度資訊網頁範例 17
Figure 5 自然語言網頁與註腳資訊網頁的匹配演算法 18
Figure 6深度資料網頁配對範例 19
Figure 7 Detail Pages透過TEX找出沒辨識出的商家名稱演算法 20
Figure 8 Detail Pages的配對演算法 20
Figure 9 List Pages的配對演算法 21
Figure 10包含地址網頁中，訓練資料數量各校能指標的影響 25
Figure 11 Full Labeling對包含地址網頁的NER提昇效果 26
Figure 12包含地址網頁中，訓練資料數量對各效能指標的影響 27
Figure 13包含地址網頁中，各特徵對效能的影響(UniLabeling & FullLabeling) 27
Figure 14以Snippet為資料來源，雜訊與訓練資料數量對效能的影響 28
Figure 15多模型合併輸出的範例 30
Figure 16本系統的商家名稱辨識率實驗結果（訓練樣本數：4,398） 31
Figure 17 SnippetFullLabeling不同訓練資料數量的模型中，NER對Match的影響 32
Figure 18 完整網頁的3-Model的配對正確率 33

表目錄
Table 1 IDC 2013年9月所公布的各裝置的出貨量、市占率及預測 2
Table 2以Start/End標記法的序列範例 14
Table 3 本研究所使用的原始特徵 15
Table 4 以包含地址網頁為資料來源的訓練語料與測試資料 23
Table 5以Search Result Snippets為資料來源的訓練資料與測試資料 23
Table 6 ABP（包含地址網頁）的訓練資料詳細數據 25
Table 7包含地址網頁中訓練資料於ABR測試資料的表現 (with 30000 training examples and full labeling) 26
Table 8 Search Result Snippets 的訓練資料詳細數據 28
Table 9 NER 交叉測試 29
Table 10 Match交叉測試 32

參考文獻

[1] H.-M. Chuang, C.-H. Chang and T.-Y. Kao, "Effective Web Crawling for Chinese Addresses and Associated Information," in EC-Web, Munich, Germany, 2014.
[2] S.-Y. Li, Application and Extraction of Postal Addresses and Related Information, National Central University, 2009.
[3] C.-H. Chang, C.-Y. Huang and Y.-S. Su, "Chinese Postal Address and Associated Information Extraction," The 26th Annual Conference of the Japanese Society for Artificial Intelligence, 2012.
[4] L. D. John , M. Andrew and N. C. Fernando, "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data," ICML Proceedings of the Eighteenth International Conference on Machine Learning, pp. 282-289, 2001.
[5] Z. Suxiang, Z. Suxian and W. Xiaojie, "Automatic Recognition of Chinese Organization Name Based on Conditional Random Fields," Natural Language Processing and Knowledge Engineering, pp. 229-233, 2007.
[6] Y. Xiying, "A METHOD OF CHINESE ORGANIZATION NAMED ENTITIES RECOGNITION BASED ON STATISTICAL WORD FREQUENCY, PART OF SPEECH AND LENGTH," Broadband Network and Multimedia Technology (IC-BNMT), pp. 637-641, 2011.
[7] L. Yajuan, Y. Jing and H. Liang, "Chinese Organization Name Recognition Based on Multiple Features," Pacific Asia conference on Intelligence and Security Informatics, pp. 136-144, 2012.
[8] C.-W. Wu, R. T.-H. Tsai and W.-L. Hsu, "Semi-joint labeling for chinese named entity recognition," Proceedings of the 4th Asia information retrieval conference, pp. 107-116, 2008.
[9] Y.-S. Su, Associated Information Extraction for Enabling Entity Search on Electronic Map, National Central University, 2012.
[10] A. Dirk and B. Susanne, "Location-based Web search," 2007, pp. 55-66.
[11] D. Ahlers, "Business entity retrieval and data provision for yellow pages by local search," Integrating IR technologies for professional search, ECIR, 2013.
[12] D. Ahlers, “Lo major de dos idiomas – cross-lingual linkage of geotagged Wikipedia articles.,” 於 Advances in Information Retrieval, 2013, pp. 668-671.
[13] "Apache Tika," The Apache Software Foundation, [Online]. Available: http://tika.apache.org/.
[14] "The Stanford NLP (Natural Language Processing) Group," Stanford NLP Group, [Online]. Available: http://nlp.stanford.edu/software/segmenter.shtml.
[15] R. C. Hassan A. Sleiman, "TEX: An efﬁcient and effective unsupervised Web information extractor," Knowledge-Based Systems, pp. 109-123, 2013.
[16] "教育部重編國語辭典修訂本－主站," 中華民國教育部, [Online]. Available: http://dict.revised.moe.edu.tw/.
[17] "GeoNames," [Online]. Available: http://www.geonames.org/.
[18] W. Liu, X. Meng 且 W. Meng, “ViDE: A Vision-Based Approach for Deep Web Data Extraction,” Transactions on Knowledge and Data Engineering, pp. 447-460, 2010.

指導教授

張嘉惠(Chia-Hui Chang)

審核日期

2014-8-21

推文