On Large-Scale Multi-Label Classification for POI Tagging

NCUIR > College of Electrical Engineering & Computer Science > Graduate Institute of Computer Science and Information Engineering > Electronic Thesis & Dissertation > Item 987654321/74784

Please use this identifier to cite or link to this item: http://ir.lib.ncu.edu.tw/handle/987654321/74784

Title:	On Large-Scale Multi-Label Classification for POI Tagging
Authors:	楊鎧謙;Yang, Kai-Qian
Contributors:	資訊工程學系
Keywords:	機器學習;多類別分類;非平衡資料;興趣點;Machine Learning;Multi Label Classification;Unbalanced Data;point of interest
Date:	2017-08-24
Issue Date:	2017-10-27 14:39:15 (UTC+8)
Publisher:	國立中央大學
Abstract:	近年來智慧型手持裝置迅速普及，現在已經達到幾乎人手一機的情況。而交通方式的進步更是使得人們移動的機率大幅增加，因此到陌生地點的機會也跟著增加。在陌生的環境之中要尋找感興趣的點是不容易的，所以需要提供電子地圖系統以便查詢。電子地圖如果只提供名稱搜尋是不夠的，因為使用者可能不知道這些點的確切名稱，他們可能只是想找特定類型的點，所以一個好的電子地圖需要提供類別搜尋服務。為了要提供類別搜尋服務，我們需要將系統中所有的點進行分類。因為系統中有許多筆資料，每筆資料都有一個或多個類別，所以這是一個大數量的多類別分類問題。地圖上的這些資料通常有許多種分類方式，我們使用中華黃頁的分類方式。類別包含兩個等級，等級一類別有29種類別而等級二類別則有1,287種。因為類別與資料較多使得一般訓練分類器的方式需要訓練多個分類器，導致訓練與測試時間增加許多。我們利用降低類別維度的方式來加快訓練與測試的速度。實驗顯示採用KDE+SVM的混合模型方式的訓練時間與測試時間皆比一般的SVM分類快幾乎一倍，對29個大類別Micro-F1可達0.813，等級二類別的Micro-F1為0.718僅略低於SVM在等級一類別的Micro-F1 0.842，等級二類別的Micro-F1 0.783。由於資料為imbalanced data我們比較了Reweighting和Downsampling的方式想增進效能，但其結果顯示在大數量的資料中這兩個方法效果較不明顯。 ;In recent years, mobile device become more popular. And due to convenient transportation, people have higher probability to visit strange places. It is not easy to find a point of interest in a strange places, so we need to provide an electronic map system for users. It is not enough to provide name search for users only, because the users may not know the exact name of points. They may just want to find a specific category of point, so a good electronic map system needs to provide category search service. In order to provide category search services, we need to classify all the points in the system. Because the system has many points, each item has one or more categories, so this is a large-scale multi-label classification problem. There are many kind of categories, we follow the categories defined by Chinese yellow pages. The category consists two levels. There are 29 categories in level 1and 1,287 in level 2. Because the number of points and categories are large, we need to spend much time for training classifiers and testing data. We reduce the dimension of categories to speed up training and testing. After the experiment, our method’s training time and testing time are superior to the general SVM classification, the performance in level 1 Micro-F1 is 0.813, in level 2 Micro-F1 is 0.718 all slightly lower than SVM in level 1 Micro-F1 is 0.842. In level 2 Micro-F1 is 0.783. We want to try Reweighting, Downsampling to improve performance, but the performance is not wall in large-scale data.
Appears in Collections:	[Graduate Institute of Computer Science and Information Engineering] Electronic Thesis & Dissertation

Files in This Item:

File	Description	Size	Format
index.html		0Kb	HTML	260	View/Open

社群 sharing

Loading...