分群式取樣法於類別不平衡問題之研究

DC 欄位	值	語言
DC.contributor	資訊管理學系	zh_TW
DC.creator	張景翔	zh_TW
DC.creator	Jing-Shang Jhang	en_US
dc.date.accessioned	2016-7-1T07:39:07Z
dc.date.available	2016-7-1T07:39:07Z
dc.date.issued	2016
dc.identifier.uri	http://ir.lib.ncu.edu.tw:444/thesis/view_etd.asp?URN=103423011
dc.contributor.department	資訊管理學系	zh_TW
DC.description	國立中央大學	zh_TW
DC.description	National Central University	en_US
dc.description.abstract	類別不平衡問題(Class imbalance)ㄧ直是資料探勘(Data mining)領域的重要議題，問題發生於訓練資料集的其中ㄧ類別樣本數遠少於其他類別樣本數時，其所建立的分類模型會誤判少類樣本為多類類別以追求高分類正確率，而此問題於真實世界也愈來愈常見，如醫學診斷、錯誤偵測、臉部辨識等不同領域。　　解決方法包含資料、演算法與成本敏感法，其中以資料層面的前處理最常見，主要是透過減少多數法或是增加少數法以平衡資料集的各類別樣本數。然而，過去方法皆有其缺點，減少多數法可能會刪除具有價值的資料；增加少數法可能增加雜訊樣本，而增加的樣本數量使訓練分類器的時間成本提高，且易造成過度訓練(Overfitting)。　　本論文提出以k-平均分群演算法(k-means clustering)為基礎的分群抽樣法，針對訓練資料集中的多類樣本進行前處理，分群主要目的在挑選資料集中具代表性的樣本取代原始資料，平衡類別之間的樣本數量，同時降低取樣時資料分布不均的機率。　　本論文實驗了44個不同的小型資料集與2個大型資料集、五種分類器(C4.5, SVM, MLP, k-NN(k=5))並搭配整體學習演算法，比較不同分群抽樣方式、不同分類器、不同分群數量的k值設定以及分析三種類別不平衡比率(Imbalance Ratio)區間的AUC結果，找出分群式抽樣下的最佳配置，並與文獻中傳統方法、整體學習法進行比較。研究結果顯示在所有組合中，群中心點之鄰近點的前處理搭配MLP演算法是最佳的選擇，無論是小型或大型資料集，其整體的AUC結果表現最好且最穩定。	zh_TW
dc.description.abstract	The class imbalance problem is an important issue in data mining. This problem occurs when the number of samples that represent one class is much less than the ones of other classes. The classification model built by class imbalance datasets is likely to misclassify most samples in the minority class into the majority class because of maximizing the accuracy rate. It’s presences in many real-world applications, such as fault diagnosis, medical diagnosis or face recognition. 　　One of the most popular types of solutions is to consider data sampling. For example, Under-sampling the majority class or over-sampling the minority class to balance the imbalance datasets. Under-sampling balance class distribution through the elimination of majority class samples, but it may discard useful data. On the contrary, over-sampling replicates minority class samples, but it can increase the likelihood of occurring overfitting. 　　Therefore, we propose several resampling methods based on the k-means clustering technique. In order to decrease the probability of uneven resampling, we select representative samples to replace majority class samples in the training dataset. 　　Our experiments are based on using 44 small class imbalance datasets and two large scale datasets to build five types of classification models, which are C4.5, SVM, MLP, k-NN (k=5) and Naïve Bayes. In addition, the classifier ensemble algorithm is also employed. The research tries to compare the AUC result between different resampling techniques, different models and the number of clusters. Besides, we also divide imbalance ratio into three intervals. We try to find the best configuration of our experiments and compete with other literature methods. The experimental results show that combining the MLP classifier with the clustering based under-sampling method by the nearest neighbors of the cluster centers performs the best in terms of AUC over small and large scale datasets.	en_US
DC.subject	類別不平衡	zh_TW
DC.subject	資料探勘	zh_TW
DC.subject	分類	zh_TW
DC.subject	分群	zh_TW
DC.subject	class imbalance	en_US
DC.subject	data mining	en_US
DC.subject	classification	en_US
DC.subject	clustering	en_US
DC.title	分群式取樣法於類別不平衡問題之研究	zh_TW
dc.language.iso	zh-TW	zh-TW
DC.title	Clustering-Based Under-sampling in Class Imbalanced Data	en_US
DC.type	博碩士論文	zh_TW
DC.type	thesis	en_US
DC.publisher	National Central University	en_US

博碩士論文 103423011 完整後設資料紀錄