分群式取樣法於類別不平衡問題之研究;Clustering-Based Under-sampling in Class Imbalanced Data

NCU Institutional Repository > 管理學院 > 資訊管理研究所 > 博碩士論文 > Item 987654321/72058

請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/72058

題名:	分群式取樣法於類別不平衡問題之研究;Clustering-Based Under-sampling in Class Imbalanced Data
作者:	張景翔;Jhang,Jing-Shang
貢獻者:	資訊管理學系
關鍵詞:	類別不平衡;資料探勘;分類;分群;class imbalance;data mining;classification;clustering
日期:	2016-07-01
上傳時間:	2016-10-13 14:23:59 (UTC+8)
出版者:	國立中央大學
摘要:	類別不平衡問題(Class imbalance)ㄧ直是資料探勘(Data mining)領域的重要議題，問題發生於訓練資料集的其中ㄧ類別樣本數遠少於其他類別樣本數時，其所建立的分類模型會誤判少類樣本為多類類別以追求高分類正確率，而此問題於真實世界也愈來愈常見，如醫學診斷、錯誤偵測、臉部辨識等不同領域。　　解決方法包含資料、演算法與成本敏感法，其中以資料層面的前處理最常見，主要是透過減少多數法或是增加少數法以平衡資料集的各類別樣本數。然而，過去方法皆有其缺點，減少多數法可能會刪除具有價值的資料；增加少數法可能增加雜訊樣本，而增加的樣本數量使訓練分類器的時間成本提高，且易造成過度訓練(Overfitting)。　　本論文提出以k-平均分群演算法(k-means clustering)為基礎的分群抽樣法，針對訓練資料集中的多類樣本進行前處理，分群主要目的在挑選資料集中具代表性的樣本取代原始資料，平衡類別之間的樣本數量，同時降低取樣時資料分布不均的機率。　　本論文實驗了44個不同的小型資料集與2個大型資料集、五種分類器(C4.5, SVM, MLP, k-NN(k=5))並搭配整體學習演算法，比較不同分群抽樣方式、不同分類器、不同分群數量的k值設定以及分析三種類別不平衡比率(Imbalance Ratio)區間的AUC結果，找出分群式抽樣下的最佳配置，並與文獻中傳統方法、整體學習法進行比較。研究結果顯示在所有組合中，群中心點之鄰近點的前處理搭配MLP演算法是最佳的選擇，無論是小型或大型資料集，其整體的AUC結果表現最好且最穩定。 ;The class imbalance problem is an important issue in data mining. This problem occurs when the number of samples that represent one class is much less than the ones of other classes. The classification model built by class imbalance datasets is likely to misclassify most samples in the minority class into the majority class because of maximizing the accuracy rate. It’s presences in many real-world applications, such as fault diagnosis, medical diagnosis or face recognition. 　　One of the most popular types of solutions is to consider data sampling. For example, Under-sampling the majority class or over-sampling the minority class to balance the imbalance datasets. Under-sampling balance class distribution through the elimination of majority class samples, but it may discard useful data. On the contrary, over-sampling replicates minority class samples, but it can increase the likelihood of occurring overfitting. 　　Therefore, we propose several resampling methods based on the k-means clustering technique. In order to decrease the probability of uneven resampling, we select representative samples to replace majority class samples in the training dataset. 　　Our experiments are based on using 44 small class imbalance datasets and two large scale datasets to build five types of classification models, which are C4.5, SVM, MLP, k-NN (k=5) and Naïve Bayes. In addition, the classifier ensemble algorithm is also employed. The research tries to compare the AUC result between different resampling techniques, different models and the number of clusters. Besides, we also divide imbalance ratio into three intervals. We try to find the best configuration of our experiments and compete with other literature methods. The experimental results show that combining the MLP classifier with the clustering based under-sampling method by the nearest neighbors of the cluster centers performs the best in terms of AUC over small and large scale datasets.
顯示於類別:	[資訊管理研究所] 博碩士論文

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	403	檢視/開啟

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....