分群式取樣法於類別不平衡問題之研究

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：120

、訪客IP：18.116.90.141

姓名

張景翔(Jing-Shang Jhang) 查詢紙本館藏

畢業系所

資訊管理學系

論文名稱

分群式取樣法於類別不平衡問題之研究
(Clustering-Based Under-sampling in Class Imbalanced Data)

相關論文

★ 利用資料探勘技術建立商用複合機銷售預測模型	★ 應用資料探勘技術於資源配置預測之研究-以某電腦代工支援單位為例
★ 資料探勘技術應用於航空業航班延誤分析-以C公司為例	★ 全球供應鏈下新產品的安全控管-以C公司為例
★ 資料探勘應用於半導體雷射產業-以A公司為例	★ 應用資料探勘技術於空運出口貨物存倉時間預測-以A公司為例
★ 使用資料探勘分類技術優化YouBike運補作業	★ 特徵屬性篩選對於不同資料類型之影響
★ 資料探勘應用於B2B網路型態之企業官網研究-以T公司為例	★ 衍生性金融商品之客戶投資分析與建議-整合分群與關聯法則技術
★ 應用卷積式神經網路建立肝臟超音波影像輔助判別模型	★ 基於卷積神經網路之身分識別系統
★ 能源管理系統電能補值方法誤差率比較分析	★ 企業員工情感分析與管理系統之研發
★ 資料淨化於類別不平衡問題: 機器學習觀點	★ 資料探勘技術應用於旅客自助報到之分析—以C航空公司為例

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

類別不平衡問題(Class imbalance)ㄧ直是資料探勘(Data mining)領域的重要議題，問題發生於訓練資料集的其中ㄧ類別樣本數遠少於其他類別樣本數時，其所建立的分類模型會誤判少類樣本為多類類別以追求高分類正確率，而此問題於真實世界也愈來愈常見，如醫學診斷、錯誤偵測、臉部辨識等不同領域。
　　解決方法包含資料、演算法與成本敏感法，其中以資料層面的前處理最常見，主要是透過減少多數法或是增加少數法以平衡資料集的各類別樣本數。然而，過去方法皆有其缺點，減少多數法可能會刪除具有價值的資料；增加少數法可能增加雜訊樣本，而增加的樣本數量使訓練分類器的時間成本提高，且易造成過度訓練(Overfitting)。
　　本論文提出以k-平均分群演算法(k-means clustering)為基礎的分群抽樣法，針對訓練資料集中的多類樣本進行前處理，分群主要目的在挑選資料集中具代表性的樣本取代原始資料，平衡類別之間的樣本數量，同時降低取樣時資料分布不均的機率。
　　本論文實驗了44個不同的小型資料集與2個大型資料集、五種分類器(C4.5, SVM, MLP, k-NN(k=5))並搭配整體學習演算法，比較不同分群抽樣方式、不同分類器、不同分群數量的k值設定以及分析三種類別不平衡比率(Imbalance Ratio)區間的AUC結果，找出分群式抽樣下的最佳配置，並與文獻中傳統方法、整體學習法進行比較。研究結果顯示在所有組合中，群中心點之鄰近點的前處理搭配MLP演算法是最佳的選擇，無論是小型或大型資料集，其整體的AUC結果表現最好且最穩定。

摘要(英)

The class imbalance problem is an important issue in data mining. This problem occurs when the number of samples that represent one class is much less than the ones of other classes. The classification model built by class imbalance datasets is likely to misclassify most samples in the minority class into the majority class because of maximizing the accuracy rate. It’s presences in many real-world applications, such as fault diagnosis, medical diagnosis or face recognition.
　　One of the most popular types of solutions is to consider data sampling. For example, Under-sampling the majority class or over-sampling the minority class to balance the imbalance datasets. Under-sampling balance class distribution through the elimination of majority class samples, but it may discard useful data. On the contrary, over-sampling replicates minority class samples, but it can increase the likelihood of occurring overfitting.
　　Therefore, we propose several resampling methods based on the k-means clustering technique. In order to decrease the probability of uneven resampling, we select representative samples to replace majority class samples in the training dataset.
　　Our experiments are based on using 44 small class imbalance datasets and two large scale datasets to build five types of classification models, which are C4.5, SVM, MLP, k-NN (k=5) and Naïve Bayes. In addition, the classifier ensemble algorithm is also employed. The research tries to compare the AUC result between different resampling techniques, different models and the number of clusters. Besides, we also divide imbalance ratio into three intervals. We try to find the best configuration of our experiments and compete with other literature methods. The experimental results show that combining the MLP classifier with the clustering based under-sampling method by the nearest neighbors of the cluster centers performs the best in terms of AUC over small and large scale datasets.

關鍵字(中)

★ 類別不平衡
★ 資料探勘
★ 分類
★ 分群

關鍵字(英)

★ class imbalance
★ data mining
★ classification
★ clustering

論文目次

目錄
一、緒論 1
1-1、研究背景 1
1-2、研究動機 2
1-3、研究目的 3
1-4、研究架構 5
二、文獻探討 6
2-1、類別不平衡問題 6
2-2、解決類別不平衡問題之文獻探討 7
2-2-1、資料層面 7
2-2-2、演算法層面 9
2-2-3、成本敏感法 9
2-3、機器學習演算法 10
2-3-1、單一分類器 10
2-3-2、整體學習演算法 15
2-4、評估指標 17
2-5、比較文獻之方法 20
三、研究方法 23
3-1、研究架構 23
3-2、資料前處理 24
3-2-1、k-means分群演算法 24
3-2-2、分群式抽樣 26
3-3、相關架構比較 28
四、實驗結果 31
4-1、實驗準備 31
4-1-1、軟硬體設置 31
4-1-2、實驗資料集 31
4-2 方法驗證 34
4-3、實驗一結果 35
4-3-1、C4.5決策樹演算法為基礎之分析 35
4-3-2、分類器之間結果比較 39
4-3-3、敏感度分析 49
4-3-4、類別不平衡比率(Imbalance Ratio)為基礎之分析 65
4-4、實驗二結果 79
4-5、實驗總結 84
五、結論 85
5-1、結論與貢獻 85
5-2、研究限制與未來研究方向 87
參考文獻 89

參考文獻

[1]. Mayer-Schönberger, V., & Cukier, K. (2013). Big data: a revolution that will transform how we live, work and think. London: John Murray.
[2]. Hilbert, M., and López, P. (2011). The World′s Technological Capacity to Store, Communicate, and Compute Information. Science, 332(6025), 60-65.
[3]. Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. (1996). From Data Mining to Knowledge Discovery in Databases. AI Magazine, 17(3), 37-54.
[4]. Berry, M. J. A., and Linoff, G. (1997). Data Mining Techniques: for Marking, Sales, and Customer Support. New York, John Wiley and Sons Inc.
[5]. Kleissner, C. (1998). Data Mining for the Enterprise, Proceedings of the 31st Annual Hawaii International Conference on System Sciences. 7, 295-304.
[6]. Pyle, D. (1999). Data Preparation for Data Mining. Morgan Kaufmann, San Francisco.
[7]. Chawla, N. V. (2005). Data Mining for Imbalanced Datasets: An Overview. Data Mining and Knowledge Discovery Handbook, 853-867.
[8]. Kotsiantis, S., Kanellopoulos, D. and Pintelas, P. (2006). Handling imbalanced datasets: A review, GESTS International Transactions on Computer Science and Engineering. 30(1), 25-36.
[9]. Galar, M., Fernández, A., Barrenechea, E., Bustince, H. and Herrera, F. (2012). A review on ensembles for class imbalance problem: bagging, boosting and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics – part C: Applications and Reviews, 42(4), 463–484.
[10]. Mazurowski, M. A., Habas, P. A., Zurada, J. M., Lo,J. Y., Baker, J. A. and Tourassi G. D. (2008). Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance. Neural Netw., 21(2-3), 427-436.
[11]. Zhu, Z.-B., and Song, Z.-H. (2010). Fault diagnosis based on imbalance modified kernel fisher discriminant analysis. Chem. Eng. Res. Des., 88(8), 936-951.
[12]. Liu, Y.-H., and Chen, Y.-T. (2005). Total margin-based adaptive fuzzy support vector machines for multiview face recognition. Proc. IEEE Int. Conf. Syst., Man Cybern., 2, 1704-1711.
[13]. Yin, L., Ge, Y., Xiao, K., Wang, X. and Quan, X. (2013). Feature selection for high-dimensional imbalanced data. Neurocomputing, 105, 3-11.
[14]. Liu, X.-Y., and Zhou, Z.-H. (2013). Ensemble Methods for Class Imbalance Learning. Imbalanced Learning: Foundations, Algorithms, and Applications, First Edition, 61-82.
[15]. Chawla, N.V. (2003). C4.5 and Imbalanced Data Sets: Investigating the Effect of Sampling Method, Probabilistic Estimate, and Decision Tree Structure. In Workshop on Learning from Imbalanced Data Sets II.
[16]. Kubat, M. and Matwin, S. (1997.) Addressing the Curse of Imbalanced Training Sets: One-Side Selection. Proceedings of the Fourteenth International Conference on Machine Learning, 179-186.
[17]. Drummond, C. and Holte, R. C. (2003) C4.5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In Workshop on Learning from Imbalanced Data Sets II, International Conference on Machine Learning.
[18]. Zadrozny, B. and Elkan, C. (2001) Learning and Making Decisions When Costs and Probabilities are Both Unknown. Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, 204-213.
[19]. Japkowicz, N. (2004) Concept-Learning in the Presence of Between-Class and Within-Class Imbalances. In Proceedings of the Fourteenth Conference of the Canadian Society for Computational Studies of Intelligence, 67-77.
[20]. Domingos, P. (1999) MetaCost: a general method for making classifiers cost-sensitive. Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, 155-164.
[21]. Chawla, N. V., Lazarevic, A., Hall, L. O. and Bowyer, K. W. (2002) SMOTEBoost: Improving Prediction of the Minority Class in Boosting. , Proc. Seventh European Conf. Principles and Practice of Knowledge Discovery in Databases, 107-119.
[22]. Wang, S. and Yao, X. (2009) Diversity analysis on imbalanced data sets by using ensemble models. IEEE Symp. Comput. Intell. Data Mining, 324-331.
[23]. Hartigan, J. A., & Wong, M. A. (1979). Algorithm AS 136: A k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1), 100-108.
[24]. Anderberg, M.R. (1973) Cluster Analysis for Applications. Academic Press.
[25]. Bradley, A. P. (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30(7), 1145-1159.
[26]. Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G. J., Ng A., Liu, B., Yu, P. S., Zhou, Z.-H., Steinbach, M., Hand, D. J. and Steinberg , D. (2007) Top 10 algorithms in data mining. Knowl. Inf. Syst., 14, 1-37.
[27]. Batista, G. E., Prati, R. C., & Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data. ACM Sigkdd Explorations Newsletter, 6(1), 20-29.
[28]. Japkowicz, N., and Stephen, S. (2002). The Class Imbalance Problem: A Systematic Study. Intelligent Data Analysis, 6(5), 429-449.
[29]. Drummond, C., & Holte, R. C. (2003, August). C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In Workshop on learning from imbalanced datasets II (Vol. 11).
[30]. Tomek, I. (1976). Two modifications of CNN. IEEE Trans. Syst. Man Cybern., 6, 769-772.
[31]. Hart, P. E. (1968). The Condensed Nearest Neighbor Rule. IEEE Transactions on Information Theory IT-14, 515-516.
[32]. Chawla, N. V., Bowyer, K. W., Hall, L. O. and Kegelmeyer, W. P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. JAIR 16, 321–357.
[33]. Weiss, G., (2004). Mining with rarity: A unifying framework.SIGKDD Explorations, 6(1), 7-19.
[34]. Cohen, W. W., (1995). Fast effective rule induction. In Proceedings of the Twelfth International Conference on Machine Learning, 115-123.
[35]. Raskutti, B. and Kowalczyk, A., (2004). Extreme rebalancing for svms: a case s tudy. SIGKDD Explorations, 6(1), 60-69.
[36]. Longadge, R., Dongre, S. S., and Malik, L. (2013). Class Imbalance Problem in Data Mining: Review. International Journal of Computer Science and Network, 2(1).
[37]. Witten, I. H., & Frank, E. (2005). Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann.
[38]. Quinlan, J. R. (2014). C4. 5: programs for machine learning. Elsevier.
[39]. Fix, E., & Hodges Jr, J. L. (1951). Discriminatory analysis-nonparametric discrimination: consistency properties. California Univ Berkeley.
[40]. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning, 20(3), 273-297.
[41]. Breiman, L. (1996). Bagging predictors. Machine learning, 24(2), 123-140.
[42]. Schapire, R. E. (1990). The strength of weak learnability. Machine learning, 5(2), 197-227.
[43]. Freund, Y., & Schapire, R. E. (1996, July). Experiments with a new boosting algorithm. In ICML (Vol. 96, pp. 148-156).
[44]. López, V., Fernández, A., García, S., Palade, V., & Herrera, F. (2013). An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Information Sciences, 250, 113-141.
[45]. Provost, F., & Domingos, P. (2003). Tree induction for probability-based ranking. Machine Learning, 52(3), 199-215.
[46]. MacQueen, J. (1967, June). Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability (Vol. 1, No. 14, pp. 281-297).
[47]. Forgy, E. W., (1965). Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics, 21, 768.
[48]. 陳景祥(2010)。R軟體：應用統計方法。臺北市：東華。

指導教授

蔡志豐(Chih-Fong Tsai)

審核日期

2016-7-1

推文