分群式前處理方法於類別不平衡問題之研究

、線上人數：232

、訪客IP：3.129.253.148

姓名	潘怡瑩(Yi-Ying Pan) 查詢紙本館藏	畢業系所	資訊管理學系
論文名稱	分群式前處理方法於類別不平衡問題之研究 (Clustering-based Data Preprocessing Approach for the Class Imbalance Problem)
檔案	[Endnote RIS 格式] [Bibtex 格式] [相關文章] [文章引用] [完整記錄] [館藏目錄] [檢視] [下載] 本電子論文使用權限為同意立即開放。已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。
摘要(中)	類別不平衡（Class Imbalance）問題一直是資料探勘領域中重要且頻繁發生的議題，此情況發生於資料集中某一類別樣本數遠大於另一類別樣本數時，導致資料產生偏態分布，此時一般分類器為了追求高分類正確率，建立出的預測模型將會傾向將小類資料（Minority Class）誤判為大類資料（Majority Class），導致無法建立出良好的小類資料分類規則，而這樣的現象在真實世界中也越來越常見，舉凡醫學診斷、錯誤偵測、臉部辨識等不同領域都經常發生資料的類別不平衡現象。本論文提出以分群技術為基礎的前處理方法，利用分群技術將大類資料分為數個子群，形成多類別資料（Multiclass Data），此架構能夠有效降低訓練資料集的類別不平衡比率、減少分類器訓練時間及提升分類正確率。本論文實驗了44個KEEL小型資料集與8個NASA高維度資料集，方法架構中分別使用兩種分群技術（Affinity Propagation、K-means），並搭配五種分類器（C4.5、MLP、Naïve Bayse、SVM、k-NN（k=5））及整體學習法建立分類模型，比較不同分群方式、不同分類器與不同分群數設定的AUC（Area Under Curve）結果，找出分群式前處理架構的最佳配適，再與文獻中傳統方法、整體學習法進行正確率比較。最後，KEEL資料集的實驗結果顯示無論是搭配Affinity Propagation或K-means（K=5），k-NN（k=5）演算法是最佳選擇；而NASA資料集實驗結果則顯示，分群式前處理架構應用於高維度資料集表現亦優於文獻。
摘要(英)	The class imbalance problem is an important issue in data mining. It occurs when the number of samples in one class is much larger than the other classes. Traditional classifiers tend to misclassify most samples of the minority class into the majority class for maximizing the overall accuracy. This phenomenon makes it hard to establish a good classification rule for the minority class. The class imbalance problem often occurs in many real world applications, such as fault diagnosis, medical diagnosis and face recognition. To deal with the class imbalance problem, a clustering-based data preprocessing approach is proposed, where two different clustering techniques including affinity propagation clustering and K-means clustering are used individually to divide the majority class into several subclasses resulting in multiclass data. This approach can effectively reduce the class imbalance ratio of the training dataset, shorten the class training time and improve classification performance. Our experiments based on forty-four small class imbalance datasets from KEEL and eight high-dimensional datasets from NASA to build five types of classification models, which are C4.5, MLP, Naïve Bayes, SVM and k-NN (k=5). In addition, we also employ the classifier ensemble algorithm. This research tries to compare AUC results between different clustering techniques, different classification models and the number of clusters of K-means clustering in order to find out the best configuration of the proposed approach and compare with other literature methods. Finally, the experimental results of the KEEL datasets show that k-NN (k=5) algorithm is the best choice regardless of whether affinity propagation or K-means (K=5); the experimental results of NASA datasets show that the performance of the proposed approach is superior to the literature methods for the high-dimensional datasets.
關鍵字(中)	★ 類別不平衡 ★ 資料探勘 ★ 分類 ★ 分群	關鍵字(英)	★ class imbalance ★ data mining ★ clustering ★ classification
論文目次	摘要 i Abstract ii 圖目錄 v 表目錄 vii 一、緒論 1 1-1　研究背景 1 1-2　研究動機 2 1-3　研究目的 4 1-4　研究架構 5 二、文獻探討 7 2-1　類別不平衡問題 7 2-2　類別不平衡問題 9 2-2-1　資料層級（Data level） 9 2-2-2　演算法層級（Algorithm level） 12 2-2-3　成本敏感法 12 2-3　機器學習演算法 13 2-3-1　非監督式學習演算法 13 2-3-2　監督式學習演算法 19 2-3-3　整體學習法 25 2-4　相關文獻比較 27 三、研究方法 33 3-1　實驗架構 33 3-2　CBRM前處理架構 35 3-3　虛擬碼（pseudo-code） 38 四、實驗結果 39 4-1　實驗準備 39 4-1-1　軟硬體設備 39 4-1-2　實驗資料集 39 4-1-3　實驗參數設定、方法 42 4-2　實驗一結果 43 4-2-1　實驗結果──使用Affinity Propagation分群技術 43 4-2-2　實驗結果──使用K-means分群技術 54 4-3　實驗二結果 59 4-4　實驗總結 61 五、結論 63 5-1　結論與貢獻 63 5-2　未來研究方向與建議 65 參考文獻 67 附錄一 73
參考文獻	[1] V. Mayer-Schönberger and K. Cukier, Big Data : a Revolution That Will Transform How We Live, Work, and Think. London: John Murray, 2013. [2] J. Zakir, T. Seymour, and K.Berg, “Big Data Analytics,” Int. Assoc. Comput. Inf. Syst., vol. 16, no. 2, pp. 81–90, 2015. [3] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, “From Data Mining to Knowledge Discovery in Databases,” AI Magazine, vol. 17, no. 3, pp. 37–54, 1996. [4] M. Gera and S. Goel, “Data Mining - Techniques, Methods and Algorithms: A Review on Tools and their Validity,” Int. J. Comput. Appl., vol. 113, no. 18, pp. 22–29, 2015. [5] G. S. Linoff and M. J. A. Berry, Data mining techniques for marketing, sales, and customer relationship management, 2nd ed. New York: John Wiley and Sons Inc, 2004. [6] C. Kleissner, “Data mining for the enterprise,” Proc. Thirty-First Hawaii Int. Conf. Syst. Sci., vol. 7, pp. 295–304, 1998. [7] J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques, 3rd ed. New York: Elsevier Inc, 2012. [8] G. Kesavaraj and S. Sukumaran, “A study on classification techniques in data mining,” in 2013 Fourth International Conference on Computing, Communications and Networking Technologies (ICCCNT), 2013, pp. 1–7. [9] Y. Sun, M. S. Kamel, A. K. C.Wong, and Y. Wang, “Cost-sensitive boosting for classification of imbalanced data,” Pattern Recognit., vol. 40, no. 12, pp. 3358–3378, 2007. [10] N. V. Chawla, N. Japkowicz, and A. Kolcz, “Editorial : Special Issue on Learning from Imbalanced Data Sets,” ACM SIGKDD Explor. Newsl., vol. 6, no. 1, pp. 1–6, 2004. [11] M. Galar, A. Fernandez, E. Barrenechea, H. Bustince, and F. Herrera, “A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches,” IEEE Trans. Syst. Man, Cybern. Part C, vol. 42, no. 4, pp. 463–484, 2012. [12] A. Orriols-Puig and E. Bernadó-Mansilla, “Evolutionary rule-based systems for imbalanced data sets,” Soft Comput., vol. 13, no. 3, pp. 213–225, 2009. [13] Z. -B. Zhu and Z. -H. Song, “Fault diagnosis based on imbalance modified kernel Fisher discriminant analysis,” Chem. Eng. Res. Des., vol. 88, no. 8, pp. 936–951, 2010. [14] W. Khreich, E. Granger, A. Miri, and R. Sabourin, “Iterative Boolean combination of classifiers in the ROC space: An application to anomaly detection with HMMs,” Pattern Recognit., vol. 43, no. 8, pp. 2732–2752, 2010. [15] M. A. Mazurowski, P. A. Habas, J. M. Zurada, J. Y. Lo, J. A. Baker, and G. D. Tourassi, “Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance,” Neural Networks, vol. 21, no. 2–3, pp. 427–436, 2008. [16] Y. -H. Liu and Y. -T. Chen, “Total Margin Based Adaptive Fuzzy Support Vector Machines for Multiview Face Recognition,” in 2005 IEEE International Conference on Systems, Man and Cybernetics, 2005, vol. 2, pp. 1704–1711. [17] M. Kubat, R. C. Holte, and S. Matwin, “Machine learning for the detection of oil spills in satellite radar images,” Mach. Learn., vol. 30, no. 2–3, pp. 195–215, 1998. [18] L. Yin, Y. Ge, K. Xiao, X. Wang, and X. Quan, “Feature selection for high-dimensional imbalanced data,” Neurocomputing, vol. 105, pp. 3–11, 2013. [19] X. -Y. Liu andZ. -H. Zhou, Ensemble Methods for Class Imbalance Learning, 1st ed. Hoboken, NJ, USA: John Wiley & Sons, Inc., 2013. [20] Y. Lin, Y. Lee, and G. Wahba, “Support Vector Machines for Classification in Nonstandard Situations,” Mach. Learn., vol. 46, pp. 191–202, 2002. [21] J. Stefanowski and S. Wilk, “Selective pre-processing of imbalanced data for improving classification performance,” Data Warehous. Knowl. Discov., vol. 5182 LNCS, pp. 283–292, 2008. [22] N. V. Chawla, D. A. Cieslak, L. O. Hall, and A. Joshi, “Automatically countering imbalance and its empirical relationship to cost,” Data Min. Knowl. Discov., vol. 17, no. 2, pp. 225–252, 2008. [23] M. Kubat and S. Matwin, “Addressing the Curse of Imbalanced Training Sets: One Sided Selection,” in Proceedings of the Fourteenth International Conference on Machine Learning, 1997, vol. 97, pp. 179–186. [24] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic minority over-sampling technique,” J. Artif. Intell. Res., vol. 16, pp. 321–357, 2002. [25] D. Dai and S. -W. Hua, “Random Under-Sampling Ensemble Methods for Highly Imbalanced Rare Disease Classification,” in 12th International Conference on Data Mining (DMIN 2016), 2016, pp. 54–59. [26] W. -C. Lin, C. -F. Tsai, Y. -H. Hu, and J.-S. Jhang, “Clustering-based undersampling in class-imbalanced data,” Inf. Sci. (Ny)., vol. 409–410, pp. 17–26, 2017. [27] Z. -B. Sun, Q. -B. Song, and X. -Y. Zhu, “Using coding-based ensemble learning to improve software defect prediction,” IEEE Trans. Syst. Man Cybern. Part C Appl. Rev., vol. 42, no. 6, pp. 1806–1817, 2012. [28] T. -F. Wu, C. -J. Lin, and R. C. Weng, “Probability Estimates for Multi-class Classification by Pairwise Coupling,” J. Mach. Learn. Res., vol. 5, pp. 975–1005, 2004. [29] B. Das, N. C. Krishnan, and D. J. Cook, “Handling Class Overlap and Imbalance to Detect Prompt Situations in Smart Homes,” in IEEE 13th International Conference on Data Mining Workshops, 2013. [30] G. E. A. P. A. Batista, R. C. Prati, and M. C. Monard, “A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data,” ACM SIGKDD Explor. Newsl., vol. 6, no. 1, pp. 20–29, 2004. [31] A. Ali, S. M. Shamsuddin, and A. L. Ralescu, “Classification with class imbalance problem,” Int. J. Adv. Soft Comput. its Appl., vol. 7, no. 3, pp. 176–204, 2015. [32] N. Japkowicz and S. Stephen, “The class imbalance problem: A systematic study,” Intell. Data Anal. J., vol. 6, no. 5, pp. 429–449, 2002. [33] R. C. Holte, L. E. Acker, and B. W. Porter, “Concept learning and the problem of small disjuncts,” in Proceedings of the Eleventh International Joint Conference on Artificial Intelligence, 1989, pp. 813–818. [34] G. M. Weiss and F. J. Provost, “Learning When Training Data are Costly: The Effect of ClassDistribution on Tree Induction.,” J. Artif. Intell. Res., vol. 19, pp. 315–354, 2003. [35] S. Kotsiantis, D. Kanellopoulos, and P. Pintelas, “Handling imbalanced datasets : A review,” GESTS Int. Trans. Comput. Sci. Eng., vol. 30, no. 1, pp. 25–36, 2006. [36] C. Drummond and R. C. Holte, “C4.5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling,” in Proceedings of the International Conference on Machine Learning (ICML 2003) Workshop on Learning from Imbalanced Data Sets II, 2003, pp. 1–8. [37] S. B. Kotsiantis and P. E. Pintelas, “Mixture of expert agents for handling imbalanced data sets,” Ann. Math. Comput. Teleinformatics, vol. 1, no. 1, pp. 46–55, 2003. [38] I. Tomek, “Two Modifications of CNN,” IEEE Trans. Syst. Man. Cybern., vol. SMC-6, no. 11, pp. 769–772, 1976. [39] G. Weiss, “Mining with rarity: A unifying framework.,” SIGKDD Explor., vol. 6, no. 1, pp. 7–19, 2004. [40] W. W. Cohen, “Fast effective rule induction,” in Proceedings of the Twelfth International Conference on Machine Learning, 1995, pp. 115–123. [41] R. Longadge, S. S. Dongre, and L. Malik, “Class imbalance problem in data mining: review,” Int. J. Comput. Sci. Netw., vol. 2, no. 1, pp. 83–87, 2013. [42] B. J. Frey and D. Dueck, “Clustering by passing messages between data points,” Science (80-. )., vol. 315, no. 5814, pp. 972–976, 2007. [43] J. Macqueen, “Some methods for classification and analysis of multivariate observations,” in Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1967, vol. 1, no. 233, pp. 281–297. [44] J. A. Hartigan and M. A. Wong, “Algorithm AS 136: A K-Means Clustering Algorithm,” J. R. Stat. Soc. Ser. C (Applied Stat., vol. 28, no. 1, pp. 100–108, 1979. [45] E. W. Forgy, “Cluster Analysis of Multivariate Data: Efficiency versus Interpretability of Classification,” Biometrics, vol. 21, no. 3, pp. 768–769, 1965. [46] J. Holland, Adaptation in natural and artificial systems: An introductory analysis with applications to bilogoy, controol and artificial intelligence. Cambridge, MA, USA: MIT Press, 1975. [47] I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, Second Edition, 2nd ed. San Francisco: Morgan Kaufmann, 2005. [48] M. Kantardzic, Data Mining: Concepts, Models, Methods, and Algorithms, 2nd ed. Hoboken: John Wiley & Sons, Inc., 2011. [49] D. E. Goldberg, Genetic Algorithm in Search, Optimization, and Machine Learning. Boston: Addison Wesley, 1989. [50] Y. Freund and R. E. Schapire, “A desicion-theoretic generalization of on-line learning and an application to boosting,” J. Comput. Syst. Sci., vol. 55, pp. 119–139, 1996. [51] R. E. Schapire, “The Strength of Weak Learnability,” Mach. Learn., vol. 5, no. 2, pp. 197–227, 1990. [52] Y .Freund and R. E. Schapire, “Experiments with a New Boosting Algorithm,” in In Proceedings of the International Conference on Machine Learning, 1996, pp. 148–156. [53] L. Breiman, “Bagging Predictors,” Mach. Learn., vol. 24, no. 2, pp. 123–140, 1996. [54] G. Douzas and F. Bacao, “Self-Organizing Map Oversampling (SOMO) for imbalanced data set learning,” Expert Syst. Appl., vol. 82, pp. 40–52, 2017. [55] W. A. Rivera, “Noise Reduction A Priori Synthetic Over-Sampling for class imbalanced data sets,” Inf. Sci. (Ny)., vol. 408, pp. 146–161, 2017. [56] G. G. Sundarkumar and V. Ravi, “A novel hybrid undersampling method for mining unbalanced datasets in banking and insurance,” Eng. Appl. Artif. Intell., vol. 37, pp. 368–377, 2015. [57] L. Nanni, C. Fantozzi, and N. Lazzarini, “Coupling different methods for overcoming the class imbalance problem,” Neurocomputing, vol. 158, pp. 48–61, 2015. [58] 陳景祥, R軟體：應用統計方法. 台北市: 東華, 2010.
指導教授	蔡志豐、蘇坤良	審核日期	2018-6-21
推文	facebook plurk twitter funp google live udn HD myshare reddit netvibes friend youpush delicious baidu
網路書籤	Google bookmarks del.icio.us hemidemi myshare

博碩士論文 105423004 詳細資訊