過採樣集成法於類別不平衡與高維度資料之研究

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：16

、訪客IP：18.117.72.224

姓名

林欣儀(Hsin-Yi Lin) 查詢紙本館藏

畢業系所

資訊管理學系

論文名稱

過採樣集成法於類別不平衡與高維度資料之研究
(Oversampling Ensembles in Class imbalanced and high dimensional data)

相關論文

★ 特徵選取於資料離散化之影響	★ 樣本選取與資料離散化對於分類器效果之影響
★ 單一與集成特徵選取方法於高維度資料之比較

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2024-5-6以後開放)

摘要(中)

從資料數據的分析中，企業可以根據分析結果進行未來計劃或決策發展，因此資料的重要性及應用性日益劇增，但原始資料中卻常存在類別不平衡以及高維度的特性，這兩者的資料問題常發生於金融業、醫療業等領域，類別不平衡容易造成數據預測的偏誤，讓預測模型只專注在大類資料而忽略小類的數據；高維度資料集因為過多的欄位則易造成計算上的複雜性且降低預測的準確率。
本論文在研究眾多類別不平衡以及高維度問題的解決方法文獻後，針對類別不平衡問題提出一個新方法：過採樣集成法（Oversampling ensemble），將常見的三個SMOTE變異法：polynom-fit-SMOTE, ProWSyn以及SMOTE-IPF進行集成，集成方法有Parallel ensemble以及Serial ensemble方式，其中Parallel ensemble包含四種選取生成資料的方法：Random、Center、Cluster Random、Cluster Center，並透過58個KEEL資料集的實驗證明Parallel ensemble顯著勝過單一演算法，以Center以及Cluster Center表現最好。對於類別不平衡同時伴隨高維度特性的資料集，本論文將新方法過採樣集成法搭配資料增益（Information Gain, IG）法以及Embedded法中的決策樹特徵選取，並透過15個OpenML的資料集證明該方法勝過單一演算法，並根據不平衡比率以及特徵數有不同的適用方法。

摘要(英)

Among the field of data analysis, the enterprise can make plans for future operation or make crucial decisions. Therefore, the data and its applications have become more and more important. However, the original dataset often exits the problems of class imbalance and high dimensionality. Those problems usually occur in the fields of finance, medicine and so on. The class imbalanced problem can cause the bias of prediction, which makes the prediction model mainly focuses on the majority class instead of the minority one. On the other hand, high dimensional datasets can lead to the complexity of the calculation and reduce the accuracy of prediction because of redundant features.
In this thesis, we propose a new method called Oversampling ensemble aiming to solve the class imbalanced problem. Three well-known variants of SMOTE, which are polynom-fit-SMOTE, ProWSyn, SMOTE-IPF, are investigated. The ensemble approaches contain the Parallel and Serial ensembles, where the parallel ensembles include four data combination methods: Random、Center、Cluster Random、Cluster Center. The experimental results based on 58 KEEL datasets show that Parallel ensembles outperform the baseline and single oversampling algorithms, especially the Center and Cluster Center methods. As for the class imbalanced and high dimensional problems, parallel ensembles are combined with information gain and embedded Decision Tree feature selection separately for 15 OpenML datasets, which indicates that the ensemble method surpasses the baseline and single algorithms. In addition, appropriate methods are recommended for different imbalance ratios and numbers of features.

關鍵字(中)

★ 類別不平衡
★ 高維度
★ 特徵選取法
★ 集成式學習

關鍵字(英)

★ class imbalance
★ high dimension
★ feature selection
★ ensemble learning

論文目次

摘要 I
Abstract II
誌謝 III
目錄 IV
圖目錄 VI
表目錄 VII
ㄧ、緒論 1
1.1研究背景 1
1.2研究動機 2
1.3研究目的 3
1.4 研究架構 5
二、文獻探討 7
2.1 類別不平衡問題 7
2.1.1 資料層級 7
2.1.2 演算法層級 9
2.2 集成式學習 10
2.3 特徵選取 12
三、研究方法 16
3.1 實驗架構 16
3.2 資料集 18
3.2.1 實驗一 18
3.2.2 實驗二 20
3.3實驗一：Parallel ensemble vs. Serial ensemble 21
3.3.1 Parallel ensemble前測 21
3.3.2 Parallel ensemble 24
3.3.3 Serial ensemble前測 30
3.3.4 Serial ensemble 31
3.4 實驗二：IG vs. Embedded DT 搭配 Oversampling ensemble 33
四、實驗結果 35
4.1實驗準備 35
4.2 實驗一 Parallel ensemble 36
4.2.1 Baseline、單一演算法 37
4.2.2 EO2 vs. EO3 39
4.2.3 Discussion 42
4.3 實驗一 Serial ensemble 47
4.4 實驗二 50
4.4.1 實驗二 EO2 Center + IG 50
4.4.2 實驗二 EO2 Center + Embedded DT 51
4.4.3 Discussion 52
五、結論 53
5.1 結論與貢獻 53
5.2 未來研究方向與建議 54
參考文獻 55
附錄一 Parallel ensemble 前測數據 62
附錄二 Parallel ensemble Decision Tree 數據 67
附錄三 Parallel ensemble SVM 數據 71
附錄四 Serial ensemble 前測數據 75
附錄五 Serial ensemble DT 數據 77
附錄六 Serial ensemble SVM 數據 78

參考文獻

1. Alasadi, S.A. and W.S. Bhaya, Review of data preprocessing techniques in data mining. Journal of Engineering and Applied Sciences, 2017. 12(16): p. 4102-4107.
2. Galar, M., et al., A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 2011. 42(4): p. 463-484.
3. Cao, H., et al., Integrated oversampling for imbalanced time series classification. IEEE Transactions on Knowledge and Data Engineering, 2013. 25(12): p. 2809-2822.
4. Singh, B., N. Kushwaha, and O.P. Vyas, A feature subset selection technique for high dimensional data using symmetric uncertainty. Journal of Data Analysis and Information Processing, 2014. 2(04): p. 95.
5. Rachburee, N. and W. Punlumjeak. A comparison of feature selection approach between greedy, IG-ratio, Chi-square, and mRMR in educational mining. in 2015 7th international conference on information technology and electrical engineering (ICITEE). 2015. IEEE.
6. Omuya, E.O., G.O. Okeyo, and M.W. Kimwele, Feature selection for classification using principal component analysis and information gain. Expert Systems with Applications, 2021. 174: p. 114765.
7. Kovács, G., An empirical comparison and evaluation of minority oversampling techniques on a large number of imbalanced datasets. Applied Soft Computing, 2019. 83: p. 105662.
8. Douzas, G., F. Bacao, and F. Last, Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Information Sciences, 2018. 465: p. 1-20.
9. Fernández, A., et al., SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. Journal of artificial intelligence research, 2018. 61: p. 863-905.
10. Choudhary, R. and S. Shukla, A clustering based ensemble of weighted kernelized extreme learning machine for class imbalance learning. Expert Systems with Applications, 2021. 164: p. 114041.
11. Sagi, O. and L. Rokach, Ensemble learning: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2018. 8(4): p. e1249.
12. Ali, U., K.S. Arif, and U. Qamar. A hybrid scheme for feature selection of high dimensional educational data. in 2019 International Conference on Communication Technologies (ComTech). 2019. IEEE.
13. Agrawal, P., et al., Metaheuristic algorithms on feature selection: A survey of one decade of research (2009-2019). IEEE Access, 2021. 9: p. 26766-26791.
14. Gazzah, S. and N.E.B. Amara. New oversampling approaches based on polynomial fitting for imbalanced data sets. in 2008 the eighth iapr international workshop on document analysis systems. 2008. IEEE.
15. Barua, S., M.M. Islam, and K. Murase. ProWSyn: Proximity weighted synthetic oversampling technique for imbalanced data set learning. in Pacific-Asia Conference on Knowledge Discovery and Data Mining. 2013. Springer.
16. Sáez, J.A., et al., SMOTE–IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Information Sciences, 2015. 291: p. 184-203.
17. Boonamnuay, S., N. Kerdprasop, and K. Kerdprasop, Classification and regression tree with resampling for classifying imbalanced data. International Journal of Machine Learning and Computing, 2018. 8(4): p. 336-340.
18. Claesen, M., et al., Fast prediction with SVM models containing RBF kernels. arXiv preprint arXiv:1403.0736, 2014.
19. Zhao, Z., et al., Imbalance learning for the prediction of N6-Methylation sites in mRNAs. BMC genomics, 2018. 19(1): p. 1-10.
20. Chandrashekar, G. and F. Sahin, A survey on feature selection methods. Computers & Electrical Engineering, 2014. 40(1): p. 16-28.
21. Liu, H., M. Zhou, and Q. Liu, An embedded feature selection method for imbalanced data classification. IEEE/CAA Journal of Automatica Sinica, 2019. 6(3): p. 703-715.
22. Thabtah, F., et al., Data imbalance in classification: Experimental evaluation. Information Sciences, 2020. 513: p. 429-441.
23. Feng, W., W. Huang, and J. Ren, Class imbalance ensemble learning based on the margin theory. Applied Sciences, 2018. 8(5): p. 815.
24. Buda, M., A. Maki, and M.A. Mazurowski, A systematic study of the class imbalance problem in convolutional neural networks. Neural Networks, 2018. 106: p. 249-259.
25. Chawla, N.V., Data mining for imbalanced datasets: An overview. Data mining and knowledge discovery handbook, 2009: p. 875-886.
26. Leevy, J.L., et al., A survey on addressing high-class imbalance in big data. Journal of Big Data, 2018. 5(1): p. 1-30.
27. Khushi, M., et al., A comparative performance analysis of data resampling methods on imbalance medical data. IEEE Access, 2021. 9: p. 109960-109975.
28. Chawla, N.V., et al., SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 2002. 16: p. 321-357.
29. Han, H., W.-Y. Wang, and B.-H. Mao. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. in International conference on intelligent computing. 2005. Springer.
30. Bunkhumpornpat, C., K. Sinapiromsaran, and C. Lursinsap. Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. in Pacific-Asia conference on knowledge discovery and data mining. 2009. Springer.
31. Yin, L., et al., Feature selection for high-dimensional imbalanced data. Neurocomputing, 2013. 105: p. 3-11.
32. Grobelnik, M. Feature selection for unbalanced class distribution and naive bayes. in ICML ‘99: Proceedings of the Sixteenth International Conference on Machine Learning. 1999. Citeseer.
33. Zheng, Z., X. Wu, and R. Srihari, Feature selection for text categorization on imbalanced data. ACM Sigkdd Explorations Newsletter, 2004. 6(1): p. 80-89.
34. He, H., et al. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. in 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence). 2008. IEEE.
35. Barua, S., M. Islam, and K. Murase. A novel synthetic minority oversampling technique for imbalanced data set learning. in International Conference on Neural Information Processing. 2011. Springer.
36. Khoshgoftaar, T.M. and P. Rebours, Improving software quality prediction by noise filtering techniques. Journal of Computer Science and Technology, 2007. 22(3): p. 387-396.
37. Krawczyk, B., Learning from imbalanced data: open challenges and future directions. Progress in Artificial Intelligence, 2016. 5(4): p. 221-232.
38. Johnson, J.M. and T.M. Khoshgoftaar, Survey on deep learning with class imbalance. Journal of Big Data, 2019. 6(1): p. 1-54.
39. Moepya, S.O., S.S. Akhoury, and F.V. Nelwamondo. Applying cost-sensitive classification for financial fraud detection under high class-imbalance. in 2014 IEEE international conference on data mining workshop. 2014. IEEE.
40. Xu, Q., et al., Imbalanced fault diagnosis of rotating machinery via multi-domain feature extraction and cost-sensitive learning. Journal of Intelligent Manufacturing, 2020. 31(6): p. 1467-1481.
41. Fernández, A., et al., Cost-sensitive learning, in Learning from Imbalanced Data Sets. 2018, Springer. p. 63-78.
42. Elkan, C. The foundations of cost-sensitive learning. in International joint conference on artificial intelligence. 2001. Lawrence Erlbaum Associates Ltd.
43. Ribeiro, M.H.D.M. and L. dos Santos Coelho, Ensemble approach based on bagging, boosting and stacking for short-term prediction in agribusiness time series. Applied Soft Computing, 2020. 86: p. 105837.
44. Mosavi, A., et al., Ensemble boosting and bagging based machine learning models for groundwater potential prediction. Water Resources Management, 2021. 35(1): p. 23-37.
45. Dong, X., et al., A survey on ensemble learning. Frontiers of Computer Science, 2020. 14(2): p. 241-258.
46. Schapire, R.E., The strength of weak learnability. Machine learning, 1990. 5(2): p. 197-227.
47. Freund, Y. and R.E. Schapire, A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 1997. 55(1): p. 119-139.
48. Palit, I. and C.K. Reddy, Scalable and parallel boosting with mapreduce. IEEE Transactions on Knowledge and Data Engineering, 2011. 24(10): p. 1904-1916.
49. Breiman, L., Bagging predictors. Machine learning, 1996. 24(2): p. 123-140.
50. Oza, N.C. and S.J. Russell. Online bagging and boosting. in International Workshop on Artificial Intelligence and Statistics. 2001. PMLR.
51. Bauer, E. and R. Kohavi, An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine learning, 1999. 36(1): p. 105-139.
52. Breiman, L., Random forests. Machine learning, 2001. 45(1): p. 5-32.
53. Ghimire, D. and J. Lee, Extreme learning machine ensemble using bagging for facial expression recognition. Journal of Information Processing Systems, 2014. 10(3): p. 443-458.
54. Nikulin, V., G.J. McLachlan, and S.K. Ng. Ensemble approach for the classification of imbalanced data. in Australasian Joint Conference on Artificial Intelligence. 2009. Springer.
55. Du, H., et al., Online ensemble learning algorithm for imbalanced data stream. Applied Soft Computing, 2021. 107: p. 107378.
56. Lim, P., C.K. Goh, and K.C. Tan, Evolutionary cluster-based synthetic oversampling ensemble (eco-ensemble) for imbalance learning. IEEE transactions on cybernetics, 2016. 47(9): p. 2850-2861.
57. Huda, S., et al., An Ensemble Oversampling Model for Class Imbalance Problem in Software Defect Prediction. Ieee Access, 2018. 6: p. 24184-24195.
58. Kira, K. and L.A. Rendell, A practical approach to feature selection, in Machine learning proceedings 1992. 1992, Elsevier. p. 249-256.
59. Xu, Z., et al., Discriminative semi-supervised feature selection via manifold regularization. IEEE Transactions on Neural networks, 2010. 21(7): p. 1033-1047.
60. Raileanu, L.E. and K. Stoffel, Theoretical comparison between the gini index and information gain criteria. Annals of Mathematics and Artificial Intelligence, 2004. 41(1): p. 77-93.
61. Guyon, I. and A. Elisseeff, An introduction to variable and feature selection. Journal of machine learning research, 2003. 3(Mar): p. 1157-1182.
62. Battiti, R., Using mutual information for selecting features in supervised neural net learning. IEEE Transactions on neural networks, 1994. 5(4): p. 537-550.
63. Lazar, C., et al., A survey on filter techniques for feature selection in gene expression microarray analysis. IEEE/ACM transactions on computational biology and bioinformatics, 2012. 9(4): p. 1106-1119.
64. Kohavi, R. and G.H. John, Wrappers for feature subset selection. Artificial intelligence, 1997. 97(1-2): p. 273-324.
65. Peng, H., F. Long, and C. Ding, Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on pattern analysis and machine intelligence, 2005. 27(8): p. 1226-1238.
66. Ferreira, A.J. and M.A. Figueiredo, An unsupervised approach to feature discretization and selection. Pattern Recognition, 2012. 45(9): p. 3048-3060.
67. Rostami, M., et al., Review of swarm intelligence-based feature selection methods. Engineering Applications of Artificial Intelligence, 2021. 100: p. 104210.
68. Pudil, P., J. Novovičová, and J. Kittler, Floating search methods in feature selection. Pattern recognition letters, 1994. 15(11): p. 1119-1125.
69. Reeves, S.J. and Z. Zhe, Sequential algorithms for observation selection. IEEE Transactions on Signal Processing, 1999. 47(1): p. 123-132.
70. Goldberg, D.E., Genetic algorithms. 2006: Pearson Education India.
71. Kennedy, J. and R. Eberhart. Particle swarm optimization. in Proceedings of ICNN′95-international conference on neural networks. 1995. IEEE.
72. Blum, A.L. and P. Langley, Selection of relevant features and examples in machine learning. Artificial intelligence, 1997. 97(1-2): p. 245-271.
73. Jiménez-Cordero, A., J.M. Morales, and S. Pineda, A novel embedded min-max approach for feature selection in nonlinear support vector machine classification. European Journal of Operational Research, 2021. 293(1): p. 24-35.
74. Guyon, I., et al., Gene selection for cancer classification using support vector machines. Machine learning, 2002. 46(1): p. 389-422.
75. Tibshirani, R., Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 1996. 58(1): p. 267-288.
76. Al-Tashi, Q., et al., Approaches to multi-objective feature selection: A systematic literature review. IEEE Access, 2020. 8: p. 125076-125096.
77. Chawla, N.V., N. Japkowicz, and A. Kotcz, Special issue on learning from imbalanced data sets. ACM SIGKDD explorations newsletter, 2004. 6(1): p. 1-6.
78. Chen, H., et al., Feature selection for imbalanced data based on neighborhood rough sets. Information Sciences, 2019. 483: p. 1-20.
79. Chen, X.-w. and M. Wasikowski. Fast: a roc-based feature selection metric for small samples and imbalanced data classification problems. in Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. 2008.
80. Alibeigi, M., S. Hashemi, and A. Hamzeh, DBFS: An effective Density Based Feature Selection scheme for small sample size and high dimensional imbalanced data sets. Data & Knowledge Engineering, 2012. 81: p. 67-103.
81. Kamalov, F., F. Thabtah, and H.H. Leung, Feature Selection in Imbalanced Data. Annals of Data Science, 2022: p. 1-15.
82. Deepa, T. and M. Punithavalli. An E-SMOTE technique for feature selection in high-dimensional imbalanced dataset. in 2011 3rd International Conference on Electronics Computer Technology. 2011. IEEE.
83. Liu, Y., et al., A classification method based on feature selection for imbalanced data. IEEE Access, 2019. 7: p. 81794-81807.
84. Van de Geer, J.P., Some aspects of Minkowski distance. 1995: Leiden University, Department of Data Theory.
85. Maldonado, S., J. López, and C. Vairetti, An alternative SMOTE oversampling strategy for high-dimensional datasets. Applied Soft Computing, 2019. 76: p. 380-389.
86. Uyun, S. and E. Sulistyowati, Feature selection for multiple water quality status: integrated bootstrapping and SMOTE approach in imbalance classes. International Journal of Electrical and Computer Engineering, 2020. 10(4): p. 4331.
87. Yin, H. and K. Gai. An empirical study on preprocessing high-dimensional class-imbalanced data for classification. in 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems. 2015. IEEE.
88. Marutho, D., S.H. Handaka, and E. Wijaya. The determination of cluster number at k-mean using elbow method and purity evaluation on headline news. in 2018 International Seminar on Application for Technology of Information and Communication. 2018. IEEE.
89. Liu, F. and Y. Deng, Determine the number of unknown targets in Open World based on Elbow method. IEEE Transactions on Fuzzy Systems, 2020. 29(5): p. 986-995.
90. Halimu, C., A. Kasem, and S.S. Newaz. Empirical comparison of area under ROC curve (AUC) and Mathew correlation coefficient (MCC) for evaluating machine learning algorithms on imbalanced datasets for binary classification. in Proceedings of the 3rd international conference on machine learning and soft computing. 2019.
91. Wardhani, N.W.S., et al. Cross-validation metrics for evaluating classification performance on imbalanced data. in 2019 international conference on computer, control, informatics and its applications (ic3ina). 2019. IEEE.

指導教授

蔡志豐蘇坤良(Tsai, Chih-Fong Su, Kuen-Liang)

審核日期

2022-7-13

推文