摘要(英) |
This study focuses on the issue of class imbalance within the field of information security, emphasizing experiments in binary and five-class machine learning classification. By analyzing the performance of different classifiers (including ANN, KNN, RF, SVM) in handling various categories of data, a range of data processing techniques was explored, including oversampling (Random Oversampling, SMOTE, Borderline SMOTE, ADASYN), undersampling (ENN, Tomek Links), and hybrid methods (SMOTE-ENN, SMOTE-Tomek Links). Selecting appropriate models and data processing strategies is crucial for reducing Type II error rates when dealing with imbalanced datasets. For binary classification, the study used information security logs from Company A, and it categorized the log data into ′harmful′ and ′harmless′. In scenarios of class imbalance, reducing Type II errors, which misclassify actual security risks as non-threatening, is of utmost importance. The experimental results showed that ANN + Random Oversampling achieved the lowest Type II error rate of 9.09%, a significant reduction compared to the original data′s Type II error rates (ANN: 81%, KNN: 54%, RF: 24%, SVM: 45%). For the five-class classification, the study used the renowned KDD99 dataset, initially preprocessing 22 types of attacks into four major categories. In this extremely imbalanced dataset (especially for categories 4 (R2L) and 5 (U2R)), significant differences in performance were observed among the classifiers. Notably, the predictive performance for category 5 significantly improved after applying oversampling techniques, with the ANN + SMOTE-ENN combination showing the most pronounced improvement for category 5. Furthermore, the analysis indicated that reducing the Type II error rate for minority classes might increase the error rate for majority classes, highlighting the complexity of addressing class imbalance issues and underscoring the importance of selecting suitable data processing strategies. |
參考文獻 |
Abhijit Das, Pramod (2022) .”A Deep Transfer Learning Approach to Enhance Network Intrusion Detection Capabilities for Cyber Security”. (IJACSA) International Journal of Advanced Computer Science and Applications,Vol. 13, No. 4, 2022
Adane Nega Tarekegn & Mario Giacobini & Krzysztof Michalak(2021)."A review of methods for imbalanced multi-label classification".Pattern recognition, 2021-10, Vol.118, p.107965, Article 107965
Ahmed Abdelkhalek, Maggie Mashaly (2023)“Addressing the class imbalance problem in network intrusion detection systems using data resampling and deep learning”. The Journal of Supercomputing (2023) 79:10611–10644 https://doi.org/10.1007/s11227-023-05073-x
Ali Haseeb & Salleh Mohd Najib Mohd & Saedudin Rohmat & Hussain Kashif & Mushtaq Muhammad Faheem(2019). ”Imbalance class problems in data mining: a review”. Indonesian Journal of Electrical Engineering and Computer Science, 2019, Vol.14 (3), p.1552-1563, Article 1560
Batista, G. E., Prati, R. C., & Monard, M. C. (2004). “A study of the behavior of several methods for balancing machine learning training data”. SIGKDD Explorations, 6(1), 20-29.
Bradley, A. P. (1997) . ”The use of the area under the ROC curve in the evaluation of machine learning algorithms”. Pattern recognition, 30(7), 1145-1159
Breiman, L. (2001). “Random forests” . Machine learning, 45(1), 5-32.
Brodersen, K. H., Ong, C. S., Stephan, K. E., & Buhmann, J. M. (2010). ”The balanced accuracy and its posterior distribution”In 2010 20th International Conference on Pattern Recognition (pp. 3121-3124). IEEE
Charles Wheelus, Elias Bou-Harb, Xingquan Zhu (2018) . ” Tackling Class Imbalance in Cyber Security Datasets”. 2018 IEEE International Conference on Information Reuse and Integration for Data Science
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002) . ” SMOTE: synthetic minority over-sampling technique”. Journal of Artificial Intelligence Research, 16, 321-357
Charles Wheelus, Elias Bou-Harb, Xingquan Zhu(2018) . ”Tackling Class Imbalance in Cyber Security Datasets”. IEEE International Conference on Information Reuse and Integration (IRI), 2018, p.229-232
Cortes, C., & Vapnik, V. (1995). “Support-vector networks” . Machine learning, 20(3), 273-297.
Cover, T., & Hart, P. (1967) . “Nearest neighbor pattern classification” . IEEE transactions on information theory, 13(1), 21-27.
Davis, J., & Goadrich, M. (2006) . ”The relationship between Precision-Recall and ROC curves”, In Proceedings of the 23rd international conference on Machine learning (pp. 233-240)
Han, H., Wang, W. Y., & Mao, B. H. (2005). “Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning”. Advances in Intelligent Computing, 3644, 878-887
Haykin, S. (1998) . “Neural Networks: A Comprehensive Foundation” . Prentice Hall.
He, H., Bai, Y., Garcia, E. A., & Li, S. (2008). “ADASYN: Adaptive synthetic sampling approach for imbalanced learning”. 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).
He, H., & Garcia, E. A. (2009). ”Learning from imbalanced data”. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263-1284
Heejung Kim , Hwankuk Kim (2022) . ” Comparative Experiment on TTP Classification with Class Imbalance Using Oversampling from CTI Dataset”, Security and Communication Networks,Volume 2022, Article ID 5021125, 11 pages https://doi.org/10.1155/2022/5021125
Hermanto & Taufik Asra & Antonius Yadi Kuntoro & Riza Fahlapi; Lasman Effendi &Ferry Syukmana (2023). ”Sentiment analysis review flip app users on Google play using Naïve Bayes algorithm and support vector machine with smote technique”. AIP Conf. Proc. 2714, 020036 (2023)
Hu, Zhiquan ; Wang, Liejun ; Qi, Lei ; Li, Yongming ; Yang, Wenzhoong(2020) . ” A Novel Wireless Network Intrusion Detection Method Based on Adaptive Synthetic Sampling and an Improved Convolutional Neural Network”. IEEE access, 2020, Vol.8, p.1-1
Jishan Ahmed , Robert C. Green II (2022) . "Predicting severely imbalanced data disk drive failures with machine learning models" . Machine Learning with Applications Volume 9, 15 September 2022, 100361
Kaiyuan Jiang & Wenya Wang & Aili Wang & Haibin Wu (2020).”Network Intrusion Detection Combined Hybrid Sampling With Deep Hierarchical Network”. IEEE access, 2020, Vol.8, p.32464-32476
Khor, Kok-Chin ; Ting, Choo-Yee ; Phon-Amnuaisuk, Somnuk(2014). ”The Effectiveness of Sampling Methods for the Imbalanced Network Intrusion Detection Data Set”. Advances in Intelligent Systems and Computing, 2014, Vol.287, p.613-622
Krawczyk, B. (2016). ”Learning from imbalanced data: open challenges and future directions”. Progress in Artificial Intelligence, 5(4), 221-232
Nur Hanisah Abdul Malek & Wan Fairos Wan Yaacob & Yap Bee Wah & Syerina Azlin Md Nasir & Norshahida Shaadan & Sapto Wahyu Indratno (2023). ”Comparison of ensemble hybrid sampling with bagging and boosting machine learning approach for imbalanced data”.Indonesian Journal of Electrical Engineering and Computer Science, 2023
Ortal Dayan,Lior Wolf,Fang Wang,Yaniv Harel (2023) .“Optimizing AI for Mobile Malware Detection by Self-Built-Dataset GAN Oversampling and LGBM”. 2023 IEEE International Conference on Cyber Security and Resilience (CSR)
Powers, D. M. (2011). ”Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation.”. arXiv preprint arXiv:2010.16061
Sikha Bagui & Kunqi Li (2021) .“Resampling imbalanced data for network intrusion detection datasets”. Journal of Big Data volume 8, Article number: 6 (2021)
Sokolova, M., & Lapalme, G. (2009). “A systematic analysis of performance measures for classification tasks”. Information Processing & Management, 45(4), 427-437
Tiwari Sadhana & Agarwal Sonali(2022). ”An optimized hybrid solution for IoT based lifestyle disease classification using stress data”. arXiv (Cornell University), 2022
Tomek, I. (1976). “Two modifications of CNN”. IEEE Transactions on Systems, Man, and Cybernetics, SMC-6(11), 769-772.
van Rijsbergen, C. J. (1979) . ”Information Retrieval, 2nd edition.”. Butterworth-Heinemann
Wilson, D. L. (1972). “Asymptotic Properties of Nearest Neighbor Rules Using Edited Data”. IEEE Transactions on Systems, Man, and Cybernetics, 2(3), 408-421.
Y. Harel, I. Ben Gal, and Y. Elovici (2017) .“Cyber Security and the Role of Intelligent Systems in Addressing its Challenges”. ACM Transactions on Intelligent Systems and Technology, vol. 8, no. 4, Art. no. 49, pp. 1-12 (May 2017), doi: 10.1145/3057729 |