資訊安全中的類別不平衡:欠採樣、過採樣和混合方法的比較研究

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：104

、訪客IP：3.141.25.133

姓名

曾令騰(LING-TENG TSENG) 查詢紙本館藏

畢業系所

資訊管理學系在職專班

論文名稱

資訊安全中的類別不平衡:欠採樣、過採樣和混合方法的比較研究
(Addressing Class Imbalance in Information Security: Comparative Analysis of Undersampling, Oversampling, and Hybrid Approaches)

相關論文

★ 利用資料探勘技術建立商用複合機銷售預測模型	★ 應用資料探勘技術於資源配置預測之研究-以某電腦代工支援單位為例
★ 資料探勘技術應用於航空業航班延誤分析-以C公司為例	★ 全球供應鏈下新產品的安全控管-以C公司為例
★ 資料探勘應用於半導體雷射產業-以A公司為例	★ 應用資料探勘技術於空運出口貨物存倉時間預測-以A公司為例
★ 使用資料探勘分類技術優化YouBike運補作業	★ 特徵屬性篩選對於不同資料類型之影響
★ 資料探勘應用於B2B網路型態之企業官網研究-以T公司為例	★ 衍生性金融商品之客戶投資分析與建議-整合分群與關聯法則技術
★ 應用卷積式神經網路建立肝臟超音波影像輔助判別模型	★ 基於卷積神經網路之身分識別系統
★ 能源管理系統電能補值方法誤差率比較分析	★ 企業員工情感分析與管理系統之研發
★ 資料淨化於類別不平衡問題: 機器學習觀點	★ 資料探勘技術應用於旅客自助報到之分析—以C航空公司為例

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

本研究專注於資訊安全領域中類別不平衡的問題，著重於二分類與五分類的機器學習實驗。透過分析不同分類器（包括ANN、KNN、RF、SVM）在處理不同類別數據時的效能，探索了多種數據處理技術包括過採樣（Random Oversampling、SMOTE、Borderline SMOTE、ADASYN）、欠採樣（ENN、Tomek Links）和混合方法（SMOTE-ENN、SMOTE-Tomek Links）。在處理類別不平衡的數據集時，選擇合適的模型和數據處理策略對於降低型二錯誤率至關重要。減少型二錯誤意味著提高了對少數類的識別能力，這對於許多應用來說，如醫療診斷、資訊安全等，是極其關鍵的。二分類資料使用個案A公司的資訊安全Log，日誌資料被分類為「有危害」和「無危害」兩種類型，在類別不平衡的情況下，資安風險中最重要的就是減少型二錯誤，也就是明明有資安風險卻被判別為無資安風險，實驗結果在ANN + Random Oversampling有著最低的型二錯誤率9.09%，相較於原始資料的型二錯誤率(ANN :81% 、KNN: 54% 、RF: 24% 、SVM :45%)降低許多。五分類使用著名的KDD99網路入侵偵測資料集，先做前處理把22種攻擊類型轉為四大類攻擊，其中極度不平衡的數據集（類別四(R2L)和類別五(U2R)），在不同的分類器上處理的表現有顯著差異。特別是在使用過採樣技術後，對於類別五的預測性能有顯著提升，其中ANN + SMOTE-ENN組合對於類別五的性能提升最為明顯，此外分析還顯示，在降低少數類別的型二錯誤率時可能會提高多數類別的錯誤率，顯示了處理類別不平衡問題的複雜性，並強調了選擇合適的數據處理策略的重要性。

摘要(英)

This study focuses on the issue of class imbalance within the field of information security, emphasizing experiments in binary and five-class machine learning classification. By analyzing the performance of different classifiers (including ANN, KNN, RF, SVM) in handling various categories of data, a range of data processing techniques was explored, including oversampling (Random Oversampling, SMOTE, Borderline SMOTE, ADASYN), undersampling (ENN, Tomek Links), and hybrid methods (SMOTE-ENN, SMOTE-Tomek Links). Selecting appropriate models and data processing strategies is crucial for reducing Type II error rates when dealing with imbalanced datasets. For binary classification, the study used information security logs from Company A, and it categorized the log data into ′harmful′ and ′harmless′. In scenarios of class imbalance, reducing Type II errors, which misclassify actual security risks as non-threatening, is of utmost importance. The experimental results showed that ANN + Random Oversampling achieved the lowest Type II error rate of 9.09%, a significant reduction compared to the original data′s Type II error rates (ANN: 81%, KNN: 54%, RF: 24%, SVM: 45%). For the five-class classification, the study used the renowned KDD99 dataset, initially preprocessing 22 types of attacks into four major categories. In this extremely imbalanced dataset (especially for categories 4 (R2L) and 5 (U2R)), significant differences in performance were observed among the classifiers. Notably, the predictive performance for category 5 significantly improved after applying oversampling techniques, with the ANN + SMOTE-ENN combination showing the most pronounced improvement for category 5. Furthermore, the analysis indicated that reducing the Type II error rate for minority classes might increase the error rate for majority classes, highlighting the complexity of addressing class imbalance issues and underscoring the importance of selecting suitable data processing strategies.

關鍵字(中)

★ 資訊安全
★ 類別不平衡
★ 二分類
★ 多分類
★ 數據重採樣技術

關鍵字(英)

★ Information Security
★ Class Imbalance
★ Binary
★ Five-class
★ Data Resampling

論文目次

摘要 i
ABSTRACT ii
誌謝 iii
目錄 iv
表目錄 vii
圖目錄 ix
第一章緒論 1
1.1 研究背景 1
1.2 研究動機 2
1.3 研究目的 4
第二章文獻探討 6
2.1 類別不平衡資料集 6
2.1.1 類別不平衡問題的影響 6
2.1.2 解決類別不平衡的方法 6
2.2 評估指標 9
2.3 資訊安全領域類別不平衡文獻回顧 12
第三章研究方法 16
3.1 資料來源與分類 16
3.1.1 資料來源 16
3.1.2 分類器 20
3.2 資料前處理 22
3.3 研究設計及架構 25
3.4 實驗環境 27
3.5 模型評估 29
第四章實驗結果與分析 31
4.1 二分類實驗與結果 31
4.1.1 二分類-原始資料實驗 31
4.1.2 二分類-過採樣實驗 32
4.1.3 二分類-欠採樣實驗 33
4.1.4 二分類-混和過採樣與欠採樣實驗 34
4.1.5 二分類實驗總結 35
4.2 五分類實驗與結果 38
4.2.1 五分類-原始資料實驗 38
4.2.2 五分類-過採樣實驗 40
4.2.3 五分類-欠採樣實驗 46
4.2.4 五分類-混和過採樣與欠採樣實驗 49
4.2.5 五分類實驗總結 52
第五章結論與建議 57
5.1 研究結論與研究貢獻 57
5.2 研究限制 58
5.3 未來研究方向 59
參考文獻 61
附錄-五分類實驗數據 66
五分類-原始資料 66
五分類-Random Oversampling 68
五分類-SMOTE 70
五分類-Borderline SMOTE 72
五分類-ADASYN 74
五分類-ENN 76
五分類-Tomek Links 79
五分類-SMOTE-ENN 81
五分類-SMOTE-Tomek Links 83

參考文獻

Abhijit Das, Pramod (2022) .”A Deep Transfer Learning Approach to Enhance Network Intrusion Detection Capabilities for Cyber Security”. (IJACSA) International Journal of Advanced Computer Science and Applications,Vol. 13, No. 4, 2022

Adane Nega Tarekegn & Mario Giacobini & Krzysztof Michalak(2021)."A review of methods for imbalanced multi-label classification".Pattern recognition, 2021-10, Vol.118, p.107965, Article 107965

Ahmed Abdelkhalek, Maggie Mashaly (2023)“Addressing the class imbalance problem in network intrusion detection systems using data resampling and deep learning”. The Journal of Supercomputing (2023) 79:10611–10644 https://doi.org/10.1007/s11227-023-05073-x

Ali Haseeb & Salleh Mohd Najib Mohd & Saedudin Rohmat & Hussain Kashif & Mushtaq Muhammad Faheem(2019). ”Imbalance class problems in data mining: a review”. Indonesian Journal of Electrical Engineering and Computer Science, 2019, Vol.14 (3), p.1552-1563, Article 1560

Batista, G. E., Prati, R. C., & Monard, M. C. (2004). “A study of the behavior of several methods for balancing machine learning training data”. SIGKDD Explorations, 6(1), 20-29.

Bradley, A. P. (1997) . ”The use of the area under the ROC curve in the evaluation of machine learning algorithms”. Pattern recognition, 30(7), 1145-1159

Breiman, L. (2001). “Random forests” . Machine learning, 45(1), 5-32.

Brodersen, K. H., Ong, C. S., Stephan, K. E., & Buhmann, J. M. (2010). ”The balanced accuracy and its posterior distribution”In 2010 20th International Conference on Pattern Recognition (pp. 3121-3124). IEEE

Charles Wheelus, Elias Bou-Harb, Xingquan Zhu (2018) . ” Tackling Class Imbalance in Cyber Security Datasets”. 2018 IEEE International Conference on Information Reuse and Integration for Data Science

Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002) . ” SMOTE: synthetic minority over-sampling technique”. Journal of Artificial Intelligence Research, 16, 321-357

Charles Wheelus, Elias Bou-Harb, Xingquan Zhu(2018) . ”Tackling Class Imbalance in Cyber Security Datasets”. IEEE International Conference on Information Reuse and Integration (IRI), 2018, p.229-232

Cortes, C., & Vapnik, V. (1995). “Support-vector networks” . Machine learning, 20(3), 273-297.

Cover, T., & Hart, P. (1967) . “Nearest neighbor pattern classification” . IEEE transactions on information theory, 13(1), 21-27.

Davis, J., & Goadrich, M. (2006) . ”The relationship between Precision-Recall and ROC curves”, In Proceedings of the 23rd international conference on Machine learning (pp. 233-240)
Han, H., Wang, W. Y., & Mao, B. H. (2005). “Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning”. Advances in Intelligent Computing, 3644, 878-887

Haykin, S. (1998) . “Neural Networks: A Comprehensive Foundation” . Prentice Hall.

He, H., Bai, Y., Garcia, E. A., & Li, S. (2008). “ADASYN: Adaptive synthetic sampling approach for imbalanced learning”. 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

He, H., & Garcia, E. A. (2009). ”Learning from imbalanced data”. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263-1284

Heejung Kim , Hwankuk Kim (2022) . ” Comparative Experiment on TTP Classification with Class Imbalance Using Oversampling from CTI Dataset”, Security and Communication Networks,Volume 2022, Article ID 5021125, 11 pages https://doi.org/10.1155/2022/5021125

Hermanto & Taufik Asra & Antonius Yadi Kuntoro & Riza Fahlapi; Lasman Effendi &Ferry Syukmana (2023). ”Sentiment analysis review flip app users on Google play using Naïve Bayes algorithm and support vector machine with smote technique”. AIP Conf. Proc. 2714, 020036 (2023)

Hu, Zhiquan ; Wang, Liejun ; Qi, Lei ; Li, Yongming ; Yang, Wenzhoong(2020) . ” A Novel Wireless Network Intrusion Detection Method Based on Adaptive Synthetic Sampling and an Improved Convolutional Neural Network”. IEEE access, 2020, Vol.8, p.1-1

Jishan Ahmed , Robert C. Green II (2022) . "Predicting severely imbalanced data disk drive failures with machine learning models" . Machine Learning with Applications Volume 9, 15 September 2022, 100361

Kaiyuan Jiang & Wenya Wang & Aili Wang & Haibin Wu (2020).”Network Intrusion Detection Combined Hybrid Sampling With Deep Hierarchical Network”. IEEE access, 2020, Vol.8, p.32464-32476

Khor, Kok-Chin ; Ting, Choo-Yee ; Phon-Amnuaisuk, Somnuk(2014). ”The Effectiveness of Sampling Methods for the Imbalanced Network Intrusion Detection Data Set”. Advances in Intelligent Systems and Computing, 2014, Vol.287, p.613-622

Krawczyk, B. (2016). ”Learning from imbalanced data: open challenges and future directions”. Progress in Artificial Intelligence, 5(4), 221-232

Nur Hanisah Abdul Malek & Wan Fairos Wan Yaacob & Yap Bee Wah & Syerina Azlin Md Nasir & Norshahida Shaadan & Sapto Wahyu Indratno (2023). ”Comparison of ensemble hybrid sampling with bagging and boosting machine learning approach for imbalanced data”.Indonesian Journal of Electrical Engineering and Computer Science, 2023

Ortal Dayan,Lior Wolf,Fang Wang,Yaniv Harel (2023) .“Optimizing AI for Mobile Malware Detection by Self-Built-Dataset GAN Oversampling and LGBM”. 2023 IEEE International Conference on Cyber Security and Resilience (CSR)

Powers, D. M. (2011). ”Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation.”. arXiv preprint arXiv:2010.16061

Sikha Bagui & Kunqi Li (2021) .“Resampling imbalanced data for network intrusion detection datasets”. Journal of Big Data volume 8, Article number: 6 (2021)
Sokolova, M., & Lapalme, G. (2009). “A systematic analysis of performance measures for classification tasks”. Information Processing & Management, 45(4), 427-437

Tiwari Sadhana & Agarwal Sonali(2022). ”An optimized hybrid solution for IoT based lifestyle disease classification using stress data”. arXiv (Cornell University), 2022

Tomek, I. (1976). “Two modifications of CNN”. IEEE Transactions on Systems, Man, and Cybernetics, SMC-6(11), 769-772.

van Rijsbergen, C. J. (1979) . ”Information Retrieval, 2nd edition.”. Butterworth-Heinemann

Wilson, D. L. (1972). “Asymptotic Properties of Nearest Neighbor Rules Using Edited Data”. IEEE Transactions on Systems, Man, and Cybernetics, 2(3), 408-421.

Y. Harel, I. Ben Gal, and Y. Elovici (2017) .“Cyber Security and the Role of Intelligent Systems in Addressing its Challenges”. ACM Transactions on Intelligent Systems and Technology, vol. 8, no. 4, Art. no. 49, pp. 1-12 (May 2017), doi: 10.1145/3057729

指導教授

蔡志豐(Chih-Fong Tsai)

審核日期

2024-5-14

推文