資訊安全中的類別不平衡:欠採樣、過採樣和混合方法的比較研究;Addressing Class Imbalance in Information Security: Comparative Analysis of Undersampling, Oversampling, and Hybrid Approaches

NCU Institutional Repository > 管理學院 > 資訊管理學系碩士在職專班 > 博碩士論文 > Item 987654321/95429

請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/95429

題名:	資訊安全中的類別不平衡:欠採樣、過採樣和混合方法的比較研究;Addressing Class Imbalance in Information Security: Comparative Analysis of Undersampling, Oversampling, and Hybrid Approaches
作者:	曾令騰;TSENG, LING-TENG
貢獻者:	資訊管理學系在職專班
關鍵詞:	資訊安全;類別不平衡;二分類;多分類;數據重採樣技術;Information Security;Class Imbalance;Binary;Five-class;Data Resampling
日期:	2024-05-14
上傳時間:	2024-10-09 16:51:05 (UTC+8)
出版者:	國立中央大學
摘要:	本研究專注於資訊安全領域中類別不平衡的問題，著重於二分類與五分類的機器學習實驗。透過分析不同分類器（包括ANN、KNN、RF、SVM）在處理不同類別數據時的效能，探索了多種數據處理技術包括過採樣（Random Oversampling、SMOTE、Borderline SMOTE、ADASYN）、欠採樣（ENN、Tomek Links）和混合方法（SMOTE-ENN、SMOTE-Tomek Links）。在處理類別不平衡的數據集時，選擇合適的模型和數據處理策略對於降低型二錯誤率至關重要。減少型二錯誤意味著提高了對少數類的識別能力，這對於許多應用來說，如醫療診斷、資訊安全等，是極其關鍵的。二分類資料使用個案A公司的資訊安全Log，日誌資料被分類為「有危害」和「無危害」兩種類型，在類別不平衡的情況下，資安風險中最重要的就是減少型二錯誤，也就是明明有資安風險卻被判別為無資安風險，實驗結果在ANN + Random Oversampling有著最低的型二錯誤率9.09%，相較於原始資料的型二錯誤率(ANN :81% 、KNN: 54% 、RF: 24% 、SVM :45%)降低許多。五分類使用著名的KDD99網路入侵偵測資料集，先做前處理把22種攻擊類型轉為四大類攻擊，其中極度不平衡的數據集（類別四(R2L)和類別五(U2R)），在不同的分類器上處理的表現有顯著差異。特別是在使用過採樣技術後，對於類別五的預測性能有顯著提升，其中ANN + SMOTE-ENN組合對於類別五的性能提升最為明顯，此外分析還顯示，在降低少數類別的型二錯誤率時可能會提高多數類別的錯誤率，顯示了處理類別不平衡問題的複雜性，並強調了選擇合適的數據處理策略的重要性。;This study focuses on the issue of class imbalance within the field of information security, emphasizing experiments in binary and five-class machine learning classification. By analyzing the performance of different classifiers (including ANN, KNN, RF, SVM) in handling various categories of data, a range of data processing techniques was explored, including oversampling (Random Oversampling, SMOTE, Borderline SMOTE, ADASYN), undersampling (ENN, Tomek Links), and hybrid methods (SMOTE-ENN, SMOTE-Tomek Links). Selecting appropriate models and data processing strategies is crucial for reducing Type II error rates when dealing with imbalanced datasets. For binary classification, the study used information security logs from Company A, and it categorized the log data into ′harmful′ and ′harmless′. In scenarios of class imbalance, reducing Type II errors, which misclassify actual security risks as non-threatening, is of utmost importance. The experimental results showed that ANN + Random Oversampling achieved the lowest Type II error rate of 9.09%, a significant reduction compared to the original data′s Type II error rates (ANN: 81%, KNN: 54%, RF: 24%, SVM: 45%). For the five-class classification, the study used the renowned KDD99 dataset, initially preprocessing 22 types of attacks into four major categories. In this extremely imbalanced dataset (especially for categories 4 (R2L) and 5 (U2R)), significant differences in performance were observed among the classifiers. Notably, the predictive performance for category 5 significantly improved after applying oversampling techniques, with the ANN + SMOTE-ENN combination showing the most pronounced improvement for category 5. Furthermore, the analysis indicated that reducing the Type II error rate for minority classes might increase the error rate for majority classes, highlighting the complexity of addressing class imbalance issues and underscoring the importance of selecting suitable data processing strategies.
顯示於類別:	[資訊管理學系碩士在職專班 ] 博碩士論文

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	16	檢視/開啟

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....