資訊安全中的類別不平衡:欠採樣、過採樣和混合方法的比較研究

DC 欄位	值	語言
DC.contributor	資訊管理學系在職專班	zh_TW
DC.creator	曾令騰	zh_TW
DC.creator	LING-TENG TSENG	en_US
dc.date.accessioned	2024-5-14T07:39:07Z
dc.date.available	2024-5-14T07:39:07Z
dc.date.issued	2024
dc.identifier.uri	http://ir.lib.ncu.edu.tw:444/thesis/view_etd.asp?URN=111453001
dc.contributor.department	資訊管理學系在職專班	zh_TW
DC.description	國立中央大學	zh_TW
DC.description	National Central University	en_US
dc.description.abstract	本研究專注於資訊安全領域中類別不平衡的問題，著重於二分類與五分類的機器學習實驗。透過分析不同分類器（包括ANN、KNN、RF、SVM）在處理不同類別數據時的效能，探索了多種數據處理技術包括過採樣（Random Oversampling、SMOTE、Borderline SMOTE、ADASYN）、欠採樣（ENN、Tomek Links）和混合方法（SMOTE-ENN、SMOTE-Tomek Links）。在處理類別不平衡的數據集時，選擇合適的模型和數據處理策略對於降低型二錯誤率至關重要。減少型二錯誤意味著提高了對少數類的識別能力，這對於許多應用來說，如醫療診斷、資訊安全等，是極其關鍵的。二分類資料使用個案A公司的資訊安全Log，日誌資料被分類為「有危害」和「無危害」兩種類型，在類別不平衡的情況下，資安風險中最重要的就是減少型二錯誤，也就是明明有資安風險卻被判別為無資安風險，實驗結果在ANN + Random Oversampling有著最低的型二錯誤率9.09%，相較於原始資料的型二錯誤率(ANN :81% 、KNN: 54% 、RF: 24% 、SVM :45%)降低許多。五分類使用著名的KDD99網路入侵偵測資料集，先做前處理把22種攻擊類型轉為四大類攻擊，其中極度不平衡的數據集（類別四(R2L)和類別五(U2R)），在不同的分類器上處理的表現有顯著差異。特別是在使用過採樣技術後，對於類別五的預測性能有顯著提升，其中ANN + SMOTE-ENN組合對於類別五的性能提升最為明顯，此外分析還顯示，在降低少數類別的型二錯誤率時可能會提高多數類別的錯誤率，顯示了處理類別不平衡問題的複雜性，並強調了選擇合適的數據處理策略的重要性。	zh_TW
dc.description.abstract	This study focuses on the issue of class imbalance within the field of information security, emphasizing experiments in binary and five-class machine learning classification. By analyzing the performance of different classifiers (including ANN, KNN, RF, SVM) in handling various categories of data, a range of data processing techniques was explored, including oversampling (Random Oversampling, SMOTE, Borderline SMOTE, ADASYN), undersampling (ENN, Tomek Links), and hybrid methods (SMOTE-ENN, SMOTE-Tomek Links). Selecting appropriate models and data processing strategies is crucial for reducing Type II error rates when dealing with imbalanced datasets. For binary classification, the study used information security logs from Company A, and it categorized the log data into ′harmful′ and ′harmless′. In scenarios of class imbalance, reducing Type II errors, which misclassify actual security risks as non-threatening, is of utmost importance. The experimental results showed that ANN + Random Oversampling achieved the lowest Type II error rate of 9.09%, a significant reduction compared to the original data′s Type II error rates (ANN: 81%, KNN: 54%, RF: 24%, SVM: 45%). For the five-class classification, the study used the renowned KDD99 dataset, initially preprocessing 22 types of attacks into four major categories. In this extremely imbalanced dataset (especially for categories 4 (R2L) and 5 (U2R)), significant differences in performance were observed among the classifiers. Notably, the predictive performance for category 5 significantly improved after applying oversampling techniques, with the ANN + SMOTE-ENN combination showing the most pronounced improvement for category 5. Furthermore, the analysis indicated that reducing the Type II error rate for minority classes might increase the error rate for majority classes, highlighting the complexity of addressing class imbalance issues and underscoring the importance of selecting suitable data processing strategies.	en_US
DC.subject	資訊安全	zh_TW
DC.subject	類別不平衡	zh_TW
DC.subject	二分類	zh_TW
DC.subject	多分類	zh_TW
DC.subject	數據重採樣技術	zh_TW
DC.subject	Information Security	en_US
DC.subject	Class Imbalance	en_US
DC.subject	Binary	en_US
DC.subject	Five-class	en_US
DC.subject	Data Resampling	en_US
DC.title	資訊安全中的類別不平衡:欠採樣、過採樣和混合方法的比較研究	zh_TW
dc.language.iso	zh-TW	zh-TW
DC.title	Addressing Class Imbalance in Information Security: Comparative Analysis of Undersampling, Oversampling, and Hybrid Approaches	en_US
DC.type	博碩士論文	zh_TW
DC.type	thesis	en_US
DC.publisher	National Central University	en_US

博碩士論文 111453001 完整後設資料紀錄