論文名稱 異常值偵測對增進類別不平衡預測的效能評估
(The Effectiveness Evaluation of Outlier Detection in Improving the Predictions of Imbalanced Classes)
摘要(中) 本研究探討異常值檢測技術在處理類別不平衡資料集當中的應用,並評估其結合過採樣技術對模型預測性能的影響。研究分別針對少數類別和多數類別的異常值進行偵測並刪除,然後使用SMOTE (Synthetic Minority Over-sampling TEchnique)過採樣方法進行過採樣,以平衡兩類別的樣本數量。藉由實驗分析,本研究比較了經過異常值處理和直接過採樣的效果,並分析異常值偵測對模型預測性能的影響。
在實驗設計上,本研究選用了收錄於KEEL-Dataset Repository (Knowledge Extraction based on Evolutionary Learning-Dataset Repository)中的七個二元類別不平衡資料集作為實驗資料集,並挑選了四種不同類型的異常值偵測代表方法進行實驗,分別是LOF (Local Outlier Factor)、iForest(Isolation Forest)、MCD (Minimum Covariance Determinant)及OCSVM (One-Class Support Vector Machine)。實驗中使用了三種分類器:SVM (Support Vector Machine)、Random Forest及LightGBM,觀察分別移除少數類別及多數類別當中的異常值之後,再以SMOTE過採樣方法將資料集類別數量過採樣至平衡,會如何對模型預測性能造成影響。
摘要(英) This study explores the application of outlier detection techniques in handling imbalanced datasets and evaluates the impact of combining these techniques with over-sampling on model classification performance. The research focuses on detecting and removing outliers from both minority and majority classes, followed by over-sampling using SMOTE (Synthetic Minority Over-sampling TEchnique) to balance the class samples. Through experimental analysis, this study compares the effects of outlier processing and direct over-sampling, analyzing the impact of outlier detection on model classification performance.
Seven binary imbalanced datasets from the KEEL-Dataset Repository were selected for the experiments. Four outlier detection methods were tested: LOF (Local Outlier Factor), iForest (Isolation Forest), MCD (Minimum Covariance Determinant), and OCSVM (One-Class Support Vector Machine). Three classifiers were used: SVM (Support Vector Machine), Random Forest, and LightGBM. The study observed the impact on model performance after removing outliers from the majority and minority classes and then using SMOTE to balance the datasets.
The experimental results showed that removing outliers from the minority class did not improve model performance and even caused a decline. In contrast, removing outliers from the majority class had a positive impact, with LOF providing the best improvement. These findings suggest that for addressing class imbalance, detecting and removing outliers from the majority class combined with SMOTE over-sampling is an effective strategy to improve model classification performance.
關鍵字(中) ★ 機器學習
★ 類別不平衡
★ 異常值偵測
★ 過採樣
關鍵字(英) ★ Machine learning
★ Class imbalance
★ Outlier detection
★ Over-sampling
論文目次 摘要 i
Abstract ii
誌謝 iii
目錄 iv
圖目錄 vi
表目錄 viii
一、緒論 1
1-1 研究背景 1
1-2 研究動機 2
1-3 研究目的 4
二、文獻探討 6
2-1 過採樣技術 6
2-2 異常值偵測與過採樣技術的結合應用 8
2-3 異常值偵測技術 9
2-3-1 LOF (Local Outlier Factor) 12
2-3-2 iForest (Isolation Forest) 14
2-3-3 MCD (Minimum Covariance Determinant) 15
2-3-4 OCSVM (One-Class Support Vector Machine) 16
2-4 分類器 17
2-4-1 SVM(Support Vector Machine) 17
2-4-2 Random Forest 18
2-4-3 LightGBM(Light Gradient-Boosting Machine) 19
三、研究方法 20
3-1 研究資料集 21
3-2 資料前處理 24
3-3 訓練與測試資料集拆分 24
3-4 實驗參數設定、方法 25
3-5 異常值偵測目標與做法 26
3-6 評估指標 29
3-7 探討刪除少數類別異常值對過採樣的影響 31
3-8 探討刪除多數類別異常值對過採樣的影響 32
四、實驗結果與分析 34
4-1 過採樣對模型預測性能的影響 34
4-2 刪除少數類別異常值對過採樣的影響 35
4-2-1 異常值偵測各資料集分析 36
4-2-2 異常值偵測平均預測性能分析 38
4-3 刪除多數類別異常值對過採樣的影響 40
4-3-1 異常值偵測於各分類器的影響分析 41
4-3-2 最佳異常值偵測方法分析 47
4-3-3 最佳分類器 53
4-3-4 最佳性能組合與SMOTEENN效果對比 54
4-3-5 不同不平衡率的效果分析 56
4-4 同時刪除兩類別異常值對過採樣的影響 59
五、結論 62
5-1 結論與貢獻 62
5-2 未來研究與建議 64
指導教授 蘇坤良(Kuen-Liang Sue) 審核日期 2024-7-29
