異常值偵測對增進類別不平衡預測的效能評估

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：21

、訪客IP：18.218.36.242

姓名

楊博翰(Po-Han Yang) 查詢紙本館藏

畢業系所

資訊管理學系

論文名稱

異常值偵測對增進類別不平衡預測的效能評估
(The Effectiveness Evaluation of Outlier Detection in Improving the Predictions of Imbalanced Classes)

相關論文

★ 具代理人之行動匿名拍賣與付款機制	★ 網路攝影機遠端連線安全性分析
★ HSDPA環境下的複合式細胞切換機制	★ 樹狀結構為基礎之行動隨意網路IP位址分配機制
★ 平面環境中目標區域之偵測 - 使用行動感測網路技術	★ 藍芽Scatternet上的P2P檔案分享機制
★ 交通壅塞避免之動態繞路機制	★ 運用UWB提升MANET上檔案分享之效能
★ 合作學習平台對團體迷思現象及學習成效之影響–以英文字彙學習為例	★ 以RFID為基礎的室內定位機制─使用虛擬標籤的經驗法則
★ 適用於實體購物情境的行動商品比價系統-使用影像辨識技術	★ 信用卡網路刷卡安全性
★ DEAP:適用於行動RFID系統之高效能動態認證協定	★ 在破產預測與信用評估領域對前處理方式與分類器組合的比較分析
★ 單一類別分類方法於不平衡資料集－搭配遺漏值填補和樣本選取方法	★ 正規化與變數篩選在破產領域的適用性研究

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2029-6-30以後開放)

摘要(中)

本研究探討異常值檢測技術在處理類別不平衡資料集當中的應用，並評估其結合過採樣技術對模型預測性能的影響。研究分別針對少數類別和多數類別的異常值進行偵測並刪除，然後使用SMOTE (Synthetic Minority Over-sampling TEchnique)過採樣方法進行過採樣，以平衡兩類別的樣本數量。藉由實驗分析，本研究比較了經過異常值處理和直接過採樣的效果，並分析異常值偵測對模型預測性能的影響。
在實驗設計上，本研究選用了收錄於KEEL-Dataset Repository (Knowledge Extraction based on Evolutionary Learning-Dataset Repository)中的七個二元類別不平衡資料集作為實驗資料集，並挑選了四種不同類型的異常值偵測代表方法進行實驗，分別是LOF (Local Outlier Factor)、iForest(Isolation Forest)、MCD (Minimum Covariance Determinant)及OCSVM (One-Class Support Vector Machine)。實驗中使用了三種分類器：SVM (Support Vector Machine)、Random Forest及LightGBM，觀察分別移除少數類別及多數類別當中的異常值之後，再以SMOTE過採樣方法將資料集類別數量過採樣至平衡，會如何對模型預測性能造成影響。
根據實驗結果顯示，以異常值偵測移除少數類別的異常值不僅無法對模型預測性能有正面的影響，反而導致模型性能下降；另一方面，移除多數類別的異常值可以對模型預測性能有正面的影響，其中，使用LOF移除多數類別中的異常值，對模型性能有最佳的提升效果。這些發現表明，在處理類別不平衡問題時，針對多數類別進行異常值偵測並移除異常值，結合SMOTE過採樣技術，是提高模型預測性能的一種有效策略。

摘要(英)

This study explores the application of outlier detection techniques in handling imbalanced datasets and evaluates the impact of combining these techniques with over-sampling on model classification performance. The research focuses on detecting and removing outliers from both minority and majority classes, followed by over-sampling using SMOTE (Synthetic Minority Over-sampling TEchnique) to balance the class samples. Through experimental analysis, this study compares the effects of outlier processing and direct over-sampling, analyzing the impact of outlier detection on model classification performance.
Seven binary imbalanced datasets from the KEEL-Dataset Repository were selected for the experiments. Four outlier detection methods were tested: LOF (Local Outlier Factor), iForest (Isolation Forest), MCD (Minimum Covariance Determinant), and OCSVM (One-Class Support Vector Machine). Three classifiers were used: SVM (Support Vector Machine), Random Forest, and LightGBM. The study observed the impact on model performance after removing outliers from the majority and minority classes and then using SMOTE to balance the datasets.
The experimental results showed that removing outliers from the minority class did not improve model performance and even caused a decline. In contrast, removing outliers from the majority class had a positive impact, with LOF providing the best improvement. These findings suggest that for addressing class imbalance, detecting and removing outliers from the majority class combined with SMOTE over-sampling is an effective strategy to improve model classification performance.

關鍵字(中)

★ 機器學習
★ 類別不平衡
★ 異常值偵測
★ 過採樣
★ SMOTE

關鍵字(英)

★ Machine learning
★ Class imbalance
★ Outlier detection
★ Over-sampling
★ SMOTE

論文目次

摘要 i
Abstract ii
誌謝 iii
目錄 iv
圖目錄 vi
表目錄 viii
一、緒論 1
1-1 研究背景 1
1-2 研究動機 2
1-3 研究目的 4
二、文獻探討 6
2-1 過採樣技術 6
2-2 異常值偵測與過採樣技術的結合應用 8
2-3 異常值偵測技術 9
2-3-1 LOF (Local Outlier Factor) 12
2-3-2 iForest (Isolation Forest) 14
2-3-3 MCD (Minimum Covariance Determinant) 15
2-3-4 OCSVM (One-Class Support Vector Machine) 16
2-4 分類器 17
2-4-1 SVM(Support Vector Machine) 17
2-4-2 Random Forest 18
2-4-3 LightGBM(Light Gradient-Boosting Machine) 19
三、研究方法 20
3-1 研究資料集 21
3-2 資料前處理 24
3-3 訓練與測試資料集拆分 24
3-4 實驗參數設定、方法 25
3-5 異常值偵測目標與做法 26
3-6 評估指標 29
3-7 探討刪除少數類別異常值對過採樣的影響 31
3-8 探討刪除多數類別異常值對過採樣的影響 32
四、實驗結果與分析 34
4-1 過採樣對模型預測性能的影響 34
4-2 刪除少數類別異常值對過採樣的影響 35
4-2-1 異常值偵測各資料集分析 36
4-2-2 異常值偵測平均預測性能分析 38
4-3 刪除多數類別異常值對過採樣的影響 40
4-3-1 異常值偵測於各分類器的影響分析 41
4-3-2 最佳異常值偵測方法分析 47
4-3-3 最佳分類器 53
4-3-4 最佳性能組合與SMOTEENN效果對比 54
4-3-5 不同不平衡率的效果分析 56
4-4 同時刪除兩類別異常值對過採樣的影響 59
五、結論 62
5-1 結論與貢獻 62
5-2 未來研究與建議 64

參考文獻

[1] Y. Sun, A. K. C. Wong, and M. S. Kamel, “Classification of imbalanced data: a review,” Int. J. Pattern Recognit. Artif. Intell., vol. 23, no. 04, pp. 687–719, Jun. 2009, doi: 10.1142/S0218001409007326.
[2] Q. Zou, S. Xie, Z. Lin, M. Wu, and Y. Ju, “Finding the Best Classification Threshold in Imbalanced Classification,” Big Data Res., vol. 5, pp. 2–8, Sep. 2016, doi: 10.1016/j.bdr.2015.12.001.
[3] V. S. Spelmen and R. Porkodi, “A Review on Handling Imbalanced Data,” in 2018 International Conference on Current Trends towards Converging Technologies (ICCTCT), Mar. 2018, pp. 1–11. doi: 10.1109/ICCTCT.2018.8551020.
[4] G. Kovács, “An empirical comparison and evaluation of minority oversampling techniques on a large number of imbalanced datasets,” Appl. Soft Comput., vol. 83, p. 105662, Oct. 2019, doi: 10.1016/j.asoc.2019.105662.
[5] V. Ganganwar, “An overview of classification algorithms for imbalanced datasets,” Int. J. Emerg. Technol. Adv. Eng., vol. 2, no. 4, pp. 42–47, 2012.
[6] G. Douzas, F. Bacao, and F. Last, “Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE,” Inf. Sci., vol. 465, pp. 1–20, Oct. 2018, doi: 10.1016/j.ins.2018.06.056.
[7] R. Mohammed, J. Rawashdeh, and M. Abdullah, “Machine Learning with Oversampling and Undersampling Techniques: Overview Study and Experimental Results,” in 2020 11th International Conference on Information and Communication Systems (ICICS), Apr. 2020, pp. 243–248. doi: 10.1109/ICICS49469.2020.239556.
[8] S. Vellamcheti and P. Singh, “Class Imbalance Deep Learning for Bankruptcy Prediction,” in 2020 First International Conference on Power, Control and Computing Technologies (ICPC2T), Jan. 2020, pp. 421–425. doi: 10.1109/ICPC2T48082.2020.9071460.
[9] H. Sain and S. W. Purnami, “Combine Sampling Support Vector Machine for Imbalanced Data Classification,” Procedia Comput. Sci., vol. 72, pp. 59–66, Jan. 2015, doi: 10.1016/j.procs.2015.12.105.
[10] G. E. A. P. A. Batista, R. C. Prati, and M. C. Monard, “A study of the behavior of several methods for balancing machine learning training data,” SIGKDD Explor Newsl, vol. 6, no. 1, pp. 20–29, Jun. 2004, doi: 10.1145/1007730.1007735.
[11] P. Gnip, L. Vokorokos, and P. Drotár, “Selective oversampling approach for strongly imbalanced data,” PeerJ Comput. Sci., vol. 7, p. e604, Jun. 2021, doi: 10.7717/peerj-cs.604.
[12] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: A survey,” ACM Comput. Surv. CSUR, vol. 41, no. 3, pp. 1–58, 2009.
[13] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic Minority Over-sampling Technique,” J. Artif. Intell. Res., vol. 16, pp. 321–357, Jun. 2002, doi: 10.1613/jair.953.
[14] A. Fernandez, S. Garcia, F. Herrera, and N. V. Chawla, “SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary,” J. Artif. Intell. Res., vol. 61, pp. 863–905, Apr. 2018, doi: 10.1613/jair.1.11192.
[15] Asniar, N. U. Maulidevi, and K. Surendro, “SMOTE-LOF for noise identification in imbalanced data classification,” J. King Saud Univ. - Comput. Inf. Sci., vol. 34, no. 6, Part B, pp. 3413–3423, Jun. 2022, doi: 10.1016/j.jksuci.2021.01.014.
[16] X. W. Liang, A. P. Jiang, T. Li, Y. Y. Xue, and G. T. Wang, “LR-SMOTE — An improved unbalanced data set oversampling based on K-means and SVM,” Knowl.-Based Syst., vol. 196, p. 105845, May 2020, doi: 10.1016/j.knosys.2020.105845.
[17] M. H. IBRAHIM, “ODBOT: Outlier detection-based oversampling technique for imbalanced datasets learning,” Neural Comput. Appl., vol. 33, no. 22, pp. 15781–15806, Nov. 2021, doi: 10.1007/s00521-021-06198-x.
[18] W. Dan and L. Yian, “Denoise-Based Over-Sampling for Imbalanced Data Classification,” presented at the 2020 19th International Symposium on Distributed Computing and Applications for Business Engineering and Science (DCABES), IEEE Computer Society, Oct. 2020, pp. 1–4. doi: 10.1109/DCABES50732.2020.00078.
[19] A. Arafa, N. El-Fishawy, M. Badawy, and M. Radad, “RN-SMOTE: Reduced Noise SMOTE based on DBSCAN for enhancing imbalanced data classification,” J. King Saud Univ. - Comput. Inf. Sci., vol. 34, no. 8, Part A, pp. 5059–5074, Sep. 2022, doi: 10.1016/j.jksuci.2022.06.005.
[20] D. M. Hawkins, Identification of Outliers. Dordrecht: Springer Netherlands, 1980. doi: 10.1007/978-94-015-3994-4.
[21] A. Boukerche, L. Zheng, and O. Alfandi, “Outlier detection: Methods, models, and classification,” ACM Comput. Surv. CSUR, vol. 53, no. 3, pp. 1–37, 2020.
[22] I. Souiden, Z. Brahmi, and H. Toumi, “A Survey on Outlier Detection in the Context of Stream Mining: Review of Existing Approaches and Recommadations,” in Intelligent Systems Design and Applications, A. M. Madureira, A. Abraham, D. Gamboa, and P. Novais, Eds., Cham: Springer International Publishing, 2017, pp. 372–383. doi: 10.1007/978-3-319-53480-0_37.
[23] O. Alghushairy, R. Alsini, T. Soule, and X. Ma, “A Review of Local Outlier Factor Algorithms for Outlier Detection in Big Data Streams,” Big Data Cogn. Comput., vol. 5, no. 1, Art. no. 1, Mar. 2021, doi: 10.3390/bdcc5010001.
[24] A. Smiti, “A critical overview of outlier detection methods,” Comput. Sci. Rev., vol. 38, p. 100306, Nov. 2020, doi: 10.1016/j.cosrev.2020.100306.
[25] H. Wang, M. J. Bah, and M. Hammad, “Progress in Outlier Detection Techniques: A Survey,” IEEE Access, vol. 7, pp. 107964–108000, 2019, doi: 10.1109/ACCESS.2019.2932769.
[26] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander, “LOF: identifying density-based local outliers,” presented at the Proceedings of the 2000 ACM SIGMOD international conference on Management of data, 2000, pp. 93–104.
[27] S. Ramaswamy, R. Rastogi, and K. Shim, “Efficient algorithms for mining outliers from large data sets,” in Proceedings of the 2000 ACM SIGMOD international conference on Management of data, in SIGMOD ’00. New York, NY, USA: Association for Computing Machinery, May 2000, pp. 427–438. doi: 10.1145/342009.335437.
[28] E. M. Knox and R. T. Ng, “Algorithms for mining distancebased outliers in large datasets,” presented at the Proceedings of the international conference on very large data bases, Citeseer, 1998, pp. 392–403.
[29] F. T. Liu, K. M. Ting, and Z.-H. Zhou, “Isolation Forest,” in 2008 Eighth IEEE International Conference on Data Mining, Feb. 2008, pp. 413–422. doi: 10.1109/ICDM.2008.17.
[30] Y. Chabchoub, M. U. Togbe, A. Boly, and R. Chiky, “An In-Depth Study and Improvement of Isolation Forest,” IEEE Access, vol. 10, pp. 10219–10237, 2022, doi: 10.1109/ACCESS.2022.3144425.
[31] D. EStimator, “A Fast Algorithm for the Minimum Covariance,” Technometrics, vol. 41, no. 3, p. 212, 1999.
[32] M. Hubert and M. Debruyne, “Minimum covariance determinant,” WIREs Comput. Stat., vol. 2, no. 1, pp. 36–43, 2010, doi: 10.1002/wics.61.
[33] M. Hubert, M. Debruyne, and P. J. Rousseeuw, “Minimum covariance determinant and extensions,” WIREs Comput. Stat., vol. 10, no. 3, p. e1421, 2018, doi: 10.1002/wics.1421.
[34] B. Schölkopf, R. C. Williamson, A. Smola, J. Shawe-Taylor, and J. Platt, “Support Vector Method for Novelty Detection,” in Advances in Neural Information Processing Systems, MIT Press, 1999. Accessed: Mar. 14, 2024. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/1999/hash/8725fb777f25776ffa9076e44fcfd776-Abstract.html
[35] H. J. Shin, D.-H. Eom, and S.-S. Kim, “One-class support vector machines—an application in machine fault detection and classification,” Comput. Ind. Eng., vol. 48, no. 2, pp. 395–408, Mar. 2005, doi: 10.1016/j.cie.2005.01.009.
[36] A. Gosain and S. Sardana, “Handling class imbalance problem using oversampling techniques: A review,” in 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Udupi: IEEE, Sep. 2017, pp. 79–85. doi: 10.1109/ICACCI.2017.8125820.
[37] I. Dey and V. Pratap, “A Comparative Study of SMOTE, Borderline-SMOTE, and ADASYN Oversampling Techniques using Different Classifiers,” in 2023 3rd International Conference on Smart Data Intelligence (ICSMDI), Trichy, India: IEEE, Mar. 2023, pp. 294–302. doi: 10.1109/ICSMDI57622.2023.00060.
[38] X. Dong, Z. Yu, W. Cao, Y. Shi, and Q. Ma, “A survey on ensemble learning,” Front. Comput. Sci., vol. 14, no. 2, pp. 241–258, Apr. 2020, doi: 10.1007/s11704-019-8208-z.
[39] H. Parvin, M. MirnabiBaboli, and H. Alinejad-Rokny, “Proposing a classifier ensemble framework based on classifier selection and decision tree,” Eng. Appl. Artif. Intell., vol. 37, pp. 34–42, Jan. 2015, doi: 10.1016/j.engappai.2014.08.005.
[40] X. Feng, Z. Xiao, B. Zhong, J. Qiu, and Y. Dong, “Dynamic ensemble classification for credit scoring using soft probability,” Appl. Soft Comput., vol. 65, pp. 139–151, Apr. 2018, doi: 10.1016/j.asoc.2018.01.021.
[41] Z. Liu and Y. Zhang, “Credit evaluation with a data mining approach based on gradient boosting decision tree,” J. Phys. Conf. Ser., vol. 1848, no. 1, p. 012034, Apr. 2021, doi: 10.1088/1742-6596/1848/1/012034.
[42] M. A. Hearst, S. T. Dumais, E. Osuna, J. Platt, and B. Scholkopf, “Support vector machines,” IEEE Intell. Syst. Their Appl., vol. 13, no. 4, pp. 18–28, Jul. 1998, doi: 10.1109/5254.708428.
[43] V. Vapnik, The Nature of Statistical Learning Theory. Springer Science & Business Media, 2013.
[44] L. Breiman, “Random Forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32, Oct. 2001, doi: 10.1023/A:1010933404324.
[45] G. Ke et al., “Lightgbm: A highly efficient gradient boosting decision tree,” Adv. Neural Inf. Process. Syst., vol. 30, 2017.
[46] J. Derrac, S. Garcia, L. Sanchez, and F. Herrera, “Keel data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework,” J Mult Valued Log. Soft Comput, vol. 17, pp. 255–287, 2015.
[47] A. Puri and M. Kumar Gupta, “Improved Hybrid Bag-Boost Ensemble With K-Means-SMOTE–ENN Technique for Handling Noisy Class Imbalanced Data,” Comput. J., vol. 65, no. 1, pp. 124–138, Jan. 2022, doi: 10.1093/comjnl/bxab039.
[48] Z. Ali, R. Ahmad, M. N. Akhtar, Z. H. Chuhan, H. M. Kiran, and W. Shahzad, “Empirical Study of Associative Classifiers on Imbalanced Datasets in KEEL,” in 2018 9th International Conference on Information, Intelligence, Systems and Applications (IISA), Jul. 2018, pp. 1–7. doi: 10.1109/IISA.2018.8633612.
[49] Q. Liu, W. Luo, and T. Shi, “Classification method for imbalanced data set based on EKCStacking algorithm,” in Proceedings of the 2019 8th International Conference on Networks, Communication and Computing, in ICNCC ’19. New York, NY, USA: Association for Computing Machinery, Jan. 2020, pp. 51–56. doi: 10.1145/3375998.3376002.
[50] N. Rout, D. Mishra, and M. K. Mallick, “Handling Imbalanced Data: A Survey,” in International Proceedings on Advances in Soft Computing, Intelligent Systems and Applications, M. S. Reddy, K. Viswanath, and S. P. K.M., Eds., Singapore: Springer, 2018, pp. 431–443. doi: 10.1007/978-981-10-5272-9_39.
[51] J. Kong, W. Kowalczyk, S. Menzel, and T. Bäck, “Improving Imbalanced Classification by Anomaly Detection,” in Parallel Problem Solving from Nature – PPSN XVI, T. Bäck, M. Preuss, A. Deutz, H. Wang, C. Doerr, M. Emmerich, and H. Trautmann, Eds., Cham: Springer International Publishing, 2020, pp. 512–523. doi: 10.1007/978-3-030-58112-1_35.
[52] J. Wang, J. Xu, C. Zhao, Y. Peng, and H. Wang, “An ensemble feature selection method for high-dimensional data based on sort aggregation,” Syst. Sci. Control Eng., vol. 7, no. 2, pp. 32–39, Nov. 2019, doi: 10.1080/21642583.2019.1620658.
[53] H. Qian, S. Zhang, B. Wang, L. Peng, S. Gao, and Y. Song, “A comparative study on machine learning models combining with outlier detection and balanced sampling methods for credit scoring,” Dec. 25, 2021, arXiv: arXiv:2112.13196. doi: 10.48550/arXiv.2112.13196.
[54] L. Cleofas-Sánchez, J. S. Sánchez, V. García, and R. M. Valdovinos, “Associative learning on imbalanced environments: An empirical study,” Expert Syst. Appl., vol. 54, pp. 387–397, Jul. 2016, doi: 10.1016/j.eswa.2015.10.001.
[55] A. I. Marqués, V. García, and J. S. Sánchez, “On the suitability of resampling techniques for the class imbalance problem in credit scoring,” J. Oper. Res. Soc., vol. 64, no. 7, pp. 1060–1070, Jul. 2013, doi: 10.1057/jors.2012.120.

指導教授

蘇坤良(Kuen-Liang Sue)

審核日期

2024-7-29

推文