探討資料層級、演算法層級與混合方法於類別不平衡資料集之研究;Research on Data-Level, Algorithm-Level, and Hybrid Methods for Class Imbalance Datasets

NCUIR > School of Management at National Central University > Graduate Institute of Information Management > Electronic Thesis & Dissertation > Item 987654321/98208

Please use this identifier to cite or link to this item: https://ir.lib.ncu.edu.tw/handle/987654321/98208

Title:	探討資料層級、演算法層級與混合方法於類別不平衡資料集之研究;Research on Data-Level, Algorithm-Level, and Hybrid Methods for Class Imbalance Datasets
Authors:	余晏禎;Yu, Yan-Zhen
Contributors:	資訊管理學系
Keywords:	類別不平衡;重採樣法;樣本選取法;成本敏感學習法;集成學習;Class Imbalance;Resampling;Instance Selection;Cost-sensitive Learning;Ensemble Learning
Date:	2025-06-20
Issue Date:	2025-10-17 12:29:36 (UTC+8)
Publisher:	國立中央大學
Abstract:	在現實世界的資料中，類別不平衡問題（Class Imbalance）相當常見，使資料呈現偏態分布（Skewed Distribution），並伴隨樣本重疊（Overlapping）、樣本數少（Small Sample Size）以及樣本分離（Small Disjuncts）等特性。為解決類別不平衡問題，過往文獻提出資料層級、演算法層級或混合方法等策略；然而，現有研究多為單一或兩種策略的結合探討，尤其是在處理資料中的雜訊（Noise）樣本方面，尚無文獻針對三種策略與雜訊處理之間的交互影響進行分析。本研究針對二元類別不平衡資料，設計兩項實驗架構。結合現有類別不平衡問題之解決策略與樣本選取法，並搭配單一與集成分類器，探討在資料平衡處理後進行異常值篩選，是否能進一步提升分類性能；此外，本研究亦探討混合方法—混合層級與集成分類器結合的效果，進一步分析在此基礎上結合樣本選取法的分類表現。本研究選用四種樣本選取法（ENN、DROP3、IPF、CVCF），以及三種重採樣法（SMOTE、ClusterCentroids、SMOTE-Tomek Links），並於演算法層級引入成本敏感學習法，搭配五種分類模型（SVM、KNN、CART、RF、XGBoost），使用KEEL網站提供並經五折交叉驗證處理的43個二元類別不平衡資料集進行實驗。實驗結果顯示，對於類別不平衡問題，結合混合層級與集成分類器的混合方法具穩定分類表現和實務應用潛力，即重採樣法結合成本敏感學習法的組合達到最佳分類效果，尤其以混合採樣法的提升最為顯著。此外，在AUC指標中，加入樣本選取法亦能進一步提升整體分類性能。SMOTE-Tomek Links結合成本敏感學習法搭配RF分類器，於F1-Score指標表現最佳；而SMOTE-Tomek Links結合ENN方法，無論是否加入成本敏感學習法，再搭配KNN分類器，於AUC指標達到最佳表現。 ;Class imbalance is common in real-world data, often resulting in skewed distributions with issues like overlapping samples, small sample sizes, and small disjuncts. To address this, previous studies have proposed data-level, algorithm-level, and hybrid methods. However, most research only explores one or two strategies, and few consider the impact of noise together with all three strategies. This study focuses on binary class-imbalanced data and designs two experimental frameworks. It combines existing imbalance-handling strategies with instance selection methods and applies both single and ensemble classifiers to examine whether applying outlier filtering after data balancing can further improve classification performance. Additionally, it explores a hybrid method that integrates data-level and algorithm-level methods with ensemble classifiers, and further analyzes the effect of incorporating instance selection. Four instance selection methods (ENN, DROP3, IPF, CVCF), three resampling techniques (SMOTE, ClusterCentroids, SMOTE-Tomek Links), and cost-sensitive learning at the algorithm level are adopted. Five classifiers (SVM, KNN, CART, RF, XGBoost) are used, and experiments are conducted on 43 binary imbalanced datasets from the KEEL website using 5-fold cross-validation. Results show that hybrid methods combining resampling and cost-sensitive learning with ensemble classifiers yield stable and effective classification. Among them, SMOTE-Tomek Links with cost-sensitive learning and the RF classifier performs best on F1-Score. For AUC, adding instance selection improves overall performance, especially SMOTE-Tomek Links with ENN and the KNN classifier, with or without cost-sensitive learning.
Appears in Collections:	[Graduate Institute of Information Management] Electronic Thesis & Dissertation

Files in This Item:

File	Description	Size	Format
index.html		0Kb	HTML	7	View/Open

社群 sharing

Loading...