| 摘要: | 在真實世界中,類別不平衡(Imbalanced Data)問題廣泛存在,如設備故障預測與醫療診斷。由於傳統機器學習模型通常偏向預測多數類別,因此如何提升分類器對少數類別的辨識能力成為重要課題。目前針對類別不平衡問題的解決策略主要可分為資料層級、演算法層級以及混合層級三大類,但在資料層級方面,現有文獻尚缺乏對於如何將集成學習(Ensemble Learning)概念應用於資料前處理的深入探討。此外,也鮮有研究將集成學習運用於多重分類器的篩選中。 因此本研究針對此缺口,探討集成學習在資料前處理與分類器建構上對分類表現的影響,使用來自KEEL資料庫的42個類別不平衡資料集,並設計兩組實驗:(1)選用三種重採樣演算法(SMOTE、Cluster Centroids和SMOTEENN)與四種案例選取演算法(ENN、DROP3、IPF和CVCF),並設計12種不同的資料前處理流程,比較不同資料前處理方法(單一與集成)對分類表現的影響,以找出最佳的資料前處理方法;(2)結合六種動態選取演算法(OLA、MLA、MCB、DES-KNN、KNORA-U和DES-P)進行多重分類器建構,評估資料層級與分類器層級集成的協同效果。 實驗結果顯示,採用重採樣的多重交集方法能提升訓練資料的多樣性與品質並增強分類效能,而所有分類器中以Random Forest表現最優異。而在整合策略方面,將SMOTE後搭配ENN,並結合SVM、CART、KNN三個分類器與KNORA-U動態選取技術,可在AUC指標上取得最優表現(0.863);若重視少數類別的預測能力則建議採用IPF後進行重採樣的聯集,並搭配SVM、KNN、Random Forest(或XGBoost)三個分類器與KNORA-U,在F1-Measure指標上表現最佳(0.739),最終整合策略可依據實際應用情境與預測重點來做選擇。;Imbalanced data is common in real-world applications such as equipment failure prediction and medical diagnosis. Traditional machine learning models often favor the majority class. Therefore, improving a classifier’s ability to recognize the minority class has become a key challenge. However, current literature lacks exploration of how ensemble learning can be incorporated into data preprocessing at the data level. Additionally, few studies have applied ensemble learning to the selection of multiple classifiers. To address these gaps, this study investigates the impact of ensemble learning on classification performance in both data preprocessing and classifier construction. A total of 42 imbalanced datasets from the KEEL repository were used, and two sets of experiments were designed: (1) Twelve distinct data preprocessing workflows were designed by three resampling algorithms (SMOTE, Cluster Centroids, and SMOTEENN) and four instance selection algorithms (ENN, DROP3, IPF, and CVCF). These workflows, incorporating both single and ensemble learning based data preprocessing approaches, were identify to determine the most effective preprocessing strategy for handling imbalanced data; (2) Integrate six dynamic selection algorithms (OLA, MLA, MCB, DES-KNN, KNORA-U, and DES-P) for multiple classifier construction to evaluate the synergistic effects of combining data-level and classifier-level ensembles. Experimental results show that employing a multi-intersection resampling approach can enhance the diversity and quality of training data, thereby improving classification performance. Random Forest demonstrated the best overall performance. Regarding integration strategies, applying SMOTE followed by ENN, and integrating SVM, CART, and KNN with the dynamic selection technique KNORA-U, achieved the highest AUC (0.863). For tasks prioritizing minority class prediction, the recommended strategy is to apply IPF followed by a union of resampling approach, combined with SVM, KNN, and Random Forest (or XGBoost), along with KNORA-U. This approach achieved the best F1-Measure (0.739). The final integration strategy can be selected according to specific application scenarios and predictive objectives. |