資料前處理與分類器建構之集成學習技術於類別不平衡資料之研究;A Study on Ensemble Learning Techniques for Data Preprocessing and Classifier Construction in Imbalanced Data

NCU Institutional Repository > 管理學院 > 資訊管理研究所 > 博碩士論文 > Item 987654321/99406

請使用永久網址來引用或連結此文件: https://ir.lib.ncu.edu.tw/handle/987654321/99406

題名:	資料前處理與分類器建構之集成學習技術於類別不平衡資料之研究;A Study on Ensemble Learning Techniques for Data Preprocessing and Classifier Construction in Imbalanced Data
作者:	高奕筠;KAO, YI-YUN
貢獻者:	資訊管理學系
關鍵詞:	類別不平衡;重採樣;案例選取;集成學習;動態選取;imbalanced data;re-sampling;instance selection;ensemble learning;dynamic selection
日期:	2026-02-11
上傳時間:	2026-03-06 18:55:11 (UTC+8)
出版者:	國立中央大學
摘要:	在真實世界中，類別不平衡（Imbalanced Data）問題廣泛存在，如設備故障預測與醫療診斷。由於傳統機器學習模型通常偏向預測多數類別，因此如何提升分類器對少數類別的辨識能力成為重要課題。目前針對類別不平衡問題的解決策略主要可分為資料層級、演算法層級以及混合層級三大類，但在資料層級方面，現有文獻尚缺乏對於如何將集成學習（Ensemble Learning）概念應用於資料前處理的深入探討。此外，也鮮有研究將集成學習運用於多重分類器的篩選中。因此本研究針對此缺口，探討集成學習在資料前處理與分類器建構上對分類表現的影響，使用來自KEEL資料庫的42個類別不平衡資料集，並設計兩組實驗：（1）選用三種重採樣演算法（SMOTE、Cluster Centroids和SMOTEENN）與四種案例選取演算法（ENN、DROP3、IPF和CVCF），並設計12種不同的資料前處理流程，比較不同資料前處理方法（單一與集成）對分類表現的影響，以找出最佳的資料前處理方法；（2）結合六種動態選取演算法（OLA、MLA、MCB、DES-KNN、KNORA-U和DES-P）進行多重分類器建構，評估資料層級與分類器層級集成的協同效果。實驗結果顯示，採用重採樣的多重交集方法能提升訓練資料的多樣性與品質並增強分類效能，而所有分類器中以Random Forest表現最優異。而在整合策略方面，將SMOTE後搭配ENN，並結合SVM、CART、KNN三個分類器與KNORA-U動態選取技術，可在AUC指標上取得最優表現（0.863）；若重視少數類別的預測能力則建議採用IPF後進行重採樣的聯集，並搭配SVM、KNN、Random Forest（或XGBoost）三個分類器與KNORA-U，在F1-Measure指標上表現最佳（0.739），最終整合策略可依據實際應用情境與預測重點來做選擇。;Imbalanced data is common in real-world applications such as equipment failure prediction and medical diagnosis. Traditional machine learning models often favor the majority class. Therefore, improving a classifier’s ability to recognize the minority class has become a key challenge. However, current literature lacks exploration of how ensemble learning can be incorporated into data preprocessing at the data level. Additionally, few studies have applied ensemble learning to the selection of multiple classifiers. To address these gaps, this study investigates the impact of ensemble learning on classification performance in both data preprocessing and classifier construction. A total of 42 imbalanced datasets from the KEEL repository were used, and two sets of experiments were designed: (1) Twelve distinct data preprocessing workflows were designed by three resampling algorithms (SMOTE, Cluster Centroids, and SMOTEENN) and four instance selection algorithms (ENN, DROP3, IPF, and CVCF). These workflows, incorporating both single and ensemble learning based data preprocessing approaches, were identify to determine the most effective preprocessing strategy for handling imbalanced data; (2) Integrate six dynamic selection algorithms (OLA, MLA, MCB, DES-KNN, KNORA-U, and DES-P) for multiple classifier construction to evaluate the synergistic effects of combining data-level and classifier-level ensembles. Experimental results show that employing a multi-intersection resampling approach can enhance the diversity and quality of training data, thereby improving classification performance. Random Forest demonstrated the best overall performance. Regarding integration strategies, applying SMOTE followed by ENN, and integrating SVM, CART, and KNN with the dynamic selection technique KNORA-U, achieved the highest AUC (0.863). For tasks prioritizing minority class prediction, the recommended strategy is to apply IPF followed by a union of resampling approach, combined with SVM, KNN, and Random Forest (or XGBoost), along with KNORA-U. This approach achieved the best F1-Measure (0.739). The final integration strategy can be selected according to specific application scenarios and predictive objectives.
顯示於類別:	[資訊管理研究所] 博碩士論文

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	173	檢視/開啟

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....