博碩士論文 111423038 完整後設資料紀錄

DC 欄位 語言
DC.contributor資訊管理學系zh_TW
DC.creator陳映彤zh_TW
DC.creatorYing-Tung Chenen_US
dc.date.accessioned2024-7-9T07:39:07Z
dc.date.available2024-7-9T07:39:07Z
dc.date.issued2024
dc.identifier.urihttp://ir.lib.ncu.edu.tw:444/thesis/view_etd.asp?URN=111423038
dc.contributor.department資訊管理學系zh_TW
DC.description國立中央大學zh_TW
DC.descriptionNational Central Universityen_US
dc.description.abstract現實世界的資料經常存在著類別不平衡(Class Imbalance)問題,在二元分類中,類別不平衡指的是兩類資料中其中一類的樣本數大於另一類的樣本數,使資料呈現偏態分布(Skewed Distribution)的情況,偏態分布的資料集通常有著樣本重疊(Overlapping)、樣本數少(Small Sample Size)、樣本分離(Small Disjuncts)特性,需要進行資料前處理才能有效地訓練模型,若不加以處理,可能導致分類器在預測時偏向於大類別資料,忽視小類別資料,而在醫療診斷、異常檢測、破產預測等許多領域,通常小類別資料更具有價值。 因此,本論文提出了一個基於分群的創新混合採樣CBHS(Cluster-Based Hybrid Sampling)方法,採用兩種不同的分群方法,針對小類資料進行分群,找出散落在資料空間中小類子群集,根據分群結果,結合兩種不同的增加少數法和兩種不同的減少多數法策略進行資料前處理,以降低大類資料與小類資料之間的類別不平衡比率,並採用三種不同分類器進行模型的訓練。欲探討CBHS方法是否能更有效處理偏態分布的三種特性,提升最後的分類效果,以及探討不同策略與分群方法的最佳選擇。 本論文使用來自KEEL網站的40個二元類別不平衡資料集進行實驗,以五折交叉驗證作為實驗驗證的方法,並採用ROC曲線下面積(Area Under Curve, AUC)作為模型的衡量指標。實驗結果顯示,CBHS方法在分類準確率(AUC)上優於Baseline方法,能有效解決偏態分布資料的樣本重疊、樣本數少及樣本分離特性,更好的解決類別不平衡問題。此外,將三種分類器中AUC最高的CBHS方法進行分類器端集成,則可進一步提升分類效果,其中VOTE(AP (SWO, LM)+RF)方法的表現最為優異。zh_TW
dc.description.abstractReal-world data often exhibit the problem of class imbalance. In binary classification, class imbalance refers to a situation where the number of samples in one class is significantly greater than in the other class, resulting in a skewed distribution. Skewed distribution datasets typically have characteristics such as overlapping, small sample sizes, and small disjuncts, necessitating data preprocessing to effectively train models. Without proper handling, classifiers may be biased towards the majority class, ignoring the minority class. In many fields, such as medical diagnosis, anomaly detection, and bankruptcy prediction, the minority class data is more valuable. Therefore, this paper proposes a novel cluster-based hybrid sampling (CBHS) approach. CBHS uses two different clustering methods to group the minority class data, identifying subgroups within the minority class. Based on the clustering results, it combines two different over-sampling strategies and two different under-sampling strategies for data preprocessing to reduce the class imbalance ratio. Three different classifiers are used to train the models. The aim is to explore whether the CBHS approach can more effectively address the three characteristics of skewed distributions, improve classification performance, and determine the optimal combination of strategies and clustering methods. This paper uses 40 imbalanced datasets from the KEEL website for experiments, using 5-fold cross-validation as the experimental validation method. The Area Under the Curve (AUC) of the ROC curve is used as the evaluation metric. Experimental results show that the CBHS approach outperforms the Baseline method, effectively addressing overlapping, small sample sizes, and small disjuncts, thereby better solving the class imbalance problem. Furthermore, using the CBHS approach with the highest AUC from the three classifiers to form an ensemble classifier can further improve AUC, with the VOTE (AP (SWO, LM) + RF) method showing the best performance.en_US
DC.subject資料探勘zh_TW
DC.subject機器學習zh_TW
DC.subject類別不平衡zh_TW
DC.subject資料重採樣zh_TW
DC.subjectdata miningen_US
DC.subjectmachine learningen_US
DC.subjectclass imbalanceen_US
DC.subjectdata resamplingen_US
DC.title一個基於分群之創新混合採樣法於類別不平衡資料集之應用zh_TW
dc.language.isozh-TWzh-TW
DC.titleA Novel Cluster-Based Hybrid Sampling Approach for Class Imbalanced Datasetsen_US
DC.type博碩士論文zh_TW
DC.typethesisen_US
DC.publisherNational Central Universityen_US

若有論文相關問題,請聯絡國立中央大學圖書館推廣服務組 TEL:(03)422-7151轉57407,或E-mail聯絡  - 隱私權政策聲明