博碩士論文 103423011 完整後設資料紀錄

DC 欄位 語言
DC.contributor資訊管理學系zh_TW
DC.creator張景翔zh_TW
DC.creatorJing-Shang Jhangen_US
dc.date.accessioned2016-7-1T07:39:07Z
dc.date.available2016-7-1T07:39:07Z
dc.date.issued2016
dc.identifier.urihttp://ir.lib.ncu.edu.tw:444/thesis/view_etd.asp?URN=103423011
dc.contributor.department資訊管理學系zh_TW
DC.description國立中央大學zh_TW
DC.descriptionNational Central Universityen_US
dc.description.abstract類別不平衡問題(Class imbalance)ㄧ直是資料探勘(Data mining)領域的重要議題,問題發生於訓練資料集的其中ㄧ類別樣本數遠少於其他類別樣本數時,其所建立的分類模型會誤判少類樣本為多類類別以追求高分類正確率,而此問題於真實世界也愈來愈常見,如醫學診斷、錯誤偵測、臉部辨識等不同領域。   解決方法包含資料、演算法與成本敏感法,其中以資料層面的前處理最常見,主要是透過減少多數法或是增加少數法以平衡資料集的各類別樣本數。然而,過去方法皆有其缺點,減少多數法可能會刪除具有價值的資料;增加少數法可能增加雜訊樣本,而增加的樣本數量使訓練分類器的時間成本提高,且易造成過度訓練(Overfitting)。   本論文提出以k-平均分群演算法(k-means clustering)為基礎的分群抽樣法,針對訓練資料集中的多類樣本進行前處理,分群主要目的在挑選資料集中具代表性的樣本取代原始資料,平衡類別之間的樣本數量,同時降低取樣時資料分布不均的機率。   本論文實驗了44個不同的小型資料集與2個大型資料集、五種分類器(C4.5, SVM, MLP, k-NN(k=5))並搭配整體學習演算法,比較不同分群抽樣方式、不同分類器、不同分群數量的k值設定以及分析三種類別不平衡比率(Imbalance Ratio)區間的AUC結果,找出分群式抽樣下的最佳配置,並與文獻中傳統方法、整體學習法進行比較。研究結果顯示在所有組合中,群中心點之鄰近點的前處理搭配MLP演算法是最佳的選擇,無論是小型或大型資料集,其整體的AUC結果表現最好且最穩定。zh_TW
dc.description.abstractThe class imbalance problem is an important issue in data mining. This problem occurs when the number of samples that represent one class is much less than the ones of other classes. The classification model built by class imbalance datasets is likely to misclassify most samples in the minority class into the majority class because of maximizing the accuracy rate. It’s presences in many real-world applications, such as fault diagnosis, medical diagnosis or face recognition.   One of the most popular types of solutions is to consider data sampling. For example, Under-sampling the majority class or over-sampling the minority class to balance the imbalance datasets. Under-sampling balance class distribution through the elimination of majority class samples, but it may discard useful data. On the contrary, over-sampling replicates minority class samples, but it can increase the likelihood of occurring overfitting.   Therefore, we propose several resampling methods based on the k-means clustering technique. In order to decrease the probability of uneven resampling, we select representative samples to replace majority class samples in the training dataset.   Our experiments are based on using 44 small class imbalance datasets and two large scale datasets to build five types of classification models, which are C4.5, SVM, MLP, k-NN (k=5) and Naïve Bayes. In addition, the classifier ensemble algorithm is also employed. The research tries to compare the AUC result between different resampling techniques, different models and the number of clusters. Besides, we also divide imbalance ratio into three intervals. We try to find the best configuration of our experiments and compete with other literature methods. The experimental results show that combining the MLP classifier with the clustering based under-sampling method by the nearest neighbors of the cluster centers performs the best in terms of AUC over small and large scale datasets.en_US
DC.subject類別不平衡zh_TW
DC.subject資料探勘zh_TW
DC.subject分類zh_TW
DC.subject分群zh_TW
DC.subjectclass imbalanceen_US
DC.subjectdata miningen_US
DC.subjectclassificationen_US
DC.subjectclusteringen_US
DC.title分群式取樣法於類別不平衡問題之研究zh_TW
dc.language.isozh-TWzh-TW
DC.titleClustering-Based Under-sampling in Class Imbalanced Dataen_US
DC.type博碩士論文zh_TW
DC.typethesisen_US
DC.publisherNational Central Universityen_US

若有論文相關問題,請聯絡國立中央大學圖書館推廣服務組 TEL:(03)422-7151轉57407,或E-mail聯絡  - 隱私權政策聲明