樣本選取方法於多分類資料集之影響：多對多、一對多與一對一;Instance Selection Methods in Multi-Class Classification Datasets: All versus All, One versus All, and One versus One

NCU Institutional Repository > 管理學院 > 資訊管理研究所 > 博碩士論文 > Item 987654321/85062

請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/85062

題名:	樣本選取方法於多分類資料集之影響：多對多、一對多與一對一;Instance Selection Methods in Multi-Class Classification Datasets: All versus All, One versus All, and One versus One
作者:	廖珮祺;Liao, Pei-Qi
貢獻者:	資訊管理學系
關鍵詞:	資料前處理;樣本選取;特徵選取;多分類資料集;資料探勘;Data pre-processing;Instance selection;Feature selection;Multi-class dataset;Data mining
日期:	2021-02-19
上傳時間:	2021-03-18 17:32:32 (UTC+8)
出版者:	國立中央大學
摘要:	巨量資料(Big data)的時代來臨，將這些資料轉化為有用的資訊時，若沒有經過適當的前處理，訓練出的模型可能會受到其中的雜訊 (Noizy)影響，而使預測能力降低。在過去的研究中顯示，透過樣本選取(Instance selsction)方法能夠有效的篩選出資料集中代表性的資料，提升模型的效能與準確度。在這些相關研究中，較少討論在資料集為多分類的情況下，是否有不同的處理方法能夠提升樣本選取的效能。因此在本論文中欲探討：先對多分類資料集進行本研究中提出的多元分類處理方法後，再進行樣本選取，對於模型建立的影響。　　本研究提出了三種多分類資料集的多元分類處理方法：多對多(AvA)、一對多(OvA)以及一對一(OvO)，並搭配三種樣本選取方法：樣本學習演算法（Instance based learning algorithm，IB3）、遞減式降低最佳化程序（Decremental reduction optimization procedure 3, DROP3）與基因演算法（Genetic algorithm, GA），使用支持向量機（Support vector machine, SVM）與K鄰近值分類演算法（k-nearest neighbors classification algorithm, KNN）作為分類器，評估訓練模型最佳的搭配組合。於實驗第二階段進一步加入特徵選取方法（Feature selection），探討特徵選取搭配多元分類處理後的樣本選取，對於建立訓練模型的影響。　　本研究使用UCI與KEEL上20個不同類型的多分類資料集，進行不同多元分類處理與樣本選取方法組合。根據實驗結果發現，以多元分類處理OvO搭配樣本選取演算法DROP3，在分類器KNN的模型建立之下，獲得最佳的平均結果，與未經過樣本選取方法的KNN建模結果相比，AUC指標提升了6.6%。 ;The big data generation has come. When turning these data into useful information, if they are out of proper pre-processing, the noise in data may reduce the predictive ability of the training model. In previous research, it has shown that the instance selection methods can effectively selection the representative data from the datasets, and improve the performance and accuracy of the model. Among the research, it rarely discusses whether there are any processing methods that can improve the efficiency of instance selection when the datasets are multi-classified. Therefore, this thesis aims to discuss about the impact of the multi-class classification methods proposed in this research with the instance selection methods in multi-class datasets. 　　This study proposes three methods for multi-class classification processing in multi-class datasets: All versus All (AvA), One versus All (OvA), and One versus One (OvO), with three instance selection methods: Instance based learning algorithm 3 (IB3), Decremental reduction optimization procedure 3 (DROP3) and Genetic algorithm (GA). Using Support vector machine (SVM) and the k-nearest neighbors classification algorithm (KNN) as classifiers to evaluate which method is the best combination. In the second stage of the study, we add the feature selection method to find out the impact between feature selection and instance selection under the multi-class classification methods. 　　This study uses 20 different types of multi-class datasets from UCI and KEEL, and goes through different combination of multi-class classification methods and instance selection methods. The empirical results show that, the combination of multi-class classification method-OvO with instance selection method-DROP3, under classifier KNN, obtained the best average results. Comparing to the results of the baseline which is without instance selection, the AUC index has improved 6.6%.
顯示於類別:	[資訊管理研究所] 博碩士論文

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	166	檢視/開啟

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....