摘要: | 巨量資料(Big data)的時代來臨,將這些資料轉化為有用的資訊時,若沒有經過適當的前處理,訓練出的模型可能會受到其中的雜訊 (Noizy)影響,而使預測能力降低。在過去的研究中顯示,透過樣本選取(Instance selsction)方法能夠有效的篩選出資料集中代表性的資料,提升模型的效能與準確度。在這些相關研究中,較少討論在資料集為多分類的情況下,是否有不同的處理方法能夠提升樣本選取的效能。因此在本論文中欲探討:先對多分類資料集進行本研究中提出的多元分類處理方法後,再進行樣本選取,對於模型建立的影響。 本研究提出了三種多分類資料集的多元分類處理方法:多對多(AvA)、一對多(OvA)以及一對一(OvO),並搭配三種樣本選取方法:樣本學習演算法(Instance based learning algorithm,IB3)、遞減式降低最佳化程序(Decremental reduction optimization procedure 3, DROP3)與基因演算法(Genetic algorithm, GA),使用支持向量機(Support vector machine, SVM)與K鄰近值分類演算法(k-nearest neighbors classification algorithm, KNN)作為分類器,評估訓練模型最佳的搭配組合。於實驗第二階段進一步加入特徵選取方法(Feature selection),探討特徵選取搭配多元分類處理後的樣本選取,對於建立訓練模型的影響。 本研究使用UCI與KEEL上20個不同類型的多分類資料集,進行不同多元分類處理與樣本選取方法組合。根據實驗結果發現,以多元分類處理OvO搭配樣本選取演算法DROP3,在分類器KNN的模型建立之下,獲得最佳的平均結果,與未經過樣本選取方法的KNN建模結果相比,AUC指標提升了6.6%。 ;The big data generation has come. When turning these data into useful information, if they are out of proper pre-processing, the noise in data may reduce the predictive ability of the training model. In previous research, it has shown that the instance selection methods can effectively selection the representative data from the datasets, and improve the performance and accuracy of the model. Among the research, it rarely discusses whether there are any processing methods that can improve the efficiency of instance selection when the datasets are multi-classified. Therefore, this thesis aims to discuss about the impact of the multi-class classification methods proposed in this research with the instance selection methods in multi-class datasets. This study proposes three methods for multi-class classification processing in multi-class datasets: All versus All (AvA), One versus All (OvA), and One versus One (OvO), with three instance selection methods: Instance based learning algorithm 3 (IB3), Decremental reduction optimization procedure 3 (DROP3) and Genetic algorithm (GA). Using Support vector machine (SVM) and the k-nearest neighbors classification algorithm (KNN) as classifiers to evaluate which method is the best combination. In the second stage of the study, we add the feature selection method to find out the impact between feature selection and instance selection under the multi-class classification methods. This study uses 20 different types of multi-class datasets from UCI and KEEL, and goes through different combination of multi-class classification methods and instance selection methods. The empirical results show that, the combination of multi-class classification method-OvO with instance selection method-DROP3, under classifier KNN, obtained the best average results. Comparing to the results of the baseline which is without instance selection, the AUC index has improved 6.6%. |