摘要: | 真實的世界中,資料往往沒有想像中的「乾淨」,因此我們需要透過資料前處理(Data Pre-processing),來確保資料品質。真實世界可能遇到資料維度過高,資料中摻雜不相關屬性或冗餘值,同時還包含許多複雜難以理解的連續型數值屬性,若直接使用可能大幅降低模型的預測能力。過去研究顯示,資料前處理中的離散化 (Discretization)方法將數值型屬性的資料轉換成為類別型屬性,有助於提升模型準確度、效能,使資料更平滑,減少雜訊、避免過度訓練。另外特徵選取(Feature Selection)為實務上經常使用的資料前處理技術,此方法能降低運算複雜度,取得具有代表性的特徵值,提升預測準確度。目前相關研究較少討論離散化與特徵選取這兩種資料前處理方法合併處理的議題,因此本論文欲探討使用離散化與特徵選取進行資料前處理流程之最佳組合。 本研究使用較具指標性離散化方法:等寬離散化(Equal-Width Discretization,EWD)、等頻離散化 (Equal-Frequency Discretization,EFD)、最小描述長度原則(Minimum Description Length Principle,MDLP)、卡方分箱法(ChiMerge,ChiM),特徵選取方法:基因演算法(Genetic Algorithm, GA)、決策樹C4.5(Decision Tree C4.5, DT)、主成分分析(Principal Components Analysis, PCA),探討離散化方法與特徵選取之間的優劣配適以及順序性。本研究使用資料集來自UCI Dataset上的10種資料集,資料之維度介於8到90維,分類問題介於2到28類,實驗結果為C5.0與SVM分類器預測之平均準確度。 根據本研究實驗結果,平均表現最佳的離散化方法為MDLP。先進行特徵選取再進行離散化模型平均預測準確度優於先進行離散化再進行特徵選取。先進行特徵選取再進行離散化的執行順序下,無論是使用SVM或C5.0作為分類器,先採用C4.5特徵選取,再用MDLP離散化為本論文最推薦之組合,其準確度可高達80.1%。 ;In the reality, data are always not “clean” as we thought. Thus, we need to figure out and ensure data quality by data pre-processing. There are many problems that we must be solved, like high dimensional data may include irrelevant and redundant features (attributes of the data). Besides, data may include many continuous attributes that would be hard to understand and explain. If people use these “unclean” data, it might decrease model prediction performance dramatically. Previous researches show advantages derived from discretization are the reduction and the simplification of data, making the model learning faster and yielding more accurate, compact and shorter results; and noise information possibly presents in the data is reduced. It could avoid overfitting and let the data curve smoothly. In addition, feature selection is a common method for data pre-processing. By this way, it can reduce the time complexity during model training and identify important features to improve the classification accuracy of the model. Currently, there are few researches discussing the pre-processing methods by combining discretization and feature selection at the same time. Thus, this paper focuses on the optimal combination of data pre-processing by discretization and feature selection. The experiment exploits three popular feature selection methods, which are GA(Genetic Algorithm), DT(Decision Tree Algorithm), and PCA(Principal Components Analysis). In this experiment, EWD(Equal-width discretization), EFD(Equal-frequency discretization), MDLP(Minimum Description Length Principle), and ChiMerge are used for discretization. In order to explore the optimal combination of discretization and feature selection, the data are collected from 10 UCI Datasets. The data dimensions are from 8 to 90 and the classification problems contains 2 to 28 classes. The comparative results are based on the average accuracy by C5.0 and SVM classifiers. Our empirical results show that the MDLP discretization method gives the best predictive performance. To conclude, implementing feature selection before discretization can make classifiers provide higher accuracy than the ones by discretization alone. Moreover, no matter which classifier is utilized (i.e. C5.0 or SVM), combining feature selection by C4.5 first and discretization by MDLP second is the most recommended combination in this thesis. The combination could make “the average classification accuracy of the model” reaches 80.1%. |