||Feature selection is an important process for pattern recognition applications. The purpose of feature selection is to avoid classifier’s performance degradation. The removed feature(s) must be redundant, irrelevant, or of the least possible use. There is no related study which compares different feature selection methods with different data types, such as categorical, numerical, and mixed-type of datasets for classification performance. Therefore, in this thesis, three major feature selection methods were chosen, which are Information Gain (IG), Genetic Algorithm (GA) and Decision Tree (DT), and the research aim is to compare the classification accuracy of using these feature selection methods over different types of datasets. We illustrate the capability of the result by extensive experiments on analyzing 40 real-world datasets from UCI. In addition, six different classification techniques are compared, including Support Vector Machines (SVM), K-Nearest Neighbor (KNN), Decision Tree (DT), Artificial Neural Network (ANN), AdaBoost and Bagging. |
The experimental results show that the need for feature selection over categorical datasets is not strong. However, bagging based KNN and DT could increase the performance. For the mixed-type and numerical datasets, using GA and DT perform better. Particularly, if MLP is used, there is no need to do the feature selection process for numerical datasets. We demonstrate that different feature selection methods could increase the accuracy of some classification models.
A. Wanga, N. Ana, G. Chenb, L. Lia, G. Alterovitz (2014). Accelerating wrapper-based feature selection with K-nearest-neighbor, Knowledge-Base Systems, 83:81-91.
B. Seijo-Pardo, I. Porto-Díaz, V. Bolón-Canedo, A. Alonso-Beta (2016). Ensemble feature selection: Homogeneous and heterogeneous approaches, Knowledge-Base Systems, 118:124-139.
Berry, M. J. A. and Linoff, G.S.(1997). Data Mining Technique for Marketing, Sale, and Customer Support, Wiley Computer, N. J..
D. Randall Wilson, Tony R. Martinez (2000). Reduction Techniques for Instance-Based Learning Algorithms, Machine Learning, Vol. 38, pp 257-286.
E. Bauer, R. Kohavi (1999). An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants, Machine Learning, Vol. 36, pp 105-139.
G. Chandrashekar, F. Sahin (2014). A survey on feature selection methods. Computers & Electrical Engineering, 40(1):16 – 28.
H. Liu and L. Yu (2005). Toward integrating feature selection algorithms for classification and clustering, IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, vol. 17, pp. 491 - 502, April.
Hand, D., Mannila, H., Smyth, P. (2001). Principles of Data Mining, MIT Press, Cambridge, MA.
I. Guyon, A. Elisseeff (2003). An introduction to variable and feature selection, Journal of Machine Learnig Research, Vol. 3, pp. 1157-1182.
J. Li, M.T. Manry, P.L. Narasimha, C. Yu (2006). Feature selection using a piecewise linear network, IEEE Transactions on Neural Networks, Vol. 17, No. 5, pp. 1101-1115.
Jiawei, H., & Kamber, M. (2001). Data mining: concepts and techniques, Morgan Kaufmann, San Francisco, CA.
Quinlan, J.R. (1986). Induction of Decision Tree, Machine Learning, Vol. 1, No. 1, pp.81-106.
Quinlan, J.R. (1993). C4.5: Programs for Machine Learning, Morgan Kaufmann, San Mateo,CA.
U. Stan ́czyk (2013). Ranking of characteristic features in combined wrapperapproaches to selection, Journal of Machine Leaning Research, Vol. 3, pp. 1157-1182.
V. Bolón-Canedo, N. Sánchez-Maroño and A. Alonso-Betanzos (2013). A review of feature selection methods on synthetic data, Knowl Inf Syst, 34:483-519.
W.B. Powell (2007). Approximate dynamic programming: solving the curses of dimensionality. Wiley-Interscience.
Y. Saeys, et al., (2007). Areview of feature selection techniques in bioinformatics, Bioinformatics, vol. 23, pp. 2507-2517.
Y. Freund, RE. Schapire (1996). Experiments with a new boosting algorithm. In Machine Learning: Proceedings of the Thirteenth International Conference, pages 148–156.
李韋柔 (2016). 特徵選取前處理於填補遺漏值之影響, 國立中央大學資訊管理學系, 碩士論文.
凌士維 (2005). 非對稱性分類分析解決策略之效能比較, 國立中山大學資訊管理學系, 碩士論文.
蘇昭安 (2003). 應用倒傳遞類神經網路在颱風波浪預報之研究, 國立臺灣大學工程科學與海洋工程學系, 碩士論文.
C.-C. Chang and C.-J. Lin (2001). LIBSVM: a library for support vector machines, Software available at http://www.csie.ntu.edu.tw/˜cjlin/libsvm.
林宗勳. Support Vector Machines 簡介 at www.cmlab.csie.ntu.edu.tw/~cyy/learning/tutorials/SVM2.pdf