摘要(英) |
With the progress of Information Technology, people are benefited from efficient data collection and its related applications. In addition, since the number and the size of online databases grow rapidly, the way to retrieve useful information from these large databases effectively and efficiently is getting more important. This has become the research issue of data mining.
Data mining is a process of using a variety of statistical analyses or machine learing techniques for large amounts of data, including analyzing and managing the way of extracting the hidden values of features and their relevance to vairous applications. It helps people to learn novel knowledge by passing experiences that they can make the decision or forecaste the trend. However, from the retrieval process, there are some problems that should be considered, such as “Missing Values”.
Missing values can briefly defined as the (attribute) value that is missed in a chosen dataset. For example, when registering on websites, users have to fill in some columns sequentially, such as “Name”,”Birthday”…etc. However, because of some reasons, like data input errors, information concealing and so on, we may lost some data values through this process and these lost may cause data incomplete or some errors. Moreover, it can reduce the efficiency and accuracy of data mining results. In this case, people try to use some methods to impute missing values, and supervised learning algorithms is one of these common approach for the missing value impution problem.
In this thesis, I try to conduct experiments to compare the efficiency and accuracy between five famous supervised learning algorithms, which are Bayes, SVM, MLP, CART, k-NN, over categorical, numerical, and mix types of datasets. This allows us to know which imputation method performs better in what data type over the dataset with how many missing rates. The experimental results show that the CART method is the best choice for missing value imputation, which not only requires relative lower imputation time, but also can make the classifier provide the higher classification accuracy. |
參考文獻 |
[1] Fayyad, U., Shapiro, g. P., Smyth, P., 1996. From data mining to knowledge discovery in databases, AI Magazine, 17(3):37-54.
[2] Hand, D., Mannila, H., Smyth, P., 2001. Principles of data mining, Adaptive
Computation and Machine Learning Series.
[3] Cios, K. J., Kurgan, L. A., 2002. Trends in Data Mining and Knowledge Discovery. In: Knowledge discovery in advanced information systems, Pal, N.R., Jain, L Pal, N.R.. C., Teoderesku N. (eds.), Springer.
[4] Ader, H. J., Mellenbergh, G. J., Hand, D. J., 2008. Advising on Research Methods: A consultant’s Companion. Huizen, The Netherlands: Johannes van Kessel.
[5] Kurgan, L. A., Cios, K. J., 2004. CAIM Discretization Algorithm. IEEE Transactions on Data and Knowledge Engineering, 16(2):145-153.
[6] Kamakshi Lakshminarayan, Steven A. Harp, Tariq Samad, 1999. Imputation of Missing Data in Industrial Databases, Appl. Intell, 11(3): 259-275
[7] Allison, P. D., 2001. Missing Data Thousand Oaks, CA: Sage Publications.
[8] Little, R. J. A., Rubin, D. B., 1987. Statistical analysis with missing data, New York, Wiley.
[9] Little, R. J. A., Rubin, D. B., 2002. Statistical Analysis with Missing Data, New York, John Wiley.
[10] XindongWu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hiroshi Motoda, Geoffrey J. McLachlan, Angus Ng, Bing Liu, Philip S. Yu, Zhi-Hua Zhou, Michael Steinbach, David J. Hand, Dan Steinberg, Top 10 algorithms in data mining, Knowl Inf Syst (2008) 14:1–37.
[11] Lewis, C. D., 1982. Industrial and business forecasting methods: A practical guide to exponential smoothing and curve fitting. London: Butterworth Scientific.
[12] J. Scott Armstrong and Fred Collopy, 1992. Error Measures For Generalizing About Forecasting Methods: Empirical Comparisons.
[13] Hyndman, Rob J. Koehler, Anne B.; Koehler, 2006. Another look at measures of forecast accuracy, International Journal of Forecasting.
[14] Scheffer, J., 2002. Dealing with missing data.
[15] Rubin, D.B., 1976. Inference and Missing Data. Biometrika 63 581-592
[16] Schafer, J.L., 1997. The Analysis of Incomplete Multivariate Data. Chapman & Hall
[17] J. Cohen and P. Cohen, 1983. Applied multiple regression/correlation analysis for the behavioral sciences (2nd ed.), Hillsdale, NJ: Erlbaum.
[18] J. L. Schafer and M. K. Olsen, 1998. “Multiple imputation for multivariate missing-data problems: A data analyst′s perspective”, Multivariate Behavioral Research, Vol.33, pp.545-57.
[19] G. Kalton and D. Kasprzyk, 1982. Imputing for Missing Survey Responses. Proceedings of the Survey Research Methods Section, American Statisitcal Association.
[20] 楊棋全、呂理裕(2004),指數與韋伯分佈遺失值之處理,國立中央大學 統計研究所
[21] 林盈秀、蔡志豐(2013),資料遺漏率、補值法與資料前處理關係之研究,國立中央大學 資訊管理研究所
[22] B. G. Tabachnick and L. S. Fidell, 1983. Using multivariate statistics, New York: Haper & Row.
[23] 趙民德、謝邦昌(1999),探索真相:抽樣理論和實務,暁園
[24] Witten, I. H., & Frank, E., 2005. Data Mining: Practical machine learning tools and techniques: Morgan Kaufmann.
[25] D. E. Rumelhart, G. E. Hinton and R. J. Williams, 1986. “Learning Internal Representations by Error Propagation,” in D. E. Rumelhart and J. L. McCelland (Eds.), Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Vol 1: Foundations. MIT Press.
[26] Chih-Jen Lin, LIBSVM -- A Library for Support Vector Machines. Retrived by 2015/05. http://www.csie.ntu.edu.tw/~cjlin/libsvm/
[27] Cortes, C., & Vapnik, V. N., 1995. Support-vector networks. Machine Learning, 20(3), 273-297.
[28] Jiawei, H., & Kamber, M., 2001. Data mining: concepts and techniques. San Francisco, CA, itd: Morgan Kaufmann, 5.
[29] Suykens, J. A., & Vandewalle, J., 1999a. Least squares support vector machine classifiers. Neural processing letters, 9(3), 293-300.
[30] Burges, C. J. (1998). A tutorial on support vector machines for pattern recognition. Data mining and knowledge discovery, 2(2), 121-167.
[31] Chang, C.-C., & Lin, C.-J., 2011. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3), 27.
[32] Lewis, R. J., 2000. An introduction to classification and regression tree (CART) analysis. In Annual Meeting of the Society for Academic Emergency Medicine in San Francisco, California (pp. 1-14).
[33] Breiman, L., Friedman, J.H., Olshen, R.A., and Stone, C.J. Classification and Regression Trees, Wadsworth, Belmont, CA, 1984. Since 1993 this book has been published by Chapman & Hall, New York.
[34] Fix, E., Hodges, J.L., 1951. Discriminatory analysis, nonparametric discrimination: Consistency properties, Technical Report 4, USAF School of Aviation Medicine, Randolph Field, Texas.
[35] CHO, S. B., 2002. Towards Creative evolutionary Systems with Interactive Genetic Algorithm, Applied Intelligence, 16(2): 129-138.
[36] Quinlan, J. R., 1987. Generating Production Rules from Decision Trees. Paper presented at the IJCAI. |