摘要(英) |
In recent years, with the progress of the computing technology and the storage space, many researchers start to research the field of the Data Mining and the Big Data in order to find the value of numerous data and come up with innovative usages. Such as, but not limited to, using classifiers to discriminate the categories of articles and so on. When building a classifier, a more comprehensive training data will come to a better result., so that we select the training data in dataset and label the training data manually by experts. However, the cost of hiring experts is high and the output is limited, we have to select the comprehensive sample data to maximize the utility of training data. In other words, how to select the best training data in the unlabeled dataset with the constraint of the sample data number is the research purpose of this study
This study focused on using unsupervised learning to select samples with the constraint of the sample data number. In this thesis, we start to remove the outliers of the dataset, then we use K-Means to find the training data which contain all typical types in the datasets, after that, we use Balanced K-Means to cluster every clusters of K-means result according to the percentage of cluster size in the dataset. At last, we pick up the “centroid” as the best training data and label it by experts. These training materials then are modeled by five different classifiers to measure the classification of classifiers that were established by the select data. In other words, if the classification of classifiers that were established by the select data is good, it means the method we proposed can select the best training data under sample data number considerations.
Finally, the experimental results show that the method we proposed has good results in KNN、Naïve Bayes、SVM、MLP but Random Forest. According to this result, we can find that the classifier which is not established by the concept of space and the distance has the lower classification result, because it does not match the method designed concept of this study. On the other hand, the method we proposed can select the best training data with the constraint of sample data number when the classifier contains all of the attributes. |
參考文獻 |
[1] Wu, X., et al., Data mining with big data. IEEE transactions on knowledge and data engineering, 2014. 26(1): p. 97-107.
[2] Labrinidis, A. and H.V. Jagadish, Challenges and opportunities with big data. Proceedings of the VLDB Endowment, 2012. 5(12): p. 2032-2033.
[3] Hawkins, D.M., Identification of outliers. Vol. 11. 1980: Springer.
[4] Ruts, I. and P.J. Rousseeuw, Computing depth contours of bivariate point clouds. Computational Statistics & Data Analysis, 1996. 23(1): p. 153-168.
[5] Johnson, T., I. Kwok, and R.T. Ng. Fast Computation of 2-Dimensional Depth Contours. in KDD. 1998.
[6] Breunig, M.M., et al. Optics-of: Identifying local outliers. in European Conference on Principles of Data Mining and Knowledge Discovery. 1999. Springer.
[7] Jin, W., et al. Ranking outliers using symmetric neighborhood relationship. in Pacific-Asia Conference on Knowledge Discovery and Data Mining. 2006. Springer.
[8] Papadimitriou, S., et al. Loci: Fast outlier detection using the local correlation integral. in Data Engineering, 2003. Proceedings. 19th International Conference on. 2003. IEEE.
[9] Knox, E.M. and R.T. Ng. Algorithms for mining distancebased outliers in large datasets. in Proceedings of the International Conference on Very Large Data Bases. 1998. Citeseer.
[10] Ramaswamy, S., R. Rastogi, and K. Shim. Efficient algorithms for mining outliers from large data sets. in ACM Sigmod Record. 2000. ACM.
[11] Angiulli, F. and C. Pizzuti. Fast outlier detection in high dimensional spaces. in European Conference on Principles of Data Mining and Knowledge Discovery. 2002. Springer.
[12] Bay, S.D. and M. Schwabacher. Mining distance-based outliers in near linear time with randomization and a simple pruning rule. in Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. 2003. ACM.
[13] Ghoting, A., S. Parthasarathy, and M.E. Otey, Fast mining of distance-based outliers in high-dimensional datasets. Data Mining and Knowledge Discovery, 2008. 16(3): p. 349-364.
[14] MacQueen, J. Some methods for classification and analysis of multivariate observations. in Proceedings of the fifth Berkeley symposium on mathematical statistics and probability. 1967. Oakland, CA, USA.
[15] Žalik, K.R., An efficient k′-means clustering algorithm. Pattern Recognition Letters, 2008. 29(9): p. 1385-1391.
[16] Caliński, T. and J. Harabasz, A dendrite method for cluster analysis. Communications in Statistics-theory and Methods, 1974. 3(1): p. 1-27.
[17] Davies, D.L. and D.W. Bouldin, A cluster separation measure. IEEE transactions on pattern analysis and machine intelligence, 1979(2): p. 224-227.
[18] Dunn, J.C., Well-separated clusters and optimal fuzzy partitions. Journal of cybernetics, 1974. 4(1): p. 95-104.
[19] Ray, S. and R.H. Turi. Determination of number of clusters in k-means clustering and application in colour image segmentation. in Proceedings of the 4th international conference on advances in pattern recognition and digital techniques. 1999. Calcutta, India.
[20] Halkidi, M., M. Vazirgiannis, and Y. Batistakis. Quality scheme assessment in the clustering process. in European Conference on Principles of Data Mining and Knowledge Discovery. 2000. Springer.
[21] Maulik, U. and S. Bandyopadhyay, Performance evaluation of some clustering algorithms and validity indices. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002. 24(12): p. 1650-1654.
[22] Kovács, F., C. Legány, and A. Babos. Cluster validity measurement techniques. in 6th International symposium of hungarian researchers on computational intelligence. 2005. Citeseer.
[23] Gupta, S., A Survey on Balanced Data Clustering Algorithms. 2017.
[24] Bradley, P., K. Bennett, and A. Demiriz, Constrained k-means clustering. Microsoft Research, Redmond, 2000: p. 1-8.
[25] Zhu, S., D. Wang, and T. Li, Data clustering with size constraints. Knowledge-Based Systems, 2010. 23(8): p. 883-889.
[26] Malinen, M.I. and P. Fränti. Balanced k-means for clustering. in Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR). 2014. Springer.
[27] Silva, C. and B. Ribeiro. The importance of stop word removal on recall values in text categorization. in Neural Networks, 2003. Proceedings of the International Joint Conference on. 2003. IEEE.
[28] Sadeghi, M. and J. Vegas, Automatic identification of light stop words for Persian information retrieval systems. Journal of Information Science, 2014. 40(4): p. 476-487.
[29] Munková, D., M. Munk, and M. Vozár, Influence of stop-words removal on sequence patterns identification within comparable corpora, in ICT innovations 2013. 2014, Springer. p. 67-76.
[30] Singh, J. and V. Gupta, Text stemming: Approaches, applications, and challenges. ACM Computing Surveys (CSUR), 2016. 49(3): p. 45.
[31] Shang, W., et al., A novel feature selection algorithm for text categorization. Expert Systems with Applications, 2007. 33(1): p. 1-5.
[32] Mucherino, A., P.J. Papajorgji, and P.M. Pardalos, K-nearest neighbor classification, in Data Mining in Agriculture. 2009, Springer. p. 83-106.
[33] Liaw, A. and M. Wiener, Classification and regression by randomForest. R news, 2002. 2(3): p. 18-22.
[34] Rish, I. An empirical study of the naive Bayes classifier. in IJCAI 2001 workshop on empirical methods in artificial intelligence. 2001. IBM.
[35] Furey, T.S., et al., Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics, 2000. 16(10): p. 906-914.
[36] Witten, I.H., et al., Data Mining: Practical machine learning tools and techniques. 2016: Morgan Kaufmann. |