摘要(英) |
With the progress of technology along with the tide of big data, the importance of ”information” has gradually been valued by people. Therefore, many scholars began to dive into the field of data mining, looking forward to find the value behind numerous data and come up with innovative usages. Such as, but not limited to, using classifiers to discriminate the categories of articles and so on. However, for a classifier, a more comprehensive training data will come to a better result. When building a classifier, we label the data in manual, since articles and paragraphs come in different length, the cost varies widely for doing so.
This study focuses on using unsupervised learning to select samples while giving each data a selection cost to limit the total cost of the final selection. In this thesis, by using Bisecting K-means and Hierarchical Clustering algorithm, the data are selected by two ways, best points and best points under cost considerations. These training materials then are modeled by five different classifiers to measure the classification of classifiers that were established by the selected data.
Finally, the experimental results show that compared with random selecting, the 5 different classifiers each show better strengths in different areas when classifiers are established. Using the method mentioned in this thesis can result in selecting better quality and representative data from unlabeled data while not exceeding the budget. If these data are handed to experts for labeling, the labeling cost will drop significantly and come out in a better result. |
參考文獻 |
[1] Wu, Xindong, et al. ”Data mining with big data.” ieee transactions on knowledge and data engineering 26.1 (2014): 97-107.
[2] Labrinidis, Alexandros, and Hosagrahar V. Jagadish. ”Challenges and opportunities with big data.” Proceedings of the VLDB Endowment 5.12 (2012): 2032-2033.
[3] Habteselassie, Biruk. ”Application of knowledge discovery in databases: automating manual tasks.” (2016).
[4] Olvera-López, J. Arturo, et al. ”A review of instance selection methods.” Artificial Intelligence Review 34.2 (2010): 133-143.
[5] Tsai, Chih-Fong, Zong-Yao Chen, and Shih-Wen Ke. ”Evolutionary instance selection for text classification.” Journal of Systems and Software 90 (2014): 104-113.
[6] Buza, Krisztian, Alexandros Nanopoulos, and Lars Schmidt-Thieme. ”Insight: efficient and effective instance selection for time-series classification.” Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer Berlin Heidelberg, 2011.
[7] Stojanović, Miloš B., et al. ”A methodology for training set instance selection using mutual information in time series prediction.” Neurocomputing 141 (2014): 236-245.
[8] Gowda, K., and G. Krishna. ”The condensed nearest neighbor rule using the concept of mutual nearest neighborhood.” IEEE Transactions on Information Theory 25.4 (1979): 488-490.
[9] Ritter, G., et al. ”An algorithm for a selective nearest neighbor decision rule.” IEEE Transactions on Information Theory 21.6 (1975): 665-669.
[10] Wilson, Dennis L. ”Asymptotic properties of nearest neighbor rules using edited data.” IEEE Transactions on Systems, Man, and Cybernetics 2.3 (1972): 408-421.
[11] Grochowski, Marek. ”Simple incremental instance selection wrapper for classification.” International Conference on Artificial Intelligence and Soft Computing. Springer Berlin Heidelberg, 2012.
[12] Czarnowski, Ireneusz. ”Cluster-based instance selection for machine classification.” Knowledge and Information Systems 30.1 (2012): 113-133.
[13] Lumini, Alessandra, and Loris Nanni. ”A clustering method for automatic biometric template selection.” Pattern Recognition 39.3 (2006): 495-497.
[14] Caises, Yoel, et al. ”SCIS: combining instance selection methods to increase their effectiveness over a wide range of domains.” International Conference on Intelligent Data Engineering and Automated Learning. Springer Berlin Heidelberg, 2009.
[15] Raicharoen, Thanapant, and Chidchanok Lursinsap. ”A divide-and-conquer approach to the pairwise opposite class-nearest neighbor (POC-NN) algorithm.” Pattern recognition letters 26.10 (2005): 1554-1567.
[16] Olvera-López, J., J. Carrasco-Ochoa, and J. Martínez-Trinidad. ”Prototype selection via prototype relevance.” Progress in Pattern Recognition, Image Analysis and Applications (2008): 153-160.
[17] Yarowsky, David. ”Unsupervised word sense disambiguation rivaling supervised methods.” Proceedings of the 33rd annual meeting on Association for Computational Linguistics. Association for Computational Linguistics, 1995.
[18] Guo, Yuanyuan, Harry Zhang, and Xiaobo Liu. ”Instance selection in semi-supervised learning.” Canadian Conference on Artificial Intelligence. Springer Berlin Heidelberg, 2011.
[19] Blum, Avrim, and Tom Mitchell. ”Combining labeled and unlabeled data with co-training.” Proceedings of the eleventh annual conference on Computational learning theory. ACM, 1998.
[20] Nigam, Kamal, and Rayid Ghani. ”Analyzing the effectiveness and applicability of co-training.” Proceedings of the ninth international conference on Information and knowledge management. ACM, 2000.
[21] Zhou, Zhi-Hua, and Ming Li. ”Tri-training: Exploiting unlabeled data using three classifiers.” IEEE Transactions on knowledge and Data Engineering 17.11 (2005): 1529-1541.
[22] Guo, Tao, and Guiyang Li. ”Improved tri-training with unlabeled data.” Software Engineering and Knowledge Engineering: Theory and Practice (2012): 139-147.
[23] Mucherino, Antonio, Petraq J. Papajorgji, and Panos M. Pardalos. ”K-nearest neighbor classification.” Data Mining in Agriculture (2009): 83-106.
[24] Liaw, Andy, and Matthew Wiener. ”Classification and regression by randomForest.” R news 2.3 (2002): 18-22.
[25] Rish, Irina. ”An empirical study of the naive Bayes classifier.” IJCAI 2001 workshop on empirical methods in artificial intelligence. Vol. 3. No. 22. IBM New York, 2001.
[26] Furey, Terrence S., et al. ”Support vector machine classification and validation of cancer tissue samples using microarray expression data.” Bioinformatics 16.10 (2000): 906-914.
[27] Witten, Ian H., et al. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2016.
[28] Steinbach, Michael, George Karypis, and Vipin Kumar. ”A comparison of document clustering techniques.” KDD workshop on text mining. Vol. 400. No. 1. 2000.
[29] Jain, Anil K. ”Data clustering: 50 years beyond K-means.” Pattern recognition letters 31.8 (2010): 651-666.
[30] Bouguettaya, Athman, et al. ”Efficient agglomerative hierarchical clustering.” Expert Systems with Applications 42.5 (2015): 2785-2797.
[31] Zhao, Ying, and George Karypis. ”Evaluation of hierarchical clustering algorithms for document datasets.” Proceedings of the eleventh international conference on Information and knowledge management. ACM, 2002.
[32] Silva, Catarina, and Bernardete Ribeiro. ”The importance of stop word removal on recall values in text categorization.” Neural Networks, 2003. Proceedings of the International Joint Conference on. Vol. 3. IEEE, 2003.
[33] Sadeghi, Mohammad, and Jesús Vegas. ”Automatic identification of light stop words for Persian information retrieval systems.” Journal of Information Science 40.4 (2014): 476-487.
[34] Munková, Daša, Michal Munk, and Martin Vozár. ”Influence of stop-words removal on sequence patterns identification within comparable corpora.” ICT Innovations 2013. Springer International Publishing, 2014. 67-76.
[35] Singh, Jasmeet, and Vishal Gupta. ”Text Stemming: Approaches, Applications, and Challenges.” ACM Computing Surveys (CSUR) 49.3 (2016): 45.
[36] Shang, Wenqian, et al. ”A novel feature selection algorithm for text categorization.” Expert Systems with Applications 33.1 (2007): 1-5.
[37] Rogati, Monica, and Yiming Yang. ”High-performing feature selection for text classification.” Proceedings of the eleventh international conference on Information and knowledge management. ACM, 2002.
[38] Yang, Yiming, and Jan O. Pedersen. ”A comparative study on feature selection in text categorization.” Icml. Vol. 97. 1997. |