摘要(英) |
Cluster analysis is an important technique in the field of machine learning. Data clustering allows units with similar characteristics to be clustered together in order to learn useful or implicit information. However, current mainstream cluster analysis algorithms need to analyze the whole dataset comprehensively to obtain the best parameters, which makes it difficult to process large-scale dataset.
This study proposes a distributed correlation-based clustering mechanism based on unsupervised learning. If neighboring data points in the same group are similar, then they can be related to more data points to form a complete cluster according to this characteristic. In processing the data, a large-scale dataset can be disassembled and distributed to multiple computers to calculate the correlation between any two pieces of data in parallel, and then the results are filtered and aggregated into a cluster.
This study uses 2D graphics, Go game (Weiqi) analysis, and medical data as experimental data, and similarity calculations are developed according to the data types. The experimental results show the ability of this clustering mechanism to handle large-scale dataset. This clustering mechanism provides advantages such as good execution performance, accuracy, variability, applicability, and ease of use. |
參考文獻 |
[1] Arthur, David, and Sergei Vassilvitskii. k-means++: The advantages of careful seeding. Stanford, 2006.
[2] Yuan, Chunhui, and Haitao Yang. "Research on K-value selection method of K-means clustering algorithm." J—Multidisciplinary Scientific Journal 2.2 (2019): 226-235.
[3] Jain, Anil K. "Data clustering: 50 years beyond K-means." Pattern recognition letters 31.8 (2010): 651-666.
[4] Sculley, David. "Web-scale k-means clustering." Proceedings of the 19th international conference on World wide web. 2010.
[5] Cheng, Yizong. "Mean shift, mode seeking, and clustering." IEEE transactions on pattern analysis and machine intelligence 17.8 (1995): 790-799.
[6] Dempster, Arthur P., Nan M. Laird, and Donald B. Rubin. "Maximum likelihood from incomplete data via the EM algorithm." Journal of the Royal Statistical Society: Series B (Methodological) 39.1 (1977): 1-22.
[7] Panuccio, Antonello, Manuele Bicego, and Vittorio Murino. "A Hidden Markov Model-based approach to sequential data clustering." Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR). Springer, Berlin, Heidelberg, 2002.
[8] He, Xiaofei, et al. "Laplacian regularized gaussian mixture model for data clustering." IEEE Transactions on Knowledge and Data Engineering 23.9 (2010): 1406-1418.
[9] Ester, Martin, et al. "A density-based algorithm for discovering clusters in large spatial databases with noise." Kdd. Vol. 96. No. 34. 1996.
[10] Jafarzadegan, Mohammad, Faramarz Safi-Esfahani, and Zahra Beheshti. "Combining hierarchical clustering approaches using the PCA method." Expert Systems with Applications 137 (2019): 1-10.
[11] Dutta, Ashit Kumar, et al. "An efficient hierarchical clustering protocol for multihop Internet of vehicles communication." Transactions on Emerging Telecommunications Technologies 31.5 (2020): e3690.
[12] Zhang, Tian, Raghu Ramakrishnan, and Miron Livny. "BIRCH: an efficient data clustering method for very large databases." ACM sigmod record 25.2 (1996): 103-114.
[13] Dunteman, George H. Principal components analysis. No. 69. Sage, 1989.
[14] Goldberger, Jacob, et al. "Neighbourhood components analysis." Advances in neural information processing systems 17 (2004): 513-520.
[15] Maaten, Laurens van der, and Geoffrey Hinton. "Visualizing data using t-SNE." Journal of machine learning research 9.Nov (2008): 2579-2605.
[16] Han, Jing, et al. "Survey on NoSQL database." 2011 6th international conference on pervasive computing and applications. IEEE, 2011.
[17] Guha, Sudipto, Rajeev Rastogi, and Kyuseok Shim. "CURE: an efficient clustering algorithm for large databases." ACM Sigmod record 27.2 (1998): 73-84.
[18] Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of the ACM 51.1 (2008): 107-113.
[19] Zaharia, Matei, et al. "Spark: Cluster computing with working sets." HotCloud 10.10-10 (2010): 95.
[20] Black, Paul E. "Manhattan distance"" Dictionary of algorithms and data structures." http://xlinux. nist. gov/dads// (2006).
[21] Cantrell, Cyrus D. Modern mathematical methods for physicists and engineers. Cambridge University Press, 2000.
[22] Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. |