摘要(英) |
Data clustering is used in many domains widely. For example: data mining, document retrieval, image segmentation, pattern classification, etc. Traditional clustering algorithms are usually used for small-scale data analysis. At present, we usually have to deal with the large data, which cannot be dealt with in single computer. To solve these problems, many researchers attempt to design efficient parallel clustering algorithms for huge data.
In this paper we focus on Information-Theoretic Co-clustering (ITCC) which is a simultaneous clustering of the rows and columns based on mutual information between the clustered random variables subject to constraints on the number of row and column clusters. ITCC is widely used in many domains, such as text mining, social recommendation system, and bio-informatics, etc.
We propose a Parallel Information-Theoretic Co-Clustering (PITCC) algorithm based on MapReduce. Because we need to analyze huge data, we develop our algorithm on cloud computing platform based on Hadoop. MapReduce is a programming model which has been widely embraced by both academia and industry because of high scalability and easy use. We use the movie recommendation contest “CAMRa2011” dataset for our experiments, and evaluate our experiment results in terms of speedup, sizeup and scaleup. The experimental results demonstrate that the proposed algorithm is very powerful and efficient, and it can process large datasets on commodity hardware.
|
參考文獻 |
[1] Hadoop. http://hadoop.apache.org/core/.
[2] HBase. http://hadoop.apache.org/hbase/.
[3] Tom Write, “Hadoop: The Definitive Guide, 2nd Edition,” O’’Reilly (2011).
[4] Borthakur, D., “The Hadoop Distributed File System: Architecture and Design” (2007).
[5] Ghemawat, S., Gobioff, H., Leung, S. “The Google File System.” Symposium on Operating Systems Principles (2003).
[6] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. “Bigtable: A distributed storage system for structured data,” Operating Systems Design and Implementation (OSDI 2006).
[7] Dean, J., Ghemawat, S. “MapReduce: Simplified Data Processing on Large Clusters,” Operating Systems Design and Implementation (OSDI 2004).
[8] Dean, J., Ghemawat, S. “MapReduce: Simplified Data Processing on Large Clusters,” Communications of The ACM (2008).
[9] Jimmy Lin and Chris Dyer, “Data-Intensive Text Processing with MapReduce,” Morgan & Claypool Publishers (2010).
[10] Ranger, C., Raghuraman, R., Penmetsa, A., Bradski, G., Kozyrakis, C., “Evaluating MapReduce for Multi-core and Multiprocessor Systems.” High-Performance Computer Architecture (HPCA 2007).
[11] Lammel, R. “Google’s MapReduce Programming Model - Revisited.” Science of Computer Programming (2008).
[12] Weizhong Zhao, Huifang Ma, and Qing He, “Parallel K-Means Clustering Based on MapReduce,” CloudCom. Lecture Notes in Computer Science (LNCS 2009).
[13] MacQueen, J. “Some Methods for Classification and Analysis of Multivariate Observations,” 5th Berkeley Symp. Math. Statist, Prob. (1967).
[14] Xu, X., Jager, J., Kriegel, H.P “A Fast Parallel Clustering Algorithm for Large Spatial Databases,” Data Mining and Knowledge Discovery (KDD 1999).
[15] Xin Yue Yang, Zhen Liu, and Yan Fu, “MapReduce as a Programming Model for Association Rules Algorithm on Hadoop,” Information Sciences and Interaction Sciences (ICIS 2010).
[16] I. S. Dhillon, S. Mallela, and D. S. Modha. “Information theoretic Co-clustering,” Knowledge Discovery and Data Mining Conference (KDD 2003).
[17] Y. Cheng and G.M. Church. “Biclustering of expression data,” American Association for Articial Intelligence (AAAI 2000).
[18] Ramanathan, V., “Parallelizing an Information Theoretic Co-clustering Algorithm Using a Cloud Middleware,” International Conference on Data Mining Workshops (ICDMW 2010).
[19] Spiros Papadimitriou, Jimeng Sun., “DisCo: Distributed Co-clustering with Map-Reduce: A Case Study towards Petabyte-Scale End-to-End Mining,” IEEE International Conference on Data Mining (ICDM 2008).
[20] H. Li and N. Abe. “Word clustering and disambiguation based on co-occurence data,” the Association for Computational Linguistics (COLING-ACL 1998).
[21] D. Agarwal and S. Merugu, “Predictive discrete latent factor models for large scale dyadic data,” Knowledge Discovery and Data Mining Conference (KDD 2007).
[22] D. Chakrabarti, S. Papadimitriou, D. Modha, and C. Faloutsos. “Fully automatic cross-associations,” Knowledge Discovery and Data Mining Conference (KDD 2004).
[23] H. Cho, I. Dhillon, Y. Guan, and S. Sra, “Minimum sum-squared residue co-clustering of gene expression data,” SIAM International Conference on Data Mining (SDM 2004).
[24] S. C. Madeira, and A. L. Oliveira, “Biclustering algorithms for biological data analysis: A survey,” IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB 2004), 1.
[25] http://www.emc.com/leadership/programs/digital-universe.htm
|