基於圖形的平行化最小生成樹分群演算法

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：15

、訪客IP：18.118.7.85

姓名

陳詩姍(Shi-Shan Chen) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

基於圖形的平行化最小生成樹分群演算法
(A Parallel Minimal-Spanning-Tree-Based Clustering Algorithm with Self-Detectability of the Best Number of Clusters)

相關論文

★ 以伸展樹為基礎的Android Binder Driver	★ 應用增量式學習於多種農作物判釋之研究
★ 應用分類重建學習偵測航照圖幅中的新穎坵塊	★ 一個建立在平行工作系統上的動態全球計算平台
★ 用權重參照計數演算法執行主動物件垃圾收集	★ 一個動態負載平衡之最大可能性估算計算架構
★ 利用多項系統負載資訊進行動態P2P系統重組的策略研究	★ 基於Hadoop系統的雲端應用程式特徵擷取與計算監測架構
★ 適用於大型動態分散式系統的調適性計算模型	★ 一個提供彈性虛擬資料中心的雲端服務平台
★ 雲端彈性虛擬機房服務平台之資源控管中心	★ 一個適用於自動供應雲端系統的動態調適計算架構
★ 線性相關工作與非相關工作的探索式排程策略	★ 適用於大資料集高效率的分散式階層分群演算法
★ 混合雲端環境上的多重代理人動態調適計算管理架構	★ 基於密度的超立方體覆蓋之啟發式演算法

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

隨著科技的快速發展，人們的生活中無時無刻皆在產生大筆資料，並等待資料處理及統整完成。而在面臨大筆資料量時，所需的計算時間及儲存空間也大幅提升。為了能夠在合理的時間與空間中有效的得到真正所需要的資訊，本研究提出了基於圖形的平行化最小生成樹分群演算法，可自動偵測出資料的最佳群數並透過平行分配資料量有效的改善在大筆資料時的執行時間及儲存空間。本研究以將資料平均分配計算為前提，提出以Message Passing Interface(MPI)函式庫實作的兩種計算架構，能夠讓最小生成樹平行分散計算並套用基於圖形的分群方式演算法(graph-based clustering method, GBC method)將資料進行分群。這兩種方法分別為基於區域最小生成樹群的平行化自動分群演算法(Parallel Clustering based on Partitions of Local Minimal-spanning-trees, Para-CPLM)以及基於全域最小生成樹的平行化自動分群演算法(Parallel Clustering based on a Global Minimal-spanning-tree, Para-CGM)。
Para-CPLM利用平行化的方式，將圖形上的所有點數平均分給多個process，分配方法為依據圖形的各維度作切割成多個區塊後再於每個區塊建立區域最小生成樹，然後將各區域的最小生成樹以最短距離連接起來並套用GBC method進行第一次分群，之後再檢查所有群數之間的距離，重新找出各群之間最短的距離且符合MST的條件。Para-CGM則是藉由一個預先設定的臨界值，過濾不必要儲存的資料，並使用平行的方式分配過濾後的資料給各processes計算，達到減少計算時間以及儲存空間的目的。
實驗結果顯示出，若拿Para-CPLM與GBC method相比，則Para-CPLM在執行時間上有相當顯著的改善，且分群結果和GBC method大多相同。而在足夠的計算資源下，Para-CGM在時間的改善上，則不像Para-CPLM能夠隨著計算資源數目 (如: CPU數目) 的增加而有效的加速，但仍比GBC method的執行時間來得快，且分群結果和GBC method完全一樣。

摘要(英)

Recently, efficiency of clustering algorithms for large data analysis has become a major issue in many application domains. First, the computation time and storage space increases dramatically while processing a large amount of data. Second, it is hard to determine the number of clusters automatically. In order to cluster datasets within rational computation time and storage space, we propose a parallel computing strategy using the concept of graph-based clustering and granular computing. The proposed strategy automatically determines the best number of clusters of a large datasets, and effectively reduces the computation time and storage space requirements, given a large amount of data. Based on the proposed strategy, we devise two clustering algorithms, and implement them in Message Passing Interface (MPI). Both the algorithms utilize the minimum spanning tree structure to determine the best number of clusters automatically. The first algorithm is called Para-CPLM (Parallel Clustering based on Partitions of Local Minimal-spanning-trees), while the second algorithm is called Para-CGM (Parallel Clustering based on a Global Minimal-spanning-tree).
The Para-CPLM partitions the data domain into several blocks (hyper-rectangles) according to the dimensions of the datasets, utilizes a parallel method to uniformly distribute all datasets to the blocks, and then establishes a local minimal-spanning-tree in each block. After each local minimal-spanning-tree is established, it combines those local minimal-spanning-trees according to the closest Euclidean Distance and then applies the GBC method to cluster the local minimal-spanning trees. After the first clustering, it checks the distances between each cluster, and finds the closest Euclidean Distance that conforms to the rules of the minimal-spanning-tree among them. The Para-CGM constructs a global minimal-spanning-tree in parallel, and applies the GBC method to find the best number of clusters. It requires a given threshold number to process the data efficiently.
From our experimental results, the Para-CPLM has significantly shorter execution time and almost the same clustering results when compared with the GBC method. On the contrary, the Para-CGM has less improvement on execution time than the Para-CPLM, given that there are enough computing resources for the Para-CPLM. However, it still outperforms the GBC method and produces the same clustering results.

關鍵字(中)

★ MPI
★ 平行計算
★ 分群演算法
★ 最小生成樹

關鍵字(英)

★ MPI
★ Minimum Spanning Tree
★ Clustering Algorithm
★ Parallel Computing

論文目次

摘要 I
Abstract III
誌謝 V
目錄 VI
圖目錄 VIII
表目錄 IX
第一章　緒論 1
1-1　分群演算法(Clustering Algorithm) 2
1-2 　平行計算(Parallel Computing) 3
1-3 　基於圖形的分群方式演算法 5
(Graph-Based Clustering method, GBC method) 5
1-4 　研究目標 5
1-5 　研究貢獻 7
1-6 　論文文章架構 7
第二章　研究背景 8
2-1　MST(Minimum Spanning Tree) 8
2-2　MPI(Message Passing Interface) 12
2-3　基於圖形的分群方式演算法 14
(Graph-Based Clustering method, GBC method) 14
第三章　Parallel Clustering based on Partitions of Local Minimal-spanning-trees (Para-CPLM) 18
3-1　區域最小生成樹群演算法 20
3-1-1　Dividing 20
3-1-2　Merging 21
3-2　Clustering 23
3-3　Modifying and Clustering 24
第四章　Parallel Clustering based on a Global Minimal-spanning-tree (Para-CGM) 28
4-1　平行化Similarity Matrix計算 30
4-1-1　平行分散Similarity Matrix計算策略 30
4-1-2　Para-CGM架構下的Threshold決定策略 32
4-1-3　減少Similarity Matrix的儲存空間策略 33
第五章　實驗結果 38
5-1　實驗環境與實驗資料集 38
5-2　實驗結果與分析 41
5-2-1　Para-CPLM與Para-CGM實驗結果比較 42
5-2-2　Para-CPLM在一些案例上的特殊結果 47
5-3　Para-CPLM使用多維切割的方法及表現 55
5-4　GBC method、Para-CPLM、Para-CGM時間分析 57
第六章　結論與未來展望 60
參考文獻 61

參考文獻

[1] J. Shao, X. He, C. Bohm, Q. Yang, and C. Plant, “Synchronization-Inspired Partitioning and Hierarchical Clustering,” IEEE Transactions on Knowledge and Data Engineering, 2012.
[2] M. Potter and C. Couldrey, “A Cooperative Coevolutionary Approach to Partitional Clustering,” Parallel Problem Solving from Nature, vol. 6238, no. 1, pp. 374–383, 2010.
[3] M. Dash, S. Petrutiu, and P. Scheuermann, “pPOP: Fast yet accurate parallel hierarchical clustering using partitioning.,” Data & Knowledge Engineering, vol. 61, no. 3, pp. 563–578, 2007.
[4] T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: an efficient data clustering method for very large databases,” International Conference on Management of Data (SIGMOD), pp. 103–114, 1996.
[5] K.-M. Yu, C.-Y. Lin, H.-Y. Wang, C. Y. Tang, and J. Zhou, “Parallel branch-and-bound approach with MPI technology in inferring chemical compounds with path frequency,” IEEE International Conference on Granular Computing, pp. 733 –738, 2009.
[6] F. Muhlenbach and S. Lallich, “A New Clustering Algorithm Based on Regions of Influence with Self-Detection of the Best Number of Clusters,” IEEE International Conference on Data Mining, pp. 884–889, 2009.
[7] Scott Vetter, Yukiya Aoyama, and Jun Nakano, “RS/6000 SP: Practical MPI Programming,” 1999.
[8] S. Bandyopadhyay and U. Maulik, “Genetic clustering for automatic evolution of clusters and application to image classification,” Pattern Recognition, vol. 35, no. 6, pp. 1197–1208, 2002.
[9] M. G. H. Omran, A. P. Engelbrecht, and A. Salman, “Dynamic clustering using Particle Swarm Optimization with application in image segmentation,” World Enformatika Conference, pp. 332–344, 2005.
[10] Y. Chen, K. D. Reilly, A. P. Sprague, and Z. Guan, “SEQOPTICS: A Protein Sequence Clustering Method,” International Multi-Symposiums on Computer and Computational Sciences, vol. 1, no. 1, pp. 69 –75, 2006.
[11] R. Krishnapuram, A. Joshi, and L. Yi, “A fuzzy relative of the k-medoids algorithm with application to web document and snippet clustering,” Fuzzy Systems Conference Proceedings, vol. 3, no. 1, pp. 1281 –1286, 1999.
[12] B. Mobasher, R. Cooley, and J. Srivastava, “Creating Adaptive Web Sites Through Usage-Based Clustering of URLs,” Knowledge and Data Engineering Exchange, pp. 19–25, 1999.
[13] “Welcome to Apache Hadoop!,” http://hadoop.apache.org/.
[14] A. Geist, A. Beguelin, J. Dongarra, W. Jiang, and R. M. and V. S. Sunderam, “PVM: Parallel Virtual Machine: A Users’ Guide and Tutorial for Network Parallel Computing,” 1994.
[15] “OpenMP.org,” http://openmp.org/wp/.
[16] “Open MPI: Open Source High Performance Computing,” http://www.open-mpi.org/.
[17] M. E. Zosel, “High performance Fortran: an overview,” Compcon Spring , Digest of Papers., pp. 132 –136, 1993.
[18] “CUDA Zone,” http://www.nvidia.com/object/cuda_home_new.html.
[19] B. C. Prim, “Shortest connecting networks and some generalizations,” Bell System Technical Journal, vol. 36, no. 1, pp. 1389–1401, 1959.
[20] J. B. Kruskal, “On the shortest spanning subtree of a graph and the traveling salesman problem,” American Mathematical Society, vol. 7, no. 1, pp. 48–50, 1956.
[21] J. Sander, M. Ester, H.-P. Kriegel, and X. Xu, “Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications,” Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 169–194, 1998.
[22] S. Saha and S. Bandyopadhyay, “A symmetry based multiobjective clustering technique for automatic evolution of clusters,” Pattern Recognition, vol. 43, no. 3, pp. 738–751, 2010.
[23] L. Wang, C. Leckie, R. Kotagiri, and J. Bezdek, “Approximate pairwise clustering for large data sets via sampling plus extension,” Pattern Recognition, vol. 44, no. 2, pp. 222–235, 2011.
[24] E. H. Ruspini, “Numerical methods for fuzzy clustering,” Information Sciences, vol. 2, no. 3, pp. 319–350, 1970.

指導教授

王尉任(Wei-Jen Wang)

審核日期

2012-7-24

推文