摘要: | 隨著科技的快速發展,人們的生活中無時無刻皆在產生大筆資料,並等待資料處理及統整完成。而在面臨大筆資料量時,所需的計算時間及儲存空間也大幅提升。為了能夠在合理的時間與空間中有效的得到真正所需要的資訊,本研究提出了基於圖形的平行化最小生成樹分群演算法,可自動偵測出資料的最佳群數並透過平行分配資料量有效的改善在大筆資料時的執行時間及儲存空間。本研究以將資料平均分配計算為前提,提出以Message Passing Interface(MPI)函式庫實作的兩種計算架構,能夠讓最小生成樹平行分散計算並套用基於圖形的分群方式演算法(graph-based clustering method, GBC method)將資料進行分群。這兩種方法分別為基於區域最小生成樹群的平行化自動分群演算法(Parallel Clustering based on Partitions of Local Minimal-spanning-trees, Para-CPLM)以及基於全域最小生成樹的平行化自動分群演算法(Parallel Clustering based on a Global Minimal-spanning-tree, Para-CGM)。Para-CPLM利用平行化的方式,將圖形上的所有點數平均分給多個process,分配方法為依據圖形的各維度作切割成多個區塊後再於每個區塊建立區域最小生成樹,然後將各區域的最小生成樹以最短距離連接起來並套用GBC method進行第一次分群,之後再檢查所有群數之間的距離,重新找出各群之間最短的距離且符合MST的條件。Para-CGM則是藉由一個預先設定的臨界值,過濾不必要儲存的資料,並使用平行的方式分配過濾後的資料給各processes計算,達到減少計算時間以及儲存空間的目的。實驗結果顯示出,若拿Para-CPLM與GBC method相比,則Para-CPLM在執行時間上有相當顯著的改善,且分群結果和GBC method大多相同。而在足夠的計算資源下,Para-CGM在時間的改善上,則不像Para-CPLM能夠隨著計算資源數目 (如: CPU數目) 的增加而有效的加速,但仍比GBC method的執行時間來得快,且分群結果和GBC method完全一樣。Recently, efficiency of clustering algorithms for large data analysis has become a major issue in many application domains. First, the computation time and storage space increases dramatically while processing a large amount of data. Second, it is hard to determine the number of clusters automatically. In order to cluster datasets within rational computation time and storage space, we propose a parallel computing strategy using the concept of graph-based clustering and granular computing. The proposed strategy automatically determines the best number of clusters of a large datasets, and effectively reduces the computation time and storage space requirements, given a large amount of data. Based on the proposed strategy, we devise two clustering algorithms, and implement them in Message Passing Interface (MPI). Both the algorithms utilize the minimum spanning tree structure to determine the best number of clusters automatically. The first algorithm is called Para-CPLM (Parallel Clustering based on Partitions of Local Minimal-spanning-trees), while the second algorithm is called Para-CGM (Parallel Clustering based on a Global Minimal-spanning-tree). The Para-CPLM partitions the data domain into several blocks (hyper-rectangles) according to the dimensions of the datasets, utilizes a parallel method to uniformly distribute all datasets to the blocks, and then establishes a local minimal-spanning-tree in each block. After each local minimal-spanning-tree is established, it combines those local minimal-spanning-trees according to the closest Euclidean Distance and then applies the GBC method to cluster the local minimal-spanning trees. After the first clustering, it checks the distances between each cluster, and finds the closest Euclidean Distance that conforms to the rules of the minimal-spanning-tree among them. The Para-CGM constructs a global minimal-spanning-tree in parallel, and applies the GBC method to find the best number of clusters. It requires a given threshold number to process the data efficiently.From our experimental results, the Para-CPLM has significantly shorter execution time and almost the same clustering results when compared with the GBC method. On the contrary, the Para-CGM has less improvement on execution time than the Para-CPLM, given that there are enough computing resources for the Para-CPLM. However, it still outperforms the GBC method and produces the same clustering results. |