適用於大資料集高效率的分散式階層分群演算法

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：15

、訪客IP：3.138.199.50

姓名

黃安慶(An-Cing Huang) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

適用於大資料集高效率的分散式階層分群演算法
(An Efficient Distributed Hierarchical-Clustering Algorithm for Large Scale Data Set)

相關論文

★ 以伸展樹為基礎的Android Binder Driver	★ 應用增量式學習於多種農作物判釋之研究
★ 應用分類重建學習偵測航照圖幅中的新穎坵塊	★ 一個建立在平行工作系統上的動態全球計算平台
★ 用權重參照計數演算法執行主動物件垃圾收集	★ 一個動態負載平衡之最大可能性估算計算架構
★ 利用多項系統負載資訊進行動態P2P系統重組的策略研究	★ 基於Hadoop系統的雲端應用程式特徵擷取與計算監測架構
★ 適用於大型動態分散式系統的調適性計算模型	★ 一個提供彈性虛擬資料中心的雲端服務平台
★ 雲端彈性虛擬機房服務平台之資源控管中心	★ 一個適用於自動供應雲端系統的動態調適計算架構
★ 線性相關工作與非相關工作的探索式排程策略	★ 混合雲端環境上的多重代理人動態調適計算管理架構
★ 基於圖形的平行化最小生成樹分群演算法	★ 基於密度的超立方體覆蓋之啟發式演算法

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

隨著資訊科技的進步，各領域所需要處理的資料量漸漸龐大到單一電腦無法處理的規模，以階層式分群演算法來說，由於其在執行時必須儲存非常大量的資料，因此在處理的資料量大時會面臨許多問題。
因此，本研究提出了將階層式分群演算法平行分散至多台電腦執行的的計算架構，藉由一預先設定的臨界值，過濾不必要儲存的資料，並將原來的階層結構拆解成為許多可獨立進行階層式分群演算法的子群集。最後將這些子群集以平行運算的的方式來加速階層式分群演算法的執行。依據這個計算模式，本研究以Message Passing Interface (MPI)函式庫實作出能夠讓階層式分群演算法平行分散的計算架構。
我們所提出的計算架構的主要優點是能夠大幅減少階層式分群演算法所需要的儲存空間與執行時間。個別應用可以依據自己的需求對於本文發展之程式去做出適度的修改，即可將本架構套用在該應用上。實驗結果也顯示出，本研究所提出之階層分群演算法的平行分散架構，於執行時間與儲存空間上皆有相當的改善，很適合發展到多種不同的應用之上。

摘要(英)

Clustering of different kinds of groups is a common and important technique in any research area. Clustering algorithms usually focus on a small dataset which can be analyzed by a single machine. However, as new hardware and techniques are developed for collecting data, the size of datasets can grow to an extremely large scale in many domains, such as astronomy, high energy physics, and aircraft engine diagnostics. However, The time complexity of hierarchical clustering algorithms are polynomial time between O(N2) to O(N3). This means that the computation cost of the algorithms will grow very fast as the size of input data become large. Therefore, the hierarchical clustering algorithms cannot be used directly in this situation because they can’t guarantee that the users will get the results back in a bounded amount of time.
This research focuses on how to make the hierarchical clustering algorithm process in parallel. The traditional hierarchical clustering algorithm is an unsupervised learning algorithm which doesn’’t need to label data in advance or assign the number of clusters. These characteristics make it become adaptable and capable to process many kinds of data. The goal of our research is to use a parallel computing architecture to improve the speed of execution and minimize the storage space needed of traditional hierarchical clustering algorithms, and refining the process of hierarchical clustering algorithms. We propose a Parallelized Hierarchical Clustering Algorithm, which provides a modified Hierarchical Agglomerative Algorithm that can be adapted to the distributed environment. This algorithm can process a grouping in a parallel way, and reduce both data computation load and transmission rate when facing a large-size data.

關鍵字(中)

★ 階層式分群演算法
★ 平行計算
★ 分散式計算

關鍵字(英)

★ Parallel Computing
★ MPI
★ Hierarchical Clustering
★ Distributed System

論文目次

摘要 i
ABSTRACT ii
目錄 iii
圖目錄 v
表目錄 vii
第一章緒論 1
1-1 分群演算法 (Clustering Algorithm) 2
1-2 階層式分群演算法 3
1-3 平行計算(Parallel Computing) 4
1-4 研究目標 5
1-5 研究貢獻 6
1-6 論文文章架構 6
第二章相關研究 7
2-1 階層式分群演算法 7
2-2 分割式分群演算法 12
2-3 分割式分群演算法與階層式分群演算法的比較 15
2-4 Disjoint Set Forests 演算法 15
2-4-1 Disjoint Set Forests 演算法最佳化 16
2-5 相關提供平行化運算的開發平台 18
2-5-1 Hadoop 18
2-5-1 Message Passing Interface 21
2-5-2 Hadoop 與MPICH2 的比較 23
第三章系統架構 25
3-1 分散式Similarity Matrix 計算 27
3-1-1 分散Similarity Matrix 計算的策略 28
3-1-2 減少Similarity Matrix 儲存空間的策略 29
3-2 使用 Disjoint Set Algorithm to Find Disjoin Set 31
3-2-1 Disjoint Sets 間的Similarity Matrix 32
3-2-2 Parallels 策略 33
3-3 Clustering 34
3-4 Algorithm Complexity 35
3-4-1 Time Complexity 35
3-4-2 Space Complexity Analysis 36
第四章實驗環境與實驗結果 38
4-1 實驗環境 38
4-1-1 實驗資料集 39
4-1-2 實驗設計 40
4-2 實驗結果與分析 40
4-2-1 臨界值切割策略與資料分佈 40
4-2-2 使用平行的方式計算Similarity Matrix 42
4-2-3 Threshold 與 Disjoint set 44
4-2-4 計算Disjoint Set 間的Similarity Matrix 46
4-2-5 分散式Clustering 47
4-3 綜合實驗結果 48
第五章結論與未來展望 49
參考文獻 51

參考文獻

[1] E. West, D.N. Barron, J. Dowsett, and J.N. Newton, “Hierarchies and cliques in the social networks of health care professionals: implications for the design of dissemination strategies.”
[2] N.A. Rawashdeh, S.T. Love, and K.D. Donohue, “Hierarchical Image Segmen-tation by Structural Content,” Journal of Software, vol. 3, 2008, p. 41.
[3] F.S. Khan, R.M. Anwer, and O. Torgersson, “Using Data Clustering in Oral Medicine,” International Journal of Biological and Medical Sciences, vol. 4, 2009.
[4] P. Bendjoya and V. Zappala, “Asteroid family identification,” Asteroids III. Univ. of Arizona Press, Tucson, 2002, pp. 613–618.
[5] MPI: A Message-Passing Interface Standard Version 2.1. - http://www.mcs.anl.gov/research/projects/mpich2/, .
[6] B.C. Fung, K. Wang, and M. Ester, “Hierarchical document clustering,” The Encyclopedia of Data Warehousing and Mining, John Wang (ed.), Idea Group, 2005.
[7] Welcome to Apache Hadoop! - http://hadoop.apache.org/, .
[8] Al Geist, Adam Beguelin, Jack Dongarra, Weicheng Jiang, Robert Manchek, and Vaidyalingam S. Sunderam, “PVM: Parallel Virtual Machine,” Nov. 1994.
[9] IEEE Computer Society, “High performance fortran,” vol. 1, 1993, p. 25.
[10] OpenMP.org . - http://openmp.org/wp/, .
[11] Open MPI: Open Source High Performance Computing. - http://www.open-mpi.org/, .
[12] CUDA Zone. - http://www.nvidia.com/object/cuda_home_new.html, .
[13] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze, Introduc-tion to Information Retrieval.
[14] W.H. Day and H. Edelsbrunner, “Efficient algorithms for agglomerative hie-rarchical clustering methods,” Journal of classification, vol. 1, 1984, pp. 7–24.
[15] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein, Introduction to algorithm - Disjoint Set Forests algorithm.
[16] J. Dean and S. Ghemawat, “MapReduce: Simplified data processing on large clusters,” Communications of the ACM, vol. 51, 2008, pp. 107–113.
[17] S. Ghemawat, H. Gobioff, and S.T. Leung, “The Google file system,” ACM SIGOPS Operating Systems Review, vol. 37, 2003, p. 43.
[18] F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R.E. Gruber, “Bigtable: A distributed storage system for structured data,” ACM Transactions on Computer Systems (TOCS), vol. 26, 2008, p. 4.
[19] K.M. Yu, C.Y. Lin, H.Y. Wang, C.Y. Tang, and J. Zhou, “Parallel Branch-and-Bound Approach with MPI Technology in Inferring Chemical Compounds with Path Frequency.”
[20] Citrix XenServer: Efficient Server Virtualization Software .
[21] SALSA Programming Language. - http://wcl.cs.rpi.edu/salsa/ .

指導教授

王尉任(Wei-Jen Wang)

審核日期

2010-7-28

推文