適用於多特性多用途的分散式關連分群機制

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：8

、訪客IP：3.138.119.75

姓名

李桂昇(Kuei-Sheng Lee) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

適用於多特性多用途的分散式關連分群機制
(A Distributed Correlation Based Mechanism for Adaptive and Divergent Purposed Clustering)

相關論文

★ 應用自組織映射圖網路及倒傳遞網路於探勘通信資料庫之潛在用戶	★ 基於社群網路特徵之企業電子郵件分類
★ 行動網路用戶時序行為分析	★ 社群網路中多階層影響力傳播探勘之研究
★ 以點對點技術為基礎之整合性資訊管理及分析系統	★ 在分散式雲端平台上對不同巨量天文應用之資料區域性適用策略研究
★ 應用資料倉儲技術探索點對點網路環境知識之研究	★ 從交易資料庫中以自我推導方式探勘具有多層次FP-tree
★ 建構儲存體容量被動遷徙政策於生命週期管理系統之研究	★ 應用服務探勘於發現複合服務之研究
★ 利用權重字尾樹中頻繁事件序改善入侵偵測系統	★ 有效率的處理在資料倉儲上連續的聚合查詢
★ 入侵偵測系統：使用以函數為基礎的系統呼叫序列	★ 有效率的在資料方體上進行多維度及多層次的關聯規則探勘
★ 在網路學習上的社群關聯及權重之課程建議	★ 在社群網路服務中找出不活躍的使用者

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

在機械學習的領域中，分群分析（Cluster analysis）一直是很重要的一門技術。資料分群後會讓具有相似特性的單元聚類在一起，從而得知其中有用或隱含的訊息。然而目前主流的分群分析演算法皆需要全面性分析整體資料以取得演算法中的最佳參數，如此一來，面對大型資料的處理將難以施展。
本研究提出一種非監督式學習 (Unsupervised Learning）分散式關連分群機制。假設同一群中相鄰資料點皆為二二相似，則可依此特性關連至更多的資料點而為一個完整的群組。而在處理資料的時候，可將大型資料拆解分散至多台電腦，平行計算任二筆資料間的關連，之後再過濾及彙整處理結果集結為群組。
在本研究實作時使用了二維圖形、圍棋分析及醫學資料做為實驗數據，依資料類型不同分別訂定了相似性計算方式。實驗結果顯示出此分群機制處理大型資料的能力，同時也提供了良好的執行效能，更有其準確性、適用性及易用性等特性。

摘要(英)

Cluster analysis is an important technique in the field of machine learning. Data clustering allows units with similar characteristics to be clustered together in order to learn useful or implicit information. However, current mainstream cluster analysis algorithms need to analyze the whole dataset comprehensively to obtain the best parameters, which makes it difficult to process large-scale dataset.
This study proposes a distributed correlation-based clustering mechanism based on unsupervised learning. If neighboring data points in the same group are similar, then they can be related to more data points to form a complete cluster according to this characteristic. In processing the data, a large-scale dataset can be disassembled and distributed to multiple computers to calculate the correlation between any two pieces of data in parallel, and then the results are filtered and aggregated into a cluster.
This study uses 2D graphics, Go game (Weiqi) analysis, and medical data as experimental data, and similarity calculations are developed according to the data types. The experimental results show the ability of this clustering mechanism to handle large-scale dataset. This clustering mechanism provides advantages such as good execution performance, accuracy, variability, applicability, and ease of use.

關鍵字(中)

★ 大型資料
★ 分群演算法
★ 分散式系統
★ 機械學習

關鍵字(英)

★ Big Data
★ Clustering
★ Distributed system
★ Machine learning

論文目次

List
摘要 i
Abstract ii
誌謝 iii
List of Figures iv
List of Tables v
List vi
1. Introduction 1
1.1. Research Background 1
1.2. Research Objectives 1
1.3. Structure 2
2. Related Research 3
2.1. K-Means++ & Mini Batch K-means 4
2.2. Means-Shift 6
2.3. Gaussian Mixture Model 7
2.4. DBSCAN 8
2.5. Hierarchical Clustering 9
2.6. Birch 11
3. Research Methodology 12
3.1. Steps for Using the Clustering Mechanism 13
3.1.1. Pre-processing of the Dataset 14
3.1.2. Indexing Data Points 15
3.1.3. Calculating the Center of Gravity of the Dataset 16
3.1.4. Configuring Data Points by Center of Gravity 17
3.1.5. Calculating the Correlation between Data Points 18
3.1.6. Filtering data by Cluster Characteristics 20
3.1.7. Merging Data Points into a Cluster 21
3.2. Changes in the Use of this Clustering Mechanism 22
4. Experimental Results 24
4.1. Two-dimensional Graphics Clustering 25
4.2. Analysis of Medical Data 29
4.3. Analysis of Go Board Positions 33
4.3.1. Analyzing the Enclosed Positions 34
4.3.2. Analyzing the Connection of “Liberty” 36
5. Conclusion and Future Prospects 38
References 39

參考文獻

[1] Arthur, David, and Sergei Vassilvitskii. k-means++: The advantages of careful seeding. Stanford, 2006.
[2] Yuan, Chunhui, and Haitao Yang. "Research on K-value selection method of K-means clustering algorithm." J—Multidisciplinary Scientific Journal 2.2 (2019): 226-235.
[3] Jain, Anil K. "Data clustering: 50 years beyond K-means." Pattern recognition letters 31.8 (2010): 651-666.
[4] Sculley, David. "Web-scale k-means clustering." Proceedings of the 19th international conference on World wide web. 2010.
[5] Cheng, Yizong. "Mean shift, mode seeking, and clustering." IEEE transactions on pattern analysis and machine intelligence 17.8 (1995): 790-799.
[6] Dempster, Arthur P., Nan M. Laird, and Donald B. Rubin. "Maximum likelihood from incomplete data via the EM algorithm." Journal of the Royal Statistical Society: Series B (Methodological) 39.1 (1977): 1-22.
[7] Panuccio, Antonello, Manuele Bicego, and Vittorio Murino. "A Hidden Markov Model-based approach to sequential data clustering." Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR). Springer, Berlin, Heidelberg, 2002.
[8] He, Xiaofei, et al. "Laplacian regularized gaussian mixture model for data clustering." IEEE Transactions on Knowledge and Data Engineering 23.9 (2010): 1406-1418.
[9] Ester, Martin, et al. "A density-based algorithm for discovering clusters in large spatial databases with noise." Kdd. Vol. 96. No. 34. 1996.
[10] Jafarzadegan, Mohammad, Faramarz Safi-Esfahani, and Zahra Beheshti. "Combining hierarchical clustering approaches using the PCA method." Expert Systems with Applications 137 (2019): 1-10.
[11] Dutta, Ashit Kumar, et al. "An efficient hierarchical clustering protocol for multihop Internet of vehicles communication." Transactions on Emerging Telecommunications Technologies 31.5 (2020): e3690.
[12] Zhang, Tian, Raghu Ramakrishnan, and Miron Livny. "BIRCH: an efficient data clustering method for very large databases." ACM sigmod record 25.2 (1996): 103-114.
[13] Dunteman, George H. Principal components analysis. No. 69. Sage, 1989.
[14] Goldberger, Jacob, et al. "Neighbourhood components analysis." Advances in neural information processing systems 17 (2004): 513-520.
[15] Maaten, Laurens van der, and Geoffrey Hinton. "Visualizing data using t-SNE." Journal of machine learning research 9.Nov (2008): 2579-2605.
[16] Han, Jing, et al. "Survey on NoSQL database." 2011 6th international conference on pervasive computing and applications. IEEE, 2011.
[17] Guha, Sudipto, Rajeev Rastogi, and Kyuseok Shim. "CURE: an efficient clustering algorithm for large databases." ACM Sigmod record 27.2 (1998): 73-84.
[18] Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of the ACM 51.1 (2008): 107-113.
[19] Zaharia, Matei, et al. "Spark: Cluster computing with working sets." HotCloud 10.10-10 (2010): 95.
[20] Black, Paul E. "Manhattan distance"" Dictionary of algorithms and data structures." http://xlinux. nist. gov/dads// (2006).
[21] Cantrell, Cyrus D. Modern mathematical methods for physicists and engineers. Cambridge University Press, 2000.
[22] Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

指導教授

蔡孟峰(Meng-Feng Tsai)

審核日期

2021-1-18

推文