以點對點技術為基礎之整合性資訊管理 及分析系統

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：16

、訪客IP：18.219.182.48

姓名

唐正憲(Cheng-Hsien Tang) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

以點對點技術為基礎之整合性資訊管理及分析系統
(Peer-to-Peer-Based Big Data Management and Analysis in Astronomical Applications)

相關論文

★ 應用自組織映射圖網路及倒傳遞網路於探勘通信資料庫之潛在用戶	★ 基於社群網路特徵之企業電子郵件分類
★ 行動網路用戶時序行為分析	★ 社群網路中多階層影響力傳播探勘之研究
★ 在分散式雲端平台上對不同巨量天文應用之資料區域性適用策略研究	★ 應用資料倉儲技術探索點對點網路環境知識之研究
★ 從交易資料庫中以自我推導方式探勘具有多層次FP-tree	★ 建構儲存體容量被動遷徙政策於生命週期管理系統之研究
★ 應用服務探勘於發現複合服務之研究	★ 利用權重字尾樹中頻繁事件序改善入侵偵測系統
★ 有效率的處理在資料倉儲上連續的聚合查詢	★ 入侵偵測系統：使用以函數為基礎的系統呼叫序列
★ 有效率的在資料方體上進行多維度及多層次的關聯規則探勘	★ 在網路學習上的社群關聯及權重之課程建議
★ 在社群網路服務中找出不活躍的使用者	★ 利用階層式權重字尾樹找出在天文觀測紀錄中變化相似的序列

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 ( 永不開放)

摘要(中)

由於各種資訊觀測技術的進步以及硬體價格下降的關係，使得天文觀測的精確度及資料量產生爆炸性的成長，同時也對於資料管理及分析產生極大負擔，以往經由手動處理及人工分析的研究方式也不再適用於如今的超巨量資料集。目前對於天文研究最大的挑戰是：一.降低資料維護及處理所產生的各種成本。二. 在合理時間內，從各種不同的巨量觀察紀錄中搜索並擷取使用者需要的資料。三. 改進已知傳統分析方法或設計新型演算法，以對應各種大型資料及分散式儲存環境。四. 能夠快速適應不同狀況及降低開發難度的彈性化架構。

雖然目前已經有多種不同的分散式架構能夠提供科學家當作使用的工具進行開發，但是系統設計仍然不是科學家的長向，結果便是到目前為止仍然缺少一個真對天文領域，全方位的管理及分析方案。

本研究將因應天文資料的特性，設計出具有以下功能的整合性系統：一. 可對應不同資料量、不同資源的系統架構，並加入自動化軟、硬體管理及高容錯力的點對點資訊管理系統。二. 針對天文資料的格式，建立快速索引。三. 將各種功能切割為多種獨立單位，讓系統能夠快速對應各種不同的需求。四. 以此系統架構為基礎，設計出針對天文資料的分類、分群範例演算法。本系統是由三個子系統所組成，其中包含一個以點對點傳輸為基礎之大量資料管理系統，此系統提供快速的資料搜尋以及自動化管理技術。第二個系統是大型資料分類系統，提供類似機器學習的大量資料分析技術，第個系統是大型資料分群系統，利用訊息傳撥演算法，讓系統能夠在分散式環境下針對大量資料建立階層式群集。本研究所提供的系統可以提供使用者針對大型資料管理以及分析所需的各種工具，同時讓使用者能夠針對不同的需求進行各式客製化動作，將系統快速的建立起來。

摘要(英)

The improvement of information technology provides scientific observations of high quality that demand larger storage space and faster data processing power than ever before. However, it also massively increases the cost of the corresponding management and analytical processes. Thus, it becomes impractical to process tera-bytes of data using traditional approaches. From the perspective of astronomical data processing, the most important challenges are: 1. To maintain large amount of data with lower cost and overhead, 2. To locate and to extract desired data from a huge collection of data pool in a reasonable time, 3. To develop new analysis methods for large-scale of data based on distributed environment, and 4. To use a flexible architecture that can adapt into different situation quickly and decrease the overhead of development. Even though the existing distributed computing techniques, such as grid and cloud technologies, have provided the scientists a better way to access powerful computing resources, the development of big-data management and analysis software is still lagging far behind. The awkward predicament obstructs the connected computing resources from being utilized efficiently. To deal with the problem, we used integrated, efficient information management and analysis system for astronomical data processing. Therefore, this study focuses on the development of a management system design as well as the distributed classification and clustering methods for efficient data analysis in various astronomical application.

The proposed system can be viewed as a integrated system that supports management and analysis of large data collections. It consists of one data management sub-system and two analytical sub-systems. The first sub-system is called the Peer-to-Peer-Based Management System (P2PBMS), which adapt the Chord system design to construct a scalable platform for fast data retrieval and management. The second sub-system is called the Similarity Classification System (SCS), which uses a decentralized Multiple Classifier System (MCS) framework to provide fast and stable classification in a distributed environment using multiple classifiers. The last one is called the Distributed Hierarchical Clustering System (DHCS), which uses a distributed message-passing algorithm to efficiently calculate a hierarchical cluster, given a set of astronomical data.

The proposed integrated system can support large-scale data management and analysis for astronomical data processing. With the three sub-systems, we can provide necessary analytical tools and combination frameworks to deal with different kinds of complex analysis tasks. The Unit-Based structure can decrease the overhead of system customization for different purposes.

關鍵字(中)

★ 點對點傳輸
★ 資料分群
★ 資料分類
★ 分散式系統

關鍵字(英)

★ Peer-to-Peer
★ Hierarchical Clustering
★ Data classification
★ Distributed Computing

論文目次

摘要.....................................................i
ABSTRACT ..............................................iii
Contents ................................................v
List of Figures........................................vii
Chapter 1 Introduction...................................1
Chapter 2 Related Works ................................10
2.1 Peer-to-Peer System . . . . . . . . . . . . . . . . 10
2.2 Hierarchical Triangular Mesh . . . . . . . . . . . .12
2.3 Data Classification . . . . . . . . . . . . . . . . 14
2.4 Data Clustering . . . . . . . . . . . . . . . . . . 15
2.5 Multiple Classifier Systems . . . . . . . . . . . . 17
2.6 All-Pairs Problem . . . . . . . . . . . . . . . . . 17
2.7 Parallel and Distributed Hierarchical Clustering . .18
Chapter 3 System Architecture ..........................22
3.1 Multi-Layer Ring System . . . . . . . . . . . . . . 22
3.2 System Stack for a Peer . . . . . . . . . . . . . . 23
3.3 Unit Based Decomposition . . . . . . . . . . . . . .25
Chapter 4 Support Operations for All Layers.............29
4.1 ID Transformation . . . . . . . . . . . . . . . . . 29
4.2 Join Operation . . . . . . . . . . . . . . . . . . .30
4.3 Delete Operation . . . . . . . . . . . . . . . . . .32
4.4 Crash Management . . . . . . . . . . . . . . . . . .32
Chapter 5 Support Operations for Storage Layers ........33
5.1 Data Insertion . . . . . . . . . . . . . . . . . . .33
5.2 Data Query . . . . . . . . . . . . . . . . . . . . .33
5.3 Load Balance . . . . . . . . . . . . . . . . . . . .36
5.4 Backup . . . . . . . . . . . . . . . . . . . . . . .40
Chapter 6 Support Operations for Index Layers ..........41
6.1 Improved Hierarchical Triangular Mesh . . . . . . . 41
6.2 Tree Structure . . . . . . . . . . . . . . . . . . .42
6.3 Range Update . . . . . . . . . . . . . . . . . . . .43
6.4 Tree Balance . . . . . . . . . . . . . . . . . . . .43
Chapter 7 Support Operations for Computing Layers.......46
7.1 Similarity Classification System . . . . . . . . . .46
7.2 Distributed Hierarchical Clustering System . . . . .49
7.2.1 Proposed Method . . . . . . . . . . . . . . . . . 49
7.2.2 Computing Distances of All Pairs of Data Items . .51
7.2.3 Reducing Space Cost . . . . . . . . . . . . . . . 51
7.2.4 Constructing Disjoint Sets . . . . . . . . . . . .53
7.2.5 Computing Distances of Disjoint Sets . . . . . . .54
7.2.6 Hierarchical Clustering . . . . . . . . . . . . . 54
7.2.7 Incremental Update . . . . . . . . . . . . . . . .56
7.2.8 Fast Incremental Update . . . . . . . . . . . . . 63
Chapter 8 Conclusions and Future Work...................65
Reference ..............................................67

參考文獻

[1] SDSS: SLOAN digital sky survey, http://www.sdss.org (2012).
[2] Pan-STARRS: The panoramic survey telescope and rapid response system, http://pan-STARRS.ifa.hawaii.edu (2009).
[3] I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, H. Balakrishnan, Chord: A scalable peer-to-peer lookup service for internet applications, SIGCOMM Comput. Commun. Rev. 31 (4) (2001) 149-160.
[4] A. S. Szalay, J. Gray, G. Fekete, P. Z. Kunszt, P. Kukol, A. Thakar, Indexing the sphere with the hierarchical triangular mesh, CoRR abs/cs/0701164.
[5] H. Drucker, C. J. Burges, L. Kaufman, A. Smola, V. Vapnik, Support vector regression machines, in: Advances in Neural Information Processing Systems, 1997, pp. 155-161.
[6] S. J.A.K., V. J., Least squares support vector machine classifiers, in: Neural Processing Letters, Vol. 9, 1999, pp. 293-300.
[7] N. Cristianini, J. Shawe-Taylor, An Introduction to Support Vector Machines and other kernel-based learning methods, Cambridge University Press, 2000.
[8] R. Ranawana, V. Palade, Multi-classifier systems: Review and a roadmap for developers, International Journal of Hybrid Intelligent Systems 3 (1) (2006) 35-61.
[9] J. A. K. Suykens, J. Vandewalle, Least squares support vector machine classifiers, Neural Processing Letters 9 (3) (1999) 293-300.
[10] S. Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker, A scalable content-addressable network, SIGCOMM Comput. Commun. Rev. 31 (4) (2001) 161-172.
[11] A. I. T. Rowstron, P. Druschel, Pastry: Scalable, decentralized object location, and routing for large-scale peer-to-peer systems, in: Proceedings of the IFIP/ACM International Conference on
Distributed Systems Platforms Heidelberg, 2001, pp. 329-350.
[12] unszt, P. Z., Szalay, A. S., Csabai, I., Thakar, A. R., The indexing of the sdss science archive, in: In ASP Conf. Ser., Astronomical Data Analysis Software and Systems IX, 2000, p. 141(216).
[13] Short, N.M., Cromp, R.F., Campbell, W.J., Tilton, J.C., LeMoigne, J., Fekete, G., Netanyahu, N.S., Wichmann, K., Ligon, W.B., Mission to plane earth: Ai views the world, IEEE Expert (1995) 24-34.
[14] F. G., Rendering and managing spherical data with sphere quadtrees, in: Proceedings of Visualization ′90. IEEE Computer Society, 1990, pp. 176-186.
[15] H. Samet, The Design and Analysis of Spatial Data Structures, Addison Wesley, 1989.
[16] H. Samet, Application of Spatial Data Structures, Addison Wesley, 1990.
[17] M. Lee, H. Samet, Navigating through triangle meshes implemented as linear quadtrees, Tech. rep. (1998).
[18] Goodchild, M. F., Y. S. et al., Spatial data representation and basic operations on triangular hierarchical data structure, Tech. rep. (1991).
[19] M. Goodchild, Y. Shiren, A hierarchical data structure for global geographic information systems, in: CVGIP: Graphical Models and Image Processing, 1992, pp. 31-44.
[20] L. Song, Kimerling, A.J., Sahr, K., Developing an equal area global grid by small circle subdivision, in: Proc. International Conference on Discrete Global Grids, 2000, pp. 26-28.
[21] J. Gray, Szalay, A. Fekete, O. M. G., W. Nieto-Santisteban, M.A., Thakar, A.R., Heber, G., Rots, A.H., There goes the neighborhood: Relational algebra for spatial data search, Tech. rep. (2004).
[22] Q.-A. Tran, Q.-L. Zhang, X. Li, Reduce the number of support vectors by using clustering techniques, Machine Learning and Cybernetics 2 (2003) 1245- 1248.
[23] K.Woodsend, J. Gondzio, High-performance parallel support vector machine training, Parallel Scientific Computing and Optimization 27 (2008) 83-92.
[24] D. Hush, C. Scovel, Polynomial-time decomposition algorithms for support vector machines, Machine Learning 51 (1) (2003) 51- 71.
[25] R. Collobert, S. Bengio, Y. Bengio, A parallel mixture of svms for very large scale problems, Neural Computation 14 (5) (2002) 1105-1114.
[26] E. Chang, K. Zhu, H. Wang, H. Bai, J. Li, Z. Qiu, H. Cui, Psvm: Parallelizing support vector machines on distributed computers, In Advances in Neural Information Processing Systems 20.
[27] A. Meligy, M. Al-Khatib, A grid-based distributed svm data mining algorithm, European Journal of Scientific Research 27 (3) (2009) 313-321.
[28] A. K. Jain, M. N. Murty, P. J. Flynn, Data Clustering: a review, ACM Computing Surveys 31(3) (1999) 264-323.
[29] B. Andreopoulos, A. An, X. Wang, M. Schroeder, A Roadmap of Clustering Algorithms: finding a match for a biomedical application, Briefings in Bioinformatics 10 (2009) 297-314.
[30] W. Kim, Parallel Clustering Algorithms: survey (2009).
[31] R. Xu, Survey of clustering algorithms, IEEE Transactions on Neural Networks 16(3) (2005) 645-678.
[32] B. S. Everitt, S. Landau, M. Leese, D. Stahl, Cluster Analysis, 5th edition, Wiley, 2010.
[33] R. Mojena, Hierarchical Grouping Methods and Stopped Rules: an evaluation, Computer Journal 20 (1977) 359-363.
[34] R. Sibson, SLINK: an optimally efficient algorithm for the single-link cluster method, The Computer Journal 16 (1973) 30-34.
[35] E. W. Dijkstra, A note on two problems in connexion with graphs, NUMERISCHE MATHEMATIK 1 (1959) 269-271.
[36] R. C. Prim, Shortest connection networks and some generalizations, The Bell Systems Technical Journal 36 (1957) 1389-1401.
[37] D. Defays, An efficient algorithm for a complete link method, The Computer Journal 20 (1977) 364-366.
[38] J. A. Hartigan, M. A. Wong, Algorithm AS136 a K-means clustering algorithm, Applied Statistics 28 (1979) 100-108.
[39] P. S. Bradley, U. M. Fayyad, Refining initial points for K-means clustering, in: Proceedings of the Fifteenth International Conference on Machine Learning, Morgan kaufmann, 1998, pp. 91-99.
[40] R. Kothari, D. Pitts, On finding the number of clusters, Pattern Recognition Letters 20 (4) (1999) 405 - 416.
[41] T. Ishioka, Extended K-means with an efficient estimation of the number of clusters, in: Proceedings of the Second International Conference on Intelligent Data Engineering and Automated
Learning, 2000, pp. 17-22.
[42] D. T. Pham, S. S. Dimov, C. D. Nguyen, Selection of K in K-means clustering, Mechanical Engineering Science 219 (2004) 103- 119.
[43] A. P. Dempster, N. M. Laird, D. B. Rubin, Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society, Series B 39 (1) (1977) 1-38.
[44] B. S. Everitt, A. Skrondal, Cambridge Dictionary of Statistics, Cambridge University Press, 2010.
[45] M. Ester, H. peter Kriegel, J. Sander, X. Xu, A density-based algorithm for discovering clusters in large spatial databases with noise, in: Proceedings of 2nd International Conference on Knowl-
edge Discovery and Data Mining, AAAI Press, 1996, pp. 226-231.
[46] M. Ankerst, M. M. Breunig, H. peter Kriegel, J. Sander, OPTICS: ordering points to identify the clustering structure, in: Proceedings of the 1999 ACM SIGMOD International Conference on Man-
agement of Data, Vol. 28, 1999, pp. 49-60.
[47] X. Zhu, X. Wu, Y. Yang, Effiective classification of noisy data streams with attribute-oriented dynamic classifier selection, Knowledge and Information Systems 9 (3) (2006) 339-363.
[48] K. Woods, W. P. K. Jr., K. Bowyer, Combination of multiple classifiers using local accuracy estimates, IEEE Transactions on Pattern Analysis and Machine Intelligence 19 (4) (1997) 405-410.
[49] E. Kim, J. Ko, Dynamic classifier integration method, Multiple Classifier Systems (2005) 97-107.
[50] P. J. Phillips, P. J. Flynn, T. Scruggs, K. W. Bowyer, J. Chang, K. Hoffman, J. Marques, J. Min, W. Worek, Overview of the face recognition grand challenge, in: Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005, pp. 947-954.
[51] P. Havlak, R. Chen, K. J. Durbin, A. Egan, Y. Ren, X. Z. Song, G. M. Weinstock, R. A. Gibbs, The Atlas genome assembly system, Genom Research 14 (2004) 721-732.
[52] X. Huang, J. Wang1, S. Aluru, S.-P. Yang, L. Hillier, PCAP: a whole-genome assembly program, Genom Research 13 (2003) 2164-2170.
[53] D. Vokrouhlicky, D. Nesvorny, Pairs of asteroids probably of a common origin, The Astronomical Journal 136 (1) (2008) 280.
[54] Z. Vincenzo, C. Alberto, F. Paolo, K. Zoran, Asteroid Families. I - identification by hierarchical clustering and reliability assessment, The Astronomical Journal 100 (1990) 2030-2046.
[55] C. Moretti, H. Bui, K. Hollingsworth, B. Rich, P. Flynn, D. Thain, All-Pairs: an abstraction for data intensive computing on campus grids, IEEE Transactions on Parallel and Distributed Systems 21 (2010) 33-46.
[56] E. Dahlhaus, Parallel algorithms for hierarchical clustering and applications to split decomposition and parity graph recognition, Journal of Algorithms 36 (1998) 2000.
[57] R. M. Karp, V. Ramachandran, in: Handbook of Theoretical Computer Science (Vol. A), MIT Press, 1990, Ch. Parallel Algorithms for Shared-memory Machines, pp. 869-941.
[58] C. F. Olson, Parallel algorithms for hierarchical clustering, Parallel Computing 21 (1995) 1313-1325.
[59] M. Imai, Y. Hayakawa, H. Kawanaka, W. Chen, K. Wada, C. D. Castanho, Y. Okajima, H. Okamoto, A hardware implementation of pram and its performance evaluation, in: Proceedings of the 15 IPDPS 2000 Workshops on Parallel and Distributed Processing, 2000, pp. 143-148.
[60] X. Li, Parallel algorithms for hierarchical clustering and cluster validity, IEEE Transactions on Pattern Analysis and Machine Intelligence 12 (1990) 1088-1092.
[61] M. J. Flynn, Some computer organizations and their effiectiveness, IEEE Transactions on Computers C-21 (1972) 948-960.
[62] C. Wu, S. Horng, H. Tsai, Efficient parallel algorithms for hierarchical clustering on arrays with reconfigurable optical buses, Journal of Parallel and Distributed Computing 60 (2000) 1137- 1153.
[63] E. M. Rasmussen, P. Willett, Efficiency of hierarchical agglomerative clustering using the ICL distributed array processor, Journal of Documentation 45(1) (1989) 1-24.
[64] J. H. Ward, Hierarchical grouping to optimize an objective function, Journal of the American Statistical Association 58(301) (1963) 236-244.
[65] A. Garg, A. Mangla, N. Gupta, V. Bhatnagar, PBIRCH: a scalable parallel clustering algorithm for incremental data, in: Proceedings of the 10th International Database Engineering and Applications Symposium, 2006, pp. 315-316.
[66] T. Zhang, R. Ramakrishnan, M. Livny, BIRCH: an efficient data clustering method for very large databases, in: In Proc. of the ACM SIGMOD Intl. Conference on Management of Data, 1996, pp. 103-114.
[67] D. Talia, Parallelism in knowledge discovery techniques, in: LNCS 2367: Applied Parallel Computing, 6th International Conference PARA′02, 2002, pp. 127-136.
[68] T. Sun, C. Shu, F. Li, H. Yu, L. Ma, Y. Fang, An efficient hierarchical clustering method for large datasets with map-reduce, in: International Conference on Parallel and Distributed Computing, Applications and Technologies, 2009, pp. 494-499.
[69] S.Wang, H. Dutta, PARABLE: a parallel random-partition based hierarchical clustering algorithm for the mapreduce framework, Tech. rep., The Center for Computational Learning Systems (2011).
[70] M. Dash, S. Petrutiu, P. Scheuermann, pPOP: fast yet accurate parallel hierarchical clustering using partitioning, Data and Knowledge Engineering 61 (2007) 563-578.
[71] T. H. Cormen, C. E. Leiserson, R. L. Rivest, C. Stein, Introduction to Algorithms, Second Edition, The MIT Press, 2001.
[72] F. Muhlenbach, S. Lallich, A new clustering algorithm based on regions of influence with self-detection of the best number of clusters, in: Proceedings of the 2009 Ninth IEEE International Conference on Data Mining, 2009, p. 884-889.
[73] The MPC Orbit (MPCORB) Database, http://minorplanetcenter.org/iau/MPCORB.html (2012).

指導教授

蔡孟峰(Meng-Feng Tsai)

審核日期

2014-7-22

推文