以MapReduce進行交叉驗證整合大量天文資料

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：26

、訪客IP：3.137.184.92

姓名

謝佳昕(Jia-Shin Shie) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

以MapReduce進行交叉驗證整合大量天文資料
(Incorporating Astronomical Catalog by using Cross-Matching Algorithm with MapReduce)

相關論文

★ 應用自組織映射圖網路及倒傳遞網路於探勘通信資料庫之潛在用戶	★ 基於社群網路特徵之企業電子郵件分類
★ 行動網路用戶時序行為分析	★ 社群網路中多階層影響力傳播探勘之研究
★ 以點對點技術為基礎之整合性資訊管理及分析系統	★ 在分散式雲端平台上對不同巨量天文應用之資料區域性適用策略研究
★ 應用資料倉儲技術探索點對點網路環境知識之研究	★ 從交易資料庫中以自我推導方式探勘具有多層次FP-tree
★ 建構儲存體容量被動遷徙政策於生命週期管理系統之研究	★ 應用服務探勘於發現複合服務之研究
★ 利用權重字尾樹中頻繁事件序改善入侵偵測系統	★ 有效率的處理在資料倉儲上連續的聚合查詢
★ 入侵偵測系統：使用以函數為基礎的系統呼叫序列	★ 有效率的在資料方體上進行多維度及多層次的關聯規則探勘
★ 在網路學習上的社群關聯及權重之課程建議	★ 在社群網路服務中找出不活躍的使用者

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

隨著科技的進步，在天文觀測時所使用的望遠鏡的功能也越來越強大，所觀測到的資訊更多、資料量也更大。增加了天文研究人員在進行研究時的困難，因此，本論文提出以交叉驗證(Cross-Matching)的方式，對大量的天文資料進行整合，以利於研究人員能快速的找出所需的資料。
交叉驗證(Cross-Matching)是一種常見的方法。用於大量的天文資料中找出有用的資訊。在過去以單機的方式進行交叉驗證(Cross-Matching)非常沒有效率，因此本論文以交叉驗證(Cross-Matching)為基礎，搭配OpenStack及Hadoop，建立一個分散式的雲端環境，再以分散式的演算法實作交叉驗證(Cross-Matching)，達到更有效率的對大量的天文資料進行整合。同時用分散式的檔案系統及資料庫做為儲存設備，使整個系統更具可靠性及擴充性。
本論文在實驗的部分以兩種不同的儲存方式進行設計：HDFS及HBase，比較單機版程式及分散式程式的執行速率和在相同節點數，實體電腦的運算時間及在雲端環境上的虛擬節點的運算時間；在不同節點數其運算時間之比較；不同儲存方式的運算時間。並提供一個視覺化的使用者介面，可以快速的找出需要的資料。

摘要(英)

Cross-Matching is a common way for find out the useful information from different star catalogs. Today hardware is more powerful than before. The data obtained through astronomical telescopes are becoming much larger. Therefore, single machine is not able to afford handling the astronomical data. In this paper, we use OpenStack to build a cloud computing environment, Hadoop as a distributed system, HDFS and HBase as distributed storages. Implement Cross-matching with MapReduce framework. In addition, Hbase supports random access so we make an incremental mechanism. User can update new astronomical data as they want. In the experiment, Transient is my test data to compare the operation time of using single machine with distributed system and using the same number of nodes on the physical machine with virtual machine. The result shows that using virtual machine is faster than using physical machine. Furthermore, we create 12 physical nodes on cloud environment to observe the operation time of different number of node. Theoretically, when we use more nodes to run the program the speed is much faster. The fact that the speeds of 10 nodes and 12 nodes are very similar.

關鍵字(中)

★ 大量資料
★ 雲端運算
★ 分散式系統
★ 交叉驗證

關鍵字(英)

★ Big Data
★ Cloud computing
★ Distributed system
★ Cross-Matching

論文目次

摘要 i
Abstract ii
誌謝 iii
目錄 iv
圖目錄 vi
一、緒論 1
1-1 研究背景 1
1-2 研究動機與目的 2
1-3 章節介紹 3
二、文獻探討 4
2-1 瞬變天文事件(Transient astronomical event) 4
2-2 OpenStack 5
2-3 Hadoop 7
2-4 MapReduce 8
2-5 NoSQL 8
三、系統架構 10
3-1 雲端運算平台 10
3-2 HDFS檔案系統 11
3-3 HBase資料庫 12
3-4交叉驗證（Cross-Matching） 14
3-5 Clustering Stage 16
3-6 Cross Matching Stage 16
四、研究方法 18
4-1 Clustering Stage 18
4-1-1 資料簡化、分群 18
4-2 Cross Matching Stage 20
4-2-1 交叉驗證 21
4-3 新增觀測資料 23
4-4 視覺化查詢介面 24
五、實驗 27
5-1 Clustering Stage執行時間 29
5-1-1 Clustering Stage基於HDFS 29
5-1-2 Clustering Stage基於HBase 32
5-1-3 Clustering Stage於HDFS與HBase比較 35
5-2 Cross Matching Stage執行時間 38
5-2-1 Cross Matching Stage基於HDFS 39
5-2-2 Cross Matching Stage基於HBase 41
5-2-3 Cross Matching Stage之HDFS與HBase比較 44
5-3 新增觀測資料 48
六、結論 50
參考文獻 52

參考文獻

[1] Pastorello, A., Smartt, S. J., Botticella, M. T. (Including Urata, Y.), Ultra-bright Optical Transients are Linked with Type Ic Supernovae ,The Astrophysical Journal, v. 724, pp. L16, (2010)
[2] Palomar Transient Factory, http://www.ptf.caltech.edu/
[3] Pan-Stars Project, http://pan-starrs.ifa.hawaii.edu/public/
[4] OpenStack, https://www.openstack.org/
[5] Hadoop, http://hadoop.apache.org/
[6] Sachin Puttur: Big Data: Overview of apache Hadoop, http://www.sachinpbuzz.com/2014/01/big-data-overview-of-apache-hadoop.html
[7] J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters”, OSDI′04: Sixth Symposium on Operating System Design and Implementation,San Francisco, CA, December, 2004.
[8] The Truth About MapReduce Performance on SSDs,
http://blog.cloudera.com/blog/2014/03/the-truth-about-mapreduce-performance-on-ssds
[9] J. Bhogal, I. Choksi, “Handling Big Data using NoSQL”, Advanced Information Networking and Applications Workshops (WAINA), pp. 393-398, 2015.
[10]HDFS, https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
[11]HBase, https://hbase.apache.org/
[12] The Khangaonkar Report, http://khangaonkar.blogspot.tw/2013/04/using-hbase-part-2-architecture.html
[13] Big data, http://hadoopbigdatas.blogspot.tw/2013/03/hbase-architecture.html
[14] M. A. Nieto-Santisteban, A. R. Thakar, and A. S. Szalay. Cross-matching very large datasets. In NSTC NASA Conference,2007
[15] VizieR, http://vizier.u-strasbg.fr
[16] Simbad, http://simbad.u-strasbg.fr
[17] Qing Zhao, Jizhou Sun, Ce Yu, Chenzhou Cui,Liqiang Lv, and Jian Xiao. A Paralleled Large-Scale Astronomical Cross-Matching Function
[18] Transient astronomical event, https://en.wikipedia.org/wiki/Transient_astronomical_event
[19] 山東大學張夏旭, The Design and Implementation of Multi-stars Storage and Cross match Based on Hadoop.
[20] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters
[21] María A. Nieto-Santisteban, Aniruddha R. Thakar, and Alexander S. Szalay. Cross-Matching Very Large Datasets
[22] Qing Zhao, Jizhou Sun, Ce Yu, Chenzhou Cui,Liqiang Lv, and Jian Xiao. A Paralleled Large-Scale Astronomical Cross-Matching Function
[23] S.Sathya, Prof. M.Victor Jose. Application of Hadoop MapReduce Technique to
Virtual Database System Design
[24] Cuncang Mi, Qian Chen, Taoying Liu. An Efficient Cross-Match Implementation based on Directed Join Algorithm in MapReduce
[25] Hot Spot, http://hbase.apache.org/0.94/book/casestudies.perftroub.html

指導教授

蔡孟峰(Meng-Feng Tsai)

審核日期

2016-7-20

推文