基於Hadoop平台之分散式權重式字尾樹暨天文時序性資料分析系統

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：26

、訪客IP：18.226.222.76

姓名

蔡昀翰(Yun-Hang Tsai) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

基於Hadoop平台之分散式權重式字尾樹暨天文時序性資料分析系統
(Distributed Astronomy Sequential Pattern Analysis System Using Hadoop Platform with Weighted Suffix Tree)

相關論文

★ 應用自組織映射圖網路及倒傳遞網路於探勘通信資料庫之潛在用戶	★ 基於社群網路特徵之企業電子郵件分類
★ 行動網路用戶時序行為分析	★ 社群網路中多階層影響力傳播探勘之研究
★ 以點對點技術為基礎之整合性資訊管理及分析系統	★ 在分散式雲端平台上對不同巨量天文應用之資料區域性適用策略研究
★ 應用資料倉儲技術探索點對點網路環境知識之研究	★ 從交易資料庫中以自我推導方式探勘具有多層次FP-tree
★ 建構儲存體容量被動遷徙政策於生命週期管理系統之研究	★ 應用服務探勘於發現複合服務之研究
★ 利用權重字尾樹中頻繁事件序改善入侵偵測系統	★ 有效率的處理在資料倉儲上連續的聚合查詢
★ 入侵偵測系統：使用以函數為基礎的系統呼叫序列	★ 有效率的在資料方體上進行多維度及多層次的關聯規則探勘
★ 在網路學習上的社群關聯及權重之課程建議	★ 在社群網路服務中找出不活躍的使用者

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

隨著科技的發展，泛星計畫（Panoramic Survey Telescope And Rapid Response System，Pan-STARRS）中所觀測到的資料量也隨之增長，而儲存設備成本降低，也讓天文學家們得以將大量且詳細的觀測資料儲存起來。
由於收集到的天文資料其各個元素間是具有時間順序性的，而傳統的方法卻難以處理此類資料，所以我們選用字尾樹作為其結構的原型，提供天文學家快速而有效率的星體資料查詢功能，並且能夠在分析後提供與查詢相似的星體資訊給天文學家們。
因為字尾樹的資料結構其記憶體使用量驚人，而天文資料的數量又十分龐大，在兩項因素交互影響之下，導致單一機器無法負荷，所以我們選用在開源的OpenStack系統上，建構Hadoop平台的雲端系統來構成分散式環境，將資料分散處理，以提升系統的整體效能。
透過分散式系統處理大量的天文資料，減少了在資料處理上所耗費的人力，在效率上也得到了明顯的提升，提供了研究人員在未來面對大量觀測資料時一個有效的解決方法。在未來我們也期望能利用此系統架構來為所有具有時序性的資料作分析。

摘要(英)

Because of the ongoing construction of observatories from Pan-Starrs projects with technological advancements, the size of observation data has exploded. And the storage device cost reduction. Astronomical researchers were able to make a large and detailed observation data stored.
The various elements of collected astronomical data have time sequential features. And the traditional method is difficult to handle such data. So we use the suffix tree as a prototype of system structure to provide astronomical researchers a fast and efficient data query system. And we can provide approximate patterns to astronomical researchers after finish the analysis.
Because the interaction of the amazing memory consumed of suffix tree data structure and the very large number of astronomical data lead to a single machine overload, we use the open source OpenStack system to construct Hadoop platform cloud system to complete a distributed environment. So that we can process astronomical data distributed, and enhance the effectiveness of the system.
To Process large amounts of astronomical data through distributed systems can reduce the cost of manually data processing and the efficiency has been significantly improved. We provided a valid solution when astronomical researchers face a lot of observation data in the future. We hope to use this system architecture to analyze all the time sequential data in the future.

關鍵字(中)

★ 泛星計畫
★ 分散式系統
★ 資料探勘
★ 權重字尾樹

關鍵字(英)

★ Pan-Starrs
★ Distributed System
★ Data Mining
★ Weighted Suffix Tree

論文目次

目錄
摘要 vi
Abstract vii
致謝 ix
目錄 x
圖目錄 xii
一緒論 1
1-1 研究動機與目的 1
1-2 研究背景 1
1-3 論文章節介紹 2
二文獻探討 4
2-1 泛星計畫 4
2-2 變星 4
2-3 OpenStack 5
2-4 Apache Hadoop 6
2-5 R語言與資料前處理系統 6
2-6 資料探勘 7
2-7 權重式字尾樹 7
三系統架構與流程 9
3-1 系統架構 9
3-2 系統流程 9
四研究方法 11
4-1 資料前處理系統 11
4-1-1 R語言 11
4-1-2 資料前處理介紹 11
4-2 權重式字尾樹 15
4-2-1 權重式字尾樹介紹 15
4-2-2 字尾樹的建構 16
4-3 分散式系統 18
4-3-1 系統架構 18
4-3-2 OpenStack架構 19
4-3-3 Hadoop架構 20
4-3-4 MapReduce設計 21
4-4 分散式字尾樹應用 23
4-4-1 建立字尾樹與新增資料 24
4-4-2 查詢完全符合之序列資料 26
4-4-3 查詢部分符合之序列資料 27
4-4-4 查詢相似結果之序列資料 29
五實驗結果與討論 32
5-1 實驗環境與實驗資料集介紹 32
5-2 查詢完全符合序列實驗 32
5-3 查詢部分符合序列實驗 33
5-4 查詢相似結果序列實驗 34
5-5 查詢實驗的綜合比較 35
六結論 40
七參考文獻 42

圖目錄
圖 1 OpenStack架構圖 5
圖 2 系統架構圖 9
圖 3 系統流程圖 10
圖 4 伽馬射線爆處理前後對照圖 12
圖 5 變星亮度曲線處理前後對照圖 12
圖 6 經過前處理後的資料圖 13
圖 7 前處理相同序列值範例圖 14
圖 8 前處理相同序列值加入長度資訊範例圖 14
圖 9 前處理相同序列值加入長度資訊並一般化範例圖 15
圖 10 字尾樹範例圖 17
圖 11 系統整體架構圖 19
圖 12 OpenStack系統架構圖 20
圖 13 Hadoop系統架構圖 21
圖 14 Map設計範例圖 22
圖 15 Reduce設計範例圖 23
圖 16 查詢分類圖 24
圖 17 建立字尾樹與新增資料簡易程式碼 25
圖 18 查詢完全符合之序列資料簡易程式碼 26
圖 19 查詢完全符合之序列資料範例圖 27
圖 20 查詢部分符合之序列資料簡易程式碼 28
圖 21 查詢部分符合之序列資料範例圖 29
圖 22 查詢相似結果之序列資料簡易程式碼 30
圖 23 查詢相似結果之序列資料範例圖 31
圖 24 查詢完全符合之序列時間測量圖 33
圖 25 查詢部分符合之序列時間測量圖 34
圖 26 查詢相似結果之序列時間測量圖 35
圖 27 記憶體使用量比較圖表 36
圖 28 單一機器的查詢時間比較圖 37
圖 29 四台機器的查詢時間比較圖 38
圖 30 十台機器的查詢時間比較圖 39

參考文獻

〔1〕陳文屏, 「天文觀測的新挑戰─談泛星計畫」, 科儀新知, 第30卷第3期, 2008年.
〔2〕“General Catalog of Variable Stars,”Institute of Astronomy of Russian Academy of Sciences and Sternberg State Astronomical Institute of the Moscow State University, [Online]. Available: http://www.sai.msu.su/gcvs/gcvs/iii/html.
〔3〕“Pan-STARTS,” Institute for Astronomy, University of Hawaii, 2005. [Online]. Available: http://pan-starrs.ifa.hawaii.edu/public/home.html.
〔4〕Wikipedia,“Variable star,” http://en.wikipedia.org/wiki/Variable_star , 2015.
〔5〕OpenStack, http://www.openstack.org/.
〔6〕Apache Hadoop, http://hadoop.apache.org/.
〔7〕R Project, http://www.r-project.org/.
〔8〕中華R軟體協會, http://www.r-software.org/.
〔9〕P. Weiner, “Linear Pattern Matching Algorithm,” 14th Annual IEEE Symposium on Switching and Automata Theory, 1973.
〔10〕吳彥慶, “Exploiting Frequent Episodes in Weighted Suffix Tree to Improve Intrusion Detection System,” 國立中央大學, 碩士論文, 2007.
〔11〕沈敬軒, “Mining Similar Astronomical Sequence Pattern with Hierarchical Weighted Suffix Tree,” 國立中央大學, 碩士論文, 2011.
〔12〕張哲嘉,“Distributed Suffix Tree Based Sequential Pattern Management System for Astronomical Analysis,” 國立中央大學, 碩士論文, 2013.
〔13〕劉書宏, Distributed Astronomical Sequence Data Indexing System with Suffix Tree, 國立中央大學, 碩士論文, 2014.
〔14〕Tom White, “Hadoop The Definitive Guide 3rd Edition,” O′Reilly, May, 2012.
〔15〕Jiawei Han and Micheline Kamber, “Data Mining Concepts and Techniques Second Edition,” Elsevier Inc., San Francisco, 2006..
〔16〕Yishan Li and Sathiamoorthy Manoharan, “A performance comparison of SQL and NoSQL databases,” University of Auckland, New Zealand, 2013.
〔17〕Lei GU and Huan Li, “Memory or Time: Performance Evaluation for Iterative Operation on Hadoop and Spark,” Beihang University, 2013.
〔18〕Essam Mansour, Ahmed El-Roby, Panos Kalnis, Aron Ahmadia and Ashraf Aboulnaga, ” RACE: A Scalable and Elastic Parallel System for Discovering Repeats in Very Long Sequences,” The 39th International Conference on Very Large Data Bases, 2013.
〔19〕Melita HADZAGIC, Marie-Odette ST-HILAIRE, Sean WEBB, Elisa SHAHBAZIAN, “Maritime Traffic Data Mining Using R,” 16th International Conference on Information Fusion Istanbul, 2013.
〔20〕Rohith Menon, Goutham Bhat and Michael Schatz, “Rapid Parallel Genome Indexing with MapReduce,” State University of New York at Stony Brook, 2011.
〔21〕Prabhat Kumar, Berkin Ozisikyilmaz, Wei-Keng Liao, Gokhan Memik, Alok Choudhary, “High Performance Data Mining Using R on Heterogeneous Platforms,” IEEE International Parallel & Distributed Processing Symposium, 2011.
〔22〕Drew Schmidt, George Ostrouchovy, Wei-Chen Cheny, and Pragneshkumar Patel, “Tight Coupling of R and Distributed Linear Algebra for High-Level Programming with Big Data,” SC Companion: High Performance Computing, Networking Storage and Analysis, 2012
〔23〕Kai Hwang, Geoffrey C. Fox, Jack J. Dongarra, “Distributed and Cloud Computing From Parallel Processing to the Internet of Things,” 2012

指導教授

蔡孟峰(Meng-Feng Tsai)

審核日期

2015-7-28

推文