基於分散式階層化字尾樹之大量序列資料探勘

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：41

、訪客IP：3.138.204.226

姓名

蘇立鼎(Li-Ding Su) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

基於分散式階層化字尾樹之大量序列資料探勘
(Large Scale Sequential Pattern Mining based on Distributed Hierarchical Suffix Tree)

相關論文

★ 應用自組織映射圖網路及倒傳遞網路於探勘通信資料庫之潛在用戶	★ 基於社群網路特徵之企業電子郵件分類
★ 行動網路用戶時序行為分析	★ 社群網路中多階層影響力傳播探勘之研究
★ 以點對點技術為基礎之整合性資訊管理及分析系統	★ 在分散式雲端平台上對不同巨量天文應用之資料區域性適用策略研究
★ 應用資料倉儲技術探索點對點網路環境知識之研究	★ 從交易資料庫中以自我推導方式探勘具有多層次FP-tree
★ 建構儲存體容量被動遷徙政策於生命週期管理系統之研究	★ 應用服務探勘於發現複合服務之研究
★ 利用權重字尾樹中頻繁事件序改善入侵偵測系統	★ 有效率的處理在資料倉儲上連續的聚合查詢
★ 入侵偵測系統：使用以函數為基礎的系統呼叫序列	★ 有效率的在資料方體上進行多維度及多層次的關聯規則探勘
★ 在網路學習上的社群關聯及權重之課程建議	★ 在社群網路服務中找出不活躍的使用者

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

在科學的領域中，天文學具有很重要的地位。由於近年來觀測技術及硬體設備不斷提升，讓天文領域的研究者，能進行更多樣化的分析，而天文望遠鏡所觀測的天文數據資料，歷日曠久不斷累積增加，數據量已逐漸增長到PB（Petabyte）等級的巨量資料（Big Data）。面對單機系統無法負荷的巨大數據量，需使用分散式運算，才能夠有效加速處理分析的運算時間。
本論文提出了基於Hadoop的分散式架構下，用以協助天文學者分類變星（variable stars）星體的字尾樹系統，系統使用MapReduce與Spark兩種分散式運算框架設計，系統在建構字尾樹的階段，是將大量星體隨時間改變亮度的序列資料，以字尾樹的形式轉成樹狀結構，儲存至分散式的檔案系統中，並支援對於後續資料的新增。利用字尾樹的特性，能讓使用者進行高效率的查詢，此外，系統的檢索階段引入了階層化（Hierarchical）的概念，能夠調整樹中資料的細膩程度，除了能找出因觀測或計算誤差產生的類似序列，亦能夠因應不同的分類方式，提供更宏觀的查詢，讓天文研究者在分類星體時，能依照不同的需求選擇相應的細膩度，來快速地找到，擁有相同或是相似特徵的星體編號。

摘要(英)

In the field of science, astronomy has a very important status. As the observation technology and hardware equipment in recent years continue to improve, so that researchers in the field of astronomy can do more diversified analysis, and the amount of data observed by astronomical telescope continue to increase, and has gradually increased to Petabyte level.
In this paper, a suffix tree system based on distributed sturcture of Hadoop is proposed to assist astronomers to classify variable stars. The system is designed with MapReduce and Spark frameworks. In the stage of constructing suffix tree, the system converts a large amount of data, which is the sequence of star brightness changing over time, into a suffix tree structure, then stores the tree in the distributed file system; the system also supports appending following observation data. Using the characteristics of the suffix tree allows users to query efficiently. Moreover, the query stage of the system introduces the hierarchical concept, which can adjust the preciseness of the data in the tree, allows the system to not only find out the similar sequence generated by observation or calculation errors but also provide more diversified query in response to different classification methods. According to different needs, astronomical researchers can select the preciseness of data to classify stars, and quickly find the ID of same or similar characteristics of the star.

關鍵字(中)

★ 分散式系統
★ 分散式運算
★ 資料探勘
★ 階層化字尾樹

關鍵字(英)

★ Distributed System
★ Distributed Computing
★ Data Mining
★ Hierarchical Suffix Tree

論文目次

摘要 i
Abstract ii
誌謝 iii
目錄 iv
圖目錄 vi
一緒論 1
1-1研究背景與動機 1
1-2 研究目的 2
1-3 論文章節介紹 3
二文獻探討 4
2-1 泛星計畫 4
2-2 變星 4
2-3 Hadoop 5
2-4 MapReduce 6
2-5 Spark 7
2-6 字尾樹 8
2-7 階層化字尾樹 9
三系統架構與流程 10
3-1 系統環境與平台架構 10
3-2 前處理階段 11
3-3 建構階段 11
3-4 檢索階段 11
3-5 樹的後續資料新增 12
四研究方法 13
4-1 資料前處理 13
4-2 分散式字尾樹 17
4-2-1 分散式字尾樹的建構 18
4-2-2 字尾樹的編碼 22
4-3 階層化字尾樹 26
4-3-1 字尾樹階層轉換演算法 27
4-3-2 分散式階層化字尾樹查詢 33
4-4 後續觀測資料新增 40
五實驗 43
5-1 實驗環境與資料集 43
5-2 系統建構階段實驗 44
5-2-1 建構階段基於單機環境與分散式環境執行時間 44
5-2-2 建構階段基於兩分散式框架執行時間 45
5-3 系統檢索階段實驗 46
5-3-1 不進行階層轉換之檢索階段執行時間 47
5-3-2 階層三之檢索階段執行時間 49
5-3-3 階層二之檢索階段執行時間 51
5-3-4 階層一之檢索階段執行時間 53
5-4 觀測資料新增實驗 55
六結論 56
七參考文獻 57

參考文獻

[1] Pan-STARRS, http://pan-starrs.ifa.hawaii.edu/public/
[2] 陳文屏, 「天文觀測的新挑戰─談泛星計畫」, 科儀新知, 第30卷第3期, 2008.
[3] Wikipedia, “variable star”, https://en.wikipedia.org/wiki/Variable_star
[4] Apache Hadoop, http://hadoop.apache.org/
[5] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler, “The Hadoop Distributed File System,” MSST, 2010.
[6] HDFS, https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
[7] Jeffrey Dean and Sanjay Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters” OSDI′04: Sixth Symposium on Operating System Design and Implementation,San Francisco, CA, December, 2004.
[8] Hadoop 101: Programming MapReduce with Native Libraries, Hive, Pig, and Cascading, http://blog.pivotal.io/pivotal/products/hadoop-101-programming-mapreduce-with-native-libraries-hive-pig-and-cascading
[9] Apache Spark, https://spark.apache.org/
[10] Spark Cluster, https://spark.apache.org/docs/latest/cluster-overview.html
[11] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica, “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing,” In NSDI, 2012.
[12] P. Weiner, ”Linear Pattern Matching Algorithm,” 14th Annual IEEE Symposium on Switching and Automata Theory, 1973.
[13] Min-Feng Wang, Chi-Sheng Huang*, Meng-Feng Tsai, Bo-Ru Song, Shin-Fu Su and Cheng-Hsien Tang, “Generalized Analysis of Message Propagation on Social Network,” International Journal of Future Generation Communication and Networking Vol. 5, No. 2, June, 2012.
[14] 沈敬軒, “Mining Similar Astronomical Sequence Pattern with Hierarchical Weighted Suffix Tree,” 國立中央大學, 碩士論文, 2011.
[15] 張哲嘉, “Distributed Suffix Tree Based Sequential Pattern Management System for Astronomical Analysis,” 國立中央大學, 碩士論文, 2013.
[16] 蔡昀翰, “Distributed Astronomy Sequential Pattern Analysis System Using Hadoop Platform with Weighted Suffix Tree,” 國立中央大學, 碩士論文, 2015.

指導教授

蔡孟峰(Meng-Feng Tsai)

審核日期

2017-7-18

推文