平行式關鍵屬性區別決策樹演算法

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：22

、訪客IP：18.224.60.19

姓名

崔孝戎(Shiau-Rung Tsui) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

平行式關鍵屬性區別決策樹演算法
(Parallel Decision Tree Construction Using Attribute-Key Classification)

相關論文

★ 以伸展樹為基礎的Android Binder Driver	★ 應用增量式學習於多種農作物判釋之研究
★ 應用分類重建學習偵測航照圖幅中的新穎坵塊	★ 用於輔助工業零件辨識之尺寸估算系統
★ 使用無紋理之3D CAD工業零件模型結合長度檢測實現細粒度真實工業零件影像分類	★ 一個建立在平行工作系統上的動態全球計算平台
★ 用權重參照計數演算法執行主動物件垃圾收集	★ 一個動態負載平衡之最大可能性估算計算架構
★ 利用多項系統負載資訊進行動態P2P系統重組的策略研究	★ 基於Hadoop系統的雲端應用程式特徵擷取與計算監測架構
★ 適用於大型動態分散式系統的調適性計算模型	★ 一個提供彈性虛擬資料中心的雲端服務平台
★ 雲端彈性虛擬機房服務平台之資源控管中心	★ 一個適用於自動供應雲端系統的動態調適計算架構
★ 線性相關工作與非相關工作的探索式排程策略	★ 適用於大資料集高效率的分散式階層分群演算法

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

資料探勘的目的在找出資料中的群聚模式，其中以建構樹來產生決策規則的決策樹演算法，例如C4.5，是一種監督式學習(Supervised learning)的方法，目前已在各個領域上被廣泛的運用。決策樹的建構過程會影響決策的效能，而現存大部分的方法有兩類，第一類是尋找屬性對應類別的支持度，第二類利用資訊獲取量，來做為選擇不同階層決策的條件。但是這些方法在資料切分過程中，不僅會降低資料屬性間的相依性質，也會出現難以區分類別及增加出錯的機率。因此本研究提出了關鍵屬性區別決策樹演算法，在犧牲些微的分析時間下，可自動偵測出資料所有屬性中最能區別群集類別的屬性資料，並透過重新分類資料的方式有效的改善資料的識別準確率，也有效的改善了在大規模筆數資料下建構決策樹的執行時間及儲存空間所耗費的成本。
另外，本研究也針對所提出的演算法對平行運算的部分進行實作，提出以Message Passing Interface（MPI）函式庫實作將資料及任務做完善的切割與排程，使得計算工作可以盡可能的在較短的時間完成。另外我們也以Hadoop MapReduce實作計算任務的切割與排程，使得每個計算任務可以同步處理資料屬性。我們以此兩種架構來進行分析及比較演算法的適合度。實驗結果顯示出，本研究所提出的演算法在執行時間及準確率上平均皆優於現存著名的分類演算法，而對於大規模筆數的資料量下，我們實作的 Hadoop版本的建樹效率於前半段優於MPI版本，而於後半段則是MPI版本優於Hadoop版本。但兩者皆適合於大規模筆數的資料分析。

摘要(英)

In machine learning, a classifier is able to associate new observed instances to a set of categories, on the basis of a training set of data containing instances whose category membership is known. Decision-tree-based algorithms are popular classifiers in many application domains because they are simple to understand and interpret. To construct a decision tree, a typical decision-tree-based algorithm chooses the attribute of the training dataset that most effectively splits the training dataset into subsets enriched in one class or the other, and then iteratively splits the subsets until the instances of the subsets belong to one class. There are two popular approaches to split the training dataset into subsets: searching the support confidence of classes and using the information gain. A common weakness of the two splitting methods is that, they may not achieve good performance for a dataset with several co-related attributes. To overcome this problem, we propose a new decision-tree algorithm based on the key values of an attribute. In our algorithm, each key value of an attribute is used to split a subset, of which the instances all have the same key value; the intersection of every pair of the key-value subsets is empty; the remainder instances that are hard to distinguish are put in one subset. The proposed algorithm automatically detects the best attribute that can distinguish different class sets, instead of classes, and uses the attribute to split the dataset level by level. The proposed algorithm improves the correctness of data classification and reduces the space requirement for storing the nodes of a decision tree. The experiment results show that, the proposed algorithm is generally better than existing decision-tree-based classifiers such as C4.5 in classification accuracy. Moreover, it is generally better than other types of classification algorithms, such as SVM, logistic regression, and Naïve Bayes, in classification accuracy.
To handle big data, we also proposed two new parallel algorithms based on the proposed decision tree algorithm: One is implemented in MPI and the other is in Hadoop MapReduce. In the MPI implementation, we design a heuristic workload balancer based on the EDF scheduling algorithm to balance the workload among all computing hosts, in order to shorten the total execution time. In the Hadoop implementation, we use an attributed-based parallelization strategy to shorten the total execution time. Both parallel implementations show good scalability according to our experiments. The MPI implementation generally has shorter execution time than the Hadoop implementation in our experiments. However, the Hadoop implementation outperforms the MPI implementation for the datasets with a relatively large number of attributes.

關鍵字(中)

★ 資料探勘
★ 分類器
★ 決策樹演算法
★ MapReduce
★ MPI

關鍵字(英)

★ Data Mining
★ Classification
★ Decision Tree Algorithm
★ MapReduce
★ MPI

論文目次

摘要 IV
Abstract V
目錄 VII
圖目錄 IX
表目錄 X
第一章緒論 1
1-1資料探勘(Data Mining) 2
1-2研究動機 3
1-3 A Generalized Decision Logic Language for Granular Computing Fuzzy Systems 4
1-4研究貢獻 5
1-5論文架構 6
第二章背景知識 7
2-1 Decision Tree Algorithms 7
2-2 C4.5 9
2-2-1　C4.5節點與建樹方法 9
2-2-2　C4.5針對連續數值的處理方法 10
2-3 CART(Classification and Regression Tree) 11
2-4 Inducing decision trees with an ant colony optimization algorithm 12
第三章研究架構 14
3-1關鍵屬性區別決策樹演算法 14
3-2決策樹針對數值的處理方式 19
第四章植基於MPI的平行式關鍵決策樹演算法 20
4-1植基於MPI的平行式關鍵決策樹演算法 20
4-1-1　 MPI(Message Passing Interface) 20
4-1-2　 Diversity calculation 22
4-1-3　 Data partition 23
4-1-4　 Workload balancer on Task-based parallelization 23
4-1-5　Policy of nodes merging 24
4-2植基於Hadoop的平行式關鍵決策樹演算法 25
4-2-1　 Hadoop 25
4-2-2　 Attribute-based parallelization 26
4-2-3　 Workload balancer on Attributed-based parallelization 27
4-3討論 27
第五章實驗設計與結果分析 29
5-1實驗背景 29
5-2系統環境設置 30
5-2-1　單機測試 30
5-2-2　MPICH2 30
5-2-3　Hadoop 31
5-3環境測試數據分析與比較 31
5-4測試數據分析與比較 34
第六章結論與未來研究 40
參考文獻 41

參考文獻

M. Chi, R. Feng, L. Bruzzone, “Classification of hyperspectral remote-sensing data with primal SVM for small-sized training dataset problem,” Advances in Space Research, Vol. 41, Issue 11, pp. 1793-1799, 2008.
D.C. Li, Y.H. Fang, “An algorithm to cluster data for efficient classification of support vector machines,” Expert Systems with Applications, Vol. 34, Issue 3, pp. 2013-2018, 2008.
M.R. Saleh, M.T. Martín-Valdivia, A. Montejo-Ráez, L.A. Ureña-López, “Experiments with SVM to classify opinions in different domains,” Expert Systems with Applications, Vol. 38, Issue 12, pp. 14799-14804 2011.
E. Youn, M.K. Jeong, “Class dependent feature scaling method using naive Bayes classifier for text datamining,” Pattern Recognition Letters, Vol. 30, Issue 5, pp. 477-485, 2009.
S.H. Lu, D.A. Chiang, H.C. Keh, H.H. Huang, “Chinese text classification by the Naïve Bayes Classifier and the associative classifier with multiple confidence threshold values,” Knowledge-Based Systems, Vol. 23, Issue 6, pp. 598-604, 2010
D. Soria, J.M. Garibaldi, F. Ambrogi, E.M. Biganzoli, I.O. Ellis, “A ‘non-parametric’ version of the naive Bayes classifier,” Knowledge-Based Systems, Vol. 24, Issue 6, pp. 775-784, 2011.
S. Mukherjee, N. Sharma, “Intrusion Detection using Naive Bayes Classifier with Feature Reduction,” Procedia Technology, Vol. 4, pp. 119-128, 2012.
J.R. Quinlan, “Induction of decision trees,” Machine Learning, Vol. 1, Issue 1, pp. 81-106, 1986.
J.R. Quinlan, “C4.5: Programs for Machine Learning,” Morgan Kaufmann, 1993.
J.R. Quinlan, “Improved use of continuous attributes in C4.5,” Artificial Intelligence Research, Vol. 4, Issue 1, pp. 77-90, 1996.
D.R. Carvalho, A.A. Freitas, “A hybrid decision tree/genetic algorithm method for data mining,” Information Sciences, Vol. 163, Issues 1–3, pp. 13-35, 2004.
L. Rokach, O. Maimon, “Top-down induction of decision trees classifiers: a survey,” IEEE Transactions on Systems, Man and Cybernetics, Vol. 35, Issue 4, pp. 476–487, 2005.
F.E.B. Otero, A.A. Freitas, C.G. Johnson, “Inducing decision trees with an ant colony optimization algorithm,” Applied Soft Computing, Vol. 12, Issue 11, pp. 3615-3626, 2012.
“OpenMP.org,” http://openmp.org/wp/.
“Open MPI: Open Source High Performance Computing,”
Welcome to Apache Hadoop!,” http://hadoop.apache.org/.
F. Usama, G. Piatetsky-Shapiro and P. Smyth, "From Data Mining to Knowledge Discovery in Databases," 1996.
ACM SIGKDD, "Data Mining Curriculum: A Proposal," 2006.
C. Christopher, “Encyclopedia Britannica: Definition of Data Mining,” 2010.
Bratko, Ivan and Stephen Muggleton, “Applications of Inductive Logic Programming,” Communications of the ACM, Vol. 38, No. 11, pp. 65-70, 1995.
Han, Jiawei and Micheline Kamber, “Data Mining: Concepts and Techniques,” Morgan Kaufmann, 2001.
Russell, S. and P. Norvig, “Artificaial intelligence: A Modern Approach,” Prentice Hall, 2003.
L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Wadsworth, 1984.
Y.Y. Yao, “A Generalized Decision Logic Language for Granular Computing Fuzzy Systems,” FUZZ-IEEE’02 in The 2002 IEEE World Congress on Computational Intelligence, pp. 1092-1097, 2002.
A. Geist, A. Beguelin, J. Dongarra, W. Jiang, and R. M. and V. S. Sunderam, “PVM: Parallel Virtual Machine: A Users’ Guide and Tutorial for Network Parallel Computing,” 1994.
“OpenMP.org,” http://openmp.org/wp/.
“Open MPI: Open Source High Performance Computing,” http://www.open-mpi.org/.
M. E. Zosel, “High performance Fortran: an overview,” Compcon Spring, Digest of Papers, pp. 132-136, 1993.
“CUDA Zone,” http://www.nvidia.com/object/cuda_home_new.html.
R. Khare, D. Cutting, K. Sitaker, A. Rifkin, “Nutch: A flexible and scalable open-source web search engine,” Oregon State University, 2004.
J. Dean, S. Ghemawat, “Map reduce: Simplified data processing on large clusters,” Communications of the ACM-Association for Computing Machinery-CACM, Vol. 51, No. 1, pp. 107-114, 2008
S. Ghemawat, H. Gobioff, S.T. Leung, “The google file system,” ACM SIGOPS Operating Systems Review, Vol. 37, No. 5, pp. 29-43, 2003.
Apache Hadoop project. (a). Hadoop MapReduce.
Apache Hadoop project. (c). HDFS.
Apache Hadoop project. (b). HBase.
Apache Hadoop project. (d). Hive.
Apache Hadoop project. (e). Pig.
“UCI machine learning repository,” http://archive.ics.uci.edu/ml/index.html.
“Keel: Knowledge extraction based on evolutionary learning,” http://sci2s.ugr.es/keel/index.php.

指導教授

王尉任(Wei-Jen Wang)

審核日期

2013-8-23

推文