摘要: | 資料探勘的目的在找出資料中的群聚模式,其中以建構樹來產生決策規則的決策樹演算法,例如C4.5,是一種監督式學習(Supervised learning)的方法,目前已在各個領域上被廣泛的運用。決策樹的建構過程會影響決策的效能,而現存大部分的方法有兩類,第一類是尋找屬性對應類別的支持度,第二類利用資訊獲取量,來做為選擇不同階層決策的條件。但是這些方法在資料切分過程中,不僅會降低資料屬性間的相依性質,也會出現難以區分類別及增加出錯的機率。因此本研究提出了關鍵屬性區別決策樹演算法,在犧牲些微的分析時間下,可自動偵測出資料所有屬性中最能區別群集類別的屬性資料,並透過重新分類資料的方式有效的改善資料的識別準確率,也有效的改善了在大規模筆數資料下建構決策樹的執行時間及儲存空間所耗費的成本。 另外,本研究也針對所提出的演算法對平行運算的部分進行實作,提出以Message Passing Interface(MPI)函式庫實作將資料及任務做完善的切割與排程,使得計算工作可以盡可能的在較短的時間完成。另外我們也以Hadoop MapReduce實作計算任務的切割與排程,使得每個計算任務可以同步處理資料屬性。我們以此兩種架構來進行分析及比較演算法的適合度。實驗結果顯示出,本研究所提出的演算法在執行時間及準確率上平均皆優於現存著名的分類演算法,而對於大規模筆數的資料量下,我們實作的 Hadoop版本的建樹效率於前半段優於MPI版本,而於後半段則是MPI版本優於Hadoop版本。但兩者皆適合於大規模筆數的資料分析。 In machine learning, a classifier is able to associate new observed instances to a set of categories, on the basis of a training set of data containing instances whose category membership is known. Decision-tree-based algorithms are popular classifiers in many application domains because they are simple to understand and interpret. To construct a decision tree, a typical decision-tree-based algorithm chooses the attribute of the training dataset that most effectively splits the training dataset into subsets enriched in one class or the other, and then iteratively splits the subsets until the instances of the subsets belong to one class. There are two popular approaches to split the training dataset into subsets: searching the support confidence of classes and using the information gain. A common weakness of the two splitting methods is that, they may not achieve good performance for a dataset with several co-related attributes. To overcome this problem, we propose a new decision-tree algorithm based on the key values of an attribute. In our algorithm, each key value of an attribute is used to split a subset, of which the instances all have the same key value the intersection of every pair of the key-value subsets is empty the remainder instances that are hard to distinguish are put in one subset. The proposed algorithm automatically detects the best attribute that can distinguish different class sets, instead of classes, and uses the attribute to split the dataset level by level. The proposed algorithm improves the correctness of data classification and reduces the space requirement for storing the nodes of a decision tree. The experiment results show that, the proposed algorithm is generally better than existing decision-tree-based classifiers such as C4.5 in classification accuracy. Moreover, it is generally better than other types of classification algorithms, such as SVM, logistic regression, and Na?ve Bayes, in classification accuracy. To handle big data, we also proposed two new parallel algorithms based on the proposed decision tree algorithm: One is implemented in MPI and the other is in Hadoop MapReduce. In the MPI implementation, we design a heuristic workload balancer based on the EDF scheduling algorithm to balance the workload among all computing hosts, in order to shorten the total execution time. In the Hadoop implementation, we use an attributed-based parallelization strategy to shorten the total execution time. Both parallel implementations show good scalability according to our experiments. The MPI implementation generally has shorter execution time than the Hadoop implementation in our experiments. However, the Hadoop implementation outperforms the MPI implementation for the datasets with a relatively large number of attributes. |