摘要: | 在眾多資料分類技術中,決策樹是相當受歡迎的一種分類方法,主要是因為決策樹分類方法所探勘出來的規則有較佳的可讀性。傳統決策樹的研究大都著重在如何使分類精確度最大化及分類錯誤率最小化。然而,在實務上,欲應用決策樹於實際管理情境時,常會面臨情境中特殊的考量及限制,如果沒有針對情境限制去設計適合的演算法,所建構出來的決策樹將不能應用。底下我們介紹三個實務情境,這些情境都是我們實務中應用決策樹所可能面臨的限制,而因為目前都沒有適合的演算法可以在該情境中建構決策樹,因此乃產生本計畫三年的研究需求。(一)必須考量決策樹大小的情境:決策樹是一種常見的分類技術,因為它有容易了解、計算效率高的特性。但決策樹常因為訓練資料內含的雜訊資料、特殊案例的影響,造成樹體結構龐大、分支太多,產生規則過多難以理解與應用的問題,此項缺點減少了決策樹的可用性。因此本研究透過限制決策樹的葉節點數為k 個節點,控制決策樹產生的規則量,並在使用者給定的葉節點數範圍內,達到最高的準確度。我們將發展出一套新的演算法,該方法以階層式分群法中的聚合法合併決策樹的分支,限制決策樹為二元樹,以便控制決策樹的節點數量,最後本研究再以實際資料進行實驗實作。(二)必須同時考量時間及準確度雙重目標的情境:在許多的應用中,資料往往都有附屬的時間標籤,表示這筆資料的取得時間。例如我們要從過去股價相關指標中來預測一個月後的可能股價,則很明顯的,所有相關的指標都有其不同的得到時間,而如果我們能越早預測一個月後的股價,則相對的可採取的獲利手段就越多,但如果我們越晚才能預測出一個月後股價,則獲利機會就越少。在這種有時間標籤的資料中建構決策樹時,傳統的作法大概是建構有時間限制的決策樹,也就是針對時間設定一個門檻,例如最後一星期前以定要預測出一個月後股價,然後在這個限制下,建構決策樹並盡可能使分類正確性最大,因此過去研究只能在有限個時間點中,去尋找較好的解,而不是全面求最佳解。本研究跟過去研究的不同在於我們把時間當成是一個目標,而另一個目標則是分類正確性。我們希望提出一個建構決策樹的方法,此方法要能同時考慮這兩個目標的最佳化。(三)必須從規則中建構決策樹的情境:一般決策樹演算法,都是直接從資料中建構決策樹,這樣的作法已經行之有年,並獲得廣大認同。但在某些特殊情境中,我們的困難是沒有資料但卻有規則,因此無法從資料中建構決策樹。當然我們可以直接就運用這些規則作分類,但這樣的作法其缺點會有:在規則式的分類中,常會發生測試資料沒有任何規則適合或雖然有多個規則適合但卻彼此結果類別衝突矛盾,因此在規則式分類中,我們必須另外建構複雜的規則衝突解決機制,而這些機制對決策者往往很難理解,更難理解規則彼此衝突的意義是甚麼。相反的,決策樹的優點包括單純、使用容易、容易理解、不會有分類衝突。因這些優點,本研究乃嘗試提出一套方法從規則中建構決策樹。 One important type of knowledge that can be obtained from data mining is decision trees (DTs), which are constructed from existing data to classify future data. Since DTs are human interpretable, well-organized, computationally inexpensive, and capable of dealing with noisy data, the technique has become a powerful and popular classification tool in various domains. Basically, the aim of decision tree learning is to construct a tree model which can maximize the classification accuracy or minimize the classification error. However, in some practical applications where special constraints or considerations are associated with, traditional decision tree building algorithms would fail to build decision tress for classification. This research gap indicates the need to design new algorithms that can construct decision trees in these situations. In the next three years, this project aims to propose three decision tree construction algorithms in three such practical situations. (1) Building decision trees with no more than K nodes: Decision tree is one of the most popular classification technologies. However, a weakness of decision tree is that if we do not restrict the tree size, the size of the tree produced by decision tree construction algorithms may be too large that it becomes difficult to understand and difficult to use. In light of this problem, this research intends to build decision trees with maximum classification accuracy under the constraint that the number of nodes in a tree should be no more than K. Therefore, the first year of this project will propose a new tree building algorithm to do so by employing the agglomerative clustering approach to combine the branches of decision tree. (2) Building decision trees with two objectives: In many real-world applications, when we classify an object, we must consider more than one objective, e.g., time and classification accuracy. For example, if our target is to predict the stock price after one month, then two objectives, time and accuracy, are involved. On one hand, it would be better to make prediction as early as possible. On the other hand, it is necessary that the prediction should be as accurate as possible. Although some previous researches had included time factor into the decision tree, these previous approaches are to maximize the classification accuracy under the time constraint. To our knowledge, no previous research has ever considered how to minimize the two objectives, time and accuracy, simultaneously. Therefore, this research intends to develop a new decision tree construction algorithm which can optimize these two objectives simultaneously. (3) Building decision trees from rules: In practice, we may encounter the situations that although we have rules but we don’t have data. For example, due to security concern it is prohibited for anyone to read the data; however, for usage reason, the rules discovered from the data may be disclosed. Another example is that what we have is the rules collected from experts, and we hope that we can build a decision tree model from these experts’ rules. To the best of our knowledge, the issue of building a decision tree from rules rather from data has seldom seen in existing research. Therefore, in the third year we will develop an algorithm that can build decision trees from rules. 研究期間:10008 ~ 10107 |