結合分類分群技術建立推測法則之研究

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：18

、訪客IP：3.137.223.190

姓名

許武先(Wu-hsien Hsu) 查詢紙本館藏

畢業系所

資訊管理學系

論文名稱

結合分類分群技術建立推測法則之研究
(Conjecturable Rules Discovery by Clustering-Classification Hybrid Approach)

相關論文

★ 零售業商業智慧之探討	★ 有線電話通話異常偵測系統之建置
★ 資料探勘技術運用於在學成績與學測成果分析 -以高職餐飲管理科為例	★ 利用資料採礦技術提昇財富管理效益 -以個案銀行為主
★ 晶圓製造良率模式之評比與分析－以國內某DRAM廠為例	★ 商業智慧分析運用於學生成績之研究
★ 運用資料探勘技術建構國小高年級學生學業成就之預測模式	★ 應用資料探勘技術建立機車貸款風險評估模式之研究－以A公司為例
★ 績效指標評估研究應用於提升研發設計品質保證	★ 基於文字履歷及人格特質應用機械學習改善錄用品質
★ 以關係基因演算法為基礎之一般性架構解決包含限制處理之集合切割問題	★ 關聯式資料庫之廣義知識探勘
★ 考量屬性值取得延遲的決策樹建構	★ 從序列資料中找尋偏好圖的方法 - 應用於群體排名問題
★ 利用分割式分群演算法找共識群解群體決策問題	★ 以新奇的方法有序共識群應用於群體決策問題

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

資料探勘的主要目的是發掘隱藏或未知的知識。分類技術可以透過分析具有分類標簽的訓練資料，建立各項法則以便未來對新資料進行分類。然而若資料集並未存在已知的分類標籤，分類技術則無法發揮。而分群技術可將無標籤的資料依據各資料點的相似程度，分成若干群，各群因具有高度相似的屬性值，可將各群歸類為某種概念。雖然分群技術可將無標籤的資料分為特定的數個概念，分群技術的特性卻無法如同分類技術一樣，將分群的規則留下來，以便於未來推測之用。
所謂「推測」係針對不熟悉或無法提供分類標簽之資料集進行兩組不同屬性之分析，期能發掘出兩組資料屬性之關係，進而建立推測的法則。
本研究延伸了先前的研究，提出新的方法，藉以發掘隱性法則與改善推測正確率。除了運用分類技術建立決策樹，作為推測法則，同時以分群方式來解決無標籤資料的困境。也透過模糊理論的實踐與離群值處理，對於隱性法則的發掘，以及正確率的提升都有顯著的結果。實驗結果顯示本研究所提出的方法，能有效建立推測法則，所發掘的規則也可彌補過去方法的缺憾。

摘要(英)

Discovering hidden or unknown knowledge is the major theme of most data mining studies. In this dissertation, we propose a new approach to discover conjecturable rules, which categorize observations of a data set into classes of similar attribute values instead of classes of crisp labels. The proposed approach is developed based on the two most developed data mining techniques: Classification and Clustering.
Classification is the problem of identifying the sub-population to which new observations belong. The result is decided according to a set of rules which discovered from a training set of data of observations whose sub-population is known. The technique is known as supervised learning, i.e. pre-defined labels are necessary for the process. The result is a set of rules which are able to predict which label a new observation is belonged to. However, when there is no label existed in the dataset, this technique fails to apply. On the other hand, Clustering is the process of grouping a set of objects into classes of similar objects. No pre-defined label is necessary for the process. It is known as unsupervised learning. Yet no any rule is preserved after the process for future prediction.
The object of this dissertation is to discover conjecturable rules from those datasets which do not have any predefined class label. Furthermore, the technique extends our two previous studies with fuzzy concept and outliers handling. Thus recessive conjecturable rules can be discovered as well as the accuracy is improved. The proposed technique covers the convenience of unsupervised learning as well as the ability of prediction of decision trees. The experiment results show that our proposed approach is capable to discover conjecturable rules as well as recessive rules. Sensitivity analysis is also given for practitioners’ reference.

關鍵字(中)

★ 資料探勘
★ 分類
★ 分群
★ 推測規則
★ 決策樹
★ 數值分析
★ 模糊理論

關鍵字(英)

★ Data Mining
★ Cluster Analysis
★ Conceptual Cluste

論文目次

中文摘要 II
Abstract I
誌謝 IV
Contents V
List of Tables VII
List of Figures VIII
Chapter 1 Introduction 1
1.1 Background 1
1.2 Motivation 3
1.3 Organization of the Dissertation 7
Chapter 2 Related Works 8
2.1 Data Mining 8
2.2 Classification 10
2.3 Cluster Analysis 13
2.4 Fuzzy Clustering 15
2.5 Conjecturable Rules Discovery 17
Chapter 3 Recapturing: Conjecturable Rules Discovery 20
3.1 TASC 20
3.1.1 Problem Definition 20
3.1.2 TASC Algorithm 24
3.1.2.1 Two Measures of Fitness 26
3.1.2.2 Minimum Entropy Partitioning (MEP) 28
3.1.2.3 Equal-Width Binary Partitioning (EWP) 28
3.1.2.4 Equal-Depth Binary Partitioning (EDP) 31
3.2 Tree-based Clustering 33
3.2.1 Problem Definition 33
3.2.2 Attributes 34
3.2.3 Clus-Tree 36
3.2.4 k-nearest-neighbors Graph 37
3.2.5 Similarity Function 38
3.2.6 Satisfactory Vector 38
3.2.7 Tree-based Clustering Algorithm 41
3.2.7.1 Parameters 41
3.2.7.2 Clus-Tree Algorithm 42
3.3 Discussion on Previous Studies 50
Chapter 4 Fuzzy Tree-based Clustering 52
4.1 Problem Definition 52
4.2 FuzzClu_Tree Algorithm 60
4.3 Performance Evaluation and the Result 70
4.3.1 Performance Evaluation and Sensitivity Analysis 70
4.3.2 Real Dataset Result 74
4.3.3 Comparison with an alternative method 79
4.3.4 Discussion 84
Chapter 5 Conclusions and Implications 85
5.1 Conclusions 85
5.2 Implications for Academic Researchers 86
5.3 Implications for Business Practitioners. 87
5.4 Future Works 89
References 90
Appendix: Synthetic Data Generation 98

參考文獻

Agrawal, R. and Srikant, R. (1994). Fast Algorithms for Mining Association Rules in Large Databases. Proceedings of the 20th International Conference on Very Large Data Bases. 487-499.
Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P. (1998). Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the ACM SIGMOD Int'l Conference on Management of Data, 94-105.
Ankerst, M., Breunig, M., Kriegel, H.-P., and Sander, J. (1999). OPTICS: Ordering Points to Identify the Clustering Structure. Proceedings of ACM SIGMOD International Conference on Management of Data. 322-331.
Basak, J. and Krishnapuram, R. (2005). Interpretable Hierarchical Clustering by Constructing an Unsupervised Decision Tree. IEEE Transactions on Knowledge and Data Engineering, 17(1), 121- 132.
Berkhin, P., (2002). Survey of clustering data mining techniques. Technical Report, CA: Accrue Software.
Berson, A., Smith, S., and Thearling, K. (2000). Building data mining applications for CRM. McGraw-Hill New York.
Bezdek, J., (1981). Pattern recognition with fuzzy objective function algorithms. Plenum Press, New York.
Bezdek, J.C., Ehrlich, R., and Full, W. (1984). FCM: The fuzzy c-means clustering algorithm. Computers & Geosciences Vol. 10, Issue 2-3, 191-203.
Breiman, L., Friedman, J.H., Olshen, R.A., and Stone, C.J. (1984). Classification and Regression Trees. London: Chapman and Hall.
Chan, P.K., Fan, W., Prodromidis, A.L. and Stolfo, S.J. (1999). Distributed data mining in credit card fraud detection. Intelligent Systems and Their Applications, IEEE (IEEE Intelligent Systems). 14(6). 67-74.
Chen, M.S., Han, J., and Yu, P. S. (1996). Data mining: an overview from a database perspective. IEEE Transactions on Knowledge and Data Engineering. 8(6). 866-883.
Chen, N., Chen, A. and Zhou, L. Lu. (2001). A graph-based clustering algorithm in large transaction databases. Intelligent Data Analysis. 5(4). 327-338.
Chen, Y.L., Hsu, C.L., and Chou, S.C. (2003). Constructing a multi-valued and multi-labeled decision tree. Expert Systems with Applications, 25 (2), 199-209.
Chen, Y.L., Hsu, W.H., Lee, Y.H. (2006). TASC: two-attribute-set clustering through decision tree construction, European Journal of Operational Research 174, 930-944
Chen, Y.L., and Hu H.L., (2006). An overlapping cluster algorithm to provide non-exhaustive clustering. European Journal of Operational Research, vol. 173, 762-780.
Cheng, C.H., Fu, A.W., and Zhang, Y., (1999). Entropy-based subspace clustering for mining numerical data. Proceedings of the fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 84-93.
Dunn, J, (1973). A fuzzy relative of the Isodata process and its use in detecting compact, well-separated clusters. Journal of Cybernetics, vol. 3(3), 32-57
Ester, M., Kriegel, H.P., Sander, J. and Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining. 226-231
Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. (1996). From data mining to knowledge discovery in databases. AI Magazine, 17(3), 37-54.
Fisher, D.H. (1987). Knowledge Acquisition via Incremental Conceptual Clustering, Machine Learning. 2, 139-172.
Friedman, J.H., and Rafsky, L.C. (1979). Multivariate generalizations of the Wald–Wolfowitz and Smirnov two-sample tests. The Annals of Statistics, 17, 697–717.
Friedman, J.H., and Rafsky, L.C. (1981). Graphics for the multivariate two-sample problem. Journal of American Statistics Association, 76, 277–293.
Friedman, J.H., and Rafsky, L.C. (1983). Graph-theoretic measures of multivariate association and prediction. The Annals of Statistics, 11(2), 377–391.
Friedman, J.H. and Fisher, N.I. (1999). Bump Hunting in High-dimensional Data, Statistics and Computing, Vol. 9, Issue 2, 123-143.
Gehrke, J., Ganti, V., Ramakrishnan, R., and Loh, W.-Y. (1999). BOAT – optimistic decision tree construction. Proceedings of ACM SIGMOD International Conference on Management of Data. 169-180.
Giannotti, F., Gozzi, C., and Manco, G.., (2001). Clustering Transactional Data. Proceedings of SEBD-01 National Conference on Advanced Database Systems. 163-176.
Giudici, P. (2003) Applied data mining: statistical methods for business and industry. Wiley.
Gonzalez-Barrios, J.M., and Quiroz, A.J., (2003). A clustering procedure based on the comparison between the k nearest neighbors graph and the minimal spanning tree. Statistics & Probability Letters, 62, 23-24.
Grabmeier, J., and Rudolph, A. (2002). Techniques of cluster algorithms in data mining. Data Mining and Knowledge Discovery, 6(4), 303-360.
Guha, S., and Rastogi, R., (2000) ROCK: A Clustering Algorithm for Categorical Attributes. Information System Journal, 25 (5), 345-366.
Guha, S., Rastogi, R., and Shim, K. (1998). CURE: An efficient clustering algorithm for large databases. In: Proceedings of the ACM SIGMOD Conference, 73-84.
Guha, S., Rastogi, R., and Shim, K., (2001). CURE: an efficient clustering algorithm for large databases. Information Systems, 26(1), 35-58.
Guo, L., Zhang, M., Sun, L., and Wang, Z., (2006). Fuzzy clustering model of CRM in securities trade. Proceedings of the 6th World Congress on Intelligent Control and Automation (WCICA). 6052-6054.
Halkidi, M., Batistakis, Y., and Vazirgiannis, M., (2001). Clustering algorithms and validity measures. Proceedings of the Thirteenth International Conference on Scientific and Statistical Database Management. 3 -22.
Han, J., and Kamber, M., (2006). Data Mining: Concepts and Techniques., 2nd edition, Morgan Kaufmann.
Hsu, W.H., Jao, J.A. and Chen, Y.L. (2005). Discovering conjecturable rules through tree-based clustering analysis, Experts Systems with Applications 29, 493-505.
Jain, A.K., Murty, M.N., and Flynn, P.J., (1999). Data clustering: a review. ACM Computing Surveys, 31(3): 264-323.
Kantardzic, M., (2003). Data Mining: Concepts, Models, Methods, and Algorithms. NJ: John Wiley & Sons.
Karypis, G., Han, E.H., and Kumar, V., (1999). Chameleon: Hierarchical Clustering Using Dynamic Modeling. IEEE Computer, (32) 68-74.
Kaufman, L., and Rousseeuw, P.J. (1990). Finding Groups in Data: an Introduction to Cluster Analysis. NJ: John Wiley & Sons.
Keim, D., and Hinneburg, A. (1999). Clustering techniques for large data sets: from the past to the future. KDD Tutorial Notes 1999: 141-181.
Klawonn, F., and Kruse, R. (1997). Constructing a fuzzy controller from data. Fuzzy Sets and Systems 85. 177-193.
Lenard, M. J., Alam, P., and Booth, D., (2000). An analysis of fuzzy clustering and a hybrid model for the auditor’s going concern assessment. Decision Sciences, vol. 31(4), 861-884.
Liu, B., Xia, Y., and Yu, P., (2000). Clustering through decision tree construction. Proceedings of Ninth International Conference on Information and Knowledge Management. 290-297.
MacQueen, J. (1967). Some Methods for Classification and Analysis of Multivariate Observations. Proceedings of the fifth Berkeley symposium on mathematical statistics and probability. 281-297.
Mattison, R. (1997). Data warehousing and data mining for telecommunications. Artech House, Inc.
Mehta, M., Rissanen, J., and Agrawal, R. (1995). MDL-based decision tree pruning. Proceedings of the First International Conference on Knowledge Discovery and Data Mining. 216-221.
Ng, R., and Han, J. (2002). CLARANS: A Method for Clustering Objects for Spatial Data Mining. IEEE Transactions on Knowledge and Data Engineering. 14(5). 1003-1016.
Ozer, M., (2001). User segmentation of online music services using fuzzy clustering. Omega: the International Journal of Management Science, vol. 29, 193-206.
Ozer, M., (2005). Fuzzy c-means clustering and Internet portals: a case study. European Journal of Operational Research, vol. 164, 696-714.
Quinlan, J.R., (1986). Induction of decision trees. Machine Learning. 1, 81-106.
Quinlan, J.R. (1987). Simplifying decision trees. International Journal of Man-Machine Studies. 27(3). 221-234.
Quinlan, J.R., (1993). C4.5: Programs for Machine Learning. CA: Morgan Kaufmann.
Quinlan, J.R. (1996). Improved use of continuous attributes in C4.5. Journal of Artificial Intelligence Research, 4, 77-90.
Ralambondrainy, H., (1995). A Conceptual Version of the k-means Algorithm, Pattern Recognition Letters, 16, pp.1147-1157.
Rastogi, R. and Shim, K. (1998). PUBLIC: A decision tree classifier that integrates building and pruning. Proc. VLDB-98, pp. 404-415.
Ruggieri, S. (2002). Efficient C4.5. IEEE Transactions on Knowledge and Data Engineering, 14 (2), 438-444.
Salton, G., (1989). Automatic text processing: the transformation, analysis and retrieval of information by computer, PA: Addison Wesley.
Shafer, J., Agrawal, R., and Mehta, M. (1996). SPRINT: A scalable parallel classifier for data mining. Proceedings of 22nd International Conference on Very Large Data Bases. 544-555.
Shoji, H., Sun, X., and Shusaku, T. (2004). Comparison of clustering methods for clinical databases, Information Sciences, Vol.159, Issue: 3-4, 155-165.
Spangler, W.E., May, J.H., and Vargas, L.G., (1999). Choosing data-mining methods for multiple classification: representational and performance measurement implications for decision support. Journal of Management Information Systems, vol. 16(1), 37-62.
Sullivan, R., Timmermann, A., and White, H. (1998). The dangers of data-driven inference: the case of calendar effects in stock returns. LSE Financial Markets Group.
Theodoridis, S. & Koutroumbas, K. (2006). Pattern Recognition 3rd Ed., 635.
Wang, W., Yang, J., and Muntz, R. (1997). STING: A Statistical Information Grid Approach to Spatial Data Mining. Proceedings of 23rd International Conference on Very Large Data Bases. 186-195.
Wu, K.L. and Yang, M.S. (2002). Alternative c-means clustering algorithms, Pattern Recognition 35, 2267–2278.
Yao, Y.Y., (1998). A comparative study of fuzzy sets and rough sets. Journal of Information Sciences 109, 227-242.
Ye, N. and Li, X. (2002). A scalable, incremental learning algorithm for classification problems, Computers & Industrial Engineering Journal, 43(4): 677-692.
Zhang, T., Ramakrishnan, R., and Livny, M. (1997), BIRCH: A New Data Clustering Algorithm and Its Applications. Data Mining and Knowledge Discovery, 1, 141–182.

指導教授

陳彥良(Yen-liang Chen)

審核日期

2011-6-30

推文