Semantic Tree II:具語意描述能力的分群演算法

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：11

、訪客IP：18.224.72.117

姓名

饒祐安(Yu-An Jao) 查詢紙本館藏

畢業系所

資訊管理學系

論文名稱

Semantic Tree II:具語意描述能力的分群演算法
(Semantic Tree II:A Clustering Algorithm with Ability of Semantic Description)

相關論文

★ 零售業商業智慧之探討	★ 有線電話通話異常偵測系統之建置
★ 資料探勘技術運用於在學成績與學測成果分析 -以高職餐飲管理科為例	★ 利用資料採礦技術提昇財富管理效益 -以個案銀行為主
★ 晶圓製造良率模式之評比與分析－以國內某DRAM廠為例	★ 商業智慧分析運用於學生成績之研究
★ 運用資料探勘技術建構國小高年級學生學業成就之預測模式	★ 應用資料探勘技術建立機車貸款風險評估模式之研究－以A公司為例
★ 績效指標評估研究應用於提升研發設計品質保證	★ 基於文字履歷及人格特質應用機械學習改善錄用品質
★ 以關係基因演算法為基礎之一般性架構解決包含限制處理之集合切割問題	★ 關聯式資料庫之廣義知識探勘
★ 考量屬性值取得延遲的決策樹建構	★ 從序列資料中找尋偏好圖的方法 - 應用於群體排名問題
★ 利用分割式分群演算法找共識群解群體決策問題	★ 以新奇的方法有序共識群應用於群體決策問題

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

在資料探勘這個領域當中，分群是ㄧ項重要的課題。現有的分群方法存在著兩項缺點：1. 無法預測新的資料項應屬於哪個群組、2. 分群後的結果無具備語意描述的能力。【Liu et al, 3】提出CLTree分群法，透過建構decision tree的方式來完成分群，decision tree最終的葉節點就是分群的結果，因此，每一個群都能夠從decision tree中獲得一個唯一的語意描述。然而，CLTree仍舊有其弱點存在。Semantic Tree【李育璇】在文中指出，CLTree分群法在分群空間所使用的分群屬性，與建構決策樹時所使用的分類屬性完全相同，是同樣的一個集合，此項缺點限制了許多使用者實務上能夠應用的範圍，例如：銀行可能想要將信用卡消費者的消費行為作分群，消費族群的特徵作分類等。
雖然Semantic Tree克服CLTree的一項弱點，並在模擬實驗上有很好的結果，但是Semantic Tree與CLTree一樣，皆屬於density-based分群法，也就是說，它們只針對數值資料作分群，無法處理名目資料。然而，實務應用中存在著大量數值與名目混合的資料，只能夠處理數值資料的分群法，無法滿足實務上的需求。
所以，本研究提出Semantic Tree II分群演算法，它能夠同時處理數值與名目資料，並具備分群結果語意描述的能力，模擬實驗的結果也證明，Semantic Tree II的確能夠處理實務上的真實資料。

摘要(英)

Clustering analysis is an important task in data mining. Due to the nature of the clustering theory, these techniques keep the result of clustering, which gives the chance of better utilization of managing the objects. Yet they all have some common shortages:(1) Unable to predict new objects. (2) Difficult to give clear semantic description for each cluster.
In [Liu et al, 3], a decision tree, called CLTree is built based on decision trees in classification to represent a result of clustering. The technique introduced in the paper uses the same attribute set for both partitioning the dataset and constructing the decision tree. However, in a practical situation, it is possible that the two kinds of attributes may be different from each other. [Lee, 1] proposed an improved technique, Semantic Tree, to allow different attributes set for clustering and partitioning which brings better chances for the technique to be applied.
A drawback for the above two techniques is that both techniques are density-based, i.e. they can be applied only to numerical attributes. This can be fatal when we want to cluster those categorical datasets. In this paper, we develop a new technique using k-nearest neighbor graph, which allows both numerical and categorical attributes. The technique also covers the convenience of unsupervised learning as well as the ability of prediction of decision trees.

關鍵字(中)

★ 資料挖掘
★ 分群

關鍵字(英)

★ Data Mining
★ Clustering

論文目次

1. 前言 1
2. 問題描述與相關定義 6
2.1. 相關定義 7
2.2. 相似值函數 15
3. Semantic Tree II分群演算法 19
3.1. 演算法概觀 19
3.2. 演算法相關參數 20
3.3. 步驟一：建立k-nearest neighbor graph 21
3.4. 步驟二：建構語意樹 24
3.4.1. 切割 25
3.4.2. 衡量與終止條件 27
3.5. 演算法架構 29
3.6. 演算法複雜度 32
3.7. 範例說明 32
4. 實驗模擬 44
4.1. 模擬環境 44
4.2. 模擬資料的產生 44
4.3. 模擬方式與內容 47
4.4. 模擬結果 50
4.4.1. 效率 50
4.4.2. Semantic Tree II在不同參數時的變化 50
4.4.3. 資料量－數值資料 60
4.4.4. 資料量－混合型資料 61
4.4.5. 群數－數值資料 62
4.4.6. 群數－混合型資料 64
4.4.7. 分類屬性與分群屬性重疊的比例－數值資料 65
4.4.8. 分類屬性與分群屬性重疊的比例－混合型資料 66
4.4.9. 名目屬性所佔的比例－混合型資料 68
4.5. 信用卡資料模擬結果 69
4.6. 小結 73
5. 結論 74
參考文獻 76
附錄A 78

參考文獻

[1] 李育璇，具語意描述能力的分群演算法，國立中央大學資訊管理研究所碩士論文，民國92年6月。
[2] A.K. Jain, M.N. Murty, and P.J. Flynn, Data clustering: a review, ACM Computing Surveys, 31(3):264--323, 1999.
[3] B. Liu, Y. Xia, and P. Yu, Clustering through decision tree construction, In SIGMOD-00, 2000.
[4] C.H. Cheng, A.W. Fu, and Y. Zhang, Entropy-based subspace clustering for mining numerical data, KDD-99, 84-93, 1999.
[5] F. Giannotti, C. Gozzi and G. Manco, Clustering Transactional Data, SEBD 2001.
[6] George Karypis, Eui-Hong Han and Vipin Kumar, Chameleon: Hierarchical Clustering Using Dynamic Modeling, IEEE Computer, 1999.
[7] G. Salton, Automatic text processing: the transformation, analysis and retrieval of information by computer, Addison Wesley, 1989.
[8] H. Ralambondrainy, A Conceptual Version of the K-Means Algorithm, Pattern Recognition Letters, 16, pp.1147-1157, 1995.
[9] Hirano Shoji, Sun Xiaoguang, Tsumoto Shusaku, Comparison of clustering methods for clinical databases , Information Sciences, Volume: 159, Issue: 3-4, pp. 155-165 .February, 2004.
[10] J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000.
[11] J.R. Quinlan, C4.5 : Programs for Machine Learning, Morgan Kaufmann, 1993.
[12] M. Halkidi, Y. Batistakis and M. Vazirgiannis. Clustering algorithms and validity measures. In Proceedings. Thirteenth International Conference on Scientific and Statistical Database Management ( SSDBM'01), Pages 3 -22, 2001.
[13] M. Kantardzic, Data Mining: Concepts, Models, Methods, and Algorithms, Wiley-Interscience, 2002.
[14] P. Berkhin, Survey of clustering data mining techniques, Technical Report, Accrue Software, 2002.
[15] Sudipto Guha, Rajeev Rastogi and Kyuseok Shim. Cure: an efficient clustering algorithm for large databases. Information Systems, 2001.
[16] Sudipto Guha and Rajeev Rastogi, ROCK: A Clustering Algorithm for Categorical Attributes. Information System Journal, 2000.

指導教授

陳彥良(Yen-Liang Chen)

審核日期

2004-6-8

推文