Center-based clustering with the string data

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：84

、訪客IP：3.17.186.21

姓名

許佳雯(Jia-Wun Syu) 查詢紙本館藏

畢業系所

工業管理研究所

論文名稱

(Center-based clustering with the string data)

相關論文

★ 二階段作業研究模式於立體化設施規劃應用之探討–以半導體製造廠X及Y公司為例	★ 推行TPM活動以改善設備總合效率並提昇企業競爭力...以U公司桃園工廠為例
★ 資訊系統整合業者行銷通路策略之研究	★ 以決策樹法歸納關鍵製程暨以群集法識別關鍵路徑
★ 關鍵績效指標(KPI)之建立與推行 - 在造紙業	★ 應用實驗計劃法- 提昇IC載板錫球斷面品質最佳化之研究
★ 如何從歷史鑽孔Cp值導出新設計規則進而達到兼顧品質與降低生產成本目標	★ 產品資料管理系統建立及導入-以半導體IC封裝廠C公司為例
★ 企業由設計代工轉型為自有品牌之營運管理	★ 運用六標準差步驟與FMEA於塑膠射出成型之冷料改善研究(以S公司為例)
★ 台灣地區輪胎產業經營績效之研究	★ 以方法時間衡量法訂定OLED面板蒸鍍有機材料更換作業之時間標準
★ 利用六標準差管理提升生產效率－以Ａ公司塗料充填流程改善為例	★ 依流程相似度對目標群組做群集分析- 以航空發動機維修廠之自修工件為例
★ 設計鏈績效衡量指標建立 —以電動巴士產業A公司為例	★ 應用資料探勘尋找影響太陽能模組製程良率之因子研究

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

分群已在許多研究之中被廣泛的討論與應用，其目的為達到群中的資料相似度最大;群與群之間的資料相似度最小。而應用在分群的資料型態有很多種，目前應用較頻繁為數值型的資料，而字串型資料則是較少被討論與使用的資料型態，但字串型資料卻常以不同的方式出現在我們的生活中，例如產品的生產流程、零件的維修程序與疾病發生的徵兆順序等等。與其他資料型態相比，字串型資料必須多考量順序的問題，因此在本研究中我們將針對此種資料型態提出可行的分群方式。
在過去對數值型或類別型資料做集群分析時，資料大多具有相同的維度，這種情形下的資料已被諸多學者以完整的定義其分群過程。而在字串型資料中卻包含了各種維度不一的資料，亦即資料長度不相等，如生產產品1依序需經過機台A、B與C，而生產產品2須經過機台B、C、D與A，因此如何在不影響順序的前提下衡量字串型資料的相似度為重要的課題。在本研究中我們採用的Edit distance 與 Simple matching distance兩種方法來衡量資料的相似度。目前針對字串型資料的分群方式，大多使用階層式方法針對字串型資料進行分群，如Tian et al (1996), Dinu and Sgarro (2006)與Tseng(2013)。而在本文中將以非階層式方法作為分群基礎，藉由找到集群中的中心點，來衡量字串型資料的相似度。
在非階層式分群中有很多學者提出了很多不同的演算法，以中心點為基準的演算法相較之下更有效率，因此研究過程將以非階層式中K-mean與K-mode兩方法center的概念，來做我們建立模型的基礎，因其各有部分優點，使我們可以達成建立分群的目標。

摘要(英)

The clustering has been studied and applied in many researches in the past. In the goal of the similarities between objects in the same clustering are high while the similarities between objects in different clustering are low. In the clustering have lot of data type, but the most be used is numerical data type. Until now the string data type haven’t been conducted into the development, but it contain the enormous potential for application, such as parts repair processes, products manufacturing processes and disease signs occurrence of order etc. Compared with other data types, the string data type have two inevitable elements need to be considered, that are the character and order. Therefore, in this study we will propose a viable method for clustering with string data.
In the past of research, most studies focus on dealing the object with same dimensionality. Having same dimensional has been complete defined clustering process by many scholars. But in string data most the objects with different dimensionality, which is the length of objects are not equal. For example, if product 1 process through the machine A, B and C and product 2 process through the machine B, C, D and A. How to measuring the similarity does not affect the order of the string data, that is an important issue. In our study, we apply the Edit distance and Simple matching distance measuring dissimilarity with string data. At present mostly using hierarchical clustering method to deal with the string data, such as Tian et al. (1996), Dinu and Sgarro (2006), and Tseng (2013). But in our study, we have been reported based on the non-hierarchical clustering to deal with the string data.
Compared to other type of clustering algorithms, center-based algorithms are very efficient for clustering. So, we proposed the new model combining the concept of K-means and K-modes. Let us establish the goal of clustering for string data.

關鍵字(中)

★ 集群分析
★ 字串型資料
★ 相似度衡量

關鍵字(英)

★ Cluster analysis
★ String data type
★ Similarity measure

論文目次

Contents
摘要 i
Abstract ii
Contents iii
Contents of Figure v
Contents of Table vi
Chapter 1 Introduction 1
1-1 Background and Motivation 1
1-2 Research Objectives 2
1-3 Research methodology 2
1-4 Research Framework 3
Chapter 2 Literature Review 4
2-1 Cluster analysis 4
2-2 Group Technology 6
2-3 Center-based clustering algorithms 8
2-3-1 K-means algorithm 8
2-3-2 K-modes algorithm 9
2-3-3 Parameters of Center-based algorithms 10
2-4 Dissimilarity measure for string data 11
2-4-1 The Levenshtein Minimum Edit Distance 11
2-4-2 The Simple matching distance 14
2-5 Cluster validation 14
Chapter 3 Methodology 16
3-1 Measuring dissimilarity in Edit distance 17
3-2 Measuring dissimilarity in Simple matching
distance 24
Chapter 4 Numerical Example 31
4-1 The Nair and Narendran (1998) problem 31
4-1-1 The Edit distance Clustering for Nair and
Narendran(1998) problem 33
4-1-2 The Simple matching distance Clustering for
Nair and Narendran (1998) problem 34
4-2 The Harhalakis et al. (1990) problem 35
4-1-1 The Edit distance Clustering for Harhalakis
et al. (1990) problem 36
4-1-2 The Simple matching distance Clustering for
Harhalakis et al. (1990) problem 37
4-3 The Tseng (2013) problem 38
4-1-1 The Edit distance Clustering for Tseng (2013)
problem 40
4-1-2 The Simple matching distance Clustering for
Tseng (2013) problem 41
Chapter 5 Conclusion and Further Research 45
Reference 47
Appendix A. The case 1 of the Edit distance Clustering
for Nair and Narendran problem (1998) 51
Appendix B. The case 1 of the Simple matching distance Clustering for Nair and Narendran problem (1998) 55

參考文獻

1. Akutsu, T., “A relation between edit distance for ordered trees and edit distance for Euler strings”, Information Processing Letters, vol. 100, pp105-109, 2006.
2. Altuntas, S., Selim, H., “Facility layout using weighted association rule-based data mining algorithms: Evaluation with simulation”, Expert system with applications, vol.39, pp.3-13, 2012.
3. Arai, K., Barakbah, A. R., “Hierarchical K-means: an algorithm for centroids initialization for K-means”, vol. 36, no. 1, pp25-31, 2007.
4. Arbelaitz, O., Gurrutxaga, I., Muguerza, J., Perez, J. M., Perona, I., “An extensive comparative study of cluster validity indices”, Pattern Recognition, vol. 46, pp.243-256, 2013.
5. Ball, G., Hall, D., “ISODATA, a novel method of data analysis and pattern classification”, Technical report NTIS AD 699616. Stanford Research Institute, Stanford, CA.
6. Berry, J. A., Linoff, G., Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management, second edition, Wiley Publishing, Inc., Indiana, 2004.
7. Bradley, P.S., Fayyad U. M., “Refining initial points for K-means clustering ”
8. Chaturvedi, A., Green,P. E., Carroll,J. D., “ K-modes clustering”, Journal of classification, vol.18, no.1, pp.35-55,2001.
9. Chen C.H., Lan, G.C., T.p., Lin Y.K., “Mining high coherent association rules with consideration of support”, Expert system with applications, vol.40, pp.6531-6537, 2013.
10. Christopher, D. M., Prabhakar, R., Hinrich S., Introduction to Information Retrieval, Cambridge University Press, 2009
11. Cohen W.W., Ravikumar P., Frenberg S. E., “A comparison of string distance metrics for Name-matching tasks”, American Association Artificial Intelligence, 2003.

12. Dinu, L. P., Sgarro, A., “A Low-complexity Distance for DNA Strings,” Fundamenta Informaticae, vol. 73, no. 3, pp. 361–372, 2006.
13. Fu, K. S., Syntactic methods in pattern recognition and machine learning, Taiwan,1968
14. Fu K.S., Lu S.Y., “A Clustering Procedure for Syntactic Patterns”, IEEE, vol. 7, no. 10, pp734-742, 1977.
15. Gan, G., Ma, C., Wu, J., Data clustering: theory, algorithms, and applications, American statistical association, 2007.
16. Grupe, F. H., Owrang, M. M., “Data Base Mining Discovering New Knowledge and Competitive Advantage”, Information Systems Management, Vol. 12, No. 4, pp. 26-31, 1995.
17. Hawkins, C. P., Murphy, M. L. and Anderson, N. H., “Effects of canopy, substrate composition, and gradient on the structure of macroinvertebrate communities in Cascade Range streams of Oregon”, Ecology 63(6), pp.1840-1856, 1982.
18. Heragu, S.S., “Group Technology and Cellular Manufacturing”, IEEE Transactions on Systems, vol. 24, no.2, 1994.
19. Heragu, S.S. and Kakuturi, S.R., “Grouping and placement of machine cells”, IIE Transactions, vol. 29, 1997.
20. Huang, Z., “Extensions to the k-means algorithm for clustering large data set with categorical values”, Data mining and knowledge discovery, vol.2, pp.283-304, 1998.
21. Jain, A. K, “Data clustering: 50 years beyond K-means”, Pattern recognition letters, vol.31, pp.651-666, 2010.
22. Jain, A. K., Dubes, R. C., Algorithms for Clustering Data, Prentice-Hall, Inc., 1988.
23. Jain, A.K, Murty, M.N., Flynn, P.J., “Data clustering: a review”, ACM computing surveys, vol.31, no.3, pp.264-323, 1999.
24. Khan S. S, Kant, S., “Computation of initial modes for K-modes clustering algorithm using evidence accumulation”,
25. Khan S.S., Ahmad A., “Cluster center initialization algorithm for K-modes clustering”, Expert Systems with Applications, vol.40, pp44-56, 2013.
26. Kim, S. R., Park, K., “A dynamic edit distance table”, Journal of Discrete Algorithms, , pp.303-312, 2004.
27. Lange, T., Roth C., Braun, M.L., Buhmann J.M., “Stability-Based Validation of Clustering Solutions”, Neural computation, vol.16, pp1299-11323, 2004.
28. Leonard, K. J., “The development of a rule based expert system model for fraud alert in consumer credit”, European journal of operational research, vol.80, pp.350-356, 1995.
29. MacQueen, J., “Some methods for classification and analysis of multivariate observations”, In: Fifth Berkeley Symposium on Mathematics. Statistics and Probability. University of California Press, pp. 281-297, 1967
30. Mardia, K.V., Kent,J.T., and Bibby, J.M., Multivariate Analysis, Academic Press, 1979.
31. Marzal, A., Vidal, E., “Computation of Normalized Edit Distance and Applications”, IEEE Transactions on pattern analysis and machine intelligence, vol. 15, No. 9, 1993.
32. Mutingi, M. and Onwubolu, G.C., “Integrated cellular manufacturing system design and layout using group genetic algorithms”, Manufacturing system.
33. Nair, G. J. and Narendran, T. T., “CASE: a clustering algorithm for cell formation with sequence data”, International Journal of Production Research, Vol. 36, pp.157-179, 1998.
34. Onwubolu, G.C. and Mutingi, M., “A genetic algorithm approach to cellular manufacturing systems”, Computers & Industrial Engineering, Vol. 39, 125-144, 2001.
35. Pavlock, B., Davenport, C., McDaniel, A., Casey, J., Varol, C., “Address Verification and Standardization Based on Edit Distance and Soundex”, International Advanced Technologies Symposium, pp.16-18, 2011.
36. Popa, A., McDowell, J.J., “The effect of Hamming distances in a computational model of selection by consequences”, Behavioral processes, vol.82, pp.428-434, 2010.
37. Rai, H. , Yadau, A. ,“Iris recognition using combined support vector machine and Hamming distance approach”, Expert systems with applications, vol.41, pp.588-593, 2014.
38. Teymourian, E., Mahdavi, I. and Kayvanfar, V., “A new cell formation model using sequence data and handing cost factors”, International conference on Industrial Engineering and Operations Management Kuala Lumpur, Malaysia, January 22-24, 2011
39. Tian, T. Z., Ramakrishnan, R., Livny, M., “Birch: an ef?cient data clustering method for very large databases,” SIGMOD Rec., vol. 25, no. 2, pp. 103–114, 1996.
40. Wemmerlov, U. and Hyer, N.L, “Procedures for the part Family/Machine group identification problem in cellular manufacturing”, Journal of operations management, vol. 6, no. 2, pp.125-148, 1986.
41. William, J. F., Gregory, P. S., Christopher, J. M., “Knowledge discovery in databases: an overview”, AI Magazine, Vol. 13, No.3, 1992.
42. 曾固鈺，「以流程相似度對目標群組做集群分析-以航空發動機維修廠之自修工件為例」，國立中央大學，碩士論文，民國102。
43. 盧錦隆，「基因序列比對的演算法」，國立交通大學生物研究所，科學發展期刊，396期，民國93年12月。

指導教授

曾富祥

審核日期

2014-6-25

推文