在樣本選取數限制下，以非監督式學習進行樣本選取之研究

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：231

、訪客IP：3.17.81.228

姓名

李世平(Shih-Ping Li) 查詢紙本館藏

畢業系所

資訊管理學系

論文名稱

在樣本選取數限制下，以非監督式學習進行樣本選取之研究

相關論文

★ 零售業商業智慧之探討	★ 有線電話通話異常偵測系統之建置
★ 資料探勘技術運用於在學成績與學測成果分析 -以高職餐飲管理科為例	★ 利用資料採礦技術提昇財富管理效益 -以個案銀行為主
★ 晶圓製造良率模式之評比與分析－以國內某DRAM廠為例	★ 商業智慧分析運用於學生成績之研究
★ 運用資料探勘技術建構國小高年級學生學業成就之預測模式	★ 應用資料探勘技術建立機車貸款風險評估模式之研究－以A公司為例
★ 績效指標評估研究應用於提升研發設計品質保證	★ 基於文字履歷及人格特質應用機械學習改善錄用品質
★ 以關係基因演算法為基礎之一般性架構解決包含限制處理之集合切割問題	★ 關聯式資料庫之廣義知識探勘
★ 考量屬性值取得延遲的決策樹建構	★ 從序列資料中找尋偏好圖的方法 - 應用於群體排名問題
★ 利用分割式分群演算法找共識群解群體決策問題	★ 以新奇的方法有序共識群應用於群體決策問題

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

近年來，由於運算技術的進步以及資料儲存空間的躍進，使許多學者開始研究資料探勘以及巨量資料等大數據相關領域，以期在大量資料找出隱藏的價值並產生出許多相關的應用，如：使用分類器預測文章的所屬類別等。在分類器的建立過程中，若能挑選能代表整體資料的訓練資料，其訓練出來的分類器會得到較好的訓練結果。在選取完訓練資料後，為了訓練分類器會將訓練資料交由專家貼上所屬的標籤，但是請專家貼標籤的成本昂貴且貼的標籤數量有限，所以在訓練資料的選取上，需要利用抽樣的方式從整體資料中挑出最具有代表性的訓練資料來訓練分類器才會使得訓練資料達到最大的效用，換句話說，也就是在樣本選取數限制的條件下，如何從未標籤過的資料中挑選出最適合的訓練資料來進行標籤即為本研究的目的。
本研究著重在訓練資料數量被限制的條件下，以非監督式學習進行樣本選取之過程，在實驗中首先會將原始資料的離群值移除，並使用K-Means以自然群數將資料進行分群以找出符合「完整性」的資料，再以Balanced K-Means將K-Means的分群結果依據群集大小佔原始資料的比例再分群，並且挑選出每個群集的中心點作為符合「完整性」及「平衡性」的未標籤資料貼上標籤，並利用已標籤的資料透過五種不同的分類方法進行建模，用以衡量挑選資料所建立的分類器之分類結果，換句話說，若是挑選出的訓練資料所訓練的分類器有較佳的分類結果，即代表本研究提出之方法可以在樣本選取數限制下挑選出最適合的訓練資料。
實驗結果呈現，本研究方法在KNN、Naïve Bayes、SVM、MLP方法中都會有良好的分類結果，唯獨在Random Forest中的結果不如預期，由此結果可以觀察出在非空間及距離概念設計的分類方法無法建立準確度高的分類器，因為其與本研究方法之設計理念不合，反之，若是分類方法所涵蓋的屬性夠完整，本研究提出之方法都能夠在樣本選取數受限制的條件下找出最適合的訓練資料。

摘要(英)

In recent years, with the progress of the computing technology and the storage space, many researchers start to research the field of the Data Mining and the Big Data in order to find the value of numerous data and come up with innovative usages. Such as, but not limited to, using classifiers to discriminate the categories of articles and so on. When building a classifier, a more comprehensive training data will come to a better result., so that we select the training data in dataset and label the training data manually by experts. However, the cost of hiring experts is high and the output is limited, we have to select the comprehensive sample data to maximize the utility of training data. In other words, how to select the best training data in the unlabeled dataset with the constraint of the sample data number is the research purpose of this study
This study focused on using unsupervised learning to select samples with the constraint of the sample data number. In this thesis, we start to remove the outliers of the dataset, then we use K-Means to find the training data which contain all typical types in the datasets, after that, we use Balanced K-Means to cluster every clusters of K-means result according to the percentage of cluster size in the dataset. At last, we pick up the “centroid” as the best training data and label it by experts. These training materials then are modeled by five different classifiers to measure the classification of classifiers that were established by the select data. In other words, if the classification of classifiers that were established by the select data is good, it means the method we proposed can select the best training data under sample data number considerations.
Finally, the experimental results show that the method we proposed has good results in KNN、Naïve Bayes、SVM、MLP but Random Forest. According to this result, we can find that the classifier which is not established by the concept of space and the distance has the lower classification result, because it does not match the method designed concept of this study. On the other hand, the method we proposed can select the best training data with the constraint of sample data number when the classifier contains all of the attributes.

關鍵字(中)

★ 文件分類
★ 離群值偵測
★ K-Means
★ Balanced K-Means

關鍵字(英)

論文目次

摘要 ---------------------------------------------------------- i
Abstract ----------------------------------------------------- ii
誌謝 -------------------------------------------------------- iii
目錄 --------------------------------------------------------- iv
圖目錄 ------------------------------------------------------- vi
表目錄 ------------------------------------------------------ vii
第一章、緒論 -------------------------------------------------- 1
1.1 背景與動機 ----------------------------------------------------- 1
1.2 情境描述 ------------------------------------------------------ 3
1.3 研究目的 ------------------------------------------------------- 5
第二章、文獻探討 ---------------------------------------------- 6
2.1 離群值偵測方法 ------------------------------------------------- 6
2.1.1 Statistic-based method --------------------------------------- 6
2.1.2 Density-based method ----------------------------------------- 7
2.1.3 Distance-based method ---------------------------------------- 7
2.2 分群演算法 ----------------------------------------------------- 9
2.2.1 K-Means ------------------------------------------------------ 9
2.2.2 決定自然群數之指標 ------------------------------------------- 9
2.2.3 Balanced K-Means -------------------------------------------- 13
第三章、研究方法 --------------------------------------------- 15
3.1 研究概述 ------------------------------------------------------ 15
3.2 移除離群值 ---------------------------------------------------- 16
3.3 第一階段分群 -------------------------------------------------- 16
3.4 第二階段分群 -------------------------------------------------- 17
3.5 中心點的選取 -------------------------------------------------- 19
第四章、實驗結果 --------------------------------------------- 21
4.1 資料集 -------------------------------------------------------- 21
4.2 資料前處理 ---------------------------------------------------- 22
4.3 資料準備 ------------------------------------------------------ 23
4.4 實驗評估 ------------------------------------------------------ 24
4.5 衡量指標 ------------------------------------------------------ 26
4.6 實驗結果 ------------------------------------------------------ 27
4.6.1 各方法下之實驗結果 ------------------------------------------ 27
4.6.2 小結 -------------------------------------------------------- 36
第五章、結論與建議 ------------------------------------------- 37
5.1 結論 ---------------------------------------------------------- 37
5.2 未來發展 ------------------------------------------------------ 38
參考文獻 ----------------------------------------------------- 39
附錄一、KNN分類器每次抽樣之實驗結果 ------------------------- 42
附錄二、Naïve Bayes分類器每次抽樣之實驗結果 ----------------- 47
附錄三、Random Forest分類器每次抽樣之實驗結果 --------------- 52
附錄四、SVM分類器每次抽樣之實驗結果 ------------------------- 57
附錄五、MLP分類器每次抽樣之實驗結果 ------------------------- 62

參考文獻

[1] Wu, X., et al., Data mining with big data. IEEE transactions on knowledge and data engineering, 2014. 26(1): p. 97-107.
[2] Labrinidis, A. and H.V. Jagadish, Challenges and opportunities with big data. Proceedings of the VLDB Endowment, 2012. 5(12): p. 2032-2033.
[3] Hawkins, D.M., Identification of outliers. Vol. 11. 1980: Springer.
[4] Ruts, I. and P.J. Rousseeuw, Computing depth contours of bivariate point clouds. Computational Statistics & Data Analysis, 1996. 23(1): p. 153-168.
[5] Johnson, T., I. Kwok, and R.T. Ng. Fast Computation of 2-Dimensional Depth Contours. in KDD. 1998.
[6] Breunig, M.M., et al. Optics-of: Identifying local outliers. in European Conference on Principles of Data Mining and Knowledge Discovery. 1999. Springer.
[7] Jin, W., et al. Ranking outliers using symmetric neighborhood relationship. in Pacific-Asia Conference on Knowledge Discovery and Data Mining. 2006. Springer.
[8] Papadimitriou, S., et al. Loci: Fast outlier detection using the local correlation integral. in Data Engineering, 2003. Proceedings. 19th International Conference on. 2003. IEEE.
[9] Knox, E.M. and R.T. Ng. Algorithms for mining distancebased outliers in large datasets. in Proceedings of the International Conference on Very Large Data Bases. 1998. Citeseer.
[10] Ramaswamy, S., R. Rastogi, and K. Shim. Efficient algorithms for mining outliers from large data sets. in ACM Sigmod Record. 2000. ACM.
[11] Angiulli, F. and C. Pizzuti. Fast outlier detection in high dimensional spaces. in European Conference on Principles of Data Mining and Knowledge Discovery. 2002. Springer.
[12] Bay, S.D. and M. Schwabacher. Mining distance-based outliers in near linear time with randomization and a simple pruning rule. in Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. 2003. ACM.
[13] Ghoting, A., S. Parthasarathy, and M.E. Otey, Fast mining of distance-based outliers in high-dimensional datasets. Data Mining and Knowledge Discovery, 2008. 16(3): p. 349-364.
[14] MacQueen, J. Some methods for classification and analysis of multivariate observations. in Proceedings of the fifth Berkeley symposium on mathematical statistics and probability. 1967. Oakland, CA, USA.
[15] Žalik, K.R., An efficient k′-means clustering algorithm. Pattern Recognition Letters, 2008. 29(9): p. 1385-1391.
[16] Caliński, T. and J. Harabasz, A dendrite method for cluster analysis. Communications in Statistics-theory and Methods, 1974. 3(1): p. 1-27.
[17] Davies, D.L. and D.W. Bouldin, A cluster separation measure. IEEE transactions on pattern analysis and machine intelligence, 1979(2): p. 224-227.
[18] Dunn, J.C., Well-separated clusters and optimal fuzzy partitions. Journal of cybernetics, 1974. 4(1): p. 95-104.
[19] Ray, S. and R.H. Turi. Determination of number of clusters in k-means clustering and application in colour image segmentation. in Proceedings of the 4th international conference on advances in pattern recognition and digital techniques. 1999. Calcutta, India.
[20] Halkidi, M., M. Vazirgiannis, and Y. Batistakis. Quality scheme assessment in the clustering process. in European Conference on Principles of Data Mining and Knowledge Discovery. 2000. Springer.
[21] Maulik, U. and S. Bandyopadhyay, Performance evaluation of some clustering algorithms and validity indices. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002. 24(12): p. 1650-1654.
[22] Kovács, F., C. Legány, and A. Babos. Cluster validity measurement techniques. in 6th International symposium of hungarian researchers on computational intelligence. 2005. Citeseer.
[23] Gupta, S., A Survey on Balanced Data Clustering Algorithms. 2017.
[24] Bradley, P., K. Bennett, and A. Demiriz, Constrained k-means clustering. Microsoft Research, Redmond, 2000: p. 1-8.
[25] Zhu, S., D. Wang, and T. Li, Data clustering with size constraints. Knowledge-Based Systems, 2010. 23(8): p. 883-889.
[26] Malinen, M.I. and P. Fränti. Balanced k-means for clustering. in Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR). 2014. Springer.
[27] Silva, C. and B. Ribeiro. The importance of stop word removal on recall values in text categorization. in Neural Networks, 2003. Proceedings of the International Joint Conference on. 2003. IEEE.
[28] Sadeghi, M. and J. Vegas, Automatic identification of light stop words for Persian information retrieval systems. Journal of Information Science, 2014. 40(4): p. 476-487.
[29] Munková, D., M. Munk, and M. Vozár, Influence of stop-words removal on sequence patterns identification within comparable corpora, in ICT innovations 2013. 2014, Springer. p. 67-76.
[30] Singh, J. and V. Gupta, Text stemming: Approaches, applications, and challenges. ACM Computing Surveys (CSUR), 2016. 49(3): p. 45.
[31] Shang, W., et al., A novel feature selection algorithm for text categorization. Expert Systems with Applications, 2007. 33(1): p. 1-5.
[32] Mucherino, A., P.J. Papajorgji, and P.M. Pardalos, K-nearest neighbor classification, in Data Mining in Agriculture. 2009, Springer. p. 83-106.
[33] Liaw, A. and M. Wiener, Classification and regression by randomForest. R news, 2002. 2(3): p. 18-22.
[34] Rish, I. An empirical study of the naive Bayes classifier. in IJCAI 2001 workshop on empirical methods in artificial intelligence. 2001. IBM.
[35] Furey, T.S., et al., Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics, 2000. 16(10): p. 906-914.
[36] Witten, I.H., et al., Data Mining: Practical machine learning tools and techniques. 2016: Morgan Kaufmann.

指導教授

陳彥良

審核日期

2018-7-2

推文