在成本限制下，以非監督式學習進行樣本選取之研究

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：68

、訪客IP：18.191.202.249

姓名

張詒鈞(Yi-Jun Zhang) 查詢紙本館藏

畢業系所

資訊管理學系

論文名稱

在成本限制下，以非監督式學習進行樣本選取之研究

相關論文

★ 零售業商業智慧之探討	★ 有線電話通話異常偵測系統之建置
★ 資料探勘技術運用於在學成績與學測成果分析 -以高職餐飲管理科為例	★ 利用資料採礦技術提昇財富管理效益 -以個案銀行為主
★ 晶圓製造良率模式之評比與分析－以國內某DRAM廠為例	★ 商業智慧分析運用於學生成績之研究
★ 運用資料探勘技術建構國小高年級學生學業成就之預測模式	★ 應用資料探勘技術建立機車貸款風險評估模式之研究－以A公司為例
★ 績效指標評估研究應用於提升研發設計品質保證	★ 基於文字履歷及人格特質應用機械學習改善錄用品質
★ 以關係基因演算法為基礎之一般性架構解決包含限制處理之集合切割問題	★ 關聯式資料庫之廣義知識探勘
★ 考量屬性值取得延遲的決策樹建構	★ 從序列資料中找尋偏好圖的方法 - 應用於群體排名問題
★ 利用分割式分群演算法找共識群解群體決策問題	★ 以新奇的方法有序共識群應用於群體決策問題

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

隨著科技的進步及大數據的浪潮，「資料」的重要性及實用性逐漸被人們所看重，因此許多的學者開始著墨於資料探勘領域，期待在眾多資料中找出其背後的價值並產生出許多相關應用，如使用分類器預測文章的所屬類別等。然而，對於分類器而言，若其訓練資料越能代表整體資料，則會使其所得到的訓練結果越好，而在分類器建立過程，會將訓練資料以人為方式貼上所屬標籤，但由於文章有長有短，並不是每筆資料貼上標籤所花費的成本都相同。
而本研究著重於在成本下以非監督式學習進行樣本選取之過程，在實驗中給予每筆資料其挑選成本，並限制訓練資料最終所挑選之總成本，而本論文使用了Bisecting K-means及Hierarchical Clustering兩種演算法，並以最佳點及成本考量下最佳點兩種方法去挑選資料，將這些訓練資料透過五種不同的分類器進行建模，來衡量所挑選資料所建立的分類器之分類結果。
最終在實驗結果證明本論文所提出之方法在五種不同分類器中，與隨機挑選法相比而言，所得資料在建立分類器模型時，皆有其相對表現較好之方法，而透過本論文之方法，可以在成本限制下，從尚未擁有類別標籤的資料中選出較具代表性的資料，若將這些資料交給專家進行類別標示，即可訓練出更好的分類模型，大幅的降低類別標示的成本。

摘要(英)

With the progress of technology along with the tide of big data, the importance of ”information” has gradually been valued by people. Therefore, many scholars began to dive into the field of data mining, looking forward to find the value behind numerous data and come up with innovative usages. Such as, but not limited to, using classifiers to discriminate the categories of articles and so on. However, for a classifier, a more comprehensive training data will come to a better result. When building a classifier, we label the data in manual, since articles and paragraphs come in different length, the cost varies widely for doing so.
This study focuses on using unsupervised learning to select samples while giving each data a selection cost to limit the total cost of the final selection. In this thesis, by using Bisecting K-means and Hierarchical Clustering algorithm, the data are selected by two ways, best points and best points under cost considerations. These training materials then are modeled by five different classifiers to measure the classification of classifiers that were established by the selected data.
Finally, the experimental results show that compared with random selecting, the 5 different classifiers each show better strengths in different areas when classifiers are established. Using the method mentioned in this thesis can result in selecting better quality and representative data from unlabeled data while not exceeding the budget. If these data are handed to experts for labeling, the labeling cost will drop significantly and come out in a better result.

關鍵字(中)

★ 文件分類
★ 非監督式樣本選取
★ 成本

關鍵字(英)

★ Document classification
★ Unsupervised instance selection
★ Cost

論文目次

摘要 i
Abstract ii
致謝 iii
目錄 iv
圖目錄 vi
表目錄 viii
第一章、緒論 1
1.1 背景與動機 1
1.2 情境描述 4
1.3 研究目的 6
第二章、文獻探討 7
2.1 監督式學習樣本選取 7
2.1.1 Wrapper 8
2.1.2 Filter 9
2.1.3 本研究與監督式樣本選取之差異 10
2.2 半監督式學習的樣本選取 11
2.2.1 Self-Training 12
2.2.2 Co-training 13
2.2.3 Tri-Training 14
2.2.4本研究與半監督式樣本選取之差異 15
第三章、研究架構 16
3.1 研究概述 16
3.2 研究流程圖 17
3.3 成本定義 18
3.4 質心定義 19
3.4.1 最佳點 19
3.4.2 成本考量最佳點 21
3.5 成本累積 22
3.6 演算法流程 23
3.6.1 Bisecting K-means 23
3.6.2 階層式聚合Hierarchical Clustering 25
第四章、實驗結果 29
4.1 資料集 29
4.2 資料前處理 30
4.3 資料準備 31
4.4 衡量指標 33
4.5 實驗結果 35
4.5.1 各方法下之實驗結果 35
4.5.2 成本因素之影響 42
第五章、結論與建議 52
5.1 結論 52
5.2 未來發展 53
參考文獻 54
附錄一、KNN分類器在每次抽樣平均結果 58
附錄二、Naïve Bayer分類器在每次抽樣平均結果 63
附錄三、Random Forest分類器在每次抽樣平均結果 68
附錄四、SVM分類器在每次抽樣平均實驗結果 73
附錄五、MLP分類器在每次抽樣平均實驗結果 78

參考文獻

[1] Wu, Xindong, et al. ”Data mining with big data.” ieee transactions on knowledge and data engineering 26.1 (2014): 97-107.
[2] Labrinidis, Alexandros, and Hosagrahar V. Jagadish. ”Challenges and opportunities with big data.” Proceedings of the VLDB Endowment 5.12 (2012): 2032-2033.
[3] Habteselassie, Biruk. ”Application of knowledge discovery in databases: automating manual tasks.” (2016).
[4] Olvera-López, J. Arturo, et al. ”A review of instance selection methods.” Artificial Intelligence Review 34.2 (2010): 133-143.
[5] Tsai, Chih-Fong, Zong-Yao Chen, and Shih-Wen Ke. ”Evolutionary instance selection for text classification.” Journal of Systems and Software 90 (2014): 104-113.
[6] Buza, Krisztian, Alexandros Nanopoulos, and Lars Schmidt-Thieme. ”Insight: efficient and effective instance selection for time-series classification.” Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer Berlin Heidelberg, 2011.
[7] Stojanović, Miloš B., et al. ”A methodology for training set instance selection using mutual information in time series prediction.” Neurocomputing 141 (2014): 236-245.
[8] Gowda, K., and G. Krishna. ”The condensed nearest neighbor rule using the concept of mutual nearest neighborhood.” IEEE Transactions on Information Theory 25.4 (1979): 488-490.
[9] Ritter, G., et al. ”An algorithm for a selective nearest neighbor decision rule.” IEEE Transactions on Information Theory 21.6 (1975): 665-669.
[10] Wilson, Dennis L. ”Asymptotic properties of nearest neighbor rules using edited data.” IEEE Transactions on Systems, Man, and Cybernetics 2.3 (1972): 408-421.
[11] Grochowski, Marek. ”Simple incremental instance selection wrapper for classification.” International Conference on Artificial Intelligence and Soft Computing. Springer Berlin Heidelberg, 2012.
[12] Czarnowski, Ireneusz. ”Cluster-based instance selection for machine classification.” Knowledge and Information Systems 30.1 (2012): 113-133.
[13] Lumini, Alessandra, and Loris Nanni. ”A clustering method for automatic biometric template selection.” Pattern Recognition 39.3 (2006): 495-497.
[14] Caises, Yoel, et al. ”SCIS: combining instance selection methods to increase their effectiveness over a wide range of domains.” International Conference on Intelligent Data Engineering and Automated Learning. Springer Berlin Heidelberg, 2009.
[15] Raicharoen, Thanapant, and Chidchanok Lursinsap. ”A divide-and-conquer approach to the pairwise opposite class-nearest neighbor (POC-NN) algorithm.” Pattern recognition letters 26.10 (2005): 1554-1567.
[16] Olvera-López, J., J. Carrasco-Ochoa, and J. Martínez-Trinidad. ”Prototype selection via prototype relevance.” Progress in Pattern Recognition, Image Analysis and Applications (2008): 153-160.
[17] Yarowsky, David. ”Unsupervised word sense disambiguation rivaling supervised methods.” Proceedings of the 33rd annual meeting on Association for Computational Linguistics. Association for Computational Linguistics, 1995.
[18] Guo, Yuanyuan, Harry Zhang, and Xiaobo Liu. ”Instance selection in semi-supervised learning.” Canadian Conference on Artificial Intelligence. Springer Berlin Heidelberg, 2011.
[19] Blum, Avrim, and Tom Mitchell. ”Combining labeled and unlabeled data with co-training.” Proceedings of the eleventh annual conference on Computational learning theory. ACM, 1998.
[20] Nigam, Kamal, and Rayid Ghani. ”Analyzing the effectiveness and applicability of co-training.” Proceedings of the ninth international conference on Information and knowledge management. ACM, 2000.
[21] Zhou, Zhi-Hua, and Ming Li. ”Tri-training: Exploiting unlabeled data using three classifiers.” IEEE Transactions on knowledge and Data Engineering 17.11 (2005): 1529-1541.
[22] Guo, Tao, and Guiyang Li. ”Improved tri-training with unlabeled data.” Software Engineering and Knowledge Engineering: Theory and Practice (2012): 139-147.
[23] Mucherino, Antonio, Petraq J. Papajorgji, and Panos M. Pardalos. ”K-nearest neighbor classification.” Data Mining in Agriculture (2009): 83-106.
[24] Liaw, Andy, and Matthew Wiener. ”Classification and regression by randomForest.” R news 2.3 (2002): 18-22.
[25] Rish, Irina. ”An empirical study of the naive Bayes classifier.” IJCAI 2001 workshop on empirical methods in artificial intelligence. Vol. 3. No. 22. IBM New York, 2001.
[26] Furey, Terrence S., et al. ”Support vector machine classification and validation of cancer tissue samples using microarray expression data.” Bioinformatics 16.10 (2000): 906-914.
[27] Witten, Ian H., et al. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2016.
[28] Steinbach, Michael, George Karypis, and Vipin Kumar. ”A comparison of document clustering techniques.” KDD workshop on text mining. Vol. 400. No. 1. 2000.
[29] Jain, Anil K. ”Data clustering: 50 years beyond K-means.” Pattern recognition letters 31.8 (2010): 651-666.
[30] Bouguettaya, Athman, et al. ”Efficient agglomerative hierarchical clustering.” Expert Systems with Applications 42.5 (2015): 2785-2797.
[31] Zhao, Ying, and George Karypis. ”Evaluation of hierarchical clustering algorithms for document datasets.” Proceedings of the eleventh international conference on Information and knowledge management. ACM, 2002.
[32] Silva, Catarina, and Bernardete Ribeiro. ”The importance of stop word removal on recall values in text categorization.” Neural Networks, 2003. Proceedings of the International Joint Conference on. Vol. 3. IEEE, 2003.
[33] Sadeghi, Mohammad, and Jesús Vegas. ”Automatic identification of light stop words for Persian information retrieval systems.” Journal of Information Science 40.4 (2014): 476-487.
[34] Munková, Daša, Michal Munk, and Martin Vozár. ”Influence of stop-words removal on sequence patterns identification within comparable corpora.” ICT Innovations 2013. Springer International Publishing, 2014. 67-76.
[35] Singh, Jasmeet, and Vishal Gupta. ”Text Stemming: Approaches, Applications, and Challenges.” ACM Computing Surveys (CSUR) 49.3 (2016): 45.
[36] Shang, Wenqian, et al. ”A novel feature selection algorithm for text categorization.” Expert Systems with Applications 33.1 (2007): 1-5.
[37] Rogati, Monica, and Yiming Yang. ”High-performing feature selection for text classification.” Proceedings of the eleventh international conference on Information and knowledge management. ACM, 2002.
[38] Yang, Yiming, and Jan O. Pedersen. ”A comparative study on feature selection in text categorization.” Icml. Vol. 97. 1997.

指導教授

陳彥良(Yen-Liang Chen)

審核日期

2017-8-17

推文