博碩士論文 107423050 詳細資訊




以作者查詢圖書館館藏 以作者查詢臺灣博碩士 以作者查詢全國書目 勘誤回報 、線上人數:4 、訪客IP:3.142.197.212
姓名 曾俊凱(Chun-Kai Tseng)  查詢紙本館藏   畢業系所 資訊管理學系
論文名稱 單一類別分類方法於不平衡資料集-搭配遺漏值填補和樣本選取方法
(One-class classification on imbalanced datasets with missing value imputation and instance selection)
相關論文
★ 單分類方法於類別不平衡資料集之研究-結合特徵選取與集成式學習★ 應用文字探勘技術於股價預測: 探討傳統機器學習及深度學習技術與不同財經新聞來源之關係
★ 混合式前處理於類別不平衡問題之研究 - 結合機器學習與生成對抗網路★ 單一與並列式集成特徵選取方法於多分類類別不平衡問題之研究
檔案 [Endnote RIS 格式]    [Bibtex 格式]    [相關文章]   [文章引用]   [完整記錄]   [館藏目錄]   [檢視]  [下載]
  1. 本電子論文使用權限為同意立即開放。
  2. 已達開放權限電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。
  3. 請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。

摘要(中) 不平衡資料集在實務資料分析中是非常重要的一環,如信用卡盜刷、醫療診斷分類和網路攻擊分類等不同領域內重要問題。面對不平衡資料集我們可以採取不同的資料處理或使用不同分類方法達到更好的分類效果。單一類別分類方法在不同的領域中可以稱作為離群值檢測或奇異點偵測,本論文嘗試使用單一類別分類方法於不平衡資料集中二分類問題如單分類支援向量機器(One-Class SVM)、孤立森林(Isolation Forest)和局部異常因子(Local Outlier Factor)。進一步探討若資料發生缺失的情況,透過模擬遺漏值10%~50%且將使用如分類與回歸樹方法(Classification And Regression Trees)將資料填補至接近原始資料,增加分類模型的分類正確率。同時也對不平衡資料中存在影響分類方法的雜值採取樣本選取方法如Instance Based algorithm(IB3)、Decremental Reduction Optimization Procedure(DROP3)、Genetic Algorithm(GA)希望減少資料集中雜質與減少訓練模型的時間成本且找出足夠影響力的資料
本論文baseline使用完整的不平衡資料與單一類別分類方法與各項實驗分析比較。探討遺漏值填補與單一類別分類方法以及哪個樣本選取方法會使單一類別分類方法正確率提升,最後探討模擬遺漏值和樣本選取方法與填補的先後順序,流程改善能夠增加分類器正確率。經過上述實驗流程以及結果,可以發現不平衡資料經過遺漏值填補之後分類正確率接近;透過樣本選取方法可以增加分類正確率同時發現樣本篩檢率會直接影響分類正確率;最後透過遺漏值與樣本選取方法的搭配,可以發現將完整資料與不完整資料拆開處理的流程可以改善分類正確率,而選擇平穩正確率的情況下使用完整資料進行模擬遺漏與填補以及搭配樣本選取方法則會有較佳的表現。
摘要(英) Imbalanced data sets are a very important part of practical data analysis, such as credit card fraud, medical diagnosis classification and network attack. Faced with imbalanced data sets, we can adopt different data processing or use different classification methods to achieve better classification results. This paper attempts to use the one-class classification methods to classify two classification problems in imbalanced data sets, such as the one-class SVM, Isolated Forest and Local Outlier Factor. To further explore the case of missing data, by simulating missing values of 10% to 50% and using methods such as CART to impute the data, increase the classification accuracy. At the same time, Instance selection methods such as IB3, DROP3, and GA are also adopted for the imbalanced data. Hope to reduce impurities in the data set and reduce the time to train the model cost and find sufficient information
Discuss the missing value filling and one-class classification methods and which instance selection methods will improve the accuracy. Simulate missing value and instance selection methods and the order of filling. After the above experimental process and results, it can be found that when missing value is filled classification accuracy is close to classification accuracy; through the instance selection methods, the classification accuracy can be increased and the reduction rate is found to directly affect the classification correct rate; finally, the missing value and combination of selection methods, it can be found the process of separating the incomplete data from the complete data can improve the classification accuracy. However, when the stable accuracy is selected, using the complete data to simulate the missing values and filling and uses the instance selection methods will have good performance.
關鍵字(中) ★ 不平衡資料集
★ 單一類別分類方法
★ 遺漏值填補
★ 樣本選取方法
關鍵字(英) ★ Imbalance data sets
★ One-Class Classification
★ Missing value imputation
★ Instance selection
論文目次 摘要 i
Abstract ii
目錄 iii
表目錄 v
圖目錄 v
附表目錄 vi
一、 緒論 1
1-1研究背景 1
1-2研究動機 2
1-3研究目的 3
1-4研究架構 3
二、文獻探討 5
2-1不平衡資料集 5
2-2單一類別分類方法(ONE-CLASS CLASSIFICATION, OCC) 7
2-2-1單類別支援向量機(One-Class SVM, OCSVM) 9
2-2-2孤立森林(Isolation Forest, iForest) 11
2-2-3局部異常因子(Local Outlier Factor, LOF) 13
2-3資料遺漏 15
2-3-1遺漏值補值流程和方法 16
2-4樣本選取方法 17
三、研究方法與設計 19
3-1實驗架構以及實驗準備 19
3-2實驗一 22
3-3實驗二 23
3-4實驗三之一 24
3-5實驗三之二 25
3-6評估標準 26
四、實驗結果 27
4-1實驗一結果 27
4-2實驗二結果 28
4-3實驗三之一結果 30
4-4實驗三之二結果 33
4-5實驗結果總結 36
五、結論 42
5-1總結 42
5-2 研究貢獻與未來展望 43
參考文獻 45
附錄一、分類正確率詳細實驗數據 44
1-1遺漏率10%~50%分類正確率(MI) 44
1-2樣本選取方法分類正確率(IS) 49
1-3遺漏率10%~50%搭配樣本選取方法分類正確率 52
附錄二、樣本選取篩檢率詳細實驗數據 70
參考文獻 1. Khan, S.S. and M.G. Madden, One-class classification: taxonomy of study and review of techniques. The Knowledge Engineering Review, 2014. 29(3): p. 345-374.
2. Puri, A. and M. Gupta, Review on Missing Value Imputation Techniques in Data Mining. IJSRCSEIT 2017. 2(7).
3. Olvera-López, J.A., et al., A review of instance selection methods. Artificial Intelligence Review, 2010. 34(2): p. 133-143.
4. Haixiang, G., et al., Learning from class-imbalanced data: Review of methods and applications. Expert Systems with Applications, 2017. 73: p. 220-239.
5. Hempstalk, K. and E. Frank, Discriminating Against New Classes: One-class versus Multi-class Classification, in AI 2008: Advances in Artificial Intelligence. 2008. p. 325-336.
6. Olvera-López, J.A., et al., A review of instance selection methods. 2010. 34(2): p. 133-143.
7. Tan, A.C., D. Gilbert, and Y. Deville, Multi-class protein fold classification using a new ensemble machine learning approach. Genome Informatics, 2003. 14: p. 206-217.
8. Abe, N., B. Zadrozny, and J. Langford. An iterative method for multi-class cost-sensitive learning. in Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. 2004.
9. Zhou, Z.-H. and X.-Y. Liu, Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Transactions on knowledge and data engineering, 2005. 18(1): p. 63-77.
10. Chen, K., B.-L. Lu, and J.T. Kwok. Efficient classification of multi-label and imbalanced data using min-max modular classifiers. in The 2006 IEEE International Joint Conference on Neural Network Proceedings. 2006. IEEE.
11. Sun, Y., M.S. Kamel, and Y. Wang. Boosting for learning multiple classes with imbalanced class distribution. in Sixth International Conference on Data Mining (ICDM′06). 2006. IEEE.
12. Zhou, Z.H. and X.Y. Liu, On multi‐class cost‐sensitive learning. Computational Intelligence, 2010. 26(3): p. 232-257.
13. Haibo, H. and E.A. Garcia, Learning from Imbalanced Data. IEEE Transactions on Knowledge and Data Engineering, 2009. 21(9): p. 1263-1284.
14. Weiss, G.M., Mining with rarity: a unifying framework. ACM Sigkdd Explorations Newsletter, 2004. 6(1): p. 7-19.
15. Kotsiantis, S., D. Kanellopoulos, and P. Pintelas, Handling imbalanced datasets: A review. GESTS International Transactions on Computer Science and Engineering, 2006. Vol.30.
16. Chawla, N.V., et al., SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 2002. 16: p. 321-357.
17. Bekkar, M. and T.A. Alitouche, Imbalanced Data Learning Approaches Review. International Journal of Data Mining & Knowledge Management Process, 2013. 3(4): p. 15-33.
18. Japkowicz, N., Learning from Imbalanced Data Sets: A Comparison of Various Strategies, in AAAI. 2000.
19. Drummond, C. and R.C. Holte, C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling, in Workshop on Learning from Imbalanced Datasets II
ICML. 2003: Washington DC.
20. Chawla, N.V., et al., SMOTE: Synthetic Minority Over-sampling Technique. Artificial Intelligence Research, 2002. 16: p. 321-257.
21. Wah, Y.B., et al., Handling imbalanced dataset using SVM and k-NN approach. 2016.
22. Khan, S.S. and M.G. Madden, A Survey of Recent Trends in One Class Classification. AICS 2009, 2010: p. 88–197.
23. Breunig, M.M., et al., LOF: Identifying Density-Based Local Outliers, in ACM SIGMOD 2000 2000.
24. Scholkopf, B., et al., Support Vector Method for Novelty Detection. Advances in Neural Information Processing Systems, 2000.
25. TAX, D.M.J. and R.P.W. DUIN, Support Vector Data Description. Machine Learning, 2004. 54: p. 45-66.
26. Liu, F.T., K.M. Ting, and Z.-H. Zhou, Isolation-based Anomaly Detection. ACM Transactions on Knowledge Discovery from Data, 2012. 5.
27. Shin, H.J., D.-H. Eom, and S.-S. Kim, One-class support vector machines—an application in machine fault detection and classification. Computers & Industrial Engineering, 2005. 48(2): p. 395-408.
28. Lin, W.-C. and C.-F. Tsai, Missing value imputation: a review and analysis of the literature (2006–2017). Artificial Intelligence Review, 2019.
29. Strike, K., K.E. Emam, and N. Madhavji, Software Cost Estimation with Incomplete Data. IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2001. 27(10).
30. RAYMOND, M.R. and D.M. ROBERTS, A COMPARISON OF METHODS FOR TREATING INCOMPLETE DATA IN SELECTION RESEARCH. EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT, 1987.
31. Silva-Ramirez, E.L., et al., Missing value imputation on missing completely at random data using multilayer perceptrons. Neural Netw, 2011. 24(1): p. 121-9.
32. Pelckmans, K., et al., Handling missing values in support vector machine classifiers. Neural Netw, 2005. 18(5-6): p. 684-92.
33. Farhangfar, A., L. Kurgan, and J. Dy, Impact of imputation of missing values on classification error for discrete data. Pattern Recognition, 2008. 41(12): p. 3692-3705.
34. Acurna, E. and C. Rodriguez. The treatment of missing values and its effect in the classifier accuracy, classification, clustering, and data mining applications. in Proceedings of the Meeting of the International Federation of Classification Societies (IFCS). 2004.
35. Burgette, L.F. and J.P. Reiter, Multiple imputation for missing data via sequential regression trees. American journal of epidemiology, 2010. 172(9): p. 1070-1076.
36. Shah, A.D., et al., Comparison of random forest and parametric imputation models for imputing missing data using MICE: a CALIBER study. American journal of epidemiology, 2014. 179(6): p. 764-774.
37. Doove, L.L., S. Van Buuren, and E. Dusseldorp, Recursive partitioning for missing data imputation in the presence of interaction effects. Computational Statistics & Data Analysis, 2014. 72: p. 92-104.
38. Breiman, L., et al., Classification and regression trees. 1984: CRC press.
39. Wilson, D.R. and T.R. Martinez, Reduction Techniques for Instance-Based Learning Algorithms. Machine Learning, 2000. 38(3): p. 257-286.
40. Tsai, C.-F. and F.-Y. Chang, Combining instance selection for better missing value imputation. Journal of Systems and Software, 2016. 122: p. 63-71.
41. Cover, T. and P. Hart, Nearest neighbor pattern classification. IEEE transactions on information theory, 1967. 13(1): p. 21-27.
42. Wilson, D.L., Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics, 1972(3): p. 408-421.
43. AHA, D.W., D. KIBLER, and M. ALBERT, Instance-Based Learning Algorithms. Machine Learning, 1991. 6: p. 37-66.
44. Tsai, C.-F., W. Eberle, and C.-Y. Chu, Genetic algorithms in feature and instance selection. Knowledge-Based Systems, 2013. 39: p. 240-247.
45. Woods, K.S., et al., Comparative evaluation of pattern recognition techniques for detection of microcalcifications in mammography, in State of The Art in Digital Mammographic Image Analysis. 1994, World Scientific. p. 213-231.
46. Wang, K. and S. Stolfo, One-class training for masquerade detection. 2003.
47. Devi, D., S.K. Biswas, and B. Purkayastha, Learning in presence of class imbalance and class overlapping by using one-class SVM and undersampling technique. Connection Science, 2019. 31(2): p. 105-142.
指導教授 蔡志豐 蘇坤良(Chih-Fong Tsai Kuen-Liang Sue) 審核日期 2020-7-21
推文 facebook   plurk   twitter   funp   google   live   udn   HD   myshare   reddit   netvibes   friend   youpush   delicious   baidu   
網路書籤 Google bookmarks   del.icio.us   hemidemi   myshare   

若有論文相關問題,請聯絡國立中央大學圖書館推廣服務組 TEL:(03)422-7151轉57407,或E-mail聯絡  - 隱私權政策聲明