單一類別分類方法於不平衡資料集－搭配遺漏值填補和樣本選取方法;One-class classification on imbalanced datasets with missing value imputation and instance selection

NCU Institutional Repository > 管理學院 > 資訊管理研究所 > 博碩士論文 > Item 987654321/84043

jsp.display-item.identifier=請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/84043

题名:	單一類別分類方法於不平衡資料集－搭配遺漏值填補和樣本選取方法;One-class classification on imbalanced datasets with missing value imputation and instance selection
作者:	曾俊凱;Tseng, Chun-Kai
贡献者:	資訊管理學系
关键词:	不平衡資料集;單一類別分類方法;遺漏值填補;樣本選取方法;Imbalance data sets;One-Class Classification;Missing value imputation;Instance selection
日期:	2020-07-21
上传时间:	2020-09-02 17:58:00 (UTC+8)
出版者:	國立中央大學
摘要:	不平衡資料集在實務資料分析中是非常重要的一環，如信用卡盜刷、醫療診斷分類和網路攻擊分類等不同領域內重要問題。面對不平衡資料集我們可以採取不同的資料處理或使用不同分類方法達到更好的分類效果。單一類別分類方法在不同的領域中可以稱作為離群值檢測或奇異點偵測，本論文嘗試使用單一類別分類方法於不平衡資料集中二分類問題如單分類支援向量機器（One-Class SVM）、孤立森林（Isolation Forest）和局部異常因子（Local Outlier Factor）。進一步探討若資料發生缺失的情況，透過模擬遺漏值10%~50%且將使用如分類與回歸樹方法（Classification And Regression Trees）將資料填補至接近原始資料，增加分類模型的分類正確率。同時也對不平衡資料中存在影響分類方法的雜值採取樣本選取方法如Instance Based algorithm（IB3）、Decremental Reduction Optimization Procedure（DROP3）、Genetic Algorithm（GA）希望減少資料集中雜質與減少訓練模型的時間成本且找出足夠影響力的資料本論文baseline使用完整的不平衡資料與單一類別分類方法與各項實驗分析比較。探討遺漏值填補與單一類別分類方法以及哪個樣本選取方法會使單一類別分類方法正確率提升，最後探討模擬遺漏值和樣本選取方法與填補的先後順序，流程改善能夠增加分類器正確率。經過上述實驗流程以及結果，可以發現不平衡資料經過遺漏值填補之後分類正確率接近；透過樣本選取方法可以增加分類正確率同時發現樣本篩檢率會直接影響分類正確率；最後透過遺漏值與樣本選取方法的搭配，可以發現將完整資料與不完整資料拆開處理的流程可以改善分類正確率，而選擇平穩正確率的情況下使用完整資料進行模擬遺漏與填補以及搭配樣本選取方法則會有較佳的表現。 ;Imbalanced data sets are a very important part of practical data analysis, such as credit card fraud, medical diagnosis classification and network attack. Faced with imbalanced data sets, we can adopt different data processing or use different classification methods to achieve better classification results. This paper attempts to use the one-class classification methods to classify two classification problems in imbalanced data sets, such as the one-class SVM, Isolated Forest and Local Outlier Factor. To further explore the case of missing data, by simulating missing values of 10% to 50% and using methods such as CART to impute the data, increase the classification accuracy. At the same time, Instance selection methods such as IB3, DROP3, and GA are also adopted for the imbalanced data. Hope to reduce impurities in the data set and reduce the time to train the model cost and find sufficient information Discuss the missing value filling and one-class classification methods and which instance selection methods will improve the accuracy. Simulate missing value and instance selection methods and the order of filling. After the above experimental process and results, it can be found that when missing value is filled classification accuracy is close to classification accuracy; through the instance selection methods, the classification accuracy can be increased and the reduction rate is found to directly affect the classification correct rate; finally, the missing value and combination of selection methods, it can be found the process of separating the incomplete data from the complete data can improve the classification accuracy. However, when the stable accuracy is selected, using the complete data to simulate the missing values and filling and uses the instance selection methods will have good performance.
显示于类别:	[資訊管理研究所] 博碩士論文

文件中的档案:

档案	描述	大小	格式	浏览次数
index.html		0Kb	HTML	170	检视/开启

在NCUIR中所有的数据项都受到原著作权保护.

社群 sharing

数据加载中.....