遺漏值於類別不平衡問題之研究

DC 欄位	值	語言
DC.contributor	資訊管理學系	zh_TW
DC.creator	黃閔慈	zh_TW
DC.creator	Min-Tzu Hwang	en_US
dc.date.accessioned	2022-7-15T07:39:07Z
dc.date.available	2022-7-15T07:39:07Z
dc.date.issued	2022
dc.identifier.uri	http://ir.lib.ncu.edu.tw:444/thesis/view_etd.asp?URN=109423034
dc.contributor.department	資訊管理學系	zh_TW
DC.description	國立中央大學	zh_TW
DC.description	National Central University	en_US
dc.description.abstract	現實世界的資料集難以避免地會有遺漏值 (Missing Values)，然而若無經過適當的處理，多數的傳統分類器無法直接運行外，更可能產生錯誤的資料探勘結果。此外，類別不平衡問題 (Class Imbalance)亦是資料探勘領域中的重要議題，然而傳統分類器基於類別平衡的假設下而傾向於忽略小類樣本，導致模型正確預測小類樣本的效能不佳。基於遺漏值和類別不平衡問題的普遍性，近年越來越多研究探討如何處理帶有遺漏值的類別不平衡資料集，但仍鮮少有研究針對某一類別的資料，特別是小類資料，帶有較多遺漏值的情況進行探討。此外，在遺漏值填補和資料層級方式 (Data Level) 皆可能影響原始資料分佈的前提下，現有文獻並無探討面對類別不平衡資料集帶有遺漏值時，執行遺漏值填補與資料層級方式 (Data level) 的先後順序對於補值效能和後續分類準確率之相關研究。因此本研究針對小類訓練資料進行模擬遺漏，以比較三種處理流程搭配六種補值法和 SMOTE 的表現，而處理流程中，除將順序調換外，亦於先補值處理流程中另外提出僅以原始完整小類資料作為新生成小類樣本的基礎。實驗發現，於多數遺漏率下，不同處理流程的順序於均方根誤差表現上具有顯著差異。而分類準確率表現則隨著遺漏率高低而有所不同，於低遺漏率 (10~50%)時，採取先補值的處理流程仍有顯著較佳的分類表現，而於高遺漏率 (70~90%)時，採取先補值但僅以完整小類資料作為新生成小類樣本基礎的處理流程，於隨機森林分類器有顯著較佳的分類準確率表現。此外，本研究亦針對不同需求，推薦高、低遺漏率區間下表現較佳的處理流程和其搭配的補值法組合，以供未來研究者參考。	zh_TW
dc.description.abstract	In real-world datasets, missing values will inevitably occur. However, without appropriate treatments, most conventional machine learning models cannot tackle the missing values directly, and may even cause false results. In addition, the class imbalance is also a critical issue in machine learning and data mining. However, conventional machine learning models tend to ignore the minority class, which leads to degraded classification performances. Due to the universality of missing values and class imbalance problems, in recent years, there are growing studies exploring how to deal with class imbalanced data with missing values. However, few studies have discussed the situation where a certain class of data, especially the minority class, has more missing values than the majority class. Moreover, based on the acknowledgment that missing value imputation and Data level approaches may change the distribution of the original data, there are few studies that discuss the implementation order of missing value imputation and Data level approaches when tackling class imbalanced data with missing values. To this end, in this paper, we compare the performance of three processing procedures with six approaches of missing value imputation and SMOTE when the minority class of training data has more missing values. In the three processing procedures, in addition to changing the order, we also proposed to use only the complete training data subset of the minority class to serve as the basis for creating synthetic samples for the minority class. The experimental results show that under most situations of missing rates, the order of different processing procedures has significant differences in the performance of RMSE. The performance of classification ability varies with the level of missing rate. When the missing rate is less than or equal to 50%, the processing procedure of imputing first has better classification performance, while the missing rate is higher than 50%, the processing procedure of imputing first and only uses the complete training data subset of the minority class to serve as the basis for creating synthetic samples for the minority class has a significantly better classification performance in the random forest classifier. Furthermore, we also recommend better performance processing procedures with the combination of missing value imputation approaches in the different levels of missing rate for future researchers′ reference.	en_US
DC.subject	遺漏值填補	zh_TW
DC.subject	類別不平衡	zh_TW
DC.subject	增加少數法	zh_TW
DC.subject	Class-specific missingness	zh_TW
DC.title	遺漏值於類別不平衡問題之研究	zh_TW
dc.language.iso	zh-TW	zh-TW
DC.title	Missing Values in Class Imbalanced Datasets	en_US
DC.type	博碩士論文	zh_TW
DC.type	thesis	en_US
DC.publisher	National Central University	en_US

博碩士論文 109423034 完整後設資料紀錄