姓名 戴郁庭(Yu-Ting Dai)  查詢紙本館藏   畢業系所 資訊管理學系
論文名稱 以動態時間校正進行類別不平衡資料之遺漏值處理
(Missing value imputation for class imbalance data: a dynamic warping approach)
摘要(中) 在充滿資料的世界中,越來越多企業希望可以運用這些資料來提高企業競爭力,然而真實世界中類別不平衡(Class Imbalance)以及資料遺漏(Missing Value)的問題一直是非常重要的問題,如醫學診療、破產預測等不同領域都經常發生類別不平衡問題,在類別不平衡中問題中,資料集出現大類資料(Majority Class)的樣本數大於小類資料(Minority Class)的樣本數,資料也因此呈現偏態分布,為了有較高的分類正確率,使用一般的分類器所建立出來的預測模型也會因受到偏態分布的影響而誤判為大類資料,此外若這些珍貴的小類資料出現遺漏時,可用的資料點就更加稀少。
本論文基於動態時間校正(Dynamic Time Warping)的概念作為核心,使用與過去不同的補值方式進行補值,利用動態時間校正的特點來解決小類樣本出現資料遺漏的問題,而此方法也不受限於需要完整資料列做為補值參考,因此在實驗中會將小類資料模擬10%、30%、50%、70%、90%的資料遺漏。
本論文實驗了17個KEEL,搭配二種分類器(SVM、Decision Tree)建立分類模型,比較不同補值方式的AUC(Area Under Curve)結果。最後,KEEL資料集的實驗結果顯示,使用動態時間校正和K-NN補值法比較後,在50%~90%的資料遺漏率下,動態時間校正的補值依然有著良好的表現。
摘要(英) In a world full of information, more and more companies want to use this information to improve their competitiveness. However, the problems of “Class Imbalance” and “Missing Value” have always been important issues in the real world. For example, class imbalance datasets often occur in different fields such as medical diagnosis and bankruptcy prediction. In class imbalance, the number of samples of the majority class in the dataset is larger than that of the minority class, and the data will look skewed. In order to have a higher classification accuracy rate, the prediction model established by the general classifier will also be misjudged as a large class of data due to the influence of the skewed distribution. If the precious minority class contains some missing data, the available data are even rarer.
In this thesis, dynamic time warping is used as the core for the missing value imputation task. Dynamic time warping correction feature is used to solve the problem of missing data in the minority class containing small numbers of samples. And this method is not limited to the need for a complete data sample. Therefore, in the experiment, 10%, 30%, 50%, 70%, and 90% missing rates of the minority class data are simulated.
In this paper, we use 17 KEEL datasets for the experiment, and two classification models (SVM, Decision Tree) are constructed, and the AUC (Area Under Curve) are examined for different methods. The experimental results show that the dynamic time warping has good performance under the missing rate of 50%~90%, which performs better than the KNN imputation method.
關鍵字(中) ★ 類別不平衡
★ 遺漏值
★ 補值方法
★ 動態時間校正
關鍵字(英) ★ class imbalance
★ data mining
★ missing value
★ imputation
★ dynamic time warping
