以動態時間校正進行類別不平衡資料之遺漏值處理

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：46

、訪客IP：18.119.132.249

姓名

戴郁庭(Yu-Ting Dai) 查詢紙本館藏

畢業系所

資訊管理學系

論文名稱

以動態時間校正進行類別不平衡資料之遺漏值處理
(Missing value imputation for class imbalance data: a dynamic warping approach)

相關論文

★ 利用資料探勘技術建立商用複合機銷售預測模型	★ 應用資料探勘技術於資源配置預測之研究-以某電腦代工支援單位為例
★ 資料探勘技術應用於航空業航班延誤分析-以C公司為例	★ 全球供應鏈下新產品的安全控管-以C公司為例
★ 資料探勘應用於半導體雷射產業-以A公司為例	★ 應用資料探勘技術於空運出口貨物存倉時間預測-以A公司為例
★ 使用資料探勘分類技術優化YouBike運補作業	★ 特徵屬性篩選對於不同資料類型之影響
★ 資料探勘應用於B2B網路型態之企業官網研究-以T公司為例	★ 衍生性金融商品之客戶投資分析與建議-整合分群與關聯法則技術
★ 應用卷積式神經網路建立肝臟超音波影像輔助判別模型	★ 基於卷積神經網路之身分識別系統
★ 能源管理系統電能補值方法誤差率比較分析	★ 企業員工情感分析與管理系統之研發
★ 資料淨化於類別不平衡問題: 機器學習觀點	★ 資料探勘技術應用於旅客自助報到之分析—以C航空公司為例

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

在充滿資料的世界中，越來越多企業希望可以運用這些資料來提高企業競爭力，然而真實世界中類別不平衡（Class Imbalance）以及資料遺漏(Missing Value)的問題一直是非常重要的問題，如醫學診療、破產預測等不同領域都經常發生類別不平衡問題，在類別不平衡中問題中，資料集出現大類資料（Majority Class）的樣本數大於小類資料（Minority Class）的樣本數，資料也因此呈現偏態分布，為了有較高的分類正確率，使用一般的分類器所建立出來的預測模型也會因受到偏態分布的影響而誤判為大類資料，此外若這些珍貴的小類資料出現遺漏時，可用的資料點就更加稀少。
本論文基於動態時間校正(Dynamic Time Warping)的概念作為核心，使用與過去不同的補值方式進行補值，利用動態時間校正的特點來解決小類樣本出現資料遺漏的問題，而此方法也不受限於需要完整資料列做為補值參考，因此在實驗中會將小類資料模擬10%、30%、50%、70%、90%的資料遺漏。
本論文實驗了17個KEEL，搭配二種分類器（SVM、Decision Tree）建立分類模型，比較不同補值方式的AUC（Area Under Curve）結果。最後，KEEL資料集的實驗結果顯示，使用動態時間校正和K-NN補值法比較後，在50%~90%的資料遺漏率下，動態時間校正的補值依然有著良好的表現。

摘要(英)

In a world full of information, more and more companies want to use this information to improve their competitiveness. However, the problems of “Class Imbalance” and “Missing Value” have always been important issues in the real world. For example, class imbalance datasets often occur in different fields such as medical diagnosis and bankruptcy prediction. In class imbalance, the number of samples of the majority class in the dataset is larger than that of the minority class, and the data will look skewed. In order to have a higher classification accuracy rate, the prediction model established by the general classifier will also be misjudged as a large class of data due to the influence of the skewed distribution. If the precious minority class contains some missing data, the available data are even rarer.
In this thesis, dynamic time warping is used as the core for the missing value imputation task. Dynamic time warping correction feature is used to solve the problem of missing data in the minority class containing small numbers of samples. And this method is not limited to the need for a complete data sample. Therefore, in the experiment, 10%, 30%, 50%, 70%, and 90% missing rates of the minority class data are simulated.
In this paper, we use 17 KEEL datasets for the experiment, and two classification models (SVM, Decision Tree) are constructed, and the AUC (Area Under Curve) are examined for different methods. The experimental results show that the dynamic time warping has good performance under the missing rate of 50%~90%, which performs better than the KNN imputation method.

關鍵字(中)

★ 類別不平衡
★ 遺漏值
★ 補值方法
★ 動態時間校正

關鍵字(英)

★ class imbalance
★ data mining
★ missing value
★ imputation
★ dynamic time warping

論文目次

摘要 i
Abstract ii
圖目錄 v
表目錄 vi
一、緒論 1
1-1研究背景 1
1-2研究動機 2
1-3研究目的 3
1-4研究架構 4
二、文獻探討 6
2-1類別不平衡問題 6
2-2解決類別不平衡問題之文獻探討 8
2-2-1資料層級（Data level） 8
2-3遺漏值問題 12
2-3-1完全隨機遺漏（Missing Completely at Random，MCAR） 13
2-3-2隨機遺漏（Missing at Random，MAR） 13
2-3-3非隨機遺漏（Missing Not at Random，MNAR） 14
2-4遺漏值填補方法 14
2-4-1案例刪除法（Case-Deletion） 14
2-4-2單一補值法（Single Imputation） 15
2-4-3 K-鄰近算法（K-Nearest Neighbor，KNN） 16
2-5 Dynamic Time Warping 17
三、研究方法 20
3-1 研究架構 20
3-2 實驗資料集 21
3-3 DTW演算法補值 22
3-3-1 所有樣本皆可完成補值且並不會出現例外狀況 23
3-3-2 出現例外狀況 24
3-3-3 待補資料樣本皆為遺漏值 25
四、實驗結果 27
4-1實驗準備 27
4-1-1軟硬體設備 27
4-2實驗結果與總結 27
4-2-1實驗結果──使用Support Vector Machines 27
4-2-2實驗小結──使用Support Vector Machines 31
4-2-3實驗結果──使用Decision Tree 31
4-2-4實驗小結──使用Decision Tree 35
4-3實驗討論 36
4-3-1討論一──缺少完整資料樣本之影響 36
4-3-2討論二──不平衡比率對補值之影響 37
五、結論 45
5-1結論與貢獻 45
5-2未來研究方向與建議 45
參考文獻 47

參考文獻

[1]. He, H. and E.A. Garcia, Learning from imbalanced data. IEEE Transactions on Knowledge & Data Engineering, 2008(9): p. 1263-1284.
[2]. Cios, K.J. and L.A. Kurgan, Trends in Data Mining and Knowledge Discovery, in Advanced Techniques in Knowledge Discovery and Data Mining, N.R. Pal and L. Jain, Editors. 2005, Springer London: London. p. 1-26.
[3]. Mazurowski, M.A., et al., Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance. Neural networks, 2008. 21(2-3): p. 427-436.
[4]. Galar, M., et al., A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 2012. 42(4): p. 463-484.
[5]. Tsai, C.-F. and F.-Y. Chang, Combining instance selection for better missing value imputation. Journal of Systems and Software, 2016. 122: p. 63-71.
[6]. Ader, H.J., Advising on research methods: A consultant′s companion. 2008: Johannes van Kessel Publishing.
[7]. Tsai, C.-F., M.-L. Li, and W.-C. Lin, A class center based approach for missing value imputation. Knowledge-Based Systems, 2018. 151: p. 124-135.
[8]. Longadge, R. and S. Dongre, Class imbalance problem in data mining review. arXiv preprint arXiv:1305.1707, 2013.
[9]. Salvador, S. and P. Chan, Toward accurate dynamic time warping in linear time and space. Intelligent Data Analysis, 2007. 11(5): p. 561-580.
[10]. Müller, M., Dynamic time warping. Information retrieval for music and motion, 2007: p. 69-84.
[11]. Lin, W.-C., et al., Clustering-based undersampling in class-imbalanced data. Information Sciences, 2017. 409: p. 17-26.
[12]. Ali, A., S.M. Shamsuddin, and A.L. Ralescu, Classification with class imbalance problem: a review. Int. J. Advance Soft Compu. Appl, 2015. 7(3): p. 176-204.
[13]. Japkowicz, N. and S. Stephen, The class imbalance problem: A systematic study. Intelligent data analysis, 2002. 6(5): p. 429-449.
[14]. Das, B., N.C. Krishnan, and D.J. Cook. Handling class overlap and imbalance to detect prompt situations in smart homes. in 2013 IEEE 13th International Conference on Data Mining Workshops. 2013. IEEE.
[15]. Batista, G.E., R.C. Prati, and M.C. Monard, A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD explorations newsletter, 2004. 6(1): p. 20-29.
[16]. Kotsiantis, S., D. Kanellopoulos, and P. Pintelas, Handling imbalanced datasets: A review. GESTS International Transactions on Computer Science and Engineering, 2006. 30(1): p. 25-36.
[17]. Fernández, A., et al., A study of the behaviour of linguistic fuzzy rule based classification systems in the framework of imbalanced data-sets. Fuzzy Sets and Systems, 2008. 159(18): p. 2378-2398.
[18]. Drummond, C. and R.C. Holte. C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. in Workshop on learning from imbalanced datasets II. 2003. Citeseer.
[19]. Kotsiantis, S. and P. Pintelas, Mixture of expert agents for handling imbalanced data sets. Annals of Mathematics, Computing & Teleinformatics, 2003. 1(1): p. 46-55.
[20]. Tomek, I., Two modifications of CNN. IEEE Trans. Systems, Man and Cybernetics, 1976. 6: p. 769-772.
[21]. Hart, P., The condensed nearest neighbor rule (Corresp.). IEEE transactions on information theory, 1968. 14(3): p. 515-516.
[22]. Chawla, N.V., et al., SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 2002. 16: p. 321-357.
[23]. Little, R.J. and D.B. Rubin, Statistical analysis with missing data. Vol. 333. 2014: John Wiley & Sons.
[24]. Scheffer, J., Dealing with missing data. 2002.
[25]. Lakshminarayan, K., S.A. Harp, and T. Samad, Imputation of missing data in industrial databases. Applied intelligence, 1999. 11(3): p. 259-275.
[26]. Silva-Ramírez, E.-L., R. Pino-Mejías, and M. López-Coello, Single imputation with multilayer perceptron and multiple imputation combining multilayer perceptron and k-nearest neighbours for monotone patterns. Applied Soft Computing, 2015. 29: p. 65-74.
[27]. Schafer, J.L., Analysis of incomplete multivariate data. 1997: Chapman and Hall/CRC.
[28]. Farhangfar, A., L. Kurgan, and J. Dy, Impact of imputation of missing values on classification error for discrete data. Pattern Recognition, 2008. 41(12): p. 3692-3705.
[29]. Cohen, P., S.G. West, and L.S. Aiken, Applied multiple regression/correlation analysis for the behavioral sciences. 2014: Psychology Press.
[30]. Farhadian, H. and H. Katibeh, New empirical model to evaluate groundwater flow into circular tunnel using multiple regression analysis. International Journal of Mining Science and Technology, 2017. 27(3): p. 415-421.
[31]. Cho, S.-B., Towards creative evolutionary systems with interactive genetic algorithm. Applied Intelligence, 2002. 16(2): p. 129-138.
[32]. Troyanskaya, O., et al., Missing value estimation methods for DNA microarrays. Bioinformatics, 2001. 17(6): p. 520-525.
[33]. Keogh, E. and C.A. Ratanamahatana, Exact indexing of dynamic time warping. Knowledge and information systems, 2005. 7(3): p. 358-386.
[34]. Senin, P., Dynamic time warping algorithm review. Information and Computer Science Department University of Hawaii at Manoa Honolulu, USA, 2008. 855: p. 1-23.
[35]. Keogh, E.J. and M.J. Pazzani. Derivative dynamic time warping. in Proceedings of the 2001 SIAM international conference on data mining. 2001. SIAM.
[36]. Zhang, Z., et al., Dynamic time warping under limited warping path length. Information Sciences, 2017. 393: p. 91-107.

指導教授

蔡志豐

審核日期

2019-7-1

推文