The Effect of Instance Selection on Missing Value Imputation

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：64

、訪客IP：3.145.154.150

姓名

李昀潔(Yun-Jie Li) 查詢紙本館藏

畢業系所

資訊管理學系

論文名稱

(The Effect of Instance Selection on Missing Value Imputation)

相關論文

★ 利用資料探勘技術建立商用複合機銷售預測模型	★ 應用資料探勘技術於資源配置預測之研究-以某電腦代工支援單位為例
★ 資料探勘技術應用於航空業航班延誤分析-以C公司為例	★ 全球供應鏈下新產品的安全控管-以C公司為例
★ 資料探勘應用於半導體雷射產業-以A公司為例	★ 應用資料探勘技術於空運出口貨物存倉時間預測-以A公司為例
★ 使用資料探勘分類技術優化YouBike運補作業	★ 特徵屬性篩選對於不同資料類型之影響
★ 資料探勘應用於B2B網路型態之企業官網研究-以T公司為例	★ 衍生性金融商品之客戶投資分析與建議-整合分群與關聯法則技術
★ 應用卷積式神經網路建立肝臟超音波影像輔助判別模型	★ 基於卷積神經網路之身分識別系統
★ 能源管理系統電能補值方法誤差率比較分析	★ 企業員工情感分析與管理系統之研發
★ 資料淨化於類別不平衡問題: 機器學習觀點	★ 資料探勘技術應用於旅客自助報到之分析—以C航空公司為例

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

遺漏值問題(Missing value problem)普遍存在資料探勘(Data mining)問題之中,不論是資料輸入錯誤或者資料格式錯誤等問題,皆造成資料探勘建模時無法有效利用現有的資料建立適合的分類模型。因此填補法(Imputation methods) 就針對此問題應運而生,此方法利用現有存在的資料進行分析並填補適合的值, 此適合的值可提供適當的資料供建模使用。
然而現有的資料或許無法提供有效的資料給填補法進行有效的補值,原因在於現有的資料中有許多存在的問題,例如:雜訊資料存在的問題(Noisy problem)、資料冗餘的問題(Redundancy)或存在許多不具代表性的資料(Represented instances)等,因此為了有效利用現有的資料進行補值,資料選取法(Instance selection methods)則利用篩選出具代表性的資料來解決上述之問題,換句話說, 資料選取法透過一系列的篩選標準來產生精簡資料集,此資料集為具代表性的資料所組成,因此補值法就能利用此精簡資料集來進行補值,以避免原始資料內含有的問題影響補值法的效果。
本論文為探討資料選取法對補值法的影響,透過 UCI 開放資料集庫中的 33 個資料集組成三種類型的資料集(類別型、混合型、數值型)來進行實驗,選定三個資料選取法;IB3(Instance-based learning)、DROP3(Decremental Reduction Optimization Procedure)、GA(Genetic Algorithm),和三個補值法;KNNI (K-Nearest Neighbor Imputation method)、SVM(Support Vector Machine)、MLP (MultiLayers Perceptron),來檢驗何種情況下哪種組合方法(三個資料選取法配上三個補值法)為最佳或最適合,或者是否組合方法是否比單純補值法更加有效果。
依據本研究所得之結果,我們建議在數值型資枓集(Numerical datasets)情況下資料選取法配上補值法的流程會比單純補值法的流程適合;資料選取法的部份,DROP3 則建議比較適合用在數值型與混合型資料集(Mixed datasets),但是對於類別型資料集(Categorical datasets)且類別數大的情況下,則不建議使用資料選取法 DROP3,另一方面,對於 GA 和 IB3 這兩個資料選取法,我們建議 GA 的方法會比 IB3 適合,因為依據本研究的實驗顯示,GA 的資料選取表現會比 IB3 來得穩定。

摘要(英)

In data mining, the collected datasets are usually incomplete, which contain some missing attribute values. It is difficult to effectively develop a learning model using the incomplete datasets. In literature, missing value imputation can be approached for the problem of incomplete datasets. Its aim is to provide estimations for the missing values by the (observed) complete data samples.
However, some of the complete data may contain some noisy information, which can be regarded as outliers. If these noisy data were used for missing value imputation, the quality of the imputation results would be affected. To solve this problem, we propose to perform instance selection over the complete data before the imputation step. The aim of instance selection is to filter out some unrepresentative data from a given dataset. Therefore, this research focuses on examining the effect of performing instance selection on missing value imputation.
The experimental setup is based on using 33 UCI datasets, which are composed of categorical, numerical, and mixed types of data. In addition, three instance selection methods, which are IB3 (Instance-based learning), DROP3 (Decremental Reduction Optimization Procedure), and GA (Genetic Algorithm) are used for comparison. Similarly, three imputation methods including KNNI (K-Nearest Neighbor Imputation method), SVM (Support Vector Machine), and MLP (MultiLayers Perceptron) are also employed individually. The comparative results can allow us to understand which combination of instance selection and imputation methods performs the best and whether combining instance selection and missing value imputation is the better choice than performing missing value imputation alone for the incomplete datasets.
According to the results of this research, we suggest that the combinations of instance selection methods and imputation methods may suitable than the imputation methods along over numerical datasets. In particular, the DROP3 instance selection method is more suitable for numerical and mixed datasets, except for categorical datasets, especially when the number of features is large. For the other two instance selection methods, the GA method can provide more stable reduction performance than IB3.

關鍵字(中)

★ 資料探勘
★ 資料選取法
★ 補值法
★ 機器學習
★ 分類問題

關鍵字(英)

★ Machine Learning
★ Instance Selection Methods
★ Imputation Methods
★ Classification
★ Data Mining

論文目次

Content
中文摘要 II
Abstract IV
Acknowledgements VI
Content VII
List of Tables IX
List of Figures XI
Chapter 1. Introduction 1
1.1 Background 1
1.2 Motivation 3
1.3 Purpose 5
Chapter 2. Literature 6
2.1 Overview 6
2.2 Instance Selection: 6
2.2.1. Instance-based Learning Algorithm 8
2.2.2. Decremental Reduction Optimization Procedure 10
2.2.3. Genetic Algorithm 12
2.3 Imputation 15
2.3.1. k-Nearest Neighbor Imputation 18
2.3.2. Support Vector Machine 19
2.3.3. MultiLayer Perceptron 21
Chapter 3. Methodology 25
3.1 Overview: 25
3.2 Data set: 26
3.3 Imputation: 28
3.4 Procedure: 30
Chapter 4. Experiment 32
4.1 Overview: 32
4.2 Categorical data type: 33
4.3 Mixed data type: 38
4.4 Numerical data type: 43
Chapter 5. Conclusion and Discussion 47
5.1 Summary of the Research: 47
5.2 Future Research Directions: 48
Chapter 6. Reference 50
Appendix ii
A.1 Categorical Detail: ii
A.1.1. Accuracy ii
A.1.2. Reduction Rate vi
A.2 Mixed Detail xi
A.2.1. Accuracy xi
A.2.2. Reduction Rate xiv
A.3 Numerical Detail xviii
A.3.1. Accuracy xviii
A.3.2. Reduction Rate xxiii

參考文獻

Aha, D. W. (1992). Tolerating noisy, irrelevant and novel attributes in instance-based learning algorithms. International Journal of Man-Machine Studies, 36(2), 267-287.
Aha, D. W., Kibler, D., & Albert, M. K. (1991). Instance-Based Learning Algorithms. Machine Learning, 6(1), 37-66.
Baker, J. E. (1987). Reducing bias and inefficiency in the selection algorithm. Paper presented at the Proceedings of the second international conference on genetic algorithms.
Batista, G. E. A. P. A., & Monard, M. C. (2002). A Study of K-Nearest Neighbour as an Imputation Method. HIS, 87, 251-260.
Bishop, C. M. (2006). Pattern recognition and machine learning (Vol. 1): springer New York.
Burges, C. J. (1998). A tutorial on support vector machines for pattern recognition. Data mining and knowledge discovery, 2(2), 121-167.
Chang, C.-C., & Lin, C.-J. (2011). LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3), 27.
Chen, J., & Shao, J. (2000). Nearest neighbor imputation for survey data. JOURNAL OF OFFICIAL STATISTICS-STOCKHOLM-, 16(2), 113-132.
Cortes, C., & Vapnik, V. N. (1995). Support-vector networks. Machine Learning, 20(3), 273-297.
Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. Information Theory, IEEE Transactions on, 13(1), 21-27.
De Jong, K. A. (1975). Analysis of the behavior of a class of genetic adaptive systems.
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the royal statistical society. Series B (methodological), 1-38.
Duda, R. O., Hart, P. E., & Stork, D. G. (2012). Pattern classification: John Wiley & Sons.
Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From Data Mining to Knowledge Discovery in Databases. AI Magazine, 17(3).
Fletcher, R. (2013). Practical methods of optimization: John Wiley & Sons.
Frawley, W. J., Piatetsky-Shapiro, G., & Matheus, C. (1992). Knowledge Discovery
in Databases An Overview. AI Magazine, 17(3).
Garcia, S., Derrac, J., Cano, J. R., & Herrera, F. (2012). Prototype selection for
nearest neighbor classification: Taxonomy and empirical study. Pattern Analysis
and Machine Intelligence, IEEE Transactions on, 34(3), 417-435.
Gates, G. W. (1972). The reduced nearest neighbor rule. IEEE Transactions on
Information theory, 18(3), 431-433.
Gen, M., & Cheng, R. (2000). Genetic algorithms and engineering optimization (Vol.
7): John Wiley & Sons.
Goldberg, D. E. (1989). Genetic Algorithms in Search, Optimization, and
Machine Learning.
Goldberg, D. E., & Holland, J. H. (1988). Genetic algorithms and machine learning.
Machine Learning, 3(2), 95-99.
Han, J., & Moraga, C. (1995). The influence of the sigmoid function parameters on
the speed of backpropagation learning From Natural to Artificial Neural Computation (pp. 195-201): Springer.
Hart, P. (1968). The condensed nearest neighbor rule. IEEE Transactions on Information theory, 14, 515-516.
Herrera, F., Lozano, M., & Verdegay, J. L. (1998). Tackling real-coded genetic algorithms: Operators and tools for behavioural analysis. Artificial Intelligence Review, 12(4), 265-319.
Holland, J. H. (1992). Adaptation in Natural and Artificial Systems.
Jain, A. K., Duin, R. P. W., & Mao, J. (2000). Statistical pattern recognition: A review. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 22(1),
4-37.
Jiawei, H., & Kamber, M. (2001). Data mining: concepts and techniques. San
Francisco, CA, itd: Morgan Kaufmann, 5.
Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation
and model selection. Paper presented at the Ijcai.
Kuncheva, L. I., & Sánchez, J. S. (2008). Nearest Neighbour Classifiers for Streaming
Data with Delayed Labelling. Paper presented at the ICDM.
Lakshminarayan, K., Harp, S. A., & Samad, T. (1999). Imputation of Missing Data in
Industrial Databases. Applied Intelligence, 11(3), 259-275.
Lee, H., Rancourt, E., & Särndal, C. E. (1994). Experiments with variance estimation
from survey data with imputed values. JOURNAL OF OFFICIAL
STATISTICS-STOCKHOLM-, 10, 231-231.
Liepins, G. E., & Vose, M. D. (1992). Characterizing crossover in genetic algorithms.
Annals of Mathematics and Artificial Intelligence, 5(1), 27-34.
Little, R. J., & Rubin, D. B. (2002). Statistical analysis with missing data.
Little, R. J., & Rubin, D. B. (2014). Statistical analysis with missing data: John Wiley
& Sons.
Mistiaen, J. A., & Ravallion, M. (2003). Survey compliance and the distribution of income. World Bank policy research working paper(2956).
Mitchell, M. (1998). An introduction to genetic algorithms: MIT press.
Mitchell, T. M. (1997). Machine learning. Burr Ridge, IL: McGraw Hill, 45.
Nocedal, J. W., Stephen J. (1999). Numerical Optimization. Springer Verlag. Olvera-López, J. A., Carrasco-Ochoa, J. A., Martínez-Trinidad, J. F., & Kittler, J.
(2010). A review of instance selection methods. Artificial Intelligence Review,
34(2), 133-143.
Pal, N. R., & Jain, L. E. (2005). Advanced techniques in data mining and knowledge
discovery. London: Springer.
Patcha, A., & Park, J.-M. (2007). An overview of anomaly detection techniques:
Existing solutions and latest technological trends. Computer Networks, 51(12),
3448-3470.
Pyle, D. (1999). Data preparation for data mining (Vol. 1): Morgan Kaufmann. Quinlan, J. R. (1987). Generating Production Rules from Decision Trees. Paper
presented at the IJCAI.
Ritter, G., Woodruff, H., Lowry, S., & Isenhour, T. (1975). An algorithm for a
selective nearest neighbor decision rule. IEEE Transactions on Information
theory, 21(6), 665-669.
Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581-592.
Rubin, D. B. (1996). Multiple imputation after 18+ years. Journal of the American
Statistical Association, 91(434), 473-489.
Rubin, D. B. (2004). Multiple imputation for nonresponse in surveys (Vol. 81): John
Wiley & Sons.
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1985). Learning internal
representations by error propagation: DTIC Document. 54
Schafer, J. L. (1997). Analysis of incomplete multivariate data: CRC press.
Schafer, J. L., & Graham, J. W. (2002). Missing data: our view of the state of the art.
Psychological methods, 7(2), 147.
Suykens, J. A., & Vandewalle, J. (1999a). Least squares support vector machine
classifiers. Neural processing letters, 9(3), 293-300.
Suykens, J. A. K., & Vandewalle, J. (1999b). Least squares support vector machine
classifiers. Neural processing letters, 9(3), 293-300.
Syswerda, G. (1989). Uniform crossover in genetic algorithms. Paper presented at the
Proceedings of the 3rd International Conference on Genetic Algorithms.
Tanner, M. A., & Wong, W. H. (1987). The calculation of posterior distributions by data augmentation. Journal of the American Statistical Association, 82(398),
528-540.
Tomek, I. (1976). An experiment with the edited nearest-neighbor rule. Systems, Man
and Cybernetics, IEEE Transactions on, 6(6), 448-452.
Tsai, C.-F., & Chen, Z.-Y. (2014). Towards high dimensional instance selection: An
evolutionary approach. Decision Support Systems, 61, 79-92.
Tsai, C.-F., Eberle, W., & Chu, C.-Y. (2013). Genetic algorithms in feature and
instance selection. Knowledge-Based Systems, 39, 240-247.
University of California, I. (2015). UCI Machine Learning Repository. Retrieved
5/19, 2015, from http://archive.ics.uci.edu/ml/
Vázquez, F., Sánchez, J. S., & Pla, F. (2005). A stochastic approach to Wilson’s
editing algorithm Pattern Recognition and Image Analysis (pp. 35-42): Springer. Vapnik, V. N. (1998a). Statistical learning theory (Vol. 1): Wiley New York.
Vapnik, V. N. (1998b). The support vector method of function estimation Nonlinear
Modeling (pp. 55-85): Springer.
Vapnik, V. N. (1999). An overview of statistical learning theory. Neural Networks, IEEE Transactions on, 10(5), 988-999.
Vapnik, V. N. (2000). The nature of statistical learning theory: Springer Science & Business Media.
Werbos, P. J. (1989). Backpropagation and neurocontrol: A review and prospectus. Paper presented at the Neural Networks, 1989. IJCNN., International Joint Conference on.
Wilson, D. L. (1972). Asymptotic properties of nearest neighbor rules using edited data. Systems, Man and Cybernetics, IEEE Transactions on(3), 408-421.
Wilson, D. R., & Martinez, T. R. (2000). Reduction Techniques for Instance-Based Learning Algorithms. Machine Learning, 38(3), 257-286.
Witten, I. H., & Frank, E. (2005). Data Mining: Practical machine learning tools and techniques: Morgan Kaufmann.
Wu, X., Kumar, V., Quinlan, J. R., Ghosh, J., Yang, Q., Motoda, H., . . . Philip, S. Y. (2008). Top 10 algorithms in data mining. Knowledge and Information Systems, 14(1), 1-37.
Wu, X., Zhu, X., Wu, G.-Q., & Ding, W. (2014). Data mining with big data. Knowledge and Data Engineering, IEEE Transactions on, 26(1), 97-107.

指導教授

蔡志豐(Chih-Fong Tsai)

審核日期

2015-6-22

推文