特徵選取前處理於填補遺漏值之影響

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：15

、訪客IP：18.191.189.120

姓名

李韋柔(Wei-Jou Lee) 查詢紙本館藏

畢業系所

資訊管理學系

論文名稱

特徵選取前處理於填補遺漏值之影響

相關論文

★ 利用資料探勘技術建立商用複合機銷售預測模型	★ 應用資料探勘技術於資源配置預測之研究-以某電腦代工支援單位為例
★ 資料探勘技術應用於航空業航班延誤分析-以C公司為例	★ 全球供應鏈下新產品的安全控管-以C公司為例
★ 資料探勘應用於半導體雷射產業-以A公司為例	★ 應用資料探勘技術於空運出口貨物存倉時間預測-以A公司為例
★ 使用資料探勘分類技術優化YouBike運補作業	★ 特徵屬性篩選對於不同資料類型之影響
★ 資料探勘應用於B2B網路型態之企業官網研究-以T公司為例	★ 衍生性金融商品之客戶投資分析與建議-整合分群與關聯法則技術
★ 應用卷積式神經網路建立肝臟超音波影像輔助判別模型	★ 基於卷積神經網路之身分識別系統
★ 能源管理系統電能補值方法誤差率比較分析	★ 企業員工情感分析與管理系統之研發
★ 資料淨化於類別不平衡問題: 機器學習觀點	★ 資料探勘技術應用於旅客自助報到之分析—以C航空公司為例

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

從真實世界收集的龐大原始資料，難免會有不完整或品質不佳的資料，造成資料探勘(Data Mining)的效益及效率下降。使資料探勘效果不佳的原因有兩種，包含資料遺漏或資料含有過多冗餘(Redundancy)、不具代表性的資料(Represented features)。這些原因不僅無法提供有價值的資訊，除了降低整體實驗數據成果，還會降低實驗的整體效率和增加實驗所需花費的成本。遺漏值問題(Missing value problem)普遍存在資料探勘(Data mining)問題之中，可能的原因為資料輸入錯誤或者資料格式錯誤等問題。填補法(Imputation methods)就針對遺漏值問題而發展出來，此方法透過完整資料當作觀測值，預測不完整資料中的遺漏值。特徵選取(Feature Selection)為從資料中濾掉多餘及不具代表性的特徵值(Feature)。本研究結合特徵選取技術及補值法，主要在探討先透過特徵選取塞選出的細緻資料，再填補遺漏值的適用性。
本論文為探討特徵選取對補值法的影響，透過UCI公開資料庫，蒐集12個完整的資料集，資料集的組成三種類型的資料集(類別型、混合型、數值型)來進行實驗。為了使實驗更接近現實的狀況，本論文模擬10％、20%、30%、40%、50%的缺失率的作為基準，探討在這五種遺漏率下的變化及趨勢。選定三個特徵選取技術：基因演算法 (Genetic Algorithm, GA)、決策樹(Decision Tree, DT)、資訊獲利 (Information Gain, IG)，和三個補值法；多層感知器 (Multilayer Perceptrons, MLP)、支持向量機 (Support Vector Machine, SVM)、K-最近鄰補值法(K-nearest Neighbor Imputation, KNNI)，來檢驗何種情況下，特徵選取搭配補值法為最佳或最適合的組合，或者比較組合方法與單純補值法的正確率表現。本研究分為兩項研究，研究一處理初始資料集，而研究二為驗證研究一結果的高維度資料集。
依據研究一所得之結果，混和型資料集情況下，使用GA進行特徵選取再透過IBk進行補值是有參考性及正面影響；數值型資料集情況下，使用IG的65%保留率進行特徵選取再透過MLP、IBk、SVM任一補值方法進行補值皆具有正面影響。類別型資料集在直接分類的正確率表現最佳，我們建議不需要先進行特徵選取再補值的流程。另外我們發現，遺漏率為10%時，先進行特徵選取再補值的結果比直接補值的方法及分類的方法佳。依據研究二所得的結果有兩個，第一個為高維度資料集使用DT先進行特徵選取，再使用MLP及IBk進行補值為表現最佳的組合。第二個為高維度資料集在保留率低於40%時，組合方法表現較優異。

摘要(英)

The collection of real life data contain missing values frequently. The presence of missing values in a dataset can affect the performance of data mining algorithms. The commonly used techniques for handling missing data are based on some imputation methods. Pool quality data negatively influence the predictive accuracy. The reason of pool quality data contains not only missing values but also redundancy features and representative features. This kind of data will degrade the performance of data mining algorithms and increase the cost of research. To solve this problem, we propose to perform feature selection over the complete data before the imputation step. The aim of feature selection is to filter out some unrepresentative data from a given dataset. Therefore, this research focuses on identifying the best combination of feature selection and imputation methods.
The experimental setup is based on 12 UCI datasets, which are composed of categorical, numerical, and mixed types of data. The experiment conducts the simulation to 10%, 20%, 30%, 40% and 50% missing rates for each training dataset. In addition, three feature selection methods, which are GA(Genetic Algorithm), DT(Decision Tree Algorithm), and IG(Information Gain Algorithm) are used for comparison. Similarly, three imputation methods including KNNI (K-Nearest Neighbor Imputation method), SVM (Support Vector Machine), and MLP (MultiLayers Perceptron) are also employed individually. The comparative results can allow us to understand which combination of feature selection and imputation methods performs the best and whether combining feature selection and missing value imputation is the better choice than performing missing value imputation alone for the incomplete datasets.
According to the results of this research, the combination of GA feature selection method and IBk imputation method was significantly better than the imputation methods alone over mixed datasets. The combinations of IG feature selection method with retaining 65% features and the imputation methods which we selected have significantly better classification accuracy over numarical datasets. Performing missing value imputation alone is a better choice over categorical datasets. Performing feature selection before the imputation step has the best classification accuracy with 10% missing rate. In large dimensional datasets, the classification accuracy of the combining the DT feature selection method and IBk or MLP imputation methods produce the best performance.

關鍵字(中)

★ 資料探勘
★ 特徵選取
★ 補值法
★ 機器學習
★ 分類問題

關鍵字(英)

★ Machine Learning
★ Feature Selection Methods
★ Imputation Methods
★ Classification
★ Data Mining

論文目次

目錄
第一章緒論 1
1-1 研究背景 1
1-2 研究動機 3
1-3 研究目的 4
1-4 論文架構 6
第二章文獻探討 7
2-1 資料遺漏值 7
2-1-1完全隨機遺漏 7
2-1-2隨機遺漏 7
2-1-3非隨機遺漏 7
2-2 遺漏值處理 8
2-2-1統計方法 8
2-2-2機器學習方法 9
2-3 特徵選取 17
2-3-1包裝(Wrappers) 18
2-3-2內嵌(Embedded) 19
2-3-3過濾(Filters) 19
第三章研究方法 29
3-1 實驗架構 29
3-2資料集 29
3-4 研究一 31
3-4-1研究流程 34
3-5 研究二 38
第四章實驗結果 39
4-1研究一：實驗結果 39
4-1-1資訊獲利特徵集合大小之比較 39
4-1-2類別型資料(Categorical Data)的SVM分類正確率結果 45
4-1-3 混合型資料(Mixed Data) 的SVM分類正確率結果 48
4-1-4數值型資料(Numeric Data) 的SVM分類正確率結果 50
4-1-5 各資料集正確率表現最佳的方法 52
4-1-6 資料集在不同遺漏率的平均分類正確率 53
4-1-7 初始資料集特徵選取正確率比較 58
4-1-8成對樣本t-檢定結果 59
4-2研究二：實驗結果 62
4-2-1 高維度資料集平均分類正確率結果 62
4-2-2 其他高維度資料集比較 65
4-2-3 所有高維度資料集平均分類正確率 67
4-2-4 成對樣本t-檢定結果 68
第五章結論與未來研究方向 69
5-1 結論與貢獻 69
5-2 未來展望 70
參考文獻 73
附錄一、特徵選取的結果 78
附錄二、Weka參數說明 86

參考文獻

[1] Fayyad, U., Shapiro, g. P., Smyth, P., 1996. From data mining to knowledge discovery in databases, AI Magazine, 17(3):37-54.
[2] Frawley, W. J., Piatetski-Shapiro, G., Matheus, C. J., 1991. Knowledge Discovery in Databases: An Overview, AAAI-MIT Press, Menlo Park, California.
[3] Lakshminarayan, K., Harp, S. A., & Samad, T., 1999. Imputation of Missing Data in Industrial Databases. Applied Intelligence, 11(3), 259-275.
[4] Ader, H. J., Mellenbergh, G. J., Hand, D. J., 2008. Advising on Research Methods: A consultant’s Companion. Huizen, The Netherlands: Johannes van Kessel.
[5] Kurgan, L. A., Cios, K. J., 2004. CAIM Discretization Algorithm. IEEE Transactions on Data and Knowledge Engineering, 16(2):145-153.
[6] G I. Bose, R.K. Mahapatra, 2001. Business data mining ─ a machine learning perspective, Information & Management, Vol. 39, No. 3, pp. 221-225.
[7] U. Fayyad, S.G. Piatetsky, P. Smyth, 1996. Advances in knowledge discovery and data mining, The MIT Press.
[8] J. Li, M.T. Manry, P.L. Narasimha, C. Yu, 2006. Feature selection using a piecewise linear network, IEEE Transactions on Neural Networks, Vol. 17, No. 5, pp. 1101-1115.
[9] Saeys, Y., Inza, I., Larranaga, P., 2007. A review of feature selection techniques in bioinformatics. Bioinformatics 23(19):2507–2517.
[10] Ma, S., Huang, J., 2008. Penalized feature selection and classiﬁcation in bioinformatics, Brieﬁngs in Bioinformatics 9(5): 392–403.
[11] Hilario, M., Kalousis, A., 2008. Approaches to dimensionality reduction in proteomic biomarker studies, Brieﬁngs in Bioinformatics 9(2):102–118.
[12] Duval, B., Hao, J., 2010. Advances in metaheuristics for gene selection and classiﬁcation of microarray data, Brieﬁngs in Bioinformatics, 11(1):127–141.
[13] Little, R. J. A. and Rubin, D. B., 1987. Statistical Analysis with Missing Data. New York: Wiley.
[14] Scheffer, J., 2002. Dealing with Missing Data, Information and, pp. 153-160.
[15] Kalton, G., and Kasprzyk, D., 1982. Imputing for missing survey response, Proc. Sect. Survey Res. Meth., Amer. Statist. Assoc., 22-23.
[16] Lien-Chin, C., 2002. A Correlation-Based Approach for Validating Gene Expression Clustering, Department of Computer Science and Information Engineering National Cheng Kung University.
[17] Rubin, D.B., 1987. Multiple imputation for nonresponse in surveys, New York, Wiley.
[18] Dempster, A. P., N. M. Laird, and D. B. Rubin, 1977. Maximum Likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society, Series B, Vol. 39, pp.1-38.
[19] Jose´ M. Jerez a, *, Ignacio Molina b , Pedro J. Garcı´a-Laencina c , Emilio Alba d , Nuria Ribelles d , Miguel Martı´n e , Leonardo Franco, 2010. Missing data imputation using statistical and machine learning methods in a real breast cancer problem, Artificial Intelligence in Medicine 50 105–115.
[20] XindongWu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hiroshi Motoda, Geoffrey J. McLachlan, Angus Ng, Bing Liu, Philip S. Yu, Zhi-Hua Zhou, Michael Steinbach, David J. Hand, Dan Steinberg, 2008. Top 10 algorithms in data mining, Knowl Inf Syst 14:1–37.
[21] Fix, E., Hodges, J.L., 1951. Discriminatory analysis, nonparametric discrimination: Consistency properties, Technical Report 4, USAF School of Aviation Medicine, Randolph Field, Texas.
[22] D. Aha, D. Kibler, 1991. Instance-based learning algorithms. Machine Learning. 6:37-66.
[23] Jain, A., Duin, R.P.W., Mao, J., 2000. Statistical pattern recognition: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(1) 4–37
[24] Chen, J. and Shao, J., 2000. Nearest Neighbor Imputation for Survey Data, Journal of Official Statistics, vol. 16, no. 2, 2000, pp. 113-131.
[25] Jönsson P. and Wohlin C., 2006. Benchmarking k-Nearest Neighbour Imputation with Homogeneous Likert Data, Empirical Software Engineering: An International Journal, Vol. 11, No. 3, pp. 463-489.
[26] Batista, G. E. A. P. A. and Monard, M. C., 2003. An analysis of four missing data treatment methods for supervised learning, Applied Arti¯cial Intelligence 17(5-6), 519{533.
[27] Kenward MG, Carpenter J., 2007. Multiple imputation: current perspectives. Statistical Methods in Medical Research, 16(3):199–218.
[28] D. E. Rumelhart, G. E. Hinton and R. J. Williams, 1986. Learning Internal Representations by Error Propagation, D. E. Rumelhart and J. L. McCelland (Eds.), Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Vol 1: Foundations. MIT Press.
[29] Gupta A, Lam MS.,1996. Estimating missing values using neural networks. Journal of the Operational Research Society,47:229–38
[30] Sharpe PK, Solly RJ.,1995. Dealing with missing values in neural-network-based diagnostic systems. Neural Computing & Applications, 3(2):73–7.
[31] Nordbotten S.,1996. Neural network imputation applied to the Norwegian 1990 census data. Journal of Official Statistics, 12(4):385–401.
[32] Sancho-´mez JL, Garcı´a-Laencina PJ, Figueiras-Vidal AR., 2009. Combining missing data imputation and pattern classification in a multi-layer perceptron. Intelligent Automation and Soft Computing, 15(4):539–53
[33] B. E. Boser, I. M. Guyon, and V. N. Vapnik. , 1992 A training algorithm for optimal margin classifiers. In D. Haussler, editor, 5th Annual ACM Workshop on COLT, pages 144-152, Pittsburgh, PA. ACM Press.
[34] Chang, C.-C., & Lin, C.-J., 2011. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3), 27.
[35] Dixon J. K., 1979. Pattern recognition with partly missing data. IEEE Transactions on Systems, Man, and Cybernetics, SMC-9, 10, 617-621.
[36] Kalton, G. and Kasprzyk, D., 1986. The treatment of missing survey data. Survey Methodology 12, 1-16.
[37] Kohavi and G. John., 1997. Wrappers for feature selection. Artificial Intelligence, 97(1-2):273–324, December.
[38] A. Blum and P. Langley., 1997. Selection of relevant features and examples in machine learning. Artificial Intelligence, 97(1-2):245–271, December.
[39] Y. Saeys, et al., 2007. A review of feature selection techniques in bioinformatics, Bioinformatics, vol. 23, pp. 2507-2517.
[40] I. Guyon and A. Elisseeff, 2003. An Introduction to Variable and Feature Selection, The Journal of Machine Learning Research, vol. 3.
[41] J. Reunanen., 2003. Overfitting in making comparisons between variable selection methods. JMLR, 3: 1371–1382 (this issue).
[42] H. Liu and L. Yu, 2005. Toward integrating feature selection algorithms for classification and clustering, IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, vol. 17, pp. 491 - 502, April.
[43] J. Holland.,1975. Adaptation in Natural and Artiﬁcial Systems. University of Michigan Press.
[44] David E. Goldberg, 1989. Genetic algorithms in search, optimization and machine learning. Addison-Wesley.
[45] Hunt, E. B., 1962. Concept Learning: An Information Processing Problem, Wiley
[46] Quinlan, J. R.,1993. C4.5: Programs for Machine Learning, Morgan Kaufmann, San Mateo, California.
[47] M. Ben-Bassat., 1982. Pattern recognition and reduction of dimensionality. In P. Krishnaiah and L. Kanal, editors, Handbook of Statistics II, volume 1, pages 773–791. North-Holland. Amsterdam.
[48] V. Kumar, S. Minz, 2014. Feature Selection: A literature Review, School of Computer and Systems Sciences, Jawaharlal Nehru University, New Delhi-110067.
[49] J.T. De Souza, R.A.F. Do Carmo, G. Augusto, L. De Campos, 2008. A novel approach for integrating feature and instance selection, in Proc. Int. Conf. Machine Learning and Cybernetics, pp. 374-379.
[50] M.L. Raymer, W.F. Punch, E.D. Goodman, L.A. Kuhn, A.K. Jain, 2000. Dimensionality reduction using genetic algorithms, IEEE Transactions on Evolutionary Computation, Vol. 4, No. 2, pp. 164-171.
[51] J.-F. Ramirez-Cruz, V. Alarcon-Aquino, O. Fuentes, L. Garcia-Banuelos, 2006. Instance Selection and Feature Weighting Using Evolutionary Algorithms, in Proc. Int. Conf. Computing, pp. 73-79.
[52] F. Ros, S. Guillaume, M. Pintore, J.R. Chrétien, 2008. Hybrid genetic algorithm for dual selection, Pattern Analysis and Applications, Vol. 11, pp. 179-198.
[53] Kohavi, R., 1995. A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection, Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, Vol. 2, pp.1137-1145.
[54] A. Venkatachalam, 2007. M-InfoSift: A Graph-based Approach for Multiclass Document Classification, The University Of Texas At Arlington.

指導教授

蔡志豐(Chih-Fong Tsai)

審核日期

2016-7-5

推文