資料前處理：整合補值法與樣本選取之研究

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：21

、訪客IP：3.134.104.173

姓名

張復喻(Fu-yu Chang) 查詢紙本館藏

畢業系所

資訊管理學系

論文名稱

資料前處理：整合補值法與樣本選取之研究
(A Study of Data Pre-process: the Integration of Imputation and Instance Selection)

相關論文

★ 利用資料探勘技術建立商用複合機銷售預測模型	★ 應用資料探勘技術於資源配置預測之研究-以某電腦代工支援單位為例
★ 資料探勘技術應用於航空業航班延誤分析-以C公司為例	★ 全球供應鏈下新產品的安全控管-以C公司為例
★ 資料探勘應用於半導體雷射產業-以A公司為例	★ 應用資料探勘技術於空運出口貨物存倉時間預測-以A公司為例
★ 使用資料探勘分類技術優化YouBike運補作業	★ 特徵屬性篩選對於不同資料類型之影響
★ 資料探勘應用於B2B網路型態之企業官網研究-以T公司為例	★ 衍生性金融商品之客戶投資分析與建議-整合分群與關聯法則技術
★ 應用卷積式神經網路建立肝臟超音波影像輔助判別模型	★ 基於卷積神經網路之身分識別系統
★ 能源管理系統電能補值方法誤差率比較分析	★ 企業員工情感分析與管理系統之研發
★ 資料淨化於類別不平衡問題: 機器學習觀點	★ 資料探勘技術應用於旅客自助報到之分析—以C航空公司為例

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

資料集中的遺漏或異常資料樣本，都會對資料探勘的過程造成影響，使得探勘的結果正確性下降。因此，在資料探勘前的資料前處理是有其必要性的。而資料前處理即是針對存在於資料集中的遺漏或異常樣本進行處理或篩選，較常使用的方法為「補值法」與「樣本選取法」。
　　補值法的原理為根據資料集中「完整資料樣本」進行分析，並推估出一個可能的值，填補到空白的欄位中，雖然現階段已有許多研究針對補值法提出各種新型的技術，但卻忽略了補值過程中所需要參考的「完整資料樣本」。假設這些「完整資料樣本」存在著異常值，將會對補值過程產生不良影響。因此，本研究提出在補值前，事先針對這些「完整資料樣本」進行樣本選取，將異常的資料篩選出來，再利用這些篩選過後的精簡資料樣本，做為補值的參考樣本，可以讓補值的結果更加可靠（實驗流程二）。另外，某資料集在補值之後的結果，對於樣本選取的概念來說，可能仍然屬於冗餘值（重複值）或異常值，因此本研究又提出了補值後，再進行樣本選取技術，將那些不必要的資料篩選出來，留下具有代表性的資料，進而提升資料探勘的正確性（實驗流程一）。為了進一步篩選出更精簡、更具代表性的資料，實驗流程一與實驗流程二將進行第二次的樣本選取技術，即成為本研究的實驗流程三與實驗流程四。
　　本研究使用31個不同的資料集，包含三種主要的類型，分別為數值型、類別型和混合型，並用10％作為遺漏率的間隔（從10％至50％）。最後，本研究將會建構決策樹模型來獲取關於資料集的特性(如資料數量、資料維度、類別數量、資料類型)和遺漏率之相關決策規則，來幫助資料分析並確定何時使用何種資料前處理流程。

摘要(英)

In practice, the collected data usually contain some missing values and noise, which are likely to degrade the data mining performance. As a result, data pre-processing step is necessary before data mining. The aim of data pre-processing is to deal with missing values and filter out noise data. In particular, “imputation” and “instance selection” are two common solutions for the data pre-processing purpose.
The aim of imputation is to provide estimations for missing values by reasoning from the observed data (i.e., complete data). Although various missing value imputation algorithms have been proposed in literature, the outputs for the missing values produced by most imputation algorithms heavily rely on the complete (training) data. Therefore, if some of the complete data contains noise, it will directly affect the quality of the imputation and data mining results. In this thesis, four integration processes were proposed, in which one process is to execute instance selection first to remove several noisy (complete) data from the training set. Then, the imputation process is performed based on the reduced training set (Process 2). On the contrary, the imputation process is employed first to produce a complete training set. Then, instance selection is performed to filter out some noisy data from this set (Process 1). In or to filter out more representative data, instance selection is performed again over the outputs produced by Processes 1 and 2 (Process 3 & Process 4).
The experiments are based 31 different data sets, which contain categorical, numerical, and mixed types of data, and 10% intervals for different missing rates per dataset (i.e. from 10% to 50%). A decision tree model is then constructed to extract useful rules to recommend when (no. of sample, no. of attribute, no. of classed, type of dataset, missing rate) to use which kind of the integration process.

關鍵字(中)

★ 遺漏值
★ 資料探勘
★ 資料前處理
★ 補值法
★ 樣本選取法

關鍵字(英)

★ missing value
★ data mining
★ data pre-process
★ imputation
★ instance selection

論文目次

目錄
摘要 i
Abstract ii
致謝辭 iii
目錄 iv
圖目錄 vi
表目錄 viii
第一章緒論 1
1.1 研究背景 1
1.2 研究動機 3
1.3 研究目的 5
1.4 論文架構 6
第二章文獻探討 7
2.1 遺漏值（Missing Value） 7
2.2 遺漏值處理 9
2.2.1 事前預防法 9
2.2.2 刪除法（Deletion Method） 10
2.2.3 虛擬變數法（Dummy Variable） 11
2.2.4 補值法（Imputation） 11
2.2.5 k-最鄰近補值（k-Nearest Neighbor Imputation, kNNI） 17
2.3 樣本選取（Instance Selection） 19
2.3.1 樣本選取簡介 19
2.3.2 DROP3 20
2.4 討論 22

第三章研究方法 24
3.1 實驗架構 24
3.2 Baseline流程 24
3.3 實驗流程 29
3.3.1 實驗流程一 29
3.3.2 實驗流程二 30
3.3.3 實驗流程三 31
3.3.4 實驗流程四 32
第四章實驗結果 33
4.1 資料集 33
4.2 實驗從不同面向結果之比較 35
4.2.1 不同遺漏率 36
4.2.2 不同資料類型 40
4.3 萃取決策規則 46
4.3.1 實驗流程一的決策規則 46
4.3.2 實驗流程二的決策規則 47
4.3.3 實驗流程三的決策規則 48
4.3.4 實驗流程四的決策規則 48
第五章結論 50
5.1 結論與貢獻 50
5.2 未來研究方向與建議 51
參考文獻 53
附錄一 56
附錄二 67
附錄三 71

圖目錄
圖 1 1 資料遺漏值範例 2
圖 1 2 資料遺漏值處理範例 3
圖 1 3 論文架構 6
圖 2 1 資料遺漏值處理範例-列表刪除法、配對刪除法之比較 10
圖 2 2 數值型資料遺漏值範例 15
圖 2 3　kNNI範例 18
圖 2 4　樣本選取示意圖 19
圖 2 5　DROP1演算法 21
圖 2 6　樣本選取技術而後補值資料處理範例 22
圖 2 7　補值而後樣本選取技術資料處理範例 23
圖 3 1　Baseline流程圖 24
圖 3 2　模擬遺漏值流程 26
圖 3 3　單一補值法與多重補值法 26
圖 3 4　單一補值法 27
圖 3 5　多重補值法 28
圖 3 6　實驗流程一 29
圖 3 7　實驗流程二 30
圖 3 8　實驗流程三 31
圖 3 9　實驗流程四 32
圖 4 1　UCI網站示意圖 33
圖 4 2　四種實驗流程的分類結果（KNN） 36
圖 4 3　四種實驗流程的分類結果（SVM） 38
圖 4 4　13個數值型資料集的分類結果（KNN） 40
圖 4 5　13個數值型資料集的分類結果（SVM） 40
圖 4 6　數值型資料集的分類結果 41
圖 4 7　 9個類別型資料集的分類結果（KNN） 42
圖 4 8　 9個類別型資料集的分類結果（SVM） 42
圖 4 9　類別型資料集的分類結果 43
圖 4 10　9個混合型資料集的分類結果（KNN） 44
圖 4 11　9個混合型資料集的分類結果（SVM） 44
圖 4 12　混合型資料集的分類結果 45

參考文獻

Abraham, Todd W., & Russell, 2004, “ Missing Data: a Review of Current Methods and Applications in Epidemiological Research,” Psychiatry, pp. 315-321.
Acock C.A., 2005, “ Working with Missing Data,” Journal of Marriage and Family, pp. 1012-1028.
AliK, & Warraich M.A., 2010, “A Framework to Implement Data Cleaning in Enterprise Data Warehouse for Robust Data Quality,” Information and Emerging Technologies (ICIET), pp. 1-6.
Asa W.D., Kibler D., & Albert K.M., 1991, “Instance-Based Learning Algorithms,” Machine Learning, pp. 37-66.
Baker R.V., & Ritter F.D., 1975, “ Competence of Rivers to Transport Coarse Bedload Material: Geological Society of America,” Bulletin, pp. 975-978.
Batista, & Monard M., 2003, “ An Analysis of Four Missing Data Treatment Methods for Supervised Learning,” Applied Articial Intelligence, pp.519-533.
BrightionH, & Mellish C., 2002, “ Advances in Instance Selection for Instance-Based Learning Algorithms,” Data Mining and Knowledge Discovery, pp.153-172.
CHOBS., 2002, “Towards Creative Evolutionary Systems with Interactive Genetic,” Applied Intelligence, pp. 129-138.
Derrac J., García S., & Herrera F., 2010, “Survey on Evolutionary Instance Selection and Generation,” International Journal of Applied Metaheuristic Computing, pp.60-92.
Fix E., & Hodges L.J., 1951, “Discriminatory Analysis - Nonparametric Discrimination: Consistency Properties,” Technical Report.
Freeman, Vicki A., Douglas A., & Wolf, 1995, “A Case-study on the Use of Multiple Imputation,” Demograph, pp.459-470.
Garcia S., Derrac J., Cano J.R., & Herrera F., 2012, “ Prototype Selection for Nearest Neighbor Classification: Taxonomy and Empirical Study,” IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 417-435.
Gates W.G., 1972, “ The Reduced Nearest Neighbor Rule,” IEEE Transactions on Information Theory, pp. 431-433.
Han J., & Kamber M., 2001, “ Data Mining: Concepts and Techniques,” Morgan Kaufmann Publishers.
Hart P.E., 1968, “The Condensed Nearest Neighbor Rule,” IEEE Transactions on Information Theory, pp. 515-516.

Hawthorne G., & Elliott P., 2005, “Imputing Cross-sectional Missing Data: Comparison of Common Techniques,” Australian and New Zealand Journal of Psychiatry, pp. 583-590.
HeJing, 2009, “Advances in Data Mining: History and Future,” Intelligent Information Technology Application(IITA), pp. 634-636..
HermanAder J., GideonMellenbergh J., & DavidHand J., 2008, “ Advising on Research Methods: A consultant′s Companion,” Huizen.
Jönsson P., & Wohlin C., 2006, “Benchmarking k-Nearest Neighbour Imputation with Homogeneous Likert Data,” Empirical Software Engineering: An International Journal, pp. 463-489.
Kalton G., & Kasprzyk D., 1982, “Imputing for Missing Survey Responses, Proceedings of the Section on Survey Research Methods,” American Statistical Association, pp. 22-31.
KamakshiLakshminarayan, StevenHarp A., & TariqSamad, 1999, “Imputation of Missing Data in Industrial Databases,” Appl. Intell, pp. 259-275.
Kaufman L., & Rousseeuw J.P., 1990, “ Finding Groups in Data: An Introduction to Cluster Analysis.”
Kohavi R., 1995, “ A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection,” Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, pp. 1137-1145.
LamWai, KeungChi-Kin, & LiuDanyu, 2002, “Discovering Useful Concept Prototypes for Classification Based on Filtering and Abstraction,” IEEE Transaction on Pattern Analysis and Machine Intelligence, pp. 1075-1090.
Landerman, Lawrence R., KennethLand C., & CarlPieper F., 1997, “ An Empirical Evaluation of the Predictive Mean Matching Method for Imputing Missing Values,” Sociological Methods & Research, pp. 3-33.
Leunens G., 1992, “Human Errors in Data Transfer During the Preparation and Delivery of Radiation Treatment Affecting the Final Result:Garbage in, Garbage out”,” Radiotherapy and Oncology, pp. 217-222.
Lisa C.A., & Daniel K.H., 2007, “ Childhood Family, Ethnicity, and Drug Use Over the Life Course,” Journal of Marriage and Family, pp. 810-830.
LittleR, & RubinD., 1987, “Statistical Analysis with Missing Data.”
LittleR, & RubinD., 2002, “ Statistical Analysis with Missing Data 2rd edition.”
LukaszKurgan A., & KrzysztofCios J., 2004, “CAIM Discretization Algorithm,” IEEE Transactions on Data and Knowledge Engineering, pp. 145-153.

MiaoZhimin, PanZhisong, HuGuyu, & ZhaoLuwen, 2007, “Treating Missing Data Processing Based on Neural Network and AdaBoost,” Grey Systems and Intelligent Services(GSIS) on IEEE International Conference, p.p 1107-1111.
Mistiaen, Johan A., Ravallion, & Martin, 2003, “Survey Compliance and the Distribution of Income,” Policy Research Working Paper.
N. Jankowski, & I. Grochowski, 2004, “Comparison of Instances Selection Algorithms: Algorithms Survey, In Artificial Intelligence and Soft Computing,” pp. 598-603.
Olvera-López A.J., Carrasco-Ochoa A.J., Martinez-Trinidad F.J., & Kittler J., 2010, “ A Review of Instance Selection Methods,” Artif. Intell. Rev, pp. 133-143.
PerJonsson, & ClaesWohlin, 2004, “ An Evaluation of kNearest Neighbour Imputation Using Likert Data,” Proceedings of the 10th International Symposium on Software Metrics, pp. 108-118.
Pyle D., 1999, “Data Preparation for Data Mining,” San Francisco: Morgan Kaufmann.
Rattikorn Hewett, 2004, “Decision Making using Incomplete Data,” IEEE International Conference on Systems, Man and Cybernetics, pp. 182-187.
Reinartz T., 2002, “A Unifying View on Instance Selection,” Data Mining and Knowledge Discovery, pp. 191-210.
RubinB D., 1987, “ Multiple Imputation for Nonresponse in Surveys.”
Sagiroglu S., & Sinanc D., 2013, “ Big Data: A Review,” Collaboration Technologies and Systems (CTS), pp.42-47.
SchaferL J., & Graham W.J., 2002, “ Missing Data: Our View of the State of the Art,” Psychological Methods, pp. 147-177.
Shafer L.J., 1997, “Analysis of Incomplete Multivariate Data.”
Tabachnick G.B., & Fidell S.L., 1983, “Using Multivariate Statistics,” New York: Harper & Row.
Tanner A.M., & Wong H.W., 1987, “ The Calculation of Posterior Distributions by Data Augmentation (with discussion),” Statist. Assoc, pp. 528-550.
Wilson D.R., & Martinez T.R., 2000, “Reduction Techniques for Instance-based Learning Algorithms,” Machine Learning, pp. 257-286.
Wilson L.D., 1972, “Asymptotic Properties of Nearest Neighbor Rules Using Edited Data,” IEEE Transactionson on Systems, pp. 408-421.
Wison R.D., & Martinez R.T., 2000, “ Reduction Techniques for Instance-Based Learning Algorithms,” Machine Learning, pp. 257-286.
Zhang S., Jin Z., & Zhu X., 2011, “Missing Data Imputation by Utilizing Information within Incomplete Instances,” The Journal of Systems and Software, pp. 452-459.
林盈秀，2013 ，資料遺漏率、補值法與資料前處理關係之研究，國立中央大學，碩士論文。

指導教授

蔡志豐(Chih-fong Tsai)

審核日期

2014-7-1

推文