摘要: | 在當今資訊科技迅速發展的背景下,遺漏值的問題在各種資料集中普遍存在,這不僅妨礙了資料的完整性,也對後續的資料分析和決策制定造成了不利影響。因此,遺漏值的處理成為資料前處理階段一項重要且挑戰性的任務。 現行人們對於遺漏值最主流的處理方法為填補遺漏值,常見的填補方法包括統計補值法和機器學習補值法這種單一補值法,以及近年來學者所採用的多重填補法,然而,過去文獻中較少探討多重填補法與機器學習補值法之間的差異,尤其是在不同種類的資料集和不同遺失率下的表現。再者,近年隨著集成式學習的蓬勃發展已被證實能有效提高模型的預測準確率,但在遺漏值填補上的文獻仍相對匱乏。故本論文旨在分析單一填補法與多重填補法在不同模型下的表現,並探討集成學習在遺漏值填補上的應用 本研究選取了 25 個 UCI 資料集作為研究對象,包括數值型、類別型及混合型資料,模擬了 10%至 50%的不同遺漏率,以評估5種機器學習演算法在單一填補和多重填補(MICE)方法的成效。此外,研究還提出了兩種基於 MICE 的集成式填補方法,包括混合式和並列式集成策略,最終以SVM分類正確率、均方根誤差、平均絕對值誤差以及類別正確率來評估補值的成效。 實驗發現,在數值型與類別型資料集多重填補在各個評估指標都明顯優於單一填補法,在混合型資料集上除了多重填補的隨機森林方法,其他方法都稍遜於單一填補。並且在多重填補與單一填補的比較我們綜合下來得出隨機森林方法是最佳的方法。在集成式填補的實驗結果發現使用混合式填補法或是並列式填補都能在各個評估指標有效提高,其中混合式填補法針對不同模型採用的先後順序對結果有一定落差,在實際使用上會需要再注意。最後,本研究根據不同的需求提供了多重填補與集成填補的推薦策略,供未來研究者參考。 ;With the progress of Information Technology, missing values have become common in various datasets. This affects data completeness and hampers data analysis and decision-making. Therefore, handling missing values is a crucial and challenging task in data preprocessing. The main methods for handling missing values are imputation, including statistical and machine learning techniques, both single imputation methods. Recently, scholars have adopted multiple imputation methods. However, limited research compares multiple imputation and machine learning imputation across different datasets and missing rates. Additionally, while ensemble learning has improved model prediction accuracy, its use in missing value imputation is under-researched. Therefore, we aim to analyze the performance of single and multiple imputation methods and explore ensemble learning in missing value imputation. This study used 25 UCI datasets, including numerical, categorical, and mixed types, simulating missing rates from 10% to 50%. Five machine learning algorithms were evaluated for single and multiple (MICE) imputation, and two ensemble imputation methods based on MICE, hybrid and parallel strategies, were proposed. Imputation effectiveness was assessed using SVM classification accuracy, RMSE, MAPE, and Hit Ratio. Results showed that multiple imputation generally outperformed single imputation, with the random forest method being the best for mixed datasets, while other methods slightly underperformed. Ensemble imputation experiments indicated that both hybrid and parallel strategies effectively improved all metrics, though the order of applying models in hybrid imputation significantly impacted results. Finally, we provide recommendations for optimal combinations of multiple and ensemble imputation, offering valuable references for future researchers. |