Abstract: | 近年來,隨著人工智慧領域的蓬勃發展,許多產業積極投入相關研究,透過現有的產業資料,來研發適用於自身產業的智慧應用。然而在現實世界中,受到不同的人為或環境因素的影響,資料容易自然地呈現出偏斜且不均勻的狀態。這種類別不平衡問題廣泛存在於不同產業與領域當中,容易對相關應用的智慧模型造成負面影響,是近年相當重要的實務議題。因此,本研究欲應用資料層級的過採樣SMOTE(Synthetic Minority Over-sampling Technique, SMOTE),與ChiMerge和MDLP等監督式離散化方法,來探討不同資料前處理步驟的組合與順序,對於二元類別不平衡問題的效益與影響。此外,為了能夠深入理解不同重採樣方法,處理類別不平衡問題的效能差異。本研究納入多種相異的重採樣方法,即具有不同採樣策略的SMOTE方法、欠採樣Tomek Links方法,與上述兩者的混合方法,來進一步地探究不同前處理步驟的組合與順序,對於多元類別不平衡問題的影響。 本研究使用UCI與KEEL網站提供的二元與多元資料集,透過使用不同的資料前處理步驟,分別比較單一前處理方法與混合前處理方法,對於二元與多元類別不平衡問題的影響,進而釐清不同前處理方法的適用性,以提供有效的解決方案與建議。根據實驗結果,在處理二元不平衡問題時,本研究建議使用「先MDLP後SMOTE」的混合方法,來改善SVM、C4.5,與RF的分類效能。此外,在處理多元類別不平衡問題時,在不考量時間成本的前題下,本研究推薦使用先重採樣後ChiMerge的流程,會具有較為穩健且準確的實驗結果。另外,若極為重視資料處理與模型的運算效率,則推薦先重採樣後MDLP的流程,亦可有效率地取得相當準確的實驗結果。 ;In recent years, with the booming of artificial intelligence, more people have taken the initiative to develop intelligent applications using their existing data, looking forward to creating successful products which suitable for their business. However, data tends to naturally present skewed or biased states due to various human or environmental factors in reality. The class imbalance problem widely exists in different industries and domains, and it causes negative influences on intelligent models used in related applications. Therefore, the issue has become an important practical concern recently. This study aims to explore the benefits and effects of data preprocessing steps with different combinations and orders to address binary class imbalance problems. The preprocessing steps include the oversampling technique called Synthetic Minority Over-sampling Technique (SMOTE) and supervised discretization methods such as ChiMerge and MDLP. Additionally, to gain a deeper understanding of different resampling methods′ performance in handling class imbalance problems, this study brings in diverse resampling methods, including SMOTE with different sampling strategies, an undersampling method called Tomek Links, and a hybrid method combining the above methods. To further investigate the impact of different preprocessing combinations and orders to address multiclass imbalance problems. This study uses binary and multiclass datasets provided by UCI and KEEL websites, to compare the effects of single preprocessing methods and mixed preprocessing methods on binary and multiclass class imbalance problems. Thus, clarifying the applicability of different preprocessing methods and providing effective solutions and recommendations. According to the experimental results, when dealing with binary class imbalance problems, it recommends the mixed method of using MDLP to discrete data features first, then using SMOTE to balance the datasets, to improve the classification performance of SVM, C4.5, and RF. Furthermore, when handling multiclass imbalance problems without considering the time cost, it recommends the mixed method of using resampling methods to balance the datasets first, then using ChiMerge to discrete data features, which can get more robust and accurate experimental results. Additionally, if there is a high emphasis on data processing and model computation efficiency, it recommends the mixed method of using resampling methods to balance the datasets first, then using MDLP to discrete data features, to efficiently obtain fairly accurate experimental results. |