dc.description.abstract | In recent years, with the booming of artificial intelligence, more people have taken the initiative to develop intelligent applications using their existing data, looking forward to creating successful products which suitable for their business. However, data tends to naturally present skewed or biased states due to various human or environmental factors in reality. The class imbalance problem widely exists in different industries and domains, and it causes negative influences on intelligent models used in related applications. Therefore, the issue has become an important practical concern recently. This study aims to explore the benefits and effects of data preprocessing steps with different combinations and orders to address binary class imbalance problems. The preprocessing steps include the oversampling technique called Synthetic Minority Over-sampling Technique (SMOTE) and supervised discretization methods such as ChiMerge and MDLP. Additionally, to gain a deeper understanding of different resampling methods′ performance in handling class imbalance problems, this study brings in diverse resampling methods, including SMOTE with different sampling strategies, an undersampling method called Tomek Links, and a hybrid method combining the above methods. To further investigate the impact of different preprocessing combinations and orders to address multiclass imbalance problems.
This study uses binary and multiclass datasets provided by UCI and KEEL websites, to compare the effects of single preprocessing methods and mixed preprocessing methods on binary and multiclass class imbalance problems. Thus, clarifying the applicability of different preprocessing methods and providing effective solutions and recommendations. According to the experimental results, when dealing with binary class imbalance problems, it recommends the mixed method of using MDLP to discrete data features first, then using SMOTE to balance the datasets, to improve the classification performance of SVM, C4.5, and RF. Furthermore, when handling multiclass imbalance problems without considering the time cost, it recommends the mixed method of using resampling methods to balance the datasets first, then using ChiMerge to discrete data features, which can get more robust and accurate experimental results. Additionally, if there is a high emphasis on data processing and model computation efficiency, it recommends the mixed method of using resampling methods to balance the datasets first, then using MDLP to discrete data features, to efficiently obtain fairly accurate experimental results. | en_US |