dc.description.abstract | In recent years, personal devices and embedded systems have become prevalent. Data is collected from all over the world into many high-dimensional data through the Internet. Only a single data set of these data may reach several petabytes. For example, many new business opportunities have been added, but its high-dimensional characteristics also trouble many companies. Because the amount of data is too large, companies need more storage space, and if they want to use these high-dimensional data to establish data mining. The model will take a long time to train and may lead to poor model learning performance. In order to avoid the problems caused by the above-mentioned high-dimensional characteristics, the feature selection technology often used in the data preprocessing method can be used to reduce the data dimension. Therefore, feature selection is the main research, hoping to explore the best feature selection method for different high-dimensional datasets. Ensemble feature selection in high dimension, low sample size datasets: Parallel and serial combination approaches
In classification problems, most of the current research on feature selection uses binary classification, but in the real world, multi-class classification problems are also problems that need to be dealt with. In literature related to multi-class feature selection, there are few feature selection methods that apply all the three types of filter, wrapper, and embedded. There is also no parallel Ensemble feature selection technology collocation of a single multiclass feature selection method.
This study applies three types of single feature selection methods for ten high-dimensional imbalanced datasets, including six filter methods, five wrapper methods, four embedded methods. At the same time, for the problem of imbalance datasets, the SMOTE is added to the data level to make the samples be balance, and finally the average accuracy rate, average area under the ROC curve, and computing time are recorded.
From the results of this experiment, it is recommended to use the SMOTE method for multi-class unbalanced datasets. In addition, the embedded feature selection method is selected in the SVM classifier, and SMOTE is performed first and then the feature selection is performed. The combination of Lasso+XGBoost has the highest average accuracy of prediction performance; Second, the embedded features are selected in the SVM classifier, and the features are selected first and then SMOTE. With the Lasso+RandomForest+XGBoost union, there is the highest average area under the ROC curve of the prediction performance. | en_US |