dc.description.abstract | Data in the real world often have the problem of bad quality, such as noise, irrelevant
data and extreme volume. Without considering data pre-processing, the models that are
trained by this kind of data are unlikely to be effective. In particular, feature selection is a
common data pre-processing method. Through feature selection, redundant and irrelevant
features can be removed, leaving only representative features. In ensemble feature selection, it
refers to using multiple different feature selection algorithms and combines their selected
feature subsets through different aggregation methods. Ensemble feature selection can
improve the robustness of single feature selection and even improve the classification
accuracy. Currently, the related research on feature selection mostly adopts single feature
selection. There are few researches discussing ensemble feature selection. Thus, the aim of
this thesis is to compare the performance of single feature selection and ensemble feature
selection in high-dimensional data to find a better combination of feature selection methods.
In the experiment, three different types of feature selection algorithms are used, which
are GA (Genetic Algorithm), DT (Decision Tree Algorithm), and PCA (Principal Components
Analysis). For ensemble feature selection, the concept of sequential ensemble and parallel
ensemble in ensemble learning are applied to form sequential ensemble feature selection and
parallel ensemble feature selection, respectively. Finally, the classification accuracy, f1-score
and execution time are examined to evaluate feature selection methods.
Based on 20 public datasets with dimensions ranging from 44 to 19993, the experimental
results show that sequential ensemble feature selection and parallel ensemble feature selection
perform better than single feature selection. The best feature selection methods for most
datasets are sequential ensemble feature selection and parallel ensemble feature selection.
The best combination in sequential ensemble feature selection is GA+PCA, and the best
combination in parallel ensemble feature selection is C4.5∪GA. | en_US |