dc.description.abstract | With the development of information technology, the popularization of wearable mobile devices and equipment, and the development of Internet communications, it has become easier to collect data. All professional fields use the collected data for further analysis and research and aim at widely usage in business development and the promotion of human well-being. The most significant and flourishing areas of application are financial technology and smart healthcare.
Along with the coming era of big data, data science has become a hot topic. Therefore, this report is an in-depth discussion and research focus on medical field with the help of information technology. Data mining technology is adopted to unearth potential knowledge and new discoveries, and expectantly to produce the most suitable method that meets the target. For getting best prediction, machine learning technology can also be used to do experiments to obtain better prediction results to obtain the best solution. The experimental method is to conduct experiments and analysis through the breast cancer data set in the public medical data set. The data set is divided into two different breast cancer data sets, large and small. Use different methods to deal with feature selection and class imbalance and use support vector machines and Random Forest to construct models, Performance evaluation of the algorithm is by the 5-fold cross-validation method to verify the prediction model and other experiments. Finally, a better prediction model is selected.
The experimental results show that the KDD CUP large-scale data set can be preprocessed first by using Random Forest to obtain a better AUC value of 0.951. The best method of preprocessing is to use feature selection first, select more suitable key features, and then deal with class imbalance. The experimental results of the UCI small data set show that even if the data is not preprocessed, and then the Random Forest is used to construct the model, the best AUC values of 0.994 still can be obtained. Therefore, it can be inferred that the small data set possess clearer characteristic attributes and more evenly distributed sample data, both have better performance.
From this research, we can learn that in the future work when large data sets have high dimensions and the distribution of categories is not uniform, data preprocessing can be done first to achieve a better performance model, and when the dimension is low and distribution of categories is relatively even, model can be constructed faster which still can achieve better performance. | en_US |