摘要(英) |
With the development of information technology, the popularization of wearable mobile devices and equipment, and the development of Internet communications, it has become easier to collect data. All professional fields use the collected data for further analysis and research and aim at widely usage in business development and the promotion of human well-being. The most significant and flourishing areas of application are financial technology and smart healthcare.
Along with the coming era of big data, data science has become a hot topic. Therefore, this report is an in-depth discussion and research focus on medical field with the help of information technology. Data mining technology is adopted to unearth potential knowledge and new discoveries, and expectantly to produce the most suitable method that meets the target. For getting best prediction, machine learning technology can also be used to do experiments to obtain better prediction results to obtain the best solution. The experimental method is to conduct experiments and analysis through the breast cancer data set in the public medical data set. The data set is divided into two different breast cancer data sets, large and small. Use different methods to deal with feature selection and class imbalance and use support vector machines and Random Forest to construct models, Performance evaluation of the algorithm is by the 5-fold cross-validation method to verify the prediction model and other experiments. Finally, a better prediction model is selected.
The experimental results show that the KDD CUP large-scale data set can be preprocessed first by using Random Forest to obtain a better AUC value of 0.951. The best method of preprocessing is to use feature selection first, select more suitable key features, and then deal with class imbalance. The experimental results of the UCI small data set show that even if the data is not preprocessed, and then the Random Forest is used to construct the model, the best AUC values of 0.994 still can be obtained. Therefore, it can be inferred that the small data set possess clearer characteristic attributes and more evenly distributed sample data, both have better performance.
From this research, we can learn that in the future work when large data sets have high dimensions and the distribution of categories is not uniform, data preprocessing can be done first to achieve a better performance model, and when the dimension is low and distribution of categories is relatively even, model can be constructed faster which still can achieve better performance. |
參考文獻 |
中文文獻
[1] 統計處 , 1 08 年死因記者會新聞稿 6月 15, 2020)。檢自
https://dep.mohw.gov.tw/dos/cp 4927 54468 113.html (引見於 4月 02, 2021).
[2] 張雅婷, 2008 ,以資料探勘技術建立輔助乳癌診斷模型,國立臺北科技大學,碩士論文。
[3] 監督 式 學習與非監督式學習的差異、應用、以及案例 ””, OOSGA, 1月 01,2020
https://oosga.com/thinking/difference between supervised learning and unsupervi sed learning/ (引見於 4月 03, 2021).
英文文獻
[4] American Cancer Society, “How Common Is Breast Cancer?” (2021/05/07) Retrieved from https://www.cancer.org/cancer/breast-cancer/about/how-common-is-breast-cancer.html (June 10,2021)
[5] N. V. Chawla, K. W. Bowyer, L. O. Hall and W. P. Kegelmeyer, “SMOTE: Synthetic Minority Over-sampling Technique”, J. Artif. Intell. Res., Vol 16, pp. 321–357, June 2002.
[6] Min-Wei Huang, Chih-Wen Chen, Wei-Chao Lin, Shih-Wen Ke and Chih-Fong Tsai, “SVM and SVM ensembles in breast cancer prediction”, PLOS ONE, Vol 12, January 2017.
[7] Nitasha, “Review on Breast Cancer Prediction Using Data Mining Algorithms”, IJCST, Vol 7 Issue 4, Jul-Aug 2019.
[8] Leo Breiman, “Random Forests”, Machine Learning, 45, 5-32, 2001.
[9] M. Dash and H. Liu, “Feature selection for classification”, Intell. Data Anal., Vol 1 (1), pp. 131–156, January 1997.
[10] Upasana , “Imbalanced Data:How to handle Imbalanced Classification Problems”, Analytics Vidhya, March 2017. https://www.analyticsvidhya.com/blog/2017/03/imbalanced-data-classification/ (引見於 4月 05, 2021).
[11] D. S. Jacob, R. Viswan, V. Manju, L. PadmaSuresh and S. Raj, “A Survey on Breast Cancer Prediction Using Data Mining Techniques”, 2018 Conference on Emerging Devices and Smart Systems (ICEDSS), pp. 256–258, March 2018.
[12] J. Ramirez-Cruz, O. Fuentes, V. Alarcon-Aquino and L. Garcia-Banuelos, “Instance Selection and Feature Weighting Using Evolutionary Algorithms”, 2006 15th International Conference on Computing, pp. 73–79, November 2006.
[13] C. Campbell, “Kernel methods: a survey of current techniques”, Neurocomputing, Vol 48(1), pp. 63–84, October 2002.
[14] T. Fawcett, “An introduction to ROC analysis”, Pattern Recognit. Lett., Vol 27(8), pp. 861–874, June 2006.
[15] James A. Hanley, Ph.D., Barbara J. McNeil, M.D., Ph.D., “A Method of Comparing the Areas under Receiver Operating Characteristic Curves Derived from the Same Cases”, Vol 148(3), pp. 839-843, September 1983.
[16] HYERAN BYUN, SEONG-WHAN LEE, “A SURVEY ON PATTERN RECOGNITION APPLICATIONS OF SUPPORT VECTOR MACHINES”, International Journal of Pattern Recognition and Artificial Intelligence, Vol 17(3), pp. 459-486, 2003.
[17] Isabelle Guyon, Andr´e Elisseeff, “An Introduction to Variable and Feature Selection”, Journal of Machine Learning Research, 3:1157-1182, 2003.
[18] Md. Milon Islam, Md. Rezwanul Haque, Hasib Iqbal, Md. Munirul Hasan, Mahmudul Hasan, Muhammad Nomani Kabir, “Breast Cancer Prediction: A Comparative Study Using Machine Learning Techniques”, SN Comput Sci. 2020;1:290.
[19] Priyanka khare, Dr.Kavita Burse, “Feature Selection Using Genetic Algorithm and Classification using Weka for Ovarian Cancer”, IJCSIT, Vo;7(1), pp.194-196,2016.
[20] Bartosz Krawczyk, “Learning from imbalanced data: open challenges and future directions”, Prog Artif Intell, 5:221–232,2016.
[21] Joseph A. Cruz, David S. Wishart, “Applications of Machine Learning in Cancer Prediction and Prognosis”, Cancer Informatics 2006:2.
[22] Konstantina Kourou, Themis P. Exarchos, Konstantinos P. Exarchos, Michalis V. Karamouzis, Dimitrios I. Fotiadis, “Machine learning applications in cancer prognosis and prediction”, Computational and Structural Biotechnology Journal 13 (2015) 8–17.
[23] Nitasha, “Review on Breast Cancer Prediction Using Data Mining Algorithms”, IJCST, Vol 7(4), Jul-Aug 2019.
[24] Maisa Daoud, Michael Mayo, “A survey of neural network-based cancer prediction models from microarray data”, Artificial Intelligence In Medicine, 97:204-214, 2019. |