摘要(英) |
With the advancement of science and technology, people’s diets and lifestyles have also changed, and consequently, the diseases they suffer from have also changed. In Taiwan, the number of people who died of cancer in 1990 was 18,536. By 2020, it has been Increased to 50,161 people, an overall increase of 2.7 times. Among them, the number of deaths due to breast cancer increased from 619 to 2,655, reaching 4.29 times, which is much higher than the overall cancer death rate. However, this situation can be improved. The survival rate of breast cancer in early treatment (stage 0 and 1) can reach more than 95%, showing the importance of early detection and early treatment. If accurate analysis data of breast cancer can be provided for medical staff’s reference, medical staff can Determine the disease and give appropriate treatment to improve the survival rate of breast cancer patients.
This study proposes a set of data multi-preprocessing and algorithms for breast cancer data analysis and prediction methods, By using normalization, discretization, and Synthetic Minority Over-sampling Technique(SMOTE) preprocessing, and then perform support vector machine, K-nearest neighbor, decision tree , and random forest algorithm were used to construct a five-fold cross-validation prediction model, and compared with the model constructed by the corresponding single pre-processing to observe the impact on the prediction model in the case of the interaction of multiple pre-processing.
In this study, KDD′s X-ray image large data set and UCI′s fine needle aspiration (FNA) image small data set were used for experiments. By using different data preprocessing at the same time, and using algorithms for model construction, the experiment found that. In each prediction model, the normalized SMOTE pre-processing has a better effect on the AUC improvement than the individual pre-processing. Among them, the AUC improved by the support vector machine is the highest. From the experiments of this research, it is known that when the support vector machine performs the prediction of the X-ray image and the data set with severe class imbalance, the normalized SMOTE data pre-processing can obtain the model with better prediction value, fine needle aspiration (FNA) Images and slightly class-imbalanced datasets, after regularized SMOTE, have improved, but the impact is small. |
參考文獻 |
[1]衛生福利部統計處, “109年國人死因統計結果”(更新於8月 19, 2021)。
檢自https://www.mohw.gov.tw/cp-5017-61533-1.html (引見於 11月 04, 2021).
[2]衛生福利部, “死因統計/歷年統計”。
檢自https://dep.mohw.gov.tw/DOS/lp-5069-113.html (引見於 11月 04, 2021).
[3]衛生福利部國民健康署, “乳癌防治”。
檢自https://www.hpa.gov.tw/Pages/Detail.aspx?nodeid=614&pid=1124(引見於 11月 04, 2021).
[4]Li, Y., Sun, G., & Zhu, Y. (2010, October). Data imbalance problem in text classification. In 2010 Third International Symposium on Information Processing (pp. 301-305). IEEE.
[5]Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357.
[6]Jayalakshmi, T., & Santhakumaran, A. (2011). Statistical normalization and back propagation for classification. International Journal of Computer Theory and Engineering, 3(1), 1793-8201.
[7]Althunibat, A., Alzyadat, W., Muhairat, M., Alhroob, A., & Almukahel, I. H. (2021). An Approach to Acquire the Constraints Using Panel Big Data Hybrid Association Rule and Discretization Process for Breast Cancer Prediction. Journal of Healthcare Engineering, 2021.
[8]Chaurasia, V., Pal, S., & Tiwari, B. B. (2018). Prediction of benign and malignant breast cancer using data mining techniques. Journal of Algorithms & Computational Technology, 12(2), 119-126.
[9]Fahad Ullah, M. (2019). Breast cancer: current perspectives on the disease status. Breast Cancer Metastasis and Drug Resistance, 51-64.
[10]Momenimovahed, Z., & Salehiniya, H. (2019). Epidemiological characteristics of and risk factors for breast cancer in the world. Breast Cancer: Targets and Therapy, 11, 151.
[11]Huang, S., Cai, N., Pacheco, P. P., Narrandes, S., Wang, Y., & Xu, W. (2018). Applications of support vector machine (SVM) learning in cancer genomics. Cancer genomics & proteomics, 15(1), 41-51.
[12]Ahmad, L. G., Eshlaghy, A. T., Poorebrahimi, A., Ebrahimi, M., & Razavi, A. R. (2013). Using three machine learning techniques for predicting breast cancer recurrence. J Health Med Inform, 4(124), 3.
[13]Khan, M. M. R., Arif, R. B., Siddique, M. A. B., & Oishe, M. R. (2018, September). Study and observation of the variation of accuracies of KNN, SVM, LMNN, ENN algorithms on eleven different datasets from UCI machine learning repository. In 2018 4th International Conference on Electrical Engineering and Information & Communication Technology (iCEEiCT) (pp. 124-129). IEEE..
[14]Sumbaly, R., Vishnusri, N., & Jeyalatha, S. (2014). Diagnosis of breast cancer using decision tree data mining technique. International Journal of Computer Applications, 98(10).
[15]Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32.
[16]Suryachandra, P., & Reddy, P. V. S. (2016, August). Comparison of machine learning algorithms for breast cancer. In 2016 International Conference on Inventive Computation Technologies (ICICT) (Vol. 3, pp. 1-6). IEEE.
[17]Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization. arXiv preprint arXiv:1607.06450.
[18]Baka, A., Wettayaprasit, W., & Vanichayobon, S. (2014, May). A novel discretization technique using Class Attribute Interval Average. In 2014 Fourth International Conference on Digital Information and Communication Technology and its Applications (DICTAP) (pp. 95-100). IEEE.
[19]Islam, M. M., Haque, M. R., Iqbal, H., Hasan, M. M., Hasan, M., & Kabir, M. N. (2020). Breast cancer prediction: a comparative study using machine learning techniques. SN Computer Science, 1(5), 1-14.
[20]Castaldo, R., Pane, K., Nicolai, E., Salvatore, M., & Franzese, M. (2020). The impact of normalization approaches to automatically detect radiogenomic phenotypes characterizing breast cancer receptors status. Cancers, 12(2), 518.
[21]Aroef, C., Rivan, Y., & Rustam, Z. (2020). Comparing random forest and support vector machines for breast cancer classification. Telkomnika, 18(2), 815-821.
[22]Assegie, T. A. (2021). An optimized K-Nearest Neighbor based breast cancer detection. Journal of Robotics and Control (JRC), 2(3), 115-118.
[23]Mohammed, S. A., Darrab, S., Noaman, S. A., & Saake, G. (2020, July). Analysis of breast cancer detection using different machine learning techniques. In International Conference on Data Mining and Big Data (pp. 108-117). Springer, Singapore.
[24]袁梅宇(2017),王者歸來:WEKA機器學習與大數據聖經(第三版),佳魁資訊。 |