資料正規化、離散化與資料平衡化之交互影響(以乳癌預測之二分類不平衡資料集為例)

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：55

、訪客IP：13.58.82.79

姓名

蔡瑞文(Rui-Wen Cai) 查詢紙本館藏

畢業系所

資訊管理學系在職專班

論文名稱

資料正規化、離散化與資料平衡化之交互影響(以乳癌預測之二分類不平衡資料集為例)

相關論文

★ 利用資料探勘技術建立商用複合機銷售預測模型	★ 應用資料探勘技術於資源配置預測之研究-以某電腦代工支援單位為例
★ 資料探勘技術應用於航空業航班延誤分析-以C公司為例	★ 全球供應鏈下新產品的安全控管-以C公司為例
★ 資料探勘應用於半導體雷射產業-以A公司為例	★ 應用資料探勘技術於空運出口貨物存倉時間預測-以A公司為例
★ 使用資料探勘分類技術優化YouBike運補作業	★ 特徵屬性篩選對於不同資料類型之影響
★ 資料探勘應用於B2B網路型態之企業官網研究-以T公司為例	★ 衍生性金融商品之客戶投資分析與建議-整合分群與關聯法則技術
★ 應用卷積式神經網路建立肝臟超音波影像輔助判別模型	★ 基於卷積神經網路之身分識別系統
★ 能源管理系統電能補值方法誤差率比較分析	★ 企業員工情感分析與管理系統之研發
★ 資料淨化於類別不平衡問題: 機器學習觀點	★ 資料探勘技術應用於旅客自助報到之分析—以C航空公司為例

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2025-7-1以後開放)

摘要(中)

隨著科技的進步，人類的飲食、生活型態也隨之改變，隨之而來，罹患的疾病也跟著改變，在台灣，1990年罹患癌症而死亡的人數為18,536人，至2020年，已提高至50,161人，整體上升2.7倍，其中，因罹患乳癌而死亡的人數由619人提升至2,655人，達4.29倍，比整體癌症死亡倍數高出不少，然而，這種情況是可以改善的，乳癌在早期治療(0、1期)的存活率可達95%以上，顯示早期發現早期治療的重要性，若能精準的提供乳癌的分析資料，供醫療人員參考，醫療人員便能在早期判斷疾病並給予適當治療，提高乳癌患者存活率。

本研究提出一套資料多前處理並使用演算法進行乳癌資料分析與預測方法，透過使用正規化、離散化及合成少數過採樣技術(SMOTE)前處理，再分別進行支援向量機、最近鄰、決策樹及隨機森林演算法進行五摺交叉驗證預測模型建構，並與相對應單前處理所建構的模型進行比較，觀察在多前處理交互影響的情形下，對於預測模型的影響。

本研究分別使用KDD的 X射線圖像大型資料集及UCI的細針穿刺(FNA)圖像小型資料集進行實驗，透過同時使用不同的資料前處理，並搭配演算法進行模型建構，實驗發現，在各個預測模型中，經過正規化SMOTE前處理，相較於各別單前處理，對於AUC提升能有較好的效果，其中以支援向量機提升的AUC最高。由本研究實驗中得知，支援向量機進行X射線圖像且重度類別不平衡的資料集預測時，先進行正規化SMOTE資料前處理，可取得較優秀預測價值的模型，細針穿刺(FNA)圖像且輕度類別不平衡資料集，在進行正規化SMOTE後，雖有提升，但較無明顯差異。

摘要(英)

With the advancement of science and technology, people’s diets and lifestyles have also changed, and consequently, the diseases they suffer from have also changed. In Taiwan, the number of people who died of cancer in 1990 was 18,536. By 2020, it has been Increased to 50,161 people, an overall increase of 2.7 times. Among them, the number of deaths due to breast cancer increased from 619 to 2,655, reaching 4.29 times, which is much higher than the overall cancer death rate. However, this situation can be improved. The survival rate of breast cancer in early treatment (stage 0 and 1) can reach more than 95%, showing the importance of early detection and early treatment. If accurate analysis data of breast cancer can be provided for medical staff’s reference, medical staff can Determine the disease and give appropriate treatment to improve the survival rate of breast cancer patients.

This study proposes a set of data multi-preprocessing and algorithms for breast cancer data analysis and prediction methods, By using normalization, discretization, and Synthetic Minority Over-sampling Technique(SMOTE) preprocessing, and then perform support vector machine, K-nearest neighbor, decision tree , and random forest algorithm were used to construct a five-fold cross-validation prediction model, and compared with the model constructed by the corresponding single pre-processing to observe the impact on the prediction model in the case of the interaction of multiple pre-processing.

In this study, KDD′s X-ray image large data set and UCI′s fine needle aspiration (FNA) image small data set were used for experiments. By using different data preprocessing at the same time, and using algorithms for model construction, the experiment found that. In each prediction model, the normalized SMOTE pre-processing has a better effect on the AUC improvement than the individual pre-processing. Among them, the AUC improved by the support vector machine is the highest. From the experiments of this research, it is known that when the support vector machine performs the prediction of the X-ray image and the data set with severe class imbalance, the normalized SMOTE data pre-processing can obtain the model with better prediction value, fine needle aspiration (FNA) Images and slightly class-imbalanced datasets, after regularized SMOTE, have improved, but the impact is small.

關鍵字(中)

★ 正規化
★ 離散化
★ 合成少數過採樣技術
★ 資料前處理交互影響
★ 機器學習

關鍵字(英)

★ Normalization
★ Discretization
★ Synthetic Minority Over-sampling Technique
★ Data Pre-processing Interaction Effects
★ Machine Learning

論文目次

摘要 i
Abstract ii
誌謝 iv
目錄 v
圖目錄 vii
表目錄 viii
第1章前言 1
1.1 研究背景 1
1.2 研究動機 2
1.3 研究目的 3
1.4 論文架構 3
第2章文獻探討 5
2.1 乳癌特徵與因素 5
2.2 機器學習技術 6
2.2.1 監督式學習 6
2.2.2 支援向量機(Support Vector Machine，SVM) 6
2.2.3 最近鄰演算法(K-NN) 7
2.2.4 決策樹(Decision Tree，DT) 8
2.2.5 隨機森林(Random Forest，RF) 8
2.3 前處理 10
2.3.1 正規化(Normalization) 10
2.3.2 離散化(Discretization) 11
2.3.3 合成少數過採樣技術(SMOTE) 11
2.4 相關文獻回顧與討論 12
第3章研究方法 19
3.1 研究架構 19
3.2 資料探勘軟體 21
3.3 實驗資料集 21
3.4 預處理程序及資料集分割 22
3.5 預測模型評估 25
第4章實驗成果 30
4.1 KDD Breast Cancer(2008) 32
4.2 Breast Cancer Wisconsin (Diagnostic) 39
4.3 模型效能評估 46
第5章研究結論與建議 48
5.1 結論 48
5.2 未來研究方向與建議 49
5.3 研究限制 50
參考文獻 51

參考文獻

[1]衛生福利部統計處, “109年國人死因統計結果”(更新於8月 19, 2021)。
檢自https://www.mohw.gov.tw/cp-5017-61533-1.html (引見於 11月 04, 2021).
[2]衛生福利部, “死因統計/歷年統計”。
檢自https://dep.mohw.gov.tw/DOS/lp-5069-113.html (引見於 11月 04, 2021).
[3]衛生福利部國民健康署, “乳癌防治”。
檢自https://www.hpa.gov.tw/Pages/Detail.aspx?nodeid=614&pid=1124(引見於 11月 04, 2021).
[4]Li, Y., Sun, G., & Zhu, Y. (2010, October). Data imbalance problem in text classification. In 2010 Third International Symposium on Information Processing (pp. 301-305). IEEE.
[5]Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357.
[6]Jayalakshmi, T., & Santhakumaran, A. (2011). Statistical normalization and back propagation for classification. International Journal of Computer Theory and Engineering, 3(1), 1793-8201.
[7]Althunibat, A., Alzyadat, W., Muhairat, M., Alhroob, A., & Almukahel, I. H. (2021). An Approach to Acquire the Constraints Using Panel Big Data Hybrid Association Rule and Discretization Process for Breast Cancer Prediction. Journal of Healthcare Engineering, 2021.
[8]Chaurasia, V., Pal, S., & Tiwari, B. B. (2018). Prediction of benign and malignant breast cancer using data mining techniques. Journal of Algorithms & Computational Technology, 12(2), 119-126.
[9]Fahad Ullah, M. (2019). Breast cancer: current perspectives on the disease status. Breast Cancer Metastasis and Drug Resistance, 51-64.
[10]Momenimovahed, Z., & Salehiniya, H. (2019). Epidemiological characteristics of and risk factors for breast cancer in the world. Breast Cancer: Targets and Therapy, 11, 151.
[11]Huang, S., Cai, N., Pacheco, P. P., Narrandes, S., Wang, Y., & Xu, W. (2018). Applications of support vector machine (SVM) learning in cancer genomics. Cancer genomics & proteomics, 15(1), 41-51.
[12]Ahmad, L. G., Eshlaghy, A. T., Poorebrahimi, A., Ebrahimi, M., & Razavi, A. R. (2013). Using three machine learning techniques for predicting breast cancer recurrence. J Health Med Inform, 4(124), 3.
[13]Khan, M. M. R., Arif, R. B., Siddique, M. A. B., & Oishe, M. R. (2018, September). Study and observation of the variation of accuracies of KNN, SVM, LMNN, ENN algorithms on eleven different datasets from UCI machine learning repository. In 2018 4th International Conference on Electrical Engineering and Information & Communication Technology (iCEEiCT) (pp. 124-129). IEEE..
[14]Sumbaly, R., Vishnusri, N., & Jeyalatha, S. (2014). Diagnosis of breast cancer using decision tree data mining technique. International Journal of Computer Applications, 98(10).
[15]Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32.
[16]Suryachandra, P., & Reddy, P. V. S. (2016, August). Comparison of machine learning algorithms for breast cancer. In 2016 International Conference on Inventive Computation Technologies (ICICT) (Vol. 3, pp. 1-6). IEEE.
[17]Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization. arXiv preprint arXiv:1607.06450.
[18]Baka, A., Wettayaprasit, W., & Vanichayobon, S. (2014, May). A novel discretization technique using Class Attribute Interval Average. In 2014 Fourth International Conference on Digital Information and Communication Technology and its Applications (DICTAP) (pp. 95-100). IEEE.
[19]Islam, M. M., Haque, M. R., Iqbal, H., Hasan, M. M., Hasan, M., & Kabir, M. N. (2020). Breast cancer prediction: a comparative study using machine learning techniques. SN Computer Science, 1(5), 1-14.
[20]Castaldo, R., Pane, K., Nicolai, E., Salvatore, M., & Franzese, M. (2020). The impact of normalization approaches to automatically detect radiogenomic phenotypes characterizing breast cancer receptors status. Cancers, 12(2), 518.
[21]Aroef, C., Rivan, Y., & Rustam, Z. (2020). Comparing random forest and support vector machines for breast cancer classification. Telkomnika, 18(2), 815-821.
[22]Assegie, T. A. (2021). An optimized K-Nearest Neighbor based breast cancer detection. Journal of Robotics and Control (JRC), 2(3), 115-118.
[23]Mohammed, S. A., Darrab, S., Noaman, S. A., & Saake, G. (2020, July). Analysis of breast cancer detection using different machine learning techniques. In International Conference on Data Mining and Big Data (pp. 108-117). Springer, Singapore.
[24]袁梅宇(2017)，王者歸來：WEKA機器學習與大數據聖經（第三版），佳魁資訊。

指導教授

蔡志豐(Chih-Fong Tsai)

審核日期

2022-4-12

推文