資料正規化、離散化與資料平衡化之交互影響(以乳癌預測之二分類不平衡資料集為例)

NCU Institutional Repository > 管理學院 > 資訊管理學系碩士在職專班 > 博碩士論文 > Item 987654321/88342

請使用永久網址來引用或連結此文件: https://ir.lib.ncu.edu.tw/handle/987654321/88342

題名:	資料正規化、離散化與資料平衡化之交互影響(以乳癌預測之二分類不平衡資料集為例)
作者:	蔡瑞文;Cai, Rui-Wen
貢獻者:	資訊管理學系在職專班
關鍵詞:	正規化;離散化;合成少數過採樣技術;資料前處理交互影響;機器學習;Normalization;Discretization;Synthetic Minority Over-sampling Technique;Data Pre-processing Interaction Effects;Machine Learning
日期:	2022-04-12
上傳時間:	2022-07-13 23:14:50 (UTC+8)
出版者:	國立中央大學
摘要:	隨著科技的進步，人類的飲食、生活型態也隨之改變，隨之而來，罹患的疾病也跟著改變，在台灣，1990年罹患癌症而死亡的人數為18,536人，至2020年，已提高至50,161人，整體上升2.7倍，其中，因罹患乳癌而死亡的人數由619人提升至2,655人，達4.29倍，比整體癌症死亡倍數高出不少，然而，這種情況是可以改善的，乳癌在早期治療(0、1期)的存活率可達95%以上，顯示早期發現早期治療的重要性，若能精準的提供乳癌的分析資料，供醫療人員參考，醫療人員便能在早期判斷疾病並給予適當治療，提高乳癌患者存活率。本研究提出一套資料多前處理並使用演算法進行乳癌資料分析與預測方法，透過使用正規化、離散化及合成少數過採樣技術(SMOTE)前處理，再分別進行支援向量機、最近鄰、決策樹及隨機森林演算法進行五摺交叉驗證預測模型建構，並與相對應單前處理所建構的模型進行比較，觀察在多前處理交互影響的情形下，對於預測模型的影響。本研究分別使用KDD的 X射線圖像大型資料集及UCI的細針穿刺(FNA)圖像小型資料集進行實驗，透過同時使用不同的資料前處理，並搭配演算法進行模型建構，實驗發現，在各個預測模型中，經過正規化SMOTE前處理，相較於各別單前處理，對於AUC提升能有較好的效果，其中以支援向量機提升的AUC最高。由本研究實驗中得知，支援向量機進行X射線圖像且重度類別不平衡的資料集預測時，先進行正規化SMOTE資料前處理，可取得較優秀預測價值的模型，細針穿刺(FNA)圖像且輕度類別不平衡資料集，在進行正規化SMOTE後，雖有提升，但較無明顯差異。;With the advancement of science and technology, people’s diets and lifestyles have also changed, and consequently, the diseases they suffer from have also changed. In Taiwan, the number of people who died of cancer in 1990 was 18,536. By 2020, it has been Increased to 50,161 people, an overall increase of 2.7 times. Among them, the number of deaths due to breast cancer increased from 619 to 2,655, reaching 4.29 times, which is much higher than the overall cancer death rate. However, this situation can be improved. The survival rate of breast cancer in early treatment (stage 0 and 1) can reach more than 95%, showing the importance of early detection and early treatment. If accurate analysis data of breast cancer can be provided for medical staff’s reference, medical staff can Determine the disease and give appropriate treatment to improve the survival rate of breast cancer patients. This study proposes a set of data multi-preprocessing and algorithms for breast cancer data analysis and prediction methods, By using normalization, discretization, and Synthetic Minority Over-sampling Technique(SMOTE) preprocessing, and then perform support vector machine, K-nearest neighbor, decision tree , and random forest algorithm were used to construct a five-fold cross-validation prediction model, and compared with the model constructed by the corresponding single pre-processing to observe the impact on the prediction model in the case of the interaction of multiple pre-processing. In this study, KDD′s X-ray image large data set and UCI′s fine needle aspiration (FNA) image small data set were used for experiments. By using different data preprocessing at the same time, and using algorithms for model construction, the experiment found that. In each prediction model, the normalized SMOTE pre-processing has a better effect on the AUC improvement than the individual pre-processing. Among them, the AUC improved by the support vector machine is the highest. From the experiments of this research, it is known that when the support vector machine performs the prediction of the X-ray image and the data set with severe class imbalance, the normalized SMOTE data pre-processing can obtain the model with better prediction value, fine needle aspiration (FNA) Images and slightly class-imbalanced datasets, after regularized SMOTE, have improved, but the impact is small.
顯示於類別:	[資訊管理學系碩士在職專班 ] 博碩士論文

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	341	檢視/開啟

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....