結合特徵選取與樣本合成法於乳癌預測之研究

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：11

、訪客IP：18.118.126.159

姓名

黃綵彤(Tsai-Tung Huang) 查詢紙本館藏

畢業系所

資訊管理學系在職專班

論文名稱

結合特徵選取與樣本合成法於乳癌預測之研究

相關論文

★ 利用資料探勘技術建立商用複合機銷售預測模型	★ 應用資料探勘技術於資源配置預測之研究-以某電腦代工支援單位為例
★ 資料探勘技術應用於航空業航班延誤分析-以C公司為例	★ 全球供應鏈下新產品的安全控管-以C公司為例
★ 資料探勘應用於半導體雷射產業-以A公司為例	★ 應用資料探勘技術於空運出口貨物存倉時間預測-以A公司為例
★ 使用資料探勘分類技術優化YouBike運補作業	★ 特徵屬性篩選對於不同資料類型之影響
★ 資料探勘應用於B2B網路型態之企業官網研究-以T公司為例	★ 衍生性金融商品之客戶投資分析與建議-整合分群與關聯法則技術
★ 應用卷積式神經網路建立肝臟超音波影像輔助判別模型	★ 基於卷積神經網路之身分識別系統
★ 能源管理系統電能補值方法誤差率比較分析	★ 企業員工情感分析與管理系統之研發
★ 資料淨化於類別不平衡問題: 機器學習觀點	★ 資料探勘技術應用於旅客自助報到之分析—以C航空公司為例

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

隨著資訊科技的發展，穿戴式行動裝置及設備普及與網際網路通訊的發達，收集資料越來越顯容易。各專業領域無不透過收集來的資料，做進一步分析與研究加以廣泛運用在商業發展及增進人類福祉之事務上。而應用最為顯著及蓬勃發展的領域即是金融科技與智慧醫療。
在全面進入大數據時代，資料科學已成為熱門議題，本研究即針對醫療領域作深入探討研究，透過資料探勘技術挖掘潛在知識和新發現，產出最合適且符合目標之方法，並在預測方面也能利用機器學習技術來做實驗，從中求得較佳之預測效果，以獲取最佳方案。實驗方式是透過公開醫療資料集中的乳癌資料集進行實驗與分析，資料集分為大、小兩種差異乳癌資料集，透過不同方法做特徵選取與類別不平衡之處理，並利用支援向量機與隨機森林進行建構模型，而對於演算法之效能評估則採用五摺交叉驗證法(5-fold cross-validation)進行驗證預測模型等實驗，最終選出較佳的預測模型。
實驗結果可得知KDD CUP大型資料集先做預處理並使用隨機森林訓練建構模型可得到較佳的AUC值達0.951，預處理方式以先採用特徵選取，選出較適關鍵特徵後再作類別不平衡之處理為最佳方法; UCI小型資料集實驗結果顯示即使不做資料預處理，直接使用Random Forest建構模型，皆能得到較佳之AUC值0.994，可推論小型資料集因為特徵屬性明確、樣本資料分布較均勻，有較佳的效能表現。由本研究可得知未來在做大型資料集有高維度且類別分布不均勻時，可先做資料預處理，以期望達到較佳的效能模型，而在低維度且類別分布較平均時，即可較快速建構模型亦仍獲得較佳結果。

摘要(英)

With the development of information technology, the popularization of wearable mobile devices and equipment, and the development of Internet communications, it has become easier to collect data. All professional fields use the collected data for further analysis and research and aim at widely usage in business development and the promotion of human well-being. The most significant and flourishing areas of application are financial technology and smart healthcare.
Along with the coming era of big data, data science has become a hot topic. Therefore, this report is an in-depth discussion and research focus on medical field with the help of information technology. Data mining technology is adopted to unearth potential knowledge and new discoveries, and expectantly to produce the most suitable method that meets the target. For getting best prediction, machine learning technology can also be used to do experiments to obtain better prediction results to obtain the best solution. The experimental method is to conduct experiments and analysis through the breast cancer data set in the public medical data set. The data set is divided into two different breast cancer data sets, large and small. Use different methods to deal with feature selection and class imbalance and use support vector machines and Random Forest to construct models, Performance evaluation of the algorithm is by the 5-fold cross-validation method to verify the prediction model and other experiments. Finally, a better prediction model is selected.
The experimental results show that the KDD CUP large-scale data set can be preprocessed first by using Random Forest to obtain a better AUC value of 0.951. The best method of preprocessing is to use feature selection first, select more suitable key features, and then deal with class imbalance. The experimental results of the UCI small data set show that even if the data is not preprocessed, and then the Random Forest is used to construct the model, the best AUC values of 0.994 still can be obtained. Therefore, it can be inferred that the small data set possess clearer characteristic attributes and more evenly distributed sample data, both have better performance.
From this research, we can learn that in the future work when large data sets have high dimensions and the distribution of categories is not uniform, data preprocessing can be done first to achieve a better performance model, and when the dimension is low and distribution of categories is relatively even, model can be constructed faster which still can achieve better performance.

關鍵字(中)

★ 特徵選取
★ 類別不平衡
★ 虛擬少數類別過採樣技術
★ 支援向量機
★ 隨機森林

關鍵字(英)

★ feature selection
★ class imbalance
★ SMOTE
★ SVM
★ Random Forest

論文目次

誌謝 I
摘要 II
ABSTRACT III
目錄 V
圖目錄 VII
表目錄 VIII
第1章前言 1
1.1 研究背景 1
1.2 研究動機 2
1.3 研究目的 3
1.4 論文架構 4
第2章文獻探討 6
2.1 乳癌死亡風險因子 6
2.2 機器學習技術 7
2.2.1 監督式學習 7
2.2.2 支援向量機 7
2.2.3 隨機森林 8
2.3 前處理 9
2.3.1 特徵選取 9
2.3.2 類別不平衡 10
2.4 相關文獻回顧與比較 12
2.5 總結 16
第3章研究方法 18
3.1 實驗資料集 19
3.2 資料前處理 20
3.3 相關技術參數設定 20
3.4 預測模型評估 25
第4章實驗結果 28
4.1 KDD CUP大型乳癌資料集描述 28
4.2 UCI小型乳癌資料集描述 31
4.3 綜合模型效能分析與評估 33
第5章研究結論與建議 34
5.1 研究結論 34
5.2 研究限制 35
5.3 未來研究方向 36
參考文獻 37

參考文獻

中文文獻
[1] 統計處 , 1 08 年死因記者會新聞稿 6月 15, 2020)。檢自
https://dep.mohw.gov.tw/dos/cp 4927 54468 113.html (引見於 4月 02, 2021).
[2] 張雅婷， 2008 ，以資料探勘技術建立輔助乳癌診斷模型，國立臺北科技大學，碩士論文。
[3] 監督式學習與非監督式學習的差異、應用、以及案例 ””, OOSGA, 1月 01,2020
https://oosga.com/thinking/difference between supervised learning and unsupervi sed learning/ (引見於 4月 03, 2021).
英文文獻
[4] American Cancer Society, “How Common Is Breast Cancer?” (2021/05/07) Retrieved from https://www.cancer.org/cancer/breast-cancer/about/how-common-is-breast-cancer.html (June 10,2021)
[5] N. V. Chawla, K. W. Bowyer, L. O. Hall and W. P. Kegelmeyer, “SMOTE: Synthetic Minority Over-sampling Technique”, J. Artif. Intell. Res., Vol 16, pp. 321–357, June 2002.
[6] Min-Wei Huang, Chih-Wen Chen, Wei-Chao Lin, Shih-Wen Ke and Chih-Fong Tsai, “SVM and SVM ensembles in breast cancer prediction”, PLOS ONE, Vol 12, January 2017.
[7] Nitasha, “Review on Breast Cancer Prediction Using Data Mining Algorithms”, IJCST, Vol 7 Issue 4, Jul-Aug 2019.
[8] Leo Breiman, “Random Forests”, Machine Learning, 45, 5-32, 2001.
[9] M. Dash and H. Liu, “Feature selection for classification”, Intell. Data Anal., Vol 1 (1), pp. 131–156, January 1997.
[10] Upasana , “Imbalanced Data：How to handle Imbalanced Classification Problems”, Analytics Vidhya, March 2017. https://www.analyticsvidhya.com/blog/2017/03/imbalanced-data-classification/ (引見於 4月 05, 2021).
[11] D. S. Jacob, R. Viswan, V. Manju, L. PadmaSuresh and S. Raj, “A Survey on Breast Cancer Prediction Using Data Mining Techniques”, 2018 Conference on Emerging Devices and Smart Systems (ICEDSS), pp. 256–258, March 2018.
[12] J. Ramirez-Cruz, O. Fuentes, V. Alarcon-Aquino and L. Garcia-Banuelos, “Instance Selection and Feature Weighting Using Evolutionary Algorithms”, 2006 15th International Conference on Computing, pp. 73–79, November 2006.
[13] C. Campbell, “Kernel methods: a survey of current techniques”, Neurocomputing, Vol 48(1), pp. 63–84, October 2002.
[14] T. Fawcett, “An introduction to ROC analysis”, Pattern Recognit. Lett., Vol 27(8), pp. 861–874, June 2006.
[15] James A. Hanley, Ph.D., Barbara J. McNeil, M.D., Ph.D., “A Method of Comparing the Areas under Receiver Operating Characteristic Curves Derived from the Same Cases”, Vol 148(3), pp. 839-843, September 1983.
[16] HYERAN BYUN, SEONG-WHAN LEE, “A SURVEY ON PATTERN RECOGNITION APPLICATIONS OF SUPPORT VECTOR MACHINES”, International Journal of Pattern Recognition and Artificial Intelligence, Vol 17(3), pp. 459-486, 2003.
[17] Isabelle Guyon, Andr´e Elisseeff, “An Introduction to Variable and Feature Selection”, Journal of Machine Learning Research, 3:1157-1182, 2003.
[18] Md. Milon Islam, Md. Rezwanul Haque, Hasib Iqbal, Md. Munirul Hasan, Mahmudul Hasan, Muhammad Nomani Kabir, “Breast Cancer Prediction: A Comparative Study Using Machine Learning Techniques”, SN Comput Sci. 2020;1:290.
[19] Priyanka khare, Dr.Kavita Burse, “Feature Selection Using Genetic Algorithm and Classification using Weka for Ovarian Cancer”, IJCSIT, Vo;7(1), pp.194-196,2016.
[20] Bartosz Krawczyk, “Learning from imbalanced data: open challenges and future directions”, Prog Artif Intell, 5:221–232,2016.
[21] Joseph A. Cruz, David S. Wishart, “Applications of Machine Learning in Cancer Prediction and Prognosis”, Cancer Informatics 2006:2.
[22] Konstantina Kourou, Themis P. Exarchos, Konstantinos P. Exarchos, Michalis V. Karamouzis, Dimitrios I. Fotiadis, “Machine learning applications in cancer prognosis and prediction”, Computational and Structural Biotechnology Journal 13 (2015) 8–17.
[23] Nitasha, “Review on Breast Cancer Prediction Using Data Mining Algorithms”, IJCST, Vol 7(4), Jul-Aug 2019.
[24] Maisa Daoud, Michael Mayo, “A survey of neural network-based cancer prediction models from microarray data”, Artificial Intelligence In Medicine, 97:204-214, 2019.

指導教授

蔡志豐

審核日期

2021-7-12

推文