摘要: | 隨著資訊科技的發展,穿戴式行動裝置及設備普及與網際網路通訊的發達,收集資料越來越顯容易。各專業領域無不透過收集來的資料,做進一步分析與研究加以廣泛運用在商業發展及增進人類福祉之事務上。而應用最為顯著及蓬勃發展的領域即是金融科技與智慧醫療。 在全面進入大數據時代,資料科學已成為熱門議題,本研究即針對醫療領域作深入探討研究,透過資料探勘技術挖掘潛在知識和新發現,產出最合適且符合目標之方法,並在預測方面也能利用機器學習技術來做實驗,從中求得較佳之預測效果,以獲取最佳方案。實驗方式是透過公開醫療資料集中的乳癌資料集進行實驗與分析,資料集分為大、小兩種差異乳癌資料集,透過不同方法做特徵選取與類別不平衡之處理,並利用支援向量機與隨機森林進行建構模型,而對於演算法之效能評估則採用五摺交叉驗證法(5-fold cross-validation)進行驗證預測模型等實驗,最終選出較佳的預測模型。 實驗結果可得知KDD CUP大型資料集先做預處理並使用隨機森林訓練建構模型可得到較佳的AUC值達0.951,預處理方式以先採用特徵選取,選出較適關鍵特徵後再作類別不平衡之處理為最佳方法; UCI小型資料集實驗結果顯示即使不做資料預處理,直接使用Random Forest建構模型,皆能得到較佳之AUC值0.994,可推論小型資料集因為特徵屬性明確、樣本資料分布較均勻,有較佳的效能表現。由本研究可得知未來在做大型資料集有高維度且類別分布不均勻時,可先做資料預處理,以期望達到較佳的效能模型,而在低維度且類別分布較平均時,即可較快速建構模型亦仍獲得較佳結果。 ;With the development of information technology, the popularization of wearable mobile devices and equipment, and the development of Internet communications, it has become easier to collect data. All professional fields use the collected data for further analysis and research and aim at widely usage in business development and the promotion of human well-being. The most significant and flourishing areas of application are financial technology and smart healthcare. Along with the coming era of big data, data science has become a hot topic. Therefore, this report is an in-depth discussion and research focus on medical field with the help of information technology. Data mining technology is adopted to unearth potential knowledge and new discoveries, and expectantly to produce the most suitable method that meets the target. For getting best prediction, machine learning technology can also be used to do experiments to obtain better prediction results to obtain the best solution. The experimental method is to conduct experiments and analysis through the breast cancer data set in the public medical data set. The data set is divided into two different breast cancer data sets, large and small. Use different methods to deal with feature selection and class imbalance and use support vector machines and Random Forest to construct models, Performance evaluation of the algorithm is by the 5-fold cross-validation method to verify the prediction model and other experiments. Finally, a better prediction model is selected. The experimental results show that the KDD CUP large-scale data set can be preprocessed first by using Random Forest to obtain a better AUC value of 0.951. The best method of preprocessing is to use feature selection first, select more suitable key features, and then deal with class imbalance. The experimental results of the UCI small data set show that even if the data is not preprocessed, and then the Random Forest is used to construct the model, the best AUC values of 0.994 still can be obtained. Therefore, it can be inferred that the small data set possess clearer characteristic attributes and more evenly distributed sample data, both have better performance. From this research, we can learn that in the future work when large data sets have high dimensions and the distribution of categories is not uniform, data preprocessing can be done first to achieve a better performance model, and when the dimension is low and distribution of categories is relatively even, model can be constructed faster which still can achieve better performance. |