整合聚類與分類機器學習方法建立原發性肺癌二次癌症預測模型

、線上人數：13

、訪客IP：52.14.230.29

姓名	蔡昌赫(Chang-He Cai) 查詢紙本館藏	畢業系所	資訊管理學系
論文名稱	整合聚類與分類機器學習方法建立原發性肺癌二次癌症預測模型 (Integrating Clustering and Classification Machine Learning Methods to Build a Second Primary Cancer Prediction Model in Lung Cancer Survivors)
檔案	[Endnote RIS 格式] [Bibtex 格式] [相關文章] [文章引用] [完整記錄] [館藏目錄] 至系統瀏覽論文 (2025-7-15以後開放)
摘要(中)	肺癌爲全球癌症死亡佔比第一的癌症，2020 年全球已有約 210 萬人被診斷罹患肺癌，同年因癌症死亡人數約 180 萬，分別佔全球總人數的 11.4%與 18%，死亡率常年居世界首位。隨著癌症診斷工具與治療方式的改進，肺癌患者的存活時間顯著增長，相應地，肺癌患者在罹患肺癌後發生二次癌症的數量在近十年中也有明顯的增長趨勢，對肺癌患者發生二次癌症風險的評估成爲一項重要議題。本研究旨在使用包含邏輯斯迴歸、隨機森林、支持向量機、極限梯度提升、單層前饋神經網絡以及堆疊模型等六種機器學習演算法，使用 2004 至 2018年至長庚醫院就診的十大癌症患者之登記資料建立肺癌二次癌症預測模型，其中以極限梯度提升訓練的預測模型作爲最終模型，平均AUROC達到0.755，標準差爲0.037。研究還使用非監督式聚類分析方法對肺癌患者進行異質性分析，將非監督聚類分群結果作爲新特徵，與原有特徵整合後，以監督式機器學習方法進行模型。結果顯示結合聚類分群結果的預測模型與其他模型相比效能並無顯著差異。在運用SHAP解釋模型方法進行特徵重要性分析後，我們發現在重要特徵因子中，患者進行手術切除原發部位、惡性肋膜積水情況不詳、整併期別爲StageⅠ與發生肺癌二次癌症風險呈正相關，而患者波及臟層膜或彈性層情況不詳、整併期別爲StageⅣ則與肺癌二次癌症風險呈負相關。最後，研究基於 R 語言中的 shiny 套件部署最終預測模型建立了一個針對肺癌二次癌症預測的臨床決策支援系統，供醫師參考。
摘要(英)	Lung cancer is the leading cause of cancer death in the world. In 2020, about 2.1 million people worldwide were diagnosed with lung cancer, and about 1.8 million died, accounting for 11.4% and 18% of the global total, respectively. The mortality rate is ranking first in the world for more serval years. With the improvement of cancer diagnostic tools and treatment methods, the survival rate of lung cancer has increased significantly. Correspondingly, the number of lung cancer survivors with second primary cancer after suffering from lung cancer has also increased significantly in the past decade. The assessment of second primary cancer risk has become an important topic. This study aims to use six machine learning algorithms including logistic regression, random forest, support vector machine, extreme gradient boosting, single-layer feed-forward neural network, and stacking model, using the cancer registry data from Chang Gung Memorial Hospital from 2004 to 2018. The registration data of cancer patients was used to establish a prediction model for second primary cancer of lung cancer survivors. The average AUROC of models trained by XGBoost reached 0.755 and the standard deviation was 0.037. The study also used an unsupervised cluster analysis method to analyze the heterogeneity of lung cancer patients, combining supervised and unsupervised machine learning methods in the form of adding unsupervised clustering analysis results as new features to supervised machine learning methods for training. The results showed that the predictive model combined with the cluster analysis method had no significant difference in performance compared with other models. After using the SHAP interpretation model to analyze the importance of features, we found that among the important feature factors, patients undergoing surgical resection of the primary site, the status of malignant pleural effusion is unknown, and the stage of integration stage I is positively correlated with the risk of second primary cancer of lung cancer, while the patient′s involvement in the visceral membrane or elastic layer is unknown, and the stage IV is negatively correlated with the risk of second primary cancer of lung cancer. Finally, a clinical decision support system for second primary cancer prediction of lung cancer was established based on the final prediction model deployed by the shiny suite in the R language for reference by physicians.
關鍵字(中)	★ 肺癌 ★ 二次癌症 ★ 聚類分析 ★ 機器學習 ★ 預測模型	關鍵字(英)	★ Lung cancer ★ Second primary cancer ★ cluster analysis ★ machine learning ★ predictive model
論文目次	論文摘要 i Abstract ii 目錄 iii 表目錄 vi 圖目錄 viii 第一章緒論 - 1 - 1.1 研究背景 - 1 - 1.1.1 肺癌流行病學概述 - 1 - 1.1.2 二次癌症流行病學概述 - 2 - 1.1.3 機器學習－分類演算法 - 3 - 1.1.4 機器學習－聚類分析演算法 - 3 - 1.2 研究動機 - 4 - 1.3 研究目的 - 5 - 第二章文獻探討 - 6 - 2.1 與肺癌及二次癌症有關的危險因子 - 6 - 2.2 二次癌症預測模型 - 7 - 第三章研究方法 - 9 - 3.1 資料來源 - 9 - 3.2 資料前處理 - 10 - 3.3特徵編碼轉換 - 11 - 3.3.1 性別 - 11 - 3.3.2 腫瘤大小 - 12 - 3.3.3 區域淋巴結侵犯數目 - 13 - 3.3.4 整併期別 - 14 - 3.3.5 原發部位手術邊緣 - 14 - 3.3.6 分級/分化 - 16 - 3.3.7 是否接受原發部位切除手術 - 16 - 3.3.8 是否接受放射治療 - 17 - 3.3.9 是否接受全身性治療 - 17 - 3.3.10 身體質量指數 - 18 - 3.3.11 吸菸行爲 - 19 - 3.3.12 嚼檳榔行爲 - 19 - 3.3.13 喝酒行爲 - 20 - 3.3.14 癌症部位特定因子 - 21 - 3.4 定義對照組與病例組 - 24 - 3.5 缺失值處理 - 24 - 3.6 編碼演算法 - 25 - 3.7 特徵選擇 - 27 - 3.8 聚類分析 - 28 - 3.9 建立預測模型 - 28 - 3.10 評估模型效能 - 32 - 3.11模型訓練與驗證策略流程 - 33 - 3.12 解釋模型 - 34 - 3.13 臨床決策支援系統開發 - 35 - 3.14 統計方法 - 35 - 第四章結果 - 37 - 4.1 資料清理 - 37 - 4.2 患者特徵資料 - 38 - 4.3 資料集比較 - 42 - 4.3.1 非小細胞肺癌資料與所有肺癌資料分佈比較 - 42 - 4.3.2 不同缺失值處理方法之資料分佈 - 45 - 4.3.3 不同聚類資料分佈比較 - 48 - 4.4 模型效能 - 51 - 4.5 預測結果資料分佈 - 56 - 4.6 重要特徵 - 58 - 4.7肺癌患者罹患二癌風險評估決策支援系統 - 60 - 第五章討論 - 62 - 第六章結論 - 66 - 參考資料 - 67 - 附錄 - 72 -
參考文獻	[1] “15-Lung-fact-sheet.pdf.” Accessed: Jul. 13, 2022. [Online]. Available: https://gco.iarc.fr/today/data/factsheets/cancers/15-Lung-fact-sheet.pdf [2] American Cancer Society, “Key Statistics for Lung Cancer,” Key Statistics for Lung Cancer. https://www.cancer.org/cancer/lung-cancer/about/key-statistics.html [3] 衛生福利部國民健康署, “肺癌防治,” 肺癌防治, Dec. 31, 2016. https://www.hpa.gov.tw/Pages/List.aspx?nodeid=4050 [4] 台灣癌症登記中心, “癌症年報,” 癌症年報, 2018. https://twcr.tw/?page_id=1354 [5] 衛生福利部國民健康署, “108年癌症登記報告,” 108年癌症登記報告, 2018. https://www.hpa.gov.tw/Pages/Detail.aspx?nodeid=269&pid=14913 [6] 衛生福利部, “109年國人死因統計結果,” 109年國人死因統計結果, Jun. 18, 2021. https://www.mohw.gov.tw/cp-5017-61533-1.html [7] 衛生福利部國民健康署, “肺癌診斷後治療方式,” 肺癌診斷後治療方式, Dec. 31, 2016. https://www.hpa.gov.tw/Pages/List.aspx?nodeid=4056 [8] The American Cancer Society, “Lung Cancer Survival Rates,” Lung Cancer Survival Rates. https://www.cancer.org/cancer/lung-cancer/detection-diagnosis-staging/survival-rates.html [9] P. Han et al., “Clinical Decision Support System Improves Early Identification of Lung Cancer Patients at High Risk for Significant Weight Loss During Radiotherapy,” Int. J. Radiat. Oncol. Biol. Phys., vol. 108, no. 3, pp. e124–e125, Nov. 2020, doi: 10.1016/j.ijrobp.2020.07.1264. [10] L. Eldridge, “Second Primary Cancer Overview,” Verywell Health. https://www.verywellhealth.com/what-is-a-second-primary-cancer-2248872 [11] The American Cancer Society, “Second Cancers After Lung Cancer,” Second Cancers After Lung Cancer. https://www.cancer.org/cancer/lung-cancer/after-treatment/second-cancers.html [12] S. Warren, “Multiple primary malignant tumors. A survey of the literature and a statistical study,” 1932, [Online]. Available: https://www.semanticscholar.org/paper/Multiple-primary-malignant-tumors.-A-survey-of-the-Warren/db002e714d10e5dd14b81934601ddfbe2697c060 [13] “2007_mphrules_manual_09272011.pdf.” Accessed: Jul. 13, 2022. [Online]. Available: https://seer.cancer.gov/tools/mphrules/2007_mphrules_manual_09272011.pdf [14] M. B. Amin et al., “The Eighth Edition AJCC Cancer Staging Manual: Continuing to build a bridge from a population-based to a more ‘personalized’ approach to cancer staging,” CA. Cancer J. Clin., vol. 67, no. 2, pp. 93–99, Mar. 2017, doi: 10.3322/caac.21388. [15] C. G. N. Demandante, D. A. Troyer, and T. P. Miles, “Multiple primary malignant neoplasms: case report and a comprehensive review of the literature,” Am. J. Clin. Oncol., vol. 26, no. 1, pp. 79–83, Feb. 2003, doi: 10.1097/00000421-200302000-00015. [16] S. M. D. A. C. Jayatilake and G. U. Ganegoda, “Involvement of Machine Learning Tools in Healthcare Decision Making,” J. Healthc. Eng., vol. 2021, p. 6679512, 2021, doi: 10.1155/2021/6679512. [17] L. Wang, Z. Zhang, X. Zhang, X. Zhou, P. Wang, and Y. Zheng, “Chapter One - A Deep-forest based approach for detecting fraudulent online transaction,” in Advances in Computers, vol. 120, A. R. Hurson and S. Wu, Eds. Elsevier, 2021, pp. 1–38. doi: 10.1016/bs.adcom.2020.10.001. [18] C. G. Rousseaux and S. C. Gad, “Chapter 30 - Statistical Assessment of Toxicologic Pathology Studies,” in Haschek and Rousseaux’s Handbook of Toxicologic Pathology (Third Edition), W. M. Haschek, C. G. Rousseaux, and M. A. Wallig, Eds. Boston: Academic Press, 2013, pp. 893–988. doi: 10.1016/B978-0-12-415759-0.00030-3. [19] J. HOU and H. WANG, “多原发肺癌的诊断与治疗,” Chin. J. Lung Cancer, vol. 18, no. 12, pp. 764–769, Dec. 2015, doi: 10.3779/j.issn.1009-3419.2015.12.09. [20] J. M. Boyle, D. J. Tandberg, J. P. Chino, T. A. D’Amico, N. E. Ready, and C. R. Kelsey, “Smoking history predicts for increased risk of second primary lung cancer: a comprehensive analysis,” Cancer, vol. 121, no. 4, pp. 598–604, Feb. 2015, doi: 10.1002/cncr.29095. [21] M. Kono et al., “Incidence of Second Malignancy after Successful Treatment of Limited-Stage Small-Cell Lung Cancer and Its Effects on Survival,” J. Thorac. Oncol. Off. Publ. Int. Assoc. Study Lung Cancer, vol. 12, no. 11, pp. 1696–1703, Nov. 2017, doi: 10.1016/j.jtho.2017.07.030. [22] R. Komaki, P. Allen, X. Wei, J. Welsh, S. Lin, and J. Cox, “Completing Thoracic Radiation Therapy With Concurrent Chemotherapy Within 6 weeks Is Important for Reducing Distant Disease in Patients With Limited-Stage Small Cell Lung Cancer,” Int. J. Radiat. Oncol. Biol. Phys., vol. 96, pp. E466–E467, Oct. 2016, doi: 10.1016/j.ijrobp.2016.06.1801. [23] M. Eberl, L. F. Tanaka, K. Kraywinkel, and S. J. Klug, “Incidence of Smoking-Related Second Primary Cancers After Lung Cancer in Germany: An Analysis of Nationwide Cancer Registry Data,” J. Thorac. Oncol. Off. Publ. Int. Assoc. Study Lung Cancer, vol. 17, no. 3, pp. 388–398, Mar. 2022, doi: 10.1016/j.jtho.2021.11.016. [24] F. Qiu et al., “Impacts of cigarette smoking on immune responsiveness: Up and down or upside down?,” Oncotarget, vol. 8, no. 1, pp. 268–284, Jan. 2017, doi: 10.18632/oncotarget.13613. [25] A. Fisher et al., “Risk Factors Associated with a Second Primary Lung Cancer in Patients with an Initial Primary Lung Cancer,” Clin. Lung Cancer, vol. 22, no. 6, pp. e842–e850, Nov. 2021, doi: 10.1016/j.cllc.2021.04.004. [26] GeneOnline, “發現新線索！造成腫瘤異質性和耐藥性的 ecDNA,” GeneOnline News, Apr. 12, 2017. https://geneonline.news/ecdna-glioblastoma/ [27] S. Hindocha et al., “A comparison of machine learning methods for predicting recurrence and death after curative-intent radiotherapy for non-small cell lung cancer: Development and validation of multivariable clinical prediction models,” eBioMedicine, vol. 77, Mar. 2022, doi: 10.1016/j.ebiom.2022.103911. [28] Y. Xie et al., “Early lung cancer diagnostic biomarker discovery by machine learning methods,” Transl. Oncol., vol. 14, no. 1, p. 100907, Jan. 2021, doi: 10.1016/j.tranon.2020.100907. [29] F. Zhong et al., “A Predictive Model to Differentiate Between Second Primary Lung Cancers and Pulmonary Metastasis,” Acad. Radiol., vol. 29 Suppl 2, pp. S137–S144, Feb. 2022, doi: 10.1016/j.acra.2021.05.015. [30] C.-C. Chang and S.-H. Chen, “Developing a Novel Machine Learning-Based Classification Scheme for Predicting SPCs in Breast Cancer Survivors,” Front. Genet., vol. 10, 2019, Accessed: Jul. 13, 2022. [Online]. Available: https://www.frontiersin.org/articles/10.3389/fgene.2019.00848 [31] “107年版長表115欄位.pdf.” Accessed: Jul. 13, 2022. [Online]. Available: https://twcr.tw/wp-content/uploads/2021/12/107%E5%B9%B4%E7%89%88%E9%95%B7%E8%A1%A8115%E6%AC%84%E4%BD%8D.pdf [32] S.-Y. Chen, “Using Machine Learning Algorithms for Second Primary Cancers Risk Prediction among Survivors of Breast Cancer,” Sep. 2021. [33] 衛生福利部國民健康署, “身體質量指數BMI,” 健康九九. https://health99.hpa.gov.tw/onlineQuiz/bmi [34] “Cancer SSF Manual_Official version_20220127_W.pdf.” Accessed: Jul. 13, 2022. [Online]. Available: http://tcr.cph.ntu.edu.tw/uploadimages/Cancer%20SSF%20Manual_Official%20version_20220127_W.pdf [35] P. Royston and I. R. White, “Multiple Imputation by Chained Equations (MICE): Implementation in Stata,” J. Stat. Softw., vol. 45, no. 4, Dec. 2011, [Online]. Available: http://www.jstatsoft.org/v45/i04/paper [36] S. van Buuren and K. Groothuis-Oudshoorn, “mice: Multivariate Imputation by Chained Equations in R,” J. Stat. Softw., vol. 45, pp. 1–67, Dec. 2011, doi: 10.18637/jss.v045.i03. [37] M. B. Kursa and W. R. Rudnicki, “Feature Selection with the Boruta Package,” J. Stat. Softw., vol. 36, pp. 1–13, Sep. 2010, doi: 10.18637/jss.v036.i11. [38] T. Edgar and D. Manz, “Chapter 4 - Exploratory Study,” 2017, pp. 95–130. doi: 10.1016/B978-0-12-805349-2.00004-2. [39] A. Bartosik and H. Whittingham, “Chapter 7 - Evaluating safety and toxicity,” in The Era of Artificial Intelligence, Machine Learning, and Data Science in the Pharmaceutical Industry, S. K. Ashenden, Ed. Academic Press, 2021, pp. 119–137. doi: 10.1016/B978-0-12-820045-2.00008-8. [40] M. Kuhn, D. Vaughan, E. Hvitfeldt, and RStudio, parsnip: A Common API to Modeling and Analysis Functions. 2022. [Online]. Available: https://CRAN.R-project.org/package=parsnip [41] N. Ben Amor, S. Benferhat, and Z. Elouedi, “Qualitative classification and evaluation in possibilistic decision trees,” in 2004 IEEE International Conference on Fuzzy Systems (IEEE Cat. No.04CH37542), Jul. 2004, vol. 2, pp. 653–657 vol.2. doi: 10.1109/FUZZY.2004.1375474. [42] R. Gove and J. Faytong, “Chapter 4 - Machine Learning and Event-Based Software Testing: Classifiers for Identifying Infeasible GUI Event Sequences,” in Advances in Computers, vol. 86, A. Hurson and A. Memon, Eds. Elsevier, 2012, pp. 109–135. doi: 10.1016/B978-0-12-396535-6.00004-1. [43] C. Cortes and V. Vapnik, “Support-vector networks,” Mach. Learn., vol. 20, no. 3, pp. 273–297, Sep. 1995, doi: 10.1007/BF00994018. [44] J. A. Bunge and D. H. Judson, “Data Mining,” in Encyclopedia of Social Measurement, K. Kempf-Leonard, Ed. New York: Elsevier, 2005, pp. 617–624. doi: 10.1016/B0-12-369398-5/00159-6. [45] F. H. Guenther, “Neural Networks: Biological Models and Applications,” in International Encyclopedia of the Social & Behavioral Sciences, N. J. Smelser and P. B. Baltes, Eds. Oxford: Pergamon, 2001, pp. 10534–10537. doi: 10.1016/B0-08-043076-7/03667-6. [46] S. Couch, M. Kuhn, and RStudio, stacks: Tidy Model Stacking. 2022. [Online]. Available: https://CRAN.R-project.org/package=stacks [47] S. Lundberg and S.-I. Lee, “A Unified Approach to Interpreting Model Predictions.” arXiv, Nov. 24, 2017. doi: 10.48550/arXiv.1705.07874. [48] W. Chang et al., shiny: Web Application Framework for R. 2021. [Online]. Available: https://CRAN.R-project.org/package=shiny [49] K. Kawamoto, C. A. Houlihan, E. A. Balas, and D. F. Lobach, “Improving clinical practice using clinical decision support systems: a systematic review of trials to identify features critical to success,” BMJ, vol. 330, no. 7494, p. 765, Apr. 2005, doi: 10.1136/bmj.38398.500764.8F. [50] Y. Park, Y. Bang, and J. Kwon, “Clinical decision support system and hospital readmission reduction: Evidence from U.S. panel data,” Decis. Support Syst., vol. 159, p. 113816, Aug. 2022, doi: 10.1016/j.dss.2022.113816. [51] F. Cabitza and A. Campagner, “The need to separate the wheat from the chaff in medical informatics: Introducing a comprehensive checklist for the (self)-assessment of medical AI studies,” Int. J. Med. Inf., vol. 153, p. 104510, Sep. 2021, doi: 10.1016/j.ijmedinf.2021.104510.
指導教授	許智誠曾意儒(Jyh-Cheng Hsu Yi-Ju Tseng)	審核日期	2022-7-16
推文	facebook plurk twitter funp google live udn HD myshare reddit netvibes friend youpush delicious baidu
網路書籤	Google bookmarks del.icio.us hemidemi myshare

博碩士論文 109423602 詳細資訊