利用可解釋性機器學習方法與基於反相蛋白陣列的多體學資料探究腎細胞癌亞型

、線上人數：79

、訪客IP：3.133.124.52

姓名	余嘉俊(Yu Ka Chun) 查詢紙本館藏	畢業系所	資訊工程學系
論文名稱	利用可解釋性機器學習方法與基於反相蛋白陣列的多體學資料探究腎細胞癌亞型 (Investigation of Renal Cell Carcinoma Subtypes Using Explainable Machine Learning Methods and Reverse Phase Protein Array-Based Multi-Omics Data)
檔案	[Endnote RIS 格式] [Bibtex 格式] [相關文章] [文章引用] [完整記錄] [館藏目錄] [檢視] [下載] 本電子論文使用權限為同意立即開放。已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。
摘要(中)	腎臟癌是全球公共衛生重大健康問題之一，每年新增病例超過40萬例，死亡人數約為18萬，其中，腎細胞癌（Renal Cell Carcinoma, RCC）占90%以上，主要包括嫌色性腎細胞癌（Kidney Chromophobe, KICH）、透明細胞型腎細胞癌（Kidney Renal Clear Cell Carcinoma, KIRC）和乳頭狀腎細胞癌（Kidney Renal Papillary Cell Carcinoma, KIRP）。由於每種亞型的預後和治療方式不同，準確區分不同亞型，並且了解不同亞型之異同，有利於精準醫療之發展。反相蛋白陣列（Reverse Phase Protein Array, RPPA）是一種高通量蛋白質體學技術，能夠在使用極少樣本的情況下定量分析多種蛋白質，具有高靈敏度、快速處理和檢測翻譯後修飾之優勢，若能深入分析相關數據，將可推展癌症研究和相關生物標誌物之量化。然而，目前採用基於反相蛋白陣列之多體學數據探討腎細胞癌亞型之研究仍較缺乏，而多體學數據之分析，有助於了解不同亞型之機制，推進治療之發展。因此，本研究探討反相蛋白陣列和多體學資料在腎細胞癌亞型之分類，採用決策樹（Decision Tree, DT）、隨機森林（Random Forest, RF）、支持向量機（Support Vector Machine, SVM）、k近鄰（K-Nearest Neighbors, KNN）和極限梯度提升（eXtreme Gradient Boosting, XGB）五種分類模型，於單一體學、基於反相蛋白陣列的雙體學、基於反相蛋白陣列的多體學評估反相蛋白陣列於腎癌亞型分類上的重要性。研究結果表明，反相蛋白陣列能有效顯著提高分類準確性。特別是，極限梯度提升模型使用突變體學和反相蛋白陣列資料時表現顯著地進一步提升了性能，顯示反相蛋白陣列在腎細胞癌亞型分類中的重要貢獻。此外，我們引入新的評估方法，包括調整加權準確性得分（Adjusted Weighted Accuracy Score, AW-ACC SCORE）以比較體學之間在特定任務上的關鍵性和調整加權絕對值Shapley重要性（Adjusted Weighted Mean Absolute Shapley Importance, AWMSHAP）以評估特徵重要性，這些方法識別出重要蛋白，如INPP4B、PIK3CA、NDRG1和CASP7，這些蛋白這可能與亞型分類有潛在的關聯，與不同的腎細胞癌亞型有顯著關聯，可能影響腫瘤的生物行為和臨床預後。本研究結果顯示結合反相蛋白陣列、多體學資料和機器學習具分類腎細胞癌亞型之潛力，識別出之重要蛋白顯示機器學習模型解釋性之重要，以建立臨床信任並促進研究成果的臨床轉化。
摘要(英)	Renal cell carcinoma is a major global health issue, with over 400,000 new cases and 180,000 deaths annually. Renal cell carcinoma (RCC) accounts for over 90% of these cases, including chromophobe RCC (KICH), clear cell RCC (KIRC), and papillary RCC (KIRP). Each subtype has distinct prognoses and treatment methods; thus, accurately distinguishing between subtypes and understanding their differences is crucial for developing precision medicine. Reverse Phase Protein Array (RPPA) is a high-throughput proteomics technology that can quantitatively analyze multiple proteins using minimal sample amounts, offering high sensitivity, rapid processing, and the ability to detect post-translational modifications. In-depth analysis of RPPA data can advance cancer research and the quantification of related biomarkers. However, studies exploring RCC subtypes using multi-omics data based on RPPA are still lacking. Analyzing multi-omics data can enhance our understanding of subtype mechanisms and promote therapeutic development. This study investigated the classification of RCC subtypes using RPPA and multi-omics data. We employed five classification models—Decision Tree (DT), Random Forest (RF), Support Vector Machine (SVM), K-Nearest Neighbors (KNN), and eXtreme Gradient Boosting (XGB)—to evaluate the performance of RPPA and integrated multi-omics data. The results show that RPPA-based dual-omics, and RPPA-based multi-omics datasets. The results indicate that RPPA significantly enhances classification accuracy. Notably, the XGB model demonstrated substantial performance improvement when utilizing mutation and RPPA data, underscoring the critical role of RPPA in renal cell carcinoma subtype classification. Furthermore, we introduce novel evaluation methods, including the Adjusted Weighted Accuracy Score (AW-ACC SCORE) for comparing the importance of omics in specific tasks and the Adjusted Weighted Mean Absolute Shapley Importance (AWMSHAP) for assessing feature importance, identifying key proteins such as INPP4B, PIK3CA, NDRG1, and CASP7, which probably have a potential association with subtype classification. These proteins are associated with different subtypes and would influence tumor behavior and clinical outcomes. Our findings indicated the potential of combining RPPA, multi-omics data, and machine learning for precise RCC subtype classification. The identified significant proteins highlight the importance of explainability in machine learning models to build clinical trust and facilitate the translation of research findings into clinical practice.
關鍵字(中)	★ 癌症亞型分類 ★ 多體學 ★ 可解釋性機器學習	關鍵字(英)	★ Cancer Subtype Classification ★ Multi-Omics ★ Interpretable Machine Learning
論文目次	中文摘要 i Abstract ii 致謝 iii Table of Contents iv List of Figures vi List of Tables x Chapter 1 Introduction 1 1.1 Background 1 1.2 Related Works 4 1.3 Motivation and Goal 16 Chapter 2 Materials and Methods 19 2.1 Data Acquisition and Pre-Processing 21 2.2 Decision Tree 24 2.3 Random Forest 26 2.4 Support Vector Machine 28 2.5 K-Nearest Neighbors 30 2.6 eXtreme Gradient Boosting 32 2.7 Feature Importance Assessment Techniques 34 2.8 Statistical Methods 38 2.9 Evaluation Metrics 40 Chapter 3 Results 43 3.1 Overview of the Datasets 43 3.1.1 Datasets Visualization 44 3.1.2 Principal Components Analysis of RPPA-based Multi-Omics 52 3.1.3 Multivariate Linear Regression Analysis 57 3.2 Performance of Classification 58 3.2.1 Performance of Classification using Single-Omics 59 3.2.2 RPPA-based Dual-Omics Improved Model Performance 74 3.2.3 Impact of Multi-Omics Integration on Model Performance 79 3.3 Feature Importance and Feature Selection of RPPA 84 3.3.1 Analysis of Feature Importance in High-Accuracy Models 86 3.3.2 Feature Selection Based on the Overall Ranking from the Five Highly Explainable Feature Importance Methods 97 3.4 Association between Clinical Characteristics and RPPA 105 Chapter 4 Discussion and Conclusion 129 References 134 Appendix 138
參考文獻	[1] F. J. e. al. (21 June). Global Cancer Observatory: Cancer Today (version 1.1 ed.). Available: https://gco.iarc.who.int/today [2] F. Bray et al., "Global cancer statistics 2022: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries," (in eng), CA Cancer J Clin, vol. 74, no. 3, pp. 229-263, May-Jun 2024. [3] J. Yang, K. Wang, and Z. Yang, "Treatment strategies for clear cell renal cell carcinoma: Past, present and future," (in eng), Front Oncol, vol. 13, p. 1133832, 2023. [4] R. E. Gray and G. T. Harris, "Renal Cell Carcinoma: Diagnosis and Management," (in eng), Am Fam Physician, vol. 99, no. 3, pp. 179-184, Feb 1 2019. [5] K. C. Association. (5 July). Available: https://www.kidneycancer.org/kidney-cancer-types/ [6] C. Clinic. (21 June). Clear Cell Renal Cell Carcinoma. Available: https://my.clevelandclinic.org/health/diseases/22273-clear-cell-renal-cell-carcinoma#overview [7] C. P. Paweletz et al., "Reverse phase protein microarrays which capture disease progression show activation of pro-survival pathways at the cancer invasion front," (in eng), Oncogene, vol. 20, no. 16, pp. 1981-9, Apr 12 2001. [8] C. f. E. a. S. Biochemistry. (4 Aug). Reverse Phase Protein Array (RPPA). Available: https://medicine.uky.edu/centers/cesb/rppa [9] T. Yamada et al., "Reverse Phase Protein Arrays: From Technical and Analytical Fundamentals to Applications," Reverse Phase Protein Arrays, 2019. [10] R. Akbani et al., "Realizing the promise of reverse phase protein arrays for clinical, translational, and basic research: a workshop report: the RPPA (Reverse Phase Protein Array) society," (in eng), Mol Cell Proteomics, vol. 13, no. 7, pp. 1625-43, Jul 2014. [11] F. Krammer, "Emerging influenza viruses and the prospect of a universal influenza virus vaccine," (in eng), Biotechnol J, vol. 10, no. 5, pp. 690-701, May 2015. [12] A. Byron et al., "Integrative analysis of multi-platform reverse-phase protein array data for the pharmacodynamic assessment of response to targeted therapies," (in eng), Sci Rep, vol. 10, no. 1, p. 21985, Dec 15 2020. [13] M. Suzuki et al., "Utility of a Reverse Phase Protein Array to Evaluate Multiple Biomarkers in Diffuse Large B-Cell Lymphoma," (in eng), Proteomics Clin Appl, vol. 14, no. 1, p. e1900091, Jan 2020. [14] L. Reboud, "Reverse Phase Protein Array – a high throughput proteomic tool," vol. 2024, ed: Proteintech Group. [15] K. Paal et al., "RPPA survey of cancer hotspot panel proteins and cell markers in matched tumor-normal human breast and kidney samples reveals a weak correlation between proteins expression and public transcriptome repositories," bioRxiv, p. 2023.11. 09.566387, 2023. [16] A. L. Lubbock et al., "Overcoming intratumoural heterogeneity for reproducible molecular risk stratification: a case study in advanced kidney cancer," BMC medicine, vol. 15, pp. 1-12, 2017. [17] G. Chu et al., "Identification of a Novel Protein-Based Signature to Improve Prognosis Prediction in Renal Clear Cell Carcinoma," (in eng), Front Mol Biosci, vol. 8, p. 623120, 2021. [18] L. B. Thomas et al., "Artificial Intelligence: Review of Current and Future Applications in Medicine," (in eng), Fed Pract, vol. 38, no. 11, pp. 527-538, Nov 2021. [19] S. Tonekaboni et al., "What clinicians want: contextualizing explainable machine learning for clinical end use," in Machine learning for healthcare conference, 2019, pp. 359-380: PMLR. [20] M. A. Ahmad, C. Eckert, and A. Teredesai, "Interpretable machine learning in healthcare," in Proceedings of the 2018 ACM international conference on bioinformatics, computational biology, and health informatics, 2018, pp. 559-560. [21] F. Doshi-Velez et al., "Accountability of AI under the law: The role of explanation," arXiv preprint arXiv:1711.01134, 2017. [22] A. A. Freitas, "Comprehensible classification models: a position paper," ACM SIGKDD explorations newsletter, vol. 15, no. 1, pp. 1-10, 2014. [23] A. Ramaswamy et al., "Application of protein lysate microarrays to molecular marker verification and quantification," Proteome Science, vol. 3, pp. 1-16, 2005. [24] C. Belluco et al., "Kinase substrate protein microarray analysis of human colon cancer and hepatic metastasis," Clinica chimica acta, vol. 357, no. 2, pp. 180-183, 2005. [25] C. G. A. R. N. A. w. g. B. C. o. M. C. C. J. M. M. G. P. H. W. D. A. Gibbs Richard A. 1 et al., "Comprehensive molecular characterization of clear cell renal cell carcinoma," Nature, vol. 499, no. 7456, pp. 43-49, 2013. [26] C. G. A. R. Network, "Comprehensive molecular characterization of papillary renal-cell carcinoma," New England Journal of Medicine, vol. 374, no. 2, pp. 135-145, 2016. [27] S. di Martino et al., "Renal cancer: new models and approach for personalizing therapy," (in eng), J Exp Clin Cancer Res, vol. 37, no. 1, p. 217, Sep 5 2018. [28] M. J. Ha et al., "Personalized Integrated Network Modeling of the Cancer Proteome Atlas," (in eng), Sci Rep, vol. 8, no. 1, p. 14924, Oct 8 2018. [29] G. Han et al., "Unique protein expression signatures of survival time in kidney renal clear cell carcinoma through a pan-cancer screening," (in eng), BMC Genomics, vol. 18, no. Suppl 6, p. 678, Oct 3 2017. [30] F. C. O′Mahony et al., "The use of reverse phase protein arrays (RPPA) to explore protein expression variation within individual renal cell cancers," (in eng), J Vis Exp, no. 71, Jan 22 2013. [31] J. Pang et al., "A denoised multi-omics integration framework for cancer subtype classification and survival prediction," Briefings in Bioinformatics, vol. 24, no. 5, p. bbad304, 2023. [32] X. Li et al., "MoGCN: A Multi-Omics Integration Method Based on Graph Convolutional Network for Cancer Subtype Analysis," (in eng), Front Genet, vol. 13, p. 806842, 2022. [33] A. Muhamed Ali et al., "A Machine Learning Approach for the Classification of Kidney Cancer Subtypes Using miRNA Genome Data," Applied Sciences, 2018. [34] Y. Chen, R. Calabrese, and B. Martin-Barragan, "Interpretable machine learning for imbalanced credit scoring datasets," European Journal of Operational Research, vol. 312, no. 1, pp. 357-372, 2024. [35] D. Fryer, I. Strümke, and H. Nguyen, "Shapley values for feature selection: The good, the bad, and the axioms," Ieee Access, vol. 9, pp. 144352-144360, 2021. [36] S. Schoch, H. Xu, and Y. Ji, "CS-Shapley: class-wise Shapley values for data valuation in classification," Advances in Neural Information Processing Systems, vol. 35, pp. 34574-34585, 2022. [37] S. Matthews and B. Hartman, "mshap: Shap values for two-part models," Risks, vol. 10, no. 1, p. 3, 2021. [38] Y. Wang et al., "Using feature selection and Bayesian network identify cancer subtypes based on proteomic data," Journal of Proteomics, vol. 280, p. 104895, 2023. [39] A. Dhillon, A. Singh, and V. K. Bhalla, "iMVAN: integrative multimodal variational autoencoder and network fusion for biomarker identification and cancer subtype classification," Applied Intelligence, vol. 53, no. 22, pp. 26672-26689, 2023. [40] C. Gonesh et al., "Integrative Analysis of Multi-Omics Data with Deep Learning: Challenges and Opportunities in Bioinformatics," pp. 1001-4055, 01/01 2023. [41] O. Sagi and L. Rokach, "Ensemble learning: A survey," Wiley interdisciplinary reviews: data mining and knowledge discovery, vol. 8, no. 4, p. e1249, 2018. [42] R. Verkuil et al., "Language models generalize beyond natural proteins," BioRxiv, p. 2022.12. 21.521521, 2022. [43] S. Goodwin, J. D. McPherson, and W. R. McCombie, "Coming of age: ten years of next-generation sequencing technologies," Nature reviews genetics, vol. 17, no. 6, pp. 333-351, 2016. [44] M. Krassowski et al., "State of the Field in Multi-Omics Research: From Computational Needs to Data Mining and Sharing," (in eng), Front Genet, vol. 11, p. 610798, 2020. [45] I. U. a. C. C. o. T. Classification, TNM classification of malignant tumours. International Union Against Cancer, 1974. [46] S. V. Vasaikar et al., "LinkedOmics: analyzing multi-omics data within and across 32 cancer types," (in eng), Nucleic Acids Res, vol. 46, no. D1, pp. D956-d963, Jan 4 2018. [47] S. M. Haake, J. D. Weyandt, and W. K. Rathmell, "Insights into the Genetic Basis of the Renal Cell Carcinomas from The Cancer Genome Atlas," (in eng), Mol Cancer Res, vol. 14, no. 7, pp. 589-98, Jul 2016.
指導教授	鍾佳儒(Chia-Ru Chu)	審核日期	2024-8-21
推文	facebook plurk twitter funp google live udn HD myshare reddit netvibes friend youpush delicious baidu
網路書籤	Google bookmarks del.icio.us hemidemi myshare

博碩士論文 111522041 詳細資訊