dc.description.abstract | Machine learning is a powerful and efficient tool that has been widely applied across various professional fields, significantly enhancing the accuracy and execution efficiency of tasks in the era of big data. In the biomedical field, where large sample sizes need to be processed, machine learning demonstrates substantial potential. This study focuses on employing various machine learning-based feature selection methods to investigate the relation between individual gene locus variations and four side effects of breast cancer treatment: Osteoporosis, Peripheral Neuropathy, abnormal Endometrial Thickness and White Blood Cell Count. Unlike mainstream research that explores the relation between specific illnesses and genetic loci, our research focuses on unveiling the association between treatment side effects and genetic loci to ensure patient safety and select appropriate treatment methods.
Through multi-stage analysis, we identified genetic variant loci highly correlated with each of the four types of side effects. We compared various feature selection methods (Chi-Square, Fisher exact, Spearman’s Rank, Kendall Tou, using p value < 0.05 as the threshold) for the first stage of feature selection. Subsequently, we evaluated different machine learning classifiers (Random Forest, XGBoost, Neural Network) as the second stage of feature selectors. We conducted in-depth comparisons of the accuracy and learning curves of each feature selector, analyzing the importance of features and the impact of genetic loci on predicting side effects.
Our research narrowed down over 150,000 independent genetic loci to approximately 100 key loci for each side effect, significantly improving the accuracy of predicting medication side effects. To obtain further validation, we conducted genotype analysis on HLA, which is closely related to drug metabolism and immunity. | en_US |