摘要(中) 結合生物微晶片實驗與電腦計算分析是目前研究癌症的一項新興科技,藉由上萬個基因表現來預測癌症的各項顯示特徵是否出現,甚至找出規則以了解癌症的成因,影響的方式,並且發展藥物療程來抑制癌症。不只癌症,任何未知的疾病都適用此方法。基因檢選對於分析生物微晶片資料是很重要的一個步驟,它可以讓我們知道哪些基因是對於疾病有判斷力及參與關鍵調控的;然而利用資訊工程方面的技術,如數值分析、機器學習、資料探勘來研究此議題會碰到兩個問題:「屬性維度太過巨大問題」與「訓練模型過適問題」!
摘要(英) Gene selection can help to analyze microarray gene expression data. However, it is very difficult to classify a satisfied result by machine learning techniques because of a curse-of-dimensionality problem and an overfitting problem, i.e. the dimension of features is too large but the samples are too few. Therefore, we design a system flow to attempt to avoid the two problems and then select a small set of significant biomarker genes for diagnosis in order to classify correctly. Furthermore, we test on some microarray datasets to demonstrate that our system is useful and reliable according to the good performance.
論文目次 Chapter 1 Introduction 1
1.1 Background 2
1.2 Motivation 4
1.3 Goal 5
Chapter 2 Related Works 6
2.1 Other gene selection methods 6
2.2 WEKA 8
2.3 KEGG 9
Chapter 3 System Flow 12
3.1 Data input 13
3.2 Gene Selection 14
3.2.1 Resampling 14
3.2.2 Tree gathering 15
3.2.3 Gene selecting 17
3.3 Classification 19
Chapter 4 Materials 22
4.1 Public datasets 22
4.2 NTU hospital data 23
Chapter 5 Results 26
5.1 The performance for public datasets 26
5.2 The performance for NTU hospital data 27
5.2.1 Metastasis diagnosis 27
5.2.2 Her2-positive diagnosis 30
Chapter 6 Discussion 33
References 35
Appendix 38
