表面強化雷射解析電離飛行質譜(SELDI-TOF)及基質輔助雷射脫附游離法飛行時間質譜(MALDI-TOF)技術是目前使用於辨識生物標記的技術。本論文是使用來自美國國家癌症研究協會的SELDI-TOF卵巢癌資料集,與來自長庚大學的MALDI-TOF口腔癌資料集。樣本皆區分為控制組及癌症病患組。我們的研究目標是縮減質譜的高維度並從中擷取出有意義的特徵峰點。抽取特徵的方法諸如基線校正、峰點偵測、質譜校準等。特徵選取則利用 Kolmogorov-Smirnov檢定(KS 檢定)、Logistic Regression(邏輯斯迴歸)和Random Forest 等方法。有鑑別力的特徵被挑選出來之後再應用三種分類方法來針對資料集做分類預測。 我們分別挑選了50個和100個最有鑑別力的特徵峰點來做1000次重複隨機性地10-fold 交叉驗證,並利用regression tree with bagging(迴歸樹), k-nearest neighbor(k 個最近鄰居)及SVM(支持向量機)等分類方法所得到的靈敏度(Sensitivity)、特異度(Specificity)、準確度(Accuracy)、精準度(Precision)皆有不錯的分類效果。同時我們也開發了一個質譜相關性查詢系統,去辨識在癌症及非癌症族群有高度相關的峰點值。在此我們提出的分析流程可以提供一個相對較小的特徵峰點資料集,該資料集具有足夠識別力來進行分類預測及相關性分析的研究。 The SELDI-TOF and MALDI-TOF process are the currently used techniques to identify biomarkers for cancers. Our work has focused on the ovarian cancer dataset that is generated by SELDI-TOF technique from National Cancer Institute, USA. Another study set is the oral cancer dataset that is generated by MALDI-TOF technique from Proteomics Center of Chang Gung University, Taiwan. The aim of this work is to reduce the high dimensionality of the mass spectra and extract the significant peak-features for further study. The methods used such as baseline subtraction, peak detection, spectra alignment and normalization are used for feature extraction. Kolmogorov-Smirnov test, logistic regression and random forest are used for feature selection. After feature selection, discriminatory peak-features are selected and three methods had applied to classify the two classes of the ovarian cancer datasets. The selected 50 and 100 most discriminatory peak-features were applied to do classification with 1000 replications using 10-fold proportional validation independently. The results yielded good accuracy, precision, sensitivity and specificity respectively, by regression tree with bagging, k-nearest neighbor and SVM classifier. We also develop a correlation based query system to identify the highly correlated peaks of cancer and non-cancer groups. The analysis pipeline that we proposed could provide a relatively small peak-feature set that is discriminatory enough for classification and correlation based studies.