姓名 郭嫚茜(Man-chien Kuo) 查詢紙本館藏 畢業系所 系統生物與生物資訊研究所 論文名稱 使用支持向量機預測蛋白質醣基化位置
(Prediction of protein glycosylation sites by using support vector machines)
檔案 [Endnote RIS 格式] [Bibtex 格式] [檢視] [下載]
摘要(中) 蛋白質醣基化是一個很重要的轉譯後修飾，這樣的修飾會影響蛋白質的許多功能，例如：結構、活性及細胞間的交互作用。由於利用生物實驗分析較為困難且會有龐大的驗證工作，因此最近幾年有許多研究中提出利用電腦計算方式去分析蛋白質醣基化位置。而在這些研究中所用到的分析醣基化模型主要是利用醣基化位置周圍的氨基酸分佈情形作為分析的特徵。此外，以往的預測工具是只針對特定的醣基化類型作預測。因此，我們根據氨基酸對及與溶劑接觸表面積大小做結合、氨基酸對及氨基酸和氨基酸對在特定區域發生的情形等特徵並使用支持向量機(SVM)建造出可以預測O-linked, N-linked及C-linked的三種醣基化類型其發生位置的方法。最後得到四組準確度數據，分別為在Serine及Threonine上發生的O-linked醣基化其預測準確度為95%及91%；N-linked醣基化發生在Asparagine上其預測準確度為96%；而在Tryptophan 上發生的C-linked醣基化預測準確度則為95%。我們的預測工具：GSI便能提供預測O-linked、N-linked及C-linked三種醣基化類型。 摘要(英) Protein glycosylation is an important post-translational modification (PTM) to affect various molecular functions such as structure, biological activity and protein-protein interaction. Due to the difficulties of biological experiments and the huge amount of identification works, there are several works were proposed in recent years to identify protein glycosylation sites by computational approaches. The features of their identification model were mainly amino acids surrounding the glycosylation sites. All of previous prediction tools are against respective types of glycosylation. Therefore, we develop prediction methods to identify protein glycosylation sites include O-linked, N-linked and C-linked glycosylation using support vector machine (SVM) based on dipeptide combined with accessible surface area, region combined with amino acid, and dipeptide. It shows that the accuracy of O-linked glycosylation on serine and threonine, N-linked on asparagine and C-linked on tryptophan are 95%, 91%, 96% and 95%. We implemented in GSI, a web server to identify O-linked, N-linked and C-linked glycosylation sites. 關鍵字(中) ★ 支持向量機
關鍵字(英) ★ support vector machines
★ post-translational modification
論文目次 Table of Contents
Chapter 1 Introduction 1
1.1 Background 1
1.1.1 Post-translational modification (PTM) 1
1.1.2 Glycosylation 2
Chapter 2 Related works 7
2.1 Prediction of glycosylation tools 7
2.1.1 Prediction of O-linked glycosylation tool 7
2.1.2 Prediction of N-linked glycosylation tool 8
2.1.3 Prediction of C-linked glycosylation tool 8
2.1.4 Other prediction of glycosylation tool 8
2.2 Comparison of current prediction tools 8
Chapter 3 Materials and Methods 11
3.1 System Flow 11
3.2 dbPTM dataset 11
3.3 Data construction 12
3.4 Feature construction 16
3.4.1 0/1 system 16
3.4.2 Dipeptide encoding 17
3.4.3 Tripeptide encoding 17
3.4.4 Secondary structure encoding 18
3.4.5 ASA encoding 18
3.4.6 Region encoding 19
3.5 Support Vector Machine (SVM) 22
3.6 Performance evaluation 23
Chapter 4 Results 25
4.1 Prediction performance 25
4.2 Comparison with previous work 35
4.3 Independent test set in previous prediction tools and ours 40
4.4 Web interface 41
Chapter 5 Discussion 45
List of Figures
Figure 1. The structure of O-linked glycosylation. The oligosaccharides attached to the hydroxyl group of amino acid, serine and threonine. 3
Figure 2. The structure of N-linked glycosylation. The oligosaccharides attached to asparagine. 3
Figure 3. The structure of C-linked glycosylation. The α-mannopyranosyl residue is attached to the indole C2 of tryptophan via a C-C link 4
Figure 4. The structure of GPI anchors. The hydrophobic phosphatidylinositol group is linked to a residue at or near the C-terminus of a protein through a carbohydrate-containing linker. 5
Figure 5. The system flow of constructing prediction models 12
Figure 6. The process of truncate the protein sequence to region windows with glycosylation or non-glycosylation site in the middle. 14
Figure 7. The process of dipeptide encoding 17
Figure 8. The process of tripeptide encoding 18
Figure 9. The process of secondary structure encoding 18
Figure 10. The calculation of ASA scores combined with dipeptide. 19
Figure 11. Comparison of the different between ASA scores of positive and negative datasets on serine residue on O-linked glycosylation 20
Figure 12. Comparison of the different between ASA scores of positive and negative datasets on threonine residue on O-linked glycosylation 21
Figure 13. Comparison of the different between ASA scores of positive and negative datasets on N-linked glycosylation 21
Figure 14. Comparison of the different between ASA scores of positive and negative datasets on C-linked glycosylation 22
Figure 15. The performance of serine residue in O-linked glycosylation prediction models 28
Figure 16. The performance of threonine residue in O-linked glycosylation prediction models 30
Figure 17. The performance of N-linked glycosylation prediction models 32
Figure 18. The performance of C-linked glycosylation prediction models 34
Figure 19. The interface of GSI web server, which is available at http://bioinfo.gene.idv.tw/. 42
Figure 20. In this graph, the web interface with an example of inputs on GSI 43
Figure 21. The results of each type of potentially glycosylated amino acid sites and the distribution of ASA scores surrounding them 44
Figure 22. The list of protein sequences prediction result and the ASA scores of each site. 44
List of Tables
Table 1. Comparison of current prediction tools 10
Table 2. Number of positive and negative datasets in our study for O-linked, N-linked and C-linked glycosylation considered 13
Table 3. The number of positive and negative datasets for Serine in O-linked glycosylation for different symmetrical window size and ratio of positive and negative datasets 15
Table 4. The number of positive and negative datasets for Serine in O-linked glycosylation for different symmetrical window size and ratio of positive and negative datasets 15
Table 5. The number of positive and negative datasets for C-linked glycosylation for different symmetrical window size 16
Table 6. The number of positive and negative datasets for N-linked glycosylation for different symmetrical window size 16
Table 7. The various ratio of positive and negative datasets on serine residues in O-linked glycosylation based on 0/1 system encoding 25
Table 8. The various ratio of positive and negative datasets on threonine residues in O-linked glycosylation based on 0/1 system encoding 26
Table 9. The results of serine residue in O-linked glycosylation using different features 27
Table 10. The results of threonine residue in O-linked glycosylation using different features 29
Table 11. The results in N-linked glycosylation from different models 31
Table 12. The performance of C-linked glycosylation from different models 32
Table 13. Best models of four types of glycosylation 35
Table 14. Comparison of using our training datasets on serine in O-linked glycosylation to test precious prediction tools 36
Table 15. Comparison of using our training datasets on threonine residues in O-linked glycosylation to test the other prediction tools 36
Table 16. Comparison of the training datasets for serine within other prediction tools to test our and their own prediction models 37
Table 17. Comparison of the training datasets for threonine within other prediction tools to test our and their own prediction models 37
Table 18. Comparison of proposed accuracy with other prediction tools on N-linked glycosylation 38
Table 19. Comparison of the training datasets for asparagine residue within other prediction tools to test our and their prediction tools 38
Table 20. Comparison of proposed accuracy with other prediction tools on C-linked glycosylation 39
Table 21. Comparison of the training datasets for tryptophan residues within other prediction tools to test our and their own prediction models 39
Table 22. Comparison of using independent test sets with current prediction tools and ours on serine residues in O-linked glycosylation 40
Table 23. Comparison of using independent test sets with current prediction tools and ours on threonine residues in O-linked glycosylation 41
Table 24. Comparison of using independent test sets with current prediction tools and ours on asparagine residues in N-linked glycosylation 41
Table 25. Comparison of using independent test sets with current prediction tools and ours on tryptophan residues in C-linked glycosylation 41
Table 26. Using datasets of serine in O-linked glycosylation of CKSAAP_OGlySite, EnsembleGly and NetOGlyc with our prediction method to cross test precious prediction tools 47
Table 27. Using datasets of threonine residues in O-linked glycosylation of CKSAAP_OGlySite, EnsembleGly and NetOGlyc with our prediction method to cross test precious prediction tools 48
Table 28. Using datasets of asparagine residues in N-linked glycosylation of CKSAAP_OGlySite, EnsembleGly and NetOGlyc with our prediction method to cross test precious prediction tools 49
Table 29. Using datasets of tryptophan residues in C-linked glycosylation of CKSAAP_OGlySite, EnsembleGly and NetOGlyc with our prediction method to cross test precious prediction tools 50
Table 30. The comparison of different glycosylation datasets between previous prediction tools and ours 51
參考文獻 1. Hart GW: Glycosylation. Current opinion in cell biology 1992, 4(6):1017-1023.
2. Hounsell EF, Davies MJ, Renouf DV: O-linked protein glycosylation structure and function. Glycoconjugate journal 1996, 13(1):19-26.
3. Stanley P: Glycosylation engineering. Glycobiology 1992, 2(2):99-107.
4. Jenkins N, Parekh RB, James DC: Getting the glycosylation right: implications for the biotechnology industry. Nature biotechnology 1996, 14(8):975-981.
5. Mann M, Jensen ON: Proteomic analysis of post-translational modifications. Nature biotechnology 2003, 21(3):255-261.
6. Walsh CT, Garneau-Tsodikova S, Gatto GJ, Jr.: Protein posttranslational modifications: the chemistry of proteome diversifications. Angewandte Chemie (International ed 2005, 44(45):7342-7372.
7. Blom N, Sicheritz-Ponten T, Gupta R, Gammeltoft S, Brunak S: Prediction of post-translational glycosylation and phosphorylation of proteins from the amino acid sequence. Proteomics 2004, 4(6):1633-1649.
8. Asker N, Baeckstrom D, Axelsson MA, Carlstedt I, Hansson GC: The human MUC2 mucin apoprotein appears to dimerize before O-glycosylation and shares epitopes with the 'insoluble' mucin of rat small intestine. The Biochemical journal 1995, 308 ( Pt 3):873-880.
9. Peters BP, Krzesicki RF, Perini F, Ruddon RW: O-glycosylation of the alpha-subunit does not limit the assembly of chorionic gonadotropin alpha beta dimer in human malignant and nonmalignant trophoblast cells. Endocrinology 1989, 124(4):1602-1612.
10. Chen YZ, Tang YR, Sheng ZY, Zhang Z: Prediction of mucin-type O-glycosylation sites in mammalian proteins using the composition of k-spaced amino acid pairs. BMC bioinformatics 2008, 9:101.
11. Hanisch FG: O-glycosylation of the mucin type. Biological chemistry 2001, 382(2):143-149.
12. Helenius A, Aebi M: Roles of N-linked glycans in the endoplasmic reticulum. Annual review of biochemistry 2004, 73:1019-1049.
13. Gavel Y, von Heijne G: Sequence differences between glycosylated and non-glycosylated Asn-X-Thr/Ser acceptor sites: implications for protein engineering. Protein engineering 1990, 3(5):433-442.
14. Rudd PM, Elliott T, Cresswell P, Wilson IA, Dwek RA: Glycosylation and the immune system. Science (New York, NY 2001, 291(5512):2370-2376.
15. Hofsteenge J, Blommers M, Hess D, Furmanek A, Miroshnichenko O: The four terminal components of the complement system are C-mannosylated on multiple tryptophan residues. The Journal of biological chemistry 1999, 274(46):32786-32794.
16. Doucey MA, Hess D, Cacan R, Hofsteenge J: Protein C-mannosylation is enzyme-catalysed and uses dolichyl-phosphate-mannose as a precursor. Molecular biology of the cell 1998, 9(2):291-300.
17. Perez-Vilar J, Randell SH, Boucher RC: C-Mannosylation of MUC5AC and MUC5B Cys subdomains. Glycobiology 2004, 14(4):325-337.
18. Ihara Y, Manabe S, Kanda M, Kawano H, Nakayama T, Sekine I, Kondo T, Ito Y: Increased expression of protein C-mannosylation in the aortic vessels of diabetic Zucker rats. Glycobiology 2005, 15(4):383-392.
19. Julenius K: NetCGlyc 1.0: prediction of mammalian C-mannosylation sites. Glycobiology 2007, 17(8):868-876.
20. Kinoshita T, Ohishi K, Takeda J: GPI-anchor synthesis in mammalian cells: genes, their products, and a deficiency. Journal of biochemistry 1997, 122(2):251-257.
21. Caragea C, Sinapov J, Silvescu A, Dobbs D, Honavar V: Glycosylation site prediction using ensembles of Support Vector Machine classifiers. BMC bioinformatics 2007, 8:438.
22. Eisenhaber B, Bork P, Eisenhaber F: Prediction of potential GPI-modification sites in proprotein sequences. Journal of molecular biology 1999, 292(3):741-758.
23. Presnell SR, Cohen FE: Artificial neural networks for pattern recognition in biochemical sequences. Annual review of biophysics and biomolecular structure 1993, 22:283-298.
24. Burges CJC: A Tutorial on Support Vector Machines for Pattern Recognition. In., vol. 2: Springer; 1998: 121-167.
25. Hansen JE, Lund O, Tolstrup N, Gooley AA, Williams KL, Brunak S: NetOglyc: prediction of mucin type O-glycosylation sites based on sequence context and surface accessibility. Glycoconjugate journal 1998, 15(2):115-130.
26. Yip YL, Scheib H, Diemand AV, Gattiker A, Famiglietti LM, Gasteiger E, Bairoch A: The Swiss-Prot variant page and the ModSNP database: a resource for sequence and structure information on human protein variants. Human mutation 2004, 23(5):464-470.
27. Gupta R, Birch H, Rapacki K, Brunak S, Hansen JE: O-GLYCBASE version 4.0: a revised database of O-glycosylated proteins. Nucleic acids research 1999, 27(1):370-372.
28. Julenius K, Molgaard A, Gupta R, Brunak S: Prediction, conservation analysis, and structural characterization of mammalian mucin-type O-glycosylation sites. Glycobiology 2005, 15(2):153-164.
29. Li S, Liu B, Zeng R, Cai Y, Li Y: Predicting O-glycosylation sites in mammalian proteins by using SVMs. Computational biology and chemistry 2006, 30(3):203-208.
30. Gupta R, Jung E: NetNGlyc: Prediction of N-glycosylation sites in human proteins. In.: Accessed; 2005.
31. Ahmad S, Gromiha MM, Sarai A: RVP-net: online prediction of real valued accessible surface area of proteins from single sequences. Bioinformatics (Oxford, England) 2003, 19(14):1849-1851.
32. Lee TY, Huang HD, Hung JH, Huang HY, Yang YS, Wang TH: dbPTM: an information repository of protein post-translational modification. Nucleic acids research 2006, 34(Database issue):D622-627.
33. McGuffin LJ, Bryson K, Jones DT: The PSIPRED protein structure prediction server. Bioinformatics (Oxford, England) 2000, 16(4):404-405.
34. Grzymislawski M, Derc K, Sobieska M, Wiktorowicz K: Microheterogeneity of acute phase proteins in patients with ulcerative colitis. World J Gastroenterol 2006, 12(32):5191-5195.
35. Bhasin M, Raghava GP: ESLpred: SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and PSI-BLAST. Nucleic acids research 2004, 32(Web Server issue):W414-419.
36. Gould SJ, Keller GA, Hosken N, Wilkinson J, Subramani S: A conserved tripeptide sorts proteins to peroxisomes. The Journal of cell biology 1989, 108(5):1657-1664.
37. Richmond TJ: Solvent accessible surface area and excluded volume in proteins. Analytical equations for overlapping spheres and implications for the hydrophobic effect. Journal of molecular biology 1984, 178(1):63-89.
38. Hua S, Sun Z: Support vector machine approach for protein subcellular localization prediction. Bioinformatics (Oxford, England) 2001, 17(8):721-728.
39. Chang CC, Lin CJ: LIBSVM: a library for support vector machines. In., vol. 80; 2001: 604–611.
40. Song J, Burrage K, Yuan Z, Huber T: Prediction of cis/trans isomerization in proteins using PSI-BLAST profiles and secondary structure information. BMC bioinformatics 2006, 7:124.
41. Witten I, Frank E: Data Mining: Practical Machine Learning Tools and Techniques: Morgan Kaufmann; 2005.
(Jorng-tzong Horng、Li-ching Wu)
審核日期 2008-7-9 推文 facebook plurk twitter funp google live udn HD myshare reddit netvibes friend youpush delicious baidu