從生物文件中萃取出蛋白質或基因之名稱

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：48

、訪客IP：3.137.175.80

姓名

鄭煜璋(Yu-Chang Cheng) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

從生物文件中萃取出蛋白質或基因之名稱
(Extracting protein/gene names from the biological literatures)

相關論文

★ 一種減輕LEO衛星網路干擾的方案	★ 萃取駕駛人在不同環境之駕駛行為方法
★ 非地面網路中基於位置的隨機接入分配方法	★ TrustFADE: 針對可程式化邏輯區塊之安全認證方法
★ 捷徑問題在特殊圖形上之演算研究	★ 行動電腦教室與其管理系統的設計與建置
★ 蛋白質體視覺化系統之實作	★ 最小切割樹群聚演算法極端情形之研究
★ 教室內應用無線科技之一對一數位學習模式	★ 蛋白質交互作用網路之視覺化系統
★ 以賓果式遊戲輔助技巧熟練之數位學習環境設計與實作	★ 蛋白質註解的三維視覺化工具
★ Joyce 2：一個在一對一數位教室環境下之小組競爭遊戲	★ 同儕計算網路上內文散佈演算法之實作與效能評估
★ 在直角多邊形上使用基因演算法畫樹之研究	★ 經由潛在語義的線索從蛋白質交互作用網路進行蛋白質功能的預測

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

近年來生物技術逐漸進步，大型實驗產生相當大量的資料與文件，如何在這些使用自然語言(如英文)的文件中萃取出有用的資訊，使得這些萃取出來的資料可以進一步分析變的越來越重要。
無論我們感興趣的是想從文件中了解生物體內每個環節的交互作用亦或是生物物質的註解，這項研究的第一步就是要先能讓電腦辨識出文件中，我們感興趣的物質名稱。這個研究即是在生物文件中，辨識出所有蛋白質的名稱。我們提出了一個系統來辨識出蛋白質或基因的名稱。這個系統主要依據人造的規則，外加機器學習機制讓系統表現的更好。這個系統在這個研究領域有名的文件集Yapex上，達到了F-score 73.8%的水準。

摘要(英)

New high-throughput technologies have increased the accumulation of data about genes and proteins. However, such data is stored in natural language text. Further processing and integrating data into more complete and useful knowledge become harder for researchers because of tremendous amount of literature. Therefore, automatic literature mining is more and more important in recent years.
The first step to extract knowledge from natural language text is to extract the named entities out of text, and then the relation between named entities can be constructed. Here we propose a new system to extract the named entities (especially named entities refer to proteins or genes) from the literature in biological domain such as MEDLINE abstracts. The system is mainly rule-based and combined with an SVM machine learning module for improving the system performance. It achieves an F-score 73.8% on the Yapex corpus.

關鍵字(中)

★ 自然語言處理
★ 文件探勘

關鍵字(英)

★ Biomedical Name Entity Extraction
★ Natural Language Processing
★ Text Mining

論文目次

List of Figures II
List of Tables III
Chapter 1. Introduction 1
1.1 Motivation 1
1.2 Research Goal 3
Chapter 2. Related Work 5
2.1 Dictionary-based methods 5
2.2 Rule-based methods 6
2.3 Machine learning methods 7
2.4 Corpora 9
2.5 Results of early works 10
Chapter 3. Methods 12
3.1 System overview 12
3.2 Tokenization and POS tagging 16
3.3 Token selector 17
3.3.1 Selection rules 18
3.3.2 Filtering rules 22
3.3.3 SVM module 24
3.4 Extending module 29
3.4.1 Left extending 29
3.4.2 Right extending 30
3.5 Post filter 32
3.6 Abbreviation recovery 33
Chapter 4. Results 36
4.2 Evaluation Criterion 36
4.1 Results in Yapex Corpus 37
Chapter 5. Conclusion 40
5.1 Discussion 40
5.2 Future works 42
Reference 44

參考文獻

Chang, J.T., Schutze, H., Altman, R. B. 2004. GAPSCORE: finding gene and protein names one word at a time. Bioinformatics. 2004 Jan. 22; 20(2):216-25.
Collier, N., Nobata, C. and Tsujii J. I. 2000. Extracting the Names of Genes and Gene Products with a Hidden Markov Model. Proceedings of 18th International Conference on Computational Linguistics. pp. 201-207.
Finkel, J., Dingare, S., Nguyen, H., Nissim, M., Sinclair, G. and Manning, C. D. 2004. Exploiting Context for Biomedical Entity Recognition: From Syntax to the Web In Proceedings of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA-2004).
Franzén, K., Eriksson, G., Olsson, F., Asker, L. and Lidén, P. 2002. Exploiting syntax when detecting protein names in text. In Workshop on Natural Language Processing in Biomedical Applications, 2002.
Fukuda, K., Tamura, A., Tsunoda, T., Takagi, T. 1998. Toward information extraction: identifying protein names from biological papers. Pac. Symp. Biocomput. 1998:707-18.
Hanisch, D., Fluck, J., Mevissen, H. and Zimmer, R. 2003. Playing biology’s name game: identifying protein names in scientific text. Pac. Symp. Biocomput., 8, 403–41
Joachims, T., Schölkopf , B., Burges, C. and Smola, A. (ed.) 1999. Making large-Scale SVM Learning Practical. Advances in Kernel Methods - Support Vector Learning, MIT-Press
Kazama, J., Makino, T., Ohta, Y. and Tsujii, J. 2002. Tuning support vector machines for biomedical named entity recognition. In Proc. of ACL-02 Workshop on Natural Language Processing in the Biomedical Domain, pages 1-8.
Klein, D., Smarr, J., Nguyen, H. and Manning, C. D. 2003. Named Entity Recognition with Character-Level Models. In Proceedings of CoNLL-2003.
Krauthammer M., Rzhetsky A., Morozov P. and Friedman C. 2000. Using blast for identifying gene and protein names in journal articles. Gene, 259, 245–252.
Lee, K. J., Hwang, Y. S. and Rim, H. C. 2003. Two-phase biomedical NE recognition based on SVMs. Proceedings of the ACL 2003 Workshop on Natural Language Processing in Biomedicine, pages 33-40, 2003.
Lin, Y. F., Tsai, T. H., Chou, W. C., Wu, K. P., Sung, T. Y., Hsu, W. L., 2004. A Maximum Entropy Approach to Biomedical Named Entity Recognition. Proceedings of 4th ACM SIGKDD Workshop on Data Mining in Bioinformatics (BioKDD), 2004.
Liu, H., Aronson, A.R. and Friedman, C. 2002. A study of abbreviations in MEDLINE abstracts. Proceedings of the American Medical Informatics Association Symposium 2002. PA, USA, pp. 327-332.
Mika, S. and Rost, B. 2004. Protein names precisely peeled off free text. Bioinformatics. 2004 Aug 4; 20 Suppl 1:I241-I247.
Ohta, T., Tateisi, Y., Mima, H. and Tsuiji, J. 2002. GENIA corpus: an annotated research abstract corpus in molecular biology domain. In Proceedings of the Human Language Technology conference, pages 73-77.
Olsson F, Eriksson G, Franzen K, Asker L, Liden P. 2002. Notions of correctness when evaluating protein name taggers. In: Proceedings of the 19th International Conference on Computational Linguistics. pages 765-71.
Ono T., Hishigaki H., Tanigami A., Takagi T. 2001. Automated extraction of information on protein-protein interactions from the biological literature. Bioinformatics. 2001 Feb; 17(2):155-61.
Schwartz A. S. and Hearst M. A. 2003. A Simple algorithm for identifying abbreviation definitions in biomedical text. In Proceedings of the Pacific Symposium on Biocomputing (PSB 2003) Kauai.
Seki, K. and Mostafa, J. 2003. A Probabilistic Model for Identifying Protein Names and Their Name Boundaries. Stanford, CA: IEEE Computer Society Bioinformatics Conference, 2003.
Shatkay, H., Feldman, R. 2003. Mining the Biomedical Literature in the Genomic Era: An Overview. J Comput Biol. 2003; 10(6):821-55.
Shen, D., Zhang, J., Zhou, G., Su, J. and Tan, C. L. 2003. Effective adaptation of hidden Markov model-based named entity recognizer for biomedical domain. Proceedings of the ACL 2003 Workshop on Natural Language Processing in Biomedicine, pages 49-56, 2003.
Tanabe, L. and Wilbur, W. J. 2002. Tagging gene and protein names in biomedical text. Bioinformatics Vol. 18 no. 8 2002
Takeuchi, K. and Collier, N. 2004. Bio-medical entity extraction using support vector machines. In Artificial Intelligence in Medicine, Elsevier (in press).
Zhou G. D. and Su J. 2002. Named Entity Recognition using an HMM-based Chunk Tagger. Proc. of the 40th ACL, Philadelphia, 2002 July, pp. 473-480.
Zhou, G. D., Zhang, J., Su, J., Shen, D., Tan, C. 2004a. Recognizing names in biomedical texts: a machine learning approach. Bioinformatics. 2004 May 1; 20(7):1178-90.
Zhou, G. D., Shen, D., Zhang, J., Su, J. and Tan, C.L. 2004b. Recognition of protein/gene names from text using an ensemble of classifiers and effective abbreviation resolution. EMBO Workshop 2004 on a critical assessment of text mining methods in molecular biology.

指導教授

何錦文(Chin-Wen Ho)

審核日期

2005-7-18

推文