建立專利資料之向量空間模型以支援跨語言檢索

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：26

、訪客IP：3.144.103.110

姓名

邱裕婷(Yu-Ting Chiu) 查詢紙本館藏

畢業系所

資訊管理學系

論文名稱

建立專利資料之向量空間模型以支援跨語言檢索
(Building Vector Space Model for Patent Data to Support Cross-Language Retrieval)

相關論文

★ 零售業商業智慧之探討	★ 有線電話通話異常偵測系統之建置
★ 資料探勘技術運用於在學成績與學測成果分析 -以高職餐飲管理科為例	★ 利用資料採礦技術提昇財富管理效益 -以個案銀行為主
★ 晶圓製造良率模式之評比與分析－以國內某DRAM廠為例	★ 商業智慧分析運用於學生成績之研究
★ 運用資料探勘技術建構國小高年級學生學業成就之預測模式	★ 應用資料探勘技術建立機車貸款風險評估模式之研究－以A公司為例
★ 績效指標評估研究應用於提升研發設計品質保證	★ 基於文字履歷及人格特質應用機械學習改善錄用品質
★ 以關係基因演算法為基礎之一般性架構解決包含限制處理之集合切割問題	★ 關聯式資料庫之廣義知識探勘
★ 考量屬性值取得延遲的決策樹建構	★ 從序列資料中找尋偏好圖的方法 - 應用於群體排名問題
★ 利用分割式分群演算法找共識群解群體決策問題	★ 以新奇的方法有序共識群應用於群體決策問題

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

文件是包含文字與圖表的非結構化資料，且大多數不含類別標籤。向量空間模型方法是一常見文件表示方式，但傳統方法存在以下兩個問題：其一是挑選重要字詞作為向量基底特徵時，只考量一字詞在某一特定文件集合中是否最具辨別能力；另一則是套用在含有類別標籤的文件上時，對於一字詞在不同類別間是否具辨別能力僅考量平坦結構的類別標籤。
為改善上述二問題，本研究設立以下三項目標。目標一：設計一新方法在挑選最具代表性特徵時，考量各特徵在階層式類別標籤中的關係。目標二：設計一新方法：IPC基礎的向量模型，使用字詞之外特徵讓所建立之向量模型更有效地表示文件。目標三：將精煉IPC基礎的向量模型使其適用於多語言情境中，讓它有更廣泛的延伸用途。
針對目標一進行實驗，測試是否加入類別標籤的階層關係考量，能篩選出更具辨別與表示能力的字詞。實驗結果顯示向量型特徵若以按比例挑選之方式揀選，則可擁有較高覆蓋力；另一方面若以加權總合挑選之方式揀選，則可得到較高準確率。對於目標二進行另一實驗來測試是否使用IPC碼作為向量基底可提升效能。實驗結果指出以IPC為基礎的索引字詞挑選法可達成較高的準確率與滿意度。最後針對目標三進行實驗以測試跨語言專利文件比對方法的效能。實驗與評估結果呈現IPC基礎的概念橋梁比傳統方法表現優異。

摘要(英)

Documents are the unstructured data containing textual data and diagrams. Most of them exist without any class label. Traditionally, the VSM methods are commonly used to present documents but it has two problems. The first one is that they only consider the discrimination ability of a term in a specific set of documents while the methods are used to select important terms as the features to form a vector base. The second problem is that they consider the discrimination ability of a term among different class labels only in the flat structure when a term consists in the documents with class labels.
In order to deal with the problems, there are three major objectives to be achieved in this research. Firstly, a new approach is designed to select the most representative features (i.e., terms) to form a VSM with the consideration of hierarchical class labels. The second objective is to design a new method to build an IPC-based VSM using features other than terms to present documents more efficiently. Finally, the third objective is to refine the IPC-based VSM to adapt to the multi-language condition as an extended usage.
For the first objective, this research conducted an experiment to test if the consideration of hierarchical relations among class labels can sift out terms with higher representative and greater discrimination abilities for presenting patent documents. Through the experiments, this research reveals that a VSM whose features are selected via proportional selecting manners has higher coverage; and a VSM whose features are selected via weighted-summed selecting manners has higher accuracy. For the second objective, another experiment was conducted to see whether using IPC codes as indexing vocabulary can arise the performance of retrieving similar documents or not. The experimental results indicate that the IPC-based indexing vocabulary selection method achieves a higher accuracy and is more satisfactory. Finally, the experiment for the third objective is to test the performance of the proposed solution for cross-language patent document matching. The results of the experiment and evaluation demonstrated that the proposed IPC-based concept bridge outperformed the traditional methods.

關鍵字(中)

★ 跨語言專利比對
★ 專利探勘與檢索
★ 階層式類別標籤
★ 向量空間模型
★ 特徵選取

關鍵字(英)

★ cross-language patent matching
★ patent mining and retrieval
★ hierarchical class label
★ vector space model
★ feature selection

論文目次

中文摘要　i
Abstract　ii
誌謝　iv
Table of Contents　v
List of Figures　vii
List of Tables　viii
Chapter 1.　Introduction　1
1.1.　Research background and motivation　1
1.2.　Objective I: Selecting representative features via class hierarchy　3
1.3.　Objective II: Designing patent representation via non-term features　6
1.4.　Objective III: Refining patent representation for multi-language usage　9
Chapter 2.　Literature Review　13
2.1.　Patent documents　13
2.2.　Patent mining　15
2.3.　Vector space model　16
2.4.　Compound noun　18
2.5.　Cross-language information retrieval and document matching　19
2.6.　Cross-language patent matching (CLPM)　22
Chapter 3.　Patent representation considering class hierarchy　23
3.1.　Problem definition　23
3.2.　Hierarchical feature selection (HFS) algorithm　25
3.3.　Experiment and evaluation　27
Chapter 4.　IPC-based patent representation via features other than terms　34
4.1.　Collect patent documents　35
4.2.　Text preprocessing　36
4.3.　Generate category*term vectors　37
4.4.　Generate term*category vector　41
4.5.　Generate document*category vector　42
4.6.　Experimental result and evaluation　43
4.6.1.　Data collection and text preprocessing　43
4.6.2.　The comparing methods for vector generation　45
4.6.3.　Experimental results and evaluation　46
Chapter 5.　IPC-based concept bridge for cross-language usage　55
5.1.　Collect patent documents　56
5.2.　Perform data preprocessing　57
5.3.　Build document*keyword vectors　58
5.4.　Transform to document*concept vectors　59
5.5.　Construct a cross-language mediator (IPC-based concept bridge)　62
5.6.　Similarity computation　65
5.7.　Experiment and evaluation　66
5.7.1.　Data collection and preprocessing　67
5.7.2.　Vector transformation　68
5.7.3.　Comparing methods for vector generation　69
5.7.4.　Experimental results and evaluation　70
Chapter 6.　Conclusion　78
References　81

參考文獻

[1] P. Castells, M. Fernández, and D. Vallet, “An Adaptation of the Vector-Space Model for Ontology-Based Information Retrieval”, IEEE Transactions on Knowledge and Data Engineering, 19(2), pp. 261-272, 2007.
[2] Y. H. Tseng, C. J. Lin, and Y. I. Lin, “Text mining techniques for patent analysis”, Information Processing & Management, 43, pp. 1216-1247, 2007.
[3] R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval, Addison-Wesley, Wokingham, UK, 1999.
[4] C. D. Manning, P. Raghavan, and H. Schütze, Introduction to Information Retrieval, Cambridge University Press, New York, USA, 2008.
[5] G. Salton and C. Buckley, “Term-weighting approaches in automatic text retrieval”, Information Processing & Management, 24(5), pp. 513-523, 1988.
[6] K. E. Lochbaum and L. A. Streeter, “Combining and comparing the effectiveness of latent semantic indexing and the ordinary vector space model for information retrieval”, Information Processing & Management, 25(6), pp. 665-676, 1989.
[7] A. Hotho, A. Nürnberger, and G. Paaß, “A brief survey of text mining”, LDV-Forum GLDV Journal for Computational Linguistics and Language Technology, 20(1), pp. 19-62, 2005.
[8] Y. J. Li, C. Luo, and S. M. Chung, “Text clustering with feature selection by using statistical data”, IEEE Transactions on Knowledge and Data Engineering, 20(5), pp. 641-652, 2008.
[9] R. Feldman and J. Sanger, The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data, Cambridge University Press, New York, USA, 2007.
[10] P. N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining, Addison Wesley, Boston, USA, 2005.
[11] C. Xue, Q. Y. Qiu, P. E. Feng, and Z. N. Yao, “An automatic classification method for patents”, 2010 Seventh International Conference on Fuzzy Systems and Knowledge Discovery, pp. 1497-1501, Yantai, China, 10-12 August, 2010.
[12] WIPO FAQ, Frequently Asked Questions about the International Patent Classification (IPC): What is the IPC. 10, June, 2011, from http://www.wipo.int/classifications/ipc/en/faq/index.html#G1.
[13] D. Tikk, G. Biró, and A. Törcsvári, A hierarchical online classifier for patent categorization, in: H. A. D. Prado and E. Ferneda (Eds.), Emerging Technologies of Text Mining: Techniques and Applications, Idea Group Publishing, New York, USA, 2007.
[14] Y. L. Chen, J. J. Wei, S. Y. Wu, and Y. H. Hu, “A similarity-based method for retrieving documents from the SCI/SSCI database”, Journal of Information Science, 32(5), pp. 449-464, 2006.
[15] S. K. M. Wong and V. V. Raghavan, “Vector space model of information retrieval--a reevaluation”, 7th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 167-185, British Computer Society, Cambridge, England, 2-6 July, 1984.
[16] WIPO, Guide to the International Patent Classification (8th edition). 13, June, 2011, from http://www.wipo.int/export/sites/www/classifications/ipc/en/guide/guide_ipc8.pdf.
[17] Y. Li, and J. Shawe-Taylor, “Advanced learning algorithms for cross-language patent retrieval and classification”, Information Processing & Management, 43, pp. 1183-1199, 2007.
[18] C. C. Yang, C. P. Wei, and K.W. Li, “Cross-lingual thesaurus for multilingual knowledge management”, Decision Support System, 45, pp. 596-605, 2008.
[19] K. Kishida, “Technical issues of cross-language information retrieval: a review”, Information Processing & Management, 41, pp. 433-455, 2005.
[20] L. Ballesteros and W. B. Croft, “Phrasal translation and query expansion techniques for cross-language information retrieval”, Proceedings of the 20th ACM SIGIR conference on research and development in information retrieval, pp. 84-91, 1997.
[21] D. A. Hull and G. Grefenstette, “Querying across languages: a dictionary-based approach to multilingual information retrieval”, Proceedings of the 19th ACM SIGIR conference on research and development in information retrieval, pp. 49-57, 1996.
[22] M. W. Davis, “On the effective use of large parallel corpora in cross-language text retrieval”, in: G. Grefenstette (Eds.), Cross language information retrieval, Kluwer Academic Publishers, Boston, USA, pp. 12-22, 1998.
[23] J. Y. Nie, and M. Simard, “Using statistical translation model for bilingual IR”, in: C. Peters, M. Braschler, J. Gonzalo, and M. Kluck (Eds.), Evaluation of cross language information retrieval systems (LNCS 2406), Springer-Verlag Berlin Heidelberg, New York, USA, pp. 137-150, 2002.
[24] J. Y. Nie, M. Simard, and P. Isabelle R. Durand, “Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web”, Proceedings of the 22nd ACM SIGIR conference on research and development in information retrieval, pp. 74-81, 1999.
[25] A. Göker, and J. Davies, Information Retrieval: Searching in the 21st Century, John Wiley & Sons, Chichester, UK, 2009.
[26] D. He, D. W. Oard, J. Wang, J. Luo, D. Demner-Fushman, K. Darwish, and P. Resnik, “Making MIRACLEs: Interactive translingual search for Cebuano and Hindi”, ACM Transactions on Asian Language Information Processing, 2, pp. 219-244, 2003.
[27] K. K. Lai, M. L. Lin, and S. M. Chang, “Research Trends on Patent Analysis: An Analysis of the Research Published in Library’s Electronic Databases”, The Journal of American Academy of Business, 8(2), pp. 248-253, 2006.
[28] TIPO, TIPO. 13, June, 2011, from http://www.tipo.gov.tw/en/index.aspx.
[29] USPTO, USPTO. 13, June, 2011, from http://www.uspto.gov/.
[30] EPO, EPO. 13, June, 2011, from http://ep.espacenet.com/.
[31] JPO, JPO. 13, June, 2011, from http://www.jpo.go.jp/index.htm.
[32] Wikipedia, United States Patent and Trademark Office. 13, June, 2011, from http://en.wikipedia.org/wiki/USPTO.
[33] I. S. Kang, S. H. Na, J. Kim, and J. H. Lee, “Cluster-based patent retrieval”, Information Processing & Management, 43(5), pp. 1173-1182, 2007.
[34] WIPO, International patent classification (version 2011) guide. 13, June, 2011, from http://www.wipo.int/export/sites/www/classifications/ipc/en/guide/guide_ipc.pdf
[35] WIPO, Introduction to the IPC on the Internet. 13, June, 2011, from http://www.wipo.int/classifications/fulltext/ipc/intro.htm.
[36] A. J. C. Trappey and C.V. Trappey, “An R&D knowledge management method for patent document”, Industrial Management & Data Systems, 108(1-2), pp. 245-257, 2008.
[37] L. S. Larkey, M. E. Connell, and J. Callan, “Collection Selection and Results Merging with Topically Organized U.S. Patents and TREC Data”, Proceedings of Ninth International Conference on Information Knowledge and Management, pp. 282-289, 2000.
[38] L. S. Larkey, “Some Issues in the Automatic Classification of U.S. Patents”, Working notes for the AAAI-98 Workshop on Learning for Text Categorization, pp. 87-90, 1998.
[39] L. S. Larkey, “A Patent Search and Classification System”, Proceeding of the Fourth ACM Conference on Digital Libraries, pp. 79-87, 1999.
[40] C. J. Fall, A. Törcsvári, K. Benzineb, and G. Karetka, “Automated Categorization in the International Patent Classification”, SIGIR Forum, 37(1), pp. 10-25, 2003.
[41] J. H. Kim and K. S. Choi, “Patent document categorization based on semantic structural information”, Information Processing & Management, 43(5), pp. 1200-1215, 2007.
[42] A. J. C. Trappey, F. C. Hsu, C. V. Trappey, and C. I. Lin, “Development of a patent document classification and search platform using a back-propagation network”, Expert Systems with Applications, 31(4), pp. 755-765, 2006.
[43] A. J. C. Trappey, C.V. Trappey, and E. C. H. Hsieh, “Automatic Categorization of Patent Documents for R&D Knowledge Self-organization”, Journal of Management, 23(4), pp. 413-424, 2006.
[44] Y. G. Kim, J. H. Suh, and S. C. Park, “Visualization of patent analysis for emerging technology”, Expert Systems with Applications, 34(3), pp. 1804-1812, 2008.
[45] S. H. Huang, H. R. Ke, and W. P. Yang, “Structure clustering for Chinese patent documents”, Expert Systems with Applications, 34, pp. 2290-2297, 2008.
[46] Y. H. Tseng, Y. M. Wang, Y. I. Lin, C. J. Lin, and D. W. Juang, “Patent surrogate extraction and evaluation in the context of patent mapping”, Journal of Information Science, 33(6), pp. 718-736, 2007.
[47] S. H. Huang, C. C. Liu, C. W. Wang, H. R. Ke, and W. P. Yang, “Knowledge Annotation and Discovery for Patent Analysis”, International Computer Symposium 2004, pp. 15-20, Taipei, Taiwan, 15-17 December, 2004.
[48] N. Milić-Fryling, “Chapter 1: Text processing and information retrieval”, in: A. Zanasi (Ed.), Text mining and its applications to intelligence, CRM and knowledge management, WIT Press, Southampton, 2005.
[49] G. Salton, A. Wong, and C. S. Yang, “A vector space model for automatic indexing”, Communications of the ACM, 18(11), pp. 613-620, 1975.
[50] J. L. Herlocker, J. A. Konstan, A. Borchers, and J. Riedll, “An algorithmic framework for performing collaborative filtering”, Proceedings of the 22nd Conference on Research and Development in Information Retrieval (SIGIR'99), pp. 230-237, 1999.
[51] K. Spärck Jones, “A statistical interpretation of term specificity and its application in retrieval”, Journal of Documentation, 28(1), pp. 11-20, 1972.
[52] G. Salton and M. McGill, Introduction to Modern Information Retrieval, McGraw-Hill, New York, USA, 1983.
[53] G. Salton, J. Allan, and C. Buckley, “Automatic structuring and retrieval of large text files”, Communications of the ACM, 37(2), pp. 97-108, 1994.
[54] T. Hedlund, H. Keskustalo, A. Pirkola, E. Airio, and K. Jarvelin, “Utaclir@CLEF2001: Effects of Compound Splitting and N-gram Techniques”, in: C. Peters, M. Braschler, J. Gonzalo, and M. Kluck (Eds.), Evaluation of cross language information retrieval systems (LNCS 2406), Springer-Verlag Berlin Heidelberg, New York, USA, pp. 118-136, 2002.
[55] G. Protaziuk, M. Kryszkiewicz, H. Rybinski, and A. Delteil, “Discovering Compound and Proper Nouns”, in: J. G. Carbonell and J. Siekmann (Eds.), Rough Sets and Intelligent Systems Paradigms (LNAI 4585), Springer-Verlag Berlin Heidelberg, New York, USA, pp. 505-515, 2007.
[56] W. H. Lu, R. S. Lin, Y. C. Chan, and K. H. Chen, “Using Web resources to construct multilingual medical thesaurus for cross-language medical information retrieval”, Decision Support Systems, 45, pp. 585-595, 2008.
[57] S. L. Huang and Y. H. Tsai, “Designing a cross-language comparison-shopping agent”, Decision Support Systems, 50, pp. 428-438, 2011.
[58] T. Jiang and A.H. Tan, “Learning Image-Text Associations”, IEEE Transactions on Knowledge and Data Engineering, 21, pp. 161-177, 2009.
[59] D. W. Oard and A. R. Diekema, “Cross-Language Information Retrieval”, Annual Review of Information Science & Technology, 33, pp. 223-256, 1998.
[60] D. Petrelli and P. Clough, “Concept hierarchy across languages in text-based image retrieval: a user evaluation”, Working notes of the CLEF Workshop 2005, Vienna, Austria, 21-23 September, 2005.
[61] P. F. Brown, S. A. Della Pietra, V. J. Della Pietra, and R. L. Mercer, “The Mathematics of Statistical Machine Translation: Parameter Estimation”, Computational Linguistics, 19, pp. 263-311, 1993.
[62] W. Ma and K. Chen, “Introduction to CKIP Chinese word segmentation system for the first international Chinese word segmentation bakeoff”, Proceedings of the second SIGHAN workshop on Chinese language processing, 17, pp. 168-171, 2003.
[63] F. C. Gey, N. Kando, and C. Peters, “Cross-Language Information Retrieval: the way ahead”, Information Processing & Management, 41, pp. 415-431, 2005.
[64] M. Hlava, G. Belonogov, B. Kuznetsov, and R. Hainebach, “Cross language retrieval-English/Russian/French”, AAAI Spring Symposium on Cross-Language Text and Speech Retrieval Series, pp. 63-83, Stanford University, California, March, 1997.
[65] N. V. Loukachevitch and B. V. Dobrov, “Cross-Language Information Retrieval Based on Multilingual Thesauri Specially Created for Automatic Text Processing”, Proceedings of Workshop on Cross-Language Information Retrieval: A Research Road Map, SIGIR 2002, 2002.
[66] A. Chen, and F. C. Gey, “Experiments on cross-language and patent retrieval at NTCIR-3 workshop”, Proceedings of the third NTCIR workshop on research in information retrieval, automatic text summarization and question answering, 2003.
[67] Stanford University NLP group, Stanford Log-linear Part-Of-Speech Tagger. 13, June, 2011, from http://nlp.stanford.edu/software/tagger.shtml.
[68] B. Fox, and C. J. Fox, “Efficient stemmer generation”, Information Processing & Management, 38, pp. 547-558, 2002.
[69] Academia Sinica, CKIP Chinese Word Segmentation System. 13, June, 2011, from http://ckipsvr.iis.sinica.edu.tw/.
[70] Wikipedia, Wikipedia: tf-idf. 13, June, 2011, from http://en.wikipedia.org/wiki/Tf-idf.
[71] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, “Indexing By Latent Semantic Analysis”, Journal of the American Society for Information Science and Technology, 41, pp. 391-407, 1990.
[72] T. K. Landauer, P. W. Foltz, and D. Laham, “An Introduction to Latent Semantic Analysis”, Discourse Processes, 25, pp. 259-284, 1998.
[73] T. K. Landauer and S. T. Dumais, “A solution to Plato's problem: The Latent Semantic Analysis theory of the acquisition, induction, and representation of knowledge”, Psychological Review, 104, pp. 211-140, 1997.
[74] C. P. Wei, C. C. Yang, and C. M. Lin, “A Latent Semantic Indexing-based approach to multilingual document clustering”, Decision Support Systems, 45, pp. 606-620, 2008.
[75] Google, Google Translate. 13, June, 2011, from http://translate.google.com/.
[76] Yahoo, Yahoo! Babel Fish. 13, June, 2011, from http://babelfish.yahoo.com/.
[77] Microsoft, Microsoft Translator. 13, June, 2011, from http://www.microsofttranslator.com/.

指導教授

陳彥良(Yen-Liang Chen)

審核日期

2011-10-14

推文