以形式概念分析為基礎之文件向量模型建立方式及其於文件分群之應用

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：169

、訪客IP：18.217.203.172

姓名

鄭敬譯(Chin-Yi Cheng) 查詢紙本館藏

畢業系所

資訊管理學系

論文名稱

以形式概念分析為基礎之文件向量模型建立方式及其於文件分群之應用
(A Formal Concept Analysis-Based Document Representation and its Application on Document Clustering)

相關論文

★ 信用卡盜刷防治簡訊規則製作之決策支援系統	★ 不同檢索策略之效果比較
★ 知識分享過程之影響因子探討	★ 兼具分享功能之檢索代理人系統建構與評估
★ 犯罪青少年電腦態度與學習自我效能之研究	★ 使用AHP分析法在軟體度量議題之研究
★ 優化入侵規則庫	★ 商務資訊擷取效率與品質促進之研究
★ 以分析層級程序法衡量銀行業導入企業應用整合系統(EAI)之關鍵因素	★ 應用基因演算法於叢集電腦機房強迫對流裝置佈局最佳近似解之研究
★ The Development of a CASE Tool with Knowledge Management Functions	★ 以PAT tree 為基礎發展之快速搜尋索引樹
★ 以複合名詞為基礎之文件概念建立方式	★ 利用使用者興趣檔探討形容詞所處位置對評論分類的重要性
★ 透過半結構資訊及使用者回饋資訊以協助使用者過濾網頁文件搜尋結果	★ 利用feature-opinion pair建立向量空間模型以進行使用者評論分類之研究

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 ( 永不開放)

摘要(中)

隨著網際網路的日益發達，有越來越多以文字為基礎的資訊出現，為了協助人們快速的搜尋到他們所需要的資訊，資訊擷取、文件分類、文件分群等技術被發展出來，這類技術有一大部分以所謂的向量模式為基礎，將文件或是查詢文字以單一文字為維度的向量加以表示，並以文字出現在文件或查詢文字中的頻率為維度值。這類以單一文字為維度的向量表示方式，忽略了那些可能有助於提升上述技術效果的文字間概念關係，例如同義字、上意字、下意字等。為了發展一套自動化的文字概念關係擷取技術，本研究應用型式概念分析，自動化的去針對一個文件集建立其文字關係架構，並發展一文件向量表示方式，應用所建立的文字關係架構將文件以概念為維度的向量加以表式，而為了評估其在相關應用上的效果，我們利用文件分群技術做為一個應用評估的方式。

摘要(英)

With the continual improvement in internet-related technology, more and more information, especially text-based information, becomes available online. The implementation of most of these techniques draws upon Salton’s vector space model (VSM) in which documents or query strings are represented by vectors. Most implementations based on VSM employ the individual terms extracted from the documents or query strings as the dimensionalities of the vectors, and the frequency of terms appearing in the documents or query strings as the value of the dimensionalities. These implementations, or so-called bag-of-terms methods, ignore the conceptual relationships between terms such as synonyms, hypernyms and hyponyms that have been proven capable of improving the effectiveness of information retrieval, document classification and document clustering. To deal with the problem of an automatically- constructed thesaurus for a given document, in this study, we apply FCA to construct the term ontology to deal with the hierarchical conceptual relationships together with synonym-like relationships for the document set. We also develop a document representation method that applies ontology to represent documents by concept-based vectors. In order to evaluate the usability and effectiveness of our method, we make use of document clustering as the application used to evaluate the generated concept-based vectors.

關鍵字(中)

★ 概念關係
★ 文件分群
★ 形式概念
★ 資訊擷取
★ 文件向量

關鍵字(英)

★ vector space model
★ information retrieval
★ document clustering
★ Formal concept analysis
★ conceptual relationship

論文目次

中文摘要.................................................i
Abstract................................................ii
Table of Contents......................................iii
List of Figures.........................................iv
List of Tables...........................................v
1. Introduction..........................................1
2. Related Work..........................................6
2.1 Conceptual term relationships........................6
2.2 Applications of the conceptual term relationships....7
2.2.1 Manually built thesauri............................7
2.2.2 Automatically constructed thesauri.................8
2.3 Document representation .............................11
3. Proposed method ......................................15
3.1 Formal concept analysis.............................16
3.2 Term ontology.......................................20
3.2.1 Document preprocessing............................22
3.2.2 Term ontology construction........................23
3.3 Document representation by concept vector...........31
4. Evaluation...........................................35
4.1 Experimental system.................................36
4.2 Document sets.......................................38
4.3 Evaluation method...................................39
4.4 Concept-based vector generation.....................41
4.5 Evaluation results..................................42
4.6 Discussion and limitations..........................56
4.7 Runtime performance evaluation......................59
5. Conclusion and future work...........................62
References..............................................66

參考文獻

1.Agrawal, R., and Srikant, R. (1994), “Fast Algorithms for Mining Association Rules”, In J.B. Bocca, M. Jarke, and C. Zaniolo (eds.), Proceedings of the 20th International Conference on Very Large Data Bases (VLDB'94), September 12-15, Santiago de Chile, Chile, pp. 487-499.
2.Baghel, R. and Dhir, R. (2010), “A Frequent Concepts Based Document Clustering Algorithm”, International Journal of Computer Applications, Vol. 4, No. 5, pp. 6-12.
3.Baziz, M., Boughanem, M, and Aussenac-Gilles, N. (2005), “Conceptual indexing based on document content representation information”, Lecture Notes in Computer Science, Vol. 3507/2005, pp. 2021-2043.
4.Beil, F., Ester, M., and Xu, X.W. (2002), Frequent term-based text clustering, In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, July 23 - 26, Alberta, Canada, pp. 436-442.
5.Bhogal, J., Macfarlane, A., and Smith, P. (2007), “A review of ontology based query expansion”, Information Processing and Management Vol. 43, No. 4, pp. 866-886.
6.Bisson, G., Nedellec, C., and Canamero, L. (2000), “Designing clustering methods for ontology building – The Mo’K Workbench”, In Proceedings of the ECAI Ontology Learning Workshop, August 25, Berlin, Germany, available at: http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-31/GBisson_7.pdf.
7.Brikoff, G. (1967), Lattice Theory, American Mathematical Society, Providence, RI.
8.Chen, C.L., Tseng, F.C.S., and Liang, T. (2010), “Mining fuzzy frequent itemsets for hierarchical document clustering”, Information Processing and Management, Vol. 46, No. 2, pp. 193-211.
9.Cararball, S.A. (1999), “Automatic construction of a hypernym-label noun hierarchy from text”, In Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics (ACL), College Park, Maryland, pp. 120-126.
10.Carpineto, C., and Romano, G. (1996), “A lattice conceptual clustering system and its application to browsing retrieval”, Machine Learning, Vol. 24, No. 2, pp. 95-122.
11.Carpineto, C., and Romano, G. (2004), “Exploiting the potential of concept lattices for information retrieval with CREDO”, Journal of Universal Computer Science, Vol. 10, No. 8, pp. 985-1013.
12.Chen, H., and Ng, T. (1995), “An algorithmic approach to concept exploration in a large knowledge network (automatic thesaurus consultation): Symbolic branch and bound search versus connectionist Hopfield net activation”, Journal of the American Society for information science, Vol. 46, No. 5, pp. 348-369.
13.Chen, H., Yim, T., and Frey, D. (1995), “Automatic thesaurus generation for an electronic community system”, Journal of the American Society for Information Science, Vol. 46, No. 3, pp. 175-193.
14.Cheung, K.S.K., and Vogel, D (2005), “Complexity reduction in lattice-based information retrieval”, Information Retrieval, Vol. 8, No. 2, pp. 285-299,
15.Chu, W.W., Liu, Z., and Mao, W. (2002), “Textual document indexing and retrieval via knowledge sources and data mining”, In Proceedings of the Communication of the Institute of Information and Computing Machinery (CIICM), available at: http://www.cobase.cs.ucla.edu/tech-docs/wenlei/ciicm02.pdf
16.Cimiano, P., Hotho, A., and Staab, S. (2005), “Learning Concept Hierarchies from Text Corpora Using Formal Concept Analysis”, Journal of Artificial Intelligence Research, Vol. 24, pp. 305-339.
17.Cole, R., and Eklund, P. (2001), “Browsing semi-structured web texts using Formal Concept Analysis”, Lecture Notes in Computer Science, Vol. 2120/2001, pp. 319-332.
18.De Buenaga, M., Gmez, J.M., and Dazc, B. (2000), “Using WordNet to complement training information in text categorization”, In: Recent Advances in Natural Language Processing II: Selected Papers from RANLP’97, Current Issues in Linguistic Theory (CILT), Vol. 189, John Benjamins, pp. 353-364.
19.Edith, H., Rene, A.G., Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F. (2006), “Document clustering based on maximal frequent sequences”, Lecture Notes in Computer Science, Vol. 4139/2006, pp. 257-267.
20.Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., and Ruppin, G. (2002), “Placing search in context: the concept revisited”, ACM Transactions on Information Systems (TOIS), Vol. 20, No. 1, pp. 116-131.
21.Faure, D., and Nedellec, C (1998), “A corpus-based conceptual clustering method for verb frames and ontology”, In P. Velardi, (eds.), Proceeding of the LREC Workshop on Adaption Lexical and Corpus Resources to Sublanguages and Application, May 26, Granada, Spain, pp. 5-12.
22.Formica, A. (2006), “Ontology-based concept similarity in Formal Concept Analysis”, Information Sciences, Vol. 176, No. 18, pp. 2624-2641.
23.Formica, A. (2008), “Concept similarity in Formal Concept Analysis: An information content approach”, Knowledge-based systems, Vol. 21, No. 1, pp. 80-87.
24.Friedman, J.H. (1994), “An Overview of Predictive Learning and Function Approximation”, In V. Cherkassky V, J.H., Friedman, and H. Wechsler (eds.) From Statistics to Neural Networks: Theory and Pattern Recognition Applications (NATO ASI Series / Computer and Systems Sciences), Springer, Germany, available at: http://www.mosuma.net/teach/ci6124/friedman1994.pdf.
25.Fung, B., Wang, K., and Ester, M. (2003), “Hierarchical document clustering using frequent itemsets”, In Proceedings of the of the Third SIAM International Conference on Data Mining, May 1-3, San Francisco, CA, available at: http://www.cs.sfu.ca/~bfung/personal/pub/FungMSc_FreqDocCluster.pdf.
26.Ganter, B., and Wille, R. (1999), Formal Concept analysis: Mathematical foundations, Springer, Berlin.
27.Gonzalo, J., Verdejo, F., Chugur, I., and Cigarran, J. (1998), “Indexing with WordNet synsets can improve text Retrieval”, In Proceedings of the COLING/ACL'98 Workshop on Usage of WordNet for NLP, Montreal, pp. 38-44.
28.Green, S.J. (1997), “Building hypertext links in newspaper articles using semantic similarity”, In Proceedings of the Third Workshop on Applications of Natural Language to Information Systems (NLDB ‘97), Vancouver, Canada, pp. 178-190.
29.Green, S.J. (1999), “Building hypertext links by computing semantic similarity”, IEEE Transactions on Knowledge and Data Engineering, Vol. 11, No. 5, pp. 713-730.
30.Grossman, D.A., and Frieder, O. (2004), Information Retrieval: Algorithms and Heuristics, Springer.
31.Grootjen, F.A., and van der Weide T.P. (2006), “Conceptual query expansion”, Data and Knowledge Engineering, Vol. 56, No. 2, pp. 174-193.
32.Hearst, M. (1992), “Automatic acquisition of hyponyms from large text corpora”, In Proceedings of the 14th conference on Computational linguistics, August 23-28, 1992, Nantes, France, pp. 539-545.
33.Hindle, D. (1990), “Noun classification from predicate-argument structures”, In proceedings of the 14th International Conference on Computational Linguistics, August 20-25, Helsinki, Finland, available at: http://acl.ldc.upenn.edu/P/P90/P90-1034.pdf.
34.Hotho, A., and Stumme, G. (2002), “Conceptual Clustering of Text Clusters”, In Proceedings of FGML Workshop Special Interest Group of German Informatics Society FGML, pp. 37-45.
35.Hotho, A., Staab, S., and Stumme, G. (2003), “Wordnet improves text document clustering”, In Proceedings of the SIGIR 2003 Semantic Web Workshop, pp. 541-544.
36.Houston, A.L., Chen, H., Schatz, B.R., Hubbard, S.M., Sewell, R.R., and Ng, T.D. (2000), “Exploring the use of concept spaces to improve medical information retrieval”, Decision Support Systems, Vol. 30, No. 2, pp. 171-186.
37.Huang, X., Huang, Y.R., and Wen, M. (2005), “A dual index model for contextual information retrieval”, In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, August 15-19, Salvador, Brazil, pp. 613-614.
38.Hung, C. and Wermter, S. (2004), “Neural network based document clustering using WordNet ontologies”, International Journal of Hybrid Intelligent Systems, Vol. 1, No. 3, 4 , pp. 127-142.
39.Hung, C., Wermter, S., and Smith, P. (2004), “Hybrid neural document clustering using guided self-organization and WordNet”, IEEE Intelligent Systems, Vol. 19, No. 2, pp. 68-77.
40.Kim, M., and Compton, P. (2001),” Formal Concept Analysis for Domain-specific Document retrieval systems”, Lecture Notes in Computer Science, Vol. 2256/2001, pp. 73-88.
41.Kovacs, L. (2002), “Document clustering based on concept lattice”, IEEE International Conference on Systems, Man and Cybernetics, October 6-9, available at: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1175673.
42.Lee, C.S., Kao, Y.F., Kuo, Y.H., and W, M.H. (2007), “Automated ontology construction for unstructured text documents”, Data and Knowledge Engineering, Vol. 60, No. 3, pp. 547-566.
43.Li, Y.J., Chung, S.M., and Holt, J.D. (2008), “Text document clustering based on frequent word meaning sequences”, Data and Knowledge Engineering, Vol. 64, No. 1, pp. 381-404.
44.Lindig, C. (2000), “Fast concept analysis”, In G. Stumme, (eds.), Working with conceptual structures-contributions to ICCS 2000, Springer-verlag, Aachen, Germany, pp. 152-161.
45.Liu, Z., and Chu, W.W. (2007), “Knowledge-based query expansion to support scenario-specific retrieval of medical free text”, Information Retrieval, Vol.10, No. 2, pp. 173-202
46.Mandala, R., Tokunaga, T., and Tanaka, H. (1999), “Combining multiple evidence from different types of thesaurus for query expansion”, In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, August 15-19, Berkeley, CA, USA, pp. 191-197.
47.Mandala, R., Tokunaga, T., and Tanaka, H. (2000), “Query expansion using heterogeneous thesauri”, Information Processing and Management, Vol. 36, No. 3, pp. 361-378.
48.Miller, G. (1995), “WordNet: A Lexical Database for English”, Communications of the ACM, Vol. 38, No. 11, pp. 39-41.
49.Minker, J., Wilson, G., and Zimmerman, B. (1972), “An evaluation of query expansion by the addition of clustered terms for a document retrieval system”, Information Storage and Retrieval, Vol. 8, No. 6, pp. 329-348.
50.Moldovan, D., and Novischi, A. (2004), “Word sense disambiguation of WordNet glosses”, Computer Speech and Language, Vol. 18 , No. 3, pp. 301-317.
51.Morita, K., Atlam, E.S, Fuketra, M., Tsuda, K., Oono, M., and Aoe, J.I. (2004), “Word classification and hierarchy using co-occurrence word information”, Information Processing and Management, Vol. 40, No. 6, pp. 957-972.
52.Navigli, R., and Velardi, P. (2003), “An analysis of ontology-based query expansion strategies”, In Proceedings 14th European conference on machine learning (ECML 2003), Workshop on Adaptive Text Extraction and Mining (ATEM 2003), September 22, Cavtat-Dubrovnik, Croatia, pp. 42-49.
53.Peat, H., and Willett, P. (1991), “The limitations of term co-occurrence data for query expansion in document retrieval systems”, Journal of the American Society for Information Science, Vol. 42, No. 5, pp. 378-383.
54.Pereira, F., Tishby, N., and Lee, L. (1993), “Distributional clustering of English words”, In Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics (ACL), pp. 183-190.
55.Petersen, W. (2004), “A set-theoretical approach for the induction of inheritance hierarchies”, Electronic Notes in Theoretical computer Science, Vol. 53, pp. 296-308.
56.Qiu, Y., and Frei, H.P. (1993), “Concept based query expansion”, In Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, June 27-July 01, Pittsburgh, PA, USA, pp. 160-169.
57.Rajapakse, R.K., and Denham, M. (2006), “Text retrieval with more realistic concept matching and reinforcement learning”, Information Processing and Management, Vol. 42, No. 5, pp. 1260-1275.
58.Recupero, D.R. (2007), “A new unsupervised method for document clustering by using WorNet lexical and conceptual relations”, Information Retrieval, Vol. 10, No. 6, pp. 563-579.
59.Rosso, P., Ferretti, E., Jimenez, D., and Vidal, V. (2004), “Text categorization and information retrieval using WordNet senses”, In Proceedings of the Second International WordNet Conference (GWC), January 20-23, Brno, Czech Republic, available at: http://www.fi.muni.cz/gwc2004/proc/110.pdf.
60.Salton, G. (1971), The SMART Retrieval System-Experiments in Automatic document Processing, Prentice-Hall.
61.Salton, G., Yang, C.S. and Wong, A. (1975), “A vector-space model for automatic indexing”, Communications of the ACM, Vol. 18, No. 11, pp. 613-620.
62.Schatz, B.R., Johnson, E.H., Cochrane, P.A, and Chen, H. (1996), “Interactive term suggestion for users of digital libraries: using subject thesauri and co-occurrence lists for information retrieval”, In Proceedings of the First ACM International Conference on Digital libraries, March 20-23, Bethesda, MD, USA, pp. 126-133.
63.Sedding, J., and Kazakov, D. (2004), “WordNet-based text document clustering”, In Proceedings of the Third Workshop on Robust Methods in Analysis of Natural Language Data (COLING 2004), August 29, Geneva, Switzerland, pp. 104-113.
64.Seo, H., Chung, H. Rim, H., Myaeng, S., and Kim, S. (2004), “Unsupervised word sense disambiguation using WordNet relatives”, Computer Speech and Language, Vol. 18, No. 3, pp. 253-273.
65.Smeaton, A., and van Rijsbergen, C.J. (1983), “The retrieval effects of query expansion on a feedback document retrieval system”, Computer Journal, Vol. 26, No. 3, pp. 239-246.
66.Sporleder, C. (2002), “A galois lattice based approach to lexical inheritance hierarchy learning”, In Proceedings of the ECAI 2002 Workshop on Machine Learning and Natural Language Processing for Ontology Engineering, July 22-2, Lyon, France, available at: http://www-sop.inria.fr/acacia/WORKSHOPS/ECAI2002-OLT/Proceedings/Sporleder.pdf.
67.Steinbach, M., Karypis, G., and Kumar, V. (2000), “A comparison of document clustering techniques”, In Proceedings of the KDD Workshop on Text Mining, August 20, Boston, MA, available at: http://www.cs.cmu.edu/~dunja/KDDpapers/Steinbach_IR.pdf.
68.Tho, Q.T., Hui, S.C., Fong, A.C.M, and Cao, T.H. (2006), “Automatic fuzzy ontology generation for semantic web”, IEEE Transactions on Knowledge and Data Engineering, Vol. 18, No. 6, pp. 842-856.
69.Vechtomova, O., Robertson, S., and Jones, S. (2003), “Query expansion with long-span collocates” Information Retrieval, Vol. 6, No. 2, pp. 251-272.
70.Voorhees, E. (1993), “Using WordNet to disambiguate word senses for text retrieval”, In Proceedings of the 16th annual international ACM SIGIR Conference on Research and Development in Information Retrieval, Pittsburgh, PA, USA , June 27 - July 01, 1993, pp. 171-180.
71.Voorhees, E. (1994), “Query expansion using lexical-semantic relations”, In Proceedings of the 17th Annual international ACM SIGIR Conference on Research and Development in Information Retrieval, Dublin, Ireland , July 03 - 06, 1994, pp. 61-69.
72.Weng, S.S., Tsai, H.J., Liu, S.C., and Hsu, C.H. (2006), “Ontology construction for information classification”, Expert Systems with Applications, Vol. 31, No. 1, pp. 1-12.
73.Wille, R. (1982), Restructuring Lattice Theory: an Approach Based on Hierarchies of Concepts. In Rival, I. (eds.) Ordered sets, D. Reidel Publishing Company, Dordrecht-Boston, pp 445-470.
74.Xu, J., and Croft, W.B. (2000), “Improving effectiveness of information retrieval with local context analysis”, ACM Transactions on Information Systems (TOIS), Vol. 18, No. 1, pp. 79-112.
75.Zhanga, W., Yoshidab, T., Tangc, X., and Wanga, Q. (2010), “Text clustering using frequent itemsets”, Knowledge-Based Systems, Vol. 23, No. 5, pp. 379-388.
76.Zheng, H.T., Kang, B.Y. and Kim, H.G. (2009), “Exploiting noun phrases and semantic relationships for text document clustering”, Information Sciences, Vol. 179, No. 13, pp. 2249-2262.

指導教授

周世傑(Shihchieh Chou)

審核日期

2011-7-12

推文