以複合名詞為基礎之文件概念建立方式

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：53

、訪客IP：3.149.230.171

姓名

施儒淵(Ju-yuan Shih) 查詢紙本館藏

畢業系所

資訊管理學系

論文名稱

以複合名詞為基礎之文件概念建立方式
(The construction of document concept based on compound nouns)

相關論文

★ 信用卡盜刷防治簡訊規則製作之決策支援系統	★ 不同檢索策略之效果比較
★ 知識分享過程之影響因子探討	★ 兼具分享功能之檢索代理人系統建構與評估
★ 犯罪青少年電腦態度與學習自我效能之研究	★ 使用AHP分析法在軟體度量議題之研究
★ 優化入侵規則庫	★ 商務資訊擷取效率與品質促進之研究
★ 以分析層級程序法衡量銀行業導入企業應用整合系統(EAI)之關鍵因素	★ 應用基因演算法於叢集電腦機房強迫對流裝置佈局最佳近似解之研究
★ The Development of a CASE Tool with Knowledge Management Functions	★ 以PAT tree 為基礎發展之快速搜尋索引樹
★ 利用使用者興趣檔探討形容詞所處位置對評論分類的重要性	★ 透過半結構資訊及使用者回饋資訊以協助使用者過濾網頁文件搜尋結果
★ 利用feature-opinion pair建立向量空間模型以進行使用者評論分類之研究	★ 探討使用者回饋之半結構化文件字詞特性於檢索文件的應用

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 ( 永不開放)

摘要(中)

由於資訊科技的進步，數位化資料與文件的數量呈現倍數成長，若是沒有資訊科技來協助使用者進行文件的搜尋，找尋文件勢必成為使用者的重擔。因此，為了可以減輕使用者在找尋文件時的負擔，利用電腦系統自動辨別文件是一項不錯的選擇，而電腦系統要自動辨別文件，常以文件之間的相似度做為分辨基準。
資訊檢索(Information Retrieval, IR)領域中，有不少的研究運用TF-IDF來表示字彙(Term)權重，並以這些字彙建立向量空間模型(Vector Space Model)來進行文件相似度計算。但是，在現實社會之中，我們常使用複合名詞，所以，表示時以字彙為單位，可能無法代表文件中的複合名詞；另外，在文件中也常見到多個字是表達同一個概念，所以，用字彙來代表文件可能會造成，描述相同概念的文件，卻因為用字不同而被辨別為不相關的文件。
本研究提出運用複合名詞(Compound Nouns)進行概念擷取(Concept Extraction)，且以概念為維度的向量空間模型來進行文件相似度計算。首先，將文件中的複合名詞找出來，並以字彙和複合名詞為單位來進行概念擷取，再以所擷取出來的概念為維度產生向量空間模型，接著進行文件相似度比對。最後，本研究進行實驗，驗證出以概念為維度的向量空間模型，在文件相似度比較的精確度上，優於以TF-IDF字彙為維度的向量空間模型。

摘要(英)

With the growth of information technology, a large volume of digital documents and materials has appeared. Without information technology, searching of information would require a great human effort. To decrease the users’ effort, documents discrimination system has been developed and applied. In this kind of system, documents usually are discriminated by similarities automatically. In Information Retrieval, researches mainly use TF-IDF to present terms from documents, exploit those terms to form Vector Space Model, and then compute documents similarity based on the formed Vector Space Model. This approach could be improved. First, in addition to single terms, compound nouns are used in documents also. Second, different terms are used in the presentation of the same concept. This paper has proposed a method which forms the Vector Space Model with concepts that are exacted from documents. The steps include, first, extracting concept from terms and compound nouns of the documents, and second, building a Vector Space Model with these concepts as dimensions. Experimental results show that the approach of concept extraction outperforms TF-IDF in accuracy of document similarity computing.

關鍵字(中)

★ 資訊檢索
★ 概念擷取
★ TF-IDF
★ 複合名詞
★ 向量空間模型

關鍵字(英)

★ Information Retrieval
★ TF-IDF
★ Vector Space Model
★ compound nouns
★ Concept Extraction

論文目次

摘要 I
ABSTRACT II
致謝 III
目錄 IV
圖目錄 V
表目錄 VI
第一章緒論 1
1-1 研究背景與動機 1
1-2 研究目的 2
1-3 研究範圍與限制 2
1-4 研究流程 3
1-5 論文架構 3
第二章文獻探討 5
2-1 複合名詞(COMPOUND NOUNS)形成相關研究 5
2-2 概念擷取(CONCEPT EXTRACTION)相關研究 6
第三章系統設計 14
3-1 文件前處理 15
3-2 複合名詞形成 17
3-3 概念擷取 21
3-4 文件概念化 25
3-5 文件相似度計算 26
第四章實驗分析 27
4-1 資料集 27
4-2 評估方式 28
4-3 實驗結果 29
第五章結論 41
參考文獻 43

參考文獻

Angheluta, R., Jeuniaux, P., Mitra, R. and Moens, M. F., Clustering algorithms for noun phrase coreference resolution (2004). Proceedings of the 7th International Conference on the Statistical Analysis of Textual Data. Available at: www.law.kuleuven.be/icri/publications/511JADT.pdf?where= (accessed 7 July 2009)
Anick, P. G. and Tipirneni, S., The paraphrase search assistant: terminological feedback for iterative information seeking. In: Gey F. et al. (eds), Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval (ACM, New York, 1999) 153-9.
Anick, P. G. and Vaithyanathan, S., Exploiting clustering and phrases for context-based information retrieval. In: Belkin J. N. et al. (eds), Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval (ACM, New York, 1997) 314-23.
Chang, Y., Kim, M. and Raghavan, V. V., Construction of query concepts based on feature clustering of documents. Information Retrieval 9(3) (2006) 231-48.
Cody, W. F., Kreulen, J. T., Krishna, V. and Spangler, W. S., The integration of business intelligence and knowledge management. IBM Systems Journal 41(4) (2002) 697-713.
Cunningham, H., Maynard, D., Bontcheva, K. and Tablan, V., GATE: A framework and graphical development environment for robust NLP tools and applications (2002). Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics. Available at: http://gate.ac.uk/sale/acl02/acl-main.pdf (accessed 7 July 2009)
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K. and Harshman, R., Indexing by latent semantic analysis. Journal of the American Society for Information Science 41 (1990) 391-407.
Edmundson, H. P., New methods in automatic extracting. Journal of the ACM 16(2) (1969) 264-85.
Hindriks, K. V., Hoppenbrouwers, S., Jonker, C. M. and Tykhonov, D., Automatic issue extraction from a focused dialogue. In: Kedad Z. et al. (eds), Natural Language Processing and Information Systems in Volume 4592 of Lecture Notes in Computer Science: Proceedings of the 12th International Conference on Applications of Natural Language to Information Systems (Springer, Berlin, 2007) 204-16.
Koster, C. H. and Verbruggen, E., The AGFL Grammar Work Lab (2002). Proceedings FREENIX/Usenix. Available at: http://www.agfl.cs.ru.nl/ (accessed 7 July 2009).
Lam-Adesina, A. M. and Jones, G. J., Applying summarization techniques for term selection in relevance feedback. In: Kraft H. D. et al. (eds), Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval (ACM, New York, 2001) 1-9.
Lauer, M., Conceptual association for compound noun analysis. In: Pustejovsky J. et al. (eds), Proceedings of the 32nd annual meeting on Association for Computational Linguistics (ACL, Morristown, NJ, 1994) 337-9.
Lawrie, D., Croft, W. B. and Rosenberg, A., Finding topic words for hierarchical summarization. In: Kraft H. D. et al. (eds), Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval (ACM, New York, 2001) 349-57.
Levi, J., The syntax and semantics of complex nominals (Academic Press, New York, 1978).
Luhn, H. P., A statistical approach to mechanized encoding and searching of literary information. IBM Journal of Research and Development 1(4) (1957) 309-17.
Luhn, H. P., The automatic creation of literature abstracts. IBM Journal of Research and Development 2 (1958) 159-65.
Nakagawa, H. and Mori, T., A simple but powerful automatic term extraction method. COLING-02 on COMPUTERM 2002: second international workshop on computational terminology (ACL, Morristown, NJ, 2002) 1-7.
Nakata, K., Voss, A., Juhnke, M. and Kreifelts, T., Collaborative concept extraction from documents (1998). Proceedings of the 2nd International Conference on Practical Aspects of Knowledge management. Available at: http://ftp.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-13/paper20.ps (accessed 7 July 2009).
Porter, M. F., An algorithm for suffix stripping. Program 14(3) (1980) 130-7.
Protaziuk, G., Kryszkiewicz, M., Rybinski, H. and Delteil, A., Discovering compound and proper nouns. In: Kryszkiewicz, M. et al. (eds), Rough Sets and Intelligent Systems Paradigms in Volume 4585 of Lecture Notes in Computer Science (Springer, Berlin, 2007) 505-15.
Qiu, Y. and Frei, H. P., Concept based query expansion. In: Korfhage, R. et al. (eds), Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval (ACM, New York, 1993) 160-9.
Rahman, C. M., Sohel, F. A., Naushad, P. and Kamruzzaman, S. M., Text classification using the concept of association rule of data mining (2003). Proceeding of the International Conference on Information Technology. Available at: http://personal.gscit.monash.edu.au/~sohel/Papers/itpc/124.full.pdf (accessed 7 July 2009).
Recupero, D. R., A new unsupervised method for document clustering by using WordNet lexical and conceptual relations. Information Retrieval 10(6) (2007) 563-79.
Resnik, P. and Hearst, M. A., Structural ambiguity and conceptual relations (1993). Proceedings of the Workshop on Very large Corpora: Academic and lndustdal Perspectives. Available at: http://acl.ldc.upenn.edu/W/W93/W93-0307.pdf (accessed 7 July 2009).
Robertson, S. E. and Sparck Jones, K., Relevance weighting of search terms. Journal of the American Society for Information Science 27(3) (1976) 129-46.
Salton, G., Another look at automatic text-retrieval systems. Communications of the ACM 29(7) (1986) 648-56.
Salton, G. and Buckley, C., Term weighting approaches in automatic text retrieval. Information Processing and Management 24(5) (1988) 513-23.
Salton, G. and Lesk, M. E., Computer evaluation of indexing and text processing. Journal of the ACM 15(1) (1968) 8-36.
Salton, G., Wong, A. and Yang, C. S., A vector space model for automatic indexing. Communications of the ACM 18 (11) (1975) 613-20.
Serrano, J. I. and del Castillo, M. D., Evolutionary learning of document categories. Information Retrieval 10(1) (2007) 69–83.
Silla Jr., C. N., Kaestner, C. A. and Freitas, A. A., A non-linear topic detection method for text summarization using WordNet (2003). Workshop of Technology Information Language Human. Available at: http://www.nilc.icmc.usp.br/til2003/oral/SillaKaestnerFreitas34.pdf (accessed 7 July 2009).
Sparck Jones, K., A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28 (1972) 11-21.
Su, C. C., Document clustering based on vector space model with concepts as the dimension value (2007). National Central University.
Toutanova, K. and Manning, C. D., Enriching the knowledge sources used in a maximum entropy part-of-speech tagger (2000). Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora. Available at: http://nlp.stanford.edu/~manning/papers/emnlp2000.pdf (accessed 7 July 2009).
Toutanova, K., Klein, D., Manning, C. and Singer, Y., Feature-rich part-of-speech tagging with a cyclic dependency network (2003). Proceedings of HLT-NAACL 2003. Available at: http://nlp.stanford.edu/~manning/papers/tagging.pdf (accessed 7 July 2009).

指導教授

周世傑(Shih-chieh Chou)

審核日期

2009-7-14

推文