一個應用字詞連結度協助文件分群之方法

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：32

、訪客IP：3.133.142.22

姓名

張巧欣(Chiao-Hsin Chang) 查詢紙本館藏

畢業系所

資訊管理學系

論文名稱

一個應用字詞連結度協助文件分群之方法
(An Approach to Aid Document Clustering based on Word Connectivity)

相關論文

★ 信用卡盜刷防治簡訊規則製作之決策支援系統	★ 不同檢索策略之效果比較
★ 知識分享過程之影響因子探討	★ 兼具分享功能之檢索代理人系統建構與評估
★ 犯罪青少年電腦態度與學習自我效能之研究	★ 使用AHP分析法在軟體度量議題之研究
★ 優化入侵規則庫	★ 商務資訊擷取效率與品質促進之研究
★ 以分析層級程序法衡量銀行業導入企業應用整合系統(EAI)之關鍵因素	★ 應用基因演算法於叢集電腦機房強迫對流裝置佈局最佳近似解之研究
★ The Development of a CASE Tool with Knowledge Management Functions	★ 以PAT tree 為基礎發展之快速搜尋索引樹
★ 以複合名詞為基礎之文件概念建立方式	★ 利用使用者興趣檔探討形容詞所處位置對評論分類的重要性
★ 透過半結構資訊及使用者回饋資訊以協助使用者過濾網頁文件搜尋結果	★ 利用feature-opinion pair建立向量空間模型以進行使用者評論分類之研究

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 ( 永不開放)

摘要(中)

網際網路的發展，資訊量快速成長，資訊過載問題日益嚴重，為了能有效率管理
龐大的資訊，資料須適當的處理，幫助使用者整理龐大的資訊並加速獲得真正有用的
資訊。傳統的文件分群主要使用字詞在文件中的權重當向量空間模型的依據，得面臨
一些挑戰，如：資料量大時，高維度向量稀疏矩陣需要大量計算成本且效能不佳、詞
彙為獨立構成，無法區分文中詞彙間關聯性、並不是所有詞彙一樣重要。本研究提出
一套方法，透過分析字詞與字詞間連結度，形成字詞群集，利用字詞群集協助文件分
群。首先，針對資料集擷取資訊量較多之關鍵字當字詞群集之基礎；接著，依關鍵字
平均連結度分數加以合併形成字詞群集，用以表達文件進行分群。由實驗結果顯示本
研究提出之方法能提升分群之效能，更能夠表達詞彙在資料集與詞彙之關係。

摘要(英)

The World Wide Web continues to grow at an amazing speed to bring a quickly growing number of documents. Since information overload is more serious than ever, the development of new methods for managing these information is an important issue. In most document clustering algorithms, documents usually are represented in the vector space model, which consider all dimensions (terms) in similarity measurement. In this vector space model, there are some weaknesses. First, cost much in calculation in high dimension situation. Second, it treats terms as independent and of equal importance. In this paper, we propose a method to aid document clustering. To start with, we analyze degree of word connectivity; and then, group keywords in to keyword clusters; finally, all documents were clustered according to the score among the keyword clusters and then choose the highest score keyword cluster for each document. Our experimental results show that the performance of the proposed approach has been improved effectively.

關鍵字(中)

★ 文件分群
★ 向量空間模型
★ 連結度
★ 字詞群集

關鍵字(英)

★ Document Clustering
★ Vector Space Model
★ Word Connectivity
★ Keyword Cluster

論文目次

一、緒論 ................................................................................................................................... 1
1-1 研究背景與動機 .................................................................................................... 1
1-2 研究目的 ................................................................................................................ 1
1-3 研究範圍與限制 .................................................................................................... 2
1-4 論文架構 ................................................................................................................ 2
二、文獻探討 ............................................................................................................................ 3
2-1 文件表示法 ............................................................................................................ 3
2-2 分群相關研究 ........................................................................................................ 4
2-2-1 K-means 4
2-2-2 階層式分群演算法 (Hierarchical Clustering) 5
2-2-3 密度為基礎分群演算法 (Density-based clustering) 6
2-2-4 高頻項目集為基礎分群演算法 (Frequent Itemsets Based Clustering) 6
2-2-5 主題詞彙群組進行文件分群 (Using Topic Keyword Clusters for
Document Clustering) 7
2-3 關聯規則探勘 ........................................................................................................ 7
2-4 特徵詞擷取 ............................................................................................................ 8
2-4-1 文件頻率門檻 (Document Frequency Threshold) 8
2-4-2 資訊增益 (Information Gain, IG) 8
2-4-3 卡方檢定 (Chi-square test, CHI) 9
2-4-4 交互資訊 (Mutual Information, MI) 10
三、研究方法 .......................................................................................................................... 11
3-1 文件前處理 .......................................................................................................... 12
3-1-1 停用字移除 (Removing Stopwords) 12
3-1-2 移除非字記號 (Removing non-numeric characters) 13
3-1-3 詞性標記 (Part of Speech) 13
3-1-4 詞根還原 (Stemming) 13
3-2 字詞連結度找出關鍵字 ...................................................................................... 14
3-3 關鍵字形成字詞群集 .......................................................................................... 17
3-4 分派文章至字詞群集 .......................................................................................... 18
iv
四、實驗結果評估與分析 ...................................................................................................... 20
4-1 實驗資料集 .......................................................................................................... 20
4-2 實驗評估指標 ...................................................................................................... 21
4-3 實驗設計 .............................................................................................................. 22
4-4 實驗結果 .............................................................................................................. 23
4-4-1 較小資料集結果 23
4-4-2 較大資料集結果 24
4-5 實驗討論與分析 .................................................................................................. 26
4-5-1 本研究方法限制情況 26
4-5-2 本研究方法較佳情況 29
五、結論 ................................................................................................................................. 31
5-1 結論與貢獻 .......................................................................................................... 31
5-2 未來研究方向 ...................................................................................................... 32
參考文獻 ................................................................................................................................. 33

參考文獻

1. Aas, K. and Eikvil, L., "Text categorisation: A survey," Technical report, vol. 941,
Norwegian Computing Center, Jun., 1999.
2. Aggarwal, C. C. and Yu, P. S., "Finding generalized projected clusters in high
dimensional spaces," in Proceedings of the 2000 ACM SIGMOD International
Conference on Management of data, Dallas, TX, USA, pp. 70-81, May, 2000.
3. Agrawal, R. and Srikant, R., "Fast algorithms for mining association rules," in
Proceedings of the 20th International Conference on Very Large Data Bases (VLDB),
Santiago de Chile, Chile, pp. 487-499, Sep., 1994.
4. Al-Kofahi, K., Tyrrell, A., Vachher, A., Travers, T., and Jackson, P., "Combining
multiple classifiers for text categorization," in Proceedings of the Tenth International
Conference on Information and Knowledge Management (CIKM), McLean, VA, USA,
pp. 97-104, Nov., 2001,.
5. Apache Software Foundation. (2011). Lucene. Available: http://lucene.apache.org/core/
6. Azcarraga, A. P., Yap Jr, T., and CHUA, T. S., "Comparing keyword extraction
techniques for WEBSOM text archives," International Journal on Artificial
Intelligence Tools, vol. 11, no. 2, pp. 219-232, Jun., 2002.
7. Baeza-Yates, R. and Ribeiro-Neto, B., Modern information retrieval, vol.463. USA:
Addison Wesley, 1999.
8. Beil, F., Ester, M., and Xu, X., "Frequent term-based text clustering," in Proceedings of
the Eighth ACM SIGKDD International Conference on Knowledge Discovery and
Data mining (KDD), Edmonton, AB, Canada, pp. 436-442, Jul., 2002,.
9. Berry, M. W., Survey of Text Mining I: Clustering, Classification, and Retrieval, vol. 1,
New York: Springer, 2004.
34
10. Buckley, C. and Salton, G. (2000). Stopword List. Available:
http://www.lextek.com/manuals/onix/stopwords2.html
11. Chang, H. C. and Hsu, C. C., "Using topic keyword clusters for automatic document
clustering," in Proceedings of the Third International Conference on Information
Technology and Applications (ICITA), Sydney, Australia, pp. 419-424, Jul., 2005.
12. Church, K. W. and Hanks, P., "Word association norms, mutual information, and
lexicography," Computational linguistics, vol. 16, no. 1, pp. 22-29, Mar., 1990.
13. Clifton, C., Cooley, R., and Rennie, J., "TopCat: data mining for topic identification in
a text corpus," Knowledge and Data Engineering, IEEE Transactions on, vol. 16, no. 8,
pp. 949-964, Aug., 2004.
14. Dai, J., He, Z., and Hu, F., "A High Performance Algorithm for Text Feature Automatic
Selection," in Proceedings of the 2009 International Symposium on Information
Processing (ISIP), Huagshan, China, pp. 414-417, Aug., 2009.
15. Dash, M. and Liu, H., Feature selection for clustering, in Knowledge Discovery and
Data Mining, New York: Springer, pp. 110-121, 2000. [E-Book]
16. Dong, J., Perrizo, W., Ding, Q., and Zhou, J., "The application of association rule
mining to remotely sensed data," in Proceedings of the 2000 ACM symposium on
Applied computing, Como, Italy, pp. 340-345, Mar., 2000.
17. Drexel University. (2008). The Dragon Toolkit. Available: www.dragontoolkit.org
18. Dumais, S. T., Furnas, G., Landauer, T., and Deerwester, S., "Latent semantic
indexing," in Proceedings of the Third Text REtrieval Conference (TREC),
Gaithersburg, MD, USA, pp. 105-115, Nov., 2005.
19. Eisen, M. (2010). Cluster 3.0. Available:
http://bonsai.hgc.jp/~mdehoon/software/cluster/software.htm
20. Ester, M., Kriegel, H. P., Sander, J., and Xu, X., "A density-based algorithm for
discovering clusters in large spatial databases with noise," in Proceedings of the 2nd
35
International Conference on Knowledge Discovery and Data Mining, Portland, Oregon,
USA, pp. 226–231, Aug., 1996.
21. Giarlo, M. J., A Comparative Analysis of Keyword Extraction Techniques, Rutgers: The
State University of New Jersey, 2005.
22. Guyon, I. and Elisseeff, A., "An introduction to variable and feature selection," The
Journal of Machine Learning Research, vol. 3, pp. 1157-1182, Mar., 2003.
23. Han, J., Kamber, M., and Pei, J., Data mining: concepts and techniques, USA: Morgan
kaufmann, 2006.
24. Hatcher, E., Gospodnetic, O., and McCandless, M., Lucene in action, 2rd, New York:
Manning Publications, 2004.
25. Jain, A. K., Murty, M. N., and Flynn, P. J., "Data clustering: a review," ACM
computing surveys (CSUR), vol. 31, no. 3, pp. 264-323, Sept., 1999.
26. Johnson, S. C., "Hierarchical clustering schemes," Psychometrika, vol. 32, no. 3, pp.
241-254, Sept., 1967.
27. Kanungo, T., Mount, D. M., Netanyahu, N. S., Piatko, C. D., Silverman, R., and Wu, A.
Y., "An efficient k-means clustering algorithm: Analysis and implementation," IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 7, pp. 881-892,
Jul., 2002.
28. Koller, D. and Sahami, M., "Hierarchically classifying documents using very few
words," in Proceedings of the Fourteenth International Conference on Machine
Learning (ICML), San Francisco, CA, USA, pp. 170-178, Jul., 1997.
29. Krishna, S. M. and Bhavani, S. D., "An Efficient Approach for Text Clustering Based
on Frequent Itemsets," European Journal of Scientific Research, vol. 42, no. 3, pp.
399-410, Jun., 2010.
30. Lewis, D. D. (2004). Reuters-21578. Available:
http://www.daviddlewis.com/resources/testcollections/
36
31. Lewis, D. D. and Croft, W. B., "Term clustering of syntactic phrases," in Proceedings
of the 13th Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval, Brussels, Belgium, pp. 385-404, Sep., 1989.
32. Li, H. and Yamanishi, K., "Text classification using ESC-based stochastic decision
lists," in Proceedings of the Eighth international conference on Information and
knowledge management (CIKM), Kansas City, MO, USA, pp. 122-130, Nov., 1999.
33. Li, Y., Luo, C., and Chung, S. M., "Text clustering with feature selection by using
statistical data," Knowledge and Data Engineering, IEEE Transactions on, vol. 20, no.
5, pp. 641-652, May, 2008.
34. Liu, T., Liu, S., Chen, Z., and Ma, W., "An evaluation on feature selection for text
clustering," in Proceedings of the Twentieth International Conference on Machine
Learning, Washington, DC, USA, pp. 488-495, Aug., 2003.
35. Maldonado, S., Weber, R., and Basak, J., "Simultaneous feature selection and
classification using kernel-penalized support vector machines," Information Sciences,
vol. 181, no. 1, pp. 115-128, Jan., 2011.
36. Malik, H. H. and Kender, J. R., "High quality, efficient hierarchical document
clustering using closed interesting itemsets," in Proceedings of the Sixth International
Conference on Data Mining (ICDM), Hong Kong, China, pp. 991-996, Dec., 2006.
37. Manning, C. D., Raghavan, P., and Schütze, H., Introduction to information retrieval.
New York: Cambridge University Press, 2008.
38. Matsuo, Y. and Ishizuka, M., "Keyword extraction from a single document using word
co-occurrence statistical information," International Journal on Artificial Intelligence
Tools, vol. 13, no. 1, pp. 157-169, Mar., 2004.
39. Pang-Ning, T., Steinbach, M., and Kumar, V., Introduction to data mining, New Jersey:
Pearson Education, 2006.
40. Rüger, S. M. and Gauch, S. E., Feature reduction for document clustering and
37
classification, London: Imperial College of Science, Technology and Medicine,
Department of Computing, 2000.
41. Salton, G. and McGill, M. J., Introduction to modern information retrieval. New York:
McGraw-Hill, 1983.
42. Silla Jr, C. N., Kaestner, C. A., and Freitas, A. A., "A non-linear topic detection method
for text summarization using wordnet," in Proceedings of I Workshop on Technology
Information Language Human, ICMC-USP, Brazil, pp. 1-8, Oct., 2003.
43. Steinbach, M., Karypis, G., and Kumar, V., "A comparison of document clustering
techniques," in Proceedings of the Sixth ACM SIGKDD International Workshop on
Knowledge Discovery and Data mining, Boston, MA,USA, pp. 109-111, Aug., 2000.
44. Sullivan, D., Document warehousing and text mining: techniques for improving
business operations, marketing, New York: Wiley, 2001.
45. The Stanford Natural Language Processing Group. (2013). Part-Of-Speech Tagger.
Available: http://www-nlp.stanford.edu/software/tagger.shtml
46. Tonella, P., Ricca, F., Pianta, E., and Girardi, C., "Evaluation methods for web
application clustering," in Proceedings Fifth IEEE International Workshop on Web Site
Evolution Theme: Architecture, Amsterdam, Netherlands, pp. 33-40, Sep., 2003.
47. Wang, K., Xu, C., and Liu, B., "Clustering transactions using large items," in
Proceedings of the Eighth International Conference on Information and Knowledge
Management (CIKM), New York, NY, USA, pp. 483-490, Nov., 1999.
48. Yang, Y. and Pedersen, J. O., "A comparative study on feature selection in text
categorization," in Proceedings of the Fourteenth International Conference on
Machine Learning (ICML), Nashville, TN, USA, pp. 412-420, Jul., 1997.
49. Zhang, W., Yoshida, T., Tang, X., and Wang, Q., "Text clustering using frequent
itemsets," Knowledge-Based Systems, vol. 23, no. 5, pp. 379-388, Jul., 2010.
50. Zhao, Y., Karypis, G., and Fayyad, U., "Hierarchical clustering algorithms for
38
document datasets," Data Mining and Knowledge Discovery, vol. 10, no. 2, pp.
141-168, Mar., 2005.

指導教授

周世傑(Shih-Chieh Chou)

審核日期

2013-7-16

推文