動態主題截取在網路文件分群之應用

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：39

、訪客IP：3.145.180.18

姓名

洪秉儒(BING-RU HONG) 查詢紙本館藏

畢業系所

企業管理學系

論文名稱

動態主題截取在網路文件分群之應用
(Web Text Clustering with Dynamic Theme)

相關論文

★ 在社群網站上作互動推薦及研究使用者行為對其效果之影響	★ 以AHP法探討伺服器品牌大廠的供應商遴選指標的權重決定分析
★ 以AHP法探討智慧型手機產業營運中心區位選擇考量關鍵因素之研究	★ 太陽能光電產業經營績效評估－應用資料包絡分析法
★ 建構國家太陽能電池產業競爭力比較模式之研究	★ 以序列採礦方法探討景氣指標與進出口值的關聯
★ ERP專案成員組合對績效影響之研究	★ 推薦期刊文章至適合學科類別之研究
★ 品牌故事分析與比較-以古早味美食產業為例	★ 以方法目的鏈比較Starbucks與Cama吸引消費者購買因素
★ 探討創意店家創業價值之研究- 以赤峰街、民生社區為例	★ 以領先指標預測企業長短期借款變化之研究
★ 應用層級分析法遴選電競筆記型電腦鍵盤供應商之關鍵因子探討	★ 以互惠及利他行為探討信任關係對知識分享之影響
★ 結合人格特質與海報主色以類神經網路推薦電影之研究	★ 資料視覺化圖表與議題之關聯

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

網際網路(Internet)的成長促成了資訊科技與網路的蓬勃發展，在全球資訊網(World Wide Web)的推波助瀾下，人與人間資訊的交流快速且無遠弗屆。然而隨著網際網路使用人口的增加，使用者面對的問題不再是如何從龐大的資料中獲取資訊，而是如何管理與過濾這些資訊。由於人們普遍缺乏足夠的時間一一分析並消化吸收大量的資訊，於是如何從這些龐大的資料中，快速且有效地整理出所需要的資訊，是一個非常重要的議題。資料探勘研究領域發展了許多技術從大量資料中分析出有用的資訊，而文件分群是其中重要的技術。以往的文件分群方式，大多著眼於文件內容摘要(多文件摘要、單文件摘要)及文件中的字彙維度分析，找出少數且重要的關鍵字，來進行內容文件分群。本研究針對網站的熱門商品及服務，依照文件的點閱次數K(d,i)及文件內容摘要所提供的字詞資訊，決定熱門商品及服務的分類別及命名。首先針對文件內容摘要進行中文字詞處理，經由階層式聚合分群法－沃德法分析字詞屬性，來決定文件主題(theme)數目K，以網路的文件內容所包含的字詞之間的關聯性、使用者辨識字詞資訊之點閱次數、使用TTM ( Temporal Text Mining) Cross-Collection Mixture Model，利用動態主題截取處理，以機率分配方式搜集穩定的字詞資訊，取得文件的主題特徵做為文件分類的依據，經由一連串實驗過程 (F-measure、error rate)，來說明本演算法具有效率性，並且改善分群結果的精確性。

摘要(英)

Internet has facilitated the development of information technology and communication protocol, and World Wide Web (WWW): an information-sharing model built on top of the Internet, a popular platform of information exchange will not confined to time and space. Since many feasible search engines have been utilized on the Internet, users of WWW no longer face the problem of how to obtain the information from the vast data, but rather how to manage and filter them. Because people generally do not have so much time to analyze the immense data, so data mining--- a technology of quickly and effectively extract requested information from these huge data is a very important issue. Research of data mining has developed many technologies of filtering out useful information from large volumes of data, document clustering is one of the important technologies. There are two methods of document clustering, one is clustering depended on metadata of document, and the other is content of document. Previous clustering methods of the document contents, most of the algorithms focus on the document summary (summary of single file or multiple files) and the words vector analysis of document, find the few and important keywords to conduct document clustering. In this study, we categorize popular goods and services and name them, in accordance with their accessing numbers K(d,i)and the words provided by abstracts of goods and services. First, parse Chinese word of abstracts documents for the foods or services, applied the hierarchical agglomerative clustering method - Ward method to analyze the properties of words into themes and decide the number K of themes. Secondly, adopt the TTM (Temporal Text Mining) Cross-Collection Mixture Model, collect and use of dynamic theme, and gather stable words by probability distribution to be the vectors of document clustering. This study proposes a novel approach of clustering document. The approach is according to the correlation of words which in the contents of documents, the level of popularity (accessing count) of users recognized words, and extracted dynamic themes to be the feature characteristic of document clustering. Through a series of experiment and evaluated by F-measure and error rate, it is proven that the algorithm is effective and can improve the accuracy of clustering results.

關鍵字(中)

★ 文件分群
★ 動態文字探勘
★ 主題截取

關鍵字(英)

★ Document Clustering
★ Temporal Text Mining
★ Extracting Theme

論文目次

中文摘要 i
Abstract ii
目錄 iv
圖目錄 v
表目錄 vi
第一章緒論 1
第一節研究背景與動機 1
第二節研究目的 1
第三節論文架構 2
第二章文獻探討 3
第一節群集法( Clustering) 3
第二節文件分群相關研究 4
第三節 Temporal Text Mining (TTM) 7
第三章系統設計 10
第一節 Preprocessing Documents 11
第二節 Identify Theme number/Initial themes 13
第三節 Processing Cross-Collection Mixture Model 14
第四節 Parameters Estimation with EM Algorithm 15
第五節 Apply K-means with Theme attribute to Cluster Documents 23
第四章實證分析 24
第一節資料描述與前處理 24
第二節實驗模擬與結果分析 31
第三節實驗結果驗證與比較 33
第五章結論與未來研究建議 36
第一節結論 36
第二節未來研究建議 36
參考文獻 37

參考文獻

[1] G. Salton and M. E. Lesk. Computer evaluation of indexing and text processing.
Journal of the ACM, 15(1):8-36, January 1968.
[2] Rüger, S. M. and S. E. Gauch (2000) Feature Reduction for Document Clustering and Classification: Technical Report DTR 2000/8, Computing Department of Imperial College, London, UK.17. Salton, G. and M. McGill
[3] C. D. Manning, H. Schutze, Foundations of statistical natural language processing, Massachusetts Institute of Technology. pages 315-407, 1999
[4] Lijuan Cai and Thomas Hofmann. Text Categorization by Boosting Automatically Extracted Concepts. In Proceedings of the 26th annual international ACM SIGIR, conference on Research and development in information retrieval, pages 182-189, 2003.
[5] E. Rasmussen. Clustering algorithms. In W.B. Frakes and R. Baeza-Yates, Information Retrieval, pages 419-442, 1992.
[6] A. Griffith, H. C. Luckhurst, and P. Willet. Using inter-document similarity
information in document retrieval systems. Journal of the American Society for
Information Science, 37(1):3-11, 1986.
[7] ChengXiang Zhai, and Qiaozhu Mei. Discovering Evolutionary Theme Patterns from Text-An Exploration of Temporal Text Mining. In Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery and data mining,
pages 198-207, 2005.
[8] S. Morinaga and K. Yamanishi. Tracking dynamics of topic trends using a _nite mixture model. In Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery and data mining, pages 811-816, 2004.
[9] A. P. Dempster, N. M. Laird, and D. B. Rubin.Maximum likelihood from incomplete data via the EMalgorithm. Journal of Royal Statist. Soc. B, 39:1{38,1977.
[10] S. Roy, D. Gevry, and W. M. Pottenger.Methodologies for trend detection in textual datamining. In the Textmine '02 Workshop, Second SIAM International Conference on Data Mining, 2002.
[11] S. Khan and A. Ahmad (2004), Cluster Centre Initialization Algorithm for K-Means Clustering, Pattern Recognition vol. 25, pages 1293–1302.
[12] R.M.Neal and G.E.Hinton: A view of the EM algorithm that justifies incremental sparse, and other variants, Learning in Graphical Models, M. Jordan (editor), MIT Press, Cambridge MA, USA.
[13] Kowalski, G. (1997) Information Retrieval Systems −Theory and Implementation, Kluwer Academic Publishers,Norwell, MA.
[14] Rijbergen, C. J. Van (1979) Information Retrieval, 2nd Ed, pages 114-115.Butterworths, London, UK.
[15] Larsen, B. and C. Aone (1999) Fast and effective text mining using linear-time document clustering. Proceedings of the fifth ACM SIGKDD international Conference on California.Knowledge Discovery and Data Mining, San Diego,
[16] Michael Steinbach, Pang-Ning Tan, Vipin Kumar(2006),Introduction to Data Mining, 1nd Ed, pages 158-159,USA

指導教授

許秉瑜(Ping-Yu Hsu)

審核日期

2011-1-25

推文