動態主題截取在網路文件分群之應用

DC 欄位	值	語言
DC.contributor	企業管理學系	zh_TW
DC.creator	洪秉儒	zh_TW
DC.creator	BING-RU HONG	en_US
dc.date.accessioned	2011-1-25T07:39:07Z
dc.date.available	2011-1-25T07:39:07Z
dc.date.issued	2011
dc.identifier.uri	http://ir.lib.ncu.edu.tw:444/thesis/view_etd.asp?URN=984201050
dc.contributor.department	企業管理學系	zh_TW
DC.description	國立中央大學	zh_TW
DC.description	National Central University	en_US
dc.description.abstract	網際網路(Internet)的成長促成了資訊科技與網路的蓬勃發展，在全球資訊網(World Wide Web)的推波助瀾下，人與人間資訊的交流快速且無遠弗屆。然而隨著網際網路使用人口的增加，使用者面對的問題不再是如何從龐大的資料中獲取資訊，而是如何管理與過濾這些資訊。由於人們普遍缺乏足夠的時間一一分析並消化吸收大量的資訊，於是如何從這些龐大的資料中，快速且有效地整理出所需要的資訊，是一個非常重要的議題。資料探勘研究領域發展了許多技術從大量資料中分析出有用的資訊，而文件分群是其中重要的技術。以往的文件分群方式，大多著眼於文件內容摘要(多文件摘要、單文件摘要)及文件中的字彙維度分析，找出少數且重要的關鍵字，來進行內容文件分群。本研究針對網站的熱門商品及服務，依照文件的點閱次數K(d,i)及文件內容摘要所提供的字詞資訊，決定熱門商品及服務的分類別及命名。首先針對文件內容摘要進行中文字詞處理，經由階層式聚合分群法－沃德法分析字詞屬性，來決定文件主題(theme)數目K，以網路的文件內容所包含的字詞之間的關聯性、使用者辨識字詞資訊之點閱次數、使用TTM ( Temporal Text Mining) Cross-Collection Mixture Model，利用動態主題截取處理，以機率分配方式搜集穩定的字詞資訊，取得文件的主題特徵做為文件分類的依據，經由一連串實驗過程 (F-measure、error rate)，來說明本演算法具有效率性，並且改善分群結果的精確性。	zh_TW
dc.description.abstract	Internet has facilitated the development of information technology and communication protocol, and World Wide Web (WWW): an information-sharing model built on top of the Internet, a popular platform of information exchange will not confined to time and space. Since many feasible search engines have been utilized on the Internet, users of WWW no longer face the problem of how to obtain the information from the vast data, but rather how to manage and filter them. Because people generally do not have so much time to analyze the immense data, so data mining--- a technology of quickly and effectively extract requested information from these huge data is a very important issue. Research of data mining has developed many technologies of filtering out useful information from large volumes of data, document clustering is one of the important technologies. There are two methods of document clustering, one is clustering depended on metadata of document, and the other is content of document. Previous clustering methods of the document contents, most of the algorithms focus on the document summary (summary of single file or multiple files) and the words vector analysis of document, find the few and important keywords to conduct document clustering. In this study, we categorize popular goods and services and name them, in accordance with their accessing numbers K(d,i)and the words provided by abstracts of goods and services. First, parse Chinese word of abstracts documents for the foods or services, applied the hierarchical agglomerative clustering method - Ward method to analyze the properties of words into themes and decide the number K of themes. Secondly, adopt the TTM (Temporal Text Mining) Cross-Collection Mixture Model, collect and use of dynamic theme, and gather stable words by probability distribution to be the vectors of document clustering. This study proposes a novel approach of clustering document. The approach is according to the correlation of words which in the contents of documents, the level of popularity (accessing count) of users recognized words, and extracted dynamic themes to be the feature characteristic of document clustering. Through a series of experiment and evaluated by F-measure and error rate, it is proven that the algorithm is effective and can improve the accuracy of clustering results.	en_US
DC.subject	文件分群	zh_TW
DC.subject	動態文字探勘	zh_TW
DC.subject	主題截取	zh_TW
DC.subject	Document Clustering	en_US
DC.subject	Temporal Text Mining	en_US
DC.subject	Extracting Theme	en_US
DC.title	動態主題截取在網路文件分群之應用	zh_TW
dc.language.iso	zh-TW	zh-TW
DC.title	Web Text Clustering with Dynamic Theme	en_US
DC.type	博碩士論文	zh_TW
DC.type	thesis	en_US
DC.publisher	National Central University	en_US

博碩士論文 984201050 完整後設資料紀錄