Google文字關聯在多領域文件分類上的應用; Using Google’s Keyword Relation in Multi-Domain Document Classification

NCU Institutional Repository > 管理學院 > 資訊管理研究所 > 博碩士論文 > Item 987654321/49029

jsp.display-item.identifier=請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/49029

题名:	Google文字關聯在多領域文件分類上的應用;Using Google’s Keyword Relation in Multi-Domain Document Classification
作者:	陳棅易;Ping-I Chen
贡献者:	資訊管理研究所
关键词:	文字向量序列;文字檢索;文件分類;相似度比對;keyword sequence;information retrieval;classification;similarity distance
日期:	2011-12-19
上传时间:	2012-01-05 15:13:36 (UTC+8)
摘要:	傳統的文件分類需先將文件都下載到電腦上，接著透過關鍵字重要性計算將潛在關鍵字抽取出來做為文件之代表序列，最後利用文件向量比對演算法進行分類。但是，在網路資訊發展越趨成熟的年代，使用者常常透過網頁瀏覽多種不同領域知識的文件或網頁。若要針對各領域訓練出關鍵字以抽取出代表性序列達到跨領域知識分類的目的，將會造成極大的資源浪費也缺乏效率。而且各領域的序列維度也將會因資訊的無限更新與擴充，而變得極為龐大需要耗費大量運算與儲存資原。本篇論文介紹使用我們自行改良之GCD演算法為基礎，透過每個關鍵字在Google中所擁有網頁數的比率來計算文字的重要性來組成一個關鍵字網路(WANET)。接著利用序列攫取演算法找出文字網路中最具代表性的K個關鍵字 (K≦4)做為代表性序列。由於我們的代表性序列太短，因此傳統的向量比對演算法無法適用在此環境。因此，我們也利用搜尋引擎為基礎的概念做出Google Purity measurement演算法做為向量比對的依據。本系統由於所有演算法都是以搜尋引擎的網頁數值來做為計算依據，所以可達成即時跨領域分類的目的。我們也透過實驗證實了若欲分類的文件包含的專業詞彙較少被其他領域引用的狀態下，可以達到極高的分類精準度。我們系統唯一的缺點在於對Google Query次數太頻繁導致整體執行效率較傳統的向量比對方式差，但是由於我們不需要預先蒐集訓練集，向量也不會跟著文件增加而一直無限制成長。所以長期來看我們提出的方法會比傳統作法有效率。我們相信未來可透過更進一步的改良，使得整體精準度與計算效率能有效提升，將能更加使使用者能有效的整理學習過的資訊，亦能透過相同的演算法找出有用的資訊即時推薦給使用者做為輔助閱讀的依據。 How to automatically classify information in an efficient way is becoming more and more important in recent years. We can collect all kinds of knowledge from search engines to improve the quality of decision making, and use document classification systems to manage the knowledge repository. Document classification systems always need to construct a keyword vector, which always contains thousands of words, to represent the knowledge domain. Thus, the computation complexity of the classification algorithm is very high. Also, users need to download all the documents before extracting the keywords and classifying the documents. In this thesis, we described a new algorithm called “Word AdHoc Network” and used it to extract the most important sequences of keywords for each document. The keyword sequence is composed of no more than four keywords. We will also use a new similarity measurement algorithm, called “Google Purity,” to calculate the similarity between the extracted keyword sequences to classify similar documents together. By using this system, we can easily classify the information in different knowledge domains at the same time, and all the executions are real-time without any pre-established keyword repository. Our experiments show that the classification results are very accurate and useful. The only weakness of our system is that the execution time of our system is longer than the cosine method. But we can save the time of choosing those training data and the vectors of each domain can remain only 4-gram. This new system can improve the efficiency of document classification and make it more usable in Web-based information management.
显示于类别:	[資訊管理研究所] 博碩士論文

文件中的档案:

档案	描述	大小	格式	浏览次数
index.html		0Kb	HTML	707	检视/开启

在NCUIR中所有的数据项都受到原著作权保护.

社群 sharing

数据加载中.....