Google文字關聯在多領域文件分類上的應用

DC 欄位	值	語言
DC.contributor	資訊管理學系	zh_TW
DC.creator	陳棅易	zh_TW
DC.creator	Ping-I Chen	en_US
dc.date.accessioned	2011-12-19T07:39:07Z
dc.date.available	2011-12-19T07:39:07Z
dc.date.issued	2011
dc.identifier.uri	http://ir.lib.ncu.edu.tw:88/thesis/view_etd.asp?URN=974403002
dc.contributor.department	資訊管理學系	zh_TW
DC.description	國立中央大學	zh_TW
DC.description	National Central University	en_US
dc.description.abstract	傳統的文件分類需先將文件都下載到電腦上，接著透過關鍵字重要性計算將潛在關鍵字抽取出來做為文件之代表序列，最後利用文件向量比對演算法進行分類。但是，在網路資訊發展越趨成熟的年代，使用者常常透過網頁瀏覽多種不同領域知識的文件或網頁。若要針對各領域訓練出關鍵字以抽取出代表性序列達到跨領域知識分類的目的，將會造成極大的資源浪費也缺乏效率。而且各領域的序列維度也將會因資訊的無限更新與擴充，而變得極為龐大需要耗費大量運算與儲存資原。本篇論文介紹使用我們自行改良之GCD演算法為基礎，透過每個關鍵字在Google中所擁有網頁數的比率來計算文字的重要性來組成一個關鍵字網路(WANET)。接著利用序列攫取演算法找出文字網路中最具代表性的K個關鍵字 (K≦4)做為代表性序列。由於我們的代表性序列太短，因此傳統的向量比對演算法無法適用在此環境。因此，我們也利用搜尋引擎為基礎的概念做出Google Purity measurement演算法做為向量比對的依據。本系統由於所有演算法都是以搜尋引擎的網頁數值來做為計算依據，所以可達成即時跨領域分類的目的。我們也透過實驗證實了若欲分類的文件包含的專業詞彙較少被其他領域引用的狀態下，可以達到極高的分類精準度。我們系統唯一的缺點在於對Google Query次數太頻繁導致整體執行效率較傳統的向量比對方式差，但是由於我們不需要預先蒐集訓練集，向量也不會跟著文件增加而一直無限制成長。所以長期來看我們提出的方法會比傳統作法有效率。我們相信未來可透過更進一步的改良，使得整體精準度與計算效率能有效提升，將能更加使使用者能有效的整理學習過的資訊，亦能透過相同的演算法找出有用的資訊即時推薦給使用者做為輔助閱讀的依據。	zh_TW
dc.description.abstract	How to automatically classify information in an efficient way is becoming more and more important in recent years. We can collect all kinds of knowledge from search engines to improve the quality of decision making, and use document classification systems to manage the knowledge repository. Document classification systems always need to construct a keyword vector, which always contains thousands of words, to represent the knowledge domain. Thus, the computation complexity of the classification algorithm is very high. Also, users need to download all the documents before extracting the keywords and classifying the documents. In this thesis, we described a new algorithm called “Word AdHoc Network” and used it to extract the most important sequences of keywords for each document. The keyword sequence is composed of no more than four keywords. We will also use a new similarity measurement algorithm, called “Google Purity,” to calculate the similarity between the extracted keyword sequences to classify similar documents together. By using this system, we can easily classify the information in different knowledge domains at the same time, and all the executions are real-time without any pre-established keyword repository. Our experiments show that the classification results are very accurate and useful. The only weakness of our system is that the execution time of our system is longer than the cosine method. But we can save the time of choosing those training data and the vectors of each domain can remain only 4-gram. This new system can improve the efficiency of document classification and make it more usable in Web-based information management.	en_US
DC.subject	文字向量序列	zh_TW
DC.subject	文字檢索	zh_TW
DC.subject	文件分類	zh_TW
DC.subject	相似度比對	zh_TW
DC.subject	keyword sequence	en_US
DC.subject	information retrieval	en_US
DC.subject	classification	en_US
DC.subject	similarity distance	en_US
DC.title	Google文字關聯在多領域文件分類上的應用	zh_TW
dc.language.iso	zh-TW	zh-TW
DC.title	Using Google’s Keyword Relation in Multi-Domain Document Classification	en_US
DC.type	博碩士論文	zh_TW
DC.type	thesis	en_US
DC.publisher	National Central University	en_US

博碩士論文 974403002 完整後設資料紀錄