博碩士論文 974403002 完整後設資料紀錄

DC 欄位 語言
DC.contributor資訊管理學系zh_TW
DC.creator陳棅易zh_TW
DC.creatorPing-I Chenen_US
dc.date.accessioned2011-12-19T07:39:07Z
dc.date.available2011-12-19T07:39:07Z
dc.date.issued2011
dc.identifier.urihttp://ir.lib.ncu.edu.tw:444/thesis/view_etd.asp?URN=974403002
dc.contributor.department資訊管理學系zh_TW
DC.description國立中央大學zh_TW
DC.descriptionNational Central Universityen_US
dc.description.abstract傳統的文件分類需先將文件都下載到電腦上,接著透過關鍵字重要性計算將潛在關鍵字抽取出來做為文件之代表序列,最後利用文件向量比對演算法進行分類。但是,在網路資訊發展越趨成熟的年代,使用者常常透過網頁瀏覽多種不同領域知識的文件或網頁。若要針對各領域訓練出關鍵字以抽取出代表性序列達到跨領域知識分類的目的,將會造成極大的資源浪費也缺乏效率。而且各領域的序列維度也將會因資訊的無限更新與擴充,而變得極為龐大需要耗費大量運算與儲存資原。本篇論文介紹使用我們自行改良之GCD演算法為基礎,透過每個關鍵字在Google中所擁有網頁數的比率來計算文字的重要性來組成一個關鍵字網路(WANET)。接著利用序列攫取演算法找出文字網路中最具代表性的K個關鍵字 (K≦4)做為代表性序列。由於我們的代表性序列太短,因此傳統的向量比對演算法無法適用在此環境。因此,我們也利用搜尋引擎為基礎的概念做出Google Purity measurement演算法做為向量比對的依據。本系統由於所有演算法都是以搜尋引擎的網頁數值來做為計算依據,所以可達成即時跨領域分類的目的。我們也透過實驗證實了若欲分類的文件包含的專業詞彙較少被其他領域引用的狀態下,可以達到極高的分類精準度。我們系統唯一的缺點在於對Google Query次數太頻繁導致整體執行效率較傳統的向量比對方式差,但是由於我們不需要預先蒐集訓練集,向量也不會跟著文件增加而一直無限制成長。所以長期來看我們提出的方法會比傳統作法有效率。我們相信未來可透過更進一步的改良,使得整體精準度與計算效率能有效提升,將能更加使使用者能有效的整理學習過的資訊,亦能透過相同的演算法找出有用的資訊即時推薦給使用者做為輔助閱讀的依據。 zh_TW
dc.description.abstractHow to automatically classify information in an efficient way is becoming more and more important in recent years. We can collect all kinds of knowledge from search engines to improve the quality of decision making, and use document classification systems to manage the knowledge repository. Document classification systems always need to construct a keyword vector, which always contains thousands of words, to represent the knowledge domain. Thus, the computation complexity of the classification algorithm is very high. Also, users need to download all the documents before extracting the keywords and classifying the documents. In this thesis, we described a new algorithm called “Word AdHoc Network” and used it to extract the most important sequences of keywords for each document. The keyword sequence is composed of no more than four keywords. We will also use a new similarity measurement algorithm, called “Google Purity,” to calculate the similarity between the extracted keyword sequences to classify similar documents together. By using this system, we can easily classify the information in different knowledge domains at the same time, and all the executions are real-time without any pre-established keyword repository. Our experiments show that the classification results are very accurate and useful. The only weakness of our system is that the execution time of our system is longer than the cosine method. But we can save the time of choosing those training data and the vectors of each domain can remain only 4-gram. This new system can improve the efficiency of document classification and make it more usable in Web-based information management. en_US
DC.subject文字向量序列zh_TW
DC.subject文字檢索zh_TW
DC.subject文件分類zh_TW
DC.subject相似度比對zh_TW
DC.subjectkeyword sequenceen_US
DC.subjectinformation retrievalen_US
DC.subjectclassificationen_US
DC.subjectsimilarity distanceen_US
DC.titleGoogle文字關聯在多領域文件分類上的應用zh_TW
dc.language.isozh-TWzh-TW
DC.titleUsing Google’s Keyword Relation in Multi-Domain Document Classificationen_US
DC.type博碩士論文zh_TW
DC.typethesisen_US
DC.publisherNational Central Universityen_US

若有論文相關問題,請聯絡國立中央大學圖書館推廣服務組 TEL:(03)422-7151轉57407,或E-mail聯絡  - 隱私權政策聲明