應用合併斷詞搜尋中文文件之研究; Searching Documents with Composite Chinese Word Segmentations

NCU Institutional Repository > 管理學院 > 企業管理研究所 > 博碩士論文 > Item 987654321/25844

請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/25844

題名:	應用合併斷詞搜尋中文文件之研究;Searching Documents with Composite Chinese Word Segmentations
作者:	林俊偉;Chun-Wei Lin
貢獻者:	企業管理研究所
關鍵詞:	關鍵字檢索;合併字詞;中文斷詞;Keyword Search;Composite Chinese Word Segmentation;Chinese Word Segmentation
日期:	2010-01-22
上傳時間:	2010-06-11 16:49:31 (UTC+8)
出版者:	國立中央大學圖書館
摘要:	在自然語言裡，詞才是自然語言的基本單位。由於文件是由許多詞構成，必須先將這些詞分開，才能供後續研究使用。在處理中文語句上，中文文件中不存在像英文字詞間有空格可作為字詞與字詞的邊界，因此必須先透過斷詞的處理以分隔出中文字詞，才能作進一步的分析。在中文斷詞處理上以中央研究院的中文斷詞系統最為代表，但由於其設計原理受限於詞庫所包含的詞彙，對於詞庫未包含之詞彙（稱之未知詞）無法區分出準確的斷詞結果。本研究的目的是解決未知詞的問題，提高斷詞系統的準確度，以便後續關鍵字詞選取處理，得到能夠代表文件內容含意的關鍵字詞以利資訊檢索之用。本研究以目前中文斷詞系統所得到的結果作為基礎，試著以合併斷詞結果的方式拼湊出完整的詞彙回復其原本之含意，以彌補未知詞被拆開後無法被發現的窘狀。接著再由原本的斷詞結果與合併產生的詞彙中做進一步關鍵字詞的擷取。研究結果顯示合併字詞檢索的精確度及召回率遠優於原斷詞系統結果及Google檢索結果，也驗證了系統之有效性。 In natural language, “word” is the most basic element. Owing to an article is constituted by lots of words; we must separate those words apart first then go to research. Because there is no word-spacing in Chinese article, like the one as the boundary between every word in English article; therefore, we need to divide Chinese words through Word Segmentation from the very first beginning then go to further analysis. In Chinese Word Segmentation Process, CKIP in Academia Sinica is the most prominent. However, due to the design principle of CKIP has been constrained by the words only in the data, as for the words not included (named Unidentified Word) are hardly to sort out correct word segmentation result. The purpose of this research is to solve the problem of Unidentified Word: to raise the accuracy of Word Segmentation System for further Keyword, which is really able to be represented the whole article, Search Processing. The foundation of this research is the result from Chinese Word Segmentation System, and I tried to combine the result to piece up the complete lexicon and original meaning, compensating for the deficiency of undiscovered when Unidentified Word being separated. Afterward I snatched further Keyword through the original word segmentation result and combined lexicon. The results showed that precision and recall of composite Chinese word segmentation are much better than CKIP and Google, it also verify the validity of the system.
顯示於類別:	[企業管理研究所] 博碩士論文

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	758	檢視/開啟

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....