應用合併斷詞搜尋中文文件之研究

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：16

、訪客IP：3.144.20.66

姓名

林俊偉(Chun-Wei Lin) 查詢紙本館藏

畢業系所

企業管理學系

論文名稱

應用合併斷詞搜尋中文文件之研究
(Searching Documents with Composite Chinese Word Segmentations)

相關論文

★ 在社群網站上作互動推薦及研究使用者行為對其效果之影響	★ 以AHP法探討伺服器品牌大廠的供應商遴選指標的權重決定分析
★ 以AHP法探討智慧型手機產業營運中心區位選擇考量關鍵因素之研究	★ 太陽能光電產業經營績效評估－應用資料包絡分析法
★ 建構國家太陽能電池產業競爭力比較模式之研究	★ 以序列採礦方法探討景氣指標與進出口值的關聯
★ ERP專案成員組合對績效影響之研究	★ 推薦期刊文章至適合學科類別之研究
★ 品牌故事分析與比較-以古早味美食產業為例	★ 以方法目的鏈比較Starbucks與Cama吸引消費者購買因素
★ 探討創意店家創業價值之研究- 以赤峰街、民生社區為例	★ 以領先指標預測企業長短期借款變化之研究
★ 應用層級分析法遴選電競筆記型電腦鍵盤供應商之關鍵因子探討	★ 以互惠及利他行為探討信任關係對知識分享之影響
★ 結合人格特質與海報主色以類神經網路推薦電影之研究	★ 資料視覺化圖表與議題之關聯

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 ( 永不開放)

摘要(中)

在自然語言裡，詞才是自然語言的基本單位。由於文件是由許多詞構成，必須先將這些詞分開，才能供後續研究使用。在處理中文語句上，中文文件中不存在像英文字詞間有空格可作為字詞與字詞的邊界，因此必須先透過斷詞的處理以分隔出中文字詞，才能作進一步的分析。在中文斷詞處理上以中央研究院的中文斷詞系統最為代表，但由於其設計原理受限於詞庫所包含的詞彙，對於詞庫未包含之詞彙（稱之未知詞）無法區分出準確的斷詞結果。
本研究的目的是解決未知詞的問題，提高斷詞系統的準確度，以便後續關鍵字詞選取處理，得到能夠代表文件內容含意的關鍵字詞以利資訊檢索之用。本研究以目前中文斷詞系統所得到的結果作為基礎，試著以合併斷詞結果的方式拼湊出完整的詞彙回復其原本之含意，以彌補未知詞被拆開後無法被發現的窘狀。接著再由原本的斷詞結果與合併產生的詞彙中做進一步關鍵字詞的擷取。研究結果顯示合併字詞檢索的精確度及召回率遠優於原斷詞系統結果及Google檢索結果，也驗證了系統之有效性。

摘要(英)

In natural language, “word” is the most basic element. Owing to an article is constituted by lots of words; we must separate those words apart first then go to research. Because there is no word-spacing in Chinese article, like the one as the boundary between every word in English article; therefore, we need to divide Chinese words through Word Segmentation from the very first beginning then go to further analysis. In Chinese Word Segmentation Process, CKIP in Academia Sinica is the most prominent. However, due to the design principle of CKIP has been constrained by the words only in the data, as for the words not included (named Unidentified Word) are hardly to sort out correct word segmentation result.
The purpose of this research is to solve the problem of Unidentified Word: to raise the accuracy of Word Segmentation System for further Keyword, which is really able to be represented the whole article, Search Processing. The foundation of this research is the result from Chinese Word Segmentation System, and I tried to combine the result to piece up the complete lexicon and original meaning, compensating for the deficiency of undiscovered when Unidentified Word being separated. Afterward I snatched further Keyword through the original word segmentation result and combined lexicon. The results showed that precision and recall of composite Chinese word segmentation are much better than CKIP and Google, it also verify the validity of the system.

關鍵字(中)

★ 關鍵字檢索
★ 合併字詞
★ 中文斷詞

關鍵字(英)

★ Keyword Search
★ Composite Chinese Word Segmentation
★ Chinese Word Segmentation

論文目次

中文摘要 ……………………………………………………………… i
Abstract ……………………………………………………………… ii
目錄 ……………………………………………………………… iii
圖目錄 ……………………………………………………………… iv
表目錄 ……………………………………………………………… v
一、緒論………………………………………………………… 1
1.1 研究背景與動機…………………………………………… 1
1.2 研究目的…………………………………………………… 2
1.3 研究架構…………………………………………………… 3
二、文獻探討…………………………………………………… 5
2.1 文字探勘…………………………………………………… 5
2.2 斷詞與關鍵字詞選取……………………………………… 7
2.3 問題探討…………………………………………………… 17
三、系統設計…………………………………………………… 18
3.1 系統流程…………………………………………………… 19
3.2 系統流程解說……………………………………………… 21
四、系統驗證…………………………………………………… 31
4.1 評估方法…………………………………………………… 31
4.2 實驗結果與分析…………………………………………… 33
五、結論與未來研究議題……………………………………… 43
參考文獻 ……………………………………………………………… 46

參考文獻

1. Chowdhury, G. G., Introduction to modern information retrieval, Library Association Publishing, London, 1999.
2. CKIP 中央研究院中文斷詞系統 http://rocling.iis.sinica.edu.tw/CKIP/wordsegment.htm
(2009, December 22)
3. D. Sullivan, Document Warehousing and Text Mining, Wiley Computer Publishing, pp. 326, 2001.
4. Even-Zohar, Y. Introduction to Text Mining. Supercomputing, 2002. http://alg.ncsa.uiuc.edu/do/documents/presentations
(2009, December 22)
5. Fu, G., Kit, C. and Webster, J.J., “Chinese word segmentation as morpheme-based lexical chunking,” Information Sciences, Vol. 178, No. 9, pp. 2282-2296, 2008
6. Hung Chim, Xiaotie Deng. Efficient Phrase-Based Document Similarity for Clustering, Knowledge and Data Engineering, IEEE Transactions on Volume 20, Issue 9, Sept. Page(s):1217 – 1229, 2008
7. Liang, N.Y. “Knowledge of Chinese Word Segmentation”, Journal of Chinese Information Processing, Vol. 4, pp. 42-49, 1990
8. Li, B. I., et., “A maximal matching automatic Chinese word Segmentation algorithm using corpus tagging for ambiguity resolution”, R.O.C. Computational Linguistics Conference, Taiwan, pp. 135-146, 1991
9. Li, G.C., K.Y. Liu, and Y K. Zhang, “Identifying Chinese Word and Processing Different Meaning Structures”, Journal of Chinese Information Processing, Vol. 2, pp. 45-53, 1988
10. Liu, T. & Wang, Z. Chinese unknown word identification based on local bi-gram model. International Journal of Computer Processing of Oriental Languages, 18(3), pp. 185-196, 2005
11. Nie, J., Briscbois, M. and Ren, X., On Chinese Text Retrieval, Conference Proceedings of SIGIR, pp.225-233, 1996
12. R. Sporat and C. Shih, “A Statistical Method for Finding Word Boundaries in Chinese Text,” Computer Processing of Chinese and Oriental Languages, Vol. 4 No. 4, pp.336-351, 1990
13. Salton, G., Introduction to Automatic Text-Retrieval Systems, Communications of the ACM, pp.648-656, 1983
14. Salton, G., and Buckley, C., Term Weighting Approaches in Automatic Information Retrieval, Journal of Information Proceeding and Management, Vol. 24, pp.513-524, 1988
15. S. Foo, and H. Li, “Chinese word segmentation and its effect on information retrieval,” Information Processing and Management, vol. 40, pp. 161–190, 2004
16. Wu, Z. M., & Tseng, G. Chinese text segmentation for text retrieval: achievements and problems. Journal of the American Society for Information Science, 44(9), 532–542, 1993
17. Y. Yang and J. O. Pedersen, “A Comparative Study on Feature Selection in Text Categorization,” 14th International Conference on Machine Learning, pp. 412-420, 1997
18. Yeh C. L. and Lee H. J., “Rule-based word identification for mandarin chinese sentences- a unification approach”, Computer Processing of Chinese and Oriental Languages, 5(2) p:97-118, 1991

指導教授

許秉瑜(Ping-Yu Hsu)

審核日期

2010-1-22

推文