結合頂層與底層句法資訊之中文詞相依性分析; Incorporating top-level and bottom-level information for Chinese word dependency analysis

NCU Institutional Repository > 資訊電機學院 > 資訊工程研究所 > 博碩士論文 > Item 987654321/9152

請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/9152

題名:	結合頂層與底層句法資訊之中文詞相依性分析;Incorporating top-level and bottom-level information for Chinese word dependency analysis
作者:	吳毓傑;Yu-Chieh Wu
貢獻者:	資訊工程研究所
關鍵詞:	詞性標註;中文斷詞;中文詞相依性剖析;機器學習;part-of-speech tagging;word segmentation;Chinese dependency parsing;machine learning
日期:	2007-11-12
上傳時間:	2009-09-22 11:42:06 (UTC+8)
出版者:	國立中央大學圖書館
摘要:	本論文提出一個整合性之中文語法相依性分析之架構，並包含了中文詞切割及詞性標注等問題。我們首先討論中文斷詞與詞性標注等問題，並且探討將此問題轉換為一個常見的序列分析分類的過程。然要訓練一個序列分析器，我們也研究了幾個目前常用而且效果很好的方法。在我們的實驗中發現最好的一個方法-CRF要優於其他分析器，但其缺點就是慢而且與類別數量成二次方成長，這使得中文詞性標注的問題無法在實際上能處理。為克服此問題，我們提出了一個結合CRF與SVM的二階段模型，結合CRF高效能的優點，並以SVM快速且準確的特性補足其效率問題。實驗證明我們的方法要明顯優於其他方法(包含CRF, 96.2 vs. 95.9 in F-measure)。在公認的中文斷詞語料(SIGHAN-3)上，我們的方法也達到幾乎最佳的結果。藉由二階段中文詞標注，文章中的詞彙與其詞性都能以此切分。因此，我們使用這詞性分析器所分出的詞彙用來進行下一階段的詞性語法相依性分析。為了使文法分析結果更上一層樓，我們也整合頂層與底層的句法關係並列入考量。同時，本研究也與目前公認最好的詞法相依性分析方法比較。實驗結果顯示，本研究的方法，不但比其他方法準確，而且訓練與測試時間要大大的減少。此外，本研究也提出一個近似K-best 搜尋法來改善整體解析文法與斷詞之結果。此法的優點在於可以不須修改訓練模組，而在測試時對所有可能的候選一起列入考慮，以決定最後之文法解析結果。 This thesis proposes a unified Chinese dependency parsing framework where the word segmentation and POS-tagging were included. We first discuss the issue of the Chinese word segmentation and part-of-speech tagging. Then we exploit the conversion of Chinese POS tagging as sequential chunk-labeling problem and treat it as the conventional sequential chunk labeling tasks. To train a sequential labeler several classification algorithms are investigated. However, the observed best method-CRF yields superior but slower performance than the other approaches which make the POS tagging intractable. To circumvent this, we propose a two-pass sequential chunk labeling model to combine CRF with SVM. The experimental result showed that the two-pass learner achieves the best result than the other single-pass methods (96.2 vs. 95.9). In the well-known benchmark corpus (SIGHAN), our method also showed very competitive performance. By means of the two-pass Chinese POS tagging, the words associated with their part-of-speech labels could be auto-segmented and labeled. We therefore employ the auto-segmented words for dependency parsing. To enhance the performance our parser integrates both top-down and bottom-up syntactic information. Meanwhile, we also compare with current state-of-the-art dependency parsers. The experimental result showed that our method is not only more accurate but also spends much less training and testing time than the other approaches. In addition, an approximate K-best reranking method is designed to improve the overall dependency parse and also for word segmentation results. The advantage is that one can independently train these modules, while taking the global parse into consideration through the K-best selection.
顯示於類別:	[資訊工程研究所] 博碩士論文

文件中的檔案:

檔案	大小	格式	瀏覽次數

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....