建構中文語言統計模型及其在數位內容上的應用;Design and Construction of a Chinese Statistical Language Model and Its Applications for Digital Content

NCU Institutional Repository > 文學院 > 學習與教學研究所 > 研究計畫 > Item 987654321/49243

請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/49243

題名:	建構中文語言統計模型及其在數位內容上的應用;Design and Construction of a Chinese Statistical Language Model and Its Applications for Digital Content
作者:	衛友賢;簡瑛瑛;陳孟彰
貢獻者:	學習與教學研究所
關鍵詞:	研究領域：資訊科學--軟體
日期:	2011-08-01
上傳時間:	2012-01-17 18:02:37 (UTC+8)
出版者:	行政院國家科學委員會
摘要:	此提案計畫旨在探究一項數位典藏文庫發展上的重要議題。這項議題主要是關於一個研究上的瓶頸，這個瓶頸的形成原因在於現今機器可追蹤的語言統計模型（machine-tractable statistical language models）在設計上有明顯的局限，且這些局限會嚴重限制可從數位典藏文庫汲取出的價值。此計畫將會把研究重心放在中文的數位文本，並預計創造一個新一代的機器可追蹤語言統計模型，以克服現今語言統計模型所受到的限制。典藏的數位文本，其價值常因為我們計算這些文本的能力所限制。而決定我們計算語言文本能力的最基本資源即一套處理該語言的機器可追蹤模型。在我們近期的研究裡，其中一項最主要的成就便是成功地以英語為目標語，設計且創造一個全新的機器可追蹤模型，此模型有效地克服目前同類型模型所受到的限制（Tsao and Wible 2009; Wible and Tsao 2010; Wible and Tsao 2011）。此提案計畫的目的為將這些已概念測試過的成果應用在中文的文本上，利用背後的設計原則來創造一個中文的語言統計模型。以中文數位典藏庫所能攫取的價值而言，我們預期這套模型將會開發出許多新的可能性。以這類型的語言模型來說，我們所創造最主要的創新性在於構成模型的各種知識結構，以及我們如何創造這些結構。更重要的是，雖然在我們的研究方法下所產生的各種語言結構都是限定於特定的語言（因此，舉例來說，中文模型的語言結構會與英語語言模型的結構完全不同），產生語言結構的研究方法本身並非特定於某種語言，而是可應用於各種不同的語言。這些語言結構的產生，來自於大型語料庫裡電腦自動統計、處理字與字之間的關聯性，以及在這些橫向組合（syntagmatic）與縱向聚合（paradigmatic）字彙關係上所作的標示索引。我們已經將我們的語言模型命名為StringNet（Wible and Tsao 2010, 2011），且預計將此提案裡的中文模型稱為Chinese StringNet。此計畫可應用的範圍與帶來的貢獻將會是非常寬廣且深遠的。特別的是，StringNet的結構為一個具有橫向組合與縱向聚合字彙關係的網絡，而非一個僅呈現各個結構的平面列表。因此，它將會是ㄧ套豐富的知識資源，以傳統的二分法來看，它的範圍涵蓋了一般的文法規則以及一些特別的字彙行為與諺語。因此，此提案計畫裡所欲開創的Chinese StringNet將會對中文的詞彙資源填補一個明顯的缺口。 Chinese StringNet可支援的應用則包含了在字彙知識的發現與表現上的突破；舉例來說，像是對於以中文為第二語言教學與學習的自動錯誤偵測與改正（Tsao and Wible 2009）、創造可結合在學習者網路瀏覽器上的中文學習工具，如同我們已經開發的英文學習工具，此數位工具會在學習者上網時即時處理所有學習者瀏覽的中文網頁資料（Wible et al 2004; Wible 2008; Wible, Liu, and Tsao 2011, inter alia）、中文文件分類、以及許多其餘可應用的領域等。 The proposed project addresses a pressing issue in the development of digital archives. The issue is the bottleneck created by limitations in current machine-tractable language models and the restrictions these models impose on the value that can be derived from archives of digital texts. The proposed project focuses its efforts specifically on the case of Chinese digital texts and proposes creating a new generation machine-tractable language model of Chinese that overcomes the key limitations of current language models. The value of archived digital texts is limited directly by our capacity to compute those texts. And the most fundamental resource determining our capacity to compute texts in a language is a machine-tractable model of that language. One of the main achievements in our recent research has been to succeed in designing and creating such a novel machine-tractable language model for the case of English which overcomes crucial limitations of current state-of-the-art models (Tsao and Wible 2009; Wible and Tsao 2010; Wible and Tsao 2011). The purpose of the proposed project is to take these test-of-concept results we have already achieved and apply the underlying design principles to creating a novel language model for Chinese, one that will open new possibilities for creating new value from Chinese digital archives. The key innovation in the language model type we have created lies in the sorts of knowledge structures that constitute the model and in how we create these structures. Crucially, while the language structures that are generated under our approach are highly language specific (so that, for example, the structure of the model of Chinese will differ fundamentally from the structures of the English model), the method itself that we use to generate these structures is not language specific at all, but completely general. The knowledge structures emerge bottom-up out of word associations computed statistically from large corpora of the target language and from the indexing of both the syntagmatic and the paradigmatic relations of these words to each other. We have named our language model type StringNet (Wible and Tsao 2010, 2011), and call our proposed Chinese model Chinese StringNet. The range of potential applications and contributions is wide. Uniquely, StringNet is structured as a web of syntagmatic and paradigmatic relations among lexemes rather than a flat list of them. Thus, it serves as a knowledge source that spans the traditional divide between general grammatical patterns or rules on the one hand and idiosyncratic lexical behaviors and idioms on the other. In this respect, the Chinese StringNet to be created under the proposed project will fill a crucial gap in lexicographic resources for Chinese. The applications which Chinese StringNet can support include new breakthroughs in the discovery and representation of lexicographic knowledge; the application of such knowledge for the teaching and learning of Chinese as a Second Language, for example automatic error detection and correction (Tsao and Wible 2009); the creation of Chinese learning tools that can be embedded on learners’ Internet browser and accompany them during Web browsing to process Chinese webpage content in real time, as we have done notably for English and English learning (Wible et al 2004; Wible 2008; Wible, Liu, and Tsao 2011, inter alia); Chinese document classification, and many other areas. 研究期間：10008 ~ 10112
關聯:	財團法人國家實驗研究院科技政策研究與資訊中心
顯示於類別:	[學習與教學研究所 ] 研究計畫

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	543	檢視/開啟

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....