易延伸的語言模型之設計及其在數位語言學習之應用;A New Breed of Machine Tractable Language Model for Digital Language Learning

NCU Institutional Repository > 文學院 > 學習與教學研究所 > 研究計畫 > Item 987654321/49241

請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/49241

題名:	易延伸的語言模型之設計及其在數位語言學習之應用;A New Breed of Machine Tractable Language Model for Digital Language Learning
作者:	衛友賢;曹乃龍;陳孟彰
貢獻者:	學習與教學研究所
關鍵詞:	研究領域：科學教育
日期:	2011-08-01
上傳時間:	2012-01-17 18:02:28 (UTC+8)
出版者:	行政院國家科學委員會
摘要:	本研究計畫指出目前數位語言學習上的核心限制，即支持語言學習之數位知識資源的匱乏。我們將計畫的重點置於我們過去七年來持續關注的語言知識領域：詞組（multiword expressions, MWEs），並預計在此新計畫裡，透過擷取語料庫資源建立一個英文詞組的字彙文法知識庫（lexico-grammatical knowledgebase）。此知識庫為StringNet。我們將重點放在詞組的原因在於，長久以來，在計算語言學（Sag et al 2002; Baldwin et al 2007; Zhang et al 2006; inter alia）與第二語言教育（Wray 2002; Pawley and Syder 1984; Lewis 2002; Nattinger and DeCarrico 1992; inter alia）的研究裡，詞組一直是一項持續、且艱鉅的挑戰。此研究計畫及StringNet 設計上的重要特色即在於，我們將以一個整合的架構研究詞組的這兩個面向（即：計算與教育），如一枚硬幣的兩面。現今計算語言學模型多以處理n-gram 的方式來產生詞組；即，二至四個或更多的gram 所組成的字串。N-gram 為平面且僅具有句法結構關係（syntagmatic dimension）的單字組合。而在StringNet 裡我們建立了一種混合型的n-gram（hybrid n-grams），將聚合層面（paradigmatic dimension）的計算導入語言模型，使n-gram 的字串裡能夠出現詞性類別。因此，StringNet 不僅能呈現單純的平面字串組合如： consider himself lucky，更可將此字串裡的反身代名詞himself 以詞類[pnx] 來代替，此中間的空位便成了可由任何反身代名詞替換的欄位，新的字串則為 consider [pnx] lucky。StringNet 以此概念及計算方式交叉標記所有由BNC 擷取出的混合n-grams（因此，consider [pnx] lucky 即被標記於consider himself lucky 與 consider herself lucky 等字串上，以表示各字串彼此之間的從屬與上下關係）。由此觀之，相異於其他計算語言模型所呈現的列表，StringNet 所呈現的是網絡狀的字串結構，其範圍由僅包含具體單字的字串（例如：it’s the thought that counts）延伸至包含詞性類別的抽象字串（例如：it’s the [noun] that [verb]）。StringNet Navigator 將會提供使用者一個網路平台，使用者不僅能夠利用關鍵字搜尋，亦可在搜尋出的結果裡遨遊，學習、研究各階層的字串與句型。以StringNet 為概念所建立的測試版本目前已順利完成，成果也已經公開發表，且被公認對於這個艱困領域帶來卓越的進展及貢獻。在線上測試版啟用後的一年內，每個月皆有來自30 至40 個國家的使用者查詢使用或來信探詢。此提案計畫的主持人衛友賢也已受到享譽國際的SSCI 期刊Annual Review of Applied Linguistics 邀請，為此期刊貢獻、撰寫一篇與此計劃主題（詞組與數位語言學習）相同的論文。在這項三年期研究計畫裡，我們預計以已成功發表的測試版為基礎，建立一個更為成熟的StringNet 語言知識庫，藉以發展更多的應用概念以輔助、支援第二語言教學。在第一年裡，相較於之前測試版所計算使用的六百萬字，StringNet 的成熟版本將計算來自於BNC 的一億個英文字。而在接下來的幾年裡，StringNet 所使用的語料將從BNC 擴展至Google Books、Wikipedia 以及其餘乾淨的（clean）語料庫。在應用方面，本計劃將：（一）發展APIs 技術，利用外掛程式的方式，讓使用者可由其他網站連結StringNet，（二）建立StringNet Builder，使之成為一種網路服務運作模式，讓任何使用者都能夠以StringNet 的知識處理方式處理任何語料庫的資料，（三）由電子版教科書擷取特定領域、學科的詞組或句型，並將結果融入數位學術英語學習，（四）創造Query Doctor，即一項利用編輯距離（edit distance）技術與StringNet 知識結構以偵測英文錯誤的工具，當使用者利用Google 或其他搜尋引擎查詢某些字串是否正確時，此工具可協助自動偵測與改正使用者的query 字串裡的錯誤，才送出進行查詢，（五）開發相似字估量工具，為學習者和教師分辨容易混淆的相似字，（六）建立練習魔法師，由StringNet 產出的結果自動產生語言探索練習與克漏字測驗。 StringNet 創新知識結構在計算語言學與第二語言教育所呈現的突破，將給未來十年甚至更久遠之後的研究提供富饒的沃土及奠定深厚的基礎。 The proposed project addresses one of the central limitations currently constraining digital language learning, that is, a lack of sophisticated digital knowledge resources to support language learning. We focus on a domain of language knowledge that we have targeted for the past seven years, the domain called multiword expressions (MWEs), and propose in our new work to construct a corpus-derived lexico-grammatical knowledgebase of such expressions for English. The knowledgebase is called StringNet. We target the area of MWEs because of the persistent and widely acknowledged challenges it poses both for the field of computational linguistics (Sag et al 2002; Baldwin et al 2007; Zhang et al 2006; inter alia) and the field of second language education (Wray 2002; Pawley and Syder 1984; Lewis 2002; Nattinger and DeCarrico 1992; inter alia). An important characteristic of our research and of the design of StringNet is that they address these two aspects of MWEs (computational and educational) within one coherent framework, as two sides of the same coin. Current computational language models are based on n-grams, that is, sequences of word pairs or triples or 4-grams and so on. N-grams are flat and represent word combinations with only the syntagmatic dimension. For StringNet we have created the novel notion of hybrid n-gram, which introduces the paradigmatic dimension to the language model by allowing part-of-speech categories to occur within n-grams alongside words. Thus, not only ‘consider himself lucky’ but also the more general ‘consider [pnx] lucky’ with the reflexive pronoun category showing the substitutability of the middle slot. StringNet then cross-indexes all the hybrid n-grams that it extracts from BNC (so ‘consider [pnx] lucky’ is indexed to ‘consider himself lucky’ and to ‘consider herself lucky’ indicating the subordinate/superordinate relation holding between them). Unlike other language models then, StringNet is not a list, but a cross-indexed web of lexical patterns ranging from specific to abstract (from ‘it’s the thought that counts’ to ‘it’s the [noun] that [verb]’). StringNet Navigator will provide a web interface allowing users not only to submit query words but to navigate through the relations among the patterns given as search results. The test-of-concept version of StringNet has been successfully created, results published, and it has been acknowledged for ‘advancing the field’ in this difficult area. The online beta version of this test-of-concept has received queries from 30-40 countries every month for the past year since it was made available. The project PI (Wible) has already been invited to contribute an article to the prestigious SSCI journal Annual Review of Applied Linguistics on the theme of this project: MWEs and digital language learning. The present three-year project proposes to create a mature version of StringNet based on the successful test-of-concept beta version and to develop and implement a range of applications to support second language education. In the first year, the mature version of StringNet will be extracted from the full 100,000,000-word British National Corpus (BNC) compared to the sampled 6,000,000-word version used for the test-of-concept version. In subsequent years, corpus resources for StringNet will be expanded beyond BNC to Google Books, Wikipedia and other clean corpora. With respect to applications, the project will (1) develop APIs that will make StringNet accessible from any website by means of plug-ins, (2) produce StringNet Builder as a web service that can generate a StringNet knowledgebase for any corpus that a user submits; (3) extract domain-specific lexical patterns from e-textbooks for particular fields and apply the results to digitally supported English for Academic Purposes; (4) create Query Doctor—a tool that uses edit distance techniques and the knowledge structures of StringNet to detect and correct errors produced in multiword queries to Google or other search engines, thus addressing the dangerous and common practice of using Google as an error checker; (5) develop word similarity measures that distinguishes confusable words for learners and teachers and (6) create exercise wizards that generate discovery exercises and cloze exams from StringNet search results. The breakthroughs represented by StringNet’s novel knowledge structures will provide fertile territory for cutting-edge research for the coming decade and beyond. 研究期間：10008 ~ 10107
關聯:	財團法人國家實驗研究院科技政策研究與資訊中心
顯示於類別:	[學習與教學研究所 ] 研究計畫

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	578	檢視/開啟

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....