易延伸的語言模型之設計及其在數位語言學習之應用;A New Breed of Machine Tractable Language Model for Digital Language Learning

NCU Institutional Repository > 文學院 > 學習與教學研究所 > 研究計畫 > Item 987654321/63241

請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/63241

題名:	易延伸的語言模型之設計及其在數位語言學習之應用;A New Breed of Machine Tractable Language Model for Digital Language Learning
作者:	衛友賢;陳孟彰
貢獻者:	國立中央大學學習與教學研究所
關鍵詞:	語文;科學教育
日期:	2013-12-01
上傳時間:	2014-03-17 14:24:29 (UTC+8)
出版者:	行政院國家科學委員會
摘要:	研究期間：10208~10307;The proposed project addresses one of the central limitations currently constraining digital language learning, that is, a lack of sophisticated digital knowledge resources to support language learning. We focus on a domain of language knowledge that we have targeted for the past seven years, the domain called multiword expressions (MWEs), and propose in our new work to construct a corpus-derived lexico-grammatical knowledgebase of such expressions for English. The knowledgebase is called StringNet. We target the area of MWEs because of the persistent and widely acknowledged challenges it poses both for the field of computational linguistics (Sag et al 2002; Baldwin et al 2007; Zhang et al 2006; inter alia) and the field of second language education (Wray 2002; Pawley and Syder 1984; Lewis 2002; Nattinger and DeCarrico 1992; inter alia). An important characteristic of our research and of the design of StringNet is that they address these two aspects of MWEs (computational and educational) within one coherent framework, as two sides of the same coin. Current computational language models are based on n-grams, that is, sequences of word pairs or triples or 4-grams and so on. N-grams are flat and represent word combinations with only the syntagmatic dimension. For StringNet we have created the novel notion of hybrid n-gram, which introduces the paradigmatic dimension to the language model by allowing part-of-speech categories to occur within n-grams alongside words. Thus, not only ‘consider himself lucky’ but also the more general ‘consider [pnx] lucky’ with the reflexive pronoun category showing the substitutability of the middle slot. StringNet then cross-indexes all the hybrid n-grams that it extracts from BNC (so ‘consider [pnx] lucky’ is indexed to ‘consider himself lucky’ and to ‘consider herself lucky’ indicating the subordinate/superordinate relation holding between them). Unlike other language models then, StringNet is not a list, but a cross-indexed web of lexical patterns ranging from specific to abstract (from ‘it’s the thought that counts’ to ‘it’s the [noun] that [verb]’). StringNet Navigator will provide a web interface allowing users not only to submit query words but to navigate through the relations among the patterns given as search results. The test-of-concept version of StringNet has been successfully created, results published, and it has been acknowledged for ‘advancing the field’ in this difficult area. The online beta version of this test-of-concept has received queries from 30-40 countries every month for the past year since it was made available. The project PI (Wible) has already been invited to contribute an article to the prestigious SSCI journal Annual Review of Applied Linguistics on the theme of this project: MWEs and digital language learning. The present three-year project proposes to create a mature version of StringNet based on the successful test-of-concept beta version and to develop and implement a range of applications to support second language education. In the first year, the mature version of StringNet will be extracted from the full 100,000,000-word British National Corpus (BNC) compared to the sampled 6,000,000-word version used for the test-of-concept version. In subsequent years, corpus resources for StringNet will be expanded beyond BNC to Google Books, Wikipedia and other clean corpora. With respect to applications, the project will (1) develop APIs that will make StringNet accessible from any website by means of plug-ins, (2) produce StringNet Builder as a web service that can generate a StringNet knowledgebase for any corpus that a user submits; (3) extract domain-specific lexical patterns from e-textbooks for particular fields and apply the results to digitally supported English for Academic Purposes; (4) create Query Doctor—a tool that uses edit distance techniques and the knowledge structures of StringNet to detect and correct errors produced in multiword queries to Google or other search engines, thus addressing the dangerous and common practice of using Google as an error checker; (5) develop word similarity measures that distinguishes confusable words for learners and teachers and (6) create exercise wizards that generate discovery exercises and cloze exams from StringNet search results. The breakthroughs represented by StringNet’s novel knowledge structures will provide fertile territory for cutting-edge research for the coming decade and beyond.
關聯:	財團法人國家實驗研究院科技政策研究與資訊中心
顯示於類別:	[學習與教學研究所 ] 研究計畫

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	468	檢視/開啟

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....