具有標點符號之端到端串流語音識別於多任務學習;End-to-End Streaming Speech Recognition with Punctuation Marks for Multi-task Learning

NCUIR > College of Electrical Engineering & Computer Science > Graduate Institute of Computer Science and Information Engineering > Electronic Thesis & Dissertation > Item 987654321/84075

Please use this identifier to cite or link to this item: http://ir.lib.ncu.edu.tw/handle/987654321/84075

Title:	具有標點符號之端到端串流語音識別於多任務學習;End-to-End Streaming Speech Recognition with Punctuation Marks for Multi-task Learning
Authors:	陳柏凱;Chen, Po-Kai
Contributors:	資訊工程學系
Keywords:	多任務學習;端到端;串流語音識別;標點符號預測;multi-task learning;end-to-end;streaming speech recognition;punctuation prediction
Date:	2020-07-29
Issue Date:	2020-09-02 18:01:34 (UTC+8)
Publisher:	國立中央大學
Abstract:	近幾年語音識別研究逐漸轉向端到端模型發展，簡化了整體模型的流程。而2015年的“Listen, Attention and Spell”論文中，首次將Seq-to-Seq的架構以及Attention機制用於端到端語音識別任務中，奠定了目前端到端語音識別模型的型式。遺憾的是，Attention是基於全序列建模，因而無法識別片段序列，也因此無法完美地應用於串流的場合中。基於Attention片段序列識別的問題，本論文使用Layer-level Time Limited Attention Mask(L-TLAM)，提高了模型對非完整序列之建模能力，並減緩因堆疊網路所產生出的過多間接注意力問題，以達到更完美串流語音識別效果。標點符號是文本資訊的組成部分，用以表示停頓、語氣以及詞語的性質和作用。然而一般用於訓練語音識別之語料皆未提供標點符號之標注，因而在語音識別任務中，無法直接提供具有標點符號的識別結果。本論文第二個工作，為了將標點符號標記任務融入於語音識別訓練中，我們基於Transducer模型架構來訓練語音識別主任務，並利用Multi-task Learning的訓練方式，將Transducer架構中的語言模型Predictor共享於兩種任務1) Context Representation for Acoustic Model 2) Punctuation Prediction。第一種任務提供了ASR任務中所需的文本上下文資訊。第二種任務提供了預測Punctuation之文本語意資訊。而最後本論文也嘗試將Language Model任務導入，以提高Predictor的語意理解能力，進而提高語音識別與標點預測任務的準確度。;In recent years, speech recognition research has gradually turned to the development of end-to-end models, simplifying the overall model process. In the "Listen, Attention and Spell" paper in 2015, the Seq-to-Seq architecture and Attention mechanism were used for the end-to-end speech recognition task for the first time, laying the current end-to-end speech recognition model. Unfortunately, Attention is based on full-sequence modeling, so it cannot identify fragment sequences, and therefore cannot be perfectly used in streaming applications. Based on the problem of Attention fragment sequence identification, this paper uses Layer-level Time Limited Attention Mask (L-TLAM), which improves the model′s ability to model non-complete sequences and alleviates excessive indirect attention due to stacked networks problems to achieve more perfect streaming speech recognition effect. Punctuation marks are an integral part of textual information, used to indicate pauses, tone, and the nature and function of words. However, the corpus generally used for training speech recognition does not provide punctuation marks, so in speech recognition tasks, it is impossible to directly provide recognition results with punctuation marks. In the second work of this paper, in order to integrate the punctuation mark task into speech recognition training, we train the speech recognition main task based on the Transducer model architecture, and use the Multi-task Learning training method to convert the language model Predictor in the Transducer architecture. Shared in two tasks 1) Context Representation for Acoustic Model 2) Punctuation Prediction. The first task provides the textual context information required in the ASR task. The second task provides textual semantic information to predict Punctuation. In the end, the thesis also tried to import Language Model task to improve Predictor′s semantic comprehension ability, and then improve the accuracy of speech recognition and punctuation prediction tasks.
Appears in Collections:	[Graduate Institute of Computer Science and Information Engineering] Electronic Thesis & Dissertation

Files in This Item:

File	Description	Size	Format
index.html		0Kb	HTML	124	View/Open

社群 sharing

Loading...