摘要: | 近幾年語音識別研究逐漸轉向端到端模型發展,簡化了整體模型的流程。而2015年的“Listen, Attention and Spell”論文中,首次將Seq-to-Seq的架構以及Attention機制用於端到端語音識別任務中,奠定了目前端到端語音識別模型的型式。遺憾的是,Attention是基於全序列建模,因而無法識別片段序列,也因此無法完美地應用於串流的場合中。基於Attention片段序列識別的問題,本論文使用Layer-level Time Limited Attention Mask(L-TLAM),提高了模型對非完整序列之建模能力,並減緩因堆疊網路所產生出的過多間接注意力問題,以達到更完美串流語音識別效果。 標點符號是文本資訊的組成部分,用以表示停頓、語氣以及詞語的性質和作用。然而一般用於訓練語音識別之語料皆未提供標點符號之標注,因而在語音識別任務中,無法直接提供具有標點符號的識別結果。本論文第二個工作,為了將標點符號標記任務融入於語音識別訓練中,我們基於Transducer模型架構來訓練語音識別主任務,並利用Multi-task Learning的訓練方式,將Transducer架構中的語言模型Predictor共享於兩種任務1) Context Representation for Acoustic Model 2) Punctuation Prediction。第一種任務提供了ASR任務中所需的文本上下文資訊。第二種任務提供了預測Punctuation之文本語意資訊。而最後本論文也嘗試將Language Model任務導入,以提高Predictor的語意理解能力,進而提高語音識別與標點預測任務的準確度。;In recent years, speech recognition research has gradually turned to the development of end-to-end models, simplifying the overall model process. In the "Listen, Attention and Spell" paper in 2015, the Seq-to-Seq architecture and Attention mechanism were used for the end-to-end speech recognition task for the first time, laying the current end-to-end speech recognition model. Unfortunately, Attention is based on full-sequence modeling, so it cannot identify fragment sequences, and therefore cannot be perfectly used in streaming applications. Based on the problem of Attention fragment sequence identification, this paper uses Layer-level Time Limited Attention Mask (L-TLAM), which improves the model′s ability to model non-complete sequences and alleviates excessive indirect attention due to stacked networks problems to achieve more perfect streaming speech recognition effect. Punctuation marks are an integral part of textual information, used to indicate pauses, tone, and the nature and function of words. However, the corpus generally used for training speech recognition does not provide punctuation marks, so in speech recognition tasks, it is impossible to directly provide recognition results with punctuation marks. In the second work of this paper, in order to integrate the punctuation mark task into speech recognition training, we train the speech recognition main task based on the Transducer model architecture, and use the Multi-task Learning training method to convert the language model Predictor in the Transducer architecture. Shared in two tasks 1) Context Representation for Acoustic Model 2) Punctuation Prediction. The first task provides the textual context information required in the ASR task. The second task provides textual semantic information to predict Punctuation. In the end, the thesis also tried to import Language Model task to improve Predictor′s semantic comprehension ability, and then improve the accuracy of speech recognition and punctuation prediction tasks. |