English  |  正體中文  |  简体中文  |  全文筆數/總筆數 : 80990/80990 (100%)
造訪人次 : 42118306      線上人數 : 773
RC Version 7.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
搜尋範圍 查詢小技巧:
  • 您可在西文檢索詞彙前後加上"雙引號",以獲取較精準的檢索結果
  • 若欲以作者姓名搜尋,建議至進階搜尋限定作者欄位,可獲得較完整資料
  • 進階搜尋


    請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/84075


    題名: 具有標點符號之端到端串流語音識別於多任務學習;End-to-End Streaming Speech Recognition with Punctuation Marks for Multi-task Learning
    作者: 陳柏凱;Chen, Po-Kai
    貢獻者: 資訊工程學系
    關鍵詞: 多任務學習;端到端;串流語音識別;標點符號預測;multi-task learning;end-to-end;streaming speech recognition;punctuation prediction
    日期: 2020-07-29
    上傳時間: 2020-09-02 18:01:34 (UTC+8)
    出版者: 國立中央大學
    摘要: 近幾年語音識別研究逐漸轉向端到端模型發展,簡化了整體模型的流程。而2015年的“Listen, Attention and Spell”論文中,首次將Seq-to-Seq的架構以及Attention機制用於端到端語音識別任務中,奠定了目前端到端語音識別模型的型式。遺憾的是,Attention是基於全序列建模,因而無法識別片段序列,也因此無法完美地應用於串流的場合中。基於Attention片段序列識別的問題,本論文使用Layer-level Time Limited Attention Mask(L-TLAM),提高了模型對非完整序列之建模能力,並減緩因堆疊網路所產生出的過多間接注意力問題,以達到更完美串流語音識別效果。
    標點符號是文本資訊的組成部分,用以表示停頓、語氣以及詞語的性質和作用。然而一般用於訓練語音識別之語料皆未提供標點符號之標注,因而在語音識別任務中,無法直接提供具有標點符號的識別結果。本論文第二個工作,為了將標點符號標記任務融入於語音識別訓練中,我們基於Transducer模型架構來訓練語音識別主任務,並利用Multi-task Learning的訓練方式,將Transducer架構中的語言模型Predictor共享於兩種任務1) Context Representation for Acoustic Model 2) Punctuation Prediction。第一種任務提供了ASR任務中所需的文本上下文資訊。第二種任務提供了預測Punctuation之文本語意資訊。而最後本論文也嘗試將Language Model任務導入,以提高Predictor的語意理解能力,進而提高語音識別與標點預測任務的準確度。;In recent years, speech recognition research has gradually turned to the development of end-to-end models, simplifying the overall model process. In the "Listen, Attention and Spell" paper in 2015, the Seq-to-Seq architecture and Attention mechanism were used for the end-to-end speech recognition task for the first time, laying the current end-to-end speech recognition model. Unfortunately, Attention is based on full-sequence modeling, so it cannot identify fragment sequences, and therefore cannot be perfectly used in streaming applications. Based on the problem of Attention fragment sequence identification, this paper uses Layer-level Time Limited Attention Mask (L-TLAM), which improves the model′s ability to model non-complete sequences and alleviates excessive indirect attention due to stacked networks problems to achieve more perfect streaming speech recognition effect.
    Punctuation marks are an integral part of textual information, used to indicate pauses, tone, and the nature and function of words. However, the corpus generally used for training speech recognition does not provide punctuation marks, so in speech recognition tasks, it is impossible to directly provide recognition results with punctuation marks. In the second work of this paper, in order to integrate the punctuation mark task into speech recognition training, we train the speech recognition main task based on the Transducer model architecture, and use the Multi-task Learning training method to convert the language model Predictor in the Transducer architecture. Shared in two tasks 1) Context Representation for Acoustic Model 2) Punctuation Prediction. The first task provides the textual context information required in the ASR task. The second task provides textual semantic information to predict Punctuation. In the end, the thesis also tried to import Language Model task to improve Predictor′s semantic comprehension ability, and then improve the accuracy of speech recognition and punctuation prediction tasks.
    顯示於類別:[資訊工程研究所] 博碩士論文

    文件中的檔案:

    檔案 描述 大小格式瀏覽次數
    index.html0KbHTML124檢視/開啟


    在NCUIR中所有的資料項目都受到原著作權保護.

    社群 sharing

    ::: Copyright National Central University. | 國立中央大學圖書館版權所有 | 收藏本站 | 設為首頁 | 最佳瀏覽畫面: 1024*768 | 建站日期:8-24-2009 :::
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - 隱私權政策聲明