具有標點符號之端到端串流語音識別於多任務學習

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：6

、訪客IP：18.118.193.104

姓名

陳柏凱(Po-Kai Chen) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

具有標點符號之端到端串流語音識別於多任務學習
(End-to-End Streaming Speech Recognition with Punctuation Marks for Multi-task Learning)

相關論文

★ Single and Multi-Label Environmental Sound Recognition with Gaussian Process	★ 波束形成與音訊前處理之嵌入式系統實現
★ 語音合成及語者轉換之應用與設計	★ 基於語意之輿情分析系統
★ 高品質口述系統之設計與應用	★ 深度學習及加速強健特徵之CT影像跟骨骨折辨識及偵測
★ 基於風格向量空間之個性化協同過濾服裝推薦系統	★ RetinaNet應用於人臉偵測
★ 金融商品走勢預測	★ 整合深度學習方法預測年齡以及衰老基因之研究
★ 漢語之端到端語音合成研究	★ 基於 ARM 架構上的 ORB-SLAM2 的應用與改進
★ 基於深度學習之指數股票型基金趨勢預測	★ 探討財經新聞與金融趨勢的相關性
★ 基於卷積神經網路的情緒語音分析	★ 運用深度學習方法預測阿茲海默症惡化與腦中風手術存活

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 ( 永不開放)

摘要(中)

近幾年語音識別研究逐漸轉向端到端模型發展，簡化了整體模型的流程。而2015年的“Listen, Attention and Spell”論文中，首次將Seq-to-Seq的架構以及Attention機制用於端到端語音識別任務中，奠定了目前端到端語音識別模型的型式。遺憾的是，Attention是基於全序列建模，因而無法識別片段序列，也因此無法完美地應用於串流的場合中。基於Attention片段序列識別的問題，本論文使用Layer-level Time Limited Attention Mask(L-TLAM)，提高了模型對非完整序列之建模能力，並減緩因堆疊網路所產生出的過多間接注意力問題，以達到更完美串流語音識別效果。
標點符號是文本資訊的組成部分，用以表示停頓、語氣以及詞語的性質和作用。然而一般用於訓練語音識別之語料皆未提供標點符號之標注，因而在語音識別任務中，無法直接提供具有標點符號的識別結果。本論文第二個工作，為了將標點符號標記任務融入於語音識別訓練中，我們基於Transducer模型架構來訓練語音識別主任務，並利用Multi-task Learning的訓練方式，將Transducer架構中的語言模型Predictor共享於兩種任務1) Context Representation for Acoustic Model 2) Punctuation Prediction。第一種任務提供了ASR任務中所需的文本上下文資訊。第二種任務提供了預測Punctuation之文本語意資訊。而最後本論文也嘗試將Language Model任務導入，以提高Predictor的語意理解能力，進而提高語音識別與標點預測任務的準確度。

摘要(英)

In recent years, speech recognition research has gradually turned to the development of end-to-end models, simplifying the overall model process. In the "Listen, Attention and Spell" paper in 2015, the Seq-to-Seq architecture and Attention mechanism were used for the end-to-end speech recognition task for the first time, laying the current end-to-end speech recognition model. Unfortunately, Attention is based on full-sequence modeling, so it cannot identify fragment sequences, and therefore cannot be perfectly used in streaming applications. Based on the problem of Attention fragment sequence identification, this paper uses Layer-level Time Limited Attention Mask (L-TLAM), which improves the model′s ability to model non-complete sequences and alleviates excessive indirect attention due to stacked networks problems to achieve more perfect streaming speech recognition effect.
Punctuation marks are an integral part of textual information, used to indicate pauses, tone, and the nature and function of words. However, the corpus generally used for training speech recognition does not provide punctuation marks, so in speech recognition tasks, it is impossible to directly provide recognition results with punctuation marks. In the second work of this paper, in order to integrate the punctuation mark task into speech recognition training, we train the speech recognition main task based on the Transducer model architecture, and use the Multi-task Learning training method to convert the language model Predictor in the Transducer architecture. Shared in two tasks 1) Context Representation for Acoustic Model 2) Punctuation Prediction. The first task provides the textual context information required in the ASR task. The second task provides textual semantic information to predict Punctuation. In the end, the thesis also tried to import Language Model task to improve Predictor′s semantic comprehension ability, and then improve the accuracy of speech recognition and punctuation prediction tasks.

關鍵字(中)

★ 多任務學習
★ 端到端
★ 串流語音識別
★ 標點符號預測

關鍵字(英)

★ multi-task learning
★ end-to-end
★ streaming speech recognition
★ punctuation prediction

論文目次

中文摘要 i
Abstract ii
章節目次 iv
圖目錄 vii
表目錄 ix
第一章緒論 1
1.1 背景 1
1.2 研究動機與目的 2
1.3 研究方法與章節概要 2
第二章相關文獻及文獻探討 4
2.1 序列模型Long Short-Term Memory(LSTM) 4
2.1.1. 雙向LSTM循環神經網路 7
2.1.2. 深層LSTM循環神經網路 8
2.2 自注意力網路Self-Attention Network (SAN) 9
2.2.1. Self-Attention演算法 10
2.2.2. 位置編碼演算法 12
2.2.3. 序列模型複雜度分析 13
2.3 誤差學習函數Connectionist temporal classification(CTC) 14
2.3.1. CTC訓練演算法 15
2.3.2. CTC前綴束搜尋演算法 20
2.4 Sequence Transducer 22
2.4.1. Transducer演算法 22
2.4.2. Transducer預訓練方法 24
2.5 VGG Transformer 25
2.5.1. VGG Transformer 模型架構 26
2.5.2. 實驗結果 27
第三章具有標點符號之端到端串流語音識別於多任務學習 29
3.1 基於自注意力之Transducer 29
3.2 層級有限時序注意力遮罩(L-TLAM) 32
3.3 基於VGG上下文之位置編碼 37
3.4 串流標點符號預測 38
3.5 具有標點符號之端到端串流語音識別於多任務學習 40
第四章實驗結果與討論 42
4.1 實驗設備 42
4.2 資料集介紹 43
4.2.1. 語音識別資料集 43
4.2.2. 文本資料集 44
4.3 實驗與討論 44
4.3.1. 串流語音識別 44
4.3.2. 串流標點符號預測 48
4.3.3. 具有標點符號之端到端串流語音識別於多任務學習 50
4.3.4. 具有標點符號之語音辨識結果 54
4.3.5. 基於多任務學習之融合模型效能實驗 56
第五章結論及未來方向 58
第六章參考文獻 59

參考文獻

[1] G. E. Dahl, Dong Yu, Li Deng, and A. Acero, “Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition,” IEEE Trans. Audio Speech Lang. Process., vol. 20, no. 1, pp. 30–42, Jan. 2012, doi: 10.1109/TASL.2011.2134090.
[2] A. Graves, “Sequence Transduction with Recurrent Neural Networks,” arXiv:1211.3711 [cs, stat], Nov. 2012, Accessed: May 25, 2020. [Online]. Available: http://arxiv.org/abs/1211.3711.
[3] A. Hannun et al., “Deep Speech: Scaling up end-to-end speech recognition,” arXiv:1412.5567 [cs], Dec. 2014, Accessed: May 25, 2020. [Online]. Available: http://arxiv.org/abs/1412.5567.
[4] W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen, Attend and Spell,” arXiv:1508.01211 [cs, stat], Aug. 2015, Accessed: May 25, 2020. [Online]. Available: http://arxiv.org/abs/1508.01211.
[5] A. Vaswani et al., “Attention Is All You Need,” arXiv:1706.03762 [cs], Dec. 2017, Accessed: May 25, 2020. [Online]. Available: http://arxiv.org/abs/1706.03762.
[6] A. Mohamed, D. Okhonko, and L. Zettlemoyer, “Transformers with convolutional context for ASR,” arXiv:1904.11660 [cs], Mar. 2020, Accessed: May 25, 2020. [Online]. Available: http://arxiv.org/abs/1904.11660.
[7] A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber, “Connectionist Temporal Classiﬁcation: Labelling Unsegmented Sequence Data with Recurrent Neural Networks,” p. 8.
[8] L. Dong, S. Xu, and B. Xu, “Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Apr. 2018, pp. 5884–5888, doi: 10.1109/ICASSP.2018.8462506.
[9] Y. He et al., “Streaming End-to-end Speech Recognition for Mobile Devices,” in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, United Kingdom, May 2019, pp. 6381–6385, doi: 10.1109/ICASSP.2019.8682336.
[10] N. Moritz, T. Hori, and J. L. Roux, “Streaming automatic speech recognition with the transformer model,” arXiv:2001.02674 [cs, eess, stat], Mar. 2020, Accessed: May 25, 2020. [Online]. Available: http://arxiv.org/abs/2001.02674.
[11] K. Kim et al., “Attention based on-device streaming speech recognition with large speech corpus,” arXiv:2001.00577 [cs, eess], Jan. 2020, Accessed: May 25, 2020. [Online]. Available: http://arxiv.org/abs/2001.00577.
[12] S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi, “Hybrid CTC/Attention Architecture for End-to-End Speech Recognition,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240–1253, Dec. 2017, doi: 10.1109/JSTSP.2017.2763455.
[13] D. Bahdanau, K. Cho, and Y. Bengio, “Neural Machine Translation by Jointly Learning to Align and Translate,” arXiv:1409.0473 [cs, stat], May 2016, Accessed: May 30, 2020. [Online]. Available: http://arxiv.org/abs/1409.0473.
[14] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” arXiv:1810.04805 [cs], May 2019, Accessed: May 30, 2020. [Online]. Available: http://arxiv.org/abs/1810.04805.
[15] A. Y. Hannun, A. L. Maas, D. Jurafsky, and A. Y. Ng, “First-Pass Large Vocabulary Continuous Speech Recognition using Bi-Directional Recurrent DNNs,” arXiv:1408.2873 [cs], Dec. 2014, Accessed: Jun. 03, 2020. [Online]. Available: http://arxiv.org/abs/1408.2873.
[16] K. Rao, H. Sak, and R. Prabhavalkar, “Exploring Architectures, Data and Units For Streaming End-to-End Speech Recognition with RNN-Transducer,” arXiv:1801.00841 [cs, eess], Jan. 2018, Accessed: Jun. 06, 2020. [Online]. Available: http://arxiv.org/abs/1801.00841.
[17] P. Shaw, J. Uszkoreit, and A. Vaswani, “Self-Attention with Relative Position Representations,” arXiv:1803.02155 [cs], Apr. 2018, Accessed: Jun. 11, 2020. [Online]. Available: http://arxiv.org/abs/1803.02155.
[18] H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “AISHELL-1: An Open-Source Mandarin Speech Corpus and A Speech Recognition Baseline,” arXiv:1709.05522 [cs], Sep. 2017, Accessed: Jun. 17, 2020. [Online]. Available: http://arxiv.org/abs/1709.05522.
[19] D. Povey et al., “The Kaldi Speech Recognition Toolkit,” p. 4.

指導教授

王家慶(Jia-Ching Wang)

審核日期

2020-7-29

推文