Speech Recognition via Attention Mechanism on Raspberry Pi

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：52

、訪客IP：18.227.140.251

姓名

申自強(Tzu-Chiang Shen) 查詢紙本館藏

畢業系所

資訊工程學系在職專班

論文名稱

(Speech Recognition via Attention Mechanism on Raspberry Pi)

相關論文

★ Dynamic Overlay Construction for Mobile Target Detection in Wireless Sensor Networks	★ 車輛導航的簡易繞路策略
★ 使用傳送端電壓改善定位	★ 利用車輛分類建構車載網路上的虛擬骨幹
★ Why Topology-based Broadcast Algorithms Do Not Work Well in Heterogeneous Wireless Networks?	★ 針對移動性目標物的有效率無線感測網路
★ 適用於無線隨意網路中以關節點為基礎的分散式拓樸控制方法	★ A Review of Existing Web Frameworks
★ 將感測網路切割成貪婪區塊的分散式演算法	★ 無線網路上Range-free的距離測量
★ Inferring Floor Plan from Trajectories	★ An Indoor Collaborative Pedestrian Dead Reckoning System
★ Dynamic Content Adjustment In Mobile Ad Hoc Networks	★ 以影像為基礎的定位系統
★ 大範圍無線感測網路下分散式資料壓縮收集演算法	★ 車用WiFi網路中的碰撞分析

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

語音識別作為一種新的計算機界面形式。它啟用了語音助手（例如Alexa 和Siri），這可以幫助我們獲得許多服務，例如獲取日常信息和設置駕駛導航系統。自1990 年代初以來，語音識別已得到廣泛研究。然而，隨著越來越多的便攜式嵌入式設備（如導航系統、語言翻譯器等）出現在市場上，需要基於低計算設備的離線語音識別。在這項研究中，我們專注於將編碼器-解碼器神經網絡應用於Raspberry Pi 等低功耗設備。與需要將錄製的語音傳輸到昂貴的服務器以提供計算和推理的Alexa 和Siri 相比，我們構建了一個僅在本地推斷語音樣本的語音識別模型。我們的模型使用CNN 作為編碼器，使用具有註意力機制的LSTM 或GRU 作為解碼器。此外，採用Tensorflow Lite 將模型導入Raspberry Pi 進行語音推理。實驗結果表明，在Raspberry Pi 上使用注意力機制後，模型對孤立詞的識別能力在召回率上提高了約2% 到5%。由於低功耗設備的計算能力有限，Raspberry Pi 上的推理時間非常長。

摘要(英)

Speech recognition serves as a new form of computer interface. It enables the voice assistant (e.g., Alexa and Siri), which helps us on many services like obtaining daily information and setting up driving navigation system. Speech recognition has been extensively studied since the early 1990s. However, as more and more portable embedded devices
(e.g., navigation system, language translator, etc.) appear on the market, there is a need for offline speech recognition based on low computation device. In this research, we focus on applying an Encoder-Decoder neural network to a low-power device like the Raspberry Pi. In contrast to Alexa and Siri that require the transmission of recorded voice to expensive servers to provide computation and inference, we build a speech recognition model that just infers speech samples locally. Our model uses CNN as the encoder and LSTM or GRU with attention mechanism as the decoder. In addition, Tensorflow Lite is adopted to import the model to the Raspberry Pi for speech inference. The experimental results indicate that the model’s ability to recognize isolated words was improved about 2% to 5% in recall by using the attention mechanism on Raspberry Pi. Inference times on the Raspberry Pi are so long due to the limited computing power of the low-power device.

關鍵字(中)

★ 語音識別
★ 注意力機制
★ 樹梅派
★ 孤立詞
★ 喚醒詞

關鍵字(英)

★ Speech Recognition
★ Attention Mechanism
★ Raspberry Pi
★ Isolated Word
★ Wake-Up-Word

論文目次

Contents
1 Introduction 1
2 Related Work 2
2.1 Statistics-based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.2 Deep learning-based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3 Preliminary 5
3.1 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.3 Attention Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4 Design 15
4.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2.1 Overview of Sound Features . . . . . . . . . . . . . . . . . . . . . . 17
4.2.2 Time-Frequency-Domain Analysis . . . . . . . . . . . . . . . . . . . 18
4.2.3 Outliers Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3 Encoder Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3.1 CNN Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3.2 LSTM module with attention mechanism . . . . . . . . . . . . . . . 22
4.4 Decoder Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.5 Low-Power Inference on Raspberry Pi . . . . . . . . . . . . . . . . . . . . . 24
5 Performance 25
5.1 Performance metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.2 Experimental environment . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.3 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6 Conclusions 45
Reference 46

參考文獻

1976. Academic Press Rapid Manuscript Reproduction.
Academic Press, 1976. isbn: 9780121709501. url: https://books.google.com.
tw/books?id=wW9QAAAAMAAJ.
[2] Ching Y. Suen. “n-Gram Statistics for Natural Language Understanding and Text
Processing”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence
PAMI-1.2 (1979), pp. 164–172. doi: 10.1109/TPAMI.1979.4766902.
[3] S. Davis and P. Mermelstein. “Comparison of parametric representations for monosyllabic
word recognition in continuously spoken sentences”. In: IEEE Transactions
on Acoustics, Speech, and Signal Processing 28.4 (1980), pp. 357–366. doi:
10.1109/TASSP.1980.1163420.
[4] Lalit R. Bahl, Frederick Jelinek, and Robert L. Mercer. “A Maximum Likelihood
Approach to Continuous Speech Recognition”. In: IEEE Transactions on Pattern
Analysis and Machine Intelligence PAMI-5.2 (1983), pp. 179–190. doi: 10.1109/
TPAMI.1983.4767370.
[5] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning internal
representations by error propagation. Tech. rep. California Univ San Diego La Jolla
Inst for Cognitive Science, 1985.
[6] L.R. Rabiner. “A tutorial on hidden Markov models and selected applications in
speech recognition”. In: Proceedings of the IEEE 77.2 (1989), pp. 257–286. doi:
10.1109/5.18626.
Communication 9.1 (1990). Neurospeech,
pp. 83–92. issn: 0167-6393. doi: https://doi.org/10.1016/0167- 6393(90)
90049-F.
[9] Lawrence Rabiner and Biing-Hwang Juang. Fundamentals of speech recognition.
Prentice-Hall, Inc., 1993.
[10] Jianxiong Wu and Chorkin Chan. “Isolated word recognition by neural network
models with cross-correlation coefficients for speech dynamics”. In: IEEE Transactions
on Pattern Analysis and Machine Intelligence 15.11 (1993), pp. 1174–1185.
doi: 10.1109/34.244678.
[11] D.A. Reynolds and R.C. Rose. “Robust text-independent speaker identification using
Gaussian mixture speaker models”. In: IEEE Transactions on Speech and Audio
Processing 3.1 (1995), pp. 72–83. doi: 10.1109/89.365379.
[12] Hervé Bourlard et al. “A new training algorithm for hybrid HMM/ANN speech
recognition systems”. In: 1996 8th European Signal Processing Conference (EUSIPCO
1996). 1996, pp. 1–4.
[13] R.Klevansand R.Rodman. “Voice Recognition”. In: Artech House, Boston. London,
1997.
[22] Anssi Klapuri and Manuel Davy. Signal Processing Methods for Music Transcription.
Jan. 2006. isbn: 978-0-387-30667-4. doi: 10.1007/0-387-32845-9.
[23] Silk Smita, Sharmila Biswas, and Sandeep Solanki. “Audio Signal Separation and
Classification: A Review Paper”. In: 3297 (Dec. 2007).
[24] Hanwu Sun, Bin Ma, and Haizhou Li. “An Efficient Feature Selection Method for
Speaker Recognition”. In: 2008 6th International Symposium on Chinese Spoken
Language Processing. 2008, pp. 1–4. doi: 10.1109/CHINSL.2008.ECP.57.
[25] “Baum-Welch Algorithm”. In: Encyclopedia of Biometrics. Ed. by Stan Z. Li and
Anil Jain. Boston, MA: Springer US, 2009, pp. 60–61. isbn: 978-0-387-73003-5. doi:
10.1007/978-0-387-73003-5_539. url: https://doi.org/10.1007/978-0-
387-73003-5_539.
[26] Anup Kumar Paul, Dipankar Das, and Md. Mustafa Kamal. “Bangla Speech Recognition
System Using LPC and ANN”. In: 2009 Seventh International Conference on
Advances in Pattern Recognition. 2009, pp. 171–174. doi: 10.1109/ICAPR.2009.80.
[27] Zhiyong Yan and Congfu Xu. “Studies on classification models using decision boundaries”.
In: 2009 8th IEEE International Conference on Cognitive Informatics. 2009,
pp. 287–294. doi: 10.1109/COGINF.2009.5250724.
[28] Geoffrey Hinton et al. “Deep Neural Networks for Acoustic Modeling in Speech
Recognition: The Shared Views of Four Research Groups”. In: Signal Processing
Magazine, IEEE 29 (Nov. 2012), pp. 82–97. doi: 10.1109/MSP.2012.2205597.
49
[29] L.-H Chen et al. “Joint spectral distribution modeling using restricted boltzmann
machines for voice conversion”. In: Proceedings of the Annual Conference of the
International Speech Communication Association, INTERSPEECH (Jan. 2013),
pp. 3052–3056.
[30] Xugang Lu et al. “Speech enhancement based on deep denoising Auto-Encoder”. In:
Proc. Interspeech (Jan. 2013), pp. 436–440.
[31] Bingyin Xia and Chang-chun Bao. “Speech enhancement with weighted denoising
auto-encoder”. In: INTERSPEECH. 2013.
[32] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. “Neural machine translation
by jointly learning to align and translate”. In: arXiv preprint arXiv:1409.0473
(2014).
[33] Jan Chorowski et al. “End-to-end continuous speech recognition using attentionbased
recurrent NN: First results”. In: arXiv preprint arXiv:1412.1602 (2014).
[34] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. “Sequence to Sequence Learning
with Neural Networks”. In: Proc. NIPS. Montreal, CA, 2014. url: http://arxiv.
org/abs/1409.3215.
[35] Zhen-Hua Ling et al. “Deep Learning for Acoustic Modeling in Parametric Speech
Generation: A systematic review of existing techniques and future trends”. In: IEEE
Signal Processing Magazine 32.3 (2015), pp. 35–52. doi: 10 . 1109 / MSP . 2014 .
2359987.
[36] Thang Luong, Hieu Pham, and Christopher D. Manning. “Effective Approaches
to Attention-based Neural Machine Translation”. In: Proceedings of the 2015 Conference
on Empirical Methods in Natural Language Processing. Lisbon, Portugal:
50
Association for Computational Linguistics, Sept. 2015, pp. 1412–1421. doi: 10 .
18653/v1/D15-1166. url: https://aclanthology.org/D15-1166.
[37] Shipra Gupta. “Application of MFCC in Text Independent Speaker Recognition”.
In: 2016.
[38] Rohit J Kate. “Using dynamic time warping distances as features for improved
time series classification”. In: Data Mining and Knowledge Discovery 30.2 (2016),
pp. 283–312.
[39] Aaron van den Oord et al. WaveNet: A Generative Model for Raw Audio. 2016. doi:
10.48550/ARXIV.1609.03499. url: https://arxiv.org/abs/1609.03499.
[40] Yonghui Wu et al. Google’s Neural Machine Translation System: Bridging the Gap
between Human and Machine Translation. 2016. doi: 10 . 48550 / ARXIV . 1609 .
08144. url: https://arxiv.org/abs/1609.08144.
[41] Ashish Vaswani et al. Attention Is All You Need. 2017. doi: 10.48550/ARXIV.
1706.03762. url: https://arxiv.org/abs/1706.03762.
[42] Rajiv Barman et al. “Content Capture and Noise Cancellation Aided Mood Recognition
using Assamese Speech”. In: 2018 5th International Conference on Signal
Processing and Integrated Networks (SPIN). 2018, pp. 811–815. doi: 10 . 1109 /
SPIN.2018.8474243.
[43] Saswati Debnath and Pinki Roy. “Speaker Independent Isolated Word Recognition
based on ANOVA and IFS”. In: Jan. 2018, pp. 92–97. doi: 10.1145/3177457.
3191708.
51
[44] Warren He, Bo Li, and Dawn Song. “Decision Boundary Analysis of Adversarial
Examples”. In: International Conference on Learning Representations. 2018. url:
https://openreview.net/forum?id=BkpiPMbA-.
[45] P. Warden. “Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition”.
In: ArXiv e-prints (Apr. 2018). arXiv: 1804.03209 [cs.CL]. url: https:
//arxiv.org/abs/1804.03209.
[46] Ekaba Bisong. “Google Colaboratory”. In: Building Machine Learning and Deep
Learning Models on Google Cloud Platform: A Comprehensive Guide for Beginners.
Berkeley, CA: Apress, 2019, pp. 59–64. isbn: 978-1-4842-4470-8. doi: 10.1007/978-
1-4842-4470-8_7. url: https://doi.org/10.1007/978-1-4842-4470-8_7.

指導教授

孫敏德(Min-Te (Peter) Sun)

審核日期

2022-9-23

推文