語音密集連接卷積網路應用於小尺寸關鍵詞偵測

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：102

、訪客IP：3.12.163.105

姓名

林欣慧(Xin-Hui Lin) 查詢紙本館藏

畢業系所

電機工程學系

論文名稱

語音密集連接卷積網路應用於小尺寸關鍵詞偵測
(Speech Densely Connected Convolutional Networks for Small-Footprint Keyword Spotting)

相關論文

★ 即時的SIFT特徵點擷取之低記憶體硬體設計	★ 即時的人臉偵測與人臉辨識之門禁系統
★ 具即時自動跟隨功能之自走車	★ 應用於多導程心電訊號之無損壓縮演算法與實現
★ 離線自定義語音語者喚醒詞系統與嵌入式開發實現	★ 晶圓圖缺陷分類與嵌入式系統實現
★ G2LGAN: 對不平衡資料集進行資料擴增應用於晶圓圖缺陷分類	★ 補償無乘法數位濾波器有限精準度之演算法設計技巧
★ 可規劃式維特比解碼器之設計與實現	★ 以擴展基本角度CORDIC為基礎之低成本向量旋轉器矽智產設計
★ JPEG2000靜態影像編碼系統之分析與架構設計	★ 適用於通訊系統之低功率渦輪碼解碼器
★ 應用於多媒體通訊之平台式設計	★ 適用MPEG 編碼器之數位浮水印系統設計與實現
★ 適用於視訊錯誤隱藏之演算法開發及其資料重複使用考量	★ 一個低功率的MPEG Layer III 解碼器架構設計

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 ( 永不開放)

摘要(中)

在人機互動愈漸重視的社會，使用語音辨識來驅動設備或是控制設備的語音助理越來越普遍。然而，有許多語音助理的使用者對個人隱私有疑慮。這是因為市面上大多語音助理會將語音指令擷取後上傳至雲端進行處理，這些語音片段會暫時性的被提供服務的公司儲存。為了解決這個問題，在終端運算的關鍵詞偵測(Keyword spotting)是語音人機互動中重要的任務。為了高隱私性，辨識任務需要在性能受限的設備上執行，因此本任務的目的為在有限成本當中盡可能提高準確度。
本論文描述了Densely Connected Convolutional Networks (DenseNet)於關鍵詞偵測任務中的應用。為了使參數量降低我們將一般的卷積替換成分組卷積(group convolution)和深度可分離卷積(depthwise separable convolution)。且為了使準確度上升，我們增加Squeeze-and-Excitation Networks (SENet)來加強重要特徵的權重。為了探討不同卷積對DenseNet的影響，我們建立了三種模型：SpDenseNet-G、SpDenseNet-D、SpDenseNet-L，且分別產生個別的緊湊變體。
我們使用Google Speech Commands Dataset來驗證網路。我們提出的網路在準確度優於其他網路的情況下參數量及FLOPs都更節省，SpDenseNet-D能在參數量為122.63K及FLOPs為142.7M的情況下達到96.3%的準確度。與基準論文相比僅使用約46%的參數量及約10%的FLOPs。除此之外，我們改變了網路的深度及寬度來建立的緊湊型變體，也優於其他論文的緊湊型變體。SpDenseNet-L-narrow在參數量為9.27K和FLOPs為3.47M的情況下準確度為93.6%。相較基準論文，我們的緊湊模型準確度提升3.5%且僅使用約47%的參數量及約48%的FLOPS。

摘要(英)

In a society where human-computer interaction is becoming increasingly important, voice assistants that use voice recognition to drive or control devices are becoming more common. However, many voice assistant users are concerned about personal privacy. The reason is that most of the voice assistants in the market will capture the voice of the command and upload it to the cloud for processing, and these voice clips will be temporarily stored by the company providing the service. To solve this problem, keyword spotting at the end of computing is an important task in voice human-computer interaction. For high privacy, the identification task needs to be performed at the edge, so the purpose of this task is to improve the accuracy as much as possible within the limited cost.
This paper discusses the application of Densely Connected Convolutional Networks (DenseNet) to the keyword spotting task. To make the model smaller, we replace the normal convolution with group convolution and depthwise separable convolution. To increase the accuracy, we add squeeze-and-excitation networks (SENet) to enhance the weight of important features. In order to investigate the effect of different convolutions on DenseNet, we built three models: SpDenseNet-G, SpDenseNet-D, and SpDenseNet-L and generated individual compact variants for each model.
We validated the network using the Google Speech Commands Dataset. Our proposed network had better accuracy than other networks even with less number of parameters and floating-point operations (FLOPs). SpDenseNet-D could achieve the accuracy of 96.3% with 122.63K trainable parameters and 142.7M FLOPs. Compared to the benchmark paper, only about 52% of the number of parameters and about 12% of the FLOPs are used. In addition, we varied the depth and width of the network to build a compact variant. It also outperforms other compact variants, SpDenseNet-L-narrow could achieve the accuracy of 93.6% with 9.27K trainable parameters and 3.47M FLOPs. Compared to the benchmark paper, our accuracy improves by 3.5% and uses only about 47% of the number of parameters and about 48% of the FLOPS.

關鍵字(中)

★ 關鍵詞偵測
★ 密集連接卷積網路
★ 分組卷積
★ 深度可分離式卷積
★ 深度學習

關鍵字(英)

★ keyword spotting
★ DenseNet
★ group convolution
★ depthwise separable convolution
★ SENet
★ deep learning

論文目次

中文摘要 ……………………………………………………………………………………i
Abstract……………………………………………………………………………………ii
致謝 ……………………………………………………………………………………iii
目錄 ……………………………………………………………………………………iv
圖目錄 ……………………………………………………………………………………v
表目錄 ……………………………………………………………………………………vi
一、緒論………………………………………………………………………… 1
1-1 研究背景與動機………………………………………………… 1
1-2 論文架構………………………………………………………………… 5
二、文獻探討………………………………………………………………… 6
2-1 關鍵詞偵測…………………………………………………………… 6
2-2 密集連接卷積網路……………………………………………… 7
2-3 分組卷積及深度可分離式卷積………………………10
2-4 Squeeze-and-Excitation Networks……12
三、網路模型設計與實驗……………………………………………14
3-1 資料集…………………………………………………………………………14
3-2 資料前處理………………………………………………………………15
3-3 關鍵詞偵測網路模型設計…………………………………18
3-4 訓練策略……………………………………………………………………24
四、結果與討論………………………………………………………………26
4-1 實驗結果……………………………………………………………………26
4-2 結果討論……………………………………………………………………30
五、結論………………………………………………………………………………34
參考文獻 …………………………………………………………………………………………35

參考文獻

[1] Number of digital voice assistants in use worldwide from 2019 to 2023. https://www.statista.com/statistics/973815/worldwide-digital-voiceassistant-in-use/
[2] J. S. Edu, J. M. Such, Guillermo Suarez-Tangil, “Smart home personal assistants: A security and privacy review,” ACM Computing Surveys, vol. 53, No. 116, pp. 1-36, Feb. 2020.
[3] Apple Machine Learning Blog, “Hey Siri: An On-device DNN-powered Voice Trigger for Apples Personal Assistant,” Oct. 2017. [Online]. Available: https://machinelearning. apple.com/2017/10/01/hey-siri.html.
[4] B. Li et al., “Acoustic modeling for Google home,” in Proc. Interspeech, 2017, pp. 399–403.
[5] Y. Bai et al., “End-to-end keywords spotting based on connectionist temporal classification for Mandarin,” 2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP), Oct. 2016, pp. 1–5.
[6] T. N. Sainath and C. Parada, “Convolutional neural networks for small-footprint keyword spotting,” in Proc. Interspeech, 2015, pp. 1478–1482.
[7] M. B. Andra and T. Usagawa, “Contextual keyword spotting in lecture video with deep convolutional neural network,” 2017 International Conference on Advanced Computer Science and Information Systems (ICACSIS), pp. 198–203.
[8] R. Tang and J. Lin, “Deep residual learning for small-footprint keyword spotting,” 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5484-5488.
[9] M. Sun et al., “Max-pooling loss training of long short-term memory networks for small-footprint keyword spotting,” 2016 IEEE Spoken Language Technology Workshop (SLT), 2016, pp. 474-480.
[10] D. Wang, S. Lv, X. Wang, and X. Lin, “Gated convolutional LSTM for speech commands recognition,” International Conference on Computational Science. Springer, Cham, 2018. p. 669-681.
[11] S. O. Arik et al., “Convolutional recurrent neural networks for small-footprint keyword spotting,” in Proc. Interspeech, 2017, pp. 1606–1610.
[12] M. Zeng and N. Xiao, “Effective combination of densenet and BiLSTM for keyword spotting,” in IEEE Access, vol. 7, pp. 10767-10775, 2019.
[13] G. Huang, Z. Liu, L. Van Der Maaten and K. Q. Weinberger, “Densely connected convolutional networks,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2261-2269.
[14] K. He, X. Zhang, S. Ren and J. Sun “Deep residual learning for image recognition.” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, p. 770-778.
[15] A. Krizhevsky et al., “ImageNet classification with deep convolutional neural networks”, Advances in neural information processing systems, 2012, pp. 1097-1105.
[16] S. Xie, R. Girshick, P. Dollár, Z. Tu and K. He, “Aggregated Residual Transformations for Deep Neural Networks,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 5987-5995.
[17] X. Zhang, X. Zhou, M. Lin and J. Sun, “ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 6848-6856.
[18] D. Sinha and M. El-Sharkawy, “Thin MobileNet: An Enhanced MobileNet Architecture,” 2019 IEEE 10th Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON), 2019, pp. 0280-0285.
[19] G. Huang, S. Liu, L. v. d. Maaten and K. Q. Weinberger, “CondenseNet: an efficient denseNet using learned group convolutions,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 2752-2761.
[20] J. Hu, L. Shen and G. Sun, “Squeeze-and-excitation networks,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 7132-7141.
[21] A. Howard et al., “Searching for MobileNetV3,” 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 1314-1324.
[22] M. Tan et al., “MnasNet: Platform-Aware Neural Architecture Search for Mobile,” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 2815-2823.
[23] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in ICML, 2010.
[24] Pete Warden, “Launching the speech commands dataset,” Google Research Blog, 2017. [Online]. Available: https://ai.googleblog.com/2017/08/launching-speech-commands-dataset.html
[25] L. Muda, M. Begam, and I. Elamvazuthi, “Voice recognition algorithms using mel frequency cepstral coefficient (MFCC) and dynamic time warping (dtw) techniques,” Journal of Computing, 2010.
[26] T. Ko, V. Peddinti, D. Povey et al., “Audio augmentation for speech recognition,” in Proc. Interspeech, 2015, pp. 3586-3589.
[27] T. Fukuda, R. Fernandez, A. Rosenberg, S. Thomas, B. Ramabhadran,　A. Sorin, and G. Kurata, “Data augmentation　improves recognition of foreign accented speech,” in Proc. Interspeech , 2018, pp. 2409-2413.
[28] SoX, audio manipulation tool, (accessed March 25, 2015). [Online]. Available: http://sox.sourceforge.net/
[29] T. Zhang, G.-J. Qi, B. Xiao, and J. Wang. “Interleaved group convolutions,” 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 4383-4392.
[30] S. Xie, R. Girshick, P. Dollár, Z. Tu and K. He, “Aggregated residual transformations for deep neural networks,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 5987-5995.
[31] Z. Yan et al., “HD-CNN: Hierarchical deep convolutional neural networks for large scale visual recognition,” 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 2740-2748.
[32] Y. Zhang, N. Suda, L. Lai, and V. Chandra. (2017). “Hello edge: Keyword spotting on icrocontrollers,” unpublished.
[33] D. C. de Andrade, S. Leo, M. L. Da S. Viana, and C. Bernkopf. (2018). “A neural attention model for speech command recognition,” unpublished.

指導教授

蔡宗漢(Tsung-Han Tsai)

審核日期

2021-11-29

推文