語音密集連接卷積網路應用於小尺寸關鍵詞偵測;Speech Densely Connected Convolutional Networks for Small-Footprint Keyword Spotting

NCU Institutional Repository > 資訊電機學院 > 電機工程研究所 > 博碩士論文 > Item 987654321/86932

jsp.display-item.identifier=請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/86932

题名:	語音密集連接卷積網路應用於小尺寸關鍵詞偵測;Speech Densely Connected Convolutional Networks for Small-Footprint Keyword Spotting
作者:	林欣慧;Lin, Xin-Hui
贡献者:	電機工程學系
关键词:	關鍵詞偵測;密集連接卷積網路;分組卷積;深度可分離式卷積;深度學習;keyword spotting;DenseNet;group convolution;depthwise separable convolution;SENet;deep learning
日期:	2021-11-29
上传时间:	2021-12-07 13:27:23 (UTC+8)
出版者:	國立中央大學
摘要:	在人機互動愈漸重視的社會，使用語音辨識來驅動設備或是控制設備的語音助理越來越普遍。然而，有許多語音助理的使用者對個人隱私有疑慮。這是因為市面上大多語音助理會將語音指令擷取後上傳至雲端進行處理，這些語音片段會暫時性的被提供服務的公司儲存。為了解決這個問題，在終端運算的關鍵詞偵測(Keyword spotting)是語音人機互動中重要的任務。為了高隱私性，辨識任務需要在性能受限的設備上執行，因此本任務的目的為在有限成本當中盡可能提高準確度。本論文描述了Densely Connected Convolutional Networks (DenseNet)於關鍵詞偵測任務中的應用。為了使參數量降低我們將一般的卷積替換成分組卷積(group convolution)和深度可分離卷積(depthwise separable convolution)。且為了使準確度上升，我們增加Squeeze-and-Excitation Networks (SENet)來加強重要特徵的權重。為了探討不同卷積對DenseNet的影響，我們建立了三種模型：SpDenseNet-G、SpDenseNet-D、SpDenseNet-L，且分別產生個別的緊湊變體。我們使用Google Speech Commands Dataset來驗證網路。我們提出的網路在準確度優於其他網路的情況下參數量及FLOPs都更節省，SpDenseNet-D能在參數量為122.63K及FLOPs為142.7M的情況下達到96.3%的準確度。與基準論文相比僅使用約46%的參數量及約10%的FLOPs。除此之外，我們改變了網路的深度及寬度來建立的緊湊型變體，也優於其他論文的緊湊型變體。SpDenseNet-L-narrow在參數量為9.27K和FLOPs為3.47M的情況下準確度為93.6%。相較基準論文，我們的緊湊模型準確度提升3.5%且僅使用約47%的參數量及約48%的FLOPS。;In a society where human-computer interaction is becoming increasingly important, voice assistants that use voice recognition to drive or control devices are becoming more common. However, many voice assistant users are concerned about personal privacy. The reason is that most of the voice assistants in the market will capture the voice of the command and upload it to the cloud for processing, and these voice clips will be temporarily stored by the company providing the service. To solve this problem, keyword spotting at the end of computing is an important task in voice human-computer interaction. For high privacy, the identification task needs to be performed at the edge, so the purpose of this task is to improve the accuracy as much as possible within the limited cost. This paper discusses the application of Densely Connected Convolutional Networks (DenseNet) to the keyword spotting task. To make the model smaller, we replace the normal convolution with group convolution and depthwise separable convolution. To increase the accuracy, we add squeeze-and-excitation networks (SENet) to enhance the weight of important features. In order to investigate the effect of different convolutions on DenseNet, we built three models: SpDenseNet-G, SpDenseNet-D, and SpDenseNet-L and generated individual compact variants for each model. We validated the network using the Google Speech Commands Dataset. Our proposed network had better accuracy than other networks even with less number of parameters and floating-point operations (FLOPs). SpDenseNet-D could achieve the accuracy of 96.3% with 122.63K trainable parameters and 142.7M FLOPs. Compared to the benchmark paper, only about 52% of the number of parameters and about 12% of the FLOPs are used. In addition, we varied the depth and width of the network to build a compact variant. It also outperforms other compact variants, SpDenseNet-L-narrow could achieve the accuracy of 93.6% with 9.27K trainable parameters and 3.47M FLOPs. Compared to the benchmark paper, our accuracy improves by 3.5% and uses only about 47% of the number of parameters and about 48% of the FLOPS.
显示于类别:	[電機工程研究所] 博碩士論文

文件中的档案:

档案	描述	大小	格式	浏览次数
index.html		0Kb	HTML	153	检视/开启

在NCUIR中所有的数据项都受到原著作权保护.

社群 sharing

数据加载中.....