在人機互動愈漸重視的社會,使用語音辨識來驅動設備或是控制設備的語音助理越來越普遍。然而,有許多語音助理的使用者對個人隱私有疑慮。這是因為市面上大多語音助理會將語音指令擷取後上傳至雲端進行處理,這些語音片段會暫時性的被提供服務的公司儲存。為了解決這個問題,在終端運算的關鍵詞偵測(Keyword spotting)是語音人機互動中重要的任務。為了高隱私性,辨識任務需要在性能受限的設備上執行,因此本任務的目的為在有限成本當中盡可能提高準確度。 本論文描述了Densely Connected Convolutional Networks (DenseNet)於關鍵詞偵測任務中的應用。為了使參數量降低我們將一般的卷積替換成分組卷積(group convolution)和深度可分離卷積(depthwise separable convolution)。且為了使準確度上升,我們增加Squeeze-and-Excitation Networks (SENet)來加強重要特徵的權重。為了探討不同卷積對DenseNet的影響,我們建立了三種模型:SpDenseNet-G、SpDenseNet-D、SpDenseNet-L,且分別產生個別的緊湊變體。 我們使用Google Speech Commands Dataset來驗證網路。我們提出的網路在準確度優於其他網路的情況下參數量及FLOPs都更節省,SpDenseNet-D能在參數量為122.63K及FLOPs為142.7M的情況下達到96.3%的準確度。與基準論文相比僅使用約46%的參數量及約10%的FLOPs。除此之外,我們改變了網路的深度及寬度來建立的緊湊型變體,也優於其他論文的緊湊型變體。SpDenseNet-L-narrow在參數量為9.27K和FLOPs為3.47M的情況下準確度為93.6%。相較基準論文,我們的緊湊模型準確度提升3.5%且僅使用約47%的參數量及約48%的FLOPS。;In a society where human-computer interaction is becoming increasingly important, voice assistants that use voice recognition to drive or control devices are becoming more common. However, many voice assistant users are concerned about personal privacy. The reason is that most of the voice assistants in the market will capture the voice of the command and upload it to the cloud for processing, and these voice clips will be temporarily stored by the company providing the service. To solve this problem, keyword spotting at the end of computing is an important task in voice human-computer interaction. For high privacy, the identification task needs to be performed at the edge, so the purpose of this task is to improve the accuracy as much as possible within the limited cost. This paper discusses the application of Densely Connected Convolutional Networks (DenseNet) to the keyword spotting task. To make the model smaller, we replace the normal convolution with group convolution and depthwise separable convolution. To increase the accuracy, we add squeeze-and-excitation networks (SENet) to enhance the weight of important features. In order to investigate the effect of different convolutions on DenseNet, we built three models: SpDenseNet-G, SpDenseNet-D, and SpDenseNet-L and generated individual compact variants for each model. We validated the network using the Google Speech Commands Dataset. Our proposed network had better accuracy than other networks even with less number of parameters and floating-point operations (FLOPs). SpDenseNet-D could achieve the accuracy of 96.3% with 122.63K trainable parameters and 142.7M FLOPs. Compared to the benchmark paper, only about 52% of the number of parameters and about 12% of the FLOPs are used. In addition, we varied the depth and width of the network to build a compact variant. It also outperforms other compact variants, SpDenseNet-L-narrow could achieve the accuracy of 93.6% with 9.27K trainable parameters and 3.47M FLOPs. Compared to the benchmark paper, our accuracy improves by 3.5% and uses only about 47% of the number of parameters and about 48% of the FLOPS.