dc.description.abstract | In recent years, deep learning technology becomes more popular because of the improvement of GPU and the advent of big data. The deep learning has brought revolutionary promotion in various fields. Most traditional algorithms are replaced by deep learning technologies such as basic pre-image processing, image segmentation, face recognition, speech recognition, etc. That shows the rise of the neural network has led to the reform of artificial intelligence. However, the neural network is limited by the power consumption and cost of the GPU, its products are extremely expensive. Due to the large amount of computation of the neural network, the neural network has to be used with the hardware accelerator for real-time computing. The problem of the computation of the neural network has promoted a lot of research for convolution network accelerator digital circuit hardware design.
This paper proposed a design and implementation of a separable convolution accelerator based on HWCK data scheduling. It can be used to accelerate the deep separable convolution model by the design of the deepwise convolution, pointwise convolution, and the batch normalization. The proposed system can be through the SoC design to let the PS (Processing System) and PL (Programmable Logic) communicated with each other by using the AXI4 bus protocol, so our proposed design can be used when the CPU needs to accelerate the neural network. This HWCK data scheduling method can be reconfigured by the allocated memory and the bandwidth resource on the DDR4 and can be easily extended our design when the bandwidth and memory are sufficient. To reduce the weight parameter of the neural network, the data are calculated and stored with a 16bits fixed-point. The memory access is carried out with the architecture of ping-pong memory, it can transmit the data through the AXI4 bus protocol. The while hardware design architecture can be implemented on the Xilinx ZCU106 development board. The SoC design which using a precompiled driver to communicate operating systems and external resources, and control the design of the neural network acceleration module on FPGA. The higher program language to quickly reconfigure the network schedule, it can improve the hardware reconfigurable ability. This hardware architecture can reach 222FPS and 60.8GOPS by running FaceNet. The energy consumption on the Xilinx ZCU106 board is 8.82W, it has 6.89GOP/s/W performance. | en_US |