基於矩陣乘法架構進行記憶體存取優化的CNN加速器

DC 欄位	值	語言
DC.contributor	電機工程學系	zh_TW
DC.creator	呂季修	zh_TW
DC.creator	Chi-Hsiu,Lu	en_US
dc.date.accessioned	2022-8-3T07:39:07Z
dc.date.available	2022-8-3T07:39:07Z
dc.date.issued	2022
dc.identifier.uri	http://ir.lib.ncu.edu.tw:444/thesis/view_etd.asp?URN=108521065
dc.contributor.department	電機工程學系	zh_TW
DC.description	國立中央大學	zh_TW
DC.description	National Central University	en_US
dc.description.abstract	近年來隨著GPU進步與大數據時代的來臨，深度學習取代了以往的演算法，為各領域帶來革命性的進步，如人臉偵測、人臉辨識、影像切割與語音辨識等。但是受限於GPU的功耗與成本，使得我們難以在邊緣裝置上執行運算量龐大的神經網路。這促使近年來有不少研究致力於神經網路的輕量化與數位電路硬體加速。卷積層是卷積神經網路(CNN)推理階段中計算成本最高的部分。有許多研究設計了許多架構來有效地處理它。在這些設計中，具有高度流水線結構的脈動陣列(Systolic Array)能夠有效地加速通用矩陣-矩陣乘法(GEMM)。但是，為了處理GEMM形式的卷積，每一層卷積都需要Image to Column （IM2COL）的預處理，這需要更大的記憶體空間和重複的記憶體存取。本文提出了一種擴展架構，包括所提出的 IM2COL 電路和脈動陣列，以最大限度地提高數據重用性。目的是基於內存減少方法和硬體架構設計來解決基於 GEMM 的 CNN 加速器高內存存取的問題。我們為 GEMM 單元設計了一個數據重組轉換單元 (Transform Unit)，以減少 IM2COL 生成冗餘數據的內存訪問。此外，我們引入了基於條帶的資料流來減少所提出的TU 的內存需求。通過適當的數據重用，TU 可以節省大約 87% 的內存存取。我們提出的加速器原型由 1024 個乘加 (MAC) 單元組成，可以達到 512 GOPS。可以根據不同硬體資源與性能來配置並行化的數量。	zh_TW
dc.description.abstract	In recent years, with the advancement of GPU and the advent of the era of big data, deep learning has replaced previous algorithms, bringing revolutionary progress to various fields, such as face detection, face recognition, image segmentation and speech recognition. However, limited by the power consumption and the cost of GPUs, it is difficult for us to execute computationally intensive neural networks on the edge devices. This has prompted a lot of research in recent years to focus on the light-weighting of neural networks and hardware acceleration of digital circuits. Convolution layers are the most computationally expensive part of the convolutional neural network (CNN) at inference stage. Many architectures have been designed to deal with it efficiently. Among those designs, systolic arrays with highly pipelined structures are able to accelerate general matrix-matrix multiplication (GEMM) efficiently. However, in order to process the convolution in the form of the GEMM, each layer requires Image to Column (IM2COL) preprocessing, which needs larger internal memory and repeated memory access. This paper proposes an extended architecture that includes proposed IM2COL circuits and systolic arrays to maximize data reuse. The aims are based on memory reduction method and hardware architecture design to solve the problem of high memory access in GEMM based CNN accelerator. We design a data reorganization transformation unit (TU) for the GEMM unit to reduce memory access of the redundant data generated by IM2COL. Besides, we introduce the stripe based dataflow to improve the memory need of the proposed TU. With the proper data reuse, the TU can save around 87% of the memory access. The prototype of our proposed accelerator comprising of 1024 Multiply–Accumulate (MAC) units can achieve 512 GOPS. The parallelization can be configured according to the availability of different hardware resources and performance limitations.	en_US
DC.subject	CNN加速器	zh_TW
DC.subject	記憶體存取優化	zh_TW
DC.subject	矩陣乘法架構	zh_TW
DC.title	基於矩陣乘法架構進行記憶體存取優化的CNN加速器	zh_TW
dc.language.iso	zh-TW	zh-TW
DC.title	Memory access optimization for matrix multiplication architecture based CNN accelerator	en_US
DC.type	博碩士論文	zh_TW
DC.type	thesis	en_US
DC.publisher	National Central University	en_US

博碩士論文 108521065 完整後設資料紀錄