基於矩陣乘法架構進行記憶體存取優化的CNN加速器;Memory access optimization for matrix multiplication architecture based CNN accelerator

NCU Institutional Repository > 資訊電機學院 > 電機工程研究所 > 博碩士論文 > Item 987654321/89940

請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/89940

題名:	基於矩陣乘法架構進行記憶體存取優化的CNN加速器;Memory access optimization for matrix multiplication architecture based CNN accelerator
作者:	呂季修;Chi-Hsiu, Lu
貢獻者:	電機工程學系
關鍵詞:	CNN加速器;記憶體存取優化;矩陣乘法架構
日期:	2022-08-03
上傳時間:	2022-10-04 12:05:19 (UTC+8)
出版者:	國立中央大學
摘要:	近年來隨著GPU進步與大數據時代的來臨，深度學習取代了以往的演算法，為各領域帶來革命性的進步，如人臉偵測、人臉辨識、影像切割與語音辨識等。但是受限於GPU的功耗與成本，使得我們難以在邊緣裝置上執行運算量龐大的神經網路。這促使近年來有不少研究致力於神經網路的輕量化與數位電路硬體加速。卷積層是卷積神經網路(CNN)推理階段中計算成本最高的部分。有許多研究設計了許多架構來有效地處理它。在這些設計中，具有高度流水線結構的脈動陣列(Systolic Array)能夠有效地加速通用矩陣-矩陣乘法(GEMM)。但是，為了處理GEMM形式的卷積，每一層卷積都需要Image to Column （IM2COL）的預處理，這需要更大的記憶體空間和重複的記憶體存取。本文提出了一種擴展架構，包括所提出的 IM2COL 電路和脈動陣列，以最大限度地提高數據重用性。目的是基於內存減少方法和硬體架構設計來解決基於 GEMM 的 CNN 加速器高內存存取的問題。我們為 GEMM 單元設計了一個數據重組轉換單元 (Transform Unit)，以減少 IM2COL 生成冗餘數據的內存訪問。此外，我們引入了基於條帶的資料流來減少所提出的TU 的內存需求。通過適當的數據重用，TU 可以節省大約 87% 的內存存取。我們提出的加速器原型由 1024 個乘加 (MAC) 單元組成，可以達到 512 GOPS。可以根據不同硬體資源與性能來配置並行化的數量。 ;In recent years, with the advancement of GPU and the advent of the era of big data, deep learning has replaced previous algorithms, bringing revolutionary progress to various fields, such as face detection, face recognition, image segmentation and speech recognition. However, limited by the power consumption and the cost of GPUs, it is difficult for us to execute computationally intensive neural networks on the edge devices. This has prompted a lot of research in recent years to focus on the light-weighting of neural networks and hardware acceleration of digital circuits. Convolution layers are the most computationally expensive part of the convolutional neural network (CNN) at inference stage. Many architectures have been designed to deal with it efficiently. Among those designs, systolic arrays with highly pipelined structures are able to accelerate general matrix-matrix multiplication (GEMM) efficiently. However, in order to process the convolution in the form of the GEMM, each layer requires Image to Column (IM2COL) preprocessing, which needs larger internal memory and repeated memory access. This paper proposes an extended architecture that includes proposed IM2COL circuits and systolic arrays to maximize data reuse. The aims are based on memory reduction method and hardware architecture design to solve the problem of high memory access in GEMM based CNN accelerator. We design a data reorganization transformation unit (TU) for the GEMM unit to reduce memory access of the redundant data generated by IM2COL. Besides, we introduce the stripe based dataflow to improve the memory need of the proposed TU. With the proper data reuse, the TU can save around 87% of the memory access. The prototype of our proposed accelerator comprising of 1024 Multiply–Accumulate (MAC) units can achieve 512 GOPS. The parallelization can be configured according to the availability of different hardware resources and performance limitations.
顯示於類別:	[電機工程研究所] 博碩士論文

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	38	檢視/開啟

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....