博碩士論文 108521065 完整後設資料紀錄

DC 欄位 語言
DC.contributor電機工程學系zh_TW
DC.creator呂季修zh_TW
DC.creatorChi-Hsiu,Luen_US
dc.date.accessioned2022-8-3T07:39:07Z
dc.date.available2022-8-3T07:39:07Z
dc.date.issued2022
dc.identifier.urihttp://ir.lib.ncu.edu.tw:444/thesis/view_etd.asp?URN=108521065
dc.contributor.department電機工程學系zh_TW
DC.description國立中央大學zh_TW
DC.descriptionNational Central Universityen_US
dc.description.abstract近年來隨著GPU進步與大數據時代的來臨,深度學習取代了以往的演算法,為各領域帶來革命性的進步,如人臉偵測、人臉辨識、影像切割與語音辨識等。但是受限於GPU的功耗與成本,使得我們難以在邊緣裝置上執行運算量龐大的神經網路。這促使近年來有不少研究致力於神經網路的輕量化與數位電路硬體加速。 卷積層是卷積神經網路(CNN)推理階段中計算成本最高的部分。有許多研究設計了許多架構來有效地處理它。在這些設計中,具有高度流水線結構的脈動陣列(Systolic Array)能夠有效地加速通用矩陣-矩陣乘法(GEMM)。但是,為了處理GEMM形式的卷積,每一層卷積都需要Image to Column (IM2COL)的預處理,這需要更大的記憶體空間和重複的記憶體存取。本文提出了一種擴展架構,包括所提出的 IM2COL 電路和脈動陣列,以最大限度地提高數據重用性。目的是基於內存減少方法和硬體架構設計來解決基於 GEMM 的 CNN 加速器高內存存取的問題。我們為 GEMM 單元設計了一個數據重組轉換單元 (Transform Unit),以減少 IM2COL 生成冗餘數據的內存訪問。此外,我們引入了基於條帶的資料流來減少所提出的TU 的內存需求。通過適當的數據重用,TU 可以節省大約 87% 的內存存取。我們提出的加速器原型由 1024 個乘加 (MAC) 單元組成,可以達到 512 GOPS。可以根據不同硬體資源與性能來配置並行化的數量。zh_TW
dc.description.abstractIn recent years, with the advancement of GPU and the advent of the era of big data, deep learning has replaced previous algorithms, bringing revolutionary progress to various fields, such as face detection, face recognition, image segmentation and speech recognition. However, limited by the power consumption and the cost of GPUs, it is difficult for us to execute computationally intensive neural networks on the edge devices. This has prompted a lot of research in recent years to focus on the light-weighting of neural networks and hardware acceleration of digital circuits. Convolution layers are the most computationally expensive part of the convolutional neural network (CNN) at inference stage. Many architectures have been designed to deal with it efficiently. Among those designs, systolic arrays with highly pipelined structures are able to accelerate general matrix-matrix multiplication (GEMM) efficiently. However, in order to process the convolution in the form of the GEMM, each layer requires Image to Column (IM2COL) preprocessing, which needs larger internal memory and repeated memory access. This paper proposes an extended architecture that includes proposed IM2COL circuits and systolic arrays to maximize data reuse. The aims are based on memory reduction method and hardware architecture design to solve the problem of high memory access in GEMM based CNN accelerator. We design a data reorganization transformation unit (TU) for the GEMM unit to reduce memory access of the redundant data generated by IM2COL. Besides, we introduce the stripe based dataflow to improve the memory need of the proposed TU. With the proper data reuse, the TU can save around 87% of the memory access. The prototype of our proposed accelerator comprising of 1024 Multiply–Accumulate (MAC) units can achieve 512 GOPS. The parallelization can be configured according to the availability of different hardware resources and performance limitations.en_US
DC.subjectCNN加速器zh_TW
DC.subject記憶體存取優化zh_TW
DC.subject矩陣乘法架構zh_TW
DC.title基於矩陣乘法架構進行記憶體存取優化的CNN加速器zh_TW
dc.language.isozh-TWzh-TW
DC.titleMemory access optimization for matrix multiplication architecture based CNN acceleratoren_US
DC.type博碩士論文zh_TW
DC.typethesisen_US
DC.publisherNational Central Universityen_US

若有論文相關問題,請聯絡國立中央大學圖書館推廣服務組 TEL:(03)422-7151轉57407,或E-mail聯絡  - 隱私權政策聲明