基於矩陣乘法架構進行記憶體存取優化的CNN加速器

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：98

、訪客IP：3.144.39.2

姓名

呂季修(Chi-Hsiu,Lu) 查詢紙本館藏

畢業系所

電機工程學系

論文名稱

基於矩陣乘法架構進行記憶體存取優化的CNN加速器
(Memory access optimization for matrix multiplication architecture based CNN accelerator)

相關論文

★ 即時的SIFT特徵點擷取之低記憶體硬體設計	★ 即時的人臉偵測與人臉辨識之門禁系統
★ 具即時自動跟隨功能之自走車	★ 應用於多導程心電訊號之無損壓縮演算法與實現
★ 離線自定義語音語者喚醒詞系統與嵌入式開發實現	★ 晶圓圖缺陷分類與嵌入式系統實現
★ 語音密集連接卷積網路應用於小尺寸關鍵詞偵測	★ G2LGAN: 對不平衡資料集進行資料擴增應用於晶圓圖缺陷分類
★ 補償無乘法數位濾波器有限精準度之演算法設計技巧	★ 可規劃式維特比解碼器之設計與實現
★ 以擴展基本角度CORDIC為基礎之低成本向量旋轉器矽智產設計	★ JPEG2000靜態影像編碼系統之分析與架構設計
★ 適用於通訊系統之低功率渦輪碼解碼器	★ 應用於多媒體通訊之平台式設計
★ 適用MPEG 編碼器之數位浮水印系統設計與實現	★ 適用於視訊錯誤隱藏之演算法開發及其資料重複使用考量

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2024-8-31以後開放)

摘要(中)

近年來隨著GPU進步與大數據時代的來臨，深度學習取代了以往的演算法，為各領域帶來革命性的進步，如人臉偵測、人臉辨識、影像切割與語音辨識等。但是受限於GPU的功耗與成本，使得我們難以在邊緣裝置上執行運算量龐大的神經網路。這促使近年來有不少研究致力於神經網路的輕量化與數位電路硬體加速。
卷積層是卷積神經網路(CNN)推理階段中計算成本最高的部分。有許多研究設計了許多架構來有效地處理它。在這些設計中，具有高度流水線結構的脈動陣列(Systolic Array)能夠有效地加速通用矩陣-矩陣乘法(GEMM)。但是，為了處理GEMM形式的卷積，每一層卷積都需要Image to Column （IM2COL）的預處理，這需要更大的記憶體空間和重複的記憶體存取。本文提出了一種擴展架構，包括所提出的 IM2COL 電路和脈動陣列，以最大限度地提高數據重用性。目的是基於內存減少方法和硬體架構設計來解決基於 GEMM 的 CNN 加速器高內存存取的問題。我們為 GEMM 單元設計了一個數據重組轉換單元 (Transform Unit)，以減少 IM2COL 生成冗餘數據的內存訪問。此外，我們引入了基於條帶的資料流來減少所提出的TU 的內存需求。通過適當的數據重用，TU 可以節省大約 87% 的內存存取。我們提出的加速器原型由 1024 個乘加 (MAC) 單元組成，可以達到 512 GOPS。可以根據不同硬體資源與性能來配置並行化的數量。

摘要(英)

In recent years, with the advancement of GPU and the advent of the era of big data, deep learning has replaced previous algorithms, bringing revolutionary progress to various fields, such as face detection, face recognition, image segmentation and speech recognition. However, limited by the power consumption and the cost of GPUs, it is difficult for us to execute computationally intensive neural networks on the edge devices. This has prompted a lot of research in recent years to focus on the light-weighting of neural networks and hardware acceleration of digital circuits.
Convolution layers are the most computationally expensive part of the convolutional neural network (CNN) at inference stage. Many architectures have been designed to deal with it efficiently. Among those designs, systolic arrays with highly pipelined structures are able to accelerate general matrix-matrix multiplication (GEMM) efficiently. However, in order to process the convolution in the form of the GEMM, each layer requires Image to Column (IM2COL) preprocessing, which needs larger internal memory and repeated memory access. This paper proposes an extended architecture that includes proposed IM2COL circuits and systolic arrays to maximize data reuse. The aims are based on memory reduction method and hardware architecture design to solve the problem of high memory access in GEMM based CNN accelerator. We design a data reorganization transformation unit (TU) for the GEMM unit to reduce memory access of the redundant data generated by IM2COL. Besides, we introduce the stripe based dataflow to improve the memory need of the proposed TU. With the proper data reuse, the TU can save around 87% of the memory access. The prototype of our proposed accelerator comprising of 1024 Multiply–Accumulate (MAC) units can achieve 512 GOPS. The parallelization can be configured according to the availability of different hardware resources and performance limitations.

關鍵字(中)

★ CNN加速器
★ 記憶體存取優化
★ 矩陣乘法架構

關鍵字(英)

論文目次

摘要 I
ABSTRACT II
1. 序論 1
1.1. 研究背景 1
1.2. 研究動機 3
1.3. 研究貢獻 4
1.4. 論文架構 4
2. 文獻探討 5
2.1. SINGLE INSTRUCTION MULTIPLE DATA (SIMD) 5
2.2. 脈動陣列架構 8
2.3. 脈動陣列架構的挑戰 10
3. 硬體架構設計 15
3.1. 整體硬體架構 15
3.2. 資料前處理 17
3.3. 轉換單元 (TU) 19
3.4. STRIPE BASED METHOD 21
3.5. 脈動陣列架構 23
3.6. 硬體架構之資料流 24
4. 硬體實現結果 25
4.1. 記體體存取次數分析 25
4.2. 硬體合成結果 27
4.3. 與相關加速器之比較 28
4.4. 實際燒入FPGA開發板的結果 31
5. 結論 35
參考文獻 36

參考文獻

[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, pp. 436–444, May 2015
[2] Krizhevsky, I. Sutskever, and G. E Hinton, “Imagenet classification with deep convolutional neural networks,” In Advances in Neural Information Processing Systems, pages 1097–1105, 2012.
[3] F. Schroff, D. Kalenichenko and J. Philbin, "FaceNet: A unified embedding for face recognition and clustering," 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, pp. 815-823, 2015.
[4] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” In Advances in Neural Information Processing Systems, pages 91–99, 2015.
[5] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, May 2015.
[6] Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1–9.
[7] He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 770–778, 2016.
[8] R. Hameed et al., “Understanding sources of inefficiency in generalpurpose chips,” in Proc. 37th Annu. Int. Symp. Comput. Archit., 2010, pp. 37–47.
[9] M. Horowitz, “Computing’s energy problem (and what we can do about it),” in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers (ISSCC), Feb. 2014, pp. 10–14.
[10] Y.-H. Chen, T. Krishna, J. Emer, and V. Sze, “Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks,” IEEE J. Solid-State Circuits, vol. 51, no. 1, pp. 127–138, Jan. 2017
[11] Kaiyuan Guo, Lingzhi Sui, Jiantao Qiu, Jincheng Yu, Junbin Wang, Song Yao, Song Han, Yu Wang, Huazhong Yang, “Angel-Eye: A Complete Design Flow for Mapping CNN Onto Embedded FPGA”, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2017.
[12] Li Du, Yuan Du, Yilei Li, Junjie Su, Yen-Cheng Kuan, Chun-Chen Liu, Mau-Chung Frank Chang, “A Reconfigurable Streaming Deep Convolutional Neural Network Accelerator for Internet of Things”, IEEE Transactions on Circuits and Systems I: Regular Papers, 2018.
[13] Jouppi, N. P. et al. In-datacenter performance analysis of a tensor processing unit. In Proc. 44th Annual International Symposium on Computer Architecture 1–12 (Association for Computing Machinery, 2017).
[14] A. Aimar, H. Mostafa, E. Calabrese, A. Rios-Navarro, R. TapiadorMorales, I.-A. Lungu, M. B. Milde, F. Corradi, A. Linares-Barranco, S.-C. Liu et al., “Nullhop: A flexible convolutional neural network accelerator based on sparse representations of feature maps”, IEEE transactions on neural networks and learning systems, 2018
[15] Mohammadreza Soltaniyeh, Richard P. Martin, Santosh Nagarakatte, “An Accelerator for Sparse Convolutional Neural Networks Leveraging Systolic General Matrix-Matrix Multiplication”, ACM Transactions on Architecture and Code Optimization (TACO), 2022.
[16] Yu-Cheng Tseng; Po-Hsiung Hsu; Tian-Sheuan Chang, “A 124 Mpixels/s VLSI Design for Histogram-Based Joint Bilateral Filtering”, IEEE Transactions on Image Processing, 2011
[17] T. Chen et al. Diannao: a small-footprint high-throughput accelerator for ubiquitous machine-learning. In ASPLOS, pp. 269–284, 2014.
[18] Y. Chen et al., “DaDianNao: A machine-learning supercomputer,” in Proc. 47th Annu. IEEE/ACM Int. Symp. Microarchit., 2014, pp. 609–622.
[19] Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A Horowitz, and William J Dally. Eie: efficient inference engine on compressed deep neural network. In Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International Symposium on, pages 243–254. IEEE, 2016.
[20] W. Lu, G. Yan, J. Li, S. Gong, Y. Han, and X. Li, “FlexFlow: A Flexible Dataflow Accelerator Architecture for Convolutional Neural Networks,” Proc. - Int. Symp. High-Performance Comput. Archit., pp. 553–564, 2017.
[21] Maurice Peemen, Arnaud A. A. Setio, Bart Mesman, Henk Corporaal, “Memory-centric accelerator design for Convolutional Neural Networks”, IEEE 31st International Conference on Computer Design (ICCD), 2013.
[22] C. Zhang, P. li, G. Sun, Y. Guan, B. Xiao and J. Cong, “Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks,” FPGA’15, February 22-24, 2015.
[23] Yufei Ma, Yu Cao, Sarma Vrudhula, Jae-sun Seo, “Optimizing the Convolution Operation to Accelerate Deep Neural Networks on FPGA”, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2018.
[24] X. Wei, C. H. Yu, P. Zhang, Y. Chen, Y. Wang, H. Hu, Y. Liang, and J. Cong, “Automated systolic array architecture synthesis for high throughput cnn inference on fpgas,” in Proceedings of the 54th Annual Design Automation Conference 2017. ACM, 2017, p. 29.
[25] W. Xu, Z. Zhang, X. You, and C. Zhang, “Efficient deep convolutional neural networks accelerator without multiplication and retraining,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 1100–1104.
[26] Y. Huan, J. Xu, L. Zheng, H. Tenhunen, and Z. Zou, “A 3d tiled low power accelerator for convolutional neural network,” in 2018 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, 2018, pp.1–5.
[27] Saptarsi Das, Arnab Roy, Kiran Kolar Chandrasekharan, Ankur Deshwal,Sehwan Lee, “A Systolic Dataflow Based Accelerator for CNNs,” IEEE International Symposium on Circuits and Systems (ISCAS).IEEE,2020.
[28] Trio Adiono, Adiwena Putra, Nana Sutisna, Infall Syafalni, Rahmat Mulyawan, “Low Latency YOLOv3-Tiny Accelerator for Low-Cost FPGA Using General Matrix Multiplication Principle”, IEEE Access, 2021.
[29] Hweesoo Kim, Sunjung Lee, Jaewan Choi, Jung Ho Ahn, “Row-Streaming Dataflow Using a Chaining Buffer and Systolic Array+ Structure”, IEEE Computer Architecture Letters, 2021.
[30] Lee, Y.-H., Yu, N.-A., & Tsai, C.-Y., “A Image Upscaling Engine for 1080p to 4K Using Gradient-Based Interpolation”, International Journal of Electronics, 2020.
[31] Liqiang Lu, Yun LiangG, “SpWA: An Efficient Sparse Winograd Convolutional Neural Networks Accelerator on FPGAs” ACM/ESDA/IEEE Design Automation Conference (DAC), 2018.
[32] Bahareh Khabbazan, Sattar Mirzakuchaki, “Design and Implementation of a Low-power, Embedded CNN Accelerator on a Low-end FPGA”, Euromicro Conference on Digital System Design (DSD), 2019.
[33] Yufeng Li, Shengli Lu*, Jihe Luo, Wei Pang, Hao Liu, “High-performance Convolutional Neural Network Accelerator Based on Systolic Arrays and Quantization”, International Conference on Signal and Image Processing, 2019.
[34] Jiantao Qiu, Jie Wang, Song Yao, Kaiyuan Guo, Boxun Li, Erjin Zhou, Jincheng Yu, Tianqi Tang, Ningyi Xu, Sen Song, Yu Wang, and Huazhong Yang, “Going Deeper with Embedded FPGA Platform for Convolutional Neural Network”, ACM/SIGDA International Symposium on Field-Programmable Gate, 2016.
[35] Muluken Tadesse Hailesellasie, Syed Rafay Hasan, “MulNet: A Flexible CNN Processor With Higher Resource Utilization Efficiency for Constrained Devices”, IEEE Access, 2019.

指導教授

蔡宗漢

審核日期

2022-8-3

推文