可重構深度神經網路加速器設計

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：43

、訪客IP：18.225.195.213

姓名

田繹(Yi Tien) 查詢紙本館藏

畢業系所

電機工程學系

論文名稱

可重構深度神經網路加速器設計
(Design of a Reconfigurable Deep Neural Network Accelerator)

相關論文

★ 應用於三元內容定址記憶體之低功率設計與測試技術	★ 用於隨機存取記憶體的接線驗證演算法
★ 用於降低系統晶片內測試資料之基礎矽智產	★ 內容定址記憶體之鄰近區域樣型敏感瑕疵測試演算法
★ 內嵌式記憶體中位址及資料匯流排之串音瑕疵測試	★ 用於系統晶片中單埠與多埠記憶體之自我修復技術
★ 用於修復嵌入式記憶體之基礎矽智產	★ 自我修復記憶體之備份分析評估與驗證平台
★ 使用雙倍疊乘累加命中線之低功率三元內容定址記憶體設計	★ 可自我測試且具成本效益之記憶體式快速傅利葉轉換處理器設計
★ 低功率與可自我修復之三元內容定址記憶體設計	★ 多核心系統晶片之診斷方法
★ 應用於網路晶片上隨機存取記憶體測試及修復之基礎矽智產	★ 應用於貪睡靜態記憶體之有效診斷與修復技術
★ 應用於內嵌式記憶體之高效率診斷性資料壓縮與可測性方案	★ 應用於隨機存取記憶體之有效良率及可靠度提升技術

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2025-8-20以後開放)

摘要(中)

深度卷積神經網路(DCNNs)被廣泛地用於人工智慧的應用，例如：物件辨識及影像分類等等。現今的深度卷積神經網路具有大量計算與大量數據的特性，為了在不同應用中符合對性能的要求，加速器被用來執行深度卷積神經網路的運算。在本論文中，我們根據以動態隨機存取記憶體(DRAM)儲存資料及使用加速器來執行計算的深度卷積神經網路推論系統，提出架構探索的方法。此方法以減少資料傳輸時間與計算時間的差異而定義出之加速器架構。加速器包含了數叢(clusters)之處理單元(PEs)、一個可重構之記憶體單元及一個控制器。交換器(switch)用以連接一叢處理單元陣列與可重構記憶體單元。可重構記憶體是由三個靜態隨機存取記憶體組合而成，每一靜態隨機存取記憶體可以調整其大小，以符合不同卷積層的記憶體需求。處理單元陣列與可重構記憶體之組態是由基於子層之參數選定流程(sublayer-based parameters decision flow)所決定。與現存之研究相比，本論文提出之加速器在卷積層及深度卷積神經網路各提升4.2%及17.4%的硬體利用率。我們根據提出之可重構加速器架構，在Xilinx ZCU-102開發板上實現了一個推論MobileNet V1的可重構加速器，此一加速器包含了1092KB的靜態隨機存取記憶體與四叢處理單元陣列，每一叢處理單元陣列包含了8個處理單元。實驗結果達到在150MHz的操作頻率下，此一加速器達到每秒1440億次計算及每秒推論40.1張圖片的效能。

摘要(英)

Deep convolutional neural networks (DCNNs) are widely used for the artificial intelligence applications, e.g., object recognition and image classification. A modern DCNN model usually needs a huge amount of computations and data. To meet the performance requirement of applications, an accelerator is usually designed to execute the computation of DCNN.
In this thesis, we consider a DCNN inference system using a DRAM to store data and an accelerator to execute the computation. An architecture exploration method based on the minimization of difference between DRAM data access time and computation time is proposed to define the architecture of accelerator. The accelerator consists of multiple clusters of processing elements (PEs), a reconfigurable memory unit, and a controller. A cluster of PEs is connected to the reconfigurable memory unit through a switch box. The reconfigurable memory unit consists of three static random access memories which sizes can be dynamically changed to fit the requirement of different convolutional layers. The configurations of PE array and reconfigurable memory are determined by sublayer-based parameters decision flow which can gain 4.2% and 17.4% increment of hardware resource utilization for convolutional layers and DCNN model in comparison with existing works. We implement the MobileNet V1 model in Xilinx ZCU-102 evaluation board using the proposed reconfigurable accelerator architecture with 1092KB SRAM and four PE clusters in which each cluster has 8 PEs. Ex-
perimental results show that 144 GOPS and 40.1 FPS can be achieved under 100MHz clock rate.

關鍵字(中)

★ 硬體加速器
★ 深度神經網路
★ 可重構

關鍵字(英)

★ Hardware Accelerator
★ Deep Neural Network
★ Reconfigurable
★ FPGA

論文目次

1 Introduction 1
1.1 Deep Neural Network 1
1.2 Deep Convolutional Neural Network Accelerator Architecture 4
1.3 Previous Work 5
1.3.1 Single Instruction Multiple Data Stream DNN Accelerator Architecture 5
1.3.2 Systolic Array DNN Accelerator Architecture 6
1.4 Motivation 6
1.5 Contribution 7
1.6 Thesis Organization 7
2 Architecture Exploration of Reconfigurable DNN Accelerator 8
2.1 DNN Inference System 8
2.2 Roofline Model Analysis and Sublayer-Based Pipeline Inference Flow 9
2.3 For Loop Analysis of Convolution Operation 12
2.4 Sublayer-Based Parameters Decision Flow 14
2.4.1 Tiling Factors Analysis 15
2.4.2 Loop Unrolling Factors Analysis 20
2.4.3 On-Chip Memory Analysis 21
2.5 Analysis Results 24
3 Proposed Reconfigurable DNN Accelerator Architecture 30
3.1 Architecture of Reconfigurable DNN Accelerator 30
3.2 Micro Instruction Set 31
3.3 On-chip Interconnections and On-Chip Data Reuse 34
3.3.1 Input Feature Map Data Reuse Strategy 35
3.3.2 Weight Data Reuse Strategy 35
3.3.3 Output Feature Map Data Flow 37
3.4 PE Clusters 38
3.5 Subsampling and Fully-Connected Layers 40
3.6 Analysis Results41
4 A Case Study of MobileNet V1 44
4.1 MobileNet V1 44
4.2 Validation Platform 48
4.2.1 Platform Resources of Xilinx ZCU-102 Evaluation Board 48
4.2.2 AXI-4 Interface Protocol 49
4.3 Implementation Details 49
4.3.1 VIVADO Block Diagram 49
4.3.2 PE Array Clusters and Reconfigurable On-Chip Memory 51
4.4 Implementation Results 54
5 Conclusion and Future Work 60
5.1 Conclusion 60
5.2 Future Work 61

參考文獻

[1] S. Williams, A. Waterman, and D. Patterson, “Roofline: an insightful visual performance model for multicore architectures,” Communications of the ACM, vol. 52, no. 4,pp. 65–76, 2009.
[2] Micron, 4Gb: x4, x8, x16 DDR3 SDRAM Features.
[3] APMemory, 1Gb DDR3 SDRAM Specification.
[4] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W.Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
[5] P. G. Xilinx Inc, LogiCORE IP AXI Master Burst v2.0.
[6] U. G. Xilinx Inc, UltraScale Architecture DSP Slice.
[7] C. Zhang, G. Sun, Z. Fang, P. Zhou, P. Pan, and J. Cong, “Caffeine: toward uniformed
representation and acceleration for deep convolutional neural networks,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 38, no. 11,pp. 2072–2085, 2018.
[8] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems,2012, pp. 1097–1105.
[9] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1–9.
[10] M. Naphade, D. C. Anastasiu, A. Sharma, V. Jagrlamudi, H. Jeon, K. Liu, M.-C. Chang,
S. Lyu, and Z. Gao, “The NVIDIA AI city challenge,” in 2017 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computed, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City
Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI), 2017, pp. 1–6.
[11] A. Conneau, H. Schwenk, L. Barrault, and Y. Lecun, “Very deep convolutional networks for natural language processing,” arXiv preprint arXiv:1606.01781, vol. 2, 2016.
[12] X. Zhang, J. Zou, K. He, and J. Sun, “Accelerating very deep convolutional networks for classification and detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 10, pp. 1943–1955, 2015.
[13] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: unified, realtime object detection,” in Proceedings of the IEEE Conference on Computer Vision andPattern Recognition, 2016, pp. 779–788.
[14] M. Motamedi, P. Gysel, V. Akella, and S. Ghiasi, “Design space exploration of FPGA-based deep convolutional neural networks,” in 2016 21st Asia and South Pacific Design Automation Conference (ASP-DAC), 2016, pp. 575–580.
[15] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, “Diannao: a small-footprint high-throughput accelerator for ubiquitous machine-learning,” ACM
SIGARCH Computer Architecture News, vol. 42, no. 1, pp. 269–284, 2014.
[16] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, et al.,
“Dadiannao: a machine-learning supercomputer,” in 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, 2014, pp. 609–622.
[17] A. Carbon, J.-M. Philippe, O. Bichler, R. Schmit, B. Tain, D. Briand, N. Ventroux, M. Paindavoine, and O. Brousse, “Pneuro: a scalable energy-efficient programmable hardware accelerator for neural networks,” in 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE), 2018, pp. 1039–1044.
[18] Y. Wang, H. Li, and X. Li, “A case of on-chip memory subsystem design for lowpower CNN accelerators,” IEEE Transactions on Computer-Aided Design of Integrated
Circuits and Systems, vol. 37, no. 10, pp. 1971–1984, 2017.
[19] J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and A. Moshovos, “Cnvlutin: ineffectual-neuron-free deep neural network computing,” ACM SIGARCH
Computer Architecture News, vol. 44, no. 3, pp. 1–13, 2016.
[20] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: an energy-efficient reconfigurable accelerator for deep convolutional neural networks,” IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. 127–138, 2016.
[21] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, et al., “In-datacenter performance analysis of a tensor processing unit,” in Proceedings of the 44th Annual International Symposium on Computer
Architecture, 2017, pp. 1–12.
[22] F. Tu, S. Yin, P. Ouyang, S. Tang, L. Liu, and S. Wei, “Deep convolutional neural network architecture with reconfigurable computation patterns,” IEEE Transactions
on Very Large Scale Integration (VLSI) Systems, vol. 25, no. 8, pp. 2220–2233, 2017.
[23] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing FPGA-based accelerator design for deep convolutional neural networks,” in Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2015, pp.
161–170.
[24] M. Shahshahani, P. Goswami, and D. Bhatia, “Memory optimization techniques for FPGA based CNN implementations,” in 2018 IEEE 13th Dallas Circuits and Systems Conference (DCAS), 2018, pp. 1–6.
[25] Y.-J. Lin and T. S. Chang, “Data and hardware efficient design for convolutional neural
network,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 65, no. 5, pp. 1642–1651, 2017.
[26] U. G. Xilinx Inc, Zynq-7000 All Programmable SoC.
[27] “IMAGENET,” http://image-net.org/, accessed: 2020-06-09.
[28] S. Ioffe and C. Szegedy, “Batch normalization: accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
[29] U. G. Xilinx Inc, MPSoC Technical Reference Manual.
[30] D. S. Xilinx Inc, Zynq-7000 All Programmable SoC Overview, Advance Product Specification.
[31] ——, Zynq UltraScale+ MPSoC Data Sheet.
[32] ARM, AMBA AXI and ACE Protocol Specification.
[33] P. G. Xilinx Inc, LogiCORE IP Block Memory Generator v8.2.
[34] ——, Integrated Logic Analyzer v6.2.
[35] K. Guo, L. Sui, J. Qiu, J. Yu, J. Wang, S. Yao, S. Han, Y. Wang, and H. Yang,
“Angel-Eye: a complete design flow for mapping CNN onto embedded FPGA,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 37,
no. 1, pp. 35–47, 2017.
[36] Y. Ma, Y. Cao, S. Vrudhula, and J.-s. Seo, “Automatic compilation of diverse CNNs onto high-performance FPGA accelerators,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2018.
[37] X. Qu, Z. Huang, N. Mao, Y. Xu, G. Cai, and Z. Fang, “A grain-adaptive computing structure for FPGA CNN acceleration,” in 2019 IEEE 13th International Conference
on ASIC (ASICON), 2019, pp. 1–4.

指導教授

李進福(Jin-Fu Li)

審核日期

2020-8-20

推文