應用於深度類神經網路加速系統之層融合能耗減低技術透過最小化動態隨機記憶體存取

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：87

、訪客IP：3.149.23.112

姓名

戴勝澤(Sheng-Tse Tai) 查詢紙本館藏

畢業系所

電機工程學系

論文名稱

應用於深度類神經網路加速系統之層融合能耗減低技術透過最小化動態隨機記憶體存取
(Layer-Fusing Energy Reduction Techniques for Deep Neural Network Acceleration Systems by Minimizing DRAM Access)

相關論文

★ 應用於三元內容定址記憶體之低功率設計與測試技術	★ 用於隨機存取記憶體的接線驗證演算法
★ 用於降低系統晶片內測試資料之基礎矽智產	★ 內容定址記憶體之鄰近區域樣型敏感瑕疵測試演算法
★ 內嵌式記憶體中位址及資料匯流排之串音瑕疵測試	★ 用於系統晶片中單埠與多埠記憶體之自我修復技術
★ 用於修復嵌入式記憶體之基礎矽智產	★ 自我修復記憶體之備份分析評估與驗證平台
★ 使用雙倍疊乘累加命中線之低功率三元內容定址記憶體設計	★ 可自我測試且具成本效益之記憶體式快速傅利葉轉換處理器設計
★ 低功率與可自我修復之三元內容定址記憶體設計	★ 多核心系統晶片之診斷方法
★ 應用於網路晶片上隨機存取記憶體測試及修復之基礎矽智產	★ 應用於貪睡靜態記憶體之有效診斷與修復技術
★ 應用於內嵌式記憶體之高效率診斷性資料壓縮與可測性方案	★ 應用於隨機存取記憶體之有效良率及可靠度提升技術

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2026-1-28以後開放)

摘要(中)

近年來，深度神經網絡（DNN）已被廣泛使用於人工智能應用上。
DNN 加速系統通常會使用動態隨機存取記憶體 (DRAM) 來儲存資料，而運算會由一個加速器負責。然而，存取DRAM 所消耗的能量通常占了DNN 加速系統的大部分能量，在本文中，我們提出一個適應性融合層方法（ALFA）藉由最小化DRAM 存取的數量來降低整個加速系統的能量消耗。ALFA 在給定的融合層中的每一層適應性地最大化重複利用輸入特徵圖(input feature map)、權重(weight)或輸出特徵圖(output feature map)來找到能夠有最小DRAM 存取數量的組合。分析結果顯示如果加速器中的記憶體(on-chip buffer size)為128 KB 且用於融合AlexNet 的第1 層到第4 層時，ALFA 可以比[1]中報告的方法減少27％的DRAM 存取數量。此外，我們還提出了系統化的方法來決定一個DNN 模型中有多少層需要用ALFA 來融合。分析結果顯示，如果加速器中的記憶體(on-chip buffer size)為128 KB 且應用於模型VGG16上，所提出的方法相較於採用[2]中報告的方法可減少34％DRAM存取數量。我們有設計一個可以支援ALFA 運算的加速器，加速器使用台積電40nm CMOS standard cell library 所合成的。加速器可在頻率為200 MHz 時使用256 個乘法器和256 個加法器達到峰值性能(peak performance)102.4 GOPS。另外，合成結果顯示出加速器的功耗和面積成本在頻率為200 MHz 時分別為195 mW 和5.214 mm2。

摘要(英)

Deep neural network (DNN) has been widely used for the artificial intelligence applications. A
DNN acceleration system typically consists of a dynamic random access memory (DRAM) for
data buffering and an accelerator for the computation. However, the energy of DRAM typically
consumes a significant portion of the energy of the DNN acceleration system. In this thesis, we
propose an adaptive layer-fusing approach (ALFA) to reduce the energy consumption of DRAM
by minimizing the amount of accesses. The ALFA adaptively maximizes the reuse of input feature
map, weight, and output feature map in every layer of the given fused layers. Analysis results show
that the ALFA can achieve 27% reduction of DRAM access than the approach reported in [1] if
128 K-byte on-chip buffer is used for fusing convolution layers 1 to 4 of AlexNet. We also propose
a systematic method to determine the number of layers fused by the ALFA for a DNN model.
Analysis results show that the proposed method with the ALFA can achieve 34% reduction in DRAM access than the approach reported in [2] if 128 K-byte on-chip buffer is used for VGG16. An accelerator with the ALFA is designed and synthesized by using TSMC 40nm CMOS standard cell library. The accelerator can achieve 102.4 GOPS peak performance with 256 multipliers and 256 adders at 200 MHz. Also, synthesis results show that the power consumption and area cost of the accelerator are 195 mW and 5.214 mm2 at 200 MHz, respectively.

關鍵字(中)

★ 類神經網路加速系統
★ 最小化動態隨機記憶體存取

關鍵字(英)

★ acceleration system
★ Minimize DRAM Access

論文目次

Contents
1 Introduction 1
1.1 Deep Convolution Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 DRAM Access Minimization Techniques . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Thesis Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Proposed Adaptive Layer-Fusing Approach 6
2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Convolution Loops and Optimization Techniques . . . . . . . . . . . . . . 6
2.1.2 Single Layer DRAM Access Optimization and Loop Orders . . . . . . . . 8
2.2 Tiling Factor Relationship . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 2D Tiling Factor Relationship . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.2 3D Tiling Factor Relationship . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Adaptive Layer-Fusing Approach(ALFA) . . . . . . . . . . . . . . . . . . . . . . 12
2.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.2 Pair up Relationship . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.3 Intermediate Data Reduced by Fusing Adjacent Layer . . . . . . . . . . . 18
2.3.4 Fusing Pooling Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Analysis Results and Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.1 Comparison with [3] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4.2 Comparison with [1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3 A Systematic Method to Determine How to Do layer-fusing for An Entire Model 26
3.1 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1.1 Overall Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1.2 Details in the proposed method . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 Supporting Different Types of Models . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3 Analysis Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4 Design of an Accelerator Supporting Layer-Fusing 43
4.1 Accelerator Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.1.1 Computation Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.1.2 Global Buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.1.3 Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2 Computation Core and Computation Flow Mapping . . . . . . . . . . . . . . . . . 48
4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5 Conclusion and Future Work 66
5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

參考文獻

[1] H.-N. Wu and C.-T. Huang, “Data Locality Optimization of Depthwise Separable Convolutions
for CNN Inference Accelerators,” in Design, Automation & Test in Europe Conference
& Exhibition (DATE), 2019, pp. 120–125.
[2] Q. Sun, T. Chen, J. Miao, and B. Yu, “Power-driven DNN dataflow optimization on FPGA,”
in International Conference on Computer-Aided Design (ICCAD), 2019, pp. 1–7.
[3] M. Alwani, H. Chen, M. Ferdman, and P. Milder, “Fused-layer CNN accelerators,” in International
Symposium on Microarchitecture (MICRO), 2016, pp. 1–12.
[4] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in
Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp.
770–778.
[5] X. Yang,M. Gao, J. Pu, A. Nayak, Q. Liu, S. E. Bell, J. O. Setter, K. Cao, H. Ha, C. Kozyrakis
et al., “DNN dataflow choice is overrated,” arXiv preprint arXiv:1809.04070, 2018.
[6] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and
H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,”
arXiv preprint arXiv:1704.04861, 2017.
[7] Y. LeCun et al., “LeNet-5, convolutional neural networks,” URL: http://yann. lecun.
com/exdb/lenet, vol. 20, p. 5, 2015.
[8] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection
with region proposal networks,” in Advances in neural information processing systems, 2015,
pp. 91–99.
[9] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time
object detection,” in Proceedings of the IEEE conference on computer vision and pattern
recognition, 2016, pp. 779–788.
[10] M. J. Shafiee, B. Chywl, F. Li, and A.Wong, “Fast YOLO: A fast you only look once system
for real-time embedded object detection in video,” arXiv preprint arXiv:1709.05943, 2017.
[11] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted
residuals and linear bottlenecks,” in Proceedings of the IEEE conference on computer vision
and pattern recognition, 2018, pp. 4510–4520.
[12] M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. V. Le, “Mnasnet:
Platform-aware neural architecture search for mobile,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 2019, pp. 2820–2828.
[13] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, “Diannao: A smallfootprint
high-throughput accelerator for ubiquitous machine-learning,” ACM SIGARCH
Computer Architecture News, vol. 42, no. 1, pp. 269–284, 2014.
[14] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun et al.,
“Dadiannao: A machine-learning supercomputer,” in International Symposium on Microarchitecture,
2014, pp. 609–622.
[15] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam, “Shidiannao:
Shifting vision processing closer to the sensor,” in Proceedings of the International
Symposium on Computer Architecture, 2015, pp. 92–104.
[16] S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen, “Cambriconx:
An accelerator for sparse neural networks,” in International Symposium on Microarchitecture
(MICRO), 2016, pp. 1–12.
[17] S. Han, X. Liu, H.Mao, J. Pu, A. Pedram,M. A. Horowitz, andW. J. Dally, “EIE: efficient inference
engine on compressed deep neural network,” ACM SIGARCH Computer Architecture
News, vol. 44, no. 3, pp. 243–254, 2016.
[18] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-efficient reconfigurable
accelerator for deep convolutional neural networks,” IEEE Journal of Solid-State Circuits,
vol. 52, no. 1, pp. 127–138, 2016.
[19] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia,
N. Boden, A. Borchers et al., “In-datacenter performance analysis of a tensor processing
unit,” in Proceedings of the International Symposium on Computer Architecture, 2017, pp.
1–12.
[20] L. Du, Y. Du, Y. Li, J. Su, Y.-C. Kuan, C.-C. Liu, and M.-C. F. Chang, “A reconfigurable
streaming deep convolutional neural network accelerator for Internet of Things,” IEEE Transactions
on Circuits and Systems I: Regular Papers, vol. 65, no. 1, pp. 198–208, 2017.
[21] S. Yin, P. Ouyang, S. Tang, F. Tu, X. Li, S. Zheng, T. Lu, J. Gu, L. Liu, and S. Wei, “A high
energy efficient reconfigurable hybrid neural network processor for deep learning applications,”
IEEE Journal of Solid-State Circuits, vol. 53, no. 4, pp. 968–982, 2017.
[22] W. Lu, G. Yan, J. Li, S. Gong, Y. Han, and X. Li, “Flexflow: A flexible dataflow accelerator
architecture for convolutional neural networks,” in International Symposium on High
Performance Computer Architecture (HPCA), 2017, pp. 553–564.
[23] Y. Huan, J. Xu, L. Zheng, H. Tenhunen, and Z. Zou, “A 3D tiled low power accelerator for
convolutional neural network,” in International Symposium on Circuits and Systems (ISCAS),
2018, pp. 1–5.
[24] J. Lee, C. Kim, S. Kang, D. Shin, S. Kim, and H.-J. Yoo, “UNPU: An energy-efficient deep
neural network accelerator with fully variable weight bit precision,” IEEE Journal of Solid-
State Circuits, vol. 54, no. 1, pp. 173–185, 2018.
[25] Y.-H. Chen, T.-J. Yang, J. Emer, and V. Sze, “Eyeriss v2: A flexible accelerator for emerging
deep neural networks on mobile devices,” IEEE Journal on Emerging and Selected Topics in
Circuits and Systems, vol. 9, no. 2, pp. 292–308, 2019.
[26] M. Gao, X. Yang, J. Pu, M. Horowitz, and C. Kozyrakis, “Tangram: Optimized coarsegrained
dataflow for scalable NN accelerators,” in Proceedings of the Twenty-Fourth International
Conference on Architectural Support for Programming Languages and Operating
Systems, 2019, pp. 807–820.
[27] S. Cass, “Taking AI to the edge: Google’s TPU now comes in a maker-friendly package,”
IEEE Spectrum, vol. 56, no. 5, pp. 16–17, 2019.
[28] B. Moons and M. Verhelst, “DVAFS : Dynamic-Voltage-Accuracy-Frequency-Scaling Applied
to Scalable Convolutional Neural Network Acceleration,” in System-Scenario-based
Design Principles and Applications, 2020, pp. 99–111.
[29] M. Sankaradas, V. Jakkula, S. Cadambi, S. Chakradhar, I. Durdanovic, E. Cosatto, and H. P.
Graf, “A massively parallel coprocessor for convolutional neural networks,” in International
Conference on Application-specific Systems, Architectures and Processors, 2009, pp. 53–60.
[30] L. Song, F. Chen, Y. Zhuo, X. Qian, H. Li, and Y. Chen, “AccPar: Tensor Partitioning for Heterogeneous
Deep Learning Accelerators,” in International Symposium on High Performance
Computer Architecture (HPCA), 2020, pp. 342–355.
[31] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing FPGA-based accelerator
design for deep convolutional neural networks,” in Proceedings of ACM/SIGDA International
Symposium on Field-Programmable Gate Arrays, 2015, pp. 161–170.
[32] Y. Ma, Y. Cao, S. Vrudhula, and J.-s. Seo, “Optimizing the convolution operation to accelerate
deep neural networks on FPGA,” Transactions on Very Large Scale Integration (VLSI)
Systems, vol. 26, no. 7, pp. 1354–1367, 2018.
[33] ——, “Automatic compilation of diverse CNNs onto high-performance FPGA accelerators,”
Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2018.
[34] M. Horowitz, “1.1 computing’s energy problem (and what we can do about it),” in International
Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2014, pp. 10–14.
[35] Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A spatial architecture for energy-efficient dataflow
for convolutional neural networks,” ACM SIGARCH Computer Architecture News, vol. 44,
no. 3, pp. 367–379, 2016.
[36] J. Li, G. Yan,W. Lu, S. Jiang, S. Gong, J.Wu, and X. Li, “SmartShuttle: Optimizing off-chip
memory accesses for deep learning accelerators,” in Design, Automation & Test in Europe
Conference & Exhibition (DATE), 2018, pp. 343–348.
[37] X. Yang, J. Pu, B. B. Rister, N. Bhagdikar, S. Richardson, S. Kvatinsky, J. Ragan-Kelley,
A. Pedram, and M. Horowitz, “A systematic approach to blocking convolutional neural networks,”
arXiv preprint arXiv:1606.04209, 2016.
[38] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional
neural networks,” in Advances in neural information processing systems, 2012, pp.
1097–1105.
[39] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image
recognition,” arXiv preprint arXiv:1409.1556, 2014.
[40] S. Williams, A. Waterman, and D. Patterson, “Roofline: an insightful visual performance
model for multicore architectures,” Communications of the ACM, no. 4, pp. 65–76, 2009.
[41] G. Ofenbeck, R. Steinmann, V. Caparros, D. G. Spampinato, and M. P¨uschel, “Applying
the roofline model,” in International Symposium on Performance Analysis of Systems and
Software (ISPASS), 2014, pp. 76–85.
[42] X. Zhang, J. Wang, C. Zhu, Y. Lin, J. Xiong, W.-m. Hwu, and D. Chen, “DNNBuilder: an
automated tool for building high-performance DNN hardware accelerators for FPGAs,” in
International Conference on Computer-Aided Design (ICCAD), 2018, pp. 1–8.
[43] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and
A. Rabinovich, “Going Deeper With Convolutions,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), June 2015.
[44] H. Kwon, P. Chatarasi, M. Pellauer, A. Parashar, V. Sarkar, and T. Krishna, “Understanding
reuse, performance, and hardware cost of dnn dataflow: A data-centric approach,” in
Proceedings of the International Symposium on Microarchitecture, 2019, pp. 754–768.
[45] A. Samajdar, Y. Zhu, P. Whatmough, M. Mattina, and T. Krishna, “Scale-sim: Systolic CNN
accelerator simulator,” arXiv preprint arXiv:1811.02883, 2018.
[46] K. T. Malladi, F. A. Nothaft, K. Periyathambi, B. C. Lee, C. Kozyrakis, and M. Horowitz,
“Towards energy-proportional datacenter memory with mobile DRAM,” in Annual International
Symposium on Computer Architecture (ISCA), 2012, pp. 37–48.
[47] I. G. Thakkar and S. Pasricha, “3D-ProWiz: An energy-efficient and optically-interfaced
3D DRAM architecture with reduced data access overhead,” Transactions on Multi-Scale
Computing Systems, pp. 168–184, 2015.

指導教授

李進福(Jin-Fu Li)

審核日期

2021-1-28

推文