應用於小晶片之深度神經網路加速器架構探索

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：41

、訪客IP：3.144.94.187

姓名

邱皓珩(Hao-Heng Qiu) 查詢紙本館藏

畢業系所

電機工程學系

論文名稱

應用於小晶片之深度神經網路加速器架構探索
(Chiplet-based Deep Neural Network Accelerator Architecture and Exploration)

相關論文

★ 應用於三元內容定址記憶體之低功率設計與測試技術	★ 用於隨機存取記憶體的接線驗證演算法
★ 用於降低系統晶片內測試資料之基礎矽智產	★ 內容定址記憶體之鄰近區域樣型敏感瑕疵測試演算法
★ 內嵌式記憶體中位址及資料匯流排之串音瑕疵測試	★ 用於系統晶片中單埠與多埠記憶體之自我修復技術
★ 用於修復嵌入式記憶體之基礎矽智產	★ 自我修復記憶體之備份分析評估與驗證平台
★ 使用雙倍疊乘累加命中線之低功率三元內容定址記憶體設計	★ 可自我測試且具成本效益之記憶體式快速傅利葉轉換處理器設計
★ 低功率與可自我修復之三元內容定址記憶體設計	★ 多核心系統晶片之診斷方法
★ 應用於網路晶片上隨機存取記憶體測試及修復之基礎矽智產	★ 應用於貪睡靜態記憶體之有效診斷與修復技術
★ 應用於內嵌式記憶體之高效率診斷性資料壓縮與可測性方案	★ 應用於隨機存取記憶體之有效良率及可靠度提升技術

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2027-9-23以後開放)

摘要(中)

深度神經網絡(DNN)被廣泛地應用於人工智慧(AI)的領域，例如物件辨識及圖像分類等等。目前的深度神經網路模型通常需要大量的資料計算。為了在不同應用中滿足對效能的需求，加速器通常被用來實現深度神經網路的推論（inference）。在本文中，我們提出了一個基於小晶片(chiplet)設計方法的深度神經網路加速器架構。此架構由一個基底晶片(base die)和具有可擴展性的多個計算晶片(compute die)組成。基底晶片由靜態隨機存取記憶體（SRAM）和控制單元組成，用來處理外部動態隨機存取記憶體（DRAM）及計算晶片之間的資料傳輸。計算晶片由靜態隨機存取記憶體及處理單元（processing element, PE）組成。我們亦根據此架構提出設計空間探索(design space exploration, DSE)的方法，用來探索在資料頻寬和端到端延遲（end-to-end latency）的限制條件下，基底晶片及計算晶片可能的設計選擇。探索的結果顯示，缺陷密度（defect density, 1/mm2）及接合良率（bonding yield）是影響計算晶片的顆粒度（granularity）的主要因素。考慮在動態隨機存取記憶體頻寬為25.6 GB/s及基底晶片的引腳（I/O）數目為4096 及端到端延遲為12毫秒的情況下實現ResNet-50的推論，由一個基底晶片及兩個計算晶片組成的系統可以達到最低的製造成本。當缺陷密度提高時，將計算晶片切割成更多的數量可以得到成本降低的回報；當接合良率下降時，將計算晶片切割成較少的數量可以有效地降低成本。為了驗證所提出的基於小晶片設計方法的深度神經網路加速器架構，我們在Xilinx ZCU-102開發板上實現了一個用於MobileNet推論的加速器。此加速器由一個基底晶片及一個計算晶片所組成。實驗結果顯示，在100MHz的操作頻率下，此加速器可以達到25ms的端到端延遲。

摘要(英)

Deep neural network (DNN) is widely used in artiﬁcial intelligence (AI) applications, e.g., object detection and image classiﬁcation. A modern DNN model usually needs a large amount of computation. To meet the performance requirement of applications, an accelerator is usually designed for DNN inference. In this thesis, we consider a DNN accelerator realized by using the chiplet-based method. A chiplet-based DNN accelerator architecture is proposed, which consists of a base die and multiple compute dies for scalability. The base die is composed of SRAM buﬀers and controllers for handling the data transportation between the external dynamic random access memory (DRAM) and the compute dies. The compute die consists of memory units (SRAM) and compute units (processing element, PE). A design space exploration is proposed to explore possible design selections of the base die and compute dies under the constraints of data bandwidth and the end-to-end latency. The exploration results show that the defect density and the bonding yield are the dominant factors for the granularity of the compute dies. For realizing the ResNet-50 model under the constraints of 25.6 GB/s DRAM bandwidth, 4096 IOs of the base die, and 12 ms latency, two compute dies can provide the minimal fabrication cost. Partitioning with more compute dies pays oﬀ when the defect density increases; for decreasing bonding yield, partitioning with fewer compute dies lowers the cost. To verify the proposed chiplet-based DNN accelerator architecture, we implemented the chiplet-based DNN accelerator for MobileNet inference using Xilinx ZCU-102 evaluation board. The chiplet-based DNN accelerator is architectured with one base die and one compute die. The implementation results show that the 25 ms end-to-end latency can be achieved using 100MHz operation frequency.

關鍵字(中)

★ 深度神經網絡
★ 加速器
★ 小晶片
★ 設計空間探索

關鍵字(英)

★ Deep Neural Network
★ accelerator
★ chiplet
★ design space exploration

論文目次

1 Introduction 1
1.1 Deep Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 DNN Accelerator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Chiplet-Based DNN Accelerator . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Proposed Chiplet-Based DNN Accelerator Architecture 8
2.1 Proposed Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.1 Base Die Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.2 Compute Die Architecture . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.3 PE Cluster Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Convolutional Layer Processing . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Pooling Layer Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Fully Connected Layer Processing . . . . . . . . . . . . . . . . . . . . . . . . 14
3 Design Space Exploration for Proposed Architecture 16
3.1 Design Space Exploration Flow . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Tiling and Memory Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Unrolling and Number of PEs . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4 End-to-end Latency Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.5 Cost Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.6 Exploration Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4 A Case Study of MobileNet 47
4.1 MobileNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 Implementation Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2.1 Xilinx ZCU-102 Evaluation Board . . . . . . . . . . . . . . . . . . . . 51
4.2.2 AXI-4 Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.4 Implementation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5 Conclusions and Future Work 60

參考文獻

[1] Y. Lecun, L. Bottou, Y. Bengio, and P. Haﬀner, “Gradient-Based Learning Applied to Document Recognition,” Proceedings of the IEEE, pp. 2278–2324, 1998.
[2] Y. S. Shao, J. Clemons, R. Venkatesan, B. Zimmer, M. Fojtik, N. Jiang, B. Keller, A. Klinefelter, N. Pinckney, P. Raina, S. G. Tell, Y. Zhang, W. J. Dally, J. Emer, C. T.
Gray, B. Khailany, and S. W. Keckler, “Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture,” in Proceedings of IEEE/ACM International Symposium on Microarchitecture (MICRO), 2019, pp. 14–27.
[3] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp.
770–778.
[4] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “MobileNets: Eﬃcient Convolutional Neural Networks for Mobile Vision Applications,” CoRR, 2017.
[5] C. Szegedy, V. Vanhoucke, S. Ioﬀe, J. Shlens, and Z. Wojna, “Rethinking the Inception Architecture for Computer Vision,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2818–2826.
[6] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” in International Conference on Learning Representations (ICLR),
2015, pp. 1–14.
[7] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going Deeper with Convolutions,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1–9.
[8] A. Oord, S. Dieleman, and B. Schrauwen, “Deep Content-Based Music Recommendation,” in Proceedings of International Conference on Neural Information Processing Systems (NIPS), 2013, pp. 2643–2651.
[9] A. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “WaveNet: A Generative Model for Raw Audio,” in Proceedings of ISCA Workshop on Speech Synthesis Workshop (SSW), 2016, p. 125.
[10] A. Tsantekidis, N. Passalis, A. Tefas, J. Kanniainen, M. Gabbouj, and A. Iosiﬁdis, “Forecasting Stock Prices from the Limit Order Book Using Convolutional Neural Networks,” in IEEE Conference on Business Informatics (CBI), 2017, pp. 7–12.
[11] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision (IJCV), pp. 211–252, 2015.
[12] K. Yamaguchi, K. Sakamoto, T. Akabane, and Y. Fujimoto, “A Neural Network for Speaker-Independent Isolated Word Cecognition,” in Proceedings of International Conference on Spoken Language Processing (ICSLP), 1990, pp. 1077–1080.
[13] D. Ciresan, U. Meier, J. Masci, and J. Schmidhuber, “Multi-Column Deep Neural Network for Traﬃc Sign Classiﬁcation,” Neural Networks, pp. 333–338, 2012.
[14] J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and A. Moshovos, “Cnvlutin: Ineﬀectual-Neuron-Free Deep Neural Network Computing,” in ACM/IEEE International Symposium on Computer Architecture (ISCA), 2016, pp. 1–13.
[15] Y. H. Chen, J. Emer, and V. Sze, “Eyeriss: A Spatial Architecture for Energy-Eﬃcient Dataﬂow for Convolutional Neural Networks,” in ACM/IEEE International Symposium
on Computer Architecture (ISCA), 2016, pp. 367–379.
[16] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam, “ShiDianNao: Shifting Vision Processing Closer to the Sensor,” in ACM/IEEE International Symposium on Computer Architecture (ISCA), 2015, pp. 92–104.
[17] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, “EIE: Eﬃcient Inference Engine on Compressed Deep Neural Network,” in Proceedings of International Symposium on Computer Architecture (ISCA), 2016, pp. 243–254.
[18] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, “SCNN: An Accelerator for Compressed-Sparse Convolutional Neural Networks,” in ACM/IEEE International Symposium on Computer Architecture (ISCA), 2017, pp. 27–40.
[19] S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen, “Cambricon-X: An Accelerator for Sparse Neural Networks,” in Proceedings of IEEE/ACM International Symposium on Microarchitecture (MICRO), 2016, pp. 1–12.
[20] N. Beck, S. White, M. Paraschou, and S. Naﬀziger, “Zeppelin: An SoC for Multichip Architectures,” in IEEE International Solid-State Circuits Conference (ISSCC), 2018,
pp. 40–42.
[21] R. Hwang, T. Kim, Y. Kwon, and M. Rhu, “Centaur: A Chiplet-Based, Hybrid Sparse-Dense Accelerator for Personalized Recommendations,” in ACM/IEEE International
Symposium on Computer Architecture (ISCA), 2020, pp. 968–981.
[22] M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis, “TETRIS: Scalable and Eﬃcient Neural Network Acceleration with 3D Memory,” in Proceedings of International
Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2017, pp. 751–764.
[23] A. Parashar, P. Raina, Y. S. Shao, Y.-H. Chen, V. A. Ying, A. Mukkara, R. Venkatesan, B. Khailany, S. W. Keckler, and J. Emer, “Timeloop: A Systematic Approach to DNN Accelerator Evaluation,” in IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2019, pp. 304–315.
[24] H. Kwon, A. Samajdar, and T. Krishna, “MAERI: Enabling Flexible Dataﬂow Mapping over DNN Accelerators via Reconﬁgurable Interconnects,” in Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2018, pp. 461–475.
[25] M. Shahshahani, P. Goswami, and D. Bhatia, “Memory Optimization Techniques for FPGA-Based CNN Implementations,” in IEEE Dallas Circuits and Systems Conference (DCAS), 2018, pp. 1–6.
[26] Y. J. Lin and T. S. Chang, “Data and Hardware Eﬃcient Design for Convolutional Neural Network,” IEEE Transactions on Circuits and Systems I: Regular Papers (TCAS-I), pp. 1642–1651, 2018.
[27] D. A. Patterson, “Latency lags bandwith,” Communications of the ACM (Commun. ACM), p. 71–75, 2004.
[28] C. Stapper, “Defect Density Distribution for LSI Yield Calculations,” IEEE Transactions on Electron Devices (T-ED), pp. 655–657, 1973.
[29] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing fpga-based accelerator design for deep convolutional neural networks,” in Proceedings of ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), 2015, p. 161–170.
[30] F. Tu, S. Yin, P. Ouyang, S. Tang, L. Liu, and S. Wei, “Deep convolutional neural network architecture with reconﬁgurable computation patterns,” IEEE Transactions on Very Large Scale Integration Systems (T-VLSIS), pp. 2220–2233, 2017.
[31] L.-N. Pouchet, P. Zhang, P. Sadayappan, and J. Cong, “Polyhedral-based data reuse optimization for conﬁgurable computing,” in Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA), 2013, p. 29–38.
[32] D. Stow, Y. Xie, T. Siddiqua, and G. H. Loh, “Cost-Eﬀective Design of Scalable High-Performance Systems Using Active and Passive Interposers,” in IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2017, pp. 728–735.
[33] J. A. Cunningham, “The Use and Evaluation of Yield Models in Integrated Circuit Manufacturing,” IEEE Transactions on Semiconductor Manufacturing (T-SM), pp. 60–
71, 1990.
[34] M.-S. Lin, C.-C. Tsai, C.-H. Hsieh, W.-H. Huang, Y.-C. Chen, S.-C. Yang, C.-M. Fu, H.-J. Zhan, J.-Y. Chien, S.-Y. Li, Y.-H. Chen, C.-C. Kuo, S.-P. Tai, and K. Yamada,
“A 16nm 256-bit wide 89.6gbyte/s total bandwidth in-package interconnect with 0.3v swing and 0.062pj/bit power in info package,” in IEEE Hot Chips Symposium (HCS), 2016, pp. 1–32.
[35] A. Shokrollahi, D. Carnelli, J. Fox, K. Hofstra, B. Holden, A. Hormati, P. Hunt, M. Johnston, J. Keay, S. Pesenti, R. Simpson, D. Stauﬀer, A. Stewart, G. Surace,
A. Tajalli, O. T. Amiri, A. Tschank, R. Ulrich, C. Walter, F. Licciardello, Y. Mogentale, and A. Singh, “A pin-eﬃcient 20.83gb/s/wire 0.94pj/bit forwarded clock cnrz-5-coded
serdes up to 12mm for mcm packages in 28nm cmos,” in IEEE International Solid-State Circuits Conference (ISSCC), 2016, pp. 182–183.
[36] J. M. Wilson, W. J. Turner, J. W. Poulton, B. Zimmer, X. Chen, S. S. Kudva, S. Song, S. G. Tell, N. Nedovic, W. Zhao, S. R. Sudhakaran, C. T. Gray, and W. J. Dally, “A
1.17pj/b 25gb/s/pin ground-referenced single-ended serial link for oﬀ- and on-package communication in 16nm cmos using a process- and temperature-adaptive voltage regulator,” in IEEE International Solid-State Circuits Conference (ISSCC), 2018, pp. 276–278.
[37] “Imagenet,”
https://www.imagenet.org/challenges/LSVRC/index.php.
[38] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet Classiﬁcation with Deep Convolutional Neural Networks,” in Advances in Neural Information Processing Systems (NeurIPS), 2012, pp. 1097–1105.
[39] S. Ioﬀe and C. Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,” in Proceedings of International Conference on International Conference on Machine Learning (ICML), 2015, pp. 448–456.
[40] “Zynq-7000 SoC Technical Reference Manual,”
https://docs.xilinx.com/v/u/en-
US/ug585-Zynq-7000-TRM.
[41] “Zynq UltraScale+ Device Technical Reference Manual,”
https://docs.xilinx.com/r/en-US/ug1085-zynq-ultrascale-trm.
[42] “Zynq-7000 SoC Data Sheet: Overview,”
https://docs.xilinx.com/v/u/en-US/ds190-Zynq-7000-Overview.
[43] “Zynq UltraScale+ MPSoC Data Sheet: Overview,”
https://docs.xilinx.com/v/u/en-US/ds891-zynq-ultrascale-plus-overview.

指導教授

李進福(Jin-Fu Li)

審核日期

2022-9-26

推文