基於腦浮點運算及稀疏性考量之低功耗高能效神經網路訓練硬體架構設計與實作

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：48

、訪客IP：18.223.203.68

姓名

林定邦(Ding-Bang Lin) 查詢紙本館藏

畢業系所

電機工程學系

論文名稱

基於腦浮點運算及稀疏性考量之低功耗高能效神經網路訓練硬體架構設計與實作
(Design and Implementation of Low-Power, Energy-Efficient Neural Network Training Hardware Accelerators Based on Brain Floating-Point Computing and Sparsity Aware)

相關論文

★ 即時的SIFT特徵點擷取之低記憶體硬體設計	★ 即時的人臉偵測與人臉辨識之門禁系統
★ 具即時自動跟隨功能之自走車	★ 應用於多導程心電訊號之無損壓縮演算法與實現
★ 離線自定義語音語者喚醒詞系統與嵌入式開發實現	★ 晶圓圖缺陷分類與嵌入式系統實現
★ 語音密集連接卷積網路應用於小尺寸關鍵詞偵測	★ G2LGAN: 對不平衡資料集進行資料擴增應用於晶圓圖缺陷分類
★ 補償無乘法數位濾波器有限精準度之演算法設計技巧	★ 可規劃式維特比解碼器之設計與實現
★ 以擴展基本角度CORDIC為基礎之低成本向量旋轉器矽智產設計	★ JPEG2000靜態影像編碼系統之分析與架構設計
★ 適用於通訊系統之低功率渦輪碼解碼器	★ 應用於多媒體通訊之平台式設計
★ 適用MPEG 編碼器之數位浮水印系統設計與實現	★ 適用於視訊錯誤隱藏之演算法開發及其資料重複使用考量

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2024-8-31以後開放)

摘要(中)

近年來隨著科技的進步與大數據時代的來臨，深度學習給各領域中帶來革命性的進展，從基本的影像前處理、影像增強技術、人臉辨識、語音辨識等相關技術，逐漸的取代了傳統演算法，這說明了神經網路的興起已經帶動人工智慧的在這些領域中改革。但受限於 GPU 高成本的問題，導致其產品都很昂貴，且因為 GPU 的功耗較大，這也造成了在推理神經網路時能效數值偏低。由於神經網路的演算法有龐大的計算量，須配合加速的硬體來進行實時運算，這促使近幾年來有不少研究是針對深度神經網路的加速數位電路硬體設計。
在本論文中，我們提出一個具有高效能、高靈活性的訓練處理器，我們把它命名為 EESA。擬議的訓練處理器具有低功耗、高吞吐量和高能效等特點，EESA 利用神經元激活函數後的稀疏性來減少記憶體訪問的次數以及記憶體儲存的空間，以實現高效的訓練加速器。所提出的處理器使用了一種新穎的可重新配置的計算架構，在正向傳播(FP)以及反向傳播(BP)過程中保持高性能。該處理器採用台積電 40 nm 工藝技術實現，能運行的操作頻率為 294 MHz，整個晶片的功耗為 87.12 mW，使用的核心電壓為 0.9 V。在整個晶片中，我們使用 16 位元的腦浮點運算精度格式來完成所有資料的數值運算，最終該處理器實現了 1.72 TOPS/W 的高能效表現。

摘要(英)

In recent years, deep learning has brought revolutionary progress in various fields with the advent of technology and big data era, from basic image pre- processing, image enhancement technology, face recognition, voice recognition and other related technologies, gradually replacing traditional algorithms, which shows that the rise of neural networks has led to the reform of artificial intelligence in these fields. However, due to the high cost of GPUs, the products are expensive, and the power consumption of GPUs is high, which results in low energy efficiency values when reasoning about neural networks. Since neural network algorithms are computationally intensive, they require accelerated hardware for real-time computation, which has led to a lot of research in recent years on accelerated digital circuit hardware design for deep neural networks.
In this paper, we proposed an efficient and flexible training processor, called EESA. Our proposed training processor features low power consumption, high throughput and high energy efficiency. EESA uses the sparsity of neuron activations to reduce the number of memory accesses and storage memory space to achieve an efficient training accelerator. The proposed processor uses a novel reconfigurable computing architecture to maintain high performance when operating Forward Propagation (FP) and Backward Propagation (BP) passes. The processor is implemented in TSMC 40nm technology process, with an operating frequency of 294MHz and power consumption of 87.12mW at the core voltage of 0.9V. For 16-bit brain floating point precision format, the processor achieves an energy efficiency of 1.72TOPS/W.

關鍵字(中)

★ 全連接層
★ AI加速器
★ 記憶體優化
★ 稀疏性
★ 低功耗

關鍵字(英)

★ Fully Connected Layers
★ AI Accelerator
★ Optimized Memory Access
★ Sparsity
★ Low Power

論文目次

摘要 I
ABSTRACT II
1. 序論 1
1.1. 研究背景 1
1.2. 研究動機 4
1.3. 研究貢獻 5
1.4. 論文架構 5
2. 文獻探討 6
2.1. 神經網路 6
2.2. 全連接層的正向傳播及反向傳播 8
2.3. 硬體加速器 10
2.4. 浮點運算格式 18
3. 硬體架構設計 24
3.1. 整個系統的硬體方塊圖 24
3.2. 用於計算DELTA數值的記憶體訪問優化 27
3.3. 配置暫存器單元 28
3.4. 資料稀疏性編碼器和解碼器單元 29
3.5. 控制單元模組 32
3.6. 記憶體模組 33
3.7. PE ARRAY模組 34
3.8. 正向傳播及反向傳播的資料流 36
3.9. SOFTMAX模組 39
3.10. LOSS FUNCTION模組 42
3.11. WEIGHT UPDATE模組 43
4. 硬體實現結果 44
4.1. FPGA驗證結果 44
4.2. VLSI IMPLEMENTATION 47
4.3. 晶片能耗及面積分析 50
4.4. 與相關作品的比較結果 52
5. 結論 55
參考文獻 56

參考文獻

[1] S. S. Farfade, M. Saberian, and L. J. Li, “Multi-view face detection using Deep convolutional neural networks,” in ICMR 2015 - Proceedings of the 2015 ACM International Conference on Multimedia Retrieval, Feb. 2015, pp. 643–650, doi: 10.1145/2671188.2749408.
[2] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Dec. 2016, vol. 2016-December, pp. 779–788, doi: 10.1109/CVPR.2016.91.
[3] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
[4] NEC人臉辨識溫度感測方案,https://www.ankecare.com/article/797-20279.
[5] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. Int. Conf. on Learning Representations, San Diego, CA, 2015.
[6] K. He et al. Deep residual learning for image recognition.arXiv:1512.03385, 2015.
[7] Y.-H. Chen, T. Krishna, J. Emer, and V. Sze, “Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks,” IEEE Int. J. Solid State Circuits, vol. 59, no. 1, pp. 262–263, 2016, doi: 10.1109/ISSCC.2016.7418007.
[8] EIE: Efficient Inference Engine on Compressed Deep Neural Network, in: 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), IEEE. pp. 243–254.
[9] Z. Yuan et al., “Sticker: A 0.41-62.1 TOPS/W 8bit neural network processor with multi-sparsity compatible convolution arrays and online tuning acceleration for fully connected layers,” in Proc. IEEE Symp. VLSI Circuits, Jun. 2018, pp. 33–34.
[10] D. Masters and C. Luschi, “Revisiting small batch training for deep neural networks,” CoRR, vol. abs/1804.07612, 2018. [Online]. Available: http://arxiv.org/abs/1804.07612.
[11] M. A. Hussain and T. H. Tsai, “Memory Access Optimization for On-Chip Transfer Learning,” IEEE Trans. Circuits Syst. I Regul. Pap., vol. 68, no. 4, pp. 1507–1519, Feb. 2021, doi: 10.1109/TCSI.2021.3055281.
[12] M. Long, Y. Cao, Z. Cao, J. Wang, and M. I. Jordan, “Transferable Representation Learning with Deep Adaptation Networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 12, pp. 3071–3085, 2019, doi: 10.1109/TPAMI.2018.2868685.
[13] A. Gepperth and S. A. Gondal, “Incremental learning with deep neural networks using a test-time oracle,” in Proceedings - European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, 2018, no. April, pp. 37–42.
[14] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proc. of the IEEE, 1998.
[15] “Why is so much memory needed for deep neural networks?” [Online]. Available: https://www.graphcore.ai/posts/why-is-so-much-memory-needed-for-deep-neural-networks. [Accessed: 13-Jan-2020].
[16] “TensorFlow.” [Online]. Available: https://www.tensorflow.org/. [Accessed: 13-Jan-2020].
[17] “PyTorch.” [Online]. Available: https://pytorch.org/. [Accessed: 13-Jan-2020].
[18] A. Aimar et al., “NullHop: A Flexible Convolutional Neural Network Accelerator Based on Sparse Representations of Feature Maps,” IEEE Trans. Neural Networks Learn. Syst., vol. 30, no. 3, pp. 644–656, 2019, doi: 10.1109/TNNLS.2018.2852335.
[19] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015
[20] S. Choi, J. Sim, M. Kang, Y. Choi, H. Kim, and L. S. Kim, “An Energy-Efficient Deep Convolutional Neural Network Training Accelerator for in Situ Personalization on Smart Devices,” IEEE J. Solid-State Circuits, vol. 55, no. 10, pp. 2691–2702, Oct. 2020.
[21] D. Han, J. Lee, J. Lee, and H. J. Yoo, “A Low-Power Deep Neural Network Online Learning Processor for Real-Time Object Tracking Application,” IEEE Trans. Circuits Syst. I Regul. Pap., vol. 66, no. 5, pp. 1794–1804, May 2019, doi: 10.1109/TCSI.2018.2880363.
[22] X. Chen, C. Gao, T. Delbruck, and S.-C. Liu, “EILE: Efficient Incremental Learning on the Edge,” in 2021 IEEE 3rd International Conference on Artificial Intelligence Circuits and Systems (AICAS), Jun. 2021, pp. 1–4, doi: 10.1109/AICAS51828.2021.9458554.
[23] IEEE 754 ,https://zh.wikipedia.org/wiki/IEEE_754.
[24] A. Agrawal et al., "DLFloat: A 16-b Floating Point Format Designed for Deep Learning Training and Inference," 2019 IEEE 26th Symposium on Computer Arithmetic (ARITH), 2019, pp. 92-95, doi: 10.1109/ARITH.2019.00023.
[25] U. Koster ¨ et al., “Flexpoint: An Adaptive Numerical Format for Efficient Training of Deep Neural Networks,” in NIPS, 2017.
[26] Gustafson, J.; Yosemite, I. Beating Floating Point at its Own Game: POSIT Arithmetic. Supercomput. Front. Innov. Int. J. 2017, 4, 2409–6008.
[27] Wang Shibo, Kanwar Pankaj "BFloat16: The secret to high performance on Cloud TPUs", Aug. 2019, https://reurl.cc/43Z7qY.
[28] Agarap, A. F. (2018). Deep learning using rectified linear units (relu). ArXiv Preprint ArXiv:1803.08375.
[29] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov and L. -C. Chen, "MobileNetV2: Inverted Residuals and Linear Bottlenecks," 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 4510-4520, doi: 10.1109/CVPR.2018.00474.
[30] C. Chen, H. Ding, H. Peng, H. Zhu, Y. Wang, and C. J. R. Shi, “OCEAN: An On-Chip Incremental-Learning Enhanced Artificial Neural Network Processor with Multiple Gated-Recurrent-Unit Accelerators,” IEEE J. Emerg. Sel. Top. Circuits Syst., vol. 8, no. 3, pp. 519–530, 2018, doi: 10.1109/JETCAS.2018.2852780.
[31] C.-H. Lu, Y.-C. Wu, and C.-H. Yang, “A 2.25 TOPS/W Fully-Integrated Deep CNN Learning Processor with On-Chip Training,” in IEEE Asian Solid-State Circuits Conference (A-SSCC), Apr. 2019, pp. 65–68, doi: 10.1109/a-sscc47793.2019.9056967.
[32] N. N. Schraudolph, “A fast, compact approximation of the exponential function,” Neural Computation, vol. 11, no. 4, pp. 853-862, 1999.
[33] D. Kim, J. Kung, and S. Mukhopadhyay, “A power-aware digital multilayer perceptron accelerator with on-chip training based on approximate computing,” IEEE Trans. Emerg. Top. Comput., vol. 5, no. 2, pp. 164–178, Apr. 2017, doi: 10.1109/TETC.2017.2673548.
[34] Y. Lecun, L. Bottou, Y. Bengio and P. Haffner, "Gradient-based learning applied to document recognition," in Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, Nov. 1998, doi: 10.1109/5.726791.
[35] Kalamkar, D., Mudigere, D., Mellempudi, N., Das, D., Banerjee, K., Avancha, S., Vooturi, D.T., Jammalamadaka, N., Huang, J., Yuen, H. and Yang, J., 2019. A study of bfloat16 for deep learning training. arXiv preprint arXiv:1905.12322.
[36] C. S. Turner, “A fast binary logarithm algorithm,” IEEE Signal Process. Mag., vol. 27, no. 5, 2010, doi: 10.1109/MSP.2010.937503.

指導教授

蔡宗漢(Tsung-Han Tsai)

審核日期

2022-8-3

推文