應用於脈動陣列深度神經網路加速器之線上內建自我修復方案

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：92

、訪客IP：3.23.103.9

姓名

羅啓翔(Chi-Hsiang Lo) 查詢紙本館藏

畢業系所

電機工程學系

論文名稱

應用於脈動陣列深度神經網路加速器之線上內建自我修復方案
(Online Built-In Self-Repair Scheme for Systolic Array-Based Deep Neural Network Accelerators)

相關論文

★ 應用於三元內容定址記憶體之低功率設計與測試技術	★ 用於隨機存取記憶體的接線驗證演算法
★ 用於降低系統晶片內測試資料之基礎矽智產	★ 內容定址記憶體之鄰近區域樣型敏感瑕疵測試演算法
★ 內嵌式記憶體中位址及資料匯流排之串音瑕疵測試	★ 用於系統晶片中單埠與多埠記憶體之自我修復技術
★ 用於修復嵌入式記憶體之基礎矽智產	★ 自我修復記憶體之備份分析評估與驗證平台
★ 使用雙倍疊乘累加命中線之低功率三元內容定址記憶體設計	★ 可自我測試且具成本效益之記憶體式快速傅利葉轉換處理器設計
★ 低功率與可自我修復之三元內容定址記憶體設計	★ 多核心系統晶片之診斷方法
★ 應用於網路晶片上隨機存取記憶體測試及修復之基礎矽智產	★ 應用於貪睡靜態記憶體之有效診斷與修復技術
★ 應用於內嵌式記憶體之高效率診斷性資料壓縮與可測性方案	★ 應用於隨機存取記憶體之有效良率及可靠度提升技術

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2029-7-29以後開放)

摘要(中)

深度神經網路(DNN)已被廣泛應用於人工智慧的應用。基於脈動陣列的加速器可用於加速DNN的運算。此外，加速器通常會使用較先進的技術節點，這會使得電晶體的可靠度降低。然而，如果加速器適用於安全攸關系統，就必須要具備高可靠度。儘管DNN本身具有抗故障能力，但在設計DNN加速器時採用的計算量減少技術有可能會大大降低其抗故障能力。在本論文中，我們提出獨立權重和與輸入特徵核對和(IWIC)技術，為基於權重固定的DNN硬體加速器提出了一種基於演算法的錯誤偵測方案。與使用過濾器核對和與輸入特徵核對和的卷積進行錯誤偵測的現有方法不同，IWIC技術分別使用權重和與輸入特徵核對和來偵測錯誤，從而降低了面積成本。與現有方法相比，對於資料精度為16位的64x64陣列，所提出的IWIC技術可將硬體開銷減少約25%。特別是，當偵測到錯誤時，IWIC技術可以辨別出有問題的列。我們還在IWIC技術的基礎上，為基於脈動陣列的加速器提出了週期性內置自我修復(PBISR)方案。提出了一故障定位方法給帶有備用行或備用列的PE陣列，以確定故障PE的位置。對於資料精度16位的64x64陣列，故障定位的硬體額外開銷約為2.88%和1.57%。我們還在Xilinx ZCU-102 FPGA平台上實現一8x8 PE陣列，並運行LeNet-5模型來演示IWIC技術。實驗結果顯示，IWIC技術只產生約1.6%的端對端延遲消耗。

摘要(英)

Deep neural networks (DNNs) have been widely used for the artificial intelligence applications. Systolic array-based accelerator can be employed for acceleration of DNN computing. Additionally, the accelerator is typically realized with advanced technology nodes which causes the reliability of transistors is decreased. However, if the accelerator is utilized in safety-critical applications, a high reliability feature is necessary. Although DNN possess inherent fault-resilience capability, the computation-reduction techniques employed in the design of DNN accelerators might significantly degrade the fault-resilience capability. In this thesis, we propose an algorithm-based error detection scheme for systolic array-based DNN accelerator with weight stationary data flow using individual weight-sum and input feature map checksum (IWIC) technique. Different to existing works using the convolution of filter checksum and input feature map checksum for the error detection, the IWIC technique executes weight-sum and input feature map checksum individually to reduce the area cost. In comparison with the existing work, the proposed IWIC technique can achieve about 25\% reduction of hardware overhead for a 64x64 systolic array with 16-bit data precision. In particular, the IWIC technique can identify the faulty column when an error is detected. We also propose a periodic built-in self-repair (PBISR) scheme for the systolic array-based accelerator with the IWIC technique. Fault location approaches for the PE array with a spare row or a spare column are proposed to locate the faulty PE. The hardware overhead of the fault location approaches is about 2.88\% and 1.57\% for a 64x64 array with 16-bit data precision, respectively. We also implement an 8x8 PE array with the IWIC technique in the Xilinx ZCU-102 FPGA platform and a LeNet-5 model is run to demonstrate the IWIC technique. Experimental results show that the IWIC technique only incurs about 1.6\% end-to-end latency overhead.

關鍵字(中)

★ 深度神經網路
★ 硬體加速器
★ 自我修復

關鍵字(英)

★ Deep Neural Network
★ Hardware Accelerator
★ Built-In Self Repair

論文目次

1 Introduction 1
1.1 Deep Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 DNN Accelerator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Fault-Tolerant DNN Accelerator . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.1 Hardware Accelerator Reliability Issues . . . . . . . . . . . . . . . . . 6
1.3.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.3 Algorithm-Based Fault Tolerance . . . . . . . . . . . . . . . . . . . . 7
1.4 Motivation and Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Proposed Algorithm-Based Error Detection Techniques 9
2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Filter Checksum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.2 Input Feature Map Checksum . . . . . . . . . . . . . . . . . . . . . . 10
2.1.3 Filter and Ifamp Checksum . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Proposed IWIC technique on Weight-Stationary Systolic Array . . . . . . . . . 13
2.2.1 Proposed Individual Weight-sum . . . . . . . . . . . . . . . . . . . . . 13
2.2.2 Proposed Slice-based Checking ABED . . . . . . . . . . . . . . . . . 14
2.3 Correctness of Error Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Hardware Architecture of Error Detection Technique . . . . . . . . . . . . . . 18
3 Online PBISR scheme for WS Systolic Array Based Accelerator 21
3.1 Switching Mechanism of Spare Column and Row . . . . . . . . . . . . . . . . 21
3.2 Distinguish of Fault Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.1 Input Fault . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.2 Weight and Partial Sum Error . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Proposed Binary-Search Based Fault location Technique . . . . . . . . . . . . 26
3.3.1 Spare Column . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.2 Spare Row . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4 Evaluation and Analysis 49
4.1 Area and Latency Overhead of IWIC Scheme . . . . . . . . . . . . . . . . . . 49
4.1.1 Area Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.1.2 Computing Latency Overhead . . . . . . . . . . . . . . . . . . . . . . 50
4.1.3 Additional Information Storage . . . . . . . . . . . . . . . . . . . . . 52
4.2 Area and Time Cost of the Fault-Location Approaches . . . . . . . . . . . . . 52
4.2.1 Area Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2.2 Fault Location Time . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.3 Reliability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5 FPGA Platform Validation 61
5.1 Validation Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

參考文獻

[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems, 2012, pp. 1097–1105.
[2] X. Zhang, J. Zou, K. He, and J. Sun, “Accelerating very deep convolutional networks for classification and detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 10, pp. 1943–1955, 2015.
[3] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: unified, realtime object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 779–788.
[4] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceeding of The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1–9.
[5] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceeding of The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2016, pp. 770–778.
[6] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “MobileNets: Efficient convolutional neural networks for mobile vision applications,” Computing Research Repository (CoRR), 2017.
[7] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, “DianNao: A smallfootprint high-throughput accelerator for ubiquitous machine-learning,” in Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2014, pp. 269–284.
[8] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam, “ShiDianNao: Shifting vision processing closer to the sensor,” ACM SIGARCH Computer Architecture News, vol. 43, no. 3, pp. 92–104, Jun. 2015.
[9] Y. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks,” IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. 127–138, Jan. 2017.
[10] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia,
N. Boden, A. Borchers et al., “In-datacenter performance analysis of a tensor processing unit,” in Proceedings of ACM/IEEE International Symposium on Computer Architecture (ISCA), 2017, pp. 1–12.
[11] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing FPGA-based accelerator design for deep convolutional neural networks,” in Proceedings of ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), 2015, pp. 161–170.
[12] M. Motamedi, P. Gysel, V. Akella, and S. Ghiasi, “Design space exploration of FPGAbased deep convolutional neural networks,” in Proceedings of Asia and South Pacific Design Automation Conference (ASP-DAC), Jan. 2016, pp. 575–580.
[13] N. P. Jouppi, C. Young, N. Patil, D. A. Patterson, G. Agrawal, R. Bajwa et al., “Indatacenter performance analysis of a tensor processing unit,” in 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), 2017, pp. 1–12.
[14] G. Li, S. K. S. Hari, M. Sullivan, T. Tsai, K. Pattabiraman, J. Emer, and S. W. Keckler, “Understanding error propagation in deep learning neural network (DNN) accelerators and applications,” in SC17: International Conference for High Performance Computing, Networking, Storage and Analysis, 2017, pp. 1–12.
[15] J. J. Zhang, K. Basu, and S. Garg, “Fault-tolerant systolic array based accelerators for deep neural network execution,” IEEE Design Test, vol. 36, no. 5, pp. 44–53, 2019.
[16] C. Liu, C. Chu, D. Xu, Y.Wang, Q.Wang, H. Li, X. Li, and K.-T. Cheng, “Hyca: A hybrid computing architecture for fault-tolerant deep learning,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 41, no. 10, pp. 3400–3413, 2022.
[17] R. Baumann, “Radiation-induced soft errors in advanced semiconductor technologies,” IEEE Transactions on Device and Materials Reliability, vol. 5, no. 3, pp. 305–316, 2005.
[18] K. Kang, S. Gangwal, S. P. Park, and R. Kaushik, “NBTI induced performance degradation in logic and memory circuits: how effectively can we approach a reliability solution?” in 2008 Asia and South Pacific Design Automation Conference, 2008, pp. 726–731.
[19] J. J. Zhang, T. Gu, K. Basu, and S. Garg, “Analyzing and mitigating the impact of permanent faults on a systolic array based neural network accelerator,” in 2018 IEEE 36th VLSI Test Symposium (VTS), 2018, pp. 1–6.
[20] M. A. Hanif and M. A. Shafique, “SalvageDNN: salvaging deep neural network accelerators with permanent faults through saliency-driven fault-aware mapping,” Philosophical Transactions of the Royal Society A, vol. 378, 2019.
[21] L. H. Hoang, M. A. Hanif, and M. Shafique, “FT-ClipAct: Resilience analysis of deep neural networks and improving their fault tolerance using clipped activation,” CoRR, vol. abs/1912.00941, 2019.
[22] G. B. Hacene, F. Leduc-Primeau, A. B. Soussia, V. Gripon, and F. Gagnon, “Training modern deep neural networks for memory-fault robustness,” CoRR, vol. abs/1911.10287, 2019.
[23] R. E. Lyons and W. Vanderkulk, “The use of triple-modular redundancy to improve computer reliability,” IBM Journal of Research and Development, vol. 6, no. 2, pp. 200–209, 1962.
[24] T. G. Bertoa, G. Gambardella, N. J. Fraser, M. Blott, and J. McAllister, “Fault-tolerant neural network accelerators with selective TMR,” IEEE Design Test, vol. 40, no. 2, pp. 67–74, 2023.
[25] J. Zhang, K. Rangineni, Z. Ghodsi, and S. Garg, “Thundervolt: Enabling aggressive voltage underscaling and timing error resilience for energy efficient deep learning accelerators,” in Proceedings of the 55th Annual Design Automation Conference, ser. DAC ’18. New York, NY, USA: Association for Computing Machinery, 2018.
[26] P. N. Whatmough, S. K. Lee, D. Brooks, and G.-Y. Wei, “DNN engine: A 28-nm timingerror tolerant sparse deep neural network processor for IoT applications,” IEEE Journal of Solid-State Circuits, vol. 53, no. 9, pp. 2722–2731, 2018.
[27] R. W. Hamming, “Error detecting and error correcting codes,” The Bell System Technical Journal, vol. 29, no. 2, pp. 147–160, 1950.
[28] M. Y. Hsiao, “A class of optimal minimum odd-weight-column SEC-DED codes,” IBM Journal of Research and Development, vol. 14, no. 4, pp. 395–401, 1970.
[29] M. Patel, J. S. Kim, H. Hassan, and O. Mutlu, “Understanding and modeling on-die error correction in modern DRAM: An experimental study using real devices,” in 2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2019, pp. 13–25.
[30] K.-H. Huang and J. A. Abraham, “Algorithm-based fault tolerance for matrix operations,” IEEE Transactions on Computers, vol. C-33, no. 6, pp. 518–528, 1984.
[31] K. Cho, I. Lee, H. Lim, and S. Kang, “Efficient systolic-array redundancy architecture for offline/online repair,” Electronics, vol. 9, no. 2, 2020.
[32] I. Takanami and T. Horita, “A built-in circuit for self-repairing mesh-connected processor arrays by direct spare replacement,” in 2012 IEEE 18th Pacific Rim International Symposium on Dependable Computing, 2012, pp. 96–104.
[33] K. Zhao, S. Di, S. Li, X. Liang, Y. Zhai, J. Chen, K. Ouyang, F. Cappello, and Z. Chen, “FT-CNN: Algorithm-based fault tolerance for convolutional neural networks,” IEEE Transactions on Parallel and Distributed Systems, vol. 32, no. 7, pp. 1677–1689, 2020.
[34] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255.
[35] Z. Xu and J. Abraham, “Safety design of a convolutional neural network accelerator with error localization and correction,” in 2019 IEEE International Test Conference (ITC), 2019, pp. 1–10.
[36] E. Ozen and A. Orailoglu, “Low-cost error detection in deep neural network accelerators with linear algorithmic checksums,” Journal of Electronic Testing, vol. 36, no. 6, pp. 703– 718, 2020.
[37] M. Safarpour, R. Inanlou, and O. Silv´en, “Algorithm level error detection in low voltage systolic array,” IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 69, no. 2, pp. 569–573, 2022.
[38] T. Marty, T. Yuki, and S. Derrien, “Safe overclocking for CNN accelerators through algorithm-level error detection,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 39, no. 12, pp. 4777–4790, 2020.
[39] D. Filippas, N. Margomenos, N. Mitianoudis, C. Nicopoulos, and G. Dimitrakopoulos, “Low-cost online convolution checksum checker,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 30, no. 2, pp. 201–212, 2022.
[40] Z. Xu and J. Abraham, “Design of a safe convolutional neural network accelerator,” in 2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), 2019, pp. 247–252.
[41] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
[42] L. Deng, “The mnist database of handwritten digit images for machine learning research [best of the web],” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 141–142, 2012.
[43] Y. Zhao, K. Wang, and A. Louri, “FSA: An efficient fault-tolerant systolic array-based DNN accelerator architecture,” in 2022 IEEE 40th International Conference on Computer Design (ICCD), 2022, pp. 545–552.
[44] A. Saleh, J. Serrano, and J. Patel, “Reliability of scrubbing recovery-techniques for memory systems,” IEEE Transactions on Reliability, vol. 39, no. 1, pp. 114–122, 1990.

指導教授

李進福(Jin-Fu Li)

審核日期

2024-7-29

推文