混合式SRAM-RRAM記憶體內運算架構： 高效可靠的深度學習推理解決方案

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：52

、訪客IP：18.221.243.29

姓名

劉致瑋(Zhi-Wei Liu) 查詢紙本館藏

畢業系所

電機工程學系

論文名稱

混合式SRAM-RRAM記憶體內運算架構：高效可靠的深度學習推理解決方案
(Hybrid SRAM-RRAM Computing-In-Memory Architecture: A Solution for Efficient and Reliable Deep Learning Inference)

相關論文

★ 晶圓圖之網格及稀疏缺陷樣態辨識	★ 晶圓圖提取特徵參數錯誤樣態分析
★ 使用聚類過濾策略和 CNN 計算識別晶圓圖瑕疵樣態	★ 新建晶圓圖相似性門檻以強化相似程度辨別能力
★ 一種可動態重新配置的4:2近似壓縮器用於補償老化	★ 一個可靠的靜態隨機存取記憶體內運算結構: 設計指南與耐老化策略研究
★ 一個高效的老化偵測器部屬策略: 基於生成對抗網路的設計方法	★ 考慮電壓衰退和繞線影響以優化電路時序之電源供應網絡精煉策略
★ 適用於提高自旋轉移力矩式磁阻隨機存取記憶體矩陣可靠度之老化偵測與緩解架構設計	★ 8T 靜態隨機存取記憶體之內積運算引擎的老化威脅緩解策略: 從架構及運算角度來提出解決的方法
★ 用於響應穩定性的老化感知平行掃描鏈PUF設計	★ 8T靜態隨機存取記憶體運算的老化檢測和容忍機制：適用於邏輯和 MAC 運算的應用
★ 使用擺置後的設計特徵及極限梯度提升演算法預測繞線後的繞線需求	★ 基於強化學習的晶片佈局規劃的卷積神經網路與圖神經網路融合架構
★ 用於佈線後階段電壓降優化的強化學習框架	★ 多核心系統的老化與瞬態錯誤感知任務部署策略：壽命延長且節能的框架

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2029-7-10以後開放)

摘要(中)

目前，馮紐曼體系結構（von Neumann architecture，VNA）是電腦系統的基本結構，由中央處理器（CPU）和存儲器（Memory）組成，它們通過數據通道和控制信號連接。CPU 負責執行存儲在存儲器中的指令，而存儲器則用於存儲指令和數據。然而，對於影像辨別分類、語音分類和自然語言處理等數據密集型應用，會在記憶體和計算核心之間傳輸大量數據，導致馮紐曼瓶頸的產生，其原因是指在這種結構下存儲器和CPU之間的通信速度限制，導致 CPU 需要等待存儲器響應，限制了系統的整體性能。
為了解決馮紐曼瓶頸，人們將注意力轉向記憶體內運算（Computing In-Memory，CIM），認為這是一個有潛力的解決方案。這種方法將計算功能移到存儲器中，使得計算和數據處理可以在同一地方進行，因此減少了CPU和存儲器之間的通信需求，來提高系統的效率和性能。許多研究人員提出了不同的 CIM 架構來加速 AI 運算。廣義上，CIM 運算可分為兩種類型：類比計算和數位計算。近年來，類比CIM因其在高並行性和能源效率層面保有高度優勢而受到大眾的廣泛關注。因此，我們本篇論文的重點是類比 CIM 架構。在各種記憶體類型中，SRAM (Static Random-Access Memory) 和 RRAM (Resistive Random-Access Memory) 脫穎而出，成為流行的選擇。
基於 SRAM 的 CIM 架構因為其技術成熟且穩定已成功證明了具有成熟設備製程的高效可靠的運算。然而，SRAM單元有相對較大的單元面積和較低的儲存密度導致晶片面積需求增加。相反，基於 RRAM 的 CIM 架構具有高密度、低功耗、非揮發性以及與 CMOS製程無縫整合等優勢。然而，他們面臨著與工藝良率差異相關的挑戰，導致各種類型的故障。雖然這兩種 CIM 架構都能顯著提高運算速度，但它們都有各自的優點和缺點。
為了最大限度地發揮不同 CIM 架構的優勢，我們提出了一種新穎的混合SRAM-RRAM CIM 架構，能直接將儲存於記憶體陣列中的權重直接就地執行運算，透過專門設計的外圍電路整合了SRAM 和RRAM 結構。此外，我們引入了一種新穎的權重分配策略，即權重儲存策略（Weight Storage Strategy，WSS），該策略根據最高有效位(Most Significant Bits，MSBs) 和最低有效位(Least Significant Bits，LSBs) 各自的重要性適當地分配它們於不同的記憶體陣列中，權重的最高有效位對於計算的影響較高，所以我們會將它儲存於相對穩定的SRAM陣列中，而最低有效位通常位元數較多且相對不重要，所以將它儲存於面積較小的RRAM陣列中。最終實驗結果表明，我們的架構在面積、洩漏功率和能耗方面分別超越了基於 8T-SRAM 的 CIM 架構約35%、40%以及50%，同時在可靠性方面在使用MNIST 與手部辨識資料集進行評估也優於基於 RRAM 的架構約32%與18%。

摘要(英)

Currently, the von Neumann architecture (VNA) is the fundamental structure of computer systems, consisting of a Central Processing Unit (CPU) and Memory, connected by data channels and control signals. The CPU executes instructions stored in memory, while memory is used to store instructions and data. However, for data-intensive applications such as image classification, speech recognition, and natural language processing, large amounts of data are transferred between memory and computing cores, leading to the emergence of von Neumann bottlenecks. This is due to the communication speed limitation between the CPU and memory in this structure, causing the CPU to wait for memory responses, thereby limiting the overall system performance.
To address the von Neumann bottleneck, attention has shifted towards Computing In-Memory (CIM), seen as a promising solution. This approach moves computational functions into memory, allowing computation and data processing to occur in the same place, thereby reducing the communication demands between the CPU and memory to improve system efficiency and performance. Many researchers have proposed various CIM architectures to accelerate AI computation. Broadly, CIM computation can be divided into two types: analog computing and digital computing. In recent years, analog CIM has received widespread attention due to its inherent advantages in high parallelism and energy efficiency. Therefore, the focus of our work is on analog CIM architectures. Among various types of memory, SRAM (Static Random-Access Memory) and RRAM (Resistive Random-Access Memory) stand out as popular choices.
SRAM-based CIM architectures have proven successful due to their mature and stable technology, demonstrating efficient and reliable computation with mature device processes. However, the relatively larger unit area and lower storage density of SRAM cells lead to increased chip area requirements. In contrast, CIM architecture based on RRAM offers advantages such as high density, low power consumption, non-volatility, and seamless integration with CMOS processes. However, they face challenges related to process yield differences, resulting in various types of faults. While both CIM architectures significantly improve computational speed, they each have their own advantages and disadvantages.
To fully leverage the advantages of different CIM architectures, we propose a novel hybrid SRAM-RRAM CIM architecture that enables direct in-place computation of weights stored in the memory array. This is achieved through a specially designed peripheral circuit integrating SRAM and RRAM structures. Additionally, we introduce a novel weight allocation strategy, termed the Weight Storage Strategy (WSS), which appropriately distributes weights based on the importance of their Most Significant Bits (MSBs) and Least Significant Bits (LSBs) into different memory arrays. The MSBs of weights have a greater impact on computations, so we store them in the relatively stable SRAM array, while the LSBs, which typically have more bits and are relatively less critical, are stored in the smaller RRAM array. Ultimately, experimental results demonstrate that our architecture surpasses 8T-SRAM-based CIM architectures by approximately 35%, 40%, and 50% in terms of area, leakage power, and energy consumption, respectively. At the same time, in terms of reliability, it is also better than the RRAM-based architecture by about 32% and 18% when evaluated using MNIST and hand detection datasets.

關鍵字(中)

★ 記憶體

關鍵字(英)

★ Memory

論文目次

摘要 i
Abstract iii
致謝 v
Table of Contents vii
Table of Figures ix
Table of Tables xi
Chapter 1 Introduction 1
1.1 Various proposals for CIM architecture 2
1.2 Hybrid CIM architectures ideas 3
1.3 Contributions 4
Chapter 2 Preliminaries 6
2.1 RRAM Basics 6
2.2 The Operations of RRAM 9
2.3 RRAM-Based CIM Architecture 10
2.4 SRAM Basics 12
2.5 The Operations of SRAM 14
2.6 8T-SRAM-Based CIM Architecture 16
2.7 Existing Hybrid SRAM-RRAM CIM Architecture 17
Chapter 3 Hybrid SRAM-RRAM CIM Architecture Design 20
3.1 Hybrid CIM Architecture Overview 20
3.2 Weight Storage Strategy 23
3.3 Hybrid CIM Architecture for MAC Operations 25
Chapter 4 Experimental Results 28
4.1 Multi-Bit MAC Operation Analysis 28
4.2 Area and Power Comparisons 30
4.3 Accuracy Comparisons 32
Chapter 5 Conclusions 36
Chapter 6 Future Works 37
Reference 38

參考文獻

[1] Ren, S., He, K., Girshick, R., & Sun, J. (2017). Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6), 1137-1149.
[2] He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 770-778).
[3] Dominique, F., Odile, M., & Irina, I. (2017). New paradigm in speech recognition: Deep neural networks, the ContNomina project supported. French National Research Agency (ANR), 270.
[4] Boroumand, A., et al. (2018). Google workloads for consumer devices: Mitigating data movement bottlenecks. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (pp. 316-331).
[5] Shulaker, M., Hills, G., Park, R., Howe, R. T., Saraswat, K., Wong, H.-S. P., & Mitra, S. (2017). Three-dimensional integration of nanotechnologies for computing and data storage on a single chip. Nature.
[6] Shukla, S., et al. (2018). A scalable multi-teraops core for AI training and inference. IEEE Solid-State Circuits Letters, 1(12), 217–220.
[7] Sun, W., et al. (2023). A survey of computing-in-memory processor: From circuit to application. IEEE Open Journal of the Solid-State Circuits Society.
[8] Chih, Y.-D., et al. (2021). An 89TOPS/W and 16.3TOPS/mm² all-digital SRAM-based full-precision compute-in-memory macro in 22nm for machine-learning edge applications. In Proceedings of IEEE International Solid-State Circuits Conference (ISSCC).
[9] Lee, C.-F., et al. (2022). A 12nm 121-TOPS/W 41.6-TOPS/mm² all digital full precision SRAM-based compute-in-memory with configurable bit-width for AI edge applications. In Proceedings of IEEE Symposium on VLSI Technology and Circuits.
[10] Su, J.-W., et al. (2021). A 28nm 384kb 6T-SRAM computation-in-memory macro with 8b precision for AI edge chips. In 2021 IEEE International Solid-State Circuits Conference (ISSCC) (pp. 250-252).
[11] Ali, M., Jaiswal, A., Kodge, S., Agrawal, A., Chakraborty, I., & Roy, K. (2020). IMAC: In-memory multi-bit multiplication and accumulation in 6T SRAM array. IEEE Transactions on Circuits and Systems I: Regular Papers, 67(8), 2521-2531.
[12] Jiang, Z., Yin, S., Seo, J.-S., & Seok, M. (2020). C3SRAM: An in-memory-computing SRAM macro based on robust capacitive coupling computing mechanism. IEEE Journal of Solid-State Circuits, 55(7), 1888-1897.
[13] Mittal, S., Verma, G., Kaushik, B., & Khanday, F. A. (2021). A survey of SRAM-based in-memory computing techniques and applications. Journal of Systems Architecture, 119.
[14] Nguyen, V. T., Kim, J.-S., & Lee, J.-W. (2021). 10T SRAM computing-in-memory macros for binary and multibit MAC operation of DNN edge processors. IEEE Access, 9, 71262-71276.
[15] Liu, R., Mahalanabis, D., Barnaby, H. J., & Yu, S. (2015). Investigation of single-bit and multiple-bit upsets in oxide RRAM-based 1T1R and crossbar memory arrays. IEEE Transactions on Nuclear Science, 62(5), 2294-2301.
[16] Pedretti, G., & Ielmini, D. (2021). In-memory computing with resistive memory circuits: Status and outlook. Electronics, 10(1063).
[17] Zhang, S., Zhang, G. L., Li, B., Li, H. H., & Schlichtmann, U. (2020). Lifetime enhancement for RRAM-based computing-in-memory engine considering aging and thermal effects. In 2020 2nd IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS) (pp. 11-15).
[18] Zhang, G. L., Li, B., Zhu, Y., Zhang, S., Wang, T., Shi, Y., Ho, T.-Y., Li, H. (H.), & Schlichtmann, U. (2020). Reliable and robust RRAM-based neuromorphic computing. In Proceedings of the 2020 on Great Lakes Symposium on VLSI (GLSVLSI ′20) (pp. 33-38). Association for Computing Machinery.
[19] Radhakrishnan, G., Yoon, Y., & Sachdev, M. (2020). Monitoring aging defects in STT-MRAMs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 39(12), 4645-4656.
[20] Na, T., Kang, S. H., & Jung, S.-O. (2021). STT-MRAM sensing: A review. IEEE Transactions on Circuits and Systems II: Express Briefs, 68(1), 12-18.
[21] He, Z., Angizi, S., & Fan, D. (2017). Exploring STT-MRAM based in-memory computing paradigm with application of image edge extraction. In 2017 IEEE International Conference on Computer Design (ICCD) (pp. 439-446).
[22] Radhakrishnan, G., Yoon, Y., & Sachdev, M. (2019). A parametric DFT scheme for STT-MRAMs. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 27(7), 1685-1696.
[23] Lian, X., & Wang, L. (2022). Boolean logic function realized by phase-change blade type random access memory. IEEE Transactions on Electron Device, 69(4).
[24] Jiao, F., Chen, B., Li, K., Wang, L., Zeng, X., & Rao, F. (2020). Monatomic 2D phase-change memory for precise neuromorphic computing. Applied Materials Today, 20.
[25] Wang, J., et al. (2019). A compute SRAM with bit-serial integer/floating-point operations for programmable in-memory vector acceleration. In IEEE ISSCC Digest of Technical Papers (pp. 224–226).
[26] Si, X., et al. (2019). A twin-8T SRAM computation-in-memory macro for multiple-bit CNN-based machine learning. In IEEE ISSCC Digest of Technical Papers (pp. 396–398).
[27] Chen, W. H., et al. (2018). A 65 nm 1 Mb nonvolatile computing-in-memory ReRAM macro with sub-16 ns multiply-and-accumulate for binary DNN AI edge processors. In IEEE ISSCC Digest of Technical Papers (pp. 494–496).
[28] Xue, C.-X., et al. (2020). Embedded 1-Mb ReRAM-based computing-in-memory macro with multibit input and weight for CNN-based AI edge processors. IEEE Journal of Solid-State Circuits, 55(1), 203–215.
[29] Rios, M., et al. (2021). Running efficiently CNNs on the edge thanks to hybrid SRAM-RRAM in-memory computing. In IEEE/ACM Design, Automation and Test in Europe Conference and Exhibition (DATE).
[30] Xia, L., Huangfu, W., Tang, T., et al. (2020). Stuck-at fault tolerance in RRAM computing systems. IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 8(1), 102-115.
[31] Chen, C.-Y., et al. (2015). RRAM defect modeling and failure analysis based on march test and a novel squeeze-search scheme. IEEE Transactions on Computers, 64(1), 180–190.
[32] Rios, M., et al. (2021). Running efficiently CNNs on the edge thanks to hybrid SRAM-RRAM in-memory computing. In IEEE/ACM Design, Automation and Test in Europe Conference and Exhibition (DATE).
[33] Jaiswal, A., Chakraborty, I., Agrawal, A., & Roy, K. (2019). 8T SRAM cell as a multibit dot-product engine for beyond von Neumann computing. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 27(11), 2556–2567.
[34] Yu, S. (2018). Neuro-inspired computing with emerging nonvolatile memorys. Proceedings of the IEEE, 106(2), 260–285.
[35] Jiang, Z., et al. (2014). Verilog-A compact model for oxide-based resistive random access memory (RRAM). In 2014 International Conference on Simulation of Semiconductor Processes and Devices (SISPAD).
[36] Dong, X., Xu, C., Xie, Y., & Jouppi, N. P. (2012). NVSIM: A circuit-level performance, energy, and area model for emerging nonvolatile memory. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 31(7), 994-1007.
[37] Qian, C., Zhang, M., Nie, Y., Lu, S., & Cao, H. (2023). A survey of bit-flip attacks on deep neural network and corresponding defense methods. Electronics, 12(853).
[38] Cai, Y., et al. (2018). Long live TIME: Improving lifetime for training-in-memory engines by structured gradient sparsification. In Proceedings of the 55th ACM/ESDA/IEEE Design Automation Conference (DAC).

指導教授

陳聿廣(Yu-Guang Chen)

審核日期

2024-7-17

推文