摘要(英) |
Currently, the von Neumann architecture (VNA) is the fundamental structure of computer systems, consisting of a Central Processing Unit (CPU) and Memory, connected by data channels and control signals. The CPU executes instructions stored in memory, while memory is used to store instructions and data. However, for data-intensive applications such as image classification, speech recognition, and natural language processing, large amounts of data are transferred between memory and computing cores, leading to the emergence of von Neumann bottlenecks. This is due to the communication speed limitation between the CPU and memory in this structure, causing the CPU to wait for memory responses, thereby limiting the overall system performance.
To address the von Neumann bottleneck, attention has shifted towards Computing In-Memory (CIM), seen as a promising solution. This approach moves computational functions into memory, allowing computation and data processing to occur in the same place, thereby reducing the communication demands between the CPU and memory to improve system efficiency and performance. Many researchers have proposed various CIM architectures to accelerate AI computation. Broadly, CIM computation can be divided into two types: analog computing and digital computing. In recent years, analog CIM has received widespread attention due to its inherent advantages in high parallelism and energy efficiency. Therefore, the focus of our work is on analog CIM architectures. Among various types of memory, SRAM (Static Random-Access Memory) and RRAM (Resistive Random-Access Memory) stand out as popular choices.
SRAM-based CIM architectures have proven successful due to their mature and stable technology, demonstrating efficient and reliable computation with mature device processes. However, the relatively larger unit area and lower storage density of SRAM cells lead to increased chip area requirements. In contrast, CIM architecture based on RRAM offers advantages such as high density, low power consumption, non-volatility, and seamless integration with CMOS processes. However, they face challenges related to process yield differences, resulting in various types of faults. While both CIM architectures significantly improve computational speed, they each have their own advantages and disadvantages.
To fully leverage the advantages of different CIM architectures, we propose a novel hybrid SRAM-RRAM CIM architecture that enables direct in-place computation of weights stored in the memory array. This is achieved through a specially designed peripheral circuit integrating SRAM and RRAM structures. Additionally, we introduce a novel weight allocation strategy, termed the Weight Storage Strategy (WSS), which appropriately distributes weights based on the importance of their Most Significant Bits (MSBs) and Least Significant Bits (LSBs) into different memory arrays. The MSBs of weights have a greater impact on computations, so we store them in the relatively stable SRAM array, while the LSBs, which typically have more bits and are relatively less critical, are stored in the smaller RRAM array. Ultimately, experimental results demonstrate that our architecture surpasses 8T-SRAM-based CIM architectures by approximately 35%, 40%, and 50% in terms of area, leakage power, and energy consumption, respectively. At the same time, in terms of reliability, it is also better than the RRAM-based architecture by about 32% and 18% when evaluated using MNIST and hand detection datasets. |
參考文獻 |
[1] Ren, S., He, K., Girshick, R., & Sun, J. (2017). Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6), 1137-1149.
[2] He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 770-778).
[3] Dominique, F., Odile, M., & Irina, I. (2017). New paradigm in speech recognition: Deep neural networks, the ContNomina project supported. French National Research Agency (ANR), 270.
[4] Boroumand, A., et al. (2018). Google workloads for consumer devices: Mitigating data movement bottlenecks. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (pp. 316-331).
[5] Shulaker, M., Hills, G., Park, R., Howe, R. T., Saraswat, K., Wong, H.-S. P., & Mitra, S. (2017). Three-dimensional integration of nanotechnologies for computing and data storage on a single chip. Nature.
[6] Shukla, S., et al. (2018). A scalable multi-teraops core for AI training and inference. IEEE Solid-State Circuits Letters, 1(12), 217–220.
[7] Sun, W., et al. (2023). A survey of computing-in-memory processor: From circuit to application. IEEE Open Journal of the Solid-State Circuits Society.
[8] Chih, Y.-D., et al. (2021). An 89TOPS/W and 16.3TOPS/mm² all-digital SRAM-based full-precision compute-in-memory macro in 22nm for machine-learning edge applications. In Proceedings of IEEE International Solid-State Circuits Conference (ISSCC).
[9] Lee, C.-F., et al. (2022). A 12nm 121-TOPS/W 41.6-TOPS/mm² all digital full precision SRAM-based compute-in-memory with configurable bit-width for AI edge applications. In Proceedings of IEEE Symposium on VLSI Technology and Circuits.
[10] Su, J.-W., et al. (2021). A 28nm 384kb 6T-SRAM computation-in-memory macro with 8b precision for AI edge chips. In 2021 IEEE International Solid-State Circuits Conference (ISSCC) (pp. 250-252).
[11] Ali, M., Jaiswal, A., Kodge, S., Agrawal, A., Chakraborty, I., & Roy, K. (2020). IMAC: In-memory multi-bit multiplication and accumulation in 6T SRAM array. IEEE Transactions on Circuits and Systems I: Regular Papers, 67(8), 2521-2531.
[12] Jiang, Z., Yin, S., Seo, J.-S., & Seok, M. (2020). C3SRAM: An in-memory-computing SRAM macro based on robust capacitive coupling computing mechanism. IEEE Journal of Solid-State Circuits, 55(7), 1888-1897.
[13] Mittal, S., Verma, G., Kaushik, B., & Khanday, F. A. (2021). A survey of SRAM-based in-memory computing techniques and applications. Journal of Systems Architecture, 119.
[14] Nguyen, V. T., Kim, J.-S., & Lee, J.-W. (2021). 10T SRAM computing-in-memory macros for binary and multibit MAC operation of DNN edge processors. IEEE Access, 9, 71262-71276.
[15] Liu, R., Mahalanabis, D., Barnaby, H. J., & Yu, S. (2015). Investigation of single-bit and multiple-bit upsets in oxide RRAM-based 1T1R and crossbar memory arrays. IEEE Transactions on Nuclear Science, 62(5), 2294-2301.
[16] Pedretti, G., & Ielmini, D. (2021). In-memory computing with resistive memory circuits: Status and outlook. Electronics, 10(1063).
[17] Zhang, S., Zhang, G. L., Li, B., Li, H. H., & Schlichtmann, U. (2020). Lifetime enhancement for RRAM-based computing-in-memory engine considering aging and thermal effects. In 2020 2nd IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS) (pp. 11-15).
[18] Zhang, G. L., Li, B., Zhu, Y., Zhang, S., Wang, T., Shi, Y., Ho, T.-Y., Li, H. (H.), & Schlichtmann, U. (2020). Reliable and robust RRAM-based neuromorphic computing. In Proceedings of the 2020 on Great Lakes Symposium on VLSI (GLSVLSI ′20) (pp. 33-38). Association for Computing Machinery.
[19] Radhakrishnan, G., Yoon, Y., & Sachdev, M. (2020). Monitoring aging defects in STT-MRAMs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 39(12), 4645-4656.
[20] Na, T., Kang, S. H., & Jung, S.-O. (2021). STT-MRAM sensing: A review. IEEE Transactions on Circuits and Systems II: Express Briefs, 68(1), 12-18.
[21] He, Z., Angizi, S., & Fan, D. (2017). Exploring STT-MRAM based in-memory computing paradigm with application of image edge extraction. In 2017 IEEE International Conference on Computer Design (ICCD) (pp. 439-446).
[22] Radhakrishnan, G., Yoon, Y., & Sachdev, M. (2019). A parametric DFT scheme for STT-MRAMs. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 27(7), 1685-1696.
[23] Lian, X., & Wang, L. (2022). Boolean logic function realized by phase-change blade type random access memory. IEEE Transactions on Electron Device, 69(4).
[24] Jiao, F., Chen, B., Li, K., Wang, L., Zeng, X., & Rao, F. (2020). Monatomic 2D phase-change memory for precise neuromorphic computing. Applied Materials Today, 20.
[25] Wang, J., et al. (2019). A compute SRAM with bit-serial integer/floating-point operations for programmable in-memory vector acceleration. In IEEE ISSCC Digest of Technical Papers (pp. 224–226).
[26] Si, X., et al. (2019). A twin-8T SRAM computation-in-memory macro for multiple-bit CNN-based machine learning. In IEEE ISSCC Digest of Technical Papers (pp. 396–398).
[27] Chen, W. H., et al. (2018). A 65 nm 1 Mb nonvolatile computing-in-memory ReRAM macro with sub-16 ns multiply-and-accumulate for binary DNN AI edge processors. In IEEE ISSCC Digest of Technical Papers (pp. 494–496).
[28] Xue, C.-X., et al. (2020). Embedded 1-Mb ReRAM-based computing-in-memory macro with multibit input and weight for CNN-based AI edge processors. IEEE Journal of Solid-State Circuits, 55(1), 203–215.
[29] Rios, M., et al. (2021). Running efficiently CNNs on the edge thanks to hybrid SRAM-RRAM in-memory computing. In IEEE/ACM Design, Automation and Test in Europe Conference and Exhibition (DATE).
[30] Xia, L., Huangfu, W., Tang, T., et al. (2020). Stuck-at fault tolerance in RRAM computing systems. IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 8(1), 102-115.
[31] Chen, C.-Y., et al. (2015). RRAM defect modeling and failure analysis based on march test and a novel squeeze-search scheme. IEEE Transactions on Computers, 64(1), 180–190.
[32] Rios, M., et al. (2021). Running efficiently CNNs on the edge thanks to hybrid SRAM-RRAM in-memory computing. In IEEE/ACM Design, Automation and Test in Europe Conference and Exhibition (DATE).
[33] Jaiswal, A., Chakraborty, I., Agrawal, A., & Roy, K. (2019). 8T SRAM cell as a multibit dot-product engine for beyond von Neumann computing. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 27(11), 2556–2567.
[34] Yu, S. (2018). Neuro-inspired computing with emerging nonvolatile memorys. Proceedings of the IEEE, 106(2), 260–285.
[35] Jiang, Z., et al. (2014). Verilog-A compact model for oxide-based resistive random access memory (RRAM). In 2014 International Conference on Simulation of Semiconductor Processes and Devices (SISPAD).
[36] Dong, X., Xu, C., Xie, Y., & Jouppi, N. P. (2012). NVSIM: A circuit-level performance, energy, and area model for emerging nonvolatile memory. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 31(7), 994-1007.
[37] Qian, C., Zhang, M., Nie, Y., Lu, S., & Cao, H. (2023). A survey of bit-flip attacks on deep neural network and corresponding defense methods. Electronics, 12(853).
[38] Cai, Y., et al. (2018). Long live TIME: Improving lifetime for training-in-memory engines by structured gradient sparsification. In Proceedings of the 55th ACM/ESDA/IEEE Design Automation Conference (DAC). |