透過基於壓縮器的MAC進位輸出替換架構提升系統計算吞吐量與能效

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：10

、訪客IP：3.149.249.124

姓名

唐儷綾(Li-Ling Tang) 查詢紙本館藏

畢業系所

電機工程學系

論文名稱

透過基於壓縮器的MAC進位輸出替換架構提升系統計算吞吐量與能效
(Enhancing Computational Throughput and Power Efficiency in Compressor-Based MAC via Carry-Out Replacement Architecture)

相關論文

★ 用於類比電路仿真之波動數位濾波器架構的自動建構方法	★ 使用波動數位濾波器與非線性MOS模型的類比電路模擬平台
★ 實現波動數位濾波器架構下之類比仿真器的非線性電晶體模型	★ 以節點保留方式進行壓降分析中電源網路模型化簡的方法
★ 以引導式二階權重提取改進辨認二階臨界函數之研究	★ 用於類比電路仿真器的波動數位濾波器架構之定點數實現方法
★ 以基本類比電路架構為基礎的佈局自動化工具	★ 可保留設計風格及繞線行為之類比佈局遷移技術
★ 自動辨識混合訊號電路中數位區塊之方法	★ 運用於記憶體內運算的SRAM功率模型之研究
★ 考量可繞度及淺溝槽隔離效應之類比佈局擺置微調方法	★ 一個適用於量化深度神經網路且可調整精確度的處理單元設計: 一種階層式的設計方法
★ 一個有效的邊緣智慧運算加速器設計: 一種適用於深度可分卷積的可重組式架構	★ 實現類比電路仿真的波動數位濾波器架構生成與模擬
★ 用於類比電路仿真器的波動數位濾波器之硬體最佳化方法	★ 自動辨識混合訊號電路中構成區塊及RLC元件之方法

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2030-1-16以後開放)

摘要(中)

隨著神經網路的複雜度日益增加，乘加運算（Multiply-Accumulate, MAC）在計算效率和能耗表現方面扮演著關鍵角色。基於壓縮器的MAC架構以其高效能與低功耗著稱，但由於乘法器與累加器階段間的進位信號（carry-out, cout）傳遞，經常導致延遲瓶頸和吞吐量下降，限制了其潛力。
為解決這些挑戰，我們提出了一種進位信號替換架構（Carry-Out Replacement Architecture），這是一種整合分段處理與優化累加器結構的新型MAC設計。透過新增寄存器來暫存cout信號，該架構將進位的累加延遲至後續階段，有效縮短了關鍵路徑。此分段策略平衡了管線階段的延遲，同時將額外寄存器帶來的開銷降至最低。此外，我們採用了寄存器整合策略（Register Integration Strategy），以選擇性地優化特定的cout信號，從而在給定的延遲限制下提升效率。
實驗結果證實了進位信號替換架構的有效性，在多項性能指標上展現了顯著的改善。僅使用進位信號替換架構，即可在效能每瓦（TOPS/W）方面比最先進的基於壓縮器的MAC架構提升5.45%。結合寄存器整合策略後，該架構的效能每瓦提升幅度最高可達19.58%。此外，與基準設計相比，此架構在功耗方面減少了8.77%，在面積方面減少了13.41%。這些結果顯示，進位信號替換架構在高性能MAC密集型應用中具有出色的適應性與效率，為下一代神經網路加速器提供了穩健的解決方案。

摘要(英)

As neural networks grow increasingly complex, Multiply-Accumulate (MAC) operations are crucial for ensuring computational efficiency and energy performance. Compressor-based MAC architectures are known for their high speed and low power consumption, but their potential is often limited by the propagation of carry-out (cout) signals across multiplier and accumulator stages, resulting in latency bottlenecks and reduced throughput.
To overcome these challenges, we present carry-out replacement architecture, a novel MAC design that integrates segmented processing and optimized accumulator structures. By incorporating additional registers to temporarily store cout values, carry-out replacement architecture defers their accumulation to subsequent stages, effectively shortening the critical path. This segmentation strategy balances delay across pipeline stages while minimizing the overhead introduced by additional registers. Additionally, a register integration strategy is employed to selectively optimize specific cout signals, enhancing efficiency under defined delay constraints.
Experimental results validate the effectiveness of carry-out replacement architecture, demonstrating substantial improvements across multiple performance metrics. Without the register integration strategy, carry-out replacement architecture achieves a 5.45% increase in TOPS/Watt compared to the optimized compressor-based MAC architectures. When the register integration strategy is applied, the Carry-Out Replacement Architecture achieves up to a 19.58% improvement in TOPS/Watt. Furthermore, the architecture achieves an 8.77% reduction in power consumption and a 13.41% reduction in area compared to baseline designs. These results highlight the adaptability and efficiency of carry-out replacement architecture for high-performance MAC-intensive applications, making it a robust solution for next-generation neural network accelerators.

關鍵字(中)

★ 以壓縮器為基礎的乘加器
★ 關鍵路徑
★ 吞吐量
★ 高速
★ 低功率

關鍵字(英)

論文目次

摘要 i
Abstract ii
致謝 iii
Table of Contents iv
Table of Figures vi
Table of Tables viii
Chapter 1 Introduction 1
1.1 AI Accelerator 2
1.2 Multiply-Accumulator (MAC) 3
1.3 Compressor-Based MAC 5
1.4 Contributions 7
Chapter 2 Preliminaries 8
2.1 Compressor-Based MAC 8
2.2 Components of Compressor-Based MAC 10
2.2.1 Partial Product Generation 11
2.2.2 Partial Product Reduction 11
2.2.3 Final Adder 13
2.2.4 Accumulator 13
2.3 Optimized compressor-based MAC 14
2.4 Latency Issues in MAC 17
Chapter 3 Proposed MAC Architecture 18
3.1 Carry-out Replacement Architecture 20
3.2 Register Integration Strategy 22
3.3 Example of Register Integration Strategy 25
Chapter 4 Experimental Results 33
4.1 Timing Results 34
4.2 Comparison of Area Overhead 35
4.3 Comparison of Power Consumption 36
4.4 Throughputs 37
Chapter 5 Conclusions 38
Reference 39

參考文獻

[1] H. Sharma et al., "Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Network," 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pp. 764-775, June 2018.
[2] H. T. Kung, “Why systolic architecture?,” Design Research Center, pp. 37-46, 1982.
[3] N. P. Jouppi, et al., “In-datacenter performance analysis of a tensor processing unit,” Proceedings of the 44th annual international symposium on computer architecture., pp. 1–12, Jun, 2017.
[4] Y. Chen et al., "DaDianNao: A Machine-Learning Supercomputer," 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 609-622, December 2014.
[5] S. Zhang et al., "Cambricon-X: An accelerator for sparse neural networks," 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), October 2016
[6] H. Xiao, H. Xu, X. Chen, Y. Wang, and Y. Han, "Fast and High-Accuracy Approximate MAC Unit Design for CNN Computing," IEEE Embedded Systems Letters, vol. 14, no. 3, pp. 155-158, September 2022.
[7] T. T. Hoang, M. Sjalander and P. Larsson-Edefors, "A High-Speed, Energy-Efficient Two-Cycle Multiply-Accumulate (MAC) Architecture and Its Application to a Double-Throughput MAC Unit," IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 57, no. 12, pp. 3073-3081, December 2010.
[8] S. Rakesh and K. S. V. Grace, "A survey on the design and performance of various MAC unit architectures," 2017 IEEE International Conference on Circuits and Systems (ICCAS), pp. 312-315, December 2017
[9] Z. Du et al., "ShiDianNao: Shifting vision processing closer to the sensor," 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA), pp. 92-104, June 2015.
[10] C. S. Wallace, “A suggestion for a fast multiplier” IEEE Transactions on electronic Computers, no. 1, pp. 14-17, February 1964.
[11] L. Dadda, “Some schemes for parallel multipliers,” IEEE Computer Society Press, 1990.
[12] C. P. Narendra and K. R. Kumar, “Low power compressor based MAC architecture for DSP applications,” 2015 IEEE International Conference on Signal Processing, Informatics, Communication and Energy Systems (SPICES), pp. 1-5., February 2015.
[13] A. Abdelgawad and M. Bayoumi, “High speed and area-efficient multiply accumulate (MAC) unit for digital signal prossing applications,” 2007 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 3199-3202, May 2007.
[14] T. U. Anusree and P. L. Bonifus, “Design and analysis of modified fast compressors for MAC unit,” International Journal of Computer Trends and Technology, vol. 36, pp. 231-218, June 2016.
[15] A. Vaswani et al., "Attention is all you need", Proc. 31st Int. Conf. Neural Inf. Process. Syst., pp. 6000-6010, June 2017.
[16] A. Riaz, and V. K. Sharma, “A novel low power 4: 2 compressor using FinFET devices,” Analog integrated circuits and signal processing, vol. 112, no. 1, pp. 127-139, January 2022
[17] A. G. M. Strollo, E. Napoli, D. De Caro, N. Petra, and G. Di Meo, “Comparison and extension of approximate 4-2 compressors for low-power approximate multipliers,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 67, no. 9, pp. 3021-3034, May 2020.
[18] S. D. Pezaris, "A 40-ns 17-bit by 17-bit array multipliers," IEEE Transactions on Computers, vol. 20, pp. 442-447, April 1971.
[19] K.Z. Pekmestzi, "Multiplexer-based array multipliers," IEEE Transactions on Computers, vol. 48, no. 1,pp. 15-23, Jan. 1999.
[20] A. Booth, "A signed binary multiplication techniques," Quarterly Journal Mechanics of Applied Mathematics, vol. 4, pp. 236-240, 1951.
[21] L. MacSorley, "High speed arithmetic in binary computers," Proc. IRE, vol. 49, Jan. 1961.
[22] V. G. Oklobdzija and D. Villeger, "Improving multiplier design by using improved column compression tree and optimized final adder in CMOS technology," in IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 3, no. 2, pp. 292-301, June 1995
[23] V. G. Oklobdzija, D. Villeger and S. S. Liu, "A method for speed optimized partial product reduction and generation of fast parallel multipliers using an algorithmic approach," in IEEE Transactions on Computers, vol. 45, no. 3, pp. 294-306, March 1996.
[24] A. A. Fayed and M. A. Bayoumi, "A merged multiplier-accumulator for high speed signal processing applications," 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, Orlando, FL, USA, 2002, pp. III-3212-III-3215, May 2002.

指導教授

周景揚(Jing-Yang Jou)

審核日期

2025-1-17

推文