一個適用於量化深度神經網路且可調整精確度的處理單元設計: 一種階層式的設計方法

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：6

、訪客IP：18.221.112.220

姓名

徐麒惟(Chi-Wei Hsu) 查詢紙本館藏

畢業系所

電機工程學系

論文名稱

一個適用於量化深度神經網路且可調整精確度的處理單元設計: 一種階層式的設計方法
(A Precision Reconfigurable Process Element Design for Quantized Deep Neural Networks: A Hierarchical Approach)

相關論文

★ 用於類比電路仿真之波動數位濾波器架構的自動建構方法	★ 使用波動數位濾波器與非線性MOS模型的類比電路模擬平台
★ 實現波動數位濾波器架構下之類比仿真器的非線性電晶體模型	★ 以節點保留方式進行壓降分析中電源網路模型化簡的方法
★ 以引導式二階權重提取改進辨認二階臨界函數之研究	★ 用於類比電路仿真器的波動數位濾波器架構之定點數實現方法
★ 以基本類比電路架構為基礎的佈局自動化工具	★ 可保留設計風格及繞線行為之類比佈局遷移技術
★ 自動辨識混合訊號電路中數位區塊之方法	★ 運用於記憶體內運算的SRAM功率模型之研究
★ 考量可繞度及淺溝槽隔離效應之類比佈局擺置微調方法	★ 一個有效的邊緣智慧運算加速器設計: 一種適用於深度可分卷積的可重組式架構
★ 實現類比電路仿真的波動數位濾波器架構生成與模擬	★ 用於類比電路仿真器的波動數位濾波器之硬體最佳化方法
★ 自動辨識混合訊號電路中構成區塊及RLC元件之方法	★ 以波動數位濾波器實現類比電路仿真器所需的FPGA表格縮減技術

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

卷積神經網路 (Convolution Neural Networks, CNN)在現今發展得十分迅速，主要使用在影像辨識、自駕車、物件偵測……等等。當我們應用CNN時，精準度以及資料大小是兩個重要的指標來計算效能以及運算效率。在傳統的CNN網路中，大部分都是以浮點數32bits來做計算以保持高水平的精準度。然而，要使用浮點數32bits運算必須用到32bits的乘加器 (MAC)，這樣除了會在運算效率上造成瓶頸之外，還會使功耗大幅的上升，因此現今的研究者都在是利於找出減少資料量以此為加速的方法。量化(Quantization)是其中一種可以在精準度不下降太多的情況下來降低資料量已獲得加速的好處以及減少運算複雜度的一個方法。在CNN網路中，每次層所需要的位元數都不盡相同，而為了權衡更好的運算效率及精準度，不同的位元的運算會用在CNN網路的不同層數中，以增加運算效率。在以上的前提下，可以調整位元數的運算單元(Processing Element, PE)可以支援不同元位元的運算，像是 8bits x 8bits、 8bits x 4bits、4bits x 4bits以及2bits x 2bits。而我們所提出的這個架構屬於階層式的架構，這樣可以在設計過程中減少一些多餘的硬體，降低整體晶片的面積，而為了提升運算速度，我們提出的8bits x 8bits PE 可以做到兩級的平行化。而在實驗的部分，我們採用90nm的製程，從實驗結果中我們可以發現，跟先前的論文相比，我們2bits x 2bits面積可以減少57.5% - 68%，而在8bits x 8bits PE中，使用平行化架構可以讓8bits x 8bits的運算速度跟4bits x 4bits PE的運算速度相當。

摘要(英)

In deep learning field, Convolution Neural Networks (CNNs) have been achieved a significant success in many fields such as visual imagery analysis, self-driving car, respectively. However, data size and the accuracy of each system are the major target to estimate the efficient and effective computations. In conventional CNN models, 32bits data are frequently used to maintain high accuracy. However, performing a bunch of 32bits multiply-and-accumulate (MAC) operations causes significant computing efforts as well as power consumptions. Therefore, recently researchers develop various methods to reduce data size and speed up calculations. Quantization is one of the techniques which reduces the number of bits of the data as well as the computational complexity at the cost of accuracy loss. To provide better computation effort and accuracy trade-off, different bit number may be applied to different layers within a CNN model. Therefore, a flexible processing element (PE) which can support operations of different bit numbers is in demand. In this work, we propose a hierarchy-based reconfigurable processing element (PE) structure that can support 8bits x 8bits, 8bits x 4bits, 4bits x 4bits and 2bits x 2bits operations. The structure we propose applies the concept of hierarchical structure that can avoid the redundant hardware in the design. To improve the calculation speed, our 8bits x 8bits PE applies two stage pipelines. The experimental results with 90nm technology show that in 2bits x 2bits PE, we can save the area by 57.5% to 60% compared to a Precision-Scalable accelerator. In the 8bits x 8bits PE, the two-stage pipelines can maintain almost the same calculation speed of the 4bits x 4 bits PE.

關鍵字(中)

★ 量化神經網路
★ 運算單元
★ 可重組式設計

關鍵字(英)

★ Quantized Neural Networks (QNN)
★ Processing Element (PE)
★ Reconfigurable Design

論文目次

中文摘要 i
Abstract ii
致謝 iii
Table of Contests iv
Table of Figures vi
Table of Tables vii
Chapter 1 Introduction 1
Chapter 2 Background and Related Works 4
2.1 Convolution Neural Networks (CNNs) 4
2.2 Quantized Neural Networks(QNNs) 5
2.3 Reconfigurable Processing Element design 6
2.3.1 Bit Fusion accelerators 6
2.3.2 Precision-Scalable accelerators 12
Chapter 3 A Hierarchical-based Reconfigurable Processing Element Design 17
3.1. Problem formulation 17
3.2. Overall architecture 18
3.3. 2bits x 2bits PE architecture 21
3.4. 4bits x 4bits PE architecture 22
3.5. 8bits x 8bits PE architecture 29
3.6. Short Summary 31
Chapter 4 Experimental Results 32
4.1. Processing Element synthesis and simulation 32
4.2 Throughputs of each configurations 35
4.3 Comparison Power efficiency and area of each configurations 37
Chapter 5 Conclusions 40
References 41

參考文獻

[1] J. Albericio et al., “Cnvlutin: Ineffectual-neuron-free deep neural network computing”, in Proc. of ACM SIGARCH Computer Architecture News, 2016.
[2] J. Choi et al., "Accurate and efficient 2-bit quantized neural networks", in Proc. of the 2nd SysML Conference, Mar. 2019.
[3] T. Chen et al., “Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning”, in Proc. of ACM SIGARCH Computer Architecture News, 2014.
[4] YJ. Chen et al., "Ct Image Denoising With Encoder-Decoder Based Graph Convolutional Networks", in Proc. of IEEE 18th International Symposium on Biomedical Imaging (ISBI). IEEE, Apr. 2021.
[5] Z. Du et al., “ShiDianNao: Shifting vision processing closer to the sensor”, in Proc. of the 42nd Annual International Symposium on Computer Architecture (ISCA), Jun. 2015.
[6] I. Hubara et al., “Quantized neural networks: Training neural networks with low precision weights and activations”, in Proc. of The Journal of Machine Learning Research, 2017.
[7] K. He et al., “Deep residual learning for image recognition”, in Proc. of IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2016.
[8] D. Kim et al., “Neurocube: A programmable digital neuromorphic architecture with high-density 3D memory”, in Proc. of ACM SIGARCH Computer Architecture News, 2016.
[9] A. Krizhevsky et al., "ImageNet classification with deep convolutional neural networks", in Proc. of Commun. ACM 60, Jun. 2017.
[10] Y. LeCun et al., “Gradient-based learning applied to document recognition”, in Proc. of IEEE, Nov. 1998.
[11] F. LI et al., “Ternary weight networks”, in Proc. of arXiv, 2016.
[12] W. Liu et al., "A Precision-Scalable Energy-Efficient Convolutional Neural Network Accelerator", in Proc. of IEEE Transactions on Circuits and Systems I: Regular Papers, Oct. 2020.
[13] D. Liu et al., “Pudiannao: A polyvalent machine learning accelerator”, in Proc. of ACM SIGARCH Computer Architecture News, 2015.
[14] S. Liu et al. “Cambricon: An instruction set architecture for neural networks”, in Proc. of ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), Jun. 2016
[15] D. Lin et al., "Fixed point quantization of deep convolutional networks", in Proc. of International conference on machine learning, PMLR, Jun. 2016.
[16] A. Mishra et al., "Apprentice: Using knowledge distillation techniques to improve low-precision network accuracy", in Proc. of arXiv, 2017.
[17] A. Mishra et al., "WRPN: Wide reduced-precision networks", In Proc. of arXiv, 2017.
[18] O. Russakovsky et al. "Imagenet large scale visual recognition challenge", in Proc. of International journal of computer vision, 2015.
[19] S. Ren et al., "Faster r-cnn: Towards real-time object detection with region proposal networks", in Proc. of arXiv, 2015.
[20] J. Redmon et al., "You only look once: Unified, real-time object detection", in Proc. of the IEEE conference on computer vision and pattern recognition (CVPR), Jun. 2016.
[21] B. Reagen et al., “Minerva: Enabling low-power, highly-accurate deep neural network accelerators”, in Proc. of ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), Jun. 2016.
[22] H. Sharma et al., "Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Network", in Proc. of ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), Jun. 2018
[23] C. Szegedy et al., “Going deeper with convolutions”, in Proc. of IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015.
[24] P. Wang et al., "Two-Step Quantization for Low-bit Neural Networks", in Proc. of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2018.
[25] Y. Wang et al., "FPAP: A Folded Architecture for Energy-Quality Scalable Convolutional Neural Networks," in Proc. of IEEE Transactions on Circuits and Systems I: Regular Papers, Jan. 2019.
[26] Z. Wang et al., "Lightweight Run-Time Working Memory Compression for Deployment of Deep Neural Networks on Resource-Constrained MCUs." in Proc. of the 26th Asia and South Pacific Design Automation Conference (ASP DAC), Jan. 2021.
[27] X. Xu et al., "DAC-SDC Low Power Object Detection Challenge for UAV Applications", in Proc. of IEEE Transactions on Pattern Analysis and Machine Intelligence, Feb. 2021.
[28] Z. Yao et al., "A machine learning-based pulmonary venous obstruction prediction model using clinical data and CT image", in Proc. of International Journal of Computer Assisted Radiology and Surgery, 2021.
[29] SJ. Zhang et al., “Cambricon-X: An accelerator for sparse neural networks”, in Proc. of 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Oct. 2016.

指導教授

周景揚(Jing-Yang Jou)

審核日期

2021-10-26

推文