深度神經網絡的邊緣優化增量學習

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：103

、訪客IP：3.145.19.0

姓名

歐海珊(Muhammad Awais Hussain) 查詢紙本館藏

畢業系所

電機工程學系

論文名稱

深度神經網絡的邊緣優化增量學習
(Edge-optimized Incremental Learning for Deep Neural Networks)

相關論文

★ 即時的SIFT特徵點擷取之低記憶體硬體設計	★ 即時的人臉偵測與人臉辨識之門禁系統
★ 具即時自動跟隨功能之自走車	★ 應用於多導程心電訊號之無損壓縮演算法與實現
★ 離線自定義語音語者喚醒詞系統與嵌入式開發實現	★ 晶圓圖缺陷分類與嵌入式系統實現
★ 語音密集連接卷積網路應用於小尺寸關鍵詞偵測	★ G2LGAN: 對不平衡資料集進行資料擴增應用於晶圓圖缺陷分類
★ 補償無乘法數位濾波器有限精準度之演算法設計技巧	★ 可規劃式維特比解碼器之設計與實現
★ 以擴展基本角度CORDIC為基礎之低成本向量旋轉器矽智產設計	★ JPEG2000靜態影像編碼系統之分析與架構設計
★ 適用於通訊系統之低功率渦輪碼解碼器	★ 應用於多媒體通訊之平台式設計
★ 適用MPEG 編碼器之數位浮水印系統設計與實現	★ 適用於視訊錯誤隱藏之演算法開發及其資料重複使用考量

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

增量學習技術旨在提高深度神經網絡 (DNN) 模型在預訓練 DNN 模型中添加新類別的能力。然而，DNN 在增量學習過程中會遭受災難性的遺忘。現有的增量學習技術需要以前的訓練樣本或較為複雜的模型架構來減少災難性遺忘。這導致高設計複雜性和記憶體儲存空間要求，使得增量學習演算法無法在記憶體儲存空間和計算資源有限的邊緣設備上實現。因此，在本文中，提出了一種片上增量學習（OCIL）加速器，它是一種軟硬體架構的協同設計，在邊緣裝置的DNN上實現高效能和高速的增量學習。 OCIL 採用新穎而簡單的增量學習演算法 Learning with Sharing (LwS) 來持續學習 DNN 模型中的新類別，同時將災難性遺忘降至最低。 LwS 可以保留現有數據類別的知識並添加新類別，而無需存儲來自先前類別的訓練樣本。 LwS 在 Cifar-100、Caltech-101 和 UCSD Birds 資料集上的準確度優於現今最先進的技術。 LwS 在執行增量學習時需要訓練全連接 (FC) 層且需要大量數據的移動，因此，對於 OCIL 的節能設計，通過一種新的最佳化記憶體存取方法來最小化訓練 FC 層的數據移動。最佳化的記憶體存取方法利用FC層反向傳播期間的數據重用來減少數據移動。最佳化後的記憶體存取方法實現了多個 DNN 模型的不同 FC 層的記憶體存取量減少達 1.45x-15.5x。由於最佳化後的記憶體存取方法，OCIL 在反向傳播階段實現了與 FC 層在處理前向傳播類似的高吞吐量。此外，用於錯誤/增量計算的最佳化記憶體存取方法統一了前向和後向傳遞的數據流，從而無需單獨的運算處理單元 (PE) 和複雜的數據控制器進行後向傳遞。 OCIL 採用 40 奈米技術工藝，時鐘頻率為 225 MHz，0.9V 時功耗為 168.8 mW。對於 32 位定點數，加速器可以實現 14.9 GOPs/mm2 的面積效率和 682.1 GOPs/W 的能效。

摘要(英)

Incremental learning techniques aim to increase the capability of Deep Neural Network (DNN) model to add new classes in a pre-trained DNN model. However, DNNs suffer from catastrophic forgetting during the incremental learning process. Existing incremental learning techniques require either previous samples of data or complex model architectures to reduce catastrophic forgetting. This leads to high design complexity and memory requirements which make incremental learning algorithms infeasible to implement on edge devices that have limited memory and computation resources. So, in this thesis, an On-Chip Incremental Learning (OCIL) accelerator is presented, which is a co-design of software and hardware architecture for energy-efficient and high-speed incremental learning in DNNs on the edge. OCIL features a novel and simple incremental learning algorithm Learning with Sharing (LwS) to learn about new classes in the DNN model continuously with minimum catastrophic forgetting. LwS can preserve the knowledge of existing data classes and add new classes without storing data from the previous classes. LwS outperforms the state-of-the-art techniques in accuracy comparison for Cifar-100, Caltech-101, and UCSD Birds datasets. LwS requires a large number of data movements for the training of Fully Connected (FC) layers to perform incremental learning, so, for the energy-efficient design of OCIL, data movement is minimized by a novel optimized memory access method for the training of FC layers. The optimized memory access method exploits the data reuse during the backpropagation of FC layers to reduce data movements. The optimized memory access method achieves 1.45x-15.5x memory access reduction for different FC layers of multiple DNN models. OCIL achieves high throughput during the backpropagation stage similar to the forward propagation for processing of FC layers due to the optimized memory access method. Moreover, the optimized memory access method for error/delta calculation unifies the dataflow for the forward and backward passes that eliminates the need for separate Processing Elements (PEs) and complex data controllers for backward pass. OCIL has been implemented in 40-nm technology process and works at the clock rate of 225 MHz with 168.8 mW of power consumption at 0.9V. The accelerator can achieve an area efficiency of 14.9 GOPs/mm2 and an energy efficiency of 682.1 GOPs/W for 32-bits fixed-point numbers.

關鍵字(中)

★ 數位硬體設計
★ 深度神經網路
★ 增量學習
★ 特殊應用積體電路實作（設計）

關鍵字(英)

★ Digital Hardware Design
★ DNN
★ Incremental Learning
★ ASIC Implementation

論文目次

1 Introduction ........................................................................................................................ 1
1.1 Motivation of on-Chip Incremental Learning ............................................................. 1
1.2 Thesis Contribution ..................................................................................................... 2
1.3 Thesis Organization ..................................................................................................... 4
2 Learning With Sharing (LwS) ............................................................................................ 5
2.1 Incremental Learning .................................................................................................. 5
2.1.1 Motivation and Definition.................................................................................... 5
2.1.2 Types of Incremental Learning Algorithms ......................................................... 6
2.1.3 Applications ......................................................................................................... 7
2.2 Problem Overview ....................................................................................................... 8
2.2.1 Previous Works on Incremental Learning Algorithms ...................................... 10
2.2.2 A Short Summary of Problems .......................................................................... 12
2.3 Learning with Sharing ............................................................................................... 12
2.3.1 Motivation for LwS Architecture ....................................................................... 13
2.3.2 Training Mechanism .......................................................................................... 15
2.3.3 Inference Method ............................................................................................... 19
2.4 Experiment Results and Discussions......................................................................... 21
2.4.1 Data Processing Framework for Audio Data ..................................................... 21
2.4.2 DNN Models Selection ...................................................................................... 21
2.4.3 Environment Settings ......................................................................................... 23
2.4.4 Dataset Selection ................................................................................................ 23
2.4.5 Dataset Configuration ........................................................................................ 24
2.4.6 Baseline Methods ............................................................................................... 25
2.4.7 Cifar-100 Results ............................................................................................... 25
2.4.8 Comparison with Other Algorithms................................................................... 26
2.4.9 Average Accuracy Loss ..................................................................................... 29
2.4.10 Caltech-101 Results ........................................................................................... 30
2.4.11 CUBS-200-2011 Results .................................................................................... 30
2.4.12 CUBS-200-2011 and TB+29 Results ................................................................ 31
2.4.13 Memory Requirements Analysis for Incremental Leaning on Image Datasets . 32
2.4.14 Comparison with Partial Network Sharing (PNS) Method................................ 33
2.4.15 Complexity Analysis .......................................................................................... 35
2.4.16 Embedded System Performance Analysis ......................................................... 36
2.4.17 Attention Maps................................................................................................... 39
2.4.18 Limitations of LwS ............................................................................................ 40
2.5 Conclusions ............................................................................................................... 40
3 Memory Access Optimization for on-Chip Learning ....................................................... 42
3.1 The Challenges in Training of DNNs ....................................................................... 42
3.1.1 Why there is a need for training process optimization? ..................................... 45
3.2 Memory Access Requirements using Traditional Method ........................................ 47
3.3 An Optimized Memory Access Method.................................................................... 52
3.3.1 Delta Reuse Opportunity.................................................................................... 52
3.3.2 A Generic Equation for Updating the Weights in the Hidden FC Layers ......... 54
3.3.3 Reduction in Memory Access ............................................................................ 55
3.3.4 Extra Storage Requirements for Delta Values ................................................... 55
3.3.5 Memory Access Overhead ................................................................................. 56
3.3.6 Delta Reuse Factor ............................................................................................. 57
3.4 Experiment Results and Discussion .......................................................................... 57
3.4.1 Memory Access Comparison between Original and Proposed Method ............ 57
3.4.2 Energy Consumption Reduction ........................................................................ 59
3.4.3 Increase in Number of Parameters ..................................................................... 60
3.4.4 Delta Reuse for Different Layers of DNNs ....................................................... 61
3.4.5 Comparison with Other Methods ....................................................................... 61
3.4.6 Verification on Hardware Platforms .................................................................. 63
3.5 Conclusions ............................................................................................................... 66
4 On-Chip Incremental Learning (OCIL) Accelerator ........................................................ 68
4.1 Problems to Perform Incremental Learning on-Chip ................................................ 68
4.1.1 Algorithmic Problems ........................................................................................ 68
4.1.2 Large and Irregular Memory Accesses in Backward Pass ................................. 69
4.2 Previous Hardware Accelerators ............................................................................... 69
4.3 Architecture Overview .............................................................................................. 70
4.4 Data Processing Flow for Forward and Backward Pass ........................................... 71
4.5 Local Data Buffer ...................................................................................................... 73
4.6 Memory Access Optimization for Delta/Error Generation ....................................... 74
4.7 Unified PE Architecture for Forward and Backward Pass ........................................ 75
4.7.1 PE Utilization in Forward and Backward Pass .................................................. 77
4.7.2 Optimized Weight Loading Method for a Unified PE Architecture .................. 78
4.8 Weight Update ........................................................................................................... 79
4.9 Output Classifier ....................................................................................................... 80
4.10 Loss Function ........................................................................................................ 80
4.11 Softmax Implementation ....................................................................................... 82
4.11.1 Exponent Design ................................................................................................ 83
4.11.2 Division Unit Design ......................................................................................... 85
4.12 Implementation Results ......................................................................................... 86
4.12.1 Incremental Learning Algorithm Evaluation ..................................................... 89
4.12.2 Comparison with Related Works ....................................................................... 90
4.13 Conclusions ........................................................................................................... 92
5 Conclusions and Future Directions................................................................................... 93
5.1 Summary of Contributions ........................................................................................ 93
5.2 Future Work .............................................................................................................. 94
6 References ........................................................................................................................ 95

參考文獻

C. Szegedy, A. Toshev, and D. Erhan, “Deep Neural Networks for object detection,” 2013.
[2] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, “Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning.”
[3] M. Tan and Q. V. Le, “MixConv: Mixed Depthwise Convolutional Kernels,” 2019, [Online]. Available: http://arxiv.org/abs/1907.09595
[4] S. Han et al., “ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA,” 2016, [Online]. Available: http://arxiv.org/abs/1612.00694
[5] Y.-H. Chen, T. Krishna, J. Emer, and V. Sze, “Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks,” IEEE Int. J. Solid State Circuits, vol. 59, no. 1, pp. 262–263, 2016, doi: 10.1109/ISSCC.2016.7418007.
[6] Y. Chen et al., “DaDianNao: A Machine-Learning Supercomputer,” in Proceedings of the Annual International Symposium on Microarchitecture, MICRO, Jan. 2015, vol. 2015-Janua, no. January, pp. 609–622. doi: 10.1109/MICRO.2014.58.
[7] M. McCloskey and N. J. Cohen, “Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem,” Psychol. Learn. Motiv. - Adv. Res. Theory, vol. 24, no. C, pp. 109–165, Jan. 1989, doi: 10.1016/S0079-7421(08)60536-8.
[8] S. A. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert, “iCaRL: Incremental classifier and representation learning,” in Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, 2017, vol. 2017-Janua, pp. 5533–5542. doi: 10.1109/CVPR.2017.587.
[9] Z. Li and D. Hoiem, “Learning without Forgetting,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 12, pp. 2935–2947, 2018, doi: 10.1109/TPAMI.2017.2773081.
[10] A. Rosenfeld and J. K. Tsotsos, “Incremental Learning through Deep Adaptation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 3, pp. 651–663, 2020, doi: 10.1109/TPAMI.2018.2884462.
[11] P. Dhar, R. V. Singh, K. C. Peng, Z. Wu, and R. Chellappa, “Learning without memorizing,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2019, vol. 2019-June, pp. 5133–5141. doi: 10.1109/CVPR.2019.00528.
[12] A. A. Rusu et al., “Progressive Neural Networks,” Jun. 2016, Accessed: Sep. 02, 2020. [Online]. Available: http://arxiv.org/abs/1606.04671
[13] S. S. Sarwar, A. Ankit, and K. Roy, “Incremental Learning in Deep Convolutional Neural Networks Using Partial Network Sharing,” IEEE Access, vol. 8, pp. 4615–4628, 2020, doi: 10.1109/ACCESS.2019.2963056.
[14] L. Pellegrini, G. Graffieti, V. Lomonaco, and D. Maltoni, “Latent Replay for Real-Time Continual Learning,” 2019, [Online]. Available: http://arxiv.org/abs/1912.01100
[15] M. A. Hussain, S.-A. Huang, and T.-H. Tsai, “Learning With Sharing: An Edge-Optimized Incremental Learning Method for Deep Neural Networks,” IEEE Trans. Emerg. Top. Comput., 2022.
[16] J. Howard and S. Gugger, “Fastai: A Layered API for Deep Learning,” Information, vol. 11, no. 2, p. 108, Feb. 2020, doi: 10.3390/info11020108.
[17] M. A. Hussain, C.-L. Lee, and T.-H. Tsai, “An Efficient Incremental Learning Algorithm for Sound Classification,” IEEE Multimed., pp. 1–8, 2022, doi: 10.1109/MMUL.2022.3208923.
[18] M. A. Hussain and T. H. Tsai, “Memory Access Optimization for On-Chip Transfer Learning,” IEEE Trans. Circuits Syst. I Regul. Pap., vol. 68, no. 4, pp. 1507–1519, Feb. 2021, doi: 10.1109/TCSI.2021.3055281.
[19] M. A. Hussain and T.-H. Tsai, “An Efficient and Fast Softmax Hardware Architecture (EFSHA) for Deep Neural Networks,” in 2021 IEEE 3rd International Conference on Artificial Intelligence Circuits and Systems (AICAS), Jun. 2021, pp. 1–4. doi: 10.1109/AICAS51828.2021.9458541.
[20] J. Knoblauch, H. Husain, and T. Diethe, “Optimal Continual Learning has Perfect Memory and is NP-hard,” no. Cl, 2020, [Online]. Available: http://arxiv.org/abs/2006.05188
[21] L. Binyan, W. Yanbo, C. Zhihong, L. Jiayu, and L. Junqin, “Object detection and robotic sorting system in complex industrial environment,” in Proceedings - 2017 Chinese Automation Congress, CAC 2017, Dec. 2017, vol. 2017-January, pp. 7277–7281. doi: 10.1109/CAC.2017.8244092.
[22] C. Carpineti, V. Lomonaco, L. Bedogni, M. Di Felice, and L. Bononi, “Custom Dual Transportation Mode Detection by Smartphone Devices Exploiting Sensor Diversity,” in 2018 IEEE International Conference on Pervasive Computing and Communications Workshops, PerCom Workshops 2018, Oct. 2018, pp. 367–372. doi: 10.1109/PERCOMW.2018.8480119.
[23] P. Casale, O. Pujol, and P. Radeva, “Human activity recognition from accelerometer data using a wearable device,” in International Conference on Pattern Recognition and Image Analysis, 2011, vol. 6669 LNCS, pp. 289–296. doi: 10.1007/978-3-642-21257-4_36/COVER.
[24] D. Roy, P. Panda, and K. Roy, “Tree-CNN: A hierarchical Deep Convolutional Neural Network for incremental learning,” Neural Networks, vol. 121, pp. 148–160, 2020, doi: 10.1016/j.neunet.2019.09.010.
[25] T. Xiao, J. Zhang, K. Yang, Y. Peng, and Z. Zhang, “Error-driven incremental learning in deep convolutional neural network for large-scale image classification,” in MM 2014 - Proceedings of the 2014 ACM Conference on Multimedia, Nov. 2014, pp. 177–186. doi: 10.1145/2647868.2654926.
[26] F. M. Castro, M. J. Marín-Jiménez, N. Guil, C. Schmid, and K. Alahari, “End-to-end incremental learning,” in European Conference on Computer Vision, 2018, vol. 11216 LNCS, pp. 241–257. doi: 10.1007/978-3-030-01258-8_15.
[27] A. G. Howard et al., “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications,” Apr. 2017, Accessed: May 23, 2020. [Online]. Available: http://arxiv.org/abs/1704.04861
[28] V. Lomonaco and D. Maltoni, “CORe50: a New Dataset and Benchmark for Continuous Object Recognition,” 2017, vol. 78, pp. 17–26. [Online]. Available: http://proceedings.mlr.press/v78/lomonaco17a.html
[29] J. Zhang et al., “Class-incremental learning via deep model consolidation,” in IEEE Winter Conference on Applications of Computer Vision, WACV 2020, 2020, pp. 1120–1129. doi: 10.1109/WACV45572.2020.9093365.
[30] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Jun. 2016, vol. 2016-December, pp. 770–778. doi: 10.1109/CVPR.2016.90.
[31] S. A. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert, “iCaRL: Incremental classifier and representation learning,” in Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Nov. 2017, vol. 2017-January, pp. 5533–5542. doi: 10.1109/CVPR.2017.587.
[32] Y. Wu et al., “Large Scale Incremental Learning,” pp. 374–382, 2019, [Online]. Available: http://arxiv.org/abs/1905.13260
[33] D. Lopez-Paz and M. Ranzato, “Gradient Episodic Memory for Continual Learning,” Adv. Neural Inf. Process. Syst., vol. 2017-December, pp. 6468–6477, Jun. 2017, Accessed: Jan. 24, 2021. [Online]. Available: http://arxiv.org/abs/1706.08840
[34] A. Douillard, M. Cord, C. Ollion, T. Robert, and E. Valle, “PODNet: Pooled Outputs Distillation for Small-Tasks Incremental Learning,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Apr. 2020, vol. 12365 LNCS, pp. 86–102. doi: 10.1007/978-3-030-58565-5_6.
[35] A. Mallya and S. Lazebnik, “PackNet: Adding Multiple Tasks to a Single Network by Iterative Pruning,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Dec. 2018, pp. 7765–7773. doi: 10.1109/CVPR.2018.00810.
[36] C. Käding, E. Rodner, A. Freytag, and J. Denzler, “Fine-tuning deep neural networks in continuous learning scenarios,” in Asian Conference on Computer Vision (ACCV), Nov. 2017, vol. 10118 LNCS, pp. 588–605. doi: 10.1007/978-3-319-54526-4_43.
[37] A. Howard et al., “Searching for mobileNetV3,” in Proceedings of the IEEE International Conference on Computer Vision, Oct. 2019, vol. 2019-October, pp. 1314–1324. doi: 10.1109/ICCV.2019.00140.
[38] B. Wu et al., “FBNET: Hardware-aware efficient convnet design via differentiable neural architecture search,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Jun. 2019, vol. 2019-June, pp. 10726–10734. doi: 10.1109/CVPR.2019.01099.
[39] M. Tan et al., “MnasNet: Platform-Aware Neural Architecture Search for Mobile,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Jul. 2018, vol. 2019-June, pp. 2815–2823. Accessed: May 23, 2020. [Online]. Available: http://arxiv.org/abs/1807.11626
[40] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “MobileNetV2: Inverted Residuals and Linear Bottlenecks,” in Proceedings - IEEE International Conference on Computer Vision and Pattern Recognition, Jan. 2018, pp. 4510–4520. Accessed: Jun. 06, 2020. [Online]. Available: http://arxiv.org/abs/1801.04381
[41] D. P. Kingma and J. L. Ba, “Adam: A method for stochastic optimization,” Dec. 2014.
[42] A. Krizhevsky, “Learning Multiple Layers of Features from Tiny Images,” 2009.
[43] L. Fei-Fei, R. Fergus, and P. Perona, “Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories,” in IEEE International Conference on Computer Vision and Pattern Recognition Workshops, 2004, vol. 2004-Janua, no. January. doi: 10.1109/CVPR.2004.383.
[44] C. Wah, S. Banson, P. Welinder, P. Perona, and S. Belongie, “The Caltech-UCSD Birds-200-2011 Dataset,” in IFAC Proceedings Volumes, 2009, vol. 42, no. 15, pp. 50–57. doi: 10.3182/20090902-3-US-2007.0059.
[45] J. Salamon, C. Jacoby, and J. P. Bello, “A Dataset and Taxonomy for Urban Sound Research,” in Proceedings of the 22nd ACM international conference on Multimedia, 2014, pp. 1041–1044. doi: 10.1145/2647868.
[46] K. J. Piczak, “ESC: Dataset for environmental sound classification,” in Proceedings of the 2015 ACM Multimedia Conference, Oct. 2015, pp. 1015–1018. doi: 10.1145/2733373.2806390.
[47] A. Mesaros, T. Heittola, and T. Virtanen, “TUT database for acoustic scene classification and sound event detection,” in European Signal Processing Conference, Nov. 2016, vol. 2016-November, pp. 1128–1132. doi: 10.1109/EUSIPCO.2016.7760424.
[48] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, and G. Desjardins, “Overcoming catastrophic forgetting in neural networks,” Pnas, pp. 1–6, 2016, [Online]. Available: http://www.pnas.org/content/suppl/2017/03/14/1611835114.DCSupplemental/pnas.201611835SI.pdf
[49] R. Aljundi, F. Babiloni, M. Elhoseiny, M. Rohrbach, and T. Tuytelaars, “Memory Aware Synapses: Learning What (not) to Forget,” in Proceedings - European Conference on Computer Vision, 2018, pp. 144–161.
[50] D. Lopez-Paz and M. Ranzato, “Gradient Episodic Memory for Continual Learning,” in Advances in Neural Information Processing Systems, Jun. 2017, vol. 2017-Decem, pp. 6468–6477. Accessed: Jan. 24, 2021. [Online]. Available: http://arxiv.org/abs/1706.08840
[51] L. Jin, H. Liang, and C. Yang, “Class-Incremental Learning of Convolutional Neural Networks Based on Double Consolidation Mechanism,” IEEE Access, vol. 8, pp. 172553–172562, Sep. 2020, doi: 10.1109/access.2020.3025558.
[52] J. O. Zhang, A. Sax, A. Zamir, L. Guibas, and J. Malik, “Side-Tuning: A Baseline for Network Adaptation via Additive Side Networks,” in European Conference on Computer Vision (ECCV) , Dec. 2019, vol. 12348 LNCS, pp. 698–714. Accessed: Jan. 24, 2021. [Online]. Available: http://arxiv.org/abs/1912.13503
[53] “jetson_stats.” https://github.com/rbonghi/jetson_stats (accessed Nov. 17, 2020).
[54] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization,” in Proceedings of the IEEE International Conference on Computer Vision, Dec. 2017, vol. 2017-October, pp. 618–626. doi: 10.1109/ICCV.2017.74.
[55] A. Aimar et al., “NullHop: A Flexible Convolutional Neural Network Accelerator Based on Sparse Representations of Feature Maps,” IEEE Trans. Neural Networks Learn. Syst., vol. 30, no. 3, pp. 644–656, 2019, doi: 10.1109/TNNLS.2018.2852335.
[56] W. Lu, G. Yan, J. Li, S. Gong, Y. Han, and X. Li, “FlexFlow: A Flexible Dataflow Accelerator Architecture for Convolutional Neural Networks,” in Proceedings - International Symposium on High-Performance Computer Architecture, 2017, pp. 553–564. doi: 10.1109/HPCA.2017.29.
[57] Y. Ma, Y. Cao, S. Vrudhula, and J. S. Seo, “Automatic Compilation of Diverse CNNs Onto High-Performance FPGA Accelerators,” IEEE Trans. Comput. Des. Integr. Circuits Syst., vol. 39, no. 2, pp. 424–437, Feb. 2020, doi: 10.1109/TCAD.2018.2884972.
[58] H. Zhu et al., “A Communication-Aware DNN Accelerator on ImageNet Using In-Memory Entry-Counting Based Algorithm-Circuit-Architecture Co-Design in 65-nm CMOS,” IEEE J. Emerg. Sel. Top. Circuits Syst., vol. 10, no. 3, pp. 283–294, Sep. 2020, doi: 10.1109/JETCAS.2020.3014920.
[59] T. Yuan, W. Liu, J. Han, and F. Lombardi, “High Performance CNN Accelerators Based on Hardware and Algorithm Co-Optimization,” IEEE Trans. Circuits Syst. I Regul. Pap., 2020, doi: 10.1109/TCSI.2020.3030663.
[60] S. Kim, J. Lee, S. Kang, J. Lee, and H. J. Yoo, “A Power-Efficient CNN Accelerator with Similar Feature Skipping for Face Recognition in Mobile Devices,” IEEE Trans. Circuits Syst. I Regul. Pap., vol. 67, no. 4, pp. 1181–1193, Apr. 2020, doi: 10.1109/TCSI.2020.2966243.
[61] X. Zhou, L. Zhang, C. Guo, X. Yin, and C. Zhuo, “A Convolutional Neural Network Accelerator Architecture with Fine-Granular Mixed Precision Configurability,” Sep. 2020, pp. 1–5. doi: 10.1109/iscas45731.2020.9180844.
[62] Z. Xu, Z. Yang, J. Xiong, J. Yang, and X. Chen, “ELFISH: Resource-Aware Federated Learning on Heterogeneous Edge Devices,” Dec. 2019.
[63] S. Caldas, J. Konečny, H. B. McMahan, and A. Talwalkar, “Expanding the Reach of Federated Learning by Reducing Client Resource Requirements,” Dec. 2018, Accessed: Sep. 24, 2020. [Online]. Available: http://arxiv.org/abs/1812.07210
[64] S. Wang et al., “When Edge Meets Learning: Adaptive Control for Resource-Constrained Distributed Machine Learning,” in Proceedings - IEEE INFOCOM, Oct. 2018, vol. 2018-April, pp. 63–71. doi: 10.1109/INFOCOM.2018.8486403.
[65] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-Efficient Learning of Deep Networks from Decentralized Data,” Proc. 20th Int. Conf. Artif. Intell. Stat. AISTATS 2017, Feb. 2016, Accessed: Sep. 24, 2020. [Online]. Available: http://arxiv.org/abs/1602.05629
[66] P. N. Whatmough, C. Zhou, P. Hansen, S. K. Venkataramanaiah, J. Seo, and M. Mattina, “FixyNN: Efficient Hardware for Mobile Computer Vision via Transfer Learning,” Feb. 2019, Accessed: May 24, 2020. [Online]. Available: http://arxiv.org/abs/1902.11128
[67] D. Han, J. Lee, J. Lee, and H. J. Yoo, “A Low-Power Deep Neural Network Online Learning Processor for Real-Time Object Tracking Application,” IEEE Trans. Circuits Syst. I Regul. Pap., vol. 66, no. 5, pp. 1794–1804, May 2019, doi: 10.1109/TCSI.2018.2880363.
[68] M. Nazemi, G. Pasandi, and M. Pedram, “NullaNet: Training Deep Neural Networks for Reduced-Memory-Access Inference,” Jul. 2018, Accessed: Aug. 04, 2020. [Online]. Available: http://arxiv.org/abs/1807.08716
[69] B. Jacob et al., “Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Dec. 2018, pp. 2704–2713. doi: 10.1109/CVPR.2018.00286.
[70] T. Zhang et al., “A systematic DNN weight pruning framework using alternating direction method of multipliers,” in European Conference on Computer Vision, Sep. 2018, vol. 11212 LNCS, pp. 191–207. doi: 10.1007/978-3-030-01237-3_12.
[71] G. Hinton, O. Vinyals, and J. Dean, “Distilling the Knowledge in a Neural Network,” Mar. 2015, Accessed: Aug. 05, 2020. [Online]. Available: http://arxiv.org/abs/1503.02531
[72] S. Rabanser, O. Shchur, and S. Günnemann, “Introduction to Tensor Decompositions and their Applications in Machine Learning,” Nov. 2017, Accessed: Aug. 05, 2020. [Online]. Available: http://arxiv.org/abs/1711.10781
[73] C. F. Jhu, P. Liu, and J. J. Wu, “Data Pinning and Back Propagation Memory Optimization for Deep Learning on GPU,” in Proceedings - 2018 6th International Symposium on Computing and Networking, CANDAR 2018, Dec. 2018, pp. 19–28. doi: 10.1109/CANDAR.2018.00011.
[74] H. Cui, H. Zhang, G. R. Ganger, P. B. Gibbons, and E. P. Xing, “GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter server,” Apr. 2016. doi: 10.1145/2901318.2901323.
[75] I. Gelado, J. E. Stone, J. Cabezas, S. Patel, N. Navarro, and W. W. Hwu, “An asymmetric distributed shared memory model for heterogeneous parallel systems,” ACM SIGARCH Comput. Archit. News, vol. 38, no. 1, pp. 347–358, Mar. 2010, doi: 10.1145/1735970.1736059.
[76] T. Chen, B. Xu, C. Zhang, and C. Guestrin, “Training Deep Nets with Sublinear Memory Cost,” Apr. 2016, Accessed: Aug. 07, 2020. [Online]. Available: http://arxiv.org/abs/1604.06174
[77] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” Commun. ACM, vol. 60, no. 6, pp. 84–90, Jun. 2012, doi: 10.1145/3065386.
[78] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” Sep. 2015.
[79] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278–2323, 1998, doi: 10.1109/5.726791.
[80] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2014, vol. 8689 LNCS, no. PART 1, pp. 818–833. doi: 10.1007/978-3-319-10590-1_53.
[81] K. H. Koo, W. H. Ryu, S. M. Lee, B. K. Choi, and C. Jo, “Versatile IO circuit schemes for LPDDR4 with 1.8mW/Gbps/pin power efficiency,” 2014. [Online]. Available: http://www.oldfriend.url.tw/article/IEEE_paper/DDR4_LPDDR4/7_WE3Slides_VersatileIOCircuitSchemesforLPDDR4.pdf
[82] T. J. Yang, Y. H. Chen, and V. Sze, “Designing energy-efficient convolutional neural networks using energy-aware pruning,” in Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Nov. 2017, vol. 2017-Janua, pp. 6071–6079. doi: 10.1109/CVPR.2017.643.
[83] P. Nayak, D. Zhang, and S. Chai, “Bit Efficient Quantization for Deep Neural Networks,” Oct. 2019, Accessed: Sep. 30, 2020. [Online]. Available: http://arxiv.org/abs/1910.04877
[84] “pyRAPL.” https://github.com/powerapi-ng/pyRAPL (accessed Nov. 17, 2020).
[85] S. Chakradhar, M. Sankaradas, V. Jakkula, and S. Cadambi, “A dynamically configurable coprocessor for convolutional neural networks,” ACM SIGARCH Comput. Archit. News, vol. 38, no. 3, p. 247, 2012, doi: 10.1145/1816038.1815993.
[86] Y. Umuroglu et al., “FINN: A framework for fast, scalable binarized neural network inference,” FPGA 2017 - Proc. 2017 ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays, no. February, pp. 65–74, 2017, doi: 10.1145/3020078.3021744.
[87] S. Choi, J. Sim, M. Kang, Y. Choi, H. Kim, and L. S. Kim, “An Energy-Efficient Deep Convolutional Neural Network Training Accelerator for in Situ Personalization on Smart Devices,” IEEE J. Solid-State Circuits, vol. 55, no. 10, pp. 2691–2702, Oct. 2020, doi: 10.1109/JSSC.2020.3005786.
[88] B. Fleischer et al., “A Scalable Multi-TeraOPS Deep Learning Processor Core for AI Training and Inference,” in IEEE Symposium on VLSI Circuits, Digest of Technical Papers, Oct. 2018, vol. 2018-June, pp. 35–36. doi: 10.1109/VLSIC.2018.8502276.
[89] Z. Yuan et al., “Sticker: A 0.41-62.1 TOPS/W 8Bit Neural Network Processor with Multi-Sparsity Compatible Convolution Arrays and Online Tuning Acceleration for Fully Connected Layers,” in IEEE Symposium on VLSI Circuits, Oct. 2018, vol. 2018-June, pp. 33–34. doi: 10.1109/VLSIC.2018.8502404.
[90] J. Lee, J. Lee, D. Han, J. Lee, G. Park, and H. J. Yoo, “LNPU: A 25.3TFLOPS/W Sparse Deep-Neural-Network Learning Processor with Fine-Grained Mixed Precision of FP8-FP16,” in Digest of Technical Papers - IEEE International Solid-State Circuits Conference, Mar. 2019, vol. 2019-Febru, pp. 142–144. doi: 10.1109/ISSCC.2019.8662302.
[91] X. Chen, C. Gao, T. Delbruck, and S.-C. Liu, “EILE: Efficient Incremental Learning on the Edge,” in 2021 IEEE 3rd International Conference on Artificial Intelligence Circuits and Systems (AICAS), Jun. 2021, pp. 1–4. doi: 10.1109/AICAS51828.2021.9458554.
[92] C. Chen, H. Ding, H. Peng, H. Zhu, Y. Wang, and C. J. R. Shi, “OCEAN: An On-Chip Incremental-Learning Enhanced Artificial Neural Network Processor with Multiple Gated-Recurrent-Unit Accelerators,” IEEE J. Emerg. Sel. Top. Circuits Syst., vol. 8, no. 3, pp. 519–530, 2018, doi: 10.1109/JETCAS.2018.2852780.
[93] J. Shin, S. Choi, Y. Choi, and L. S. Kim, “A pragmatic approach to on-device incremental learning system with selective weight updates,” Proc. - Des. Autom. Conf., vol. 2020-July, Jul. 2020, doi: 10.1109/DAC18072.2020.9218507.
[94] C. S. Turner, “A fast binary logarithm algorithm,” IEEE Signal Process. Mag., vol. 27, no. 5, 2010, doi: 10.1109/MSP.2010.937503.
[95] D. Kim, J. Kung, and S. Mukhopadhyay, “A power-aware digital multilayer perceptron accelerator with on-chip training based on approximate computing,” IEEE Trans. Emerg. Top. Comput., vol. 5, no. 2, pp. 164–178, Apr. 2017, doi: 10.1109/TETC.2017.2673548.
[96] C.-H. Lu, Y.-C. Wu, and C.-H. Yang, “A 2.25 TOPS/W Fully-Integrated Deep CNN Learning Processor with On-Chip Training,” in IEEE Asian Solid-State Circuits Conference (A-SSCC), Apr. 2019, pp. 65–68. doi: 10.1109/a-sscc47793.2019.9056967.

指導教授

蔡宗漢(Tsung-Han Tsai)

審核日期

2023-1-16

推文