博碩士論文 106521038 詳細資訊




以作者查詢圖書館館藏 以作者查詢臺灣博碩士 以作者查詢全國書目 勘誤回報 、線上人數:19 、訪客IP:3.239.33.139
姓名 許晉瑋(Chin-Wei Hsu)  查詢紙本館藏   畢業系所 電機工程學系
論文名稱 基於HWCK資料排程之分離式卷積加速器設計與實現
(Design and Implementation of a Separable Convolution Accelerator Based on HWCK Data Scheduling)
相關論文
★ 即時的SIFT特徵點擷取之低記憶體硬體設計★ 即時的人臉偵測與人臉辨識之門禁系統
★ 具即時自動跟隨功能之自走車★ 應用於多導程心電訊號之無損壓縮演算法與實現
★ 離線自定義語音語者喚醒詞系統與嵌入式開發實現★ 晶圓圖缺陷分類與嵌入式系統實現
★ 補償無乘法數位濾波器有限精準度之演算法設計技巧★ 可規劃式維特比解碼器之設計與實現
★ 以擴展基本角度CORDIC為基礎之低成本向量旋轉器矽智產設計★ JPEG2000靜態影像編碼系統之分析與架構設計
★ 適用於通訊系統之低功率渦輪碼解碼器★ 應用於多媒體通訊之平台式設計
★ 適用MPEG 編碼器之數位浮水印系統設計與實現★ 適用於視訊錯誤隱藏之演算法開發及其資料重複使用考量
★ 一個低功率的MPEG Layer III 解碼器架構設計★ 具有高品質反量化演算的AAC解碼器 之平台式設計
檔案 [Endnote RIS 格式]    [Bibtex 格式]    [相關文章]   [文章引用]   [完整記錄]   [館藏目錄]   至系統瀏覽論文 (2022-8-1以後開放)
摘要(中) 近年來隨著GPU進步與大數據時代的來臨,深度學習給各領域帶來革命性的進展,從基本的影像前處理、影像切割技術、人臉辨識、語音辨識等,逐漸的取代了以往的傳統演算法,這說明了神經網路的興起已經帶動人工智慧的各項改革。但受限於GPU的功耗以及成本,其產品都極其昂貴,也因神經網路演算法龐大的計算量,須配合加速的硬體來進行實時運算,這促使近幾年來有不少研究是針對卷積網路的加速數位電路硬體設計。
本論文提出基於HWCK資料排程之分離式卷積硬體架構設計,設計深度卷積(Depthwise convolution)、逐點卷積(Pointwise convolution)、正規化硬體架構(Batch Normalization),來加速深度可分離卷積模型,可透過SoC設計,利用AXI4總線協議讓PS端(Processing System)與PL端(Programmable Logic)相互溝通,可以使CPU利用我們所開發之神經網路模組在FPGA上對神經網路進行加速。此HWCK資料排程方法,可根據所分配的記憶體頻寬資源以及內存資源進行重新配置,當頻寬與內存均足夠時,可以非常輕易的將此設計進行擴展。為了減少神經網路的權重參數,資料皆以定點數16-bit來進行運算與儲存,並以兵乓記憶體的架構來進行內存存取,且透過AXI4總線協議與CPU進行資料傳輸。整個硬體架構可實現在Xilinx ZCU106開發版上實現,藉由SoC設計,使用已預先編譯的驅動程式溝通作業系統與外部的資源,並同時控制所設計的神經網路加速模組,利用高階的程式語言來快速的重新配置神經網路加速的排程,提高硬體的重新配置能力,能在多種不同的嵌入式平台上實現此硬體架構設計,將此硬體架構運行FaceNet可以達到222FPS以及60.8GOPS,在Xilinx ZCU106開發版上只需要耗能8.82W,能達到6.89GOP/s/W的效能。
摘要(英) In recent years, deep learning technology becomes more popular because of the improvement of GPU and the advent of big data. The deep learning has brought revolutionary promotion in various fields. Most traditional algorithms are replaced by deep learning technologies such as basic pre-image processing, image segmentation, face recognition, speech recognition, etc. That shows the rise of the neural network has led to the reform of artificial intelligence. However, the neural network is limited by the power consumption and cost of the GPU, its products are extremely expensive. Due to the large amount of computation of the neural network, the neural network has to be used with the hardware accelerator for real-time computing. The problem of the computation of the neural network has promoted a lot of research for convolution network accelerator digital circuit hardware design.
This paper proposed a design and implementation of a separable convolution accelerator based on HWCK data scheduling. It can be used to accelerate the deep separable convolution model by the design of the deepwise convolution, pointwise convolution, and the batch normalization. The proposed system can be through the SoC design to let the PS (Processing System) and PL (Programmable Logic) communicated with each other by using the AXI4 bus protocol, so our proposed design can be used when the CPU needs to accelerate the neural network. This HWCK data scheduling method can be reconfigured by the allocated memory and the bandwidth resource on the DDR4 and can be easily extended our design when the bandwidth and memory are sufficient. To reduce the weight parameter of the neural network, the data are calculated and stored with a 16bits fixed-point. The memory access is carried out with the architecture of ping-pong memory, it can transmit the data through the AXI4 bus protocol. The while hardware design architecture can be implemented on the Xilinx ZCU106 development board. The SoC design which using a precompiled driver to communicate operating systems and external resources, and control the design of the neural network acceleration module on FPGA. The higher program language to quickly reconfigure the network schedule, it can improve the hardware reconfigurable ability. This hardware architecture can reach 222FPS and 60.8GOPS by running FaceNet. The energy consumption on the Xilinx ZCU106 board is 8.82W, it has 6.89GOP/s/W performance.
關鍵字(中) ★ 硬體加速器
★ 深度學習
★ 現場可程式化邏輯閘陣列
★ 系統單晶片
關鍵字(英) ★ Hardware accelerator
★ Deep learning
★ Field Programmable Gate Array
★ System on a Chip
論文目次 目錄
摘要 I
ABSTRACT II
致謝 III
1. 序論 1
1.1. 研究背景與動機 1
1.2. 論文架構 4
2. 文獻探討 5
2.1. 硬體加速器 5
2.2. 人臉辨識 14
3. 網路模型挑選與結果 18
3.1. 網路架構選擇 18
3.2. 資料集 21
3.3. 訓練策略與結果 22
4. 硬體架構設計 23
4.1. 整個系統的硬體方塊圖 23
4.2. 分離式硬體加速控制模組 27
4.3. 記憶體傳輸控制模組 28
4.4. HWCK資料排程方法 29
4.5. 深度卷積硬體模組(DEPTHWISE CNN MODULE) 32
4.6. 逐點卷積硬體模組(POINTWISE CNN MODULE) 35
5. 硬體實現結果 38
5.1. 深度卷積架構模擬結果 38
5.2. 逐點卷積架構模擬結果 39
5.3. 分離式卷積硬體架構合成結果 40
5.4. 實現於FPGA架構之結果 44
5.5. 晶片設計規格 48
6. 結論 49
參考文獻 50
參考文獻 [1] J. P. Wachs, M. Kolsch, H. Stern, and Y. Edan, “Vision-based handgesture applications,” Commun. ACM, vol. 54, no. 2, pp. 60–71, Feb. 2011.
[2] P.N. Belhumeur, J.P. Hespanha, D.J. Kriegman, ―Eigenfaces vs. fisherfaces: recognition using class specific linear projection, IEEE Transaction Pattern Analysis and Machine Intelligence,1997, 19, 7, pp 711–720.
[3] G. M. Basavaraj and A. Kusagur, "Vision based surveillance system for detection of human fall," 2017 2nd IEEE International Conference on Recent Trends in Electronics, Information & Communication Technology (RTEICT), Bangalore, 2017, pp. 1516-1520, doi: 10.1109/RTEICT.2017.8256851.
[4] C. Cortes and V. Vapnik, “Support-vector network,” Machine Learning, vol. 20, pp. 273–297, 1995.
[5] Breiman, L. Random Forests. Machine Learning 45, 5–32 (2001).
[6] Rumelhart, D., Hinton, G. & Williams, R. Learning representations by back-propagating errors. Nature 323, 533–536 (1986).
[7] Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S.,Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D. G.,Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., and Zheng, X. Tensorflow: A system for large-scale machine learning. Tech. rep., Google Brain, 2016. arXiv preprint.
[8] A. Dadashzadeh, A. T. Targhi, M. Tahmasbi, M. Mirmehdi, “HGR-Net: A Fusion Network for Hand Gesture Segmentation and Recognition,” arXiv:1806.05653, 2018.
[9] Y. Sun, D. Liang, X. Wang, X. Tang, "Deepid3: Face recognition with very deep neural networks", arXiv preprint arXiv:1502.00873, 2015.
[10] Shahroudy, A.; Liu, J.; Ng, T.-T.; and Wang, G. 2016. NTU RGB+D: A large scale dataset for 3D human activity analysis. In IEEE Conference on Computer Vision and Pattern Recognition.
[11] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proc. of the IEEE, 1998.
[12] “Why is so much memory needed for deep neural networks?” [Online]. Available: https://www.graphcore.ai/posts/why-is-so-much-memory-needed-for-deep-neural-networks. [Accessed: 13-Jan-2020].
[13] Y. Ma, N. Suda, Y. Cao, J.-s. Seo, and S. Vrudhula, “Scalable and modularized rtl compilation of convolutional neural networks onto FPGA,” In Field Programmable Logic and Applications (FPL), 2016 26th International Conference on, pages 1–8. IEEE, 2016.
[14] “TensorFlow.” [Online]. Available: https://www.tensorflow.org/. [Accessed: 13-Jan-2020].
[15] “PyTorch.” [Online]. Available: https://pytorch.org/. [Accessed: 13-Jan-2020].
[16] Y.-H. Chen, T. Krishna, J. Emer, and V. Sze, “Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks,” IEEE Int. J. Solid State Circuits, vol. 59, no. 1, pp. 262–263, 2016.
[17] Y. Chen et al., “DaDianNao : A Machine-Learning Supercomputer.”
[18] W. Lu, G. Yan, J. Li, S. Gong, Y. Han, and X. Li, “FlexFlow: A Flexible Dataflow Accelerator Architecture for Convolutional Neural Networks,” Proc. - Int. Symp. High-Performance Comput. Archit., pp. 553–564, 2017.
[19] A. Aimar et al., “NullHop: A Flexible Convolutional Neural Network Accelerator Based on Sparse Representations of Feature Maps,” IEEE Trans. Neural Networks Learn. Syst., vol. 30, no. 3, pp. 644–656, 2019.
[20] K. Abdelouahab et al., “Accelerating CNN inference on FPGAs : A Survey,” 2018.
[21] C. Zhang, P. li, G. Sun, Y. Guan, B. Xiao and J. Cong, “Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks,” FPGA’15, February 22-24, 2015.
[22] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015
[23] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR, abs/1704.04861, 2017.
[24] B. Liu et al., “An FPGA-Based CNN Accelerator Integrating Depthwise Separable Convolution,” Electronics 2019, 8, 281.
[25] K. Guo et al., “Angel-Eye: A Complete Design Flow for Mapping CNN Onto Embedded FPGA,” in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 37, no. 1, pp. 35-47, Jan. 2018.
[26] J. Su et al., “Redundancy-reduced mobilenet acceleration on reconfigurable logic for ImageNet classification,” in Proc. Appl. Reconfig. Comput. Archit. Tools Appl., 2018, pp. 16–28.
[27] T. Wang, C. Wang, X. Zhou and H. Chen, "An Overview of FPGA Based Deep Learning Accelerators: Challenges and Opportunities," 2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), Zhangjiajie, China, 2019, pp. 1674-1681.
[28] M. Turk and A. Pentland. Eigenfaces for recognition (PDF). Journal of Cognitive Neuroscience. 1991, 3 (1): 71–86.
[29] Yeh, Jih & Pai, Yi-Chun & Wang, Chun-Wei & Yang, Fu-Wen & Lin, Hwei-Jen. (2009). Face detection using SVM-based classification. Far East Journal of Experimental and Theoretical Artificial Intelligence. 3.
[30] J. K. J. Julina and T. S. Sharmila, "Facial recognition using histogram of gradients and support vector machines," 2017 International Conference on Computer, Communication and Signal Processing (ICCCSP), Chennai, 2017, pp. 1-5, doi: 10.1109/ICCCSP.2017.7944082.
[31] Y. Taigman, M. Yang, M. Ranzato and L. Wolf, "DeepFace: Closing the Gap to Human-Level Performance in Face Verification," 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, 2014, pp. 1701-1708, doi: 10.1109/CVPR.2014.220.
[32] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 815–823, 2015.
[33] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” arXiv:1512.03385, 2015.
[34] Iandola, F.N., et al. “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model sizes”. ArXiv e-prints, 1602. 2016.
[35] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman,“VGGFace2: A dataset for recognising faces across pose andage,” CoRR, vol. abs/1710.08092, 2017. [Online]. Available: https://arxiv.org/abs/1710.08092
[36] Huang, G. B., Ramesh, M., Berg, T., and Learned-Miller, E. Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments. Technical Report 07-49, University of Massachusetts, Amherst, 2007.
[37] Y. Yang, Q. Huang, B. Wu, T. Zhang, L. Ma, G. Gambardella, M. Blott, L. Lavagno, K. Vissers, J. Wawrzynek, et al. Synetgy: Algorithm-hardware co-design for convnet accelerators on embedded fpgas. arXiv preprint arXiv:1811.08634, 2018.
指導教授 蔡宗漢(Tsung-Han Tsai) 審核日期 2020-7-20
推文 facebook   plurk   twitter   funp   google   live   udn   HD   myshare   reddit   netvibes   friend   youpush   delicious   baidu   
網路書籤 Google bookmarks   del.icio.us   hemidemi   myshare   

若有論文相關問題,請聯絡國立中央大學圖書館推廣服務組 TEL:(03)422-7151轉57407,或E-mail聯絡  - 隱私權政策聲明