以深度神經網路實現手勢辨識及其硬體架構設計

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：84

、訪客IP：3.145.68.24

姓名

何元禎(Yuan-Chen Ho) 查詢紙本館藏

畢業系所

電機工程學系

論文名稱

以深度神經網路實現手勢辨識及其硬體架構設計
(Implementation of hand gesture recognition with deep neural network and its hardware architecture design)

相關論文

★ 即時的SIFT特徵點擷取之低記憶體硬體設計	★ 即時的人臉偵測與人臉辨識之門禁系統
★ 具即時自動跟隨功能之自走車	★ 應用於多導程心電訊號之無損壓縮演算法與實現
★ 離線自定義語音語者喚醒詞系統與嵌入式開發實現	★ 晶圓圖缺陷分類與嵌入式系統實現
★ 語音密集連接卷積網路應用於小尺寸關鍵詞偵測	★ G2LGAN: 對不平衡資料集進行資料擴增應用於晶圓圖缺陷分類
★ 補償無乘法數位濾波器有限精準度之演算法設計技巧	★ 可規劃式維特比解碼器之設計與實現
★ 以擴展基本角度CORDIC為基礎之低成本向量旋轉器矽智產設計	★ JPEG2000靜態影像編碼系統之分析與架構設計
★ 適用於通訊系統之低功率渦輪碼解碼器	★ 應用於多媒體通訊之平台式設計
★ 適用MPEG 編碼器之數位浮水印系統設計與實現	★ 適用於視訊錯誤隱藏之演算法開發及其資料重複使用考量

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

近年來，深度學習的研究越來越廣泛，從基本的影像前處理、影像切割到物件辨識、語意分析等，逐漸取代傳統的影像處理演算法，傳統的手勢辨識演算法在複雜場景上需仰賴深度資訊才能正確辨識，且辨識效果不佳，而深度資訊需使用深度攝影機或雙CMOS攝影機來取得，對於一般使用者並不便利，因此本論文提出基於深度神經網路來進行手勢辨識之方法及其硬體架構設計，僅需單CMOS攝影機即可在複雜場景下辨識手勢，研究可分成兩個部分，一是神經網路模型之訓練，二是硬體架構之實現。在神經網路訓練部分，使用深度可分離卷積架構來建立神經網路模型，在訓練階段，模型區分為手部切割及手勢辨識，藉由先訓練手部切割子模型作為注意力模型，來輔助手勢辨識子模型之辨識率提升；在推理階段，僅需要使用手勢辨識及部分手部切割子模型即可進行手勢辨識，可避免使用全部模型來降低所需之權重參數及運算量。
硬體實現部分，我們設計深度卷積、逐點卷積、批量正規化及最大池化等模組來加速深度可分離卷積模型，並使用內部記憶體來暫存特徵資料，再透過DMA將資料傳送至外部記憶體儲存來減少內部記憶體使用量，同時為了減少權重參數及特徵資料所需之記憶體，資料皆使用定點數16bits來運算及儲存，本文亦規劃乒乓記憶體架構來最大化內部記憶體存取，來減少與外部記憶體存取之次數，全系統在Xilinx ZCU106開發板上實現，由CMOS攝影機將影像輸入至FPGA，手勢辨識完後將原始影像及辨識結果透過HDMI輸出並顯示於螢幕上，處理速度可以達到52.6FPS以及65.6 GOPS的計算量。

摘要(英)

The research in deep learning has become extensively deep recently, such as image pre-processing, image segmentation, object recognition, semantic analysis, etc. Deep learning has gradually replaced the traditional algorithm. The traditional hand gesture recognition algorithm needs to depend on the depth information to recognize hand gesture correctly in complex backgrounds and yet the recognition rate is not good. The depth information needs to be obtained using a depth camera or a dual CMOS camera, which is not convenient for the common user due to its high price. Therefore, this paper proposes a method using deep neural network for hand gesture recognition and an implementation of its hardware architecture design. It only needs a single CMOS camera which can recognize hand gestures in complex background. The research can be divided into two parts; one is the design of neural network model and second is the implementation of the hardware architecture. In the neural network design part, a depthwise separable convolutional is used to establish a neural network model and the model can be divided into segmentation and classification. By training the segmentation model as the attention model, the recognition rate of the classification model is improved. In the inference stage, hand gesture recognition can be performed by only using the classification model and a part of segmentation model, which avoids the use of the whole model to reduce the amount of weights and calculation.
In the hardware implementation part, this work designs depthwise convolution, pointwise convolution, batch normalization and max-pooling to accelerate the depthwise separable convolution. Design uses the on-chip memory to temporarily store the feature data and then transfers the data to the off-chip memory through DMA to reduce the on-chip memory usage. 16 bits fixed-point data is used for weights and feature data so that the memory size of weights and feature data can be reduced along with different calculations. This work also implements a ping-pong memory to maximize on-chip memory usage to reduce the access time to off-chip memory. The whole system is implemented on the Xilinx ZCU106 development board. The image is sent as input to the FPGA by the CMOS camera. After the gesture is recognized, the original image and the recognition result are outputted through the HDMI and displayed on the monitor. The implemented system can achieve the frame rate of 52.6 FPS and 65.6 GOPS.

關鍵字(中)

★ 手勢辨識
★ 深度神經網路
★ 深度可分離卷積
★ 硬體加速器
★ FPGA

關鍵字(英)

★ Hand gesture recognition
★ Deep neural network
★ Depthwise separable convolution
★ Hardware accelerator
★ FPGA

論文目次

摘要 I
ABSTRACT II
1. 序論 1
1.1. 研究背景與動機 1
1.2. 論文架構 4
2. 文獻探討 5
2.1. 手部切割 5
2.2. 手勢辨識 12
2.3. 硬體加速器 16
3. 網路模型設計與結果 21
3.1. 資料集 21
3.2. 數據增強策略 23
3.3. 手部切割網路模型設計 24
3.4. 訓練策略 28
3.5. 手部切割結果 29
3.6. 手勢辨識網路模型設計 31
3.7. 手勢辨識結果 34
4. 硬體架構設計 36
4.1. 全系統方塊圖 36
4.2. 量化策略 38
4.3. 深度卷積模組(DEPTHWISE CNN) 38
4.4. 逐點卷積模組(POINTWISE CNN) 40
4.5. 其餘模組 42
4.6. 記憶體規劃 43
5. 硬體實現結果 45
5.1. 深度卷積模擬結果 45
5.2. 逐點卷積模擬結果 46
5.3. 全系統模擬結果 47
5.4. FPGA合成結果 48
6. 結論 50
參考文獻 51

參考文獻

[1] P. Kumar, S. S. Rautaray, and A. Agrawal, “Hand data glove: A new generation real-time mouse for human-computer interaction,” International Conference on Recent Advances in Information Technology (RAIT), pp. 750-755, 2012.
[2] J. P. Wachs, M. Kolsch, H. Stern, and Y. Edan, “Vision-based handgesture applications,” Commun. ACM, vol. 54, no. 2, pp. 60–71, Feb. 2011.
[3] S. S Rautaray and A. Agrawal,” Vision based hand gesture recognition for human computer interaction: a survey,” Artificial Intelligence Review, 43(1):1–54, 2015.
[4] C. Cortes and V. Vapnik, “Support-vector network,” Machine Learning, vol. 20, pp. 273–297, 1995.
[5] A. Krizhevsky, I. Sutskever, and G. E Hinton. “Imagenet classification with deep convolutional neural networks,” In Advances in Neural Information Processing Systems, pages 1097–1105, 2012.
[6] S. Ren, K. He, R. Girshick, and J. Sun. “Faster r-cnn: Towards real-time object detection with region proposal networks,” In Advances in Neural Information Processing Systems, pages 91–99, 2015.
[7] J. Long, E. Shelhamer, and T. Darrell. “Fully convolutional networks for semantic segmentation,” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3431– 3440, 2015.
[8] T.-H. Tsai, C.-C. Huang and K.-L. Zhang, “Embedded Virtual Mouse System by Using Hand Gesture Recognition,” IEEE International Conference on Consumer Electronics - Taiwan (ICCE-TW), pp.352-353, June, 2015.
[9] W. Wang and J. Pan, “Hand segmentation using skin color and background information,” 2012 International Conference on Machine Learning and Cybernetics, Xian, 2012, pp. 1487-1492.
[10] M. Van den Bergh and L. Van Gool, “Combining RGB and ToF cameras for real-time 3D hand gesture interaction,” 2011 IEEE Workshop on Applications of Computer Vision (WACV), Kona, HI, 2011, pp. 66-72.
[11] S. Bilal, R. Akmeliawati, M. J. E. Salami, A. A. Shafie and E. M. Bouhabba, “A hybrid method using haar-like and skin-color algorithm for hand posture detection, recognition and tracking,” 2010 IEEE International Conference on Mechatronics and Automation, Xi′an, 2010, pp. 934-939.
[12] J. Guo, J. Cheng, J. Pang and Y. Guo, “Real-time hand detection based on multi-stage HOG-SVM classifier,” 2013 IEEE International Conference on Image Processing, Melbourne, VIC, 2013, pp. 4108-4111.
[13] J. Long, E. Shelhamer, and T. Darrell. “Fully convolutional networks for semantic segmentation,” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3431– 3440, 2015.
[14] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam. “Rethinking atrous convolution for semantic image segmentation,” arXiv:1706.05587, 2017.
[15] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” arXiv:1802.02611, 2018.
[16] G. Lin, A. Milan, C. Shen, and I. Reid, “Refinenet: Multipath refinement networks with identity mappings for highresolution semantic segmentation,” arXiv:1611.06612, 2016.
[17] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” arXiv:1512.03385, 2015.
[18] P. K. Pisharady, P.d Vadakkepat, and A. P. Loh, “Attention based detection and recognition of hand postures against complex backgrounds,” International Journal of Computer Vision, 101(3):403– 419, 2013.
[19] G. Plouffe and A.-M. Cretu, “Static and dynamic hand gesture recognition in depth data using dynamic time warping,” IEEE Transactions on Instrumentation and Measurement, 65(2):305–316, 2016.
[20] G. Marin, F. Dominio, and P. Zanuttigh, “Hand gesture recognition with leap motion and Kinect devices,” In Image Processing (ICIP), IEEE International Conference on, pages 1565–1569, 2014.
[21] P. Barros, S. Magg, C. Weber, and S. Wermter, “A multichannel convolutional neural network for hand posture recognition,” In International Conference on Artificial Neural Networks, pages 403– 410, 2014.
[22] P. Narayana, J. R. Beveridge, B. A. Draper, “Gesture Recognition: Focus on the Hands,” IEEE CVPR, 2018, pp. 5235-5244.
[23] A. Dadashzadeh, A. T. Targhi, M. Tahmasbi, M. Mirmehdi, “HGR-Net: A Fusion Network for Hand Gesture Segmentation and Recognition,” arXiv:1806.05653, 2018.
[24] M. Matilainen, P. Sangi, J. Holappa, and O. Silven, “Ouhands database for hand detection and pose recognition,” In Image Processing Theory Tools and Applications, 6th International Conference on, pages 1–5. IEEE, 2016.
[25] HGR1. http://sun.aei.polsl.pl/ mkawulok/gestures/.
[26] Y. Ma, N. Suda, Y. Cao, J.-s. Seo, and S. Vrudhula, “Scalable and modularized rtl compilation of convolutional neural networks onto FPGA,” In Field Programmable Logic and Applications (FPL), 2016 26th International Conference on, pages 1–8. IEEE, 2016.
[27] C. Zhang, P. li, G. Sun, Y. Guan, B. Xiao and J. Cong, “Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks,” FPGA’15, February 22-24, 2015.
[28] K. Guo et al., “Angel-Eye: A Complete Design Flow for Mapping CNN Onto Embedded FPGA,” in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 37, no. 1, pp. 35-47, Jan. 2018.
[29] L. Sifre and S. Mallat, “Rigid-motion scattering for texture classification,” arXiv:1403.1687 [cs], Mar. 2014.
[30] A. G. Howard et al., “MobileNets: Efficient convolutional neural networks for mobile vision applications,” arXiv:1704.04861 [cs], Apr. 2017.
[31] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “MobileNetV2: Inverted residuals and linear bottlenecks,” arXiv:1801.04381v3 [cs], Apr. 2018.
[32] J. Su et al., “Redundancy-reduced mobilenet acceleration on reconfigurable logic for ImageNet classification,” in Proc. Appl. Reconfig. Comput. Archit. Tools Appl., 2018, pp. 16–28.
[33] L. Bai, Y. Zhao and X. Huang, “A CNN Accelerator on FPGA Using Depthwise Separable Convolution,” in IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 65, no. 10, pp. 1415-1419, Oct. 2018.
[34] B. Liu et al., “An FPGA-Based CNN Accelerator Integrating Depthwise Separable Convolution,” Electronics 2019, 8, 281.
[35] S. Bambach, S. Lee, D. Crandall, and C. Yu, “Lending a hand: Detecting hands and recognizing activities in complex ego-centric interactions,” In IEEE International Conference on Computer Vision (ICCV), 2015.
[36] A. U. Khan and A. Borji, “Analysis of hand segmentation in the wild,” In CVPR, 2018. 2
[37] M. Everingham, L. V. Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The PASCAL Visual Object Classes (VOC) Challenge,” International Journal of Computer Vision, 88(2), 303-338, 2010.
[38] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
[39] UltraScale Architecture-Based FPGAs Memory IP v1.4 LogiCORE IP Product Guide, Xilinx PG150 May 22, 2019.

指導教授

蔡宗漢(Tsung-Han Tsai)

審核日期

2019-8-1

推文