摘要(英) |
Hearing-impaired people rely on sign languages to communicate with each other but may have problems interacting with the persons who may not understand sign languages. Since sign languages belong to a type of visual languages, computer vision approaches to recognizing sign languages are usually considered feasible to bridge the gap. However, recognition of sign languages is a complex task, which requires classifying hand shapes, hand motions and facial expressions. The detection and classification of hand gestures should be the first step because hands are the most important elements. This research thus focuses on hand feature extraction and gesture recognition for Taiwan Sign Language (TSL) videos.
First, we established a synthetic dataset by using Unity3D. The advantage of using synthetic data is to reduce the effort of manual labeling and to avoid possible errors. A large dataset with high quality labeling can thus be achieved. The dataset is generated by changing hand shapes, colors and orientations. The background images are also changed to increase the robustness of the model. Motion blurriness is also added to make the synthetic data look closers to real cases. We compare three feature extractions: bounding boxes, semantic segmentation generated by the ResNeSt+Detectron2 and the heatmap generated by the EfficientDet. The bounding boxes are selected for the subsequent gesture recognition. We also employ Unity3D to create several basic sign gestures for TSL, and then use ResNeSt for classification and recognition.
Experimental results demonstrate that the synthetic dataset can effectively help to train the suitable models for hand feature extraction and gesture recognition in TSL videos. |
參考文獻 |
[1] 衛生服務部統計處. 社會福利統計 https://dep.mohw.gov.tw/DOS/lp-2976-113.html
[2] Huang, Jung-Ning. "台灣手語手型辨識研究." 成功大學資訊工程學系學位論文 (2005): 1-55.
[3] Ko, Chih-Ang. "手勢跨越顏面部位的台灣手語辨識." 成功大學資訊工程學系學位論文 (2009): 1-64.
[4] 張光寒. "3D台灣手語辨識系統." (2007).
[5] 蕭怡涵. "基於 Kinect 之台灣手語單字辨識." (2013).
[6] 林政諺. "利用RGB-D相機之台灣手語辨識." (2016).
[7] 姚俊英. "台灣手語演進." 聽障教育期刊 (2006): 11-15.
[8] 丁立芬; 史文漢. "手能生橋. " 台北: 中華民國聾人協會, (2001).
[9] 潘秋雯執行編輯 "臺北市手語翻譯培訓教材第一冊修訂版" (2018)
[10] 全國特殊教育資訊網 https://tinyurl.com/y7v3kalk
[11] Sign Tube 手語天地 (YouTube)
[12] Huang, Deng-Yuan, Wu-Chih Hu, and Sung-Hsiang Chang. "Vision-based hand gesture recognition using PCA+ Gabor filters and SVM." 2009 fifth international conference on intelligent information hiding and multimedia signal processing. IEEE, 2009.
[13] Cortes, Corinna, and Vladimir Vapnik. "Support-vector networks." Machine learning 20.3 (1995): 273-297.
[14] He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
[15] Hu, Jie, Li Shen, and Gang Sun. "Squeeze-and-excitation networks." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
[16] Li, Xiang, et al. "Selective kernel networks." Proceedings of the IEEE conference on computer vision and pattern recognition. 2019.
[17] Tan, Mingxing, and Quoc V. Le. "Efficientnet: Rethinking model scaling for convolutional neural networks." arXiv preprint arXiv:1905.11946 (2019).
[18] Girshick, Ross, et al. "Region-based convolutional networks for accurate object detection and segmentation." IEEE transactions on pattern analysis and machine intelligence 38.1 (2015): 142-158.
[19] Girshick, Ross. "Fast r-cnn." Proceedings of the IEEE international conference on computer vision. 2015.
[20] Ren, Shaoqing, et al. "Faster r-cnn: Towards real-time object detection with region proposal networks." Advances in neural information processing systems. 2015.
[21] He, Kaiming, et al. "Mask r-cnn." Proceedings of the IEEE international conference on computer vision. 2017.
[22] Unity. https://unity.com/
[23] Betancourt, A. "EgoHands: a unified framework for hand-based methods in first person vision videos." (2017).
[24] Mittal, Arpit, Andrew Zisserman, and Philip HS Torr. "Hand detection using multiple proposals." BMVC. Vol. 2. No. 3. 2011.
[25] Wang, Qi, et al. "Learning from synthetic data for crowd counting in the wild." Proceedings of the IEEE conference on computer vision and pattern recognition. 2019.
[26] Liu, Ziwei, et al. "Large-scale celebfaces attributes (celeba) dataset." Retrieved August 15 (2018): 2018.
[27] Tan, Mingxing, Ruoming Pang, and Quoc V. Le. "Efficientdet: Scalable and efficient object detection." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.
[28] Zhang, Hang, et al. "Resnest: Split-attention networks." arXiv preprint arXiv:2004.08955 (2020).
[29] Wu, Yuxin, et al. "Detectron2." (2019).
[30] Lin, Tsung-Yi, et al. "Feature pyramid networks for object detection." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
[31] Liu, Shu, et al. "Path aggregation network for instance segmentation." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
[32] Ghiasi, Golnaz, Tsung-Yi Lin, and Quoc V. Le. "Nas-fpn: Learning scalable feature pyramid architecture for object detection." Proceedings of the IEEE conference on computer vision and pattern recognition. 2019.
[33] Xie, Saining, et al. "Aggregated residual transformations for deep neural networks." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
[34] Intersection over Union (IoU). https://www.pyimagesearch.com/2016/11/07/intersection-over-union-iou-for-object-detection/
[35] Average Precision.(AP) https://cocodataset.org/#detection-eval
[36] Chinese Sign Language Recognition Dataset http://home.ustc.edu.cn/~pjh/openresources/cslr-dataset-2015/index.html
[37] Maćkiewicz, Andrzej, and Waldemar Ratajczak. "Principal components analysis (PCA)." Computers & Geosciences 19.3 (1993): 303-342. |