基於人體骨骼圖的手語動作偵測與辨識

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：20

、訪客IP：13.58.219.150

姓名

張檳妮(Binni Zhang) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

基於人體骨骼圖的手語動作偵測與辨識
(Skeleton based continuous sign language action detection and recognition)

相關論文

★ 適用於大面積及場景轉換的視訊錯誤隱藏法	★ 虛擬觸覺系統中的力回饋修正與展現
★ 多頻譜衛星影像融合與紅外線影像合成	★ 腹腔鏡膽囊切除手術模擬系統
★ 飛行模擬系統中的動態載入式多重解析度地形模塑	★ 以凌波為基礎的多重解析度地形模塑與貼圖
★ 多重解析度光流分析與深度計算	★ 體積守恆的變形模塑應用於腹腔鏡手術模擬
★ 互動式多重解析度模型編輯技術	★ 以小波轉換為基礎的多重解析度邊線追蹤技術(Wavelet-based multiresolution edge tracking for edge detection)
★ 基於二次式誤差及屬性準則的多重解析度模塑	★ 以整數小波轉換及灰色理論為基礎的漸進式影像壓縮
★ 建立在動態載入多重解析度地形模塑的戰術模擬	★ 以多階分割的空間關係做人臉偵測與特徵擷取
★ 以小波轉換為基礎的影像浮水印與壓縮	★ 外觀守恆及視點相關的多重解析度模塑

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

深度學習技術應用於圖像識別的研究在最近十年間取得巨大的發展與進步，人們已經開始在社會生活中的各個領域内頻繁使用深度學習的方法自動分析視覺資料以獲取需要的資訊，並且這些嘗試大多取得了很好的效果。隨著計算機硬體計算能力的提升與計算方法的不斷優化，這類研究與嘗試已經從處理圖像 (image) 資料擴展到處理視訊影片 (video) 資料。
聽覺障礙與語言功能障礙者在社會生活中往往有很多不便，尤其是在與非語言障礙人士溝通時。在深度學習與圖像識別發展的過程中，人們一直嘗試通過計算機幫助他們，使他們可以使用自己常用的手勢語言與非語言障礙人士自由地溝通。本論文即結合深度學習和視訊處理技術發展手語動作偵測與辨識研究，通過搭建深度學習網路來處理手語影像，並將其翻譯為文字資訊。為實現這一目的，我們搭建的網路大概可劃分爲三個步驟進行。
第一為骨骼特徵擷取；在人體動作影像中往往有很多與動作無關的資訊；例如，背景、衣服或髮型等，爲了排除這些無關資訊對結果的影響，我們先使用Openpose模組從視訊資料中擷取出每一幀的人體骨骼圖，也就是只保留了與動作相關的資訊。之後對骨骼圖使用一個特殊的卷積網路“圖卷積網路”(graph convolutional network, GCN) 來擷取出動作特徵。第二為分割疑似動作片段；在獲得視訊每一幀的動作特徵後，我們使用小型的卷積網路尋找動作的起始與末端時間節點，結合時間節點内是否為動作的初步判斷，從整段影像中分割出一些疑似動作片段。第三為對這些疑似動作片段進行分類；並且通過非極大值抑制去除在時間上重疊的疑似片段，得到最終的動作偵測與辨識結果。
在實驗中我們使用CSLR (Chinese Sign Language Recognition Dataset) 資料集，資料集為2D RGB手語影片，每個影片的長度在10秒以内，影片中錄製者面朝錄製設備。取15個連續手語語句，並對其中的31個單詞進行了分類。採用交替固定部分參數的方式訓練三個模組，完成網路訓練後我們與其他使用過相同資料集的手語識別網路的結果進行了比較，我們的句子精準度達到了84.5%。
本文的特色主要有兩點；其一，使用人體骨骼關鍵點圖表示動作，並使用圖卷積提取特徵，過濾了與動作無關的背景特徵；其二，使用小型卷積網路偵測動作的開始與結束時間點，計算量較少且找到的動作長度靈活。

摘要(英)

The application of deep learning technology to image recognition has made tremendous development and progress in the past decade. People have begun to frequently use deep learning methods in various fields of social life to automatically analyze visual data to obtain the required information, and these most of the attempts have achieved very good results. With the improvement of computer hardware computing capabilities and the continuous optimization of computing methods, such research and attempts have expanded from processing image data to processing video data.
People with hearing impairment and language dysfunction often have a lot of inconveniences in social life, especially when communicating with people with non-verbal disabilities. In the process of deep learning and image recognition development, people have been trying to help them through computers so that they can use their commonly used gesture language to communicate freely with people with non-verbal disabilities. This thesis combines deep learning and video processing technology to develop sign language motion detection and recognition research, builds a deep learning network to process sign language images, and translates them into textual information. To achieve this goal, the network we built can be divided into three steps.
First step, bone feature extraction; in human motion images, there is often a lot of information that is not related to movement; for example, background, clothing, or hairstyle, etc., in order to exclude the influence of these irrelevant information on the result, we first use the Openpose module to extract video data Each frame of the human skeleton is extracted from the frame, that is, only the information related to the movement is retained. Afterwards, a special convolutional network "graph convolutional network" (GCN) is used to extract the motion features for the skeletal graph. The second is to segment the suspected motion fragments; after obtaining the motion characteristics of each frame of the video, we use a small convolutional network to find the start and end time nodes of the motion, combined with the preliminary judgment of whether the motion is within the time node, from the whole Some suspicious motion clips are segmented from the video. The third is to classify these suspicious motion fragments; and remove the suspicious fragments overlapping in time by non-maximum suppression to obtain the final motion detection and recognition results.
In the experiment, we use the CSLR (Chinese Sign Language Recognition Dataset) data set, which is a 2D RGB sign language videos’ data set. The length of each video is less than 10 seconds. The recorders in the videos face the recording device. We take 15 consecutive sign language sentences and 31 words of them to classify. The three modules were trained by alternately fixing some parameters. After completing the network training, we compared the results with other sign language recognition networks that used the same data set. Our sentence accuracy reached 84.5%.
There are two main features of this article; one is the use of human Skeleton key points maps to represent actions, and the graph convolution is used to extract features, so that background features which are not related to actions are filtered; second, we use small convolutional networks to detect actions’ start and end time points, less calculation and flexible length of the action found.

關鍵字(中)

★ 動作偵測
★ 動作分類

關鍵字(英)

★ action detection
★ action recognition

論文目次

摘要 i
Abstract iii
致謝 v
目錄 vi
圖目錄 viii
表目錄 ix
第一章緒論 1
1.1. 研究動機 1
1.2. 系統架構 2
1.3. 論文架構 4
第二章相關研究 5
2.1. 基於骨骼架構的手勢辨識 5
2.2. 圖卷積 7
2.3. 手語辨識 9
第三章網路整體架構 11
3.1. 擷取骨骼特徵模組 11
3.2. 分割疑似片段模組 23
3.3. 動作分類模組 27
第四章實驗結果與討論 32
4.1. 實驗設備 32
4.2. 卷積神經網路訓練 32
4.3. 評估準則與實驗結果 36
第五章結論與未來展望 42
5.1. 結論 42
5.2. 未來展望 42
參考文獻 44
附錄一資料集分類 50

參考文獻

[1]S. Ren, K. He, R.Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in Proc. Conf. on Neural Information Processing Systems(NIPS), Montréal, Canada, Dec.7-12, 2015.
[2]Z. Cao, G. Hidalgo, T. Simon, S. Wei, and Y. Sheikh, “OpenPose: Realtime multi-person 2D pose estimation using part affinity fields” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Hawaii , Jul.21-26, 2017, pp.7291-7299.
[3]T. Lin, X. Zhao, H. Su, C. Wang, and M. Yang, “BSN: Boundary sensitive network for temporal action proposal generation,” in Proc. Conf. on European Conf. on Computer Vision (ECCV), Munich, Germany, Sept.8-14, 2018, pp.3-19.
[4]J. Liu, A. Shahroudy, D. Xu, and G. Wang. “Spatio-Temporal LSTM with trust gates for 3D human action recognition,” in Proc. Conf. on European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, Oct.8-16, 2016, pp.816-833.
[5]J. Weng, M. Liu, X. Jiang, and J. Yuan. “Deformable pose traversal convolution for 3D action and gesture recognition,” in Proc. Conf. on European Conference on Computer Vision (ECCV), Munich, Germany, Sept.8-14, 2018, pp.136-152.
[6]M. Niepert, M. Ahmed, and K. Kutzkov, “Learning convolutional neural networks for graphs,” in Proc. Conf. on Machine Learning, New York, NY, Jun.19-24, 2016, pp.2014-2023.
[7]C. Wan, T. Probst, L. V. Gool, and A. Yao, “Dense 3D regression for hand pose estimation,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, Utah, Jun.18-22, 2018, pp.5147-5156.
[8]L. Ge, H. Liang, J. Yuan, and D. Thalmann, “3D convolutional neural networks for efficient and robust hand pose estimation from single depth images,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Hawaii, Jul.21-26, 2017, pp.1991-2000.
[9]L. Ge, Y. Cai, J. Weng, and J. Yuan, “Hand PointNet: 3D hand pose estimation using point sets,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, Utah, Jun.18-22, 2018, pp.8417-8426.
[10]G. Devineau, F. Moutarde ,W. Xi, and J. Yang, “Deep learning for hand gesture recognition on skeletal data,” in Proc. IEEE Conf. on Automatic Face & Gesture Recognition (FG 2018), Xi′an, China, May.15-19, 2018, pp.106-113.
[11]Q. Ke, M. Bennamoun, S. An, F. Sohel, and F. Boussaid, “A new representation of skeleton sequences for 3D action recognition,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Hawaii, Jul.21-26, 2017, pp.3288-3297.
[12]M. Liu, H. Liu, and C. Chen, “Enhanced skeleton visualization for view invariant human action recognition,” Pattern Recognition, vol.68, pp.346-362, 2017.
[13]J. Núñez, C., R. Cabido, J. J. Pantrigo, A. S. Montemayor, and J. F.Vélez, “Convolutional neural networks and long short-term memory for skeleton-based human activity and hand gesture recognition,” Pattern Recognition, vol.76, pp.80-94, 2018.
[14]H. Wang, and L. Wang, “Modeling temporal dynamics and spatial configurations of actions using two-stream recurrent neural networks,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Hawaii, Jul.21-26, 2017, pp.499-508.
[15]R. Vemulapalli, F.Arrate, and R. Chellappa Human, “Action recognition by representing 3D skeletons as points in a lie group,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, Jun.23-28, 2014, pp.588-595.
[16]X. Nguyen, S., L. Brun, O. Lezoray, and S. Bougleux, “A neural network based on SPD manifold learning for skeleton-based hand gesture recognition,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, Jun.16-20, 2019, pp.12036-12045.
[17]A. Urooj, and A. Borji, “Analysis of hand segmentation in the wild,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, Utah, Jun.18-22, 2018, pp.4710-4719.
[18]M. Abavisani, H. R. V. Joze, and V. M. Patel, “Improving the performance of unimodal dynamic hand-gesture recognition with multimodal training,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, Jun.16-20, 2019, pp.1165-1174.
[19]C. Li, Z. Cui, W. Zheng, C. Xu, and J. Yang, “Spatio-temporal graph convolution for skeleton based action recognition,” in Proc. Conf. on Thirty-Second AAAI Conf. on Artificial Intelligence (AAAI), New Orleans, Louisiana, Feb.2-7, 2018, pp.3482-3489.
[20]S. Yan, Y. Xiong, and D. Lin, “Spatial temporal graph convolutional networks for skeleton-based action recognition.” in Proc. Conf. on Thirty-Second AAAI Conf. on Artificial Intelligence (AAAI), New Orleans, Louisiana, Feb.2-7, 2018, pp.7444-7452.
[21]L. Shi, Y. Zhang, J. Cheng, and H. Lu, “Non-local graph convolutional networks for skeleton-based action recognition,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, Jun.16-20, 2019, pp.12026-12035.
[22]A. Graves, S. Fernánde, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proc. Conf. on Machine Learning, Pittsburgh, PA, Jun.25-29, 2006, pp.369-376.
[23]A. Grover, and J. Leskovec, “node2vec: scalable feature learning for networks,” in Proc. ACM SIGKOD Conf. on Knowledge Discovery and Data Mining, San Francisco, CA, Aug.13-17, 2016, pp.855-864.
[24]W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” arXiv:1706.02216, 2017.
[25]J. Atwood, and D. Towsley, “Diffusion-convolutional neural networks,” arXiv:1511.02136, 2015.
[26]J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl, “Neural message passing for quantum chemistry,” in Proc. Conf. on Machine Learning, vol.70, Sydney, Australia, Aug.6-11, 2017, pp.1263-1272.
[27]M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neural networks on graphs with fast localized spectral filtering,” in Proc. Conf. on Neural Information Processing Systems (NIPS), Barcelona, Spain, Dec.5-10, 2016, pp.3844-3852.
[28]Y. Li, R. Yu, C. Shahabi, and Y. Liu, “Diffusion convolutional recurrent neural network: data-driven traffic forecasting,” arXiv:1707.01926, 2018.
[29]N. Camgoz, C., S. Hadfield, O. Koller, and R. Bowden, “SubUNets: end-to-end hand shape and continuous sign language recognition,” in Proc. IEEE Conf. on Computer Vision (ICCV), Venice, Italy, Oct.22-29, 2017, pp.3075-3084.
[30]J. Pu, W. Zhou, and H. Li, “Iterative alignment network for continuous sign language recognition,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, Jun. 16-20, 2019, pp. 4165-4174.
[31]N. Camgoz, C., S. Hadfield, O. Koller, H. Ney, and R. Bowden, “Neural sign language translation,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, Utah, Jun.18-22, 2018, pp.7784-7793.
[32]R. Cui, H. Liu, and C. Zhang, “Recurrent convolutional neural networks for continuous sign language recognition by staged optimization,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Hawaii, Jul.21-26, 2017, pp.7361-7369.
[33]S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko, “Sequence to sequence - video to text,” in Proc. IEEE Conf. on Computer Vision (ICCV), Santiago, Chile, Dec.11-16, 2015, pp.4534-4542.
[34]B. Yu, H. Yin, and Z. Zhu, “Spatio-temporal graph convolutional networks: a deep learning framework for traffic forecasting,” in Proc. Conf. on Artificial Intelligence (IJCAI), Stockholm, Sweden, Jul.13-19, 2018, pp.3634-3640.
[35]Thomas N. Kipf, Max Welling, “Semi-supervised classification with graph convolutional network,” arXiv:1609.02907, 2017.
[36]Z. Cao, Tomas Simon, Shih-En Wei, Yaser Sheikh, “Realtime multi-person 2d pose estimation using part affinity fields,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Hawaii, Jul.21-26, 2017, pp.7291-7299.
[37]Z. Shou, D.Wang, and Shih-Fu C., “Temporal action localization in untrimmed videos via multi-stage CNNs,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, Jun.27-30, 2016, pp.1049-1058.
[38]J. Gao, Z. Yang, C. Sun, K. Chen, R. Nevatia, “TURN TAP: Temporal unit regression network for temporal action proposals,” in Proc. IEEE Conf. on Computer Vision (ICCV), Venice, Italy, Oct.22-29, 2017, pp.3628-3636.
[39]S. Venugopalan, H. Xu , J. Donahue, M. Rohrbach, R. Mooney, K. Saenko, “Translating videos to natural language using deep recurrent neural networks,” arXiv:1412.4729, 2014.
J. Huang, Wengang Zhou, Qilin Zhang, Houqiang Li, Weiping Li, “Video-based sign language recognition without temporal segmentation,” in Proc. Conf. on Thirty-Second AAAI Conference on Artificial Intelligence (AAAI), New Orleans, Louisiana, Feb.2-7, 2018, pp.2257-2264.

指導教授

曾定章(Din-Chang Tseng)

審核日期

2020-7-29

推文