基於人體骨骼圖的手語動作偵測與辨識

DC 欄位	值	語言
DC.contributor	資訊工程學系	zh_TW
DC.creator	張檳妮	zh_TW
DC.creator	Binni Zhang	en_US
dc.date.accessioned	2020-7-29T07:39:07Z
dc.date.available	2020-7-29T07:39:07Z
dc.date.issued	2020
dc.identifier.uri	http://ir.lib.ncu.edu.tw:444/thesis/view_etd.asp?URN=107522629
dc.contributor.department	資訊工程學系	zh_TW
DC.description	國立中央大學	zh_TW
DC.description	National Central University	en_US
dc.description.abstract	深度學習技術應用於圖像識別的研究在最近十年間取得巨大的發展與進步，人們已經開始在社會生活中的各個領域内頻繁使用深度學習的方法自動分析視覺資料以獲取需要的資訊，並且這些嘗試大多取得了很好的效果。隨著計算機硬體計算能力的提升與計算方法的不斷優化，這類研究與嘗試已經從處理圖像 (image) 資料擴展到處理視訊影片 (video) 資料。聽覺障礙與語言功能障礙者在社會生活中往往有很多不便，尤其是在與非語言障礙人士溝通時。在深度學習與圖像識別發展的過程中，人們一直嘗試通過計算機幫助他們，使他們可以使用自己常用的手勢語言與非語言障礙人士自由地溝通。本論文即結合深度學習和視訊處理技術發展手語動作偵測與辨識研究，通過搭建深度學習網路來處理手語影像，並將其翻譯為文字資訊。為實現這一目的，我們搭建的網路大概可劃分爲三個步驟進行。第一為骨骼特徵擷取；在人體動作影像中往往有很多與動作無關的資訊；例如，背景、衣服或髮型等，爲了排除這些無關資訊對結果的影響，我們先使用Openpose模組從視訊資料中擷取出每一幀的人體骨骼圖，也就是只保留了與動作相關的資訊。之後對骨骼圖使用一個特殊的卷積網路“圖卷積網路”(graph convolutional network, GCN) 來擷取出動作特徵。第二為分割疑似動作片段；在獲得視訊每一幀的動作特徵後，我們使用小型的卷積網路尋找動作的起始與末端時間節點，結合時間節點内是否為動作的初步判斷，從整段影像中分割出一些疑似動作片段。第三為對這些疑似動作片段進行分類；並且通過非極大值抑制去除在時間上重疊的疑似片段，得到最終的動作偵測與辨識結果。在實驗中我們使用CSLR (Chinese Sign Language Recognition Dataset) 資料集，資料集為2D RGB手語影片，每個影片的長度在10秒以内，影片中錄製者面朝錄製設備。取15個連續手語語句，並對其中的31個單詞進行了分類。採用交替固定部分參數的方式訓練三個模組，完成網路訓練後我們與其他使用過相同資料集的手語識別網路的結果進行了比較，我們的句子精準度達到了84.5%。本文的特色主要有兩點；其一，使用人體骨骼關鍵點圖表示動作，並使用圖卷積提取特徵，過濾了與動作無關的背景特徵；其二，使用小型卷積網路偵測動作的開始與結束時間點，計算量較少且找到的動作長度靈活。	zh_TW
dc.description.abstract	The application of deep learning technology to image recognition has made tremendous development and progress in the past decade. People have begun to frequently use deep learning methods in various fields of social life to automatically analyze visual data to obtain the required information, and these most of the attempts have achieved very good results. With the improvement of computer hardware computing capabilities and the continuous optimization of computing methods, such research and attempts have expanded from processing image data to processing video data. People with hearing impairment and language dysfunction often have a lot of inconveniences in social life, especially when communicating with people with non-verbal disabilities. In the process of deep learning and image recognition development, people have been trying to help them through computers so that they can use their commonly used gesture language to communicate freely with people with non-verbal disabilities. This thesis combines deep learning and video processing technology to develop sign language motion detection and recognition research, builds a deep learning network to process sign language images, and translates them into textual information. To achieve this goal, the network we built can be divided into three steps. First step, bone feature extraction; in human motion images, there is often a lot of information that is not related to movement; for example, background, clothing, or hairstyle, etc., in order to exclude the influence of these irrelevant information on the result, we first use the Openpose module to extract video data Each frame of the human skeleton is extracted from the frame, that is, only the information related to the movement is retained. Afterwards, a special convolutional network ＂graph convolutional network＂ (GCN) is used to extract the motion features for the skeletal graph. The second is to segment the suspected motion fragments; after obtaining the motion characteristics of each frame of the video, we use a small convolutional network to find the start and end time nodes of the motion, combined with the preliminary judgment of whether the motion is within the time node, from the whole Some suspicious motion clips are segmented from the video. The third is to classify these suspicious motion fragments; and remove the suspicious fragments overlapping in time by non-maximum suppression to obtain the final motion detection and recognition results. In the experiment, we use the CSLR (Chinese Sign Language Recognition Dataset) data set, which is a 2D RGB sign language videos’ data set. The length of each video is less than 10 seconds. The recorders in the videos face the recording device. We take 15 consecutive sign language sentences and 31 words of them to classify. The three modules were trained by alternately fixing some parameters. After completing the network training, we compared the results with other sign language recognition networks that used the same data set. Our sentence accuracy reached 84.5%. There are two main features of this article; one is the use of human Skeleton key points maps to represent actions, and the graph convolution is used to extract features, so that background features which are not related to actions are filtered; second, we use small convolutional networks to detect actions’ start and end time points, less calculation and flexible length of the action found.	en_US
DC.subject	動作偵測	zh_TW
DC.subject	動作分類	zh_TW
DC.subject	action detection	en_US
DC.subject	action recognition	en_US
DC.title	基於人體骨骼圖的手語動作偵測與辨識	zh_TW
dc.language.iso	zh-TW	zh-TW
DC.title	Skeleton based continuous sign language action detection and recognition	en_US
DC.type	博碩士論文	zh_TW
DC.type	thesis	en_US
DC.publisher	National Central University	en_US

博碩士論文 107522629 完整後設資料紀錄