dc.description.abstract | The application of deep learning technology to image recognition has made tremendous development and progress in the past decade. People have begun to frequently use deep learning methods in various fields of social life to automatically analyze visual data to obtain the required information, and these most of the attempts have achieved very good results. With the improvement of computer hardware computing capabilities and the continuous optimization of computing methods, such research and attempts have expanded from processing image data to processing video data.
People with hearing impairment and language dysfunction often have a lot of inconveniences in social life, especially when communicating with people with non-verbal disabilities. In the process of deep learning and image recognition development, people have been trying to help them through computers so that they can use their commonly used gesture language to communicate freely with people with non-verbal disabilities. This thesis combines deep learning and video processing technology to develop sign language motion detection and recognition research, builds a deep learning network to process sign language images, and translates them into textual information. To achieve this goal, the network we built can be divided into three steps.
First step, bone feature extraction; in human motion images, there is often a lot of information that is not related to movement; for example, background, clothing, or hairstyle, etc., in order to exclude the influence of these irrelevant information on the result, we first use the Openpose module to extract video data Each frame of the human skeleton is extracted from the frame, that is, only the information related to the movement is retained. Afterwards, a special convolutional network "graph convolutional network" (GCN) is used to extract the motion features for the skeletal graph. The second is to segment the suspected motion fragments; after obtaining the motion characteristics of each frame of the video, we use a small convolutional network to find the start and end time nodes of the motion, combined with the preliminary judgment of whether the motion is within the time node, from the whole Some suspicious motion clips are segmented from the video. The third is to classify these suspicious motion fragments; and remove the suspicious fragments overlapping in time by non-maximum suppression to obtain the final motion detection and recognition results.
In the experiment, we use the CSLR (Chinese Sign Language Recognition Dataset) data set, which is a 2D RGB sign language videos’ data set. The length of each video is less than 10 seconds. The recorders in the videos face the recording device. We take 15 consecutive sign language sentences and 31 words of them to classify. The three modules were trained by alternately fixing some parameters. After completing the network training, we compared the results with other sign language recognition networks that used the same data set. Our sentence accuracy reached 84.5%.
There are two main features of this article; one is the use of human Skeleton key points maps to represent actions, and the graph convolution is used to extract features, so that background features which are not related to actions are filtered; second, we use small convolutional networks to detect actions’ start and end time points, less calculation and flexible length of the action found. | en_US |