Gesture Recognition System for Hearing and Speech Impaired
Many normal people do find difficulty to communicate with deaf or hearing impaired individuals. Commonly people do not understand the gestures performed by the deaf or hearing impaired person. This project aims to build the software which can reduce the communication gap between disabled and normal peoples and allowing these people to communicate with each other effectively. With the help of Kinect and by keeping in mind the restriction and limitation given to us by Kinect, we were able to make the system. The project used Dynamic Time Wrapping Algorithm to compare gesture with the ones stored in our gestures dictionary. The project is able to recognize different gestures with an accuracy of 91%. The gestures used were from Pakistani Sign Language. Word can be stored in different language if the gesture is common in different countries. Gestures is translated into text and finally the text is converted into spoken language. Accuracy is effected by the distance of the user from Kinect. Only one user at time can perform the gesture. Basic goal of the project is to give hearing or speech impaired people a new hope, a hope in shape of a user friendly system that helps them to communicate effectively.
The proposed system consists of a series of steps for the detection of any particular gesture. Microsoft Kinect is used to get stream of input frames form the user. This stream is then processed and stored in memory for purpose of gesture recognition. The individual steps are (1) receiving joint of interest from Kinect video stream (2) normalizing the skeleton frame data (3) building a linked list of the normalized data and (4) storing or detecting the gesture. There are two modes in this process. The first is recording mode and the second is translation mode. In recording mode the user is able to add gestures to the dictionary while in the translation mode the user performs a gesture and that gesture is compared to gestures stored in dictionary. Most of the sign language gestures are above the body’s hip bone and below head, so in order to start and end a sign language gesture the user has to perform a specific defined gesture. Our system requires the user to place both hands below hip bone with distance between hands below 0.5 m to be able to start or end a sign language gesture. We call this gesture “Recording Translation Gesture” or “RTG”. After the RTG is performed the following sequences of steps follow:
1. Joints of interest
Kinect is able to detect and track human body joints. Microsoft Kinect SDK provides with functions to get the Cartesian coordinate of the joints. We use these coordinates to store the movements performed by the users in order to record the gesture performed. We can get 20 joints of human body with Microsoft Kinect SDK. In our case we will be using only those joints which are required for detecting sign language gestures. For RTG gesture we require hip joint and center joint. For sign language gesture, eight joints are required which are head (H), spine (S), elbow right (ER), elbow left (EL), hand right (HR), hand left (HL), wrist right (WR) and wrist left (WL) we will be storing and tracking coordinates of these joints and then normalize these.
Every user’s height and dimensions will be different. This has a huge impact on the performance of the system. The reason being X, Y and Z coordinates of joints of every user might be different. This can also happen due to varying position of user from Kinect. Ideally a user should be 6 feet from Kinect and straight in front of camera but it is not always the case. The user can be at any angle from Kinect and at any distance. So a need to normalize the data is of necessary to increase accuracy of gesture recognition. The Cartesian products that are retrieved from Microsoft Kinect SDK are with respect to Kinect, we first shift the origin of the coordinates from Kinect to a point in human body. The coordinates when captured are in Cartesian coordinate system. These are then converted to a three dimensional space system known as spherical coordinate system. The reason for this is that the coordinates are easy to normalize. Regardless the user size, only distance from origin will vary unlike all coordinates in Cartesian system and the angles will remain constant. We would now only require normalizing this distance as explained later. This system consists of three attributes (1) distance of point from origin (2) a polar angle and (3) an azimuth angle.
3. Temporary Storage
Once the normalization is completed the data is to be stored in memory. A linked list is maintained to store normalized skeleton frames until the gesture is completed. This is done by storing coordinates of joints in private variables in objects of gesture class and forming a linked list of the objects. The ending RTG marks the end of a gesture. When the complete system executes, a dictionary is loaded in memory in form of a two dimensional linked list of objects. All gestures are linked vertically, while each gesture individually is connected to list of objects that contains the values of joints, spherical coordinates and their normalized coordinated for each frame. Linked list has been the choice to accommodate dictionary in memory due to large size of the text file storing gestures.
At this instance, the specific mode of system determines what to do next. Depending on whether it is in recording mode or translation mode the next step would execute. In recording mode, once the ending RTG is performed, the normalized skeleton frame linked list is written in the gesture dictionary. The gesture dictionary is a text file with joints coordinates stored and separated by commas and each gesture is enclosed in braces in order to differentiate them.
If the mode is translation, DTW algorithm is applied to compare the gesture stored in linked list with the gesture stored in gesture dictionary and the result of corresponding gesture is returned. Each gesture is linearly searched for comparison and compared through DTW algorithm. Dynamic time wrapping algorithm is a general algorithm used to compare different elements of different lengths.
Z. Halim and G. Abbas, “A Kinect-Based Sign Language Hand Gesture Recognition System for Hearing- and Speech-Impaired: A Pilot Study of Pakistani Sign Language,” Assistive Technology, Vol. 27, No.01, 2015. [ISSN: 1040-0435, Thomson Reuters JCR 2014, Impact factor 1.300, Taylor & Francis]
H. Saad,M. Parvez, B. Shah, S. Ashraf, Z. Halim and G. Abbas, "Dynamic Time Wrapping based Gesture Recognition," International Conference on Robotics & Emerging Allied Technologies in Engineering (iCREATE), Islamabad, Pakistan, April 22-24, 2014.