scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Robust Part-Based Hand Gesture Recognition Using Kinect Sensor

01 Aug 2013-IEEE Transactions on Multimedia (IEEE)-Vol. 15, Iss: 5, pp 1110-1120
TL;DR: A novel distance metric, Finger-Earth Mover's Distance (FEMD), is proposed, which only matches the finger parts while not the whole hand, it can better distinguish the hand gestures of slight differences.
Abstract: The recently developed depth sensors, e.g., the Kinect sensor, have provided new opportunities for human-computer interaction (HCI). Although great progress has been made by leveraging the Kinect sensor, e.g., in human body tracking, face recognition and human action recognition, robust hand gesture recognition remains an open problem. Compared to the entire human body, the hand is a smaller object with more complex articulations and more easily affected by segmentation errors. It is thus a very challenging problem to recognize hand gestures. This paper focuses on building a robust part-based hand gesture recognition system using Kinect sensor. To handle the noisy hand shapes obtained from the Kinect sensor, we propose a novel distance metric, Finger-Earth Mover's Distance (FEMD), to measure the dissimilarity between hand shapes. As it only matches the finger parts while not the whole hand, it can better distinguish the hand gestures of slight differences. The extensive experiments demonstrate that our hand gesture recognition system is accurate (a 93.2% mean accuracy on a challenging 10-gesture dataset), efficient (average 0.0750 s per frame), robust to hand articulations, distortions and orientation or scale changes, and can work in uncontrolled environments (cluttered backgrounds and lighting conditions). The superiority of our system is further demonstrated in two real-life HCI applications.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: This paper presents a survey of some recent works on hand gesture recognition using 3D depth sensors, and reviews the commercial depth sensors and public data sets that are widely used in this field.
Abstract: Three-dimensional hand gesture recognition has attracted increasing research interests in computer vision, pattern recognition, and human-computer interaction. The emerging depth sensors greatly inspired various hand gesture recognition approaches and applications, which were severely limited in the 2D domain with conventional cameras. This paper presents a survey of some recent works on hand gesture recognition using 3D depth sensors. We first review the commercial depth sensors and public data sets that are widely used in this field. Then, we review the state-of-the-art research for 3D hand gesture recognition in four aspects: 1) 3D hand modeling; 2) static hand gesture recognition; 3) hand trajectory gesture recognition; and 4) continuous hand gesture recognition. While the emphasis is on 3D hand gesture recognition approaches, the related applications and typical systems are also briefly summarized for practitioners.

291 citations


Cites methods from "Robust Part-Based Hand Gesture Reco..."

  • ...After the hand detection and tracking, either stat ic hand gesture recognition [68]–[70] or hybrid hand gesture recognition [71, 72] can be applied....

    [...]

Book ChapterDOI
08 Sep 2018
TL;DR: A weakly-supervised method, adaptating from fully-annotated synthetic dataset toWeakly-labeled real-world dataset with the aid of a depth regularizer, which generates depth maps from predicted 3D pose and serves as weak supervision for3D pose regression.
Abstract: Compared with depth-based 3D hand pose estimation, it is more challenging to infer 3D hand pose from monocular RGB images, due to substantial depth ambiguity and the difficulty of obtaining fully-annotated training data. Different from existing learning-based monocular RGB-input approaches that require accurate 3D annotations for training, we propose to leverage the depth images that can be easily obtained from commodity RGB-D cameras during training, while during testing we take only RGB inputs for 3D joint predictions. In this way, we alleviate the burden of the costly 3D annotations in real-world dataset. Particularly, we propose a weakly-supervised method, adaptating from fully-annotated synthetic dataset to weakly-labeled real-world dataset with the aid of a depth regularizer, which generates depth maps from predicted 3D pose and serves as weak supervision for 3D pose regression. Extensive experiments on benchmark datasets validate the effectiveness of the proposed depth regularizer in both weakly-supervised and fully-supervised settings.

288 citations


Additional excerpts

  • ...Inspired by the great improvement of CNN-based 3D hand pose estimation from depth images[24], deep learning has also been adopted in some recent works on monocular RGB-based applications [46, 18]....

    [...]

Journal ArticleDOI
TL;DR: A novel distance metric, superpixel earth mover's distance (SP-EMD), is proposed to measure the dissimilarity between the hand gestures, which is robust to distortion and articulation, but also invariant to scaling, translation and rotation with proper preprocessing.
Abstract: This paper presents a new superpixel-based hand gesture recognition system based on a novel superpixel earth mover's distance metric, together with Kinect depth camera The depth and skeleton information from Kinect are effectively utilized to produce markerless hand extraction The hand shapes, corresponding textures and depths are represented in the form of superpixels, which effectively retain the overall shapes and color of the gestures to be recognized Based on this representation, a novel distance metric, superpixel earth mover's distance (SP-EMD), is proposed to measure the dissimilarity between the hand gestures This measurement is not only robust to distortion and articulation, but also invariant to scaling, translation and rotation with proper preprocessing The effectiveness of the proposed distance metric and recognition algorithm are illustrated by extensive experiments with our own gesture dataset as well as two other public datasets Simulation results show that the proposed system is able to achieve high mean accuracy and fast recognition speed Its superiority is further demonstrated by comparisons with other conventional techniques and two real-life applications

271 citations


Cites background or methods from "Robust Part-Based Hand Gesture Reco..."

  • ...We now evaluate and compare the proposed hand gesture recognition system with various state-of-the-art recognition algorithms including Shape Context [27], Skeleton Matching [26], FEMD [12], Random Forest (RF) [32], HOG [24] and H3DF [23], using three different real world datasets, namely our joint color-depth hand gesture dataset, NTU hand digit dataset [12] and American Sign Language (ASL) finger spelling dataset [32]....

    [...]

  • ...Instead of representing the hand shape in contour [12], [27] or skeleton [26], we propose to use superpixels to simplify the hand shape but retain as much information as possible....

    [...]

  • ...Comparisons With Other Methods: To further illustrate the advantage of our system, we first compare it with other three state-of-the-art recognition algorithms, Shape Context [27], Skeleton Matching [26] and FEMD [12] on our dataset....

    [...]

  • ...Comparing with previous distance measures such as FEMD, shape context distance and path similarity, the proposed SP-EMD metric achieves better performance for hand gesture recognition....

    [...]

  • ...It is worth noting that FEMD is particularly designed for depth-camera based hand gesture recognition....

    [...]

Proceedings ArticleDOI
27 Jun 2016
TL;DR: This work proposes to first project the query depth image onto three orthogonal planes and utilize these multi-view projections to regress for 2D heat-maps which estimate the joint positions on each plane to produce final 3D hand pose estimation with learned pose priors.
Abstract: Articulated hand pose estimation plays an important role in human-computer interaction. Despite the recent progress, the accuracy of existing methods is still not satisfactory, partially due to the difficulty of embedded highdimensional and non-linear regression problem. Different from the existing discriminative methods that regress for the hand pose with a single depth image, we propose to first project the query depth image onto three orthogonal planes and utilize these multi-view projections to regress for 2D heat-maps which estimate the joint positions on each plane. These multi-view heat-maps are then fused to produce final 3D hand pose estimation with learned pose priors. Experiments show that the proposed method largely outperforms state-of-the-art on a challenging dataset. Moreover, a cross-dataset experiment also demonstrates the good generalization ability of the proposed method.

266 citations


Cites methods from "Robust Part-Based Hand Gesture Reco..."

  • ...Index Terms—3D hand pose estimation, Convolutional neural networks, Multi-view CNNs F...

    [...]

Journal ArticleDOI
TL;DR: A comprehensive survey on Kinect applications, and the latest research and development on motion recognition using data captured by the Kinect sensor, and a classification of motion recognition techniques to highlight the different approaches used in human motion recognition.
Abstract: Microsoft Kinect, a low-cost motion sensing device, enables users to interact with computers or game consoles naturally through gestures and spoken commands without any other peripheral equipment. As such, it has commanded intense interests in research and development on the Kinect technology. In this paper, we present, a comprehensive survey on Kinect applications, and the latest research and development on motion recognition using data captured by the Kinect sensor. On the applications front, we review the applications of the Kinect technology in a variety of areas, including healthcare, education and performing arts, robotics, sign language recognition, retail services, workplace safety training, as well as 3D reconstructions. On the technology front, we provide an overview of the main features of both versions of the Kinect sensor together with the depth sensing technologies used, and review literatures on human motion recognition techniques used in Kinect applications. We provide a classification of motion recognition techniques to highlight the different approaches used in human motion recognition. Furthermore, we compile a list of publicly available Kinect datasets. These datasets are valuable resources for researchers to investigate better methods for human motion recognition and lower-level computer vision tasks such as segmentation, object detection and human pose estimation.

261 citations

References
More filters
Journal ArticleDOI
TL;DR: This paper presents work on computing shape models that are computationally fast and invariant basic transformations like translation, scaling and rotation, and proposes shape detection using a feature called shape context, which is descriptive of the shape of the object.
Abstract: We present a novel approach to measuring similarity between shapes and exploit it for object recognition. In our framework, the measurement of similarity is preceded by: (1) solving for correspondences between points on the two shapes; (2) using the correspondences to estimate an aligning transform. In order to solve the correspondence problem, we attach a descriptor, the shape context, to each point. The shape context at a reference point captures the distribution of the remaining points relative to it, thus offering a globally discriminative characterization. Corresponding points on two similar shapes will have similar shape contexts, enabling us to solve for correspondences as an optimal assignment problem. Given the point correspondences, we estimate the transformation that best aligns the two shapes; regularized thin-plate splines provide a flexible class of transformation maps for this purpose. The dissimilarity between the two shapes is computed as a sum of matching errors between corresponding points, together with a term measuring the magnitude of the aligning transform. We treat recognition in a nearest-neighbor classification framework as the problem of finding the stored prototype shape that is maximally similar to that in the image. Results are presented for silhouettes, trademarks, handwritten digits, and the COIL data set.

6,693 citations


"Robust Part-Based Hand Gesture Reco..." refers background or methods in this paper

  • ...Unfortunately, classic shape recognition methods, such as correspondence-based shape matching algorithms [13], [14] and skeleton matching methods [3], [15], cannot robustly recognize shape contour with severe distortions....

    [...]

  • ...as shape contexts [13] and inner-distance [14], they are not...

    [...]

  • ...The confusion matrix of hand gesture recognition using Shape Context [13]....

    [...]

  • ...We compare our algorithm with shape contexts [13], and skeleton path similarity [3] in Section IV-B4 and show our superiority in hand gesture recognition....

    [...]

  • ...We compare it with the traditional correspondence-based matching algorithm, Shape Context [13] and the skeleton-based matching algorithm, Path Similarity [3]....

    [...]

BookDOI
01 Jan 2001
TL;DR: This book presents the first comprehensive treatment of Monte Carlo techniques, including convergence results and applications to tracking, guidance, automated target recognition, aircraft navigation, robot navigation, econometrics, financial modeling, neural networks, optimal control, optimal filtering, communications, reinforcement learning, signal enhancement, model averaging and selection.
Abstract: Monte Carlo methods are revolutionizing the on-line analysis of data in fields as diverse as financial modeling, target tracking and computer vision. These methods, appearing under the names of bootstrap filters, condensation, optimal Monte Carlo filters, particle filters and survival of the fittest, have made it possible to solve numerically many complex, non-standard problems that were previously intractable. This book presents the first comprehensive treatment of these techniques, including convergence results and applications to tracking, guidance, automated target recognition, aircraft navigation, robot navigation, econometrics, financial modeling, neural networks, optimal control, optimal filtering, communications, reinforcement learning, signal enhancement, model averaging and selection, computer vision, semiconductor design, population biology, dynamic Bayesian networks, and time series analysis. This will be of great value to students, researchers and practitioners, who have some basic knowledge of probability. Arnaud Doucet received the Ph. D. degree from the University of Paris-XI Orsay in 1997. From 1998 to 2000, he conducted research at the Signal Processing Group of Cambridge University, UK. He is currently an assistant professor at the Department of Electrical Engineering of Melbourne University, Australia. His research interests include Bayesian statistics, dynamic models and Monte Carlo methods. Nando de Freitas obtained a Ph.D. degree in information engineering from Cambridge University in 1999. He is presently a research associate with the artificial intelligence group of the University of California at Berkeley. His main research interests are in Bayesian statistics and the application of on-line and batch Monte Carlo methods to machine learning. Neil Gordon obtained a Ph.D. in Statistics from Imperial College, University of London in 1993. He is with the Pattern and Information Processing group at the Defence Evaluation and Research Agency in the United Kingdom. His research interests are in time series, statistical data analysis, and pattern recognition with a particular emphasis on target tracking and missile guidance.

6,574 citations

Journal ArticleDOI
TL;DR: This paper investigates the properties of a metric between two distributions, the Earth Mover's Distance (EMD), for content-based image retrieval, and compares the retrieval performance of the EMD with that of other distances.
Abstract: We investigate the properties of a metric between two distributions, the Earth Mover's Distance (EMD), for content-based image retrieval. The EMD is based on the minimal cost that must be paid to transform one distribution into the other, in a precise sense, and was first proposed for certain vision problems by Peleg, Werman, and Rom. For image retrieval, we combine this idea with a representation scheme for distributions that is based on vector quantization. This combination leads to an image comparison framework that often accounts for perceptual similarity better than other previously proposed methods. The EMD is based on a solution to the transportation problem from linear optimization, for which efficient algorithms are available, and also allows naturally for partial matching. It is more robust than histogram matching techniques, in that it can operate on variable-length representations of the distributions that avoid quantization and other binning problems typical of histograms. When used to compare distributions with the same overall mass, the EMD is a true metric. In this paper we focus on applications to color and texture, and we compare the retrieval performance of the EMD with that of other distances.

4,593 citations


"Robust Part-Based Hand Gesture Reco..." refers background in this paper

  • ...2) Finger-Earth Mover’s Distance: In [33], Rubner et al....

    [...]

  • ...When , FEMD becomes the EMD metric [33]....

    [...]

Proceedings ArticleDOI
20 Jun 2011
TL;DR: This work takes an object recognition approach, designing an intermediate body parts representation that maps the difficult pose estimation problem into a simpler per-pixel classification problem, and generates confidence-scored 3D proposals of several body joints by reprojecting the classification result and finding local modes.
Abstract: We propose a new method to quickly and accurately predict 3D positions of body joints from a single depth image, using no temporal information. We take an object recognition approach, designing an intermediate body parts representation that maps the difficult pose estimation problem into a simpler per-pixel classification problem. Our large and highly varied training dataset allows the classifier to estimate body parts invariant to pose, body shape, clothing, etc. Finally we generate confidence-scored 3D proposals of several body joints by reprojecting the classification result and finding local modes. The system runs at 200 frames per second on consumer hardware. Our evaluation shows high accuracy on both synthetic and real test sets, and investigates the effect of several training parameters. We achieve state of the art accuracy in our comparison with related work and demonstrate improved generalization over exact whole-skeleton nearest neighbor matching.

3,579 citations

Journal ArticleDOI
TL;DR: Almost 300 key theoretical and empirical contributions in the current decade related to image retrieval and automatic image annotation are surveyed, and the spawning of related subfields are discussed, to discuss the adaptation of existing image retrieval techniques to build systems that can be useful in the real world.
Abstract: We have witnessed great interest and a wealth of promise in content-based image retrieval as an emerging technology. While the last decade laid foundation to such promise, it also paved the way for a large number of new techniques and systems, got many new people involved, and triggered stronger association of weakly related fields. In this article, we survey almost 300 key theoretical and empirical contributions in the current decade related to image retrieval and automatic image annotation, and in the process discuss the spawning of related subfields. We also discuss significant challenges involved in the adaptation of existing image retrieval techniques to build systems that can be useful in the real world. In retrospect of what has been achieved so far, we also conjecture what the future may hold for image retrieval research.

3,433 citations


"Robust Part-Based Hand Gesture Reco..." refers background in this paper

  • ...EMD is widely used in many problems such as content-based image retrieval and pattern recognition [34], [35]....

    [...]