scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Real-Time Articulated Hand Pose Estimation Using Semi-supervised Transductive Regression Forests

01 Dec 2013-pp 3224-3231
TL;DR: The Semi-supervised Transductive Regression (STR) forest is proposed which learns the relationship between a small, sparsely labelled realistic dataset and a large synthetic dataset, and a novel data-driven, pseudo-kinematic technique to refine noisy or occluded joints.
Abstract: This paper presents the first semi-supervised transductive algorithm for real-time articulated hand pose estimation. Noisy data and occlusions are the major challenges of articulated hand pose estimation. In addition, the discrepancies among realistic and synthetic pose data undermine the performances of existing approaches that use synthetic data extensively in training. We therefore propose the Semi-supervised Transductive Regression (STR) forest which learns the relationship between a small, sparsely labelled realistic dataset and a large synthetic dataset. We also design a novel data-driven, pseudo-kinematic technique to refine noisy or occluded joints. Our contributions include: (i) capturing the benefits of both realistic and synthetic data via transductive learning, (ii) showing accuracies can be improved by considering unlabelled data, and (iii) introducing a pseudo-kinematic technique to refine articulations efficiently. Experimental results show not only the promising performance of our method with respect to noise and occlusions, but also its superiority over state-of-the-arts in accuracy, robustness and speed.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI
07 Jun 2015
TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).
Abstract: We propose a deep convolutional neural network architecture codenamed Inception that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. By a carefully crafted design, we increased the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC14 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.

40,257 citations

Proceedings ArticleDOI
23 Jun 2014
TL;DR: A hybrid method that combines gradient based and stochastic optimization methods to achieve fast convergence and good accuracy is proposed and presented, making it the first system that achieves such robustness, accuracy, and speed simultaneously.
Abstract: We present a realtime hand tracking system using a depth sensor. It tracks a fully articulated hand under large viewpoints in realtime (25 FPS on a desktop without using a GPU) and with high accuracy (error below 10 mm). To our knowledge, it is the first system that achieves such robustness, accuracy, and speed simultaneously, as verified on challenging real data. Our system is made of several novel techniques. We model a hand simply using a number of spheres and define a fast cost function. Those are critical for realtime performance. We propose a hybrid method that combines gradient based and stochastic optimization methods to achieve fast convergence and good accuracy. We present new finger detection and hand initialization methods that greatly enhance the robustness of tracking.

517 citations


Cites background from "Real-Time Articulated Hand Pose Est..."

  • ...Other realtime and robust systems are limited in recognizing discrete hand gestures only [31, 5, 6, 29] without optimization, supporting a small number of DOFs [22], or under a fixed viewpoint [12]....

    [...]

Proceedings ArticleDOI
18 Apr 2015
TL;DR: A new real-time hand tracking system based on a single depth camera that can accurately reconstruct complex hand poses across a variety of subjects and is highly flexible, dramatically improving upon previous approaches which have focused on front-facing close-range scenarios.
Abstract: We present a new real-time hand tracking system based on a single depth camera. The system can accurately reconstruct complex hand poses across a variety of subjects. It also allows for robust tracking, rapidly recovering from any temporary failures. Most uniquely, our tracker is highly flexible, dramatically improving upon previous approaches which have focused on front-facing close-range scenarios. This flexibility opens up new possibilities for human-computer interaction with examples including tracking at distances from tens of centimeters through to several meters (for controlling the TV at a distance), supporting tracking using a moving depth camera (for mobile scenarios), and arbitrary camera placements (for VR headsets). These features are achieved through a new pipeline that combines a multi-layered discriminative reinitialization strategy for per-frame pose estimation, followed by a generative model-fitting stage. We provide extensive technical details and a detailed qualitative and quantitative analysis.

466 citations


Cites background from "Real-Time Articulated Hand Pose Est..."

  • ...[27, 26] extend this work demonstrating more complex poses at 25Hz....

    [...]

Proceedings ArticleDOI
23 Jun 2014
TL;DR: The Latent Regression Forest is presented, a novel framework for real-time, 3D hand pose estimation from a single depth image and shows that the LRF out-performs state-of-the-art methods in both accuracy and efficiency.
Abstract: In this paper we present the Latent Regression Forest (LRF), a novel framework for real-time, 3D hand pose estimation from a single depth image. In contrast to prior forest-based methods, which take dense pixels as input, classify them independently and then estimate joint positions afterwards, our method can be considered as a structured coarse-to-fine search, starting from the centre of mass of a point cloud until locating all the skeletal joints. The searching process is guided by a learnt Latent Tree Model which reflects the hierarchical topology of the hand. Our main contributions can be summarised as follows: (i) Learning the topology of the hand in an unsupervised, data-driven manner. (ii) A new forest-based, discriminative framework for structured search in images, as well as an error regression step to avoid error accumulation. (iii) A new multi-view hand pose dataset containing 180K annotated images from 10 different subjects. Our experiments show that the LRF out-performs state-of-the-art methods in both accuracy and efficiency.

424 citations


Cites background or methods from "Real-Time Articulated Hand Pose Est..."

  • ...3) A new multi-view hand pose dataset: We present a new hand pose dataset containing 180K fully 3D annotated depth images from 10 different subjects....

    [...]

  • ...Without such procedures, highly unlikely or even impossible poses can be produced as output....

    [...]

Proceedings ArticleDOI
07 Jun 2015
TL;DR: 3D pose-indexed features that generalize the previous 2D parameterized features and achieve better invariance to 3D transformations and a principled hierarchical regression that is adapted to the articulated object structure are introduced.
Abstract: We extends the previous 2D cascaded object pose regression work [9] in two aspects so that it works better for 3D articulated objects. Our first contribution is 3D pose-indexed features that generalize the previous 2D parameterized features and achieve better invariance to 3D transformations. Our second contribution is a principled hierarchical regression that is adapted to the articulated object structure. It is therefore more accurate and faster. Comprehensive experiments verify the state-of-the-art accuracy and efficiency of the proposed approach on the challenging 3D hand pose estimation problem, on a public dataset and our new dataset.

422 citations


Cites background or methods from "Real-Time Articulated Hand Pose Est..."

  • ...6 FPS in [6], 12 FPS in [39], 25 FPS in [10], 62....

    [...]

  • ...Previous techniques (pre-clustering of hand pose in [6] and using an augmented cost function with a viewpoint classification term in [10]) are simple and can only perform coarse viewpoint estimation....

    [...]

  • ...holistic regression Many methods [6, 39, 10, 22] estimate hand joints individually by following the per-pixel classification approaches for human body pose recognition [18, 29]....

    [...]

  • ...This framework has been applied to facial landmark localization [8], human body pose estimation [33] and hand pose estimation [6, 10]....

    [...]

  • ..., the center of the depth patch [34] or any pixel under consideration [6, 10, 22], and z(u) is its depth....

    [...]

References
More filters
Journal ArticleDOI
01 Oct 2001
TL;DR: Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the forest, and are also applicable to regression.
Abstract: Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, aaa, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.

79,257 citations


"Real-Time Articulated Hand Pose Est..." refers methods in this paper

  • ...Viewpoint classification termQa: Traditional information gain is used to evaluate the classification performance of all the viewpoint labels a in dataset L [4]....

    [...]

Journal ArticleDOI
TL;DR: The relationship between transfer learning and other related machine learning techniques such as domain adaptation, multitask learning and sample selection bias, as well as covariate shift are discussed.
Abstract: A major assumption in many machine learning and data mining algorithms is that the training and future data must be in the same feature space and have the same distribution. However, in many real-world applications, this assumption may not hold. For example, we sometimes have a classification task in one domain of interest, but we only have sufficient training data in another domain of interest, where the latter data may be in a different feature space or follow a different data distribution. In such cases, knowledge transfer, if done successfully, would greatly improve the performance of learning by avoiding much expensive data-labeling efforts. In recent years, transfer learning has emerged as a new learning framework to address this problem. This survey focuses on categorizing and reviewing the current progress on transfer learning for classification, regression, and clustering problems. In this survey, we discuss the relationship between transfer learning and other related machine learning techniques such as domain adaptation, multitask learning and sample selection bias, as well as covariate shift. We also explore some potential future issues in transfer learning research.

18,616 citations


"Real-Time Articulated Hand Pose Est..." refers background in this paper

  • ...It has seen various successful applications [21], still it has not been applied in articulated pose estimation....

    [...]

  • ...This process is known as transductive transfer learning [21]: A transductive model learns from a source domain, e....

    [...]

Proceedings ArticleDOI
20 Jun 2011
TL;DR: This work takes an object recognition approach, designing an intermediate body parts representation that maps the difficult pose estimation problem into a simpler per-pixel classification problem, and generates confidence-scored 3D proposals of several body joints by reprojecting the classification result and finding local modes.
Abstract: We propose a new method to quickly and accurately predict 3D positions of body joints from a single depth image, using no temporal information. We take an object recognition approach, designing an intermediate body parts representation that maps the difficult pose estimation problem into a simpler per-pixel classification problem. Our large and highly varied training dataset allows the classifier to estimate body parts invariant to pose, body shape, clothing, etc. Finally we generate confidence-scored 3D proposals of several body joints by reprojecting the classification result and finding local modes. The system runs at 200 frames per second on consumer hardware. Our evaluation shows high accuracy on both synthetic and real test sets, and investigates the effect of several training parameters. We achieve state of the art accuracy in our comparison with related work and demonstrate improved generalization over exact whole-skeleton nearest neighbor matching.

3,579 citations

Journal ArticleDOI
TL;DR: This work takes an object recognition approach, designing an intermediate body parts representation that maps the difficult pose estimation problem into a simpler per-pixel classification problem, and generates confidence-scored 3D proposals of several body joints by reprojecting the classification result and finding local modes.
Abstract: We propose a new method to quickly and accurately predict human pose---the 3D positions of body joints---from a single depth image, without depending on information from preceding frames. Our approach is strongly rooted in current object recognition strategies. By designing an intermediate representation in terms of body parts, the difficult pose estimation problem is transformed into a simpler per-pixel classification problem, for which efficient machine learning techniques exist. By using computer graphics to synthesize a very large dataset of training image pairs, one can train a classifier that estimates body part labels from test images invariant to pose, body shape, clothing, and other irrelevances. Finally, we generate confidence-scored 3D proposals of several body joints by reprojecting the classification result and finding local modes.The system runs in under 5ms on the Xbox 360. Our evaluation shows high accuracy on both synthetic and real test sets, and investigates the effect of several training parameters. We achieve state-of-the-art accuracy in our comparison with related work and demonstrate improved generalization over exact whole-skeleton nearest neighbor matching.

3,034 citations


"Real-Time Articulated Hand Pose Est..." refers background or methods in this paper

  • ...Discriminative approaches learn a mapping from visual features to the target parameter space, such as joint labels [24] or joint coordinates [12]....

    [...]

  • ...While latest depth sensor technology has enabled body pose estimation in real-time [2, 24, 12, 26], hand pose estimation still requires improvement....

    [...]

  • ...Performances of algorithms are measured by their pixel-wise classification accuracy per joint, similar to [24], hence only Qp,Qv , Qt and Qu were utilised in this experiment....

    [...]

  • ...Although discriminative methods have proved successful in real-time body pose estimation from depth sensors [24, 12, 2, 26], they are less common than model-based approaches with respect to hand pose estimation....

    [...]

  • ...The size of a patch is 64× 64 which is comparable to the patches in [24]....

    [...]

Proceedings ArticleDOI
01 Jan 2011
TL;DR: A novel solution to the problem of recovering and tracking the 3D position, orientation and full articulation of a human hand from markerless visual observations obtained by a Kinect sensor is presented.
Abstract: We present a novel solution to the problem of recovering and tracking the 3D position, orientation and full articulation of a human hand from markerless visual observations obtained by a Kinect sensor. We treat this as an optimization problem, seeking for the hand model parameters that minimize the discrepancy between the appearance and 3D structure of hypothesized instances of a hand model and actual hand observations. This optimization problem is effectively solved using a variant of Particle Swarm Optimization (PSO). The proposed method does not require special markers and/or a complex image acquisition setup. Being model based, it provides continuous solutions to the problem of tracking hand articulations. Extensive experiments with a prototype GPU-based implementation of the proposed method demonstrate that accurate and robust 3D tracking of hand articulations can be achieved in near real-time (15Hz).

1,009 citations


"Real-Time Articulated Hand Pose Est..." refers background or methods in this paper

  • ...Existing state-of-the-arts resort to synthetic data [16], or model-based optimisation [8, 15]....

    [...]

  • ...Kinematics Inverse kinematics is a standard technique in model-based and tracking approaches for both body [28, 22] and hand poses estimation [8, 15, 25]....

    [...]