Real-Time Articulated Hand Pose Estimation Using Semi-supervised Transductive Regression Forests

doi:10.1109/ICCV.2013.400

Home
/
Papers
/
Real-Time Articulated Hand Pose Estimation Using Semi-supervised Transductive Regression Forests

Proceedings Article•DOI•

Real-Time Articulated Hand Pose Estimation Using Semi-supervised Transductive Regression Forests

Danhang Tang¹, Tsz-Ho Yu², Tae-Kyun Kim¹•Institutions (2)

Imperial College London¹, University of Cambridge²

01 Dec 2013-pp 3224-3231

TL;DR: The Semi-supervised Transductive Regression (STR) forest is proposed which learns the relationship between a small, sparsely labelled realistic dataset and a large synthetic dataset, and a novel data-driven, pseudo-kinematic technique to refine noisy or occluded joints.

read less

Abstract: This paper presents the first semi-supervised transductive algorithm for real-time articulated hand pose estimation. Noisy data and occlusions are the major challenges of articulated hand pose estimation. In addition, the discrepancies among realistic and synthetic pose data undermine the performances of existing approaches that use synthetic data extensively in training. We therefore propose the Semi-supervised Transductive Regression (STR) forest which learns the relationship between a small, sparsely labelled realistic dataset and a large synthetic dataset. We also design a novel data-driven, pseudo-kinematic technique to refine noisy or occluded joints. Our contributions include: (i) capturing the benefits of both realistic and synthetic data via transductive learning, (ii) showing accuracies can be improved by considering unlabelled data, and (iii) introducing a pseudo-kinematic technique to refine articulations efficiently. Experimental results show not only the promising performance of our method with respect to noise and occlusions, but also its superiority over state-of-the-arts in accuracy, robustness and speed.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Going deeper with convolutions

[...]

Christian Szegedy¹, Wei Liu², Yangqing Jia¹, Pierre Sermanet¹, Scott Reed³, Dragomir Anguelov¹, Dumitru Erhan¹, Vincent Vanhoucke¹, Andrew Rabinovich - Show less +5 more•Institutions (3)

Google¹, University of North Carolina at Chapel Hill², University of Michigan³

07 Jun 2015

TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).

...read moreread less

Abstract: We propose a deep convolutional neural network architecture codenamed Inception that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. By a carefully crafted design, we increased the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC14 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.

...read moreread less

40,257 citations

Proceedings Article•DOI•

Realtime and Robust Hand Tracking from Depth

[...]

Chen Qian¹, Xiao Sun², Yichen Wei², Xiaoou Tang¹, Jian Sun² - Show less +1 more•Institutions (2)

The Chinese University of Hong Kong¹, Microsoft²

23 Jun 2014

TL;DR: A hybrid method that combines gradient based and stochastic optimization methods to achieve fast convergence and good accuracy is proposed and presented, making it the first system that achieves such robustness, accuracy, and speed simultaneously.

...read moreread less

Abstract: We present a realtime hand tracking system using a depth sensor. It tracks a fully articulated hand under large viewpoints in realtime (25 FPS on a desktop without using a GPU) and with high accuracy (error below 10 mm). To our knowledge, it is the first system that achieves such robustness, accuracy, and speed simultaneously, as verified on challenging real data. Our system is made of several novel techniques. We model a hand simply using a number of spheres and define a fast cost function. Those are critical for realtime performance. We propose a hybrid method that combines gradient based and stochastic optimization methods to achieve fast convergence and good accuracy. We present new finger detection and hand initialization methods that greatly enhance the robustness of tracking.

...read moreread less

517 citations

Cites background from "Real-Time Articulated Hand Pose Est..."

...Other realtime and robust systems are limited in recognizing discrete hand gestures only [31, 5, 6, 29] without optimization, supporting a small number of DOFs [22], or under a fixed viewpoint [12]....
[...]

Proceedings Article•DOI•

Accurate, Robust, and Flexible Real-time Hand Tracking

[...]

Toby Sharp¹, Cem Keskin¹, Duncan Robertson¹, Jonathan Taylor¹, Jamie Shotton¹, David Kim¹, Christoph Rhemann¹, Ido Leichter¹, Alon Vinnikov¹, Yichen Wei¹, Daniel Freedman¹, Pushmeet Kohli¹, Eyal Krupka¹, Andrew Fitzgibbon¹, Shahram Izadi¹ - Show less +11 more•Institutions (1)

Microsoft¹

18 Apr 2015

TL;DR: A new real-time hand tracking system based on a single depth camera that can accurately reconstruct complex hand poses across a variety of subjects and is highly flexible, dramatically improving upon previous approaches which have focused on front-facing close-range scenarios.

...read moreread less

Abstract: We present a new real-time hand tracking system based on a single depth camera. The system can accurately reconstruct complex hand poses across a variety of subjects. It also allows for robust tracking, rapidly recovering from any temporary failures. Most uniquely, our tracker is highly flexible, dramatically improving upon previous approaches which have focused on front-facing close-range scenarios. This flexibility opens up new possibilities for human-computer interaction with examples including tracking at distances from tens of centimeters through to several meters (for controlling the TV at a distance), supporting tracking using a moving depth camera (for mobile scenarios), and arbitrary camera placements (for VR headsets). These features are achieved through a new pipeline that combines a multi-layered discriminative reinitialization strategy for per-frame pose estimation, followed by a generative model-fitting stage. We provide extensive technical details and a detailed qualitative and quantitative analysis.

...read moreread less

466 citations

Cites background from "Real-Time Articulated Hand Pose Est..."

...[27, 26] extend this work demonstrating more complex poses at 25Hz....
[...]

Proceedings Article•DOI•

Latent Regression Forest: Structured Estimation of 3D Articulated Hand Posture

[...]

Danhang Tang¹, Hyung Jin Chang¹, Alykhan Tejani¹, Tae-Kyun Kim¹•Institutions (1)

Imperial College London¹

23 Jun 2014

TL;DR: The Latent Regression Forest is presented, a novel framework for real-time, 3D hand pose estimation from a single depth image and shows that the LRF out-performs state-of-the-art methods in both accuracy and efficiency.

...read moreread less

Abstract: In this paper we present the Latent Regression Forest (LRF), a novel framework for real-time, 3D hand pose estimation from a single depth image. In contrast to prior forest-based methods, which take dense pixels as input, classify them independently and then estimate joint positions afterwards, our method can be considered as a structured coarse-to-fine search, starting from the centre of mass of a point cloud until locating all the skeletal joints. The searching process is guided by a learnt Latent Tree Model which reflects the hierarchical topology of the hand. Our main contributions can be summarised as follows: (i) Learning the topology of the hand in an unsupervised, data-driven manner. (ii) A new forest-based, discriminative framework for structured search in images, as well as an error regression step to avoid error accumulation. (iii) A new multi-view hand pose dataset containing 180K annotated images from 10 different subjects. Our experiments show that the LRF out-performs state-of-the-art methods in both accuracy and efficiency.

...read moreread less

424 citations

Cites background or methods from "Real-Time Articulated Hand Pose Est..."

...3) A new multi-view hand pose dataset: We present a new hand pose dataset containing 180K fully 3D annotated depth images from 10 different subjects....
[...]
...Without such procedures, highly unlikely or even impossible poses can be produced as output....
[...]

Proceedings Article•DOI•

Cascaded hand pose regression

[...]

Xiao Sun¹, Yichen Wei², Shuang Liang³, Xiaoou Tang¹, Jian Sun² - Show less +1 more•Institutions (3)

The Chinese University of Hong Kong¹, Microsoft², Tongji University³

07 Jun 2015

TL;DR: 3D pose-indexed features that generalize the previous 2D parameterized features and achieve better invariance to 3D transformations and a principled hierarchical regression that is adapted to the articulated object structure are introduced.

...read moreread less

Abstract: We extends the previous 2D cascaded object pose regression work [9] in two aspects so that it works better for 3D articulated objects. Our first contribution is 3D pose-indexed features that generalize the previous 2D parameterized features and achieve better invariance to 3D transformations. Our second contribution is a principled hierarchical regression that is adapted to the articulated object structure. It is therefore more accurate and faster. Comprehensive experiments verify the state-of-the-art accuracy and efficiency of the proposed approach on the challenging 3D hand pose estimation problem, on a public dataset and our new dataset.

...read moreread less

422 citations

Cites background or methods from "Real-Time Articulated Hand Pose Est..."

...6 FPS in [6], 12 FPS in [39], 25 FPS in [10], 62....
[...]
...Previous techniques (pre-clustering of hand pose in [6] and using an augmented cost function with a viewpoint classification term in [10]) are simple and can only perform coarse viewpoint estimation....
[...]
...holistic regression Many methods [6, 39, 10, 22] estimate hand joints individually by following the per-pixel classification approaches for human body pose recognition [18, 29]....
[...]
...This framework has been applied to facial landmark localization [8], human body pose estimation [33] and hand pose estimation [6, 10]....
[...]
..., the center of the depth patch [34] or any pixel under consideration [6, 10, 22], and z(u) is its depth....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Random Forests

[...]

Leo Breiman¹•Institutions (1)

University of California, Berkeley¹

01 Oct 2001

TL;DR: Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the forest, and are also applicable to regression.

...read moreread less

Abstract: Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, aaa, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.

...read moreread less

79,257 citations

"Real-Time Articulated Hand Pose Est..." refers methods in this paper

...Viewpoint classification termQa: Traditional information gain is used to evaluate the classification performance of all the viewpoint labels a in dataset L [4]....
[...]

Journal Article•DOI•

A Survey on Transfer Learning

[...]

Sinno Jialin Pan¹, Qiang Yang¹•Institutions (1)

Hong Kong University of Science and Technology¹

01 Oct 2010-IEEE Transactions on Knowledge and Data Engineering

TL;DR: The relationship between transfer learning and other related machine learning techniques such as domain adaptation, multitask learning and sample selection bias, as well as covariate shift are discussed.

...read moreread less

Abstract: A major assumption in many machine learning and data mining algorithms is that the training and future data must be in the same feature space and have the same distribution. However, in many real-world applications, this assumption may not hold. For example, we sometimes have a classification task in one domain of interest, but we only have sufficient training data in another domain of interest, where the latter data may be in a different feature space or follow a different data distribution. In such cases, knowledge transfer, if done successfully, would greatly improve the performance of learning by avoiding much expensive data-labeling efforts. In recent years, transfer learning has emerged as a new learning framework to address this problem. This survey focuses on categorizing and reviewing the current progress on transfer learning for classification, regression, and clustering problems. In this survey, we discuss the relationship between transfer learning and other related machine learning techniques such as domain adaptation, multitask learning and sample selection bias, as well as covariate shift. We also explore some potential future issues in transfer learning research.

...read moreread less

18,616 citations

"Real-Time Articulated Hand Pose Est..." refers background in this paper

...It has seen various successful applications [21], still it has not been applied in articulated pose estimation....
[...]
...This process is known as transductive transfer learning [21]: A transductive model learns from a source domain, e....
[...]

Proceedings Article•DOI•

Real-time human pose recognition in parts from single depth images

[...]

Jamie Shotton¹, Andrew Fitzgibbon¹, Mat Cook¹, Toby Sharp¹, Mark J. Finocchio¹, Richard E. Moore¹, Alex Aben-Athar Kipman¹, Andrew Blake¹ - Show less +4 more•Institutions (1)

Microsoft¹

20 Jun 2011

TL;DR: This work takes an object recognition approach, designing an intermediate body parts representation that maps the difficult pose estimation problem into a simpler per-pixel classification problem, and generates confidence-scored 3D proposals of several body joints by reprojecting the classification result and finding local modes.

...read moreread less

Abstract: We propose a new method to quickly and accurately predict 3D positions of body joints from a single depth image, using no temporal information. We take an object recognition approach, designing an intermediate body parts representation that maps the difficult pose estimation problem into a simpler per-pixel classification problem. Our large and highly varied training dataset allows the classifier to estimate body parts invariant to pose, body shape, clothing, etc. Finally we generate confidence-scored 3D proposals of several body joints by reprojecting the classification result and finding local modes. The system runs at 200 frames per second on consumer hardware. Our evaluation shows high accuracy on both synthetic and real test sets, and investigates the effect of several training parameters. We achieve state of the art accuracy in our comparison with related work and demonstrate improved generalization over exact whole-skeleton nearest neighbor matching.

...read moreread less

3,579 citations

Journal Article•DOI•

Real-time human pose recognition in parts from single depth images

[...]

Jamie Shotton¹, Toby Sharp¹, Alex Aben-Athar Kipman, Andrew Fitzgibbon¹, Mark J. Finocchio, Andrew Blake¹, Mat Cook¹, Richard Moore² - Show less +4 more•Institutions (2)

Microsoft¹, Ericsson²

01 Jan 2013-Communications of The ACM

...read moreread less

Abstract: We propose a new method to quickly and accurately predict human pose---the 3D positions of body joints---from a single depth image, without depending on information from preceding frames. Our approach is strongly rooted in current object recognition strategies. By designing an intermediate representation in terms of body parts, the difficult pose estimation problem is transformed into a simpler per-pixel classification problem, for which efficient machine learning techniques exist. By using computer graphics to synthesize a very large dataset of training image pairs, one can train a classifier that estimates body part labels from test images invariant to pose, body shape, clothing, and other irrelevances. Finally, we generate confidence-scored 3D proposals of several body joints by reprojecting the classification result and finding local modes.The system runs in under 5ms on the Xbox 360. Our evaluation shows high accuracy on both synthetic and real test sets, and investigates the effect of several training parameters. We achieve state-of-the-art accuracy in our comparison with related work and demonstrate improved generalization over exact whole-skeleton nearest neighbor matching.

...read moreread less

3,034 citations

"Real-Time Articulated Hand Pose Est..." refers background or methods in this paper

...Discriminative approaches learn a mapping from visual features to the target parameter space, such as joint labels [24] or joint coordinates [12]....
[...]
...While latest depth sensor technology has enabled body pose estimation in real-time [2, 24, 12, 26], hand pose estimation still requires improvement....
[...]
...Performances of algorithms are measured by their pixel-wise classification accuracy per joint, similar to [24], hence only Qp,Qv , Qt and Qu were utilised in this experiment....
[...]
...Although discriminative methods have proved successful in real-time body pose estimation from depth sensors [24, 12, 2, 26], they are less common than model-based approaches with respect to hand pose estimation....
[...]
...The size of a patch is 64× 64 which is comparable to the patches in [24]....
[...]

Proceedings Article•DOI•

Efficient Model-based 3D Tracking of Hand Articulations using Kinect

[...]

Iason Oikonomidis¹, Nikolaos Kyriazis², Antonis A. Argyros²•Institutions (2)

University of Crete¹, Foundation for Research & Technology – Hellas²

01 Jan 2011

TL;DR: A novel solution to the problem of recovering and tracking the 3D position, orientation and full articulation of a human hand from markerless visual observations obtained by a Kinect sensor is presented.

...read moreread less

Abstract: We present a novel solution to the problem of recovering and tracking the 3D position, orientation and full articulation of a human hand from markerless visual observations obtained by a Kinect sensor. We treat this as an optimization problem, seeking for the hand model parameters that minimize the discrepancy between the appearance and 3D structure of hypothesized instances of a hand model and actual hand observations. This optimization problem is effectively solved using a variant of Particle Swarm Optimization (PSO). The proposed method does not require special markers and/or a complex image acquisition setup. Being model based, it provides continuous solutions to the problem of tracking hand articulations. Extensive experiments with a prototype GPU-based implementation of the proposed method demonstrate that accurate and robust 3D tracking of hand articulations can be achieved in near real-time (15Hz).

...read moreread less

1,009 citations

"Real-Time Articulated Hand Pose Est..." refers background or methods in this paper

...Existing state-of-the-arts resort to synthetic data [16], or model-based optimisation [8, 15]....
[...]
...Kinematics Inverse kinematics is a standard technique in model-based and tracking approaches for both body [28, 22] and hand poses estimation [8, 15, 25]....
[...]