scispace - formally typeset
Open AccessJournal ArticleDOI

Random Forests for Real Time 3D Face Analysis

TLDR
A random forest-based framework for real time head pose estimation from depth images and extend it to localize a set of facial features in 3D to achieve real time performance without resorting to parallel computations on a GPU.
Abstract
We present a random forest-based framework for real time head pose estimation from depth images and extend it to localize a set of facial features in 3D. Our algorithm takes a voting approach, where each patch extracted from the depth image can directly cast a vote for the head pose or each of the facial features. Our system proves capable of handling large rotations, partial occlusions, and the noisy depth data acquired using commercial sensors. Moreover, the algorithm works on each frame independently and achieves real time performance without resorting to parallel computations on a GPU. We present extensive experiments on publicly available, challenging datasets and present a new annotated head pose database recorded using a Microsoft Kinect.

read more

Content maybe subject to copyright    Report

ETH Library
Random Forests for Real Time 3D
Face Analysis
Journal Article
Author(s):
Fanelli, Gabriele; Dantone, Matthias; Gall, Juergen; Fossati, Andrea; Van Gool, Luc
Publication date:
2013-02
Permanent link:
https://doi.org/10.3929/ethz-b-000055123
Rights / license:
In Copyright - Non-Commercial Use Permitted
Originally published in:
International Journal of Computer Vision 101(3), https://doi.org/10.1007/s11263-012-0549-0
This page was generated automatically upon download from the ETH Zurich Research Collection.
For more information, please consult the Terms of use.

Int J Comput Vis (2013) 101:437–458
DOI 10.1007/s11263-012-0549-0
Random Forests for Real Time 3D Face Analysis
Gabriele Fanelli ·Matthias Dantone ·Juergen Gall ·
Andrea Fossati ·Luc Van Gool
Received: 5 December 2011 / Accepted: 16 July 2012 / Published online: 1 August 2012
© Springer Science+Business Media, LLC 2012
Abstract We present a random forest-based framework for
real time head pose estimation from depth images and ex-
tend it to localize a set of facial features in 3D. Our algo-
rithm takes a voting approach, where each patch extracted
from the depth image can directly cast a vote for the head
pose or each of the facial features. Our system proves ca-
pable of handling large rotations, partial occlusions, and the
noisy depth data acquired using commercial sensors. More-
over, the algorithm works on each frame independently and
achieves real time performance without resorting to parallel
computations on a GPU. We present extensive experiments
on publicly available, challenging datasets and present a new
annotated head pose database recorded using a Microsoft
Kinect.
G. Fanelli (
) · M. Dantone · A. Fossati ·L. Van Gool
Computer Vision Laboratory, ETH Zurich, Sternwartstrasse 7,
8092 Zurich, Switzerland
e-mail: fanelli@vision.ee.ethz.ch
M. Dantone
e-mail: dantone@vision.ee.ethz.ch
A. Fossati
e-mail: fossati@vision.ee.ethz.ch
L. Van Gool
e-mail: luc.vangool@esat.kuleuven.be
J. Gall
Perceiving Systems Department, Max Planck Institute for
Intelligent Systems, Spemannstrasse 41, 72076 Tübingen,
Germany
e-mail: juergen.gall@tue.mpg.de
L. Van Gool
Department of Electrical Engineering/IBBT, K.U. Leuven,
Kasteelpark Arenberg 10, 3001 Heverlee, Belgium
Keywords Random forests · Head pose estimation · 3D
facial features detection · Real time
1 Introduction
Despite recent advances, people still interact with machines
through devices like keyboards and mice, which are not part
of natural human-human communication. As people inter-
act by means of many channels, including body posture and
facial expressions, an important step towards more natural
interfaces is the visual analysis of the user’s movements by
the machine. Besides the interpretation of full body move-
ments, as done by systems like the Kinect for gaming, new
interfaces would highly benefit from automatic analysis of
facial movements, as addressed in this paper.
Recent work has mainly focused on the analysis of stan-
dard images or videos; see the survey of Murphy-Chutorian
and Trivedi (2009) for an overview of head pose estima-
tion from video. The use of 2D imagery is very challenging
though, not least because of the lack of texture in some fa-
cial regions. On the other hand, depth-sensing devices have
recently become affordable (e.g., Microsoft Kinect or Asus
Xtion) and in some cases also accurate (e.g., Weise et al.
2007).
The newly available depth cue is key for solving many of
the problems inherent to 2D video data. Yet, 3D imagery has
mainly been leveraged for face tracking (Breidt et al. 2011;
Cai et al. 2010; Weise et al. 2009a, 2011), often leaving open
issues of drift and (re-)initialization. Tracking-by-detection,
on the other hand, detects the face or its features in each
frame, thereby providing increased robustness.
A typical approach to 3D head pose estimation involves
localizing a specific facial feature point (e.g., one not af-
fected by facial deformations like the nose) and determining

438 Int J Comput Vis (2013) 101:437–458
the head orientation (e.g., as Euler angles). When 3D data
is used, most methods rely on geometry to localize promi-
nent facial points like the nose tip (Lu and Jain 2006; Chang
et al. 2006; Sun and Yin 2008; Breitenstein et al. 2008, 2009)
and thus becoming sensitive to its occlusion. Moreover, the
available algorithms are either not real time, rely on some as-
sumption for initialization like starting with a frontal pose,
or cannot handle large rotations.
We introduce a voting framework where patches ex-
tracted from the whole depth image can contribute to the
estimation task. As in the Implicit Shape Model (Leibe et al.
2008), the intuition is that patches belonging to different
parts of the image contain valuable information on global
properties of the object which generated it, like pose. Since
all patches can vote for the localization of a specific point of
the object, it can be detected even when that particular point
is occluded.
We propose to use random regression forests for real time
head pose estimation and facial feature localization from
depth images. Random forests (Breiman 2001)(RFs)have
been successful in semantic segmentation (Shotton et al.
2008), keypoint recognition (Lepetit et al. 2005), object de-
tection (Gall and Lempitsky 2009;Galletal.2011), action
recognition (Yao et al. 2010;Galletal.2011), and real time
human pose estimation (Shotton et al. 2011; Girshick et al.
2011). They are well suited for time-critical applications, be-
ing very fast at both train and test time, lend themselves to
parallelization (Sharp 2008), and are inherently multi-class.
The proposed method does not rely on specific hardware and
can easily trade-off accuracy for speed. We estimate the de-
sired, continuous parameters directly from the depth data,
through a learnt mapping from depth to parameter values.
Our system works in real time, without manual initialization.
In our experiments, we show that it also works for unseen
faces and that it can handle large pose changes, variations in
facial hair, and partial occlusions due to glasses, hands, or
missing parts in the 3D reconstruction. It does not rely on
specific features like the nose tip.
Random forests show their power when using large
datasets, on which they can be trained efficiently. Because
the accuracy of a regressor depends on the amount of anno-
tated training data, the acquisition and labeling of a train-
ing set are key issues. Depending on the expected scenario,
we either synthetically generate annotated depth images by
rendering a face model undergoing large rotations, or record
real sequences using a consumer depth sensor, automatically
annotating them using state-of-the-art tracking methods.
A preliminary version of this work was published in
Fanelli et al. (2011a), where we introduced the use of ran-
dom regression forests for real time head pose estimation
from high quality range scans. In Fanelli et al. (2011b), we
extended the forest to cope with depth images where the
whole body can be visible, i.e., discriminating depth patches
that belong to a head and only using those to predict the
pose, jointly solving the classification and regression prob-
lems involved. In this work, we provide a thorough experi-
mental evaluation and extend the random forest to localize
several facial landmarks on the range scans.
2 Related Work
After a brief overview of the random forest literature, we
present an analysis of related works on head pose estimation
and facial features localization.
2.1 Random Forests
Random forests (Breiman 2001) have become a popular
method in computer vision (Gall and Lempitsky 2009;Gall
et al. 2011; Shotton et al. 2011; Girshick et al. 2011)given
their capability to handle large training datasets, their high
generalization power and speed, and the relative ease of im-
plementation. Recent works showed the power of random
forests in mapping image features to votes in a generalized
Hough space (Gall and Lempitsky 2009)or,inaregression
framework, to real-valued functions (Criminisi et al. 2010).
In the context of real time pose estimation, multi-class ran-
dom forests have been proposed for the real time determina-
tion of head pose from 2D video data (Huang et al. 2010).
In Shotton et al. (2011), random forests have been used for
real time body pose estimation from depth data. Each input
depth pixel is first assigned to a specific body part, using a
classification forest trained on a very large synthetic dataset.
After this step, the location of the body joints is inferred
through a local mode-finding approach based on mean shift.
In Girshick et al. (2011), it has been shown that the body
pose can be more efficiently estimated by using regression
instead of classification forests. Inspired by the works (Gall
and Lempitsky 2009; Criminisi et al. 2010), we have shown
that regression forests can be used for real time head pose
estimation from depth data (Fanelli et al. 2011a, 2011b).
A detailed introduction to decision forests and their ap-
plications in computer vision can be found in Criminisi et al.
(2011).
2.2 Head Pose Estimation
With application ranging from image normalization for
recognition to driver drowsiness detection, automatic head
pose estimation is an important problem. Several approaches
have been proposed in the literature (Murphy-Chutorian and
Trivedi 2009); before introducing 3D approaches, which are
more relevant for this paper, we present a brief overview
of works that take 2D images as input. Methods based on
2D images can be subdivided into appearance-based and

Int J Comput Vis (2013) 101:437–458 439
feature-based classes, depending on whether they analyze
the face as a whole or instead rely on the localization of
some specific facial features.
2D Appearance-Based Methods These methods usually
discretize the head pose space and learn separate detec-
tors for subsets of poses (Jones and Viola 2003). Chen et
al. (2003) and Balasubramanian et al. (2007) present head
pose estimation systems with a specific focus on the map-
ping from the high-dimensional space of facial appearance
to the lower-dimensional manifold of head poses. The latter
paper considers face images with varying poses as lying on
a smooth low-dimensional manifold in a high-dimensional
feature space. The proposed Biased Manifold Embedding
uses the pose angle information of the face images to com-
pute a biased neighborhood of each point in the feature
space, prior to determining the low-dimensional embedding.
In the same vein, Osadchy et al. (2005) instead use a convo-
lutional network to learn the mapping, achieving real time
performance for the face detection problem, while also pro-
viding an estimate of the head pose. A very popular family
of methods use statistical models of the face shape and ap-
pearance, like Active Appearance Models (AAMs) (Cootes
et al. 2001), multi-view AAMs (Ramnath et al. 2008), and
3D Morphable Models (Blanz and Vetter 1999;Storeretal.
2009). Such methods usually focus on tracking facial fea-
tures rather than estimating the head pose, however. In this
context, Martins and Batista (2008) coupled an Active Ap-
pearance Model with the POSIT algorithm for head pose
tracking.
2D Feature-Based Methods These methods rely on some
specific facial features to be visible, and therefore are sen-
sitive to occlusions and to large head rotations. Vatahska et
al. (2007) use a face detector to roughly classify the pose
as frontal, left, or right profile. After this, they detect the
eyes and nose tip using AdaBoost classifiers, and the detec-
tions are fed into a neural network which estimates the head
orientation. Similarly, Whitehill et al. (2008) present a dis-
criminative approach to frame-by-frame head pose estima-
tion. Their algorithm relies on the detection of the nose tip
and both eyes, thereby limiting the recognizable poses to the
ones where both eyes are visible. Morency et al. (2008)pro-
pose a probabilistic framework called Generalized Adaptive
View-based Appearance Model integrating frame-by-frame
head pose estimation, differential registration, and keyframe
tracking.
3D Methods In general, approaches relying solely on 2D
images are sensitive to illumination changes and lack of dis-
tinctive features. Moreover, the annotation of head poses
from 2D images is intrinsically problematic. Since 3D sens-
ing devices have become available, computer vision re-
searchers have started to leverage the additional depth in-
formation for solving some of the inherent limitations of
image-based methods. Some of the recent works thus use
depth as primary cue (Breitenstein et al. 2008) or in addi-
tion to 2D images (Cai et al. 2010; Morency et al. 2003;
Seemann et al. 2004).
Seemann et al. (2004) presented a neural network-based
system fusing skin color histograms and depth information.
It tracks at 10 fps but requires the face to be detected in
a frontal pose in the first frame of the sequence. The ap-
proach of Mian et al. (2006) uses head pose estimation only
as a pre-processing step to face recognition, and the low
reported average errors are only calculated on subjects be-
longing to the training set. Still in a tracking framework,
Morency et al. (2003) use instead a intensity and depth input
image to build a prior model of the face using 3D view-based
eigenspaces. Then, they use this model to compute the abso-
lute difference in pose for each new frame. The pose range
is limited and manual cropping is necessary. In Cai et al.
(2010), a 3D face model is aligned to an RGB-depth input
stream for tracking features across frames, taking into ac-
count the very noisy nature of depth measurements coming
from commercial sensors.
Considering instead pure detectors on a frame-by-frame
basis, Lu and Jain (2006) create hypotheses for the nose po-
sition in range images based on directional maxima. For ver-
ification, they compute the nose profile using PCA and a
curvature-based shape index. Breitenstein et al. (
2008)pre-
sented a real time system working on range scans provided
by the scanner of Weise et al. (2007). Their system can han-
dle large pose variations, facial expressions, and partial oc-
clusions, as long as the nose remains visible. The method
relies on several candidate nose positions, suggested by a
geometric descriptor. Such hypotheses are all evaluated in
parallel on a GPU, which compares them to renderings of a
generic template with different orientations, finally selecting
the orientation which minimizes a predefined cost function.
Real time performance is only met thanks to the parallel
GPU computations. Unfortunately, GPUs are power-hungry
and might not be available in many scenarios where porta-
bility is important, e.g., for mobile robots. Breitenstein et al.
also collected a dataset of over 10K annotated range scans of
heads. The subjects, both male and female, with and with-
out glasses, were recorded using the scanner of Weise et al.
(2007) while turning their heads around, trying to span all
possible yaw and pitch rotation angles they could. The scans
were automatically annotated, tracking each sequence using
ICP in combination with a personalized face template. The
same authors also extended their system to use lower quality
depth images from a stereo system (Breitenstein et al. 2009).
Yet, the main shortcomings of the original method remain.
2.3 Facial Features Localization
2D Facial Features Facial feature detection from stan-
dard images is a well studied problem, often performed

440 Int J Comput Vis (2013) 101:437–458
as preprocessing for face recognition. Previous contribu-
tions can be classified into two categories, depending on
whether they use global or local features. Holistic methods,
e.g., Active Appearance Models (Cootes et al. 2001, 2002;
Matthews and Baker 2003), use the entire facial texture to
fit a generative model to a test image. They are usually af-
fected by lighting changes and a bias towards the average
face. The complexity of the modeling is an additional issue.
Moreover, these methods perform poorly on unseen iden-
tities (Gross et al. 2005) and cannot handle low-resolution
images well.
In recent years, there has been a shift towards methods
based on independent local feature detectors (Valstar et al.
2010; Amberg and Vetter 2011; Belhumeur et al. 2011).
Such detectors are discriminative models of image patches
centered around the facial landmarks, often ambiguous be-
cause the limited support region cannot cope with the large
appearance variations present in the training samples. To
improve accuracy and reduce the influence of wrong detec-
tions, global models of the facial features configuration like
pictorial structures (Felzenszwalb and Huttenlocher 2005;
Everingham et al. 2006) or Active Shape Models (Cristi-
nacce and Cootes 2008) are needed.
3D Facial Features Similar to the 2D case, methods fo-
cusing on facial feature localization from range data can
be subdivided into categories using global or local informa-
tion. Among the former class, the authors of Mpiperis et al.
(2008) deform a bi-linear face model to match a scan of an
unseen face in different expressions. Yet, the paper’s focus is
not on the localization of facial feature points and real time
performance is not achieved. Also Kakadiaris et al. (2007)
non-rigidly align an annotated model to face meshes. Con-
straints need to be imposed on the initial face orientation,
however. Using high quality range scans, Weise et al.(2009a)
presented a real time system, capable of tracking facial mo-
tion in detail, but using personalized templates. The same
approach has been extended to robustly track head pose and
facial deformations using RGB-depth streams provided by
commercial sensors like the Kinect (Weise et al. 2011).
Most works that try to directly localize specific feature
points from 3D data take advantage of surface curvatures.
For example, Sun and Yin (2008), Segundo et al. (2010),
Chang et al. (2006) all use curvature to roughly localize the
inner corners of the eyes. Such an approach is very sensitive
to missing depth data, particularly for the regions around
the inner eye corners, frequently occluded by shadows. Also
Mehryar et al. (2010) use surface curvatures by first ex-
tracting ridge and valley points, which are then clustered.
The clusters are refined using a geometric model imposing
a set of distance and angle constraints on the arrangement
of candidate landmarks. Colbry et al. (2005) use curvature
in conjunction with the Shape Index proposed by Dorai and
Jain (1997) to locate facial feature points from range scans
of faces. The reported execution time of this anchor point
detector is 15 sec per frame. Wang et al. (2002) use point
signatures (Chua and Jarvis 1997) and Gabor filters to de-
tect some facial feature points from 3D and 2D data. The
method needs all desired landmarks to be visible, thus re-
stricting the range of head poses while being sensitive to oc-
clusions. Yu et al. (2008
) use genetic algorithms to combine
several weak classifiers into a 3D facial landmark detector.
Ju et al. (2009) detect the nose tip and the eyes using binary
neural networks, and propose a 3D shape descriptor invari-
ant to pose and expression.
Zhao et al. (2011) propose a 3D Statistical Facial Feature
Model (SFAM), which models both the global variations in
the morphology of the face and the local structures around
the landmarks. The low reported errors for the localization
of 15 points in scans of neutral faces come at the expense of
processing time: over 10 minutes are needed to process one
facial scan. In Nair and Cavallaro (2009), fitting the pro-
posed PCA shape model containing only the upper facial
features, i.e., without the mouth, takes on average 2 minutes
per face.
In general, prior work on facial feature localization from
3D data is either sensitive to occlusions, especially of the
nose, requires prior knowledge of feature map thresholds,
cannot handle large rotations, or does not run in real time.
3 Random Forest Framework for Face Analysis
In Sect. 3.1 we first summarize a general random forest
framework (Breiman 2001), then give specific details for
face analysis based on depth data in Sects. 3.2 and 3.3.
3.1 Random Forest
Decision trees (Breiman et al. 1984) can map complex input
spaces into simpler, discrete or continuous output spaces,
depending on whether they are used for classification of re-
gression purposes. A tree splits the original problem into
smaller ones, solvable with simple predictors, thus achieving
complex, highly non-linear mappings in a very simple man-
ner. A non-leaf node in the tree contains a binary test, guid-
ing a data sample towards the left or the right child node.
The tests are chosen in a supervised-learning framework,
and training a tree boils down to selecting the tests which
cluster the training such as to allow good predictions using
simple models.
Random forests are collections of decision trees, each
trained on a randomly sampled subset of the available data;
this reduces over-fitting in comparison to trees trained on the
whole dataset, as shown by Breiman (2001). Randomness is
introduced by the subset of training examples provided to

Figures
Citations
More filters
Proceedings Article

A morphable model for the synthesis of 3D faces

Matthew Turk
Proceedings ArticleDOI

SUN RGB-D: A RGB-D scene understanding benchmark suite

TL;DR: This paper introduces an RGB-D benchmark suite for the goal of advancing the state-of-the-arts in all major scene understanding tasks, and presents a dataset that enables the train data-hungry algorithms for scene-understanding tasks, evaluate them using meaningful 3D metrics, avoid overfitting to a small testing set, and study cross-sensor bias.
Journal ArticleDOI

Gradient boosting machines, a tutorial.

TL;DR: This article gives a tutorial introduction into the methodology of gradient boosting methods with a strong focus on machine learning aspects of modeling.
Proceedings ArticleDOI

OctNet: Learning Deep 3D Representations at High Resolutions

TL;DR: The utility of the OctNet representation is demonstrated by analyzing the impact of resolution on several 3D tasks including 3D object classification, orientation estimation and point cloud labeling.
Proceedings ArticleDOI

Learning-by-Synthesis for Appearance-Based 3D Gaze Estimation

TL;DR: This paper presents a learning-by-synthesis approach to accurate image-based gaze estimation that is person- and head pose-independent and outperforms existing methods that use low-resolution eye images.
References
More filters
Journal ArticleDOI

Random Forests

TL;DR: Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the forest, and are also applicable to regression.
Journal ArticleDOI

A method for registration of 3-D shapes

TL;DR: In this paper, the authors describe a general-purpose representation-independent method for the accurate and computationally efficient registration of 3D shapes including free-form curves and surfaces, based on the iterative closest point (ICP) algorithm, which requires only a procedure to find the closest point on a geometric entity to a given point.
Journal ArticleDOI

Classification and regression trees

TL;DR: This article gives an introduction to the subject of classification and regression trees by reviewing some widely available algorithms and comparing their capabilities, strengths, and weakness in two examples.
Book

Classification and regression trees

Leo Breiman
TL;DR: The methodology used to construct tree structured rules is the focus of a monograph as mentioned in this paper, covering the use of trees as a data analysis method, and in a more mathematical framework, proving some of their fundamental properties.
Frequently Asked Questions (14)
Q1. What are the contributions mentioned in the paper "Random forests for real time 3d face analysis" ?

The authors present a random forest-based framework for real time head pose estimation from depth images and extend it to localize a set of facial features in 3D. Their algorithm takes a voting approach, where each patch extracted from the depth image can directly cast a vote for the head pose or each of the facial features. The authors present extensive experiments on publicly available, challenging datasets and present a new annotated head pose database recorded using a Microsoft 

In their future work, the authors intend to train on full upper body models instead of isolated faces in order to better handle hair and other non-face body parts. 

Because the accuracy of a regressor depends on the amount of annotated training data, the acquisition and labeling of a training set are key issues. 

the 3D morphable model of Paysan et al. (2009) was used, together with graph-based non-rigid ICP (Li et al. 2009), to adapt the generic face template to the point cloud. 

In order for the forest to be more scale-invariant, the size of the patches can be made dependent on the depth (e.g., at its center), however, in this work the authors assume the faces to be within a relatively narrow range of distances from the sensor. 

In Nair and Cavallaro (2009), fitting the proposed PCA shape model containing only the upper facial features, i.e., without the mouth, takes on average 2 minutes per face. 

Most errors occur around the mouth regions due to the large deformations and the noisy reconstruction of the teeth and oral cavity. 

Since all patches can vote for the localization of a specific point of the object, it can be detected even when that particular point is occluded. 

Their head pose estimation system does not assume any initialization phase nor person-specific training, and works on a frame-by-frame basis. 

Because the database does not contain a uniform distribution of head poses, but has a sharp peak around thefrontal face configuration, as can be noted from Fig. 21, the authors bin the space of yaw and pitch angles and cap the number of images for each bin. 

GPUs are power-hungry and might not be available in many scenarios where portability is important, e.g., for mobile robots. 

That means that all leaves in a tree contain a probability p(c = 1|P ) = 1 and thus all patches extracted from the depth image will be allowed to vote, no matter their appearance. 

The plot shows that a minimum size for the patches is critical since small patches can not capture enough information to reliably predict the head pose. 

In Girshick et al. (2011), it has been shown that the body pose can be more efficiently estimated by using regression instead of classification forests.