What have the authors stated for future works in "Random forests for real time 3d face analysis" ?

In their future work, the authors intend to train on full upper body models instead of isolated faces in order to better handle hair and other non-face body parts.

How did Paysan et al. (2009) use the generic face template?

the 3D morphable model of Paysan et al. (2009) was used, together with graph-based non-rigid ICP (Li et al. 2009), to adapt the generic face template to the point cloud.

How can the authors make the patches more scale-invariant?

In order for the forest to be more scale-invariant, the size of the patches can be made dependent on the depth (e.g., at its center), however, in this work the authors assume the faces to be within a relatively narrow range of distances from the sensor.

What causes the errors around the mouth regions?

Most errors occur around the mouth regions due to the large deformations and the noisy reconstruction of the teeth and oral cavity.

How can patches be detected when a particular point is occluded?

Since all patches can vote for the localization of a specific point of the object, it can be detected even when that particular point is occluded.

What does the head pose estimation system assume?

Their head pose estimation system does not assume any initialization phase nor person-specific training, and works on a frame-by-frame basis.

Why do the authors use the depth channel?

Because the database does not contain a uniform distribution of head poses, but has a sharp peak around thefrontal face configuration, as can be noted from Fig. 21, the authors bin the space of yaw and pitch angles and cap the number of images for each bin.

What does the algorithm assume that all leaves in a tree contain a probability?

That means that all leaves in a tree contain a probability p(c = 1|P ) = 1 and thus all patches extracted from the depth image will be allowed to vote, no matter their appearance.

What is the importance of a minimum size for the patches?

The plot shows that a minimum size for the patches is critical since small patches can not capture enough information to reliably predict the head pose.

How can the body pose be estimated?

In Girshick et al. (2011), it has been shown that the body pose can be more efficiently estimated by using regression instead of classification forests.

(Open Access) Random Forests for Real Time 3D Face Analysis (2013) | Gabriele Fanelli

Q: What are the contributions mentioned in the paper "Random forests for real time 3d face analysis" ?

The authors present a random forest-based framework for real time head pose estimation from depth images and extend it to localize a set of facial features in 3D. Their algorithm takes a voting approach, where each patch extracted from the depth image can directly cast a vote for the head pose or each of the facial features. The authors present extensive experiments on publicly available, challenging datasets and present a new annotated head pose database recorded using a Microsoft

Q: How long does it take to fit the proposed PCA shape model?

In Nair and Cavallaro (2009), fitting the proposed PCA shape model containing only the upper facial features, i.e., without the mouth, takes on average 2 minutes per face.

ETH Library

Random Forests for Real Time 3D

Face Analysis

Journal Article

Author(s):

Fanelli, Gabriele; Dantone, Matthias; Gall, Juergen; Fossati, Andrea; Van Gool, Luc

Publication date:

2013-02

Permanent link:

https://doi.org/10.3929/ethz-b-000055123

Rights / license:

In Copyright - Non-Commercial Use Permitted

Originally published in:

International Journal of Computer Vision 101(3), https://doi.org/10.1007/s11263-012-0549-0

This page was generated automatically upon download from the ETH Zurich Research Collection.

For more information, please consult the Terms of use.

Int J Comput Vis (2013) 101:437–458

DOI 10.1007/s11263-012-0549-0

Random Forests for Real Time 3D Face Analysis

Gabriele Fanelli ·Matthias Dantone ·Juergen Gall ·

Andrea Fossati ·Luc Van Gool

Received: 5 December 2011 / Accepted: 16 July 2012 / Published online: 1 August 2012

Abstract We present a random forest-based framework for

real time head pose estimation from depth images and ex-

tend it to localize a set of facial features in 3D. Our algo-

rithm takes a voting approach, where each patch extracted

from the depth image can directly cast a vote for the head

pose or each of the facial features. Our system proves ca-

pable of handling large rotations, partial occlusions, and the

noisy depth data acquired using commercial sensors. More-

over, the algorithm works on each frame independently and

achieves real time performance without resorting to parallel

computations on a GPU. We present extensive experiments

on publicly available, challenging datasets and present a new

annotated head pose database recorded using a Microsoft

Kinect.

G. Fanelli (



) · M. Dantone · A. Fossati ·L. Van Gool

Computer Vision Laboratory, ETH Zurich, Sternwartstrasse 7,

8092 Zurich, Switzerland

e-mail: fanelli@vision.ee.ethz.ch

M. Dantone

e-mail: dantone@vision.ee.ethz.ch

A. Fossati

e-mail: fossati@vision.ee.ethz.ch

L. Van Gool

e-mail: luc.vangool@esat.kuleuven.be

J. Gall

Perceiving Systems Department, Max Planck Institute for

Intelligent Systems, Spemannstrasse 41, 72076 Tübingen,

Germany

e-mail: juergen.gall@tue.mpg.de

L. Van Gool

Department of Electrical Engineering/IBBT, K.U. Leuven,

Kasteelpark Arenberg 10, 3001 Heverlee, Belgium

Keywords Random forests · Head pose estimation · 3D

facial features detection · Real time

1 Introduction

Despite recent advances, people still interact with machines

through devices like keyboards and mice, which are not part

of natural human-human communication. As people inter-

act by means of many channels, including body posture and

facial expressions, an important step towards more natural

interfaces is the visual analysis of the user’s movements by

the machine. Besides the interpretation of full body move-

ments, as done by systems like the Kinect for gaming, new

interfaces would highly beneﬁt from automatic analysis of

facial movements, as addressed in this paper.

Recent work has mainly focused on the analysis of stan-

dard images or videos; see the survey of Murphy-Chutorian

and Trivedi (2009) for an overview of head pose estima-

tion from video. The use of 2D imagery is very challenging

though, not least because of the lack of texture in some fa-

cial regions. On the other hand, depth-sensing devices have

recently become affordable (e.g., Microsoft Kinect or Asus

Xtion) and in some cases also accurate (e.g., Weise et al.

2007).

The newly available depth cue is key for solving many of

the problems inherent to 2D video data. Yet, 3D imagery has

mainly been leveraged for face tracking (Breidt et al. 2011;

Cai et al. 2010; Weise et al. 2009a, 2011), often leaving open

issues of drift and (re-)initialization. Tracking-by-detection,

on the other hand, detects the face or its features in each

frame, thereby providing increased robustness.

A typical approach to 3D head pose estimation involves

localizing a speciﬁc facial feature point (e.g., one not af-

fected by facial deformations like the nose) and determining

438 Int J Comput Vis (2013) 101:437–458

the head orientation (e.g., as Euler angles). When 3D data

is used, most methods rely on geometry to localize promi-

nent facial points like the nose tip (Lu and Jain 2006; Chang

et al. 2006; Sun and Yin 2008; Breitenstein et al. 2008, 2009)

and thus becoming sensitive to its occlusion. Moreover, the

available algorithms are either not real time, rely on some as-

sumption for initialization like starting with a frontal pose,

or cannot handle large rotations.

We introduce a voting framework where patches ex-

tracted from the whole depth image can contribute to the

estimation task. As in the Implicit Shape Model (Leibe et al.

2008), the intuition is that patches belonging to different

parts of the image contain valuable information on global

properties of the object which generated it, like pose. Since

all patches can vote for the localization of a speciﬁc point of

the object, it can be detected even when that particular point

is occluded.

We propose to use random regression forests for real time

head pose estimation and facial feature localization from

depth images. Random forests (Breiman 2001)(RFs)have

been successful in semantic segmentation (Shotton et al.

2008), keypoint recognition (Lepetit et al. 2005), object de-

tection (Gall and Lempitsky 2009;Galletal.2011), action

recognition (Yao et al. 2010;Galletal.2011), and real time

human pose estimation (Shotton et al. 2011; Girshick et al.

2011). They are well suited for time-critical applications, be-

ing very fast at both train and test time, lend themselves to

parallelization (Sharp 2008), and are inherently multi-class.

The proposed method does not rely on speciﬁc hardware and

can easily trade-off accuracy for speed. We estimate the de-

sired, continuous parameters directly from the depth data,

through a learnt mapping from depth to parameter values.

Our system works in real time, without manual initialization.

In our experiments, we show that it also works for unseen

faces and that it can handle large pose changes, variations in

facial hair, and partial occlusions due to glasses, hands, or

missing parts in the 3D reconstruction. It does not rely on

speciﬁc features like the nose tip.

Random forests show their power when using large

datasets, on which they can be trained efﬁciently. Because

the accuracy of a regressor depends on the amount of anno-

tated training data, the acquisition and labeling of a train-

ing set are key issues. Depending on the expected scenario,

we either synthetically generate annotated depth images by

rendering a face model undergoing large rotations, or record

real sequences using a consumer depth sensor, automatically

annotating them using state-of-the-art tracking methods.

A preliminary version of this work was published in

Fanelli et al. (2011a), where we introduced the use of ran-

dom regression forests for real time head pose estimation

from high quality range scans. In Fanelli et al. (2011b), we

extended the forest to cope with depth images where the

whole body can be visible, i.e., discriminating depth patches

that belong to a head and only using those to predict the

pose, jointly solving the classiﬁcation and regression prob-

lems involved. In this work, we provide a thorough experi-

mental evaluation and extend the random forest to localize

several facial landmarks on the range scans.

2 Related Work

After a brief overview of the random forest literature, we

present an analysis of related works on head pose estimation

and facial features localization.

2.1 Random Forests

Random forests (Breiman 2001) have become a popular

method in computer vision (Gall and Lempitsky 2009;Gall

et al. 2011; Shotton et al. 2011; Girshick et al. 2011)given

their capability to handle large training datasets, their high

generalization power and speed, and the relative ease of im-

plementation. Recent works showed the power of random

forests in mapping image features to votes in a generalized

Hough space (Gall and Lempitsky 2009)or,inaregression

framework, to real-valued functions (Criminisi et al. 2010).

In the context of real time pose estimation, multi-class ran-

dom forests have been proposed for the real time determina-

tion of head pose from 2D video data (Huang et al. 2010).

In Shotton et al. (2011), random forests have been used for

real time body pose estimation from depth data. Each input

depth pixel is ﬁrst assigned to a speciﬁc body part, using a

classiﬁcation forest trained on a very large synthetic dataset.

After this step, the location of the body joints is inferred

through a local mode-ﬁnding approach based on mean shift.

In Girshick et al. (2011), it has been shown that the body

pose can be more efﬁciently estimated by using regression

instead of classiﬁcation forests. Inspired by the works (Gall

and Lempitsky 2009; Criminisi et al. 2010), we have shown

that regression forests can be used for real time head pose

estimation from depth data (Fanelli et al. 2011a, 2011b).

A detailed introduction to decision forests and their ap-

plications in computer vision can be found in Criminisi et al.

(2011).

2.2 Head Pose Estimation

With application ranging from image normalization for

recognition to driver drowsiness detection, automatic head

pose estimation is an important problem. Several approaches

have been proposed in the literature (Murphy-Chutorian and

Trivedi 2009); before introducing 3D approaches, which are

more relevant for this paper, we present a brief overview

of works that take 2D images as input. Methods based on

2D images can be subdivided into appearance-based and

Int J Comput Vis (2013) 101:437–458 439

feature-based classes, depending on whether they analyze

the face as a whole or instead rely on the localization of

some speciﬁc facial features.

2D Appearance-Based Methods These methods usually

discretize the head pose space and learn separate detec-

tors for subsets of poses (Jones and Viola 2003). Chen et

al. (2003) and Balasubramanian et al. (2007) present head

pose estimation systems with a speciﬁc focus on the map-

ping from the high-dimensional space of facial appearance

to the lower-dimensional manifold of head poses. The latter

paper considers face images with varying poses as lying on

a smooth low-dimensional manifold in a high-dimensional

feature space. The proposed Biased Manifold Embedding

uses the pose angle information of the face images to com-

pute a biased neighborhood of each point in the feature

space, prior to determining the low-dimensional embedding.

In the same vein, Osadchy et al. (2005) instead use a convo-

lutional network to learn the mapping, achieving real time

performance for the face detection problem, while also pro-

viding an estimate of the head pose. A very popular family

of methods use statistical models of the face shape and ap-

pearance, like Active Appearance Models (AAMs) (Cootes

et al. 2001), multi-view AAMs (Ramnath et al. 2008), and

3D Morphable Models (Blanz and Vetter 1999;Storeretal.

2009). Such methods usually focus on tracking facial fea-

tures rather than estimating the head pose, however. In this

context, Martins and Batista (2008) coupled an Active Ap-

pearance Model with the POSIT algorithm for head pose

tracking.

2D Feature-Based Methods These methods rely on some

speciﬁc facial features to be visible, and therefore are sen-

sitive to occlusions and to large head rotations. Vatahska et

al. (2007) use a face detector to roughly classify the pose

as frontal, left, or right proﬁle. After this, they detect the

eyes and nose tip using AdaBoost classiﬁers, and the detec-

tions are fed into a neural network which estimates the head

orientation. Similarly, Whitehill et al. (2008) present a dis-

criminative approach to frame-by-frame head pose estima-

tion. Their algorithm relies on the detection of the nose tip

and both eyes, thereby limiting the recognizable poses to the

ones where both eyes are visible. Morency et al. (2008)pro-

pose a probabilistic framework called Generalized Adaptive

View-based Appearance Model integrating frame-by-frame

head pose estimation, differential registration, and keyframe

tracking.

3D Methods In general, approaches relying solely on 2D

images are sensitive to illumination changes and lack of dis-

tinctive features. Moreover, the annotation of head poses

from 2D images is intrinsically problematic. Since 3D sens-

ing devices have become available, computer vision re-

searchers have started to leverage the additional depth in-

formation for solving some of the inherent limitations of

image-based methods. Some of the recent works thus use

depth as primary cue (Breitenstein et al. 2008) or in addi-

tion to 2D images (Cai et al. 2010; Morency et al. 2003;

Seemann et al. 2004).

Seemann et al. (2004) presented a neural network-based

system fusing skin color histograms and depth information.

It tracks at 10 fps but requires the face to be detected in

a frontal pose in the ﬁrst frame of the sequence. The ap-

proach of Mian et al. (2006) uses head pose estimation only

as a pre-processing step to face recognition, and the low

reported average errors are only calculated on subjects be-

longing to the training set. Still in a tracking framework,

Morency et al. (2003) use instead a intensity and depth input

image to build a prior model of the face using 3D view-based

eigenspaces. Then, they use this model to compute the abso-

lute difference in pose for each new frame. The pose range

is limited and manual cropping is necessary. In Cai et al.

(2010), a 3D face model is aligned to an RGB-depth input

stream for tracking features across frames, taking into ac-

count the very noisy nature of depth measurements coming

from commercial sensors.

Considering instead pure detectors on a frame-by-frame

basis, Lu and Jain (2006) create hypotheses for the nose po-

sition in range images based on directional maxima. For ver-

iﬁcation, they compute the nose proﬁle using PCA and a

curvature-based shape index. Breitenstein et al. (

2008)pre-

sented a real time system working on range scans provided

by the scanner of Weise et al. (2007). Their system can han-

dle large pose variations, facial expressions, and partial oc-

clusions, as long as the nose remains visible. The method

relies on several candidate nose positions, suggested by a

geometric descriptor. Such hypotheses are all evaluated in

parallel on a GPU, which compares them to renderings of a

generic template with different orientations, ﬁnally selecting

the orientation which minimizes a predeﬁned cost function.

Real time performance is only met thanks to the parallel

GPU computations. Unfortunately, GPUs are power-hungry

and might not be available in many scenarios where porta-

bility is important, e.g., for mobile robots. Breitenstein et al.

also collected a dataset of over 10K annotated range scans of

heads. The subjects, both male and female, with and with-

out glasses, were recorded using the scanner of Weise et al.

(2007) while turning their heads around, trying to span all

possible yaw and pitch rotation angles they could. The scans

were automatically annotated, tracking each sequence using

ICP in combination with a personalized face template. The

same authors also extended their system to use lower quality

depth images from a stereo system (Breitenstein et al. 2009).

Yet, the main shortcomings of the original method remain.

2.3 Facial Features Localization

2D Facial Features Facial feature detection from stan-

dard images is a well studied problem, often performed

440 Int J Comput Vis (2013) 101:437–458

as preprocessing for face recognition. Previous contribu-

tions can be classiﬁed into two categories, depending on

whether they use global or local features. Holistic methods,

e.g., Active Appearance Models (Cootes et al. 2001, 2002;

Matthews and Baker 2003), use the entire facial texture to

ﬁt a generative model to a test image. They are usually af-

fected by lighting changes and a bias towards the average

face. The complexity of the modeling is an additional issue.

Moreover, these methods perform poorly on unseen iden-

tities (Gross et al. 2005) and cannot handle low-resolution

images well.

In recent years, there has been a shift towards methods

based on independent local feature detectors (Valstar et al.

2010; Amberg and Vetter 2011; Belhumeur et al. 2011).

Such detectors are discriminative models of image patches

centered around the facial landmarks, often ambiguous be-

cause the limited support region cannot cope with the large

appearance variations present in the training samples. To

improve accuracy and reduce the inﬂuence of wrong detec-

tions, global models of the facial features conﬁguration like

pictorial structures (Felzenszwalb and Huttenlocher 2005;

Everingham et al. 2006) or Active Shape Models (Cristi-

nacce and Cootes 2008) are needed.

3D Facial Features Similar to the 2D case, methods fo-

cusing on facial feature localization from range data can

be subdivided into categories using global or local informa-

tion. Among the former class, the authors of Mpiperis et al.

(2008) deform a bi-linear face model to match a scan of an

unseen face in different expressions. Yet, the paper’s focus is

not on the localization of facial feature points and real time

performance is not achieved. Also Kakadiaris et al. (2007)

non-rigidly align an annotated model to face meshes. Con-

straints need to be imposed on the initial face orientation,

however. Using high quality range scans, Weise et al.(2009a)

presented a real time system, capable of tracking facial mo-

tion in detail, but using personalized templates. The same

approach has been extended to robustly track head pose and

facial deformations using RGB-depth streams provided by

commercial sensors like the Kinect (Weise et al. 2011).

Most works that try to directly localize speciﬁc feature

points from 3D data take advantage of surface curvatures.

For example, Sun and Yin (2008), Segundo et al. (2010),

Chang et al. (2006) all use curvature to roughly localize the

inner corners of the eyes. Such an approach is very sensitive

to missing depth data, particularly for the regions around

the inner eye corners, frequently occluded by shadows. Also

Mehryar et al. (2010) use surface curvatures by ﬁrst ex-

tracting ridge and valley points, which are then clustered.

The clusters are reﬁned using a geometric model imposing

a set of distance and angle constraints on the arrangement

of candidate landmarks. Colbry et al. (2005) use curvature

in conjunction with the Shape Index proposed by Dorai and

Jain (1997) to locate facial feature points from range scans

of faces. The reported execution time of this anchor point

detector is 15 sec per frame. Wang et al. (2002) use point

signatures (Chua and Jarvis 1997) and Gabor ﬁlters to de-

tect some facial feature points from 3D and 2D data. The

method needs all desired landmarks to be visible, thus re-

stricting the range of head poses while being sensitive to oc-

clusions. Yu et al. (2008

) use genetic algorithms to combine

several weak classiﬁers into a 3D facial landmark detector.

Ju et al. (2009) detect the nose tip and the eyes using binary

neural networks, and propose a 3D shape descriptor invari-

ant to pose and expression.

Zhao et al. (2011) propose a 3D Statistical Facial Feature

Model (SFAM), which models both the global variations in

the morphology of the face and the local structures around

the landmarks. The low reported errors for the localization

of 15 points in scans of neutral faces come at the expense of

processing time: over 10 minutes are needed to process one

facial scan. In Nair and Cavallaro (2009), ﬁtting the pro-

posed PCA shape model containing only the upper facial

features, i.e., without the mouth, takes on average 2 minutes

per face.

In general, prior work on facial feature localization from

3D data is either sensitive to occlusions, especially of the

nose, requires prior knowledge of feature map thresholds,

cannot handle large rotations, or does not run in real time.

3 Random Forest Framework for Face Analysis

In Sect. 3.1 we ﬁrst summarize a general random forest

framework (Breiman 2001), then give speciﬁc details for

face analysis based on depth data in Sects. 3.2 and 3.3.

3.1 Random Forest

Decision trees (Breiman et al. 1984) can map complex input

spaces into simpler, discrete or continuous output spaces,

depending on whether they are used for classiﬁcation of re-

gression purposes. A tree splits the original problem into

smaller ones, solvable with simple predictors, thus achieving

complex, highly non-linear mappings in a very simple man-

ner. A non-leaf node in the tree contains a binary test, guid-

ing a data sample towards the left or the right child node.

The tests are chosen in a supervised-learning framework,

and training a tree boils down to selecting the tests which

cluster the training such as to allow good predictions using

simple models.

Random forests are collections of decision trees, each

trained on a randomly sampled subset of the available data;

this reduces over-ﬁtting in comparison to trees trained on the

whole dataset, as shown by Breiman (2001). Randomness is

introduced by the subset of training examples provided to

Random Forests for Real Time 3D Face Analysis

Figures

Citations

A morphable model for the synthesis of 3D faces

SUN RGB-D: A RGB-D scene understanding benchmark suite

Gradient boosting machines, a tutorial.

OctNet: Learning Deep 3D Representations at High Resolutions

Learning-by-Synthesis for Appearance-Based 3D Gaze Estimation

References

Random Forests

Classification and Regression Trees.

A method for registration of 3-D shapes

Classification and regression trees

Classification and regression trees

Related Papers (5)

Head Pose Estimation in Computer Vision: A Survey

Random Forests

Face detection, pose estimation, and landmark localization in the wild

One Millisecond Face Alignment with an Ensemble of Regression Trees

Face Alignment Across Large Poses: A 3D Solution

Frequently Asked Questions (14)

Q1. What are the contributions mentioned in the paper "Random forests for real time 3d face analysis" ?

Q2. What have the authors stated for future works in "Random forests for real time 3d face analysis" ?

Q3. What are the key issues in the regressor?

Q4. How did Paysan et al. (2009) use the generic face template?

Q5. How can the authors make the patches more scale-invariant?

Q6. How long does it take to fit the proposed PCA shape model?

Q7. What causes the errors around the mouth regions?

Q8. How can patches be detected when a particular point is occluded?

Q9. What does the head pose estimation system assume?

Q10. Why do the authors use the depth channel?

Q11. Why are GPUs not available in many scenarios?

Q12. What does the algorithm assume that all leaves in a tree contain a probability?

Q13. What is the importance of a minimum size for the patches?

Q14. How can the body pose be estimated?