scispace - formally typeset
Open AccessProceedings ArticleDOI

OpenFace: An open source facial behavior analysis toolkit

TLDR
OpenFace is the first open source tool capable of facial landmark detection, head pose estimation, facial action unit recognition, and eye-gaze estimation and allows for easy integration with other applications and devices through a lightweight messaging system.
Abstract
Over the past few years, there has been an increased interest in automatic facial behavior analysis and understanding. We present OpenFace — an open source tool intended for computer vision and machine learning researchers, affective computing community and people interested in building interactive applications based on facial behavior analysis. OpenFace is the first open source tool capable of facial landmark detection, head pose estimation, facial action unit recognition, and eye-gaze estimation. The computer vision algorithms which represent the core of OpenFace demonstrate state-of-the-art results in all of the above mentioned tasks. Furthermore, our tool is capable of real-time performance and is able to run from a simple webcam without any specialist hardware. Finally, OpenFace allows for easy integration with other applications and devices through a lightweight messaging system.

read more

Content maybe subject to copyright    Report

OpenFace: an open source facial behavior analysis toolkit
Tadas Baltru
ˇ
saitis
Tadas.Baltrusaitis@cl.cam.ac.uk
Peter Robinson
Peter.Robinson@cl.cam.ac.uk
Louis-Philippe Morency
morency@cs.cmu.edu
Abstract
Over the past few years, there has been an increased
interest in automatic facial behavior analysis and under-
standing. We present OpenFace an open source tool
intended for computer vision and machine learning re-
searchers, affective computing community and people in-
terested in building interactive applications based on facial
behavior analysis. OpenFace is the first open source tool
capable of facial landmark detection, head pose estima-
tion, facial action unit recognition, and eye-gaze estimation.
The computer vision algorithms which represent the core of
OpenFace demonstrate state-of-the-art results in all of the
above mentioned tasks. Furthermore, our tool is capable of
real-time performance and is able to run from a simple we-
bcam without any specialist hardware. Finally, OpenFace
allows for easy integration with other applications and de-
vices through a lightweight messaging system.
1. Introduction
Over the past few years, there has been an increased in-
terest in machine understanding and recognition of affective
and cognitive mental states and interpretation of social sig-
nals especially based on facial expression and more broadly
facial behavior [18, 51, 39]. As the face is a very important
channel of nonverbal communication [20, 18], facial behav-
ior analysis has been used in different applications to facil-
itate human computer interaction [10, 43, 48, 66]. More
recently, there has been a number of developments demon-
strating the feasibility of automated facial behavior analysis
systems for better understanding of medical conditions such
as depression [25] and post traumatic stress disorders [53].
Other uses of automatic facial behavior analysis include au-
tomotive industries [14], education [42, 26], and entertain-
ment [47].
In our work we define facial behavior as consisting of:
facial landmark motion, head pose (orientation and mo-
tion), facial expressions, and eye gaze. Each of these modal-
ities play an important role in human behavior, both in-
dividually and together. For example automatic detection
and analysis of facial Action Units [19] (AUs) is an im-
Figure 1: OpenFace is an open source framework that im-
plements state-of-the-art facial behavior analysis algorithms
including: facial landmark detection, head pose tracking,
eye gaze and facial Action Unit estimation.
portant building block in nonverbal behavior and emotion
recognition systems [18, 51]. This includes detecting both
the presence and the intensity of AUs, allowing us to anal-
yse their occurrence, co-occurrence and dynamics. In ad-
dition to AUs, head pose and gesture also play an impor-
tant role in emotion and social signal perception and expres-
sion [56, 1, 29]. Finally, gaze direction is important when
evaluating things like attentiveness, social skills and mental
health, as well as intensity of emotions [35].
Over the past years there has been a huge amount of
progress in facial behavior understanding [18, 51, 39].
However, there is still no open source system available to
the research community that can do all of the above men-
tioned tasks (see Table 1). There is a big gap between state-
of-the-art algorithms and freely available toolkits. This is
especially true if real-time performance is wanted - a neces-
sity for interactive systems .
Furthermore, even though there exist a number of ap-

Tool Approach Landmark Head pose AU Gaze Train Fit Binary Real-time
COFW[13] RCPR[13] X X X X
FaceTracker CLM[50] X X X X X
dlib [34] [32] X X X X
DRMF[4] DRMF[4] X X X X
Chehra [5] X X X X
GNDPM GNDPM[58] X X
PO-CR[57] PO-CR [57] X X
Menpo [3] AAM, CLM, SDM
1
X X X
2
CFAN [67] [67] X X X
[65] Reg. For [65] X X X X X X
TCDCN CNN [70] X X X X
EyeTab [63] X N/A X X X
Intraface SDM [64] X X ?
3
X
OKAO ? X X X X X
FACET ? X X X X X
Affdex ? X X X X X
Tree DPM [71] [71] X X X
LEAR LEAR [40] X X X
TAUD TAUD [31] X X
OpenFace [7, 6] X X X X X X X X
Table 1: Comparison of facial behavior analysis tools. We do not consider fitting code to be available if the only code
provided is a wrapper around a compiled executable. Note that most tools only provide binary versions (executables) rather
than the model training and fitting source code.
1
The implementation differs from the originally proposed one based on
the used features,
2
the algorithms implemented are capable of real-time performance but the tool does not provide it,
3
the
executable is no longer available on the author’s website.
proaches for tackling each individual problem, very few of
them are available in source code form and would require
significant amount of effort to re-implement. In some cases
exact re-implementation is virtually impossible due to lack
of details in papers. Examples of often omitted details in-
clude: values of hyper-parameters, data normalization and
cleaning procedures, exact training protocol, model initial-
ization and re-initialization procedures, and optimization
techniques to make systems real-time. These details are of-
ten as important as the algorithms themselves in order to
build systems that work on real world data. Source code is
a great way of providing such details. Finally, even the ap-
proaches that claim they provide code instead only provide
a thin wrapper around a compiled binary making it impos-
sible to know what is actually being computed internally.
OpenFace is not only the first open source tool for facial
behavior analysis, it demonstrates state-of-the art perfor-
mance in facial landmark detection, head pose tracking, AU
recognition and eye gaze estimation. It is also able to per-
form all of these tasks together in real-time. Main contribu-
tions of OpenFace are: 1) implements and extends state-of-
the-art algorithms; 2) open source tool that includes model
training code; 3) comes with ready to use trained models;
4) is capable of real-time performance, without the need of
a GPU; 5) includes a messaging system allowing for easy
to implement real-time interactive applications; 6) available
as a Graphical User Interface (for Windows) and as a com-
mand line tool (for Ubuntu, Mac OS X and Windows).
Our work is intended to bridge that gap between existing
state-of-the-art research and easy to use out-of-the-box so-
lutions for facial behavior analysis. We believe our tool will
stimulate the community by lowering the bar of entry into
the field and enabling new and interesting applications
1
.
First, we present a brief outline of the recent advances in
face analysis tools (section 2). Then we move on to describe
our facial behavior analysis pipeline (section 3). We follow,
by a description of a large number of experiments to asses
our framework (section 4). Finally, we provide a brief de-
scription of the interface provided by OpenFace (section 5).
2. Previous work
A full review of work in facial landmark detection, head
pose, eye gaze, and action unit estimation is outside the
scope of this paper, we refer the reader to recent reviews
of the field [17, 18, 30, 46, 51, 61]. We instead provide an
1
https://www.cl.cam.ac.uk/research/rainbow/
projects/openface/

Figure 2: OpenFace facial behavior analysis pipeline, including: facial landmark detection, head pose and eye gaze estima-
tion, facial action unit recognition. The outputs from all of these systems (indicated by red) can be saved to disk or sent over
a network.
overview of available tools for accomplishing the individual
facial behavior analysis tasks. For a summary of available
tools see Table 1.
Facial landmark detection - there exists a broad selec-
tion of freely available tools to perform facial landmark de-
tection in images or videos. However, very few of the ap-
proaches provide the source code and instead only provide
executable binaries. This makes the reproduction of experi-
ments on different training sets or using different landmark
annotation schemes difficult. Furthermore, binaries only al-
low for certain predefined functionality and are often not
cross-platform, making real-time integration of the systems
that would rely on landmark detection almost impossible.
Although, there exist several exceptions that provide both
training and testing code [3, 71], those approaches do not
allow for real-time landmark tracking in videos - an impor-
tant requirement for interactive systems.
Head pose estimation has not received the same amount
of interest as facial landmark detection. An earlier exam-
ple of a dedicated head pose estimation is the Watson sys-
tem, which is an implementation of the Generalized Adap-
tive View-based Appearance Model [45]. There also exist
several frameworks that allow for head pose estimation us-
ing depth data [21], however they cannot work on webcams.
While some facial landmark detectors include head pose es-
timation capabilities [4, 5], most ignore this problem.
AU recognition - there are very few freely available
tools for action unit recognition. However, there are a num-
ber of commercial systems that amongst other functional-
ity perform Action Unit Recognition: FACET
2
, Affdex
3
,
and OKAO
4
. However, the drawback of such systems is the
sometimes prohibitive cost, unknown algorithms, and often
unknown training data. Furthermore, some tools are incon-
venient to use by being restricted to a single machine (due
2
http://www.emotient.com/products/
3
http://www.affectiva.com/solutions/affdex/
4
https://www.omron.com/ecb/products/mobile/
to MAC address locking or requiring of USB dongles). Fi-
nally, and most importantly, the commercial product may
be discontinued leading to impossible to reproduce results
due to lack of product transparency (this is illustrated by the
recent unavailability of FACET).
Gaze estimation - there are a number of tools and com-
mercial systems for eye-gaze estimation, however, majority
of them require specialist hardware such as infrared cam-
eras or head mounted cameras [30, 37, 54]. Although, there
exist a couple of systems available for webcam based gaze
estimation [72, 24, 63], they struggle in real-world scenar-
ios and some require cumbersome manual calibration steps.
In contrast to other available tools OpenFace provides
both training and testing code allowing for easy repro-
ducibility of experiments. Furthermore, our system shows
state-of-the-art results on in-the-wild data and does not re-
quire any specialist hardware or person specific calibration.
Finally, our system runs in real-time with all of the facial
behavior analysis modules working together.
3. OpenFace pipeline
In this section we outline the core technologies used by
OpenFace for facial behavior analysis (see Figure 2 for a
summary). First, we provide an explanation of how we de-
tect and track facial landmarks, together with a hierarchical
model extension to an existing algorithm. We then provide
an outline of how these features are used for head pose es-
timation and eye gaze tracking. Finally, we describe our
Facial Action Unit intensity and presence detection system,
which includes a novel person calibration extension to an
existing model.
3.1. Facial landmark detection and tracking
OpenFace uses the recently proposed Conditional Lo-
cal Neural Fields (CLNF) [8] for facial landmark detection
and tracking. CLNF is an instance of a Constrained Local
Model (CLM) [16], that uses more advanced patch experts

Figure 3: Sample registrations on 300-W and MPIIGaze
datasets.
and optimization function. The two main components of
CLNF are: Point Distribution Model (PDM) which captures
landmark shape variations; patch experts which capture lo-
cal appearance variations of each landmark. For more de-
tails about the algorithm refer to Baltru
ˇ
saitis et al. [8].
3.1.1 Model novelties
The originally proposed CLNF model performs the detec-
tion of all 68 facial landmarks together. We extend this
model by training separate sets of point distribution and
patch expert models for eyes, lips and eyebrows. We later
fit the landmarks detected with individual models to a joint
(PDM).
Tracking a face over a long period of time may lead to
drift or the person may leave the scene. In order to deal
with this, we employ a face validation step. We use a simple
three layer convolutional neural network (CNN) that given
a face aligned using a piecewise affine warp is trained to
predict the expected landmark detection error. We train the
CNN on the LFPW [11] and Helen [36] training sets with
correct and randomly offset landmark locations. If the val-
idation step fails when tracking a face in a video, we know
that our model needs to be reset.
In case of landmark detection in difficult in-the-wild im-
ages we use multiple initialization hypotheses at different
orientations and pick the model with the best converged
likelihood. This slows down the approach, but makes it
more accurate.
3.1.2 Implementation details
The PDM used in OpenFace was trained on two datasets -
LFPW [11] and Helen [36] training sets. This resulted in a
model with 34 non-rigid and 6 rigid shape parameters.
For training the CLNF patch experts we used: Multi-PIE
[27], LFPW [11] and Helen [36] training sets. We trained a
separate set of patch experts for seven views and four scales
(leading to 28 sets in total). Having multi-scale patch ex-
perts allows us to be accurate both on lower and higher res-
Figure 4: Sample gaze estimations on video sequences;
green lines represent the estimated eye gaze vectors.
olution face images. We found optimal results are achieved
when the face is at least 100px across. Training on different
views allows us to track faces with out of plane motion and
to model self-occlusion caused by head rotation.
To initialize our CLNF model we use the face detector
found in the dlib library [33, 34]. We learned a simple
linear mapping from the bounding box provided by dlib
detector to the one surrounding the 68 facial landmarks.
When tracking landmarks in videos we initialize the CLNF
model based on landmark detections in previous frame. If
our CNN validation module reports that tracking failed we
reinitialize the model using the dlib face detector.
OpenFace also allows for detection of multiple faces in
an image and tracking of multiple faces in videos. For
videos this is achieved by keeping a track of active face
tracks and a simple logic module that checks for people
leaving and entering the frame.
3.2. Head pose estimation
Our model is able to extract head pose (translation and
orientation) information in addition to facial landmark de-
tection. We are able to do this, as CLNF internally uses a 3D
representation of facial landmarks and projects them to the
image using orthographic camera projection. This allows us
to accurately estimate the head pose once the landmarks are
detected by solving the PnP problem.
For accurate head pose estimation OpenFace needs to
be provided with the camera calibration parameters (focal
length and principal point). In their absence OpenFace uses
a rough estimate based on image size.
3.3. Eye gaze estimation
CLNF framework is a general deformable shape regis-
tration approach so we use it to detect eye-region landmarks
as well. This includes eyelids, iris and the pupil. We used
the SynthesEyes training dataset [62] to train the PDM and

Figure 5: Prediction of AU12 on DISFA dataset [7]. Notice
how the prediction is always offset by a constant value.
CLNF patch experts. This model achieves state-of-the-art
results in eye-region registration task [62]. Some sample
registrations can be seen in Figure 3.
Once the location of the eye and the pupil are detected
using our CLNF model we use that information to compute
the eye gaze vector individually for each eye. We fire a ray
from the camera origin through the center of the pupil in the
image plane and compute it’s intersection with the eye-ball
sphere. This gives us the pupil location in 3D camera coor-
dinates. The vector from the 3D eyeball center to the pupil
location is our estimated gaze vector. This is a fast and ac-
curate method for person independent eye-gaze estimation
in webcam images. See Figure 4 for sample gaze estimates.
3.4. Action Unit detection
OpenFace AU intensity and presence detection module
is based on a recent state-of-the-art AU recognition frame-
work [7, 59]. It is a direct implementation with a couple
of changes that adapt it to work better on natural video se-
quences from unseen datasets. A more detailed explanation
of the system can be found in Baltru
ˇ
saitis et al. [7]. In
the following section we describe our extensions to the ap-
proach and the implementation details.
3.4.1 Model novelties
In natural interactions people are not expressive very often
[2]. This observation allows us to safely assume that most
of the time the lowest intensity (and in turn prediction) of
each action unit over a long video recording of a person
should be zero. However, the existing AU predictors tend
to sometimes under- or over-estimate AU values for a par-
ticular person, see Figure 5 for an illustration of this.
To correct for such prediction errors, we take the lowest
n
th
percentile (learned on validation data) of the predictions
on a specific person and subtract it from all of the predic-
tions. We call this approach person calibration. Such a
correction can be easily implemented in an online system as
well by keeping a histogram of previous predictions. This
extension only applies to AU intensity prediction.
AU Full name Prediction
AU1 Inner brow raiser I
AU2 Outer brow raiser I
AU4 Brow lowerer I
AU5 Upper lid raiser I
AU6 Cheek raiser I
AU7 Lid tightener P
AU9 Nose wrinkler I
AU10 Upper lip raiser I
AU12 Lip corner puller I
AU14 Dimpler I
AU15 Lip corner depressor I
AU17 Chin raiser I
AU20 Lip stretched I
AU23 Lip tightener P
AU25 Lips part I
AU26 Jaw drop I
AU28 Lip suck P
AU45 Blink P
Table 2: List of AUs in OpenFace. I - intensity, P - presence.
Another extension we propose is to combine AU pres-
ence and intensity training datasets. Some datasets only
contain labels for action unit presence (SEMAINE [44] and
BP4D) and others contain labels for their intensities (DISFA
[41] and BP4D [69]). This makes the training on combined
datasets not straightforward. We use the distance to the hy-
perplane of the trained SVM model as a feature for an SVR
regressor. This allows us to train a single predictor using
both AU presence and intensity datasets.
3.4.2 Implementation details
In order to extract facial appearance features we used a sim-
ilarity transform from the currently detected landmarks to a
representation of frontal landmarks from a neutral expres-
sion. This results in a 112 × 112 pixel image of the face
with 45 pixel interpupilary distance (similar to Baltru
ˇ
saitis
et al.[7]).
We extract Histograms of Oriented Gradients (HOGs)
features as proposed by Felzenswalb et al. [23] from the
aligned face. We use blocks of 2 × 2 cells, of 8 × 8 pix-
els, leading to 12 × 12 blocks of 31 dimensional histograms
(4464 dimensional vector describing the face). In order
to reduce the feature dimensionality we use a PCA model
trained on a number of facial expression datasets: CK+
[38], DISFA [41], AVEC 2011 [52], FERA 2011 [60], and
FERA 2015 [59]. Applying PCA to images (sub-sampling
from peak and neutral expressions) and keeping 95% of ex-
plained variability leads to a reduced basis of 1391 dimen-
sions. This allows for a generic basis, more suitable to un-
seen datasets.

Citations
More filters
Proceedings ArticleDOI

OpenFace 2.0: Facial Behavior Analysis Toolkit

TL;DR: OpenFace 2.0 is an extension of OpenFace toolkit and is capable of more accurate facial landmark detection, head pose estimation, facial action unit recognition, and eye-gaze estimation.
Proceedings ArticleDOI

Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph

TL;DR: This paper introduces CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI), the largest dataset of sentiment analysis and emotion recognition to date and uses a novel multimodal fusion technique called the Dynamic Fusion Graph (DFG), which is highly interpretable and achieves competative performance when compared to the previous state of the art.
Proceedings ArticleDOI

Tensor Fusion Network for Multimodal Sentiment Analysis

TL;DR: In this article, a tensor fusion network (Tensor fusion network) is proposed to model intra-modality and inter-modal dynamics for multimodal sentiment analysis.
Proceedings ArticleDOI

AVEC 2016: Depression, Mood, and Emotion Recognition Workshop and Challenge

TL;DR: The challenge guidelines, the common data used, and the performance of the baseline system on the two tasks are presented, to establish to what extent fusion of the approaches is possible and beneficial.
Journal ArticleDOI

Why and how to use virtual reality to study human social interaction: The challenges of exploring a new research landscape.

TL;DR: This study aimed to guide the psychologist into the novel world of VR, reviewing available instrumentation and mapping the landscape of possible systems, and proposes that the biggest challenge for the field would be to build a fully interactive virtual human who can pass a VR Turing test.
References
More filters
Journal ArticleDOI

Object Detection with Discriminatively Trained Part-Based Models

TL;DR: An object detection system based on mixtures of multiscale deformable part models that is able to represent highly variable object classes and achieves state-of-the-art results in the PASCAL object detection challenges is described.
Proceedings ArticleDOI

The Extended Cohn-Kanade Dataset (CK+): A complete dataset for action unit and emotion-specified expression

TL;DR: The Cohn-Kanade (CK+) database is presented, with baseline results using Active Appearance Models (AAMs) and a linear support vector machine (SVM) classifier using a leave-one-out subject cross-validation for both AU and emotion detection for the posed data.
Journal ArticleDOI

Dlib-ml: A Machine Learning Toolkit

TL;DR: dlib-ml contains an extensible linear algebra toolkit with built in BLAS support, and implementations of algorithms for performing inference in Bayesian networks and kernel-based methods for classification, regression, clustering, anomaly detection, and feature ranking.
Proceedings ArticleDOI

One Millisecond Face Alignment with an Ensemble of Regression Trees

TL;DR: It is shown how an ensemble of regression trees can be used to estimate the face's landmark positions directly from a sparse subset of pixel intensities, achieving super-realtime performance with high quality predictions.
Journal ArticleDOI

A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions

TL;DR: In this paper, the authors discuss human emotion perception from a psychological perspective, examine available approaches to solving the problem of machine understanding of human affective behavior, and discuss important issues like the collection and availability of training and test data.
Frequently Asked Questions (8)
Q1. What are the contributions mentioned in the paper "Openface: an open source facial behavior analysis toolkit" ?

The authors present OpenFace – an open source tool intended for computer vision and machine learning researchers, affective computing community and people interested in building interactive applications based on facial behavior analysis. Furthermore, their tool is capable of real-time performance and is able to run from a simple webcam without any specialist hardware. 

Furthermore, the future development of the tool will continue and it will attempt to incorporate the newest and most reliable approaches for the problem at hand while remaining a transparent open source tool and retaining its real-time capacity. The authors hope that this tool will encourage other researchers in the field to share their code. 

The authors use blocks of 2 × 2 cells, of 8 × 8 pixels, leading to 12×12 blocks of 31 dimensional histograms (4464 dimensional vector describing the face). 

In order to extract facial appearance features the authors used a similarity transform from the currently detected landmarks to a representation of frontal landmarks from a neutral expression. 

Example use case of saving facial behaviors using OpenFace would involve using them as features for emotion prediction, medical condition analysis, and social signal analysis systems. 

To measure OpenFace performance on a head pose estimation task the authors used three publicly available datasets with existing ground truth head pose data: BU [15], Biwi [21] and ICT-3DHP [9]. 

Applying PCA to images (sub-sampling from peak and neutral expressions) and keeping 95% of explained variability leads to a reduced basis of 1391 dimensions. 

The recognition of certain AUs is not as reliable as that of others partly due to lack of representation in training data and inherent difficulty of the problem. 

Trending Questions (1)
How do I use Google pixel as a Webcam?

Furthermore, our tool is capable of real-time performance and is able to run from a simple webcam without any specialist hardware.