scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Automatic analysis of facial expressions: the state of the art

TL;DR: The capability of the human visual system with respect to these problems is discussed, and it is meant to serve as an ultimate goal and a guide for determining recommendations for development of an automatic facial expression analyzer.
Abstract: Humans detect and interpret faces and facial expressions in a scene with little or no effort. Still, development of an automated system that accomplishes this task is rather difficult. There are several related problems: detection of an image segment as a face, extraction of the facial expression information, and classification of the expression (e.g., in emotion categories). A system that performs these operations accurately and in real time would form a big step in achieving a human-like interaction between man and machine. The paper surveys the past work in solving these problems. The capability of the human visual system with respect to these problems is discussed, too. It is meant to serve as an ultimate goal and a guide for determining recommendations for development of an automatic facial expression analyzer.

Summary (6 min read)

1 INTRODUCTION

  • The main characteristics of human communication are: multiplicity and multimodality of communication channels.
  • The characteristics of an ideal automated system for facial expression analysis are given in Section 3.

2 FACIAL EXPRESSION ANALYSIS

  • The authors aim is to explore the issues in design and implementation of a system that could perform automated facial expression analysis.
  • In general, three main steps can be distinguished in tackling the problem.
  • The face features are the features used to represent the face.
  • The applied face representation and the kind of input images determine the choice of mechanisms for automatic extraction of facial expression information.
  • A good reference point is the functionality of the human visual system.

2.1 Face Detection

  • For most works in automatic facial expression analysis, the conditions under which a facial image or image sequence is obtained are controlled.
  • Determining the exact location of the face in a digitized facial image is a more complex problem.
  • First, the scale and the orientation of the face can vary from image to image.
  • The presence of noise and occlusion makes the problem even more difficult.
  • The presence of the features and their geometrical relationship with each other appears to be more important than the details of the features [5].

2.2 Facial Expression Data Extraction

  • After the presence of a face has been detected in the observed scene, the next step is to extract the information about the encountered facial expression in an automatic way.
  • One of the fundamental issues about the facial expression analysis is the representation of the visual information that an examined face might reveal [102].
  • The results of Johansson's point-light display experiments [1], [5], gave a clue to this problem.
  • The face can be also modeled using a socalled hybrid approach, which typifies a combination of analytic and holistic approaches to face representation.
  • This disables a search for fixed patterns in the images.

2.3 Facial Expression Classification

  • After the face and its appearance have been perceived, the next step of an automated expression analyzer is to ªidentifyº the facial expression conveyed by the face.
  • The Facial Action Coding System (FACS) [21] is probably the most known study on facial activity.
  • Second, classification of facial expressions in to multiple emotion categories should be feasible (e.g., raised eyebrows and smiling mouth is a blend of surprise and happiness, Fig. 1).
  • First, the system should be capable of analyzing any subject, male or female of any age and ethnicity.
  • While the human mechanisms for face detection are very robust, the same is not the case for interpretation of facial expressions.

3 AN IDEAL SYSTEM FOR FACIAL EXPRESSION ANALYSIS

  • Before developing an automated system for facial expression analysis, one should decide on its functionality.
  • It may not be possible to incorporate all features of the human visual system into an automated system, and some features may even be undesirable, but it can certainly serve as a reference point.
  • Yet, actual implementation and integration of these stages into a system are constrained by the system's application domain.
  • An ideal system should be able to perform analysis of all visually distinguishable facial expressions.
  • Yet, according to the descriptions of these prototypic expressions given by Ekman and Friesen [20], the left hand side facial expression shown in Fig. 1 belongs ªmoreº to the surpriseÐthan to the happiness class.

4 AUTOMATIC FACIAL EXPRESSION ANALYSIS

  • For its utility in application domains of human behavior interpretation and multimodal/media HCI, automatic facial expression analysis has attracted the interest of many computer vision researchers.
  • Before surveying these works in detail, the authors are giving a short overview of the systems for facial expression analysis proposed in the period of 1991 to 1995.
  • Therefore, these properties (i.e., columns 15 and 20) have been excluded from Table 2 (. stands for ªyes,º X stands for ªno,º and - represents a missing entry).
  • These systems primarily concern facial expression animation and do not attempt to classify the observed facial expression either in terms of facial actions or in terms of emotion categories.

4.1 Face Detection

  • For most of the work in automatic facial expression analysis, the conditions under which an image is obtained are controlled.
  • The camera is either mounted on a helmetlike device worn by the subject (e.g., [62], [59]) or placed in such a way that the image has the face in frontal view.
  • Hence, the presence of the face in the scene is ensured and some global location of the face in the scene is known a priori.
  • In the second, analytic approach, the face is detected by detecting some important facial features first (e.g., the irises and the nostrils).

4.1.1 Face Detection in Facial Images

  • To represent the face, Huang and Huang [32] apply a point distribution model (PDM).
  • The face should be without facial hair and glasses, no rigid head motion may be encountered and illumination variations must be linear for the system to work correctly.
  • To localize the contour of the face, they use an algorithm based on the HSV color model, which is similar to the algorithm based on the relative RGB model [103].
  • Once the irises are identified, the overall location of the face is determined by using relative locations of the facial features in the face.
  • Yoneyama et al. [104] use an analytic approach to face detection too.

4.1.2 Face Detection in Arbitrary Images

  • Two of the works surveyed in this paper perform automatic face detection in an arbitrary scene.
  • Hong et al. [30] utilize the PersonSpotter system [81] in order to perform a realtime tracking of the head.
  • The box bounding the head is used then as the image to which an initial labeled graph is fitted.
  • By inspecting the local maximums of the disparity histogram, image regions confined to a certain disparity interval are selected.
  • Essa and Pentland [24] use the eigenspace method of Pentland et al. [65] to locate faces in an arbitrary scene.

4.2 Facial Expression Data Extraction

  • After the presence of a face is detected in the observed scene, the next step is to extract the information about the shown facial expression.
  • Both the applied face representation and the kind of input images affect the choice of the approach to facial expression data extraction.
  • The face representations used by the surveyed systems are listed in Table 5.
  • Template-based methods fit a holistic face model to the input image or track it in the input image sequence.
  • The methods utilized by the surveyed systems are listed in Table 6.

4.2.1 Facial Data Extraction from Static Images: Template-Based Methods

  • As shown in Table 3, several surveyed systems can be classified as methods for facial expression analysis from static images.
  • To build their model they used facial images that were manually labeled with 122 points localized around the facial features.
  • Hong et al. use wavelets of five different frequencies and eight different orientations.
  • Padgett and Cottrell [61] also use a holistic face representation, but they do not deal with facial expression information extraction in an automatic way.
  • Hence, the method will fail to recognize any facial appearance change that involves a horizontal movement of the facial features.

4.2.2 Facial Data Extraction from Static Images: Feature-Based Methods

  • The second category of the surveyed methods for automatic facial expression analysis from static images uses an analytic approach to face representation (Table 3, Table 5) and applies a feature-based method for expression information extraction from an input image.
  • In their later work [42], they utilize a CCD camera in monochrome mode to obtain a set of brightness distributions of 13 vertical lines crossing the FCPs.
  • Pantic and Rothkrantz [62] are utilizing a point-based model composed of two 2D facial views, the frontal and the side view.
  • Then, the best of the acquired results is chosen.
  • These data are used further for expression emotional classification.

4.2.3 Facial Data Extraction from Image Sequences: Template-Based Methods

  • A first category of the surveyed approaches to automatic facial expression analysis from image sequences uses a holistic or a hybrid approach to face representation (Table 3, Table 5) and applies a template-based method for facial expression information extraction from an input image sequence.
  • First, they applied the eigenspace method [65] to automatically track the face in the scene (Section 4.1.2) and extract the positions of the eyes, nose, and mouth.
  • Essa and Pentland use the optical flow computation method proposed by Simoncelli [80].
  • The flow covariances between different frames are stored and used together with Fig.
  • To fit the Potential Net to a normalized facial image (see Section 4.1.1), they compute first the edge image by applying a differential filter.

4.2.4 Facial Data Extraction from Image Sequences: Feature-Based Methods

  • Only one of the surveyed methods for automatic facial expression analysis from image sequences utilizes an analytic face representation (Table 3, Table 5) and applies a feature-based method for facial expression information extraction.
  • Cohn et al. [10] use a model of facial landmark points localized around the facial features, hand-marked with a mouse device in the first frame of an examined image sequence.
  • In the rest of the frames, a hierarchical optical flow method [49] is used to track the optical flows of 13 13 windows surrounding the landmark points.
  • The displacement of each landmark point is calculated by subtracting its normalized position in the first frame from its current normalized position (all frames of an input sequence are manually normalized).
  • The face should be without facial hair/ glasses, no rigid head motion may be encountered, the first frame should be an expressionless face, and the facial landmark points should be marked in the first frame for the method to work correctly.

4.3 Facial Expression Classification

  • The last step of facial expression analysis is to classify (identify, interpret) the facial display conveyed by the face.
  • The applied methods for expression classification in terms of facial actions are summarized in Table 7.
  • If a template-based classification method is applied, the encountered facial expression is compared to the templates defined for each expression category.
  • Most of the neural-network-based classification methods utilized by the surveyed systems perform facial expression classification into a single category.
  • The authors are doing so because the overall characteristics of these methods fit better the overall properties of the template-based expression classification approaches.

4.3.1 Expression Classification from Static Images: Template-Based Methods

  • A first category of the surveyed methods for automatic expression analysis from static images applies a template-based method for expression classification.
  • The personalized galleries of nine people have been utilized, where each gallery contained 28 images (four images per expression).
  • The achieved recognition rate was 89 percent in the case of the familiar subjects and 73 percent in the case of unknown persons.
  • In order to perform emotional classification of the observed facial expression, Huang and Huang [32] perform an intermediate step by calculating 10 Action Parameters (APs, Fig. 12).
  • An input LG vector is classified by being projected along the discriminant vectors calculated for each independently trained binary classifier.

4.3.2 Expression Classification from Static Images: Neural-Network-Based Methods

  • A second category of the surveyed methods for automatic facial expression analysis from static images applies a neural network for facial expression classification.
  • For classification of expression into one of six basic emotion categories, Hara and Kobayashi [42] apply a 234 50 6 back-propagation neural network.
  • The average recognition rate was 85 percent.
  • This process has been repeated for each of the 10 segments and the results of all 10 trained networks have been averaged.
  • The difference between a distance measured in an examined image and the same distance measured in an expressionless face of the same person was normalized.

4.3.3 Expression Classification from Static Images: Rule-Based Methods

  • Just one of the surveyed methods for automatic facial expression analysis from static images applies a rule-based approach to expression classification.
  • From the localized contours of the facial features, the model features (Fig. 7) are extracted.
  • Based on the knowledge acquired from FACS [21], the production rules classify the calculated model deformation into the appropriate AUs-classes (total number of classes is 31).
  • Classification of an input facial dual-view into multiple emotion categories is performed by comparing the AU-coded description of the shown facial expression to AU-coded descriptions of six basic emotional expressions, which have been acquired from the linguistic descriptions given by Ekman [22].
  • The dual-views used for testing of the system have been recorded under constant illumination and none of the subjects had a moustache, a beard, or wear glasses.

4.3.4 Expression Classification from Image Sequences: Template-Based Methods

  • Thefirstcategoryof thesurveyedmethodsforautomaticfacial expression analysis from facial image sequences applies a template-based method for expression classification.
  • Image sequences (504) containing 872 facial actions displayed by 100 subjects have been used.
  • The method was tested on image sequences shown by the same subjects.
  • The category of an expression is decided by determining the minimal distance between the actual trajectory of FEFPs and the trajectories defined by the models.

4.3.5 Expression Classification from Image Sequences: Rule-Based Methods

  • Just one of the surveyed methods for automatic facial expression analysis from image sequences applies a rulebased approach to expression classification.
  • The motion parameters (e.g., translation and divergence) are used to derive the midlevel predicates that describe the motion of the facial features.
  • Each of six basic emotional expressions, they developed a model represented by a set of rules for detecting the beginning and ending of the expression.
  • The rules are applied to the predicates of the midlevel representation.
  • The achieved recognition rate was 88 percent.

5 DISCUSSION

  • The authors have explored and compared a number of different recently presented approaches to facial expression detection and classification in static images and image sequences.
  • The number of the surveyed systems is rather large and the reader might be interested in the results of the performed comparison in terms of the best performances.
  • Yet, the authors deliberately didn't make an attempt to label some of the surveyed systems as being better than some other systems presented in the literature.
  • The authors believe that a well-defined and commonly used single database of testing images (image sequences) is the necessary prerequisite for ªrankingº the performances of the proposed systems in an objective manner.

5.1 Detection of the Face and Its Features

  • Most of the currently existing systems for facial expression analysis assume that the presence of a face in the scene is ensured.
  • In many instances, the systems do not utilize a camera setting that will ascertain the correctness of that assumption.
  • In addition, in many instances strong assumptions are made to make the problem of facial expression analysis more tractable (Table 6).
  • Thus, if a fixed camera acquires the images, the system should be capable of dealing with rigid head motions.
  • Yet, only the method proposed by Essa and Pentland [24] deals with the facial images of faces with facial hair and/or eyeglasses.

5.2 Facial Expression Classification

  • In general, the existing expression analyzers perform a singular classification of the examined expression into one of the basic emotion categories proposed by Ekman and Friesen [20].
  • Defining interpretation categories into which any facial expression can be classified is one of the key challenges in the design of a realistic facial expression analyzer.
  • In addition, each person has his/her own maximal intensity of displaying a particular facial action.
  • Also, none of the surveyed systems can distinguish all 44 AUs defined in FACS.

6 CONCLUSION

  • Analysis of facial expressions is an intriguing problem which humans solve with quite an apparent ease.
  • Capability of the human visual system in solving these problems has been discussed.
  • Also, all of the proposed approaches to automatic expres- sion analysis perform only facial expression classification into the basic emotion categories defined by Ekman and Friesen [20].
  • Furthermore, some of the surveyed methods have been tested only on the set of images used for training.
  • The authors hesitate in belief that those systems are person- independent what, in turn, should be a basic property of a behavioral science research tool or of an advanced HCI.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Automatic Analysis of Facial Expressions:
The State of the Art
Maja Pantic, Student Member, IEEE, and Leon J.M. Rothkrantz
AbstractÐHumans detect and interpret faces and facial expressions in a scene with little or no effort. Still, development of an
automated system that accomplishes this task is rather difficult. There are several related problems: detection of an image segment as
a face, extraction of the facial expression information, and classification of the expression (e.g., in emotion categories). A system that
performs these operations accurately and in real time would form a big step in achieving a human-like interaction between man and
machine. This paper surveys the past work in solving these problems. The capability of the human visual system with respect to these
problems is discussed, too. It is meant to serve as an ultimate goal and a guide for determining recommendations for development of
an automatic facial expression analyzer.
Index TermsÐFace detection, facial expression information extraction, facial action encoding, facial expression emotional
classification.
æ
1INTRODUCTION
A
S pointed out by Bruce [6], Takeuchi and Nagao [84],
and Hara and Kobayashi [28], human face-to-face
communication is an ideal model for designing a multi-
modal/media human-computer interface (HCI). The main
characteristics of human communication are: multiplicity
and multimodality of communication channels. A channel
is a communication medium while a modality is a sense
used to perceive signals from the outside world. Examples
of human communication channels are: auditory channel
that carries speech, auditory channel that carries vocal
intonation, visual channel that carries facial expressions,
and visual channel that carries body movements. The
senses of sight, hearing, and touch are examples of
modalities. In usual face-to-face communication, many
channels are used and different modalities are activated.
As a result, communication becomes highly flexible and
robust. Failure of one channel is recovered by another
channel and a message in one channel can be explained by
another channel. This is how a multimedia/modal HCI
should be developed for facilitating robust, natural,
efficient, and effective man-machine interaction.
Relatively few existing works combine different mod-
alities into a single system for human communicative
reaction analysis. Examples are the works of Chen et al.
[9] and De Silva et al. [15] who studied the effects of a
combined detection of facial and vocal expressions of
emotions. So far, the majority of studies treat various
human communication channels separately, as indicated by
Nakatsu [58]. Examples for the presented systems are:
emotional interpretation of human voices [35], [66], [68],
[90], emotion recognition by physiological signals pattern
recognition [67], detection and interpretation of hand
gestures [64], recognition of body movements [29], [97],
[46], and facial expression analysis (this survey).
The terms ªface-to-faceº and ªinterfaceº indicate that the
face plays an essential role in interpersonal communication.
The face is the mean to identify other members of the
species, to interpret what has been said by the means of
lipreading, and to understand someone's emotional state
and intentions on the basis of the shown facial expression.
Personality, attractiveness, age, and gender can also be seen
from someone's face. Considerable research in social
psychology has also shown that facial expressions help
coordinate conversation [4], [82], and have considerably
more effect on whether a listener feels liked or disliked than
the speaker's spoken words [55]. Mehrabian indicated that
the verbal part (i.e., spoken words) of a message contributes
only for 7 percent to the effect of the message as a whole, the
vocal part (e.g., voice intonation) contributes for 38 percent,
while facial expression of the speaker contributes for
55 percent to the effect of the spoken message [55]. This
implies that the facial expressions form the major modality
in human communication.
Recent advances in image analysis and pattern recogni-
tion open up the possibility of automatic detection and
classification of emotional and conversational facial signals.
Automating facial expression analysis could bring facial
expressions into man-machine interaction as a new mod-
ality and make the interaction tighter and more efficient.
Such a system could also make classification of facial
expressions widely accessible as a tool for research in
behavioral science and medicine. The goal of this paper is to
survey the work done in automating facial expression
analysis in facial images and image sequences. Section 2
identifies three basic problems related to facial expression
analysis. These problems are: face detection in a facial
image or image sequence, facial expression data extraction,
and facial expression classification. The capability of the
1424 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 22, NO. 12, DECEMBER 2000
. The authors are with the Department of Media, Engineering and
Mathematics, Delft University of Technology, PO Box 356, 2600 AJ
Delft, The Netherlands.
E-mail: {M.Pantic, L.J.M.Rothkrantz}@cs.tudelft.nl.
Manuscript received 29 June 1999; revised 16 March 2000; accepted 12 Sept.
2000.
Recommended for acceptance by K. Bowyer.
For information on obtaining reprints of this article, please send e-mail to:
tpami@computer.org, and reference IEEECS Log Number 110156.
0162-8828/00/$10.00 ß 2000 IEEE

human visual system with respect to these problems is
described. It defines, in some way, the expectations for an
automated system. The characteristics of an ideal auto-
mated system for facial expression analysis are given in
Section 3. Section 4 surveys the techniques presented in the
literature in the past decade for facial expression analysis by
a computer. Their characteristics are summarized in respect
to the requirements posed on the design of an ideal facial
expression analyzer. We do not attempt to provide an
exhaustive review of the past work in each of the problems
related to automatic facial expression analysis. Here, we
selectively discuss systems which deal with each of these
problems. Possible directions for future research are
discussed in Section 5. Section 6 concludes the paper.
2FACIAL EXPRESSION ANALYSIS
Our aim is to explore the issues in design and implementa-
tion of a system that could perform automated facial
expression analysis. In general, three main steps can be
distinguished in tackling the problem. First, before a facial
expression can be analyzed, the face must be detected in a
scene. Next is to devise mechanisms for extracting the facial
expression information from the observed facial image or
image sequence. In the case of static images, the process of
extracting the facial expression information is referred to as
localizing the face and its features in the scene. In the case of
facial image sequences, this process is referred to as tracking
the face and its features in the scene. At this point, a clear
distinction should be made between two terms, namely,
facial features and face model features. The facial features are
the prominent features of the faceÐeyebrows, eyes, nose,
mouth, and chin. The face model features are the features
used to represent (model) the face. The face can be
represented in various ways, e.g., as a whole unit (holistic
representation), as a set of features (analytic representation)
or as a combination of these (hybrid approach). The applied
face representation and the kind of input images determine
the choice of mechanisms for automatic extraction of facial
expression information. The final step is to define some set
of categories, which we want to use for facial expression
classification and/or facial expression interpretation, and to
devise the mechanism of categorization.
Before an automated facial expression analyzerisbuilt,one
should decide on the system's functionality. A good reference
point is the functionality of the human visual system. After all,
it is the best known facial expression analyzer. This section
discusses the three basic problems related to the process of
facial expression analysis as well as the capability of the
human visual system with respect to these.
2.1 Face Detection
For most works in automatic facial expression analysis, the
conditions under which a facial image or image sequence is
obtained are controlled. Usually, the image has the face in
frontal view. Hence, the presence of a face in the scene is
ensured and some global location of the face in the scene is
known a priori. However, determining the exact location of
the face in a digitized facial image is a more complex
problem. First, the scale and the orientation of the face can
vary from image to image. If the mugshots are taken with a
fixed camera, faces can occur in images at various sizes and
angles due to the movements of the observed person. Thus,
it is difficult to search for a fixed pattern (template) in the
image. The presence of noise and occlusion makes the
problem even more difficult.
Humans detect a facial pattern by casual inspection of
the scene. We detect faces effortlessly in a wide range of
conditions, under bad lightning conditions or from a great
distance. It is generally believed that two-gray-levels
images of 100 to 200 pixels form a lower limit for detection
of a face by a human observer [75], [8]. Another character-
istic of the human visual system is that a face is perceived as
a whole, not as a collection of the facial features. The
presence of the features and their geometrical relationship
with each other appears to be more important than the
details of the features [5]. When a face is partially occluded
(e.g., by a hand), we perceive a whole face, as if our
perceptual system fills in the missing parts. This is very
difficult (if possible at all) to achieve by a computer.
2.2 Facial Expression Data Extraction
After the presence of a face has been detected in the
observed scene, the next step is to extract the information
about the encountered facial expression in an automatic
way. If the extraction cannot be performed automatically, a
fully automatic facial expression analyzer cannot be
developed. Both, the applied face representation and the
kind of input images affect the choice of the approach to
facial expression information extraction.
One of the fundamental issues about the facial expres-
sion analysis is the representation of the visual information
that an examined face might reveal [102]. The results of
Johansson's point-light display experiments [1], [5], gave a
clue to this problem. The experiments suggest that the
visual properties of the face, regarding the information
about the shown facial expression, could be made clear by
describing the movements of points belonging to the facial
features (eyebrows, eyes, and mouth) and then by analyzing
the relationships between those movements. This triggered
the researchers of vision-based facial gesture analysis to
make different attempts to define point-based visual
properties of facial expressions. Various analytic face
representations yielded, in which the face is modeled as a
set of facial points (e.g., Fig. 6, [42], Fig. 7, [61]) or as a set of
templates fitted to the facial features such as the eyes and
the mouth. In another approach to face representation
(holistic approach), the face is represented as a whole unit.
A 3D wire-frame with a mapped texture (e.g., [86]) and a
spatio-temporal model of facial image motion (e.g., Fig. 8,
[2]) are typical examples of the holistic approaches to face
representation. The face can be also modeled using a so-
called hybrid approach, which typifies a combination of
analytic and holistic approaches to face representation. In
this approach, a set of facial points is usually used to
determine an initial position of a template that models the
face (e.g., Fig 10, [40]).
Irrespectively of the kind of the face model applied,
attempts must be made to model and then extract the
information about the displayed facial expression without
losing any (or much) of that information. Several factors
make this task complex. The first is the presence of facial
PANTIC AND ROTHKRANTZ: AUTOMATIC ANALYSIS OF FACIAL EXPRESSIONS: THE STATE OF THE ART 1425

hair, glasses, etc., which obscure the facial features. Another
problem is the variation in size and orientation of the face in
input images. This disables a search for fixed patterns in the
images. Finally, noise and occlusion are always present to
some extent.
As indicated by Ellis [23], human encoding of the visual
stimulus (face and its expression) may be in the form of a
primal sketch and may be hardwired. However, not much
else is known in terms of the nature of internal representa-
tion of a face in the human brain.
2.3 Facial Expression Classification
After the face and its appearance have been perceived, the
next step of an automated expression analyzer is to
ªidentifyº the facial expression conveyed by the face. A
fundamental issue about the facial expression classification
is to define a set of categories we want to deal with. A
related issue is to devise mechanisms of categorization.
Facial expressions can be classified in various waysÐin
terms of facial actions that cause an expression, in terms of
some nonprototypic expressions such as ªraised browsº or
in terms of some prototypic expressions such as emotional
expressions.
The Facial Action Coding System (FACS) [21] is probably
the most known study on facial activity. It is a system that
has been developed to facilitate objective measurement of
facial activity for behavioral science investigations of the
face. FACS is designed for human observers to detect
independent subtle changes in facial appearance caused by
contractions of the facial muscles. In a form of rules, FACS
provides a linguistic description of all possible, visually
detectable, facial changes in terms of 44 so-called Action
Units (AUs). Using these rules, a trained human FACS
coder decomposes a shown expression into the specific AUs
that describe the expression. Automating FACS would
make it widely accessible as a research tool in the
behavioral science, which is furthermore the theoretical
basis of multimodal/media user interfaces. This triggered
researchers of computer vision field to take different
approaches in tackling the problem. Still, explicit attempts
to automate the facial action coding as applied to automated
FACS encoding are few (see [16] or [17] for a review of the
existing methods as well as Table 7 of this survey).
Most of the studies on automated expression analysis
perform an emotional classification. As indicated by
Fridlund et al. [25], the most known and the most
commonly used study on emotional classification of facial
expressions is the cross-cultural study on existence of
ªuniversal categories of emotional expressions.º Ekman
defined six such categories, referred to as the basic emotions:
happiness, sadness, surprise, fear, anger, and disgust [19].
He described each basic emotion in terms of a facial
expression that uniquely characterizes that emotion. In the
past years, many questions arose around this study. Are the
basic emotional expressions indeed universal [33], [22], or
are they merely a stressing of the verbal communication
and have no relation with an actual emotional state [76],
[77], [26]? Also, it is not at all certain that each facial
expression able to be displayed on the face can be classified
under the six basic emotion categories. Nevertheless, most
of the studies on vision-based facial expression analysis rely
on Ekman's emotional categorization of facial expressions.
The problem of automating facial expression emotional
classification is difficult to handle for a number of reasons.
First, Ekman's description of the six prototypic facial
expressions of emotion is linguistic (and, thus, ambiguous).
There is no uniquely defined description either in terms of
facial actions or in terms of some other universally defined
facial codes. Hence, the validation and the verification of
the classification scheme to be used are difficult and crucial
tasks. Second, classification of facial expressions in to
multiple emotion categories should be feasible (e.g., raised
eyebrows and smiling mouth is a blend of surprise and
happiness, Fig. 1). Still, there is no psychological scrutiny on
this topic.
Three more issues are related to facial expression
classification in general. First, the system should be capable
of analyzing any subject, male or female of any age and
ethnicity. In other words, the classification mechanism may
not depend on physiognomic variability of the observed
person. On the other hand, each person has his/her own
maximal intensity of displaying a particular facial expres-
sion. Therefore, if the obtained classification is to be
quantified (e.g., to achieve a quantified encoding of facial
actions or a quantified emotional labeling of blended
expressions), systems which can start with a generic
expression classification and then adapt to a particular
individual have an advantage. Second, it is important to
realize that the interpretation of the body language is
situation-dependent [75]. Nevertheless, the information
about the context in which a facial expression appears is
very difficult to obtain in an automatic way. This issue has
not been handled by the currently existing systems. Finally,
there is now a growing psychological research that argues
that timing of facial expressions is a critical factor in the
interpretation of expressions [1], [5], [34]. For the research-
ers of automated vision-based expression analysis, this
suggests moving towards a real-time whole-face analysis of
facial expression dynamics.
While the human mechanisms for face detection are very
robust, the same is not the case for interpretation of facial
expressions. It is often very difficult to determine the exact
nature of the expression on a person's face. According to
Bassili [1], a trained observer can correctly classify faces
1426 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 22, NO. 12, DECEMBER 2000
Fig. 1. Expressions of blended emotions (surpriseÐhappiness).

showing six basic emotions with an average of 87 percent.
This ratio varies depending on several factors: the famil-
iarity with the face, the familiarity with the personality of
the observed person, the general experience with different
types of expressions, the attention given to the face and the
nonvisual cues (e.g., the context in which an expression
appears). It is interesting to note that the appearance of the
upper face features plays a more important role in face
interpretation as opposed to lower face features [22].
3AN IDEAL SYSTEM FOR FACIAL EXPRESSION
ANALYSIS
Before developing an automated system for facial expres-
sion analysis, one should decide on its functionality. A good
reference point is the best known facial expression
analyzerÐthe human visual system. It may not be possible
to incorporate all features of the human visual system into
an automated system, and some features may even be
undesirable, but it can certainly serve as a reference point.
A first requirement that should be posed on developing
an ideal automated facial expression analyzer is that all of
the stages of the facial expression analysis are to be
performed automatically, namely, face detection, facial
expression information extraction, and facial expression
classification. Yet, actual implementation and integration of
these stages into a system are constrained by the system's
application domain. For instance, if the system is to be used
as a tool for research in behavioral science, real-time
performance is not an essential property of the system.
On the other hand, this is crucial if the system would form a
part of an advanced user-interface. Long delays make the
interaction desynchronized and less efficient. Also, having
an explanation facility that would elucidate facial action
encoding performed by the system might be useful if the
system is employed to train human experts in using FACS.
However, such facility is superfluous if the system is to be
employed in videoconferencing or as a stress-monitoring
tool. In this paper, we are mainly concerned with two major
application domains of an automated facial expression
analyzer, namely, behavioral science research and multi-
modal/media HCI. In this section, we propose an ideal
automated facial expression analyzer (Table 1) which could
be employed in those application domains and has the
properties of the human visual system.
Considering the potential applications of an automated
facial expression analyzer, which involve continuous ob-
servation of a subject in a time interval, facial image
acquisition should proceed in an automatic way. In order to
be universal, the system should be capable of analyzing
subjects of both sexes, of any age and any ethnicity. Also, no
constraints should be set on the outlook of the observed
subjects. The system should perform robustly despite
changes in lightning conditions and distractions like
glasses, changes in hair style, and facial hair like moustache,
beard and grown-together eyebrows. Similarly to the
human visual system, an ideal system would ªfill inº
missing parts of the observed face and ªperceiveº a whole
face even when a part of it is occluded (e.g., by hand). In
most real-life situations, complete immovability of the
observed subject cannot be assumed. Hence, the system
should be able to deal with rigid head motions. Ideally, the
system would perform robust facial expression analysis
despite large changes in viewing conditions; it would be
capable of dealing with a whole range of head movements,
from frontal view to profile view acquired by a fixed
camera. This could be achieved by employing several fixed
cameras for acquiring different facial views of the examined
face (such as frontal view, and right and left profile views)
and then approximating the actual view by interpolation
among the acquired views. Having no constraints set on the
rigid head motions of the subject can also be achieved by
having a camera mounted on the subject's head and placed
in front of his/her face.
An ideal system should perform robust automatic face
detection and facial expression information extraction in the
acquired images or image sequences. Considering the state-
of-the-art in image processing techniques, inaccurate, noisy,
and missing data could be expected. An ideal system
should be capable of dealing with these inaccuracies. In
addition, certainty of the extracted facial expression
information should be taken into account.
An ideal system should be able to perform analysis of all
visually distinguishable facial expressions. Well-defined
face representation is a prerequisite for achieving this. The
face representation should be such that a particular
alteration of the face model uniquely reveals a particular
facial expression. In general, an ideal system should be able
to distinguish:
1. all possible facial expressions (a reference point is
a total of 44 facial actions defined in FACS [21]
PANTIC AND ROTHKRANTZ: AUTOMATIC ANALYSIS OF FACIAL EXPRESSIONS: THE STATE OF THE ART 1427
TABLE 1
Properties of an Ideal Analyzer

whose combinations form the complete set of facial
expressions),
2. any bilateral or unilateral facial change [21], and
3. facial expressions with a similar facial appearance
(e.g., upward pull of the upper lip and nose
wrinkling which also causes the upward pull of
the upper lip [21]).
In practice, it may not be possible to define a face model
that can satisfy both, to reflect each and every change in
facial appearance and whose features are detectable in a
facial image or image sequence. Still, the set of distinct facial
expressions that the system can distinguish should be as
copious as possible.
If the system is to be used for behavioral science research
purposes it should perform facial expression recognition as
applied to automated FACS encoding. As explained by
Bartlett et al. [16], [17], this means that it should accomplish
multiple quantified expression classification in terms of
44 AUs defined in FACS. If the system is to be used as a part
of an advanced multimodal/media HCI, the system should
be able to interpret the shown facial expressions (e.g., in
terms of emotions). Since psychological researchers dis-
agree on existence of universal categories of emotional
facial displays, an ideal system should be able to adapt the
classification mechanism according to the user's subjective
interpretation of expressions, e.g., as suggested in [40]. Also,
it is definitely not the case that each and every facial
expression able to be displayed on the face can be classified
into one and only one emotion class. Think about blended
emotional displays such as raised eyebrow and smiling
mouth (Fig. 1). This expression might be classified in two
emotion categories defined by Ekman and Friesen
[20]Ðsurprise and happiness. Yet, according to the descrip-
tions of these prototypic expressions given by Ekman and
Friesen [20], the left hand side facial expression shown in
Fig. 1 belongs ªmoreº to the surpriseÐthan to the
happiness class. For instance, in the left hand side image
the ªpercentageº of shown surprise is higher than the
ªpercentageº of shown happiness while those percentages
are approximately the same in the case of the right hand
side image. In order to obtain an accurate categorization, an
ideal analyzer should perform quantified classification of
facial expression into multiple emotion categories.
4AUTOMATIC FACIAL EXPRESSION ANALYSIS
For its utility in application domains of human behavior
interpretation and multimodal/media HCI, automatic facial
expression analysis has attracted the interest of many
computer vision researchers. Since the mid 1970s, different
approaches are proposed for facial expression analysis from
either static facial images or image sequences. In 1992,
Samal and Iyengar [79] gave an overview of the early
works. This paper explores and compares approaches to
automatic facial expression analysis that have been devel-
oped recently, i.e., in the late 1990s. Before surveying these
works in detail, we are giving a short overview of the
systems for facial expression analysis proposed in the
period of 1991 to 1995.
Table 2 summarizes the features of these systems in
respect to the requirements posed on design of an ideal
facial expression analyzer. None of these systems performs
a quantified expression classification in terms of facial
actions. Also, except the system proposed by Moses et al.
[57], no system listed in Table 2 performs in real-time.
Therefore, these properties (i.e., columns 15 and 20) have
been excluded from Table 2 (. stands for ªyes,º X stands for
ªno,º and - represents a missing entry). A missing entry
either means that it is not reported on the issue or that the
issue is not applicable to the system in question. An
inapplicable issue, for instance, is the issue of dealing with
rigid head motions and inaccurate facial data in the cases
where the input data were hand measured (e.g., [40]). Some
of the methods listed in Table 2 do not perform automatic
facial data extraction (see column 8); the others achieve this
by using facial motion analysis [52], [100], [101], [73], [57].
Except the method proposed by Kearney and McKenzie
[40], which performs facial expression classification in
1428 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 22, NO. 12, DECEMBER 2000
TABLE 2
Early Methods for Automatic Facial Expression Analysis

Citations
More filters
Journal ArticleDOI
TL;DR: In this paper, the authors provide an up-to-date critical survey of still-and video-based face recognition research, and provide some insights into the studies of machine recognition of faces.
Abstract: As one of the most successful applications of image analysis and understanding, face recognition has recently received significant attention, especially during the past several years. At least two reasons account for this trend: the first is the wide range of commercial and law enforcement applications, and the second is the availability of feasible technologies after 30 years of research. Even though current machine recognition systems have reached a certain level of maturity, their success is limited by the conditions imposed by many real applications. For example, recognition of face images acquired in an outdoor environment with changes in illumination and/or pose remains a largely unsolved problem. In other words, current systems are still far away from the capability of the human perception system.This paper provides an up-to-date critical survey of still- and video-based face recognition research. There are two underlying motivations for us to write this survey paper: the first is to provide an up-to-date review of the existing literature, and the second is to offer some insights into the studies of machine recognition of faces. To provide a comprehensive survey, we not only categorize existing recognition techniques but also present detailed descriptions of representative methods within each category. In addition, relevant topics such as psychophysical studies, system evaluation, and issues of illumination and pose variation are covered.

6,384 citations

Journal ArticleDOI
TL;DR: A novel approach for recognizing DTs is proposed and its simplifications and extensions to facial image analysis are also considered and both the VLBP and LBP-TOP clearly outperformed the earlier approaches.
Abstract: Dynamic texture (DT) is an extension of texture to the temporal domain. Description and recognition of DTs have attracted growing attention. In this paper, a novel approach for recognizing DTs is proposed and its simplifications and extensions to facial image analysis are also considered. First, the textures are modeled with volume local binary patterns (VLBP), which are an extension of the LBP operator widely used in ordinary texture analysis, combining motion and appearance. To make the approach computationally simple and easy to extend, only the co-occurrences of the local binary patterns on three orthogonal planes (LBP-TOP) are then considered. A block-based method is also proposed to deal with specific dynamic events such as facial expressions in which local information and its spatial locations should also be taken into account. In experiments with two DT databases, DynTex and Massachusetts Institute of Technology (MIT), both the VLBP and LBP-TOP clearly outperformed the earlier approaches. The proposed block-based method was evaluated with the Cohn-Kanade facial expression database with excellent results. The advantages of our approach include local processing, robustness to monotonic gray-scale changes, and simple computation

2,653 citations


Cites background from "Automatic analysis of facial expres..."

  • ...Ç...

    [...]

  • ...0162-8828/07/$25.00 2007 IEEE Published by the IEEE Computer Society Our approach is completely different, avoiding error-prone segmentation....

    [...]

Journal ArticleDOI
TL;DR: In this paper, the authors discuss human emotion perception from a psychological perspective, examine available approaches to solving the problem of machine understanding of human affective behavior, and discuss important issues like the collection and availability of training and test data.
Abstract: Automated analysis of human affective behavior has attracted increasing attention from researchers in psychology, computer science, linguistics, neuroscience, and related disciplines. However, the existing methods typically handle only deliberately displayed and exaggerated expressions of prototypical emotions despite the fact that deliberate behaviour differs in visual appearance, audio profile, and timing from spontaneously occurring behaviour. To address this problem, efforts to develop algorithms that can process naturally occurring human affective behaviour have recently emerged. Moreover, an increasing number of efforts are reported toward multimodal fusion for human affect analysis including audiovisual fusion, linguistic and paralinguistic fusion, and multi-cue visual fusion based on facial expressions, head movements, and body gestures. This paper introduces and surveys these recent advances. We first discuss human emotion perception from a psychological perspective. Next we examine available approaches to solving the problem of machine understanding of human affective behavior, and discuss important issues like the collection and availability of training and test data. We finally outline some of the scientific and engineering challenges to advancing human affect sensing technology.

2,503 citations

Journal ArticleDOI
TL;DR: This paper empirically evaluates facial representation based on statistical local features, Local Binary Patterns, for person-independent facial expression recognition, and observes that LBP features perform stably and robustly over a useful range of low resolutions of face images, and yield promising performance in compressed low-resolution video sequences captured in real-world environments.

2,098 citations


Cites background from "Automatic analysis of facial expres..."

  • ...Much progress has been made in the last decade, and a thorough survey of the exiting work can be found in [ 1 ,2]....

    [...]

Journal ArticleDOI
TL;DR: A face detection algorithm for color images in the presence of varying lighting conditions as well as complex backgrounds is proposedBased on a novel lighting compensation technique and a nonlinear color transformation, this method detects skin regions over the entire image and generates face candidates based on the spatial arrangement of these skin patches.
Abstract: Human face detection plays an important role in applications such as video surveillance, human computer interface, face recognition, and face image database management. We propose a face detection algorithm for color images in the presence of varying lighting conditions as well as complex backgrounds. Based on a novel lighting compensation technique and a nonlinear color transformation, our method detects skin regions over the entire image and then generates face candidates based on the spatial arrangement of these skin patches. The algorithm constructs eye, mouth, and boundary maps for verifying each face candidate. Experimental results demonstrate successful face detection over a wide range of facial variations in color, position, scale, orientation, 3D pose, and expression in images from several photo collections (both indoors and outdoors).

2,075 citations

References
More filters
01 Jan 1994
TL;DR: The Diskette v 2.06, 3.5''[1.44M] for IBM PC, PS/2 and compatibles [DOS] Reference Record created on 2004-09-07, modified on 2016-08-08.
Abstract: Note: Includes bibliographical references, 3 appendixes and 2 indexes.- Diskette v 2.06, 3.5''[1.44M] for IBM PC, PS/2 and compatibles [DOS] Reference Record created on 2004-09-07, modified on 2016-08-08

19,881 citations


"Automatic analysis of facial expres..." refers methods in this paper

  • ...The distance functions are minimized using the method proposed by Brent [69]....

    [...]

Journal ArticleDOI
TL;DR: This work uses snakes for interactive interpretation, in which user-imposed constraint forces guide the snake near features of interest, and uses scale-space continuation to enlarge the capture region surrounding a feature.
Abstract: A snake is an energy-minimizing spline guided by external constraint forces and influenced by image forces that pull it toward features such as lines and edges. Snakes are active contour models: they lock onto nearby edges, localizing them accurately. Scale-space continuation can be used to enlarge the capture region surrounding a feature. Snakes provide a unified account of a number of visual problems, including detection of edges, lines, and subjective contours; motion tracking; and stereo matching. We have used snakes successfully for interactive interpretation, in which user-imposed constraint forces guide the snake near features of interest.

18,095 citations

Journal ArticleDOI
TL;DR: A near-real-time computer system that can locate and track a subject's head, and then recognize the person by comparing characteristics of the face to those of known individuals, and that is easy to implement using a neural network architecture.
Abstract: We have developed a near-real-time computer system that can locate and track a subject's head, and then recognize the person by comparing characteristics of the face to those of known individuals. The computational approach taken in this system is motivated by both physiology and information theory, as well as by the practical requirements of near-real-time performance and accuracy. Our approach treats the face recognition problem as an intrinsically two-dimensional (2-D) recognition problem rather than requiring recovery of three-dimensional geometry, taking advantage of the fact that faces are normally upright and thus may be described by a small set of 2-D characteristic views. The system functions by projecting face images onto a feature space that spans the significant variations among known face images. The significant features are known as "eigenfaces," because they are the eigenvectors (principal components) of the set of faces; they do not necessarily correspond to features such as eyes, ears, and noses. The projection operation characterizes an individual face by a weighted sum of the eigenface features, and so to recognize a particular face it is necessary only to compare these weights to those of known individuals. Some particular advantages of our approach are that it provides for the ability to learn and later recognize new faces in an unsupervised manner, and that it is easy to implement using a neural network architecture.

14,562 citations


"Automatic analysis of facial expres..." refers background in this paper

  • ...The eigenfaces define the subspace of sample images, i.e., so-called “face space” [ 91 ]....

    [...]

Proceedings Article
24 Aug 1981
TL;DR: In this paper, the spatial intensity gradient of the images is used to find a good match using a type of Newton-Raphson iteration, which can be generalized to handle rotation, scaling and shearing.
Abstract: Image registration finds a variety of applications in computer vision. Unfortunately, traditional image registration techniques tend to be costly. We present a new image registration technique that makes use of the spatial intensity gradient of the images to find a good match using a type of Newton-Raphson iteration. Our technique is taster because it examines far fewer potential matches between the images than existing techniques Furthermore, this registration technique can be generalized to handle rotation, scaling and shearing. We show how our technique can be adapted tor use in a stereo vision system.

12,944 citations

Frequently Asked Questions (14)
Q1. What is the method used to detect the presence of faces in an image?

To detect the presence of faces in an image sequence, a spatio-temporal filtering is performed, the filtered image is thresholded in order to analyze ªmotion blobs,º and each motion blob that can represent a human head is then evaluated as a single image. 

Defining interpretation categories into which any facial expression can be classified is one of the key challenges in the design of a realistic facial expression analyzer. 

A first requirement that should be posed on developing an ideal automated facial expression analyzer is that all of the stages of the facial expression analysis are to be performed automatically, namely, face detection, facial expression information extraction, and facial expression classification. 

The face model used by Wang et al. represents a way of improving the labeled-graph-based models (e.g., [30]) to include intensity measurement of the encountered facial expressions based on the information stored in the links between the nodes. 

In order to achieve a correctplacement of an initial PDM in an input image, Huang andHuang utilize a Canny edge detector to obtain a roughestimate of the face location in the image. 

In their later work [42], they utilize a CCD camera in monochrome mode to obtain a set of brightness distributions of 13 vertical lines crossing the FCPs. 

A set of 213 images of different expressions displayed by nine Japanese females has been used to train and test the used network. 

According to Bassili [1], a trained observer can correctly classify facesshowing six basic emotions with an average of 87 percent. 

Independent of the used classification categories, the mechanism of classification applied by a particular surveyed expression analyzer is either a template-based- or a neural-network-based- or a rule-based- classification method. 

the best matching person, whose personalized gallery is available, is found by applying the method of elastic graph matching proposed by Wiskot [98]. 

The final step is to define some set of categories, which the authors want to use for facial expression classification and/or facial expression interpretation, and to devise the mechanism of categorization. 

The face can be also modeled using a socalled hybrid approach, which typifies a combination of analytic and holistic approaches to face representation. 

In order to obtain an accurate categorization, an ideal analyzer should perform quantified classification of facial expression into multiple emotion categories. 

They utilize 19 facial feature points (FFPs)Ðseven FFPs to preserve the local topology and 12 FEFPs (depicted as . in Fig. 11) for facial expression recognition.