Journal Article•DOI•

Automatic analysis of facial expressions: the state of the art

Maja Pantic¹, Léon J. M. Rothkrantz¹•Institutions (1)

01 Dec 2000-IEEE Transactions on Pattern Analysis and Machine Intelligence (IEEE)-Vol. 22, Iss: 12, pp 1424-1445

TL;DR: The capability of the human visual system with respect to these problems is discussed, and it is meant to serve as an ultimate goal and a guide for determining recommendations for development of an automatic facial expression analyzer.

read less

Abstract: Humans detect and interpret faces and facial expressions in a scene with little or no effort. Still, development of an automated system that accomplishes this task is rather difficult. There are several related problems: detection of an image segment as a face, extraction of the facial expression information, and classification of the expression (e.g., in emotion categories). A system that performs these operations accurately and in real time would form a big step in achieving a human-like interaction between man and machine. The paper surveys the past work in solving these problems. The capability of the human visual system with respect to these problems is discussed, too. It is meant to serve as an ultimate goal and a guide for determining recommendations for development of an automatic facial expression analyzer.

...read moreread less

Summary (6 min read)

1 INTRODUCTION

The main characteristics of human communication are: multiplicity and multimodality of communication channels.
The characteristics of an ideal automated system for facial expression analysis are given in Section 3.

2 FACIAL EXPRESSION ANALYSIS

The authors aim is to explore the issues in design and implementation of a system that could perform automated facial expression analysis.
In general, three main steps can be distinguished in tackling the problem.
The face features are the features used to represent the face.
The applied face representation and the kind of input images determine the choice of mechanisms for automatic extraction of facial expression information.
A good reference point is the functionality of the human visual system.

2.1 Face Detection

For most works in automatic facial expression analysis, the conditions under which a facial image or image sequence is obtained are controlled.
Determining the exact location of the face in a digitized facial image is a more complex problem.
First, the scale and the orientation of the face can vary from image to image.
The presence of noise and occlusion makes the problem even more difficult.
The presence of the features and their geometrical relationship with each other appears to be more important than the details of the features [5].

2.2 Facial Expression Data Extraction

After the presence of a face has been detected in the observed scene, the next step is to extract the information about the encountered facial expression in an automatic way.
One of the fundamental issues about the facial expression analysis is the representation of the visual information that an examined face might reveal [102].
The results of Johansson's point-light display experiments [1], [5], gave a clue to this problem.
The face can be also modeled using a socalled hybrid approach, which typifies a combination of analytic and holistic approaches to face representation.
This disables a search for fixed patterns in the images.

2.3 Facial Expression Classification

After the face and its appearance have been perceived, the next step of an automated expression analyzer is to ªidentifyº the facial expression conveyed by the face.
The Facial Action Coding System (FACS) [21] is probably the most known study on facial activity.
Second, classification of facial expressions in to multiple emotion categories should be feasible (e.g., raised eyebrows and smiling mouth is a blend of surprise and happiness, Fig. 1).
First, the system should be capable of analyzing any subject, male or female of any age and ethnicity.
While the human mechanisms for face detection are very robust, the same is not the case for interpretation of facial expressions.

3 AN IDEAL SYSTEM FOR FACIAL EXPRESSION ANALYSIS

Before developing an automated system for facial expression analysis, one should decide on its functionality.
It may not be possible to incorporate all features of the human visual system into an automated system, and some features may even be undesirable, but it can certainly serve as a reference point.
Yet, actual implementation and integration of these stages into a system are constrained by the system's application domain.
An ideal system should be able to perform analysis of all visually distinguishable facial expressions.
Yet, according to the descriptions of these prototypic expressions given by Ekman and Friesen [20], the left hand side facial expression shown in Fig. 1 belongs ªmoreº to the surpriseÐthan to the happiness class.

4 AUTOMATIC FACIAL EXPRESSION ANALYSIS

For its utility in application domains of human behavior interpretation and multimodal/media HCI, automatic facial expression analysis has attracted the interest of many computer vision researchers.
Before surveying these works in detail, the authors are giving a short overview of the systems for facial expression analysis proposed in the period of 1991 to 1995.
Therefore, these properties (i.e., columns 15 and 20) have been excluded from Table 2 (. stands for ªyes,º X stands for ªno,º and - represents a missing entry).
These systems primarily concern facial expression animation and do not attempt to classify the observed facial expression either in terms of facial actions or in terms of emotion categories.

4.1 Face Detection

For most of the work in automatic facial expression analysis, the conditions under which an image is obtained are controlled.
The camera is either mounted on a helmetlike device worn by the subject (e.g., [62], [59]) or placed in such a way that the image has the face in frontal view.
Hence, the presence of the face in the scene is ensured and some global location of the face in the scene is known a priori.
In the second, analytic approach, the face is detected by detecting some important facial features first (e.g., the irises and the nostrils).

4.1.1 Face Detection in Facial Images

To represent the face, Huang and Huang [32] apply a point distribution model (PDM).
The face should be without facial hair and glasses, no rigid head motion may be encountered and illumination variations must be linear for the system to work correctly.
To localize the contour of the face, they use an algorithm based on the HSV color model, which is similar to the algorithm based on the relative RGB model [103].
Once the irises are identified, the overall location of the face is determined by using relative locations of the facial features in the face.
Yoneyama et al. [104] use an analytic approach to face detection too.

4.1.2 Face Detection in Arbitrary Images

Two of the works surveyed in this paper perform automatic face detection in an arbitrary scene.
Hong et al. [30] utilize the PersonSpotter system [81] in order to perform a realtime tracking of the head.
The box bounding the head is used then as the image to which an initial labeled graph is fitted.
By inspecting the local maximums of the disparity histogram, image regions confined to a certain disparity interval are selected.
Essa and Pentland [24] use the eigenspace method of Pentland et al. [65] to locate faces in an arbitrary scene.

4.2 Facial Expression Data Extraction

After the presence of a face is detected in the observed scene, the next step is to extract the information about the shown facial expression.
Both the applied face representation and the kind of input images affect the choice of the approach to facial expression data extraction.
The face representations used by the surveyed systems are listed in Table 5.
Template-based methods fit a holistic face model to the input image or track it in the input image sequence.
The methods utilized by the surveyed systems are listed in Table 6.

4.2.1 Facial Data Extraction from Static Images: Template-Based Methods

As shown in Table 3, several surveyed systems can be classified as methods for facial expression analysis from static images.
To build their model they used facial images that were manually labeled with 122 points localized around the facial features.
Hong et al. use wavelets of five different frequencies and eight different orientations.
Padgett and Cottrell [61] also use a holistic face representation, but they do not deal with facial expression information extraction in an automatic way.
Hence, the method will fail to recognize any facial appearance change that involves a horizontal movement of the facial features.

4.2.2 Facial Data Extraction from Static Images: Feature-Based Methods

The second category of the surveyed methods for automatic facial expression analysis from static images uses an analytic approach to face representation (Table 3, Table 5) and applies a feature-based method for expression information extraction from an input image.
In their later work [42], they utilize a CCD camera in monochrome mode to obtain a set of brightness distributions of 13 vertical lines crossing the FCPs.
Pantic and Rothkrantz [62] are utilizing a point-based model composed of two 2D facial views, the frontal and the side view.
Then, the best of the acquired results is chosen.
These data are used further for expression emotional classification.

4.2.3 Facial Data Extraction from Image Sequences: Template-Based Methods

A first category of the surveyed approaches to automatic facial expression analysis from image sequences uses a holistic or a hybrid approach to face representation (Table 3, Table 5) and applies a template-based method for facial expression information extraction from an input image sequence.
First, they applied the eigenspace method [65] to automatically track the face in the scene (Section 4.1.2) and extract the positions of the eyes, nose, and mouth.
Essa and Pentland use the optical flow computation method proposed by Simoncelli [80].
The flow covariances between different frames are stored and used together with Fig.
To fit the Potential Net to a normalized facial image (see Section 4.1.1), they compute first the edge image by applying a differential filter.

4.2.4 Facial Data Extraction from Image Sequences: Feature-Based Methods

Only one of the surveyed methods for automatic facial expression analysis from image sequences utilizes an analytic face representation (Table 3, Table 5) and applies a feature-based method for facial expression information extraction.
Cohn et al. [10] use a model of facial landmark points localized around the facial features, hand-marked with a mouse device in the first frame of an examined image sequence.
In the rest of the frames, a hierarchical optical flow method [49] is used to track the optical flows of 13 13 windows surrounding the landmark points.
The displacement of each landmark point is calculated by subtracting its normalized position in the first frame from its current normalized position (all frames of an input sequence are manually normalized).
The face should be without facial hair/ glasses, no rigid head motion may be encountered, the first frame should be an expressionless face, and the facial landmark points should be marked in the first frame for the method to work correctly.

4.3 Facial Expression Classification

The last step of facial expression analysis is to classify (identify, interpret) the facial display conveyed by the face.
The applied methods for expression classification in terms of facial actions are summarized in Table 7.
If a template-based classification method is applied, the encountered facial expression is compared to the templates defined for each expression category.
Most of the neural-network-based classification methods utilized by the surveyed systems perform facial expression classification into a single category.
The authors are doing so because the overall characteristics of these methods fit better the overall properties of the template-based expression classification approaches.

4.3.1 Expression Classification from Static Images: Template-Based Methods

A first category of the surveyed methods for automatic expression analysis from static images applies a template-based method for expression classification.
The personalized galleries of nine people have been utilized, where each gallery contained 28 images (four images per expression).
The achieved recognition rate was 89 percent in the case of the familiar subjects and 73 percent in the case of unknown persons.
In order to perform emotional classification of the observed facial expression, Huang and Huang [32] perform an intermediate step by calculating 10 Action Parameters (APs, Fig. 12).
An input LG vector is classified by being projected along the discriminant vectors calculated for each independently trained binary classifier.

4.3.2 Expression Classification from Static Images: Neural-Network-Based Methods

A second category of the surveyed methods for automatic facial expression analysis from static images applies a neural network for facial expression classification.
For classification of expression into one of six basic emotion categories, Hara and Kobayashi [42] apply a 234 50 6 back-propagation neural network.
The average recognition rate was 85 percent.
This process has been repeated for each of the 10 segments and the results of all 10 trained networks have been averaged.
The difference between a distance measured in an examined image and the same distance measured in an expressionless face of the same person was normalized.

4.3.3 Expression Classification from Static Images: Rule-Based Methods

Just one of the surveyed methods for automatic facial expression analysis from static images applies a rule-based approach to expression classification.
From the localized contours of the facial features, the model features (Fig. 7) are extracted.
Based on the knowledge acquired from FACS [21], the production rules classify the calculated model deformation into the appropriate AUs-classes (total number of classes is 31).
Classification of an input facial dual-view into multiple emotion categories is performed by comparing the AU-coded description of the shown facial expression to AU-coded descriptions of six basic emotional expressions, which have been acquired from the linguistic descriptions given by Ekman [22].
The dual-views used for testing of the system have been recorded under constant illumination and none of the subjects had a moustache, a beard, or wear glasses.

4.3.4 Expression Classification from Image Sequences: Template-Based Methods

Thefirstcategoryof thesurveyedmethodsforautomaticfacial expression analysis from facial image sequences applies a template-based method for expression classification.
Image sequences (504) containing 872 facial actions displayed by 100 subjects have been used.
The method was tested on image sequences shown by the same subjects.
The category of an expression is decided by determining the minimal distance between the actual trajectory of FEFPs and the trajectories defined by the models.

4.3.5 Expression Classification from Image Sequences: Rule-Based Methods

Just one of the surveyed methods for automatic facial expression analysis from image sequences applies a rulebased approach to expression classification.
The motion parameters (e.g., translation and divergence) are used to derive the midlevel predicates that describe the motion of the facial features.
Each of six basic emotional expressions, they developed a model represented by a set of rules for detecting the beginning and ending of the expression.
The rules are applied to the predicates of the midlevel representation.
The achieved recognition rate was 88 percent.

5 DISCUSSION

The authors have explored and compared a number of different recently presented approaches to facial expression detection and classification in static images and image sequences.
The number of the surveyed systems is rather large and the reader might be interested in the results of the performed comparison in terms of the best performances.
Yet, the authors deliberately didn't make an attempt to label some of the surveyed systems as being better than some other systems presented in the literature.
The authors believe that a well-defined and commonly used single database of testing images (image sequences) is the necessary prerequisite for ªrankingº the performances of the proposed systems in an objective manner.

5.1 Detection of the Face and Its Features

Most of the currently existing systems for facial expression analysis assume that the presence of a face in the scene is ensured.
In many instances, the systems do not utilize a camera setting that will ascertain the correctness of that assumption.
In addition, in many instances strong assumptions are made to make the problem of facial expression analysis more tractable (Table 6).
Thus, if a fixed camera acquires the images, the system should be capable of dealing with rigid head motions.
Yet, only the method proposed by Essa and Pentland [24] deals with the facial images of faces with facial hair and/or eyeglasses.

5.2 Facial Expression Classification

In general, the existing expression analyzers perform a singular classification of the examined expression into one of the basic emotion categories proposed by Ekman and Friesen [20].
Defining interpretation categories into which any facial expression can be classified is one of the key challenges in the design of a realistic facial expression analyzer.
In addition, each person has his/her own maximal intensity of displaying a particular facial action.
Also, none of the surveyed systems can distinguish all 44 AUs defined in FACS.

6 CONCLUSION

Analysis of facial expressions is an intriguing problem which humans solve with quite an apparent ease.
Capability of the human visual system in solving these problems has been discussed.
Also, all of the proposed approaches to automatic expres- sion analysis perform only facial expression classification into the basic emotion categories defined by Ekman and Friesen [20].
Furthermore, some of the surveyed methods have been tested only on the set of images used for training.
The authors hesitate in belief that those systems are person- independent what, in turn, should be a basic property of a behavioral science research tool or of an advanced HCI.

Did you find this useful? Give us your feedback

Figures (18)

TABLE 3 Recent Approaches to Automatic Facial Expression Analysis

Fig. 5. Fiducial grid of facial points ([51]).

Fig. 6. Facial Characteristic Points ([42]).

TABLE 8 The Methods for Facial Expression Emotional Classification

TABLE 4 Summary of the Methods for Automatic Face Detection

Fig. 4. Fitted PDM model (reprinted from [32] with permission from Academic Press) ß 1997 Academic Press.

Fig. 3. Aligned training set for generation of PDM model (reprinted from [32] with permission from Academic Press) ß 1997 Academic Press.

TABLE 7 Facial Expression Classification in Terms of Facial Actions

Fig. 1. Expressions of blended emotions (surpriseÐhappiness).

Fig. 10. Potential Field and corresponding Potential Net ([41]).

TABLE 2 Early Methods for Automatic Facial Expression Analysis

Fig. 9. Motion vector field represented in the deformation of two grids ([60]).

Fig. 8. Planar model for representing rigid face motions and affine-plus-

TABLE 6 The Methods for Automatic Facial Expression Data Extraction

Fig. 12. APs (reprinted from [32] with permission from Academic Press) ß 1997 Academic Press.

Content maybe subject to copyright Report

Automatic Analysis of Facial Expressions:

The State of the Art

Maja Pantic, Student Member, IEEE, and Leon J.M. Rothkrantz

AbstractÐHumans detect and interpret faces and facial expressions in a scene with little or no effort. Still, development of an

automated system that accomplishes this task is rather difficult. There are several related problems: detection of an image segment as

a face, extraction of the facial expression information, and classification of the expression (e.g., in emotion categories). A system that

performs these operations accurately and in real time would form a big step in achieving a human-like interaction between man and

machine. This paper surveys the past work in solving these problems. The capability of the human visual system with respect to these

problems is discussed, too. It is meant to serve as an ultimate goal and a guide for determining recommendations for development of

an automatic facial expression analyzer.

Index TermsÐFace detection, facial expression information extraction, facial action encoding, facial expression emotional

classification.

1INTRODUCTION

S pointed out by Bruce [6], Takeuchi and Nagao [84],

and Hara and Kobayashi [28], human face-to-face

communication is an ideal model for designing a multi-

modal/media human-computer interface (HCI). The main

characteristics of human communication are: multiplicity

and multimodality of communication channels. A channel

is a communication medium while a modality is a sense

used to perceive signals from the outside world. Examples

of human communication channels are: auditory channel

that carries speech, auditory channel that carries vocal

intonation, visual channel that carries facial expressions,

and visual channel that carries body movements. The

senses of sight, hearing, and touch are examples of

modalities. In usual face-to-face communication, many

channels are used and different modalities are activated.

As a result, communication becomes highly flexible and

robust. Failure of one channel is recovered by another

channel and a message in one channel can be explained by

another channel. This is how a multimedia/modal HCI

should be developed for facilitating robust, natural,

efficient, and effective man-machine interaction.

Relatively few existing works combine different mod-

alities into a single system for human communicative

reaction analysis. Examples are the works of Chen et al.

[9] and De Silva et al. [15] who studied the effects of a

combined detection of facial and vocal expressions of

emotions. So far, the majority of studies treat various

human communication channels separately, as indicated by

Nakatsu [58]. Examples for the presented systems are:

emotional interpretation of human voices [35], [66], [68],

[90], emotion recognition by physiological signals pattern

recognition [67], detection and interpretation of hand

gestures [64], recognition of body movements [29], [97],

[46], and facial expression analysis (this survey).

The terms ªface-to-faceº and ªinterfaceº indicate that the

face plays an essential role in interpersonal communication.

The face is the mean to identify other members of the

species, to interpret what has been said by the means of

lipreading, and to understand someone's emotional state

and intentions on the basis of the shown facial expression.

Personality, attractiveness, age, and gender can also be seen

from someone's face. Considerable research in social

psychology has also shown that facial expressions help

coordinate conversation [4], [82], and have considerably

more effect on whether a listener feels liked or disliked than

the speaker's spoken words [55]. Mehrabian indicated that

the verbal part (i.e., spoken words) of a message contributes

only for 7 percent to the effect of the message as a whole, the

vocal part (e.g., voice intonation) contributes for 38 percent,

while facial expression of the speaker contributes for

55 percent to the effect of the spoken message [55]. This

implies that the facial expressions form the major modality

in human communication.

Recent advances in image analysis and pattern recogni-

tion open up the possibility of automatic detection and

classification of emotional and conversational facial signals.

Automating facial expression analysis could bring facial

expressions into man-machine interaction as a new mod-

ality and make the interaction tighter and more efficient.

Such a system could also make classification of facial

expressions widely accessible as a tool for research in

behavioral science and medicine. The goal of this paper is to

survey the work done in automating facial expression

analysis in facial images and image sequences. Section 2

identifies three basic problems related to facial expression

analysis. These problems are: face detection in a facial

image or image sequence, facial expression data extraction,

and facial expression classification. The capability of the

1424 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 22, NO. 12, DECEMBER 2000

. The authors are with the Department of Media, Engineering and

Mathematics, Delft University of Technology, PO Box 356, 2600 AJ

Delft, The Netherlands.

E-mail: {M.Pantic, L.J.M.Rothkrantz}@cs.tudelft.nl.

Manuscript received 29 June 1999; revised 16 March 2000; accepted 12 Sept.

2000.

Recommended for acceptance by K. Bowyer.

For information on obtaining reprints of this article, please send e-mail to:

tpami@computer.org, and reference IEEECS Log Number 110156.

0162-8828/00/$10.00 ß 2000 IEEE

human visual system with respect to these problems is

described. It defines, in some way, the expectations for an

automated system. The characteristics of an ideal auto-

mated system for facial expression analysis are given in

Section 3. Section 4 surveys the techniques presented in the

literature in the past decade for facial expression analysis by

a computer. Their characteristics are summarized in respect

to the requirements posed on the design of an ideal facial

expression analyzer. We do not attempt to provide an

exhaustive review of the past work in each of the problems

related to automatic facial expression analysis. Here, we

selectively discuss systems which deal with each of these

problems. Possible directions for future research are

discussed in Section 5. Section 6 concludes the paper.

2FACIAL EXPRESSION ANALYSIS

Our aim is to explore the issues in design and implementa-

tion of a system that could perform automated facial

expression analysis. In general, three main steps can be

distinguished in tackling the problem. First, before a facial

expression can be analyzed, the face must be detected in a

scene. Next is to devise mechanisms for extracting the facial

expression information from the observed facial image or

image sequence. In the case of static images, the process of

extracting the facial expression information is referred to as

localizing the face and its features in the scene. In the case of

facial image sequences, this process is referred to as tracking

the face and its features in the scene. At this point, a clear

distinction should be made between two terms, namely,

facial features and face model features. The facial features are

the prominent features of the faceÐeyebrows, eyes, nose,

mouth, and chin. The face model features are the features

used to represent (model) the face. The face can be

represented in various ways, e.g., as a whole unit (holistic

representation), as a set of features (analytic representation)

or as a combination of these (hybrid approach). The applied

face representation and the kind of input images determine

the choice of mechanisms for automatic extraction of facial

expression information. The final step is to define some set

of categories, which we want to use for facial expression

classification and/or facial expression interpretation, and to

devise the mechanism of categorization.

Before an automated facial expression analyzerisbuilt,one

should decide on the system's functionality. A good reference

point is the functionality of the human visual system. After all,

it is the best known facial expression analyzer. This section

discusses the three basic problems related to the process of

facial expression analysis as well as the capability of the

human visual system with respect to these.

2.1 Face Detection

For most works in automatic facial expression analysis, the

conditions under which a facial image or image sequence is

obtained are controlled. Usually, the image has the face in

frontal view. Hence, the presence of a face in the scene is

ensured and some global location of the face in the scene is

known a priori. However, determining the exact location of

the face in a digitized facial image is a more complex

problem. First, the scale and the orientation of the face can

vary from image to image. If the mugshots are taken with a

fixed camera, faces can occur in images at various sizes and

angles due to the movements of the observed person. Thus,

it is difficult to search for a fixed pattern (template) in the

image. The presence of noise and occlusion makes the

problem even more difficult.

Humans detect a facial pattern by casual inspection of

the scene. We detect faces effortlessly in a wide range of

conditions, under bad lightning conditions or from a great

distance. It is generally believed that two-gray-levels

images of 100 to 200 pixels form a lower limit for detection

of a face by a human observer [75], [8]. Another character-

istic of the human visual system is that a face is perceived as

a whole, not as a collection of the facial features. The

presence of the features and their geometrical relationship

with each other appears to be more important than the

details of the features [5]. When a face is partially occluded

(e.g., by a hand), we perceive a whole face, as if our

perceptual system fills in the missing parts. This is very

difficult (if possible at all) to achieve by a computer.

2.2 Facial Expression Data Extraction

After the presence of a face has been detected in the

observed scene, the next step is to extract the information

about the encountered facial expression in an automatic

way. If the extraction cannot be performed automatically, a

fully automatic facial expression analyzer cannot be

developed. Both, the applied face representation and the

kind of input images affect the choice of the approach to

facial expression information extraction.

One of the fundamental issues about the facial expres-

sion analysis is the representation of the visual information

that an examined face might reveal [102]. The results of

Johansson's point-light display experiments [1], [5], gave a

clue to this problem. The experiments suggest that the

visual properties of the face, regarding the information

about the shown facial expression, could be made clear by

describing the movements of points belonging to the facial

features (eyebrows, eyes, and mouth) and then by analyzing

the relationships between those movements. This triggered

the researchers of vision-based facial gesture analysis to

make different attempts to define point-based visual

properties of facial expressions. Various analytic face

representations yielded, in which the face is modeled as a

set of facial points (e.g., Fig. 6, [42], Fig. 7, [61]) or as a set of

templates fitted to the facial features such as the eyes and

the mouth. In another approach to face representation

(holistic approach), the face is represented as a whole unit.

A 3D wire-frame with a mapped texture (e.g., [86]) and a

spatio-temporal model of facial image motion (e.g., Fig. 8,

[2]) are typical examples of the holistic approaches to face

representation. The face can be also modeled using a so-

called hybrid approach, which typifies a combination of

analytic and holistic approaches to face representation. In

this approach, a set of facial points is usually used to

determine an initial position of a template that models the

face (e.g., Fig 10, [40]).

Irrespectively of the kind of the face model applied,

attempts must be made to model and then extract the

information about the displayed facial expression without

losing any (or much) of that information. Several factors

make this task complex. The first is the presence of facial

PANTIC AND ROTHKRANTZ: AUTOMATIC ANALYSIS OF FACIAL EXPRESSIONS: THE STATE OF THE ART 1425

hair, glasses, etc., which obscure the facial features. Another

problem is the variation in size and orientation of the face in

input images. This disables a search for fixed patterns in the

images. Finally, noise and occlusion are always present to

some extent.

As indicated by Ellis [23], human encoding of the visual

stimulus (face and its expression) may be in the form of a

primal sketch and may be hardwired. However, not much

else is known in terms of the nature of internal representa-

tion of a face in the human brain.

2.3 Facial Expression Classification

After the face and its appearance have been perceived, the

next step of an automated expression analyzer is to

ªidentifyº the facial expression conveyed by the face. A

fundamental issue about the facial expression classification

is to define a set of categories we want to deal with. A

related issue is to devise mechanisms of categorization.

Facial expressions can be classified in various waysÐin

terms of facial actions that cause an expression, in terms of

some nonprototypic expressions such as ªraised browsº or

in terms of some prototypic expressions such as emotional

expressions.

The Facial Action Coding System (FACS) [21] is probably

the most known study on facial activity. It is a system that

has been developed to facilitate objective measurement of

facial activity for behavioral science investigations of the

face. FACS is designed for human observers to detect

independent subtle changes in facial appearance caused by

contractions of the facial muscles. In a form of rules, FACS

provides a linguistic description of all possible, visually

detectable, facial changes in terms of 44 so-called Action

Units (AUs). Using these rules, a trained human FACS

coder decomposes a shown expression into the specific AUs

that describe the expression. Automating FACS would

make it widely accessible as a research tool in the

behavioral science, which is furthermore the theoretical

basis of multimodal/media user interfaces. This triggered

researchers of computer vision field to take different

approaches in tackling the problem. Still, explicit attempts

to automate the facial action coding as applied to automated

FACS encoding are few (see [16] or [17] for a review of the

existing methods as well as Table 7 of this survey).

Most of the studies on automated expression analysis

perform an emotional classification. As indicated by

Fridlund et al. [25], the most known and the most

commonly used study on emotional classification of facial

expressions is the cross-cultural study on existence of

ªuniversal categories of emotional expressions.º Ekman

defined six such categories, referred to as the basic emotions:

happiness, sadness, surprise, fear, anger, and disgust [19].

He described each basic emotion in terms of a facial

expression that uniquely characterizes that emotion. In the

past years, many questions arose around this study. Are the

basic emotional expressions indeed universal [33], [22], or

are they merely a stressing of the verbal communication

and have no relation with an actual emotional state [76],

[77], [26]? Also, it is not at all certain that each facial

expression able to be displayed on the face can be classified

under the six basic emotion categories. Nevertheless, most

of the studies on vision-based facial expression analysis rely

on Ekman's emotional categorization of facial expressions.

The problem of automating facial expression emotional

classification is difficult to handle for a number of reasons.

First, Ekman's description of the six prototypic facial

expressions of emotion is linguistic (and, thus, ambiguous).

There is no uniquely defined description either in terms of

facial actions or in terms of some other universally defined

facial codes. Hence, the validation and the verification of

the classification scheme to be used are difficult and crucial

tasks. Second, classification of facial expressions in to

multiple emotion categories should be feasible (e.g., raised

eyebrows and smiling mouth is a blend of surprise and

happiness, Fig. 1). Still, there is no psychological scrutiny on

this topic.

Three more issues are related to facial expression

classification in general. First, the system should be capable

of analyzing any subject, male or female of any age and

ethnicity. In other words, the classification mechanism may

not depend on physiognomic variability of the observed

person. On the other hand, each person has his/her own

maximal intensity of displaying a particular facial expres-

sion. Therefore, if the obtained classification is to be

quantified (e.g., to achieve a quantified encoding of facial

actions or a quantified emotional labeling of blended

expressions), systems which can start with a generic

expression classification and then adapt to a particular

individual have an advantage. Second, it is important to

realize that the interpretation of the body language is

situation-dependent [75]. Nevertheless, the information

about the context in which a facial expression appears is

very difficult to obtain in an automatic way. This issue has

not been handled by the currently existing systems. Finally,

there is now a growing psychological research that argues

that timing of facial expressions is a critical factor in the

interpretation of expressions [1], [5], [34]. For the research-

ers of automated vision-based expression analysis, this

suggests moving towards a real-time whole-face analysis of

facial expression dynamics.

While the human mechanisms for face detection are very

robust, the same is not the case for interpretation of facial

expressions. It is often very difficult to determine the exact

nature of the expression on a person's face. According to

Bassili [1], a trained observer can correctly classify faces

1426 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 22, NO. 12, DECEMBER 2000

Fig. 1. Expressions of blended emotions (surpriseÐhappiness).

showing six basic emotions with an average of 87 percent.

This ratio varies depending on several factors: the famil-

iarity with the face, the familiarity with the personality of

the observed person, the general experience with different

types of expressions, the attention given to the face and the

nonvisual cues (e.g., the context in which an expression

appears). It is interesting to note that the appearance of the

upper face features plays a more important role in face

interpretation as opposed to lower face features [22].

3AN IDEAL SYSTEM FOR FACIAL EXPRESSION

ANALYSIS

Before developing an automated system for facial expres-

sion analysis, one should decide on its functionality. A good

reference point is the best known facial expression

analyzerÐthe human visual system. It may not be possible

to incorporate all features of the human visual system into

an automated system, and some features may even be

undesirable, but it can certainly serve as a reference point.

A first requirement that should be posed on developing

an ideal automated facial expression analyzer is that all of

the stages of the facial expression analysis are to be

performed automatically, namely, face detection, facial

expression information extraction, and facial expression

classification. Yet, actual implementation and integration of

these stages into a system are constrained by the system's

application domain. For instance, if the system is to be used

as a tool for research in behavioral science, real-time

performance is not an essential property of the system.

On the other hand, this is crucial if the system would form a

part of an advanced user-interface. Long delays make the

interaction desynchronized and less efficient. Also, having

an explanation facility that would elucidate facial action

encoding performed by the system might be useful if the

system is employed to train human experts in using FACS.

However, such facility is superfluous if the system is to be

employed in videoconferencing or as a stress-monitoring

tool. In this paper, we are mainly concerned with two major

application domains of an automated facial expression

analyzer, namely, behavioral science research and multi-

modal/media HCI. In this section, we propose an ideal

automated facial expression analyzer (Table 1) which could

be employed in those application domains and has the

properties of the human visual system.

Considering the potential applications of an automated

facial expression analyzer, which involve continuous ob-

servation of a subject in a time interval, facial image

acquisition should proceed in an automatic way. In order to

be universal, the system should be capable of analyzing

subjects of both sexes, of any age and any ethnicity. Also, no

constraints should be set on the outlook of the observed

subjects. The system should perform robustly despite

changes in lightning conditions and distractions like

glasses, changes in hair style, and facial hair like moustache,

beard and grown-together eyebrows. Similarly to the

human visual system, an ideal system would ªfill inº

missing parts of the observed face and ªperceiveº a whole

face even when a part of it is occluded (e.g., by hand). In

most real-life situations, complete immovability of the

observed subject cannot be assumed. Hence, the system

should be able to deal with rigid head motions. Ideally, the

system would perform robust facial expression analysis

despite large changes in viewing conditions; it would be

capable of dealing with a whole range of head movements,

from frontal view to profile view acquired by a fixed

camera. This could be achieved by employing several fixed

cameras for acquiring different facial views of the examined

face (such as frontal view, and right and left profile views)

and then approximating the actual view by interpolation

among the acquired views. Having no constraints set on the

rigid head motions of the subject can also be achieved by

having a camera mounted on the subject's head and placed

in front of his/her face.

An ideal system should perform robust automatic face

detection and facial expression information extraction in the

acquired images or image sequences. Considering the state-

of-the-art in image processing techniques, inaccurate, noisy,

and missing data could be expected. An ideal system

should be capable of dealing with these inaccuracies. In

addition, certainty of the extracted facial expression

information should be taken into account.

An ideal system should be able to perform analysis of all

visually distinguishable facial expressions. Well-defined

face representation is a prerequisite for achieving this. The

face representation should be such that a particular

alteration of the face model uniquely reveals a particular

facial expression. In general, an ideal system should be able

to distinguish:

1. all possible facial expressions (a reference point is

a total of 44 facial actions defined in FACS [21]

PANTIC AND ROTHKRANTZ: AUTOMATIC ANALYSIS OF FACIAL EXPRESSIONS: THE STATE OF THE ART 1427

TABLE 1

Properties of an Ideal Analyzer

whose combinations form the complete set of facial

expressions),

2. any bilateral or unilateral facial change [21], and

3. facial expressions with a similar facial appearance

(e.g., upward pull of the upper lip and nose

wrinkling which also causes the upward pull of

the upper lip [21]).

In practice, it may not be possible to define a face model

that can satisfy both, to reflect each and every change in

facial appearance and whose features are detectable in a

facial image or image sequence. Still, the set of distinct facial

expressions that the system can distinguish should be as

copious as possible.

If the system is to be used for behavioral science research

purposes it should perform facial expression recognition as

applied to automated FACS encoding. As explained by

Bartlett et al. [16], [17], this means that it should accomplish

multiple quantified expression classification in terms of

44 AUs defined in FACS. If the system is to be used as a part

of an advanced multimodal/media HCI, the system should

be able to interpret the shown facial expressions (e.g., in

terms of emotions). Since psychological researchers dis-

agree on existence of universal categories of emotional

facial displays, an ideal system should be able to adapt the

classification mechanism according to the user's subjective

interpretation of expressions, e.g., as suggested in [40]. Also,

it is definitely not the case that each and every facial

expression able to be displayed on the face can be classified

into one and only one emotion class. Think about blended

emotional displays such as raised eyebrow and smiling

mouth (Fig. 1). This expression might be classified in two

emotion categories defined by Ekman and Friesen

[20]Ðsurprise and happiness. Yet, according to the descrip-

tions of these prototypic expressions given by Ekman and

Friesen [20], the left hand side facial expression shown in

Fig. 1 belongs ªmoreº to the surpriseÐthan to the

happiness class. For instance, in the left hand side image

the ªpercentageº of shown surprise is higher than the

ªpercentageº of shown happiness while those percentages

are approximately the same in the case of the right hand

side image. In order to obtain an accurate categorization, an

ideal analyzer should perform quantified classification of

facial expression into multiple emotion categories.

4AUTOMATIC FACIAL EXPRESSION ANALYSIS

For its utility in application domains of human behavior

interpretation and multimodal/media HCI, automatic facial

expression analysis has attracted the interest of many

computer vision researchers. Since the mid 1970s, different

approaches are proposed for facial expression analysis from

either static facial images or image sequences. In 1992,

Samal and Iyengar [79] gave an overview of the early

works. This paper explores and compares approaches to

automatic facial expression analysis that have been devel-

oped recently, i.e., in the late 1990s. Before surveying these

works in detail, we are giving a short overview of the

systems for facial expression analysis proposed in the

period of 1991 to 1995.

Table 2 summarizes the features of these systems in

respect to the requirements posed on design of an ideal

facial expression analyzer. None of these systems performs

a quantified expression classification in terms of facial

actions. Also, except the system proposed by Moses et al.

[57], no system listed in Table 2 performs in real-time.

Therefore, these properties (i.e., columns 15 and 20) have

been excluded from Table 2 (. stands for ªyes,º X stands for

ªno,º and - represents a missing entry). A missing entry

either means that it is not reported on the issue or that the

issue is not applicable to the system in question. An

inapplicable issue, for instance, is the issue of dealing with

rigid head motions and inaccurate facial data in the cases

where the input data were hand measured (e.g., [40]). Some

of the methods listed in Table 2 do not perform automatic

facial data extraction (see column 8); the others achieve this

by using facial motion analysis [52], [100], [101], [73], [57].

Except the method proposed by Kearney and McKenzie

[40], which performs facial expression classification in

1428 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 22, NO. 12, DECEMBER 2000

TABLE 2

Early Methods for Automatic Facial Expression Analysis

HTML Viewer

Frequently Asked Questions (14)

Q1. What is the method used to detect the presence of faces in an image?

To detect the presence of faces in an image sequence, a spatio-temporal filtering is performed, the filtered image is thresholded in order to analyze ªmotion blobs,º and each motion blob that can represent a human head is then evaluated as a single image.

Q2. What is the key challenge in the design of a realistic facial expression analyzer?

Defining interpretation categories into which any facial expression can be classified is one of the key challenges in the design of a realistic facial expression analyzer.

Q3. What is the first requirement for developing an ideal automated facial expression analyzer?

A first requirement that should be posed on developing an ideal automated facial expression analyzer is that all of the stages of the facial expression analysis are to be performed automatically, namely, face detection, facial expression information extraction, and facial expression classification.

Q4. What is the purpose of the face model used by Wang et al.?

The face model used by Wang et al. represents a way of improving the labeled-graph-based models (e.g., [30]) to include intensity measurement of the encountered facial expressions based on the information stored in the links between the nodes.

Q5. How does Huang and Huang use a Canny edge detector to obtain a roughestimate?

In order to achieve a correctplacement of an initial PDM in an input image, Huang andHuang utilize a Canny edge detector to obtain a roughestimate of the face location in the image.

Q6. How many vertical lines are used to obtain a set of brightness distributions?

In their later work [42], they utilize a CCD camera in monochrome mode to obtain a set of brightness distributions of 13 vertical lines crossing the FCPs.

Q7. How many images of different expressions were used to train and test the used network?

A set of 213 images of different expressions displayed by nine Japanese females has been used to train and test the used network.

Q8. How many basic emotions can a trained observer classify faces?

According to Bassili [1], a trained observer can correctly classify facesshowing six basic emotions with an average of 87 percent.

Q9. What is the mechanism of classification applied by a particular surveyed expression analyzer?

Independent of the used classification categories, the mechanism of classification applied by a particular surveyed expression analyzer is either a template-based- or a neural-network-based- or a rule-based- classification method.

Q10. How is the matching person found?

the best matching person, whose personalized gallery is available, is found by applying the method of elastic graph matching proposed by Wiskot [98].

Q11. What is the final step in the process of facial expression analysis?

The final step is to define some set of categories, which the authors want to use for facial expression classification and/or facial expression interpretation, and to devise the mechanism of categorization.

Q12. What is the way to model the face?

The face can be also modeled using a socalled hybrid approach, which typifies a combination of analytic and holistic approaches to face representation.

Q13. What is the way to categorize facial expressions?

In order to obtain an accurate categorization, an ideal analyzer should perform quantified classification of facial expression into multiple emotion categories.

Q14. How many FFPs are used to preserve the local topology?

They utilize 19 facial feature points (FFPs)Ðseven FFPs to preserve the local topology and 12 FEFPs (depicted as . in Fig. 11) for facial expression recognition.

Automatic analysis of facial expressions: the state of the art

Summary (6 min read)

1 INTRODUCTION

2 FACIAL EXPRESSION ANALYSIS

2.1 Face Detection

2.2 Facial Expression Data Extraction

2.3 Facial Expression Classification

3 AN IDEAL SYSTEM FOR FACIAL EXPRESSION ANALYSIS

4 AUTOMATIC FACIAL EXPRESSION ANALYSIS

4.1 Face Detection

4.1.1 Face Detection in Facial Images

4.1.2 Face Detection in Arbitrary Images

4.2 Facial Expression Data Extraction

4.2.1 Facial Data Extraction from Static Images: Template-Based Methods

4.2.2 Facial Data Extraction from Static Images: Feature-Based Methods

4.2.3 Facial Data Extraction from Image Sequences: Template-Based Methods

4.2.4 Facial Data Extraction from Image Sequences: Feature-Based Methods

4.3 Facial Expression Classification

4.3.1 Expression Classification from Static Images: Template-Based Methods

4.3.2 Expression Classification from Static Images: Neural-Network-Based Methods

4.3.3 Expression Classification from Static Images: Rule-Based Methods

4.3.4 Expression Classification from Image Sequences: Template-Based Methods

4.3.5 Expression Classification from Image Sequences: Rule-Based Methods

5 DISCUSSION

5.1 Detection of the Face and Its Features

5.2 Facial Expression Classification

6 CONCLUSION

Figures (18)

Citations

Cites background from "Automatic analysis of facial expres..."

Cites background from "Automatic analysis of facial expres..."

References

"Automatic analysis of facial expres..." refers methods in this paper

"Automatic analysis of facial expres..." refers background in this paper

Related Papers (5)

Frequently Asked Questions (14)

Q1. What is the method used to detect the presence of faces in an image?

Q2. What is the key challenge in the design of a realistic facial expression analyzer?

Q3. What is the first requirement for developing an ideal automated facial expression analyzer?

Q4. What is the purpose of the face model used by Wang et al.?

Q5. How does Huang and Huang use a Canny edge detector to obtain a roughestimate?

Q6. How many vertical lines are used to obtain a set of brightness distributions?

Q7. How many images of different expressions were used to train and test the used network?

Q8. How many basic emotions can a trained observer classify faces?

Q9. What is the mechanism of classification applied by a particular surveyed expression analyzer?

Q10. How is the matching person found?

Q11. What is the final step in the process of facial expression analysis?

Q12. What is the way to model the face?

Q13. What is the way to categorize facial expressions?

Q14. How many FFPs are used to preserve the local topology?