Detecting faces in images: a survey

doi:10.1109/34.982883

Detecting Faces in Images: A Survey

Ming-Hsuan Yang, Member, IEEE, David J. Kriegman, Senior Member, IEEE,and

Narendra Ahuja, Fellow, IEEE

AbstractÐImages containing faces are essential to intelligent vision-based human computer interaction, and research efforts in face

processing include face recognition, face tracking, pose estimation, and expression recognition. However, many reported methods

assume that the faces in an image or an image sequence have been identified and localized. To build fully automated systems that

analyze the information contained in face images, robust and efficient face detection algorithms are required. Given a single image, the

goal of face detection is to identify all image regions which contain a face regardless of its three-dimensional position, orientation, and

the lighting conditions. Such a problem is challenging because faces are nonrigid and have a high degree of variability in size, shape,

color, and texture. Numerous techniques have been developed to detect faces in a single image, and the purpose of this paper is to

categorize and evaluate these algorithms. We also discuss relevant issues such as data collection, evaluation metrics, and

benchmarking. After analyzing these algorithms and identifying their limitations, we conclude with several promising directions for

future research.

Index TermsÐFace detection, face recognition, computer vision, object recognition, view-based recognition, statistical pattern

recognition, machine learning.

æ

1INTRODUCTION

W

ITH the ubiquity of new information technology and

media, more effective and friendly methods for

human computer interaction (HCI) are being developed

which do not rely on traditional devices such as keyboards,

mice, and displays. Furthermore, the ever decreasing price/

performance ratio of computing coupled with recent

decreases in video image acquisition cost imply that

computer vision systems can be deployed in desktop and

embedded systems [111], [112], [113]. The rapidly expand-

ing research in face processing is based on the premise that

information about a user's identity, state, and intent can be

extracted from images, and that computers can then react

accordingly, e.g., by observing a person's facial expression.

In the last five years, face and facial expression recognition

have attracted much attention though they have been

studied for more than 20 years by psychophysicists,

neuroscientists, and engineers. Many research demonstra-

tions and commercial applications have been developed

from these efforts. A first step of any face processing system

is detecting the locations in images where faces are present.

However, face detection from a single image is a challen-

ging task because of variability in scale, location, orientation

(up-right, rotated), and pose (frontal, profile). Facial

expression, occlusion, and lighting conditions also change

the overall appearance of faces.

We now give a definition of face detection: Given an

arbitrary image, the goal of face detection is to determine

whether or not there are any faces in the image and, if

present, return the image location and extent of each face.

The challenges associated with face detection can be

attributed to the following factors:

. Pose. The images of a face vary due to the relative

camera-face pose (frontal, 45 degree, profile, upside

down), and some facial features such as an eye or the

nose may become partially or wholly occluded.

. Presence or ab sence of structural components.

Facial features such as beards, mustaches, and

glasses may or may not be present and there is a

great deal of variability among these components

including shape, color, and size.

. Facial expression. The appearance of faces are

directly affected by a person's facial expression.

. Occlusion. Faces may be partially occluded by other

objects. In an image with a group of people, some

faces may partially occlude other faces.

. Image orientation. Face images directly vary for

different rotations about the camera's optical axis.

. Imaging conditions. When the image is formed,

factors such as lighting (spectra, source distribution

and intensity) and camera characteristics (sensor

response, lenses) affect the appearance of a face.

There are many closely related problems of face

detection. Face localization aims to determine the image

position of a single face; this is a simplified detection

problem with the assumption that an input image contains

only one face [85], [103]. The goal of facial feature detection is

to detect the presence and location of features, such as eyes,

nose, nostrils, eyebrow, mouth, lips, ears, etc., with the

assumption that there is only one face in an image [28], [54].

Face recognition or face identification compares an input image

(probe) against a database (gallery) and reports a match, if

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 24, NO. 1, JANUARY 2002 1

. M.-H. Yang is with Honda Fundamental Research Labs, 800 California

Street, Mountain View, CA 94041. E-mail: myang@hra.com.

. D.J. Kriegman is with the Department of Computer Science and Beckman

Institute, University of Illinois at Urbana-Champaign, Urbana, IL 61801.

E-mail: kriegman@uiuc.edu.

. N. Ahjua is with the Department if Electrical and Computer Engineering

and Beckman Institute, University of Illinois at Urbana-Champaign,

Urbana, IL 61801. E-mail: ahuja@vision.ai.uiuc.edu.

Manuscript received 5 May 2000; revised 15 Jan. 2001; accepted 7 Mar. 2001.

Recommended for acceptance by K. Bowyer.

For information on obtaining reprints of this article, please send e-mail to:

tpami@computer.org, and reference IEEECS Log Number 112058.

0162-8828/02/$17.00 ß 2002 IEEE

any [163], [133], [18]. The purpose of face authentication is to

verify the claim of the identity of an individual in an input

image [158], [82], while face tracking methods continuously

estimate the location and possibly the orientation of a face

in an image sequence in real time [30], [39], [33]. Facial

expression recognition concerns identifying the affective

states (happy, sad, disgusted, etc.) of humans [40], [35].

Evidently, face detection is the first step in any automated

system which solves the above problems. It is worth

mentioning that many papers use the term ªface detection,º

but the methods and the experimental results only show

that a single face is localized in an input image. In this

paper, we differentiate face detection from face localization

since the latter is a simplified problem of the former.

Meanwhile, we focus on face detection methods rather than

tracking methods.

While numerous methods have been proposed to detect

faces in a single image of intensity or color images, we are

unaware of any surveys on this particular topic. A survey of

early face recognition methods before 1991 was written by

Samal and Iyengar [133]. Chellapa et al. wrote a more recent

survey on face recognition and some detection methods [18].

Among the face detection methods, the ones based on

learning algorithms have attracted much attention recently

and have demonstrated excellent results. Since these data-

driven methods rely heavily on the training sets, we also

discuss several databases suitable for this task. A related

and important problem is how to evaluate the performance

of the proposed detection methods. Many recent face

detection papers compare the performance of several

methods, usually in terms of detection and false alarm

rates. It is also worth noticing that many metrics have been

adopted to evaluate algorithms, such as learning time,

execution time, the number of samples required in training,

and the ratio between detection rates and false alarms.

Evaluation becomes more difficult when researchers use

different definitions for detection and false alarm rates. In

this paper, detection rate is defined as the ratio between the

number of faces correctly detected and the number faces

determined by a human. An image region identified as a

face by a classifier is considered to be correctly detected if

the image region covers more than a certain percentage of a

face in the image (See Section 3.3 for details). In general,

detectors can make two types of errors: false negatives in

which faces are missed resulting in low detection rates and

false positives in which an image region is declared to be

face, but it is not. A fair evaluation should take these factors

into consideration since one can tune the parameters of

one's method to increase the detection rates while also

increasing the number of false detections. In this paper, we

discuss the benchmarking data sets and the related issues in

a fair evaluation.

With over 150 reported approaches to face detection, the

research in face detection has broader implications for

computer vision research on object recognition. Nearly all

model-based or appearance-based approaches to 3D object

recognition have been limited to rigid objects while

attempting to robustly perform identification over a broad

range of camera locations and illumination conditions. Face

detection can be viewed as a two-class recognition problem

in which an image region is classified as being a ªfaceº or

ªnonface.º Consequently, face detection is one of the few

attempts to recognize from images (not abstract representa-

tions) a class of objects for which there is a great deal of

within-class variability (described previously). It is also one

of the few classes of objects for which this variability has

been captured using large training sets of images and, so,

some of the detection techniques may be applicable to a

much broader class of recognition problems.

Face detection also provides interesting challenges to the

underlying pattern classification and leaning techniques.

When a raw or filtered image is considered as input to a

pattern classifier, the dimension of the feature space is

extremely large (i.e., the number of pixels in normalized

training images). The classes of face and nonface images are

decidedly characterized by multimodal distribution func-

tions and effective decision boundaries are likely to be

nonlinear in the image space. To be effective, either classifiers

must be able to extrapolate from a modest number of training

samples or be efficient when dealing with a a very large

number of these high-dimensional training samples.

With an aim to give a comprehensive and critical survey

of current face detection methods, this paper is organized as

follows: In Section 2, we give a detailed review of

techniques to detect faces in a single image. Benchmarking

databases and evaluation criteria are discussed in Section 3.

We conclude this paper with a discussion of several

promising directions for face detection in Section 4.

Though we report error rates for each method when

available, tests are often done on unique data sets and, so,

comparisons are often difficult. We indicate those methods

that have been evaluated with a publicly available test set. It

can be assumed that a unique data set was used if we do not

indicate the name of the test set.

2DETECTING FACES IN A SINGLE IMAGE

In this section, we review existing techniques to detect faces

from a single intensity or color image. We classify single

image detection methods into four categories; some

methods clearly overlap category boundaries and are

discussed at the end of this section.

1. Knowledge-based methods. These rule-based meth-

ods encode human knowledge of what constitutes a

typical face. Usually, the rules capture the relation-

ships between facial features. These methods are

designed mainly for face localization.

2. Feature invariant approaches. These algorithms aim

to find structural features that exist even when the

pose, viewpoint, or lighting conditions vary, and

then use the these to locate faces. These methods are

designed mainly for face localization.

3. Template matching methods. Several standard pat-

terns of a face are stored to describe the face as a whole

or the facial features separately. The correlations

between an input image and the stored patterns are

computed for detection. These methods have been

used for both face localization and detection.

4. Appearance-based methods. In contrast to template

matching, the models (or templates) are learned from

2 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 24, NO. 1, JANUARY 2002

a set of training images which should capture the

representative variability of facial appearance. These

learned models are then used for detection. These

methods are designed mainly for face detection.

Table 1 summarizes algorithms and representative

works for face detection in a single image within these

four categories. Below, we discuss the motivation and

general approach of each category. This is followed by a

review of specific methods including a discussion of their

pros and cons. We suggest ways to further improve these

methods in Section 4.

2.1 Knowledge-Based Top-Down Methods

In this approach, face detection methods are developed

based on the rules derived from the researcher's knowledge

of human faces. It is easy to come up with simple rules to

describe the features of a face and their relationships. For

example, a face often appears in an image with two eyes

that are symmetric to each other, a nose, and a mouth. The

relationships between features can be represented by their

relative distances and positions. Facial features in an input

image are extracted first, and face candidates are identified

based on the coded rules. A verification process is usually

applied to reduce false detections.

One problem with this approach is the difficulty in

translating human knowledge into rules. If the rules are

detailed (i.e., strict), they may fail to detect faces that do not

pass all the rules. If the rules are too general, they may give

many false positives. Moreover, it is difficult to extend this

approach to detect faces in different poses since it is

challenging to enumerate all the possible cases. On the other

hand, heuristics about faces work well in detecting frontal

faces in uncluttered scenes.

Yang and Huang used a hierarchical knowledge-based

method to detect faces [170]. Their system consists of three

levels of rules. At the highest level, all possible face

candidates are found by scanning a window over the input

image and applying a set of rules at each location. The rules

at a higher level are general descriptions of what a face

looks like while the rules at lower levels rely on details of

facial features. A multiresolution hierarchy of images is

created by averaging and subsampling, and an example is

shown in Fig. 1. Examples of the coded rules used to locate

face candidates in the lowest resolution include: ªthe center

part of the face (the dark shaded parts in Fig. 2) has four

cells with a basically uniform intensity,º ªthe upper round

part of a face (the light shaded parts in Fig. 2) has a basically

uniform intensity,º and ªthe difference between the average

YANG ET AL.: DETECTING FACES IN IMAGES: A SURVEY 3

TABLE 1

Categorization of Methods for Face Detection in a Single Image

Fig. 1. (a) n = 1, original image. (b) n = 4. (c) n = 8. (d) n = 16. Original and corresponding low resolution images. Each square cell consists of

n  n pixels in which the intensity of each pixel is replaced by the average intensity of the pixels in that cell.

gray values of the center part and the upper round part is

significant.º The lowest resolution (Level 1) image is

searched for face candidates and these are further processed

at finer resolutions. At Level 2, local histogram equalization

is performed on the face candidates received from Level 2,

followed by edge detection. Surviving candidate regions are

then examined at Level 3 with another set of rules that

respond to facial features such as the eyes and mouth.

Evaluated on a test set of 60 images, this system located

faces in 50 of the test images while there are 28 images in

which false alarms appear. One attractive feature of this

method is that a coarse-to-fine or focus-of-attention strategy

is used to reduce the required computation. Although it

does not result in a high detection rate, the ideas of using a

multiresolution hierarchy and rules to guide searches have

been used in later face detection works [81].

Kotropoulos and Pitas [81] presented a rule-based

localization method which is similar to [71] and [170]. First,

facial features are located with a projection method that

Kanade successfully used tolocate the boundary of a face [71].

Let Ix; y be the intensity value of an m  n image at position

x; y, the horizontal and vertical projections of the image are

defined as HIx

P

n

y1

Ix; y and VIy

P

m

x1

Ix; y.

The horizontal profile of an input image is obtained first, and

then the two local minima, determined by detecting abrupt

changes in HI, are said to correspond to the left and right side

of the head. Similarly, the vertical profile is obtained and the

local minima are determined for the locations of mouth lips,

nose tip, and eyes. These detected features constitute a facial

candidate. Fig. 3a shows one example where the boundaries

of the face correspond to the local minimum where abrupt

intensity changes occur. Subsequently, eyebrow/eyes, nos-

trils/nose, and the mouth detection rules are used to validate

these candidates. The proposed method has been tested using

a set of faces in frontal views extracted from the European

ACTS M2VTS (MultiModal Verification for Teleservices and

Security applications) database [116] which contains video

sequences of 37 different people. Each image sequence

contains only one face in a uniform background. Their

method provides correct face candidates in all tests. The

detection rate is 86.5 percent if successful detection is defined

as correctly identifying all facial features. Fig. 3b shows one

example in which it becomes difficult to locate a face in a

complex background using the horizontal and vertical

profiles. Furthermore, this method cannot readily detect

multiple faces as illustrated in Fig. 3c. Essentially, the

projection method can be effective if the window over

which it operates is suitably located to avoid misleading

interference.

2.2 Bottom-Up Feature-Based Methods

In contrast to the knowledge-based top-down approach,

researchers have been trying to find invariant features of

faces for detection. The underlying assumption is based on

the observation that humans can effortlessly detect faces

and objects in different poses and lighting conditions and,

so, there must exist properties or features which are

invariant over these variabilities. Numerous methods have

been proposed to first detect facial features and then to infer

the presence of a face. Facial features such as eyebrows,

eyes, nose, mouth, and hair-line are commonly extracted

using edge detectors. Based on the extracted features, a

statistical model is built to describe their relationships and

to verify the existence of a face. One problem with these

feature-based algorithms is that the image features can be

severely corrupted due to illumination, noise, and occlu-

sion. Feature boundaries can be weakened for faces, while

shadows can cause numerous strong edges which together

render perceptual grouping algorithms useless.

2.2.1 Facial Features

Sirohey proposed a localization method to segment a face

from a cluttered background for face identification [145]. It

uses an edge map (Canny detector [15]) and heuristics to

remove and group edges so that only the ones on the face

contour are preserved. An ellipse is then fit to the boundary

between the head region and the background. This algorithm

achieves 80 percent accuracy on a database of 48 images with

cluttered backgrounds. Instead of using edges, Chetverikov

4 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 24, NO. 1, JANUARY 2002

Fig. 2. A typical face used in knowledge-based top-down methods:

Rules are coded based on human knowledge about the characteristics

(e.g., intensity distribution and difference) of the facial regions [170].

Fig. 3. (a) and (b) n = 8. (c) n = 4. Horizontal and vertical profiles. It is feasible to detect a single face by searching for the peaks in horizontal and

vertical profiles. However, the same method has difficulty detecting faces in complex backgrounds or multiple faces as shown in (b) and (c).

and Lerch presented a simple face detection method using

blobs and streaks (linear sequences of similarly oriented

edges) [20]. Their face model consists of two dark blobs and

three light blobs to represent eyes, cheekbones, and nose. The

model uses streaks to represent the outlines of the faces,

eyebrows, and lips. Two triangular configurations are

utilized to encode the spatial relationship among the blobs.

A low resolution Laplacian image is generated to facilitate

blob detection. Next, the image is scanned to find specific

triangular occurrences as candidates. A face is detected if

streaks are identified around a candidate.

Graf et al. developed a method to locate facial features

and faces in gray scale images [54]. After band pass

filtering, morphological operations are applied to enhance

regions with high intensity that have certain shapes (e.g.,

eyes). The histogram of the processed image typically

exhibits a prominent peak. Based on the peak value and its

width, adaptive threshold values are selected in order to

generate two binarized images. Connected components are

identified in both binarized images to identify the areas of

candidate facial features. Combinations of such areas are

then evaluated with classifiers, to determine whether and

where a face is present. Their method has been tested with

head-shoulder images of 40 individuals and with five video

sequences where each sequence consists of 100 to

200 frames. However, it is not clear how morphological

operations are performed and how the candidate facial

features are combined to locate a face.

Leung et al. developed a probabilistic method to locate a

face in a cluttered scene based on local feature detectors and

random graph matching [87]. Their motivation is to formulate

the face localization problem as a search problem in which the

goal is to find the arrangement of certain facial features that is

most likely to be a face pattern. Five features (two eyes, two

nostrils, and nose/lip junction) are used to describe a typical

face. For any pair of facial features of the same type (e.g., left-

eye, right-eye pair), their relative distance is computed, and

over an ensemble of images the distances are modeled by a

Gaussian distribution. A facial template is defined by

averaging the responses to a set of multiorientation, multi-

scale Gaussian derivative filters (at the pixels inside the facial

feature) over a number of faces in a data set. Given a test

image, candidate facial features are identified by matching

the filter response at each pixel against a template vector of

responses (similar to correlation in spirit). The top two feature

candidates with the strongest response are selected to search

for the other facial features. Since the facial features cannot

appear in arbitrary arrangements, the expected locations of

the other features are estimated using a statistical model of

mutual distances. Furthermore, the covariance of the esti-

mates can be computed. Thus, the expected feature locations

can be estimated with high probability. Constellations are

then formed only from candidates that lie inside the

appropriate locations, and the most face-like constellation is

determined. Finding the best constellation is formulated as a

random graph matching problem in which the nodes of the

graph correspond to features on a face, and the arcs represent

the distances between different features. Ranking of

constellations is based on a probability density function that

a constellation corresponds to a face versus the probability it

was generated by an alternative mechanism (i.e., nonface).

They used a set of 150 images for experiments in which a face

is considered correctly detected if any constellation correctly

locates three or more features on the faces. This system is able

to achieve a correct localization rate of 86 percent.

Instead of using mutual distances to describe the

relationships between facial features in constellations, an

alternative method for modeling faces was also proposed

by the Leung et al. [13], [88]. The representation and

ranking of the constellations is accomplished using the

statistical theory of shape, developed by Kendall [75] and

Mardia and Dryden [95]. The shape statistics is a joint

probability density function over N feature points, repre-

sented by x

i

;y

i

, for the ith feature under the assumption

that the original feature points are positioned in the plane

according to a general 2N-dimensional Gaussian distribu-

tion. They applied the same maximum-likelihood (ML)

method to determine the location of a face. One advantage

of these methods is that partially occluded faces can be

located. However, it is unclear whether these methods can

be adapted to detect multiple faces effectively in a scene.

In [177], [178], Yow and Cipolla presented a feature-

based method that uses a large amount of evidence from the

visual image and their contextual evidence. The first stage

applies a second derivative Gaussian filter, elongated at an

aspect ratio of three to one, to a raw image. Interest points,

detected at the local maxima in the filter response, indicate

the possible locations of facial features. The second stage

examines the edges around these interest points and groups

them into regions. The perceptual grouping of edges is

based on their proximity and similarity in orientation and

strength. Measurements of a region's characteristics, such as

edge length, edge strength, and intensity variance, are

computed and stored in a feature vector. From the training

data of facial features, the mean and covariance matrix of

each facial feature vector are computed. An image region

becomes a valid facial feature candidate if the Mahalanobis

distance between the corresponding feature vectors is

below a threshold. The labeled features are further grouped

based on model knowledge of where they should occur

with respect to each other. Each facial feature and grouping

is then evaluated using a Bayesian network. One attractive

aspect is that this method can detect faces at different

orientations and poses. The overall detection rate on a test

set of 110 images of faces with different scales, orientations,

and viewpoints is 85 percent [179]. However, the reported

false detection rate is 28 percent and the implementation is

only effective for faces larger than 60  60 pixels. Subse-

quently, this approach has been enhanced with active

contour models [22], [179]. Fig. 4 summarizes their feature-

based face detection method.

Takacs and Wechsler described a biologically motivated

face localization method based on a model of retinal feature

extraction and small oscillatory eye movements [157]. Their

algorithm operates on the conspicuity map or region of

interest, with a retina lattice modeled after the magnocel-

lular ganglion cells in the human vision system. The first

phase computes a coarse scan of the image to estimate the

location of the face, based on the filter responses of

receptive fields. Each receptive field consists of a number

YANG ET AL.: DETECTING FACES IN IMAGES: A SURVEY 5

Detecting faces in images: a survey

Citations

Face recognition: A literature survey

Computer Vision: Algorithms and Applications

Face detection, pose estimation, and landmark localization in the wild

Face detection in color images

Methods and Systems for Content Processing

References

Elements of information theory

A Computational Approach to Edge Detection

Classification and Regression Trees.

Pattern Classification

Textural Features for Image Classification

Related Papers (5)

Neural network-based face detection

Rapid object detection using a boosted cascade of simple features

Robust Real-Time Face Detection

Eigenfaces for recognition

Face recognition: A literature survey