scispace - formally typeset
Open AccessJournal ArticleDOI

Detecting faces in images: a survey

TLDR
In this article, the authors categorize and evaluate face detection algorithms and discuss relevant issues such as data collection, evaluation metrics and benchmarking, and conclude with several promising directions for future research.
Abstract
Images containing faces are essential to intelligent vision-based human-computer interaction, and research efforts in face processing include face recognition, face tracking, pose estimation and expression recognition. However, many reported methods assume that the faces in an image or an image sequence have been identified and localized. To build fully automated systems that analyze the information contained in face images, robust and efficient face detection algorithms are required. Given a single image, the goal of face detection is to identify all image regions which contain a face, regardless of its 3D position, orientation and lighting conditions. Such a problem is challenging because faces are non-rigid and have a high degree of variability in size, shape, color and texture. Numerous techniques have been developed to detect faces in a single image, and the purpose of this paper is to categorize and evaluate these algorithms. We also discuss relevant issues such as data collection, evaluation metrics and benchmarking. After analyzing these algorithms and identifying their limitations, we conclude with several promising directions for future research.

read more

Content maybe subject to copyright    Report

Detecting Faces in Images: A Survey
Ming-Hsuan Yang, Member, IEEE, David J. Kriegman, Senior Member, IEEE,and
Narendra Ahuja, Fellow, IEEE
AbstractÐImages containing faces are essential to intelligent vision-based human computer interaction, and research efforts in face
processing include face recognition, face tracking, pose estimation, and expression recognition. However, many reported methods
assume that the faces in an image or an image sequence have been identified and localized. To build fully automated systems that
analyze the information contained in face images, robust and efficient face detection algorithms are required. Given a single image, the
goal of face detection is to identify all image regions which contain a face regardless of its three-dimensional position, orientation, and
the lighting conditions. Such a problem is challenging because faces are nonrigid and have a high degree of variability in size, shape,
color, and texture. Numerous techniques have been developed to detect faces in a single image, and the purpose of this paper is to
categorize and evaluate these algorithms. We also discuss relevant issues such as data collection, evaluation metrics, and
benchmarking. After analyzing these algorithms and identifying their limitations, we conclude with several promising directions for
future research.
Index TermsÐFace detection, face recognition, computer vision, object recognition, view-based recognition, statistical pattern
recognition, machine learning.
æ
1INTRODUCTION
W
ITH the ubiquity of new information technology and
media, more effective and friendly methods for
human computer interaction (HCI) are being developed
which do not rely on traditional devices such as keyboards,
mice, and displays. Furthermore, the ever decreasing price/
performance ratio of computing coupled with recent
decreases in video image acquisition cost imply that
computer vision systems can be deployed in desktop and
embedded systems [111], [112], [113]. The rapidly expand-
ing research in face processing is based on the premise that
information about a user's identity, state, and intent can be
extracted from images, and that computers can then react
accordingly, e.g., by observing a person's facial expression.
In the last five years, face and facial expression recognition
have attracted much attention though they have been
studied for more than 20 years by psychophysicists,
neuroscientists, and engineers. Many research demonstra-
tions and commercial applications have been developed
from these efforts. A first step of any face processing system
is detecting the locations in images where faces are present.
However, face detection from a single image is a challen-
ging task because of variability in scale, location, orientation
(up-right, rotated), and pose (frontal, profile). Facial
expression, occlusion, and lighting conditions also change
the overall appearance of faces.
We now give a definition of face detection: Given an
arbitrary image, the goal of face detection is to determine
whether or not there are any faces in the image and, if
present, return the image location and extent of each face.
The challenges associated with face detection can be
attributed to the following factors:
. Pose. The images of a face vary due to the relative
camera-face pose (frontal, 45 degree, profile, upside
down), and some facial features such as an eye or the
nose may become partially or wholly occluded.
. Presence or ab sence of structural components.
Facial features such as beards, mustaches, and
glasses may or may not be present and there is a
great deal of variability among these components
including shape, color, and size.
. Facial expression. The appearance of faces are
directly affected by a person's facial expression.
. Occlusion. Faces may be partially occluded by other
objects. In an image with a group of people, some
faces may partially occlude other faces.
. Image orientation. Face images directly vary for
different rotations about the camera's optical axis.
. Imaging conditions. When the image is formed,
factors such as lighting (spectra, source distribution
and intensity) and camera characteristics (sensor
response, lenses) affect the appearance of a face.
There are many closely related problems of face
detection. Face localization aims to determine the image
position of a single face; this is a simplified detection
problem with the assumption that an input image contains
only one face [85], [103]. The goal of facial feature detection is
to detect the presence and location of features, such as eyes,
nose, nostrils, eyebrow, mouth, lips, ears, etc., with the
assumption that there is only one face in an image [28], [54].
Face recognition or face identification compares an input image
(probe) against a database (gallery) and reports a match, if
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 24, NO. 1, JANUARY 2002 1
. M.-H. Yang is with Honda Fundamental Research Labs, 800 California
Street, Mountain View, CA 94041. E-mail: myang@hra.com.
. D.J. Kriegman is with the Department of Computer Science and Beckman
Institute, University of Illinois at Urbana-Champaign, Urbana, IL 61801.
E-mail: kriegman@uiuc.edu.
. N. Ahjua is with the Department if Electrical and Computer Engineering
and Beckman Institute, University of Illinois at Urbana-Champaign,
Urbana, IL 61801. E-mail: ahuja@vision.ai.uiuc.edu.
Manuscript received 5 May 2000; revised 15 Jan. 2001; accepted 7 Mar. 2001.
Recommended for acceptance by K. Bowyer.
For information on obtaining reprints of this article, please send e-mail to:
tpami@computer.org, and reference IEEECS Log Number 112058.
0162-8828/02/$17.00 ß 2002 IEEE

any [163], [133], [18]. The purpose of face authentication is to
verify the claim of the identity of an individual in an input
image [158], [82], while face tracking methods continuously
estimate the location and possibly the orientation of a face
in an image sequence in real time [30], [39], [33]. Facial
expression recognition concerns identifying the affective
states (happy, sad, disgusted, etc.) of humans [40], [35].
Evidently, face detection is the first step in any automated
system which solves the above problems. It is worth
mentioning that many papers use the term ªface detection,º
but the methods and the experimental results only show
that a single face is localized in an input image. In this
paper, we differentiate face detection from face localization
since the latter is a simplified problem of the former.
Meanwhile, we focus on face detection methods rather than
tracking methods.
While numerous methods have been proposed to detect
faces in a single image of intensity or color images, we are
unaware of any surveys on this particular topic. A survey of
early face recognition methods before 1991 was written by
Samal and Iyengar [133]. Chellapa et al. wrote a more recent
survey on face recognition and some detection methods [18].
Among the face detection methods, the ones based on
learning algorithms have attracted much attention recently
and have demonstrated excellent results. Since these data-
driven methods rely heavily on the training sets, we also
discuss several databases suitable for this task. A related
and important problem is how to evaluate the performance
of the proposed detection methods. Many recent face
detection papers compare the performance of several
methods, usually in terms of detection and false alarm
rates. It is also worth noticing that many metrics have been
adopted to evaluate algorithms, such as learning time,
execution time, the number of samples required in training,
and the ratio between detection rates and false alarms.
Evaluation becomes more difficult when researchers use
different definitions for detection and false alarm rates. In
this paper, detection rate is defined as the ratio between the
number of faces correctly detected and the number faces
determined by a human. An image region identified as a
face by a classifier is considered to be correctly detected if
the image region covers more than a certain percentage of a
face in the image (See Section 3.3 for details). In general,
detectors can make two types of errors: false negatives in
which faces are missed resulting in low detection rates and
false positives in which an image region is declared to be
face, but it is not. A fair evaluation should take these factors
into consideration since one can tune the parameters of
one's method to increase the detection rates while also
increasing the number of false detections. In this paper, we
discuss the benchmarking data sets and the related issues in
a fair evaluation.
With over 150 reported approaches to face detection, the
research in face detection has broader implications for
computer vision research on object recognition. Nearly all
model-based or appearance-based approaches to 3D object
recognition have been limited to rigid objects while
attempting to robustly perform identification over a broad
range of camera locations and illumination conditions. Face
detection can be viewed as a two-class recognition problem
in which an image region is classified as being a ªfaceº or
ªnonface.º Consequently, face detection is one of the few
attempts to recognize from images (not abstract representa-
tions) a class of objects for which there is a great deal of
within-class variability (described previously). It is also one
of the few classes of objects for which this variability has
been captured using large training sets of images and, so,
some of the detection techniques may be applicable to a
much broader class of recognition problems.
Face detection also provides interesting challenges to the
underlying pattern classification and leaning techniques.
When a raw or filtered image is considered as input to a
pattern classifier, the dimension of the feature space is
extremely large (i.e., the number of pixels in normalized
training images). The classes of face and nonface images are
decidedly characterized by multimodal distribution func-
tions and effective decision boundaries are likely to be
nonlinear in the image space. To be effective, either classifiers
must be able to extrapolate from a modest number of training
samples or be efficient when dealing with a a very large
number of these high-dimensional training samples.
With an aim to give a comprehensive and critical survey
of current face detection methods, this paper is organized as
follows: In Section 2, we give a detailed review of
techniques to detect faces in a single image. Benchmarking
databases and evaluation criteria are discussed in Section 3.
We conclude this paper with a discussion of several
promising directions for face detection in Section 4.
Though we report error rates for each method when
available, tests are often done on unique data sets and, so,
comparisons are often difficult. We indicate those methods
that have been evaluated with a publicly available test set. It
can be assumed that a unique data set was used if we do not
indicate the name of the test set.
2DETECTING FACES IN A SINGLE IMAGE
In this section, we review existing techniques to detect faces
from a single intensity or color image. We classify single
image detection methods into four categories; some
methods clearly overlap category boundaries and are
discussed at the end of this section.
1. Knowledge-based methods. These rule-based meth-
ods encode human knowledge of what constitutes a
typical face. Usually, the rules capture the relation-
ships between facial features. These methods are
designed mainly for face localization.
2. Feature invariant approaches. These algorithms aim
to find structural features that exist even when the
pose, viewpoint, or lighting conditions vary, and
then use the these to locate faces. These methods are
designed mainly for face localization.
3. Template matching methods. Several standard pat-
terns of a face are stored to describe the face as a whole
or the facial features separately. The correlations
between an input image and the stored patterns are
computed for detection. These methods have been
used for both face localization and detection.
4. Appearance-based methods. In contrast to template
matching, the models (or templates) are learned from
2 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 24, NO. 1, JANUARY 2002

a set of training images which should capture the
representative variability of facial appearance. These
learned models are then used for detection. These
methods are designed mainly for face detection.
Table 1 summarizes algorithms and representative
works for face detection in a single image within these
four categories. Below, we discuss the motivation and
general approach of each category. This is followed by a
review of specific methods including a discussion of their
pros and cons. We suggest ways to further improve these
methods in Section 4.
2.1 Knowledge-Based Top-Down Methods
In this approach, face detection methods are developed
based on the rules derived from the researcher's knowledge
of human faces. It is easy to come up with simple rules to
describe the features of a face and their relationships. For
example, a face often appears in an image with two eyes
that are symmetric to each other, a nose, and a mouth. The
relationships between features can be represented by their
relative distances and positions. Facial features in an input
image are extracted first, and face candidates are identified
based on the coded rules. A verification process is usually
applied to reduce false detections.
One problem with this approach is the difficulty in
translating human knowledge into rules. If the rules are
detailed (i.e., strict), they may fail to detect faces that do not
pass all the rules. If the rules are too general, they may give
many false positives. Moreover, it is difficult to extend this
approach to detect faces in different poses since it is
challenging to enumerate all the possible cases. On the other
hand, heuristics about faces work well in detecting frontal
faces in uncluttered scenes.
Yang and Huang used a hierarchical knowledge-based
method to detect faces [170]. Their system consists of three
levels of rules. At the highest level, all possible face
candidates are found by scanning a window over the input
image and applying a set of rules at each location. The rules
at a higher level are general descriptions of what a face
looks like while the rules at lower levels rely on details of
facial features. A multiresolution hierarchy of images is
created by averaging and subsampling, and an example is
shown in Fig. 1. Examples of the coded rules used to locate
face candidates in the lowest resolution include: ªthe center
part of the face (the dark shaded parts in Fig. 2) has four
cells with a basically uniform intensity,º ªthe upper round
part of a face (the light shaded parts in Fig. 2) has a basically
uniform intensity,º and ªthe difference between the average
YANG ET AL.: DETECTING FACES IN IMAGES: A SURVEY 3
TABLE 1
Categorization of Methods for Face Detection in a Single Image
Fig. 1. (a) n = 1, original image. (b) n = 4. (c) n = 8. (d) n = 16. Original and corresponding low resolution images. Each square cell consists of
n n pixels in which the intensity of each pixel is replaced by the average intensity of the pixels in that cell.

gray values of the center part and the upper round part is
significant.º The lowest resolution (Level 1) image is
searched for face candidates and these are further processed
at finer resolutions. At Level 2, local histogram equalization
is performed on the face candidates received from Level 2,
followed by edge detection. Surviving candidate regions are
then examined at Level 3 with another set of rules that
respond to facial features such as the eyes and mouth.
Evaluated on a test set of 60 images, this system located
faces in 50 of the test images while there are 28 images in
which false alarms appear. One attractive feature of this
method is that a coarse-to-fine or focus-of-attention strategy
is used to reduce the required computation. Although it
does not result in a high detection rate, the ideas of using a
multiresolution hierarchy and rules to guide searches have
been used in later face detection works [81].
Kotropoulos and Pitas [81] presented a rule-based
localization method which is similar to [71] and [170]. First,
facial features are located with a projection method that
Kanade successfully used tolocate the boundary of a face [71].
Let Ix; y be the intensity value of an m n image at position
x; y, the horizontal and vertical projections of the image are
defined as HIx
P
n
y1
Ix; y and VIy
P
m
x1
Ix; y.
The horizontal profile of an input image is obtained first, and
then the two local minima, determined by detecting abrupt
changes in HI, are said to correspond to the left and right side
of the head. Similarly, the vertical profile is obtained and the
local minima are determined for the locations of mouth lips,
nose tip, and eyes. These detected features constitute a facial
candidate. Fig. 3a shows one example where the boundaries
of the face correspond to the local minimum where abrupt
intensity changes occur. Subsequently, eyebrow/eyes, nos-
trils/nose, and the mouth detection rules are used to validate
these candidates. The proposed method has been tested using
a set of faces in frontal views extracted from the European
ACTS M2VTS (MultiModal Verification for Teleservices and
Security applications) database [116] which contains video
sequences of 37 different people. Each image sequence
contains only one face in a uniform background. Their
method provides correct face candidates in all tests. The
detection rate is 86.5 percent if successful detection is defined
as correctly identifying all facial features. Fig. 3b shows one
example in which it becomes difficult to locate a face in a
complex background using the horizontal and vertical
profiles. Furthermore, this method cannot readily detect
multiple faces as illustrated in Fig. 3c. Essentially, the
projection method can be effective if the window over
which it operates is suitably located to avoid misleading
interference.
2.2 Bottom-Up Feature-Based Methods
In contrast to the knowledge-based top-down approach,
researchers have been trying to find invariant features of
faces for detection. The underlying assumption is based on
the observation that humans can effortlessly detect faces
and objects in different poses and lighting conditions and,
so, there must exist properties or features which are
invariant over these variabilities. Numerous methods have
been proposed to first detect facial features and then to infer
the presence of a face. Facial features such as eyebrows,
eyes, nose, mouth, and hair-line are commonly extracted
using edge detectors. Based on the extracted features, a
statistical model is built to describe their relationships and
to verify the existence of a face. One problem with these
feature-based algorithms is that the image features can be
severely corrupted due to illumination, noise, and occlu-
sion. Feature boundaries can be weakened for faces, while
shadows can cause numerous strong edges which together
render perceptual grouping algorithms useless.
2.2.1 Facial Features
Sirohey proposed a localization method to segment a face
from a cluttered background for face identification [145]. It
uses an edge map (Canny detector [15]) and heuristics to
remove and group edges so that only the ones on the face
contour are preserved. An ellipse is then fit to the boundary
between the head region and the background. This algorithm
achieves 80 percent accuracy on a database of 48 images with
cluttered backgrounds. Instead of using edges, Chetverikov
4 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 24, NO. 1, JANUARY 2002
Fig. 2. A typical face used in knowledge-based top-down methods:
Rules are coded based on human knowledge about the characteristics
(e.g., intensity distribution and difference) of the facial regions [170].
Fig. 3. (a) and (b) n = 8. (c) n = 4. Horizontal and vertical profiles. It is feasible to detect a single face by searching for the peaks in horizontal and
vertical profiles. However, the same method has difficulty detecting faces in complex backgrounds or multiple faces as shown in (b) and (c).

and Lerch presented a simple face detection method using
blobs and streaks (linear sequences of similarly oriented
edges) [20]. Their face model consists of two dark blobs and
three light blobs to represent eyes, cheekbones, and nose. The
model uses streaks to represent the outlines of the faces,
eyebrows, and lips. Two triangular configurations are
utilized to encode the spatial relationship among the blobs.
A low resolution Laplacian image is generated to facilitate
blob detection. Next, the image is scanned to find specific
triangular occurrences as candidates. A face is detected if
streaks are identified around a candidate.
Graf et al. developed a method to locate facial features
and faces in gray scale images [54]. After band pass
filtering, morphological operations are applied to enhance
regions with high intensity that have certain shapes (e.g.,
eyes). The histogram of the processed image typically
exhibits a prominent peak. Based on the peak value and its
width, adaptive threshold values are selected in order to
generate two binarized images. Connected components are
identified in both binarized images to identify the areas of
candidate facial features. Combinations of such areas are
then evaluated with classifiers, to determine whether and
where a face is present. Their method has been tested with
head-shoulder images of 40 individuals and with five video
sequences where each sequence consists of 100 to
200 frames. However, it is not clear how morphological
operations are performed and how the candidate facial
features are combined to locate a face.
Leung et al. developed a probabilistic method to locate a
face in a cluttered scene based on local feature detectors and
random graph matching [87]. Their motivation is to formulate
the face localization problem as a search problem in which the
goal is to find the arrangement of certain facial features that is
most likely to be a face pattern. Five features (two eyes, two
nostrils, and nose/lip junction) are used to describe a typical
face. For any pair of facial features of the same type (e.g., left-
eye, right-eye pair), their relative distance is computed, and
over an ensemble of images the distances are modeled by a
Gaussian distribution. A facial template is defined by
averaging the responses to a set of multiorientation, multi-
scale Gaussian derivative filters (at the pixels inside the facial
feature) over a number of faces in a data set. Given a test
image, candidate facial features are identified by matching
the filter response at each pixel against a template vector of
responses (similar to correlation in spirit). The top two feature
candidates with the strongest response are selected to search
for the other facial features. Since the facial features cannot
appear in arbitrary arrangements, the expected locations of
the other features are estimated using a statistical model of
mutual distances. Furthermore, the covariance of the esti-
mates can be computed. Thus, the expected feature locations
can be estimated with high probability. Constellations are
then formed only from candidates that lie inside the
appropriate locations, and the most face-like constellation is
determined. Finding the best constellation is formulated as a
random graph matching problem in which the nodes of the
graph correspond to features on a face, and the arcs represent
the distances between different features. Ranking of
constellations is based on a probability density function that
a constellation corresponds to a face versus the probability it
was generated by an alternative mechanism (i.e., nonface).
They used a set of 150 images for experiments in which a face
is considered correctly detected if any constellation correctly
locates three or more features on the faces. This system is able
to achieve a correct localization rate of 86 percent.
Instead of using mutual distances to describe the
relationships between facial features in constellations, an
alternative method for modeling faces was also proposed
by the Leung et al. [13], [88]. The representation and
ranking of the constellations is accomplished using the
statistical theory of shape, developed by Kendall [75] and
Mardia and Dryden [95]. The shape statistics is a joint
probability density function over N feature points, repre-
sented by x
i
;y
i
, for the ith feature under the assumption
that the original feature points are positioned in the plane
according to a general 2N-dimensional Gaussian distribu-
tion. They applied the same maximum-likelihood (ML)
method to determine the location of a face. One advantage
of these methods is that partially occluded faces can be
located. However, it is unclear whether these methods can
be adapted to detect multiple faces effectively in a scene.
In [177], [178], Yow and Cipolla presented a feature-
based method that uses a large amount of evidence from the
visual image and their contextual evidence. The first stage
applies a second derivative Gaussian filter, elongated at an
aspect ratio of three to one, to a raw image. Interest points,
detected at the local maxima in the filter response, indicate
the possible locations of facial features. The second stage
examines the edges around these interest points and groups
them into regions. The perceptual grouping of edges is
based on their proximity and similarity in orientation and
strength. Measurements of a region's characteristics, such as
edge length, edge strength, and intensity variance, are
computed and stored in a feature vector. From the training
data of facial features, the mean and covariance matrix of
each facial feature vector are computed. An image region
becomes a valid facial feature candidate if the Mahalanobis
distance between the corresponding feature vectors is
below a threshold. The labeled features are further grouped
based on model knowledge of where they should occur
with respect to each other. Each facial feature and grouping
is then evaluated using a Bayesian network. One attractive
aspect is that this method can detect faces at different
orientations and poses. The overall detection rate on a test
set of 110 images of faces with different scales, orientations,
and viewpoints is 85 percent [179]. However, the reported
false detection rate is 28 percent and the implementation is
only effective for faces larger than 60 60 pixels. Subse-
quently, this approach has been enhanced with active
contour models [22], [179]. Fig. 4 summarizes their feature-
based face detection method.
Takacs and Wechsler described a biologically motivated
face localization method based on a model of retinal feature
extraction and small oscillatory eye movements [157]. Their
algorithm operates on the conspicuity map or region of
interest, with a retina lattice modeled after the magnocel-
lular ganglion cells in the human vision system. The first
phase computes a coarse scan of the image to estimate the
location of the face, based on the filter responses of
receptive fields. Each receptive field consists of a number
YANG ET AL.: DETECTING FACES IN IMAGES: A SURVEY 5

Citations
More filters
Journal ArticleDOI

Face recognition: A literature survey

TL;DR: In this paper, the authors provide an up-to-date critical survey of still-and video-based face recognition research, and provide some insights into the studies of machine recognition of faces.
Book

Computer Vision: Algorithms and Applications

TL;DR: Computer Vision: Algorithms and Applications explores the variety of techniques commonly used to analyze and interpret images and takes a scientific approach to basic vision problems, formulating physical models of the imaging process before inverting them to produce descriptions of a scene.
Proceedings ArticleDOI

Face detection, pose estimation, and landmark localization in the wild

TL;DR: It is shown that tree-structured models are surprisingly effective at capturing global elastic deformation, while being easy to optimize unlike dense graph structures, in real-world, cluttered images.
Journal ArticleDOI

Face detection in color images

TL;DR: A face detection algorithm for color images in the presence of varying lighting conditions as well as complex backgrounds is proposedBased on a novel lighting compensation technique and a nonlinear color transformation, this method detects skin regions over the entire image and generates face candidates based on the spatial arrangement of these skin patches.
Patent

Methods and Systems for Content Processing

TL;DR: In this paper, a variety of technologies by which existing functionality can be improved, and new functionality can also be provided, including visual search capabilities, and determining appropriate actions responsive to different image inputs.
References
More filters
Book

Elements of information theory

TL;DR: The author examines the role of entropy, inequality, and randomness in the design of codes and the construction of codes in the rapidly changing environment.
Journal ArticleDOI

A Computational Approach to Edge Detection

TL;DR: There is a natural uncertainty principle between detection and localization performance, which are the two main goals, and with this principle a single operator shape is derived which is optimal at any scale.
Journal ArticleDOI

Textural Features for Image Classification

TL;DR: These results indicate that the easily computable textural features based on gray-tone spatial dependancies probably have a general applicability for a wide variety of image-classification applications.
Related Papers (5)