scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Combination of Feature Extraction Methods for SVM Pedestrian Detection

TL;DR: A components-based learning approach is proposed in order to better deal with pedestrian variability, illumination conditions, partial occlusions, and rotations and suggest a combination of feature extraction methods as an essential clue for enhanced detection performance.
Abstract: This paper describes a comprehensive combination of feature extraction methods for vision-based pedestrian detection in Intelligent Transportation Systems. The basic components of pedestrians are first located in the image and then combined with a support-vector-machine-based classifier. This poses the problem of pedestrian detection in real cluttered road images. Candidate pedestrians are located using a subtractive clustering attention mechanism based on stereo vision. A components-based learning approach is proposed in order to better deal with pedestrian variability, illumination conditions, partial occlusions, and rotations. Extensive comparisons have been carried out using different feature extraction methods as a key to image understanding in real traffic conditions. A database containing thousands of pedestrian samples extracted from real traffic images has been created for learning purposes at either daytime or nighttime. The results achieved to date show interesting conclusions that suggest a combination of feature extraction methods as an essential clue for enhanced detection performance

Summary (4 min read)

I. INTRODUCTION

  • T HIS PAPER describes a comprehensive combination of feature extraction methods for vision-based pedestrian detection in Intelligent Transportation Systems (ITS).
  • The use of infrared cameras is quite an expensive option that makes mass production an untraceable problem nowadays, especially for the case of stereo vision systems where two cameras are needed.
  • Some authors have demonstrated that the recognition of pedestrians by components is more effective than the recognition of the entire body [10] , [21] .
  • For this purpose, several feature extraction methods have been implemented, compared, and combined.
  • The implementation and comparative results achieved to date are presented and discussed in Section VI.

II. CANDIDATE SELECTION

  • An efficient candidate selection mechanism is a crucial factor in the global performance of the pedestrian detection system.
  • In addition, the computation of accurate disparity maps requires fine grain texture images in order to avoid noise generation.
  • This implies managing very little information to detect obstacles, which may work well for big object detection, such as vehicles [26] , but might not be enough for small thin object detection, such as pedestrians.
  • Conversely, the authors propose a candidate selection method based on the direct computation of the 3-D coordinates of relevant points in the scene.
  • A major advantage is that outliers can be easily filtered out in 3-D space, which makes the method less sensitive to noise.

A. Three-Dimensional Computation of Relevant Points

  • The 3-D representation of relevant points in the scene is computed in two stages.
  • Features such as heads, arms, and legs are distinguishable, when visible, and are not heavily affected by different colors or clothes.
  • The matching computational cost is further reduced in two ways.
  • An increase in the window size causes the performance to degrade due to occlusion regions and smoothing of disparity values across boundaries.
  • Finally, an XZ map (bird's eye view of the 3-D scene) is filtered following a neighborhood criterion.

B. Subtractive Clustering

  • Data clustering techniques are related to the partitioning of a data set into several groups in such a way that the similarity within a group is larger than that among groups.
  • Objects in the 3-D space are roughly modeled by means of Gaussian functions.
  • This point is selected as the cluster center at the current iteration of the algorithm.
  • After applying subtractive clustering to a set of input data, each cluster finally represents a candidate.
  • 4) Densities are corrected according to (5).

C. Multicandidate (MC) Generation

  • In practice, a multiple candidate selection strategy has been implemented.
  • Accordingly, several candidates are generated for each candidate cluster by slightly shifting the original candidate bounding box in the u and v axes in the image plane.
  • A major benefit derived from the MC approach is the fact that the classification performance of pedestrians at long distance increases.
  • Fig. 3 depicts typical images from their test sequences.
  • The number below the bounding box represents range.

A. Component-Based Approach

  • There are some important aspects that need to be addressed when constructing a classifier, such as the global classification structure and the use of single or multiple cascaded classifiers.
  • The first decision to make implies the development of a holistic classifier against a component-based approach.
  • The component-based approach suggests the division of the candidate body into several parts over which features are computed.
  • Thus, the first subregion is located in the zone where the head would be.
  • This subregion is particularly useful to recognize stationary pedestrians.

B. Combination of Feature Extraction Methods

  • The choice of the most appropriate features for pedestrian characterization remains a challenging problem nowadays since recognition performance depends crucially on the features that are used to represent pedestrians.
  • There seems then to be an optimal feature extraction method for each candidate subregion.
  • The comparison among the results achieved in the four experiments yields the final combination of features used in this paper: head-NTU; arms-Histogram; legs-HON; between-the-legs-NTU.
  • The increase in performance due to the use of the proposed optimal combination of feature extraction methods is illustrated in Section VI.

A. Training Strategy

  • The first step in the design of the training strategy is to create representative databases for learning and testing.
  • The following considerations must be taken into account when creating the training and test sets.
  • The ratio between positive and negative samples has to be set to an appropriate value.
  • It is clear that daytime and nighttime samples must be compulsorily separated in order to create multiple specialized classifiers.
  • Pedestrians intersecting the vehicle trajectory from the sides are usually easier to recognize since their legs are clearly visible and distinguishable.

B. Classifier Structure

  • In the first stage of the classifier, features computed over each individual fixed subregion are fed to the input of individual SVM classifiers.
  • Thus, there are six individual SVM classifiers corresponding to the six candidate subregions.
  • Two different methods have been tested to carry out this operation.
  • The second method that has been tested to implement the second stage of the classifier relies on the use of another SVM classifier.
  • Additionally, an optimal kernel selection for the SVM classifiers has been performed.

V. MULTIFRAME VALIDATION AND TRACKING

  • Once candidates are validated by the SVM classifier, a tracking stage takes place.
  • For this purpose, detection results are temporally accumulated.
  • The multiframe validation and tracking algorithm relies on Kalman filter theory to provide spatial estimates of detected pedestrians and Bayesian probability to provide an estimate of pedestrian detection certainty over time.
  • A pedestrian entering the pretracking stage must be validated in several iterations before entering the tracking stage.
  • Once a precandidate is validated, pretracking stops, and tracking starts.

VI. EXPERIMENTAL RESULTS

  • The system was implemented on a Pentium IV PC at 2.4 GHz running the Knoppix GNU/Linux Operating System and Libsvm libraries [35] .
  • Using 320 × 240 pixel images, the complete algorithm runs at an average rate of 20 frames/s, depending on the number of pedestrians being tracked and their position.
  • Accordingly, negative samples (nonpedestrian samples) in the training sets were neither randomly nor manually selected.
  • In the following sections, the results are compared and assessed using DR under certain FPRs.
  • The selection of the FPR value has been made to show performance in representative points where differences between curves can be optimally appreciated.

A. Holistic versus Component-Based

  • A first comparison is made in order to state the best performing approach among the holistic and component-based options.
  • In particular, the training and test sets were designed to contain 10 000 and 3670 samples, respectively.
  • As depicted in Fig. 5 , the performance of the holistic approach for all feature extraction methods is largely improved in the component-based approach.
  • The Haar Wavelet is again below those figures.
  • This shows that breaking the pedestrian into smaller pieces and specifically training the SVM for these pieces reduces the variability and lets the SVM generalize the models much better.

B. Combination of Optimal Features

  • These results can further be improved by combining different feature extraction methods for different candidate subregions.
  • The best performing features for each subregion are combined in a second classifier instead of applying the same feature extractor to all six subregions.
  • The authors used the same training and test sets as in Section VI-A. Fig. 6 (a)-(f) shows the ROC curves for each separate subregion after computing the seven predefined features.
  • As concluded in Section III-B, the selection of optimal features for each subregion is carried out as follows: head-NTU, arms-Histogram, legs-HON, between-the-legs-NTU.
  • These results improve the performance of Canny's detector, which is the best performing feature extractor (in the conditions of the experiment conducted and described in Section VI-A), which exhibits a DR of 95% at an FPR of 2%.

C. of the Second-Stage Classifier

  • Another comparison has been studied in order to analyze the influence of the second-stage classifier that combines the information delivered by the six specifically trained SVM models.
  • In the first approach, the authors have used a simple-distance criterion (i.e., distance to the hyperplane separating pedestrians from nonpedestrians) that computes the addition of the six first-stage SVM outputs and then decides the classification by setting a threshold.
  • Another option has been tested by training a two-stage SVM (2-SVM).
  • Once again, the same training and test sets as in Section VI-A were used in this experiment.
  • The results achieved to date show that the simple-distance criterion clearly outperforms the 2-SVM classifier, as depicted in Fig. 7(b) , where a comparison between both methods is shown when optimal feature extraction methods are applied.

D. Effect of Illumination Conditions and Candidate Size

  • The need of separate training sets for day, night, and different candidate sizes is analyzed in this section.
  • The purpose of this experiment is to analyze the performance of nighttime classification using a global daytime classifier.
  • Fig. 8(b) shows that nighttime pedestrian detection is not accurate when training is carried out using daytime samples (DR is between 23% and 70% at an FPR of 10%).
  • In the next experiment, three different SVM classifiers were trained using sets DS, DL, and G, respectively.
  • The results are illustrated in Fig. 9(a) .

E. Effect of Bounding Box Accuracy

  • The accuracy exhibited in bounding candidates is limited, and in fact, a multiple-hypothesis generation for each detected candidate is encouraged to boost classifier performance, as described in Section II-C. Fig. 10 (a) depicts the performance obtained after testing a set of badly bounded samples using a classifier trained on badly bounded samples.
  • All methods exhibit much worse figures since none of the proposed extractors succeed in providing a DR above 83% (for the case of HON, which is the best performing one) at an FPR of 5%.
  • The analysis of these results suggests that choosing the optimal feature extraction methods just in terms of DR and FPR can lead, in practice, to a decrease in recognition performance.
  • Additionally, an MC generation stage has been developed in order to generate several candidates for each originally selected hypothesis to at least assure some well-fitted candidates that match the samples used for training.

F. Global Performance

  • Some of the sequences were acquired in urban environments and others in nonurban areas.
  • The purpose of this evaluation is to assess the combined operation of the attention mechanism and the SVMbased classifier, including the MC generation strategy, and a multiframe validation stage using Kalman filtering.
  • Similarly, the DR is 93.24% in urban environments, where ten pedestrians were missed by the system.
  • Concerning nonurban environments, three pedestrians were missed by the system in 72 min of operation.
  • As happens in urban environments, false alarms are caused by real objects.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

292 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 8, NO. 2, JUNE 2007
Combination of Feature Extraction Methods
for SVM Pedestrian Detection
Ignacio Parra Alonso, David Fernández Llorca, Miguel Ángel Sotelo, Member, IEEE,
Luis M. Bergasa, Associate Member, IEEE, Pedro Revenga de Toro, Jesús Nuevo,
Manuel Ocaña, and Miguel Ángel García Garrido
Abstract—This paper describes a comprehensive combination
of feature extraction methods for vision-based pedestrian detec-
tion in Intelligent Transportation Systems. The basic components
of pedestrians are first located in the image and then combined
with a support-vector-machine-based classifier. This poses the
problem of pedestrian detection in real cluttered road images.
Candidate pedestrians are located using a subtractive clustering
attention mechanism based on stereo vision. A components-based
learning approach is proposed in order to better deal with pedes-
trian variability, illumination conditions, partial occlusions, and
rotations. Extensive comparisons have been carried out using dif-
ferent feature extraction methods as a key to image understanding
in real traffic conditions. A database containing thousands of
pedestrian samples extracted from real traffic images has been
created for learning purposes at either daytime or nighttime. The
results achieved to date show interesting conclusions that suggest
a combination of feature extraction methods as an essential clue
for enhanced detection performance.
Index Terms—Features combination, pedestrian detection,
stereo vision, subtractive clustering, support vector machine
(SVM) classifier.
I. INTRODUCTION
T
HIS PAPER describes a comprehensive combination of
feature extraction methods for vision-based pedestrian
detection in Intelligent Transportation Systems (ITS). Vision-
based pedestrian detection is a challenging problem in real
traffic scenarios since pedestrian detection must perform ro-
bustly under variable illumination conditions, variable rotated
positions and pose, and even if some of the pedestrian parts or
limbs are partially occluded. An additional difficulty is given
by the fact that the camera is installed on a fast-moving vehicle.
Manuscript received February 20, 2006; revised July 12, 2006, October 3,
2006, and December 11, 2006. This work was supported in part by the
Spanish Ministry of Education and Science under Grants DPI2002-04064-
C05-04 and DPI2005-07980-C03-02 and in part by the Spanish Ministry of
Public Works under Grant FOM2002-002. The Associate Editor for this paper
was N. Papanikolopoulos.
The authors are with the Department of Electronics, Escuela Politécnica
Superior, University of Alcalá, Madrid 28801, Spain (e-mail: parra@depeca.
uah.es; llorca@depeca.uah.es; miguel.sotelo@uah.es; bergasa@depeca.uah.es;
revenga@depeca.uah.es; jnuevo@depeca.uah.es; mocana@depeca.uah.es;
garrido@depeca.uah.es).
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TITS.2007.894194
As a consequence of this, the background is no longer static,
and pedestrians significantly vary in scale. This makes the
problem of pedestrian detection for ITS quite different from that
of detecting and tracking people in the context of surveillance
applications, where the cameras are fixed and the background
is stationary.
To ease the pedestrian recognition task in vision-based sys-
tems, a candidate selection mechanism is normally applied.
The selection of candidates can be implemented by performing
an object segmentation in either a 3-D scene or a 2-D image
plane. Not many authors have tackled the problem of monocular
pedestrian recognition [1]–[3]. The advantages of the monocu-
lar solution are well known. It constitutes a cheap solution that
makes mass production a viable option for car manufacturers.
Monocular systems are less demanding from the computational
point of view and ease the calibration maintenance process. On
the contrary, the main problem with candidate selection mecha-
nisms in monocular systems is that, on average, they are bound
to yield a large amount of candidates per frame in order to
ensure a low false negative ratio (i.e., the number of pedestrians
that are not selected by the attention mechanism). Another
problem in monocular systems is the fact that depth cues are lost
unless some constraints are applied, such as the flat terrain as-
sumption, which is not always applicable. These problems can
be easily overcome by using stereo vision systems, although
other problems arise such as the need to maintain calibration
and the high computational cost required to implement dense
algorithms.
In this paper, we present a full solution for pedestrian detec-
tion at daytime, which is also applicable, although constrained,
to nighttime driving. Other systems already exist for pedestrian
detection using infrared images [4]–[6] and infrared stereo [7].
Nighttime detection is usually carried out using infrared cam-
eras as long as they provide better visibility at night and under
adverse weather conditions. However, the use of infrared cam-
eras is quite an expensive option that makes mass production an
untraceable problem nowadays, especially for the case of stereo
vision systems where two cameras are needed. They provide
images that strongly depend on both weather conditions and the
season of the year. Additionally, infrared cameras (considered
as a monocular system) do not provide depth information and
need periodic recalibration (normally once a year). In prin-
ciple, the algorithm described in this paper has been tested
using cameras in the visible spectrum. Nonetheless, as soon
as the technology for night-vision camera production becomes
1524-9050/$25.00 © 2007 IEEE

PARRA ALONSO et al.: COMBINATION OF FEATURE EXTRACTION METHODS FOR SVM PEDESTRIAN DETECTION 293
cheaper, the results could easily be extended to a stereo night-
vision system.
Concerning the various approaches proposed in the literature,
most of them are based on shape analysis. Some authors
use feature-based techniques, such as recognition by vertical
linear features, symmetry, and human templates [2], [8], Haar
wavelet representation [9], [10], hierarchical shape templates
on Chamfer distance [3], [11], correlation with probabilistic
human templates [12], sparse Gabor filters and support vector
machines (SVMs) [13], graph kernels [14], motion analysis
[15], [16], and principal component analysis [17]. Neural-
network-based classifiers [18] and convolutional neural net-
works [19] are also considered by some authors. In [4], an
interesting discussion is presented about the use of binary or
gray-level images as well as the use of the so-called hotspots
in infrared images versus the use of the whole candidate region
containing both the human body and the road. Using single or
multiple classifiers is another topic of study. As experimentally
demonstrated in this paper and supported by other authors [1],
[4], [20], the option of multiple classifiers is definitely needed.
Another crucial factor, which is not well documented in the
literature, is the effect of pedestrian bounding box accuracy.
Candidate selection mechanisms tend to produce pedestrian
candidates that are not exactly similar to the pedestrian ex-
amples that were used for training in the sense that online
candidates extracted by the attention mechanism may contain
some part of the ground or may cut the pedestrians’ feet,
arms, or heads. This results in significant differences between
candidates and examples. As a consequence, a decrease in
Detection Rate (DR) takes place. The use of multiple classifiers
can also provide a means to cope with day and nighttime scenes,
variable pose, and nonentire pedestrians (when they are very
close to the cameras). In sum, a single classifier cannot be
expected to robustly deal with the whole classification problem.
In the last years, SVMs have been widely used by many
researchers [1], [9], [10], [20], [21] as they provide a supervised
learning approach for object recognition as well as a separation
between two classes of objects. This is particularly useful for
the case of pedestrian recognition. Combinations of shape and
motion are used as an alternative to improve the classifier
robustness [1], [22]. Some authors have demonstrated that the
recognition of pedestrians by components is more effective than
the recognition of the entire body [10], [21]. In our approach,
the basic components of pedestrians are first located in the
image and then combined with an SVM-based classifier. The
pedestrian searching space is reduced in an intelligent manner
to increase the performance of the detection module. Accord-
ingly, road lane markings are detected and used as the main
guidelines that drive the pedestrian searching process. The area
contained by the limits of the lanes determines the zone of the
real 3-D scene from which pedestrians are searched. In the case
where no lane markings are detected, a basic area of interest is
used instead of covering the front part ahead of the ego-vehicle.
A description of the lane marking detection system is provided
in [23]. The authors have also developed lane tracking systems
for unmarked roads [24], [25] in the past. Nonetheless, a key
problem is to find out the most discriminating features in order
to significantly represent pedestrians. For this purpose, several
feature extraction methods have been implemented, compared,
and combined. While a large amount of effort in the literature
is dedicated to developing more powerful learning machines,
the choice of the most appropriate features for pedestrian
characterization remains a challenging problem nowadays to
such an extent that it is still uncertain how the human brain
performs pedestrian recognition using visual information. An
extensive study of feature extraction methods is therefore a
worthwhile topic for a more comprehensive approach to image
understanding.
The rest of the paper is organized as follows: Section II
provides a description of the candidate selection mecha-
nism. Section III describes the component-based approach and
the optimal combination of feature extraction methods. In
Section IV, the SVM-based pedestrian classification system is
presented. In Section V, the multiframe validation and track-
ing system is described. The implementation and compara-
tive results achieved to date are presented and discussed in
Section VI. Finally, Section VII summarizes the conclusions
and future work.
II. C
ANDIDATE SELECTION
An efficient candidate selection mechanism is a crucial
factor in the global performance of the pedestrian detection
system. The candidate selection method must assure that no
misdetection occurs. Candidates, which are usually described
by a bounding box in the image plane, must be detected
as precisely as possible since the detection accuracy has a
remarkable effect on the performance of the recognition stage,
as demonstrated in Section VI. In order to extract information
from the 3-D scene, most authors use disparity map techniques
[18] as well as segmentation based on v-disparity [20], [26].
The use of disparity-based techniques is likely to yield useful
results in open roadways. However, depth disparity clues are
unlikely to be useful for segmenting out pedestrians in city
traffic due to the heavy disparity clutter. We disregarded this
option because of the disadvantages associated with disparity
computation algorithms, since the image pair has to be rectified
prior to the disparity map generation to ensure good corre-
spondence matching. In addition, the computation of accurate
disparity maps requires fine grain texture images in order to
avoid noise generation. Otherwise, disparity-based methods
are prone to produce many outliers that affect the segmentation
process. Concerning the v-disparity image, the information
for performing generic obstacles detection is defined with a
vertical line. This implies managing very little information to
detect obstacles, which may work well for big object detection,
such as vehicles [26], but might not be enough for small thin
object detection, such as pedestrians. Conversely, we propose
a candidate selection method based on the direct computation
of the 3-D coordinates of relevant points in the scene. Accord-
ingly, a nondense 3-D geometrical representation is created
and used for candidate segmentation purposes. This kind of
representation allows for robust object segmentation whenever
the number of relevant points in the image is high enough. A
major advantage is that outliers can be easily filtered out in 3-D
space, which makes the method less sensitive to noise.

294 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 8, NO. 2, JUNE 2007
Fig. 1. (Left) Two-dimensional points overlayed on left image. (Right) Three-dimensional coordinates of detected pixels.
A. Three-Dimensional Computation of Relevant Points
The 3-D representation of relevant points in the scene is
computed in two stages. In the first stage, the intensities of
the left and right images are normalized, and the radial and
tangential distortions are compensated for. Relevant points in
the image are extracted using a well-known Canny algorithm
with adaptive thresholds. Features such as heads, arms, and
legs are distinguishable, when visible, and are not heavily
affected by different colors or clothes. In the second stage, a 3-D
map is created after solving the correspondence problem. The
matching computational cost is further reduced in two ways.
First, the matching searching area is greatly decreased by using
the parameters of the fundamental matrix. Second, pixels in the
right image are considered for matching only if they are also
relevant points. Otherwise, they are discarded, and correlations
are not computed for that pixel. Computation time is abruptly
decreased while maintaining similar detection results. Among
the wide spectrum of matching techniques that can be used
to solve the correspondence problem, we implemented the
Zero Mean Normalized Cross Correlation [27] because of its
robustness. The Normalized Cross Correlation between two
image windows can be computed as follows:
ZMNCC(p, p
)=
n
i=n
n
j=n
A · B
n
i=n
n
j=n
A
2
n
i=n
n
j=n
B
2
(1)
where A and B are defined by
A =
I(x + i, y + j)
I(x, y)
(2)
B =
I
(x
+ i, y
+ j) I
(x
,y
)
(3)
where I(x, y) is the intensity level of pixel with coordinates
(x, y), and
I(x, y) is the average intensity of a (2n +1)×
(2n +1) window centered around that point. As the window
size decreases, the discriminatory power of the area-based
criterion is decreased, and some local maxima appear in the
searching regions. An increase in the window size causes the
performance to degrade due to occlusion regions and smoothing
of disparity values across boundaries. According to the previous
statements, a filtering criterion is needed in order to provide
outlier rejection. First, a selection of 3-D points within the
pedestrian searching area is carried out. Second, road surface
points as well as high points (points with a Y coordinate above
2 m) are removed. Finally, an XZ map(birdseyeviewofthe
3-D scene) is filtered following a neighborhood criterion. As
depicted in Fig. 1, the appearance of pedestrians in 3-D space
is represented by a uniformly distributed set of points.
B. Subtractive Clustering
Data clustering techniques are related to the partitioning of
a data set into several groups in such a way that the similarity
within a group is larger than that among groups. Normally, the
number of clusters is known beforehand. This is the case of
K-means-based algorithms. In this paper, the number of clusters
is considered unknown since no aprioriestimate about the
number of pedestrians in scene can be reasonably made. The
effects of outliers have to be reduced or completely removed,
being necessary to define specific space characteristics in order
to group different pedestrians in the scene. For these reasons,
a Subtractive Clustering method [28] is proposed, which is a
well-known approach in the field of Fuzzy Model Identification
Systems. Clustering is carried out in 3-D space based on a
density measure of data points. The idea is to find high-density
regions in 3-D space. Objects in the 3-D space are roughly
modeled by means of Gaussian functions. It implies that, in
principle, each Gaussian distribution represents a single object
in 3-D space. Nonetheless, objects that get too close to each
other can be modeled by the system as a single one and, thus,
represented by a single Gaussian distribution. The complete
representation is the addition of all Gaussian distributions found
in the 3-D reconstructed scene. Accordingly, the parameters of
the Gaussian functions are adapted by the clustering algorithm
to best represent the 3-D coordinates of the detected pixels. The
3-D coordinates of all detected pixels are then considered as
candidate cluster centers. Thus, each point p
i
with coordinates

PARRA ALONSO et al.: COMBINATION OF FEATURE EXTRACTION METHODS FOR SVM PEDESTRIAN DETECTION 295
(x
i
,y
i
,z
i
) is potentially a cluster center whose 3-D spatial
distribution D
i
is given by the following equation:
D
i
=
N
j=1
exp
(x
i
x
j
)
2
r
ax
2
2
(y
i
y
j
)
2
r
ay
2
2
(z
i
z
j
)
2
r
az
2
2
(4)
where N represents the number of 3-D points contained in a
neighborhood defined by radii r
ax
, r
ay
, and r
az
. Cluster shape
can then be tuned by properly selecting the parameters r
ax
, r
ay
,
and r
az
. As can be observed, candidates p
i
surrounded by a
large number of points within the defined neighborhood will
exhibit a high value of D
i
. Points located at a distance well
above the radius defined by (r
ax
,r
ay
· r
az
) will have almost
no influence over the value of D
i
. Equation (4) is computed
for all 3-D points measured by the stereovision algorithm. Let
p
cl
=(x
cl
,y
cl
,z
cl
) represent the point exhibiting the maximum
density denoted by D
cl
. This point is selected as the cluster
center at the current iteration of the algorithm. The densities
of all points D
i
are corrected based on p
cl
and D
cl
. For this
purpose, the subtraction represented as
D
i
=D
i
D
cl
exp
(x
i
z
j
)
2
r
bx
2
2
(y
i
y
j
)
2
r
by
2
2
(z
i
z
j
)
2
r
bz
2
2
(5)
is computed for all points, where the parameters (r
bx
,r
by
,r
bz
)
define the neighborhood where the correction of point densi-
ties will have the largest influence. Normally, the parameters
(r
bx
,r
by
,r
bz
) are larger than (r
ax
,r
ay
,r
az
) in order to prevent
closely spaced cluster centers. Typically, r
bx
=1.5 r
ax
, r
by
=
1.5 r
ay
, and r
bz
=1.5 r
az
. In this paper, these parameters have
been set to r
ax
= r
az
=1m, r
ay
=1.5 m, r
bx
= r
bz
=1.5 m,
and r
by
=2.25 m. After the subtraction process, the density
corresponding to the cluster center p
cl
gets strongly decreased.
Similarly, densities corresponding to points in the neighbor-
hood of p
cl
also get decreased by an amount that is a function
of the distance to p
cl
. All these points are associated with the
first cluster computed by the algorithm, which is represented
by its center p
cl
, and will have almost no effect in the next step
of the subtractive clustering. After the correction of densities,
a new cluster center p
cl,new
is selected, which corresponds to
the new density maximum D
cl,new
, and the process is repeated
whenever the condition expressed as
if U
rel
>
D
cl
D
cl,new
D
cl,new
>U
min
new cluster (6)
is met, where U
rel
and U
min
are experimentally tuned parame-
ters that permit the establishment of a termination condition
based on the relation between the previous cluster density
and the new one, as well as a minimum value of the density
function. In this paper, this parameter has been set to U
min
=
40. The process is repeated until the termination condition given
by (6) is not met. After applying subtractive clustering to a set
of input data, each cluster finally represents a candidate. The
algorithm can be summarized as follows.
1) The parameters (r
ax
,r
ay
,r
az
) and (r
bx
,r
by
,r
bz
) are
initialized.
2) The densities of all points are computed using (4).
3) The point p
cl
that exhibits the highest density value D
cl
is selected as a cluster center.
4) Densities are corrected according to (5).
5) A new maximum density D
cl,new
is computed.
6) If the condition given by (6) is met, a new cluster is
considered, which is represented by its center p
cl,new
, and
the algorithm is resumed from Point 4. Otherwise, the
algorithm is stopped.
Pedestrian candidates are then considered as the 2-D region
of interest (ROI) defined by the projection in the image plane of
the 3-D candidate regions. The number of candidates is bound
to change depending on traffic conditions, since some cars
can be considered as candidates by the subtractive clustering
algorithm.
C. Multicandidate (MC) Generation
In practice, a multiple candidate selection strategy has been
implemented. The purpose is to produce several candidates
around each selected cluster in an attempt to compensate for
the effect of the candidate bounding box accuracy in the recog-
nition step. Accordingly, several candidates are generated for
each candidate cluster by slightly shifting the original candidate
bounding box in the u and v axes in the image plane. The candi-
date selection method yields generic obstacles with a 3-D shape
that is similar to that of pedestrians. The 2-D candidates are then
produced by projecting the 3-D points over the left image and
computing their bounding box. Two bounding box limits are
defined, i.e., for the maximum and minimum values of width
and height, respectively, taking into account people taller than
2 m or shorter than 1 m. The 3-D candidate position is given
by the stereo-based candidate selection approach (subtractive
clustering), which provides the 3-D cluster center coordinates.
Nonetheless, the 2-D bounding box corresponding to a 3-D
candidate might not perfectly match the candidate appearance
in the image plane due to several effects: body parts that are
partially occluded or camouflaged with the background, 3-D
objects that have been subtracted together with a pedestrian
(for example, pedestrians beside traffic signals, trees, cars, etc.),
low contrast pedestrians represented by a low number of 3-D
points, etc. These badly bounded pedestrians will be classified
as nonpedestrians if the positive samples used to train the
classifier are well fitted. Let us note that this problem also
appears with 2-D candidate selection mechanisms [1] with the
additional drawback of losing the actual pedestrian depth.
Two strategies are proposed to solve the “bounding accuracy
effect.” The first one consists of training the classifier with
additional badly fitted pedestrians in an attempt to absorb either
the extra information due to large bounding boxes containing
part of the background or the loss of information due to small
bounding boxes in which part of the pedestrian is not visible. In
other words, the positive samples yielded by the candidate se-
lection method are included in the training set. For that purpose,

296 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 8, NO. 2, JUNE 2007
Fig. 2. MC generation approach. (a) Oversized and downsized windows.
(b) Spatial centers for each window. (c) Fifteen candidates are generated.
it is necessary to execute the candidate selection process with
offline validation to distinguish pedestrians from nonpedes-
trians. In [1] and [10], the same procedure is only applied to
nonpedestrian samples. The second strategy consists of per-
forming an MC generation for every extracted candidate, trying
to hit the target and add redundancy. Three window sizes are de-
fined: 1) the window size generated by the candidate selection
method; 2) a 20% oversized window; and 3) a 20% down-
sized one. These three windows are shifted five pixels in each
direction: top, down, left, and right. Thus, a total of 15 MCs are
generated for each original candidate, as depicted in Fig. 2.
A majority criterion is followed in order to validate a pedes-
trian. Thus, the MC strategy yields a pedestrian if more than ve
candidates are as pedestrians. This number has been defined
after extensive experiments. In average, the candidate selection
mechanism generates six windows per frame, which yields a
total of 90 candidates per frame after the MC process. In case
the number of candidates generated by the attention mechanism
increases abruptly, the MC approach might become impractical.
A major benefit derived from the MC approach is the fact that
the classification performance of pedestrians at long distance
increases. Fig. 3 depicts typical images from our test sequences.
The number below the bounding box represents range. The
rightmost image shows a motorcyclist that is detected as a
pedestrian (false positive). In the leftmost image, two kids are
properly detected, and their range is correctly measured.
III. F
EATURE EXTRACTION
The optimal selection of discriminant features is an issue
of the greatest importance in a pedestrian detection system
considering the large variability problem that has to be solved
in real scenarios. A set of features must be extracted and fed to
a pedestrian recognition system.
A. Component-Based Approach
There are some important aspects that need to be addressed
when constructing a classifier, such as the global classification
structure and the use of single or multiple cascaded classifiers.
These issues are strongly connected to the way features are
extracted. The first decision to make implies the development
of a holistic classifier against a component-based approach.
In the first option, features are extracted from the complete
candidate described by a bounding box in the image plane.
The component-based approach suggests the division of the
candidate body into several parts over which features are
computed. Each pedestrian body part is then independently
learned by a specialized classifier in the first learning stage. The
outputs provided by individual classifiers, which correspond
to individual body parts, can be integrated in a second stage
that provides the final classification output. In Section IV, two
possible methods for developing a second-stage classifier are
described. As long as a sufficient number of body parts or limbs
are visible in the image, the component-based approach can
still manage to provide correct classification results. This allows
for the detection of partially occluded pedestrians whenever the
contributions of the pedestrian visible parts are reliable enough
to compensate for the missing ones.
After extensive trials, we propose a total of six different
subregions for each candidate ROI, which has been rescaled
to a size of 24 × 72 pixels. This solution constitutes a tradeoff
between exhaustive subregion decomposition and the holistic
approach. The optimal location of the six subregions, which are
empirically achieved after hundreds of trials, has been chosen
in an attempt to detect coherent pedestrian features, as depicted
in Fig. 4. Thus, the first subregion is located in the zone where
the head would be. The arms and legs are covered by the
second, third, fourth, and fifth regions, respectively. An addi-
tional region is defined between the legs, which covers an area
that provides relevant information about the pedestrian pose.
This subregion is particularly useful to recognize stationary
pedestrians.
B. Combination of Feature Extraction Methods
The choice of the most appropriate features for pedestrian
characterization remains a challenging problem nowadays since
recognition performance depends crucially on the features that
are used to represent pedestrians. In the first intuitive approach,
some features seem to be more suitable than others for repre-
senting certain parts of human body. Thus, legs and arms are
long elements that tend to produce straight lines in the image,
while the torso and head are completely different parts, which
are not so easy to recognize. This statement, although based on
intuition, suggests the combination of several feature extraction
methods for the different subregions into which a candidate is
divided. Accordingly, we have tested a set of seven different
feature extraction methods. The selection of features was made
based on intuition, previous work carried out by other authors,
and our own previous work on other applications. The proposed
features are briefly described in the following lines.
Canny image: The Canny edge detector [29] computes
image gradient, i.e., highlighting regions with high spatial
derivatives. The computations of edges significantly re-
duce the amount of data that needs to be managed and filter
out useless information while preserving shape properties
in the image. The result obtained after applying a Canny
filter to the ROI is directly applied to the input of the
classifier. The Canny-based feature vector is the same size
as the candidate image, i.e., 24 × 72.
Haar wavelets, which were originally proposed for pedes-
trian recognition in [9]: In this paper, only the verti-
cal features have been considered. This yields a feature

Citations
More filters
Journal ArticleDOI
TL;DR: An extensive evaluation of the state of the art in a unified framework of monocular pedestrian detection using sixteen pretrained state-of-the-art detectors across six data sets and proposes a refined per-frame evaluation methodology.
Abstract: Pedestrian detection is a key problem in computer vision, with several applications that have the potential to positively impact quality of life. In recent years, the number of approaches to detecting pedestrians in monocular images has grown steadily. However, multiple data sets and widely varying evaluation protocols are used, making direct comparisons difficult. To address these shortcomings, we perform an extensive evaluation of the state of the art in a unified framework. We make three primary contributions: 1) We put together a large, well-annotated, and realistic monocular pedestrian detection data set and study the statistics of the size, position, and occlusion patterns of pedestrians in urban scenes, 2) we propose a refined per-frame evaluation methodology that allows us to carry out probing and informative comparisons, including measuring performance in relation to scale and occlusion, and 3) we evaluate the performance of sixteen pretrained state-of-the-art detectors across six data sets. Our study allows us to assess the state of the art and provides a framework for gauging future efforts. Our experiments show that despite significant progress, performance still has much room for improvement. In particular, detection is disappointing at low resolutions and for partially occluded pedestrians.

3,170 citations


Additional excerpts

  • ...[38], [39], [40], [41]), we refer readers to [2], [42], [43]....

    [...]

Journal ArticleDOI
TL;DR: An overview of the current state of the art of pedestrian detection from both methodological and experimental perspectives is provided and a clear advantage of HOG/linSVM at higher image resolutions and lower processing speeds is indicated.
Abstract: Pedestrian detection is a rapidly evolving area in computer vision with key applications in intelligent vehicles, surveillance, and advanced robotics. The objective of this paper is to provide an overview of the current state of the art from both methodological and experimental perspectives. The first part of the paper consists of a survey. We cover the main components of a pedestrian detection system and the underlying models. The second (and larger) part of the paper contains a corresponding experimental study. We consider a diverse set of state-of-the-art systems: wavelet-based AdaBoost cascade, HOG/linSVM, NN/LRF, and combined shape-texture detection. Experiments are performed on an extensive data set captured onboard a vehicle driving through urban environment. The data set includes many thousands of training samples as well as a 27-minute test sequence involving more than 20,000 images with annotated pedestrian locations. We consider a generic evaluation setting and one specific to pedestrian detection onboard a vehicle. Results indicate a clear advantage of HOG/linSVM at higher image resolutions and lower processing speeds, and a superiority of the wavelet-based AdaBoost cascade approach at lower image resolutions and (near) real-time processing speeds. The data set (8.5 GB) is made public for benchmarking purposes.

1,263 citations


Cites background from "Combination of Feature Extraction M..."

  • ...While the latter two require models of the pedestrian class, e.g., in terms of geometry, appearance, or dynamics, the initial generation of regions of interest is usually based on more general low-level features or prior scene knowledge....

    [...]

Journal ArticleDOI
TL;DR: This work divides the problem of detecting pedestrians from images into different processing steps, each with attached responsibilities, and separates the different proposed methods with respect to each processing stage, favoring a comparative viewpoint.
Abstract: Advanced driver assistance systems (ADASs), and particularly pedestrian protection systems (PPSs), have become an active research area aimed at improving traffic safety. The major challenge of PPSs is the development of reliable on-board pedestrian detection systems. Due to the varying appearance of pedestrians (e.g., different clothes, changing size, aspect ratio, and dynamic shape) and the unstructured environment, it is very difficult to cope with the demanded robustness of this kind of system. Two problems arising in this research area are the lack of public benchmarks and the difficulty in reproducing many of the proposed methods, which makes it difficult to compare the approaches. As a result, surveying the literature by enumerating the proposals one--after-another is not the most useful way to provide a comparative point of view. Accordingly, we present a more convenient strategy to survey the different approaches. We divide the problem of detecting pedestrians from images into different processing steps, each with attached responsibilities. Then, the different proposed methods are analyzed and classified with respect to each processing stage, favoring a comparative viewpoint. Finally, discussion of the important topics is presented, putting special emphasis on the future needs and challenges.

1,021 citations


Cites background from "Combination of Feature Extraction M..."

  • ...In the case of [98], Parra et al. define the features as the cooccurrence matrix between Canny edges and normalized gray-scale image, the orientation histogram, the magnitude and orientation of the image gradient, and the texture unit number, which are then fed to an SVM classifier....

    [...]

Journal ArticleDOI
TL;DR: A novel active pedestrian safety system that combines sensing, situation analysis, decision making, and vehicle control is presented that can decide, within a split second, whether it will perform automatic braking or evasive steering and reliably execute this maneuver at relatively high vehicle speed.
Abstract: Active safety systems hold great potential for reducing accident frequency and severity by warning the driver and/or exerting automatic vehicle control ahead of crashes. This paper presents a novel active pedestrian safety system that combines sensing, situation analysis, decision making, and vehicle control. The sensing component is based on stereo vision, and it fuses the following two complementary approaches for added robustness: 1) motion-based object detection and 2) pedestrian recognition. The highlight of the system is its ability to decide, within a split second, whether it will perform automatic braking or evasive steering and reliably execute this maneuver at relatively high vehicle speed (up to 50 km/h). We performed extensive precrash experiments with the system on the test track (22 scenarios with real pedestrians and a dummy). We obtained a significant benefit in detection performance and improved lateral velocity estimation by the fusion of motion-based object detection and pedestrian recognition. On a fully reproducible scenario subset, involving the dummy that laterally enters into the vehicle path from behind an occlusion, the system executed, in more than 40 trials, the intended vehicle action, i.e., automatic braking (if a full stop is still possible) or automatic evasive steering.

204 citations


Cites methods from "Combination of Feature Extraction M..."

  • ...Other ROI selection techniques use stereo vision [6]–[10] or motion cues [11]....

    [...]

Journal ArticleDOI
TL;DR: This paper organizes and surveys the corresponding literature, defines unambiguous key terms, and discusses links among fundamental building blocks ranging from human detection to action and interaction recognition, providing a comprehensive coverage of key aspects of video-based human behavior understanding.
Abstract: Understanding human behaviors is a challenging problem in computer vision that has recently seen important advances. Human behavior understanding combines image and signal processing, feature extraction, machine learning, and 3-D geometry. Application scenarios range from surveillance to indexing and retrieval, from patient care to industrial safety and sports analysis. Given the broad set of techniques used in video-based behavior understanding and the fast progress in this area, in this paper we organize and survey the corresponding literature, define unambiguous key terms, and discuss links among fundamental building blocks ranging from human detection to action and interaction recognition. The advantages and the drawbacks of the methods are critically discussed, providing a comprehensive coverage of key aspects of video-based human behavior understanding, available datasets for experimentation and comparisons, and important open research issues.

199 citations


Cites methods from "Combination of Feature Extraction M..."

  • ...A combination of multiple features, such as silhouette, appearance, holistic, and part-based, can be used as input to a SVM classifier [33], [34]....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
Abstract: LIBSVM is a library for Support Vector Machines (SVMs). We have been actively developing this package since the year 2000. The goal is to help users to easily apply SVM to their applications. LIBSVM has gained wide popularity in machine learning and many other areas. In this article, we present all implementation details of LIBSVM. Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.

40,826 citations

Proceedings ArticleDOI
20 Jun 2005
TL;DR: It is shown experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection, and the influence of each stage of the computation on performance is studied.
Abstract: We study the question of feature sets for robust visual object recognition; adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.

31,952 citations


"Combination of Feature Extraction M..." refers background in this paper

  • ...This subregion is particularly useful to recognize stationary pedestrians....

    [...]

Journal ArticleDOI
TL;DR: There is a natural uncertainty principle between detection and localization performance, which are the two main goals, and with this principle a single operator shape is derived which is optimal at any scale.
Abstract: This paper describes a computational approach to edge detection. The success of the approach depends on the definition of a comprehensive set of goals for the computation of edge points. These goals must be precise enough to delimit the desired behavior of the detector while making minimal assumptions about the form of the solution. We define detection and localization criteria for a class of edges, and present mathematical forms for these criteria as functionals on the operator impulse response. A third criterion is then added to ensure that the detector has only one response to a single edge. We use the criteria in numerical optimization to derive detectors for several common image features, including step edges. On specializing the analysis to step edges, we find that there is a natural uncertainty principle between detection and localization performance, which are the two main goals. With this principle we derive a single operator shape which is optimal at any scale. The optimal detector has a simple approximate implementation in which edges are marked at maxima in gradient magnitude of a Gaussian-smoothed image. We extend this simple detector using operators of several widths to cope with different signal-to-noise ratios in the image. We present a general method, called feature synthesis, for the fine-to-coarse integration of information from operators at different scales. Finally we show that step edge detector performance improves considerably as the operator point spread function is extended along the edge.

28,073 citations


"Combination of Feature Extraction M..." refers background in this paper

  • ...As long as a sufficient number of body parts or limbs are visible in the image, the component-based approach can still manage to provide correct classification results....

    [...]

Journal ArticleDOI
TL;DR: There are several arguments which support the observed high accuracy of SVMs, which are reviewed and numerous examples and proofs of most of the key theorems are given.
Abstract: The tutorial starts with an overview of the concepts of VC dimension and structural risk minimization. We then describe linear Support Vector Machines (SVMs) for separable and non-separable data, working through a non-trivial example in detail. We describe a mechanical analogy, and discuss when SVM solutions are unique and when they are global. We describe how support vector training can be practically implemented, and discuss in detail the kernel mapping technique which is used to construct SVM solutions which are nonlinear in the data. We show how Support Vector machines can have very large (even infinite) VC dimension by computing the VC dimension for homogeneous polynomial and Gaussian radial basis function kernels. While very high VC dimension would normally bode ill for generalization performance, and while at present there exists no theory which shows that good generalization performance is guaranteed for SVMs, there are several arguments which support the observed high accuracy of SVMs, which we review. Results of some experiments which were inspired by these arguments are also presented. We give numerous examples and proofs of most of the key theorems. There is new material, and I hope that the reader will find that even old material is cast in a fresh light.

15,696 citations

Journal ArticleDOI
Robert M. Haralick1
01 Jan 1979
TL;DR: This survey reviews the image processing literature on the various approaches and models investigators have used for texture, including statistical approaches of autocorrelation function, optical transforms, digital transforms, textural edgeness, structural element, gray tone cooccurrence, run lengths, and autoregressive models.
Abstract: In this survey we review the image processing literature on the various approaches and models investigators have used for texture. These include statistical approaches of autocorrelation function, optical transforms, digital transforms, textural edgeness, structural element, gray tone cooccurrence, run lengths, and autoregressive models. We discuss and generalize some structural approaches to texture based on more complex primitives than gray tone. We conclude with some structural-statistical generalizations which apply the statistical techniques to the structural primitives.

5,112 citations

Frequently Asked Questions (9)
Q1. What are the contributions in "Combination of feature extraction methods for svm pedestrian detection" ?

This paper describes a comprehensive combination of feature extraction methods for vision-based pedestrian detection in Intelligent Transportation Systems. The results achieved to date show interesting conclusions that suggest a combination of feature extraction methods as an essential clue for enhanced detection performance. 

The multiframe validation and tracking algorithm relies on Kalman filter theory to provide spatial estimates of detected pedestrians and Bayesian probability to provide an estimate of pedestrian detection certainty overtime. 

Nonilluminated areas have not been considered in this analysis since pedestrian detection would not be possible beyond a few meters (6–8 m), and infrared cameras would be needed. 

In average, the candidate selection mechanism generates six windows per frame, which yields a total of 90 candidates per frame after the MC process. 

The first one consists of training the classifier with additional badly fitted pedestrians in an attempt to absorb either the extra information due to large bounding boxes containing part of the background or the loss of information due to small bounding boxes in which part of the pedestrian is not visible. 

The optimal selection of discriminant features is an issue of the greatest importance in a pedestrian detection system considering the large variability problem that has to be solved in real scenarios. 

The optimal combination of feature extraction methods eases the learning stage, which makes the classifier less sensitive, in particular, to clothing. 

five false alarms occurred in the sequences, which are mainly due tolampposts and trees located by the edge of the road, yielding an average ratio of four false alarms per hour. 

a new training set is created by taking as inputs the outputs produced by the six already trained first-stage SVM classifiers (in theory, between −1 and +1) after applying the 15 000 samples contained in DS and taking as outputs the supervised outputs of DS. 

Trending Questions (1)
Why extraction methods are used in combinantion?

The extraction methods are used in combination because different features are more suitable for representing different parts of the human body.