scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

People Detection with Heterogeneous Features and Explicit Optimization on Computation Time

TL;DR: A novel people detector that employs discrete optimization for feature selection using binary integer programming to mine heterogeneous features taking both detection performance and computation time explicitly into consideration is presented.
Abstract: In this paper we present a novel people detector that employs discrete optimization for feature selection. Specifically, we use binary integer programming to mine heterogeneous features taking both detection performance and computation time explicitly into consideration. The final trained detector exhibits low Miss Rates with significant boost in frame rate. For example, it achieves a 2.6% less Miss Rate at 10-4 FPPW compared to Dalal and Triggs HOG detector with a 9.22x speed improvement.

Summary (2 min read)

Introduction

  • In modern era, computer vision is playing a significant role in automated object perception; one such thriving role is automated people detection.
  • Given heterogeneous pool of features, different ways can be used to build the final detector.
  • Both [14], [15] adopt a heuristic based rule and use homogeneous family of features they deemed cheap at the initial stages, and homogeneous complex features at the latter.
  • Finally, (4) via a computation time vs detection trade-off.
  • Second, the paper presents a thorough evaluation of the proposed person detector–using both proprietary and public datasets–with detailed analysis of its performance compared to alternative approaches and the state-of-the-art.

A. Features

  • Five different feature families are considered, namely: Haar like, CS-LBP, CSS, EOH, and HOG.
  • Here, the extended set proposed by Lienhart and Maydt [20] which includes tilted variants, is used.
  • Then, considering a rectangular region within the human template, a histogram with 16 bins is computed to signify one feature of this family.
  • Given the template window, it is divided into overlapping blocks and a 36 dimensional histogram of oriented gradients is computed just like [4].
  • The HOG feature pool is generated by considering all possible positions, width, and hight of the rectangular region.

B. Pareto-front extraction

  • Given all set of features,F , along with their trained associated weak learners, and characterized by three parameters: TPR, FPR, and computation time (τ), pareto-front analysis is used to find the optimal solutions that make up the pareto optimal set—the solutions that cannot be improved in one objective function without deteriorating their performance in at least one of the rest.
  • The subset of features that are pareto optimal with respect to TPR, FPR, and computation time, denotedF̃ , are extracted and passed on to be used for the discrete optimization step.

C. Feature selection and cascade classifier learning

  • The final and decisive feature selection step is performed by the BIP optimizer and is discussed in§ III.
  • Finally, the nodal strong classifier,H(·), is built with discrete AdaBoost by using thêF feature set.
  • Once this is done, all negative training samples in the dataset are tested with it.
  • The BIP decision variables are the following.
  • Constraint (4) expresses that the stipulated TPRk of true positives, obtained with the selected classifiers, hasto be reached.

A. Evaluation Criteria

  • For detector performance evaluation, the authors use two approaches: (1) The per window approach, whereby a Detection Error Trade-off (DET) curve with Miss Rate versus False Positives Per Window (FPPW) is generated by using cropped positive and negative windows; and (2) the per image approach which shows Miss Rate versus False Positives per Image (FPPI).
  • The first curve is used to compare experimental variants of the proposed framework with respect to Dalal and Triggs HOG [4](aspect 1), and the second is used to determine how their best approach plays out compared to the different techniques in the literature(aspect 2).
  • To summarize the performance, the Miss Rate at 10−4 FPPW and the logaverage miss rate are used in the first and second approaches respectively.
  • Another criterion that is taken into account is the average computation time.
  • For a cascade detector the average computation time for a given candidate window is affected by the FPR of each node.

B. Dataset

  • For evaluation, two different datasets are considered: The Ladybug dataset1, which is a proprietary dataset compiled from indoor laboratory environment using theLadybug2spherical camera; and theINRIA public dataset [4], a publicly available dataset most predominantly used for benchmarking people detectors in the literature.
  • A detailed descriptionis ot provided here due to space considerations, but table II summarizes the actual data used for training and testing purposes.
  • On the INRIA dataset cropped windows are used for training.
  • For testing, both cropped windows and full images are used for a per window and full image evaluation respectively.

C. Training

  • Each cascade node training is governed by two provided parameters: the nodal TPRk and FPRk for node k (TPRk is always 1.0).
  • Once the pertinent features a selected, the corresponding weak learners are re-trained using the combined training and validation set within the discrete AdaBoost to build the per node final strong classifier,i. . H(·).
  • Similar results obtained for the INRIA dataset are shown in figure 3 and summarized in table IV, also known as INRIA Dataset.
  • But, this also contributes to its superior detection performance, over BIP+AdaBoost(Fix), throughout the FPPW range shown in figure 3. S. Walk, N. Majer, K. Schindler, and B. Schiele, “New features and insights for pedestrian detection,” inProc.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

HAL Id: hal-01059551
https://hal.archives-ouvertes.fr/hal-01059551
Submitted on 1 Sep 2014
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entic research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diusion de documents
scientiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
People Detection with Heterogeneous Features and
Explicit Optimization on Computation Time
Alhayat Ali Mekonnen, Frédéric Lerasle, Ariane Herbulot, Cyril Briand
To cite this version:
Alhayat Ali Mekonnen, Frédéric Lerasle, Ariane Herbulot, Cyril Briand. People Detection with Het-
erogeneous Features and Explicit Optimization on Computation Time. 22nd International Conference
on Pattern Recognition, Aug 2014, Stockholm, Sweden. �hal-01059551�

People Detection with Heterogeneous Features and
Explicit Optimization on Computation Time
A. A. Mekonnen, F. Lerasle, A. Herbulot, and C. Briand
CNRS, LAAS, 7 avenue du Colonel Roche, F-31400 Toulouse, France
Univ de Toulouse, UPS, LAAS, F-31400 Toulouse, France
Email: {alhayat-ali.mekonnen, cyril.briand, frederic.lerasle, ariane.herbulot}@laas.fr
Abstract—In this paper we present a novel people detector that
employs discrete optimization for feature selection. Specifically,
we use binary integer programming to mine heterogeneous
features taking both detection performance and computation time
explicitly into consideration. The final trained detector exhibits
low Miss Rates with significant boost in frame rate. For example,
it achieves a 2.6% less Miss Rate at 10
4
FPPW compared to Dalal
and Triggs HOG detector with a 9.22x speed improvement.
I. Introduction
In modern era, computer vision is playing a significant role
in automated object perception; one such thriving role is au-
tomated people detection. Visual people detection, i.e., people
detection using visual cameras, is the most prominent mode
employed in the literature as cameras are cheap, versatile, and
provide rich color and texture information. It is indispensable
primarily in surveillance systems, human-machine interaction,
robotics, automotive industry, image/video indexing, etc. Ev-
idently, it is also one of the challenging tasks in computer
vision due to variations in peoples’ appearance, background
clutter, illumination, sensor motion, and so forth. In recent
years astounding progress have been made by the scientific
community [1], [2], but there is still room for improvement.
One important discipline where applications of visual peo-
ple detection is highly proliferating is robotics. In robotic
systems that entail people perception, the aforementioned
challenges are further exacerbated by real-time requirements,
limited computational resources, and sensor motion. A mobile
robot needs to be reactive during navigation/interaction in
human occupied environments. Thus, its people detection
module–which is one component of an entire functioning
system–should be fast. The advent of powerful camera systems
in the robotic community that provide high resolution omnidi-
rectional images, e.g., the Ladybug series [3] from Point Grey,
stresses this point further urging the need to give extra focus
on computation time during detector design.
In this work, we try to give explicit consideration to
computation time during detector design. Generally speaking,
balancing computation time and detection performance is
challenging; best detection results are obtained using complex
features and descriptors which are computationally expensive.
As an example, Histogram of Oriented Gradients (HOG) [4] is
the most discriminant feature thus far, but it is also computa-
tionally expensive compared to simple features like Haar vari-
ants [5]. Furthermore, most detectors that improve over HOG
either use complex human models, e.g., parts based models [6],
or consider various heterogeneous pool of features, e.g., [7],
[8], both of which contribute to added computation time unless
explicit computation considerations are made. In line with this,
we present a person detector that uses heterogeneous pool
of features and makes explicit computation time vs detection
trade-o optimization to build a performant detector that leads
to a significant gain in computation time while maintaining
competitive detection performance.
Related Works: The entire literature in visual people
detection is overwhelming and a discussion on the dierent
techniques is beyond the scope of this paper (please refer
to [1], [2] for extensive surveys). We will focus on approaches
that use heterogeneous pool of features with sliding-window
detection paradigm. The best results in visual people detection
are obtained using heterogeneous pool of features [1], [2]. Het-
erogeneous features help capture complementary information
useful to handle various detection challenges. For example:
Wojek et al. [8] used Haar, HOG, and shape context features.
They presented a comparative result obtained using boosting
techniques and SVMs as classifiers and demonstrated that the
combination of dierent features successfully outperformed in-
dividual variants and even the state-of-the-art at the time. Walk
et al. [7] also clearly showed they obtained the best detection
results when concatenating HOG, Histogram Of Flow [9], and
Color Self Similarity (CSS) features all together, rather than
individual features or a subset of them. Similar conclusions
were made by Schwartz et al. [10] and Hussain and Triggs [11]
using–HOG, color frequency, and co-occurrence features–and–
HOG, Local Binary (LBP) and Ternay (LTP) Pattern features–
respectively.
Given heterogeneous pool of features, dierent ways can
be used to build the final detector. Four main trends can be
observed in the literature: (1) Direct concatenation [7], [8]
in which the dierent features are concatenated to make one
high dimensional feature vector and an SVM used afterwards
for classification. This is computationally costly owing to
the complex feature and SVM weights applied in sliding
window detection. [11], [10] used dimensionality reduction
techniques after concatenation which improved detection per-
formance but not detection speed. (2) Direct boosting [12],
[8], [13] where an ensemble classifier is learned using the
entire heterogeneous pool of features. The problem here is in
boosting, on each iteration, the feature with the least weighted
classification error is added to the ensemble irrespective of its
computation time. This favors complex features resulting in
computationally costly detector. (3) Coarse-to-fine hierarchical
arrangement [14], [15] where a cascade is constructed using
cheap features at the initial stages and using complex features

Feature
Extraction
Weak Learner
Training
Pareto-Front
Selection
Binary Integer
Programming
Discrete
AdaBoost
{(x
i
, y
i
)}
iN
F
F
TPR
FPR
τ
˜
F
ˆ
F
H(·)
Fig. 1: Feature selection and classifier learning framework used at each node of a cascade.
at later stages. This approach is quite advantageous and tries to
find a balance between detection performance and speed. The
concern is, how to decide which features to use at the dierent
stages systematically? Both [14], [15] adopt a heuristic based
rule and use homogeneous family of features they deemed
cheap at the initial stages, and homogeneous complex features
at the latter. Finally, (4) via a computation time vs detection
trade-o. This notion has been considered by the works of
Wu and Nevatia [16], Jourdheuil et al. [17], and Mekonnen
et al. [18]. In all cases, they defined a criterion composed
of feature detection performance and computation time in a
multiplicative manner. But, considering a multiplicative factor
masks the contributions from the respective objectives and is
not guaranteed to be optimal.
Our proposed framework falls in the 4
th
category; but, it
can also be considered as a variant of coarse-to-fine hierarchy
in which the exact features to use at each cascade node
are selected automatically via an optimization step. We use
ve frequently used heterogeneous features, namely: Haar-like
features [5], Edge Orientation Histogram (EOH) [13], CSS [7],
Center Surround Local Binary Patterns (CS-LBP) [19], and
HOG [4] in a classical cascaded boosting configuration [5]
with an added explicit optimization step based on Binary
Integer Programming (BIP) to select a subset of features
that have the least combined computation time and achieve
a stipulated detection performance.
Contributions: This paper claims to make two important
contributions. First, it presents a BIP formulation to mine
heterogeneous features taking both detection performance and
computation time into consideration. The authors assert this
optimization applied to heterogeneous features is unique in
the literature and makes a key contribution. Second, the
paper presents a thorough evaluation of the proposed per-
son detector–using both proprietary and public datasets–with
detailed analysis of its performance compared to alternative
approaches and the state-of-the-art.
II. Framework
The objective in this work is to develop a people detec-
tion framework based on heterogeneous features that capture
dierent facets of persons in an image. Our proposed detector
training framework takes discriminative power of each fea-
ture and its associated computation time into consideration
explicitly to select, and subsequently use, a subset of features
that fulfill the required detection performance and have the
minimum cumulative computation time.
As detection speed is one of our design focus, we adopt
the acclaimed Viola and Jones [5] attentional cascade detector
configuration in a sliding window paradigm. To train a strong
classifier at each node of the cascade, the framework depicted
in figure 1 is employed. For a given set of positive and
negative training samples (a total of n samples denoted as
{
(x
i
, y
i
)
}
i
{
1,...,n
}
): First, the features described in § II-A are ex-
tracted resulting in the feature set F . For each feature a unique
weak learner is trained using the examples provided and is used
to characterize the discriminating power of the feature in terms
of True Positive Rate (TPR) and False Positive Rate (FPR).
Following, pareto-front analysis is used to select a subset of
features,
˜
F , taking their TPR, FPR, and computation time
into account. This step is necessary to reduce the overwhelm-
ing total number of features to a tractable size for discrete
optimization. Next, binary integer optimization, presented in
§ III, is used to retain a subset of features,
ˆ
F , that have the
required performance—detection plus minimum computation
time. Finally, a nodal strong classifier H(·) is trained using the
retained feature set
ˆ
F with discrete AdaBoost. Specific design
choice motivations and brief descriptions of each block are
presented herein below.
A. Features
Five dierent feature families are considered, namely: Haar
like, CS-LBP, CSS, EOH, and HOG. This choice is motivated
by two aspects: (1) their frequent use in the literature for person
detection, and (2) their complementary nature. EOH and HOG
capture edge distributions, CSS focuses on color symmetry,
Haar-like and CS-LBP on intensity and texture variations. The
feature pool of each family is extracted from a 128× 64 pixels
human template window.
Haar like: Here, the extended set proposed by Lienhart and
Maydt [20] which includes tilted variants, is used. The pool
is generated by extracting feature values at all positions and
scales in the template window with the extended Haar set.
CS-LBP: Computes per pixel CS-LBP [19] value by taking
and modulating the intensity dierence of center symmetric
pixels for all the neighboring pixels. For each pixel, we
privilege a 3 × 3 pixel region which results in a scalar integer
between 0 and 16. Then, considering a rectangular region
within the human template, a histogram with 16 bins is
computed to signify one feature of this family. For all possible
positions and scales of the rectangular region a distinct feature
(which is a histogram) is computed and added in to the set of
CS-LBP feature pool.
CSS: Color self similarity, proposed by Walk et al. [7],
captures pairwise similarities of spatially localized color dis-
tributions and can be used to capture the left and right
symmetry of persons’ clothing (upper body and lower body).
The computation first starts by subdividing the given template
into non-overlapping regions called blocks. For each block
a 3 × 3 × 3 HSV color histogram is constructed. Then, the
similarity of block with the rest of the blocks is determined via
histogram intersection. Instead of concatenating all computed
similarities like Walk et al. [7], we define a single CSS feature
to be a vector of scalar values that are obtained by intersecting
the histogram of one block with the rest of the blocks. The
CSS feature pool set is then determined by computing this
vector for all blocks. By dividing the template into blocks of
8 × 8 pixels, a total of 128 feature vectors, each with 127
dimensions, are obtained.
EOH: This feature pool is generated exactly as described by
Geronimo et al. [13]: edge orientation histogram followed

by ratios of magnitude of two bins to get a single scalar
feature value and doing this for all positions and scales of
rectangular subregions for histogram computation within the
template window.
HOG: The HOG feature pool set is constructed as follows:
Given the template window, it is divided into overlapping
blocks and a 36 dimensional histogram of oriented gradients
is computed just like [4]. But, rather than concatenating all
block histograms to make one high dimensional feature, we
consider concatenating a subset spanning a rectangular region.
The HOG feature pool is generated by considering all possible
positions, width, and hight of the rectangular region. The
features range from a 36 dimensional vector, a single block,
to 3780 dimensional one, all blocks in the template.
Table I summarizes the total number of features, the scaled
maximum and minimum feature computation time (τ
max
and
τ
min
), and the exact weak learner used in each feature family.
For CS-LBP families Linear Discriminant Analysis combined
with a decision tree (which is trained after re-projection) is
privileged as SVM leads to overwhelming training period (due
to the high number of CS-LBP features).
TABLE I: Feature pool summary. Time is reported relatively as a multiple
of the smallest feature computation time, u = 0.0535µs.
Feature Type No of features τ
min
τ
max
Weak Learner
Haar like 672,406 1.0u 3.48u Decision Tree
EOH 712,960 4.83u 317.75u Decision Tree
CS-LBP 59,520 15.45u 393.64u LDA + Decision Tree
CSS 128 1017.94u 1017.94u SVM
HOG 3,360 489.72u 51420.56u SVM
B. Pareto-front extraction
Given all set of features, F , along with their trained asso-
ciated weak learners, and characterized by three parameters:
TPR, FPR, and computation time (τ), pareto-front analysis is
used to find the optimal solutions that make up the pareto
optimal set—the solutions that cannot be improved in one
objective function without deteriorating their performance in
at least one of the rest. The subset of features that are pareto
optimal with respect to TPR, FPR, and computation time,
denoted
˜
F , are extracted and passed on to be used for the
discrete optimization step.
C. Feature selection and cascade classifier learning
The final and decisive feature selection step is performed
by the BIP optimizer and is discussed in § III. This module
provides the set
ˆ
F . Finally, the nodal strong classifier, H(·),
is built with discrete AdaBoost by using the
ˆ
F feature set.
The complete classifier used for detection, however, con-
tains multiple nodes forming a cascade. The cascade construc-
tion starts with all positive training samples and a subset of
the negative training samples (equivalent to the positive ones)
to learn the set of relevant features and classifiers for the
initial cascade node. Once this is done, all negative training
samples in the dataset are tested with it. All those that get
classified correctly are rejected while all those labeled as
positive samples (false positives) are retained along with the
positive samples for training the following nodes. This step is
repeated until all negative training samples are exhausted. This
data mining technique makes it possible to use vast number of
negative training samples.
III. Discrete optimization feature selection
The BIP based feature selection applied to heterogeneous
features makes the core of this work’s contribution. The
detailed optimization formulation to select a subset of features
that fulfill a stipulated nodal FPR
k
, TPR
k
, with the minimum
combined computation time possible is provided as follows (k
denotes the node index):
Definition of parameters: The following are list of pa-
rameters used in the optimization specification (applies to a
cascade node k). B = {0, 1} denotes a binary set.
N =
{
1, ..., n
}
: set of training sample indexes with n
Z; a total of n training samples indexed by i;
M =
{
1, ..., m
}
: set of weak learners indexes with m
Z; a total of m weak learners indexed by j;
y
+
B
n
, y
+
=
n
y
+
i
o
iN
; y
B
n
, y
=
n
y
i
o
iN
y
+
i
=
(
1 if i is positive
0 otherwise
y
i
=
(
1 if i is negative
0 otherwise
H B
nxm
where H =
n
h
i, j
o
iN
jM
with h
i, j
{0, 1}
h
i, j
=
(
1 if weak learner j detects sample i as positive
0 otherwise
TPR
k
[0, 1]: minimum true positive rate set at the
considered node (k) of the cascade;
FPR
k
[0, 1]: maximum false positive rate at the
node;
T R
m
: with T =
n
τ
j
o
jM
computation time of weak
learner j.
Decision Variables: In BIP, the decision variables are
restricted to binary values, values from the set B = {0, 1}.
The BIP decision variables are the following.
v B
m
, v =
n
v
j
o
jM
v
j
{0, 1}: v
j
= 1 if weak learner
j is selected, else v
j
= 0;
t B
n
, t
i
{0, 1}: t
i
= 1 if a positive sample i has
been detected as positive (true positive) by at least
one selected weak learner, else t
i
= 0;
f B
n
, f
i
{0, 1}: f
i
= 1 if a negative sample i has
been detected as positive (false positive) by at least
one selected classifier, else f
i
= 0.
Let vector p, p =
{
p
i
}
iN
= Hv denote the total number
of weak learners that have labeled each training sample i as
positive.
Objective Function and Constraints:
min T
T
v (1)
s.t t
i
y
+
i
· p
i
i (2)
f
i
y
i
· h
i, j
· v
j
(i, j) (3)
kt
k
1
ky
+
k
1
· TPR
k
(4)
kf
k
1
ky
k
1
· FPR
k
(5)
v B
n
;t = {t
i
}
iN
, f = { f
i
}
iN
;t, f B
n
(6)
k
1
is l
1
norm.

The objective function (1) aims at minimizing the computation
time. Constraints (2)-(5) express that a given rate of detection
quality has to be reached (depending on the number of true
and false positives). Constraints (2) links v
j
and t
i
variables
(via p
i
) so that t
i
= 0 if positive image i is not correctly
detected by at least one selected classifier. Constraints (3)
links v
j
and f
i
variables so that f
i
= 1 if a negative image
i has been recognized as positive by at least one selected
classifier. Constraint (4) expresses that the stipulated TPR
k
of
true positives, obtained with the selected classifiers, has to be
reached. Similarly, constraint (5) expresses that the stipulated
FPR
k
of false positives, obtained with the selected classifiers,
must not be exceeded. In this formulation, there are a total of
(n · (m + 1) + 2) binary variables in the BIP, which could be
huge for large n and m values. The final subset of features
ˆ
F corresponds to only the selected features, i.e., non zero v
entry; since each feature indexed by j is associated with a
unique weak learner h
j
,
ˆ
F also represents the subset of weak
learners retained
IV. Experiments and Results
In this section the dierent experiments carried out to
investigate the performance of the proposed framework and
obtained results along with commentaries are presented. The
evaluation is focused on the following two aspects:
(1) Feature selection strategy evaluation: Here, the aim is to
analyze the pros and cons of using BIP over other simpler
alternatives. The proposed BIP based feature selection and
classifier learning strategy, labeled as BIP+AdaBoost, is com-
pared with two other modes. First, Pareto+AdaBoost which
discards the BIP block in the framework and directly trains a
nodal strong classifier with discrete adaboost using the features
retained by the pareto-front extraction block. And second,
Random+AdaBoost which directly builds a nodal classifier
using randomly sampled features from the total feature pool
(proportional to each feature pool family size) and AdaBoost.
(2) General comparative evaluation with the state-of-the-art:
In this part, the performance of the trained BIP+AdaBoost is
compared with the prominent approaches in the literature.
A. Evaluation Criteria
For detector performance evaluation, we use two ap-
proaches: (1) The per window approach, whereby a Detection
Error Trade-o (DET) curve with Miss Rate versus False
Positives Per Window (FPPW) is generated by using cropped
positive and negative windows; and (2) the per image ap-
proach which shows Miss Rate versus False Positives per
Image (FPPI). The first curve is used to compare experimental
variants of the proposed framework with respect to Dalal
and Triggs HOG [4] (aspect 1), and the second is used to
determine how our best approach plays out compared to the
dierent techniques in the literature (aspect 2). To summarize
the performance, the Miss Rate at 10
4
FPPW and the log-
average miss rate are used in the first and second approaches
respectively.
Another criterion that is taken into account is the average
computation time. For a cascade detector the average compu-
tation time for a given candidate window is aected by the
FPR of each node. Let K be the total number of nodes in the
cascade, FPR
k
be the false positive rate and τ
k
be the total
computation time of the k
th
node during detection. Assuming
the nodal FPR characteristics hold on a generic input image,
the average time spent on a test candidate window, T
av
, can
be estimated as T
av
= τ
0
+
P
K1
k=1
(
Q
k1
z=0
FPR
z
)τ
k
. Using Dalal
and Triggs [4] detector, which takes ζ
HOG
per window, as a
reference, the Average Speed Up (ASU) over it is determined
as ASU =
ζ
HOG
T
av
. Consequently, the ASU values reported
henceforth are with respect to Dalal and Triggs detector.
B. Dataset
For evaluation, two dierent datasets are considered: The
Ladybug dataset
1
, which is a proprietary dataset compiled
from indoor laboratory environment using the Ladybug2 spher-
ical camera; and the INRIA public dataset [4], a publicly
available dataset most predominantly used for benchmarking
people detectors in the literature. A detailed description is not
provided here due to space considerations, but table II summa-
rizes the actual data used for training and testing purposes. The
Ladybug dataset is used for training and testing the framework
using cropped windows. On the INRIA dataset cropped win-
dows are used for training. For testing, both cropped windows
and full images are used for a per window and full image
evaluation respectively. In both datasets, the cropped negative
windows are uniformly sampled from provided person free full
images.
TABLE II: Summary of the dierent dataset used for training and testing.
Dataset
Training Test
pos win. neg win. pos win. neg win. full images
Ladybug
1
1, 990 488, 992 1, 000 319, 653
INRIA [4] 2, 416 2.55 × 10
6
1, 132 2.00 × 10
6
288
C. Training
Each cascade node training (learning) is governed by two
provided parameters: the nodal TPR
k
and FPR
k
for node k
(TPR
k
is always 1.0). The training is done so the final trained
nodal classifier conforms to these stipulated performance re-
quirements. Each cascade node is built using a subset of the
total negative training samples and all positive samples. This
set is initially divided into a 60% training and a 40% validation
set. The weak learners are trained using the 60% training set.
Then, TPR and FPR values corresponding to each weak learner
are determined based on the validation set. All subsequent
computations, i.e., pareto-front analysis and feature selection
via BIP are performed using the weak learners performance
conferred on the validation set. Once the pertinent features are
selected, the corresponding weak learners are re-trained using
the combined training and validation set within the discrete
AdaBoost to build the per node final strong classifier, i.e., H(·).
The complete cascaded classifier is then learned as explained in
§ II-C. For the associated weak learners, decision trees of depth
2, 3, and 3 are used for Haar like, EOH, and LBP features,
respectively, after detection performance and over-fitting trade
o analysis on a validation set.
D. Results and Discussions
2
Ladybug Dataset: The main results obtained with the
Ladybug dataset are depicted in figure 2 and summarized in
1
Please see http://homepages.laas.fr/ aamekonn/ladybug
dataset/
2
All figures in this section are best viewed in color.

Citations
More filters
01 Jan 2014
TL;DR: A very long-term multi-disciplinary research programme addressing inadequacies in current AI, cognitive science, robotics, psychology, neuroscience, philosophy of mathematics and philosophy of mind.
Abstract: AI and robotics have many impressive successes, yet there remain huge chasms between artificial systems and forms of natural intelligence in humans and other animals. Fashionable “paradigms” o↵ering definitive answers come and go (sometimes reappearing with new labels). Yet no AI or robotic systems come close to modelling or replicating the development from helpless infant over a decade or two to a competent adult. Human and animal developmental trajectories vastly outstrip, in depth and breadth of achievement, products of artificial learning systems, although some AI products demonstrate super-human competences in restricted domains. I’ll outline a very long-term multi-disciplinary research programme addressing these and other inadequacies in current AI, cognitive science, robotics, psychology, neuroscience, philosophy of mathematics and philosophy of mind. The project builds on past work by actively seeking gaps in what we already understand, and by looking for very di↵erent clues and challenges: the Meta-Morphogenesis project, partly inspired by Turing’s work on morphogenesis, outlined here: http://www.cs.bham.ac.uk/research/projects/coga↵/misc/meta-morphogenesis.html

17 citations

26 Nov 2014
TL;DR: The situation assessment aspect of Romeo2, a unique project aiming to bring multi-modal and multi-layered perception on a single system and targeting for a unified theoretical and functional frame-work for a robot companion for everyday life, is presented.
Abstract: For a socially intelligent robot, different levels of situation as-sessment are required, ranging from basic processing of sensor input to high-level analysis of semantics and intention. However, the attempt to combine them all prompts new research challenges and the need of a co-herent framework and architecture. This paper presents the situation assessment aspect of Romeo2, a unique project aiming to bring multi-modal and multi-layered perception on a single system and targeting for a unified theoretical and functional frame-work for a robot companion for everyday life. It also discusses some of the innovation potentials, which the combination of these various perception abilities adds into the robot's socio-cognitive capabilities.

12 citations


Cites background from "People Detection with Heterogeneous..."

  • ...Interested readers can find the details in documentation of the system [18] and in dedicated publications for individual components, such as [4], [11], [19], [15], [3], [24], [17], [23], etc....

    [...]

Proceedings ArticleDOI
01 Jan 2015
TL;DR: This work finds that the normed gradients, designed for generic objectness estimation, are also able to rapidly generate high quality object windows for a single category object, and proposes an efficient method to produce candidate windows that are highly likely to contain a person.
Abstract: In this work, we study a real-time human detection method for mobile devices using window proposals. We find that the normed gradients, designed for generic objectness estimation, are also able to rapidly generate high quality object windows for a singlecategory object. We also notice that fusing the normed gradients with additional color feature improves the performance of objectness estimation for the single-category object. Based on these observations, we propose an efficient method, which we call personness estimation, to produce candidate windows that are highly likely to contain a person. The produced candidate windows are used to search over feature maps of an image so that a human detection method can achieve high detection performance within a short period of time. We further present how personness estimation can be efficiently combined into part-based human detection. Our experiments indicate that the proposed method is directly applicable to mobile devices, and allows real-time human detection.

4 citations


Cites background from "People Detection with Heterogeneous..."

  • ...Human detection has been extensively researched under various names such as pedestrian detection [6], people detection [11, 18], and head-shoulder detection [17] in recent years....

    [...]

Proceedings ArticleDOI
12 Mar 2018
TL;DR: A novel framework for learning a soft-cascade detector with explicit computation time considerations is presented and it is confirmed that a faster cascade detector can be learned while maintaining similar detection performances.
Abstract: This paper presents a novel framework for learning a soft-cascade detector with explicit computation time considerations. Classically, training techniques for softcascade detectors select a set of weak classifiers and their respective thresholds, solely to achieve the desired detection performance without any regard to the detector response time. Nevertheless, since computation time performance is of utmost importance in many time-constrained applications, this work divulges an optimization approach that aims to minimize the mean cascade response time, given a desired detection performance, fixed beforehand. The resulting problem is NP-Hard, therefore finding an optimal threshold vector can be very time-consuming, especially when building a soft-cascade detector of long length. An efficient local search procedure is presented that deals with long-length detectors. Our evaluations on two challenging public datasets confirm that a faster cascade detector can be learned while maintaining similar detection performances.

2 citations


Additional excerpts

  • ..., [23, 20]....

    [...]

Proceedings ArticleDOI
23 Feb 2016
TL;DR: This paper proves the NP-hardness of the problem and proposes a mathematical model that takes benefit from several dominance properties, which are put into evidence, and shows that it can provide a faster cascade detector, while maintaining the same detection performances.
Abstract: In this paper, the problem of minimizing the mean response-time of a soft-cascade detector is addressed. A soft-cascade detector is a machine learning tool used in applications that need to recognize the presence of certain types of object instances in images. Classical soft-cascade learning methods select the weak classifiers that compose the cascade, as well as the classification thresholds applied at each cascade level, so that a desired detection performance is reached. They usually do not take into account its mean response-time, which is also of importance in time-constrained applications. To overcome that, we consider the threshold selection problem aiming to minimize the computation time needed to detect a target object in an image (i.e., by classifying a set of samples). We prove the NP-hardness of the problem and propose a mathematical model that takes benefit from several dominance properties, which are put into evidence. On the basis of computational experiments, we show that we can provide a faster cascade detector, while maintaining the same detection performances.

2 citations


Cites background or methods from "People Detection with Heterogeneous..."

  • ...Computation time of a cascade classifier can be predominantly decreased in two ways: (1) By using a feature selection mechanism with computation time consideration so that cheap features are used in the initial stages of the cascade and costly, but more discriminatory, ones at later stages; for example, the Binary Integer Programming (BIP) based feature selection framework proposed in (Mekonnen et al., 2014) and the ad-hoc weighted computation time based feature selection approach in (Jourdheuil et al....

    [...]

  • ...…of the cascade and costly, but more discriminatory, ones at later stages; for example, the Binary Integer Programming (BIP) based feature selection framework proposed in (Mekonnen et al., 2014) and the ad-hoc weighted computation time based feature selection approach in (Jourdheuil et al., 2012)....

    [...]

References
More filters
Proceedings ArticleDOI
20 Jun 2005
TL;DR: It is shown experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection, and the influence of each stage of the computation on performance is studied.
Abstract: We study the question of feature sets for robust visual object recognition; adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.

31,952 citations


"People Detection with Heterogeneous..." refers methods in this paper

  • ...We use five frequently used heterogeneous features, namely: Haar-like features [5], Edge Orientation Histogram (EOH) [13], CSS [7], Center Surround Local Binary Patterns (CS-LBP) [19], and HOG [4] in a classical cascaded boosting configuration [5] with an added explicit optimization step based on Binary Integer Programming (BIP) to select a subset of features that have the least combined computation time and achieve a stipulated detection performance....

    [...]

  • ...Ladybug1 1, 990 488, 992 1, 000 319, 653 – INRIA [4] 2, 416 2....

    [...]

  • ...At lower FPPI values, less than 0.1 FPPW, the BIP variant consistently supersedes Dalal and Triggs HOG....

    [...]

  • ...Using Dalal and Triggs [4] detector, which takes ζHOG per window, as a reference, the Average Speed Up (ASU) over it is determined as ASU = ζHOG Tav ....

    [...]

  • ...As an example, Histogram of Oriented Gradients (HOG) [4] is the most discriminant feature thus far, but it is also computationally expensive compared to simple features like Haar variants [5]....

    [...]

Journal ArticleDOI
TL;DR: In this paper, a face detection framework that is capable of processing images extremely rapidly while achieving high detection rates is described. But the detection performance is limited to 15 frames per second.
Abstract: This paper describes a face detection framework that is capable of processing images extremely rapidly while achieving high detection rates. There are three key contributions. The first is the introduction of a new image representation called the “Integral Image” which allows the features used by our detector to be computed very quickly. The second is a simple and efficient classifier which is built using the AdaBoost learning algorithm (Freund and Schapire, 1995) to select a small number of critical visual features from a very large set of potential features. The third contribution is a method for combining classifiers in a “cascade” which allows background regions of the image to be quickly discarded while spending more computation on promising face-like regions. A set of experiments in the domain of face detection is presented. The system yields face detection performance comparable to the best previous systems (Sung and Poggio, 1998; Rowley et al., 1998; Schneiderman and Kanade, 2000; Roth et al., 2000). Implemented on a conventional desktop, face detection proceeds at 15 frames per second.

13,037 citations

Journal ArticleDOI
TL;DR: An extensive evaluation of the state of the art in a unified framework of monocular pedestrian detection using sixteen pretrained state-of-the-art detectors across six data sets and proposes a refined per-frame evaluation methodology.
Abstract: Pedestrian detection is a key problem in computer vision, with several applications that have the potential to positively impact quality of life. In recent years, the number of approaches to detecting pedestrians in monocular images has grown steadily. However, multiple data sets and widely varying evaluation protocols are used, making direct comparisons difficult. To address these shortcomings, we perform an extensive evaluation of the state of the art in a unified framework. We make three primary contributions: 1) We put together a large, well-annotated, and realistic monocular pedestrian detection data set and study the statistics of the size, position, and occlusion patterns of pedestrians in urban scenes, 2) we propose a refined per-frame evaluation methodology that allows us to carry out probing and informative comparisons, including measuring performance in relation to scale and occlusion, and 3) we evaluate the performance of sixteen pretrained state-of-the-art detectors across six data sets. Our study allows us to assess the state of the art and provides a framework for gauging future efforts. Our experiments show that despite significant progress, performance still has much room for improvement. In particular, detection is disappointing at low resolutions and for partially occluded pedestrians.

3,170 citations


"People Detection with Heterogeneous..." refers background or methods in this paper

  • ...The best results in visual people detection are obtained using heterogeneous pool of features [1], [2]....

    [...]

  • ...In recent years astounding progress have been made by the scientific community [1], [2], but there is still room for improvement....

    [...]

  • ...Related Works: The entire literature in visual people detection is overwhelming and a discussion on the different techniques is beyond the scope of this paper (please refer to [1], [2] for extensive surveys)....

    [...]

  • ...are taken from [1]; the reader is referred to this survey for explanation of each detector (as space does not permit here)....

    [...]

  • ...To generate these results, a Pairwise Max non-maximal suppression [1] with an overlap threshold of 0....

    [...]

Proceedings ArticleDOI
Rainer Lienhart1, J. Maydt1
10 Dec 2002
TL;DR: This paper introduces a novel set of rotated Haar-like features that significantly enrich the simple features of Viola et al. scheme based on a boosted cascade of simple feature classifiers.
Abstract: Recently Viola et al. [2001] have introduced a rapid object detection. scheme based on a boosted cascade of simple feature classifiers. In this paper we introduce a novel set of rotated Haar-like features. These novel features significantly enrich the simple features of Viola et al. and can also be calculated efficiently. With these new rotated features our sample face detector shows off on average a 10% lower false alarm rate at a given hit rate. We also present a novel post optimization procedure for a given boosted cascade improving on average the false alarm rate further by 12.5%.

3,133 citations


"People Detection with Heterogeneous..." refers methods in this paper

  • ...[20] R. Lienhart and J. Maydt, “An extended set of haar-like features for rapid object detection,” in Proc....

    [...]

  • ...Haar like: Here, the extended set proposed by Lienhart and Maydt [20] which includes tilted variants, is used....

    [...]

Proceedings ArticleDOI
23 Jun 2008
TL;DR: A discriminatively trained, multiscale, deformable part model for object detection, which achieves a two-fold improvement in average precision over the best performance in the 2006 PASCAL person detection challenge and outperforms the best results in the 2007 challenge in ten out of twenty categories.
Abstract: This paper describes a discriminatively trained, multiscale, deformable part model for object detection. Our system achieves a two-fold improvement in average precision over the best performance in the 2006 PASCAL person detection challenge. It also outperforms the best results in the 2007 challenge in ten out of twenty categories. The system relies heavily on deformable parts. While deformable part models have become quite popular, their value had not been demonstrated on difficult benchmarks such as the PASCAL challenge. Our system also relies heavily on new methods for discriminative training. We combine a margin-sensitive approach for data mining hard negative examples with a formalism we call latent SVM. A latent SVM, like a hidden CRF, leads to a non-convex training problem. However, a latent SVM is semi-convex and the training problem becomes convex once latent information is specified for the positive examples. We believe that our training methods will eventually make possible the effective use of more latent information such as hierarchical (grammar) models and models involving latent three dimensional pose.

2,893 citations


"People Detection with Heterogeneous..." refers background in this paper

  • ..., parts based models [6], or consider various heterogeneous pool of features, e....

    [...]

Frequently Asked Questions (1)
Q1. What contributions have the authors mentioned in the paper "People detection with heterogeneous features and explicit optimization on computation time" ?

In this paper the authors present a novel people detector that employs discrete optimization for feature selection. Specifically, the authors use binary integer programming to mine heterogeneous features taking both detection performance and computation time explicitly into consideration.