scispace - formally typeset
Open AccessJournal ArticleDOI

Fast PRISM: Branch and Bound Hough Transform for Object Class Detection

Reads0
Chats0
TLDR
This paper addresses the task of efficient object class detection by means of the Hough transform by demonstrating PRISM’s flexibility by two complementary implementations: a generatively trained Gaussian Mixture Model as well as a discriminatively trained histogram approach.
Abstract
This paper addresses the task of efficient object class detection by means of the Hough transform. This approach has been made popular by the Implicit Shape Model (ISM) and has been adopted many times. Although ISM exhibits robust detection performance, its probabilistic formulation is unsatisfactory. The PRincipled Implicit Shape Model (PRISM) overcomes these problems by interpreting Hough voting as a dual implementation of linear sliding-window detection. It thereby gives a sound justification to the voting procedure and imposes minimal constraints. We demonstrate PRISM's flexibility by two complementary implementations: a generatively trained Gaussian Mixture Model as well as a discriminatively trained histogram approach. Both systems achieve state-of-the-art performance. Detections are found by gradient-based or branch and bound search, respectively. The latter greatly benefits from PRISM's feature-centric view. It thereby avoids the unfavourable memory trade-off and any on-line pre-processing of the original Efficient Subwindow Search (ESS). Moreover, our approach takes account of the features' scale value while ESS does not. Finally, we show how to avoid soft-matching and spatial pyramid descriptors during detection without losing their positive effect. This makes algorithms simpler and faster. Both are possible if the object model is properly regularised and we discuss a modification of SVMs which allows for doing so.

read more

Content maybe subject to copyright    Report

ETH Library
Fast PRISM
Branch and Bound Hough Transform for Object Class
Detection
Journal Article
Author(s):
Lehmann, Alain; Leibe, Bastian; Van Gool, Luc
Publication date:
2011-09
Permanent link:
https://doi.org/10.3929/ethz-b-000027008
Rights / license:
In Copyright - Non-Commercial Use Permitted
Originally published in:
International Journal of Computer Vision 94(2), https://doi.org/10.1007/s11263-010-0342-x
This page was generated automatically upon download from the ETH Zurich Research Collection.
For more information, please consult the Terms of use.

Int J Comput Vis (2011) 94:175–197
DOI 10.1007/s11263-010-0342-x
Fast PRISM:
Branch and Bound Hough Transform for Object Class Detection
Alain Lehmann ·Bastian Leibe ·Luc Van Gool
Received: 21 September 2009 / Accepted: 9 April 2010 / Published online: 28 April 2010
© Springer Science+Business Media, LLC 2010
Abstract This paper addresses the task of efficient object
class detection by means of the Hough transform. This ap-
proach has been made popular by the Implicit Shape Model
(ISM) and has been adopted many times. Although ISM
exhibits robust detection performance, its probabilistic for-
mulation is unsatisfactory. The PRincipled Implicit Shape
Model (PRISM) overcomes these problems by interpreting
Hough voting as a dual implementation of linear sliding-
window detection. It thereby gives a sound justification
to the voting procedure and imposes minimal constraints.
We demonstrate PRISM’s flexibility by two complemen-
tary implementations: a generatively trained Gaussian Mix-
ture Model as well as a discriminatively trained histogram
approach. Both systems achieve state-of-the-art perfor-
mance. Detections are found by gradient-based or branch
and bound search, respectively. The latter greatly benefits
from PRISM’s feature-centric view. It thereby avoids the un-
favourable memory trade-off and any on-line pre-processing
of the original Efficient Subwindow Search (ESS). More-
over, our approach takes account of the features’ scale value
while ESS does not. Finally, we show how to avoid soft-
matching and spatial pyramid descriptors during detection
without losing their positive effect. This makes algorithms
A. Lehmann (
) · L. Van Gool
Computer Vision Laboratory, ETH Zurich, Zurich, Switzerland
e-mail: lehmann@vision.ee.ethz.ch
L. Van Gool
e-mail: vangool@vision.ee.ethz.ch
B. Leibe
UMIC Research Centre, RWTH Aachen, Aachen, Germany
e-mail: leibe@umic.rwth-aachen.de
L. Van Gool
ESAT-PSI/IBBT, KU Leuven, Leuven, Belgium
simpler and faster. Both are possible if the object model is
properly regularised and we discuss a modification of SVMs
which allows for doing so.
Keywords Object detection · Hough transform ·
Sliding-window · Branch and bound · Soft-matching ·
Spatial pyramid histograms
1 Introduction
Object detection is the problem of joint localisation and cat-
egorisation of objects in images. It involves two tasks: learn-
ing an accurate object model and the actual search for ob-
jects, i.e., applying the model to new images. While learning
accurate models is not time critical (as it is done off-line),
speed is of prime importance during detection. As a matter
of fact, the structure of the model has a direct impact on the
efficiency of the detection. Therefore, object detection in-
volves an additional, third task: modelling the problem such
that it allows for efficient search. This task, which precedes
both others, and the subsequent acceleration of object de-
tection is the focus of this paper. A central aspect towards
this goal is to move as much computation as possible to the
off-line training stage where runtime is not critical.
Most state-of-the-art object detectors are based on ei-
ther the sliding-window paradigm (Viola and Jones 2004;
Schneiderman and Kanade 2004; Dalal and Triggs 2005;
Ferrari et al. 2007; Felzenszwalb et al. 2008;Majietal.
2008) or the Hough transform (Ballard 1981;Opeltetal.
2006; Leibe et al. 2008; Liebelt et al. 2008; Maji and Ma-
lik 2009). Sliding-window considers all possible sub-images
of an image and a classifier decides whether they contain an
object of interest or not. For reasons of efficiency, mostly lin-
ear classifiers are used, although fast non-linear approaches

176 Int J Comput Vis (2011) 94:175–197
have been proposed recently (Maji et al. 2008). Moreover,
advanced search schemes based on branch and bound have
been designed to overcome the computationally expensive
exhaustive search (Lampert et al. 2009). In this work we fo-
cus on the second aforementioned paradigm, i.e. the Hough-
transform, and show that it can also—and even more—
benefit from branch and bound search.
The Hough transform was originally introduced for line
detection, while the Generalised Hough Transform (Ballard
1981) presented modifications for finding predefined shapes
other than lines. More recently, the Implicit Shape Model
(ISM) (Leibe et al. 2008) has shown how the underlying idea
can be extended to object category detection from local fea-
tures. It is this extended form which we refer to in this paper.
In ISM, processing starts with local feature extraction and
each feature subsequently casts probabilistic votes for pos-
sible object positions. The final hypothesis score is obtained
by marginalising over all these votes. In our opinion, such
bottom-up voting schemes seem to be more natural than the
exhaustive sliding-window search paradigm.
Indeed, object class detectors based on ISM’s idea have
become increasingly popular (Liebelt et al. 2008; Opelt et
al. 2006; Maji and Malik 2009; Gall and Lempitsky 2009;
Chum and Zisserman 2007). However, we argue that the
commonly used ISM formalism is unsatisfactory from a
probabilistic point of view. In particular, the summation over
feature likelihoods is explained by marginalisation. How-
ever, the extracted features co-exist and are not mutually
exclusive. Thus, marginalising over them is meaningless
which will be detailed in Sect. 2.3. Nevertheless, ISM em-
pirically demonstrates robust detection performance. Fur-
thermore, the summation is crucial for the voting para-
digm. Hence, the question is: “How else can this summa-
tion be justified?”. We set out to give a sound answer to
this question by formulating Hough-based object detection
as a dual implementation of linear sliding-window detec-
tion (Lehmann et al. 2009b). As a result, PRISM brings to-
gether the advantages of both paradigms. On the one hand,
the sliding-window reasoning resolves ISM’s deficiencies,
i.e., PRISM gives sound justification for the voting proce-
dure and also allows for discriminative (i.e., negative) vot-
ing weights (which was not possible before). This contribu-
tion plays out on the level of object modelling and motivates
the name PRISM: PRincipled Implicit Shape Model.Onthe
other hand, we will show that the feature-centric view of the
Hough-transform leads to computational advantages at de-
tection time.
The central aspect of our PRISM framework (Lehmann
et al. 2009b) is to consider the sliding-window and the
Hough paradigms as two sides of the same coin. This du-
ality is exploited to define the object score from a sliding-
window point of view, while the actual evaluation follows
the Hough transform. The core concept which allows the fu-
sion of the two paradigms is a (visual) object footprint.It
represents all features in a canonical reference frame which
is defined through invariants. This compensates for (geo-
metric) transformations of the object. Object hypotheses are
then scored by comparing their footprint to a linear object
model. The latter is compulsory for the Hough transform,
but most sliding-window detectors do likewise for reasons of
efficiency. Contrary to spatial histogram-based approaches
(Dalal and Triggs 2005;Lampertetal.2009), we keep track
of each individual feature. This leads to a feature-centric
score which is crucial for a Hough transform like algo-
rithm. Moreover, it is this feature-centric score which can
be exploited to improve the runtime properties of Efficient
Subwindow Search (Lampert et al. 2009). In particular, we
show significant memory savings and avoid any on-line pre-
processing.
Efficient Subwindow Search (ESS) (Lampert et al. 2008,
2009) is an elegant technique to overcome exhaustive search
of sliding-window systems. It thereby allows for sub-linear
search time. Central to this approach is the use of bounds
on sets of hypotheses, embedded in a branch and bound
search scheme. Such branch and bound algorithm have al-
ready proven their effectiveness in earlier work on geometric
matching (Keysers et al. 2007; Breuel, 1992, 2002). How-
ever, ESS also has its downsides, which are mainly due
to the integral image representation used during detection.
More precisely, ESS uses two integral images per spatial bin
of its (histogram) object model. These are very memory de-
manding as they scale with the input image size. Hence, a
single-class detector with 10 × 10 histogram bins (as pre-
sented in Lampert et al. 2009, Fig. 8), but without spatial
pyramid) already consumes on the order of 235MB mem-
ory for moderately sized 640 × 480 images.
1
Due to this
memory issue, ESS uses only 2D integral images although
features actually reside in a 3D scale space. Hence, ESS
chooses to ignore the feature scale. This limits the modelling
capabilities, i.e., small/large-scale feature cannot be distin-
guished. Furthermore, this may cause a bias towards larger-
scale detections as we will discuss later. Depending on the
application, the problem may be less the memory usage, but
more the fact that memory has to be filled with data immedi-
ately prior to the actual search. This on-line pre-processing
step may cancel the sub-linear runtime. In any case, as the
number of bins scales with the number of classes, ESS will
not scale well to large images and many classes. Adapt-
ing the ESS idea to the Hough-inspired PRISM framework
avoids all of these problems. In particular, we present a sys-
tem which assigns different weights to small/large-scale fea-
tures. Therefore, it has no bias towards larger detections.
1
2 integral images ·640 ×480 pixel ·100 bins ·4 bytes 235 MB. Us-
ing the 10 level pyramid increases memory usage to about 900 MB. As
we argue later, the pyramid proposed in Lampert et al. (2009) is not
needed at detection time as it causes model regularisation which can
be integrated into the training procedure.

Int J Comput Vis (2011) 94:175–197 177
Fig. 1 (Color online) High-level illustration of PRISM’s main con-
cepts. The goal of object detection is to recover a semantic description
ofascene(e.g. “a car at position λ”) while we only have access to
a visual description, i.e., the image pixels. Both visual and seman-
tic description are scene-dependent and cannot be accessed during
training (as they do not exist then). Our object footprint couples them
and allows for computing a scene-independent object description for
each object hypothesis. This footprint compensates for (geometric)
object transformations by means of invariants. This generalises the
usual “windowing” of sliding-window detectors. The explicitly de-
fined invariants induce an invariant space which is scene-independent.
This space thus emerges during training which makes learning possi-
ble. The footprint’s coupling is exploited during detection to combine
the input (i.e., the visual description) with knowledge (i.e., the object
model) to, eventually, infer the semantics of the scene
Furthermore, no on-line pre-computation is needed. Hence,
detection directly starts with the adaptive branch and bound
search. Both memory usage and runtime are sub-linear in the
size of the search space and linear in the number of features.
In our opinion, both dependencies are very natural, which
we will discuss.
A rather complementary contribution of our work tack-
les the common practice of soft-matching and spatial pyra-
mid descriptors. These techniques improve detection qual-
ity, but lead to additional cost at detection time. The for-
mer has been acknowledged by many authors (Grauman
and Darrell 2005; Lazebnik et al. 2006; Ferrari et al. 2007;
Ommer and Buhmann 2007; Leibe et al. 2008; Philbin et
al. 2008), while this paper shows how to avoid the latter.
We argue that both techniques cause model regularisation.
This is a concept of learning and does not belong to the de-
tection stage. We demonstrate that soft-matching is indeed
not needed during detection if the model is properly reg-
ularised. Moreover, we integrate spatial pyramids directly
into Support Vector Machines (SVMs) by modifying their
usual L
2
-norm regularisation. This makes the connection to
regularisation explicit. Interestingly, the modified problem
can be solved efficiently in the primal form (Chapelle 2007;
Sandler et al. 2008). As a result, fast nearest-neighbour
matching and flat histograms are sufficient during detection.
This makes detection algorithms simpler (without sacrific-
ing quality) and faster, which is the focus of this paper.
In summary, this paper tackles the task of object mod-
elling and efficient search. It combines elements of what
is currently the state-of-the-art in both sliding-window and
Hough-based object detection, as it is indebted to both ESS
(Lampert et al. 2009) and ISM (Leibe et al. 2008). It puts
ISM on a well-grounded mathematical footing and shows
that the ESS principle can also be applied in a different,
feature-centric manner. Depending on the parameters of the
problem at hand (such as the number of classes, number of
extracted features, image size, etc.), one may be served bet-
ter by taking this alternative view. Moreover, we will actu-
ally show that the traditional ESS bound is encompassed by
our framework. This said, and as a disclaimer, the paper is
not about learning better model parameters and, hence, not
about improving object detection rates.
The structure of the paper is as follows: The PRISM
framework (Lehmann et al. 2009b) is introduced and dis-
cussed in Sect. 2. Two implementations of this framework
are presented in Sects. 3 and 4, respectively. Both algorithms
achieve state-of-the-art performance and are complemen-
tary to each other, i.e., the two sections can be read inde-
pendently. The former builds on Gaussian Mixture Models,
which allow for efficient gradient based search. The com-
bination of PRISM with branch and bound (Lehmann et al.
2009a) is demonstrated in Sect. 4 along with a detailed dis-
cussion and a comparison to ESS. Section 5 describes how
soft-matching and spatial pyramids can be moved to the
training stage, thereby allowing for fast NN-matching and
flat histograms during recognition. Section 6 gives conclud-
ing remarks.
2 The PRincipled Implicit Shape Model (PRISM)
Object detection can be formulated as a search problem:
Given a newly observed image I and a trained object

178 Int J Comput Vis (2011) 94:175–197
Fig. 2 Computations in the object detection pipeline. Clearly, fea-
ture extraction (FE) and branch and bound (BNB) search depend on
the newly observed image. Hence, they cannot be avoided in the
detection stage. However, PRISM’s feature-centric view allows for
pre-computing integral images (II) off-line during training, whereas
ESS needs to compute them during detection. This is a clear advantage
as speed is of prime importance during detection, while the off-line
training stage is not time critical. Furthermore, spatial pyramid de-
scriptors (PYR) can be avoided during detection by using SVMs with
a modified regularisation term (RSVM)
model W , the goal is to find the best hypothesis
λ
=
searching

argmax
λ
modelling

S
λ|I, W

learning
, (1)
where S is a score function and is the search space of all
potential object hypotheses. Furthermore, this formulation
reveals the three tasks involved, i.e., modelling, learning,
and searching. In this section, we are concerned with the
modelling problem, i.e., the design of the score function S.
In general, there are no restrictions on the score function.
However, the structure of S plays an important role when it
comes to defining efficient search algorithms. As the search
space is large, quickly finding the best-scoring hypothesis is
of great importance. We believe that feature-centric scores
(as introduced in ISM (Leibe et al. 2008)) offer a powerful
approach. In particular, it allows us to perform certain pre-
computations off-line during training (c.f. Fig. 2). In such a
framework, local features are matched to a visual vocabulary
and cast votes for possible object positions (the score being
the sum of votes). However, ISM’s probabilistic model has
various shortcomings and also does not allow for discrimi-
native voting weights, i.e., votes which can be negative and,
thus, penalise certain hypotheses.
In the sequel, we will present the PRISM framework
which resolves these problems. We pick up the reasoning of
the sliding-window paradigm, but derive a feature-centric,
Hough-like score in Sect. 2.1. A high-level illustration of
PRISM’s main concepts is shown in Fig. 1. The duality
of the sliding-window and the Hough paradigms are high-
lighted in Sect. 2.2. The shortcomings of ISM and related
work are discussed in Sects. 2.3 and 2.4, respectively. Com-
ments on multiple object detection follow in Sect. 2.5.
2.1 Feature-Centric Score Function
The central element of PRISM (Lehmann et al. 2009b)isa
footprint φ(λ,I) for a given object hypothesis λ extracted
from the image I . This footprint maps all detected features
into a canonical reference frame. In our implementation, it
describes an object independently of its location and scale
in the image. Intuitively, it crops out a sub-image, which is
the key idea of the sliding-window paradigm (c.f. Fig. 1).
Unlike general (non-linear) sliding-window scores, PRISM
uses a linear score function. Such a linear model is compul-
sory for the Hough transform. Hence, the hypothesis score
is computed by the inner product
S(λ|I,W) =φ(λ,I),W (2)
of the footprint φ and a weight function W ,
i.e., the object
model. Classical sliding-window detectors (Dalal and Triggs
2005; Felzenszwalb et al. 2008; Lampert et al. 2009) repre-
sent all features in an object-centric coordinate frame. More-
over, the relative feature position is often discretised which
leads to a histogram representation (Dalal and Triggs 2005)
for φ and W . Contrary to previous work, we focus on the
definition of the mapping function φ and avoid the discreti-
sation. This allows us to switch from the sliding-window to a
feature-driven, Hough-transform algorithm. This change of
views is central to our framework and is the main reason for
the favourable properties of our branch and bound algorithm
(in Sect. 4).
Image-Object Invariants. An important aspect of the foot-
print is that it relates objects and features in an invariant
way. In order to define invariant expressions, we first have to
specify the image representation and hypothesis parametri-
sation. For the image representation we consider a set of fea-
tures F. Each feature is characterised by a descriptor, posi-
tion, and scale, i.e., f = (f
c
,f
x
,f
y
,f
s
). In this work, we
use local features (Mikolajczyk and Schmid 2004) and f
c
is an index to the best-matching visual word in a learned
vocabulary. For the sake of completeness, a feature may be
modulated with a factor f
m
> 0 which accounts for e.g. the
quality of the match to the vocabulary. In the sequel, we will
refer to this factor as the feature mass. As object hypothe-
sis parametrisation we use λ =
x
y
s
), i.e., the object’s
position and size, respectively. This is equivalent to a bound-
ing box with fixed aspect ratio. Finally, possible mappings
of these scene-dependent variables , f ) into a translation-
and scale-invariant space are e.g.
I
f
=
λ
x
f
x
f
s
, log
λ
s
f
s
or I
λ
=
f
x
λ
x
λ
s
, log
f
s
λ
s
,
(3)
where the y-coordinate is analogous to x and is dropped for
brevity’s sake. Using the logarithm accounts for the multi-
plicative nature of the scale ratio and will be helpful later
in Sect. 3.1. I
f
considers a feature-centric coordinate frame,

Figures
Citations
More filters
Journal ArticleDOI

Hough Forests for Object Detection, Tracking, and Action Recognition

TL;DR: Hough forests can be regarded as task-adapted codebooks of local appearance that allow fast supervised training and fast matching at test time that improve the performance of the generalized Hough transform for object detection on a categorical level and extend to new domains such as object tracking and action recognition.
Proceedings ArticleDOI

Voting for Voting in Online Point Cloud Object Detection

TL;DR: It is proved that this voting scheme is mathematically equivalent to a convolution on a sparse feature grid and thus enables the processing, in full 3D, of any point cloud irrespective of the number of vantage points required to construct it.
Book

Visual Object Recognition

TL;DR: This lecture summarizes what is and isn't possible to do reliably today, and overviews key concepts that could be employed in systems requiring visual categorization, with an emphasis on recent advances in the field.
Journal ArticleDOI

Detecting Surgical Tools by Modelling Local Appearance and Global Shape

TL;DR: Results indicate that performing semantic labelling as an intermediate task is key for high quality detection and significantly improves over competitive baselines from the computer vision field.
Book ChapterDOI

Globally optimal consensus set maximization through rotation search

TL;DR: This paper considers a rotation model and presents a new approach that performs consensus set maximization in a mathematically guaranteed globally optimal way and can be applied for various computer vision tasks such as panoramic image stitching, 3D registration with a rotating range sensor and line clustering and vanishing point estimation.
References
More filters
Proceedings ArticleDOI

Histograms of oriented gradients for human detection

TL;DR: It is shown experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection, and the influence of each stage of the computation on performance is studied.
Journal ArticleDOI

Robust Real-Time Face Detection

TL;DR: In this paper, a face detection framework that is capable of processing images extremely rapidly while achieving high detection rates is described. But the detection performance is limited to 15 frames per second.
Book ChapterDOI

SURF: speeded up robust features

TL;DR: A novel scale- and rotation-invariant interest point detector and descriptor, coined SURF (Speeded Up Robust Features), which approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness, yet can be computed and compared much faster.
Journal ArticleDOI

Scale-space and edge detection using anisotropic diffusion

TL;DR: A new definition of scale-space is suggested, and a class of algorithms used to realize a diffusion process is introduced, chosen to vary spatially in such a way as to encourage intra Region smoothing rather than interregion smoothing.
Journal ArticleDOI

Speeded-Up Robust Features (SURF)

TL;DR: A novel scale- and rotation-invariant detector and descriptor, coined SURF (Speeded-Up Robust Features), which approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness, yet can be computed and compared much faster.
Related Papers (5)
Frequently Asked Questions (12)
Q1. What are the contributions in "Branch and bound hough transform for object class detection" ?

This paper addresses the task of efficient object class detection by means of the Hough transform. The authors demonstrate PRISM ’ s flexibility by two complementary implementations: a generatively trained Gaussian Mixture Model as well as a discriminatively trained histogram approach. Moreover, their approach takes account of the features ’ scale value while ESS does not. Finally, the authors show how to avoid softmatching and spatial pyramid descriptors during detection without losing their positive effect. Both are possible if the object model is properly regularised and the authors discuss a modification of SVMs which allows for doing so. 

Future work will aim at deepening the understanding of the approximate bound and improving the splitting strategy of the branch and bound algorithm. 

The authors see that for values up to β = 0.1 and 0.15, respectively, the computation time (ignoring feature extraction) can be reduced by about a third without decrease in accuracy. 

The disadvantage of kernel density estimators is their strong dependence on the training data (in terms of storage and computation time) which is unfavourable for large training sets. 

2. Consequently, the regularisation matrix defined above can be derived as a finitedifferences implementation of ∫R3 α‖W‖2 + β‖∇W‖2dV (18) where the integration is over the invariant space and ∇ denotes the gradient operator. 

4.3 Maximum Query Using Integral ImagesA possible method to efficiently process the maximum queries of (9) is by means of integral images. 

The Hough transform was originally introduced for line detection, while the Generalised Hough Transform (Ballard 1981) presented modifications for finding predefined shapes other than lines. 

the hypothesis score is computed by the inner productS(λ|I,W) = 〈φ(λ, The author),W 〉 (2) of the footprint φ and a weight function W , i.e., the object model. 

the authors integrate spatial pyramids directly into Support Vector Machines (SVMs) by modifying their usual L2-norm regularisation. 

a natural scale-invariant convergence criterion is to stop whenever the extent of ̄ in each dimension is less then a threshold (e.g. 0.01). 

More recently, the Implicit Shape Model (ISM) (Leibe et al. 2008) has shown how the underlying idea can be extended to object category detection from local features. 

This is a common setup (Dalal and Triggs 2005; Felzenszwalb et al. 2008; Lampert et al. 2009) where learning can be accomplished using discriminative methods (e.g. linear SVMs).