What are the contributions in "Branch and bound hough transform for object class detection" ?

This paper addresses the task of efficient object class detection by means of the Hough transform. The authors demonstrate PRISM ’ s flexibility by two complementary implementations: a generatively trained Gaussian Mixture Model as well as a discriminatively trained histogram approach. Moreover, their approach takes account of the features ’ scale value while ESS does not. Finally, the authors show how to avoid softmatching and spatial pyramid descriptors during detection without losing their positive effect. Both are possible if the object model is properly regularised and the authors discuss a modification of SVMs which allows for doing so.

What are the future works in "Branch and bound hough transform for object class detection" ?

Future work will aim at deepening the understanding of the approximate bound and improving the splitting strategy of the branch and bound algorithm.

How much time can the authors reduce the computation time without reducing the accuracy?

The authors see that for values up to β = 0.1 and 0.15, respectively, the computation time (ignoring feature extraction) can be reduced by about a third without decrease in accuracy.

What is the disadvantage of kernel density estimators?

The disadvantage of kernel density estimators is their strong dependence on the training data (in terms of storage and computation time) which is unfavourable for large training sets.

What is the finitedifferences implementation of the regularisation matrix?

2. Consequently, the regularisation matrix defined above can be derived as a finitedifferences implementation of ∫R3 α‖W‖2 + β‖∇W‖2dV (18) where the integration is over the invariant space and ∇ denotes the gradient operator.

How can the authors efficiently process the maximum queries of (9)?

4.3 Maximum Query Using Integral ImagesA possible method to efficiently process the maximum queries of (9) is by means of integral images.

What is the hypothesis score for the sliding-window paradigm?

the hypothesis score is computed by the inner productS(λ|I,W) = 〈φ(λ, The author),W 〉 (2) of the footprint φ and a weight function W , i.e., the object model.

What is the criterion for a natural scale-invariant convergence?

a natural scale-invariant convergence criterion is to stop whenever the extent of ̄ in each dimension is less then a threshold (e.g. 0.01).

What is the common setup for learning?

This is a common setup (Dalal and Triggs 2005; Felzenszwalb et al. 2008; Lampert et al. 2009) where learning can be accomplished using discriminative methods (e.g. linear SVMs).

(Open Access) Fast PRISM: Branch and Bound Hough Transform for Object Class Detection (2011) | Alain Lehmann

Q: How does the SVMs integrate spatial pyramids?

the authors integrate spatial pyramids directly into Support Vector Machines (SVMs) by modifying their usual L2-norm regularisation.

ETH Library

Fast PRISM

Branch and Bound Hough Transform for Object Class

Detection

Journal Article

Author(s):

Lehmann, Alain; Leibe, Bastian; Van Gool, Luc

Publication date:

2011-09

Permanent link:

https://doi.org/10.3929/ethz-b-000027008

Rights / license:

In Copyright - Non-Commercial Use Permitted

Originally published in:

International Journal of Computer Vision 94(2), https://doi.org/10.1007/s11263-010-0342-x

This page was generated automatically upon download from the ETH Zurich Research Collection.

For more information, please consult the Terms of use.

Int J Comput Vis (2011) 94:175–197

DOI 10.1007/s11263-010-0342-x

Fast PRISM:

Branch and Bound Hough Transform for Object Class Detection

Alain Lehmann ·Bastian Leibe ·Luc Van Gool

Received: 21 September 2009 / Accepted: 9 April 2010 / Published online: 28 April 2010

Abstract This paper addresses the task of efﬁcient object

class detection by means of the Hough transform. This ap-

proach has been made popular by the Implicit Shape Model

(ISM) and has been adopted many times. Although ISM

exhibits robust detection performance, its probabilistic for-

mulation is unsatisfactory. The PRincipled Implicit Shape

Model (PRISM) overcomes these problems by interpreting

Hough voting as a dual implementation of linear sliding-

window detection. It thereby gives a sound justiﬁcation

to the voting procedure and imposes minimal constraints.

We demonstrate PRISM’s ﬂexibility by two complemen-

tary implementations: a generatively trained Gaussian Mix-

ture Model as well as a discriminatively trained histogram

approach. Both systems achieve state-of-the-art perfor-

mance. Detections are found by gradient-based or branch

and bound search, respectively. The latter greatly beneﬁts

from PRISM’s feature-centric view. It thereby avoids the un-

favourable memory trade-off and any on-line pre-processing

of the original Efﬁcient Subwindow Search (ESS). More-

over, our approach takes account of the features’ scale value

while ESS does not. Finally, we show how to avoid soft-

matching and spatial pyramid descriptors during detection

without losing their positive effect. This makes algorithms

A. Lehmann (



) · L. Van Gool

Computer Vision Laboratory, ETH Zurich, Zurich, Switzerland

e-mail: lehmann@vision.ee.ethz.ch

L. Van Gool

e-mail: vangool@vision.ee.ethz.ch

B. Leibe

UMIC Research Centre, RWTH Aachen, Aachen, Germany

e-mail: leibe@umic.rwth-aachen.de

L. Van Gool

ESAT-PSI/IBBT, KU Leuven, Leuven, Belgium

simpler and faster. Both are possible if the object model is

properly regularised and we discuss a modiﬁcation of SVMs

which allows for doing so.

Keywords Object detection · Hough transform ·

Sliding-window · Branch and bound · Soft-matching ·

Spatial pyramid histograms

1 Introduction

Object detection is the problem of joint localisation and cat-

egorisation of objects in images. It involves two tasks: learn-

ing an accurate object model and the actual search for ob-

jects, i.e., applying the model to new images. While learning

accurate models is not time critical (as it is done off-line),

speed is of prime importance during detection. As a matter

of fact, the structure of the model has a direct impact on the

efﬁciency of the detection. Therefore, object detection in-

volves an additional, third task: modelling the problem such

that it allows for efﬁcient search. This task, which precedes

both others, and the subsequent acceleration of object de-

tection is the focus of this paper. A central aspect towards

this goal is to move as much computation as possible to the

off-line training stage where runtime is not critical.

Most state-of-the-art object detectors are based on ei-

ther the sliding-window paradigm (Viola and Jones 2004;

Schneiderman and Kanade 2004; Dalal and Triggs 2005;

Ferrari et al. 2007; Felzenszwalb et al. 2008;Majietal.

2008) or the Hough transform (Ballard 1981;Opeltetal.

2006; Leibe et al. 2008; Liebelt et al. 2008; Maji and Ma-

lik 2009). Sliding-window considers all possible sub-images

of an image and a classiﬁer decides whether they contain an

object of interest or not. For reasons of efﬁciency, mostly lin-

ear classiﬁers are used, although fast non-linear approaches

176 Int J Comput Vis (2011) 94:175–197

have been proposed recently (Maji et al. 2008). Moreover,

advanced search schemes based on branch and bound have

been designed to overcome the computationally expensive

exhaustive search (Lampert et al. 2009). In this work we fo-

cus on the second aforementioned paradigm, i.e. the Hough-

transform, and show that it can also—and even more—

beneﬁt from branch and bound search.

The Hough transform was originally introduced for line

detection, while the Generalised Hough Transform (Ballard

1981) presented modiﬁcations for ﬁnding predeﬁned shapes

other than lines. More recently, the Implicit Shape Model

(ISM) (Leibe et al. 2008) has shown how the underlying idea

can be extended to object category detection from local fea-

tures. It is this extended form which we refer to in this paper.

In ISM, processing starts with local feature extraction and

each feature subsequently casts probabilistic votes for pos-

sible object positions. The ﬁnal hypothesis score is obtained

by marginalising over all these votes. In our opinion, such

bottom-up voting schemes seem to be more natural than the

exhaustive sliding-window search paradigm.

Indeed, object class detectors based on ISM’s idea have

become increasingly popular (Liebelt et al. 2008; Opelt et

al. 2006; Maji and Malik 2009; Gall and Lempitsky 2009;

Chum and Zisserman 2007). However, we argue that the

commonly used ISM formalism is unsatisfactory from a

probabilistic point of view. In particular, the summation over

feature likelihoods is explained by marginalisation. How-

ever, the extracted features co-exist and are not mutually

exclusive. Thus, marginalising over them is meaningless

which will be detailed in Sect. 2.3. Nevertheless, ISM em-

pirically demonstrates robust detection performance. Fur-

thermore, the summation is crucial for the voting para-

digm. Hence, the question is: “How else can this summa-

tion be justiﬁed?”. We set out to give a sound answer to

this question by formulating Hough-based object detection

as a dual implementation of linear sliding-window detec-

tion (Lehmann et al. 2009b). As a result, PRISM brings to-

gether the advantages of both paradigms. On the one hand,

the sliding-window reasoning resolves ISM’s deﬁciencies,

i.e., PRISM gives sound justiﬁcation for the voting proce-

dure and also allows for discriminative (i.e., negative) vot-

ing weights (which was not possible before). This contribu-

tion plays out on the level of object modelling and motivates

the name PRISM: PRincipled Implicit Shape Model.Onthe

other hand, we will show that the feature-centric view of the

Hough-transform leads to computational advantages at de-

tection time.

The central aspect of our PRISM framework (Lehmann

et al. 2009b) is to consider the sliding-window and the

Hough paradigms as two sides of the same coin. This du-

ality is exploited to deﬁne the object score from a sliding-

window point of view, while the actual evaluation follows

the Hough transform. The core concept which allows the fu-

sion of the two paradigms is a (visual) object footprint.It

represents all features in a canonical reference frame which

is deﬁned through invariants. This compensates for (geo-

metric) transformations of the object. Object hypotheses are

then scored by comparing their footprint to a linear object

model. The latter is compulsory for the Hough transform,

but most sliding-window detectors do likewise for reasons of

efﬁciency. Contrary to spatial histogram-based approaches

(Dalal and Triggs 2005;Lampertetal.2009), we keep track

of each individual feature. This leads to a feature-centric

score which is crucial for a Hough transform like algo-

rithm. Moreover, it is this feature-centric score which can

be exploited to improve the runtime properties of Efﬁcient

Subwindow Search (Lampert et al. 2009). In particular, we

show signiﬁcant memory savings and avoid any on-line pre-

processing.

Efﬁcient Subwindow Search (ESS) (Lampert et al. 2008,

2009) is an elegant technique to overcome exhaustive search

of sliding-window systems. It thereby allows for sub-linear

search time. Central to this approach is the use of bounds

on sets of hypotheses, embedded in a branch and bound

search scheme. Such branch and bound algorithm have al-

ready proven their effectiveness in earlier work on geometric

matching (Keysers et al. 2007; Breuel, 1992, 2002). How-

ever, ESS also has its downsides, which are mainly due

to the integral image representation used during detection.

More precisely, ESS uses two integral images per spatial bin

of its (histogram) object model. These are very memory de-

manding as they scale with the input image size. Hence, a

single-class detector with 10 × 10 histogram bins (as pre-

sented in Lampert et al. 2009, Fig. 8), but without spatial

pyramid) already consumes on the order of 235MB mem-

ory for moderately sized 640 × 480 images.

Due to this

memory issue, ESS uses only 2D integral images although

features actually reside in a 3D scale space. Hence, ESS

chooses to ignore the feature scale. This limits the modelling

capabilities, i.e., small/large-scale feature cannot be distin-

guished. Furthermore, this may cause a bias towards larger-

scale detections as we will discuss later. Depending on the

application, the problem may be less the memory usage, but

more the fact that memory has to be ﬁlled with data immedi-

ately prior to the actual search. This on-line pre-processing

step may cancel the sub-linear runtime. In any case, as the

number of bins scales with the number of classes, ESS will

not scale well to large images and many classes. Adapt-

ing the ESS idea to the Hough-inspired PRISM framework

avoids all of these problems. In particular, we present a sys-

tem which assigns different weights to small/large-scale fea-

tures. Therefore, it has no bias towards larger detections.

2 integral images ·640 ×480 pixel ·100 bins ·4 bytes ≈235 MB. Us-

ing the 10 level pyramid increases memory usage to about 900 MB. As

we argue later, the pyramid proposed in Lampert et al. (2009) is not

needed at detection time as it causes model regularisation which can

be integrated into the training procedure.

Int J Comput Vis (2011) 94:175–197 177

Fig. 1 (Color online) High-level illustration of PRISM’s main con-

cepts. The goal of object detection is to recover a semantic description

ofascene(e.g. “a car at position λ”) while we only have access to

a visual description, i.e., the image pixels. Both visual and seman-

tic description are scene-dependent and cannot be accessed during

training (as they do not exist then). Our object footprint couples them

and allows for computing a scene-independent object description for

each object hypothesis. This footprint compensates for (geometric)

object transformations by means of invariants. This generalises the

usual “windowing” of sliding-window detectors. The explicitly de-

ﬁned invariants induce an invariant space which is scene-independent.

This space thus emerges during training which makes learning possi-

ble. The footprint’s coupling is exploited during detection to combine

the input (i.e., the visual description) with knowledge (i.e., the object

model) to, eventually, infer the semantics of the scene

Furthermore, no on-line pre-computation is needed. Hence,

detection directly starts with the adaptive branch and bound

search. Both memory usage and runtime are sub-linear in the

size of the search space and linear in the number of features.

In our opinion, both dependencies are very natural, which

we will discuss.

A rather complementary contribution of our work tack-

les the common practice of soft-matching and spatial pyra-

mid descriptors. These techniques improve detection qual-

ity, but lead to additional cost at detection time. The for-

mer has been acknowledged by many authors (Grauman

and Darrell 2005; Lazebnik et al. 2006; Ferrari et al. 2007;

Ommer and Buhmann 2007; Leibe et al. 2008; Philbin et

al. 2008), while this paper shows how to avoid the latter.

We argue that both techniques cause model regularisation.

This is a concept of learning and does not belong to the de-

tection stage. We demonstrate that soft-matching is indeed

not needed during detection if the model is properly reg-

ularised. Moreover, we integrate spatial pyramids directly

into Support Vector Machines (SVMs) by modifying their

usual L

-norm regularisation. This makes the connection to

regularisation explicit. Interestingly, the modiﬁed problem

can be solved efﬁciently in the primal form (Chapelle 2007;

Sandler et al. 2008). As a result, fast nearest-neighbour

matching and ﬂat histograms are sufﬁcient during detection.

This makes detection algorithms simpler (without sacriﬁc-

ing quality) and faster, which is the focus of this paper.

In summary, this paper tackles the task of object mod-

elling and efﬁcient search. It combines elements of what

is currently the state-of-the-art in both sliding-window and

Hough-based object detection, as it is indebted to both ESS

(Lampert et al. 2009) and ISM (Leibe et al. 2008). It puts

ISM on a well-grounded mathematical footing and shows

that the ESS principle can also be applied in a different,

feature-centric manner. Depending on the parameters of the

problem at hand (such as the number of classes, number of

extracted features, image size, etc.), one may be served bet-

ter by taking this alternative view. Moreover, we will actu-

ally show that the traditional ESS bound is encompassed by

our framework. This said, and as a disclaimer, the paper is

not about learning better model parameters and, hence, not

about improving object detection rates.

The structure of the paper is as follows: The PRISM

framework (Lehmann et al. 2009b) is introduced and dis-

cussed in Sect. 2. Two implementations of this framework

are presented in Sects. 3 and 4, respectively. Both algorithms

achieve state-of-the-art performance and are complemen-

tary to each other, i.e., the two sections can be read inde-

pendently. The former builds on Gaussian Mixture Models,

which allow for efﬁcient gradient based search. The com-

bination of PRISM with branch and bound (Lehmann et al.

2009a) is demonstrated in Sect. 4 along with a detailed dis-

cussion and a comparison to ESS. Section 5 describes how

soft-matching and spatial pyramids can be moved to the

training stage, thereby allowing for fast NN-matching and

ﬂat histograms during recognition. Section 6 gives conclud-

ing remarks.

2 The PRincipled Implicit Shape Model (PRISM)

Object detection can be formulated as a search problem:

Given a newly observed image I and a trained object

178 Int J Comput Vis (2011) 94:175–197

Fig. 2 Computations in the object detection pipeline. Clearly, fea-

ture extraction (FE) and branch and bound (BNB) search depend on

the newly observed image. Hence, they cannot be avoided in the

detection stage. However, PRISM’s feature-centric view allows for

pre-computing integral images (II) off-line during training, whereas

ESS needs to compute them during detection. This is a clear advantage

as speed is of prime importance during detection, while the off-line

training stage is not time critical. Furthermore, spatial pyramid de-

scriptors (PYR) can be avoided during detection by using SVMs with

a modiﬁed regularisation term (RSVM)

model W , the goal is to ﬁnd the best hypothesis

∗

searching

  

argmax

λ∈

modelling

  



λ|I, W



learning



, (1)

where S is a score function and  is the search space of all

potential object hypotheses. Furthermore, this formulation

reveals the three tasks involved, i.e., modelling, learning,

and searching. In this section, we are concerned with the

modelling problem, i.e., the design of the score function S.

In general, there are no restrictions on the score function.

However, the structure of S plays an important role when it

comes to deﬁning efﬁcient search algorithms. As the search

space is large, quickly ﬁnding the best-scoring hypothesis is

of great importance. We believe that feature-centric scores

(as introduced in ISM (Leibe et al. 2008)) offer a powerful

approach. In particular, it allows us to perform certain pre-

computations off-line during training (c.f. Fig. 2). In such a

framework, local features are matched to a visual vocabulary

and cast votes for possible object positions (the score being

the sum of votes). However, ISM’s probabilistic model has

various shortcomings and also does not allow for discrimi-

native voting weights, i.e., votes which can be negative and,

thus, penalise certain hypotheses.

In the sequel, we will present the PRISM framework

which resolves these problems. We pick up the reasoning of

the sliding-window paradigm, but derive a feature-centric,

Hough-like score in Sect. 2.1. A high-level illustration of

PRISM’s main concepts is shown in Fig. 1. The duality

of the sliding-window and the Hough paradigms are high-

lighted in Sect. 2.2. The shortcomings of ISM and related

work are discussed in Sects. 2.3 and 2.4, respectively. Com-

ments on multiple object detection follow in Sect. 2.5.

2.1 Feature-Centric Score Function

The central element of PRISM (Lehmann et al. 2009b)isa

footprint φ(λ,I) for a given object hypothesis λ extracted

from the image I . This footprint maps all detected features

into a canonical reference frame. In our implementation, it

describes an object independently of its location and scale

in the image. Intuitively, it crops out a sub-image, which is

the key idea of the sliding-window paradigm (c.f. Fig. 1).

Unlike general (non-linear) sliding-window scores, PRISM

uses a linear score function. Such a linear model is compul-

sory for the Hough transform. Hence, the hypothesis score

is computed by the inner product

S(λ|I,W) =φ(λ,I),W (2)

of the footprint φ and a weight function W ,

i.e., the object

model. Classical sliding-window detectors (Dalal and Triggs

2005; Felzenszwalb et al. 2008; Lampert et al. 2009) repre-

sent all features in an object-centric coordinate frame. More-

over, the relative feature position is often discretised which

leads to a histogram representation (Dalal and Triggs 2005)

for φ and W . Contrary to previous work, we focus on the

deﬁnition of the mapping function φ and avoid the discreti-

sation. This allows us to switch from the sliding-window to a

feature-driven, Hough-transform algorithm. This change of

views is central to our framework and is the main reason for

the favourable properties of our branch and bound algorithm

(in Sect. 4).

Image-Object Invariants. An important aspect of the foot-

print is that it relates objects and features in an invariant

way. In order to deﬁne invariant expressions, we ﬁrst have to

specify the image representation and hypothesis parametri-

sation. For the image representation we consider a set of fea-

tures F. Each feature is characterised by a descriptor, posi-

tion, and scale, i.e., f = (f

). In this work, we

use local features (Mikolajczyk and Schmid 2004) and f

is an index to the best-matching visual word in a learned

vocabulary. For the sake of completeness, a feature may be

modulated with a factor f

> 0 which accounts for e.g. the

quality of the match to the vocabulary. In the sequel, we will

refer to this factor as the feature mass. As object hypothe-

sis parametrisation we use λ =(λ

,λ

), i.e., the object’s

position and size, respectively. This is equivalent to a bound-

ing box with ﬁxed aspect ratio. Finally, possible mappings

of these scene-dependent variables (λ, f ) into a translation-

and scale-invariant space are e.g.



−f

, log



or I



−λ

, log



(3)

where the y-coordinate is analogous to x and is dropped for

brevity’s sake. Using the logarithm accounts for the multi-

plicative nature of the scale ratio and will be helpful later

in Sect. 3.1. I

considers a feature-centric coordinate frame,

Fast PRISM: Branch and Bound Hough Transform for Object Class Detection

Figures

Citations

Hough Forests for Object Detection, Tracking, and Action Recognition

Voting for Voting in Online Point Cloud Object Detection

Visual Object Recognition

Detecting Surgical Tools by Modelling Local Appearance and Global Shape

Globally optimal consensus set maximization through rotation search

References

Histograms of oriented gradients for human detection

Robust Real-Time Face Detection

SURF: speeded up robust features

Scale-space and edge detection using anisotropic diffusion

Speeded-Up Robust Features (SURF)

Related Papers (5)

Histograms of oriented gradients for human detection

Robust Real-Time Face Detection

Object Detection with Discriminatively Trained Part-Based Models

Distinctive Image Features from Scale-Invariant Keypoints

The Pascal Visual Object Classes (VOC) Challenge

Frequently Asked Questions (12)

Q1. What are the contributions in "Branch and bound hough transform for object class detection" ?

Q2. What are the future works in "Branch and bound hough transform for object class detection" ?

Q3. How much time can the authors reduce the computation time without reducing the accuracy?

Q4. What is the disadvantage of kernel density estimators?

Q5. What is the finitedifferences implementation of the regularisation matrix?

Q6. How can the authors efficiently process the maximum queries of (9)?

Q7. What was the first generalised Hough transform?

Q8. What is the hypothesis score for the sliding-window paradigm?

Q9. How does the SVMs integrate spatial pyramids?

Q10. What is the criterion for a natural scale-invariant convergence?

Q11. What is the underlying idea of the Implicit Shape Model?

Q12. What is the common setup for learning?