What is the effect of increasing the number of detections/feature-type/image?

Increasing the number of detections/feature-type/image increases the error rate slightly in some cases such as camels, since many of the additional detections lie in the background of the image, so increasing the chances of a false positive.

What is the second method to learn the HSM?

The second method, which the authors adopt, is to learn the HSM directly using EM as in [8, 21], starting from randomly-chosen initial conditions, enabling the learning of many more parts and with more detections/image.

What is the method to learn a heterogeneous star model?

One method is to learn a fully connected constellation model using EM [8] and then reduce the learnt spatial model to a star by completely trying out each of the parts as a landmark, and picking the one which gives the highest likelihood on the training data.

What is the limitation of learning a heterogeneous star model?

The limitation of this approach is that the fully connected model can only handle a small number of parts and detections in learning.

What is the cost of finding the optimal match?

The authors then introduce the shape model, which by the use of distance transforms [6], reduces the cost of finding the optimal match from O(N 2P ) to O(NP ).

How many datasets are used to evaluate the HSM?

Evaluation of the HSM using feature-based detection is done using nine widely varying, unnormalized, datasets summarized in Table 1.

What is the problem with using a large number of regions?

To ensure this, one approach is to use a very large number of regions, however the problem remains that each feature will still be perturbed slightly in location and scale from its optimal position so degrading the quality of the match obtainable by the model.

What is the gradient of the star model?

The star model’s curve, while also roughly linear, has a much flatter gradient: a 12 part star model taking the same time to learn as a 6 part full model.

What is the likelihood of the hypothesis?

The learnt model is then applied to the regions/curves and the likelihood of the best hypothesis computed using the learnt model.

(Open Access) A sparse object category model for efficient learning and exhaustive recognition (2005) | Rob Fergus

Q: What have the authors contributed in "A sparse object category model for efficient learning and complete recognition" ?

The authors present a “ parts and structure ” model for object category recognition that can be learnt efficiently and in a weakly-supervised manner: the model is learnt from example images containing category instances, without requiring segmentation from background clutter.

Q: What are the future works in "A sparse object category model for efficient learning and complete recognition" ?

There are several aspects of the model that the authors wish to improve and investigate. Although the authors have restricted the model to a star topology, the approach is applicable to a trees and k-fans [ 4 ], and it will be interesting to determine which topologies are best suited to which type of object category.

Q: What are the main advantages of using feature-based methods?

The majority of approaches using feature-based methods rely on region detectors such as Kadir and Brady or multi-scale Harris [11, 13] which favour interest points or circular regions.

Q: What was the first evidence that the object category could be learned from weaklysupervised training images?

The constellation model [3, 8, 21] was the first to convincingly demonstrate that models could be learnt from weaklysupervised unsegmented training images (i.e. the only supervision information was that the image contained an instance of the object category, but not the location of the instance in the image).

A Sparse Object Category Model for Efﬁcient Learning

and Complete Recognition

Rob Fergus

, Pietro Perona

, and Andrew Zisserman

Dept. of Engineering Science

University of Oxford

Parks Road, Oxford

OX1 3PJ, U.K.

{fergus,az}@robots.ox.ac.uk

Dept. of Electrical Engineering

California Institute of Technology

MC 136–93, Pasadena

CA 91125, U.S.A.

perona@vision.caltech.edu

Abstract

We present a “parts and structure” model for object category recognition that can be

learnt efﬁciently and in a weakly-supervised manner: the model is learnt from example

images containing category instances, without requiring segmentation from background

clutter.

The model is a sparse representation of the object, and consists of a star topology

conﬁguration of parts modeling the output of a variety of feature detectors. The optimal

choice of feature types (whose repertoire includes interest points, curves and regions)

is made automatically.

In recognition, the model may be applied efﬁciently in a complete manner, bypass-

ing the need for feature detectors, to give the globally optimal match within a query

image. The approach is demonstrated on a wide variety of categories, and delivers both

successful classiﬁcation and localization of the object within the image.

1 Introduction

A variety of models and methods exist for representing, learning and recognizing ob-

ject categories in images. Many of these are variations on the “Parts and Structure”

model introduced by Fischler and Elschlager [10], though the modern instantiations

use scale-invariant image fragments [1–3, 12, 15, 20, 21]. The constellation model [3, 8,

21] was the ﬁrst to convincingly demonstrate that models could be learnt from weakly-

supervised unsegmented training images (i.e. the only supervision information was that

the image contained an instance of the object category, but not the location of the in-

stance in the image). Various types of categories could be modeled, including those

speciﬁed by tight spatial conﬁgurations (such as cars) and those speciﬁed by tight ap-

pearance exemplars (such as spotted cats). The model was translation and scale invariant

both in learning and in recognition.

However, the Constellation model of [8] has some serious short-comings, namely:

(i) The joint nature of the shape model results in an exponential explosion in computa-

tional cost, limiting the number of parts and regions per image that can be handled. For

N feature detections, and P model parts the complexity for both learning and recog-

nition is O(N

); (ii) Since only 20-30 regions per image and 6 parts are permitted by

this complexity, the model can only learn from an incredibly sparse representation of

the image. Good performance is therefore highly dependent on the consistent ﬁring of

the feature detector; (iii) Only one type of feature detector (a region operator) was used,

making the model very sensitive to the nature of the class. If the distinctive features

of the category happen, say, to be edge-based then relying on a region-based detector

is likely to give poor results (though this limitation was overcome in later work [9]);

(iv) The model has many parameters resulting in over-ﬁtting unless a large number of

training images (typically 200+) are used.

Other models and methods have since been developed which have achieved supe-

rior performance to the constellation model on at least a subset of the object categories

modeled in [8]. These models range from bag-of-word models (where the words are

vector quantized invariant descriptors) with no spatial organization [5, 18], through to

fragment based models [2, 15] with particular spatial conﬁgurations. The methods uti-

lize a range of machine learning approaches EM, SVMs and Adaboost.

In this paper we propose a heterogeneous star model (HSM) which maintains the

simple training requirements of the constellation model, and also, like the constellation

model, gives a localization for the recognized object. The model is translation and scale

invariant both in learning and in recognition. There are three main areas of innovation:

(i) both in learning and recognition it has a lower complexity than the constellation

model. This enables both the number of parts and the number of detected features to

be increased substantially; (ii) it is heterogeneous and is able to make the optimum

selection of feature types (here from a pool of three, including curves). This enables

it to better model objects with signiﬁcant intra-class variation in appearance, but less

variation in outline (for example a guitar), or vice-versa; (iii) The recognition stage can

use feature detectors or can be complete in the manner of Felzenswalb and Huttenlocher

[6]. In the latter case there is no actual detection stage. Rather the model itself deﬁnes

the areas of most relevance using a matched ﬁlter. This complete search overcomes

many false negatives due to feature drop out, and also poor localizations due to small

feature displacement and scale errors.

2 Approach

We describe here the structure of the heterogeneous star model, how it is learnt from

training data, and how it is applied to test data for recognition.

2.1 Star model

As in the constellation model of [8], our model has P parts and parameters θ. From

each image i, we extract N features with locations X

; scales S

and descriptors D

In learning, the aim is to ﬁnd the value of θ that maximizes the log-likelihood over all

images:

log p(X

, D

, S

|θ) (1)

Since N >> P , we introduce an assignment variable, h, to assign features to parts in

the model. The log-likelihood is obtained by marginalizing over h.

log

p(X

, D

, S

, h|θ) (2)

In the constellation model, the joint density is factored as:

p(X

, D

, S

, h|θ) = p(D

|h, θ)

{z }

Appearance

p(X

, h, θ)

{z }

Rel. Locations

p(S

|h, θ)

{z }

Rel. Scale

p(h|θ)

{z }

Occlusion

(3)

In [8], the appearance model for each part is assumed independent but the relative loca-

tion of the model parts is represented by a joint Gaussian density. While this provides

the most thorough description, it makes the location of all parts dependent on one an-

other. Consequently, the EM-based learning scheme, which entails marginalizing over

p(h|X

, D

, S

, θ), becomes an O(N

) operation. We propose here a simpliﬁed con-

Fully connected model

(a)

“Star” model

(b)

Fig. 1. (a) Fully-connected six part shape model. Each node is a model part while the edges

represent the dependencies between parts. (b) A six part Star model. The former has complexity

O(N

) while the latter has complexity O(N

P ) which may be further improved in recognition

by the use of distance-transforms [6] to O(N P ).

ﬁguration model in which the location of the model part is conditioned on the location

of a landmark part. Under this model the non-landmark parts are independent of one

another given the landmark. In graphical model terms, this is a tree of depth one, with

the landmark part being the root node. We call this the “star” model. A similar model,

where the reference frame acts as a landmark is used by Lowe [16] and was studied

in a probabilistic framework by Moreels et al. [17]. Figure 1 illustrates the differences

between the full and star models. In the star model the joint probability of the conﬁgu-

ration aspect of the model may be factored as:

p(X|S, h, θ) = p(x

)

j6=L

p(x

, s

, h

, θ

) (4)

where x

is the position of part j and L is the landmark part. We adopt a Gaussian

model for p(x

, s

, h

, θ

) which depends only on the relative position and scale

between each part and the landmark. The reduced dependencies of this model mean

that the marginalization in Eqn. 2 is O(N

P ), in theory allowing us to cope with a

larger N and P in learning and recognition.

In practical terms, we can achieve translation invariance by subtracting the loca-

tion of the landmark part from the non-landmark ones. Scale invariance is achieved by

dividing the location of the non-landmark parts by the locally measured scale of the

landmark part.

It is useful to examine what has been lost in the star compared to the constellation

model of [8]. In the star model any of the leaf (i.e. non-landmark) parts can be occluded,

but (as discussed below) we impose the condition that the landmark part must always

be present. With small N this can lead to a model with artiﬁcially high variance, but

as N increases this ceases to be a problem (since the landmark is increasingly likely to

actually be detected). In the constellation model any or several parts can be occluded.

This is a powerful feature: not only does it make the model robust to the inadequacies

of the region detector but it also assists the convergence properties of the model by

enabling a subset of the parts to be ﬁtted rather than all simultaneously.

The star model does have other beneﬁts though, in that it has less parameters so that

the model can be trained on fewer images without over-ﬁtting occurring.

2.2 Heterogeneous features

By constraining the model to operate in both learning and recognition from the sparse

outputs of a feature detector, good performance is highly dependent on the detector

ﬁnding parts of the object that are characteristic and distinctive of the class. The major-

ity of approaches using feature-based methods rely on region detectors such as Kadir

and Brady or multi-scale Harris [11, 13] which favour interest points or circular regions.

However, for certain classes such as bottles or mugs, the outline of the object is more

informative than the textured regions on the interior. Curves have been used to a lim-

ited extent in previous models for object categories, for example both Fergus et al. [9]

and Jurie & Schmid [12] introduce curves as a feature type. However, in both cases

the model was constrained to being homogeneous, i.e. consisting only of curves. Here

the models can utilize a combination of different features detectors, the optimal selec-

tion being made automatically. This makes the scheme far more tolerant to the type of

category to be learnt.

In our scheme, we have a choice of three feature types: Kadir & Brady; multi-

scale Harris and Curves. Figure 2 shows examples of these 3 operators on two sample

airplane images. The detectors were chosen since they are somewhat complementary

in their properties: Kadir & Brady favours circular regions; multi-scale Harris prefers

interest points, and curves locate the outline of the object.

20 40 60 80 100 120 140 160 180 200 220

100

120

140

20 40 60 80 100 120 140 160 180 200 220

100

120

140

20 40 60 80 100 120 140 160 180 200 220

100

120

140

20 40 60 80 100 120 140 160 180 200 220

100

120

140

20 40 60 80 100 120 140 160 180 200 220

100

120

140

20 40 60 80 100 120 140 160 180 200 220

100

120

140

(a) (b) (c)

Fig. 2. Output of three different feature detectors on two airplane images. (a) Curves. (b) Kadir

& Brady. (c) Multi-scale Harris.

To be able to learn different combinations of features we use the same representation

for all types. Inspired by the performance of PCA-SIFT in region matching [14], we

utilize a gradient-based PCA approach in contrast to the intensity-based PCA approach

of [8]. Both the region operators give a location and scale for each feature. Each feature

is cropped from the image (using a square mask); rescaled to a k × k patch; has its

gradient computed and then normalized to remove intensity differences. Note that we

do not perform any orientation normalization as in [14]. The outcome is a vector of

length 2k

, with the ﬁrst k elements representing the x derivative, and the second k the

y derivatives. The derivatives are computed by symmetric ﬁnite difference (cropping to

avoid edge effects).

The normalized gradient-patch is then projected into a ﬁxed PCA basis

of d dimen-

sions. Two additional measurements are made for each gradient-patch: its unnormalized

energy and the reconstruction error between the point in the PCA basis and the original

gradient-patch. Each region is thus represented by a vector of length d + 2.

Curve features are extracted in the same manner as [9]: a Canny edge detector is

run over the image; the edgels are grouped into chains; each chain is then broken at

its bitangent points to give a curve. Since the chain may have multiple bitangent points,

each chain may result in multiple curves (which may overlap in portions). Curves which

are very straight tend to be uninformative and are discarded.

The curves are then represented in the same way as the regions. Each curve’s loca-

tion is taken as its centroid with the scale being its length. The region around the curve

is then cropped from the image and processed in the manner described above. We use

the curve as an feature detector, modeling the textured region around the curve, rather

The ﬁxed basis was computed from patches extracted using all Kadir and Brady regions found

on all the training images of Motorbikes; Faces; Airplanes; Cars (Rear); Leopards and Caltech

background.

A sparse object category model for efficient learning and exhaustive recognition

Figures

Citations

Object Detection with Discriminatively Trained Part-Based Models

Computer Vision: Algorithms and Applications

LabelMe: A Database and Web-Based Tool for Image Annotation

Image retrieval: Ideas, influences, and trends of the new age

One-shot learning of object categories

References

A Combined Corner and Edge Detector

Visual categorization with bags of keypoints

PCA-SIFT: a more distinctive representation for local image descriptors

Pictorial Structures for Object Recognition

Object class recognition by unsupervised scale-invariant learning

Related Papers (5)

Distinctive Image Features from Scale-Invariant Keypoints

Visual categorization with bags of keypoints

Histograms of oriented gradients for human detection

Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories

Object recognition from local scale-invariant features

Frequently Asked Questions (13)

Q1. What have the authors contributed in "A sparse object category model for efficient learning and complete recognition" ?

Q2. What are the future works in "A sparse object category model for efficient learning and complete recognition" ?

Q3. What is the effect of increasing the number of detections/feature-type/image?

Q4. What is the second method to learn the HSM?

Q5. What are the main advantages of using feature-based methods?

Q6. What is the method to learn a heterogeneous star model?

Q7. What is the limitation of learning a heterogeneous star model?

Q8. What is the cost of finding the optimal match?

Q9. How many datasets are used to evaluate the HSM?

Q10. What was the first evidence that the object category could be learned from weaklysupervised training images?

Q11. What is the problem with using a large number of regions?

Q12. What is the gradient of the star model?

Q13. What is the likelihood of the hypothesis?