scispace - formally typeset
Open AccessProceedings ArticleDOI

A sparse object category model for efficient learning and exhaustive recognition

TLDR
A "parts and structure" model for object category recognition that can be learnt efficiently and in a semi-supervised manner is presented, learnt from example images containing category instances, without requiring segmentation from background clutter.
Abstract
We present a "parts and structure" model for object category recognition that can be learnt efficiently and in a semi-supervised manner: the model is learnt from example images containing category instances, without requiring segmentation from background clutter. The model is a sparse representation of the object, and consists of a star topology configuration of parts modeling the output of a variety of feature detectors. The optimal choice of feature types (whose repertoire includes interest points, curves and regions) is made automatically. In recognition, the model may be applied efficiently in an exhaustive manner, bypassing the need for feature detectors, to give the globally optimal match within a query image. The approach is demonstrated on a wide variety of categories, and delivers both successful classification and localization of the object within the image.

read more

Content maybe subject to copyright    Report

A Sparse Object Category Model for Efficient Learning
and Complete Recognition
Rob Fergus
1
, Pietro Perona
2
, and Andrew Zisserman
1
1
Dept. of Engineering Science
University of Oxford
Parks Road, Oxford
OX1 3PJ, U.K.
{fergus,az}@robots.ox.ac.uk
2
Dept. of Electrical Engineering
California Institute of Technology
MC 136–93, Pasadena
CA 91125, U.S.A.
perona@vision.caltech.edu
Abstract
We present a “parts and structure” model for object category recognition that can be
learnt efficiently and in a weakly-supervised manner: the model is learnt from example
images containing category instances, without requiring segmentation from background
clutter.
The model is a sparse representation of the object, and consists of a star topology
configuration of parts modeling the output of a variety of feature detectors. The optimal
choice of feature types (whose repertoire includes interest points, curves and regions)
is made automatically.
In recognition, the model may be applied efficiently in a complete manner, bypass-
ing the need for feature detectors, to give the globally optimal match within a query
image. The approach is demonstrated on a wide variety of categories, and delivers both
successful classification and localization of the object within the image.
1 Introduction
A variety of models and methods exist for representing, learning and recognizing ob-
ject categories in images. Many of these are variations on the “Parts and Structure”
model introduced by Fischler and Elschlager [10], though the modern instantiations
use scale-invariant image fragments [1–3, 12, 15, 20, 21]. The constellation model [3, 8,
21] was the first to convincingly demonstrate that models could be learnt from weakly-
supervised unsegmented training images (i.e. the only supervision information was that
the image contained an instance of the object category, but not the location of the in-
stance in the image). Various types of categories could be modeled, including those

specified by tight spatial configurations (such as cars) and those specified by tight ap-
pearance exemplars (such as spotted cats). The model was translation and scale invariant
both in learning and in recognition.
However, the Constellation model of [8] has some serious short-comings, namely:
(i) The joint nature of the shape model results in an exponential explosion in computa-
tional cost, limiting the number of parts and regions per image that can be handled. For
N feature detections, and P model parts the complexity for both learning and recog-
nition is O(N
P
); (ii) Since only 20-30 regions per image and 6 parts are permitted by
this complexity, the model can only learn from an incredibly sparse representation of
the image. Good performance is therefore highly dependent on the consistent firing of
the feature detector; (iii) Only one type of feature detector (a region operator) was used,
making the model very sensitive to the nature of the class. If the distinctive features
of the category happen, say, to be edge-based then relying on a region-based detector
is likely to give poor results (though this limitation was overcome in later work [9]);
(iv) The model has many parameters resulting in over-fitting unless a large number of
training images (typically 200+) are used.
Other models and methods have since been developed which have achieved supe-
rior performance to the constellation model on at least a subset of the object categories
modeled in [8]. These models range from bag-of-word models (where the words are
vector quantized invariant descriptors) with no spatial organization [5, 18], through to
fragment based models [2, 15] with particular spatial configurations. The methods uti-
lize a range of machine learning approaches EM, SVMs and Adaboost.
In this paper we propose a heterogeneous star model (HSM) which maintains the
simple training requirements of the constellation model, and also, like the constellation
model, gives a localization for the recognized object. The model is translation and scale
invariant both in learning and in recognition. There are three main areas of innovation:
(i) both in learning and recognition it has a lower complexity than the constellation
model. This enables both the number of parts and the number of detected features to
be increased substantially; (ii) it is heterogeneous and is able to make the optimum
selection of feature types (here from a pool of three, including curves). This enables
it to better model objects with significant intra-class variation in appearance, but less
variation in outline (for example a guitar), or vice-versa; (iii) The recognition stage can
use feature detectors or can be complete in the manner of Felzenswalb and Huttenlocher
[6]. In the latter case there is no actual detection stage. Rather the model itself defines
the areas of most relevance using a matched filter. This complete search overcomes
many false negatives due to feature drop out, and also poor localizations due to small
feature displacement and scale errors.
2 Approach
We describe here the structure of the heterogeneous star model, how it is learnt from
training data, and how it is applied to test data for recognition.

2.1 Star model
As in the constellation model of [8], our model has P parts and parameters θ. From
each image i, we extract N features with locations X
i
; scales S
i
and descriptors D
i
.
In learning, the aim is to find the value of θ that maximizes the log-likelihood over all
images:
X
i
log p(X
i
, D
i
, S
i
|θ) (1)
Since N >> P , we introduce an assignment variable, h, to assign features to parts in
the model. The log-likelihood is obtained by marginalizing over h.
X
i
log
X
h
p(X
i
, D
i
, S
i
, h|θ) (2)
In the constellation model, the joint density is factored as:
p(X
i
, D
i
, S
i
, h|θ) = p(D
i
|h, θ)
|
{z }
Appearance
p(X
i
|S
i
, h, θ)
|
{z }
Rel. Locations
p(S
i
|h, θ)
|
{z }
Rel. Scale
p(h|θ)
|
{z }
Occlusion
(3)
In [8], the appearance model for each part is assumed independent but the relative loca-
tion of the model parts is represented by a joint Gaussian density. While this provides
the most thorough description, it makes the location of all parts dependent on one an-
other. Consequently, the EM-based learning scheme, which entails marginalizing over
p(h|X
i
, D
i
, S
i
, θ), becomes an O(N
P
) operation. We propose here a simplified con-
x
1
x
3
x
4
x
6
x
5
x
2
Fully connected model
x
1
x
3
x
4
x
6
x
5
x
2
Fully connected model
(a)
x
1
x
3
x
4
x
6
x
5
x
2
“Star” model
x
1
x
3
x
4
x
6
x
5
x
2
“Star” model
(b)
Fig. 1. (a) Fully-connected six part shape model. Each node is a model part while the edges
represent the dependencies between parts. (b) A six part Star model. The former has complexity
O(N
P
) while the latter has complexity O(N
2
P ) which may be further improved in recognition
by the use of distance-transforms [6] to O(N P ).
figuration model in which the location of the model part is conditioned on the location
of a landmark part. Under this model the non-landmark parts are independent of one
another given the landmark. In graphical model terms, this is a tree of depth one, with
the landmark part being the root node. We call this the “star” model. A similar model,
where the reference frame acts as a landmark is used by Lowe [16] and was studied
in a probabilistic framework by Moreels et al. [17]. Figure 1 illustrates the differences

between the full and star models. In the star model the joint probability of the configu-
ration aspect of the model may be factored as:
p(X|S, h, θ) = p(x
L
|h
L
)
Y
j6=L
p(x
j
|x
L
, s
L
, h
j
, θ
j
) (4)
where x
j
is the position of part j and L is the landmark part. We adopt a Gaussian
model for p(x
j
|x
L
, s
L
, h
j
, θ
j
) which depends only on the relative position and scale
between each part and the landmark. The reduced dependencies of this model mean
that the marginalization in Eqn. 2 is O(N
2
P ), in theory allowing us to cope with a
larger N and P in learning and recognition.
In practical terms, we can achieve translation invariance by subtracting the loca-
tion of the landmark part from the non-landmark ones. Scale invariance is achieved by
dividing the location of the non-landmark parts by the locally measured scale of the
landmark part.
It is useful to examine what has been lost in the star compared to the constellation
model of [8]. In the star model any of the leaf (i.e. non-landmark) parts can be occluded,
but (as discussed below) we impose the condition that the landmark part must always
be present. With small N this can lead to a model with artificially high variance, but
as N increases this ceases to be a problem (since the landmark is increasingly likely to
actually be detected). In the constellation model any or several parts can be occluded.
This is a powerful feature: not only does it make the model robust to the inadequacies
of the region detector but it also assists the convergence properties of the model by
enabling a subset of the parts to be fitted rather than all simultaneously.
The star model does have other benefits though, in that it has less parameters so that
the model can be trained on fewer images without over-fitting occurring.
2.2 Heterogeneous features
By constraining the model to operate in both learning and recognition from the sparse
outputs of a feature detector, good performance is highly dependent on the detector
finding parts of the object that are characteristic and distinctive of the class. The major-
ity of approaches using feature-based methods rely on region detectors such as Kadir
and Brady or multi-scale Harris [11, 13] which favour interest points or circular regions.
However, for certain classes such as bottles or mugs, the outline of the object is more
informative than the textured regions on the interior. Curves have been used to a lim-
ited extent in previous models for object categories, for example both Fergus et al. [9]
and Jurie & Schmid [12] introduce curves as a feature type. However, in both cases
the model was constrained to being homogeneous, i.e. consisting only of curves. Here
the models can utilize a combination of different features detectors, the optimal selec-
tion being made automatically. This makes the scheme far more tolerant to the type of
category to be learnt.
In our scheme, we have a choice of three feature types: Kadir & Brady; multi-
scale Harris and Curves. Figure 2 shows examples of these 3 operators on two sample
airplane images. The detectors were chosen since they are somewhat complementary
in their properties: Kadir & Brady favours circular regions; multi-scale Harris prefers
interest points, and curves locate the outline of the object.

20 40 60 80 100 120 140 160 180 200 220
20
40
60
80
100
120
140
20 40 60 80 100 120 140 160 180 200 220
20
40
60
80
100
120
140
20 40 60 80 100 120 140 160 180 200 220
20
40
60
80
100
120
140
20 40 60 80 100 120 140 160 180 200 220
20
40
60
80
100
120
140
20 40 60 80 100 120 140 160 180 200 220
20
40
60
80
100
120
140
20 40 60 80 100 120 140 160 180 200 220
20
40
60
80
100
120
140
(a) (b) (c)
Fig. 2. Output of three different feature detectors on two airplane images. (a) Curves. (b) Kadir
& Brady. (c) Multi-scale Harris.
To be able to learn different combinations of features we use the same representation
for all types. Inspired by the performance of PCA-SIFT in region matching [14], we
utilize a gradient-based PCA approach in contrast to the intensity-based PCA approach
of [8]. Both the region operators give a location and scale for each feature. Each feature
is cropped from the image (using a square mask); rescaled to a k × k patch; has its
gradient computed and then normalized to remove intensity differences. Note that we
do not perform any orientation normalization as in [14]. The outcome is a vector of
length 2k
2
, with the first k elements representing the x derivative, and the second k the
y derivatives. The derivatives are computed by symmetric finite difference (cropping to
avoid edge effects).
The normalized gradient-patch is then projected into a fixed PCA basis
3
of d dimen-
sions. Two additional measurements are made for each gradient-patch: its unnormalized
energy and the reconstruction error between the point in the PCA basis and the original
gradient-patch. Each region is thus represented by a vector of length d + 2.
Curve features are extracted in the same manner as [9]: a Canny edge detector is
run over the image; the edgels are grouped into chains; each chain is then broken at
its bitangent points to give a curve. Since the chain may have multiple bitangent points,
each chain may result in multiple curves (which may overlap in portions). Curves which
are very straight tend to be uninformative and are discarded.
The curves are then represented in the same way as the regions. Each curve’s loca-
tion is taken as its centroid with the scale being its length. The region around the curve
is then cropped from the image and processed in the manner described above. We use
the curve as an feature detector, modeling the textured region around the curve, rather
3
The fixed basis was computed from patches extracted using all Kadir and Brady regions found
on all the training images of Motorbikes; Faces; Airplanes; Cars (Rear); Leopards and Caltech
background.

Figures
Citations
More filters
Journal ArticleDOI

Object Detection with Discriminatively Trained Part-Based Models

TL;DR: An object detection system based on mixtures of multiscale deformable part models that is able to represent highly variable object classes and achieves state-of-the-art results in the PASCAL object detection challenges is described.
Book

Computer Vision: Algorithms and Applications

TL;DR: Computer Vision: Algorithms and Applications explores the variety of techniques commonly used to analyze and interpret images and takes a scientific approach to basic vision problems, formulating physical models of the imaging process before inverting them to produce descriptions of a scene.
Journal ArticleDOI

LabelMe: A Database and Web-Based Tool for Image Annotation

TL;DR: In this article, a large collection of images with ground truth labels is built to be used for object detection and recognition research, such data is useful for supervised learning and quantitative evaluation.
Journal ArticleDOI

Image retrieval: Ideas, influences, and trends of the new age

TL;DR: Almost 300 key theoretical and empirical contributions in the current decade related to image retrieval and automatic image annotation are surveyed, and the spawning of related subfields are discussed, to discuss the adaptation of existing image retrieval techniques to build systems that can be useful in the real world.
Journal ArticleDOI

One-shot learning of object categories

TL;DR: It is found that on a database of more than 100 categories, the Bayesian approach produces informative models when the number of training examples is too small for other methods to operate successfully.
References
More filters
Proceedings ArticleDOI

A Combined Corner and Edge Detector

TL;DR: The problem the authors are addressing in Alvey Project MMI149 is that of using computer vision to understand the unconstrained 3D world, in which the viewed scenes will in general contain too wide a diversity of objects for topdown recognition techniques to work.
Proceedings Article

Visual categorization with bags of keypoints

TL;DR: This bag of keypoints method is based on vector quantization of affine invariant descriptors of image patches and shows that it is simple, computationally efficient and intrinsically invariant.
Proceedings ArticleDOI

PCA-SIFT: a more distinctive representation for local image descriptors

TL;DR: This paper examines (and improves upon) the local image descriptor used by SIFT, and demonstrates that the PCA-based local descriptors are more distinctive, more robust to image deformations, and more compact than the standard SIFT representation.
Journal ArticleDOI

Pictorial Structures for Object Recognition

TL;DR: A computationally efficient framework for part-based modeling and recognition of objects, motivated by the pictorial structure models introduced by Fischler and Elschlager, that allows for qualitative descriptions of visual appearance and is suitable for generic recognition problems.
Proceedings ArticleDOI

Object class recognition by unsupervised scale-invariant learning

TL;DR: The flexible nature of the model is demonstrated by excellent results over a range of datasets including geometrically constrained classes (e.g. faces, cars) and flexible objects (such as animals).
Related Papers (5)
Frequently Asked Questions (13)
Q1. What have the authors contributed in "A sparse object category model for efficient learning and complete recognition" ?

The authors present a “ parts and structure ” model for object category recognition that can be learnt efficiently and in a weakly-supervised manner: the model is learnt from example images containing category instances, without requiring segmentation from background clutter. 

There are several aspects of the model that the authors wish to improve and investigate. Although the authors have restricted the model to a star topology, the approach is applicable to a trees and k-fans [ 4 ], and it will be interesting to determine which topologies are best suited to which type of object category. 

Increasing the number of detections/feature-type/image increases the error rate slightly in some cases such as camels, since many of the additional detections lie in the background of the image, so increasing the chances of a false positive. 

The second method, which the authors adopt, is to learn the HSM directly using EM as in [8, 21], starting from randomly-chosen initial conditions, enabling the learning of many more parts and with more detections/image. 

The majority of approaches using feature-based methods rely on region detectors such as Kadir and Brady or multi-scale Harris [11, 13] which favour interest points or circular regions. 

One method is to learn a fully connected constellation model using EM [8] and then reduce the learnt spatial model to a star by completely trying out each of the parts as a landmark, and picking the one which gives the highest likelihood on the training data. 

The limitation of this approach is that the fully connected model can only handle a small number of parts and detections in learning. 

The authors then introduce the shape model, which by the use of distance transforms [6], reduces the cost of finding the optimal match from O(N 2P ) to O(NP ). 

Evaluation of the HSM using feature-based detection is done using nine widely varying, unnormalized, datasets summarized in Table 1. 

The constellation model [3, 8, 21] was the first to convincingly demonstrate that models could be learnt from weaklysupervised unsegmented training images (i.e. the only supervision information was that the image contained an instance of the object category, but not the location of the instance in the image). 

To ensure this, one approach is to use a very large number of regions, however the problem remains that each feature will still be perturbed slightly in location and scale from its optimal position so degrading the quality of the match obtainable by the model. 

The star model’s curve, while also roughly linear, has a much flatter gradient: a 12 part star model taking the same time to learn as a 6 part full model. 

The learnt model is then applied to the regions/curves and the likelihood of the best hypothesis computed using the learnt model.