scispace - formally typeset
Open AccessBook ChapterDOI

A comparison of features in parts-based object recognition hierarchies

Reads0
Chats0
TLDR
A comparative investigation of different feature types with regard to their suitability for category discrimination in patches of gray-scale images were compared with SIFT descriptors and patches from the high-level output of a feedforward hierarchy related to the ventral visual pathway.
Abstract
Parts-based recognition has been suggested for generalizing from few training views in categorization scenarios. In this paper we present the results of a comparative investigation of different feature types with regard to their suitability for category discrimination. So patches of gray-scale images were compared with SIFT descriptors and patches from the high-level output of a feedforward hierarchy related to the ventral visual pathway. We discuss the conceptual differences, resulting performance and consequences for hierarchical models of visual recognition.

read more

Content maybe subject to copyright    Report

A Comparison of Features in Parts-Based
Object Recognition Hierarchies
Stephan Hasler, Heiko Wersing, and Edgar orner
Honda Research Institute Europe GmbH
D-63073 Offenbach/Germany
stephan.hasler@honda-ri.de
Abstract. Parts-based recognition has been suggested for generalizing
from few training views in categorization scenarios. In this paper we
present the results of a comparative investigation of different feature
types with regard to their suitability for category discrimination. So
patches of gray-scale images were compared with SIFT descriptors and
patches from the high-level output of a feedforward hierarchy related to
the ventral visual pathway. We discuss the conceptual differences, re-
sulting performance and consequences for hierarchical models of visual
recognition.
1 Introduction
The human brain employs different kinds of interrelated representations and
processes to recognize objects, depending on the familiarity of the object and
the required level of recognition, which is defined by the current task. There is
evidence that for identifying highly familiar objects, like faces, holistic templates
are used that emphasize the spatial layout of the object’s parts but neglect de-
tails of the parts themselves. This holistic prototypical representation requires
a lot of experience and coding capacity and therefore can not be used for all
the objects in every day’s life. A more compact representation can be obtained
when handling objects as combinations of shared parts. There is various biologi-
cal motivation for such a representation. The experiments of Tanaka [1] revealed
that there are high-level areas in primates ventral visual pathway that predict
the presence of a large set of features with intermediate complexity, generaliz-
ing over small variations and being invariant to retinotopical position and scale.
The combinatorial use of those features was shown by Tsunoda [2]. He observed
that complex objects simultaneously activate different spots in those areas and
that this activation is caused by the constituent parts. A parts-based represen-
tation is especially efficient for storing and categorizing novel objects, because
the largest variance in unseen views of an object can be expected in the position
and arrangement of parts, while each part of an object will be visible under a
large variety of 3D object transformations.
In computer vision literature there is a similar distinction into holistic and
parts-based approaches, depending on how feature responses are aggregated over
J. Marques de a et al. (Eds.): ICANN 2007, Part II, LNCS 4669, pp. 210–219, 2007.
c
Springer-Verlag Berlin Heidelberg 2007

A Comparison of Features in Parts-Based Object Recognition Hierarchies 211
the image. Parts can be local features of any kind. The response of a part detector
at different positions in an image means that the part might be present several
times but not that the probability is higher that the part is present at all. So
each peak in the multimodal response map is handled as a possible instance
of the part. In contrary to this, holistic approaches contain a layer that simply
accumulates the real-valued response of single features of the previous layer
over the whole image. This is only comparable to the biological definition if the
configurational information is kept.
Approaches with strong biological motivation are presented in [3,4]. Here hi-
erarchies of feature layers are used, like in the ventral visual pathway, where they
combine specificity and invariance of features. So there are cells that are either
sensitive to a specific pattern of activation in lower layers, in this way increas-
ing the feature’s complexity, or that pool the responses of similar features, so
generalizing over small variations. The output layer of the feedforward hierarchy
proposed in [3] contains several topographically organized feature maps which
are used directly by the final classifier. Following the above definition this is a
holistic approach. The similar hierarchy of [4] employs in the highest feature
layer a spatial max-pooling over each feature map in the previous layer, which
makes it a parts-based approach. Multimodal response characteristics and the
position of the parts are neglected.
Most other approaches work more directly on the images. Very typical holis-
tic approaches apply histograms, so e.g. in [5] the responses to local features are
simply summed and in [6] it is counted how often a response lies in a certain
range. In other holistic methods the receptive fields of the features cover the
whole image. So e.g. in [7] features obtained by principal component analysis
(PCA) on gray-scale images were used to classify faces. These features, so called
eigenfaces, show a very global activation and do not reflect parts of a face. In
contrary to PCA other methods produce so called parts-based features like the
nonnegative matrix factorization (NMF) proposed in [8] or a similar scheme pro-
posed in [9] yielding more class-specific features. Although during training the
receptive field of each feature covers the whole image, it learns to reconstruct
a certain localized region that contains the same part in many training views
(e.g. parts of normalized frontal views of faces). But usually those features are
used in a holistic manner, meaning that they are extracted at a single position
in the test image and in this way are only sensitive to the rigid constellation of
parts that was present during training. This limits the possibilities to general-
ize over geometric transformations, which is especially a drawback when using
few training examples in an unnormalized setting. Also the holistic approaches
perform bad in the presence of clutter and occlusion and often require extensive
preprocessing as localization and segmentation.
Other parts-based recognition approaches also use the maximum activation
of each feature, like the highest layer in [4]. In [10] the features are fragments of
gray-scale images. The response of a feature is binary and obtained by thresh-
olding the maximum activation in the image. The approach selects features
based on the maximization of mutual information for a single class. This yields

212 S. Hasler, H. Wersing, and E. orner
fragments of intermediate complexity. An image is classified by comparing bi-
nary activation vectors to stored representatives in a nearest neighbor fashion.
Other approaches make use of the position and treat each peak in the response
map as possible part instance. In the scale invariant feature transform (SIFT)
approach in [11] gradient-histograms are extracted for small patches around in-
teresting points (see Fig. 1c). Each such patch descriptor is compared against a
large repertoire of stored descriptors, where the best match votes for the presence
of an object at a certain position, scale and rotation. The votes are combined
using a Generalized Hough Transform and the maximally activated hypothesis
is chosen. A similar scheme is proposed in [12]. Here image patches are used as
features and the algorithm is capable to produce a segmentation mask for the
object hypothesis that can be used for a further refinement process. In the bags
of keypoints approaches, e.g. [13], it is counted how often parts are detected in
an image. In contrast to holistic histogram-based approaches the presence of a
part is the result of a strong local competition of parts. Therefore it is more a
counting of symbol-type information than a summation over real-valued signal-
type responses. Parts-based recognition can be used to localize and recognize
objects at the same time and works well in the presence of clutter and occlusion.
In Sect. 2 we first comment on the task we want to solve and the nature of
the features required for this. Then we describe the investigated feature types
and our feature selection strategy. We give results for a categorization problem
in Sect. 3 and present our conclusions in Sect. 4.
2 Analytic Features
To generalize from few training examples, parts-based recognition follows the
notion that similar combinations of parts are specific for a certain category over
a wide range of variations. In this work we investigate how suitable different
feature types are for this purpose and which effort is needed in terms of the
number of used features. As has been argued in [10], it is beneficial that a single
part can be detected in many views of one category, while being absent in other
categories. So we need a reasonable feature selection strategy that evaluates
which and how many views of a certain category a feature can separate from
other categories and, based on those results, choose the subset of features that in
combination can describe the whole scenario best. For simple categories a single
feature can separate many views and therefore only few features are necessary to
represent the whole category. For categories with more variation more features
have to be selected to cover the whole appearance. This dynamic distribution of
resources is necessary to make best use of the limited number of features.
How well certain local descriptors can be re-detected under different image
transformations, as scale, rotation and viewpoint changes, was investigated in
[14]. Although this is a desired quality, it does not necessarily state something on
the usefulness in object recognition tasks. To underline that the desired features
should be meaningful, i.e. offer a compromise between specificity and generality

A Comparison of Features in Parts-Based Object Recognition Hierarchies 213
at low costs, and to avoid confusion with approaches that learn parts-based
features, we will use the term analytic features.
We decided to compare patches of gray-scale images, for their simplicity, SIFT
descriptors, for their known invariance, and patches of the output of the feed-
forward hierarchy in [3], because of the biological background.
A SIFT descriptor as proposed in [11] describes a gray-scale patch of 16x16
pixels using a grid of 4x4 gradient-histograms (see Fig. 1c). Each histogram in
the grid is made up of eight orientation bins. The magnitude of the gradient at a
certain pixel is distributed in a bilinear fashion over the neighboring histograms
(in general four), where the orientation of the gradient determines the bin. The
gradient magnitudes are scaled with a Gaussian that is centered on the patch, in
this way reducing the influence of border pixels. Prior to the calculation of the
histogram grid a single histogram with a higher number of orientation bins is
computed for the whole patch. The maximum activated bin in this histogram is
used to normalize the rotation of the patch in advance. Finally the energy of the
whole descriptor is normalized to obtain invariance to illumination. In contrast
to [11], we do not extract SIFT descriptors at a small number of interesting
keypoints, but for all locations where at least a minimum of structure is present.
In this way only uniform, dark background is neglected and on the category
scenario in Fig. 3 on average one third of all descriptors is kept. We reduce the
number of descriptors for each image by applying a k-means algorithm with
200 components. A similar cluster step was also done in [15] to improve the
generalization performance of the otherwise very specific SIFT descriptors.
For the gray-scale patches we decided to use the same patch size as for the
SIFT approach and the influence of the pixels is also weighted with a Gaussian
that is centered on the patch.
The feedforward hierarchy proposed in [3] is shown in Fig. 1a. The S1-layer
computes the magnitudes of the response to four differently oriented gabor filters.
This activation is pooled to a lower resolution in the C1-layer performing a local
OR-operation. The 50 features used in S2 are trained as to efficiently reconstruct
a large set of random 4x4x4 C1-patches from natural images and are therefore
sensitive to local patterns in C1. Layer C2 performs a further pooling operation
and is the output of the hierarchy. Columns of 2x2 pixels are cut from the C2-
layer as shown in Fig. 1b and used as feature candidates. Because of the two
pooling layers, which offer a small degree of invariance to translation, a column of
2x2 pixels in C2 corresponds roughly to a patch of 16x16 pixels in the gray-scale
image.
We will refer to the parts-based approaches as GRAY-P, SIFT-P, and C2-P.
For SIFT-P each image i is described by the J =4× 4 × 8 = 128 dimensional
representatives of the 200 k-means clusters p
in
, n =1...200. For GRAY-P the
p
in
are the patches of image i at all distinct positions n (J =16× 16 = 256).
Similar to this for C2 each p
in
is a column through the feature maps of image
i at a distinct position n asshowninFig.1b(J =2× 2 × 50 = 200). The p
in
show a large variety. Therefore we will use all p
in
directly as feature candidates
w
m
,wherem is an index over all combinations of i and n, and select a subset

214 S. Hasler, H. Wersing, and E. orner
I
S1
C1
S2
Pooling
Input
4 Gabors
50 Combinations
C2
Pooling
C2
16x16 Pixels
a)
b)
c)
50 Planes
Fig. 1. a) Feedforward hierarchy in [3]. b) Columns of C2-layer are used as local fea-
tures. c) SIFT descriptor [11] is grid of gradient histograms each with 8 orientations.
of those candidates with a strategy that is described later. The response r
mi
of
feature w
m
on the image i is given by:
r
mi
=max
n
(G(w
m
, p
in
)) . (1)
For GRAY-P G(w
m
, p
in
)=
J
j=1
h
j
(w
j
m
w
m
)(p
j
in
p
in
)
j
h
j
(w
j
m
w
m
)
2
j
h
j
(p
j
in
p
in
)
2
is used which is
the normalized cross-correlation, where
w
m
and p
in
are the means of vector
w
m
and p
in
respectively, and h
j
is a weighting which decreases the influence
of border pixels with a Gaussian. For C2-P the negative Euclidean distance
G(w
m
, p
in
)=
J
j=1
(w
j
m
p
j
in
)
2
shows better performance because of the
sparseness in this layer. The similarity between SIFT descriptors is given by their
dot product G(w
m
, p
in
)=
J
j=1
w
j
m
p
j
in
. The maximum activation per image is
chosen as response and spatial information is neglected.
Reflecting the remarks on feature selection given above, we decided to use
the following strategy: First we determine which views of a certain category
each individual candidate feature w
m
can separate. Therefore we compute the
response r
mi
for every training image with (1). Then the minimal threshold t
m
is chosen that guarantees that all images with r
mi
above or equal to t
m
belong
the same category (see Fig. 2):
t
m
=min
t|∀
i|r
mi
t
j|r
mj
t
l
i
= l
j
. (2)
Here l
i
denotes the category label of image i. The images separated by the
threshold is assigned a constant score s
mi
= k with respect to the feature w
m
.

Citations
More filters
Book ChapterDOI

Large-Scale Real-Time Object Identification Based on Analytic Features

TL;DR: This work presents a system that is able to robustly identify a large number of pre-trained objects in real-time, like prototype-based figure-ground segmentation, extraction of brain-like analytic features, and a simple classifier on top.
Book ChapterDOI

A vector quantization approach for life-long learning of categories

TL;DR: To achieve the life-long learning ability an incremental learning vector quantization approach is combined with a category-specific feature selection method in a novel way to allow several metrical "views" on the representation space for the same cLVQ nodes.
Proceedings Article

Multi-class Image Classification - Sparsity does it Better

TL;DR: A new regularized functional is proposed, which is a modification of the standard dictionary learning problem, designed to learn one dictionary per class, and it is possible to directly classify local features based on their sparsity factor without losing statistical information or spatial configuration and being more robust to clutter and occlusions.
Book ChapterDOI

An integrated system for incremental learning of multiple visual categories

TL;DR: A biologically inspired vision system able to incrementally learn multiple visual categories by interactively presenting several hand-held objects and allowing labeling with an arbitrary number of shape and color categories is presented.
Proceedings ArticleDOI

A biologically inspired approach for interactive learning of categories

TL;DR: Biological inspired modifications to the established learning vector quantization approach are proposed and combined with a category-specific forward feature selection to decouple co-occurring categories to ensure a compact and efficient category representation, which is necessary for fast and interactive learning.
References
More filters
Journal ArticleDOI

Distinctive Image Features from Scale-Invariant Keypoints

TL;DR: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene and can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Journal ArticleDOI

Eigenfaces for recognition

TL;DR: A near-real-time computer system that can locate and track a subject's head, and then recognize the person by comparing characteristics of the face to those of known individuals, and that is easy to implement using a neural network architecture.
Journal ArticleDOI

Learning the parts of objects by non-negative matrix factorization

TL;DR: An algorithm for non-negative matrix factorization is demonstrated that is able to learn parts of faces and semantic features of text and is in contrast to other methods that learn holistic, not parts-based, representations.

Learning parts of objects by non-negative matrix factorization

D. D. Lee
TL;DR: In this article, non-negative matrix factorization is used to learn parts of faces and semantic features of text, which is in contrast to principal components analysis and vector quantization that learn holistic, not parts-based, representations.
Journal ArticleDOI

A performance evaluation of local descriptors

TL;DR: It is observed that the ranking of the descriptors is mostly independent of the interest region detector and that the SIFT-based descriptors perform best and Moments and steerable filters show the best performance among the low dimensional descriptors.
Related Papers (5)
Frequently Asked Questions (2)
Q1. What are the contributions mentioned in the paper "A comparison of features in parts-based object recognition hierarchies" ?

In this paper the authors present the results of a comparative investigation of different feature types with regard to their suitability for category discrimination. The authors discuss the conceptual differences, resulting performance and consequences for hierarchical models of visual recognition. 

Besides the normalization of rotation for SIFT, it would be interesting to investigate other reasons for the differences in performance in future work. Since both approaches have not been applied to scenarios with multiple categories, the authors hope that their comparative study provides further helpful inside into parts-based 3D object recognition.