What are the contributions mentioned in the paper "A comparison of features in parts-based object recognition hierarchies" ?

In this paper the authors present the results of a comparative investigation of different feature types with regard to their suitability for category discrimination. The authors discuss the conceptual differences, resulting performance and consequences for hierarchical models of visual recognition.

What are the future works in "A comparison of features in parts-based object recognition hierarchies" ?

Besides the normalization of rotation for SIFT, it would be interesting to investigate other reasons for the differences in performance in future work. Since both approaches have not been applied to scenarios with multiple categories, the authors hope that their comparative study provides further helpful inside into parts-based 3D object recognition.

(Open Access) A comparison of features in parts-based object recognition hierarchies (2007) | Stephan Hasler

A Comparison of Features in Parts-Based

Object Recognition Hierarchies

Stephan Hasler, Heiko Wersing, and Edgar K¨orner

Honda Research Institute Europe GmbH

D-63073 Oﬀenbach/Germany

stephan.hasler@honda-ri.de

Abstract. Parts-based recognition has been suggested for generalizing

from few training views in categorization scenarios. In this paper we

present the results of a comparative investigation of diﬀerent feature

types with regard to their suitability for category discrimination. So

patches of gray-scale images were compared with SIFT descriptors and

patches from the high-level output of a feedforward hierarchy related to

the ventral visual pathway. We discuss the conceptual diﬀerences, re-

sulting performance and consequences for hierarchical models of visual

recognition.

1 Introduction

The human brain employs diﬀerent kinds of interrelated representations and

processes to recognize objects, depending on the familiarity of the object and

the required level of recognition, which is deﬁned by the current task. There is

evidence that for identifying highly familiar objects, like faces, holistic templates

are used that emphasize the spatial layout of the object’s parts but neglect de-

tails of the parts themselves. This holistic prototypical representation requires

a lot of experience and coding capacity and therefore can not be used for all

the objects in every day’s life. A more compact representation can be obtained

when handling objects as combinations of shared parts. There is various biologi-

cal motivation for such a representation. The experiments of Tanaka [1] revealed

that there are high-level areas in primates ventral visual pathway that predict

the presence of a large set of features with intermediate complexity, generaliz-

ing over small variations and being invariant to retinotopical position and scale.

The combinatorial use of those features was shown by Tsunoda [2]. He observed

that complex objects simultaneously activate diﬀerent spots in those areas and

that this activation is caused by the constituent parts. A parts-based represen-

tation is especially eﬃcient for storing and categorizing novel objects, because

the largest variance in unseen views of an object can be expected in the position

and arrangement of parts, while each part of an object will be visible under a

large variety of 3D object transformations.

In computer vision literature there is a similar distinction into holistic and

parts-based approaches, depending on how feature responses are aggregated over

J. Marques de S´a et al. (Eds.): ICANN 2007, Part II, LNCS 4669, pp. 210–219, 2007.

 Springer-Verlag Berlin Heidelberg 2007

A Comparison of Features in Parts-Based Object Recognition Hierarchies 211

the image. Parts can be local features of any kind. The response of a part detector

at diﬀerent positions in an image means that the part might be present several

times but not that the probability is higher that the part is present at all. So

each peak in the multimodal response map is handled as a possible instance

of the part. In contrary to this, holistic approaches contain a layer that simply

accumulates the real-valued response of single features of the previous layer

over the whole image. This is only comparable to the biological deﬁnition if the

conﬁgurational information is kept.

Approaches with strong biological motivation are presented in [3,4]. Here hi-

erarchies of feature layers are used, like in the ventral visual pathway, where they

combine speciﬁcity and invariance of features. So there are cells that are either

sensitive to a speciﬁc pattern of activation in lower layers, in this way increas-

ing the feature’s complexity, or that pool the responses of similar features, so

generalizing over small variations. The output layer of the feedforward hierarchy

proposed in [3] contains several topographically organized feature maps which

are used directly by the ﬁnal classiﬁer. Following the above deﬁnition this is a

holistic approach. The similar hierarchy of [4] employs in the highest feature

layer a spatial max-pooling over each feature map in the previous layer, which

makes it a parts-based approach. Multimodal response characteristics and the

position of the parts are neglected.

Most other approaches work more directly on the images. Very typical holis-

tic approaches apply histograms, so e.g. in [5] the responses to local features are

simply summed and in [6] it is counted how often a response lies in a certain

range. In other holistic methods the receptive ﬁelds of the features cover the

whole image. So e.g. in [7] features obtained by principal component analysis

(PCA) on gray-scale images were used to classify faces. These features, so called

eigenfaces, show a very global activation and do not reﬂect parts of a face. In

contrary to PCA other methods produce so called parts-based features like the

nonnegative matrix factorization (NMF) proposed in [8] or a similar scheme pro-

posed in [9] yielding more class-speciﬁc features. Although during training the

receptive ﬁeld of each feature covers the whole image, it learns to reconstruct

a certain localized region that contains the same part in many training views

(e.g. parts of normalized frontal views of faces). But usually those features are

used in a holistic manner, meaning that they are extracted at a single position

in the test image and in this way are only sensitive to the rigid constellation of

parts that was present during training. This limits the possibilities to general-

ize over geometric transformations, which is especially a drawback when using

few training examples in an unnormalized setting. Also the holistic approaches

perform bad in the presence of clutter and occlusion and often require extensive

preprocessing as localization and segmentation.

Other parts-based recognition approaches also use the maximum activation

of each feature, like the highest layer in [4]. In [10] the features are fragments of

gray-scale images. The response of a feature is binary and obtained by thresh-

olding the maximum activation in the image. The approach selects features

based on the maximization of mutual information for a single class. This yields

212 S. Hasler, H. Wersing, and E. K¨orner

fragments of intermediate complexity. An image is classiﬁed by comparing bi-

nary activation vectors to stored representatives in a nearest neighbor fashion.

Other approaches make use of the position and treat each peak in the response

map as possible part instance. In the scale invariant feature transform (SIFT)

approach in [11] gradient-histograms are extracted for small patches around in-

teresting points (see Fig. 1c). Each such patch descriptor is compared against a

large repertoire of stored descriptors, where the best match votes for the presence

of an object at a certain position, scale and rotation. The votes are combined

using a Generalized Hough Transform and the maximally activated hypothesis

is chosen. A similar scheme is proposed in [12]. Here image patches are used as

features and the algorithm is capable to produce a segmentation mask for the

object hypothesis that can be used for a further reﬁnement process. In the bags

of keypoints approaches, e.g. [13], it is counted how often parts are detected in

an image. In contrast to holistic histogram-based approaches the presence of a

part is the result of a strong local competition of parts. Therefore it is more a

counting of symbol-type information than a summation over real-valued signal-

type responses. Parts-based recognition can be used to localize and recognize

objects at the same time and works well in the presence of clutter and occlusion.

In Sect. 2 we ﬁrst comment on the task we want to solve and the nature of

the features required for this. Then we describe the investigated feature types

and our feature selection strategy. We give results for a categorization problem

in Sect. 3 and present our conclusions in Sect. 4.

2 Analytic Features

To generalize from few training examples, parts-based recognition follows the

notion that similar combinations of parts are speciﬁc for a certain category over

a wide range of variations. In this work we investigate how suitable diﬀerent

feature types are for this purpose and which eﬀort is needed in terms of the

number of used features. As has been argued in [10], it is beneﬁcial that a single

part can be detected in many views of one category, while being absent in other

categories. So we need a reasonable feature selection strategy that evaluates

which and how many views of a certain category a feature can separate from

other categories and, based on those results, choose the subset of features that in

combination can describe the whole scenario best. For simple categories a single

feature can separate many views and therefore only few features are necessary to

represent the whole category. For categories with more variation more features

have to be selected to cover the whole appearance. This dynamic distribution of

resources is necessary to make best use of the limited number of features.

How well certain local descriptors can be re-detected under diﬀerent image

transformations, as scale, rotation and viewpoint changes, was investigated in

[14]. Although this is a desired quality, it does not necessarily state something on

the usefulness in object recognition tasks. To underline that the desired features

should be meaningful, i.e. oﬀer a compromise between speciﬁcity and generality

A Comparison of Features in Parts-Based Object Recognition Hierarchies 213

at low costs, and to avoid confusion with approaches that learn parts-based

features, we will use the term analytic features.

We decided to compare patches of gray-scale images, for their simplicity, SIFT

descriptors, for their known invariance, and patches of the output of the feed-

forward hierarchy in [3], because of the biological background.

A SIFT descriptor as proposed in [11] describes a gray-scale patch of 16x16

pixels using a grid of 4x4 gradient-histograms (see Fig. 1c). Each histogram in

the grid is made up of eight orientation bins. The magnitude of the gradient at a

certain pixel is distributed in a bilinear fashion over the neighboring histograms

(in general four), where the orientation of the gradient determines the bin. The

gradient magnitudes are scaled with a Gaussian that is centered on the patch, in

this way reducing the inﬂuence of border pixels. Prior to the calculation of the

histogram grid a single histogram with a higher number of orientation bins is

computed for the whole patch. The maximum activated bin in this histogram is

used to normalize the rotation of the patch in advance. Finally the energy of the

whole descriptor is normalized to obtain invariance to illumination. In contrast

to [11], we do not extract SIFT descriptors at a small number of interesting

keypoints, but for all locations where at least a minimum of structure is present.

In this way only uniform, dark background is neglected and on the category

scenario in Fig. 3 on average one third of all descriptors is kept. We reduce the

number of descriptors for each image by applying a k-means algorithm with

200 components. A similar cluster step was also done in [15] to improve the

generalization performance of the otherwise very speciﬁc SIFT descriptors.

For the gray-scale patches we decided to use the same patch size as for the

SIFT approach and the inﬂuence of the pixels is also weighted with a Gaussian

that is centered on the patch.

The feedforward hierarchy proposed in [3] is shown in Fig. 1a. The S1-layer

computes the magnitudes of the response to four diﬀerently oriented gabor ﬁlters.

This activation is pooled to a lower resolution in the C1-layer performing a local

OR-operation. The 50 features used in S2 are trained as to eﬃciently reconstruct

a large set of random 4x4x4 C1-patches from natural images and are therefore

sensitive to local patterns in C1. Layer C2 performs a further pooling operation

and is the output of the hierarchy. Columns of 2x2 pixels are cut from the C2-

layer as shown in Fig. 1b and used as feature candidates. Because of the two

pooling layers, which oﬀer a small degree of invariance to translation, a column of

2x2 pixels in C2 corresponds roughly to a patch of 16x16 pixels in the gray-scale

image.

We will refer to the parts-based approaches as GRAY-P, SIFT-P, and C2-P.

For SIFT-P each image i is described by the J =4× 4 × 8 = 128 dimensional

representatives of the 200 k-means clusters p

, n =1...200. For GRAY-P the

are the patches of image i at all distinct positions n (J =16× 16 = 256).

Similar to this for C2 each p

is a column through the feature maps of image

i at a distinct position n asshowninFig.1b(J =2× 2 × 50 = 200). The p

show a large variety. Therefore we will use all p

directly as feature candidates

,wherem is an index over all combinations of i and n, and select a subset

214 S. Hasler, H. Wersing, and E. K¨orner

Pooling

Input

4 Gabors

50 Combinations

Pooling

16x16 Pixels

50 Planes

Fig. 1. a) Feedforward hierarchy in [3]. b) Columns of C2-layer are used as local fea-

tures. c) SIFT descriptor [11] is grid of gradient histograms each with 8 orientations.

of those candidates with a strategy that is described later. The response r

feature w

on the image i is given by:

=max

(G(w

, p

)) . (1)

For GRAY-P G(w

, p



j=1

−w

)(p

−p

)





−w

)



−p

)

is used which is

the normalized cross-correlation, where

and p

are the means of vector

and p

respectively, and h

is a weighting which decreases the inﬂuence

of border pixels with a Gaussian. For C2-P the negative Euclidean distance

G(w

, p

)=−





j=1

− p

)

shows better performance because of the

sparseness in this layer. The similarity between SIFT descriptors is given by their

dot product G(w

, p



j=1

. The maximum activation per image is

chosen as response and spatial information is neglected.

Reﬂecting the remarks on feature selection given above, we decided to use

the following strategy: First we determine which views of a certain category

each individual candidate feature w

can separate. Therefore we compute the

response r

for every training image with (1). Then the minimal threshold t

is chosen that guarantees that all images with r

above or equal to t

belong

the same category (see Fig. 2):

=min



t|∀

i|r

≥t

j|r

≥t

= l



. (2)

Here l

denotes the category label of image i. The images separated by the

threshold is assigned a constant score s

= k with respect to the feature w

A comparison of features in parts-based object recognition hierarchies

Figures

Citations

Large-Scale Real-Time Object Identification Based on Analytic Features

A vector quantization approach for life-long learning of categories

Multi-class Image Classification - Sparsity does it Better

An integrated system for incremental learning of multiple visual categories

A biologically inspired approach for interactive learning of categories

References

Distinctive Image Features from Scale-Invariant Keypoints

Eigenfaces for recognition

Learning the parts of objects by non-negative matrix factorization

Learning parts of objects by non-negative matrix factorization

A performance evaluation of local descriptors

Related Papers (5)

Color indexing

Learning optimized features for hierarchical models of invariant object recognition

Generalized relevance learning vector quantization

Combined Object Categorization and Segmentation With an Implicit Shape Model

Object recognition with features inspired by visual cortex

Frequently Asked Questions (2)

Q1. What are the contributions mentioned in the paper "A comparison of features in parts-based object recognition hierarchies" ?

Q2. What are the future works in "A comparison of features in parts-based object recognition hierarchies" ?