Book Chapter•DOI•

A comparison of features in parts-based object recognition hierarchies

Stephan Hasler¹, Heiko Wersing¹, Edgar Körner¹•Institutions (1)

09 Sep 2007-pp 210-219

TL;DR: A comparative investigation of different feature types with regard to their suitability for category discrimination in patches of gray-scale images were compared with SIFT descriptors and patches from the high-level output of a feedforward hierarchy related to the ventral visual pathway.

read less

Abstract: Parts-based recognition has been suggested for generalizing from few training views in categorization scenarios. In this paper we present the results of a comparative investigation of different feature types with regard to their suitability for category discrimination. So patches of gray-scale images were compared with SIFT descriptors and patches from the high-level output of a feedforward hierarchy related to the ventral visual pathway. We discuss the conceptual differences, resulting performance and consequences for hierarchical models of visual recognition.

...read moreread less

Summary (2 min read)

Jump to: [1 Introduction] – [2 Analytic Features] – [3 Results] and [4 Conclusion]

1 Introduction

The human brain employs different kinds of interrelated representations and processes to recognize objects, depending on the familiarity of the object and the required level of recognition, which is defined by the current task.
A parts-based representation is especially efficient for storing and categorizing novel objects, because the largest variance in unseen views of an object can be expected in the position and arrangement of parts, while each part of an object will be visible under a large variety of 3D object transformations.
Here hierarchies of feature layers are used, like in the ventral visual pathway, where they combine specificity and invariance of features.
In other holistic methods the receptive fields of the features cover the whole image.
The approach selects features based on the maximization of mutual information for a single class.

2 Analytic Features

To generalize from few training examples, parts-based recognition follows the notion that similar combinations of parts are specific for a certain category over a wide range of variations.
So the authors need a reasonable feature selection strategy that evaluates which and how many views of a certain category a feature can separate from other categories and, based on those results, choose the subset of features that in combination can describe the whole scenario best.
The maximum activated bin in this histogram is used to normalize the rotation of the patch in advance.
A similar cluster step was also done in [15] to improve the generalization performance of the otherwise very specific SIFT descriptors.
The feedforward hierarchy proposed in [3] is shown in Fig. 1a.

3 Results

The authors tested the performance of the different feature types on the categorization scenario shown in Fig.
For the different tests the authors then varied the number of used features and the number of training views that were used by a single layer perceptron (SLP), as the final classifier.
C2-H is similar to C2-P and GRAY-P, and takes the lead when using a large number of views.
The performance of SIFT-P and GRAY-P on cups(7) is very poor and does not improve with more training views.
This is especially true for categories where the rotation in depth looks like rotation in plane (bottle(2), brush(4), phone(9), tool(10)).

4 Conclusion

The authors evaluated the performance of different types of local feature when used in parts-based recognition.
The biological motivated feedforward hierarchy in [3] is powerful in holistic recognition with a sufficient number of training examples, but the patches from the output layer are too general and therefore show weak performance in parts-based recognition.
First features are used that extract the magnitudes for 8 different local gradient directions.
This could be beneficial for both feature types.
The most related work in the direction of analytic features was done in [16], where Ullman introduced invariance over viewpoint in his fragments approach, or in the work of Dorko et al. in [15], where highly informative clusters of SIFT descriptors are used.

Did you find this useful? Give us your feedback

Figures (6)

Fig. 2. Feature selection scheme. For visualization the images are sorted on their response rmi. The threshold tm separates views of a single category (here ducks) from all other images. To these views a score smi = k is assigned.

Fig. 1. a) Feedforward hierarchy in [3]. b) Columns of C2-layer are used as local features. c) SIFT descriptor [11] is grid of gradient histograms each with 8 orientations.

Fig. 5. Error rates depending on number of training views for different approaches

Fig. 6. Error rates of individual categories depending on number of training views

Fig. 4. a) First 75 top ranked features for parts-based approaches. For C2 the corresponding patch of the original gray-scale image and for SIFT the patch the descriptor of which is most similar to the selected k-means component is shown. b) Error rates depending on number of features for parts-based approaches.

Fig. 3. Category scenario. Each category contains nine objects. Five are used for training and four for testing. Only two objects of both groups are shown here.

Content maybe subject to copyright Report

A Comparison of Features in Parts-Based

Object Recognition Hierarchies

Stephan Hasler, Heiko Wersing, and Edgar K¨orner

Honda Research Institute Europe GmbH

D-63073 Oﬀenbach/Germany

stephan.hasler@honda-ri.de

Abstract. Parts-based recognition has been suggested for generalizing

from few training views in categorization scenarios. In this paper we

present the results of a comparative investigation of diﬀerent feature

types with regard to their suitability for category discrimination. So

patches of gray-scale images were compared with SIFT descriptors and

patches from the high-level output of a feedforward hierarchy related to

the ventral visual pathway. We discuss the conceptual diﬀerences, re-

sulting performance and consequences for hierarchical models of visual

recognition.

1 Introduction

The human brain employs diﬀerent kinds of interrelated representations and

processes to recognize objects, depending on the familiarity of the object and

the required level of recognition, which is deﬁned by the current task. There is

evidence that for identifying highly familiar objects, like faces, holistic templates

are used that emphasize the spatial layout of the object’s parts but neglect de-

tails of the parts themselves. This holistic prototypical representation requires

a lot of experience and coding capacity and therefore can not be used for all

the objects in every day’s life. A more compact representation can be obtained

when handling objects as combinations of shared parts. There is various biologi-

cal motivation for such a representation. The experiments of Tanaka [1] revealed

that there are high-level areas in primates ventral visual pathway that predict

the presence of a large set of features with intermediate complexity, generaliz-

ing over small variations and being invariant to retinotopical position and scale.

The combinatorial use of those features was shown by Tsunoda [2]. He observed

that complex objects simultaneously activate diﬀerent spots in those areas and

that this activation is caused by the constituent parts. A parts-based represen-

tation is especially eﬃcient for storing and categorizing novel objects, because

the largest variance in unseen views of an object can be expected in the position

and arrangement of parts, while each part of an object will be visible under a

large variety of 3D object transformations.

In computer vision literature there is a similar distinction into holistic and

parts-based approaches, depending on how feature responses are aggregated over

J. Marques de S´a et al. (Eds.): ICANN 2007, Part II, LNCS 4669, pp. 210–219, 2007.

 Springer-Verlag Berlin Heidelberg 2007

A Comparison of Features in Parts-Based Object Recognition Hierarchies 211

the image. Parts can be local features of any kind. The response of a part detector

at diﬀerent positions in an image means that the part might be present several

times but not that the probability is higher that the part is present at all. So

each peak in the multimodal response map is handled as a possible instance

of the part. In contrary to this, holistic approaches contain a layer that simply

accumulates the real-valued response of single features of the previous layer

over the whole image. This is only comparable to the biological deﬁnition if the

conﬁgurational information is kept.

Approaches with strong biological motivation are presented in [3,4]. Here hi-

erarchies of feature layers are used, like in the ventral visual pathway, where they

combine speciﬁcity and invariance of features. So there are cells that are either

sensitive to a speciﬁc pattern of activation in lower layers, in this way increas-

ing the feature’s complexity, or that pool the responses of similar features, so

generalizing over small variations. The output layer of the feedforward hierarchy

proposed in [3] contains several topographically organized feature maps which

are used directly by the ﬁnal classiﬁer. Following the above deﬁnition this is a

holistic approach. The similar hierarchy of [4] employs in the highest feature

layer a spatial max-pooling over each feature map in the previous layer, which

makes it a parts-based approach. Multimodal response characteristics and the

position of the parts are neglected.

Most other approaches work more directly on the images. Very typical holis-

tic approaches apply histograms, so e.g. in [5] the responses to local features are

simply summed and in [6] it is counted how often a response lies in a certain

range. In other holistic methods the receptive ﬁelds of the features cover the

whole image. So e.g. in [7] features obtained by principal component analysis

(PCA) on gray-scale images were used to classify faces. These features, so called

eigenfaces, show a very global activation and do not reﬂect parts of a face. In

contrary to PCA other methods produce so called parts-based features like the

nonnegative matrix factorization (NMF) proposed in [8] or a similar scheme pro-

posed in [9] yielding more class-speciﬁc features. Although during training the

receptive ﬁeld of each feature covers the whole image, it learns to reconstruct

a certain localized region that contains the same part in many training views

(e.g. parts of normalized frontal views of faces). But usually those features are

used in a holistic manner, meaning that they are extracted at a single position

in the test image and in this way are only sensitive to the rigid constellation of

parts that was present during training. This limits the possibilities to general-

ize over geometric transformations, which is especially a drawback when using

few training examples in an unnormalized setting. Also the holistic approaches

perform bad in the presence of clutter and occlusion and often require extensive

preprocessing as localization and segmentation.

Other parts-based recognition approaches also use the maximum activation

of each feature, like the highest layer in [4]. In [10] the features are fragments of

gray-scale images. The response of a feature is binary and obtained by thresh-

olding the maximum activation in the image. The approach selects features

based on the maximization of mutual information for a single class. This yields

212 S. Hasler, H. Wersing, and E. K¨orner

fragments of intermediate complexity. An image is classiﬁed by comparing bi-

nary activation vectors to stored representatives in a nearest neighbor fashion.

Other approaches make use of the position and treat each peak in the response

map as possible part instance. In the scale invariant feature transform (SIFT)

approach in [11] gradient-histograms are extracted for small patches around in-

teresting points (see Fig. 1c). Each such patch descriptor is compared against a

large repertoire of stored descriptors, where the best match votes for the presence

of an object at a certain position, scale and rotation. The votes are combined

using a Generalized Hough Transform and the maximally activated hypothesis

is chosen. A similar scheme is proposed in [12]. Here image patches are used as

features and the algorithm is capable to produce a segmentation mask for the

object hypothesis that can be used for a further reﬁnement process. In the bags

of keypoints approaches, e.g. [13], it is counted how often parts are detected in

an image. In contrast to holistic histogram-based approaches the presence of a

part is the result of a strong local competition of parts. Therefore it is more a

counting of symbol-type information than a summation over real-valued signal-

type responses. Parts-based recognition can be used to localize and recognize

objects at the same time and works well in the presence of clutter and occlusion.

In Sect. 2 we ﬁrst comment on the task we want to solve and the nature of

the features required for this. Then we describe the investigated feature types

and our feature selection strategy. We give results for a categorization problem

in Sect. 3 and present our conclusions in Sect. 4.

2 Analytic Features

To generalize from few training examples, parts-based recognition follows the

notion that similar combinations of parts are speciﬁc for a certain category over

a wide range of variations. In this work we investigate how suitable diﬀerent

feature types are for this purpose and which eﬀort is needed in terms of the

number of used features. As has been argued in [10], it is beneﬁcial that a single

part can be detected in many views of one category, while being absent in other

categories. So we need a reasonable feature selection strategy that evaluates

which and how many views of a certain category a feature can separate from

other categories and, based on those results, choose the subset of features that in

combination can describe the whole scenario best. For simple categories a single

feature can separate many views and therefore only few features are necessary to

represent the whole category. For categories with more variation more features

have to be selected to cover the whole appearance. This dynamic distribution of

resources is necessary to make best use of the limited number of features.

How well certain local descriptors can be re-detected under diﬀerent image

transformations, as scale, rotation and viewpoint changes, was investigated in

[14]. Although this is a desired quality, it does not necessarily state something on

the usefulness in object recognition tasks. To underline that the desired features

should be meaningful, i.e. oﬀer a compromise between speciﬁcity and generality

A Comparison of Features in Parts-Based Object Recognition Hierarchies 213

at low costs, and to avoid confusion with approaches that learn parts-based

features, we will use the term analytic features.

We decided to compare patches of gray-scale images, for their simplicity, SIFT

descriptors, for their known invariance, and patches of the output of the feed-

forward hierarchy in [3], because of the biological background.

A SIFT descriptor as proposed in [11] describes a gray-scale patch of 16x16

pixels using a grid of 4x4 gradient-histograms (see Fig. 1c). Each histogram in

the grid is made up of eight orientation bins. The magnitude of the gradient at a

certain pixel is distributed in a bilinear fashion over the neighboring histograms

(in general four), where the orientation of the gradient determines the bin. The

gradient magnitudes are scaled with a Gaussian that is centered on the patch, in

this way reducing the inﬂuence of border pixels. Prior to the calculation of the

histogram grid a single histogram with a higher number of orientation bins is

computed for the whole patch. The maximum activated bin in this histogram is

used to normalize the rotation of the patch in advance. Finally the energy of the

whole descriptor is normalized to obtain invariance to illumination. In contrast

to [11], we do not extract SIFT descriptors at a small number of interesting

keypoints, but for all locations where at least a minimum of structure is present.

In this way only uniform, dark background is neglected and on the category

scenario in Fig. 3 on average one third of all descriptors is kept. We reduce the

number of descriptors for each image by applying a k-means algorithm with

200 components. A similar cluster step was also done in [15] to improve the

generalization performance of the otherwise very speciﬁc SIFT descriptors.

For the gray-scale patches we decided to use the same patch size as for the

SIFT approach and the inﬂuence of the pixels is also weighted with a Gaussian

that is centered on the patch.

The feedforward hierarchy proposed in [3] is shown in Fig. 1a. The S1-layer

computes the magnitudes of the response to four diﬀerently oriented gabor ﬁlters.

This activation is pooled to a lower resolution in the C1-layer performing a local

OR-operation. The 50 features used in S2 are trained as to eﬃciently reconstruct

a large set of random 4x4x4 C1-patches from natural images and are therefore

sensitive to local patterns in C1. Layer C2 performs a further pooling operation

and is the output of the hierarchy. Columns of 2x2 pixels are cut from the C2-

layer as shown in Fig. 1b and used as feature candidates. Because of the two

pooling layers, which oﬀer a small degree of invariance to translation, a column of

2x2 pixels in C2 corresponds roughly to a patch of 16x16 pixels in the gray-scale

image.

We will refer to the parts-based approaches as GRAY-P, SIFT-P, and C2-P.

For SIFT-P each image i is described by the J =4× 4 × 8 = 128 dimensional

representatives of the 200 k-means clusters p

, n =1...200. For GRAY-P the

are the patches of image i at all distinct positions n (J =16× 16 = 256).

Similar to this for C2 each p

is a column through the feature maps of image

i at a distinct position n asshowninFig.1b(J =2× 2 × 50 = 200). The p

show a large variety. Therefore we will use all p

directly as feature candidates

,wherem is an index over all combinations of i and n, and select a subset

214 S. Hasler, H. Wersing, and E. K¨orner

Pooling

Input

4 Gabors

50 Combinations

Pooling

16x16 Pixels

50 Planes

Fig. 1. a) Feedforward hierarchy in [3]. b) Columns of C2-layer are used as local fea-

tures. c) SIFT descriptor [11] is grid of gradient histograms each with 8 orientations.

of those candidates with a strategy that is described later. The response r

feature w

on the image i is given by:

=max

(G(w

, p

)) . (1)

For GRAY-P G(w

, p



j=1

−w

)(p

−p

)





−w

)



−p

)

is used which is

the normalized cross-correlation, where

and p

are the means of vector

and p

respectively, and h

is a weighting which decreases the inﬂuence

of border pixels with a Gaussian. For C2-P the negative Euclidean distance

G(w

, p

)=−





j=1

− p

)

shows better performance because of the

sparseness in this layer. The similarity between SIFT descriptors is given by their

dot product G(w

, p



j=1

. The maximum activation per image is

chosen as response and spatial information is neglected.

Reﬂecting the remarks on feature selection given above, we decided to use

the following strategy: First we determine which views of a certain category

each individual candidate feature w

can separate. Therefore we compute the

response r

for every training image with (1). Then the minimal threshold t

is chosen that guarantees that all images with r

above or equal to t

belong

the same category (see Fig. 2):

=min



t|∀

i|r

≥t

j|r

≥t

= l



. (2)

Here l

denotes the category label of image i. The images separated by the

threshold is assigned a constant score s

= k with respect to the feature w

HTML Viewer

Frequently Asked Questions (2)

Q1. What are the contributions mentioned in the paper "A comparison of features in parts-based object recognition hierarchies" ?

In this paper the authors present the results of a comparative investigation of different feature types with regard to their suitability for category discrimination. The authors discuss the conceptual differences, resulting performance and consequences for hierarchical models of visual recognition.

Q2. What are the future works in "A comparison of features in parts-based object recognition hierarchies" ?

Besides the normalization of rotation for SIFT, it would be interesting to investigate other reasons for the differences in performance in future work. Since both approaches have not been applied to scenarios with multiple categories, the authors hope that their comparative study provides further helpful inside into parts-based 3D object recognition.

A comparison of features in parts-based object recognition hierarchies

Summary (2 min read)

1 Introduction

2 Analytic Features

3 Results

4 Conclusion

Figures (6)

Citations

Cites methods from "A comparison of features in parts-b..."

References

"A comparison of features in parts-b..." refers methods in this paper

"A comparison of features in parts-b..." refers methods in this paper

"A comparison of features in parts-b..." refers background in this paper

Related Papers (5)

Frequently Asked Questions (2)

Q1. What are the contributions mentioned in the paper "A comparison of features in parts-based object recognition hierarchies" ?

Q2. What are the future works in "A comparison of features in parts-based object recognition hierarchies" ?