scispace - formally typeset

Book ChapterDOI

A comparison of features in parts-based object recognition hierarchies

09 Sep 2007-pp 210-219

TL;DR: A comparative investigation of different feature types with regard to their suitability for category discrimination in patches of gray-scale images were compared with SIFT descriptors and patches from the high-level output of a feedforward hierarchy related to the ventral visual pathway.

AbstractParts-based recognition has been suggested for generalizing from few training views in categorization scenarios. In this paper we present the results of a comparative investigation of different feature types with regard to their suitability for category discrimination. So patches of gray-scale images were compared with SIFT descriptors and patches from the high-level output of a feedforward hierarchy related to the ventral visual pathway. We discuss the conceptual differences, resulting performance and consequences for hierarchical models of visual recognition.

Summary (2 min read)

1 Introduction

  • The human brain employs different kinds of interrelated representations and processes to recognize objects, depending on the familiarity of the object and the required level of recognition, which is defined by the current task.
  • A parts-based representation is especially efficient for storing and categorizing novel objects, because the largest variance in unseen views of an object can be expected in the position and arrangement of parts, while each part of an object will be visible under a large variety of 3D object transformations.
  • Here hierarchies of feature layers are used, like in the ventral visual pathway, where they combine specificity and invariance of features.
  • In other holistic methods the receptive fields of the features cover the whole image.
  • The approach selects features based on the maximization of mutual information for a single class.

2 Analytic Features

  • To generalize from few training examples, parts-based recognition follows the notion that similar combinations of parts are specific for a certain category over a wide range of variations.
  • So the authors need a reasonable feature selection strategy that evaluates which and how many views of a certain category a feature can separate from other categories and, based on those results, choose the subset of features that in combination can describe the whole scenario best.
  • The maximum activated bin in this histogram is used to normalize the rotation of the patch in advance.
  • A similar cluster step was also done in [15] to improve the generalization performance of the otherwise very specific SIFT descriptors.
  • The feedforward hierarchy proposed in [3] is shown in Fig. 1a.

3 Results

  • The authors tested the performance of the different feature types on the categorization scenario shown in Fig.
  • For the different tests the authors then varied the number of used features and the number of training views that were used by a single layer perceptron (SLP), as the final classifier.
  • C2-H is similar to C2-P and GRAY-P, and takes the lead when using a large number of views.
  • The performance of SIFT-P and GRAY-P on cups(7) is very poor and does not improve with more training views.
  • This is especially true for categories where the rotation in depth looks like rotation in plane (bottle(2), brush(4), phone(9), tool(10)).

4 Conclusion

  • The authors evaluated the performance of different types of local feature when used in parts-based recognition.
  • The biological motivated feedforward hierarchy in [3] is powerful in holistic recognition with a sufficient number of training examples, but the patches from the output layer are too general and therefore show weak performance in parts-based recognition.
  • First features are used that extract the magnitudes for 8 different local gradient directions.
  • This could be beneficial for both feature types.
  • The most related work in the direction of analytic features was done in [16], where Ullman introduced invariance over viewpoint in his fragments approach, or in the work of Dorko et al. in [15], where highly informative clusters of SIFT descriptors are used.

Did you find this useful? Give us your feedback

...read more

Content maybe subject to copyright    Report

A Comparison of Features in Parts-Based
Object Recognition Hierarchies
Stephan Hasler, Heiko Wersing, and Edgar orner
Honda Research Institute Europe GmbH
D-63073 Offenbach/Germany
stephan.hasler@honda-ri.de
Abstract. Parts-based recognition has been suggested for generalizing
from few training views in categorization scenarios. In this paper we
present the results of a comparative investigation of different feature
types with regard to their suitability for category discrimination. So
patches of gray-scale images were compared with SIFT descriptors and
patches from the high-level output of a feedforward hierarchy related to
the ventral visual pathway. We discuss the conceptual differences, re-
sulting performance and consequences for hierarchical models of visual
recognition.
1 Introduction
The human brain employs different kinds of interrelated representations and
processes to recognize objects, depending on the familiarity of the object and
the required level of recognition, which is defined by the current task. There is
evidence that for identifying highly familiar objects, like faces, holistic templates
are used that emphasize the spatial layout of the object’s parts but neglect de-
tails of the parts themselves. This holistic prototypical representation requires
a lot of experience and coding capacity and therefore can not be used for all
the objects in every day’s life. A more compact representation can be obtained
when handling objects as combinations of shared parts. There is various biologi-
cal motivation for such a representation. The experiments of Tanaka [1] revealed
that there are high-level areas in primates ventral visual pathway that predict
the presence of a large set of features with intermediate complexity, generaliz-
ing over small variations and being invariant to retinotopical position and scale.
The combinatorial use of those features was shown by Tsunoda [2]. He observed
that complex objects simultaneously activate different spots in those areas and
that this activation is caused by the constituent parts. A parts-based represen-
tation is especially efficient for storing and categorizing novel objects, because
the largest variance in unseen views of an object can be expected in the position
and arrangement of parts, while each part of an object will be visible under a
large variety of 3D object transformations.
In computer vision literature there is a similar distinction into holistic and
parts-based approaches, depending on how feature responses are aggregated over
J. Marques de a et al. (Eds.): ICANN 2007, Part II, LNCS 4669, pp. 210–219, 2007.
c
Springer-Verlag Berlin Heidelberg 2007

A Comparison of Features in Parts-Based Object Recognition Hierarchies 211
the image. Parts can be local features of any kind. The response of a part detector
at different positions in an image means that the part might be present several
times but not that the probability is higher that the part is present at all. So
each peak in the multimodal response map is handled as a possible instance
of the part. In contrary to this, holistic approaches contain a layer that simply
accumulates the real-valued response of single features of the previous layer
over the whole image. This is only comparable to the biological definition if the
configurational information is kept.
Approaches with strong biological motivation are presented in [3,4]. Here hi-
erarchies of feature layers are used, like in the ventral visual pathway, where they
combine specificity and invariance of features. So there are cells that are either
sensitive to a specific pattern of activation in lower layers, in this way increas-
ing the feature’s complexity, or that pool the responses of similar features, so
generalizing over small variations. The output layer of the feedforward hierarchy
proposed in [3] contains several topographically organized feature maps which
are used directly by the final classifier. Following the above definition this is a
holistic approach. The similar hierarchy of [4] employs in the highest feature
layer a spatial max-pooling over each feature map in the previous layer, which
makes it a parts-based approach. Multimodal response characteristics and the
position of the parts are neglected.
Most other approaches work more directly on the images. Very typical holis-
tic approaches apply histograms, so e.g. in [5] the responses to local features are
simply summed and in [6] it is counted how often a response lies in a certain
range. In other holistic methods the receptive fields of the features cover the
whole image. So e.g. in [7] features obtained by principal component analysis
(PCA) on gray-scale images were used to classify faces. These features, so called
eigenfaces, show a very global activation and do not reflect parts of a face. In
contrary to PCA other methods produce so called parts-based features like the
nonnegative matrix factorization (NMF) proposed in [8] or a similar scheme pro-
posed in [9] yielding more class-specific features. Although during training the
receptive field of each feature covers the whole image, it learns to reconstruct
a certain localized region that contains the same part in many training views
(e.g. parts of normalized frontal views of faces). But usually those features are
used in a holistic manner, meaning that they are extracted at a single position
in the test image and in this way are only sensitive to the rigid constellation of
parts that was present during training. This limits the possibilities to general-
ize over geometric transformations, which is especially a drawback when using
few training examples in an unnormalized setting. Also the holistic approaches
perform bad in the presence of clutter and occlusion and often require extensive
preprocessing as localization and segmentation.
Other parts-based recognition approaches also use the maximum activation
of each feature, like the highest layer in [4]. In [10] the features are fragments of
gray-scale images. The response of a feature is binary and obtained by thresh-
olding the maximum activation in the image. The approach selects features
based on the maximization of mutual information for a single class. This yields

212 S. Hasler, H. Wersing, and E. orner
fragments of intermediate complexity. An image is classified by comparing bi-
nary activation vectors to stored representatives in a nearest neighbor fashion.
Other approaches make use of the position and treat each peak in the response
map as possible part instance. In the scale invariant feature transform (SIFT)
approach in [11] gradient-histograms are extracted for small patches around in-
teresting points (see Fig. 1c). Each such patch descriptor is compared against a
large repertoire of stored descriptors, where the best match votes for the presence
of an object at a certain position, scale and rotation. The votes are combined
using a Generalized Hough Transform and the maximally activated hypothesis
is chosen. A similar scheme is proposed in [12]. Here image patches are used as
features and the algorithm is capable to produce a segmentation mask for the
object hypothesis that can be used for a further refinement process. In the bags
of keypoints approaches, e.g. [13], it is counted how often parts are detected in
an image. In contrast to holistic histogram-based approaches the presence of a
part is the result of a strong local competition of parts. Therefore it is more a
counting of symbol-type information than a summation over real-valued signal-
type responses. Parts-based recognition can be used to localize and recognize
objects at the same time and works well in the presence of clutter and occlusion.
In Sect. 2 we first comment on the task we want to solve and the nature of
the features required for this. Then we describe the investigated feature types
and our feature selection strategy. We give results for a categorization problem
in Sect. 3 and present our conclusions in Sect. 4.
2 Analytic Features
To generalize from few training examples, parts-based recognition follows the
notion that similar combinations of parts are specific for a certain category over
a wide range of variations. In this work we investigate how suitable different
feature types are for this purpose and which effort is needed in terms of the
number of used features. As has been argued in [10], it is beneficial that a single
part can be detected in many views of one category, while being absent in other
categories. So we need a reasonable feature selection strategy that evaluates
which and how many views of a certain category a feature can separate from
other categories and, based on those results, choose the subset of features that in
combination can describe the whole scenario best. For simple categories a single
feature can separate many views and therefore only few features are necessary to
represent the whole category. For categories with more variation more features
have to be selected to cover the whole appearance. This dynamic distribution of
resources is necessary to make best use of the limited number of features.
How well certain local descriptors can be re-detected under different image
transformations, as scale, rotation and viewpoint changes, was investigated in
[14]. Although this is a desired quality, it does not necessarily state something on
the usefulness in object recognition tasks. To underline that the desired features
should be meaningful, i.e. offer a compromise between specificity and generality

A Comparison of Features in Parts-Based Object Recognition Hierarchies 213
at low costs, and to avoid confusion with approaches that learn parts-based
features, we will use the term analytic features.
We decided to compare patches of gray-scale images, for their simplicity, SIFT
descriptors, for their known invariance, and patches of the output of the feed-
forward hierarchy in [3], because of the biological background.
A SIFT descriptor as proposed in [11] describes a gray-scale patch of 16x16
pixels using a grid of 4x4 gradient-histograms (see Fig. 1c). Each histogram in
the grid is made up of eight orientation bins. The magnitude of the gradient at a
certain pixel is distributed in a bilinear fashion over the neighboring histograms
(in general four), where the orientation of the gradient determines the bin. The
gradient magnitudes are scaled with a Gaussian that is centered on the patch, in
this way reducing the influence of border pixels. Prior to the calculation of the
histogram grid a single histogram with a higher number of orientation bins is
computed for the whole patch. The maximum activated bin in this histogram is
used to normalize the rotation of the patch in advance. Finally the energy of the
whole descriptor is normalized to obtain invariance to illumination. In contrast
to [11], we do not extract SIFT descriptors at a small number of interesting
keypoints, but for all locations where at least a minimum of structure is present.
In this way only uniform, dark background is neglected and on the category
scenario in Fig. 3 on average one third of all descriptors is kept. We reduce the
number of descriptors for each image by applying a k-means algorithm with
200 components. A similar cluster step was also done in [15] to improve the
generalization performance of the otherwise very specific SIFT descriptors.
For the gray-scale patches we decided to use the same patch size as for the
SIFT approach and the influence of the pixels is also weighted with a Gaussian
that is centered on the patch.
The feedforward hierarchy proposed in [3] is shown in Fig. 1a. The S1-layer
computes the magnitudes of the response to four differently oriented gabor filters.
This activation is pooled to a lower resolution in the C1-layer performing a local
OR-operation. The 50 features used in S2 are trained as to efficiently reconstruct
a large set of random 4x4x4 C1-patches from natural images and are therefore
sensitive to local patterns in C1. Layer C2 performs a further pooling operation
and is the output of the hierarchy. Columns of 2x2 pixels are cut from the C2-
layer as shown in Fig. 1b and used as feature candidates. Because of the two
pooling layers, which offer a small degree of invariance to translation, a column of
2x2 pixels in C2 corresponds roughly to a patch of 16x16 pixels in the gray-scale
image.
We will refer to the parts-based approaches as GRAY-P, SIFT-P, and C2-P.
For SIFT-P each image i is described by the J =4× 4 × 8 = 128 dimensional
representatives of the 200 k-means clusters p
in
, n =1...200. For GRAY-P the
p
in
are the patches of image i at all distinct positions n (J =16× 16 = 256).
Similar to this for C2 each p
in
is a column through the feature maps of image
i at a distinct position n asshowninFig.1b(J =2× 2 × 50 = 200). The p
in
show a large variety. Therefore we will use all p
in
directly as feature candidates
w
m
,wherem is an index over all combinations of i and n, and select a subset

214 S. Hasler, H. Wersing, and E. orner
I
S1
C1
S2
Pooling
Input
4 Gabors
50 Combinations
C2
Pooling
C2
16x16 Pixels
a)
b)
c)
50 Planes
Fig. 1. a) Feedforward hierarchy in [3]. b) Columns of C2-layer are used as local fea-
tures. c) SIFT descriptor [11] is grid of gradient histograms each with 8 orientations.
of those candidates with a strategy that is described later. The response r
mi
of
feature w
m
on the image i is given by:
r
mi
=max
n
(G(w
m
, p
in
)) . (1)
For GRAY-P G(w
m
, p
in
)=
J
j=1
h
j
(w
j
m
w
m
)(p
j
in
p
in
)
j
h
j
(w
j
m
w
m
)
2
j
h
j
(p
j
in
p
in
)
2
is used which is
the normalized cross-correlation, where
w
m
and p
in
are the means of vector
w
m
and p
in
respectively, and h
j
is a weighting which decreases the influence
of border pixels with a Gaussian. For C2-P the negative Euclidean distance
G(w
m
, p
in
)=
J
j=1
(w
j
m
p
j
in
)
2
shows better performance because of the
sparseness in this layer. The similarity between SIFT descriptors is given by their
dot product G(w
m
, p
in
)=
J
j=1
w
j
m
p
j
in
. The maximum activation per image is
chosen as response and spatial information is neglected.
Reflecting the remarks on feature selection given above, we decided to use
the following strategy: First we determine which views of a certain category
each individual candidate feature w
m
can separate. Therefore we compute the
response r
mi
for every training image with (1). Then the minimal threshold t
m
is chosen that guarantees that all images with r
mi
above or equal to t
m
belong
the same category (see Fig. 2):
t
m
=min
t|∀
i|r
mi
t
j|r
mj
t
l
i
= l
j
. (2)
Here l
i
denotes the category label of image i. The images separated by the
threshold is assigned a constant score s
mi
= k with respect to the feature w
m
.

Citations
More filters

Journal ArticleDOI
TL;DR: To achieve the life-long learning ability for a cognitive system, a new learning vector quantization approach combined with a category-specific feature selection method to allow several metrical "views" on the representation space of each individual vector quantification node.
Abstract: We present a new method capable of learning multiple categories in an interactive and life-long learning fashion to approach the ''stability-plasticity dilemma''. The problem of incremental learning of multiple categories is still largely unsolved. This is especially true for the domain of cognitive robotics, requiring real-time and interactive learning. To achieve the life-long learning ability for a cognitive system, we propose a new learning vector quantization approach combined with a category-specific feature selection method to allow several metrical ''views'' on the representation space of each individual vector quantization node. These category-specific features are incrementally collected during the learning process, so that a balance between the correction of wrong representations and the stability of acquired knowledge is achieved. We demonstrate our approach for a difficult visual categorization task, where the learning is applied for several complex-shaped objects rotated in depth.

49 citations


Cites methods from "A comparison of features in parts-b..."

  • ...The parts-based feature detection (see Hasler et al. (2007) for details) is based on a preselected set of SIFT-descriptors (Lowe, 2004), which are designed to be invariant with regard to rotations in the image plane....

    [...]


Journal ArticleDOI
TL;DR: This work presents an integrated vision architecture capable of incrementally learning several visual categories based on natural hand-held objects and imposes no restrictions on the viewing angle of presented objects, relaxing the common constraint on canonical views.
Abstract: We present an integrated vision architecture capable of incrementally learning several visual categories based on natural hand-held objects. Additionally we focus on interactive learning, which requires real-time image processing methods and a fast learning algorithm. The overall system is composed of a figure-ground segregation part, several feature extraction methods and a life-long learning approach combining incremental learning with category-specific feature selection. In contrast to most visual categorization approaches, where typically each view is assigned to a single category, we allow labeling with an arbitrary number of shape and color categories. We also impose no restrictions on the viewing angle of presented objects, relaxing the common constraint on canonical views.

30 citations


Patent
Andreas Knoblauch1
15 Dec 2009
Abstract: This invention is in the field of machine learning and neural associative memory. In particular the invention discloses a neural associative memory structure for storing and maintaining associations between memory address patterns and memory content patterns using a neural network, as well as methods for retrieving such associations. A method for a non-linear synaptic learning of discrete synapses is disclosed, and its application on neural networks is laid out.

25 citations


Patent
Andreas Knoblauch1
28 May 2010
Abstract: This invention is in the field of machine learning and neural associative memory. In particular the invention discloses a neural associative memory structure for storing and maintaining associations between memory address patterns and memory content patterns using a neural network, as well as methods for storing and retrieving such associations. Bayesian learning is applied to achieve non-linear learning.

25 citations


Proceedings ArticleDOI
18 Jul 2010
TL;DR: An exemplar-based learning approach for incremental and life-long learning of visual categories and it is argued that contextual information is beneficial for this process.
Abstract: We present an exemplar-based learning approach for incremental and life-long learning of visual categories. The basic concept of the proposed learning method is to subdivide the learning process into two phases. In the first phase we utilize supervised learning to generate an appropriate category seed, while in the second phase this seed is used to autonomously bootstrap the visual representation. This second learning phase is especially useful for assistive systems like a mobile robot, because the visual knowledge can be enhanced even if no tutor is present. Although for this autonomous bootstrapping no category labels are provided, we argue that contextual information is beneficial for this process. Finally we investigate the effect of the proposed second learning phase with respect to the overall categorization performance.

11 citations


References
More filters

Journal ArticleDOI
TL;DR: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene and can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Abstract: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.

42,225 citations


Journal ArticleDOI
TL;DR: A near-real-time computer system that can locate and track a subject's head, and then recognize the person by comparing characteristics of the face to those of known individuals, and that is easy to implement using a neural network architecture.
Abstract: We have developed a near-real-time computer system that can locate and track a subject's head, and then recognize the person by comparing characteristics of the face to those of known individuals. The computational approach taken in this system is motivated by both physiology and information theory, as well as by the practical requirements of near-real-time performance and accuracy. Our approach treats the face recognition problem as an intrinsically two-dimensional (2-D) recognition problem rather than requiring recovery of three-dimensional geometry, taking advantage of the fact that faces are normally upright and thus may be described by a small set of 2-D characteristic views. The system functions by projecting face images onto a feature space that spans the significant variations among known face images. The significant features are known as "eigenfaces," because they are the eigenvectors (principal components) of the set of faces; they do not necessarily correspond to features such as eyes, ears, and noses. The projection operation characterizes an individual face by a weighted sum of the eigenface features, and so to recognize a particular face it is necessary only to compare these weights to those of known individuals. Some particular advantages of our approach are that it provides for the ability to learn and later recognize new faces in an unsupervised manner, and that it is easy to implement using a neural network architecture.

14,128 citations


"A comparison of features in parts-b..." refers methods in this paper

  • ...in [7] features obtained by principal component analysis (PCA) on gray-scale images were used to classify faces....

    [...]


Journal ArticleDOI
21 Oct 1999-Nature
TL;DR: An algorithm for non-negative matrix factorization is demonstrated that is able to learn parts of faces and semantic features of text and is in contrast to other methods that learn holistic, not parts-based, representations.
Abstract: Is perception of the whole based on perception of its parts? There is psychological and physiological evidence for parts-based representations in the brain, and certain computational theories of object recognition rely on such representations. But little is known about how brains or computers might learn the parts of objects. Here we demonstrate an algorithm for non-negative matrix factorization that is able to learn parts of faces and semantic features of text. This is in contrast to other methods, such as principal components analysis and vector quantization, that learn holistic, not parts-based, representations. Non-negative matrix factorization is distinguished from the other methods by its use of non-negativity constraints. These constraints lead to a parts-based representation because they allow only additive, not subtractive, combinations. When non-negative matrix factorization is implemented as a neural network, parts-based representations emerge by virtue of two properties: the firing rates of neurons are never negative and synaptic strengths do not change sign.

9,911 citations


"A comparison of features in parts-b..." refers methods in this paper

  • ...In contrary to PCA other methods produce so called parts-based features like the nonnegative matrix factorization (NMF) proposed in [8] or a similar scheme proposed in [9] yielding more class-specific features....

    [...]


01 Jan 1999
Abstract: Is perception of the whole based on perception of its parts? There is psychological and physiological evidence for parts-based representations in the brain, and certain computational theories of object recognition rely on such representations. But little is known about how brains or computers might learn the parts of objects. Here we demonstrate an algorithm for non-negative matrix factorization that is able to learn parts of faces and semantic features of text. This is in contrast to other methods, such as principal components analysis and vector quantization, that learn holistic, not parts-based, representations. Non-negative matrix factorization is distinguished from the other methods by its use of non-negativity constraints. These constraints lead to a parts-based representation because they allow only additive, not subtractive, combinations. When non-negative matrix factorization is implemented as a neural network, parts-based representations emerge by virtue of two properties: the firing rates of neurons are never negative and synaptic strengths do not change sign.

9,604 citations


Journal ArticleDOI
TL;DR: It is observed that the ranking of the descriptors is mostly independent of the interest region detector and that the SIFT-based descriptors perform best and Moments and steerable filters show the best performance among the low dimensional descriptors.
Abstract: In this paper, we compare the performance of descriptors computed for local interest regions, as, for example, extracted by the Harris-Affine detector [Mikolajczyk, K and Schmid, C, 2004]. Many different descriptors have been proposed in the literature. It is unclear which descriptors are more appropriate and how their performance depends on the interest region detector. The descriptors should be distinctive and at the same time robust to changes in viewing conditions as well as to errors of the detector. Our evaluation uses as criterion recall with respect to precision and is carried out for different image transformations. We compare shape context [Belongie, S, et al., April 2002], steerable filters [Freeman, W and Adelson, E, Setp. 1991], PCA-SIFT [Ke, Y and Sukthankar, R, 2004], differential invariants [Koenderink, J and van Doorn, A, 1987], spin images [Lazebnik, S, et al., 2003], SIFT [Lowe, D. G., 1999], complex filters [Schaffalitzky, F and Zisserman, A, 2002], moment invariants [Van Gool, L, et al., 1996], and cross-correlation for different types of interest regions. We also propose an extension of the SIFT descriptor and show that it outperforms the original method. Furthermore, we observe that the ranking of the descriptors is mostly independent of the interest region detector and that the SIFT-based descriptors perform best. Moments and steerable filters show the best performance among the low dimensional descriptors.

6,855 citations


"A comparison of features in parts-b..." refers background in this paper

  • ...How well certain local descriptors can be re-detected under different image transformations, as scale, rotation and viewpoint changes, was investigated in [14]....

    [...]


Frequently Asked Questions (2)
Q1. What are the contributions mentioned in the paper "A comparison of features in parts-based object recognition hierarchies" ?

In this paper the authors present the results of a comparative investigation of different feature types with regard to their suitability for category discrimination. The authors discuss the conceptual differences, resulting performance and consequences for hierarchical models of visual recognition. 

Besides the normalization of rotation for SIFT, it would be interesting to investigate other reasons for the differences in performance in future work. Since both approaches have not been applied to scenarios with multiple categories, the authors hope that their comparative study provides further helpful inside into parts-based 3D object recognition.