TL;DR: A comparative investigation of different feature types with regard to their suitability for category discrimination in patches of gray-scale images were compared with SIFT descriptors and patches from the high-level output of a feedforward hierarchy related to the ventral visual pathway.
Abstract: Parts-based recognition has been suggested for generalizing from few training views in categorization scenarios. In this paper we present the results of a comparative investigation of different feature types with regard to their suitability for category discrimination. So patches of gray-scale images were compared with SIFT descriptors and patches from the high-level output of a feedforward hierarchy related to the ventral visual pathway. We discuss the conceptual differences, resulting performance and consequences for hierarchical models of visual recognition.
The human brain employs different kinds of interrelated representations and processes to recognize objects, depending on the familiarity of the object and the required level of recognition, which is defined by the current task.
A parts-based representation is especially efficient for storing and categorizing novel objects, because the largest variance in unseen views of an object can be expected in the position and arrangement of parts, while each part of an object will be visible under a large variety of 3D object transformations.
Here hierarchies of feature layers are used, like in the ventral visual pathway, where they combine specificity and invariance of features.
In other holistic methods the receptive fields of the features cover the whole image.
The approach selects features based on the maximization of mutual information for a single class.
2 Analytic Features
To generalize from few training examples, parts-based recognition follows the notion that similar combinations of parts are specific for a certain category over a wide range of variations.
So the authors need a reasonable feature selection strategy that evaluates which and how many views of a certain category a feature can separate from other categories and, based on those results, choose the subset of features that in combination can describe the whole scenario best.
The maximum activated bin in this histogram is used to normalize the rotation of the patch in advance.
A similar cluster step was also done in [15] to improve the generalization performance of the otherwise very specific SIFT descriptors.
The feedforward hierarchy proposed in [3] is shown in Fig. 1a.
3 Results
The authors tested the performance of the different feature types on the categorization scenario shown in Fig.
For the different tests the authors then varied the number of used features and the number of training views that were used by a single layer perceptron (SLP), as the final classifier.
C2-H is similar to C2-P and GRAY-P, and takes the lead when using a large number of views.
The performance of SIFT-P and GRAY-P on cups(7) is very poor and does not improve with more training views.
This is especially true for categories where the rotation in depth looks like rotation in plane (bottle(2), brush(4), phone(9), tool(10)).
4 Conclusion
The authors evaluated the performance of different types of local feature when used in parts-based recognition.
The biological motivated feedforward hierarchy in [3] is powerful in holistic recognition with a sufficient number of training examples, but the patches from the output layer are too general and therefore show weak performance in parts-based recognition.
First features are used that extract the magnitudes for 8 different local gradient directions.
This could be beneficial for both feature types.
The most related work in the direction of analytic features was done in [16], where Ullman introduced invariance over viewpoint in his fragments approach, or in the work of Dorko et al. in [15], where highly informative clusters of SIFT descriptors are used.
TL;DR: To achieve the life-long learning ability for a cognitive system, a new learning vector quantization approach combined with a category-specific feature selection method to allow several metrical "views" on the representation space of each individual vector quantification node.
Abstract: We present a new method capable of learning multiple categories in an interactive and life-long learning fashion to approach the ''stability-plasticity dilemma''. The problem of incremental learning of multiple categories is still largely unsolved. This is especially true for the domain of cognitive robotics, requiring real-time and interactive learning. To achieve the life-long learning ability for a cognitive system, we propose a new learning vector quantization approach combined with a category-specific feature selection method to allow several metrical ''views'' on the representation space of each individual vector quantization node. These category-specific features are incrementally collected during the learning process, so that a balance between the correction of wrong representations and the stability of acquired knowledge is achieved. We demonstrate our approach for a difficult visual categorization task, where the learning is applied for several complex-shaped objects rotated in depth.
53 citations
Cites methods from "A comparison of features in parts-b..."
...The parts-based feature detection (see Hasler et al. (2007) for details) is based on a preselected set of SIFT-descriptors (Lowe, 2004), which are designed to be invariant with regard to rotations in the image plane....
TL;DR: This work presents an integrated vision architecture capable of incrementally learning several visual categories based on natural hand-held objects and imposes no restrictions on the viewing angle of presented objects, relaxing the common constraint on canonical views.
Abstract: We present an integrated vision architecture capable of incrementally learning several visual categories based on natural hand-held objects. Additionally we focus on interactive learning, which requires real-time image processing methods and a fast learning algorithm. The overall system is composed of a figure-ground segregation part, several feature extraction methods and a life-long learning approach combining incremental learning with category-specific feature selection. In contrast to most visual categorization approaches, where typically each view is assigned to a single category, we allow labeling with an arbitrary number of shape and color categories. We also impose no restrictions on the viewing angle of presented objects, relaxing the common constraint on canonical views.
TL;DR: In this paper, a neural associative memory structure for storing and maintaining associations between memory address patterns and memory content patterns using a neural network is presented, as well as methods for retrieving such associations.
Abstract: This invention is in the field of machine learning and neural associative memory. In particular the invention discloses a neural associative memory structure for storing and maintaining associations between memory address patterns and memory content patterns using a neural network, as well as methods for retrieving such associations. A method for a non-linear synaptic learning of discrete synapses is disclosed, and its application on neural networks is laid out.
TL;DR: In this paper, a neural associative memory structure for storing and maintaining associations between memory address patterns and memory content patterns using a neural network, as well as methods for storing, retrieving and storing such associations.
Abstract: This invention is in the field of machine learning and neural associative memory. In particular the invention discloses a neural associative memory structure for storing and maintaining associations between memory address patterns and memory content patterns using a neural network, as well as methods for storing and retrieving such associations. Bayesian learning is applied to achieve non-linear learning.
TL;DR: An exemplar-based learning approach for incremental and life-long learning of visual categories and it is argued that contextual information is beneficial for this process.
Abstract: We present an exemplar-based learning approach for incremental and life-long learning of visual categories. The basic concept of the proposed learning method is to subdivide the learning process into two phases. In the first phase we utilize supervised learning to generate an appropriate category seed, while in the second phase this seed is used to autonomously bootstrap the visual representation. This second learning phase is especially useful for assistive systems like a mobile robot, because the visual knowledge can be enhanced even if no tutor is present. Although for this autonomous bootstrapping no category labels are provided, we argue that contextual information is beneficial for this process. Finally we investigate the effect of the proposed second learning phase with respect to the overall categorization performance.
TL;DR: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene and can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Abstract: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.
TL;DR: A near-real-time computer system that can locate and track a subject's head, and then recognize the person by comparing characteristics of the face to those of known individuals, and that is easy to implement using a neural network architecture.
Abstract: We have developed a near-real-time computer system that can locate and track a subject's head, and then recognize the person by comparing characteristics of the face to those of known individuals. The computational approach taken in this system is motivated by both physiology and information theory, as well as by the practical requirements of near-real-time performance and accuracy. Our approach treats the face recognition problem as an intrinsically two-dimensional (2-D) recognition problem rather than requiring recovery of three-dimensional geometry, taking advantage of the fact that faces are normally upright and thus may be described by a small set of 2-D characteristic views. The system functions by projecting face images onto a feature space that spans the significant variations among known face images. The significant features are known as "eigenfaces," because they are the eigenvectors (principal components) of the set of faces; they do not necessarily correspond to features such as eyes, ears, and noses. The projection operation characterizes an individual face by a weighted sum of the eigenface features, and so to recognize a particular face it is necessary only to compare these weights to those of known individuals. Some particular advantages of our approach are that it provides for the ability to learn and later recognize new faces in an unsupervised manner, and that it is easy to implement using a neural network architecture.
14,562 citations
"A comparison of features in parts-b..." refers methods in this paper
...in [7] features obtained by principal component analysis (PCA) on gray-scale images were used to classify faces....
TL;DR: An algorithm for non-negative matrix factorization is demonstrated that is able to learn parts of faces and semantic features of text and is in contrast to other methods that learn holistic, not parts-based, representations.
Abstract: Is perception of the whole based on perception of its parts? There is psychological and physiological evidence for parts-based representations in the brain, and certain computational theories of object recognition rely on such representations. But little is known about how brains or computers might learn the parts of objects. Here we demonstrate an algorithm for non-negative matrix factorization that is able to learn parts of faces and semantic features of text. This is in contrast to other methods, such as principal components analysis and vector quantization, that learn holistic, not parts-based, representations. Non-negative matrix factorization is distinguished from the other methods by its use of non-negativity constraints. These constraints lead to a parts-based representation because they allow only additive, not subtractive, combinations. When non-negative matrix factorization is implemented as a neural network, parts-based representations emerge by virtue of two properties: the firing rates of neurons are never negative and synaptic strengths do not change sign.
11,500 citations
"A comparison of features in parts-b..." refers methods in this paper
...In contrary to PCA other methods produce so called parts-based features like the nonnegative matrix factorization (NMF) proposed in [8] or a similar scheme proposed in [9] yielding more class-specific features....
TL;DR: In this article, non-negative matrix factorization is used to learn parts of faces and semantic features of text, which is in contrast to principal components analysis and vector quantization that learn holistic, not parts-based, representations.
Abstract: Is perception of the whole based on perception of its parts? There is psychological and physiological evidence for parts-based representations in the brain, and certain computational theories of object recognition rely on such representations. But little is known about how brains or computers might learn the parts of objects. Here we demonstrate an algorithm for non-negative matrix factorization that is able to learn parts of faces and semantic features of text. This is in contrast to other methods, such as principal components analysis and vector quantization, that learn holistic, not parts-based, representations. Non-negative matrix factorization is distinguished from the other methods by its use of non-negativity constraints. These constraints lead to a parts-based representation because they allow only additive, not subtractive, combinations. When non-negative matrix factorization is implemented as a neural network, parts-based representations emerge by virtue of two properties: the firing rates of neurons are never negative and synaptic strengths do not change sign.
TL;DR: It is observed that the ranking of the descriptors is mostly independent of the interest region detector and that the SIFT-based descriptors perform best and Moments and steerable filters show the best performance among the low dimensional descriptors.
Abstract: In this paper, we compare the performance of descriptors computed for local interest regions, as, for example, extracted by the Harris-Affine detector [Mikolajczyk, K and Schmid, C, 2004]. Many different descriptors have been proposed in the literature. It is unclear which descriptors are more appropriate and how their performance depends on the interest region detector. The descriptors should be distinctive and at the same time robust to changes in viewing conditions as well as to errors of the detector. Our evaluation uses as criterion recall with respect to precision and is carried out for different image transformations. We compare shape context [Belongie, S, et al., April 2002], steerable filters [Freeman, W and Adelson, E, Setp. 1991], PCA-SIFT [Ke, Y and Sukthankar, R, 2004], differential invariants [Koenderink, J and van Doorn, A, 1987], spin images [Lazebnik, S, et al., 2003], SIFT [Lowe, D. G., 1999], complex filters [Schaffalitzky, F and Zisserman, A, 2002], moment invariants [Van Gool, L, et al., 1996], and cross-correlation for different types of interest regions. We also propose an extension of the SIFT descriptor and show that it outperforms the original method. Furthermore, we observe that the ranking of the descriptors is mostly independent of the interest region detector and that the SIFT-based descriptors perform best. Moments and steerable filters show the best performance among the low dimensional descriptors.
7,057 citations
"A comparison of features in parts-b..." refers background in this paper
...How well certain local descriptors can be re-detected under different image transformations, as scale, rotation and viewpoint changes, was investigated in [14]....
Q1. What are the contributions mentioned in the paper "A comparison of features in parts-based object recognition hierarchies" ?
In this paper the authors present the results of a comparative investigation of different feature types with regard to their suitability for category discrimination. The authors discuss the conceptual differences, resulting performance and consequences for hierarchical models of visual recognition.
Q2. What are the future works in "A comparison of features in parts-based object recognition hierarchies" ?
Besides the normalization of rotation for SIFT, it would be interesting to investigate other reasons for the differences in performance in future work. Since both approaches have not been applied to scenarios with multiple categories, the authors hope that their comparative study provides further helpful inside into parts-based 3D object recognition.