scispace - formally typeset
Open AccessJournal ArticleDOI

Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study

TLDR
A large-scale evaluation of an approach that represents images as distributions of features extracted from a sparse set of keypoint locations and learns a Support Vector Machine classifier with kernels based on two effective measures for comparing distributions, the Earth Mover’s Distance and the χ2 distance.
Abstract
Recently, methods based on local image features have shown promise for texture and object recognition tasks. This paper presents a large-scale evaluation of an approach that represents images as distributions (signatures or histograms) of features extracted from a sparse set of keypoint locations and learns a Support Vector Machine classifier with kernels based on two effective measures for comparing distributions, the Earth Mover’s Distance and the ÷2 distance. We first evaluate the performance of our approach with different keypoint detectors and descriptors, as well as different kernels and classifiers. We then conduct a comparative evaluation with several state-of-the-art recognition methods on 4 texture and 5 object databases. On most of these databases, our implementation exceeds the best reported results and achieves comparable performance on the rest. Finally, we investigate the influence of background correlations on recognition performance.

read more

Content maybe subject to copyright    Report

HAL Id: hal-00171412
https://hal.archives-ouvertes.fr/hal-00171412
Submitted on 19 Apr 2011
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entic research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diusion de documents
scientiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Local features and kernels for classication of texture
and object categories: a comprehensive study
Jianguo Zhang, Marcin Marszalek, Svetlana Lazebnik, Cordelia Schmid
To cite this version:
Jianguo Zhang, Marcin Marszalek, Svetlana Lazebnik, Cordelia Schmid. Local features and kernels for
classication of texture and object categories: a comprehensive study. International Journal of Com-
puter Vision, Springer Verlag, 2007, 73 (2), pp.213-238. �10.1007/s11263-006-9794-4�. �hal-00171412�

Local features and kernels for classification of texture
and object categories: A comprehensive study
J. Zhang
1
, M. Marsza lek
1
, S. Lazebnik
2
and C. Schmid
1
,
1
INRIA, GRAVIR-CNRS, 655, av. de l’Europe, 38330 Montbonnot, France
2
Beckman Institute, University of Illinois, 405 N. Mathews Ave., Urbana, IL 61801, USA
Abstract
Recently, methods based on local image features have shown promise for texture and object recog-
nition tasks. This paper presents a large-scale evaluation of an approach that represents images as
distributions (signatures or histograms) of features extracted from a sparse set of keypoint locations
and learns a Support Vector Machine classifier with kernels based on two effective measures for
comparing distributions, the Earth Mover’s Distance and the χ
2
distance. We first evaluate the per-
formance of our approach with different keypoint detectors and descriptors, as well as different kernels
and classifiers. We then conduct a comparative evaluation with several state-of-the-art recognition
methods on four texture and five object databases. On most of these databases, our implementation
exceeds the best reported results and achieves comparable performance on the rest. Finally, we inves-
tigate the influence of background correlations on recognition performance via extensive tests on the
PASCAL database, for which ground-truth object localization information is available. Our experi-
ments demonstrate that image representations based on distributions of local features are surprisingly
effective for classification of texture and object images under challenging real-world conditions, in-
cluding significant intra-class variations and substantial background clutter.
Keywords: image classification, texture recognition, object recognition, scale- and affine-invariant
keypoints, support vector machines, kernel methods.
1 Introduction
The recognition of texture and object categories is one of the most challenging problems in computer
vision, especially in the presence of intra-class variation, clutter, occlusion, and pose changes. Histori-
cally, texture and object recognition have been treated as two separate problems in the literature. It is
customary to define texture as a visual pattern characterized by the repetition of a few basic primitives,
or textons [27]. Accordingly, many effective texture recognition approaches [8, 31, 33, 57, 58] obtain tex-
tons by clustering local image features (i.e., appearance descriptors of relatively small neighborhoods),
and represent texture images as histograms or distributions of the resulting textons. Note that these
approaches are orderless, i.e., they retain only the frequencies of the individual features, and discard all
information about their spatial layout. On the other hand, the problem of object recognition has typi-
cally been approached using parts-and-shape models that represent not only the appearance of individual
object components, but also the spatial relations between them [1, 17, 18, 19, 60]. However, recent liter-
ature also contains several proposals to represent the “visual texture” of images containing objects using
orderless bag-of-features models. Such models have proven to be effective for object classification [7, 61],
unsupervised discovery of categories [16, 51, 55], and video retrieval [56]. The success of orderless models
for these object recognition tasks may be explained with the help of an analogy to bag-of-words models
for text document classification [40, 46]. Whereas for texture recognition, local features play the role
of textons, or frequently repeated elements, for object recognition tasks, local features play the role of
“visual words” predictive of a certain “topic,” or object class. For example, an eye is highly predictive
of a face being present in the image. If our visual dictionary contains words that are sufficiently discrim-
inative when taken individually, then it is possible to achieve a high degree of success for whole-image
classification, i.e., identification of the object class contained in the image without attempting to segment
1

or localize that object, simply by looking which visual words are present, regardless of their spatial layout.
Overall, there is an emerging consensus in recent literature that orderless methods are effective for both
texture and object description, and it creates the need for a large-scale quantitative evaluation of a single
approach tested on multiple texture and object databases.
To date, state-of-the-art results in both texture [31] and object recognition [18, 23, 48, 61] have been
obtained with local features computed at a sparse set of scale- or affine-invariant keypoint locations
found by specialized interest operators [34, 43]. At the same time, Support Vector Machine (SVM)
classifiers [54] have shown their promise for visual classification tasks (see [50] for an early example),
and the development of kernels suitable for use with local features has emerged as a fruitful line of re-
search [4, 13, 23, 37, 47, 59]. Most existing evaluations of methods combining kernels and local features
have been small-scale and limited to one or two datasets. Moreover, the backgrounds in many of these
datasets, such as COIL-100 [44] or ETH-80 [32] are either (mostly) uniform or highly correlated with the
foreground objects, so that the performance of the methods on challenging real-world imagery cannot be
assessed accurately. This motivates us to build an effective image classification approach combining a
bag-of-keypoints representation with a kernel-based learning method and to test the limits of its perfor-
mance on the most challenging databases available today. Our study consists of three components:
Evaluation of implementation choices. In this paper, we place a particular emphasis on producing
a carefully engineered recognition system, where every component has been chosen to maximize perfor-
mance. To this end, we conduct a comprehensive assessment of many available choices for our method,
including keypoint detector type, level of geometric invariance, feature descriptor, and classifier kernel.
Several practical insights emerge from this process. For example, we show that a combination of multiple
detectors and descriptors usually achieves better results than even the most discriminative individual
detector/descriptor channel. Also, for most datasets in our evaluation, we show that local features with
the highest possible level of invariance do not yield the best performance. Thus, in attempting to design
the most effective recognition system for a practical application, one should seek to incorporate multiple
types of complementary features, but make sure that their local invariance properties do not exceed the
level absolutely required for a given application.
Comparison with existing methods. We conduct a comparative evaluation with several state-of-
the-art methods for texture and object classification on four texture and five object databases. For
texture classification, our approach outperforms existing methods on Brodatz [3], KTH-TIPS [24] and
UIUCTex [31] datasets, and obtains comparable results on the CUReT dataset [9]. For object category
classification, our approach outperforms existing methods on the Xerox7 [61], Graz [48], CalTech6 [18],
CalTech101 [15] and the more difficult test set of the PASCAL challenge [14]. It obtains comparable
results on the easier PASCAL test set. The power of orderless bag-of-keypoints representations is not
particularly surprising in the case of texture images, which lack clutter and have uniform statistical
properties. However, it is not a priori obvious that such representations are sufficient for object category
classification, since they ignore spatial relations and do not separate foreground from background features.
Influence of background features. As stated above, our bag-of-keypoints method uses both fore-
ground and background features to make a classification decision about the image as a whole. For many
existing object datasets, background features are not completely uncorrelated from the foreground, and
may thus provide inadvertent “hints” for recognition (e.g., cars are frequently pictured on a road or in a
parking lot, while faces tend to appear against indoor backgrounds). Therefore, to obtain a complete un-
derstanding of how bags-of-keypoints methods work, it is important to analyze the separate contributions
of foreground and background features. To our knowledge, such an analysis has not been undertaken
to date. In this paper, we study the influence of background features on the diverse and challenging
PASCAL benchmark. Our experiments reveal that, while backgrounds do in fact contain some discrim-
inative information for the foreground category, particularly in “easier” datasets, using foreground and
background features together does not improve the performance of our method. Thus, even in the pres-
ence of background correlations, it is the features on the objects themselves that play the key role for
recognition. But at the same time, we show the danger of training the recognition system on datasets
with monotonous or highly correlated backgrounds—such a system does not perform well on a more
complex test set.
2

For object recognition, we have deliberately limited our evaluations to the image-level classification
task, i.e., classifying an entire test image as containing an instance of one of a fixed number of given object
classes. This task must be clearly distinguished from localization, or reporting a location hypothesis for
the object that is judged to be present. Though it is possible to perform localization with a bag-of-
keypoints representation, e.g., by incorporating a probabilistic model that can report the likelihood of an
individual feature for a given image and category [55], evaluation of localization accuracy is beyond the
scope of the present paper. It is important to emphasize that we do not propose basic bag-of-keypoints
methods as a solution to the general object recognition problem. Instead, we seek to demonstrate that,
given the right implementation choices, simple orderless image representations with suitable kernels can
be surprisingly effective on a wide variety of imagery. Thus, they can serve as good baselines for measuring
the difficulty of newly acquired datasets and for evaluating more sophisticated recognition approaches
that incorporate structural information about the object.
The rest of this paper is organized as follows. Section 2 presents existing approaches for texture
and object recognition. The components of our approach are described in section 3. Results are given
in section 4. We first evaluate the implementation choices relevant to our approach, i.e., we compare
different detectors and descriptors as well as different kernels. We then compare our approach to existing
texture and object category classification methods. In section 5 we evaluate the effect of changes to the
object background. Section 6 concludes the paper with a summary of our findings and a discussion of
future work.
2 Related work
This section gives a brief survey of recent work on texture and object recognition. As stated in the intro-
duction, these two problems have typically been considered separately in the computer vision literature,
though in the last few years, we have seen a convergence in the types of methods used to attack them,
as orderless bags of features have proven to be effective for both texture and object description.
2.1 Texture recognition
A major challenge in the field of texture analysis and recognition is achieving invariance under a wide
range of geometric and photometric transformations. Early research in this domain has concentrated
on global 2D image transformations, such as rotation and scaling [6, 39]. However, such models do not
accurately capture the effects of 3D transformations (even in-plane rotations) of textured surfaces. More
recently, there has been a great deal of interest in recognizing images of textured surfaces subjected to
lighting and viewpoint changes [8, 9, 33, 35, 57, 58, 62]. Distribution-based methods have been introduced
for classifying 3D textures under varying poses and illumination changes. The basic idea is to compute a
texton histogram based on a universal representative texton dictionary. Leung and Malik [33] constructed
a 3D texton representation for classifying a “stack” of registered images of a test material with known
imaging parameters. The special requirement of calibrated cameras limits the usage of this method in
most practical situations. This limitation was removed by the work of Cula and Dana [8], who used single-
image histograms of 2D textons. Varma and Zisserman [57, 58] have further improved 2D texton-based
representations, achieving very high levels of accuracy on the Columbia-Utrecht reflectance and texture
(CUReT) database [9]. The descriptors used in their work are filter bank outputs [57] and raw pixel
values [58]. Hayman et al. [24] extend this method by using support vector machine classifiers with a kernel
based on χ
2
histogram distance. Even though these methods have been successful in the complex task of
classifying images of materials despite significant appearance changes, their representations themselves
are not invariant to the changes in question. In particular, the support regions for computing descriptors
are fixed by hand; no adaptation is performed to compensate for changes in surface orientation with
respect to the camera. Lazebnik et al. [31] have proposed a different strategy, namely, an intrinsically
invariant image representation based on distributions of appearance descriptors computed at a sparse set
of affine-invariant keypoints (in contrast, earlier approaches to texture recognition can be called dense,
since they compute appearance descriptors at every pixel). This approach has achieved promising results
for texture classification under significant viewpoint changes. In the experiments presented in this paper,
we take this approach as a starting point and further improve its discriminative power with a kernel-based
3

learning method, provide a detailed evaluation of different descriptors and their invariance properties,
and place it in the broader context of both texture and object recognition by measuring the impact of
background clutter on its performance.
2.2 Object recognition
The earliest work on appearance-based object recognition has mainly utilized global descriptions such as
color or texture histograms [45, 50, 53]. The main drawback of such methods is their sensitivity to real-
world sources of variability such as viewpoint and lighting changes, clutter and occlusions. For this reason,
global methods were gradually supplanted over the last decade by part-based methods, which became one
of the dominant paradigms in the object recognition community. Part-based object models combine
appearance descriptors of local features with a representation of their spatial relations. Initially, part-
based methods relied on simple Harris interest points, which only provided translation invariance [1, 60].
Subsequently, local features with higher degrees of invariance were used to obtain robustness against
scaling changes [18] and affine deformations [30]. While part-based models offer an intellectually satisfying
way of representing many real-world objects, learning and inference problems for spatial relations remain
extremely complex and computationally intensive, especially in a weakly supervised setting where the
location of the object in a training image has not been marked by hand. On the other hand, orderless
bag-of-keypoints methods [55, 61] have the advantage of simplicity and computational efficiency, though
they fail to represent the geometric structure of the object class or to distinguish between foreground
and background features. For these reasons, bag-of-keypoints methods can be adversely affected by
clutter, just as earlier global methods based on color or gradient histograms. One way to overcome this
potential weakness is to use feature selection [12] or boosting [48] to retain only the most discriminative
features for recognition. Another approach is to design novel kernels that can yield high discriminative
power despite the noise and irrelevant information that may be present in local feature sets [23, 37, 59].
While these methods have obtained promising results, they have not been extensively tested on databases
featuring heavily cluttered, uncorrelated backgrounds, so the true extent of their robustness has not been
conclusively determined. Our own approach is related to that of Grauman and Darrell [23], who have
developed a kernel that approximates the optimal partial matching between two feature sets. Specifically,
we use a kernel based on the Earth Mover’s Distance [52], which solves the partial matching problem
exactly. Finally, we note that our image representation is similar to that of [61], though our choice of
local features and classifier kernel results in significantly higher performance.
3 Components of the representation
This section introduces our image representation based on sparse local features. We first discuss scale-
and affine-invariant local regions and the descriptors of their appearance. We then describe different
image signatures and similarity measures suitable for comparing them.
3.1 Scale- and affine-invariant region detectors
In this paper, we use two complementary local region detector types to extract salient image structures:
The Harris-Laplace detector [43] responds to corner-like regions, while the Laplacian detector [34] extracts
blob-like regions (Fig. 1).
At the most basic level, these two detectors are invariant to scale transformations alone, i.e., they
output circular regions at a certain characteristic scale. To achieve rotation invariance, we can either
use rotationally invariant descriptors—for example, SPIN and RIFT [31], as presented in the following
section—or rotate the circular regions in the direction of the dominant gradient orientation [36, 43].
In our implementation, the dominant gradient orientation is computed as the average of all gradient
orientations in the region. Finally, we obtain affine-invariant versions of the Harris-Laplace and Laplacian
detectors through the use of an affine adaptation procedure [21, 42]. Affinely adapted detectors output
ellipse-shaped regions which are then normalized, i.e., transformed into circles. Normalization leaves a
rotational ambiguity that can be eliminated either by using rotation-invariant descriptors or by finding
the dominant gradient orientation, as described above.
4

Citations
More filters
Journal ArticleDOI

The Pascal Visual Object Classes (VOC) Challenge

TL;DR: The state-of-the-art in evaluated methods for both classification and detection are reviewed, whether the methods are statistically different, what they are learning from the images, and what the methods find easy or confuse.
Journal ArticleDOI

Object Detection with Discriminatively Trained Part-Based Models

TL;DR: An object detection system based on mixtures of multiscale deformable part models that is able to represent highly variable object classes and achieves state-of-the-art results in the PASCAL object detection challenges is described.
Proceedings ArticleDOI

Learning realistic human actions from movies

TL;DR: A new method for video classification that builds upon and extends several recent ideas including local space-time features,space-time pyramids and multi-channel non-linear SVMs is presented and shown to improve state-of-the-art results on the standard KTH action dataset.
Journal ArticleDOI

On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation.

TL;DR: This work proposes a general solution to the problem of understanding classification decisions by pixel-wise decomposition of nonlinear classifiers by introducing a methodology that allows to visualize the contributions of single pixels to predictions for kernel-based classifiers over Bag of Words features and for multilayered neural networks.
Book ChapterDOI

Improving the fisher kernel for large-scale image classification

TL;DR: In an evaluation involving hundreds of thousands of training images, it is shown that classifiers learned on Flickr groups perform surprisingly well and that they can complement classifier learned on more carefully annotated datasets.
References
More filters
Journal ArticleDOI

Distinctive Image Features from Scale-Invariant Keypoints

TL;DR: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene and can robustly identify objects among clutter and occlusion while achieving near real-time performance.
BookDOI

Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond

TL;DR: Learning with Kernels provides an introduction to SVMs and related kernel methods that provide all of the concepts necessary to enable a reader equipped with some basic mathematical knowledge to enter the world of machine learning using theoretically well-founded yet easy-to-use kernel algorithms.
Journal ArticleDOI

A performance evaluation of local descriptors

TL;DR: It is observed that the ranking of the descriptors is mostly independent of the interest region detector and that the SIFT-based descriptors perform best and Moments and steerable filters show the best performance among the low dimensional descriptors.
Proceedings ArticleDOI

Video Google: a text retrieval approach to object matching in videos

TL;DR: An approach to object and scene retrieval which searches for and localizes all the occurrences of a user outlined object in a video, represented by a set of viewpoint invariant region descriptors so that recognition can proceed successfully despite changes in viewpoint, illumination and partial occlusion.
Proceedings Article

Visual categorization with bags of keypoints

TL;DR: This bag of keypoints method is based on vector quantization of affine invariant descriptors of image patches and shows that it is simple, computationally efficient and intrinsically invariant.
Related Papers (5)
Frequently Asked Questions (10)
Q1. What are the contributions mentioned in the paper "Local features and kernels for classification of texture and object categories: a comprehensive study" ?

This paper presents a large-scale evaluation of an approach that represents images as distributions ( signatures or histograms ) of features extracted from a sparse set of keypoint locations and learns a Support Vector Machine classifier with kernels based on two effective measures for comparing distributions, the Earth Mover ’ s Distance and the χ distance. The authors first evaluate the performance of their approach with different keypoint detectors and descriptors, as well as different kernels and classifiers. The authors then conduct a comparative evaluation with several state-of-the-art recognition methods on four texture and five object databases. On most of these databases, their implementation exceeds the best reported results and achieves comparable performance on the rest. Finally, the authors investigate the influence of background correlations on recognition performance via extensive tests on the PASCAL database, for which ground-truth object localization information is available. 

Future research should focus on designing improved feature representations. Another promising area is the development of hybrid sparse/dense representations. For example, the recent successes of the novel feature extraction schemes of [ 10, 28 ] suggest that increasing the density and redundancy of local feature sets may be beneficial for recognition. 

To achieve rotation invariance, the authors can either use rotationally invariant descriptors—for example, SPIN and RIFT [31], as presented in the following section—or rotate the circular regions in the direction of the dominant gradient orientation [36, 43]. 

One way to overcome this potential weakness is to use feature selection [12] or boosting [48] to retain only the most discriminative features for recognition. 

The SPIN descriptor, based on spin images used for matching range data [26], is a rotation-invariant two-dimensional histogram of intensities within an image region. 

An alternative to image signatures is to obtain a global texton vocabulary (or visual vocabulary) by clustering descriptors from a special training set, and then to represent each image in the database as a histogram of texton labels [8, 57, 58, 61]. 

On the other hand, orderless bag-of-keypoints methods [55, 61] have the advantage of simplicity and computational efficiency, though they fail to represent the geometric structure of the object class or to distinguish between foreground and background features. 

The power of orderless bag-of-keypoints representations is not particularly surprising in the case of texture images, which lack clutter and have uniform statistical properties. 

To avoid the computational expense of building global vocabularies for each dataset, the authors use the EMD kernel in the following experiments. 

Their results show that for most datasets, combining geometric invariance at the representation level with a discriminative classifier at the learning level, results in a very effective texture recognition system.