What is the definition of a RIFT descriptor?

The SPIN descriptor, based on spin images used for matching range data [26], is a rotation-invariant two-dimensional histogram of intensities within an image region.

What is the way to obtain a global texton vocabulary?

An alternative to image signatures is to obtain a global texton vocabulary (or visual vocabulary) by clustering descriptors from a special training set, and then to represent each image in the database as a histogram of texton labels [8, 57, 58, 61].

What is the advantage of orderless bag-of-keypoints methods?

On the other hand, orderless bag-of-keypoints methods [55, 61] have the advantage of simplicity and computational efficiency, though they fail to represent the geometric structure of the object class or to distinguish between foreground and background features.

What is the way to avoid the cost of building global vocabularies?

To avoid the computational expense of building global vocabularies for each dataset, the authors use the EMD kernel in the following experiments.

What is the method for combining geometric invariance with a discriminative classifier?

Their results show that for most datasets, combining geometric invariance at the representation level with a discriminative classifier at the learning level, results in a very effective texture recognition system.

(Open Access) Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study (2006) | Jianguo Zhang

Q: What have the authors stated for future works in "Local features and kernels for classification of texture and object categories: a comprehensive study" ?

Future research should focus on designing improved feature representations. Another promising area is the development of hybrid sparse/dense representations. For example, the recent successes of the novel feature extraction schemes of [ 10, 28 ] suggest that increasing the density and redundancy of local feature sets may be beneficial for recognition.

Q: What can the authors do to achieve rotation invariance?

To achieve rotation invariance, the authors can either use rotationally invariant descriptors—for example, SPIN and RIFT [31], as presented in the following section—or rotate the circular regions in the direction of the dominant gradient orientation [36, 43].

HAL Id: hal-00171412

https://hal.archives-ouvertes.fr/hal-00171412

Submitted on 19 Apr 2011

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-

entic research documents, whether they are pub-

lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diusion de documents

scientiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Local features and kernels for classication of texture

and object categories: a comprehensive study

Jianguo Zhang, Marcin Marszalek, Svetlana Lazebnik, Cordelia Schmid

To cite this version:

Jianguo Zhang, Marcin Marszalek, Svetlana Lazebnik, Cordelia Schmid. Local features and kernels for

classication of texture and object categories: a comprehensive study. International Journal of Com-

puter Vision, Springer Verlag, 2007, 73 (2), pp.213-238. �10.1007/s11263-006-9794-4�. �hal-00171412�

Local features and kernels for classiﬁcation of texture

and object categories: A comprehensive study

J. Zhang

, M. Marsza lek

, S. Lazebnik

and C. Schmid

INRIA, GRAVIR-CNRS, 655, av. de l’Europe, 38330 Montbonnot, France

Beckman Institute, University of Illinois, 405 N. Mathews Ave., Urbana, IL 61801, USA

Abstract

Recently, methods based on local image features have shown promise for texture and object recog-

nition tasks. This paper presents a large-scale evaluation of an approach that represents images as

distributions (signatures or histograms) of features extracted from a sparse set of keypoint locations

and learns a Support Vector Machine classiﬁer with kernels based on two eﬀective measures for

comparing distributions, the Earth Mover’s Distance and the χ

distance. We ﬁrst evaluate the per-

formance of our approach with diﬀerent keypoint detectors and descriptors, as well as diﬀerent kernels

and classiﬁers. We then conduct a comparative evaluation with several state-of-the-art recognition

methods on four texture and ﬁve object databases. On most of these databases, our implementation

exceeds the best reported results and achieves comparable performance on the rest. Finally, we inves-

tigate the inﬂuence of background correlations on recognition performance via extensive tests on the

PASCAL database, for which ground-truth object localization information is available. Our experi-

ments demonstrate that image representations based on distributions of local features are surprisingly

eﬀective for classiﬁcation of texture and object images under challenging real-world conditions, in-

cluding signiﬁcant intra-class variations and substantial background clutter.

Keywords: image classiﬁcation, texture recognition, object recognition, scale- and aﬃne-invariant

keypoints, support vector machines, kernel methods.

1 Introduction

The recognition of texture and object categories is one of the most challenging problems in computer

vision, especially in the presence of intra-class variation, clutter, occlusion, and pose changes. Histori-

cally, texture and object recognition have been treated as two separate problems in the literature. It is

customary to deﬁne texture as a visual pattern characterized by the repetition of a few basic primitives,

or textons [27]. Accordingly, many eﬀective texture recognition approaches [8, 31, 33, 57, 58] obtain tex-

tons by clustering local image features (i.e., appearance descriptors of relatively small neighborhoods),

and represent texture images as histograms or distributions of the resulting textons. Note that these

approaches are orderless, i.e., they retain only the frequencies of the individual features, and discard all

information about their spatial layout. On the other hand, the problem of object recognition has typi-

cally been approached using parts-and-shape models that represent not only the appearance of individual

object components, but also the spatial relations between them [1, 17, 18, 19, 60]. However, recent liter-

ature also contains several proposals to represent the “visual texture” of images containing objects using

orderless bag-of-features models. Such models have proven to be eﬀective for object classiﬁcation [7, 61],

unsupervised discovery of categories [16, 51, 55], and video retrieval [56]. The success of orderless models

for these object recognition tasks may be explained with the help of an analogy to bag-of-words models

for text document classiﬁcation [40, 46]. Whereas for texture recognition, local features play the role

of textons, or frequently repeated elements, for object recognition tasks, local features play the role of

“visual words” predictive of a certain “topic,” or object class. For example, an eye is highly predictive

of a face being present in the image. If our visual dictionary contains words that are suﬃciently discrim-

inative when taken individually, then it is possible to achieve a high degree of success for whole-image

classiﬁcation, i.e., identiﬁcation of the object class contained in the image without attempting to segment

or localize that object, simply by looking which visual words are present, regardless of their spatial layout.

Overall, there is an emerging consensus in recent literature that orderless methods are eﬀective for both

texture and object description, and it creates the need for a large-scale quantitative evaluation of a single

approach tested on multiple texture and object databases.

To date, state-of-the-art results in both texture [31] and object recognition [18, 23, 48, 61] have been

obtained with local features computed at a sparse set of scale- or aﬃne-invariant keypoint locations

found by specialized interest operators [34, 43]. At the same time, Support Vector Machine (SVM)

classiﬁers [54] have shown their promise for visual classiﬁcation tasks (see [50] for an early example),

and the development of kernels suitable for use with local features has emerged as a fruitful line of re-

search [4, 13, 23, 37, 47, 59]. Most existing evaluations of methods combining kernels and local features

have been small-scale and limited to one or two datasets. Moreover, the backgrounds in many of these

datasets, such as COIL-100 [44] or ETH-80 [32] are either (mostly) uniform or highly correlated with the

foreground objects, so that the performance of the methods on challenging real-world imagery cannot be

assessed accurately. This motivates us to build an eﬀective image classiﬁcation approach combining a

bag-of-keypoints representation with a kernel-based learning method and to test the limits of its perfor-

mance on the most challenging databases available today. Our study consists of three components:

Evaluation of implementation choices. In this paper, we place a particular emphasis on producing

a carefully engineered recognition system, where every component has been chosen to maximize perfor-

mance. To this end, we conduct a comprehensive assessment of many available choices for our method,

including keypoint detector type, level of geometric invariance, feature descriptor, and classiﬁer kernel.

Several practical insights emerge from this process. For example, we show that a combination of multiple

detectors and descriptors usually achieves better results than even the most discriminative individual

detector/descriptor channel. Also, for most datasets in our evaluation, we show that local features with

the highest possible level of invariance do not yield the best performance. Thus, in attempting to design

the most eﬀective recognition system for a practical application, one should seek to incorporate multiple

types of complementary features, but make sure that their local invariance properties do not exceed the

level absolutely required for a given application.

Comparison with existing methods. We conduct a comparative evaluation with several state-of-

the-art methods for texture and object classiﬁcation on four texture and ﬁve object databases. For

texture classiﬁcation, our approach outperforms existing methods on Brodatz [3], KTH-TIPS [24] and

UIUCTex [31] datasets, and obtains comparable results on the CUReT dataset [9]. For object category

classiﬁcation, our approach outperforms existing methods on the Xerox7 [61], Graz [48], CalTech6 [18],

CalTech101 [15] and the more diﬃcult test set of the PASCAL challenge [14]. It obtains comparable

results on the easier PASCAL test set. The power of orderless bag-of-keypoints representations is not

particularly surprising in the case of texture images, which lack clutter and have uniform statistical

properties. However, it is not a priori obvious that such representations are suﬃcient for object category

classiﬁcation, since they ignore spatial relations and do not separate foreground from background features.

Inﬂuence of background features. As stated above, our bag-of-keypoints method uses both fore-

ground and background features to make a classiﬁcation decision about the image as a whole. For many

existing object datasets, background features are not completely uncorrelated from the foreground, and

may thus provide inadvertent “hints” for recognition (e.g., cars are frequently pictured on a road or in a

parking lot, while faces tend to appear against indoor backgrounds). Therefore, to obtain a complete un-

derstanding of how bags-of-keypoints methods work, it is important to analyze the separate contributions

of foreground and background features. To our knowledge, such an analysis has not been undertaken

to date. In this paper, we study the inﬂuence of background features on the diverse and challenging

PASCAL benchmark. Our experiments reveal that, while backgrounds do in fact contain some discrim-

inative information for the foreground category, particularly in “easier” datasets, using foreground and

background features together does not improve the performance of our method. Thus, even in the pres-

ence of background correlations, it is the features on the objects themselves that play the key role for

recognition. But at the same time, we show the danger of training the recognition system on datasets

with monotonous or highly correlated backgrounds—such a system does not perform well on a more

complex test set.

For object recognition, we have deliberately limited our evaluations to the image-level classiﬁcation

task, i.e., classifying an entire test image as containing an instance of one of a ﬁxed number of given object

classes. This task must be clearly distinguished from localization, or reporting a location hypothesis for

the object that is judged to be present. Though it is possible to perform localization with a bag-of-

keypoints representation, e.g., by incorporating a probabilistic model that can report the likelihood of an

individual feature for a given image and category [55], evaluation of localization accuracy is beyond the

scope of the present paper. It is important to emphasize that we do not propose basic bag-of-keypoints

methods as a solution to the general object recognition problem. Instead, we seek to demonstrate that,

given the right implementation choices, simple orderless image representations with suitable kernels can

be surprisingly eﬀective on a wide variety of imagery. Thus, they can serve as good baselines for measuring

the diﬃculty of newly acquired datasets and for evaluating more sophisticated recognition approaches

that incorporate structural information about the object.

The rest of this paper is organized as follows. Section 2 presents existing approaches for texture

and object recognition. The components of our approach are described in section 3. Results are given

in section 4. We ﬁrst evaluate the implementation choices relevant to our approach, i.e., we compare

diﬀerent detectors and descriptors as well as diﬀerent kernels. We then compare our approach to existing

texture and object category classiﬁcation methods. In section 5 we evaluate the eﬀect of changes to the

object background. Section 6 concludes the paper with a summary of our ﬁndings and a discussion of

future work.

2 Related work

This section gives a brief survey of recent work on texture and object recognition. As stated in the intro-

duction, these two problems have typically been considered separately in the computer vision literature,

though in the last few years, we have seen a convergence in the types of methods used to attack them,

as orderless bags of features have proven to be eﬀective for both texture and object description.

2.1 Texture recognition

A major challenge in the ﬁeld of texture analysis and recognition is achieving invariance under a wide

range of geometric and photometric transformations. Early research in this domain has concentrated

on global 2D image transformations, such as rotation and scaling [6, 39]. However, such models do not

accurately capture the eﬀects of 3D transformations (even in-plane rotations) of textured surfaces. More

recently, there has been a great deal of interest in recognizing images of textured surfaces subjected to

lighting and viewpoint changes [8, 9, 33, 35, 57, 58, 62]. Distribution-based methods have been introduced

for classifying 3D textures under varying poses and illumination changes. The basic idea is to compute a

texton histogram based on a universal representative texton dictionary. Leung and Malik [33] constructed

a 3D texton representation for classifying a “stack” of registered images of a test material with known

imaging parameters. The special requirement of calibrated cameras limits the usage of this method in

most practical situations. This limitation was removed by the work of Cula and Dana [8], who used single-

image histograms of 2D textons. Varma and Zisserman [57, 58] have further improved 2D texton-based

representations, achieving very high levels of accuracy on the Columbia-Utrecht reﬂectance and texture

(CUReT) database [9]. The descriptors used in their work are ﬁlter bank outputs [57] and raw pixel

values [58]. Hayman et al. [24] extend this method by using support vector machine classiﬁers with a kernel

based on χ

histogram distance. Even though these methods have been successful in the complex task of

classifying images of materials despite signiﬁcant appearance changes, their representations themselves

are not invariant to the changes in question. In particular, the support regions for computing descriptors

are ﬁxed by hand; no adaptation is performed to compensate for changes in surface orientation with

respect to the camera. Lazebnik et al. [31] have proposed a diﬀerent strategy, namely, an intrinsically

invariant image representation based on distributions of appearance descriptors computed at a sparse set

of aﬃne-invariant keypoints (in contrast, earlier approaches to texture recognition can be called dense,

since they compute appearance descriptors at every pixel). This approach has achieved promising results

for texture classiﬁcation under signiﬁcant viewpoint changes. In the experiments presented in this paper,

we take this approach as a starting point and further improve its discriminative power with a kernel-based

learning method, provide a detailed evaluation of diﬀerent descriptors and their invariance properties,

and place it in the broader context of both texture and object recognition by measuring the impact of

background clutter on its performance.

2.2 Object recognition

The earliest work on appearance-based object recognition has mainly utilized global descriptions such as

color or texture histograms [45, 50, 53]. The main drawback of such methods is their sensitivity to real-

world sources of variability such as viewpoint and lighting changes, clutter and occlusions. For this reason,

global methods were gradually supplanted over the last decade by part-based methods, which became one

of the dominant paradigms in the object recognition community. Part-based object models combine

appearance descriptors of local features with a representation of their spatial relations. Initially, part-

based methods relied on simple Harris interest points, which only provided translation invariance [1, 60].

Subsequently, local features with higher degrees of invariance were used to obtain robustness against

scaling changes [18] and aﬃne deformations [30]. While part-based models oﬀer an intellectually satisfying

way of representing many real-world objects, learning and inference problems for spatial relations remain

extremely complex and computationally intensive, especially in a weakly supervised setting where the

location of the object in a training image has not been marked by hand. On the other hand, orderless

bag-of-keypoints methods [55, 61] have the advantage of simplicity and computational eﬃciency, though

they fail to represent the geometric structure of the object class or to distinguish between foreground

and background features. For these reasons, bag-of-keypoints methods can be adversely aﬀected by

clutter, just as earlier global methods based on color or gradient histograms. One way to overcome this

potential weakness is to use feature selection [12] or boosting [48] to retain only the most discriminative

features for recognition. Another approach is to design novel kernels that can yield high discriminative

power despite the noise and irrelevant information that may be present in local feature sets [23, 37, 59].

While these methods have obtained promising results, they have not been extensively tested on databases

featuring heavily cluttered, uncorrelated backgrounds, so the true extent of their robustness has not been

conclusively determined. Our own approach is related to that of Grauman and Darrell [23], who have

developed a kernel that approximates the optimal partial matching between two feature sets. Speciﬁcally,

we use a kernel based on the Earth Mover’s Distance [52], which solves the partial matching problem

exactly. Finally, we note that our image representation is similar to that of [61], though our choice of

local features and classiﬁer kernel results in signiﬁcantly higher performance.

3 Components of the representation

This section introduces our image representation based on sparse local features. We ﬁrst discuss scale-

and aﬃne-invariant local regions and the descriptors of their appearance. We then describe diﬀerent

image signatures and similarity measures suitable for comparing them.

3.1 Scale- and aﬃne-invariant region detectors

In this paper, we use two complementary local region detector types to extract salient image structures:

The Harris-Laplace detector [43] responds to corner-like regions, while the Laplacian detector [34] extracts

blob-like regions (Fig. 1).

At the most basic level, these two detectors are invariant to scale transformations alone, i.e., they

output circular regions at a certain characteristic scale. To achieve rotation invariance, we can either

use rotationally invariant descriptors—for example, SPIN and RIFT [31], as presented in the following

section—or rotate the circular regions in the direction of the dominant gradient orientation [36, 43].

In our implementation, the dominant gradient orientation is computed as the average of all gradient

orientations in the region. Finally, we obtain aﬃne-invariant versions of the Harris-Laplace and Laplacian

detectors through the use of an aﬃne adaptation procedure [21, 42]. Aﬃnely adapted detectors output

ellipse-shaped regions which are then normalized, i.e., transformed into circles. Normalization leaves a

rotational ambiguity that can be eliminated either by using rotation-invariant descriptors or by ﬁnding

the dominant gradient orientation, as described above.

Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study

Figures

Citations

The Pascal Visual Object Classes (VOC) Challenge

Object Detection with Discriminatively Trained Part-Based Models

Learning realistic human actions from movies

On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation.

Improving the fisher kernel for large-scale image classification

References

Distinctive Image Features from Scale-Invariant Keypoints

Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond

A performance evaluation of local descriptors

Video Google: a text retrieval approach to object matching in videos

Visual categorization with bags of keypoints

Related Papers (5)

Distinctive Image Features from Scale-Invariant Keypoints

Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories

Video Google: a text retrieval approach to object matching in videos

Histograms of oriented gradients for human detection

LIBSVM: A library for support vector machines

Frequently Asked Questions (10)

Q1. What are the contributions mentioned in the paper "Local features and kernels for classification of texture and object categories: a comprehensive study" ?

Q2. What have the authors stated for future works in "Local features and kernels for classification of texture and object categories: a comprehensive study" ?

Q3. What can the authors do to achieve rotation invariance?

Q4. What is the way to overcome this potential weakness?

Q5. What is the definition of a RIFT descriptor?

Q6. What is the way to obtain a global texton vocabulary?

Q7. What is the advantage of orderless bag-of-keypoints methods?

Q8. What is the power of bag-of-keypoints representations?

Q9. What is the way to avoid the cost of building global vocabularies?

Q10. What is the method for combining geometric invariance with a discriminative classifier?