What are the contributions in "Image classification using super-vector coding of local image descriptors" ?

This paper introduces a new framework for image classification using local visual descriptors. For all the three steps the authors suggest novel solutions which make their approach appealing in theory, more scalable in computation, and transparent in classification.

What is the main challenge to the computer vision community?

Image classification, including object recognition and scene classification, remains to be a major challenge to the computer vision community.

What is the function that is used to find whether an image is contained in a particular category?

An image’s spacial pyramid representation is then obtained by concatenating the results of local poolingΦs(X) = [ Φ(X111), Φ(X 2 11), Φ(X 2 12), Φ(X 2 21), Φ(X 2 22), Φ(X 3 11), Φ(X 3 12), Φ(X 3 13) ]Image classification is done by applying classifiers based on the image representations obtained from the pooling step.

What is the way to train a classifier?

Once the model is trained, Eq. (5) suggests that one can compute a response map based on g(x), which visualizes where the classifier focuses on in the image, as shown in their experiments.

What is the effect of a change of measure in (x)?

In particular, a change of measure µ(·) (still piece-wise constant in each partition) leads to a rescaling of different components in Φ(X).

What is the result of g(x) = w>(x)?

As the result, g(x) = w>φ(x) gives rise to a local linear function (i.e., piece-wise linear) to approximate the unknown nonlinear function f(x).

What is the difference between the two methods?

Since their method naturally requires a linear classifier, it enjoys a training scalability which is linear to the number of training images, while nonlinear kernel-based methods suffer quadratic or higher complexity.

What is the method for coding images?

These methods are: (1) VQ coding – using Bhattacharyya kernel on spatial pyramid histogram presentations; (2) GMM – the method described in [22]; (3) SV – the super-vector coding proposed by this paper; (4) SV-soft – the soft version of SV coding, where [pk(x)]k for each x is truncated to retain the top 20 elements with the rest elements being set zero.

What is the purpose of this paper?

The work stresses the importance of learning good coding of local descriptors in the context of image classification, and makes the first attempt to formally incorporate the metric of local descriptors into distribution kernels.

(Open Access) Image classification using super-vector coding of local image descriptors (2010) | Xi Zhou

Q: What is the notable example of the SPM kernel approach?

One notable example is the SPM kernel approach [4], which applies average pooling on top of VQ coding, plus a nonlinear SVM classifier using Chi-square or intersection kernels.

Image Classiﬁcation using Super-Vector Coding

of Local Image Descriptors

Xi Zhou

†

, Kai Yu

‡

, Tong Zhang

∗

, and Thomas S. Huang

†

Dept. of ECE, University of Illinois at Urbana-Champaign, Illinois

‡

NEC Laboratories America, California

∗

Department of Statistics, Rutgers University, New Jersey

Abstract. This paper introduces a new framework for image classiﬁ-

cation using local visual descriptors. The pipeline ﬁrst performs a non-

linear feature transformation on descriptors, then aggregates the results

together to form image-level representations, and ﬁnally applies a clas-

siﬁcation model. For all the three steps we suggest novel solutions which

make our approach appealing in theory, more scalable in computation,

and transparent in classiﬁcation. Our experiments demonstrate that the

proposed classiﬁcation method achieves state-of-the-art accuracy on the

well-known PASCAL benchmarks.

1 Introduction

Image classiﬁcation, including object recognition and scene classiﬁcation, re-

mains to be a major challenge to the computer vision community. Perhaps one

of the most signiﬁcant developments in the last decade is the application of lo-

cal features to image classiﬁcation, including the introduction of “bag-of-visual-

words” representation that inspires and initiates a lot of research eﬀorts [1].

A large body of work investigates probabilistic generative models, with the

objective towards understanding the semantic content of images. Typically those

models extend the famous topic models on bag-of-word representation by further

considering the spatial information of visual words [2][3].

This paper follows another line of research on building discriminative models

for classiﬁcation. The previous work includes SVMs using pyramid matching ker-

nels [4], biologically-inspired models [5][6], and KNN methods [7][8][9]. Over the

past years, the nonlinear SVM method using spatial pyramid matching (SPM)

kernels [4][10] seems to be dominant among the top performers in various im-

age classiﬁcation benchmarks, including Caltech-101 [11], PASCAL [12], and

TRECVID. The recent improvements were often achieved by combining diﬀer-

ent types of local descriptors [10][13][14], without any fundamental change of the

underlying classiﬁcation method. In addition to the demand for more accurate

classiﬁers, one has to develop more practical methods. Nonlinear SVMs scale

at least quadratically to the size of training data, which makes it nontrivial to

handle large-scale training data. It is thus necessary to design algorithms that

are computationally more eﬃcient.

2 ECCV-10 submission ID 453

1.1 Overview of Our Approach

Our work represents each image by a set of local descriptors with their spatial

coordinates. The descriptor can be SIFT, or any other local features, computed

from image patches at locations on a 2D grid. Our image classiﬁcation method

consists of three computational steps:

1. Descriptor coding:

Each descriptor of an image is nonlinearly mapped to form a high-dimensional

sparse vector. We propose a novel nonlinear coding method called Super-

Vector coding, which is algorithmically a simple extension of Vector Quan-

tization (VQ) coding;

2. Spatial pooling:

For each local region, the codes of all the descriptors in it are aggregated

to form a single vector, then vectors of diﬀerent regions are concatenated to

form the image-level feature vector. Our pooling is base on a novel proba-

bility kernel incorporating the similarity metric of local descriptors;

3. Image classiﬁcation:

The image-level feature vector is normalized and fed into a classiﬁer. We

choose linear SVMs, which scale linearly to the size of training data.

We note that the coding-pooling-classiﬁcation pipeline is the de facto frame-

work for image scene classiﬁcation. One notable example is the SPM kernel ap-

proach [4], which applies average pooling on top of VQ coding, plus a nonlinear

SVM classiﬁer using Chi-square or intersection kernels.

In this paper, we propose novel methods for each of the three steps and

formalize their underlying mathematical principles. The work stresses the im-

portance of learning good coding of local descriptors in the context of image

classiﬁcation, and makes the ﬁrst attempt to formally incorporate the metric of

local descriptors into distribution kernels. Putting all these together, the over-

all image classiﬁcation framework enjoys a linear training complexity, and also

a great interpretability that is missing in conventional models (see details in

Sec. 2.3). The most importantly, our method demonstrates state-of-the-art per-

formances on the challenging PASCAL07 and PASCAL09 image classiﬁcation

benchmarks.

2 The Method

In the following we will describe all the three steps of our image classiﬁcation

pipeline in detail.

2.1 Descriptor Coding

We introduce a novel coding method, which enjoys appealing theoretical prop-

erties. Suppose we are interested in learning a smooth nonlinear function f(x)

ECCV-10 submission ID 453 3

deﬁned on a high dimensional space R

. The question is, how to derive a good

coding scheme (or nonlinear mapping) φ(x) such that f(x) can be well approxi-

mated by a linear function on it, namely w

φ(x). Our only assumption here is

that f(x) should be suﬃciently smooth.

Let us consider a general unsupervised learning setting, where a set of bases

C ⊂ R

, called codebook or dictionary, is employed to approximate any x,

namely,

x ≈

v∈C

(x)v,

where γ(x) = [γ

(x)]

v∈C

is the coeﬃcients, and sometimes

(x) = 1. By

restricting the cardinality of nonzeros of γ(x) to be 1 and γ

(x) ≥ 0, we obtain

the Vector Quantization (VQ) method

∗

(x) = arg min

v∈C

kx − vk,

where k · k is the Euclidean norm (2-norm). The VQ method uses the coding

(x) = 1 if v = v

∗

(x) and γ

(x) = 0 otherwise. We say that f(x) is β Lipschitz

derivative smooth if for all x, x

∈ R

|f(x) − f(x

) − ∇f(x

)

(x − x

)| ≤

kx − x

It immediately implies the following simple function approximation bound via

VQ coding: for all x ∈ R



f(x) − f



∗

(x)



− ∇f



∗

(x)





x − v

∗

(x)





≤

kx − v

∗

(x)k

. (1)

This bounds simply states that one can approximate f (x) by f



∗

(x)



∇f



∗

(x)





x − v

∗

(x)



, and the approximation error is upper bounded by the

quality of VQ. It further suggests that the function approximation can be im-

proved by learning the codebook C to minimize this upper bound. One way is

the K-means algorithm

C = arg min

(

min

v∈C

kx − vk

)

Eq. (1) also suggests that the approximation to f(x) can be expressed as a linear

function on a nonlinear coding scheme

f(x) ≈ g(x) ≡ w

φ(x),

where φ(x) is called the Super-Vector (SV) coding of x, deﬁned by

φ(x) =



sγ

(x), γ

(x)(x − v)



v∈C

(2)

where s is a nonnegative constant. It is not diﬃcult to see that w = [

f(v), ∇f(v)]

v∈C

which can be regarded as unknown parameters to be estimated. Because γ

(x) =

4 ECCV-10 submission ID 453

1 if v = v

∗

(x), otherwise γ

(x) = 0, the obtained φ(x) a is highly sparse represen-

tation, with dimensionality |C|(d+1). For example, if |C| = 3 and γ(x) = [0, 1, 0],

then

φ(x) =







0, . . . , 0

| {z }

d+1 dim.

, s, (x − v)

| {z }

d+1 dim.

, 0, . . . , 0

| {z }

d+1 dim.







(3)

(1) (2) (3)

Fig. 1. Function f (x) approximated by w

φ(x)

As illustrated in Figure 1, w

φ(x) provides a piece-wise linear function to

approximate a nonlinear function f(x), as shown in Figure 1-(2), while with

VQ coding φ(x) = [γ

(x)]

v∈C

, the same formulation w

φ(x) gives a piece-wise

constant approximation, as shown in Figure 1-(3). This intuitively suggests that

SV coding may achieve a lower function approximation error than VQ coding.

We note that the popular bag-of-features image classiﬁcation method essentially

employs VQ to obtain histogram representations. The proposed SV coding is a

simple extension of VQ, and may lead to a better approach to image classiﬁca-

tion.

2.2 Spatial Pooling

Pooling Let each image be represented as a set of descriptor vectors x that fol-

lows an image-speciﬁc distribution, represented as a probability density function

p(x) with respect to an image independent back-ground measure dµ(x). Let’s

ﬁrst ignore the spacial locations of x, and address the spacial pooling later. A

kernel-based method for image classiﬁcation is based on a kernel on the proba-

bility distributions over x ∈ Ω, K : P × P 7→ R. A well-known example is the

Bhattacharyya kernel [15]:

(p, q) =

Ω

p(x)

q(x)

dµ(x).

Here p(x) and q(x) represent two images as distributions over local descriptor

vectors, and µ(x) is the image independent background measure. Bhattacharyya

ECCV-10 submission ID 453 5

kernel is closely associated with Hellinger distance, deﬁned as D

(p, q) = 2 −

(p, q), which can be seen as a principled symmetric approximation of the

Kullback Leibler (KL) divergence [15]. Despite the popular application of both

Bhattacharyya kernel and KL divergence, a signiﬁcant drawback is the ignorance

of the underlying similarity metric of x, as illustrated in Figure 2. In order to

avoid this problem, one has to work with very smooth distribution families that

are inconvenient to work with in practice. In this paper, we propose a novel

formulation that explicitly takes the similarity of x into account:

(p, q) =

Ω

p(x)

q(x

)

κ(x, x

)dµ(x)dµ(x

)

Ω

p(x)

−

q(x

)

−

κ(x, x

)p(x)q(x

)dµ(x)dµ(x

)

where κ(x, x

) is a RKHS kernel on Ω that reﬂects the similarity structure of x.

In the extreme case where κ(x, x

) = δ(x −x

) is the delta-function with respect

to µ(·), then the above kernel reduces to the Bhattacharyya kernel.

(1) (2)

Fig. 2. Illustration of the drawback of Bhattacharyya kernel: in both cases their

density kernels K

(p, q) remain to be the same, equal to 0.

In reality we cannot directly observe p(x) from any image, but a set X of

local descriptors. Therefore, based on the empirical approximation to K

(p, q),

we deﬁne a kernel between sets of vectors:

K(X, X

) =

x∈X

∈X

p(x)

−

q(x

)

−

κ(x, x

) (4)

where N and N

are the sizes of the descriptor sets from two images.

Let κ(x, x

) = hφ(x), φ(x

)i, where φ(x) is the SV coding deﬁned in the

previous section. It is easy to see that κ(x, x

) = 0 if x and x

fall into diﬀerent

clusters. Then we have

K(X, X

) =

|C|

k=1

x∈X

∈X

p(x)

−

q(x

)

−

κ(x, x

)

Image classification using super-vector coding of local image descriptors

Figures

Citations

Going deeper with convolutions

ImageNet Large Scale Visual Recognition Challenge

Selective Search for Object Recognition

Deep learning for visual understanding

Understanding deep image representations by inverting them

References

Gradient-based learning applied to document recognition

The Pascal Visual Object Classes (VOC) Challenge

Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories

Visual categorization with bags of keypoints

A Bayesian hierarchical model for learning natural scene categories

Related Papers (5)

Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories

Locality-constrained Linear Coding for image classification

Distinctive Image Features from Scale-Invariant Keypoints

Visual categorization with bags of keypoints

Histograms of oriented gradients for human detection

Frequently Asked Questions (13)

Q1. What are the contributions in "Image classification using super-vector coding of local image descriptors" ?

Q2. What is the main challenge to the computer vision community?

Q3. What is the purpose of the paper?

Q4. What is the function that is used to find whether an image is contained in a particular category?

Q5. What is the notable example of the SPM kernel approach?

Q6. What is the way to train a classifier?

Q7. What is the effect of a change of measure in (x)?

Q8. What is the result of g(x) = w>(x)?

Q9. What is the simplest way to approximate f(x)?

Q10. What is the difference between the two methods?

Q11. What is the coding scheme for f(x)?

Q12. What is the method for coding images?

Q13. What is the purpose of this paper?