scispace - formally typeset
Open AccessBook ChapterDOI

Image classification using super-vector coding of local image descriptors

Reads0
Chats0
TLDR
In this article, the authors proposed a new framework for image classification using local visual descriptors, which performs a nonlinear feature transformation on descriptors and aggregates the results together to form image-level representations, and finally applies a classification model.
Abstract
This paper introduces a new framework for image classification using local visual descriptors. The pipeline first performs a non-linear feature transformation on descriptors, then aggregates the results together to form image-level representations, and finally applies a classification model. For all the three steps we suggest novel solutions which make our approach appealing in theory, more scalable in computation, and transparent in classification. Our experiments demonstrate that the proposed classification method achieves state-of-the-art accuracy on the well-known PASCAL benchmarks.

read more

Content maybe subject to copyright    Report

Image Classification using Super-Vector Coding
of Local Image Descriptors
Xi Zhou
, Kai Yu
, Tong Zhang
, and Thomas S. Huang
Dept. of ECE, University of Illinois at Urbana-Champaign, Illinois
NEC Laboratories America, California
Department of Statistics, Rutgers University, New Jersey
Abstract. This paper introduces a new framework for image classifi-
cation using local visual descriptors. The pipeline first performs a non-
linear feature transformation on descriptors, then aggregates the results
together to form image-level representations, and finally applies a clas-
sification model. For all the three steps we suggest novel solutions which
make our approach appealing in theory, more scalable in computation,
and transparent in classification. Our experiments demonstrate that the
proposed classification method achieves state-of-the-art accuracy on the
well-known PASCAL benchmarks.
1 Introduction
Image classification, including object recognition and scene classification, re-
mains to be a major challenge to the computer vision community. Perhaps one
of the most significant developments in the last decade is the application of lo-
cal features to image classification, including the introduction of “bag-of-visual-
words” representation that inspires and initiates a lot of research efforts [1].
A large body of work investigates probabilistic generative models, with the
objective towards understanding the semantic content of images. Typically those
models extend the famous topic models on bag-of-word representation by further
considering the spatial information of visual words [2][3].
This paper follows another line of research on building discriminative models
for classification. The previous work includes SVMs using pyramid matching ker-
nels [4], biologically-inspired models [5][6], and KNN methods [7][8][9]. Over the
past years, the nonlinear SVM method using spatial pyramid matching (SPM)
kernels [4][10] seems to be dominant among the top performers in various im-
age classification benchmarks, including Caltech-101 [11], PASCAL [12], and
TRECVID. The recent improvements were often achieved by combining differ-
ent types of local descriptors [10][13][14], without any fundamental change of the
underlying classification method. In addition to the demand for more accurate
classifiers, one has to develop more practical methods. Nonlinear SVMs scale
at least quadratically to the size of training data, which makes it nontrivial to
handle large-scale training data. It is thus necessary to design algorithms that
are computationally more efficient.

2 ECCV-10 submission ID 453
1.1 Overview of Our Approach
Our work represents each image by a set of local descriptors with their spatial
coordinates. The descriptor can be SIFT, or any other local features, computed
from image patches at locations on a 2D grid. Our image classification method
consists of three computational steps:
1. Descriptor coding:
Each descriptor of an image is nonlinearly mapped to form a high-dimensional
sparse vector. We propose a novel nonlinear coding method called Super-
Vector coding, which is algorithmically a simple extension of Vector Quan-
tization (VQ) coding;
2. Spatial pooling:
For each local region, the codes of all the descriptors in it are aggregated
to form a single vector, then vectors of different regions are concatenated to
form the image-level feature vector. Our pooling is base on a novel proba-
bility kernel incorporating the similarity metric of local descriptors;
3. Image classification:
The image-level feature vector is normalized and fed into a classifier. We
choose linear SVMs, which scale linearly to the size of training data.
We note that the coding-pooling-classification pipeline is the de facto frame-
work for image scene classification. One notable example is the SPM kernel ap-
proach [4], which applies average pooling on top of VQ coding, plus a nonlinear
SVM classifier using Chi-square or intersection kernels.
In this paper, we propose novel methods for each of the three steps and
formalize their underlying mathematical principles. The work stresses the im-
portance of learning good coding of local descriptors in the context of image
classification, and makes the first attempt to formally incorporate the metric of
local descriptors into distribution kernels. Putting all these together, the over-
all image classification framework enjoys a linear training complexity, and also
a great interpretability that is missing in conventional models (see details in
Sec. 2.3). The most importantly, our method demonstrates state-of-the-art per-
formances on the challenging PASCAL07 and PASCAL09 image classification
benchmarks.
2 The Method
In the following we will describe all the three steps of our image classification
pipeline in detail.
2.1 Descriptor Coding
We introduce a novel coding method, which enjoys appealing theoretical prop-
erties. Suppose we are interested in learning a smooth nonlinear function f(x)

ECCV-10 submission ID 453 3
defined on a high dimensional space R
d
. The question is, how to derive a good
coding scheme (or nonlinear mapping) φ(x) such that f(x) can be well approxi-
mated by a linear function on it, namely w
>
φ(x). Our only assumption here is
that f(x) should be sufficiently smooth.
Let us consider a general unsupervised learning setting, where a set of bases
C R
d
, called codebook or dictionary, is employed to approximate any x,
namely,
x
X
vC
γ
v
(x)v,
where γ(x) = [γ
v
(x)]
vC
is the coefficients, and sometimes
P
v
γ
v
(x) = 1. By
restricting the cardinality of nonzeros of γ(x) to be 1 and γ
v
(x) 0, we obtain
the Vector Quantization (VQ) method
v
(x) = arg min
vC
kx vk,
where k · k is the Euclidean norm (2-norm). The VQ method uses the coding
γ
v
(x) = 1 if v = v
(x) and γ
v
(x) = 0 otherwise. We say that f(x) is β Lipschitz
derivative smooth if for all x, x
0
R
d
:
|f(x) f(x
0
) f(x
0
)
>
(x x
0
)|
β
2
kx x
0
k
2
.
It immediately implies the following simple function approximation bound via
VQ coding: for all x R
d
:
f(x) f
v
(x)
f
v
(x)
>
x v
(x)
β
2
kx v
(x)k
2
. (1)
This bounds simply states that one can approximate f (x) by f
v
(x)
+
f
v
(x)
>
x v
(x)
, and the approximation error is upper bounded by the
quality of VQ. It further suggests that the function approximation can be im-
proved by learning the codebook C to minimize this upper bound. One way is
the K-means algorithm
C = arg min
C
(
X
x
min
vC
kx vk
2
)
.
Eq. (1) also suggests that the approximation to f(x) can be expressed as a linear
function on a nonlinear coding scheme
f(x) g(x) w
>
φ(x),
where φ(x) is called the Super-Vector (SV) coding of x, defined by
φ(x) =
v
(x), γ
v
(x)(x v)
>
>
vC
(2)
where s is a nonnegative constant. It is not difficult to see that w = [
1
s
f(v), f(v)]
vC
,
which can be regarded as unknown parameters to be estimated. Because γ
v
(x) =

4 ECCV-10 submission ID 453
1 if v = v
(x), otherwise γ
v
(x) = 0, the obtained φ(x) a is highly sparse represen-
tation, with dimensionality |C|(d+1). For example, if |C| = 3 and γ(x) = [0, 1, 0],
then
φ(x) =
0, . . . , 0
| {z }
d+1 dim.
, s, (x v)
>
| {z }
d+1 dim.
, 0, . . . , 0
| {z }
d+1 dim.
>
(3)
(1) (2) (3)
Fig. 1. Function f (x) approximated by w
>
φ(x)
As illustrated in Figure 1, w
>
φ(x) provides a piece-wise linear function to
approximate a nonlinear function f(x), as shown in Figure 1-(2), while with
VQ coding φ(x) = [γ
v
(x)]
>
vC
, the same formulation w
>
φ(x) gives a piece-wise
constant approximation, as shown in Figure 1-(3). This intuitively suggests that
SV coding may achieve a lower function approximation error than VQ coding.
We note that the popular bag-of-features image classification method essentially
employs VQ to obtain histogram representations. The proposed SV coding is a
simple extension of VQ, and may lead to a better approach to image classifica-
tion.
2.2 Spatial Pooling
Pooling Let each image be represented as a set of descriptor vectors x that fol-
lows an image-specific distribution, represented as a probability density function
p(x) with respect to an image independent back-ground measure (x). Let’s
first ignore the spacial locations of x, and address the spacial pooling later. A
kernel-based method for image classification is based on a kernel on the proba-
bility distributions over x , K : P × P 7→ R. A well-known example is the
Bhattacharyya kernel [15]:
K
b
(p, q) =
Z
p(x)
1
2
q(x)
1
2
(x).
Here p(x) and q(x) represent two images as distributions over local descriptor
vectors, and µ(x) is the image independent background measure. Bhattacharyya

ECCV-10 submission ID 453 5
kernel is closely associated with Hellinger distance, defined as D
h
(p, q) = 2
K
b
(p, q), which can be seen as a principled symmetric approximation of the
Kullback Leibler (KL) divergence [15]. Despite the popular application of both
Bhattacharyya kernel and KL divergence, a significant drawback is the ignorance
of the underlying similarity metric of x, as illustrated in Figure 2. In order to
avoid this problem, one has to work with very smooth distribution families that
are inconvenient to work with in practice. In this paper, we propose a novel
formulation that explicitly takes the similarity of x into account:
K
s
(p, q) =
Z
Z
p(x)
1
2
q(x
0
)
1
2
κ(x, x
0
)(x)(x
0
)
=
Z
Z
p(x)
1
2
q(x
0
)
1
2
κ(x, x
0
)p(x)q(x
0
)(x)(x
0
)
where κ(x, x
0
) is a RKHS kernel on that reflects the similarity structure of x.
In the extreme case where κ(x, x
0
) = δ(x x
0
) is the delta-function with respect
to µ(·), then the above kernel reduces to the Bhattacharyya kernel.
(1) (2)
Fig. 2. Illustration of the drawback of Bhattacharyya kernel: in both cases their
density kernels K
b
(p, q) remain to be the same, equal to 0.
In reality we cannot directly observe p(x) from any image, but a set X of
local descriptors. Therefore, based on the empirical approximation to K
s
(p, q),
we define a kernel between sets of vectors:
K(X, X
0
) =
1
NN
0
X
xX
X
x
0
X
0
p(x)
1
2
q(x
0
)
1
2
κ(x, x
0
) (4)
where N and N
0
are the sizes of the descriptor sets from two images.
Let κ(x, x
0
) = hφ(x), φ(x
0
)i, where φ(x) is the SV coding defined in the
previous section. It is easy to see that κ(x, x
0
) = 0 if x and x
0
fall into different
clusters. Then we have
K(X, X
0
) =
1
NN
0
|C|
X
k=1
X
xX
k
X
x
0
X
0
k
p(x)
1
2
q(x
0
)
1
2
κ(x, x
0
)

Citations
More filters
Proceedings ArticleDOI

Going deeper with convolutions

TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).
Journal ArticleDOI

ImageNet Large Scale Visual Recognition Challenge

TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.
Journal ArticleDOI

Selective Search for Object Recognition

TL;DR: This paper introduces selective search which combines the strength of both an exhaustive search and segmentation, and shows that its selective search enables the use of the powerful Bag-of-Words model for recognition.
Journal ArticleDOI

Deep learning for visual understanding

TL;DR: The state-of-the-art in deep learning algorithms in computer vision is reviewed by highlighting the contributions and challenges from over 210 recent research papers, and the future trends and challenges in designing and training deep neural networks are summarized.
Proceedings ArticleDOI

Understanding deep image representations by inverting them

TL;DR: In this article, a general framework was proposed to invert representations such as HOG and Bag of Visual Words (BOW) to reconstruct the image itself, which can be applied to CNNs too.
References
More filters
Journal ArticleDOI

Gradient-based learning applied to document recognition

TL;DR: In this article, a graph transformer network (GTN) is proposed for handwritten character recognition, which can be used to synthesize a complex decision surface that can classify high-dimensional patterns, such as handwritten characters.
Journal ArticleDOI

The Pascal Visual Object Classes (VOC) Challenge

TL;DR: The state-of-the-art in evaluated methods for both classification and detection are reviewed, whether the methods are statistically different, what they are learning from the images, and what the methods find easy or confuse.
Proceedings ArticleDOI

Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories

TL;DR: This paper presents a method for recognizing scene categories based on approximate global geometric correspondence that exceeds the state of the art on the Caltech-101 database and achieves high accuracy on a large database of fifteen natural scene categories.
Proceedings Article

Visual categorization with bags of keypoints

TL;DR: This bag of keypoints method is based on vector quantization of affine invariant descriptors of image patches and shows that it is simple, computationally efficient and intrinsically invariant.
Proceedings ArticleDOI

A Bayesian hierarchical model for learning natural scene categories

TL;DR: This work proposes a novel approach to learn and recognize natural scene categories by representing the image of a scene by a collection of local regions, denoted as codewords obtained by unsupervised learning.
Related Papers (5)
Frequently Asked Questions (13)
Q1. What are the contributions in "Image classification using super-vector coding of local image descriptors" ?

This paper introduces a new framework for image classification using local visual descriptors. For all the three steps the authors suggest novel solutions which make their approach appealing in theory, more scalable in computation, and transparent in classification. 

Image classification, including object recognition and scene classification, remains to be a major challenge to the computer vision community. 

A large body of work investigates probabilistic generative models, with the objective towards understanding the semantic content of images. 

An image’s spacial pyramid representation is then obtained by concatenating the results of local poolingΦs(X) = [ Φ(X111), Φ(X 2 11), Φ(X 2 12), Φ(X 2 21), Φ(X 2 22), Φ(X 3 11), Φ(X 3 12), Φ(X 3 13) ]Image classification is done by applying classifiers based on the image representations obtained from the pooling step. 

One notable example is the SPM kernel approach [4], which applies average pooling on top of VQ coding, plus a nonlinear SVM classifier using Chi-square or intersection kernels. 

Once the model is trained, Eq. (5) suggests that one can compute a response map based on g(x), which visualizes where the classifier focuses on in the image, as shown in their experiments. 

In particular, a change of measure µ(·) (still piece-wise constant in each partition) leads to a rescaling of different components in Φ(X). 

As the result, g(x) = w>φ(x) gives rise to a local linear function (i.e., piece-wise linear) to approximate the unknown nonlinear function f(x). 

Eq. (1) also suggests that the approximation to f(x) can be expressed as a linear function on a nonlinear coding schemef(x) ≈ g(x) ≡ w>φ(x),where φ(x) is called the Super-Vector (SV) coding of x, defined byφ(x) = [ sγv(x), γv(x)(x− v)> ]> 

Since their method naturally requires a linear classifier, it enjoys a training scalability which is linear to the number of training images, while nonlinear kernel-based methods suffer quadratic or higher complexity. 

Let us consider a general unsupervised learning setting, where a set of bases C ⊂ Rd, called codebook or dictionary, is employed to approximate any x, namely,x ≈ ∑ v∈C γv(x)v,where γ(x) = [γv(x)]v∈C is the coefficients, and sometimes ∑ v γv(x) = 

These methods are: (1) VQ coding – using Bhattacharyya kernel on spatial pyramid histogram presentations; (2) GMM – the method described in [22]; (3) SV – the super-vector coding proposed by this paper; (4) SV-soft – the soft version of SV coding, where [pk(x)]k for each x is truncated to retain the top 20 elements with the rest elements being set zero. 

The work stresses the importance of learning good coding of local descriptors in the context of image classification, and makes the first attempt to formally incorporate the metric of local descriptors into distribution kernels.