scispace - formally typeset
Proceedings ArticleDOI

Hierarchical Gaussianization for image classification

TLDR
A new image representation to capture both the appearance and spatial information for image classification applications is proposed and it is justified that the traditional histogram representation and the spatial pyramid matching are special cases of the hierarchical Gaussianization.
Abstract
In this paper, we propose a new image representation to capture both the appearance and spatial information for image classification applications First, we model the feature vectors, from the whole corpus, from each image and at each individual patch, in a Bayesian hierarchical framework using mixtures of Gaussians After such a hierarchical Gaussianization, each image is represented by a Gaussian mixture model (GMM) for its appearance, and several Gaussian maps for its spatial layout Then we extract the appearance information from the GMM parameters, and the spatial information from global and local statistics over Gaussian maps Finally, we employ a supervised dimension reduction technique called DAP (discriminant attribute projection) to remove noise directions and to further enhance the discriminating power of our representation We justify that the traditional histogram representation and the spatial pyramid matching are special cases of our hierarchical Gaussianization We compare our new representation with other approaches in scene classification, object recognition and face recognition, and our performance ranks among the top in all three tasks

read more

Content maybe subject to copyright    Report

Hierarchical Gaussianization for Image Classification
Xi Zhou
,NaCui
, Zhen Li
, Feng Liang
, and Thomas S. Huang
Dept. of ECE, University of Illnois at Urbana-Champaign
Dept. of Statistics, University of Illnois at Urbana-Champaign
{xizhou2, nacui2, zhenli3, liangf}@uiuc.edu, huang@ifp.uiuc.edu
Abstract
In this paper, we propose a new image representation
to capture both the appearance and spatial information for
image classification applications. First, we model the fea-
ture vectors, from the whole corpus, from each image and
at each individual patch, in a Bayesian hierarchical frame-
work using mixtures of Gaussians. After such a hierarchi-
cal Gaussianization, each image is represented by a Gaus-
sian mixture model (GMM) for its appearance, and several
Gaussian maps for its spatial layout. Then we extract the
appearance information from the GMM parameters, and
the spatial information from global and local statistics over
Gaussian maps. Finally, we employ a supervised dimension
reduction technique called DAP (discriminant attribute pro-
jection) to remove noise directions and to further enhance
the discriminating power of our representation. We justify
that the traditional histogram representation and the spa-
tial pyramid matching are special cases of our hierarchical
Gaussianization. We compare our new representation with
other approaches in scene classification, object recognition
and face recognition, and our performance ranks among the
top in all three tasks.
1. Introduction
Histogram representation, as a description for orderless
patch-based features, has been widely used in visual recog-
nition and image retrieval [4, 5]. Despite its popularity,
however, histogram representation has some intrinsic limi-
tations. For example, it is sensitive to several factors such as
outliers, the choice of bins, and the noise level in the data.
Most importantly, encoding high-dimensional feature vec-
tors by a relatively small codebook inclines to large quanti-
zation errors and lose of discriminability [21]. Furthermore,
histogram representation discards all the spatial configura-
tion of image patches, which is a key attribute for object and
scene classification.
Several approaches have been proposed in the literature
(a) Image
(c) Gaussian maps
(3)
(1)
(
2
)
(b) Feature space
DAP
HG Vector
(d) (e)
Figure 1. (a) is an input image. (b) shows the patch features in the
feature space. Each ”+” denotes a feature vector, whose distribu-
tion is approximated by a GMM. (c) shows a set of Gaussian maps,
each of which corresponds to one Gaussian component in (b). A
supervised dimension reduction algorithm, DAP, is performed in
(d) to form the final image representation, hierarchical Gaussian-
ization vector.
to overcome these limitations. Soft assignment, which al-
lows each feature vector belonging to multiple histogram
bins, have been suggested to capture partial similarity be-
tween images [16, 19, 18, 26, 27, 28]. To enhance the dis-
criminating capability of histograms, Farquhar et al. [12]
and Peronnin et al. [16] introduced several ways to con-
struct category-specific histograms, Larlus et al. [13] and
Yang et al. [19] suggested to integrate histogram construc-
tion with classifier training, and Moosmann et al. [15] pro-
posed to use randomized forests to build discriminative his-
tograms. As a flexible way to model a variety of distribu-
tions, GMM emerged as a better alternative to histograms in
age estimation, object classification and video event analy-
sis [2, 1, 3]. On the other hand, to alleviate the loss of spa-
tial information in histogram representation, one of the most
successful approaches by far is the spatial pyramid match-
ing (SPM) technique proposed by Lazebnik et al. [11].
In this paper, we propose a new model-based representa-
tion for image features, capturing both the appearance and
spatial information. First, we adopt a hierarchical GMM for
feature vectors at difference levels: the whole corpus, each
image and individual patches. We learn the image-specific
GMM in a Bayesian framework to allow information shar-
1971
2009 IEEE 12th International Conference on Computer Vision (ICCV)
978-1-4244-4419-9/09/$25.00 ©2009 IEEE

ing across different images and to bridge the universal and
individual information retrievals. Given an image-specific
GMM, each patch of that image is assigned to a Gaussian
component with respect to a posterior probability. All these
probabilities constitute a set of so-called Gaussian maps
over the entire patch grid. After obtaining a GMM and
Gaussian maps for each image which we term as a Hierar-
chical Gaussianization (HG) process, we extract the appear-
ance information from the GMM parameters, and the spa-
tial information from global and local summary statistics
over Gaussian maps. Finally, all parameters of the GMM
and statistics of the Gaussian maps are concatenated as a
super-vector, followed by a supervised dimension reduction
to further enhance the discriminating power of the represen-
tation. An illustration of this new representation is shown
in Figure 1.
The remaining of this paper is arranged as follows. In
Section 2, we introduce the new image representation that
incorporates both the visual and spatial information. In Sec-
tion 3, we justify that the histogram representation and the
spatial pyramid matching are special cases of the HG rep-
resentation. In Section 4, we demonstrate the effectiveness
of our approach on three image databases. Conclusions are
given in Section 5.
2. Hierarchical Gaussianization representation
2.1. GMMs for appearance representation
Let z denotes a p-dimensional feature vector from the
I-th image. We model z by a GMM, namely,
p(z|Θ) =
K
k=1
w
I
k
N(z; µ
I
k
, Σ
I
k
), (1)
where K denotes the total number of Gaussian components,
and (w
I
k
I
k
, Σ
I
k
) are the image-specific weight, mean and
covariance matrix of the kth Gaussian component, respec-
tively. For computational efficiency, we restrict the covari-
ance matrices Σ
I
k
to be a diagonal matrix Σ
k
shared by all
images.
The number of model parameters Θ=
{w
I
k
I
k
, Σ
k
}
k=1:K,I=1:N
increases extensively with
respect to N , the number of training images. In practice
the size of patches from one image is usually small and
thus insufficient for a robust estimate of all parameters. To
overcome this problem, we propose a hierarchical Bayesian
framework to jointly estimate all the GMM parameters. We
model the image-specific GMM parameters w
I
k
s and µ
I
k
’s
by conjugate priors:
(w
I
1
,...,w
I
K
) Dir (Tw
1
,...,Tw
K
),
µ
I
k
∼N(µ
k
, Σ
k
/r),k=1:K.
The prior distribution over the weights w
I
k
s is a Dirichlet
distribution with parameters (Tw
1
,...Tw
K
), which can be
interpreted as adding total T pseudo-counts with w
k
frac-
tion of them from the kth component. The prior distribu-
tion for the mean µ
I
k
s is a Gaussian centered at a global
mean µ
k
with a covariance matrix shrunk by a smoothing
parameter r. Note that such a prior specification imposes
dependence between images. And the rationale behind this
is to “borrow” strength across similar images for estimation
and therefore overcome the small sample size issue suffered
in conventional learning processes.
We estimate the prior mean vector µ
k
, prior weights w
k
and covariance matrix Σ
k
by fitting a global GMM based on
the whole corpus, and the remaining parameters by solving
the following Maximum A Posteriori (MAP) loss,
max
Θ
ln p(z|Θ) + ln p(Θ)
.
The MAP estimates can be obtained via an EM algorithm:
in the E-step, we compute
Pr(k|z
i
)=
w
I
k
N(z
i
; µ
I
k
, Σ
k
)
K
j=1
w
I
j
N(z
i
; µ
I
j
, Σ
j
)
, (2)
n
k
=
N
i=1
Pr(k|z
i
), (3)
and in the M-step, we update
ˆw
I
k
= γ
k
n
k
/N +(1 γ
k
)w
k
, (4)
ˆµ
I
k
= α
k
m
k
+(1α
k
)µ
k
, (5)
where
m
k
=
1
n
k
N
i=1
Pr(k|z
i
)z
i
,
α
k
= n
k
/(n
k
+ r)
k
= N/(N + T ).
If a Gaussian component has a high probabilistic count, n
k
,
then α
k
approaches 1 and the adapted parameters empha-
size the new sufficient statistics m
k
; otherwise, the adapted
parameters are determined by the global model µ
k
. The tun-
ing parameters r and T can also affect the MAP adaptation.
In general, the larger r and T, the larger the influence of the
prior distribution on the adaptation. For example, when r
goes to infinity, the MAP adaptation for µ
I
k
is fixed at the
prior mean, similar for T and w
I
k
. In practice we adjust r
and T empirically, based on the total number of coordinate
patches for each image.
After Gaussinization, we can calculate the similarity
between a pair of images via the similarity between two
GMMs. A common approach is to summarize the pa-
rameters of a GMM as a vector m, and then use some
vector metric, such as inner product [2, 1, 3]. Note that
m = f (w, µ,Σ) is in general a function involving all pa-
rameters of the corresponding GMM. In our experiments,
1972

we follow the suggestion in [3] and choose the appearance
vector for an image x
I
to be
m(x
I
)=[
w
I
1
Σ
1
2
1
µ
I
1
; ··· ;
w
I
K
Σ
1
2
K
µ
I
K
]. (6)
2.2. Gaussian maps for spatial representation
According to equation (2), the feature vector at each
patch is again modeled by a mixture of Gaussians with a
mixture probability Pr(k|z
i
).Foraxedk, all such proba-
bilities Pr(k|z
i
) form a map over the patch locations, which
we refer to as a Gaussian map. While each Gaussian com-
ponent represents some structure in the feature space, the
corresponding Gaussian map shows the geometric location
of that structure on an image. For a GMM with K com-
ponents, we have K Gaussian maps, and we can learn the
spatial information of an image by analyzing each of these
Gaussian maps.
A natural way to summarize a Gaussian map is to use its
mean location or normalized mean location. However, such
global summary statistics do not work well for images. In
Figure 2, we plot a subset of Gaussian maps for three im-
ages from Caltech 101 database that is analyzed in Section
4. It is clear that local information is more important for the
discriminant analysis than the global one.
Figure 2. Sample Gaussian Maps of three images from the Caltech
101 dataset.
Therefore we propose to hierarchically split a Gaus-
sian map and extract summary statistics over local regions.
Specifically, each of the K Gaussian maps is divided into
subregions based on a sequence of increasingly coarser
grids; assume there are M subregions in total, then we cal-
culate some summary statistic ν over each of the M regions.
As a parallel form to (6), we define v(x
I
), a vector express-
ing spatial information of image x
I
as follows,
v(x
I
)=[ν
I
11
; ··· ; ν
I
M1
; ν
I
12
; ··· ; ν
I
M2
; ··· ; ν
I
MK
] (7)
2.3. Discriminant attribute projection
We concatenate the appearance vector m(x
I
) and the
spatial vector ν(x
I
) as a supver-vector
φ(x
I
)=[m(x
I
);
ηv(x
a
)],
where η is a tuning parameter balance the information con-
tribution from the two sources. However, directly employ-
ing such a high-dimensional vector for image classification
may not lead to a good performance, because the super-
vector is constructed without considering the inter-category
or intra-category relationship.
To enhance the discrimating power of our representation,
we propose to project φ(x
I
) to a subspace that depresses the
directions with high inter-category variabilities. Let V de-
note the projection matrix toward the subspace with high
inter-category variabilities, that is, (I V )φ(x
I
) is the dis-
criminant projection we are looking for. We solve V via the
following objective function
V =arg max
V
T
V =I
i=j
||V
T
φ(x
i
) V
T
φ(x
j
)||
2
W
ij
, (8)
where W
ij
=1 when x
i
and x
j
belong to the same category,
otherwise W
ij
=0. Let Φ=[φ(x
1
)(x
2
), ··· (x
N
)],
a matrix with N columns where N is the total number of
training images. It can be shown that the optimal solution
for V consists of the top eigenvectors corresponding to the
largest eigenvalues of matrix Φ(D W
T
, where D is a
diagonal matrix with D
ii
=
N
j=1
W
ij
, i.
Suppose we use the dot product as a similarity mea-
sure between super-vectors. After applying discriminant at-
tribute projection (DAP), the similarty between two images,
x
a
and x
b
, is equal to
D(x
a
,x
b
)=φ(x
a
)
T
(I VV
T
)φ(x
b
). (9)
That is, the projection toward V , which is irrelevant to the
classification, is discarded in the similarity calculation.
In the DAP approach, each eigen-direction is either in-
cluded or excluded for later analysis. An alternative is to
adaptively shrink each directions of the subspace spanned
by V : the one with larger eigen-values shrunk less and the
one with smaller eigen-values shrunk more. Arrange all the
shrinkage factors in a diagonal matrix C, then the similarity
metric (9) can be reexpressed as
D(x
a
,x
b
)=φ(x
a
)
T
(I VCV
T
)φ(x
b
). (10)
In our experiments, we set C=I Λ
1
, where Λ is a
diagonal matrix with eigenvalues of matrix Φ(D W
T
.
3. Connection to previous work
3.1. Histogram as a special case of GMMs
It is easy to see that the histogram representation is a
special case of GMMs, with only the weights w
I
k
being
1973

CALsuburb
MITcoast MITforest
MIThighway
MITinsidecity
MITmountain
MITopencountry
MITstreet
MITtallbuilding
PARoffice
Bedroom
Kitchen
Figure 3. Example images from the scene category database.
adapted: If we set the hyper-parameters T =0and r = ,
from equations (4, 5), we have all the image-specific GMMs
sharing the same mean vectors and covariance matrices, and
therefore the only information captured by GMMs is the
weight w
I
k
which is proportional to the histogram counts.
Here we want to highlight three aspects in which the
GMM-based approach extends histograms. First, his-
tograms use the Euclidean distance as the clustering met-
ric in constructing bins, while GMMs use the Mahamalobis
distance that takes into account the heterogeneity among
features. Second, histograms use a hard decision rule in
distributing feature vectors into bins and the resulting data
summary is sensitive to noise, while GMMs use a soft de-
cision rule in distributing feature vectors to Gaussian com-
ponents and the resulting probabilistic summary of the data
is more robust. The last and the most important advantage
of GMMs over histograms is the gain of information. His-
tograms summarize the appearance information of an im-
age (i.e., a bag of feature vectors) by the counts in each
histogram bin, which correspond to the weights of Gaus-
sian components in the adapted mixture model. In addition
to weights, GMMs summarize each image by the adapted
mean vectors and covariance matrices, which provide richer
information in constructing the super-vector and in calculat-
ing similarities between images.
3.2. SPM as a special case of Gaussian maps
To avoid the loss of spatial information with histograms,
Lazebnik et al. [11] proposed a successful technique
called the spatial pyramid matching (SPM). In SPM, im-
ages are repeatedly divided into subregions, similarity mea-
sures are repeatedly calculated for each subregions, and
their weighted summation forms an overall similarity mea-
sure.
Since histogram is a special case of GMMs, SPM corre-
sponds to a hierarchical spatial modeling over a degenerated
Gaussian map where the posterior probabilities are either 0
or 1. The special similarity measure used by SPM, the his-
togram intersection function, corresponds to an intersection
function defined over those posterior probabilities. So SPM
can be viewed as a special case of Gaussian maps.
4. Experiments
In this section, we report the performance of our im-
age representation on three diverse datasets: fifteen scene
categories [10], Caltech101 and CMU PIE face database.
We investigate the effectiveness of different aspects of our
representation and further compare our results with existing
works. All experiments are repeated ten times with different
randomly selected training and testing images, and the av-
erage of per-class recognition rate is recorded for each run.
As we focus on the image representation, we just employ
the nearest centroid (NC) classifier in all the experiments.
We perform all processing in grayscale, even when color
images are available.
4.1. Scene category recognition
The scene database is composed of fifteen scene cate-
gories, thirteen provided by Fei-Fei et al. in [10] and the
other two collected by Lazebnik et al. in [11]. Each scene
category contains 200 to 400 images. The average size of
the images is around 300 ×250 pixels. This database is one
of the most comprehensive scene category databases used in
the literature. Example images of different scene categories
of this database are illustrated in Figure 3.
Here, the experiment setting is purposely made the same
as that in [10] and [11] to guarantee the fairness of per-
1974

formance comparison. Specifically, all experiments are
repeated ten times with 100 randomly selected images
per class for training and the rest for testing. The 128-
dimensional SIFT vector is extracted within a 20 ×20 patch
over a grid with spacing of 5 pixels. The dimension of SIFT
descriptor is reduced to 64 by Principal Component Analy-
sis (PCA). The GMM contains 512 Gaussian components,
while histogram contains 512 bins.
Table 1. Performance comparison on scene category database.
Algorithm Average accuracy (%)
Histogram [10] 65.2
SPM [11] 81.4
HG 85.2
Table 1 compares our approach with several existing sys-
tems on the scene classification task. The result in [10] is
65.2%, which is based on histogram representation without
any spatial information. In [11], Lazebnik et al. introduced
spatial pyramid matching (SPM) to incorporate the spatial
information with histogram representation and reported an
accuracy of 81.4% using SVM with nonlinear histogram in-
tersection kernel. In the experiment, by a simple nearest
centroid (NC) classifier, HG representation achieves a su-
perior performance of 85.2% in accuracy. The results are
consistent with our analysis in the previous sections: HG is
more general than both histogram and SPM.
Table 2. The classification results on scene category database.
Algorithm Average accuracy (%)
Histogram 41.8
GMM 75.8
GMM+GM 80.4
GMM+DAP 82.1
HG 85.2
Table 2 gives an in-depth analysis into the effectiveness
of each aspect of our representation. Here all the results
are obtained by nearest centroid (NC) classifier. The ta-
ble demonstrates the performance when adding the compo-
nents of our representation one by one. It is evident that
the three components, GMM for appearance representation,
Gaussian maps (GM) for spatial layout encoding and DAP
for discriminant dimension reduction jointly improve the
recognition accuracy. Note that [11] reported an accuracy of
74.2% based on histogram representation, which is higher
than 41.8% here. This is because [11] employed a nonlinear
histogram intersection kernel for SVM. This indicated that
the performance of histogram representation is sensitive to
choice of kernel metrics, and relies heavily on the classifier.
Figure 4 shows the confusion matrix between the fifteen
scene categories for the HG representation.
.99 .01
.84 .01 .02 .12
.96 .03 .01
.04 .01
.88 .01 .03 .01 .03
.01
.89 .02 .01 .03 .02
.02 .01
.87 .08 .01
.08 .04 .05
.82
.01 .04 .02 .91 .01 .02 .01
.02 .01 .01 .83 .08 .03
.01
.88 .03 .02 .03 .03
.01 .01 .01
.63 .01 .04 .25 .04
.01 .01 .01 .02 .01 .02
.74 .02 .13
.01 .01 .05 .02
.72 .11 .09
.01 .01 .01 .16 .03 .01
.67 .11
.01 .02 .01 .01 .03 .03
.87
CALsuburb
MITcoast
MITforest
MIThighway
MITinsidecity
MITmountain
MITopencountry
MITstreet
MITtallbuilding
PARoffice
bedroom
industrial
kitchen
livingroom
store
CALsuburb
MITcoast
MITforest
MIThighway
MITinsidecity
MITmountain
MITopencountry
MITstreet
MITtallbuilding
PARoffice
bedroom
industrial
kitchen
livingroom
store
HG Representation
Figure 4. Confusion matrix on scene category database for the HG
representation. The average classification accuracy is 85.2%. The
entry in the i
th
row and j
th
column is the percentage of images
from class i that were misidentified as class j. For better viewing,
please see the pdf file
4.2. Object recognition
Our second set of experiments are conducted on Cal-
tech101 database. This database consists of 101 object
classes with high intra-class appearance and shape variabil-
ity. The number of images in each class vary from 31 to
800, and most images are of medium resolution (about 300
× 300 pixels). This database is one of the most diverse and
thoroughly studied databases for object recognition, and
significant progress has been made on it for state-of-the-art
algorithms. There exist several drawbacks for this database,
though. For example, most objects is located at the cen-
ter of an image with little cluttered background. And many
classes are devoid of pose and scale variability. Moreover,
the presence of rotation artifacts tend to make some classes
(e.g. minaret) much easier to be identified.
In this experiment, the representation step is the same
as in scene recognition: first extract SIFT descriptor within
a20× 20 sliding window, and then learn a 512-mixture
GMM and Gaussian map for each image. For experiment
setup, we follow the standard procedure, namely we ran-
domly selectly 15 and 30 training images per class and 50
for testing. The recognition rate is then computed as the
average of per-class accuracies. Similar to the previous ex-
periments, the entire procedure is repeated ten times, and
the average performance and its standard deviation are re-
ported.
Table 3 shows a performance comparison of HG repre-
sentation with several recently reported methods, all based
on a single descriptor. At both training/testing settings,
HG representation achieves the best result, i.e., 65.5%
for 15 training images and 73.1% for 30 training images.
It is worth noting that, most of previous methods used
computing-extensive classifiers, such as support vector ma-
1975

Citations
More filters
Proceedings ArticleDOI

Locality-constrained Linear Coding for image classification

TL;DR: This paper presents a simple but effective coding scheme called Locality-constrained Linear Coding (LLC) in place of the VQ coding in traditional SPM, using the locality constraints to project each descriptor into its local-coordinate system, and the projected coordinates are integrated by max pooling to generate the final representation.
Posted Content

Person Re-identification: Past, Present and Future

TL;DR: The history of person re-identification and its relationship with image classification and instance retrieval is introduced and two new re-ID tasks which are much closer to real-world applications are described and discussed.
Book ChapterDOI

Image classification using super-vector coding of local image descriptors

TL;DR: In this article, the authors proposed a new framework for image classification using local visual descriptors, which performs a nonlinear feature transformation on descriptors and aggregates the results together to form image-level representations, and finally applies a classification model.
Proceedings ArticleDOI

Learning Locally-Adaptive Decision Functions for Person Verification

TL;DR: The decision function for verification is proposed to be viewed as a joint model of a distance metric and a locally adaptive thresholding rule, and the inference on the decision function is formulated as a second-order large-margin regularization problem, and an efficient algorithm is provided in its dual from.
Proceedings Article

A Review on Image Feature Extraction and Representation Techniques

TL;DR: This paper analyzes the effectiveness of the fusion of global and local features in automatic image annotation and content based image retrieval community, including some classic models and their illustrations in the literature.
References
More filters
Journal ArticleDOI

Latent dirichlet allocation

TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.
Proceedings Article

Latent Dirichlet Allocation

TL;DR: This paper proposed a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI).
Proceedings ArticleDOI

Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories

TL;DR: This paper presents a method for recognizing scene categories based on approximate global geometric correspondence that exceeds the state of the art on the Caltech-101 database and achieves high accuracy on a large database of fifteen natural scene categories.
Journal ArticleDOI

Color indexing

TL;DR: In this paper, color histograms of multicolored objects provide a robust, efficient cue for indexing into a large database of models, and they can differentiate among a large number of objects.
Related Papers (5)