Hierarchical Gaussianization for image classification

doi:10.1109/ICCV.2009.5459435

Hierarchical Gaussianization for Image Classiﬁcation

Xi Zhou

†

,NaCui

‡

, Zhen Li

†

, Feng Liang

‡

, and Thomas S. Huang

†

Dept. of ECE, University of Illnois at Urbana-Champaign

‡

Dept. of Statistics, University of Illnois at Urbana-Champaign

{xizhou2, nacui2, zhenli3, liangf}@uiuc.edu, huang@ifp.uiuc.edu

Abstract

In this paper, we propose a new image representation

to capture both the appearance and spatial information for

image classiﬁcation applications. First, we model the fea-

ture vectors, from the whole corpus, from each image and

at each individual patch, in a Bayesian hierarchical frame-

work using mixtures of Gaussians. After such a hierarchi-

cal Gaussianization, each image is represented by a Gaus-

sian mixture model (GMM) for its appearance, and several

Gaussian maps for its spatial layout. Then we extract the

appearance information from the GMM parameters, and

the spatial information from global and local statistics over

Gaussian maps. Finally, we employ a supervised dimension

reduction technique called DAP (discriminant attribute pro-

jection) to remove noise directions and to further enhance

the discriminating power of our representation. We justify

that the traditional histogram representation and the spa-

tial pyramid matching are special cases of our hierarchical

Gaussianization. We compare our new representation with

other approaches in scene classiﬁcation, object recognition

and face recognition, and our performance ranks among the

top in all three tasks.

1. Introduction

Histogram representation, as a description for orderless

patch-based features, has been widely used in visual recog-

nition and image retrieval [4, 5]. Despite its popularity,

however, histogram representation has some intrinsic limi-

tations. For example, it is sensitive to several factors such as

outliers, the choice of bins, and the noise level in the data.

Most importantly, encoding high-dimensional feature vec-

tors by a relatively small codebook inclines to large quanti-

zation errors and lose of discriminability [21]. Furthermore,

histogram representation discards all the spatial conﬁgura-

tion of image patches, which is a key attribute for object and

scene classiﬁcation.

Several approaches have been proposed in the literature

(a) Image

(c) Gaussian maps

(3)

(1)

(

2

)

(b) Feature space

DAP

HG Vector

(d) (e)

Figure 1. (a) is an input image. (b) shows the patch features in the

feature space. Each ”+” denotes a feature vector, whose distribu-

tion is approximated by a GMM. (c) shows a set of Gaussian maps,

each of which corresponds to one Gaussian component in (b). A

supervised dimension reduction algorithm, DAP, is performed in

(d) to form the ﬁnal image representation, hierarchical Gaussian-

ization vector.

to overcome these limitations. Soft assignment, which al-

lows each feature vector belonging to multiple histogram

bins, have been suggested to capture partial similarity be-

tween images [16, 19, 18, 26, 27, 28]. To enhance the dis-

criminating capability of histograms, Farquhar et al. [12]

and Peronnin et al. [16] introduced several ways to con-

struct category-speciﬁc histograms, Larlus et al. [13] and

Yang et al. [19] suggested to integrate histogram construc-

tion with classiﬁer training, and Moosmann et al. [15] pro-

posed to use randomized forests to build discriminative his-

tograms. As a ﬂexible way to model a variety of distribu-

tions, GMM emerged as a better alternative to histograms in

age estimation, object classiﬁcation and video event analy-

sis [2, 1, 3]. On the other hand, to alleviate the loss of spa-

tial information in histogram representation, one of the most

successful approaches by far is the spatial pyramid match-

ing (SPM) technique proposed by Lazebnik et al. [11].

In this paper, we propose a new model-based representa-

tion for image features, capturing both the appearance and

spatial information. First, we adopt a hierarchical GMM for

feature vectors at difference levels: the whole corpus, each

image and individual patches. We learn the image-speciﬁc

GMM in a Bayesian framework to allow information shar-

1971

2009 IEEE 12th International Conference on Computer Vision (ICCV)

ing across different images and to bridge the universal and

individual information retrievals. Given an image-speciﬁc

GMM, each patch of that image is assigned to a Gaussian

component with respect to a posterior probability. All these

probabilities constitute a set of so-called Gaussian maps

over the entire patch grid. After obtaining a GMM and

Gaussian maps for each image which we term as a Hierar-

chical Gaussianization (HG) process, we extract the appear-

ance information from the GMM parameters, and the spa-

tial information from global and local summary statistics

over Gaussian maps. Finally, all parameters of the GMM

and statistics of the Gaussian maps are concatenated as a

super-vector, followed by a supervised dimension reduction

to further enhance the discriminating power of the represen-

tation. An illustration of this new representation is shown

in Figure 1.

The remaining of this paper is arranged as follows. In

Section 2, we introduce the new image representation that

incorporates both the visual and spatial information. In Sec-

tion 3, we justify that the histogram representation and the

spatial pyramid matching are special cases of the HG rep-

resentation. In Section 4, we demonstrate the effectiveness

of our approach on three image databases. Conclusions are

given in Section 5.

2. Hierarchical Gaussianization representation

2.1. GMMs for appearance representation

Let z denotes a p-dimensional feature vector from the

I-th image. We model z by a GMM, namely,

p(z|Θ) =

K



k=1

w

I

k

N(z; µ

I

k

, Σ

I

k

), (1)

where K denotes the total number of Gaussian components,

and (w

I

k

,µ

I

k

, Σ

I

k

) are the image-speciﬁc weight, mean and

covariance matrix of the kth Gaussian component, respec-

tively. For computational efﬁciency, we restrict the covari-

ance matrices Σ

I

k

to be a diagonal matrix Σ

k

shared by all

images.

The number of model parameters Θ=

{w

I

k

,µ

I

k

, Σ

k

}

k=1:K,I=1:N

increases extensively with

respect to N , the number of training images. In practice

the size of patches from one image is usually small and

thus insufﬁcient for a robust estimate of all parameters. To

overcome this problem, we propose a hierarchical Bayesian

framework to jointly estimate all the GMM parameters. We

model the image-speciﬁc GMM parameters w

I

k

’s and µ

I

k

’s

by conjugate priors:

(w

I

1

,...,w

I

K

) ∼ Dir (Tw

1

,...,Tw

K

),

µ

I

k

∼N(µ

k

, Σ

k

/r),k=1:K.

The prior distribution over the weights w

I

k

’s is a Dirichlet

distribution with parameters (Tw

1

,...Tw

K

), which can be

interpreted as adding total T pseudo-counts with w

k

frac-

tion of them from the kth component. The prior distribu-

tion for the mean µ

I

k

’s is a Gaussian centered at a global

mean µ

k

with a covariance matrix shrunk by a smoothing

parameter r. Note that such a prior speciﬁcation imposes

dependence between images. And the rationale behind this

is to “borrow” strength across similar images for estimation

and therefore overcome the small sample size issue suffered

in conventional learning processes.

We estimate the prior mean vector µ

k

, prior weights w

k

and covariance matrix Σ

k

by ﬁtting a global GMM based on

the whole corpus, and the remaining parameters by solving

the following Maximum A Posteriori (MAP) loss,

max

Θ



ln p(z|Θ) + ln p(Θ)



.

The MAP estimates can be obtained via an EM algorithm:

in the E-step, we compute

Pr(k|z

i

)=

w

I

k

N(z

i

; µ

I

k

, Σ

k

)



K

j=1

w

I

j

N(z

i

; µ

I

j

, Σ

j

)

, (2)

n

k

=

N



i=1

Pr(k|z

i

), (3)

and in the M-step, we update

ˆw

I

k

= γ

k

n

k

/N +(1− γ

k

)w

k

, (4)

ˆµ

I

k

= α

k

m

k

+(1−α

k

)µ

k

, (5)

where

m

k

=

1

n

k

N



i=1

Pr(k|z

i

)z

i

,

α

k

= n

k

/(n

k

+ r),γ

k

= N/(N + T ).

If a Gaussian component has a high probabilistic count, n

k

,

then α

k

approaches 1 and the adapted parameters empha-

size the new sufﬁcient statistics m

k

; otherwise, the adapted

parameters are determined by the global model µ

k

. The tun-

ing parameters r and T can also affect the MAP adaptation.

In general, the larger r and T, the larger the inﬂuence of the

prior distribution on the adaptation. For example, when r

goes to inﬁnity, the MAP adaptation for µ

I

k

is ﬁxed at the

prior mean, similar for T and w

I

k

. In practice we adjust r

and T empirically, based on the total number of coordinate

patches for each image.

After Gaussinization, we can calculate the similarity

between a pair of images via the similarity between two

GMMs. A common approach is to summarize the pa-

rameters of a GMM as a vector m, and then use some

vector metric, such as inner product [2, 1, 3]. Note that

m = f (w, µ,Σ) is in general a function involving all pa-

rameters of the corresponding GMM. In our experiments,

1972

we follow the suggestion in [3] and choose the appearance

vector for an image x

I

to be

m(x

I

)=[



w

I

1

Σ

−

1

2

1

µ

I

1

; ··· ;



w

I

K

Σ

−

1

2

K

µ

I

K

]. (6)

2.2. Gaussian maps for spatial representation

According to equation (2), the feature vector at each

patch is again modeled by a mixture of Gaussians with a

mixture probability Pr(k|z

i

).Foraﬁxedk, all such proba-

bilities Pr(k|z

i

) form a map over the patch locations, which

we refer to as a Gaussian map. While each Gaussian com-

ponent represents some structure in the feature space, the

corresponding Gaussian map shows the geometric location

of that structure on an image. For a GMM with K com-

ponents, we have K Gaussian maps, and we can learn the

spatial information of an image by analyzing each of these

Gaussian maps.

A natural way to summarize a Gaussian map is to use its

mean location or normalized mean location. However, such

global summary statistics do not work well for images. In

Figure 2, we plot a subset of Gaussian maps for three im-

ages from Caltech 101 database that is analyzed in Section

4. It is clear that local information is more important for the

discriminant analysis than the global one.

Figure 2. Sample Gaussian Maps of three images from the Caltech

101 dataset.

Therefore we propose to hierarchically split a Gaus-

sian map and extract summary statistics over local regions.

Speciﬁcally, each of the K Gaussian maps is divided into

subregions based on a sequence of increasingly coarser

grids; assume there are M subregions in total, then we cal-

culate some summary statistic ν over each of the M regions.

As a parallel form to (6), we deﬁne v(x

I

), a vector express-

ing spatial information of image x

I

as follows,

v(x

I

)=[ν

I

11

; ··· ; ν

I

M1

; ν

I

12

; ··· ; ν

I

M2

; ··· ; ν

I

MK

] (7)

2.3. Discriminant attribute projection

We concatenate the appearance vector m(x

I

) and the

spatial vector ν(x

I

) as a supver-vector

φ(x

I

)=[m(x

I

);

√

ηv(x

a

)],

where η is a tuning parameter balance the information con-

tribution from the two sources. However, directly employ-

ing such a high-dimensional vector for image classiﬁcation

may not lead to a good performance, because the super-

vector is constructed without considering the inter-category

or intra-category relationship.

To enhance the discrimating power of our representation,

we propose to project φ(x

I

) to a subspace that depresses the

directions with high inter-category variabilities. Let V de-

note the projection matrix toward the subspace with high

inter-category variabilities, that is, (I − V )φ(x

I

) is the dis-

criminant projection we are looking for. We solve V via the

following objective function

V =arg max

V

T

V =I



i=j

||V

T

φ(x

i

) − V

T

φ(x

j

)||

2

W

ij

, (8)

where W

ij

=1 when x

i

and x

j

belong to the same category,

otherwise W

ij

=0. Let Φ=[φ(x

1

),φ(x

2

), ··· ,φ(x

N

)],

a matrix with N columns where N is the total number of

training images. It can be shown that the optimal solution

for V consists of the top eigenvectors corresponding to the

largest eigenvalues of matrix Φ(D − W )Φ

T

, where D is a

diagonal matrix with D

ii

=



N

j=1

W

ij

, ∀i.

Suppose we use the dot product as a similarity mea-

sure between super-vectors. After applying discriminant at-

tribute projection (DAP), the similarty between two images,

x

a

and x

b

, is equal to

D(x

a

,x

b

)=φ(x

a

)

T

(I − VV

T

)φ(x

b

). (9)

That is, the projection toward V , which is irrelevant to the

classiﬁcation, is discarded in the similarity calculation.

In the DAP approach, each eigen-direction is either in-

cluded or excluded for later analysis. An alternative is to

adaptively shrink each directions of the subspace spanned

by V : the one with larger eigen-values shrunk less and the

one with smaller eigen-values shrunk more. Arrange all the

shrinkage factors in a diagonal matrix C, then the similarity

metric (9) can be reexpressed as

D(x

a

,x

b

)=φ(x

a

)

T

(I − VCV

T

)φ(x

b

). (10)

In our experiments, we set C=I− Λ

−1

, where Λ is a

diagonal matrix with eigenvalues of matrix Φ(D − W )Φ

T

.

3. Connection to previous work

3.1. Histogram as a special case of GMMs

It is easy to see that the histogram representation is a

special case of GMMs, with only the weights w

I

k

being

1973

CALsuburb

MITcoast MITforest

MIThighway

MITinsidecity

MITmountain

MITopencountry

MITstreet

MITtallbuilding

PARoffice

Bedroom

Kitchen

Figure 3. Example images from the scene category database.

adapted: If we set the hyper-parameters T =0and r = ∞,

from equations (4, 5), we have all the image-speciﬁc GMMs

sharing the same mean vectors and covariance matrices, and

therefore the only information captured by GMMs is the

weight w

I

k

which is proportional to the histogram counts.

Here we want to highlight three aspects in which the

GMM-based approach extends histograms. First, his-

tograms use the Euclidean distance as the clustering met-

ric in constructing bins, while GMMs use the Mahamalobis

distance that takes into account the heterogeneity among

features. Second, histograms use a hard decision rule in

distributing feature vectors into bins and the resulting data

summary is sensitive to noise, while GMMs use a soft de-

cision rule in distributing feature vectors to Gaussian com-

ponents and the resulting probabilistic summary of the data

is more robust. The last and the most important advantage

of GMMs over histograms is the gain of information. His-

tograms summarize the appearance information of an im-

age (i.e., a bag of feature vectors) by the counts in each

histogram bin, which correspond to the weights of Gaus-

sian components in the adapted mixture model. In addition

to weights, GMMs summarize each image by the adapted

mean vectors and covariance matrices, which provide richer

information in constructing the super-vector and in calculat-

ing similarities between images.

3.2. SPM as a special case of Gaussian maps

To avoid the loss of spatial information with histograms,

Lazebnik et al. [11] proposed a successful technique

called the spatial pyramid matching (SPM). In SPM, im-

ages are repeatedly divided into subregions, similarity mea-

sures are repeatedly calculated for each subregions, and

their weighted summation forms an overall similarity mea-

sure.

Since histogram is a special case of GMMs, SPM corre-

sponds to a hierarchical spatial modeling over a degenerated

Gaussian map where the posterior probabilities are either 0

or 1. The special similarity measure used by SPM, the his-

togram intersection function, corresponds to an intersection

function deﬁned over those posterior probabilities. So SPM

can be viewed as a special case of Gaussian maps.

4. Experiments

In this section, we report the performance of our im-

age representation on three diverse datasets: ﬁfteen scene

categories [10], Caltech101 and CMU PIE face database.

We investigate the effectiveness of different aspects of our

representation and further compare our results with existing

works. All experiments are repeated ten times with different

randomly selected training and testing images, and the av-

erage of per-class recognition rate is recorded for each run.

As we focus on the image representation, we just employ

the nearest centroid (NC) classiﬁer in all the experiments.

We perform all processing in grayscale, even when color

images are available.

4.1. Scene category recognition

The scene database is composed of ﬁfteen scene cate-

gories, thirteen provided by Fei-Fei et al. in [10] and the

other two collected by Lazebnik et al. in [11]. Each scene

category contains 200 to 400 images. The average size of

the images is around 300 ×250 pixels. This database is one

of the most comprehensive scene category databases used in

the literature. Example images of different scene categories

of this database are illustrated in Figure 3.

Here, the experiment setting is purposely made the same

as that in [10] and [11] to guarantee the fairness of per-

1974

formance comparison. Speciﬁcally, all experiments are

repeated ten times with 100 randomly selected images

per class for training and the rest for testing. The 128-

dimensional SIFT vector is extracted within a 20 ×20 patch

over a grid with spacing of 5 pixels. The dimension of SIFT

descriptor is reduced to 64 by Principal Component Analy-

sis (PCA). The GMM contains 512 Gaussian components,

while histogram contains 512 bins.

Table 1. Performance comparison on scene category database.

Algorithm Average accuracy (%)

Histogram [10] 65.2

SPM [11] 81.4

HG 85.2

Table 1 compares our approach with several existing sys-

tems on the scene classiﬁcation task. The result in [10] is

65.2%, which is based on histogram representation without

any spatial information. In [11], Lazebnik et al. introduced

spatial pyramid matching (SPM) to incorporate the spatial

information with histogram representation and reported an

accuracy of 81.4% using SVM with nonlinear histogram in-

tersection kernel. In the experiment, by a simple nearest

centroid (NC) classiﬁer, HG representation achieves a su-

perior performance of 85.2% in accuracy. The results are

consistent with our analysis in the previous sections: HG is

more general than both histogram and SPM.

Table 2. The classiﬁcation results on scene category database.

Algorithm Average accuracy (%)

Histogram 41.8

GMM 75.8

GMM+GM 80.4

GMM+DAP 82.1

HG 85.2

Table 2 gives an in-depth analysis into the effectiveness

of each aspect of our representation. Here all the results

are obtained by nearest centroid (NC) classiﬁer. The ta-

ble demonstrates the performance when adding the compo-

nents of our representation one by one. It is evident that

the three components, GMM for appearance representation,

Gaussian maps (GM) for spatial layout encoding and DAP

for discriminant dimension reduction jointly improve the

recognition accuracy. Note that [11] reported an accuracy of

74.2% based on histogram representation, which is higher

than 41.8% here. This is because [11] employed a nonlinear

histogram intersection kernel for SVM. This indicated that

the performance of histogram representation is sensitive to

choice of kernel metrics, and relies heavily on the classiﬁer.

Figure 4 shows the confusion matrix between the ﬁfteen

scene categories for the HG representation.

.99 .01

.84 .01 .02 .12

.96 .03 .01

.04 .01

.88 .01 .03 .01 .03

.01

.89 .02 .01 .03 .02

.02 .01

.87 .08 .01

.08 .04 .05

.82

.01 .04 .02 .91 .01 .02 .01

.02 .01 .01 .83 .08 .03

.01

.88 .03 .02 .03 .03

.01 .01 .01

.63 .01 .04 .25 .04

.01 .01 .01 .02 .01 .02

.74 .02 .13

.01 .01 .05 .02

.72 .11 .09

.01 .01 .01 .16 .03 .01

.67 .11

.01 .02 .01 .01 .03 .03

.87

CALsuburb

MITcoast

MITforest

MIThighway

MITinsidecity

MITmountain

MITopencountry

MITstreet

MITtallbuilding

PARoffice

bedroom

industrial

kitchen

livingroom

store

CALsuburb

MITcoast

MITforest

MIThighway

MITinsidecity

MITmountain

MITopencountry

MITstreet

MITtallbuilding

PARoffice

bedroom

industrial

kitchen

livingroom

store

HG Representation

Figure 4. Confusion matrix on scene category database for the HG

representation. The average classiﬁcation accuracy is 85.2%. The

entry in the i

th

row and j

th

column is the percentage of images

from class i that were misidentiﬁed as class j. For better viewing,

please see the pdf ﬁle

4.2. Object recognition

Our second set of experiments are conducted on Cal-

tech101 database. This database consists of 101 object

classes with high intra-class appearance and shape variabil-

ity. The number of images in each class vary from 31 to

800, and most images are of medium resolution (about 300

× 300 pixels). This database is one of the most diverse and

thoroughly studied databases for object recognition, and

signiﬁcant progress has been made on it for state-of-the-art

algorithms. There exist several drawbacks for this database,

though. For example, most objects is located at the cen-

ter of an image with little cluttered background. And many

classes are devoid of pose and scale variability. Moreover,

the presence of rotation artifacts tend to make some classes

(e.g. minaret) much easier to be identiﬁed.

In this experiment, the representation step is the same

as in scene recognition: ﬁrst extract SIFT descriptor within

a20× 20 sliding window, and then learn a 512-mixture

GMM and Gaussian map for each image. For experiment

setup, we follow the standard procedure, namely we ran-

domly selectly 15 and 30 training images per class and 50

for testing. The recognition rate is then computed as the

average of per-class accuracies. Similar to the previous ex-

periments, the entire procedure is repeated ten times, and

the average performance and its standard deviation are re-

ported.

Table 3 shows a performance comparison of HG repre-

sentation with several recently reported methods, all based

on a single descriptor. At both training/testing settings,

HG representation achieves the best result, i.e., 65.5%

for 15 training images and 73.1% for 30 training images.

It is worth noting that, most of previous methods used

computing-extensive classiﬁers, such as support vector ma-

1975

Hierarchical Gaussianization for image classification

Citations

Locality-constrained Linear Coding for image classification

Person Re-identification: Past, Present and Future

Image classification using super-vector coding of local image descriptors

Learning Locally-Adaptive Decision Functions for Person Verification

A Review on Image Feature Extraction and Representation Techniques

References

Latent dirichlet allocation

Latent Dirichlet Allocation

A K-Means Clustering Algorithm

Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories

Color indexing

Related Papers (5)

Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories

Distinctive Image Features from Scale-Invariant Keypoints

Linear spatial pyramid matching using sparse coding for image classification

Locality-constrained Linear Coding for image classification

A Bayesian hierarchical model for learning natural scene categories