Deep Quantization: Encoding Convolutional Activations with Deep Generative Model

doi:10.1109/CVPR.2017.435

Deep Quantization: Encoding Convolutional Activations

with Deep Generative Model

∗

Zhaofan Qiu, Ting Yao, and Tao Mei

University of Science and Technology of China, Hefei, China

Microsoft Research, Beijing, China

zhaofanqiu@gmail.com, {tiyao, tmei}@microsoft.com

Abstract

Deep convolutional neural networks (CNNs) have

proven highly effective for visual recognition, where learn-

ing a universal representation from activations of convolu-

tional layer plays a fundamental problem. In this paper,

we present Fisher Vector encoding with Variational Auto-

Encoder (FV-VAE), a novel deep architecture that quan-

tizes the local activations of convolutional layer in a deep

generative model, by training them in an end-to-end man-

ner. To incorporate FV encoding strategy into deep genera-

tive models, we introduce Variational Auto-Encoder model,

which steers a variational inference and learning in a neu-

ral network which can be straightforwardly optimized us-

ing standard stochastic gradient method. Different from the

FV characterized by conventional generative models (e.g.,

Gaussian Mixture Model) which parsimoniously ﬁt a dis-

crete mixture model to data distribution, the proposed FV-

VAE is more ﬂexible to represent the natural property of da-

ta for better generalization. Extensive experiments are con-

ducted on three public datasets, i.e., UCF101, ActivityNet,

and CUB-200-2011 in the context of video action recogni-

tion and ﬁne-grained image classiﬁcation, respectively. Su-

perior results are reported when compared to state-of-the-

art representations. Most remarkably, our proposed FV-

VAE achieves to-date the best published accuracy of 94.2%

on UCF101.

1. Introduction

The recent advances in deep convolutional neural net-

works (CNNs) have demonstrated high capability in visu-

al recognition. For instance, an ensemble of residual net-

s [7] achieves 3.57% in terms of top-5 error on the Ima-

geNet dataset [26]. More importantly, when utilizing the

activations of either a fully-connected layer or a convolu-

∗

This work was performed when Zhaofan Qiu was visiting Mi-

crosoft Research as a research intern.

...

Global Activations

Convolutional

Activations

FV Encoding

FV-VAE Encoding

Convolutional

Activations

Figure 1. Visual representations derived from activations of dif-

ferent layers in CNNs (upper row: global activations of the fully-

connected layer; middle row: convolutional activations with Fish-

er Vector encoding; bottom row: convolutional activations with

our FV-VAE encoding).

tional layer in a pre-trained CNN as a universal visual rep-

resentation and applying this representation to other visu-

al recognition tasks (e.g., scene understanding and seman-

tic segmentation), CNNs also manifest impressive perfor-

mances. The improvements are expected when CNNs are

further ﬁne-tuned with only amount of task-speciﬁc data.

The activations of different layers in CNNs are generally

grouped into two dimensions: global activations and con-

volutional activations. The former directly takes the activa-

tions of the fully-connected layer as visual representations,

which are holistic over the entire image, as shown in the

upper row of Figure 1. The latter, in contrast, creates vi-

sual representations by encoding a set of regional and local

activations from a convolutional layer to a vectorial repre-

sentation using quantization strategies. For example, Fisher

Vector (FV) [23] is one of the most successful quantization

approaches, as shown in the middle row of Figure 1. While

superior results by aggregating convolutional activations are

reported in most recent studies [3, 44], convolutional acti-

vations are ﬁrst extracted as local descriptors followed by

another separate quantization step. Thus such descriptors

may not be optimally compatible with the encoding pro-

cess, making the quantization sub-optimal. Furthermore, as

discussed in [13], the generative model behind FV, i.e., the

Gaussian Mixture Model (GMM), cannot always represen-

t the natural clustering of the descriptors and its inﬂexible

Gaussian observation model limits its generalization ability.

We show in this paper that these two limitations can be

mitigated by designing a deep architecture for representa-

tion learning that combines convolutional activations ex-

traction and quantization into a one-stage learning. Specif-

ically, we present a novel Fisher Vector encoding with

Variational Auto-Encoder (FV-VAE) framework to encode

convolutional activations with deep generative model (i.e.,

VAE), as shown in the bottom row of Figure 1. The pipeline

of the proposed deep architecture generally consists of two

components: a sub-network with a stack of convolution lay-

ers to produce convolutional activations followed by a VAE

structure aggregating the regional convolutional descriptors

to a FV. VAE consists of hierarchies of conditional stochas-

tic variables and is a highly expressive model by optimizing

a variational approximation (an inference/recognition mod-

el) to the intractable posterior for the generative distribution.

Compared to traditional GMM model which has the form of

a mixture of ﬁxed Gaussian components, the inference mod-

el here can be regarded as an alternative to predict speciﬁc

Gaussian components to different inputs by a single neural

network, making it more ﬂexible. It is also worth noting that

a classiﬁcation loss is additionally considered to preserve

the semantic information in the training stage. The entire

architecture is trainable in an end-to-end fashion. Further-

more, in the feature extraction stage, we theoretically prove

that the FV of input descriptors can be directly computed

by accumulating the gradient vector of reconstruction loss

in the VAE through back-propagation.

The main contribution of this work is the proposal of FV-

VAE architecture to encode convolutional descriptors with

Variational Auto-Encoder. We theoretically formulate the

computation of FV in VAE and substantiate an implemen-

tation of FV-VAE for visual representation learning.

2. Related Work

In the literature, visual representation generation from a

pre-trained CNN model has proceeded along two dimen-

sions: global activations and convolutional activations. The

ﬁrst is to extract visual representation from global acti-

vations in a CNN directly, e.g., the outputs from fully-

connected layer in VGG [30] or pool5 layer in ResNet [7].

In practice, this scheme often starts by pre-training CNN

model on a large dataset (e.g., ImageNet) and then ﬁne-

tuning the CNN architecture with a small amount of task-

speciﬁc data to better characterize the intrinsic informa-

tion in target scenario. The visual representation learnt in

this direction has been exploited in a broad range of com-

puter vision tasks including ﬁne-grained image classiﬁca-

tion [1, 17], video action recognition [18, 19, 24, 29] and

visual captioning [36, 38].

Another alternative scheme is to utilize the activation-

s from convolutional layers in CNN as regional and local

descriptors. Compared to global activations, convolutional

activations from CNN are embedded with rich spatial in-

formation, making them more transferable to different do-

mains and more robust to translation and rotation, which

have shown the effectiveness in several technological ad-

vances, e.g., Spatial Pyramid Pooling (SPP) [6], Fast R-

CNN [5] and Fully Convolutional Networks (FCNs) [22].

Recently, many works attempt to produce visual represen-

tation by encoding convolutional activations with different

quantization strategies. For example, Fisher Vector [23] is

computed on the output of the last convolutional layer of

VGG networks for describing texture in [3]. Similar in spir-

it, Xu et al. utilize VLAD [11] to encode convolutional

descriptors of video frame for multimedia event detection

[44]. In [28] and [43], Sharma et al. and Xu et al. dynam-

ically pool convolutional descriptors with attention models

for action recognition and image captioning, respectively.

Furthermore, convolutional descriptors of one convolution-

al layer are pooled with the guidance of the activations of

the successive convolutional layer in [21]. In [20], convolu-

tional descriptors from two CNNs are multiplied using the

outer product and pooled to obtain the bilinear vector.

In summary, our work belongs to the second dimen-

sion and aims to compute FV on convolutional activations

with deep generative models. We exploit Variational Auto-

Encoder for this purpose, which optimizes an inference

model to the intractable posterior. The high ﬂexibility of

the inference model and efﬁciency of the structure optimiza-

tion makes VAE more advanced than traditional GMM. Our

work in this paper contributes by studying not only encod-

ing convolutional activations in a deep architecture, but also

theoretically ﬁguring out the computation of FV based on

VAE architecture.

3. Fisher Vector Meets VAE

In this section, we ﬁrst recall the Fisher Vector theory,

followed by presenting how to estimate the probability den-

sity function in FV through VAE. The optimization of VAE

is then elaborated and how to compute the FV of the input

descriptors will be introduced ﬁnally.

3.1. Fisher Vector Theory

Suppose we have two sets of local descriptors X =

{x

t

}

T

x

t=1

and Y = {y

t

}

T

y

t=1

with T

x

and T

y

descriptors, re-

spectively. Let x

t

, y

t

∈ R

d

denote the d-dimensional fea-

tures of each descriptor. In order to measure the similarity

between the two sets, kernel method is employed by map-

ping them into a hyperspace. Speciﬁcally, assuming that

the generation process of descriptors in R

d

can be mod-

eled by a probability density function u

θ

with M param-

...

Reconstruction

Loss

...

Regularization

Loss

Classification

Loss

...

Encoder Sampling Decoder

(a) VAE Training

...

Reconstruction

Loss

...

Encoder Identity Decoder

Back Propagation

Gradient Vector

Accumulator

(b) FV Extraction

Figure 2. The overview of FV learning with VAE: (a) The training

process of VAE, (b) FV extraction based on VAE.

eters θ = [θ

1

, ..., θ

M

]

′

, Fisher Kernel (FK) [9] between the

two sets X and Y is given by

K(X, Y ) = G

X

θ

′

F

−1

θ

G

Y

θ

, (1)

where G

X

θ

= ∇

θ

log u

θ

(X) is deﬁned as ﬁsher score

function by computing the gradient of the log-likelihood

of the set based on the generative model, and F

θ

=

E

X∼u

θ

[G

X

θ

G

X

θ

′

] is the Fisher Information Matrix (FIM) of

u

θ

which is regarded as statistical feature normalization. S-

ince F

θ

is positive semi-deﬁnite, the FK in Eq.(1) can be

re-written explicitly as inner product in hyperspace:

K(X, Y ) = G

X

θ

′

G

Y

θ

, (2)

where

G

X

θ

= F

−

1

2

θ

G

X

θ

= F

−

1

2

θ

∇

θ

log u

θ

(X)

. (3)

Formally, G

X

θ

is well-known as Fisher Vector (FV). The

dimension of FV is equal to the number of generative pa-

rameters θ, which is often much higher than that of the de-

scriptor, making FV of higher descriptive capability.

3.2. Probability Estimation through VAE

Next, we will discuss how to estimate the probability

density function u

θ

in FV. In general, u

θ

is chosen to be

Gaussian Mixture Model (GMM) [27, 40] as one can ap-

proximate any distribution with arbitrary precision by GM-

M, in which θ is composed of mixture weight, mean and co-

variance of Gaussian components. The need of a large num-

ber of mixture components and inefﬁcient optimization of

Expectation Maximization algorithm, however, makes the

parameter learning computationally expensive and difﬁcult

to be applied to large-scale complex data. Inspired by the

idea of deep generative models [16, 25] which enable the

ﬂexible and efﬁcient inference learning in a neural network,

Algorithm 1 Variational Auto-Encoding (VAE) Optimization

1: Input: training set X = {x

t

}

T

x

t=1

, corresponding labels L =

{l

t

}

T

x

t=1

, loss weights λ

1

, λ

2

, λ

3

.

2: Initialization: random initialized θ

0

, φ

0

.

3: Output: VAE parameters θ

∗

, φ

∗

.

4: repeat

5: Sample x

t

in the minibatch.

6: Encoder: µ

z

t

← f

φ

(x

t

).

7: Sampling: z

t

← µ

z

t

+ ǫ ⊙ σ

z

, ǫ ∼ N (0, I).

8: Decoder: µ

x

t

← f

θ

(z

t

).

9: Compute reconstruction loss:

L

rec

= − log p

θ

(x

t

|z

t

) = − log N (x

t

; µ

x

t

, σ

2

x

I).

10: Compute regularization loss:

L

reg

=

1

2



µ

z

t



+

1

2

kσ

z

k −

1

2

P

d

k=1

(1 + log σ

2

z(k)

).

11: Compute classiﬁcation loss:

L

cls

= sof tmax

loss(z

t

, l

t

).

12: Fuse the three loss:

L(θ, φ) = λ

1

L

rec

(θ, φ) + λ

2

L

reg

(φ) + λ

3

L

cls

(φ).

13: Back-propagate the gradients.

14: until maximum iteration reached.

we develop a Variational Auto-Encoder (VAE) to generate

the probability function u

θ

.

Following the notations in Section 3.1 and assuming

that all the descriptors in the set are independent, the log-

likelihood of the set can be calculated by the sum over log-

likelihoods of individual descriptor and written as

log u

θ

(X) =

T

x

X

t=1

log p

θ

(x

t

)

. (4)

To model the probability of x

t

generated from parame-

ters θ, an unobserved continuous random variable z

t

is in-

volved with prior distribution p

θ

(z) and each x

t

is generat-

ed from the conditional distribution p

θ

(x|z). As such, each

log-likelihood log p

θ

(x

t

) can be measured using Kullback-

Leibler divergence (D

KL

) as

log p

θ

(x

t

) = D

KL

(q

φ

(z|x

t

)||p

θ

(z|x

t

)) + LB(θ, φ; x

t

)

> LB(θ, φ; x

t

)

, (5)

where LB(θ, φ; x

t

) is the variational lower bound on the

likelihood of descriptor x

t

and can be written as

LB(θ, φ; x

t

) = −D

KL

(q

φ

(z|x

t

)||p

θ

(z))+E

q

φ

(z|x

t

)

[log p

θ

(x

t

|z)],

(6)

where q

φ

(z|x) is a recognition model which is an approxi-

mation to the intractable posterior p

θ

(z|x). In our proposed

FV-VAE method, we use this lower bound LB(θ, φ; x

t

) as

an approximation of the log-likelihood. Through this ap-

proximation, the generative model can be divided into t-

wo parts: encoder q

φ

(z|x) and decoder p

θ

(x|z), predicting

hidden and visible probability, respectively.

FV-VAE

Gradient

Vector

Visual

Representation

Loss

Function

Ice Dancing/

Albatross

+

Training Epoch

Extraction Epoch

CNN

Convolutional

Activations

Local

Feature Set

SPP

224x224

448x448

50/frame

Convolutional

Activations

Denser Grid

(28x28)

784/image

Video

Classification

Image

Classification

Figure 3. Visual representation learning framework for image and video recognition including our FV-VAE. Spatial Pyramid Pooling (SPP)

is performed on the last pooling layer of CNN to aggregate the local descriptors of video frame, which applies four different max pooling

operations and obtain (6 × 6), (3 × 3), (2 × 2) and (1 × 1) outputs for each convolutional ﬁlter, resulting a total of 50 descriptors. For

image, an input with a higher resolution of (448 × 448) is fed into the CNN and the activations of the last convolutional layer conv

5

4

+relu

in VGG

19 are extracted, leading to dense local descriptors of 28 × 28. In training stage, FV-VAE architecture is learnt by minimizing the

overall loss. In extraction epoch, the learnt FV-VAE is to encode the set of local descriptors into a vectorial FV representation.

3.3. Optimization of VAE

The inference model parameter φ and generative mod-

el parameter θ are straightforward to be optimized us-

ing stochastic gradient descend method. More speciﬁcal-

ly, let the prior distribution be the standard normal dis-

tribution p

θ

(z) = N (z; 0, I), and both the conditional

distribution p

θ

(x|z) and posterior approximation q

φ

(z|x)

be multivariate Gaussian distribution N (x

t

; µ

x

t

, σ

2

x

t

I) and

N (z

t

; µ

z

t

, σ

2

z

t

I), respectively. The one-step Monte Carlo

is exploited to estimate the latent variable z

t

. Hence, the

lower bound in Eq. (6) can be rewritten as

LB(θ, φ; x

t

) ≃ log p

θ

(x

t

|z

t

) +

1

2

d

X

k=1

(1 + log σ

2

z

t

(k)

)

−

1

2



µ

z

t



−

1

2

kσ

z

t

k

, (7)

where z

t

is generated from N (µ

z

t

, σ

2

z

t

I) and it is equiva-

lent to z

t

= µ

z

t

+ ǫ ⊙ σ

z

t

,ǫ ∼ N (0, I).

Figure 2(a) illustrates an overview of our VAE training

process and Algorithm 1 further details the optimization

steps. It is also worth noting that different from the train-

ing of standard VAE method which estimates σ

x

and σ

z

in

another parallel encoder-decoder structure, we simply learn

the two covariance by gradient descent technique and share

them across all the descriptors, making the number of pa-

rameters learnt in VAE signiﬁcantly reduced in our case.

In addition to the basic reconstruction loss and regulariza-

tion loss, we further take classiﬁcation loss into account

in our VAE training to incorporate semantic information,

which has been shown effective in semi-supervised genera-

tive model learning [15]. The overall loss function is then

given by

L(θ, φ) = λ

1

L

rec

(θ, φ) + λ

2

L

reg

(φ) + λ

3

L

cls

(φ)

. (8)

We ﬁx λ

1

= λ

2

= 1 in Eq. (8) and will investigate the ef-

fect of tradeoff parameter λ

3

in our experiments. During the

training, the gradients are calculated and back-propagate to

the lower layers so that lower layers can adjust their param-

eters to minimize the loss.

3.4. FV Extraction

After the optimization of model parameters [θ

∗

, φ

∗

],

Figure 2(b) demonstrates how to extract Fisher Vector based

on the learnt VAE architecture.

By replacing the log-likelihood with its approximation,

i.e., lower bound LB(θ, φ; x

t

), we can obtain FV in Eq. (3):

G

X

θ

∗

= F

−

1

2

θ

∗

∇

θ

log u

θ

∗

(X)

= −F

−

1

2

θ

∗

T

x

X

t=1

[∇

θ

L

rec

(x

t

; θ

∗

, φ

∗

)]

, (9)

which is the normalized gradient vector of reconstruction

loss, and can be computed directly though the back propa-

gation operation. It is worth noticing that when extracting

FV representation, we withdraw the sampling operation and

use µ

z

t

as z

t

directly to avoid stochastic factors.

4. Visual Representation Learning

By utilizing FV-VAE as a deep architecture for quanti-

zation, a general visual representation learning framework

is devised for image and video recognition, respectively, as

illustrated in Figure 3. The basic idea is to construct a set

of convolutional descriptors for image or video frames, fol-

lowed by encoding them into a vectorial FV representation

using FV-VAE architecture. Both the training epoch and

FV extraction epoch are shown in Figure 3 and the entire

framework is trainable in an end-to-end manner.

We exploit different strategies of aggregation to con-

struct the set of convolutional descriptors for video frames

and image, respectively, due to the different property in be-

tween. A video consists of a sequence of frames with large

intra-class variations caused by, e.g., camera motion, illu-

mination conditions and so on, making the scale of an i-

dentical object varying in different frames. Following [44],

we employ Spatial Pyramid Pooling (SPP) [6] on the last

pooling layer to extract scale-invariant local descriptors for

video frames. Instead, we feed a higher resolution (e.g.,

448 × 448) input into the CNN to fully utilize image infor-

mation and extract the activations of the last convolutional

layer (e.g., conv

5

4

+relu in VGG

19), resulting in dense lo-

cal descriptors (e.g., 28 × 28) for image as in [20].

In our implementation, Multi-Layer Perceptron (MLP)

is employed as encoder and decoder in FV-VAE and one

layer decoder is developed to reduce the dimension of FV

representation. As such, the functions in Algorithm 1 can

be speciﬁed as

Encoder : µ

z

t

← M LP

φ

(x

t

)

Decoder : µ

x

t

← ReLU (W

′

θ

z

t

+ b

θ

)

, (10)

where {W

θ

, b

θ

} are the encoder parameters θ. The gradi-

ent vector of L

rec

is calculated as

∇

θ

L

rec

(x

t

; θ

∗

, φ

∗

) = f latten



[

∂L

rec

∂W

φ

,

∂L

rec

∂b

θ

]



= f latten



[

∂L

rec

∂µ

x

t

· z

′

t

,

∂L

rec

∂µ

x

t

]



= f latten



∂L

rec

∂µ

x

t

· [z

′

t

, 1]



= f latten



µ

x

t

− x

t

σ

2

x

⊙ (µ

x

t

> 0) · [z

′

t

, 1]



,

(11)

where “flatten” represents to ﬂatten a matrix to a vector,

and ⊙ denotes element-wise multiplication to ﬁlter the ac-

tivated elements. Considering it is difﬁcult to obtain an an-

alytic solution of FIM in this case, we make an approxima-

tion by replacing the expectation with the average on the

whole training set:

F

θ

∗

= E

X∼u

θ

[G

X

θ

G

X

θ

′

] ≈ mean

X

[G

X

θ

G

X

θ

′

]

, (12)

and

G

X

θ

∗

= f latten

(

−F

−

1

2

θ

∗

·

T

x

X

t=1

(

µ

x

t

− x

t

σ

2

x

⊙ (µ

x

t

> 0) · [z

′

t

, 1])

)

,

(13)

which is the output FV representation in our framework.

To improve the convergence speed and better regularize

the visual representation learning for video, we train this

framework by inputting one single video frame rather than

multiple ones, which is randomly sampled from videos. In

the FV extraction stage, the video-level representation can

Table 1. Methodology comparison of different quantization.

Quantization indicator descriptor

FV [23] Gaussian observation

model

gradient with respect to

GMM parameters

VLAD [11] clustering center difference to the as-

signed center

BP [20] local feature coordinate representa-

tion

FV-VAE VAE hidden variable gradient of reconstruc-

tion loss

be easily obtained by averaging FVs of all the frames sam-

pled from the video since FV in Eq. (13) is linear additive.

5. Experiments

We evaluate the learnt visual representation by FV-VAE

architecture on three popular datasets, i.e., UCF101 [31],

ActivityNet [2] and CUB-200-2011 [39]. The UCF101

dataset is one of the most popular video action recogni-

tion benchmarks. It consists of 13,320 videos from 101

action categories. The action categories are divided into

ﬁve groups: human-object interaction, body-motion only,

human-human interaction, playing musical instruments and

sports. Three training/test splits are provided by the dataset

organisers and each split in UCF101 includes about 9.5K

training and 3.7K test videos. The ActivityNet dataset is

a large-scale video benchmark for human activity under-

standing. The latest released version of the dataset (v1.3)

is exploited, which contains 19,994 videos from 200 activ-

ity categories. The 19,994 videos are divided into 10,024,

4,926, 5,044 videos for training, validation and test set, re-

spectively. Note that the labels of test set are not publicly

available and the performances on ActivityNet dataset are

all reported on validation set. Furthermore, we also val-

idate the representation on CUB-200-2011 dataset, which

is widely adopted for ﬁne-grained image classiﬁcation and

consists of 11,788 images from 200 bird species. We fol-

low the ofﬁcial split on this dataset with 5,994 training and

5,794 test images.

5.1. Compared Approaches

To empirically verify the merit of visual representation

learnt by FV-VAE, we compare the following quantization

methods: Global Activations (GA) directly utilizes the

outputs of fully-connected/pooling layer as visual represen-

tation. Fisher Vector (FV) [23] produces the visual repre-

sentation by concatenating the gradients with respect to the

parameters of GMM, which is trained on local descriptors.

Vector of Locally Aggregated Descriptors (VLAD) [11]

is to accumulate, for each clustering center learnt with K-

means, the differences between the clustering center and the

descriptors assigned to it, and then concatenates the accu-

mulated vector of each center as quantized representation.

Deep Quantization: Encoding Convolutional Activations with Deep Generative Model

Citations

Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks

ECO: Efficient Convolutional Network for Online Video Understanding

Learning Spatio-Temporal Representation With Local and Global Diffusion

Recurrent Tubelet Proposal and Recognition Networks for Action Detection

Unified Spatio-Temporal Attention Networks for Action Recognition in Videos

References

Deep Residual Learning for Image Recognition

Very Deep Convolutional Networks for Large-Scale Image Recognition

ImageNet Large Scale Visual Recognition Challenge

Fully convolutional networks for semantic segmentation

Auto-Encoding Variational Bayes

Related Papers (5)

Learning Spatiotemporal Features with 3D Convolutional Networks

Deep Residual Learning for Image Recognition

Large-Scale Video Classification with Convolutional Neural Networks

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

ImageNet Large Scale Visual Recognition Challenge