scispace - formally typeset
Open AccessProceedings ArticleDOI

Deep Quantization: Encoding Convolutional Activations with Deep Generative Model

Reads0
Chats0
TLDR
This paper presents Fisher Vector encoding with Variational Auto-Encoder (FV-VAE), a novel deep architecture that quantizes the local activations of convolutional layer in a deep generative model, by training them in an end-to-end manner.
Abstract
Deep convolutional neural networks (CNNs) have proven highly effective for visual recognition, where learning a universal representation from activations of convolutional layer plays a fundamental problem. In this paper, we present Fisher Vector encoding with Variational Auto-Encoder (FV-VAE), a novel deep architecture that quantizes the local activations of convolutional layer in a deep generative model, by training them in an end-to-end manner. To incorporate FV encoding strategy into deep generative models, we introduce Variational Auto-Encoder model, which steers a variational inference and learning in a neural network which can be straightforwardly optimized using standard stochastic gradient method. Different from the FV characterized by conventional generative models (e.g., Gaussian Mixture Model) which parsimoniously fit a discrete mixture model to data distribution, the proposed FV-VAE is more flexible to represent the natural property of data for better generalization. Extensive experiments are conducted on three public datasets, i.e., UCF101, ActivityNet, and CUB-200-2011 in the context of video action recognition and fine-grained image classification, respectively. Superior results are reported when compared to state-of-the-art representations. Most remarkably, our proposed FV-VAE achieves to-date the best published accuracy of 94.2% on UCF101.

read more

Content maybe subject to copyright    Report

Deep Quantization: Encoding Convolutional Activations
with Deep Generative Model
Zhaofan Qiu, Ting Yao, and Tao Mei
University of Science and Technology of China, Hefei, China
Microsoft Research, Beijing, China
zhaofanqiu@gmail.com, {tiyao, tmei}@microsoft.com
Abstract
Deep convolutional neural networks (CNNs) have
proven highly effective for visual recognition, where learn-
ing a universal representation from activations of convolu-
tional layer plays a fundamental problem. In this paper,
we present Fisher Vector encoding with Variational Auto-
Encoder (FV-VAE), a novel deep architecture that quan-
tizes the local activations of convolutional layer in a deep
generative model, by training them in an end-to-end man-
ner. To incorporate FV encoding strategy into deep genera-
tive models, we introduce Variational Auto-Encoder model,
which steers a variational inference and learning in a neu-
ral network which can be straightforwardly optimized us-
ing standard stochastic gradient method. Different from the
FV characterized by conventional generative models (e.g.,
Gaussian Mixture Model) which parsimoniously fit a dis-
crete mixture model to data distribution, the proposed FV-
VAE is more flexible to represent the natural property of da-
ta for better generalization. Extensive experiments are con-
ducted on three public datasets, i.e., UCF101, ActivityNet,
and CUB-200-2011 in the context of video action recogni-
tion and fine-grained image classification, respectively. Su-
perior results are reported when compared to state-of-the-
art representations. Most remarkably, our proposed FV-
VAE achieves to-date the best published accuracy of 94.2%
on UCF101.
1. Introduction
The recent advances in deep convolutional neural net-
works (CNNs) have demonstrated high capability in visu-
al recognition. For instance, an ensemble of residual net-
s [7] achieves 3.57% in terms of top-5 error on the Ima-
geNet dataset [26]. More importantly, when utilizing the
activations of either a fully-connected layer or a convolu-
This work was performed when Zhaofan Qiu was visiting Mi-
crosoft Research as a research intern.
...
...
...
...
Global Activations
Convolutional
Activations
FV Encoding
FV-VAE Encoding
Convolutional
Activations
Figure 1. Visual representations derived from activations of dif-
ferent layers in CNNs (upper row: global activations of the fully-
connected layer; middle row: convolutional activations with Fish-
er Vector encoding; bottom row: convolutional activations with
our FV-VAE encoding).
tional layer in a pre-trained CNN as a universal visual rep-
resentation and applying this representation to other visu-
al recognition tasks (e.g., scene understanding and seman-
tic segmentation), CNNs also manifest impressive perfor-
mances. The improvements are expected when CNNs are
further fine-tuned with only amount of task-specific data.
The activations of different layers in CNNs are generally
grouped into two dimensions: global activations and con-
volutional activations. The former directly takes the activa-
tions of the fully-connected layer as visual representations,
which are holistic over the entire image, as shown in the
upper row of Figure 1. The latter, in contrast, creates vi-
sual representations by encoding a set of regional and local
activations from a convolutional layer to a vectorial repre-
sentation using quantization strategies. For example, Fisher
Vector (FV) [23] is one of the most successful quantization
approaches, as shown in the middle row of Figure 1. While
superior results by aggregating convolutional activations are
reported in most recent studies [3, 44], convolutional acti-
vations are first extracted as local descriptors followed by
another separate quantization step. Thus such descriptors
may not be optimally compatible with the encoding pro-
cess, making the quantization sub-optimal. Furthermore, as

discussed in [13], the generative model behind FV, i.e., the
Gaussian Mixture Model (GMM), cannot always represen-
t the natural clustering of the descriptors and its inflexible
Gaussian observation model limits its generalization ability.
We show in this paper that these two limitations can be
mitigated by designing a deep architecture for representa-
tion learning that combines convolutional activations ex-
traction and quantization into a one-stage learning. Specif-
ically, we present a novel Fisher Vector encoding with
Variational Auto-Encoder (FV-VAE) framework to encode
convolutional activations with deep generative model (i.e.,
VAE), as shown in the bottom row of Figure 1. The pipeline
of the proposed deep architecture generally consists of two
components: a sub-network with a stack of convolution lay-
ers to produce convolutional activations followed by a VAE
structure aggregating the regional convolutional descriptors
to a FV. VAE consists of hierarchies of conditional stochas-
tic variables and is a highly expressive model by optimizing
a variational approximation (an inference/recognition mod-
el) to the intractable posterior for the generative distribution.
Compared to traditional GMM model which has the form of
a mixture of fixed Gaussian components, the inference mod-
el here can be regarded as an alternative to predict specific
Gaussian components to different inputs by a single neural
network, making it more flexible. It is also worth noting that
a classification loss is additionally considered to preserve
the semantic information in the training stage. The entire
architecture is trainable in an end-to-end fashion. Further-
more, in the feature extraction stage, we theoretically prove
that the FV of input descriptors can be directly computed
by accumulating the gradient vector of reconstruction loss
in the VAE through back-propagation.
The main contribution of this work is the proposal of FV-
VAE architecture to encode convolutional descriptors with
Variational Auto-Encoder. We theoretically formulate the
computation of FV in VAE and substantiate an implemen-
tation of FV-VAE for visual representation learning.
2. Related Work
In the literature, visual representation generation from a
pre-trained CNN model has proceeded along two dimen-
sions: global activations and convolutional activations. The
first is to extract visual representation from global acti-
vations in a CNN directly, e.g., the outputs from fully-
connected layer in VGG [30] or pool5 layer in ResNet [7].
In practice, this scheme often starts by pre-training CNN
model on a large dataset (e.g., ImageNet) and then fine-
tuning the CNN architecture with a small amount of task-
specific data to better characterize the intrinsic informa-
tion in target scenario. The visual representation learnt in
this direction has been exploited in a broad range of com-
puter vision tasks including fine-grained image classifica-
tion [1, 17], video action recognition [18, 19, 24, 29] and
visual captioning [36, 38].
Another alternative scheme is to utilize the activation-
s from convolutional layers in CNN as regional and local
descriptors. Compared to global activations, convolutional
activations from CNN are embedded with rich spatial in-
formation, making them more transferable to different do-
mains and more robust to translation and rotation, which
have shown the effectiveness in several technological ad-
vances, e.g., Spatial Pyramid Pooling (SPP) [6], Fast R-
CNN [5] and Fully Convolutional Networks (FCNs) [22].
Recently, many works attempt to produce visual represen-
tation by encoding convolutional activations with different
quantization strategies. For example, Fisher Vector [23] is
computed on the output of the last convolutional layer of
VGG networks for describing texture in [3]. Similar in spir-
it, Xu et al. utilize VLAD [11] to encode convolutional
descriptors of video frame for multimedia event detection
[44]. In [28] and [43], Sharma et al. and Xu et al. dynam-
ically pool convolutional descriptors with attention models
for action recognition and image captioning, respectively.
Furthermore, convolutional descriptors of one convolution-
al layer are pooled with the guidance of the activations of
the successive convolutional layer in [21]. In [20], convolu-
tional descriptors from two CNNs are multiplied using the
outer product and pooled to obtain the bilinear vector.
In summary, our work belongs to the second dimen-
sion and aims to compute FV on convolutional activations
with deep generative models. We exploit Variational Auto-
Encoder for this purpose, which optimizes an inference
model to the intractable posterior. The high flexibility of
the inference model and efficiency of the structure optimiza-
tion makes VAE more advanced than traditional GMM. Our
work in this paper contributes by studying not only encod-
ing convolutional activations in a deep architecture, but also
theoretically figuring out the computation of FV based on
VAE architecture.
3. Fisher Vector Meets VAE
In this section, we first recall the Fisher Vector theory,
followed by presenting how to estimate the probability den-
sity function in FV through VAE. The optimization of VAE
is then elaborated and how to compute the FV of the input
descriptors will be introduced finally.
3.1. Fisher Vector Theory
Suppose we have two sets of local descriptors X =
{x
t
}
T
x
t=1
and Y = {y
t
}
T
y
t=1
with T
x
and T
y
descriptors, re-
spectively. Let x
t
, y
t
R
d
denote the d-dimensional fea-
tures of each descriptor. In order to measure the similarity
between the two sets, kernel method is employed by map-
ping them into a hyperspace. Specifically, assuming that
the generation process of descriptors in R
d
can be mod-
eled by a probability density function u
θ
with M param-

...
Reconstruction
Loss
...
Regularization
Loss
Classification
Loss
...
...
Encoder Sampling Decoder
(a) VAE Training
...
Reconstruction
Loss
...
...
...
Encoder Identity Decoder
Back Propagation
Gradient Vector
Accumulator
(b) FV Extraction
Figure 2. The overview of FV learning with VAE: (a) The training
process of VAE, (b) FV extraction based on VAE.
eters θ = [θ
1
, ..., θ
M
]
, Fisher Kernel (FK) [9] between the
two sets X and Y is given by
K(X, Y ) = G
X
θ
F
1
θ
G
Y
θ
, (1)
where G
X
θ
=
θ
log u
θ
(X) is defined as fisher score
function by computing the gradient of the log-likelihood
of the set based on the generative model, and F
θ
=
E
Xu
θ
[G
X
θ
G
X
θ
] is the Fisher Information Matrix (FIM) of
u
θ
which is regarded as statistical feature normalization. S-
ince F
θ
is positive semi-definite, the FK in Eq.(1) can be
re-written explicitly as inner product in hyperspace:
K(X, Y ) = G
X
θ
G
Y
θ
, (2)
where
G
X
θ
= F
1
2
θ
G
X
θ
= F
1
2
θ
θ
log u
θ
(X)
. (3)
Formally, G
X
θ
is well-known as Fisher Vector (FV). The
dimension of FV is equal to the number of generative pa-
rameters θ, which is often much higher than that of the de-
scriptor, making FV of higher descriptive capability.
3.2. Probability Estimation through VAE
Next, we will discuss how to estimate the probability
density function u
θ
in FV. In general, u
θ
is chosen to be
Gaussian Mixture Model (GMM) [27, 40] as one can ap-
proximate any distribution with arbitrary precision by GM-
M, in which θ is composed of mixture weight, mean and co-
variance of Gaussian components. The need of a large num-
ber of mixture components and inefficient optimization of
Expectation Maximization algorithm, however, makes the
parameter learning computationally expensive and difficult
to be applied to large-scale complex data. Inspired by the
idea of deep generative models [16, 25] which enable the
flexible and efficient inference learning in a neural network,
Algorithm 1 Variational Auto-Encoding (VAE) Optimization
1: Input: training set X = {x
t
}
T
x
t=1
, corresponding labels L =
{l
t
}
T
x
t=1
, loss weights λ
1
, λ
2
, λ
3
.
2: Initialization: random initialized θ
0
, φ
0
.
3: Output: VAE parameters θ
, φ
.
4: repeat
5: Sample x
t
in the minibatch.
6: Encoder: µ
z
t
f
φ
(x
t
).
7: Sampling: z
t
µ
z
t
+ ǫ σ
z
, ǫ N (0, I).
8: Decoder: µ
x
t
f
θ
(z
t
).
9: Compute reconstruction loss:
L
rec
= log p
θ
(x
t
|z
t
) = log N (x
t
; µ
x
t
, σ
2
x
I).
10: Compute regularization loss:
L
reg
=
1
2
µ
z
t
+
1
2
kσ
z
k
1
2
P
d
k=1
(1 + log σ
2
z(k)
).
11: Compute classification loss:
L
cls
= sof tmax
loss(z
t
, l
t
).
12: Fuse the three loss:
L(θ, φ) = λ
1
L
rec
(θ, φ) + λ
2
L
reg
(φ) + λ
3
L
cls
(φ).
13: Back-propagate the gradients.
14: until maximum iteration reached.
we develop a Variational Auto-Encoder (VAE) to generate
the probability function u
θ
.
Following the notations in Section 3.1 and assuming
that all the descriptors in the set are independent, the log-
likelihood of the set can be calculated by the sum over log-
likelihoods of individual descriptor and written as
log u
θ
(X) =
T
x
X
t=1
log p
θ
(x
t
)
. (4)
To model the probability of x
t
generated from parame-
ters θ, an unobserved continuous random variable z
t
is in-
volved with prior distribution p
θ
(z) and each x
t
is generat-
ed from the conditional distribution p
θ
(x|z). As such, each
log-likelihood log p
θ
(x
t
) can be measured using Kullback-
Leibler divergence (D
KL
) as
log p
θ
(x
t
) = D
KL
(q
φ
(z|x
t
)||p
θ
(z|x
t
)) + LB(θ, φ; x
t
)
> LB(θ, φ; x
t
)
, (5)
where LB(θ, φ; x
t
) is the variational lower bound on the
likelihood of descriptor x
t
and can be written as
LB(θ, φ; x
t
) = D
KL
(q
φ
(z|x
t
)||p
θ
(z))+E
q
φ
(z|x
t
)
[log p
θ
(x
t
|z)],
(6)
where q
φ
(z|x) is a recognition model which is an approxi-
mation to the intractable posterior p
θ
(z|x). In our proposed
FV-VAE method, we use this lower bound LB(θ, φ; x
t
) as
an approximation of the log-likelihood. Through this ap-
proximation, the generative model can be divided into t-
wo parts: encoder q
φ
(z|x) and decoder p
θ
(x|z), predicting
hidden and visible probability, respectively.

FV-VAE
Gradient
Vector
Visual
Representation
Loss
Function
Ice Dancing/
Albatross
+
Training Epoch
Extraction Epoch
CNN
CNN
Convolutional
Activations
Local
Feature Set
SPP
224x224
448x448
50/frame
Convolutional
Activations
Denser Grid
(28x28)
784/image
Video
Classification
Image
Classification
Figure 3. Visual representation learning framework for image and video recognition including our FV-VAE. Spatial Pyramid Pooling (SPP)
is performed on the last pooling layer of CNN to aggregate the local descriptors of video frame, which applies four different max pooling
operations and obtain (6 × 6), (3 × 3), (2 × 2) and (1 × 1) outputs for each convolutional filter, resulting a total of 50 descriptors. For
image, an input with a higher resolution of (448 × 448) is fed into the CNN and the activations of the last convolutional layer conv
5
4
+relu
in VGG
19 are extracted, leading to dense local descriptors of 28 × 28. In training stage, FV-VAE architecture is learnt by minimizing the
overall loss. In extraction epoch, the learnt FV-VAE is to encode the set of local descriptors into a vectorial FV representation.
3.3. Optimization of VAE
The inference model parameter φ and generative mod-
el parameter θ are straightforward to be optimized us-
ing stochastic gradient descend method. More specifical-
ly, let the prior distribution be the standard normal dis-
tribution p
θ
(z) = N (z; 0, I), and both the conditional
distribution p
θ
(x|z) and posterior approximation q
φ
(z|x)
be multivariate Gaussian distribution N (x
t
; µ
x
t
, σ
2
x
t
I) and
N (z
t
; µ
z
t
, σ
2
z
t
I), respectively. The one-step Monte Carlo
is exploited to estimate the latent variable z
t
. Hence, the
lower bound in Eq. (6) can be rewritten as
LB(θ, φ; x
t
) log p
θ
(x
t
|z
t
) +
1
2
d
X
k=1
(1 + log σ
2
z
t
(k)
)
1
2
µ
z
t
1
2
kσ
z
t
k
, (7)
where z
t
is generated from N (µ
z
t
, σ
2
z
t
I) and it is equiva-
lent to z
t
= µ
z
t
+ ǫ σ
z
t
,ǫ N (0, I).
Figure 2(a) illustrates an overview of our VAE training
process and Algorithm 1 further details the optimization
steps. It is also worth noting that different from the train-
ing of standard VAE method which estimates σ
x
and σ
z
in
another parallel encoder-decoder structure, we simply learn
the two covariance by gradient descent technique and share
them across all the descriptors, making the number of pa-
rameters learnt in VAE significantly reduced in our case.
In addition to the basic reconstruction loss and regulariza-
tion loss, we further take classification loss into account
in our VAE training to incorporate semantic information,
which has been shown effective in semi-supervised genera-
tive model learning [15]. The overall loss function is then
given by
L(θ, φ) = λ
1
L
rec
(θ, φ) + λ
2
L
reg
(φ) + λ
3
L
cls
(φ)
. (8)
We fix λ
1
= λ
2
= 1 in Eq. (8) and will investigate the ef-
fect of tradeoff parameter λ
3
in our experiments. During the
training, the gradients are calculated and back-propagate to
the lower layers so that lower layers can adjust their param-
eters to minimize the loss.
3.4. FV Extraction
After the optimization of model parameters [θ
, φ
],
Figure 2(b) demonstrates how to extract Fisher Vector based
on the learnt VAE architecture.
By replacing the log-likelihood with its approximation,
i.e., lower bound LB(θ, φ; x
t
), we can obtain FV in Eq. (3):
G
X
θ
= F
1
2
θ
θ
log u
θ
(X)
= F
1
2
θ
T
x
X
t=1
[
θ
L
rec
(x
t
; θ
, φ
)]
, (9)
which is the normalized gradient vector of reconstruction
loss, and can be computed directly though the back propa-
gation operation. It is worth noticing that when extracting
FV representation, we withdraw the sampling operation and
use µ
z
t
as z
t
directly to avoid stochastic factors.
4. Visual Representation Learning
By utilizing FV-VAE as a deep architecture for quanti-
zation, a general visual representation learning framework
is devised for image and video recognition, respectively, as
illustrated in Figure 3. The basic idea is to construct a set
of convolutional descriptors for image or video frames, fol-
lowed by encoding them into a vectorial FV representation
using FV-VAE architecture. Both the training epoch and
FV extraction epoch are shown in Figure 3 and the entire
framework is trainable in an end-to-end manner.

We exploit different strategies of aggregation to con-
struct the set of convolutional descriptors for video frames
and image, respectively, due to the different property in be-
tween. A video consists of a sequence of frames with large
intra-class variations caused by, e.g., camera motion, illu-
mination conditions and so on, making the scale of an i-
dentical object varying in different frames. Following [44],
we employ Spatial Pyramid Pooling (SPP) [6] on the last
pooling layer to extract scale-invariant local descriptors for
video frames. Instead, we feed a higher resolution (e.g.,
448 × 448) input into the CNN to fully utilize image infor-
mation and extract the activations of the last convolutional
layer (e.g., conv
5
4
+relu in VGG
19), resulting in dense lo-
cal descriptors (e.g., 28 × 28) for image as in [20].
In our implementation, Multi-Layer Perceptron (MLP)
is employed as encoder and decoder in FV-VAE and one
layer decoder is developed to reduce the dimension of FV
representation. As such, the functions in Algorithm 1 can
be specified as
Encoder : µ
z
t
M LP
φ
(x
t
)
Decoder : µ
x
t
ReLU (W
θ
z
t
+ b
θ
)
, (10)
where {W
θ
, b
θ
} are the encoder parameters θ. The gradi-
ent vector of L
rec
is calculated as
θ
L
rec
(x
t
; θ
, φ
) = f latten
[
L
rec
W
φ
,
L
rec
b
θ
]
= f latten
[
L
rec
µ
x
t
· z
t
,
L
rec
µ
x
t
]
= f latten
L
rec
µ
x
t
· [z
t
, 1]
= f latten
µ
x
t
x
t
σ
2
x
(µ
x
t
> 0) · [z
t
, 1]
,
(11)
where flatten represents to flatten a matrix to a vector,
and denotes element-wise multiplication to filter the ac-
tivated elements. Considering it is difficult to obtain an an-
alytic solution of FIM in this case, we make an approxima-
tion by replacing the expectation with the average on the
whole training set:
F
θ
= E
Xu
θ
[G
X
θ
G
X
θ
] mean
X
[G
X
θ
G
X
θ
]
, (12)
and
G
X
θ
= f latten
(
F
1
2
θ
·
T
x
X
t=1
(
µ
x
t
x
t
σ
2
x
(µ
x
t
> 0) · [z
t
, 1])
)
,
(13)
which is the output FV representation in our framework.
To improve the convergence speed and better regularize
the visual representation learning for video, we train this
framework by inputting one single video frame rather than
multiple ones, which is randomly sampled from videos. In
the FV extraction stage, the video-level representation can
Table 1. Methodology comparison of different quantization.
Quantization indicator descriptor
FV [23] Gaussian observation
model
gradient with respect to
GMM parameters
VLAD [11] clustering center difference to the as-
signed center
BP [20] local feature coordinate representa-
tion
FV-VAE VAE hidden variable gradient of reconstruc-
tion loss
be easily obtained by averaging FVs of all the frames sam-
pled from the video since FV in Eq. (13) is linear additive.
5. Experiments
We evaluate the learnt visual representation by FV-VAE
architecture on three popular datasets, i.e., UCF101 [31],
ActivityNet [2] and CUB-200-2011 [39]. The UCF101
dataset is one of the most popular video action recogni-
tion benchmarks. It consists of 13,320 videos from 101
action categories. The action categories are divided into
five groups: human-object interaction, body-motion only,
human-human interaction, playing musical instruments and
sports. Three training/test splits are provided by the dataset
organisers and each split in UCF101 includes about 9.5K
training and 3.7K test videos. The ActivityNet dataset is
a large-scale video benchmark for human activity under-
standing. The latest released version of the dataset (v1.3)
is exploited, which contains 19,994 videos from 200 activ-
ity categories. The 19,994 videos are divided into 10,024,
4,926, 5,044 videos for training, validation and test set, re-
spectively. Note that the labels of test set are not publicly
available and the performances on ActivityNet dataset are
all reported on validation set. Furthermore, we also val-
idate the representation on CUB-200-2011 dataset, which
is widely adopted for fine-grained image classification and
consists of 11,788 images from 200 bird species. We fol-
low the official split on this dataset with 5,994 training and
5,794 test images.
5.1. Compared Approaches
To empirically verify the merit of visual representation
learnt by FV-VAE, we compare the following quantization
methods: Global Activations (GA) directly utilizes the
outputs of fully-connected/pooling layer as visual represen-
tation. Fisher Vector (FV) [23] produces the visual repre-
sentation by concatenating the gradients with respect to the
parameters of GMM, which is trained on local descriptors.
Vector of Locally Aggregated Descriptors (VLAD) [11]
is to accumulate, for each clustering center learnt with K-
means, the differences between the clustering center and the
descriptors assigned to it, and then concatenates the accu-
mulated vector of each center as quantized representation.

Citations
More filters
Proceedings ArticleDOI

Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks

TL;DR: This paper devise multiple variants of bottleneck building blocks in a residual learning framework by simulating 3 x3 x 3 convolutions with 1 × 3 × 3 convolutional filters on spatial domain (equivalent to 2D CNN) plus 3 × 1 × 1 convolutions to construct temporal connections on adjacent feature maps in time.
Book ChapterDOI

ECO: Efficient Convolutional Network for Online Video Understanding

TL;DR: A network architecture that takes long-term content into account and enables fast per-video processing at the same time and achieves competitive performance across all datasets while being 10 to 80 times faster than state-of-the-art methods.
Proceedings ArticleDOI

Learning Spatio-Temporal Representation With Local and Global Diffusion

TL;DR: In this paper, a Local and Global Diffusion (LGD) network is proposed to boost the spatio-temporal representation learning by local and global diffusions, where each block updates local features by modeling the diffusions between these two representations.
Book ChapterDOI

Recurrent Tubelet Proposal and Recognition Networks for Action Detection

TL;DR: This work presents a novel deep architecture called Recurrent Tubelet Proposal and Recognition (RTPR) networks to incorporate temporal context for action detection and conducts extensive experiments to demonstrate superior results over state-of-the-art methods.
Journal ArticleDOI

Unified Spatio-Temporal Attention Networks for Action Recognition in Videos

TL;DR: A unified Spatio-Temporal Attention Networks (STAN) is proposed in the context of multiple modalities, which differs from conventional deep networks, which focus on the attention mechanism, because the authors' temporal attention provides a principled and global guidance across different modalities and video segments.
References
More filters
Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Journal ArticleDOI

ImageNet Large Scale Visual Recognition Challenge

TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.
Proceedings ArticleDOI

Fully convolutional networks for semantic segmentation

TL;DR: The key insight is to build “fully convolutional” networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning.
Proceedings Article

Auto-Encoding Variational Bayes

TL;DR: A stochastic variational inference and learning algorithm that scales to large datasets and, under some mild differentiability conditions, even works in the intractable case is introduced.
Related Papers (5)