scispace - formally typeset
Open AccessProceedings ArticleDOI

Asynchronous Temporal Fields for Action Recognition

TLDR
This work proposes a fully-connected temporal CRF model for reasoning over various aspects of activities that includes objects, actions, and intentions, where the potentials are predicted by a deep network.
Abstract
Actions are more than just movements and trajectories: we cook to eat and we hold a cup to drink from it. A thorough understanding of videos requires going beyond appearance modeling and necessitates reasoning about the sequence of activities, as well as the higher-level constructs such as intentions. But how do we model and reason about these? We propose a fully-connected temporal CRF model for reasoning over various aspects of activities that includes objects, actions, and intentions, where the potentials are predicted by a deep network. End-to-end training of such structured models is a challenging endeavor: For inference and learning we need to construct mini-batches consisting of whole videos, leading to mini-batches with only a few videos. This causes high-correlation between data points leading to breakdown of the backprop algorithm. To address this challenge, we present an asynchronous variational inference method that allows efficient end-to-end training. Our method achieves a classification mAP of 22.4% on the Charades [42] benchmark, outperforming the state-of-the-art (17.2% mAP), and offers equal gains on the task of temporal localization.

read more

Content maybe subject to copyright    Report

Asynchronous Temporal Fields for Action Recognition
Gunnar A. Sigurdsson
1
Santosh Divvala
2,3
Ali Farhadi
2,3
Abhinav Gupta
1,3
1
Carnegie Mellon University
2
University of Washington
3
Allen Institute for Artificial Intelligence
github.com/gsig/temporal-fields/
Abstract
Actions are more than just movements and trajectories:
we cook to eat and we hold a cup to drink from it. A thor-
ough understanding of videos requires going beyond ap-
pearance modeling and necessitates reasoning about the
sequence of activities, as well as the higher-level constructs
such as intentions. But how do we model and reason about
these? We propose a fully-connected temporal CRF model
for reasoning over various aspects of activities that includes
objects, actions, and intentions, where the potentials are
predicted by a deep network. End-to-end training of such
structured models is a challenging endeavor: For infer-
ence and learning we need to construct mini-batches con-
sisting of whole videos, leading to mini-batches with only
a few videos. This causes high-correlation between data
points leading to breakdown of the backprop algorithm. To
address this challenge, we present an asynchronous vari-
ational inference method that allows efficient end-to-end
training. Our method achieves a classification mAP of
22.4% on the Charades [
42] benchmark, outperforming the
state-of-the-art (17.2% mAP), and offers equal gains on the
task of temporal localization.
1. Introduction
Consider the video shown in Figure
1: A man walks
through a doorway, stands at a table, holds a cup, pours
something into it, drinks it, puts the cup on the table, and
finally walks away. Despite depicting a simple activity,
the video involves a rich interplay of a sequence of actions
with underlying goals and intentions. For example, the man
stands at the table ‘to take a cup’, he holds the cup ‘to drink
from it’, etc. Thorough understanding of videos requires
us to model such interplay between activities as well as to
reason over extensive time scales and multiple aspects of
actions (objects, scenes, etc).
Most contemporary deep learning based methods have
treated the problem of video understanding as that of only
appearance and motion (trajectory) modeling [
43, 53, 7,
Work was done while Gunnar was an intern at AI2.
Time
Holding a cup
Pouring into a cup
Drinking from a cup
Intent: Getting something to drink
Figure 1. Understanding human activities in videos requires jointly
reasoning about multiple aspects of activities, such as ‘what is hap-
pening’, ‘how’, and ‘why’. In this paper, we present an end-to-
end deep structured model over time trained in a stochastic fash-
ion. The model captures rich semantic aspects of activities, in-
cluding Intent (why), Category (what), Object (how). The figure
shows video frames and annotations used in training from the Cha-
rades [
42] dataset.
27]. While this has fostered interesting progress in this
domain, these methods still struggle to outperform mod-
els based on hand-crafted features, such as Dense Trajec-
tories [
56]. Why such a disconnect? We argue that video
understanding requires going beyond appearance modeling,
and necessitates reasoning about the activity sequence as
well as higher-level constructs such as intentions. The re-
cent emergence of large-scale datasets containing rich se-
quences of realistic activities [
42, 63, 60] comes at a perfect
time facilitating us to explore such complex reasoning.
But what is the right way to model and reason about tem-
poral relations and goal-driven behaviour? Over the last
couple of decades, graphical models such as Conditional
Random Fields (CRFs) have been the prime vehicles for
structured reasoning. Therefore, one possible alternative
is to use ConvNet-based approaches [
19] to provide fea-
tures for a CRF training algorithm. Alternatively, it has
been shown that integrating CRFs with ConvNet architec-
tures and training them in an end-to-end manner provides
substantial improvements in tasks such as segmentation and
situation recognition [
66, 1, 62].
Inspired by these advances, we present a deep-structured
model that can reason temporally about multiple aspects of
activities. For each frame, our model infers the activity cate-
585

gory, object, action, progress, and scene using a CRF, where
the potentials are predicted by a jointly end-to-end trained
ConvNet over all predictions in all frames. This CRF has a
latent node for the intent of the actor in the video and pair-
wise relationships between all individual frame predictions.
While our model is intuitive, training it in an end-to-end
manner is a non-trivial task. Particularly, end-to-end learn-
ing requires computing likelihoods for individual frames
and doing joint inference about all connected frames with
a CRF training algorithm. This is in stark contrast with
the standard stochastic gradient descent (SGD) training al-
gorithm (backprop) for deep networks, where we require
mini-batches with a large number of independent and un-
correlated samples, not just a few whole videos. In order
to handle this effectively: (1) we relax the Markov assump-
tion and choose a fully-connected temporal model, such that
each frame’s prediction is influenced by all other frames,
and (2) we propose an asynchronous method for training
fully-connected structured models for videos. Specifically,
this structure allows for an implementation where the in-
fluence (messages) from other frames are approximated by
emphasizing influence from frames computed in recent it-
erations. They are more accurate, and show advantage
over being limited to only neighboring frames. In addi-
tion to being more suitable for stochastic training, fully-
connected models have shown increased performance on
various tasks [
18, 66].
In summary, our key contributions are: (a) a deep CRF
based model for structured understanding and comprehen-
sive reasoning of videos in terms of multiple aspects, such
as action sequences, objects, and even intentions; (b) an
asynchronous training framework for expressive temporal
CRFs that is suitable for end-to-end training of deep net-
works; and, (c) substantial improvements over state-of-the-
art, increasing performance from 17.2% mAP to 22.4%
mAP on the challenging Charades [
42] benchmark.
2. Related Work
Understanding activities and actions has an extensive
history [
32, 59, 22, 17, 23, 2, 26, 56, 29, 21]. Inter-
estingly, analyzing actions by their appearance has gone
through multiple iterations. Early success was with hand-
crafted representations such as Space Time Interest Points
(STIP) [
22], 3D Histogram of Gradient (HOG3D) [17], His-
togram of Optical Flow (HOF) [
23], and Motion Bound-
ary Histogram [
2]. These methods capture and analyze
local properties of the visual-temporal datastream. In the
past years, the most prominent hand-crafted representa-
tions have been from the family of trajectory based ap-
proaches [
26, 56, 29, 21], where the Improved Dense Tra-
jectories (IDT) [
56] representation is in fact on par with
state-of-the-art on multiple recent datasets [8, 42].
Recently there has been a push towards mid-level rep-
resentations of video [37, 46, 13, 20], that capture beyond
local properties. However, these approaches still used hand-
crafted features. With the advent of deep learning, learn-
ing representations from data has been extensively stud-
ied [
14, 15, 44, 57, 52, 53, 24, 7, 61, 55, 40, 3]. Of these,
one of the most popular frameworks has been the approach
of Simonyan et al. [
44], who introduced the idea of training
separate color and optical flow networks to capture local
properties of the video.
Many of those approaches were designed for short clips
of individual activities and hence do not generalize well to
realistic sequences of activities. Capturing the whole in-
formation of the video in terms of temporal evolution of
the video stream has been the focus of some recent ap-
proaches [
51, 6, 12, 35, 49, 30]. Moving towards more ex-
pressive deep networks such as LSTM has become a pop-
ular method for encoding such temporal information [
48,
4, 65, 50, 58, 41, 64]. Interestingly, while those mod-
els move towards more complete understanding of the full
video stream, they have yet to significantly outperform local
methods [
44] on standard benchmarks.
A different direction in understanding comes from rea-
soning about the complete video stream in a complemen-
tary direction Structure. Understanding activities in a
human-centric fashion encodes our particular experiences
with the visual world. Understanding activities with em-
phasis on objects has been a particularly fruitful direc-
tion [
25, 36, 9, 34, 54]. In a similar vein, some works
have also tried modeling activities as transformations [
58]
or state changes [
5]. Recently, there has been significant
progress in modelling the complete human-centric aspect,
where image recognition is phrased in terms of objects and
their roles [
62, 10]. Moving beyond appearance and reason-
ing about the state of agents in the images requires under-
standing human intentions [16, 31]. This ability to under-
stand people in terms of beliefs and intents has been tradi-
tionally studied in psychology as the Theory of mind [
33].
How to exactly model structure of the visual and tem-
poral world has been the pursuit of numerous fields. Of
particular interest is work that combines the representative
power of deep networks with structured modelling. Train-
ing such models is often cumbersome due to the differences
in jointly training deep networks (stochastic sampling) and
sequential models (consecutive samples) [
28, 66]. In this
work, we focus on fully-connected random fields, that have
been popular in image segmentation [
18], where image fil-
tering was used for efficient message passing, and later ex-
tended to use CNN potentials [
39].
3. Proposed Method
Given a video with multiple activities, our goal is to un-
derstand the video in terms of activities. Understanding
activities requires reasoning about objects being interacted
586

Intent
Fully Connected Temporal Model
Two-Stream
Network
VGG-16
VGG-16
fc7
Time
Progress
Action
Object
Category
Start
Walk
Door
C
097
Dining
Scene
Progress
Action
Object
Category
Start
Walk
Door
C
097
Dining
Scene
Progress
Action
Object
Category
Start
Take
Cup
C
110
Dining
Scene
Progress
Action
Object
Category
Mid
Pour
Cup
C
108
Dining
Scene
Progress
Action
Object
Category
Mid
Drink
Water
C
106
Dining
Scene
Progress
Action
Object
Category
End
Walk
Door
C
097
Hallway
Scene
Figure 2. An overview of our structured model. The semantic part captures object, action, etc. at each frame, and temporal aspects captures
those over time. On the left side, we show how for each timepoint in the video, a Two-Stream Network predicts the potentials. Our model
jointly reasons about multiple aspects of activities in all video frames. The Intent captures groups of activities of the person throughout the
whole sequence of activities, and fine-grained temporal reasoning is through fully-connected temporal connections.
with, the place where the interaction is happening, what
happened before and what happens after this current action
and even the intent of the actor in the video. We incorporate
all these by formulating a deep Conditional Random Field
(CRF) over different aspects of the activity over time. That
is, a video can be interpreted as a graphical model, where
the components of the activity in each frame are nodes in the
graph, and the model potentials are the edges in the graph.
In particular, we create a CRF which predicts activity,
object, etc., for every frame in the video. For reasoning
about time, we create a fully-connected temporal CRF, re-
ferred as Asynchronous Temporal Field in the text. That is,
unlike a linear-chain CRF for temporal modelling (the dis-
criminative counterpart to Hidden Markov Models), each
node depends on the state of every other node in the graph.
We incorporate intention as another latent variable which
is connected to all the action nodes. This is an unobserved
variable that influences the sequence of activities. This vari-
able is the common underlying factor that guides and better
explains the sequence of actions an agent takes. Analysis
of what structure this latent variable learns is presented in
the experiments. Our model has three advantages: (1) it ad-
dresses the problem of long-term interactions; (2) it incor-
porates reasoning about multiple parts of the activity, such
as objects and intent; and (3) more interestingly, as we will
see, it allows for efficient end-to-end training in an asyn-
chronous stochastic fashion.
3.1. Architecture
In this work we encode multiple components of an
activity. Each video with T frames is represented as
{X
1
, . . . , X
T
, I} where X
t
is a set of frame-level random
variables for time step t and I is an unobserved random
variable that represent global intent in the entire video. We
can further write X
t
= {C
t
, O
t
, A
t
, P
t
, S
t
}, where C is
the activity category (e.g., ‘drinking from cup’), O cor-
responds to the object (e.g., ‘cup’), A represents the ac-
tion (e.g., ‘drink’), P represents the progress of the activity
{start, middle, end}, and S represents the scene (e.g. ‘Din-
ing Room’). For clarity in the following derivation we will
refer to all the associated variables of X
t
as a single ran-
dom variable X
t
. A more detailed description of the CRF
is presented in the appendix.
Mathematically we consider a random field {X, I} over
all the random variables in our model ({X
1
, . . . , X
T
, I}).
Given an input video V ={V
1
, . . . , V
T
}, where V
t
is a video
frame, our goal is to estimate the maximum a posteriori la-
beling of the random field by marginalizing over the intent
I. This can be written as:
x
= arg max
x
X
I
P (x, I|V ). (1)
For clarity in notation, we will drop the conditioning
on V and write P (X, I). We can define P (X, I) us-
ing Gibbs distribution as: P (X, I)=
1
Z(V)
exp (E(x , I))
where E(x , I) is the Gibbs energy over x. In our CRF,
we model all unary and pairwise cliques between all frames
{X
1
, . . . , X
T
} and the intent I. The Gibbs energy is:
E(x, I) =
X
i
φ
X
(x
i
)
|
{z }
Semantic
+
X
i
φ
XI
(x
i
, I) +
X
i,j
i6=j
φ
XX
(x
i
, x
j
)
|
{z }
Temporal
, (2)
where φ
XX
(x
i
, x
j
) is the potential between frame i and
frame j, and φ
XI
(x
i
, I) is the potential between frame i
and the intent. For notational clarity φ
X
(x
i
) incorporates
all unary and pairwise potentials for C
t
, O
t
, A
t
, P
t
, S
t
. The
model is best understood in terms of two aspects: Semantic
587

Message Server
Video
Loss
CNN
CNN
Output
Backprop
Output
Messages
Time
Single
Timepoint
RGB &
Optical Flow
Input
Messages
Input
Messages
Figure 3. Illustration of the learning algorithm, and the message
passing structure. Each timepoint that has been processed has a
message (Blue highlights messages that have recently been com-
puted). The loss receives a combination of those messages, uses
those to construct new messages, and updates the network.
aspect, which incorporates the local variables in each frame
(C
t
, O
t
, A
t
, P
t
, S
t
); and Temporal aspect, which incorpo-
rates interactions among frames and the intent I. This is
visualized in Figure
2. We will now explain the semantic,
and temporal potentials.
Semantic aspect The frame potential φ
X
(x
i
) incor-
porates the interplay between activity category, object,
action, progress and scene, and could be written ex-
plicitly as φ
X
(C
t
, O
t
, A
t
, P
t
, S
t
). In practice this
potential is composed of unary, pairwise, and tertiary
potentials directly predicted by a CNN. We found
predicting only the following terms to be sufficient
without introducing too many additional parameters:
φ
X
(C
t
, O
t
, A
t
, P
t
, S
t
)=φ(O
t
, P
t
)+φ(A
t
, P
t
)+φ(O
t
, S
t
)+
φ(C
t
, O
t
, A
t
, P
t
) where we only model the assignments
seen in the training set, and assume others are not possible.
Temporal aspect The temporal aspect of the model is
both in terms of the frame-intent potentials φ
XI
(x
i
, I) and
frame-frame potentials φ
XX
(x
i
, x
j
). The frame-intent po-
tentials are predicted with a CNN from video frames (pixels
and motion). The pairwise potentials φ
XX
(x
i
, x
j
) for two
time points i and j in our model have the form:
φ
XX
(x
i
, x
j
) = µ(x
i
, x
j
)
X
m
w
(m)
k
(m)
(v
i
, v
j
), (3)
where µ models the asymmetric affinity between frames, w
are kernel weights, and each k
(m)
is a Gaussian kernel that
depends on the videoframes v
i
and v
j
. In this work we use
a single kernel that prioritises short-term interactions:
k(v
i
, v
j
) = exp
(j i)
2
2σ
2
(4)
The parameters of the general asymmetric compatibility
function µ(x
i
, x
j
) are learned from the data, and σ is a
hyper-parameter chosen by cross-validation.
3.2. Inference
While it is possible to enumerate all variable config-
urations in a single frame, doing so for multiple frames
and their interactions is intractable. Our algorithm uses a
structured variational approximation to approximate the full
probability distribution. In particular, we use a mean-field
approximation to make inference and learning tractable.
With this approximation, we can do inference by keeping
track of message between frames, and asynchronously train
one frame at a time (in a mini-batch fashion).
More formally, instead of computing the exact distribu-
tion P (X, I) presented above, the structured variational ap-
proximation finds the distribution Q(X, I) among a given
family of distributions that best fits the exact distribution in
terms of KL-divergence. By choosing a family of tractable
distributions, it is possible to make inference involving
the ideal distribution tractable. Here we use Q(X, I) =
Q
I
(I)
Q
i
Q
i
(x
i
), the structured mean-field approximation.
Minimizing the KL-divergence between those two distribu-
tions yields the following iterative update equation:
Q
i
(x
i
) exp
φ
X
(x
i
) + E
UQ
I
[φ
XI
(x
i
, U)]
+
X
j>i
E
U
j
Q
j
[φ
XX
(x
i
, U
j
)]
+
X
j<i
E
U
j
Q
j
[φ
XX
(U
j
, x
i
)]
(5)
Q
I
(I) exp
X
j
E
U
j
Q
j
[φ
XI
(U
j
, I)]
(6)
where Q
i
is marginal distribution with respect to each of
the frames, and Q
I
is the marginal with respect to the in-
tent. An algorithmic implementation of this equation is as
presented in Algorithm
1.
Algorithm 1 Inference for Asynchronous Temporal Fields
1: Initialize Q Uniform distribution
2: while not converged do
3: Visit frame i
4: Get
P
j>i
E
U
j
Q
j
[φ
XX
(x
i
, U
j
)]
5: Get
P
j<i
E
U
j
Q
j
[φ
XX
(U
j
, x
i
)]
6: Get
P
j
E
U
j
Q
j
[φ
XI
(U
j
, I)]
7: while not converged do
8: Update Q
i
and Q
I
using Eq.
6
9: Send E
UQ
i
[φ
XX
(x, U)]
10: Send E
UQ
i
[φ
XX
(U, x)]
11: Send E
UQ
i
[φ
XI
(U, I)]
Here ‘Get’ and ‘Send’ refer to the message server, and f (x)
is a message used later by frames in the same video. The
term message server is used for a central process that keeps
track of what node in what video sent what message, and
588

1st Message Pass
time
2nd Message Pass
3rd Message Pass
Initial Prediction
Figure 4. Evolution of prediction with increasing messages passes.
The first row shows the initial prediction for the category tidying
with a broom without any message passing, where darker colors
correspond to higher likelihood, blue is then an increase in like-
lihood, and brown decrease. In the first message pass, the confi-
dence of high predictions gets spread around, and eventually in-
creases the confidence of the whole prediction.
distributes them accordingly when requested. In practice,
this could be implemented in a multi-machine setup.
3.3. Learning
Training a deep CRF model requires calculating deriva-
tives of the objective in terms of each of the potentials in
the model, which in turn requires inference of P (X, I|V ).
The network is trained to maximize the log-likelihood of
the data l(X) = log
P
I
P (x, I|V ). The goal is to update
the parameters of the model, for which we need gradients
with respect to the parameters. Similar to SGD, we find the
gradient with respect to one part of the parameters at a time,
specifically with respect to one potential in one frame. That
is, φ
i
X
(ˆx) instead of φ
X
(ˆx). The partial derivatives of this
loss with respect to each of the potentials are as follows:
l(X)
φ
i
X
(ˆx)
= 1
x=ˆx
Q
i
(ˆx) (7)
l(X)
φ
i
XI
(ˆx,
ˆ
I)
=
exp
P
j
φ
XI
(x
j
,
ˆ
I)
P
I
exp
P
j
φ
XI
(x
j
, I)
1
x=ˆx
Q
i
(ˆx)Q
I
(
ˆ
I) (8)
l(X)
µ
i
(a, b)
=
X
j>i
1
x=a
k(v
i
, v
j
) Q
i
(ˆx)
X
j>i
Q
I
(b)k(v
i
, v
j
)
+
X
j<i
1
x=b
k(v
j
, v
i
) Q
i
(ˆx)
X
j<i
Q
I
(a)k(v
i
, v
j
) (9)
where φ
i
X
(ˆx) and φ
i
XI
(ˆx,
ˆ
I) is the frame and frame-intent
potentials of frame i, and we use ˆx to distinguish between
the labels and variables the derivative is taken with respect
to. µ
i
(a, b) are the parameters of the asymmetric affinity
kernel with respect to frame i, and 1
x=ˆx
is a indicator vari-
able that has the value one if the ground truth label corre-
sponds to the variable. Complete derivation is presented in
the appendix. These gradients are used to update the un-
derlying CNN model. These update equations lead to the
learning procedure presented in Algorithm
2.
Figure
3 graphically illustrates the learning procedure.
Since the videos are repeatedly visited throughout the train-
ing process, we do not have to run multiple message passes
Algorithm 2 Learning for Asynchronous Temporal Fields
1: Given videos V
2: while not converged do
3: for each example in mini-batch do
4: Sample frame v V V
5: Get incoming messages
6: Update Q
i
and Q
I
7: Find gradients with Eq.
7-9
8: Backprop gradients through CNN
9: Send outgoing messages
to calculate each partial gradient. This shares ideas with
contrastive divergence [
11, 38]. Given a single video at test
time, we visualize in Figure
4 how the predictions changes
as the distribution converges with multiple messages passes.
Message Passing The key thing to note is all the in-
coming messages are of the form M (z)=
P
j
f
j
(z) where
f
j
is some function from node j; for e.g., M(z) =
P
j
E
U
j
Q
j
[φ
XI
(U
j
, z)] =
P
j
f
j
(z) from Algorithm
1.
We use the following approximation during training:
M(z)
h
P
j
d
j
X
j
d
j
f
J(j)
(z), (10)
where d [0, 1] is a discount factor, h is a hyperparameter,
and J(·) is an ordering of the messages in that video based
on the iteration in which the message was computed. The
messages are a weighted combination of stored messages.
4. Experimental Results and Analysis
We analyzed the efficacy of our model on the challenging
tasks of video activity classification and temporal localiza-
tion. In addition, we investigated the different parts of the
model, and will demonstrate how they operate together.
Dataset Recent years have witnessed an emergence of
large-scale datasets containing sequences of common daily
activities [
42, 63, 60]. For our evaluation, we chose the
Charades dataset [
42]. This dataset is a challenging bench-
mark containing 9,848 videos across 157 action classes with
66,500 annotated activities, including nouns (objects), verbs
(actions), and scenes. A unique feature of this dataset is
the presence of complex co-occurrences of realistic human-
generated activities making it a perfect test-bed for our anal-
ysis. We evaluate video classification using the evaluation
criteria and code from [
42]. Temporal localization is evalu-
ated in terms of per-frame classification using the provided
temporal annotations.
Implementation details We use a VGG16 network [
45]
with additional layers to predict the model potentials (Fig-
ure
5). We train both a network on RGB frames, and stacks
of optical flow images, following the two-stream architec-
ture [
44]. The main challenge in training the network is the
increase in the output layer size. For the larger potentials,
589

Citations
More filters
Proceedings ArticleDOI

Non-local Neural Networks

TL;DR: In this article, the non-local operation computes the response at a position as a weighted sum of the features at all positions, which can be used to capture long-range dependencies.
Proceedings ArticleDOI

SlowFast Networks for Video Recognition

TL;DR: This work presents SlowFast networks for video recognition, which achieves strong performance for both action classification and detection in video, and large improvements are pin-pointed as contributions by the SlowFast concept.
Book ChapterDOI

Temporal Relational Reasoning in Videos

TL;DR: This paper introduces an effective and interpretable network module, the Temporal Relation Network (TRN), designed to learn and reason about temporal dependencies between video frames at multiple time scales.
Book ChapterDOI

Videos as Space-Time Region Graphs

TL;DR: The proposed graph representation achieves state-of-the-art results on the Charades and Something-Something datasets and obtains a huge gain when the model is applied in complex environments.
Proceedings ArticleDOI

R-C3D: Region Convolutional 3D Network for Temporal Activity Detection

TL;DR: Region Convolutional 3D Network (R-C3D) as mentioned in this paper uses a three-dimensional fully convolutional network to extract meaningful spatio-temporal features to capture activities, accurately localizing the start and end times of each activity.
References
More filters
Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Posted Content

Playing Atari with Deep Reinforcement Learning

TL;DR: This work presents the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning, which outperforms all previous approaches on six of the games and surpasses a human expert on three of them.
Proceedings ArticleDOI

Learning Spatiotemporal Features with 3D Convolutional Networks

TL;DR: The learned features, namely C3D (Convolutional 3D), with a simple linear classifier outperform state-of-the-art methods on 4 different benchmarks and are comparable with current best methods on the other 2 benchmarks.
Book

Probabilistic graphical models : principles and techniques

TL;DR: The framework of probabilistic graphical models, presented in this book, provides a general approach for causal reasoning and decision making under uncertainty, allowing interpretable models to be constructed and then manipulated by reasoning algorithms.