Asynchronous Temporal Fields for Action Recognition

doi:10.1109/CVPR.2017.599

Gunnar A. Sigurdsson

1∗

Santosh Divvala

2,3

Ali Farhadi

2,3

Abhinav Gupta

1,3

1

Carnegie Mellon University

2

University of Washington

3

Allen Institute for Artiﬁcial Intelligence

github.com/gsig/temporal-fields/

Abstract

Actions are more than just movements and trajectories:

we cook to eat and we hold a cup to drink from it. A thor-

ough understanding of videos requires going beyond ap-

pearance modeling and necessitates reasoning about the

sequence of activities, as well as the higher-level constructs

such as intentions. But how do we model and reason about

these? We propose a fully-connected temporal CRF model

for reasoning over various aspects of activities that includes

objects, actions, and intentions, where the potentials are

predicted by a deep network. End-to-end training of such

structured models is a challenging endeavor: For infer-

ence and learning we need to construct mini-batches con-

sisting of whole videos, leading to mini-batches with only

a few videos. This causes high-correlation between data

points leading to breakdown of the backprop algorithm. To

address this challenge, we present an asynchronous vari-

ational inference method that allows efﬁcient end-to-end

training. Our method achieves a classiﬁcation mAP of

22.4% on the Charades [

42] benchmark, outperforming the

state-of-the-art (17.2% mAP), and offers equal gains on the

task of temporal localization.

1. Introduction

Consider the video shown in Figure

1: A man walks

through a doorway, stands at a table, holds a cup, pours

something into it, drinks it, puts the cup on the table, and

ﬁnally walks away. Despite depicting a simple activity,

the video involves a rich interplay of a sequence of actions

with underlying goals and intentions. For example, the man

stands at the table ‘to take a cup’, he holds the cup ‘to drink

from it’, etc. Thorough understanding of videos requires

us to model such interplay between activities as well as to

reason over extensive time scales and multiple aspects of

actions (objects, scenes, etc).

Most contemporary deep learning based methods have

treated the problem of video understanding as that of only

appearance and motion (trajectory) modeling [

43, 53, 7,

∗

Work was done while Gunnar was an intern at AI2.

Time

Holding a cup

Pouring into a cup

Drinking from a cup

Intent: Getting something to drink

Figure 1. Understanding human activities in videos requires jointly

reasoning about multiple aspects of activities, such as ‘what is hap-

pening’, ‘how’, and ‘why’. In this paper, we present an end-to-

end deep structured model over time trained in a stochastic fash-

ion. The model captures rich semantic aspects of activities, in-

cluding Intent (why), Category (what), Object (how). The ﬁgure

shows video frames and annotations used in training from the Cha-

rades [

42] dataset.

27]. While this has fostered interesting progress in this

domain, these methods still struggle to outperform mod-

els based on hand-crafted features, such as Dense Trajec-

tories [

56]. Why such a disconnect? We argue that video

understanding requires going beyond appearance modeling,

and necessitates reasoning about the activity sequence as

well as higher-level constructs such as intentions. The re-

cent emergence of large-scale datasets containing rich se-

quences of realistic activities [

42, 63, 60] comes at a perfect

time facilitating us to explore such complex reasoning.

But what is the right way to model and reason about tem-

poral relations and goal-driven behaviour? Over the last

couple of decades, graphical models such as Conditional

Random Fields (CRFs) have been the prime vehicles for

structured reasoning. Therefore, one possible alternative

is to use ConvNet-based approaches [

19] to provide fea-

tures for a CRF training algorithm. Alternatively, it has

been shown that integrating CRFs with ConvNet architec-

tures and training them in an end-to-end manner provides

substantial improvements in tasks such as segmentation and

situation recognition [

66, 1, 62].

Inspired by these advances, we present a deep-structured

model that can reason temporally about multiple aspects of

activities. For each frame, our model infers the activity cate-

585

gory, object, action, progress, and scene using a CRF, where

the potentials are predicted by a jointly end-to-end trained

ConvNet over all predictions in all frames. This CRF has a

latent node for the intent of the actor in the video and pair-

wise relationships between all individual frame predictions.

While our model is intuitive, training it in an end-to-end

manner is a non-trivial task. Particularly, end-to-end learn-

ing requires computing likelihoods for individual frames

and doing joint inference about all connected frames with

a CRF training algorithm. This is in stark contrast with

the standard stochastic gradient descent (SGD) training al-

gorithm (backprop) for deep networks, where we require

mini-batches with a large number of independent and un-

correlated samples, not just a few whole videos. In order

to handle this effectively: (1) we relax the Markov assump-

tion and choose a fully-connected temporal model, such that

each frame’s prediction is inﬂuenced by all other frames,

and (2) we propose an asynchronous method for training

fully-connected structured models for videos. Speciﬁcally,

this structure allows for an implementation where the in-

ﬂuence (messages) from other frames are approximated by

emphasizing inﬂuence from frames computed in recent it-

erations. They are more accurate, and show advantage

over being limited to only neighboring frames. In addi-

tion to being more suitable for stochastic training, fully-

connected models have shown increased performance on

various tasks [

18, 66].

In summary, our key contributions are: (a) a deep CRF

based model for structured understanding and comprehen-

sive reasoning of videos in terms of multiple aspects, such

as action sequences, objects, and even intentions; (b) an

asynchronous training framework for expressive temporal

CRFs that is suitable for end-to-end training of deep net-

works; and, (c) substantial improvements over state-of-the-

art, increasing performance from 17.2% mAP to 22.4%

mAP on the challenging Charades [

42] benchmark.

2. Related Work

Understanding activities and actions has an extensive

history [

32, 59, 22, 17, 23, 2, 26, 56, 29, 21]. Inter-

estingly, analyzing actions by their appearance has gone

through multiple iterations. Early success was with hand-

crafted representations such as Space Time Interest Points

(STIP) [

22], 3D Histogram of Gradient (HOG3D) [17], His-

togram of Optical Flow (HOF) [

23], and Motion Bound-

ary Histogram [

2]. These methods capture and analyze

local properties of the visual-temporal datastream. In the

past years, the most prominent hand-crafted representa-

tions have been from the family of trajectory based ap-

proaches [

26, 56, 29, 21], where the Improved Dense Tra-

jectories (IDT) [

56] representation is in fact on par with

state-of-the-art on multiple recent datasets [8, 42].

Recently there has been a push towards mid-level rep-

resentations of video [37, 46, 13, 20], that capture beyond

local properties. However, these approaches still used hand-

crafted features. With the advent of deep learning, learn-

ing representations from data has been extensively stud-

ied [

14, 15, 44, 57, 52, 53, 24, 7, 61, 55, 40, 3]. Of these,

one of the most popular frameworks has been the approach

of Simonyan et al. [

44], who introduced the idea of training

separate color and optical ﬂow networks to capture local

properties of the video.

Many of those approaches were designed for short clips

of individual activities and hence do not generalize well to

realistic sequences of activities. Capturing the whole in-

formation of the video in terms of temporal evolution of

the video stream has been the focus of some recent ap-

proaches [

51, 6, 12, 35, 49, 30]. Moving towards more ex-

pressive deep networks such as LSTM has become a pop-

ular method for encoding such temporal information [

48,

4, 65, 50, 58, 41, 64]. Interestingly, while those mod-

els move towards more complete understanding of the full

video stream, they have yet to signiﬁcantly outperform local

methods [

44] on standard benchmarks.

A different direction in understanding comes from rea-

soning about the complete video stream in a complemen-

tary direction — Structure. Understanding activities in a

human-centric fashion encodes our particular experiences

with the visual world. Understanding activities with em-

phasis on objects has been a particularly fruitful direc-

tion [

25, 36, 9, 34, 54]. In a similar vein, some works

have also tried modeling activities as transformations [

58]

or state changes [

5]. Recently, there has been signiﬁcant

progress in modelling the complete human-centric aspect,

where image recognition is phrased in terms of objects and

their roles [

62, 10]. Moving beyond appearance and reason-

ing about the state of agents in the images requires under-

standing human intentions [16, 31]. This ability to under-

stand people in terms of beliefs and intents has been tradi-

tionally studied in psychology as the Theory of mind [

33].

How to exactly model structure of the visual and tem-

poral world has been the pursuit of numerous ﬁelds. Of

particular interest is work that combines the representative

power of deep networks with structured modelling. Train-

ing such models is often cumbersome due to the differences

in jointly training deep networks (stochastic sampling) and

sequential models (consecutive samples) [

28, 66]. In this

work, we focus on fully-connected random ﬁelds, that have

been popular in image segmentation [

18], where image ﬁl-

tering was used for efﬁcient message passing, and later ex-

tended to use CNN potentials [

39].

3. Proposed Method

Given a video with multiple activities, our goal is to un-

derstand the video in terms of activities. Understanding

activities requires reasoning about objects being interacted

586

Intent

Fully Connected Temporal Model

Two-Stream

Network

VGG-16

fc7

Time

Progress

Action

Object

Category

Start

Walk

Door

C

097

Dining

Scene

Progress

Action

Object

Category

Start

Walk

Door

C

097

Dining

Scene

Progress

Action

Object

Category

Start

Take

Cup

C

110

Dining

Scene

Progress

Action

Object

Category

Mid

Pour

Cup

C

108

Dining

Scene

Progress

Action

Object

Category

Mid

Drink

Water

C

106

Dining

Scene

Progress

Action

Object

Category

End

Walk

Door

C

097

Hallway

Scene

Figure 2. An overview of our structured model. The semantic part captures object, action, etc. at each frame, and temporal aspects captures

those over time. On the left side, we show how for each timepoint in the video, a Two-Stream Network predicts the potentials. Our model

jointly reasons about multiple aspects of activities in all video frames. The Intent captures groups of activities of the person throughout the

whole sequence of activities, and ﬁne-grained temporal reasoning is through fully-connected temporal connections.

with, the place where the interaction is happening, what

happened before and what happens after this current action

and even the intent of the actor in the video. We incorporate

all these by formulating a deep Conditional Random Field

(CRF) over different aspects of the activity over time. That

is, a video can be interpreted as a graphical model, where

the components of the activity in each frame are nodes in the

graph, and the model potentials are the edges in the graph.

In particular, we create a CRF which predicts activity,

object, etc., for every frame in the video. For reasoning

about time, we create a fully-connected temporal CRF, re-

ferred as Asynchronous Temporal Field in the text. That is,

unlike a linear-chain CRF for temporal modelling (the dis-

criminative counterpart to Hidden Markov Models), each

node depends on the state of every other node in the graph.

We incorporate intention as another latent variable which

is connected to all the action nodes. This is an unobserved

variable that inﬂuences the sequence of activities. This vari-

able is the common underlying factor that guides and better

explains the sequence of actions an agent takes. Analysis

of what structure this latent variable learns is presented in

the experiments. Our model has three advantages: (1) it ad-

dresses the problem of long-term interactions; (2) it incor-

porates reasoning about multiple parts of the activity, such

as objects and intent; and (3) more interestingly, as we will

see, it allows for efﬁcient end-to-end training in an asyn-

chronous stochastic fashion.

3.1. Architecture

In this work we encode multiple components of an

activity. Each video with T frames is represented as

{X

1

, . . . , X

T

, I} where X

t

is a set of frame-level random

variables for time step t and I is an unobserved random

variable that represent global intent in the entire video. We

can further write X

t

= {C

t

, O

t

, A

t

, P

t

, S

t

}, where C is

the activity category (e.g., ‘drinking from cup’), O cor-

responds to the object (e.g., ‘cup’), A represents the ac-

tion (e.g., ‘drink’), P represents the progress of the activity

{start, middle, end}, and S represents the scene (e.g. ‘Din-

ing Room’). For clarity in the following derivation we will

refer to all the associated variables of X

t

as a single ran-

dom variable X

t

. A more detailed description of the CRF

is presented in the appendix.

Mathematically we consider a random ﬁeld {X, I} over

all the random variables in our model ({X

1

, . . . , X

T

, I}).

Given an input video V ={V

1

, . . . , V

T

}, where V

t

is a video

frame, our goal is to estimate the maximum a posteriori la-

beling of the random ﬁeld by marginalizing over the intent

I. This can be written as:

x

∗

= arg max

x

X

I

P (x, I|V ). (1)

For clarity in notation, we will drop the conditioning

on V and write P (X, I). We can deﬁne P (X, I) us-

ing Gibbs distribution as: P (X, I)=

1

Z(V)

exp (−E(x , I))

where E(x , I) is the Gibbs energy over x. In our CRF,

we model all unary and pairwise cliques between all frames

{X

1

, . . . , X

T

} and the intent I. The Gibbs energy is:

E(x, I) =

X

i

φ

X

(x

i

)

|

{z }

Semantic

+

X

i

φ

XI

(x

i

, I) +

X

i,j

i6=j

φ

XX

(x

i

, x

j

)

|

{z }

Temporal

, (2)

where φ

XX

(x

i

, x

j

) is the potential between frame i and

frame j, and φ

XI

(x

i

, I) is the potential between frame i

and the intent. For notational clarity φ

X

(x

i

) incorporates

all unary and pairwise potentials for C

t

, O

t

, A

t

, P

t

, S

t

. The

model is best understood in terms of two aspects: Semantic

587

Message Server

Video

Loss

CNN

Output

Backprop

Output

Messages

Time

Single

Timepoint

RGB &

Optical Flow

Input

Messages

Input

Messages

Figure 3. Illustration of the learning algorithm, and the message

passing structure. Each timepoint that has been processed has a

message (Blue highlights messages that have recently been com-

puted). The loss receives a combination of those messages, uses

those to construct new messages, and updates the network.

aspect, which incorporates the local variables in each frame

(C

t

, O

t

, A

t

, P

t

, S

t

); and Temporal aspect, which incorpo-

rates interactions among frames and the intent I. This is

visualized in Figure

2. We will now explain the semantic,

and temporal potentials.

Semantic aspect The frame potential φ

X

(x

i

) incor-

porates the interplay between activity category, object,

action, progress and scene, and could be written ex-

plicitly as φ

X

(C

t

, O

t

, A

t

, P

t

, S

t

). In practice this

potential is composed of unary, pairwise, and tertiary

potentials directly predicted by a CNN. We found

predicting only the following terms to be sufﬁcient

without introducing too many additional parameters:

φ

X

(C

t

, O

t

, A

t

, P

t

, S

t

)=φ(O

t

, P

t

)+φ(A

t

, P

t

)+φ(O

t

, S

t

)+

φ(C

t

, O

t

, A

t

, P

t

) where we only model the assignments

seen in the training set, and assume others are not possible.

Temporal aspect The temporal aspect of the model is

both in terms of the frame-intent potentials φ

XI

(x

i

, I) and

frame-frame potentials φ

XX

(x

i

, x

j

). The frame-intent po-

tentials are predicted with a CNN from video frames (pixels

and motion). The pairwise potentials φ

XX

(x

i

, x

j

) for two

time points i and j in our model have the form:

φ

XX

(x

i

, x

j

) = µ(x

i

, x

j

)

X

m

w

(m)

k

(m)

(v

i

, v

j

), (3)

where µ models the asymmetric afﬁnity between frames, w

are kernel weights, and each k

(m)

is a Gaussian kernel that

depends on the videoframes v

i

and v

j

. In this work we use

a single kernel that prioritises short-term interactions:

k(v

i

, v

j

) = exp



−

(j − i)

2

2σ

2



(4)

The parameters of the general asymmetric compatibility

function µ(x

i

, x

j

) are learned from the data, and σ is a

hyper-parameter chosen by cross-validation.

3.2. Inference

While it is possible to enumerate all variable conﬁg-

urations in a single frame, doing so for multiple frames

and their interactions is intractable. Our algorithm uses a

structured variational approximation to approximate the full

probability distribution. In particular, we use a mean-ﬁeld

approximation to make inference and learning tractable.

With this approximation, we can do inference by keeping

track of message between frames, and asynchronously train

one frame at a time (in a mini-batch fashion).

More formally, instead of computing the exact distribu-

tion P (X, I) presented above, the structured variational ap-

proximation ﬁnds the distribution Q(X, I) among a given

family of distributions that best ﬁts the exact distribution in

terms of KL-divergence. By choosing a family of tractable

distributions, it is possible to make inference involving

the ideal distribution tractable. Here we use Q(X, I) =

Q

I

(I)

Q

i

Q

i

(x

i

), the structured mean-ﬁeld approximation.

Minimizing the KL-divergence between those two distribu-

tions yields the following iterative update equation:

Q

i

(x

i

) ∝ exp



φ

X

(x

i

) + E

U∼Q

I

[φ

XI

(x

i

, U)]

+

X

j>i

E

U

j

∼Q

j

[φ

XX

(x

i

, U

j

)]



+

X

j<i

E

U

j

∼Q

j

[φ

XX

(U

j

, x

i

)]



(5)

Q

I

(I) ∝ exp



X

j

E

U

j

∼Q

j

[φ

XI

(U

j

, I)]



(6)

where Q

i

is marginal distribution with respect to each of

the frames, and Q

I

is the marginal with respect to the in-

tent. An algorithmic implementation of this equation is as

presented in Algorithm

1.

Algorithm 1 Inference for Asynchronous Temporal Fields

1: Initialize Q ⊲ Uniform distribution

2: while not converged do

3: Visit frame i

4: Get

P

j>i

E

U

j

∼Q

j

[φ

XX

(x

i

, U

j

)]

5: Get

P

j<i

E

U

j

∼Q

j

[φ

XX

(U

j

, x

i

)]

6: Get

P

j

E

U

j

∼Q

j

[φ

XI

(U

j

, I)]

7: while not converged do

8: Update Q

i

and Q

I

using Eq.

6

9: Send E

U∼Q

i

[φ

XX

(x, U)]

10: Send E

U∼Q

i

[φ

XX

(U, x)]

11: Send E

U∼Q

i

[φ

XI

(U, I)]

Here ‘Get’ and ‘Send’ refer to the message server, and f (x)

is a message used later by frames in the same video. The

term message server is used for a central process that keeps

track of what node in what video sent what message, and

588

1st Message Pass

time

2nd Message Pass

3rd Message Pass

Initial Prediction

Figure 4. Evolution of prediction with increasing messages passes.

The ﬁrst row shows the initial prediction for the category tidying

with a broom without any message passing, where darker colors

correspond to higher likelihood, blue is then an increase in like-

lihood, and brown decrease. In the ﬁrst message pass, the conﬁ-

dence of high predictions gets spread around, and eventually in-

creases the conﬁdence of the whole prediction.

distributes them accordingly when requested. In practice,

this could be implemented in a multi-machine setup.

3.3. Learning

Training a deep CRF model requires calculating deriva-

tives of the objective in terms of each of the potentials in

the model, which in turn requires inference of P (X, I|V ).

The network is trained to maximize the log-likelihood of

the data l(X) = log

P

I

P (x, I|V ). The goal is to update

the parameters of the model, for which we need gradients

with respect to the parameters. Similar to SGD, we ﬁnd the

gradient with respect to one part of the parameters at a time,

speciﬁcally with respect to one potential in one frame. That

is, φ

i

X

(ˆx) instead of φ

X

(ˆx). The partial derivatives of this

loss with respect to each of the potentials are as follows:

∂l(X)

∂φ

i

X

(ˆx)

= 1

x=ˆx

− Q

i

(ˆx) (7)

∂l(X)

∂φ

i

XI

(ˆx,

ˆ

I)

=

exp

P

j

φ

XI

(x

j

,

ˆ

I)

P

I

exp

P

j

φ

XI

(x

j

, I)

1

x=ˆx

− Q

i

(ˆx)Q

I

(

ˆ

I) (8)

∂l(X)

∂µ

i

(a, b)

=

X

j>i

1

x=a

k(v

i

, v

j

) − Q

i

(ˆx)

X

j>i

Q

I

(b)k(v

i

, v

j

)

+

X

j<i

1

x=b

k(v

j

, v

i

) − Q

i

(ˆx)

X

j<i

Q

I

(a)k(v

i

, v

j

) (9)

where φ

i

X

(ˆx) and φ

i

XI

(ˆx,

ˆ

I) is the frame and frame-intent

potentials of frame i, and we use ˆx to distinguish between

the labels and variables the derivative is taken with respect

to. µ

i

(a, b) are the parameters of the asymmetric afﬁnity

kernel with respect to frame i, and 1

x=ˆx

is a indicator vari-

able that has the value one if the ground truth label corre-

sponds to the variable. Complete derivation is presented in

the appendix. These gradients are used to update the un-

derlying CNN model. These update equations lead to the

learning procedure presented in Algorithm

2.

Figure

3 graphically illustrates the learning procedure.

Since the videos are repeatedly visited throughout the train-

ing process, we do not have to run multiple message passes

Algorithm 2 Learning for Asynchronous Temporal Fields

1: Given videos V

2: while not converged do

3: for each example in mini-batch do

4: Sample frame v ∈ V ⊆ V

5: Get incoming messages

6: Update Q

i

and Q

I

7: Find gradients with Eq.

7-9

8: Backprop gradients through CNN

9: Send outgoing messages

to calculate each partial gradient. This shares ideas with

contrastive divergence [

11, 38]. Given a single video at test

time, we visualize in Figure

4 how the predictions changes

as the distribution converges with multiple messages passes.

Message Passing The key thing to note is all the in-

coming messages are of the form M (z)=

P

j

f

j

(z) where

f

j

is some function from node j; for e.g., M(z) =

P

j

E

U

j

∼Q

j

[φ

XI

(U

j

, z)] =

P

j

f

j

(z) from Algorithm

1.

We use the following approximation during training:

M(z)≈

h

P

j

d

j

X

j

d

j

f

J(j)

(z), (10)

where d ∈ [0, 1] is a discount factor, h is a hyperparameter,

and J(·) is an ordering of the messages in that video based

on the iteration in which the message was computed. The

messages are a weighted combination of stored messages.

4. Experimental Results and Analysis

We analyzed the efﬁcacy of our model on the challenging

tasks of video activity classiﬁcation and temporal localiza-

tion. In addition, we investigated the different parts of the

model, and will demonstrate how they operate together.

Dataset Recent years have witnessed an emergence of

large-scale datasets containing sequences of common daily

activities [

42, 63, 60]. For our evaluation, we chose the

Charades dataset [

42]. This dataset is a challenging bench-

mark containing 9,848 videos across 157 action classes with

66,500 annotated activities, including nouns (objects), verbs

(actions), and scenes. A unique feature of this dataset is

the presence of complex co-occurrences of realistic human-

generated activities making it a perfect test-bed for our anal-

ysis. We evaluate video classiﬁcation using the evaluation

criteria and code from [

42]. Temporal localization is evalu-

ated in terms of per-frame classiﬁcation using the provided

temporal annotations.

Implementation details We use a VGG16 network [

45]

with additional layers to predict the model potentials (Fig-

ure

5). We train both a network on RGB frames, and stacks

of optical ﬂow images, following the two-stream architec-

ture [

44]. The main challenge in training the network is the

increase in the output layer size. For the larger potentials,

589

Asynchronous Temporal Fields for Action Recognition

Citations

Non-local Neural Networks

SlowFast Networks for Video Recognition

Temporal Relational Reasoning in Videos

Videos as Space-Time Region Graphs

R-C3D: Region Convolutional 3D Network for Temporal Activity Detection

References

ImageNet Classification with Deep Convolutional Neural Networks

Very Deep Convolutional Networks for Large-Scale Image Recognition

Playing Atari with Deep Reinforcement Learning

Learning Spatiotemporal Features with 3D Convolutional Networks

Probabilistic graphical models : principles and techniques

Related Papers (5)

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

Two-Stream Convolutional Networks for Action Recognition in Videos

Learning Spatiotemporal Features with 3D Convolutional Networks

Deep Residual Learning for Image Recognition

Non-local Neural Networks