What is the simplest approach to modelling the TTE?

The simplest approach to modelling the CDF F̂ (t|XN ) is to quantize time in discrete bins and reduce the problem to a classification one.

What is the simplest approach to predict the TTE?

The simplest approach to predict the TTE over a continuous domain is to configure the neural network to output a real number Φ(XN ) = t̂ ∈ R+ that approximates the TTE directly (following the convention that, if the event does not occur before the time horizon ∆max, then the model outputs the value 2∆max).

What is the effect of the GMMH model on the prediction of traffic lights?

The authors also showed that in vehicle stopping prediction, their model outperforms an average human, which the authors contribute to the better ability of neural networks to learn domain specific priors and to capture subtle cues.

What is the probability of a car stopping?

The model has also learned that when approaching a green traffic light, there is a still a probability the car might stop (fig. 1 — middle row), as green might turn red before the car gets there.

What is the way to describe the TTE?

the authors describe a number of prediction models for the TTE, all implemented as neural networks Φ. These networks take as input a sequence of observations XN and output an estimate F̂ (t|XN ) of the TTE CDF.

Why do the authors only train and evaluate the prediction in the interval of (1, 5) seconds?

In contrast to the previous section, the authors only train and evaluate the prediction in the interval of (1, 5) seconds, because of generally faster pace of the action happening in the videos and to avoid issues with cuts in the TV stream.

(Open Access) Future Event Prediction: If and When (2019) | Lukas Neumann

Q: What are the contributions in "Future event prediction: if and when" ?

The authors consider the problem of future event prediction in video: if and when a future event will occur. To this end, the authors propose a number of representations and loss functions tailored to this problem. The authors show that ( i ) they are able to predict events far in the future, up to 10 seconds before they occur ; and ( ii ) using attention, they can determine which areas of the image sequence are responsible for these predictions, and find that they are meaningful, e. g. traffic lights are picked out for predicting when a vehicle will stop.

Q: What is the purpose of this paper?

Their objective in this paper introducing the problem of time-to-event prediction into computer vision by predicting future events in video before they occur.

Future Event Prediction: If and When

Luk

s Neumann Andrew Zisserman

Andrea Vedaldi

Visual Geometry Group, Department of Engineering Science

University of Oxford

{lukas,az,vedaldi}@robots.ox.ac.uk

Abstract

We consider the problem of future event prediction in

video: if and when a future event will occur. To this end,

we propose a number of representations and loss functions

tailored to this problem. These include several probabilistic

formulations that also model the uncertainty of the predic-

tion. We train and evaluate the approach on two entirely

different prediction scenarios: if and when a car will stop

in the BDD100k car driving dataset; and if and when a

player is going to shoot a basketball towards the basket in

the NCAA basketball dataset.

We show that (i) we are able to predict events far in the

future, up to 10 seconds before they occur; and (ii) using at-

tention, we can determine which areas of the image sequence

are responsible for these predictions, and ﬁnd that they are

meaningful, e.g. trafﬁc lights are picked out for predicting

when a vehicle will stop.

1. Introduction

Image understanding is usually concerned with the prob-

lem of describing the content of a given image or video.

However, in applications such as robotics and autonomous

driving this is often not enough: in order to react in a timely

manner to external events (such as a pedestrian crossing the

street), it may be necessary to predict them before they occur

or are captured by the imaging device.

Our objective in this paper introducing the problem of

time-to-event prediction into computer vision by predicting

future events in video before they occur. In addition to

its practical importance, the problem also has a signiﬁcant

scientiﬁc interest. In fact, predicting effectively the future is

likely to require an understanding of subtle properties of the

world and of its dynamics. Thus, this task can be used as a

form of self-supervision to learn about abstract concepts in

images and videos.

In this paper, we focus on two key challenges that are of

direct interest in many applications: telling (i) if a certain

event, such as a car stopping, is likely to occur in the near

future and, in this case, (ii) when the event is going to happen.

Our approach is based on performing probabilistic prediction

of future events, while accounting for event rarity, which is

required due to the fact that most of the data does not in fact

contain the event of interest.

Seeking to develop a very general approach, we consider

two entirely different testing scenarios. The ﬁrst scenario is

to predict, given a video stream captured from a moving car,

whether and when the car is going to stop, in response to traf-

ﬁc conditions and other environmental factors. The second

scenario is to predict, in videos of basketball games, whether

and if a player is going to throw. We develop and publish

two benchmarks for these tasks building on existing pub-

lic benchmark data. We also develop a new benchmarking

protocol based on evaluation metrics that reﬂect the ability

of a model to perform such predictions in a manner that is

relevant to applications.

The key design decision for this task is how to represent

the prediction, and the associated loss function. In section 3

we discuss a number of possible options and discuss their

advantages and disadvantages, introducing models that were

not tested before in this context. Our best model is GMMH,

a hybrid between a heatmap and a Mixture of Gaussians. We

show that this approach works better than may be thought

of as the “default” solution for modelling such problems,

such as pure classiﬁcation or the Weibull distribution used

in survival analysis.

Compared to alternatives, the main beneﬁt of our ap-

proach is its ability to predict events far in the future, up

to 10 seconds before they occur. We also show that, as the

algorithm learns to predict future events, it induces a visual

representation that captures small but important details of

image content. For example, in the driving application, we

show that the learned neural network pays attention to ele-

ments such as trafﬁc and car lights as predictor of slowing

trafﬁc.

We also compare our method to human performance,

demonstrating that such systems are competitive and in fact

better at performing predictions about the future than average

humans. We impute this to the ability of algorithms to better

learn a speciﬁc domain, such as trafﬁc conditions, and thus

be better able to capture subtle cues and tells that may be

overlooked by people.

2. Related Work

Early Action Recognition.

Several authors have studied

the problem of early action recognition, where the task is

to predict what kind of action is currently happening, using

as small number of frames as possible. Hoai et al. [

] use

Structured Output SVM to detect facial expressions early.

Aliakbarian et al. [

] use context- and action-aware fea-

tures with a two-stage LSTM to recognize actions from

partial video sequences. Similarly, Sigurdsson et al. [

] pro-

pose a fully-connected temporal CRF model, which exploits

intent information to predict in each frame the action being

performed. Dave et al. [

] use a predictive-corrective model

to maintain a memory state of the network for per-frame

action classiﬁcation. Ramanathan et al.[

] use a BLSTM

to classify and detect events in individual frames of basket-

ball sports videos and Heidarivincheh et al. [

] detect action

completion. Wei et al. [

] on the other hand learn to classify

the arrow of time in videos.

The crucial difference of all the above methods to our

task is that the event had already started, i.e. the methods

already observe the event of interest unrolling, whereas in

our task we observe frames leading towards the start of the

event, but never the event itself. The same difference applies

to the standard activity recognition datasets [

which capture many different classes of events, but typically

not what leads to them - and even if it was captured, the data

would not be very informative for event prediction, as given

the nature of the actions in the datasets such as “singing” or

“cliff diving”, the motivation leading to start the action is

mostly intrinsic to the actors and therefore cannot be easily

observed in the videos until the action actually starts.

Future Frames Prediction.

A lot of work has been done

trying to predict future visual appearance. Vondrick et

al. [

] use Generative Adversarial Networks (GANs)

to explicitly model foreground and background pixels and

generate a short video sequence from a single image. Xue

et al. [

] also predict future frames from a single image,

exploiting cross convolutional networks. Liang et al. [

]

then use dual motion GAN to predict future frames given

an input video sequence. The prediction horizon in all these

methods is however only couple of frames, and therefore

it is not applicable for predicting events which are several

seconds in the future.

Event Prediction

In the machine learning literature, time

to event prediction has been extensively studied. Most re-

cently, Martinsson [

] proposed a Recurrent Neural Net-

work model with Weibull distribution for a customer churn

prediction. Soleimani et al. [

] use Gaussian processes to

predict time to event in medical applications. These mod-

els however deal with only low-dimensional sequential data

(such as patient temperature measurements), and thus are

not directly applicable in computer vision. Where possible,

we however try to adapt these models for high-dimensional

video data and evaluate them (see section 4.1).

In computer vision, Vondrick et al. [

] exploit

AlexNet [

] representations to predict what action will

happen in a recording of TV series in exactly 1 second time.

Felsen et al. [

] predict which player will hold the ball next,

by applying random forests to normalized overhead-view

sport videos. Alahi et al. [

] use a LSTM model to pre-

dict a 2D heatmap of future pedestrian position in overhead

pedestrian videos.

Our work is signiﬁcantly different in that i) we predict

whether the event will happen or not ii) we give time estimate

to say when the event is likely to happen and with what prob-

ability iii) we predict signiﬁcantly longer time horizon (up to

10 seconds) and iv) we don’t require static and normalized

camera views like [1, 6].

3. Method

3.1. Time to event

Consider a short sequence of

past and current visual

observations

−N+1

, x

−N+2

, . . . , x

. Our aim is to

estimate, based on this information, the probability that a

certain event of interest will occur in the near future and, if

so, when exactly (see ﬁg. 2).

Since the choice of a time origin is immaterial, we sim-

plify the notation by assuming

= 0

, and denote

−N+1

, x

−N+2

, . . . , x

)

the

observations. Hence, the

task can be formulated as estimating the conditional proba-

bility density p(t|X

) of the time to event (TTE) τ ≥ 0.

More precisely, we deﬁne TTE as the time when the

observed system enters ﬁrst a certain state of interest. For

example, in the car application the state of interest is that

the car velocity is zero, and in the sport application the

condition is that the ball is ﬂying as a consequence of a

throw. This deﬁnition has a few important consequences.

First, the system may enter and leave the state of interest

several times and we are only concerned with the ﬁrst of such

occurrences. Second, the system may already be in said state

at time

, for example because the car is already stopped, so

that strictly speaking

may be less than 0. In this case we

assume instead that

τ = 0

. Thirdly, the event may not occur

at all in the near future. The “no show” condition could

be model by allowing

to range in

∪ {+∞}

. Instead,

it is more practical to choose a ﬁxed prediction horizon

∆

max

 0

and conventionally set

τ = 2∆

max

whenever the

TTE is beyond the horizon.

input model attention model output

0 2 4 6 8 10

[s]

0.000

0.025

0.050

0.075

0.100



(

)

0 2 4 6 8 10

[s]

0.00

0.25

0.50

0.75

1.00



(

)

0 2 4 6 8 10

[s]

0.000

0.025

0.050

0.075

0.100



(

)

0 2 4 6 8 10

[s]

0.00

0.25

0.50

0.75

1.00



(

)

0 2 4 6 8 10

[s]

0.000

0.025

0.050

0.075

0.100



(

)

0 2 4 6 8 10

[s]

0.00

0.25

0.50

0.75

1.00



(

)

Figure 1. Predicting car stopping in the BDD100k driving dataset. Time to event probability prediction (blue), event occurrence ground truth

(red), maximal prediction horizon

∆

max

(dashed gray). The network has learned to look for various cues (trafﬁc lights, cars in front on the

road) to predict the probability if and when the car will stop. It has also learned to assign non-zero probability to time/position corresponding

to green trafﬁc lights, as they in fact might turn red by the time the car gets there (middle row). See Supplementary material for the videos.

-N

, …, x

observed

not observed

event occurs

Figure 2. Predicting future event at time

, having observed

frames up until time

, where

∆

denotes the time distance between

the last observed frame and the event

In the following, rather than discussing density function

p(t|X

)

directly, we will make use of their cumulative dis-

tribution function (CDF), which has the usual deﬁnition

F (u|X

) = P [t > u|X

CDFs are more convenient in

our discussion as they can represent both continuous and

discrete-time distributions. Furthermore, CDFs can capture

both non-trivial distributions, used to encode uncertainty,

as well as “deterministic ones” that put all the mass on a

speciﬁc value of

, and can thus be used to encode point

estimates (these CDFs are step functions).

3.2. Predicting the TTE

Next, we describe a number of prediction models for the

TTE, all implemented as neural networks

. These networks

take as input a sequence of observations

and output

an estimate

F (t|X

)

of the TTE CDF. The networks are

learned from example pairs

, τ)

via optimization of a

suitable expected loss, which depends on the nature of the

model. Models differs mainly by whether they use a discrete

or continuous representation of time and by whether they

predict a non-trivial distribution over possible TTEs or they

produce instead point estimates.

3.2.1 Discrete-time models

One-in-many classiﬁcation.

The simplest approach to

modelling the CDF

F (t|X

)

is to quantize time in discrete

bins and reduce the problem to a classiﬁcation one. For sim-

plicity, assume that the quantization interval is of one second

and that there are

∆

max

bins in total. Time

t ≥ 0

is mapped

to a discrete index by the following quantizer function:

q(t) =

(

btc + 1, t < ∆

max

∆

max

, t ≥ ∆

max

Index

q = ∆

max

means that the event occurs beyond the

prediction horizon.

In order to implement this model, the network

Φ(X

)

terminated in a softmax layer conﬁgured to output a

∆

max

dimensional probability vector. The corresponding CDF is

given by the cumulative sum

F (t|X

) =

q(t)

i=1

). (1)

The model is trained by minimizing its negative log-

likelihood E

,t)

[− log Φ

q(t)

)].

Binary classiﬁers.

As a variant of the previous model,

we consider using

∆

max

independent binary classiﬁers

) ∈ [0, 1]

by processing the network output not via

a softmax layer, but via a sigmoid. Each such classiﬁer

)

is trained via log-likelihood maximization to pre-

dict the probability that the event occurs at time

or later. At

test time, the event is deemed to occur at the time the ﬁrst of

such classiﬁer ﬁres; formally, the CDF is given by

F (t|X

) =

(

1, ∃i ≤ t + 1 : Φ

) ≥

0, otherwise.

(2)

Note that the expression above is valid for values of

strictly

less than the horizon

∆

max

; if all classiﬁers fail to ﬁre, then

the event is deemed to occur beyond the prediction hori-

zon and conventionally predicted to be at time

2∆

max

explained before.

Heuristic heatmaps.

Inspired by the keypoint detection

literature [

], we explore predicting the TTE via

an heuristic heatmap. To this end, the neural network is

conﬁgured to output a 1D heat map

Φ(X

) = h ∈ R

size

T = 2r∆

max

, where

is the temporal resolution (in

our experiments, we set

r = 4

). The TTE is decoded as the

position of the maximum in the heat map

. For

t < 2∆

max

the TTE CDF is obtained via cumulative summation and

normalization:

F (t; h) =

i=1

2r∆

max

tc+1

i=1

. (3)

This model is trained by minimizing the expected squared

distance

,t)

[kg

− Φ(X

]

between the estimated

heatmap and a heatmap prototype

centered at the ground-

truth TTE value

. The prototype is a Gaussian-like kernel

]

= exp

−

2σ



2r∆

max

i − τ



Note that this model allows the heatmap to have non-

zero values in a region beyond the prediction horizon

∆

max

up to

2∆

max

so that the “no show” case can be captured.

For training the model, “no show” samples are mapped to

Gaussian window prototypes

whose center is randomly

selected in the interval

[∆

max

, 2∆

max

]

. This ensures that data

are more balanced in the training process, compared to a

representation when one outputs just zero values in the heat

map if the event did not occur.

3.2.2 Continuous-time models

A limitation of the previous models is their ﬁnite temporal

resolution, due to the use of a quantizer function, and no

representation of uncertainty of the prediction. We thus

evaluate models that produce a continuous estimate of the

TTE, and that, apart from Direct Regression, also output

prediction uncertainty.

Direct regression.

The simplest approach to predict the

TTE over a continuous domain is to conﬁgure the neural

network to output a real number

Φ(X

) =

t ∈ R

that

approximates the TTE directly (following the convention

that, if the event does not occur before the time horizon

∆

max

, then the model outputs the value 2∆

max

This model estimates the TTE with inﬁnite resolution,

but it does not produce an uncertainty. This fact is cap-

tured by a step-wise CDF

F (u|

t) = [u ≥

t].

The model

is trained by minimizing the expected absolute distance be-

tween estimated TTE

and ground-truth TTE

, given by

,t)

[|τ − Φ(X

)|]

; this is the same as the expected

distance

kF (·|

t) − F (·|t)k

between the estimated step-wise

CDF and the step-wise CDF centered at ground-truth TTE.

Gaussian distribution.

In this case, the model

Φ(X

) =

(µ, σ) ∈ R

outputs two real numbers, the mean

and stan-

dard deviation

of a 1D Gaussian distribution

N (t|µ, σ)

with CDF

F (t|X

) = N

CDF

(t|Φ(X

))

. Again, we use

the convention of setting

τ = 2∆

max

when the event

does not occur within the prediction horizon. This model

is also trained by minimizing the negative log-likelihood

,τ)

[− log N (τ |Φ(X

))].

Weibull distribution.

The Gaussian distribution is not

necessarily optimal for modeling TTEs; in fact, several

authors have proposed to use the Weibull distribution in

similar contexts [

]. In order to experiment with this

idea, the network is modiﬁed to output the parameters

(α, β) ∈ R

of a Weibull distribution, whose CDF is given by

F (t|α, β) = 1 − e

−(t/α)

The advantage of the Weibull dis-

tribution is that it can more explicitly model the case in which

the event does not occur within the prediction horizon

∆

max

Such a “no show” event is modeled as Type I censoring,

denoted with

z = 0

, whereas

z = 1

means that the event oc-

curs within the window. The model is learned by minimizing

the negative log-likelihood

,t)

[− log p(τ |z, Φ(X

))]

where

log p(t|z, α, β) = z



log(β) + β log



−





(4)

where, given our convention of mapping events that do not

occur we have z = [τ > ∆

max

Gaussian Mixture Model Heatmap (GMMH).

Finally,

we propose a novel representation based on a Gaussian Mix-

ture Model (GMM), which combines the beneﬁts of the

Gaussian distribution and the heuristic heat map models.

The network is conﬁgured to output three vectors

Φ(X

) = (µ, σ, h) ∈ R

T ×3

again of dimension

T =

2r∆

max

. The ﬁrst two vectors

and

represent the param-

eters of

1D Gaussian distributions, and the third vector

similar to the heuristic heatmap, is used here as the weight-

ing of the

Gaussians components. The CDF of this model

is simply the weighted combination of Gaussian CDFs:

F (t; µ, σ, h) =

h1, hi

j=1

CDF

(t; µ

, σ

The loss function is the negative log-likelihood of the GMM

regularized by the loss already adopted for the heuristic

heatmap:

,t)





− log

j=1

N (τ; µ

, σ

)

h1, hi

+ λkg

− hk





(5)

3.3. Evaluation Metrics

Next, we present three evaluation metrics of increasing

granularity, allowing to compare models’ performance in

terms of how well they predict whether the event will or

will not occur (Event Prediction Accuracy), how accurately

they predict time to the event (Time-to-Event Error) and how

accurate is the event probability distribution estimate (Model

Surprise).

We expect predictions to be more difﬁcult as the event oc-

curs farther in the future, so we break down evaluations based

on this parameter. Speciﬁcally, we vary

∆

∈ [0, ∆

max

]

draw performance curves, as the same test events are ob-

served from greater temporal distance. We also report the

average value of the three metrics as

∆

is swept inside this

range.

Event Prediction Accuracy (EPA).

EPA measures

whether the model can successfully answer the if question,

namely whether the event will or will not occur within the

prediction window. This is a classiﬁcation problem and EPA

is the average classiﬁcation accuracy. Recall that, in order

to predict that the event does not occur, discrete-time mod-

els predict a special index/class, whereas continuous-time

models predict the event to occur at a time t > ∆

max

Since most of the episodes

do not contain the event

of interest (as these are comparatively rare), in order to make

metrics comparable between event types and datasets we

balance the testing set so that the ratio between sequences

with and without the event is 50:50.

Time-to-Event Error (TTEE).

TTEE measures whether

the model can successfully answer the if so, when? question,

namely determine when exactly the events will occur. TTEE

is the average absolute prediction error

,t)

[|τ − Ψ ◦

Φ(X

)|]

, where

is the operation that maps the output of

the neural network

to a point estimate

t = Ψ ◦ Φ(X

)

for

the TTE. For discrete-time models, for example,

maps the

predicted time index to a continuous time value to allow for

a comparison against the ground-truth time. The empirical

average is carried over the subset of the test set where the

event does occur. If the network predicts incorrectly that the

event does not occur, then the TTEE for that sample is set to

∆

max

Model Surprise (MS).

For model that outputs a predic-

tion uncertainty in addition to a point estimates, we are

also interested in measuring the quality of the predicted un-

certainty value. We do so by taking the output distribution

ˆp(t|X

)

and measuring the expected negative log-likelihood

given the ground truth annotations in the test set, deﬁned as

,τ)

[− log ˆp(τ|X

)]

. This is also known as model “sur-

prise” and is an indication of the quality of the probabilistic

output of the model: if the model assigns high probability

values to the correct ground truth locations, the resulting

“surprise” will be low, and vice-versa.

3.4. Backbone Architecture

In order to implement the neural network

, in all experi-

ments we adapt the 3D ResNet-34 [

] architecture (see ta-

ble 1) and extend it with a soft-attention module [

] to

visualize what regions of the video sequence play the key

role in the network decision making process. Depending on

the model, we also change the output dimension accordingly

(see Section 3.2). Most notably, our proposed GMMH has an

output dimension of

80 × 3

(

T = 2r∆

max

= 2 × 4 × 10 =

80).

We used the vanilla SGD optimizer with Nesterov mo-

mentum [

] with an initial learning rate of

−1

, which was

decreased by a factor of 10 every time the loss stopped im-

proving, and trained every model until the learning rate fell

below

−5

. All models were implemented in PyTorch [

]

and all the source code will be released to foster reproducibil-

ity.

4. Experiments

We assess our approach in two challenging scenarios:

egocentric car stopping and basketball throws.

4.1. Egocentric Car Stopping

In the ﬁrst experiment, we aim to predict if and when a

car is about to stop, using the video stream from a forward-

looking camera mounted behind a windshield.

Dataset.

We build on the BDD100k dataset [

], which

consists of 100,000 video sequences each 40 seconds of

length, accompanied with basic sensory data such as GPS,

velocity or acceleration. We deﬁne the stopping event as the

Future Event Prediction: If and When

Figures

Citations

Uncertainty-based Traffic Accident Anticipation with Spatio-Temporal Relational Learning

Event Prediction in the Big Data Era: A Systematic Survey

Am I Done? Predicting Action Progress in Videos

Event Prediction in the Big Data Era: A Systematic Survey

Predicting the future from first person (egocentric) vision: A survey

References

ImageNet Classification with Deep Convolutional Neural Networks

Automatic differentiation in PyTorch

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

On the importance of initialization and momentum in deep learning

Related Papers (5)

Deep Residual Learning for Image Recognition

Long short-term memory

Joint Prediction of Activity Labels and Starting Times in Untrimmed Videos

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

Scaling Egocentric Vision: The EPIC-KITCHENS Dataset

Frequently Asked Questions (9)

Q1. What are the contributions in "Future event prediction: if and when" ?

Q2. What is the simplest approach to modelling the TTE?

Q3. What is the simplest approach to predict the TTE?

Q4. What is the purpose of this paper?

Q5. What is the effect of the GMMH model on the prediction of traffic lights?

Q6. What is the loss function of the Gaussian Mixture Model?

Q7. What is the probability of a car stopping?

Q8. What is the way to describe the TTE?

Q9. Why do the authors only train and evaluate the prediction in the interval of (1, 5) seconds?