scispace - formally typeset
Open AccessProceedings ArticleDOI

Future Event Prediction: If and When

TLDR
A number of representations and loss functions tailored to the problem of future event prediction in video are proposed, which are able to predict events far in the future, up to 10 seconds before they occur and determine which areas of the image sequence are responsible for these predictions.
Abstract
We consider the problem of future event prediction in video: if and when a future event will occur. To this end, we propose a number of representations and loss functions tailored to this problem. These include several probabilistic formulations that also model the uncertainty of the prediction. We train and evaluate the approach on two entirely different prediction scenarios: if and when a car will stop in the BDD100k car driving dataset; and if and when a player is going to shoot a basketball towards the basket in the NCAA basketball dataset. We show that (i) we are able to predict events far in the future, up to 10 seconds before they occur; and (ii) using attention, we can determine which areas of the image sequence are responsible for these predictions, and find that they are meaningful, e.g. traffic lights are picked out for predicting when a vehicle will stop.

read more

Content maybe subject to copyright    Report

Future Event Prediction: If and When
Luk
´
a
ˇ
s Neumann Andrew Zisserman
Andrea Vedaldi
Visual Geometry Group, Department of Engineering Science
University of Oxford
{lukas,az,vedaldi}@robots.ox.ac.uk
Abstract
We consider the problem of future event prediction in
video: if and when a future event will occur. To this end,
we propose a number of representations and loss functions
tailored to this problem. These include several probabilistic
formulations that also model the uncertainty of the predic-
tion. We train and evaluate the approach on two entirely
different prediction scenarios: if and when a car will stop
in the BDD100k car driving dataset; and if and when a
player is going to shoot a basketball towards the basket in
the NCAA basketball dataset.
We show that (i) we are able to predict events far in the
future, up to 10 seconds before they occur; and (ii) using at-
tention, we can determine which areas of the image sequence
are responsible for these predictions, and find that they are
meaningful, e.g. traffic lights are picked out for predicting
when a vehicle will stop.
1. Introduction
Image understanding is usually concerned with the prob-
lem of describing the content of a given image or video.
However, in applications such as robotics and autonomous
driving this is often not enough: in order to react in a timely
manner to external events (such as a pedestrian crossing the
street), it may be necessary to predict them before they occur
or are captured by the imaging device.
Our objective in this paper introducing the problem of
time-to-event prediction into computer vision by predicting
future events in video before they occur. In addition to
its practical importance, the problem also has a significant
scientific interest. In fact, predicting effectively the future is
likely to require an understanding of subtle properties of the
world and of its dynamics. Thus, this task can be used as a
form of self-supervision to learn about abstract concepts in
images and videos.
In this paper, we focus on two key challenges that are of
direct interest in many applications: telling (i) if a certain
event, such as a car stopping, is likely to occur in the near
future and, in this case, (ii) when the event is going to happen.
Our approach is based on performing probabilistic prediction
of future events, while accounting for event rarity, which is
required due to the fact that most of the data does not in fact
contain the event of interest.
Seeking to develop a very general approach, we consider
two entirely different testing scenarios. The first scenario is
to predict, given a video stream captured from a moving car,
whether and when the car is going to stop, in response to traf-
fic conditions and other environmental factors. The second
scenario is to predict, in videos of basketball games, whether
and if a player is going to throw. We develop and publish
two benchmarks for these tasks building on existing pub-
lic benchmark data. We also develop a new benchmarking
protocol based on evaluation metrics that reflect the ability
of a model to perform such predictions in a manner that is
relevant to applications.
The key design decision for this task is how to represent
the prediction, and the associated loss function. In section 3
we discuss a number of possible options and discuss their
advantages and disadvantages, introducing models that were
not tested before in this context. Our best model is GMMH,
a hybrid between a heatmap and a Mixture of Gaussians. We
show that this approach works better than may be thought
of as the “default” solution for modelling such problems,
such as pure classification or the Weibull distribution used
in survival analysis.
Compared to alternatives, the main benefit of our ap-
proach is its ability to predict events far in the future, up
to 10 seconds before they occur. We also show that, as the
algorithm learns to predict future events, it induces a visual
representation that captures small but important details of
image content. For example, in the driving application, we
show that the learned neural network pays attention to ele-
ments such as traffic and car lights as predictor of slowing
traffic.
We also compare our method to human performance,
demonstrating that such systems are competitive and in fact
1

better at performing predictions about the future than average
humans. We impute this to the ability of algorithms to better
learn a specific domain, such as traffic conditions, and thus
be better able to capture subtle cues and tells that may be
overlooked by people.
2. Related Work
Early Action Recognition.
Several authors have studied
the problem of early action recognition, where the task is
to predict what kind of action is currently happening, using
as small number of frames as possible. Hoai et al. [
9
] use
Structured Output SVM to detect facial expressions early.
Aliakbarian et al. [
3
,
2
] use context- and action-aware fea-
tures with a two-stage LSTM to recognize actions from
partial video sequences. Similarly, Sigurdsson et al. [
18
] pro-
pose a fully-connected temporal CRF model, which exploits
intent information to predict in each frame the action being
performed. Dave et al. [
5
] use a predictive-corrective model
to maintain a memory state of the network for per-frame
action classification. Ramanathan et al.[
16
] use a BLSTM
to classify and detect events in individual frames of basket-
ball sports videos and Heidarivincheh et al. [
8
] detect action
completion. Wei et al. [
27
] on the other hand learn to classify
the arrow of time in videos.
The crucial difference of all the above methods to our
task is that the event had already started, i.e. the methods
already observe the event of interest unrolling, whereas in
our task we observe frames leading towards the start of the
event, but never the event itself. The same difference applies
to the standard activity recognition datasets [
17
,
10
,
19
,
4
],
which capture many different classes of events, but typically
not what leads to them - and even if it was captured, the data
would not be very informative for event prediction, as given
the nature of the actions in the datasets such as “singing” or
“cliff diving”, the motivation leading to start the action is
mostly intrinsic to the actors and therefore cannot be easily
observed in the videos until the action actually starts.
Future Frames Prediction.
A lot of work has been done
trying to predict future visual appearance. Vondrick et
al. [
25
,
26
] use Generative Adversarial Networks (GANs)
to explicitly model foreground and background pixels and
generate a short video sequence from a single image. Xue
et al. [
29
] also predict future frames from a single image,
exploiting cross convolutional networks. Liang et al. [
12
]
then use dual motion GAN to predict future frames given
an input video sequence. The prediction horizon in all these
methods is however only couple of frames, and therefore
it is not applicable for predicting events which are several
seconds in the future.
Event Prediction
In the machine learning literature, time
to event prediction has been extensively studied. Most re-
cently, Martinsson [
13
] proposed a Recurrent Neural Net-
work model with Weibull distribution for a customer churn
prediction. Soleimani et al. [
20
] use Gaussian processes to
predict time to event in medical applications. These mod-
els however deal with only low-dimensional sequential data
(such as patient temperature measurements), and thus are
not directly applicable in computer vision. Where possible,
we however try to adapt these models for high-dimensional
video data and evaluate them (see section 4.1).
In computer vision, Vondrick et al. [
24
] exploit
AlexNet [
11
] representations to predict what action will
happen in a recording of TV series in exactly 1 second time.
Felsen et al. [
6
] predict which player will hold the ball next,
by applying random forests to normalized overhead-view
sport videos. Alahi et al. [
1
] use a LSTM model to pre-
dict a 2D heatmap of future pedestrian position in overhead
pedestrian videos.
Our work is significantly different in that i) we predict
whether the event will happen or not ii) we give time estimate
to say when the event is likely to happen and with what prob-
ability iii) we predict significantly longer time horizon (up to
10 seconds) and iv) we don’t require static and normalized
camera views like [1, 6].
3. Method
3.1. Time to event
Consider a short sequence of
N
past and current visual
observations
x
t
0
N+1
, x
t
0
N+2
, . . . , x
t
0
. Our aim is to
estimate, based on this information, the probability that a
certain event of interest will occur in the near future and, if
so, when exactly (see fig. 2).
Since the choice of a time origin is immaterial, we sim-
plify the notation by assuming
t
0
= 0
, and denote
X
N
=
(x
N+1
, x
N+2
, . . . , x
0
)
the
N
observations. Hence, the
task can be formulated as estimating the conditional proba-
bility density p(t|X
N
) of the time to event (TTE) τ 0.
More precisely, we define TTE as the time when the
observed system enters first a certain state of interest. For
example, in the car application the state of interest is that
the car velocity is zero, and in the sport application the
condition is that the ball is flying as a consequence of a
throw. This definition has a few important consequences.
First, the system may enter and leave the state of interest
several times and we are only concerned with the first of such
occurrences. Second, the system may already be in said state
at time
0
, for example because the car is already stopped, so
that strictly speaking
τ
may be less than 0. In this case we
assume instead that
τ = 0
. Thirdly, the event may not occur
at all in the near future. The “no show” condition could
be model by allowing
t
to range in
R
+
0
{+∞}
. Instead,
it is more practical to choose a fixed prediction horizon
max
0
and conventionally set
τ = 2∆
max
whenever the
TTE is beyond the horizon.

input model attention model output
0 2 4 6 8 10
t
[s]
0.000
0.025
0.050
0.075
0.100
p
(
t
|
X
N
)
0 2 4 6 8 10
t
[s]
0.00
0.25
0.50
0.75
1.00
F
(
t
|
X
N
)
0 2 4 6 8 10
t
[s]
0.000
0.025
0.050
0.075
0.100
p
(
t
|
X
N
)
0 2 4 6 8 10
t
[s]
0.00
0.25
0.50
0.75
1.00
F
(
t
|
X
N
)
0 2 4 6 8 10
t
[s]
0.000
0.025
0.050
0.075
0.100
p
(
t
|
X
N
)
0 2 4 6 8 10
t
[s]
0.00
0.25
0.50
0.75
1.00
F
(
t
|
X
N
)
Figure 1. Predicting car stopping in the BDD100k driving dataset. Time to event probability prediction (blue), event occurrence ground truth
(red), maximal prediction horizon
max
(dashed gray). The network has learned to look for various cues (traffic lights, cars in front on the
road) to predict the probability if and when the car will stop. It has also learned to assign non-zero probability to time/position corresponding
to green traffic lights, as they in fact might turn red by the time the car gets there (middle row). See Supplementary material for the videos.
τ
t
0
Δ
x
t
0
-N
, …, x
t
0
observed
not observed
event occurs
t
Figure 2. Predicting future event at time
τ
, having observed
N
frames up until time
t
0
, where
denotes the time distance between
the last observed frame and the event
In the following, rather than discussing density function
p(t|X
N
)
directly, we will make use of their cumulative dis-
tribution function (CDF), which has the usual definition
F (u|X
N
) = P [t > u|X
N
].
CDFs are more convenient in
our discussion as they can represent both continuous and
discrete-time distributions. Furthermore, CDFs can capture
both non-trivial distributions, used to encode uncertainty,
as well as “deterministic ones” that put all the mass on a
specific value of
t
, and can thus be used to encode point
estimates (these CDFs are step functions).
3.2. Predicting the TTE
Next, we describe a number of prediction models for the
TTE, all implemented as neural networks
Φ
. These networks
take as input a sequence of observations
X
N
and output
an estimate
ˆ
F (t|X
N
)
of the TTE CDF. The networks are
learned from example pairs
(X
N
, τ)
via optimization of a
suitable expected loss, which depends on the nature of the
model. Models differs mainly by whether they use a discrete
or continuous representation of time and by whether they
predict a non-trivial distribution over possible TTEs or they
produce instead point estimates.
3.2.1 Discrete-time models
One-in-many classification.
The simplest approach to
modelling the CDF
ˆ
F (t|X
N
)
is to quantize time in discrete
bins and reduce the problem to a classification one. For sim-
plicity, assume that the quantization interval is of one second
and that there are
max
bins in total. Time
t 0
is mapped
to a discrete index by the following quantizer function:
q(t) =
(
btc + 1, t <
max
,
max
, t
max
.
Index
q =
max
means that the event occurs beyond the
prediction horizon.

In order to implement this model, the network
Φ(X
N
)
is
terminated in a softmax layer configured to output a
max
-
dimensional probability vector. The corresponding CDF is
given by the cumulative sum
ˆ
F (t|X
N
) =
q(t)
X
i=1
Φ
i
(X
N
). (1)
The model is trained by minimizing its negative log-
likelihood E
(X
N
,t)
[ log Φ
q(t)
(X
N
)].
Binary classifiers.
As a variant of the previous model,
we consider using
max
independent binary classifiers
Φ
i
(X
N
) [0, 1]
by processing the network output not via
a softmax layer, but via a sigmoid. Each such classifier
Φ
i
(X
N
)
is trained via log-likelihood maximization to pre-
dict the probability that the event occurs at time
i
or later. At
test time, the event is deemed to occur at the time the first of
such classifier fires; formally, the CDF is given by
ˆ
F (t|X
N
) =
(
1, i t + 1 : Φ
i
(X
N
)
1
2
,
0, otherwise.
(2)
Note that the expression above is valid for values of
t
strictly
less than the horizon
max
; if all classifiers fail to fire, then
the event is deemed to occur beyond the prediction hori-
zon and conventionally predicted to be at time
2∆
max
as
explained before.
Heuristic heatmaps.
Inspired by the keypoint detection
literature [
23
,
22
,
14
], we explore predicting the TTE via
an heuristic heatmap. To this end, the neural network is
configured to output a 1D heat map
Φ(X
N
) = h R
T
+
of
size
T = 2r
max
, where
r
is the temporal resolution (in
our experiments, we set
r = 4
). The TTE is decoded as the
position of the maximum in the heat map
h
. For
t < 2∆
max
,
the TTE CDF is obtained via cumulative summation and
normalization:
ˆ
F (t; h) =
1
P
T
i=1
h
i
b
T
2r
max
tc+1
X
i=1
h
i
. (3)
This model is trained by minimizing the expected squared
L
2
distance
E
(X
N
,t)
[kg
t
Φ(X
N
)k
2
]
between the estimated
heatmap and a heatmap prototype
g
t
centered at the ground-
truth TTE value
τ
. The prototype is a Gaussian-like kernel
[g
t
]
i
= exp
h
1
2σ
2
2r
max
T
i τ
2
i
.
Note that this model allows the heatmap to have non-
zero values in a region beyond the prediction horizon
max
up to
2∆
max
so that the “no show” case can be captured.
For training the model, “no show” samples are mapped to
Gaussian window prototypes
g
t
whose center is randomly
selected in the interval
[∆
max
, 2∆
max
]
. This ensures that data
are more balanced in the training process, compared to a
representation when one outputs just zero values in the heat
map if the event did not occur.
3.2.2 Continuous-time models
A limitation of the previous models is their finite temporal
resolution, due to the use of a quantizer function, and no
representation of uncertainty of the prediction. We thus
evaluate models that produce a continuous estimate of the
TTE, and that, apart from Direct Regression, also output
prediction uncertainty.
Direct regression.
The simplest approach to predict the
TTE over a continuous domain is to configure the neural
network to output a real number
Φ(X
N
) =
ˆ
t R
+
that
approximates the TTE directly (following the convention
that, if the event does not occur before the time horizon
max
, then the model outputs the value 2∆
max
).
This model estimates the TTE with infinite resolution,
but it does not produce an uncertainty. This fact is cap-
tured by a step-wise CDF
ˆ
F (u|
ˆ
t) = [u
ˆ
t].
The model
is trained by minimizing the expected absolute distance be-
tween estimated TTE
ˆ
t
and ground-truth TTE
τ
, given by
E
(X
N
,t)
[|τ Φ(X
N
)|]
; this is the same as the expected
L
1
distance
kF (·|
ˆ
t) F (·|t)k
1
between the estimated step-wise
CDF and the step-wise CDF centered at ground-truth TTE.
Gaussian distribution.
In this case, the model
Φ(X
N
) =
(µ, σ) R
2
+
outputs two real numbers, the mean
µ
and stan-
dard deviation
σ
of a 1D Gaussian distribution
N (t|µ, σ)
,
with CDF
F (t|X
N
) = N
CDF
(t|Φ(X
N
))
. Again, we use
the convention of setting
τ = 2∆
max
when the event
does not occur within the prediction horizon. This model
is also trained by minimizing the negative log-likelihood
E
(X
N
)
[ log N (τ |Φ(X
N
))].
Weibull distribution.
The Gaussian distribution is not
necessarily optimal for modeling TTEs; in fact, several
authors have proposed to use the Weibull distribution in
similar contexts [
13
]. In order to experiment with this
idea, the network is modified to output the parameters
(α, β) R
2
of a Weibull distribution, whose CDF is given by
ˆ
F (t|α, β) = 1 e
(t/α)
β
.
The advantage of the Weibull dis-
tribution is that it can more explicitly model the case in which
the event does not occur within the prediction horizon
max
.
Such a “no show” event is modeled as Type I censoring,
denoted with
z = 0
, whereas
z = 1
means that the event oc-
curs within the window. The model is learned by minimizing
the negative log-likelihood
E
(X
N
,t)
[ log p(τ |z, Φ(X
N
))]
where
log p(t|z, α, β) = z
log(β) + β log
t
α
t
α
β
(4)
where, given our convention of mapping events that do not
occur we have z = [τ >
max
].
Gaussian Mixture Model Heatmap (GMMH).
Finally,
we propose a novel representation based on a Gaussian Mix-

ture Model (GMM), which combines the benefits of the
Gaussian distribution and the heuristic heat map models.
The network is configured to output three vectors
Φ(X
N
) = (µ, σ, h) R
T ×3
again of dimension
T =
2r
max
. The first two vectors
µ
and
σ
represent the param-
eters of
T
1D Gaussian distributions, and the third vector
h
,
similar to the heuristic heatmap, is used here as the weight-
ing of the
T
Gaussians components. The CDF of this model
is simply the weighted combination of Gaussian CDFs:
ˆ
F (t; µ, σ, h) =
1
h1, hi
T
X
j=1
h
j
N
CDF
(t; µ
j
, σ
j
).
The loss function is the negative log-likelihood of the GMM
regularized by the loss already adopted for the heuristic
heatmap:
E
(X
N
,t)
log
T
X
j=1
h
j
N (τ; µ
j
, σ
j
)
h1, hi
+ λkg
t
hk
2
.
(5)
3.3. Evaluation Metrics
Next, we present three evaluation metrics of increasing
granularity, allowing to compare models’ performance in
terms of how well they predict whether the event will or
will not occur (Event Prediction Accuracy), how accurately
they predict time to the event (Time-to-Event Error) and how
accurate is the event probability distribution estimate (Model
Surprise).
We expect predictions to be more difficult as the event oc-
curs farther in the future, so we break down evaluations based
on this parameter. Specifically, we vary
gt
[0,
max
]
to
draw performance curves, as the same test events are ob-
served from greater temporal distance. We also report the
average value of the three metrics as
gt
is swept inside this
range.
Event Prediction Accuracy (EPA).
EPA measures
whether the model can successfully answer the if question,
namely whether the event will or will not occur within the
prediction window. This is a classification problem and EPA
is the average classification accuracy. Recall that, in order
to predict that the event does not occur, discrete-time mod-
els predict a special index/class, whereas continuous-time
models predict the event to occur at a time t >
max
.
Since most of the episodes
X
N
do not contain the event
of interest (as these are comparatively rare), in order to make
metrics comparable between event types and datasets we
balance the testing set so that the ratio between sequences
with and without the event is 50:50.
Time-to-Event Error (TTEE).
TTEE measures whether
the model can successfully answer the if so, when? question,
namely determine when exactly the events will occur. TTEE
is the average absolute prediction error
E
(X
N
,t)
[|τ Ψ
Φ(X
N
)|]
, where
Ψ
is the operation that maps the output of
the neural network
Φ
to a point estimate
ˆ
t = Ψ Φ(X
N
)
for
the TTE. For discrete-time models, for example,
Ψ
maps the
predicted time index to a continuous time value to allow for
a comparison against the ground-truth time. The empirical
average is carried over the subset of the test set where the
event does occur. If the network predicts incorrectly that the
event does not occur, then the TTEE for that sample is set to
max
.
Model Surprise (MS).
For model that outputs a predic-
tion uncertainty in addition to a point estimates, we are
also interested in measuring the quality of the predicted un-
certainty value. We do so by taking the output distribution
ˆp(t|X
N
)
and measuring the expected negative log-likelihood
given the ground truth annotations in the test set, defined as
E
(X
N
)
[ log ˆp(τ|X
N
)]
. This is also known as model “sur-
prise” and is an indication of the quality of the probabilistic
output of the model: if the model assigns high probability
values to the correct ground truth locations, the resulting
“surprise” will be low, and vice-versa.
3.4. Backbone Architecture
In order to implement the neural network
Φ
, in all experi-
ments we adapt the 3D ResNet-34 [
7
] architecture (see ta-
ble 1) and extend it with a soft-attention module [
28
] to
visualize what regions of the video sequence play the key
role in the network decision making process. Depending on
the model, we also change the output dimension accordingly
(see Section 3.2). Most notably, our proposed GMMH has an
output dimension of
80 × 3
(
T = 2r
max
= 2 × 4 × 10 =
80).
We used the vanilla SGD optimizer with Nesterov mo-
mentum [
21
] with an initial learning rate of
10
1
, which was
decreased by a factor of 10 every time the loss stopped im-
proving, and trained every model until the learning rate fell
below
10
5
. All models were implemented in PyTorch [
15
]
and all the source code will be released to foster reproducibil-
ity.
4. Experiments
We assess our approach in two challenging scenarios:
egocentric car stopping and basketball throws.
4.1. Egocentric Car Stopping
In the first experiment, we aim to predict if and when a
car is about to stop, using the video stream from a forward-
looking camera mounted behind a windshield.
Dataset.
We build on the BDD100k dataset [
30
], which
consists of 100,000 video sequences each 40 seconds of
length, accompanied with basic sensory data such as GPS,
velocity or acceleration. We define the stopping event as the

Citations
More filters
Proceedings ArticleDOI

Uncertainty-based Traffic Accident Anticipation with Spatio-Temporal Relational Learning

TL;DR: An uncertainty-based accident anticipation model with spatio-temporal relational learning that sequentially predicts the probability of traffic accident occurrence with dashcam videos is proposed to take advantage of graph convolution and recurrent networks for relational feature learning, and leverage Bayesian neural networks to address the intrinsic variability of latent relational representations.
Posted Content

Event Prediction in the Big Data Era: A Systematic Survey

TL;DR: A systematic and comprehensive survey of the technologies, applications, and evaluations of event prediction in the big data era is provided to unify the understanding of model performance among stakeholders, model developers, and domain experts in various application domains.
Journal ArticleDOI

Am I Done? Predicting Action Progress in Videos

TL;DR: In this paper, the authors proposed a novel approach, named ProgressNet, capable of predicting when an action takes place in a video, where it is located within the frames, and how far it has progressed during its execution.
Journal ArticleDOI

Event Prediction in the Big Data Era: A Systematic Survey

Abstract: Events are occurrences in specific locations, time, and semantics that nontrivially impact either our society or the nature, such as earthquakes, civil unrest, system failures, pandemics, and crimes. It is highly desirable to be able to anticipate the occurrence of such events in advance to reduce the potential social upheaval and damage caused. Event prediction, which has traditionally been prohibitively challenging, is now becoming a viable option in the big data era and is thus experiencing rapid growth, also thanks to advances in high performance computers and new Artificial Intelligence techniques. There is a large amount of existing work that focuses on addressing the challenges involved, including heterogeneous multi-faceted outputs, complex (e.g., spatial, temporal, and semantic) dependencies, and streaming data feeds. Due to the strong interdisciplinary nature of event prediction problems, most existing event prediction methods were initially designed to deal with specific application domains, though the techniques and evaluation procedures utilized are usually generalizable across different domains. However, it is imperative yet difficult to cross-reference the techniques across different domains, given the absence of a comprehensive literature survey for event prediction. This article aims to provide a systematic and comprehensive survey of the technologies, applications, and evaluations of event prediction in the big data era. First, systematic categorization and summary of existing techniques are presented, which facilitate domain experts’ searches for suitable techniques and help model developers consolidate their research at the frontiers. Then, comprehensive categorization and summary of major application domains are provided to introduce wider applications to model developers to help them expand the impacts of their research. Evaluation metrics and procedures are summarized and standardized to unify the understanding of model performance among stakeholders, model developers, and domain experts in various application domains. Finally, open problems and future directions are discussed. Additional resources related to event prediction are included in the paper website: http://cs.emory.edu/∼lzhao41/projects/event_prediction_site.html.
Journal ArticleDOI

Predicting the future from first person (egocentric) vision: A survey

TL;DR: It is highlighted that methods for future prediction from egocentric vision can have a significant impact in a range of applications and that further research efforts should be devoted to the standardisation of tasks and the proposal of datasets considering real-world scenarios such as the ones with an industrial vocation.
References
More filters
Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.

Automatic differentiation in PyTorch

TL;DR: An automatic differentiation module of PyTorch is described — a library designed to enable rapid research on machine learning models that focuses on differentiation of purely imperative programs, with a focus on extensibility and low overhead.
Proceedings Article

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

TL;DR: An attention based model that automatically learns to describe the content of images is introduced that can be trained in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower bound.
Posted Content

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

TL;DR: This paper proposed an attention-based model that automatically learns to describe the content of images by focusing on salient objects while generating corresponding words in the output sequence, which achieved state-of-the-art performance on three benchmark datasets: Flickr8k, Flickr30k and MS COCO.
Proceedings Article

On the importance of initialization and momentum in deep learning

TL;DR: It is shown that when stochastic gradient descent with momentum uses a well-designed random initialization and a particular type of slowly increasing schedule for the momentum parameter, it can train both DNNs and RNNs to levels of performance that were previously achievable only with Hessian-Free optimization.
Related Papers (5)
Frequently Asked Questions (9)
Q1. What are the contributions in "Future event prediction: if and when" ?

The authors consider the problem of future event prediction in video: if and when a future event will occur. To this end, the authors propose a number of representations and loss functions tailored to this problem. The authors show that ( i ) they are able to predict events far in the future, up to 10 seconds before they occur ; and ( ii ) using attention, they can determine which areas of the image sequence are responsible for these predictions, and find that they are meaningful, e. g. traffic lights are picked out for predicting when a vehicle will stop. 

The simplest approach to modelling the CDF F̂ (t|XN ) is to quantize time in discrete bins and reduce the problem to a classification one. 

The simplest approach to predict the TTE over a continuous domain is to configure the neural network to output a real number Φ(XN ) = t̂ ∈ R+ that approximates the TTE directly (following the convention that, if the event does not occur before the time horizon ∆max, then the model outputs the value 2∆max). 

Their objective in this paper introducing the problem of time-to-event prediction into computer vision by predicting future events in video before they occur. 

The authors also showed that in vehicle stopping prediction, their model outperforms an average human, which the authors contribute to the better ability of neural networks to learn domain specific priors and to capture subtle cues. 

The loss function is the negative log-likelihood of the GMM regularized by the loss already adopted for the heuristic heatmap:E(XN ,t) − log T∑ j=1 hjN (τ ;µj , σj) 〈1,h〉 + λ‖gt − h‖2 . 

The model has also learned that when approaching a green traffic light, there is a still a probability the car might stop (fig. 1 — middle row), as green might turn red before the car gets there. 

the authors describe a number of prediction models for the TTE, all implemented as neural networks Φ. These networks take as input a sequence of observations XN and output an estimate F̂ (t|XN ) of the TTE CDF. 

In contrast to the previous section, the authors only train and evaluate the prediction in the interval of (1, 5) seconds, because of generally faster pace of the action happening in the videos and to avoid issues with cuts in the TV stream.