scispace - formally typeset

Book ChapterDOI

Video Frame Interpolation via Cyclic Fine-Tuning and Asymmetric Reverse Flow

11 Jun 2019-pp 311-323

TL;DR: This work uses a convolutional neural network that takes two frames as input and predicts two optical flows with pixelwise weights that outperforms the publicly available state-of-the-art methods on multiple datasets.
Abstract: The objective in video frame interpolation is to predict additional in-between frames in a video while retaining natural motion and good visual quality. In this work, we use a convolutional neural network (CNN) that takes two frames as input and predicts two optical flows with pixelwise weights. The flows are from an unknown in-between frame to the input frames. The input frames are warped with the predicted flows, multiplied by the predicted weights, and added to form the in-between frame. We also propose a new strategy to improve the performance of video frame interpolation models: we reconstruct the original frames using the learned model by reusing the predicted frames as input for the model. This is used during inference to fine-tune the model so that it predicts the best possible frames. Our model outperforms the publicly available state-of-the-art methods on multiple datasets.

Content maybe subject to copyright    Report

General rights
Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright
owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.
Users may download and print one copy of any publication from the public portal for the purpose of private study or research.
You may not further distribute the material or use it for any profit-making activity or commercial gain
You may freely distribute the URL identifying the publication in the public portal
If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately
and investigate your claim.
Downloaded from orbit.dtu.dk on: Aug 10, 2022
Video Frame Interpolation via Cyclic Fine-Tuning and Asymmetric Reverse Flow
Hannemose, Morten; Jensen, Janus Nørtoft; Einarsson, Gudmundur; Wilm, Jakob; Dahl, Anders
Bjorholm; Frisvad, Jeppe Revall
Published in:
Proceedings of 2019 Scandinavian Conference on Image Analysis
Link to article, DOI:
10.1007/978-3-030-20205-7_26
Publication date:
2019
Document Version
Peer reviewed version
Link back to DTU Orbit
Citation (APA):
Hannemose, M., Jensen, J. N., Einarsson, G., Wilm, J., Dahl, A. B., & Frisvad, J. R. (2019). Video Frame
Interpolation via Cyclic Fine-Tuning and Asymmetric Reverse Flow. In Proceedings of 2019 Scandinavian
Conference on Image Analysis (pp. 311-323). Springer. Lecture Notes in Computer Science Vol. 11482
https://doi.org/10.1007/978-3-030-20205-7_26

Video Frame Interpolation via Cyclic
Fine-Tuning and Asymmetric Reverse Flow
Morten Hannemose
1
, Janus Nørtoft Jensen
1
, Gudmundur Einarsson
1
, Jakob
Wilm
2
, Anders Bjorholm Dahl
1
, and Jeppe Revall Frisvad
1
1
DTU Compute, Technical University of Denmark
2
SDU Robotics, University of Southern Denmark
Abstract. The objective in video frame interpolation is to predict ad-
ditional in-between frames in a video while retaining natural motion and
good visual quality. In this work, we use a convolutional neural network
(CNN) that takes two frames as input and predicts two optical ows with
pixelwise weights. The ows are from an unknown in-between frame to
the input frames. The input frames are warped with the predicted ows,
multiplied by the predicted weights, and added to form the in-between
frame. We also propose a new strategy to improve the performance of
video frame interpolation models: we reconstruct the original frames us-
ing the learned model by reusing the predicted frames as input for the
model. This is used during inference to ne-tune the model so that it
predicts the best possible frames. Our model outperforms the publicly
available state-of-the-art methods on multiple datasets.
Keywords: slow motion · video frame interpolation · convolutional
neural networks.
1 Introduction
Video frame interpolation, also known as inbetweening, is the process of gener-
ating intermediate frames between two consecutive frames in a video sequence.
This is an important technique in computer animation [19], where artists draw
keyframes and lets software interpolate between them. With the advent of high
frame rate displays that need to display videos recorded at lower frame rates,
inbetweening has become important in order to perform frame rate up-conver-
sion [2]. Computer animation research [9, 19] indicates that good inbetweening
cannot be obtained based on linear motion, as objects often deform and follow
nonlinear paths between frames. In an early paper, Catmull [3] interestingly ar-
gues that inbetweening is “akin to dicult articial intelligence problems” in
that it must be able understand the content of the images in order to accu-
rately handle e.g. occlusions. Applying learning-based methods to the problem
of inbetweening thus seems an interesting line of investigation.
Some of the rst work on video frame interpolation using CNNs was pre-
sented by Niklaus et al. [17, 18]. Their approach relies on estimating kernels to
jointly represent motion and interpolate intermediate frames. Concurrently, Liu

2 M. Hannemose, J.N. Jensen et al.
f
f
f
f
f
I
0
I
1
I
2
I
3
ˆ
I
0.5
ˆ
I
1.5
ˆ
I
2.5
˜
I
1
˜
I
2
Fig. 1. Diagram illustrating the cyclic ne-tuning process when predicting frame
ˆ
I
1.5
.
The model is rst applied in a pairwise manner on the four input frames I
0
, I
1
, I
2
, and
I
3
, then on the results
ˆ
I
0.5
,
ˆ
I
1.5
, and
ˆ
I
2.5
. The results of the second iteration,
˜
I
1
and
˜
I
2
,
are then compared with the input frames and the weights of the network are updated.
This process optimizes our model specically to be good at interpolating frame
ˆ
I
1.5
.
et al. [11] and Jiang et al. [6] used neural networks to predict optical ow and
used it to warp the input images followed by a linear blending.
Our contribution is twofold. Firstly, we propose a CNN architecture that
directly estimates asymmetric optical ows and weights from an unknown in-
termediate frame to two input frames. We use this to interpolate the frame
in-between. Existing techniques either assume that this ow is symmetric or use
a symmetric approximation followed by a renement step [6, 11, 16]. For non-
linear motion, this assumption does not hold, and we document the eect of
relaxing it. Secondly, we propose a new strategy for ne-tuning a network for
each specic frame in a video. We rely on the fact that interpolated frames can
be used to estimate the original frames by applying the method again with the
in-between frames as input. The similarity of reconstructed and original frames
can be considered a proxy for the quality of the interpolated frames. For each
frame we predict, the model is ne-tuned in this manner using the surrounding
frames in the video, see Figure 1. This concept is not restricted to our method
and could be applied to other methods as well.
2 Related work
Video frame interpolation is usually done in two steps: motion estimation fol-
lowed by frame synthesis. Motion estimation is often performed using optical
ow [1, 4, 25], and optical ow algorithms have used interpolation error as an
error metric [1, 12, 23]. Frame synthesis can then be done via e.g. bilinear inter-
polation and occlusion reasoning using simple hole lling. Other methods use
phase decompositions of the input frames to predict the phase decomposition of
the intermediate frame and invert this for frame generation [14, 15], or they use
local per pixel convolution kernels on the input frames to both represent motion
and synthesize new frames [17, 18]. Mahajan et al. [13] determine where each

VFI via Cyclic Fine-Tuning and Asymmetric Reverse Flow 3
pixel in an intermediate frame comes from in the surrounding input frames by
solving an expensive optimization problem. Our method is similar but replaces
the optimization step with a learned neural network.
The advent of CNNs has prompted several new learning based approaches.
Liu et al. [11] train a CNN to predict a symmetrical optical ow from the inter-
mediate frame to the surrounding frames. They synthesize the target frame by
interpolating the values in the input frames. Niklaus et al. [17] train a network
to output local 38x38 convolution kernels for each pixel to be applied on the
input images. In [18], they are able to improve this to 51x51 kernels. However,
their representation is still limited to motions within this range. Jiang et al. [6]
rst predict bidirectional optical ows between two input frames. They combine
these to get a symmetric approximation of the ows from an intermediate frame
to the input frames, which is then rened in a separate step. Our method, in
contrast, directly predicts the nal ows to the input frames without the need
for an intermediate step. Niklaus et al. [16] also initially predict bidirectional
ows between the input frames and extract context maps for the images. They
warp the input images and context maps to the intermediate time step using the
predicted ows. Another network blends these to get the intermediate frame.
Liu et al. [10] propose a new loss term, which they call cycle consistency loss.
This is a loss based on how well the output frames of a model can reconstruct the
input frames. They retrain the model from [11] with this and show state-of-the-
art results. We use this loss term and show how it can be used during inference
to improve results. Meyer et al. [14] estimate the phase of an intermediate frame
from the phases of two input frames represented by steerable pyramid lters.
They invert the decomposition to reconstruct the image. This method alleviates
some of the limitations of optical ow, which are also limitations of our method:
sudden light changes, transparency and motion blur, for example. However, their
results have a lower level of detail.
3 Method
Given a video containing the image sequence I
0
, I
1
, · · · , I
n
, we are interested
in computing additional images that can be inserted in the original sequence
to increase the frame rate, while keeping good visual quality in the video. Our
method doubles the frame rate, which allows for the retrieval of approximately
any in-between frame by recursive application of the method. This means that
we need to compute estimates of I
0.5
, I
1.5
, · · · , I
n0.5
, such that the nal sequence
would be:
I
0
, I
0.5
, I
1
, · · · , I
n0.5
, I
n
.
We simplify the problem by only looking at interpolating a single frame I
1
,
that is located temporally between two neighboring frames I
0
and I
2
. If we know
the optical ows from the missing frame to each of these and denote them as
F
10
and F
12
, we can compute an estimate of the missing frame by
ˆ
I
1
= W
0
W(F
10
, I
0
) + W
2
W(F
12
, I
2
), (1)

4 M. Hannemose, J.N. Jensen et al.
g
I
0
I
2
F
10
F
12
W
0
W
2
ˆ
I
1
Fig. 2. Illustration of the frame interpolation process with g from Equation (2). From
left to right: Input frames, predicted ows, weights and nal interpolated frame.
32
64
128
256
512
512
512
256
128
64
32 6
Average
Pooling
2×Bilinear
Upsampling
Conv + ReLU
Conv
Skip
Connection
Fig. 3. The architecture of our network. Input is two color images I
0
and I
2
and output
is optical ows F
10
, F
12
, and weights W
0
, W
2
. Convolutions are 3 × 3 and average
pooling is 2 × 2 with a stride of 2. Skip connections are implemented by adding the
output of the layer that arrows emerge from to the output of the layers they point to.
where W(·, ·) is the backward warping function that follows the vector to the
input frame and samples a value with bilinear interpolation. W
0
and W
2
are
weights for each pixel describing how much of each of the neighboring frames
should contribute to the middle frame. The weights are used for handling occlu-
sions. Examples of ows and weights can be seen in Figure 2. We train a CNN g
with a U-Net [20] style architecture, illustrated in Figure 3. The network takes
two images as input and predicts the ows and pixel-wise weights
g(I
0
, I
2
) F
10
, F
12
, W
0
, W
2
. (2)
Our architecture uses ve 2 × 2 average pooling layers with stride 2 for the
encoding and ve bilinear upsampling layers to upscale the layers with a factor
2 in the decoding. We use four skip connections (addition) between layers in the
encoder and decoder. It should be noted that our network is fully convolutional,
which implies that it works on images of any size, where both dimensions are a
multiple of 32. If this is not the case, we pad the image with boundary reections.
Our model for frame interpolation is obtained by combining Equations (1)
and (2) into
f(I
0
, I
2
) =
ˆ
I
1
, (3)
where
ˆ
I
1
is the estimated image. The model is depicted in Figure 2. All compo-
nents of f are dierentiable, which means that our model is end-to-end trainable.
It is easy to get data in the form of triplets (I
0
, I
1
, I
2
) by taking frames from
videos that we use as training data for our model.

Figures (8)
Citations
More filters

Journal ArticleDOI
Xianhang Cheng1, Zhenzhong Chen1Institutions (1)
03 Apr 2020-
TL;DR: Experimental results demonstrate that the DSepConv method significantly outperforms the other kernel-based interpolation methods and shows strong performance on par or even better than the state-of-the-art algorithms both qualitatively and quantitatively.
Abstract: Learning to synthesize non-existing frames from the original consecutive video frames is a challenging task. Recent kernel-based interpolation methods predict pixels with a single convolution process to replace the dependency of optical flow. However, when scene motion is larger than the pre-defined kernel size, these methods yield poor results even though they take thousands of neighboring pixels into account. To solve this problem in this paper, we propose to use deformable separable convolution (DSepConv) to adaptively estimate kernels, offsets and masks to allow the network to obtain information with much fewer but more relevant pixels. In addition, we show that the kernel-based methods and conventional flow-based methods are specific instances of the proposed DSepConv. Experimental results demonstrate that our method significantly outperforms the other kernel-based interpolation methods and shows strong performance on par or even better than the state-of-the-art algorithms both qualitatively and quantitatively.

20 citations


Cites background from "Video Frame Interpolation via Cycli..."

  • ...…flow information together with occlusion masks or visibility maps with deep convolutional neural networks (CNNs) (Jiang et al. 2018; Bao et al. 2019; 2018a; Liu et al. 2017; van Amersfoort et al. 2017; Liu et al. 2019; Xue et al. 2019; Peleg et al. 2019; Yuan et al. 2019; Hannemose et al. 2019)....

    [...]

  • ...estimating flow information together with occlusion masks or visibility maps with deep convolutional neural networks (CNNs) (Jiang et al. 2018; Bao et al. 2019; 2018a; Liu et al. 2017; van Amersfoort et al. 2017; Liu et al. 2019; Xue et al. 2019; Peleg et al. 2019; Yuan et al. 2019; Hannemose et al. 2019)....

    [...]


References
More filters

Proceedings Article
Diederik P. Kingma1, Jimmy Ba2Institutions (2)
01 Jan 2015-
TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Abstract: We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and/or parameters. The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods. Finally, we discuss AdaMax, a variant of Adam based on the infinity norm.

78,539 citations


"Video Frame Interpolation via Cycli..." refers methods in this paper

  • ...We train our network using the Adam optimizer [8] with default values β1 = 0....

    [...]


Proceedings Article
Karen Simonyan1, Andrew Zisserman1Institutions (1)
01 Jan 2015-
Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

49,857 citations


Proceedings Article
Karen Simonyan1, Andrew Zisserman1Institutions (1)
04 Sep 2014-
TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

38,283 citations


"Video Frame Interpolation via Cycli..." refers background in this paper

  • ...Let φ be the output of relu4 4 from VGG19 [21], then Lf = ∣∣∣∣∣∣φ(I1)− φ(Î1)∣∣∣∣∣∣2 2 ....

    [...]

  • ...Let φ be the output of relu4 4 from VGG19 [21], then...

    [...]


Journal ArticleDOI
Abstract: Objective methods for assessing perceptual image quality traditionally attempted to quantify the visibility of errors (differences) between a distorted image and a reference image using a variety of known properties of the human visual system. Under the assumption that human visual perception is highly adapted for extracting structural information from a scene, we introduce an alternative complementary framework for quality assessment based on the degradation of structural information. As a specific example of this concept, we develop a structural similarity index and demonstrate its promise through a set of intuitive examples, as well as comparison to both subjective ratings and state-of-the-art objective methods on a database of images compressed with JPEG and JPEG2000. A MATLAB implementation of the proposed algorithm is available online at http://www.cns.nyu.edu//spl sim/lcv/ssim/.

30,333 citations


Book ChapterDOI
05 Oct 2015-
Abstract: There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net .

28,273 citations


Network Information
Related Papers (5)
17 Jul 2019

Yu-Lun Liu, Yu-Lun Liu +5 more

01 Sep 2019

Tejas Jayashankar, Pierre Moulin +2 more

Performance
Metrics
No. of citations received by the Paper in previous years
YearCitations
20201