scispace - formally typeset
Open AccessProceedings ArticleDOI

Coherent Online Video Style Transfer

Reads0
Chats0
TLDR
In this paper, the authors propose an end-to-end network for online video style transfer, which generates temporally coherent stylized video sequences in near real-time by incorporating short-term coherence.
Abstract
Training a feed-forward network for the fast neural style transfer of images has proven successful, but the naive extension of processing videos frame by frame is prone to producing flickering results. We propose the first end-toend network for online video style transfer, which generates temporally coherent stylized video sequences in near realtime. Two key ideas include an efficient network by incorporating short-term coherence, and propagating short-term coherence to long-term, which ensures consistency over a longer period of time. Our network can incorporate different image stylization networks and clearly outperforms the per-frame baseline both qualitatively and quantitatively. Moreover, it can achieve visually comparable coherence to optimization-based video style transfer, but is three orders of magnitude faster.

read more

Content maybe subject to copyright    Report

Coherent Online Video Style Transfer
Dongdong Chen
1
, Jing Liao
2
, Lu Yuan
2
, Nenghai Yu
1
, and Gang Hua
2
1
University of Science and Technology of China,
cd722522@mail.ustc.edu.cn, ynh@ustc.edu.cn
2
Microsoft Research Asia,
{jliao,luyuan,ganghua}@microsoft.com
Abstract
Training a feed-forward network for fast neural style
transfer of images is proven to be successful. However, the
naive extension to process video frame by frame is prone
to producing flickering results. We propose the first end-to-
end network for online video style transfer, which generates
temporally coherent stylized video sequences in near real-
time. Two key ideas include an efficient network by incor-
porating short-term coherence, and propagating short-term
coherence to long-term, which ensures the consistency over
larger period of time. Our network can incorporate differ-
ent image stylization networks. We show that the proposed
method clearly outperforms the per-frame baseline both
qualitatively and quantitatively. Moreover, it can achieve
visually comparable coherence to optimization-based video
style transfer, but is three orders of magnitudes faster in run-
time.
1. Introduction
Inspired by the success of work from Gatys et al. [16] on
neural style transfer, there have been a surge of recent works
[36, 27, 8, 17] addressing the problem of style transfer using
deep neural networks.
In their approaches, style transfer is formulated as an op-
timization problem, i.e., starting with white noise, search-
ing for a new image presenting similar neural activations
as the content image and similar feature correlations as the
style image. Notwithstanding their impressive results, these
methods are very slow in runtime due to the heavy iter-
ative optimization process. To mitigate this issue, many
works have sought to speed up the transfer by training feed-
forward networks [23, 38, 28, 9, 11, 29]. Such techniques
have been successfully applied to a number of popular apps
such as Prisma, Pikazo, DeepArt, etc.
This work was done when Dongdong Chen is an intern at MSR Asia.
Extending neural style transfer form image to video may
produce new and impressive effects, whose appeal is espe-
cially strong in short videos sharing, live-view effects, and
movie entertainments. The approaches discussed above,
when naively extended to process each frame of the video
one-by-one, often lead to flickering and false discontinu-
ities. This is because the solution of the style transfer task
is not stable. For optimization-based methods (e.g., [16]),
the instability stems from the random initialization and lo-
cal minima of the style loss function. And for those methods
based on feed-forward networks (e.g., [23]), a small pertur-
bation in the content images, e.g., lighting, noises and mo-
tions may cause large variations in the stylized results, as
shown in Figure 1. Consequently, it is essential to explore
temporal consistency in videos for stable outputs.
Anderson et al. [1] and Ruder et al. [35] address the prob-
lem of flickers in the optimization-based method by intro-
ducing optical flow to constrain both the initialization and
the loss function. Although very impressive and smooth-
ing stylized video sequences are obtained, their runtime is
quite slow (usually several minutes per frame), making it
less practical in real-world applications.
In search for a fast and yet stable solution to video style
transfer, we present the first feed-forward network lever-
aging temporal information for video style transfer, which
is able to produce consistent and stable stylized video se-
quences in near real-time. Our network architecture is con-
stituted by a series of the same networks, which considers
two-frame temporal coherence. The basic network incorpo-
rates two sub-networks, namely the flow sub-network and
the mask sub-network, into a certain intermediate layer of a
pre-trained stylization network (e.g., [23, 9]).
The flow sub-network, which is motivated by [43], es-
timates dense feature correspondences between consecu-
tive frames. It helps all consistent points along the mo-
tion trajectory be aligned in feature domain. The mask sub-
network identifies the occlusion or motion discontinuity re-
1
arXiv:1703.09211v2 [cs.CV] 28 Mar 2017

(a) (b) (c) (d) (e)
Figure 1. The image stylization network (e.g., [23]), will amplify image variances caused by some unnoticeable changes in inputs. Upper
row shows the four inputs: (a) the original one, (b) 5% lighter than (a), (c) Gaussian noises (µ = 0, σ = 1e 4) added to (a); and (d)
the next frame of (a) with subtle motions. The middle rows show the absolute difference between (a) and other three inputs. For better
visualization, these differences are boosted by 3×. The bottom row shows the corresponding stylization results. (e) shows close-up views
of some flickering regions.
gions. It helps adaptively blend feature maps from previ-
ous frames and the current frame to avoid ghosting artifacts.
The entire architecture is trained end-to-end, and minimizes
a new loss function, jointly considering stylization and tem-
poral coherence.
There are two kinds of temporal consistency in videos,
as mentioned in [35]: long-term consistency and short-term
consistency. Long-term consistency is more appealing since
it produces stable results over larger periods of time, and
even can enforce consistency of the synthesized frames be-
fore and after the occlusion. This constraint can be eas-
ily enforced in optimization-based methods [35]. Unfortu-
nately, it is quite difficult to incroporate it in feed-forward
networks, due to limited batch size, computation time and
cache memory. Therefore, short-term consistency seems to
be more affordable by feed-forward network in practice.
Therefore, our solution is a kind of compromise between
consistency and efficiency. Our network is designed to
mainly consider short-term relationship (only two frames),
but the long-term consistency is partially achieved by prop-
agating the short-term ones. Our network may directly
leverage the composite features obtained from the previous
frame, and combine it with features at the current frame for
the propagation. In this way, when the point can be traced
along motion trajectories, the feature can be propagated un-
til the tracks end.
This approximation may suffer from shifting errors in
propagation, and inconsistency before and after the occlu-
sion. Nevertheless, in practice, we do not observe obvious
ghosting or flickering artifacts through our online method,
which is necessary in many real applications. In summary,
our proposed video style transfer network is unique in the
following aspects:
Our network is the first network leveraging temporal
information that is trained end-to-end for video style
transfer, which successfully generates stable results.
Our feed-forward network is about thousands of times
faster compared to optimization-based style transfer in
videos [1, 35], reaching 15 fps on modern GPUs.
Our method enables online processing, and is cheap
in both learning and inference, since we achieve the
good approximation of long-term temporal coherence
by propagating short-term one.
Our network is general, and successfully applied to
several existing image stylization networks, includ-
ing per-style-per-network [23] or mutiple-style-per-
network [9].
2. Related Work
2.1. Style Transfer for Images and Videos
Traditional image stylization work mainly focus on tex-
ture synthesis based on low-level features, which uses non-
2

parametric sampling of pixels or patches in given source
texture images [13, 20, 12] or stroke databases [30, 19].
Their extension to video mostly uses optical flow to con-
strain the temporal coherence of sampling [4, 18, 31]. A
comprehensive survey can be found in [25].
Recently, with the development of deep learning, us-
ing neural networks for stylization becomes an active topic.
Gatys et al. [16] first propose a method of using pre-trained
Deep Convolutional Neural Networks (CNN) for image
stylization. It generates more impressive results compared
to traditional methods because CNN provides more seman-
tic representations of styles. To further improve the transfer
quality, different complementary schemes have been pro-
posed, including face constraints [36], Markov Random
Field (MRF) prior [27], user guidance [8] or controls [17].
Unfortunately, these methods based on an iterative opti-
mization are computationally expensive in run-time, which
imposes a big limitation in real applications. To make the
run-time more efficient, some work directly learn a feed-
forward generative network for a specific style [23, 38, 28]
or multiple styles [9, 11, 29] which are hundreds of times
faster than optimization-based methods.
Another direction of neural style transfer [16] is to ex-
tend it to videos. Naive solution that independently pro-
cesses each frame produces flickers and false discontinu-
ities. To preserve temporal consistency, Alexander et al. [1]
use optical flow to initialize the style transfer optimization,
and incorporate flow explicitly into the loss function. To
further reduce ghosting artifacts at the boundaries and oc-
cluded regions, Ruder et al. [35] introduce masks to filter
out the flow with low confidences in the loss function. This
allows to generate consistent and stable stylized video se-
quences, even in cases with large motion and strong occlu-
sions. Notwithstanding their demonstrated success in video
style transfer, it is very slow due to the iterative optimiza-
tion. Feed-forward networks [23, 38, 28, 9, 11, 29] have
proven to be efficient in image style transfer. However, we
are not aware of any work that trains a feed-forward network
that explicitly takes temporal coherence into consideration
in video style transfer.
2.2. Temporal Coherence in Video Filter
Video style transfer can be viewed as applying one kind
of artistic filter on videos. How to preserve the temporal
coherence is essential and has been considered in previous
video filtering work. One popular solution is to temporally
smooth filter parameters. For instance, Bonneel et al. [2]
and Wang et al. [39] transfer the color grade of one video to
another by temporally filtering the color transfer functions.
Another solution is to extend the filter from 2D to 3D.
Paris et al. [32] extend the Gaussian kernel in bilateral fil-
tering and mean-shift clustering to the temporal domain for
many applications of videos. Lang et al. [26] also extend
the notion of smoothing to the temporal domain by exploit-
ing optical flow and revisit optimization-based techniques
such as motion estimation and colorization. These temporal
smoothing and 3D extension methods are specific to their
applications, and cannot generalize to other applications,
such as stylization.
A more general solution considering temporal coher-
ence is to incorporate a post-processing step which is blind
to filters. Dong et al. [10] segment each frame into sev-
eral regions and spatiotemporally adjust the enhancement
(produced by unknown image filters) of regions of differ-
ent frames; Bonneel et al. [3] filter videos along motion
paths using a temporal edge-preserving filter. Unfortu-
nately, these post-processing methods fracture texture pat-
terns, or introduce ghosting artifacts when applied to the
stylization results due to high demand of optical flow.
As for stylization, previous methods (including tradi-
tional ones [4, 18, 31, 42] and neural ones [1, 35]) rely on
optical flow to track motions and keep coherence in color
and texture patterns along the motion trajectories. Never-
theless, how to add flow constraints to feed-forward styliza-
tion networks has not been investigated before.
2.3. Flow Estimation
Optical flow is known as an essential component in many
video tasks. It has been studied for decades and numerous
approaches has been proposed [21, 5, 40, 6, 41, 34]). These
methods are all hand-crafted, which are difficult to be inte-
grated in and jointly trained in our end-to-end network.
Recently, deep learning has been explored to solving op-
tical flow. FlowNet [15] is the first deep CNNs designed to
directly estimate the optical flow and achieve good results.
Later, its successors focused on accelerating the flow esti-
mation [33], or achieving better quality [22]. Zhu et al. [43]
recently integrate the FlowNet [15] to image recognition
networks and train the network end-to-end for fast video
recognition. Our work is inspired by their idea of apply-
ing FlowNet to existing networks. However, the stylization
task, different from the recognition one, requires some new
factors to be considered in network designing, such as the
loss function, and feature composition, etc.
3. Method
3.1. Motivation
When the style transfer for consecutive frames is ap-
plied independently (e.g., [23]), subtle changes in appear-
ance (e.g., lighting, noise, motion) would result in strong
flickering, as shown in Figure 1. By contrast, in still-image
style transfer, such small changes in the content image, es-
pecially on flat regions, may be necessary to generate spa-
tially rich and varied stylized patterns, making the result
more impressive. Thus, how to keep such spatially rich and
3

𝐼
𝑡1
𝐹
𝑡1
𝑆
𝑡1
𝐼
𝑡
𝐹
𝑡
𝑆
𝑡
𝑊
𝑡
𝐹
𝑡
𝑆
𝑡
𝑀
𝐹
𝑡
𝑜
𝑂
𝑡
Figure 2. Visualization of two-frame temporal consistency. Two
inputs I
t1
, I
t
pass the stylization network [23] to obtain feature
maps F
t1
, F
t
, and stylized results S
t1
, S
t
. We may notice dis-
continuities between S
t1
and S
t
in Red and Green rectangles.
The third row shows the warped feature map F
t
and styled result
S
t
through flow W
t
. We can see texture patterns (Red rectangle)
are successfully traced from t 1 to t, but ghosting occurs at oc-
cluded regions (Green rectangle) of S
t
. The occlusion mask is
shown as M . In these false regions, F
t
(also S
t
) is replaced with
F
t
(also S
t
) to get the composite features F
o
t
and result O
t
.
interesting texture patterns, while preserving the temporal
consistency in videos is worthy of a more careful study.
For simplicity, we start by exploring temporal coherence
between two frames. Our intuition is to warp the stylized
result from the previous frame to the current one, and adap-
tively fuse both together. In other words, some traceable
points/regions from the previous frame keep unchanged,
while some untraceable points/regions use new results oc-
curring at the current frame. Such an intuitive strategy
strikes two birds in one stone: 1) it makes sure stylized re-
sults along the motion paths to be as stable as possible; 2) it
avoids ghosting artifacts for occlusions or motion disconti-
nuities. We show the intuitive idea in Figure 2.
The strategy outlined above only preserves the short-
term consistency, which can be formulated as the problem
of propagation and composition. The issue of propagation
relies on good and robust motion estimation. Instead of op-
tical flow, we are more inclined to estimate flow on deep
features, similar to [43], which may neglect noise and small
appearance variations and hence lead to more stable mo-
tion estimation. This is crucial to generate stable stylization
videos, since we desire appearance in stylized video frames
not to be changed due to such variations. The issue of com-
position is also considered in the feature domain instead of



Figure 3. system overview.
pixel domain, since it can further avoid seam artifacts.
To further obtain the consistency over long periods of
time, we seek a new architecture to propagate short-term
consistency to long-term. The pipeline is shown in Figure 3.
At t1, we obtain the composite feature maps F
o
t1
, which
are constrained by two-frame consistency. At t, we reuse
F
o
t1
for propagation and composition. By doing so, we
expect all traceable points to be propagated as far as possi-
ble in the entire video. Once the points are occluded or the
tracking get lost, the composite features will keep values in-
dependently computed at the current frame. In this way, our
network only needs to consider two frames every time, but
still approaches long-term consistency.
3.2. Network Architecture
In this section, we explain the details of our proposed
end-to-end network for video style transfer. Given the in-
put video sequence {I
t
|t = 1...n}, the task is to obtain the
stylized video sequence {O
t
|t = 1...n}. The overall system
pipeline is shown in Figure 3. At the first frame I
1
, it uses
existing stylization network (e.g., [23]) denoted as N et
0
to
produce the stylized result. Meanwhile, it also generates the
encoded features F
1
as the input of our proposed network
Net
1
at the second frame I
2
. The process is iterated over
the entire video sequence. Starting from the second frame
I
2
, we use Net
1
rather than N et
0
for style transfer.
The proposed network structure Net
1
incorporating
two-frame temporal coherence is presented in Figure 4. It
consists of three main components: the style sub-network,
the flow sub-network, and the mask sub-network.
Style Sub-network. We adopt the pre-trained image style
transfer network of Johnson et al. [23] as our default style
sub-network, since it is often adopted as the basic network
structure for many follow-up work (e.g., [11, 9]). This kind
of network looks like auto-encoder architecture, with some
strided convolution layers as the encoder and fractionally
strided convolution layers as the decoder, respectively. Such
architectures allow us to insert the flow sub-network and the
mask sub-network between the encoder and the decoder. In
Section 4.4, we provide the detailed analysis on which layer
is better for the integration of our sub-networks.
4

Flow

󰆒
Encoder
(Style)
Decoder
(Style)
Mask
Warp
Figure 4. Our network architecture consists of three main components: the pretrained style sub-network, which is split into two parts: an
encoder and a decoder; the flow sub-network to predict intermediate feature flow; and the mask sub-network to regress the composition
mask.
Flow Sub-network. As a part for temporal coherence, the
flow sub-network is designed to estimate the correspon-
dences between two consecutive frames I
t1
and I
t
, and
then warp the convolutional features. We adopt FlowNet
(the ”Simple” version) [15] as our flow sub-network by
default. It is pre-trained on the synthetic Flying Chairs
dataset [15] for optical flow task, and should be fine-tuned
to produce feature flow suitable for our task.
The process is similar to [43], which uses it for video
recognition. Two consecutive frames I
t1
, I
t
are firstly en-
coded into feature maps F
t1
, F
t
respectively by the en-
coder. W
t
is the feature flow generated by the flow sub-
network and bilinearly resized to the same spatial resolution
as F
t1
. As the values of W
t
are in general fractional, we
warp F
t1
to F
t
via bilinear interpolation:
F
t
= W
t
t1
(F
t1
) (1)
where W
t
t1
(·) denotes the function that warps features
from t 1 to t using the estimated flow field W
t
, namely
F
t
(p) = F
t1
(p+W
t
(p)), where p denotes spatial location
in feature map and flow.
Mask Sub-network. Given the warped feature F
t
and the
original feature F
t
, the mask sub-network is employed to
regress the composition mask M , which is then adopted to
compose both features F
t
and F
t
. The value of M varies
from 0 to 1. For traceable points/regions by the flow (e.g.,
static background), the value in the mask M tends to be 1. It
suggests that the warped feature F
t
should be reused so as to
keep coherence. On the contrary, at occlusion or false flow
points/regions, the value in the mask M is 0, which suggests
F
t
should be adopted. The mask sub-network architecture
consists of three convolutional layers with stride one. Its
input is the absolute difference of two feature maps
F
t
= |F
t
F
t
|, (2)
and the output is a single channel mask M, which means
all feature channels would share the same mask in the later
composition. Here, we obtain the composite features F
o
t
by
linear combination of F
t
and F
t
:
F
o
t
= (1 M ) F
t
+ M F
t
(3)
where represents element-wise multiplication.
Summary of Net
1
. Figure 4 summarizes our network
Net
1
designed for two frames. Given two input frame
I
t1
, I
t
, they are fed into the encoder of fixed style sub-
network, generating convolutional feature maps F
t1
, F
t
.
This first step is different in inference, where F
t1
will not
be computed from I
t1
, and instead borrowed from the ob-
tained composite features F
o
t1
at t 1. It is illustrated by
the dot lines in Figure 4. On the other branch, both frames
I
t1
, I
t
are fed into the flow sub-network to compute fea-
ture flow W
t
, which warps the features F
t1
(F
o
t1
used in
inference instead) to F
t
. Next, the difference F
t
between
F
t
and F
t
is fed into the mask sub-network, generating the
mask M. New features F
o
t
are achieved by linear combi-
nation of F
t
and F
t
weighted by the mask M. Finally, F
o
t
is fed into the decoder of the style sub-network, generating
the stylized result O
t
at frame t. For the inference, F
o
t
is
also the output for the next frame t + 1. Since both flow and
mask sub-networks learn relative flow W
t
and mask M
t
be-
tween any two frames, it is not necessary for our training to
incorporate historic information (e.g., F
o
t1
) as well as the
inference. It can make our training be simple.
3.3. The Loss Function
To train both the flow and mask sub-networks, we de-
fine the loss function by enforcing three terms: the coher-
ence term L
cohe
, the occlusion term L
occ
, and the flow term
L
flow
. The coherence term L
cohe
penalizes the inconsis-
tencies between stylized results of two consecutive frames.
L
cohe
(O
t
, S
t1
) = M
g
||O
t
W
t
t1
(S
t1
)||
2
, (4)
where S
t1
is the stylized result produced independently at
t 1. The warping function W
t
t1
(·) uses the ground-truth
5

Citations
More filters
Proceedings ArticleDOI

Frame-Recurrent Video Super-Resolution

TL;DR: In this paper, an end-to-end trainable frame-recurrent video super-resolution framework was proposed that uses the previously inferred HR estimate to super-resolve the subsequent frame.
Posted Content

Video-to-Video Synthesis

TL;DR: In this article, a video-to-video synthesis approach under the generative adversarial learning framework is proposed, which achieves high-resolution, photorealistic, temporally coherent video results on a diverse set of input formats.
Posted Content

Neural Style Transfer: A Review

TL;DR: A comprehensive overview of the current progress in NST can be found in this paper, where the authors present several evaluation methods and compare different NST algorithms both qualitatively and quantitatively, concluding with a discussion of various applications of NST and open problems for future research.
Journal ArticleDOI

Neural Style Transfer: A Review

TL;DR: A taxonomy of current algorithms in the field of NST is proposed and several evaluation methods are presented and compared to compare different NST algorithms both qualitatively and quantitatively.
Proceedings ArticleDOI

Gated Context Aggregation Network for Image Dehazing and Deraining

TL;DR: Zhang et al. as mentioned in this paper proposed an end-to-end gated context aggregation network to directly restore the final haze-free image, which adopted the latest smoothed dilation technique to help remove the gridding artifacts caused by the widely-used dilated convolution with negligible extra parameters, and leverage a gated sub-network to fuse the features from different levels.
References
More filters
Proceedings Article

Adam: A Method for Stochastic Optimization

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Journal ArticleDOI

Determining optical flow

TL;DR: In this paper, a method for finding the optical flow pattern is presented which assumes that the apparent velocity of the brightness pattern varies smoothly almost everywhere in the image, and an iterative implementation is shown which successfully computes the Optical Flow for a number of synthetic image sequences.
Journal ArticleDOI

Finding Structure in Time

TL;DR: A proposal along these lines first described by Jordan (1986) which involves the use of recurrent links in order to provide networks with a dynamic memory and suggests a method for representing lexical categories and the type/token distinction is developed.
Proceedings ArticleDOI

Determining Optical Flow

TL;DR: In this article, a method for finding the optical flow pattern is presented which assumes that the apparent velocity of the brightness pattern varies smoothly almost everywhere in the image, and an iterative implementation is shown which successfully computes the Optical Flow for a number of synthetic image sequences.
Journal ArticleDOI

Learning long-term dependencies with gradient descent is difficult

TL;DR: This work shows why gradient based learning algorithms face an increasingly difficult problem as the duration of the dependencies to be captured increases, and exposes a trade-off between efficient learning by gradient descent and latching on information for long periods.