Coherent Online Video Style Transfer

doi:10.1109/ICCV.2017.126

Dongdong Chen

∗1

, Jing Liao

2

, Lu Yuan

2

, Nenghai Yu

1

, and Gang Hua

2

1

University of Science and Technology of China,

cd722522@mail.ustc.edu.cn, ynh@ustc.edu.cn

2

Microsoft Research Asia,

{jliao,luyuan,ganghua}@microsoft.com

Abstract

Training a feed-forward network for fast neural style

transfer of images is proven to be successful. However, the

naive extension to process video frame by frame is prone

to producing ﬂickering results. We propose the ﬁrst end-to-

end network for online video style transfer, which generates

temporally coherent stylized video sequences in near real-

time. Two key ideas include an efﬁcient network by incor-

porating short-term coherence, and propagating short-term

coherence to long-term, which ensures the consistency over

larger period of time. Our network can incorporate differ-

ent image stylization networks. We show that the proposed

method clearly outperforms the per-frame baseline both

qualitatively and quantitatively. Moreover, it can achieve

visually comparable coherence to optimization-based video

style transfer, but is three orders of magnitudes faster in run-

time.

1. Introduction

Inspired by the success of work from Gatys et al. [16] on

neural style transfer, there have been a surge of recent works

[36, 27, 8, 17] addressing the problem of style transfer using

deep neural networks.

In their approaches, style transfer is formulated as an op-

timization problem, i.e., starting with white noise, search-

ing for a new image presenting similar neural activations

as the content image and similar feature correlations as the

style image. Notwithstanding their impressive results, these

methods are very slow in runtime due to the heavy iter-

ative optimization process. To mitigate this issue, many

works have sought to speed up the transfer by training feed-

forward networks [23, 38, 28, 9, 11, 29]. Such techniques

have been successfully applied to a number of popular apps

such as Prisma, Pikazo, DeepArt, etc.

∗

This work was done when Dongdong Chen is an intern at MSR Asia.

Extending neural style transfer form image to video may

produce new and impressive effects, whose appeal is espe-

cially strong in short videos sharing, live-view effects, and

movie entertainments. The approaches discussed above,

when naively extended to process each frame of the video

one-by-one, often lead to ﬂickering and false discontinu-

ities. This is because the solution of the style transfer task

is not stable. For optimization-based methods (e.g., [16]),

the instability stems from the random initialization and lo-

cal minima of the style loss function. And for those methods

based on feed-forward networks (e.g., [23]), a small pertur-

bation in the content images, e.g., lighting, noises and mo-

tions may cause large variations in the stylized results, as

shown in Figure 1. Consequently, it is essential to explore

temporal consistency in videos for stable outputs.

Anderson et al. [1] and Ruder et al. [35] address the prob-

lem of ﬂickers in the optimization-based method by intro-

ducing optical ﬂow to constrain both the initialization and

the loss function. Although very impressive and smooth-

ing stylized video sequences are obtained, their runtime is

quite slow (usually several minutes per frame), making it

less practical in real-world applications.

In search for a fast and yet stable solution to video style

transfer, we present the ﬁrst feed-forward network lever-

aging temporal information for video style transfer, which

is able to produce consistent and stable stylized video se-

quences in near real-time. Our network architecture is con-

stituted by a series of the same networks, which considers

two-frame temporal coherence. The basic network incorpo-

rates two sub-networks, namely the ﬂow sub-network and

the mask sub-network, into a certain intermediate layer of a

pre-trained stylization network (e.g., [23, 9]).

The ﬂow sub-network, which is motivated by [43], es-

timates dense feature correspondences between consecu-

tive frames. It helps all consistent points along the mo-

tion trajectory be aligned in feature domain. The mask sub-

network identiﬁes the occlusion or motion discontinuity re-

1

arXiv:1703.09211v2 [cs.CV] 28 Mar 2017

(a) (b) (c) (d) (e)

Figure 1. The image stylization network (e.g., [23]), will amplify image variances caused by some unnoticeable changes in inputs. Upper

row shows the four inputs: (a) the original one, (b) 5% lighter than (a), (c) Gaussian noises (µ = 0, σ = 1e − 4) added to (a); and (d)

the next frame of (a) with subtle motions. The middle rows show the absolute difference between (a) and other three inputs. For better

visualization, these differences are boosted by 3×. The bottom row shows the corresponding stylization results. (e) shows close-up views

of some ﬂickering regions.

gions. It helps adaptively blend feature maps from previ-

ous frames and the current frame to avoid ghosting artifacts.

The entire architecture is trained end-to-end, and minimizes

a new loss function, jointly considering stylization and tem-

poral coherence.

There are two kinds of temporal consistency in videos,

as mentioned in [35]: long-term consistency and short-term

consistency. Long-term consistency is more appealing since

it produces stable results over larger periods of time, and

even can enforce consistency of the synthesized frames be-

fore and after the occlusion. This constraint can be eas-

ily enforced in optimization-based methods [35]. Unfortu-

nately, it is quite difﬁcult to incroporate it in feed-forward

networks, due to limited batch size, computation time and

cache memory. Therefore, short-term consistency seems to

be more affordable by feed-forward network in practice.

Therefore, our solution is a kind of compromise between

consistency and efﬁciency. Our network is designed to

mainly consider short-term relationship (only two frames),

but the long-term consistency is partially achieved by prop-

agating the short-term ones. Our network may directly

leverage the composite features obtained from the previous

frame, and combine it with features at the current frame for

the propagation. In this way, when the point can be traced

along motion trajectories, the feature can be propagated un-

til the tracks end.

This approximation may suffer from shifting errors in

propagation, and inconsistency before and after the occlu-

sion. Nevertheless, in practice, we do not observe obvious

ghosting or ﬂickering artifacts through our online method,

which is necessary in many real applications. In summary,

our proposed video style transfer network is unique in the

following aspects:

• Our network is the ﬁrst network leveraging temporal

information that is trained end-to-end for video style

transfer, which successfully generates stable results.

• Our feed-forward network is about thousands of times

faster compared to optimization-based style transfer in

videos [1, 35], reaching 15 fps on modern GPUs.

• Our method enables online processing, and is cheap

in both learning and inference, since we achieve the

good approximation of long-term temporal coherence

by propagating short-term one.

• Our network is general, and successfully applied to

several existing image stylization networks, includ-

ing per-style-per-network [23] or mutiple-style-per-

network [9].

2. Related Work

2.1. Style Transfer for Images and Videos

Traditional image stylization work mainly focus on tex-

ture synthesis based on low-level features, which uses non-

2

parametric sampling of pixels or patches in given source

texture images [13, 20, 12] or stroke databases [30, 19].

Their extension to video mostly uses optical ﬂow to con-

strain the temporal coherence of sampling [4, 18, 31]. A

comprehensive survey can be found in [25].

Recently, with the development of deep learning, us-

ing neural networks for stylization becomes an active topic.

Gatys et al. [16] ﬁrst propose a method of using pre-trained

Deep Convolutional Neural Networks (CNN) for image

stylization. It generates more impressive results compared

to traditional methods because CNN provides more seman-

tic representations of styles. To further improve the transfer

quality, different complementary schemes have been pro-

posed, including face constraints [36], Markov Random

Field (MRF) prior [27], user guidance [8] or controls [17].

Unfortunately, these methods based on an iterative opti-

mization are computationally expensive in run-time, which

imposes a big limitation in real applications. To make the

run-time more efﬁcient, some work directly learn a feed-

forward generative network for a speciﬁc style [23, 38, 28]

or multiple styles [9, 11, 29] which are hundreds of times

faster than optimization-based methods.

Another direction of neural style transfer [16] is to ex-

tend it to videos. Naive solution that independently pro-

cesses each frame produces ﬂickers and false discontinu-

ities. To preserve temporal consistency, Alexander et al. [1]

use optical ﬂow to initialize the style transfer optimization,

and incorporate ﬂow explicitly into the loss function. To

further reduce ghosting artifacts at the boundaries and oc-

cluded regions, Ruder et al. [35] introduce masks to ﬁlter

out the ﬂow with low conﬁdences in the loss function. This

allows to generate consistent and stable stylized video se-

quences, even in cases with large motion and strong occlu-

sions. Notwithstanding their demonstrated success in video

style transfer, it is very slow due to the iterative optimiza-

tion. Feed-forward networks [23, 38, 28, 9, 11, 29] have

proven to be efﬁcient in image style transfer. However, we

are not aware of any work that trains a feed-forward network

that explicitly takes temporal coherence into consideration

in video style transfer.

2.2. Temporal Coherence in Video Filter

Video style transfer can be viewed as applying one kind

of artistic ﬁlter on videos. How to preserve the temporal

coherence is essential and has been considered in previous

video ﬁltering work. One popular solution is to temporally

smooth ﬁlter parameters. For instance, Bonneel et al. [2]

and Wang et al. [39] transfer the color grade of one video to

another by temporally ﬁltering the color transfer functions.

Another solution is to extend the ﬁlter from 2D to 3D.

Paris et al. [32] extend the Gaussian kernel in bilateral ﬁl-

tering and mean-shift clustering to the temporal domain for

many applications of videos. Lang et al. [26] also extend

the notion of smoothing to the temporal domain by exploit-

ing optical ﬂow and revisit optimization-based techniques

such as motion estimation and colorization. These temporal

smoothing and 3D extension methods are speciﬁc to their

applications, and cannot generalize to other applications,

such as stylization.

A more general solution considering temporal coher-

ence is to incorporate a post-processing step which is blind

to ﬁlters. Dong et al. [10] segment each frame into sev-

eral regions and spatiotemporally adjust the enhancement

(produced by unknown image ﬁlters) of regions of differ-

ent frames; Bonneel et al. [3] ﬁlter videos along motion

paths using a temporal edge-preserving ﬁlter. Unfortu-

nately, these post-processing methods fracture texture pat-

terns, or introduce ghosting artifacts when applied to the

stylization results due to high demand of optical ﬂow.

As for stylization, previous methods (including tradi-

tional ones [4, 18, 31, 42] and neural ones [1, 35]) rely on

optical ﬂow to track motions and keep coherence in color

and texture patterns along the motion trajectories. Never-

theless, how to add ﬂow constraints to feed-forward styliza-

tion networks has not been investigated before.

2.3. Flow Estimation

Optical ﬂow is known as an essential component in many

video tasks. It has been studied for decades and numerous

approaches has been proposed [21, 5, 40, 6, 41, 34]). These

methods are all hand-crafted, which are difﬁcult to be inte-

grated in and jointly trained in our end-to-end network.

Recently, deep learning has been explored to solving op-

tical ﬂow. FlowNet [15] is the ﬁrst deep CNNs designed to

directly estimate the optical ﬂow and achieve good results.

Later, its successors focused on accelerating the ﬂow esti-

mation [33], or achieving better quality [22]. Zhu et al. [43]

recently integrate the FlowNet [15] to image recognition

networks and train the network end-to-end for fast video

recognition. Our work is inspired by their idea of apply-

ing FlowNet to existing networks. However, the stylization

task, different from the recognition one, requires some new

factors to be considered in network designing, such as the

loss function, and feature composition, etc.

3. Method

3.1. Motivation

When the style transfer for consecutive frames is ap-

plied independently (e.g., [23]), subtle changes in appear-

ance (e.g., lighting, noise, motion) would result in strong

ﬂickering, as shown in Figure 1. By contrast, in still-image

style transfer, such small changes in the content image, es-

pecially on ﬂat regions, may be necessary to generate spa-

tially rich and varied stylized patterns, making the result

more impressive. Thus, how to keep such spatially rich and

3

𝐼

𝑡−1

𝐹

𝑡−1

𝑆

𝑡−1

𝐼

𝑡

𝐹

𝑡

𝑆

𝑡

𝑊

𝑡

𝐹

𝑡

′

𝑆

𝑡

′

𝑀

𝐹

𝑡

𝑜

𝑂

𝑡

Figure 2. Visualization of two-frame temporal consistency. Two

inputs I

t−1

, I

t

pass the stylization network [23] to obtain feature

maps F

t−1

, F

t

, and stylized results S

t−1

, S

t

. We may notice dis-

continuities between S

t−1

and S

t

in Red and Green rectangles.

The third row shows the warped feature map F

′

t

and styled result

S

′

t

through ﬂow W

t

. We can see texture patterns (Red rectangle)

are successfully traced from t − 1 to t, but ghosting occurs at oc-

cluded regions (Green rectangle) of S

′

t

. The occlusion mask is

shown as M . In these false regions, F

′

t

(also S

′

t

) is replaced with

F

t

(also S

t

) to get the composite features F

o

t

and result O

t

.

interesting texture patterns, while preserving the temporal

consistency in videos is worthy of a more careful study.

For simplicity, we start by exploring temporal coherence

between two frames. Our intuition is to warp the stylized

result from the previous frame to the current one, and adap-

tively fuse both together. In other words, some traceable

points/regions from the previous frame keep unchanged,

while some untraceable points/regions use new results oc-

curring at the current frame. Such an intuitive strategy

strikes two birds in one stone: 1) it makes sure stylized re-

sults along the motion paths to be as stable as possible; 2) it

avoids ghosting artifacts for occlusions or motion disconti-

nuities. We show the intuitive idea in Figure 2.

The strategy outlined above only preserves the short-

term consistency, which can be formulated as the problem

of propagation and composition. The issue of propagation

relies on good and robust motion estimation. Instead of op-

tical ﬂow, we are more inclined to estimate ﬂow on deep

features, similar to [43], which may neglect noise and small

appearance variations and hence lead to more stable mo-

tion estimation. This is crucial to generate stable stylization

videos, since we desire appearance in stylized video frames

not to be changed due to such variations. The issue of com-

position is also considered in the feature domain instead of

…



































…













…













Figure 3. system overview.

pixel domain, since it can further avoid seam artifacts.

To further obtain the consistency over long periods of

time, we seek a new architecture to propagate short-term

consistency to long-term. The pipeline is shown in Figure 3.

At t−1, we obtain the composite feature maps F

o

t−1

, which

are constrained by two-frame consistency. At t, we reuse

F

o

t−1

for propagation and composition. By doing so, we

expect all traceable points to be propagated as far as possi-

ble in the entire video. Once the points are occluded or the

tracking get lost, the composite features will keep values in-

dependently computed at the current frame. In this way, our

network only needs to consider two frames every time, but

still approaches long-term consistency.

3.2. Network Architecture

In this section, we explain the details of our proposed

end-to-end network for video style transfer. Given the in-

put video sequence {I

t

|t = 1...n}, the task is to obtain the

stylized video sequence {O

t

|t = 1...n}. The overall system

pipeline is shown in Figure 3. At the ﬁrst frame I

1

, it uses

existing stylization network (e.g., [23]) denoted as N et

0

to

produce the stylized result. Meanwhile, it also generates the

encoded features F

1

as the input of our proposed network

Net

1

at the second frame I

2

. The process is iterated over

the entire video sequence. Starting from the second frame

I

2

, we use Net

1

rather than N et

0

for style transfer.

The proposed network structure Net

1

incorporating

two-frame temporal coherence is presented in Figure 4. It

consists of three main components: the style sub-network,

the ﬂow sub-network, and the mask sub-network.

Style Sub-network. We adopt the pre-trained image style

transfer network of Johnson et al. [23] as our default style

sub-network, since it is often adopted as the basic network

structure for many follow-up work (e.g., [11, 9]). This kind

of network looks like auto-encoder architecture, with some

strided convolution layers as the encoder and fractionally

strided convolution layers as the decoder, respectively. Such

architectures allow us to insert the ﬂow sub-network and the

mask sub-network between the encoder and the decoder. In

Section 4.4, we provide the detailed analysis on which layer

is better for the integration of our sub-networks.

4













Flow



















    



   



󰆒





Encoder

(Style)

Decoder

(Style)

Mask













Warp

Figure 4. Our network architecture consists of three main components: the pretrained style sub-network, which is split into two parts: an

encoder and a decoder; the ﬂow sub-network to predict intermediate feature ﬂow; and the mask sub-network to regress the composition

mask.

Flow Sub-network. As a part for temporal coherence, the

ﬂow sub-network is designed to estimate the correspon-

dences between two consecutive frames I

t−1

and I

t

, and

then warp the convolutional features. We adopt FlowNet

(the ”Simple” version) [15] as our ﬂow sub-network by

default. It is pre-trained on the synthetic Flying Chairs

dataset [15] for optical ﬂow task, and should be ﬁne-tuned

to produce feature ﬂow suitable for our task.

The process is similar to [43], which uses it for video

recognition. Two consecutive frames I

t−1

, I

t

are ﬁrstly en-

coded into feature maps F

t−1

, F

t

respectively by the en-

coder. W

t

is the feature ﬂow generated by the ﬂow sub-

network and bilinearly resized to the same spatial resolution

as F

t−1

. As the values of W

t

are in general fractional, we

warp F

t−1

to F

′

t

via bilinear interpolation:

F

′

t

= W

t

t−1

(F

t−1

) (1)

where W

t

t−1

(·) denotes the function that warps features

from t − 1 to t using the estimated ﬂow ﬁeld W

t

, namely

F

′

t

(p) = F

t−1

(p+W

t

(p)), where p denotes spatial location

in feature map and ﬂow.

Mask Sub-network. Given the warped feature F

′

t

and the

original feature F

t

, the mask sub-network is employed to

regress the composition mask M , which is then adopted to

compose both features F

′

t

and F

t

. The value of M varies

from 0 to 1. For traceable points/regions by the ﬂow (e.g.,

static background), the value in the mask M tends to be 1. It

suggests that the warped feature F

′

t

should be reused so as to

keep coherence. On the contrary, at occlusion or false ﬂow

points/regions, the value in the mask M is 0, which suggests

F

t

should be adopted. The mask sub-network architecture

consists of three convolutional layers with stride one. Its

input is the absolute difference of two feature maps

∆F

t

= |F

t

− F

′

t

|, (2)

and the output is a single channel mask M, which means

all feature channels would share the same mask in the later

composition. Here, we obtain the composite features F

o

t

by

linear combination of F

t

and F

′

t

:

F

o

t

= (1 − M ) ⊙ F

t

+ M ⊙ F

′

t

(3)

where ⊙ represents element-wise multiplication.

Summary of Net

1

. Figure 4 summarizes our network

Net

1

designed for two frames. Given two input frame

I

t−1

, I

t

, they are fed into the encoder of ﬁxed style sub-

network, generating convolutional feature maps F

t−1

, F

t

.

This ﬁrst step is different in inference, where F

t−1

will not

be computed from I

t−1

, and instead borrowed from the ob-

tained composite features F

o

t−1

at t − 1. It is illustrated by

the dot lines in Figure 4. On the other branch, both frames

I

t−1

, I

t

are fed into the ﬂow sub-network to compute fea-

ture ﬂow W

t

, which warps the features F

t−1

(F

o

t−1

used in

inference instead) to F

′

t

. Next, the difference ∆F

t

between

F

t

and F

′

t

is fed into the mask sub-network, generating the

mask M. New features F

o

t

are achieved by linear combi-

nation of F

t

and F

′

t

weighted by the mask M. Finally, F

o

t

is fed into the decoder of the style sub-network, generating

the stylized result O

t

at frame t. For the inference, F

o

t

is

also the output for the next frame t + 1. Since both ﬂow and

mask sub-networks learn relative ﬂow W

t

and mask M

t

be-

tween any two frames, it is not necessary for our training to

incorporate historic information (e.g., F

o

t−1

) as well as the

inference. It can make our training be simple.

3.3. The Loss Function

To train both the ﬂow and mask sub-networks, we de-

ﬁne the loss function by enforcing three terms: the coher-

ence term L

cohe

, the occlusion term L

occ

, and the ﬂow term

L

flow

. The coherence term L

cohe

penalizes the inconsis-

tencies between stylized results of two consecutive frames.

L

cohe

(O

t

, S

t−1

) = M

g

⊙ ||O

t

− W

t

t−1

(S

t−1

)||

2

, (4)

where S

t−1

is the stylized result produced independently at

t − 1. The warping function W

t

t−1

(·) uses the ground-truth

5

Coherent Online Video Style Transfer

Citations

Frame-Recurrent Video Super-Resolution

Video-to-Video Synthesis

Neural Style Transfer: A Review

Neural Style Transfer: A Review

Gated Context Aggregation Network for Image Dehazing and Deraining

References

Adam: A Method for Stochastic Optimization

Determining optical flow

Finding Structure in Time

Determining Optical Flow

Learning long-term dependencies with gradient descent is difficult

Related Papers (5)

Perceptual Losses for Real-Time Style Transfer and Super-Resolution

Image Style Transfer Using Convolutional Neural Networks

Arbitrary Style Transfer in Real-Time with Adaptive Instance Normalization

Very Deep Convolutional Networks for Large-Scale Image Recognition

Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks