Learning Video Object Segmentation from Static Images

doi:10.1109/CVPR.2017.372

*

Federico Perazzi

1,2 *

Anna Khoreva

3

Rodrigo Benenson

3

Bernt Schiele

3

Alexander Sorkine-Hornung

1

Disney Research

2

ETH Zurich

3

Max Planck Institute for Informatics, Saarbrücken, Germany

Abstract

Inspired by recent advances of deep learning in instance

segmentation and object tracking, we introduce the concept

of convnet-based guidance applied to video object segmen-

tation. Our model proceeds on a per-frame basis, guided by

the output of the previous frame towards the object of inter-

est in the next frame. We demonstrate that highly accurate

object segmentation in videos can be enabled by using a

convolutional neural network (convnet) trained with static

images only. The key component of our approach is a com-

bination of ofﬂine and online learning strategies, where the

former produces a reﬁned mask from the previous’ frame es-

timate and the latter allows to capture the appearance of the

speciﬁc object instance. Our method can handle different

types of input annotations such as bounding boxes and seg-

ments while leveraging an arbitrary amount of annotated

frames. Therefore our system is suitable for diverse appli-

cations with different requirements in terms of accuracy and

efﬁciency. In our extensive evaluation, we obtain competi-

tive results on three different datasets, independently from

the type of input annotation.

1. Introduction

Convolutional neural networks (convnets) have shown

outstanding performance in many fundamental areas in

computer vision, enabled by the availability of large-scale

annotated datasets (e.g., ImageNet classiﬁcation [

24, 43]).

However, some important challenges in video processing

can be difﬁcult to approach using convnets, since creating

a sufﬁciently large body of densely, pixel-wise annotated

video data for training is usually prohibitive.

One example of such domain is video object segmenta-

tion. Given only one or a few frames annotated with seg-

mentation masks of a particular object instance, the task of

video object segmentation is to accurately segment the same

∗

The ﬁrst two authors contributed equally.

MaskTrack ConvNet

Input frame t

Mask estimate t-1

Refined mask t

Figure 1: Given a rough mask estimate from the previous

frame t − 1 we train a convnet to provide a reﬁned mask

output for the current frame t.

instance in all other frames of the video. Current top per-

forming approaches either interleave box tracking and seg-

mentation [

53], or propagate the ﬁrst frame mask annotation

in space-time via CRF or GrabCut-like techniques [

29,49].

One of the key insights and contributions of this paper is

that fully annotated video data is not necessary. We demon-

strate that highly accurate video object segmentation can be

enabled using a convnet trained with static images only. We

show that a convnet designed for semantic image segmen-

tation [

8] can be utilized to perform per-frame instance seg-

mentation, i.e., segmentation of generic objects while dis-

tinguishing different instances of the same class. For each

new video frame the network is guided towards the object

of interest by feeding in the previous’ frame mask estimate.

We therefore refer to our approach as guided instance seg-

mentation. To the best of our knowledge, it represents the

ﬁrst fully trained approach to video object segmentation.

Our system is efﬁcient due to its feed-forward archi-

tecture and can generate high quality results in a single

pass over the video, without the need for considering more

than one frame at a time. This is in stark contrast to many

other video segmentation approaches, which usually require

global connections over multiple frames or even the whole

1

2663

video sequence in order to achieve coherent results. Futher-

more, our method can handle different types of annotations

and even simple bounding boxes as input are sufﬁcient to

obtain competitive results, making our method ﬂexible with

respect to various practical applications with different re-

quirements in terms of human supervision.

Key to the video segmentation quality of our approach

is the combination of ofﬂine and online learning strategies.

In the ofﬂine phase, we use deformation and coarsening on

the image masks in order to train the network to produce ac-

curate output masks from their rough estimates. An online

training phase extends ideas from previous works on object

tracking [

12, 32] to the task of video segmentation and en-

ables the method to be easily optimized with respect to an

object of interest in a novel input video.

The result is a single, homogeneus system that compares

favourably to most classical approaches on three extremely

heterogeneous video segmentation benchmarks, despite us-

ing the same model and parameters across all videos. We

provide a detailed ablation study and explore the impact of

varying number and types of annotations. Moreover, we dis-

cuss extensions of the proposed model, allowing to improve

the quality even further.

2. Related Works

The idea of performing video object segmentation via

tracking at the pixel level is at least a decade old [

40]. Re-

cent approaches interweave box tracking with box-driven

segmentation (e.g. TRS [

53]), or propagate the ﬁrst frame

segmentation via graph labeling approaches.

Local propagation. JOTS [

52] builds a graph over neigh-

boring frames connecting superpixels and (generic) ob-

ject parts to solve the video labeling task. ObjFlow [

49]

builds a graph over pixels and superpixels, uses convnet

based appearance terms, and interleaves labeling with op-

tical ﬂow estimation. Instead of using superpixels or pro-

posals, BVS [

29] formulates a fully-connected pixel-level

graph between frames and efﬁciently infer the labeling over

the vertices of a spatio-temporal bilateral grid [

7]. Because

these methods propagate information only across neighbor

frames they have difﬁculties capturing long range relation-

ships and ensuring globally consistent segmentation.

Global propagation. In order to overcome these limi-

tations, some methods have proposed to use long-range

connections between video frames [

15, 25, 48, 55]. In

particular, we compare to FCP [

35], Z15 [56], NLC [15]

and W16 [

50] which build a global graph structure over

object proposal segments, and then infer a consistent

segmentation. A limitation of methods utilizing long-range

connections is that they have to operate on larger image re-

gions such as superpixels or object proposals for acceptable

speed and memory usage, compromising on their ability to

handle ﬁne details.

Unsupervised segmentation. Another family of works

perform moving object segmentation (over all parts of the

image), and selects post-hoc the space-time tube that best

match the annotation [

18, 26, 33, 53]. In contrast, our ap-

proach side-steps the use of any intermediate tracked boxes,

superpixels or object proposals and proceeds on a per-frame

basis, therefore efﬁciently handling even long sequences at

full detail. We focus on propagating the ﬁrst frame seg-

mentation forward onto future frames, using an online ﬁne-

tuned convnet as appearance model for segmenting the ob-

ject of interest in the next frames.

Box tracking. Some previous works have investigated ap-

proaches that improve segmentation quality by leveraging

object tracking and vice versa [

10, 13, 17, 40, 53]. More

recent, state-of-the-art tracking methods are based on dis-

criminative correlation ﬁlters over handcrafted features (e.g.

HOG) and over frozen deep learned features [

11,12], or are

convnet based trackers on their own right [

20, 32]. Our ap-

proach is most closely related to the latter group. GOTURN

[

20] proposes to train ofﬂine a convnet so as to directly

regress the bounding box in the current frame based on

the object position and appearance in the previous frame.

MDNet [

32] proposes to use online ﬁne-tuning of a conv-

net to model the object appearance. Our training strategy

is inspired by GOTURN for the ofﬂine part, and MDNet for

the online stage. Compared to the aforementioned methods

our approach operates at pixel level masks instead of boxes.

Differently from MDNet, we do not replace the domain-

speciﬁc layers, instead ﬁne-tuning all the layers on the avail-

able annotations for each individual video sequence.

Instance segmentation. At each frame, video object seg-

mentation outputs a single instance segmentation. Given an

estimate of the object location and size, bottom-up seg-

ment proposals [

38] or GrabCut [42] variants can be used

as shape guesses. Also speciﬁc convnet architectures have

been proposed for instance segmentation [19, 36, 37, 54].

Our approach outputs per-frame instance segmentation us-

ing a convnet architecture, inspired by works from other do-

mains like [

6, 44, 54]. A concurrent work [5] also exploits

convnets for video object segmentation. Differently from

our approach their segmentation is not guided, and therefore

it cannot distinguish multiple instances of the same object.

Interactive video segmentation. Applications such as

video editing for movie production often require a level

of accuracy beyond the current state-of-the-art. Thus sev-

eral works have also considered video segmentation with

variable annotation effort, enabling human interaction us-

ing clicks [

22, 47, 51] or strokes [1, 16, 57]. In this work

we consider instead box or segment annotations on multiple

frames. In §

5 we report results when varying the amount of

annotation effort, from one frame per video to all frames.

2664

3. Method

We approach the video object segmentation problem

from a different perspective we refer as: convnet-based

guided instance segmentation. For each new frame we wish

to label pixels as object/non-object of interest, for this we

build upon the architecture of the existing pixel labelling

convnet and train it to generate per-frame instance seg-

ments. We pick DeepLabv2 [

8], but our approach is ag-

nostic of the speciﬁc architecture selected. The challenge

is then: how to inform the network which instance to seg-

ment? We solve this by using two complementary strate-

gies. First we guide the network towards the instance of in-

terest by feeding in the previous’ frame mask estimate dur-

ing ofﬂine training (§

3.1). Second, we employ online train-

ing to ﬁne-tune the model to incorporate speciﬁc knowledge

of the object instance (§

3.2).

3.1. Ofﬂine Training

In order to guide the pixel labeling network to segment

the object of interest, we begin by expanding the convnet

input from RGB to RGB+mask channels. The extra mask

channel is meant to provide an estimate of the visible area

of the object in the current frame, its approximate location

and shape. We can then train the labelling convnet to output

an accurate segmentation of the object, given as input the

current image and a rough estimate of the object mask. Our

tracking network is de-facto a "mask reﬁnement" network.

There are two key observations that make this approach

practical. First, very rough input masks are enough for our

trained network to provide sensible output segments. Even

a large bounding box as input will result in a reasonable out-

put (see §

5.2). The main role of the input mask is to point

the convnet towards the correct object instance to segment.

Second, this particular approach does not require us to use

video as training data, such as done in [

3, 5, 20, 32]. Be-

cause we only use a mask as additional input, instead of an

image crop as in [3,20], we can synthesize training samples

from single frame instance segmentation annotations. This

allows us to train from a large set of diverse images, instead

of having to rely on scarce video annotations.

Figure

1 shows our simpliﬁed model. To simulate the

noise of the previous frame output, during ofﬂine training,

we generate input masks by deforming the annotations via

afﬁne transformation as well as non-rigid deformations via

thin-plate splines [

4], followed by a coarsening step (dila-

tion morphological operation) to remove details of the ob-

ject contour. We apply this data generation procedure over a

dataset of ∼ 10

4

images containing diverse object instances,

see examples in the supplementary material. At test time,

given the mask estimate at time t − 1, we apply the dila-

tion operation and use the resulting rough mask as input for

object segmentation in frame t.

The afﬁne transformations and non-rigid deformations

aim at modelling the expected motion of an object between

two frames. The coarsening permits us to generate train-

ing samples that resembles the test time data, simulating

the blobby shape of the output mask given from the pre-

vious frame by the convnet. These two ingredients make

the estimation more robust to noisy segmentation estimates

while helping to avoid accumulation of errors from the pre-

ceding frames. The trained convnet has learnt to do guided

instance segmentation similar to networks like SharpMask

[

37], DeepMask [36] and Hypercolumns [19], but instead of

taking a bounding box as guidance, we can use an arbitrary

input mask. The training details are described in §

4.

When using ofﬂine training only, the segmentation pro-

cedure consists of two steps: the previous frame mask is

coarsened and then fed into the trained network to esti-

mate the current frame mask. Since objects have a tendency

to move smoothly through space, the object mask in the

preceding frame will provide a good guess in the current

frame and simply copying the coarse mask from the pre-

vious frame is enough. This approach is fast and already

provides good results. We also experimented using optical

ﬂow to propagate the mask from one frame to the next, but

found the optical ﬂow errors to offset the gains.

With only the ofﬂine trained network, the proposed ap-

proach allows us to achieve competitive performance com-

pared to previously reported results (see §

5.2). However the

performance can be further improved by integrating online

training strategy as described in the next section.

3.2. Online Training

For further boosting the video segmentation quality, we

borrow and extend ideas that were originally proposed

for object tracking. Current top performing tracking tech-

niques [

12, 32] use some form of online training. We thus

consider improving results by adding online ﬁne-tuning as

a second strategy.

The idea is to use, at test time, the segment annotation

of the ﬁrst video frame as additional training data. Using

augmented versions of this single frame annotation, we pro-

ceed to ﬁne-tune the model to become more specialized for

the speciﬁc object instance at hand. We use a similar data

augmentation as for ofﬂine training. On top of afﬁne and

non-rigid deformations for the input mask, we also add im-

age ﬂipping and rotations. We generate ∼ 10

3

training sam-

ples from this single annotation, and proceed to ﬁne-tune

the model previously trained ofﬂine.

With online ﬁne-tuning, the network weights partially

capture the appearance of the speciﬁc object being tracked.

The model aims to strike a balance between general instance

segmentation (so as to generalize to the object changes), and

speciﬁc instance segmentation (so as to leverage the com-

mon appearance across video frames). The details of the

online ﬁne-tuning are provided in §

4. In our experiments

2665

Figure 2: Examples of optical ﬂow magnitude images. Top:

RGB images. Bottom: corresponding motion magnitude es-

timates encoded into as gray-scale images.

we only perform ﬁne-tuning using the annotated frame(s).

To the best of our knowledge our approach is the ﬁrst

to use a pixel labelling network (like DeepLabv2 [

8]) for

the task of video object segmentation. We name our full ap-

proach, using both ofﬂine and online training, MaskTrack.

3.3. Variants

Additionally we consider variations of the proposed

model. First, we demonstrate that our approach is ﬂexible

and could handle different types of input annotations, using

less supervision in the ﬁrst frame annotation. Second, we

describe how motion information could be easily integrated

in the system, improving the quality of the object segments.

Box annotation. In this paragraph, we discuss a variant

named MaskTrack

Box

, that takes a bounding box annota-

tion in the ﬁrst frame as an input supervision instead of a

segmentation mask. To this end, we train a similar convnet

that fed with a bounding-box annotation as an input outputs

a segment. Once the ﬁrst frame bounding box is converted

to a segment, we switch back to the MaskTrack model that

uses as guidance the output mask from the previous frame.

Optical ﬂow. On top of MaskTrack, we consider to em-

ploy optical ﬂow as a source of additional information to

guide the segmentation. Given a video sequence, we com-

pute the optical ﬂow using EpicFlow [

41] with Flow Fields

matches [

2] and convolutional boundaries [30]. In parallel

to the vanilla MaskTrack, we proceed to compute a sec-

ond output mask using the magnitude of the optical ﬂow

as input image (replicated into a three channel image). The

model is used as-is, without retraining. Although it has been

trained on RGB images, this strategy works as object ﬂow

magnitude roughly looks like a gray-scale object, and still

captures useful object shape information, see examples in

Figure

2. Using the RGB model allows to avoid training the

convnet on video datasets annotated with masks. We then

fuse by averaging the output scores given by the two paral-

lel networks, respectively fed with RGB images and optical

ﬂow magnitude as input. As shown in Table

1, optical ﬂow

provides complementary information to MaskTrack with

RGB images, improving the overall performance.

4. Network Implementation and Training

Following, we describe the implementation details of our

approach. Speciﬁcally, we provide additional information

regarding the network initialization, the ofﬂine and online

training strategies and the data augmentation.

Network. For all our experiments we use the training

and test parameters of DeepLabv2-VGG network [8]. The

model is initialized from a VGG16 network pre-trained on

ImageNet [

46]. For the extra mask channel of ﬁlters in the

ﬁrst convolutional layer we use gaussian initialization. We

also tried zero initialization, but observed no difference.

Ofﬂine training. The advantage of our method is that it

does not require expensive pixel-wise video annotations

for training. Thus we can employ existing image datasets.

However, in order for our model to generalize well across

different videos, we avoid training on datasets that are bi-

ased towards certain semantic classes, such as COCO [

28]

or Pascal [

14]. Instead we combine images and annotations

from several saliency segmentation datasets (ECSSD [

45],

MSRA10K [

9], SOD [31], and PASCAL-S [27]), resulting

in an aggregated set of 11 282 training images.

The input masks for the extra channel are generated

by deforming the binary segmentation masks via afﬁne

transformation and non-rigid deformations, as discussed in

§

3.1. For afﬁne transformation we consider random scaling

(±5% of object size) and translation (±10% shift). Non-

rigid deformations are done via thin-plate splines [

4] using

5 control points and randomly shifting the points in x and

y directions within ±10% margin of the original segmen-

tation mask width and height. Next, the mask is coarsened

using dilation operation with 5 pixel radius. This mask de-

formation procedure is applied over all object instances in

the training set. For each image two different masks are

generated. We refer the reader to the supplementary mate-

rial for visual examples of deformed masks.

The convnet training parameters are identical to those

proposed in [

8]. Therefore we use stochastic gradient de-

scent (SGD) with mini-batches of 10 images and a polyno-

mial learning policy with initial learning rate of 0.001. The

momentum and weight decay are set to 0.9 and 0.0005, re-

spectively. The network is trained for 20k iterations.

Online training. For online adaptation we ﬁne-tune the

model previously trained ofﬂine on the ﬁrst frame for 200

iterations with training samples generated from the ﬁrst

frame annotation. We augment the ﬁrst frame by image ﬂip-

ping and rotations as well as by deforming the annotated

masks for an extra channel via afﬁne and non-rigid defor-

mations with the same parameters as for the ofﬂine training.

This results in an augmented set of ∼10

3

training images.

The network is trained with the same learning parameters

as for ofﬂine training, ﬁne-tuning all convolutional layers.

2666

1st frame

Image Box annotation Segment annotation

13th frame

Ground truth MaskTrack

Box

result MaskTrack result

Figure 3: By propagating annotation from the 1st frame,

either from segment or just bounding box annotations, our

system generates results comparable to ground truth.

At test time our base MaskTrack system runs at about

12 seconds per frame (averaged over DAVIS, amortizing

the online ﬁne-tuning time over all video frames), which

is a magnitude faster compared to ObjFlow [

49] (takes 2

minutes per frame, averaged over DAVIS).

5. Results

In this section we describe our evaluation protocol

(§

5.1), study the importance of the different components

of our system (§

5.2), and report results comparing to state-

of-the-art techniques over three datasets (§

5.3), as well as

comparing the effects of different amounts of annotation on

the resulting segmentation quality (§

5.4). Additional results

are provided in the supplementary material.

5.1. Experimental setup

Datasets. We evaluate the proposed approach on three dif-

ferent video object segmentation datasets: DAVIS [

34],

YoutubeObjects [

39], and SegTrack-v2 [26]. These datasets

include assorted challenges such as appearance change, oc-

clusion, motion blur and shape deformation.

DAVIS [

34] consists of 50 high quality videos, totaling

3 455 frames. Pixel-level segmentation annotations are pro-

vided for each frame, where one single object or two con-

nected objects are separated from the background.

YoutubeObjects [

39] includes videos with 10 object cat-

egories. We consider the subset of 126 videos with more

than 20 000 frames, for which the pixel-level ground truth

segmentation masks are provided by [

21].

SegTrack-v2 [

26] contains 14 video sequences with 24

objects and 947 frames. Every frame is annotated with a

pixel-level object mask. As instance-level annotations are

provided for sequences with multiple objects, each speciﬁc

instance segmentation is treated as separate problem.

Evaluation. We evaluate using the standard mIoU metric:

intersection-over-union of the estimated segmentation and

the ground truth binary mask, also known as Jaccard In-

dex, averaged across videos. For DAVIS we use the pro-

vided benchmark code [

34], which excludes the ﬁrst and

the last frames from the evaluation. For YoutubeObjects

and SegTrack-v2 only the ﬁrst frame is excluded.

Previous works used different evaluation procedures. To

ensure a consistent comparison between methods, when

needed, we re-computed scores from the publicly available

output masks, or reproduced the results using the available

open source code. In particular, we collected new results for

ObjFlow [

49] and BVS [29] in order to present other meth-

ods with results across the three datasets.

5.2. Ablation study

We ﬁrst study different ingredients of our method. We

experiment on the DAVIS dataset and measure the per-

formance using the mean intersection-over-union metric

(mIoU). Table

1 shows the importance of each of the in-

gredients described in §

3 and reports the improvement of

adding extra components to the MaskTrack model.

Add-ons. We ﬁrst study the effect of adding a couple of

ingredients on top of our base MaskTrack system, which

are speciﬁcally ﬁne-tuned for DAVIS. We see that optical

ﬂow provides complementary information to the appear-

ance, boosting further the results (74.8 → 78.4). Adding

on top a well-tuned post-processing CRF [

23] can gain a

couple of mIoU points, reaching 80.3% mIoU on DAVIS,

the best known result on this dataset.

Albeit optical ﬂow can provide interesting gains, we

found it to be brittle when going across different datasets.

Different strategies to handle optical ﬂow provide 1 ∼ 4%

on each dataset, but none provide consistent gains across

all datasets; mainly due to failure modes of the optical ﬂow

algorithms. For the sake of presenting a single model with

ﬁx parameters across all datasets, we refrain from using a

per-dataset tuned optical ﬂow in the results of §

5.3.

Training. We next study the effect of ofﬂine/online train-

ing of the network. By disabling online ﬁne-tuning, and

only relying on ofﬂine training we see a ∼ 5 IoU percent

points drop, showing that online ﬁne-tuning indeed expand

the tracking capabilities. If instead we skip ofﬂine training

and only rely on online ﬁne-tuning performance drop drasti-

cally, albeit the absolute quality (57.6 mIoU) is surprisingly

high for a system trained on ImageNet+single frame.

By reducing the amount of training data from 11k to 5k

we only see a minor decrease in mIoU; this indicates that

even with the small amount of training data we can achieve

reasonable performance. That being said, further increase

of the training data volume would lead to improved results.

Additionally, we explore the effect of the ofﬂine training

on video data instead of using static images. We train the

model on the annotated frames of two combined datasets,

SegTrack-v2 and YoutubeObjects. By switching to train on

video data we observe a minor decrease in mIoU; this could

be explained by lack of diversity in the video training data

2667

Learning Video Object Segmentation from Static Images

Citations

Fast Online Object Tracking and Segmentation: A Unifying Approach

One-Shot Video Object Segmentation

Video Salient Object Detection via Fully Convolutional Networks

Advances in Computer Vision-Based Civil Infrastructure Inspection and Monitoring

SegFlow: Joint Learning for Video Object Segmentation and Optical Flow

References

ImageNet Classification with Deep Convolutional Neural Networks

Very Deep Convolutional Networks for Large-Scale Image Recognition

ImageNet Large Scale Visual Recognition Challenge

Microsoft COCO: Common Objects in Context

The Pascal Visual Object Classes Challenge: A Retrospective

Related Papers (5)

A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation

One-Shot Video Object Segmentation

Online Adaptation of Convolutional Neural Networks for Video Object Segmentation

Deep Residual Learning for Image Recognition

SegFlow: Joint Learning for Video Object Segmentation and Optical Flow