Proceedings Article•DOI•

Fast Object Segmentation in Unconstrained Video

Anestis Papazoglou¹, Vittorio Ferrari¹•Institutions (1)

01 Dec 2013-pp 1777-1784

TL;DR: This method is fast, fully automatic, and makes minimal assumptions about the video, which enables handling essentially unconstrained settings, including rapidly moving background, arbitrary object motion and appearance, and non-rigid deformations and articulations.

read less

Abstract: We present a technique for separating foreground objects from the background in a video. Our method is fast, fully automatic, and makes minimal assumptions about the video. This enables handling essentially unconstrained settings, including rapidly moving background, arbitrary object motion and appearance, and non-rigid deformations and articulations. In experiments on two datasets containing over 1400 video shots, our method outperforms a state-of-the-art background subtraction technique [4] as well as methods based on clustering point tracks [6, 18, 19]. Moreover, it performs comparably to recent video object segmentation methods based on object proposals [14, 16, 27], while being orders of magnitude faster.

...read moreread less

Summary (2 min read)

Jump to: [1. Introduction] – [2. Related Work] – [3. Our approach] – [3.1. Efficient initial foreground estimation] – [3.2. Foreground-background labelling refinement] – [4.2. YouTube-Objects] and [4.3. Runtime]

1. Introduction

Video object segmentation is the task of separating foreground objects from the background in a video [14, 18, 26].
The latter scenario is more practically relevant, as a good solution would enable processing large amounts of video without human intervention.
The object can be static in a portion of the video and only part of it can be moving in some other portion (e.g. a cat starts running and then stops to lick its paws).
This second stage automatically bootstraps an appearance model based on the initial foreground estimate, and uses it to refine the spatial accuracy of the segmentation and to also segment the object in frames where it does not move (sec. 3.2).

3. Our approach

The goal of their work is to segment objects that move differently than their surroundings.
The authors method has two main stages: (1) efficient initial foreground estimation (sec. 3.1), (2) foreground-background labelling refinement (sec. 3.2).
The authors compute the optical flow between pairs of subsequent frames and detect motion boundaries.
Due to inaccuracies in the flow estimation, the motion boundaries are typically incomplete and do not align perfectly with object boundaries (fig. 1f).
The goal of the second stage is to refine the spatial accuracy of the inside-outside maps and to segment the whole object in all frames.

3.1. Efficient initial foreground estimation

The authors begin by computing optical flow between pairs of subsequent frames (t, t + 1) using the stateof-the-art algorithm [6, 22].
The authors base their approach on motion boundaries, i.e. image points where the optical flow field changes abruptly.
The algorithm estimates whether a pixel is inside the object based on the point-in-polygon problem [12] from computational geometry.
Instead, a ray starting from a point outside the polygon will intersect it an even number of times .
The authors algorithm visits each pixel exactly once per direction while building S, and once to compute its vote, and is therefore linear in the number of pixels in the image.

3.2. Foreground-background labelling refinement

The authors formulate video segmentation as a pixel labelling problem with two labels (foreground and background).
The pairwise potentials V andW encourage spatial and temporal smoothness, respectively.
Two superpixels sti, s t+1 j in subsequent frames are connected if there at least one pixel of sti moves into st+1j according to the optical flow (fig. 3).
Moreover, the appearance models are integrated over large image regions and over many frames, and therefore can robustly estimate the appearance of the object, despite faults in the insideoutside maps.
In some frames (part of) the object may be static, and in others the inside-outside map might miss it because of incorrect optical flow estimation (fig. 4, middle row).

4.2. YouTube-Objects

YouTube-Objects [19]3 is a large database collected from YouTube containing many videos for each of 2http://www2.ulg.ac.be/telecom/research/vibe/.
The objects undergo rapid movement, strong scale and viewpoint changes, nonrigid deformations, and are sometimes clipped by the image border (fig. 5).
Prest et al. [19] automatically select one segment per shot among those produced by [6], based on its appearance similarity to segments selected in other videos of the same object class, and on how likely it is to cover an object according to a class-generic objectness measure [2].
For evaluation the authors fit a bounding-box to the top ranked output segment.

4.3. Runtime

Given optical flow and superpixels, their method takes 0.5 sec/frame on SegTrack (0.05 sec for the inside-outside maps and the rest for the foreground-background labelling refinement).
While [16, 27] do not report timings nor have code available for us to measure, their runtime must be > 120 sec/frame as they also use the object proposals [10].
High quality optical flow can be computed rapidly using [22] (< 1 sec/frame).
Currently, the authors use TurboPixels as superpixels [15] (1.5 sec/frame), but even faster alternatives are available [1].

Did you find this useful? Give us your feedback

Figures (7)

Table 1. Results on SegTrack. The entries show the average number of mislabelled pixels per frame. For [14], the numbers in parenthesis refer to the single top ranked hypothesis, as given to us by the authors in personal communication.

Figure 4. Location model. Top row: three video frames. Middle row: likelihood of foreground based on the inside-outside maps in individual frames. They miss large parts of the person in the second and third frames, as the head and torso are not moving. Bottom row: the location model based on propagating the insideoutside maps. It includes most of the person in all frames.

Figure 1. Motion boundaries.. (a) Two input frames. (b) Optical flow ~fp. The hue of a pixel indicates its direction and the color saturation its velocity. (c) Motion boundaries bmp , based on the magnitude of the gradient of the optical flow. (d) Motion boundaries bθp, based on difference in direction between a pixel and its neighbours. (e) Combined motion boundaries bp. (f) Final, binary motion boundaries after thresholding, overlaid on the first frame.

Table 2. Results on YouTube-Objects. The entries show the average per-class CorLoc (‘aero’ to ‘train’) as well as the average over all classes (‘avg’). Top row: the best segment returned by the method of [6]. Middle row: the segment automatically selected by the method of [19], out of those produced by [6]. Bottom row: the segment output by our method.

Figure 2. Inside-outside maps. (Left) The ray-casting observation. Any ray originating inside a closed curve intersects it an odd number of time. Any ray originating outside intersects it an even number of times. This holds for any number of closed curves in the image. (Middle) Illustration of the integral intersections data structure S for the horizontal direction. The number of intersections for the ray going from pixel x to the left border can be easily computed as Xleft(x, y) = S(x − 1, y) = 1, and for the right ray as Xright(x, y) = S(W, y)− S(x, y) = 1. In this case, both rays vote for x being inside the object. (Right) The output inside-outside map M t.

Figure 5. Example results. We show 3 example frames per video, with the output of our method overlaid in green. (Left) SegTrack. Top to bottom: monkeydog, cheetah, birdfall, parachute, girl. (Right) YouTube-Objects. Top to bottom: cat, dog, motorbike, bird, horse. We include example result videos in the supplementary material.

Figure 3. Example connectivity Et over time. Superpixel st1 contains pixels that lead to st+11 , s t+1

Content maybe subject to copyright Report

Edinburgh Research Explorer

Fast Object Segmentation in Unconstrained Video

Citation for published version:

Papazoglou, A & Ferrari, V 2013, Fast Object Segmentation in Unconstrained Video. in Computer Vision

(ICCV), 2013 IEEE International Conference on. pp. 1777-1784. https://doi.org/10.1109/ICCV.2013.223

Digital Object Identifier (DOI):

10.1109/ICCV.2013.223

Link:

Link to publication record in Edinburgh Research Explorer

Document Version:

Peer reviewed version

Published In:

Computer Vision (ICCV), 2013 IEEE International Conference on

General rights

and / or other copyright owners and it is a condition of accessing these publications that users recognise and

abide by the legal requirements associated with these rights.

Take down policy

The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer

content complies with UK legislation. If you believe that the public display of this file breaches copyright please

contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and

investigate your claim.

Download date: 09. Aug. 2022

Fast object segmentation in unconstrained video

Anestis Papazoglou

University of Edinburgh

Vittorio Ferrari

University of Edinburgh

Abstract

We present a technique for separating foreground objects

from the background in a video. Our method is fast, fully au-

tomatic, and makes minimal assumptions about the video.

This enables handling essentially unconstrained settings,

including rapidly moving background, arbitrary object mo-

tion and appearance, and non-rigid deformations and ar-

ticulations. In experiments on two datasets containing over

1400 video shots, our method outperforms a state-of-the-

art background subtraction technique [4] as well as meth-

ods based on clustering point tracks [6, 18, 19]. Moreover,

it performs comparably to recent video object segmentation

methods based on object proposals [14, 16, 27], while being

orders of magnitude faster.

1. Introduction

Video object segmentation is the task of separating fore-

ground objects from the background in a video [14, 18, 26].

This is important for a wide range of applications, includ-

ing providing spatial support for learning object class mod-

els [19], video summarization, and action recognition [5].

The task has been addressed by methods requiring a user

to annotate the object position in some frames [3, 20, 26,

24], and by fully automatic methods [14, 6, 18, 4], which

input just the video. The latter scenario is more practi-

cally relevant, as a good solution would enable processing

large amounts of video without human intervention. How-

ever, this task is very challenging, as the method is given no

knowledge about the object appearance, scale or position.

Moreover, the general unconstrained setting might include

rapidly moving backgrounds and objects, non-rigid defor-

mations and articulations (ﬁg. 5).

In this paper we propose a technique for fully automatic

video object segmentation in unconstrained settings. Our

method is computationally efﬁcient and makes minimal as-

sumptions about the video: the only requirement is for the

object to move differently from its surrounding background

in a good fraction of the video. The object can be static

in a portion of the video and only part of it can be mov-

ing in some other portion (e.g. a cat starts running and then

stops to lick its paws). Our method does not require a static

or slowly moving background (as opposed to classic back-

ground subtraction methods [9, 4, 7]). Moreover, it does

not assume the object follows a particular motion model,

nor that all its points move homogeneously (as opposed to

methods based on clustering point tracks [6, 17, 18]). This

is especially important when segmenting non-rigid or artic-

ulated objects such as animals (ﬁg. 5).

The key new element in our approach is a rapid technique

to produce a rough estimate of which pixels are inside the

object based on motion boundaries in pairs of subsequent

frames (sec. 3.1). This initial estimate is then reﬁned by

integrating information over the whole video with a spatio-

temporal extension of GrabCut [21, 14, 26]. This second

stage automatically bootstraps an appearance model based

on the initial foreground estimate, and uses it to reﬁne the

spatial accuracy of the segmentation and to also segment the

object in frames where it does not move (sec. 3.2).

Through extensive experiments on over 1400 video shots

from two datasets [24, 19], we show that our method: (i)

handles fast moving backgrounds and objects exhibiting a

wide range of appearance, motions and deformations, in-

cluding non-rigid and articulated objects; (ii) outperforms

a state-of-the-art background subtraction technique [4] as

well as methods based on clustering point tracks [6, 18, 19];

(iii) is orders of magnitude faster than recent video object

segmentation methods based on object proposals [14, 16,

27]; (iv) outperforms the popular method [14] on the large

YouTube-Objects dataset [19]; (v) produces competitive re-

sults on the small SegTrack benchmark [24]. The source

code of our method is released at http://groups.

inf.ed.ac.uk/calvin/software.html

2. Related Work

Interactive or supervised methods. Several methods for

video object segmentation require the user to manually an-

notate a few frames with object segmentations and then

propagate these annotations to all other frames [3, 20, 26].

Similarly, methods based on tracking [8, 24], require the

user to mark the object positions in the ﬁrst frame and then

track them in the rest of the video.

Background subtraction. Classic background subtrac-

tion methods model the appearance of the background at

each pixel and consider pixels that change rapidly to be

foreground. These methods typically assume a stationary,

or slowly panning camera [9, 4, 7]. The background should

change slowly in order for the model to update safely with-

out generating false-positive foreground detections.

Clustering point tracks. Several automatic video seg-

mentation methods track points over several frames and

then cluster the resulting tracks based on pairwise [6, 17]

or triplet [18] similarity measures. The underlying assump-

tion induced by pairwise clustering [6, 17] is that all ob-

ject points move according to a single translation, while the

triplet model [18] assumes a single similarity transforma-

tion. These assumptions have trouble accommodating non-

rigid or articulated objects. Our method instead does not at-

tempt to cluster object points and does not assume any kind

of motion homogeneity. The object only needs to move suf-

ﬁciently differently from the background to generate mo-

tion boundaries along most of its physical boundary. On the

other hand, these methods [6, 17, 18] try to place multiple

objects in separate segments, whereas our method produces

a simpler binary segmentation (all objects vs background).

Ranking object proposals. The works [14, 16, 27] are

closely related to ours, as they tackle the very same task.

These methods are based on ﬁnding recurring object-like

segments, aided by recent techniques for measuring generic

object appearance [10], and achieve impressive results on

the SegTrack benchmark [24]. While the object proposal in-

frastructure is necessary to ﬁnd out which image regions are

objects vs background, it makes these methods very slow

(minutes/frame). In our work instead, this goal is achieved

by a much simpler, faster process (sec. 3.1). In sec. 4 we

show that our method achieves comparable segmentation

accuracy to [14] while being two orders of magnitude faster.

Oversegmentation. Grundmann et al. [13] oversegment

a video into spatio-temporal regions of uniform motion and

appearance, analog to still-image superpixels [15]. While

this is a useful basis for later processing, it does not solve

the video object segmentation task on its own.

3. Our approach

The goal of our work is to segment objects that move dif-

ferently than their surroundings. Our method has two main

stages: (1) efﬁcient initial foreground estimation (sec. 3.1),

(2) foreground-background labelling reﬁnement (sec. 3.2).

We now give a brief overview of these two stages, and then

present them in more detail in the rest of the section.

(1) Efﬁcient initial foreground estimation. The goal of

the ﬁrst stage is to rapidly produce an initial estimate of

which pixels might be inside the object based purely on

motion. We compute the optical ﬂow between pairs of sub-

sequent frames and detect motion boundaries. Ideally, the

Figure 1. Motion boundaries.. (a) Two input frames. (b) Optical

ﬂow

. The hue of a pixel indicates its direction and the color

saturation its velocity. (c) Motion boundaries b

, based on the

magnitude of the gradient of the optical ﬂow. (d) Motion bound-

aries b

, based on difference in direction between a pixel and its

neighbours. (e) Combined motion boundaries b

. (f) Final, binary

motion boundaries after thresholding, overlaid on the ﬁrst frame.

motion boundaries will form a complete closed curve co-

inciding with the object boundaries. However, due to in-

accuracies in the ﬂow estimation, the motion boundaries

are typically incomplete and do not align perfectly with ob-

ject boundaries (ﬁg. 1f). Also, occasionally false positive

boundaries might be detected. We propose a novel, compu-

tationally efﬁcient algorithm to robustly determine which

pixels reside inside the moving object, taking into account

all these sources of error (ﬁg. 2c).

(2) Foreground-background labelling reﬁnement. As

they are purely based on motion boundaries, the inside-

outside maps produced by the ﬁrst stage typically only ap-

proximately indicate where the object is. They do not accu-

rately delineate object outlines. Furthermore, (parts of) the

object might be static in some frames, or the inside-outside

maps may miss it due to incorrect optical ﬂow estimation.

The goal of the second stage is to reﬁne the spatial ac-

curacy of the inside-outside maps and to segment the whole

object in all frames. To achieve this, it integrates the infor-

mation from the inside-outside maps over all frames by (1)

encouraging the spatio-temporal smoothness of the output

segmentation over the whole video; (2) building dynamic

appearance models of the object and background under the

assumption that they change smoothly over time. Incor-

0 0

0 1 1

1 2 2 2S:

1 W

Figure 2. Inside-outside maps. (Left) The ray-casting observation. Any ray originating inside a closed curve intersects it an odd number

of time. Any ray originating outside intersects it an even number of times. This holds for any number of closed curves in the image.

(Middle) Illustration of the integral intersections data structure S for the horizontal direction. The number of intersections for the ray

going from pixel x to the left border can be easily computed as X

left

(x, y) = S(x − 1, y) = 1, and for the right ray as X

right

(x, y) =

S(W, y) − S(x, y) = 1. In this case, both rays vote for x being inside the object. (Right) The output inside-outside map M

porating appearance cues is key to achieving a ﬁner level

of detail, compared to using only motion. Moreover, af-

ter learning the object appearance in the frames where the

inside-outside maps found it, the second stage uses it to seg-

ment the object in frames where it was initially missed (e.g.

because it is static).

3.1. Efﬁcient initial foreground estimation

Optical ﬂow. We begin by computing optical ﬂow be-

tween pairs of subsequent frames (t, t + 1) using the state-

of-the-art algorithm [6, 22]. It supports large displacements

between frames and has a computationally very efﬁcient

GPU implementation [22] (ﬁg. 1a+b).

Motion boundaries. We base our approach on motion

boundaries, i.e. image points where the optical ﬂow ﬁeld

changes abruptly. Motion boundaries reveal the location of

occlusion boundaries, which very often correspond to phys-

ical object boundaries [23].

Let

be the optical ﬂow vector at pixel p. The sim-

plest way to estimate motion boundaries is by computing

the magnitude of the gradient of the optical ﬂow ﬁeld:

= 1 − exp(−λ

||∇

||) (1)

where b

∈ [0, 1] is the strength of the motion boundary at

pixel p; λ

is a parameter controlling the steepness of the

function.

While this measure correctly detects boundaries at

rapidly moving pixels, where b

is close to 1, it is unre-

liable for pixels with intermediate b

values around 0.5,

which could be explained either as boundaries or errors due

to inaccuracies in the optical ﬂow (ﬁg. 1c). To disambiguate

between those two cases, we compute a second estimator

∈ [0, 1], based on the difference in direction between the

motion of pixel p and its neighbours N :

= 1 − exp(−λ

max

q∈N

(δθ

p,q

)) (2)

where δθ

p,q

denotes the angle between

and

. The idea

is that if n is moving in a different direction than all its

neighbours, it is likely to be a motion boundary. This esti-

mator can correctly detect boundaries even when the object

is moving at a modest velocity, as long as it goes in a dif-

ferent direction than the background. However, it tends to

produce false-positives in static image regions, as the direc-

tion of the optical ﬂow is noisy at points with little or no

motion (ﬁg. 1d).

As the two measures above have complementary failure

modes, we combine them into a measure that is more reli-

able than either alone (ﬁg. 1e):

(

, if b

> T

· b

, if b

≤ T,

(3)

where T is a high threshold, above which b

is considered

reliable on its own. As a last step we threshold b

at 0.5 to

produce a binary motion boundary labelling (ﬁg. 1f).

Inside-outside maps. The produced motion boundaries

typically do not completely cover the whole object bound-

ary. Moreover, there might be false positive boundaries, due

to inaccurracy of the optical ﬂow estimation. We present

here a computationally efﬁcient algorithm to robustly esti-

mate which pixels are inside the object while taking into

account these sources of error.

The algorithm estimates whether a pixel is inside the

object based on the point-in-polygon problem [12] from

computational geometry. The key observation is that any

ray starting from a point inside the polygon (or any closed

curve) will intersect the boundary of the polygon an odd

number of times. Instead, a ray starting from a point out-

side the polygon will intersect it an even number of times

(ﬁgure 2a). Since the motion boundaries are typically in-

complete, a single ray is not sufﬁcient to determine whether

a pixel lies inside the object. Instead, we get a robust es-

timate by shooting 8 rays spaced by 45 degrees. Each ray

casts a vote on whether the pixel is inside or outside. The

ﬁnal inside-outside decision is taken by majority rule, i.e. a

pixel with 5 or more rays intersecting the boundaries an odd

number of times is deemed inside.

Realizing the above idea with a naive algorithm would

be computationally expensive (i.e. quadratic in the number

0.0

0.28

t+1

Figure 3. Example connectivity E

over time. Superpixel s

con-

tains pixels that lead to s

t+1

, s

t+1

, s

t+1

, s

t+1

. As an example,

the weight φ(s

, s

t+1

) is 0.28 (all others are omitted for clarity).

of pixels in the image). We propose an efﬁcient algorithm

which we call integral intersections, inspired by the use of

integral images in [25]. The key idea is to create a special

data structure that enables very fast inside-outside evalua-

tion by massively reusing the computational effort that went

into creating the datastructure.

For each direction (horizontal, vertical and the two diag-

onals) we create a matrix S of the same size W × H as the

image. An entry S(x, y) of this matrix indicates the num-

ber of boundary intersections along the line going from the

image border up to pixel (x, y). For simplicity, we explain

here how to build S for the horizontal direction. The algo-

rithm for the other directions is analogous. The algorithm

builds S one line y at a time. The ﬁrst pixel (1, y), at the left

image border, has value S(1, y) = 0. We then move right-

wards one pixel at a time and increment S(x, y) by 1 each

time we transition from a non-boundary pixel to a boundary

pixel. This results in a line S(:, y) whose entries count the

number of boundary intersections (ﬁg. 2b.).

After computing S for all horizontal lines, the data struc-

ture is ready. We can now determine the number of inter-

sections X for both horizontal rays (left→right, right→left)

emanating from a pixel (x, y) in constant time by

left

(x, y) = S(x − 1, y) (4)

right

(x, y) = S(W, y) − S(x, y) (5)

where W is the width of the image, i.e. the rightmost pixel

in a line (ﬁg. 2b).

Our algorithm visits each pixel exactly once per direc-

tion while building S, and once to compute its vote, and is

therefore linear in the number of pixels in the image. The

algorithm is very fast in practice and takes about 0.1s per

frame of a HD video (1280x720 pixels) on a modest CPU

(Intel Core i7 at 2.0GHz).

For each video frame t, we apply the algorihtm on all 8

directions and use majority voting to decide which pixels

are inside, resulting is an inside-outside map M

(ﬁg. 2c).

3.2. Foreground-background labelling reﬁnement

We formulate video segmentation as a pixel labelling

problem with two labels (foreground and background). We

oversegment each frame into superpixels S

[15], which

greatly reduces computational efﬁciency and memory us-

age, enabling to segment much longer videos.

Each superpixel s

∈ S

can take a label l

∈ {0, 1}. A

labelling L = {l

}

t,i

of all superpixels in all frames repre-

sents a segmentation of the video. Similarly to other seg-

mentation works [14, 21, 26], we deﬁne an energy function

to evaluate a labeling

E(L) =

t,i

) + α

t,i

) (6)

+ α

(i,j,t)∈E

, l

) + α

(i,j,t)∈E

, l

t+1

)

is a unary potential evaluating how likely a superpixel is

to be foreground or background according to the appearance

model of frame t. The second unary potential L

is based on

a location prior model encouraging foreground labellings in

areas where independent motion has been observed. As we

explain in detail later, we derive both the appearance model

and the location prior parameters from the inside-outside

maps M

. The pairwise potentials V and W encourage spa-

tial and temporal smoothness, respectively. The scalars α

weight the various terms.

The output segmentation is the labeling that mini-

mizes (6):

∗

= argmin

E(L) (7)

As E is a binary pairwise energy function with submodular

pairwise potentials, we minimize it exactly with graph-cuts.

Next we use the resulting segmentation to re-estimate the

appearance models and iterate between these two steps, as

in GrabCut [21]. Below we describe the potentials in detail.

Smoothness V, W. The spatial smoothness potential V is

deﬁned over the edge set E

, containing pairs of spatially

connected superpixels. Two superpixels are spatially con-

nected if they are in the same frame and are adjacent.

The temporal smoothness potential W is deﬁned over

the edge set E

, containing pairs of temporally connected

superpixels. Two superpixels s

, s

t+1

in subsequent frames

are connected if there at least one pixel of s

moves into

t+1

according to the optical ﬂow (ﬁg. 3).

The functions V, W are standard contrast-modulated

Potts potentials [21, 26, 14]:

, l

) = dis(s

, s

)

−1

6= l

] exp(−βcol(s

, s

)

) (8)

, l

t+1

) = φ(s

, s

t+1

)[l

6= l

] exp(−βcol(s

, s

t+1

)

(9)

where dis is the Euclidean distance between the centres of

two superpixels and col is the difference between their av-

erage RGB color. The factor that differs from the standard

HTML Viewer

Frequently Asked Questions (1)

Q1. What have the authors contributed in "Fast object segmentation in unconstrained video" ?

The authors present a technique for separating foreground objects from the background in a video.

Fast Object Segmentation in Unconstrained Video

Summary (2 min read)

1. Introduction

3. Our approach

3.1. Efficient initial foreground estimation

3.2. Foreground-background labelling refinement

4.2. YouTube-Objects

4.3. Runtime

Figures (7)

Citations

Cites background or methods from "Fast Object Segmentation in Unconst..."

Cites methods from "Fast Object Segmentation in Unconst..."

References

Related Papers (5)

Frequently Asked Questions (1)

Q1. What have the authors contributed in "Fast object segmentation in unconstrained video" ?