scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Fast Object Segmentation in Unconstrained Video

01 Dec 2013-pp 1777-1784
TL;DR: This method is fast, fully automatic, and makes minimal assumptions about the video, which enables handling essentially unconstrained settings, including rapidly moving background, arbitrary object motion and appearance, and non-rigid deformations and articulations.
Abstract: We present a technique for separating foreground objects from the background in a video. Our method is fast, fully automatic, and makes minimal assumptions about the video. This enables handling essentially unconstrained settings, including rapidly moving background, arbitrary object motion and appearance, and non-rigid deformations and articulations. In experiments on two datasets containing over 1400 video shots, our method outperforms a state-of-the-art background subtraction technique [4] as well as methods based on clustering point tracks [6, 18, 19]. Moreover, it performs comparably to recent video object segmentation methods based on object proposals [14, 16, 27], while being orders of magnitude faster.

Summary (2 min read)

1. Introduction

  • Video object segmentation is the task of separating foreground objects from the background in a video [14, 18, 26].
  • The latter scenario is more practically relevant, as a good solution would enable processing large amounts of video without human intervention.
  • The object can be static in a portion of the video and only part of it can be moving in some other portion (e.g. a cat starts running and then stops to lick its paws).
  • This second stage automatically bootstraps an appearance model based on the initial foreground estimate, and uses it to refine the spatial accuracy of the segmentation and to also segment the object in frames where it does not move (sec. 3.2).

3. Our approach

  • The goal of their work is to segment objects that move differently than their surroundings.
  • The authors method has two main stages: (1) efficient initial foreground estimation (sec. 3.1), (2) foreground-background labelling refinement (sec. 3.2).
  • The authors compute the optical flow between pairs of subsequent frames and detect motion boundaries.
  • Due to inaccuracies in the flow estimation, the motion boundaries are typically incomplete and do not align perfectly with object boundaries (fig. 1f).
  • The goal of the second stage is to refine the spatial accuracy of the inside-outside maps and to segment the whole object in all frames.

3.1. Efficient initial foreground estimation

  • The authors begin by computing optical flow between pairs of subsequent frames (t, t + 1) using the stateof-the-art algorithm [6, 22].
  • The authors base their approach on motion boundaries, i.e. image points where the optical flow field changes abruptly.
  • The algorithm estimates whether a pixel is inside the object based on the point-in-polygon problem [12] from computational geometry.
  • Instead, a ray starting from a point outside the polygon will intersect it an even number of times .
  • The authors algorithm visits each pixel exactly once per direction while building S, and once to compute its vote, and is therefore linear in the number of pixels in the image.

3.2. Foreground-background labelling refinement

  • The authors formulate video segmentation as a pixel labelling problem with two labels (foreground and background).
  • The pairwise potentials V andW encourage spatial and temporal smoothness, respectively.
  • Two superpixels sti, s t+1 j in subsequent frames are connected if there at least one pixel of sti moves into st+1j according to the optical flow (fig. 3).
  • Moreover, the appearance models are integrated over large image regions and over many frames, and therefore can robustly estimate the appearance of the object, despite faults in the insideoutside maps.
  • In some frames (part of) the object may be static, and in others the inside-outside map might miss it because of incorrect optical flow estimation (fig. 4, middle row).

4.2. YouTube-Objects

  • YouTube-Objects [19]3 is a large database collected from YouTube containing many videos for each of 2http://www2.ulg.ac.be/telecom/research/vibe/.
  • The objects undergo rapid movement, strong scale and viewpoint changes, nonrigid deformations, and are sometimes clipped by the image border (fig. 5).
  • Prest et al. [19] automatically select one segment per shot among those produced by [6], based on its appearance similarity to segments selected in other videos of the same object class, and on how likely it is to cover an object according to a class-generic objectness measure [2].
  • For evaluation the authors fit a bounding-box to the top ranked output segment.

4.3. Runtime

  • Given optical flow and superpixels, their method takes 0.5 sec/frame on SegTrack (0.05 sec for the inside-outside maps and the rest for the foreground-background labelling refinement).
  • While [16, 27] do not report timings nor have code available for us to measure, their runtime must be > 120 sec/frame as they also use the object proposals [10].
  • High quality optical flow can be computed rapidly using [22] (< 1 sec/frame).
  • Currently, the authors use TurboPixels as superpixels [15] (1.5 sec/frame), but even faster alternatives are available [1].

Did you find this useful? Give us your feedback

Figures (7)

Content maybe subject to copyright    Report

Edinburgh Research Explorer
Fast Object Segmentation in Unconstrained Video
Citation for published version:
Papazoglou, A & Ferrari, V 2013, Fast Object Segmentation in Unconstrained Video. in Computer Vision
(ICCV), 2013 IEEE International Conference on. pp. 1777-1784. https://doi.org/10.1109/ICCV.2013.223
Digital Object Identifier (DOI):
10.1109/ICCV.2013.223
Link:
Link to publication record in Edinburgh Research Explorer
Document Version:
Peer reviewed version
Published In:
Computer Vision (ICCV), 2013 IEEE International Conference on
General rights
Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s)
and / or other copyright owners and it is a condition of accessing these publications that users recognise and
abide by the legal requirements associated with these rights.
Take down policy
The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer
content complies with UK legislation. If you believe that the public display of this file breaches copyright please
contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and
investigate your claim.
Download date: 09. Aug. 2022

Fast object segmentation in unconstrained video
Anestis Papazoglou
University of Edinburgh
Vittorio Ferrari
University of Edinburgh
Abstract
We present a technique for separating foreground objects
from the background in a video. Our method is fast, fully au-
tomatic, and makes minimal assumptions about the video.
This enables handling essentially unconstrained settings,
including rapidly moving background, arbitrary object mo-
tion and appearance, and non-rigid deformations and ar-
ticulations. In experiments on two datasets containing over
1400 video shots, our method outperforms a state-of-the-
art background subtraction technique [4] as well as meth-
ods based on clustering point tracks [6, 18, 19]. Moreover,
it performs comparably to recent video object segmentation
methods based on object proposals [14, 16, 27], while being
orders of magnitude faster.
1. Introduction
Video object segmentation is the task of separating fore-
ground objects from the background in a video [14, 18, 26].
This is important for a wide range of applications, includ-
ing providing spatial support for learning object class mod-
els [19], video summarization, and action recognition [5].
The task has been addressed by methods requiring a user
to annotate the object position in some frames [3, 20, 26,
24], and by fully automatic methods [14, 6, 18, 4], which
input just the video. The latter scenario is more practi-
cally relevant, as a good solution would enable processing
large amounts of video without human intervention. How-
ever, this task is very challenging, as the method is given no
knowledge about the object appearance, scale or position.
Moreover, the general unconstrained setting might include
rapidly moving backgrounds and objects, non-rigid defor-
mations and articulations (fig. 5).
In this paper we propose a technique for fully automatic
video object segmentation in unconstrained settings. Our
method is computationally efficient and makes minimal as-
sumptions about the video: the only requirement is for the
object to move differently from its surrounding background
in a good fraction of the video. The object can be static
in a portion of the video and only part of it can be mov-
ing in some other portion (e.g. a cat starts running and then
stops to lick its paws). Our method does not require a static
or slowly moving background (as opposed to classic back-
ground subtraction methods [9, 4, 7]). Moreover, it does
not assume the object follows a particular motion model,
nor that all its points move homogeneously (as opposed to
methods based on clustering point tracks [6, 17, 18]). This
is especially important when segmenting non-rigid or artic-
ulated objects such as animals (fig. 5).
The key new element in our approach is a rapid technique
to produce a rough estimate of which pixels are inside the
object based on motion boundaries in pairs of subsequent
frames (sec. 3.1). This initial estimate is then refined by
integrating information over the whole video with a spatio-
temporal extension of GrabCut [21, 14, 26]. This second
stage automatically bootstraps an appearance model based
on the initial foreground estimate, and uses it to refine the
spatial accuracy of the segmentation and to also segment the
object in frames where it does not move (sec. 3.2).
Through extensive experiments on over 1400 video shots
from two datasets [24, 19], we show that our method: (i)
handles fast moving backgrounds and objects exhibiting a
wide range of appearance, motions and deformations, in-
cluding non-rigid and articulated objects; (ii) outperforms
a state-of-the-art background subtraction technique [4] as
well as methods based on clustering point tracks [6, 18, 19];
(iii) is orders of magnitude faster than recent video object
segmentation methods based on object proposals [14, 16,
27]; (iv) outperforms the popular method [14] on the large
YouTube-Objects dataset [19]; (v) produces competitive re-
sults on the small SegTrack benchmark [24]. The source
code of our method is released at http://groups.
inf.ed.ac.uk/calvin/software.html
2. Related Work
Interactive or supervised methods. Several methods for
video object segmentation require the user to manually an-
notate a few frames with object segmentations and then
propagate these annotations to all other frames [3, 20, 26].
Similarly, methods based on tracking [8, 24], require the
user to mark the object positions in the first frame and then
track them in the rest of the video.
Background subtraction. Classic background subtrac-
tion methods model the appearance of the background at
1

each pixel and consider pixels that change rapidly to be
foreground. These methods typically assume a stationary,
or slowly panning camera [9, 4, 7]. The background should
change slowly in order for the model to update safely with-
out generating false-positive foreground detections.
Clustering point tracks. Several automatic video seg-
mentation methods track points over several frames and
then cluster the resulting tracks based on pairwise [6, 17]
or triplet [18] similarity measures. The underlying assump-
tion induced by pairwise clustering [6, 17] is that all ob-
ject points move according to a single translation, while the
triplet model [18] assumes a single similarity transforma-
tion. These assumptions have trouble accommodating non-
rigid or articulated objects. Our method instead does not at-
tempt to cluster object points and does not assume any kind
of motion homogeneity. The object only needs to move suf-
ficiently differently from the background to generate mo-
tion boundaries along most of its physical boundary. On the
other hand, these methods [6, 17, 18] try to place multiple
objects in separate segments, whereas our method produces
a simpler binary segmentation (all objects vs background).
Ranking object proposals. The works [14, 16, 27] are
closely related to ours, as they tackle the very same task.
These methods are based on finding recurring object-like
segments, aided by recent techniques for measuring generic
object appearance [10], and achieve impressive results on
the SegTrack benchmark [24]. While the object proposal in-
frastructure is necessary to find out which image regions are
objects vs background, it makes these methods very slow
(minutes/frame). In our work instead, this goal is achieved
by a much simpler, faster process (sec. 3.1). In sec. 4 we
show that our method achieves comparable segmentation
accuracy to [14] while being two orders of magnitude faster.
Oversegmentation. Grundmann et al. [13] oversegment
a video into spatio-temporal regions of uniform motion and
appearance, analog to still-image superpixels [15]. While
this is a useful basis for later processing, it does not solve
the video object segmentation task on its own.
3. Our approach
The goal of our work is to segment objects that move dif-
ferently than their surroundings. Our method has two main
stages: (1) efficient initial foreground estimation (sec. 3.1),
(2) foreground-background labelling refinement (sec. 3.2).
We now give a brief overview of these two stages, and then
present them in more detail in the rest of the section.
(1) Efficient initial foreground estimation. The goal of
the first stage is to rapidly produce an initial estimate of
which pixels might be inside the object based purely on
motion. We compute the optical flow between pairs of sub-
sequent frames and detect motion boundaries. Ideally, the
Figure 1. Motion boundaries.. (a) Two input frames. (b) Optical
flow
~
f
p
. The hue of a pixel indicates its direction and the color
saturation its velocity. (c) Motion boundaries b
m
p
, based on the
magnitude of the gradient of the optical flow. (d) Motion bound-
aries b
θ
p
, based on difference in direction between a pixel and its
neighbours. (e) Combined motion boundaries b
p
. (f) Final, binary
motion boundaries after thresholding, overlaid on the first frame.
motion boundaries will form a complete closed curve co-
inciding with the object boundaries. However, due to in-
accuracies in the flow estimation, the motion boundaries
are typically incomplete and do not align perfectly with ob-
ject boundaries (fig. 1f). Also, occasionally false positive
boundaries might be detected. We propose a novel, compu-
tationally efficient algorithm to robustly determine which
pixels reside inside the moving object, taking into account
all these sources of error (fig. 2c).
(2) Foreground-background labelling refinement. As
they are purely based on motion boundaries, the inside-
outside maps produced by the first stage typically only ap-
proximately indicate where the object is. They do not accu-
rately delineate object outlines. Furthermore, (parts of) the
object might be static in some frames, or the inside-outside
maps may miss it due to incorrect optical flow estimation.
The goal of the second stage is to refine the spatial ac-
curacy of the inside-outside maps and to segment the whole
object in all frames. To achieve this, it integrates the infor-
mation from the inside-outside maps over all frames by (1)
encouraging the spatio-temporal smoothness of the output
segmentation over the whole video; (2) building dynamic
appearance models of the object and background under the
assumption that they change smoothly over time. Incor-

2
2
4
5
3
1
1
0 0
0 1 1
1 2 2 2S:
1 W
1
x
Figure 2. Inside-outside maps. (Left) The ray-casting observation. Any ray originating inside a closed curve intersects it an odd number
of time. Any ray originating outside intersects it an even number of times. This holds for any number of closed curves in the image.
(Middle) Illustration of the integral intersections data structure S for the horizontal direction. The number of intersections for the ray
going from pixel x to the left border can be easily computed as X
left
(x, y) = S(x 1, y) = 1, and for the right ray as X
right
(x, y) =
S(W, y) S(x, y) = 1. In this case, both rays vote for x being inside the object. (Right) The output inside-outside map M
t
.
porating appearance cues is key to achieving a finer level
of detail, compared to using only motion. Moreover, af-
ter learning the object appearance in the frames where the
inside-outside maps found it, the second stage uses it to seg-
ment the object in frames where it was initially missed (e.g.
because it is static).
3.1. Efficient initial foreground estimation
Optical flow. We begin by computing optical flow be-
tween pairs of subsequent frames (t, t + 1) using the state-
of-the-art algorithm [6, 22]. It supports large displacements
between frames and has a computationally very efficient
GPU implementation [22] (fig. 1a+b).
Motion boundaries. We base our approach on motion
boundaries, i.e. image points where the optical flow field
changes abruptly. Motion boundaries reveal the location of
occlusion boundaries, which very often correspond to phys-
ical object boundaries [23].
Let
~
f
p
be the optical flow vector at pixel p. The sim-
plest way to estimate motion boundaries is by computing
the magnitude of the gradient of the optical flow field:
b
m
p
= 1 exp(λ
m
||∇
~
f
p
||) (1)
where b
m
p
[0, 1] is the strength of the motion boundary at
pixel p; λ
m
is a parameter controlling the steepness of the
function.
While this measure correctly detects boundaries at
rapidly moving pixels, where b
m
p
is close to 1, it is unre-
liable for pixels with intermediate b
m
p
values around 0.5,
which could be explained either as boundaries or errors due
to inaccuracies in the optical flow (fig. 1c). To disambiguate
between those two cases, we compute a second estimator
b
θ
p
[0, 1], based on the difference in direction between the
motion of pixel p and its neighbours N :
b
θ
p
= 1 exp(λ
θ
max
q∈N
(δθ
2
p,q
)) (2)
where δθ
p,q
denotes the angle between
~
f
p
and
~
f
q
. The idea
is that if n is moving in a different direction than all its
neighbours, it is likely to be a motion boundary. This esti-
mator can correctly detect boundaries even when the object
is moving at a modest velocity, as long as it goes in a dif-
ferent direction than the background. However, it tends to
produce false-positives in static image regions, as the direc-
tion of the optical flow is noisy at points with little or no
motion (fig. 1d).
As the two measures above have complementary failure
modes, we combine them into a measure that is more reli-
able than either alone (fig. 1e):
b
p
=
(
b
m
p
, if b
m
p
> T
b
m
p
· b
θ
p
, if b
m
p
T,
(3)
where T is a high threshold, above which b
m
p
is considered
reliable on its own. As a last step we threshold b
p
at 0.5 to
produce a binary motion boundary labelling (fig. 1f).
Inside-outside maps. The produced motion boundaries
typically do not completely cover the whole object bound-
ary. Moreover, there might be false positive boundaries, due
to inaccurracy of the optical flow estimation. We present
here a computationally efficient algorithm to robustly esti-
mate which pixels are inside the object while taking into
account these sources of error.
The algorithm estimates whether a pixel is inside the
object based on the point-in-polygon problem [12] from
computational geometry. The key observation is that any
ray starting from a point inside the polygon (or any closed
curve) will intersect the boundary of the polygon an odd
number of times. Instead, a ray starting from a point out-
side the polygon will intersect it an even number of times
(figure 2a). Since the motion boundaries are typically in-
complete, a single ray is not sufficient to determine whether
a pixel lies inside the object. Instead, we get a robust es-
timate by shooting 8 rays spaced by 45 degrees. Each ray
casts a vote on whether the pixel is inside or outside. The
final inside-outside decision is taken by majority rule, i.e. a
pixel with 5 or more rays intersecting the boundaries an odd
number of times is deemed inside.
Realizing the above idea with a naive algorithm would
be computationally expensive (i.e. quadratic in the number

0.0
0.28
s1
s4
s3
s2
s
1
t+1
t+1
t+1
t+1
t
Figure 3. Example connectivity E
t
over time. Superpixel s
t
1
con-
tains pixels that lead to s
t+1
1
, s
t+1
2
, s
t+1
3
, s
t+1
4
. As an example,
the weight φ(s
t
1
, s
t+1
1
) is 0.28 (all others are omitted for clarity).
of pixels in the image). We propose an efficient algorithm
which we call integral intersections, inspired by the use of
integral images in [25]. The key idea is to create a special
data structure that enables very fast inside-outside evalua-
tion by massively reusing the computational effort that went
into creating the datastructure.
For each direction (horizontal, vertical and the two diag-
onals) we create a matrix S of the same size W × H as the
image. An entry S(x, y) of this matrix indicates the num-
ber of boundary intersections along the line going from the
image border up to pixel (x, y). For simplicity, we explain
here how to build S for the horizontal direction. The algo-
rithm for the other directions is analogous. The algorithm
builds S one line y at a time. The first pixel (1, y), at the left
image border, has value S(1, y) = 0. We then move right-
wards one pixel at a time and increment S(x, y) by 1 each
time we transition from a non-boundary pixel to a boundary
pixel. This results in a line S(:, y) whose entries count the
number of boundary intersections (fig. 2b.).
After computing S for all horizontal lines, the data struc-
ture is ready. We can now determine the number of inter-
sections X for both horizontal rays (leftright, rightleft)
emanating from a pixel (x, y) in constant time by
X
left
(x, y) = S(x 1, y) (4)
X
right
(x, y) = S(W, y) S(x, y) (5)
where W is the width of the image, i.e. the rightmost pixel
in a line (fig. 2b).
Our algorithm visits each pixel exactly once per direc-
tion while building S, and once to compute its vote, and is
therefore linear in the number of pixels in the image. The
algorithm is very fast in practice and takes about 0.1s per
frame of a HD video (1280x720 pixels) on a modest CPU
(Intel Core i7 at 2.0GHz).
For each video frame t, we apply the algorihtm on all 8
directions and use majority voting to decide which pixels
are inside, resulting is an inside-outside map M
t
(fig. 2c).
3.2. Foreground-background labelling refinement
We formulate video segmentation as a pixel labelling
problem with two labels (foreground and background). We
oversegment each frame into superpixels S
t
[15], which
greatly reduces computational efficiency and memory us-
age, enabling to segment much longer videos.
Each superpixel s
t
i
S
t
can take a label l
t
i
{0, 1}. A
labelling L = {l
t
i
}
t,i
of all superpixels in all frames repre-
sents a segmentation of the video. Similarly to other seg-
mentation works [14, 21, 26], we define an energy function
to evaluate a labeling
E(L) =
X
t,i
A
t
i
(l
t
i
) + α
1
X
t,i
L
t
i
(l
t
i
) (6)
+ α
2
X
(i,j,t)∈E
s
V
t
ij
(l
t
i
, l
t
j
) + α
3
X
(i,j,t)∈E
t
W
t
ij
(l
t
i
, l
t+1
j
)
A
t
is a unary potential evaluating how likely a superpixel is
to be foreground or background according to the appearance
model of frame t. The second unary potential L
t
is based on
a location prior model encouraging foreground labellings in
areas where independent motion has been observed. As we
explain in detail later, we derive both the appearance model
and the location prior parameters from the inside-outside
maps M
t
. The pairwise potentials V and W encourage spa-
tial and temporal smoothness, respectively. The scalars α
weight the various terms.
The output segmentation is the labeling that mini-
mizes (6):
L
= argmin
L
E(L) (7)
As E is a binary pairwise energy function with submodular
pairwise potentials, we minimize it exactly with graph-cuts.
Next we use the resulting segmentation to re-estimate the
appearance models and iterate between these two steps, as
in GrabCut [21]. Below we describe the potentials in detail.
Smoothness V, W. The spatial smoothness potential V is
defined over the edge set E
s
, containing pairs of spatially
connected superpixels. Two superpixels are spatially con-
nected if they are in the same frame and are adjacent.
The temporal smoothness potential W is defined over
the edge set E
t
, containing pairs of temporally connected
superpixels. Two superpixels s
t
i
, s
t+1
j
in subsequent frames
are connected if there at least one pixel of s
t
i
moves into
s
t+1
j
according to the optical flow (fig. 3).
The functions V, W are standard contrast-modulated
Potts potentials [21, 26, 14]:
V
t
ij
(l
t
i
, l
t
j
) = dis(s
t
i
, s
t
j
)
1
[l
t
i
6= l
t
j
] exp(βcol(s
t
i
, s
t
j
)
2
) (8)
W
t
ij
(l
t
i
, l
t+1
j
) = φ(s
t
i
, s
t+1
j
)[l
t
i
6= l
t
j
] exp(βcol(s
t
i
, s
t+1
j
)
2
)
(9)
where dis is the Euclidean distance between the centres of
two superpixels and col is the difference between their av-
erage RGB color. The factor that differs from the standard

Citations
More filters
01 Jan 2006

3,012 citations

Proceedings ArticleDOI
27 Jun 2016
TL;DR: This work presents a new benchmark dataset and evaluation methodology for the area of video object segmentation, named DAVIS (Densely Annotated VIdeo Segmentation), and provides a comprehensive analysis of several state-of-the-art segmentation approaches using three complementary metrics.
Abstract: Over the years, datasets and benchmarks have proven their fundamental importance in computer vision research, enabling targeted progress and objective comparisons in many fields. At the same time, legacy datasets may impend the evolution of a field due to saturated algorithm performance and the lack of contemporary, high quality data. In this work we present a new benchmark dataset and evaluation methodology for the area of video object segmentation. The dataset, named DAVIS (Densely Annotated VIdeo Segmentation), consists of fifty high quality, Full HD video sequences, spanning multiple occurrences of common video object segmentation challenges such as occlusions, motionblur and appearance changes. Each video is accompanied by densely annotated, pixel-accurate and per-frame ground truth segmentation. In addition, we provide a comprehensive analysis of several state-of-the-art segmentation approaches using three complementary metrics that measure the spatial extent of the segmentation, the accuracy of the silhouette contours and the temporal coherence. The results uncover strengths and weaknesses of current approaches, opening up promising directions for future works.

1,656 citations


Cites background or methods from "Fast Object Segmentation in Unconst..."

  • ...Interestingly the assumption of a completely closed motion boundary curve coinciding with the object contours can robustly accommodate background deformations (FST)....

    [...]

  • ...Unsupervised approaches have historically targeted over-segmentation [21, 51] or motion segmentation [5, 18] and only recently automatic methods for foregroundbackground separation have been proposed [13, 25, 33, 43, 45, 52]....

    [...]

  • ...Aiming at detecting per-frame indicators of potential foreground object locations, KEY [24], SAL [43], and FST [33] try to determine prior information sparsely distributed over the video sequence....

    [...]

  • ...Within the unsupervised category we evaluate the performance of NLC [13], FST [33], SAL [43], TRC [18], MSG [5] and CVOS [45]....

    [...]

  • ...The dataset is accompanied with a comprehensive evaluation of several state-of-the-art approaches [5, 7, 13, 14, 18, 21, 24, 33, 35, 40, 43, 45]....

    [...]

Proceedings ArticleDOI
01 Jan 2017
TL;DR: One-shot video object segmentation (OSVOS) as mentioned in this paper is based on a fully-convolutional neural network architecture that is able to successively transfer generic semantic information, learned on ImageNet, to the task of foreground segmentation, and finally to learning the appearance of a single annotated object of the test sequence.
Abstract: This paper tackles the task of semi-supervised video object segmentation, i.e., the separation of an object from the background in a video, given the mask of the first frame. We present One-Shot Video Object Segmentation (OSVOS), based on a fully-convolutional neural network architecture that is able to successively transfer generic semantic information, learned on ImageNet, to the task of foreground segmentation, and finally to learning the appearance of a single annotated object of the test sequence (hence one-shot). Although all frames are processed independently, the results are temporally coherent and stable. We perform experiments on two annotated video segmentation databases, which show that OSVOS is fast and improves the state of the art by a significant margin (79.8% vs 68.0%).

573 citations

Posted Content
TL;DR: One-shot video object segmentation (OSVOS) as mentioned in this paper is based on a fully-convolutional neural network architecture that is able to successively transfer generic semantic information, learned on ImageNet, to the task of foreground segmentation, and finally to learning the appearance of a single annotated object of the test sequence.
Abstract: This paper tackles the task of semi-supervised video object segmentation, i.e., the separation of an object from the background in a video, given the mask of the first frame. We present One-Shot Video Object Segmentation (OSVOS), based on a fully-convolutional neural network architecture that is able to successively transfer generic semantic information, learned on ImageNet, to the task of foreground segmentation, and finally to learning the appearance of a single annotated object of the test sequence (hence one-shot). Although all frames are processed independently, the results are temporally coherent and stable. We perform experiments on two annotated video segmentation databases, which show that OSVOS is fast and improves the state of the art by a significant margin (79.8% vs 68.0%).

523 citations

Proceedings ArticleDOI
07 Jun 2015
TL;DR: This work introduces an unsupervised, geodesic distance based, salient video object segmentation method that incorporates saliency as prior for object via the computation of robust geodesIC measurement and builds global appearance models for foreground and background.
Abstract: We introduce an unsupervised, geodesic distance based, salient video object segmentation method. Unlike traditional methods, our method incorporates saliency as prior for object via the computation of robust geodesic measurement. We consider two discriminative visual features: spatial edges and temporal motion boundaries as indicators of foreground object locations. We first generate framewise spatiotemporal saliency maps using geodesic distance from these indicators. Building on the observation that foreground areas are surrounded by the regions with high spatiotemporal edge values, geodesic distance provides an initial estimation for foreground and background. Then, high-quality saliency results are produced via the geodesic distances to background regions in the subsequent frames. Through the resulting saliency maps, we build global appearance models for foreground and background. By imposing motion continuity, we establish a dynamic location model for each frame. Finally, the spatiotemporal saliency maps, appearance models and dynamic location models are combined into an energy minimization framework to attain both spatially and temporally coherent object segmentation. Extensive quantitative and qualitative experiments on benchmark video dataset demonstrate the superiority of the proposed method over the state-of-the-art algorithms.

516 citations


Cites methods from "Fast Object Segmentation in Unconst..."

  • ...We further carried out experiments on SegTrack v2 dataset [18] and 12 groups of videos randomly selected from Youtube Objects and compared our method with [32, 23, 15, 6] as well....

    [...]

  • ...The methods in [32, 23, 15, 6, 19, 4, 22] and our method are unsupervised....

    [...]

  • ...The average per-frame pixel error rate compared with these methods [32, 23, 15, 6, 19, 4, 22, 29, 9] for each video from SegTrack dataset [29] are summarized in Table 1....

    [...]

  • ...method Ours [32] [23] [15] [6] [19] [4] [22] [29] [9]...

    [...]

References
More filters
Proceedings ArticleDOI
07 Jul 2001
TL;DR: A new image representation called the “Integral Image” is introduced which allows the features used by the detector to be computed very quickly and a method for combining classifiers in a “cascade” which allows background regions of the image to be quickly discarded while spending more computation on promising face-like regions.
Abstract: This paper describes a face detection framework that is capable of processing images extremely rapidly while achieving high detection rates. There are three key contributions. The first is the introduction of a new image representation called the "Integral Image" which allows the features used by our detector to be computed very quickly. The second is a simple and efficient classifier which is built using the AdaBoost learning algo- rithm (Freund and Schapire, 1995) to select a small number of critical visual features from a very large set of potential features. The third contribution is a method for combining classifiers in a "cascade" which allows back- ground regions of the image to be quickly discarded while spending more computation on promising face-like regions. A set of experiments in the domain of face detection is presented. The system yields face detection perfor- mance comparable to the best previous systems (Sung and Poggio, 1998; Rowley et al., 1998; Schneiderman and Kanade, 2000; Roth et al., 2000). Implemented on a conventional desktop, face detection proceeds at 15 frames per second.

10,592 citations

Journal ArticleDOI
TL;DR: A new superpixel algorithm is introduced, simple linear iterative clustering (SLIC), which adapts a k-means clustering approach to efficiently generate superpixels and is faster and more memory efficient, improves segmentation performance, and is straightforward to extend to supervoxel generation.
Abstract: Computer vision applications have come to rely increasingly on superpixels in recent years, but it is not always clear what constitutes a good superpixel algorithm. In an effort to understand the benefits and drawbacks of existing methods, we empirically compare five state-of-the-art superpixel algorithms for their ability to adhere to image boundaries, speed, memory efficiency, and their impact on segmentation performance. We then introduce a new superpixel algorithm, simple linear iterative clustering (SLIC), which adapts a k-means clustering approach to efficiently generate superpixels. Despite its simplicity, SLIC adheres to boundaries as well as or better than previous methods. At the same time, it is faster and more memory efficient, improves segmentation performance, and is straightforward to extend to supervoxel generation.

7,849 citations

Journal ArticleDOI
TL;DR: A review of the Pascal Visual Object Classes challenge from 2008-2012 and an appraisal of the aspects of the challenge that worked well, and those that could be improved in future challenges.
Abstract: The Pascal Visual Object Classes (VOC) challenge consists of two components: (i) a publicly available dataset of images together with ground truth annotation and standardised evaluation software; and (ii) an annual competition and workshop. There are five challenges: classification, detection, segmentation, action classification, and person layout. In this paper we provide a review of the challenge from 2008---2012. The paper is intended for two audiences: algorithm designers, researchers who want to see what the state of the art is, as measured by performance on the VOC datasets, along with the limitations and weak points of the current generation of algorithms; and, challenge designers, who want to see what we as organisers have learnt from the process and our recommendations for the organisation of future challenges. To analyse the performance of submitted algorithms on the VOC datasets we introduce a number of novel evaluation methods: a bootstrapping method for determining whether differences in the performance of two algorithms are significant or not; a normalised average precision so that performance can be compared across classes with different proportions of positive instances; a clustering method for visualising the performance across multiple algorithms so that the hard and easy images can be identified; and the use of a joint classifier over the submitted algorithms in order to measure their complementarity and combined performance. We also analyse the community's progress through time using the methods of Hoiem et al. (Proceedings of European Conference on Computer Vision, 2012) to identify the types of occurring errors. We conclude the paper with an appraisal of the aspects of the challenge that worked well, and those that could be improved in future challenges.

6,061 citations

Book
01 Jan 1995
TL;DR: This chapter discusses the development of Hardware and Software for Computer Graphics, and the design methodology of User-Computer Dialogues, which led to the creation of the Simple Raster Graphics Package.
Abstract: 1 Introduction Image Processing as Picture Analysis The Advantages of Interactive Graphics Representative Uses of Computer Graphics Classification of Applications Development of Hardware and Software for Computer Graphics Conceptual Framework for Interactive Graphics 2 Programming in the Simple Raster Graphics Package (SRGP)/ Drawing with SRGP/ Basic Interaction Handling/ Raster Graphics Features/ Limitations of SRGP/ 3 Basic Raster Graphics Algorithms for Drawing 2d Primitives Overview Scan Converting Lines Scan Converting Circles Scan Convertiing Ellipses Filling Rectangles Fillign Polygons Filling Ellipse Arcs Pattern Filling Thick Primiives Line Style and Pen Style Clipping in a Raster World Clipping Lines Clipping Circles and Ellipses Clipping Polygons Generating Characters SRGP_copyPixel Antialiasing 4 Graphics Hardware Hardcopy Technologies Display Technologies Raster-Scan Display Systems The Video Controller Random-Scan Display Processor Input Devices for Operator Interaction Image Scanners 5 Geometrical Transformations 2D Transformations Homogeneous Coordinates and Matrix Representation of 2D Transformations Composition of 2D Transformations The Window-to-Viewport Transformation Efficiency Matrix Representation of 3D Transformations Composition of 3D Transformations Transformations as a Change in Coordinate System 6 Viewing in 3D Projections Specifying an Arbitrary 3D View Examples of 3D Viewing The Mathematics of Planar Geometric Projections Implementing Planar Geometric Projections Coordinate Systems 7 Object Hierarchy and Simple PHIGS (SPHIGS) Geometric Modeling Characteristics of Retained-Mode Graphics Packages Defining and Displaying Structures Modeling Transformations Hierarchical Structure Networks Matrix Composition in Display Traversal Appearance-Attribute Handling in Hierarchy Screen Updating and Rendering Modes Structure Network Editing for Dynamic Effects Interaction Additional Output Features Implementation Issues Optimizing Display of Hierarchical Models Limitations of Hierarchical Modeling in PHIGS Alternative Forms of Hierarchical Modeling 8 Input Devices, Interaction Techniques, and Interaction Tasks Interaction Hardware Basic Interaction Tasks Composite Interaction Tasks 9 Dialogue Design The Form and Content of User-Computer Dialogues User-Interfaces Styles Important Design Considerations Modes and Syntax Visual Design The Design Methodology 10 User Interface Software Basic Interaction-Handling Models Windows-Management Systems Output Handling in Window Systems Input Handling in Window Systems Interaction-Technique Toolkits User-Interface Management Systems 11 Representing Curves and Surfaces Polygon Meshes Parametric Cubic Curves Parametric Bicubic Surfaces Quadric Surfaces 12 Solid Modeling Representing Solids Regularized Boolean Set Operations Primitive Instancing Sweep Representations Boundary Representations Spatial-Partitioning Representations Constructive Solid Geometry Comparison of Representations User Interfaces for Solid Modeling 13 Achromatic and Colored Light Achromatic Light Chromatic Color Color Models for Raster Graphics Reproducing Color Using Color in Computer Graphics 14 The Quest for Visual Realism Why Realism? Fundamental Difficulties Rendering Techniques for Line Drawings Rendering Techniques for Shaded Images Improved Object Models Dynamics Stereopsis Improved Displays Interacting with Our Other Senses Aliasing and Antialiasing 15 Visible-Surface Determination Functions of Two Variables Techniques for Efficient Visible-Surface Determination Algorithms for Visible-Line Determination The z-Buffer Algorithm List-Priority Algorithms Scan-Line Algorithms Area-Subdivision Algorithms Algorithms for Octrees Algorithms for Curved Surfaces Visible-Surface Ray Tracing 16 Illumination And Shading Illumination Modeling Shading Models for Polygons Surface Detail Shadows Transparency Interobject Reflections Physically Based Illumination Models Extended Light Sources Spectral Sampling Improving the Camera Model Global Illumination Algorithms Recursive Ray Tracing Radiosity Methods The Rendering Pipeline 17 Image Manipulation and Storage What Is an Image? Filtering Image Processing Geometric Transformations of Images Multipass Transformations Image Compositing Mechanisms for Image Storage Special Effects with Images Summary 18 Advanced Raster Graphic Architecture Simple Raster-Display System Display-Processor Systems Standard Graphics Pipeline Introduction to Multiprocessing Pipeline Front-End Architecture Parallel Front-End Architectures Multiprocessor Rasterization Architectures Image-Parallel Rasterization Object-Parallel Rasterization Hybrid-Parallel Rasterization Enhanced Display Capabilities 19 Advanced Geometric and Raster Algorithms Clipping Scan-Converting Primitives Antialiasing The Special Problems of Text Filling Algorithms Making copyPixel Fast The Shape Data Structure and Shape Algebra Managing Windows with bitBlt Page Description Languages 20 Advanced Modeling Techniques Extensions of Previous Techniques Procedural Models Fractal Models Grammar-Based Models Particle Systems Volume Rendering Physically Based Modeling Special Models for Natural and Synthetic Objects Automating Object Placement 21 Animation Conventional and Computer-Assisted Animation Animation Languages Methods of Controlling Animation Basic Rules of Animation Problems Peculiar to Animation Appendix: Mathematics for Computer Graphics Vector Spaces and Affine Spaces Some Standard Constructions in Vector Spaces Dot Products and Distances Matrices Linear and Affine Transformations Eigenvalues and Eigenvectors Newton-Raphson Iteration for Root Finding Bibliography Index 0201848406T04062001

5,692 citations

Journal ArticleDOI
01 Aug 2004
TL;DR: A more powerful, iterative version of the optimisation of the graph-cut approach is developed and the power of the iterative algorithm is used to simplify substantially the user interaction needed for a given quality of result.
Abstract: The problem of efficient, interactive foreground/background segmentation in still images is of great practical importance in image editing. Classical image segmentation tools use either texture (colour) information, e.g. Magic Wand, or edge (contrast) information, e.g. Intelligent Scissors. Recently, an approach based on optimization by graph-cut has been developed which successfully combines both types of information. In this paper we extend the graph-cut approach in three respects. First, we have developed a more powerful, iterative version of the optimisation. Secondly, the power of the iterative algorithm is used to simplify substantially the user interaction needed for a given quality of result. Thirdly, a robust algorithm for "border matting" has been developed to estimate simultaneously the alpha-matte around an object boundary and the colours of foreground pixels. We show that for moderately difficult examples the proposed method outperforms competitive tools.

5,670 citations

Frequently Asked Questions (1)
Q1. What have the authors contributed in "Fast object segmentation in unconstrained video" ?

The authors present a technique for separating foreground objects from the background in a video.