scispace - formally typeset
Open AccessProceedings ArticleDOI

Learning layered motion segmentations of video

Reads0
Chats0
TLDR
An unsupervised approach for learning a generative layered representation of a scene from a video for motion segmentation using efficient loopy belief propagation and /spl alpha//spl beta/-swap and / spl alpha/-expansion algorithms for refining the initial estimate.
Abstract
We present an unsupervised approach for learning a generative layered representation of a scene from a video for motion segmentation. The learnt model is a composition of layers, which consist of one or more segments. Included in the model are the effects of image projection, lighting, and motion blur. The two main contributions of our method are: (i) a novel algorithm for obtaining the initial estimate of the model using efficient loopy belief propagation; and (ii) using /spl alpha//spl beta/-swap and /spl alpha/-expansion algorithms, which guarantee a strong local minima, for refining the initial estimate. Results are presented on several classes of objects with different types of camera motion. We compare our method with the state of the art and demonstrate significant improvements.

read more

Content maybe subject to copyright    Report

Learning Layered Motion Segmentations of Video
M. Pawan Kumar P.H.S. Torr A. Zisserman
Dept. of Computing Dept. of Eng. Science
Oxford Brookes University University of Oxford
{pkmudigonda,philiptorr}@brookes.ac.uk az@robots.ox.ac.uk
http://cms.brookes.ac.uk/research/visiongroup http://www.robots.ox.ac.uk/˜vgg
Abstract
We present an unsupervised approach for learning a layered representation of a scene
from a video for motion segmentation. Our method is applicable to any video contain-
ing piecewise parametric motion. The learnt model is a composition of layers, which
consist of one or more segments. The shape of each segment is represented using a
binary matte and its appearance is given by the RGB value for each point belonging to
the matte. Included in the model are the effects of image projection, lighting, and mo-
tion blur. Furthermore, spatial continuity is explicitly modelled resulting in contiguous
segments. Unlike previous approaches, our method does not use reference frame(s) for
initialization. The two main contributions of our method are: (i) A novel algorithm
for obtaining the initial estimate of the model by dividing the scene into rigidly mov-
ing components using efficient loopy belief propagation; and (ii) Refining the initial
estimate using αβ-swap and α-expansion algorithms, which guarantee a strong local
minima. Results are presented on several classes of objects with different types of cam-
era motion, e.g. videos of a human walking shot with static or translating cameras. We
compare our method with the state of the art and demonstrate significant improvements.
1 Introduction
We present an approach for learning a layered representation from a video for motion
segmentation. Our method is applicable to any video containing piecewise parametric
motion, e.g. piecewise homography, without any restrictions on camera motion. It also
accounts for the effects of occlusion, lighting and motion blur. For example, Fig. 1
shows one such sequence where a layered representation can be learnt and used to
segment the walking person from the static background.
Many different approaches for motion segmentation have been reported in the lit-
erature. Important issues are: (i) whether the methods are feature-based or dense; (ii)
whether they model occlusion; (iii) whether they model spatial continuity; (iv) whether
they apply to multiple frames (i.e. a video sequence); and (v) whether they are inde-
pendent of which frames are used for initialization.
1

Figure 1: Four intermediate frames of a 31 frame video sequence of a person walking
sideways where the camera is static. Given the sequence, the model which best de-
scribes the person and the background is learnt in an unsupervised manner. Note that
the arm always partially occludes the torso.
A comprehensive survey of feature-based methods can be found in [19]. Most of
these method rely on computing a homography corresponding to the motion of a planar
object. This limits their application to a restricted set of scenes and motions. Dense
methods [2, 6, 18, 22] overcome this deficiency by computing pixel-wise motion. How-
ever, many dense approaches do not model occlusion which can lead to overcounting
of data when obtaining the segmentation, e.g. see [2, 6].
Chief amongst the methods which do model occlusion are those that use a layered
representation [21]. One such approach, described in [22] divides a scene into (almost)
planar regions for occlusion reasoning. Torr et al. [18] extend this representation by
allowing for parallax disparity. However, these methods rely on a keyframe for the
initial estimation. Other approaches [8, 23] overcome this problem by using layered
flexible sprites. A flexible sprite is a 2D appearance map and matte (mask) of an object
which is allowed to deform from frame to frame according to pure translation. Winn
et al. [25] extend the model to handle affine deformations. However, these methods
do not enforce spatial continuity i.e. they assume each pixel is labelled independent
of its neighbours. This leads to non-contiguous segmentation when the foreground
and background are similar in appearance (see Fig. 19(b)). Most of these approaches,
namely those described in [6, 8, 18, 21, 22], use either EM or variational methods for
learning the parameters of the model which makes them prone to local minima.
Wills et al. [24] noted the importance of spatial continuity when learning the re-
gions in a layered representation. Given an initial estimate, they learn the shape of
the regions using the powerful α-expansion algorithm [5] which guarantees a strong
local minima. However, their method does not deal with more than 2 views. In our
earlier work [10], we described a similar motion segmentation approach to [24] for a
video sequence. Like [16] this automatically learns a model of an object. However,
the method depends on a keyframe to obtain an initial estimate of the model. This has
the disadvantage that points not visible in the keyframe are not included in the model,
which leads to incomplete segmentation.
In this paper, we present a model which does not suffer from the problems men-
tioned above, i.e. (i) it models occlusion; (ii) it models spatial continuity; (iii) it handles
multiple frames; and (iv) it is learnt independent of keyframes. An initial estimate of
the model is obtained based on a method to estimate image motion with discontinuities
using a new efficient loopy belief propagation algorithm. Despite the use of piecewise
parametric motion (similar to feature-based approaches), this allows us to learn the
2

Figure 2: The top row shows the various layers of a human model (the latent image in
this case). Each layer consists of one of more segments whose appearance is shown.
The shape of each segment is represented by a binary matte (not shown in the image).
Any frame j can be generated using this representation by assigning appropriate values
to its parameters and latent variables. The background is not shown.
model for a wide variety of scenes. Given the initial estimate, the shape of the seg-
ments, along with the layering, are learnt by minimizing an objective function using
αβ-swap and α-expansion algorithms [5]. Results are demonstrated on several classes
of objects with different types of camera motion.
In the next section, we describe the layered representation. In section 3, we present
a ve stage approach to learn the parameters of the layered representation from a video.
Such a model is particularly suited for applications like motion segmentation. Results
are presented in section 4. Preliminary versions of this article have appeared in [10, 11].
The input videos used in this work together with the description and output of our
approach are available at http://www.robots.ox.ac.uk/ vgg/research/moseg/.
2 Layered Representation
We introduce the model for a layered representation which describes the scene as a
composition of layers. Any frame of a video can be generated from our model by as-
signing appropriate values to its parameters and latent variables as illustrated in Fig. 2.
While the parameters of the model define the latent image, the latent variables describe
how to generate the frames using the latent image (see table 1). Together, they also
define the probability of the frame being generated.
The latent image is defined as follows. It consists of a set of n
P
segments, which
3

Input
D Data (RGB values of all pixels in every frame of a video).
n
F
Number of frames.
Parameters
n
P
Number of segments p
i
including the background.
Θ
Mi
Matte for segment p
i
.
Θ
M
Set of all mattes, i.e. {Θ
Mi
, i = 1, · · · , n
P
}.
Θ
Ai
Appearance parameter for segment p
i
.
Θ
A
Set of all appearance parameters, i.e. {Θ
Ai
, i = 1, · · · , n
P
}.
H
i
Histogram specifying the distribution of the RGB values for p
i
.
l
i
Layer number of segment p
i
.
Latent Variables
Θ
j
T i
Transformation {t
x
, t
y
, s
x
, s
y
, φ} of segment p
i
to frame j.
Θ
j
Li
Lighting variables {a
j
i
, b
j
i
} of segment p
i
to frame j.
Θ {n
P
, Θ
M
, Θ
A
, H
i
, l
i
; Θ
T
, Θ
L
}.
Table 1: Parameters and latent variables of the layered representation.
are 2D patterns (specified by their shape and appearance) along with their layering.
The layering determines the occlusion ordering. Thus, each layer contains a number of
non-overlapping segments. We denote the i
th
segment of the latent image as p
i
. The
shape of a segment p
i
is modelled as a binary matte Θ
Mi
of size equal to the frame
of the video such that Θ
Mi
(x) = 1 for a point x belonging to segment p
i
(denoted by
x p
i
) and Θ
Mi
(x) = 0 otherwise.
The appearance Θ
Ai
(x) is the RGB value of points x p
i
. We denote the set
of mattes and appearance parameters for all segments as Θ
M
and Θ
A
respectively.
The distribution of the RGB values Θ
Ai
(x) for all points x p
i
is specified using
a histogram H
i
for each segment p
i
. In order to model the layers, we assign a (not
necessarily unique) layer number l
i
to each segment p
i
such that segments belonging
to the same layer share a common layer number. Each segment p
i
can partially or
completely occlude segment p
k
, if and only if l
i
> l
k
. In summary, the latent image
is defined by the mattes Θ
M
, the appearance Θ
A
, the histograms H
i
and the layer
numbers l
i
of the n
P
segments.
When generating frame j, we start from a latent image and map each point x p
i
to x
using the transformation Θ
j
T i
. This implies that points belonging to the same
segment move according to a common transformation. The generated frame is then
obtained by compositing the transformed segments in descending order of their layer
numbers. For this paper, each transformation has ve degrees of freedom: rotation,
translations and anisotropic scale factors. The model accounts for the effects of lighting
conditions on the appearance of a segment p
i
using latent variable Θ
j
Li
= {a
j
i
, b
j
i
},
where a
j
i
and b
j
i
are 3-dimensional vectors. The change in appearance of the segment
p
i
in frame j due to lighting conditions is modelled as
d(x
) = diag(a
j
i
) · Θ
Ai
(x) + b
j
i
. (1)
4

The motion of segment p
i
from frame j 1 to frame j, denoted by m
j
i
, can be deter-
mined using the transformations Θ
j1
T i
and Θ
j
T i
. This allows us to take into account
the change in appearance due to motion blur as
c(x
) =
Z
T
0
d(x
m
j
i
(t))dt, (2)
where T is the total exposure time when capturing the frame.
Posterior of the model: We represent the set of all parameters and latent variables
of the layered representation as Θ = {n
P
, Θ
M
, Θ
A
, H
i
, l
i
; Θ
T
, Θ
L
} (summarized in
table 1). Given data D, i.e. the n
F
frames of a video, the posterior probability of the
model is given by
Pr(Θ|D) =
1
Z
exp(Ψ(Θ|D)), (3)
where Z is the partition function. The energy Ψ(Θ|D) has the form
Ψ(Θ|D) =
n
P
X
i=1
X
xΘ
M
A
i
(x; Θ, D) + λ
1
X
y∈N (x)
(B
i
(x, y; Θ, D) + λ
2
P
i
(x, y; Θ))
,
(4)
where N (x) is the neighbourhood of x. For this paper, we define N (x) as the 8-
neighbourhood of x across all mattes Θ
Mi
of the layered representation (see Fig 3).
As will be seen in § 3.3, this allows us to learn the model efficiently by minimizing the
energy Ψ(Θ|D) using multi-way graph cuts. However, a larger neighbourhood can be
used for each point at the cost of more computation time. Note that minimizing the
energy Ψ(Θ|D) is equivalent to maximizing the posterior Pr(Θ|D) since the partition
function Z is independent of Θ.
The energy of the layered representation has two components: (i) the data log like-
lihood term which consists of the appearance term A
i
(x; Θ, D) and the contrast term
B
i
(x, y; Θ, D), and (ii) the prior P
i
(x, y; Θ). The appearance term measures the con-
sistency of motion and colour distribution of a point x. The contrast and the prior terms
encourage spatially continuous segments whose boundaries lie on edges in the frames.
Their relative weight to the appearance term is given by λ
1
. The weight λ
2
specifies
the relative importance of the prior to the contrast term. An extension of Markov ran-
dom fields (MRF) described in [12], which we call Contrast-dependent random fields
(CDRF), allows a probabilistic interpretation of the energy Ψ(Θ|D) as shown in Fig. 4.
We note, however, that unlike MRF it is not straightforward to generate the frames from
CDRF since it is a discriminative model (due to the presence of contrast term B
i
(x, y)).
We return to this when we provide a Conditional random field formulation of the energy
Ψ(Θ|D). We begin by describing the three terms of the energy in detail.
Appearance: We denote the observed RGB values at point x
= Θ
j
T i
(x) (i.e. the
image of the point x in frame j) by I
j
i
(x). The generated RGB values of the point x
5

Figures
Citations
More filters
Journal ArticleDOI

A survey of advances in vision-based human motion capture and analysis

TL;DR: This survey reviews recent trends in video-based human capture and analysis, as well as discussing open problems for future research to achieve automatic visual analysis of human movement.
Proceedings ArticleDOI

Progressive search space reduction for human pose estimation

TL;DR: An approach that progressively reduces the search space for body parts, to greatly improve the chances that pose estimation will succeed, and an integrated spatio- temporal model covering multiple frames to refine pose estimates from individual frames, with inference using belief propagation.
Proceedings ArticleDOI

Chaotic invariants of Lagrangian particle trajectories for anomaly detection in crowded scenes

TL;DR: Results show that the proposed novel method for crowd flow modeling and anomaly detection achieves higher accuracy in anomaly detection and can effectively localize anomalies.
Proceedings ArticleDOI

Bilayer Segmentation of Live Video

TL;DR: In this article, an efficient motion vs non-motion classifier is trained to operate directly and jointly on intensity-change and contrast, and its output is then fused with colour information.
Proceedings ArticleDOI

Track to the future: Spatio-temporal video segmentation with long-range motion cues

TL;DR: An efficient spatiotemporal video segmentation algorithm is developed, which naturally incorporates long-range motion cues from the past and future frames in the form of clusters of point tracks with coherent motion, and a new track clustering cost function is devised that includes occlusion reasoning, in the forms of depth ordering constraints, as well as motion similarity along the tracks.
References
More filters
Book

Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference

TL;DR: Probabilistic Reasoning in Intelligent Systems as mentioned in this paper is a complete and accessible account of the theoretical foundations and computational methods that underlie plausible reasoning under uncertainty, and provides a coherent explication of probability as a language for reasoning with partial belief.
Proceedings Article

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

TL;DR: This work presents iterative parameter estimation algorithms for conditional random fields and compares the performance of the resulting models to HMMs and MEMMs on synthetic and natural-language data.
Journal ArticleDOI

Fast approximate energy minimization via graph cuts

TL;DR: This work presents two algorithms based on graph cuts that efficiently find a local minimum with respect to two types of large moves, namely expansion moves and swap moves that allow important cases of discontinuity preserving energies.
Proceedings ArticleDOI

Interactive graph cuts for optimal boundary & region segmentation of objects in N-D images

TL;DR: In this paper, the user marks certain pixels as "object" or "background" to provide hard constraints for segmentation, and additional soft constraints incorporate both boundary and region information.
Related Papers (5)
Frequently Asked Questions (9)
Q1. What are the contributions mentioned in the paper "Learning layered motion segmentations of video" ?

The authors present an unsupervised approach for learning a layered representation of a scene from a video for motion segmentation. Included in the model are the effects of image projection, lighting, and motion blur. The authors compare their method with the state of the art and demonstrate significant improvements. Furthermore, spatial continuity is explicitly modelled resulting in contiguous segments. 

The initial estimation takes approximately 5 minutes for every pair of frames: 3 minutes for computing the likelihood of the transformations and 2 minutes for MMSE estimation using LBP. 

When determining rigidity of two transformations or clustering patches to obtain components, the authors allow for the translations to vary by one pixel in x and y directions to account for errors introduced by discretization of putative transformations. 

The shape of a segment pi is modelled as a binary matte ΘMi of size equal to the frame of the video such that ΘMi(x) = 1 for a point x belonging to segment pi (denoted by x ∈ pi) and ΘMi(x) = 0 otherwise. 

Recall that λ1 and λ2 are the weights given to the contrast and the prior term which encourage boundaries of segments to lie on image edges. 

Given data D, i.e. the nF frames of a video, the posterior probability of the model is given byPr(Θ|D) = 1Z exp(−Ψ(Θ|D)), (3)where Z is the partition function. 

In this section, the authors describe a method to refine the estimate of the shape parameters ΘM and determine the layer numbers li using the αβ-swap and α-expansion algorithms [5]. 

As can be seen from the table, the value of the pairwise potential is small when boundaries of the segment lie on image edges (i.e. when i 6= k and gik(x,y) = 3σ). 

The cost Vx,y(hx, hy) of assigning two different labels hx and hy to neighbouring points x and y is directly proportional to Bi(x,y;Θ,D) for that frame.