What are the contributions mentioned in the paper "Learning layered motion segmentations of video" ?

The authors present an unsupervised approach for learning a layered representation of a scene from a video for motion segmentation. Included in the model are the effects of image projection, lighting, and motion blur. The authors compare their method with the state of the art and demonstrate significant improvements. Furthermore, spatial continuity is explicitly modelled resulting in contiguous segments.

How long does it take to estimate the likelihood of the transformations?

The initial estimation takes approximately 5 minutes for every pair of frames: 3 minutes for computing the likelihood of the transformations and 2 minutes for MMSE estimation using LBP.

How does the algorithm account for the differences in the translations?

When determining rigidity of two transformations or clustering patches to obtain components, the authors allow for the translations to vary by one pixel in x and y directions to account for errors introduced by discretization of putative transformations.

What is the shape of a segment pi?

The shape of a segment pi is modelled as a binary matte ΘMi of size equal to the frame of the video such that ΘMi(x) = 1 for a point x belonging to segment pi (denoted by x ∈ pi) and ΘMi(x) = 0 otherwise.

What is the weight given to the contrast and the prior term?

Recall that λ1 and λ2 are the weights given to the contrast and the prior term which encourage boundaries of segments to lie on image edges.

What is the method to refine the estimate of the shape parameters M?

In this section, the authors describe a method to refine the estimate of the shape parameters ΘM and determine the layer numbers li using the αβ-swap and α-expansion algorithms [5].

What is the value of the pairwise potential?

As can be seen from the table, the value of the pairwise potential is small when boundaries of the segment lie on image edges (i.e. when i 6= k and gik(x,y) = 3σ).

What is the cost of assigning labels hx and hy to neighbouring points?

The cost Vx,y(hx, hy) of assigning two different labels hx and hy to neighbouring points x and y is directly proportional to Bi(x,y;Θ,D) for that frame.

(Open Access) Learning layered motion segmentations of video (2005) | M.P. Kumar

Q: What is the posterior probability of the model?

Given data D, i.e. the nF frames of a video, the posterior probability of the model is given byPr(Θ|D) = 1Z exp(−Ψ(Θ|D)), (3)where Z is the partition function.

Learning Layered Motion Segmentations of Video

M. Pawan Kumar P.H.S. Torr A. Zisserman

Dept. of Computing Dept. of Eng. Science

Oxford Brookes University University of Oxford

{pkmudigonda,philiptorr}@brookes.ac.uk az@robots.ox.ac.uk

http://cms.brookes.ac.uk/research/visiongroup http://www.robots.ox.ac.uk/˜vgg

Abstract

We present an unsupervised approach for learning a layered representation of a scene

from a video for motion segmentation. Our method is applicable to any video contain-

ing piecewise parametric motion. The learnt model is a composition of layers, which

consist of one or more segments. The shape of each segment is represented using a

binary matte and its appearance is given by the RGB value for each point belonging to

the matte. Included in the model are the effects of image projection, lighting, and mo-

tion blur. Furthermore, spatial continuity is explicitly modelled resulting in contiguous

segments. Unlike previous approaches, our method does not use reference frame(s) for

initialization. The two main contributions of our method are: (i) A novel algorithm

for obtaining the initial estimate of the model by dividing the scene into rigidly mov-

ing components using efﬁcient loopy belief propagation; and (ii) Reﬁning the initial

estimate using αβ-swap and α-expansion algorithms, which guarantee a strong local

minima. Results are presented on several classes of objects with different types of cam-

era motion, e.g. videos of a human walking shot with static or translating cameras. We

compare our method with the state of the art and demonstrate signiﬁcant improvements.

1 Introduction

We present an approach for learning a layered representation from a video for motion

segmentation. Our method is applicable to any video containing piecewise parametric

motion, e.g. piecewise homography, without any restrictions on camera motion. It also

accounts for the effects of occlusion, lighting and motion blur. For example, Fig. 1

shows one such sequence where a layered representation can be learnt and used to

segment the walking person from the static background.

Many different approaches for motion segmentation have been reported in the lit-

erature. Important issues are: (i) whether the methods are feature-based or dense; (ii)

whether they model occlusion; (iii) whether they model spatial continuity; (iv) whether

they apply to multiple frames (i.e. a video sequence); and (v) whether they are inde-

pendent of which frames are used for initialization.

Figure 1: Four intermediate frames of a 31 frame video sequence of a person walking

sideways where the camera is static. Given the sequence, the model which best de-

scribes the person and the background is learnt in an unsupervised manner. Note that

the arm always partially occludes the torso.

A comprehensive survey of feature-based methods can be found in [19]. Most of

these method rely on computing a homography corresponding to the motion of a planar

object. This limits their application to a restricted set of scenes and motions. Dense

methods [2, 6, 18, 22] overcome this deﬁciency by computing pixel-wise motion. How-

ever, many dense approaches do not model occlusion which can lead to overcounting

of data when obtaining the segmentation, e.g. see [2, 6].

Chief amongst the methods which do model occlusion are those that use a layered

representation [21]. One such approach, described in [22] divides a scene into (almost)

planar regions for occlusion reasoning. Torr et al. [18] extend this representation by

allowing for parallax disparity. However, these methods rely on a keyframe for the

initial estimation. Other approaches [8, 23] overcome this problem by using layered

ﬂexible sprites. A ﬂexible sprite is a 2D appearance map and matte (mask) of an object

which is allowed to deform from frame to frame according to pure translation. Winn

et al. [25] extend the model to handle afﬁne deformations. However, these methods

do not enforce spatial continuity i.e. they assume each pixel is labelled independent

of its neighbours. This leads to non-contiguous segmentation when the foreground

and background are similar in appearance (see Fig. 19(b)). Most of these approaches,

namely those described in [6, 8, 18, 21, 22], use either EM or variational methods for

learning the parameters of the model which makes them prone to local minima.

Wills et al. [24] noted the importance of spatial continuity when learning the re-

gions in a layered representation. Given an initial estimate, they learn the shape of

the regions using the powerful α-expansion algorithm [5] which guarantees a strong

local minima. However, their method does not deal with more than 2 views. In our

earlier work [10], we described a similar motion segmentation approach to [24] for a

video sequence. Like [16] this automatically learns a model of an object. However,

the method depends on a keyframe to obtain an initial estimate of the model. This has

the disadvantage that points not visible in the keyframe are not included in the model,

which leads to incomplete segmentation.

In this paper, we present a model which does not suffer from the problems men-

tioned above, i.e. (i) it models occlusion; (ii) it models spatial continuity; (iii) it handles

multiple frames; and (iv) it is learnt independent of keyframes. An initial estimate of

the model is obtained based on a method to estimate image motion with discontinuities

using a new efﬁcient loopy belief propagation algorithm. Despite the use of piecewise

parametric motion (similar to feature-based approaches), this allows us to learn the

Figure 2: The top row shows the various layers of a human model (the latent image in

this case). Each layer consists of one of more segments whose appearance is shown.

The shape of each segment is represented by a binary matte (not shown in the image).

Any frame j can be generated using this representation by assigning appropriate values

to its parameters and latent variables. The background is not shown.

model for a wide variety of scenes. Given the initial estimate, the shape of the seg-

ments, along with the layering, are learnt by minimizing an objective function using

αβ-swap and α-expansion algorithms [5]. Results are demonstrated on several classes

of objects with different types of camera motion.

In the next section, we describe the layered representation. In section 3, we present

a ﬁve stage approach to learn the parameters of the layered representation from a video.

Such a model is particularly suited for applications like motion segmentation. Results

are presented in section 4. Preliminary versions of this article have appeared in [10, 11].

The input videos used in this work together with the description and output of our

approach are available at http://www.robots.ox.ac.uk/ vgg/research/moseg/.

2 Layered Representation

We introduce the model for a layered representation which describes the scene as a

composition of layers. Any frame of a video can be generated from our model by as-

signing appropriate values to its parameters and latent variables as illustrated in Fig. 2.

While the parameters of the model deﬁne the latent image, the latent variables describe

how to generate the frames using the latent image (see table 1). Together, they also

deﬁne the probability of the frame being generated.

The latent image is deﬁned as follows. It consists of a set of n

segments, which

Input

D Data (RGB values of all pixels in every frame of a video).

Number of frames.

Parameters

Number of segments p

including the background.

Matte for segment p

Set of all mattes, i.e. {Θ

, i = 1, · · · , n

Appearance parameter for segment p

Set of all appearance parameters, i.e. {Θ

, i = 1, · · · , n

Histogram specifying the distribution of the RGB values for p

Layer number of segment p

Latent Variables

T i

Transformation {t

, t

, s

, φ} of segment p

to frame j.

Lighting variables {a

, b

} of segment p

to frame j.

Θ {n

, Θ

, H

, l

; Θ

, Θ

Table 1: Parameters and latent variables of the layered representation.

are 2D patterns (speciﬁed by their shape and appearance) along with their layering.

The layering determines the occlusion ordering. Thus, each layer contains a number of

non-overlapping segments. We denote the i

segment of the latent image as p

. The

shape of a segment p

is modelled as a binary matte Θ

of size equal to the frame

of the video such that Θ

(x) = 1 for a point x belonging to segment p

(denoted by

x ∈ p

) and Θ

(x) = 0 otherwise.

The appearance Θ

(x) is the RGB value of points x ∈ p

. We denote the set

of mattes and appearance parameters for all segments as Θ

and Θ

respectively.

The distribution of the RGB values Θ

(x) for all points x ∈ p

is speciﬁed using

a histogram H

for each segment p

. In order to model the layers, we assign a (not

necessarily unique) layer number l

to each segment p

such that segments belonging

to the same layer share a common layer number. Each segment p

can partially or

completely occlude segment p

, if and only if l

> l

. In summary, the latent image

is deﬁned by the mattes Θ

, the appearance Θ

, the histograms H

and the layer

numbers l

of the n

segments.

When generating frame j, we start from a latent image and map each point x ∈ p

to x

′

using the transformation Θ

T i

. This implies that points belonging to the same

segment move according to a common transformation. The generated frame is then

obtained by compositing the transformed segments in descending order of their layer

numbers. For this paper, each transformation has ﬁve degrees of freedom: rotation,

translations and anisotropic scale factors. The model accounts for the effects of lighting

conditions on the appearance of a segment p

using latent variable Θ

= {a

, b

where a

and b

are 3-dimensional vectors. The change in appearance of the segment

in frame j due to lighting conditions is modelled as

d(x

′

) = diag(a

) · Θ

(x) + b

. (1)

The motion of segment p

from frame j − 1 to frame j, denoted by m

, can be deter-

mined using the transformations Θ

j−1

T i

and Θ

T i

. This allows us to take into account

the change in appearance due to motion blur as

c(x

′

) =

d(x

′

− m

(t))dt, (2)

where T is the total exposure time when capturing the frame.

Posterior of the model: We represent the set of all parameters and latent variables

of the layered representation as Θ = {n

, Θ

, H

, l

; Θ

, Θ

} (summarized in

table 1). Given data D, i.e. the n

frames of a video, the posterior probability of the

model is given by

Pr(Θ|D) =

exp(−Ψ(Θ|D)), (3)

where Z is the partition function. The energy Ψ(Θ|D) has the form

Ψ(Θ|D) =

i=1

x∈Θ





(x; Θ, D) + λ

y∈N (x)

(x, y; Θ, D) + λ

(x, y; Θ))





(4)

where N (x) is the neighbourhood of x. For this paper, we deﬁne N (x) as the 8-

neighbourhood of x across all mattes Θ

of the layered representation (see Fig 3).

As will be seen in § 3.3, this allows us to learn the model efﬁciently by minimizing the

energy Ψ(Θ|D) using multi-way graph cuts. However, a larger neighbourhood can be

used for each point at the cost of more computation time. Note that minimizing the

energy Ψ(Θ|D) is equivalent to maximizing the posterior Pr(Θ|D) since the partition

function Z is independent of Θ.

The energy of the layered representation has two components: (i) the data log like-

lihood term which consists of the appearance term A

(x; Θ, D) and the contrast term

(x, y; Θ, D), and (ii) the prior P

(x, y; Θ). The appearance term measures the con-

sistency of motion and colour distribution of a point x. The contrast and the prior terms

encourage spatially continuous segments whose boundaries lie on edges in the frames.

Their relative weight to the appearance term is given by λ

. The weight λ

speciﬁes

the relative importance of the prior to the contrast term. An extension of Markov ran-

dom ﬁelds (MRF) described in [12], which we call Contrast-dependent random ﬁelds

(CDRF), allows a probabilistic interpretation of the energy Ψ(Θ|D) as shown in Fig. 4.

We note, however, that unlike MRF it is not straightforward to generate the frames from

CDRF since it is a discriminative model (due to the presence of contrast term B

(x, y)).

We return to this when we provide a Conditional random ﬁeld formulation of the energy

Ψ(Θ|D). We begin by describing the three terms of the energy in detail.

Appearance: We denote the observed RGB values at point x

′

= Θ

T i

(x) (i.e. the

image of the point x in frame j) by I

(x). The generated RGB values of the point x

′

Learning layered motion segmentations of video

Figures

Citations

A survey of advances in vision-based human motion capture and analysis

Progressive search space reduction for human pose estimation

Chaotic invariants of Lagrangian particle trajectories for anomaly detection in crowded scenes

Bilayer Segmentation of Live Video

Track to the future: Spatio-temporal video segmentation with long-range motion cues

References

Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

Probabilistic Models for Segmenting and Labeling Sequence Data

Fast approximate energy minimization via graph cuts

Interactive graph cuts for optimal boundary & region segmentation of objects in N-D images

Related Papers (5)

Fast approximate energy minimization via graph cuts

Interactive graph cuts for optimal boundary & region segmentation of objects in N-D images

Representing moving images with layers

"GrabCut": interactive foreground extraction using iterated graph cuts

Distinctive Image Features from Scale-Invariant Keypoints

Frequently Asked Questions (9)

Q1. What are the contributions mentioned in the paper "Learning layered motion segmentations of video" ?

Q2. How long does it take to estimate the likelihood of the transformations?

Q3. How does the algorithm account for the differences in the translations?

Q4. What is the shape of a segment pi?

Q5. What is the weight given to the contrast and the prior term?

Q6. What is the posterior probability of the model?

Q7. What is the method to refine the estimate of the shape parameters M?

Q8. What is the value of the pairwise potential?

Q9. What is the cost of assigning labels hx and hy to neighbouring points?