scispace - formally typeset
Open AccessBook ChapterDOI

Learning Must-Link Constraints for Video Segmentation Based on Spectral Clustering

Reads0
Chats0
TLDR
It is shown that the integration of learned must-link constraints not only improves the segmentation result but also significantly reduces the required runtime, making the use of costly spectral methods possible for today’s high quality video.

Content maybe subject to copyright    Report

Learning Must-Link Constraints for Video
Segmentation based on Spectral Clustering
Anna Khoreva
1
, Fabio Galasso
1
, Matthias Hein
2
, and Bernt Schiele
1
1
Max Planck Institute for Informatics, Saarbr¨ucken, Germany
{khoreva,galasso,schiele}@mpi-inf.mpg.de
2
Saarland University, Saarbr¨ucken, Germany
hein@cs.uni-saarland.de
Abstract. In recent years it has been shown that clustering and seg-
mentation methods can greatly benefit from the integration of prior in-
formation in terms of must-link constraints. Very recently the use of such
constraints has been integrated in a rigorous manner also in graph-based
methods such as normalized cut. On the other hand spectral cluster-
ing as relaxation of the normalized cut has been shown to be among
the best methods for video segmentation. In this paper we merge these
two developments and propose to learn must-link constraints for video
segmentation with spectral clustering. We show that the integration of
learned must-link constraints not only improves the segmentation result
but also significantly reduces the required runtime, making the use of
costly spectral methods possible for today’s high quality video.
1 Introduction
Video segmentation is an open problem in computer vision, which has recently
attracted increasing attention. The problem is of high interest due to its poten-
tial applications in action recognition, scene classification, 3D reconstruction and
video indexing, among others. The literature on the topic has become prolific [7,
43, 2, 28, 27, 19, 11, 10, 4, 29] and a number of techniques have become available,
e.g. generative layered models [25, 26], graph-based models [20, 46, 36] and spec-
tral techniques [39, 8, 15, 18, 32, 35, 16].
Spectral methods, stemming from the seminal work of [39, 34], have received
much attention from the theoretical viewpoint [31, 9, 21], and currently provide
state-of-the-art segmentation performance [3, 40, 18, 41, 35, 42, 32, 16]. Spectral
clustering, as a relaxation of the NP-hard normalized cut problem, is suitable
due to its ability to include long-range affinities [18, 40] and its global view on
the problem [14], providing balanced solutions.
In this paper, we focus on two important limitations of spectral techniques:
the excessive resource requirements and the lack of exploiting available training
data. The large demands of spectral techniques [40, 18] are particularly clear
in the case of high-quality video datasets [17], limiting their current large-scale
applicability. While often a labeled dataset is available, a systematic learning
of the affinities used to build the graph for spectral clustering is very difficult.

2 A. Khoreva, F. Galasso, M. Hein, B. Schiele
a. Video sequence b. SPX c. Proposed M SPX d. Video segm.
Fig. 1. Video segmentation [18] employs fine superpixels (b), resulting in large resource
requirements, esp. when using spectral methods. We propose learned must-links to
merge superpixels into fewer must-link-constrained M superpixels (c). This reduces
runtime and memory consumption and maintains or improves the segmentation (d).
In particular, as the normalized cut itself is a NP-hard problem and even the
spectral relaxation is non-convex, the optimization of the minimizer which yields
the segmentation is out of reach. Thus in practice one typically validates a few
model parameters [8, 18, 32], refraining spectral methods to make use of recently
available large training data [17].
We propose to learn must-link constraints to overcome both limitations. Re-
cent spectral theory [38, 16] has shown that the integration of must-links (i.e.
forcing two vertices to be in the same cluster) allows to reduce the size of the
problem, while preserving the original optimization objective for all partitions
satisfying the must-links. On the other hand by learning must-link constraints
we can leverage the available training data in order to guide spectral clustering
towards a desired segmentation. Figure 1 illustrates the advantages of learning
must-links: superpixel-based techniques [18] build spectral graphs on fine super-
pixels, Figure 1(b); by contrast, we propose to build graphs merging superpixels
based on learned must-link constraints, Figure 1(c). In particular, specifically
training a classifier to minimize the number of false positives allows conservative
superpixel merging, which: i. reduces the problem size significantly; ii. preserves
the original optimization problem; and iii. improves the video segmentation,
Figure 1(d), because correct must-links avoid undesired solutions (cf. Section 3).
In the following, we present the integration and learning of must-link con-
straints in Section 3 and validate them experimentally under various setups in
Section 4 on two recent video segmentation datasets [8, 17].
2 Related Work
The usage of must-link constraints, first introduced in [44], is an active area
of research in machine learning known as constrained clustering (see [5] for an
overview). The goal of integrating must-link constraints into spectral clustering
has been tried via: i. modifying the value of affinities (cf. [24], which first con-
sidered constrained spectral clustering); ii. modifying the spectral embedding
[30]; or iii. adding constraints in a post-processing step [49, 13, 48, 45, 33]. In-
terestingly, none of these methods can guarantee that the must-link constraints

Learning Must-Link Constraints for Video Segmentation based on SC 3
are actually satisfied in the final clustering. By contrast, we employ must-link
constraints to reduce the original graph to one of smaller size, thus enforcing the
constraints while additionally benefiting runtime and memory consumption.
In particular, [38, 16] have shown that must-link constraints can be used
to reduce the graph, based on the corresponding point groupings, and proved
equivalence between the reduced and the original graph, respectively in terms
of NCut [38] and SC [16], for any clustering satisfying the must-link constraints.
We employ these recent advances and propose to learn the must-link constraints
in a data-driven discriminative fashion for video segmentation.
Other related work in segmentation have looked at merging superpixels with
equivalence [1], but using hand-designed affinities, or learned pair-wise relations
between superpixels [23], disregarding equivalence in the agglomerative merging
process. This work brings together learning affinities and merging with equiva-
lence guarantees for the first time.
3 Learning sp ectral must-link constraints
We provide here the steps of a video segmentation framework based on the
normalized cut [39, 34, 22] and review the integration of must-link constraints by
graph reductions as proposed in [38, 16]. While the idea of learning must-link
constraints applies to any segmentation problem, we discuss in detail learning
and inference in the specific case of the video segmentation features of [18].
3.1 Segmentation and Must-link Constraints
We represent a video sequence as a graph G = (V, E): nodes i V represent su-
perpixels, extracted at each frame of the video sequence with an image segmen-
tation algorithm [3]; edges e
ij
E between superpixels i and j take non-negative
weights w
ij
and express the similarity (affinity) between the superpixels.
A video segmentation can be defined as a partition S = {S
1
, S
2
, . . . , S
K
}
of the (superpixel) vertex set V, i.e.
k
S
k
= V, S
k
S
m
= k 6= m.
Given S the set of all partitions, we look for an optimal video segmentation
S
= {S
1
, S
2
, . . . , S
N
} S (where N is the number of visual objects), minimizer
of an objective function, implicit [20, 47, 37] or explicit [39, 34, 43, 10].
Must-link constraints alter the video segmentation by reducing the set of
feasible partitions S. Given correct
1
must-links, a video segmentation algorithm
generally improves in performance, since the solver is constrained to disregard
non-optimal segmentations wrt S
. Moreover, the integration of must-links leads
to reduced runtime and memory load as the recent work [38, 16] suggests.
We are interested in learning a must-link grouping function M, which groups
certain
2
superpixels in the graph, while respecting S
. M should conservatively
1
correct refers to the desired ground truth segmentation, which ideally corresponds
with the optimal segmentation S
2
certain groupings are the conservative grouping decisions which we propose to learn

4 A. Khoreva, F. Galasso, M. Hein, B. Schiele
associate each node i with a point grouping I
k
S
l
(in most uncertain cases a
point grouping may only include a single node). More formally:
M : V 7→ P, i 7→ I
k
(1)
s.t. I
k
S
l
V ,
k
I
k
= V , I
k
I
m
= k 6= m ,
where P is the set of possible partitions of V.
3.2 Framework
Here we tailor the general theory to a video segmentation framework based on
the normalized cut, solved either via the spectral [39, 34] or 1-spectral [9, 21]
relaxation. Further, we discuss the integration of learned must-link constraints
via graph reduction techniques [38, 16] and learning and inference strategies.
Video segmentation setup. We build upon Galasso et al. [18]. Their con-
structed graph G = (V, E) uses superpixels extracted from the lowest level (level
1) of a hierarchical image segmentation [3]. Edges connect superpixels from spa-
tial and temporal neighbors and are weighted by their pair-wise affinities, com-
puted from motion, appearance and shape features.
We consider six pairwise affinities: spatio-temporal appearance (STA), based
on the median CIE Lab color distance; spatio-temporal motion (STM), based
on median optical flow distance; across boundary appearance (ABA) and mo-
tion (ABM), computed across the common boundary of superpixels; short-term-
temporal (STT), measuring shape similarity by the spatial overlap of optical
flow-propagated superpixels; long-term-temporal (LTT), given by the fraction
of common trajectories between the superpixels. Additionally we consider the
number of common intersecting trajectories (IT). We distinguish four types of
affinities, depending on whether the related superpixels: i. lie within the same
frame (STA,STM,ABA,ABM); ii. lie on adjacent frames (STA,STM,STT); iii-
iv. lie on frames at a distance of 2 (STT,LTT,IT) or more frames (LTT,IT)
respectively.
Video segmentation objective function. Given a partition of V into N sets
S
1
, . . . , S
N
, the normalized cut (NCut) is defined [31] as:
NCut(S
1
, . . . , S
N
) =
N
X
k=1
cut(S
k
, V\S
k
)
vol(S
k
)
, (2)
where cut(S
k
, V\S
k
) =
P
iS
k
,j∈V\S
k
w
ij
and vol(S
k
) =
P
iS
k
,j∈V
w
ij
. The
balancing factor prevents trivial solutions and is ideal when unary terms cannot
be defined, but is also the reason why minimization of the NCut is NP-Hard.

Learning Must-Link Constraints for Video Segmentation based on SC 5
Spectral relaxations. The most widely adopted relaxation of NCut is spectral
clustering (SC) [39, 34, 31], where the solution of the relaxed problem is given by
representing the data points with the first few eigenvectors and then clustering
them with k-means.
While widely adopted [16, 32, 3, 8, 40, 18, 41], the SC relaxation is known to
be loose. We therefore additionally consider the 1-spectral clustering (1-SC) [21,
22] - a tight relaxation based on the 1-Laplacian. However, the relaxation is only
tight for bi-partitioning, for multi-way partitioning recursive splitting is used as
greedy heuristic.
Reducing the original graph size with learned must-link constraints allows
to experiment with 1-SC on state-of-the-art video segmentation benchmarks [8,
17], notwithstanding the increased computational costs.
Graph reduction schemes. Given must-link constraints provided as point
groupings {I
1
, I
2
, . . . , I
q
} on the original vertex set I
k
V, recent work [38, 16]
shows how to integrate such constraints into the original problem with respec-
tively preserving the NCut and the spectral clustering objective function.
In more detail, integration proceeds by reducing the original graph G to one
of smaller size G
M
= (V
M
, E
M
), whereby the vertex set is given by the point
grouping V
M
= {I
1
, I
2
, . . . , I
q
}, the edge set E
M
preserves the original node
connectivity and weights w
M
IJ
are estimated so as to preserve the original video
segmentation problem in terms of the NCut or spectral clustering objective. In
particular, the NCut reduction is given by
w
M
IJ
=
X
iI
X
jJ
w
ij
(3)
while the spectral clustering reduction is defined as
w
M
IJ
=
X
iI
X
jJ
w
ij
if I 6= J
1
|I|
X
iI
X
jJ
w
ij
(|I| 1)
|I|
X
iI
X
j∈V\I
w
ij
if I = J,
(4)
provided equal affinities of elements of G constrained in G
M
, cf. [16].
3.3 Learning
An ideal must-link constraining function M (Eq. 1) should only merge super-
pixels which are correct, i.e. belong to the same set in the optimal segmentation.
From an implementation viewpoint, it is convenient to consider instead M
pw
,
defined over the set of edges E of the graph G representing the video sequence:
M
pw
: E 7→ {0, 1} (5)
M
pw
casts the must-link constraining problem as a binary classification one,
where a true output for an input edge e
ij
means that i and j belong to the
same point grouping, in the must-link constrained graph G
M
.

Citations
More filters
Proceedings ArticleDOI

Going deeper with convolutions

TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).
Proceedings ArticleDOI

Motion Trajectory Segmentation via Minimum Cost Multicuts

TL;DR: This paper provides a method to create a long-term point trajectory graph with attractive and repulsive binary terms and outperform state-of-the-art methods based on spectral clustering on the FBMS-59 dataset and on the motion subtask of the VSB100 dataset.

Evaluation of Super Voxel Methods for Early Video Processing (Author's Manuscript)

TL;DR: Five supervoxel algorithms are studied in the context of what is considered to be a good supervoxels: namely, spatiotemporal uniformity, object/region boundary detection, region compression and parsimony, leading to conclusive evidence that the hierarchical graph-based and segmentation by weighted aggregation methods perform best and almost equally-well on nearly all the metrics.
Proceedings Article

Guarantees for Spectral Clustering with Fairness Constraints

TL;DR: This work develops variants of both normalized and unnormalized constrained SC and shows that they help find fairer clusterings on both synthetic and real data and proves that their algorithms can recover this fair clustering with high probability.
Journal ArticleDOI

Submodular Trajectories for Better Motion Segmentation in Videos

TL;DR: A new trajectory clustering method using submodular optimization for better motion segmentation in videos, which demonstrates that the method can divide trajectories into more accurate clusters and outperforms state-of-the-art motion segmentations methods based on trajectories.
References
More filters
Journal ArticleDOI

Random Forests

TL;DR: Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the forest, and are also applicable to regression.
Journal ArticleDOI

Normalized cuts and image segmentation

TL;DR: This work treats image segmentation as a graph partitioning problem and proposes a novel global criterion, the normalized cut, for segmenting the graph, which measures both the total dissimilarity between the different groups as well as the total similarity within the groups.
Proceedings ArticleDOI

Normalized cuts and image segmentation

TL;DR: This work treats image segmentation as a graph partitioning problem and proposes a novel global criterion, the normalized cut, for segmenting the graph, which measures both the total dissimilarity between the different groups as well as the total similarity within the groups.
Journal ArticleDOI

A tutorial on spectral clustering

TL;DR: In this article, the authors present the most common spectral clustering algorithms, and derive those algorithms from scratch by several different approaches, and discuss the advantages and disadvantages of these algorithms.
Proceedings Article

On Spectral Clustering: Analysis and an algorithm

TL;DR: A simple spectral clustering algorithm that can be implemented using a few lines of Matlab is presented, and tools from matrix perturbation theory are used to analyze the algorithm, and give conditions under which it can be expected to do well.
Related Papers (5)
Frequently Asked Questions (17)
Q1. What contributions have the authors mentioned in the paper "Learning must-link constraints for video segmentation based on spectral clustering" ?

In this paper the authors merge these two developments and propose to learn must-link constraints for video segmentation with spectral clustering. The authors show that the integration of learned must-link constraints not only improves the segmentation result but also significantly reduces the required runtime, making the use of costly spectral methods possible for today ’ s high quality video. 

Spectral clustering, as a relaxation of the NP-hard normalized cut problem, is suitable due to its ability to include long-range affinities [18, 40] and its global view on the problem [14], providing balanced solutions. 

In particular, as the normalized cut itself is a NP-hard problem and even the spectral relaxation is non-convex, the optimization of the minimizer which yields the segmentation is out of reach. 

In the first set of experiments, the authors consider the Berkeley Motion Segmentation Dataset (BMDS) [8], which consists of 26 VGA-quality video sequences, representing mainly humans and cars, which the authors arrange into training, validation and test sets (6+4+16). 

The goal of integrating must-link constraints into spectral clustering has been tried via: i. modifying the value of affinities (cf. [24], which first considered constrained spectral clustering); ii. modifying the spectral embedding [30]; or iii. adding constraints in a post-processing step [49, 13, 48, 45, 33]. 

In this paper, the authors focus on two important limitations of spectral techniques: the excessive resource requirements and the lack of exploiting available training data. 

the authors have shown that learned mustlink constraints improve efficiency and, in most cases, performance, as these allow discriminatively training on GT data. 

Reducing the original graph size with learned must-link constraints allows to experiment with 1-SC on state-of-the-art video segmentation benchmarks [8, 17], notwithstanding the increased computational costs. 

Must-link constraints have a transitive nature:Mpw(eij) = 1 andMpw(eik) = 1 imply Mpw(ejk) = 1. It is therefore crucial that all decided constraints are correct, as a few wrong ones may result in a larger set of incorrect decisions by transitive closure and potentially spoil the segmentation. 

Given a partition of V into N sets S1, . . . , SN , the normalized cut (NCut) is defined [31] as:NCut(S1, . . . , SN ) = N∑ k=1 cut(Sk,V\\Sk) vol(Sk) , (2)where cut(Sk,V\\Sk) = ∑ i∈Sk,j∈V\\Sk wij and vol(Sk) = ∑ i∈Sk,j∈V wij . 

Thus learned must-links closely follow the spectral clustering optimization and their proposed method only provides further reduction of the problem size. 

With respect to the efficient reduction of [16], the authors further reduce runtime by 30% and memory load by 65%, while the authors reduce runtime by 97% and memory load by 87% wrt [18]. 

The balancing factor prevents trivial solutions and is ideal when unary terms cannot be defined, but is also the reason why minimization of the NCut is NP-Hard. 

While this theory is applicable to general clustering and segmentation problems, the authors have particularly shown the use of learned must-link constraints in conjunction with spectral techniques, whereby recent theoretical advances employ these to reduce the original problem size, hence the runtime and memory requirements. 

From an implementation viewpoint, it is convenient to consider instead Mpw, defined over the set of edges E of the graph G representing the video sequence:Mpw : E 7→ {0, 1} (5)Mpw casts the must-link constraining problem as a binary classification one, where a true output for an input edge eij means that i and j belong to the same point grouping, in the must-link constrained graph GM . 

Although this conservative classifier might imply that in the worst case, no must-link constraints are predicted, it turns out their classifier actually predicts for a large fraction of the edges to be linked and thus leads to a significant reduction in size, while making a few false positives on the unseen data (overall, 1 false positive per 242k true predictions). 

Besides the PR curves, the authors report aggregate performance for BPR and VPR: optimal dataset scale [ODS], optimal segmentation scale [OSS], average precision [AP].