What is the main argument for spectral clustering?

Spectral clustering, as a relaxation of the NP-hard normalized cut problem, is suitable due to its ability to include long-range affinities [18, 40] and its global view on the problem [14], providing balanced solutions.

What is the first set of experiments?

In the first set of experiments, the authors consider the Berkeley Motion Segmentation Dataset (BMDS) [8], which consists of 26 VGA-quality video sequences, representing mainly humans and cars, which the authors arrange into training, validation and test sets (6+4+16).

How has must-link constraint learning been tried?

The goal of integrating must-link constraints into spectral clustering has been tried via: i. modifying the value of affinities (cf. [24], which first considered constrained spectral clustering); ii. modifying the spectral embedding [30]; or iii. adding constraints in a post-processing step [49, 13, 48, 45, 33].

How do the authors improve the performance of the learning algorithms?

the authors have shown that learned mustlink constraints improve efficiency and, in most cases, performance, as these allow discriminatively training on GT data.

What is the affinity of the must-link constraining problem?

Must-link constraints have a transitive nature:Mpw(eij) = 1 andMpw(eik) = 1 imply Mpw(ejk) = 1. It is therefore crucial that all decided constraints are correct, as a few wrong ones may result in a larger set of incorrect decisions by transitive closure and potentially spoil the segmentation.

What is the function for a normalized cut?

Given a partition of V into N sets S1, . . . , SN , the normalized cut (NCut) is defined [31] as:NCut(S1, . . . , SN ) = N∑ k=1 cut(Sk,V\\Sk) vol(Sk) , (2)where cut(Sk,V\\Sk) = ∑ i∈Sk,j∈V\\Sk wij and vol(Sk) = ∑ i∈Sk,j∈V wij .

What is the description of the proposed method?

Thus learned must-links closely follow the spectral clustering optimization and their proposed method only provides further reduction of the problem size.

How does the proposed method reduce runtime?

With respect to the efficient reduction of [16], the authors further reduce runtime by 30% and memory load by 65%, while the authors reduce runtime by 97% and memory load by 87% wrt [18].

What is the reason why minimization of the NCut is NP-Hard?

The balancing factor prevents trivial solutions and is ideal when unary terms cannot be defined, but is also the reason why minimization of the NCut is NP-Hard.

What is the effect of learning mustlink constraints on the learning of a tree?

While this theory is applicable to general clustering and segmentation problems, the authors have particularly shown the use of learned must-link constraints in conjunction with spectral techniques, whereby recent theoretical advances employ these to reduce the original problem size, hence the runtime and memory requirements.

What is the way to learn Mpw?

From an implementation viewpoint, it is convenient to consider instead Mpw, defined over the set of edges E of the graph G representing the video sequence:Mpw : E 7→ {0, 1} (5)Mpw casts the must-link constraining problem as a binary classification one, where a true output for an input edge eij means that i and j belong to the same point grouping, in the must-link constrained graph GM .

How many false positives are made on the unseen data?

Although this conservative classifier might imply that in the worst case, no must-link constraints are predicted, it turns out their classifier actually predicts for a large fraction of the edges to be linked and thus leads to a significant reduction in size, while making a few false positives on the unseen data (overall, 1 false positive per 242k true predictions).

What are the main metrics for BPR and VPR?

Besides the PR curves, the authors report aggregate performance for BPR and VPR: optimal dataset scale [ODS], optimal segmentation scale [OSS], average precision [AP].

(Open Access) Learning Must-Link Constraints for Video Segmentation Based on Spectral Clustering (2014) | Anna Khoreva

Q: What contributions have the authors mentioned in the paper "Learning must-link constraints for video segmentation based on spectral clustering" ?

In this paper the authors merge these two developments and propose to learn must-link constraints for video segmentation with spectral clustering. The authors show that the integration of learned must-link constraints not only improves the segmentation result but also significantly reduces the required runtime, making the use of costly spectral methods possible for today ’ s high quality video.

Q: What are the main limitations of spectral techniques?

In this paper, the authors focus on two important limitations of spectral techniques: the excessive resource requirements and the lack of exploiting available training data.

Learning Must-Link Constraints for Video

Segmentation based on Spectral Clustering

Anna Khoreva

, Fabio Galasso

, Matthias Hein

, and Bernt Schiele

Max Planck Institute for Informatics, Saarbr¨ucken, Germany

{khoreva,galasso,schiele}@mpi-inf.mpg.de

Saarland University, Saarbr¨ucken, Germany

hein@cs.uni-saarland.de

Abstract. In recent years it has been shown that clustering and seg-

mentation methods can greatly beneﬁt from the integration of prior in-

formation in terms of must-link constraints. Very recently the use of such

constraints has been integrated in a rigorous manner also in graph-based

methods such as normalized cut. On the other hand spectral cluster-

ing as relaxation of the normalized cut has been shown to be among

the best methods for video segmentation. In this paper we merge these

two developments and propose to learn must-link constraints for video

segmentation with spectral clustering. We show that the integration of

learned must-link constraints not only improves the segmentation result

but also signiﬁcantly reduces the required runtime, making the use of

costly spectral methods possible for today’s high quality video.

1 Introduction

Video segmentation is an open problem in computer vision, which has recently

attracted increasing attention. The problem is of high interest due to its poten-

tial applications in action recognition, scene classiﬁcation, 3D reconstruction and

video indexing, among others. The literature on the topic has become proliﬁc [7,

43, 2, 28, 27, 19, 11, 10, 4, 29] and a number of techniques have become available,

e.g. generative layered models [25, 26], graph-based models [20, 46, 36] and spec-

tral techniques [39, 8, 15, 18, 32, 35, 16].

Spectral methods, stemming from the seminal work of [39, 34], have received

much attention from the theoretical viewpoint [31, 9, 21], and currently provide

state-of-the-art segmentation performance [3, 40, 18, 41, 35, 42, 32, 16]. Spectral

clustering, as a relaxation of the NP-hard normalized cut problem, is suitable

due to its ability to include long-range aﬃnities [18, 40] and its global view on

the problem [14], providing balanced solutions.

In this paper, we focus on two important limitations of spectral techniques:

the excessive resource requirements and the lack of exploiting available training

data. The large demands of spectral techniques [40, 18] are particularly clear

in the case of high-quality video datasets [17], limiting their current large-scale

applicability. While often a labeled dataset is available, a systematic learning

of the aﬃnities used to build the graph for spectral clustering is very diﬃcult.

2 A. Khoreva, F. Galasso, M. Hein, B. Schiele

a. Video sequence b. SPX c. Proposed M SPX d. Video segm.

Fig. 1. Video segmentation [18] employs ﬁne superpixels (b), resulting in large resource

requirements, esp. when using spectral methods. We propose learned must-links to

merge superpixels into fewer must-link-constrained M superpixels (c). This reduces

runtime and memory consumption and maintains or improves the segmentation (d).

In particular, as the normalized cut itself is a NP-hard problem and even the

spectral relaxation is non-convex, the optimization of the minimizer which yields

the segmentation is out of reach. Thus in practice one typically validates a few

model parameters [8, 18, 32], refraining spectral methods to make use of recently

available large training data [17].

We propose to learn must-link constraints to overcome both limitations. Re-

cent spectral theory [38, 16] has shown that the integration of must-links (i.e.

forcing two vertices to be in the same cluster) allows to reduce the size of the

problem, while preserving the original optimization objective for all partitions

satisfying the must-links. On the other hand by learning must-link constraints

we can leverage the available training data in order to guide spectral clustering

towards a desired segmentation. Figure 1 illustrates the advantages of learning

must-links: superpixel-based techniques [18] build spectral graphs on ﬁne super-

pixels, Figure 1(b); by contrast, we propose to build graphs merging superpixels

based on learned must-link constraints, Figure 1(c). In particular, speciﬁcally

training a classiﬁer to minimize the number of false positives allows conservative

superpixel merging, which: i. reduces the problem size signiﬁcantly; ii. preserves

the original optimization problem; and iii. improves the video segmentation,

Figure 1(d), because correct must-links avoid undesired solutions (cf. Section 3).

In the following, we present the integration and learning of must-link con-

straints in Section 3 and validate them experimentally under various setups in

Section 4 on two recent video segmentation datasets [8, 17].

2 Related Work

The usage of must-link constraints, ﬁrst introduced in [44], is an active area

of research in machine learning known as constrained clustering (see [5] for an

overview). The goal of integrating must-link constraints into spectral clustering

has been tried via: i. modifying the value of aﬃnities (cf. [24], which ﬁrst con-

sidered constrained spectral clustering); ii. modifying the spectral embedding

[30]; or iii. adding constraints in a post-processing step [49, 13, 48, 45, 33]. In-

terestingly, none of these methods can guarantee that the must-link constraints

Learning Must-Link Constraints for Video Segmentation based on SC 3

are actually satisﬁed in the ﬁnal clustering. By contrast, we employ must-link

constraints to reduce the original graph to one of smaller size, thus enforcing the

constraints while additionally beneﬁting runtime and memory consumption.

In particular, [38, 16] have shown that must-link constraints can be used

to reduce the graph, based on the corresponding point groupings, and proved

equivalence between the reduced and the original graph, respectively in terms

of NCut [38] and SC [16], for any clustering satisfying the must-link constraints.

We employ these recent advances and propose to learn the must-link constraints

in a data-driven discriminative fashion for video segmentation.

Other related work in segmentation have looked at merging superpixels with

equivalence [1], but using hand-designed aﬃnities, or learned pair-wise relations

between superpixels [23], disregarding equivalence in the agglomerative merging

process. This work brings together learning aﬃnities and merging with equiva-

lence guarantees for the ﬁrst time.

3 Learning sp ectral must-link constraints

We provide here the steps of a video segmentation framework based on the

normalized cut [39, 34, 22] and review the integration of must-link constraints by

graph reductions as proposed in [38, 16]. While the idea of learning must-link

constraints applies to any segmentation problem, we discuss in detail learning

and inference in the speciﬁc case of the video segmentation features of [18].

3.1 Segmentation and Must-link Constraints

We represent a video sequence as a graph G = (V, E): nodes i ∈ V represent su-

perpixels, extracted at each frame of the video sequence with an image segmen-

tation algorithm [3]; edges e

∈ E between superpixels i and j take non-negative

weights w

and express the similarity (aﬃnity) between the superpixels.

A video segmentation can be deﬁned as a partition S = {S

, S

, . . . , S

}

of the (superpixel) vertex set V, i.e. ∪

= V, S

∩ S

= ∅ ∀ k 6= m.

Given S the set of all partitions, we look for an optimal video segmentation

∗

= {S

∗

, S

∗

, . . . , S

∗

} ∈ S (where N is the number of visual objects), minimizer

of an objective function, implicit [20, 47, 37] or explicit [39, 34, 43, 10].

Must-link constraints alter the video segmentation by reducing the set of

feasible partitions S. Given correct

must-links, a video segmentation algorithm

generally improves in performance, since the solver is constrained to disregard

non-optimal segmentations wrt S

∗

. Moreover, the integration of must-links leads

to reduced runtime and memory load as the recent work [38, 16] suggests.

We are interested in learning a must-link grouping function M, which groups

certain

superpixels in the graph, while respecting S

∗

. M should conservatively

correct refers to the desired ground truth segmentation, which ideally corresponds

with the optimal segmentation S

∗

certain groupings are the conservative grouping decisions which we propose to learn

4 A. Khoreva, F. Galasso, M. Hein, B. Schiele

associate each node i with a point grouping I

⊆ S

∗

(in most uncertain cases a

point grouping may only include a single node). More formally:

M : V 7→ P, i 7→ I

(1)

s.t. I

⊆ S

∗

⊆ V , ∪

= V , I

∩ I

= ∅ ∀ k 6= m ,

where P is the set of possible partitions of V.

3.2 Framework

Here we tailor the general theory to a video segmentation framework based on

the normalized cut, solved either via the spectral [39, 34] or 1-spectral [9, 21]

relaxation. Further, we discuss the integration of learned must-link constraints

via graph reduction techniques [38, 16] and learning and inference strategies.

Video segmentation setup. We build upon Galasso et al. [18]. Their con-

structed graph G = (V, E) uses superpixels extracted from the lowest level (level

1) of a hierarchical image segmentation [3]. Edges connect superpixels from spa-

tial and temporal neighbors and are weighted by their pair-wise aﬃnities, com-

puted from motion, appearance and shape features.

We consider six pairwise aﬃnities: spatio-temporal appearance (STA), based

on the median CIE Lab color distance; spatio-temporal motion (STM), based

on median optical ﬂow distance; across boundary appearance (ABA) and mo-

tion (ABM), computed across the common boundary of superpixels; short-term-

temporal (STT), measuring shape similarity by the spatial overlap of optical

ﬂow-propagated superpixels; long-term-temporal (LTT), given by the fraction

of common trajectories between the superpixels. Additionally we consider the

number of common intersecting trajectories (IT). We distinguish four types of

aﬃnities, depending on whether the related superpixels: i. lie within the same

frame (STA,STM,ABA,ABM); ii. lie on adjacent frames (STA,STM,STT); iii-

iv. lie on frames at a distance of 2 (STT,LTT,IT) or more frames (LTT,IT)

respectively.

Video segmentation objective function. Given a partition of V into N sets

, . . . , S

, the normalized cut (NCut) is deﬁned [31] as:

NCut(S

, . . . , S

) =

k=1

cut(S

, V\S

)

vol(S

)

, (2)

where cut(S

, V\S

) =

i∈S

,j∈V\S

and vol(S

) =

i∈S

,j∈V

. The

balancing factor prevents trivial solutions and is ideal when unary terms cannot

be deﬁned, but is also the reason why minimization of the NCut is NP-Hard.

Learning Must-Link Constraints for Video Segmentation based on SC 5

Spectral relaxations. The most widely adopted relaxation of NCut is spectral

clustering (SC) [39, 34, 31], where the solution of the relaxed problem is given by

representing the data points with the ﬁrst few eigenvectors and then clustering

them with k-means.

While widely adopted [16, 32, 3, 8, 40, 18, 41], the SC relaxation is known to

be loose. We therefore additionally consider the 1-spectral clustering (1-SC) [21,

22] - a tight relaxation based on the 1-Laplacian. However, the relaxation is only

tight for bi-partitioning, for multi-way partitioning recursive splitting is used as

greedy heuristic.

Reducing the original graph size with learned must-link constraints allows

to experiment with 1-SC on state-of-the-art video segmentation benchmarks [8,

17], notwithstanding the increased computational costs.

Graph reduction schemes. Given must-link constraints provided as point

groupings {I

, I

, . . . , I

} on the original vertex set I

⊆ V, recent work [38, 16]

shows how to integrate such constraints into the original problem with respec-

tively preserving the NCut and the spectral clustering objective function.

In more detail, integration proceeds by reducing the original graph G to one

of smaller size G

= (V

, E

), whereby the vertex set is given by the point

grouping V

= {I

, I

, . . . , I

}, the edge set E

preserves the original node

connectivity and weights w

are estimated so as to preserve the original video

segmentation problem in terms of the NCut or spectral clustering objective. In

particular, the NCut reduction is given by

i∈I

j∈J

(3)

while the spectral clustering reduction is deﬁned as











i∈I

j∈J

if I 6= J

|I|

i∈I

j∈J

−

(|I| − 1)

|I|

i∈I

j∈V\I

if I = J,

(4)

provided equal aﬃnities of elements of G constrained in G

, cf. [16].

3.3 Learning

An ideal must-link constraining function M (Eq. 1) should only merge super-

pixels which are correct, i.e. belong to the same set in the optimal segmentation.

From an implementation viewpoint, it is convenient to consider instead M

deﬁned over the set of edges E of the graph G representing the video sequence:

: E 7→ {0, 1} (5)

casts the must-link constraining problem as a binary classiﬁcation one,

where a true output for an input edge e

means that i and j belong to the

same point grouping, in the must-link constrained graph G

Learning Must-Link Constraints for Video Segmentation Based on Spectral Clustering

Figures

Citations

Going deeper with convolutions

Motion Trajectory Segmentation via Minimum Cost Multicuts

Evaluation of Super Voxel Methods for Early Video Processing (Author's Manuscript)

Guarantees for Spectral Clustering with Fairness Constraints

Submodular Trajectories for Better Motion Segmentation in Videos

References

Random Forests

Normalized cuts and image segmentation

Normalized cuts and image segmentation

A tutorial on spectral clustering

On Spectral Clustering: Analysis and an algorithm

Related Papers (5)

Contour Detection and Hierarchical Image Segmentation

Object segmentation by long term analysis of point trajectories

Segmentation of Moving Objects by Long Term Video Analysis

Normalized cuts and image segmentation

Efficient hierarchical graph-based video segmentation

Frequently Asked Questions (17)

Q1. What contributions have the authors mentioned in the paper "Learning must-link constraints for video segmentation based on spectral clustering" ?

Q2. What is the main argument for spectral clustering?

Q3. What is the problem of the normalized cut?

Q4. What is the first set of experiments?

Q5. How has must-link constraint learning been tried?

Q6. What are the main limitations of spectral techniques?

Q7. How do the authors improve the performance of the learning algorithms?

Q8. What is the way to reduce the graph size?

Q9. What is the affinity of the must-link constraining problem?

Q10. What is the function for a normalized cut?

Q11. What is the description of the proposed method?

Q12. How does the proposed method reduce runtime?

Q13. What is the reason why minimization of the NCut is NP-Hard?

Q14. What is the effect of learning mustlink constraints on the learning of a tree?

Q15. What is the way to learn Mpw?

Q16. How many false positives are made on the unseen data?

Q17. What are the main metrics for BPR and VPR?