How many possible windows in a 25-frame sequence?

If each bounding box could move to 100 nearby positions in each subsequent frame, it leaves around 1050 possible temporal windows in a 25-frame sequence.

What is the definition of the 3D Under-segmentation Error?

The 3D Under-segmentation Error penalizes temporal superpixels that contain more than one object, the 3D Boundary Recall is the standard recall for temporal object boundaries, and the Explained Variation is a human-independent metric that considers how well the superpixel means represent the information in the video.

What are the standard metrics used to evaluate the accuracy of the randomized superpixel samples?

The authors use the standard metrics, which are 2D Undersegmentation Error, 2D Boundary Recall, and 2D Segmentation Accuracy for still images, and 3D Undersegmentation Error, 3D Boundary Recall and Explained Variation for video.

What is the objectness score for the test split?

As baselines, the authors use the output of boundary detectors, instead of using randomized SEEDS, to compute their objectness score in still images.

How does the amount of variation in video affect the accuracy of the randomized superpixels?

Note that the amount of variation grows faster in videos because it is propagated from the first frame of the video until the end.

What is the objectness measure based on randomised SEEDS?

The objectness measure based on randomised SEEDS with 5 samples outperforms the one computed using only one sample, which emphasises the usefulness of using Randomized SEEDS.

What is the objectness score for video objects?

The video objectness score is proposed as a volumetric extension of Eq. (6) in the time dimension, normalized by the tube volume (we denoted as 3D edge score in the experiments).

Why is the video objectness score able to be seen as a form of multiple samples?

The reason why is that the video objectness score can be seen as a form of multiple samples as well: the score is the sum over 25 samples in time.

How can the authors evaluate the accuracy of the randomized superpixels?

The result of this experiment is shown in Fig. 8.In both cases (still images and video), a variation between samples of about 20-30% per frame can be induced without significantly affecting the accuracy of the superpixels.

What is the measure of the objectness in a still image?

In the following, the authors first define the measure of the objectness in a still image, and then the authors introduce how to extend it to temporal windows (tubes of bounding boxes).

What is the objectness score for the video?

to show the usefulness of the video objectness score (noted as 3D edge in the figure), the authors compare with a method that uses only propagation.

What is the way to extract the superpixel structure?

In practice, the authors use 4 block layers and propagate at the 2nd layer, as shown in Fig. 4.Some superpixel methods offer extra capabilities, such as the extraction of a hierarchy of superpixels [17].

How can the authors use the algorithm to calculate the objectness of a video?

In this paper the authors have introduced a novel online video superpixel algorithm that is able to run in real-time, with accuracy comparable to offline methods.

(Open Access) Online Video SEEDS for Temporal Window Objectness (2013) | Michael Van den Bergh

Q: What have the authors contributed in "Online video seeds for temporal window objectness" ?

The authors introduce an online, real-time video superpixel algorithm based on the recently proposed SEEDS superpixels. The multiple samples are shown to provide a strong cue to efficiently measure the objectness of image windows, and the authors introduce the novel concept of objectness in temporal windows.

Q: What is the energy function of the color histogram?

Then the energy function isH(s) = ∑k∑{Hj} (cAt:0k (j))2, (1)which is maximal when the histograms have only one nonzero bin for each video superpixel.

Q: How many superpixels can be computed using two levels of integral images?

In the supplementary material, the authors show that the score can be computed very efficiently using two levels of integral images, with only 8 additions, allowing for the evaluation of over 100 million bounding boxes per second.

Online Video SEEDS for Temporal Window Objectness

Michael Van den Bergh

Gemma Roig

Xavier Boix

Santiago Manen

Luc Van Gool

1,2

ETH Z

urich, Switzerland

KU Leuven, Belgium

{vamichae,boxavier,gemmar,vangool}@vision.ee.ethz.ch

∗

Abstract

Superpixel and objectness algorithms are broadly used

as a pre-processing step to generate support regions and

to speed-up further computations. Recently, many algo-

rithms have been extended to video in order to exploit the

temporal consistency between frames. However, most meth-

ods are computationally too expensive for real-time appli-

cations. We introduce an online, real-time video superpixel

algorithm based on the recently proposed SEEDS superpix-

els. A new capability is incorporated which delivers multi-

ple diverse samples (hypotheses) of superpixels in the same

image or video sequence. The multiple samples are shown

to provide a strong cue to efﬁciently measure the object-

ness of image windows, and we introduce the novel concept

of objectness in temporal windows. Experiments show that

the video superpixels achieve comparable performance to

state-of-the-art ofﬂine methods while running at 30 fps on

a single 2.8 GHz i7 CPU. State-of-the-art performance on

objectness is also demonstrated, yet orders of magnitude

faster and extended to temporal windows in video.

1. Introduction

Many algorithms use superpixels or objectness scores to

efﬁciently select areas which to analyze further. With an

increasing number of papers on the analysis of videos, the

interest in having similar concepts extracted from time se-

quences is increasing as well. The exploitation of temporal

continuity can indeed help boost several types of applica-

tions. Yet, most current solutions are computationally ex-

pensive and non-causal (i.e. need to see the whole video

ﬁrst). We propose a novel method for the online extraction

of video superpixels. In terms of its still counterparts, it

comes closest to the recently introduced SEEDS superpix-

els [15].

Similar to SEEDS, we deﬁne an objective function that

prefers video superpixels to have a homogeneous color, and

our video superpixels can be extracted efﬁciently. Their op-

timization is based on iteratively reﬁning the partition, by

∗

This work has been supported by the European Commission project

RADHAR (FP7 ICT 248873).

Figure 1. Top: Video SEEDS provide temporal superpixel tubes.

Bottom: Randomized SEEDS efﬁciently produce multiple label

hypotheses per frame. Based on these, a Video Objectness mea-

sure is introduced to propose temporal windows (tubes of bound-

ing boxes) that are likely to contain objects.

exchanging blocks of pixels between superpixels. When

starting off the partition of a new video frame, we ex-

ploit the hierarchical superpixel organization of the previous

frame, the coarser levels of which serve as initialization.

Moreover, we propose a method to extract multiple su-

perpixel partitions with a value of the objective function

close to that of the optimum. Typically the overlapping su-

perpixels differ in non-essential parts of their contours, but

those segments that correspond to a genuine object contour

are shared. This allows us to introduce a new and highly ef-

ﬁcient objectness measure, together with its natural exten-

sion to videos (a tube of bounding boxes spanning a time

interval). Fig. 1 depicts a summary of the contributions of

the paper.

We experimentally validate the video superpixel and ob-

jectness algorithms, where we use standard benchmarks

where possible. Both methods achieve state-of-the-art re-

2013 IEEE International Conference on Computer Vision

DOI 10.1109/ICCV.2013.54

377

sults but at much higher speeds than available methods.

2. Related Work

In this section, we review previous work related to su-

perpixels and objectness in videos, the two tasks tackled in

this paper.

Video Superpixels. Most methods are approaches for still

images that have been extended to video. They either

progressively add cuts or grow superpixels from centers.

Adding cuts are the graph-based method [5] and its hier-

archical extensions [8, 17], segmentation by weighted ag-

gregation (SWA) [12], and normalized cuts with Nystrom

optimization [7]. Methods that grow centers are based on

mean shift [10, 9]. Our method also starts from a still-

oriented method, i.e. the recently introduced SEEDS ap-

proach [15]. Thus, our approach can be seen to add a third

strand to video superpixel extraction, namely one that that

moves the boundaries in an initial superpixel partition.

Recently, Xu et al. [16, 17] proposed a benchmark to

evaluate video superpixels and a framework for streaming

video segmentation using the graph-based superpixel ap-

proach of [5]. They achieved state-of-the-art results, but

only at 4 seconds/frame, i.e. 2 orders of magnitude from

real-time.

Temporal Window Objectness. The objectness measure

was introduced by Alexe et al. [1] for still images, where-

after [11] and [6] introduced new cues to boost perfor-

mance. To the best of our knowledge, objectness throughout

video shots has not been introduced before. It should not

be confused with the recently introduced dynamic object-

ness [13], which extracts objectness within a frame by in-

cluding instantaneous motion. In contrast, we deliver tubes

of bounding boxes throughout extended time intervals.

3. Video SEEDS

In this section, we ﬁrst review the SEEDS algorithm [15]

for the extraction of superpixels in stills. Subsequently, we

discuss the extension of this concept for videos, the corre-

sponding energy function, and how to optimize it.

3.1. SEEDS for stills

Let s represent the superpixel partition of an image, such

that s : {1,...,N}→{1,...,K}, in which N represents

the number of pixels in the image, and K the number of

superpixels. Superpixels are constrained to be contiguous

blobs, which is indicated by s ∈S, where S is the set of

valid superpixel partitions. The SEEDS approach [15] for

extracting superpixels in stills serves as starting point for

our video extension. Yet, we propose important reﬁnements

on which the algorithm’s efﬁciency critically depends.

frame 0 frame 1 frame 2

initialization

pixel-updates

DVSSFOU

frame

block-updates

propagation

Figure 2. Overview of the Video SEEDS algorithm: The super-

pixel labels are propagated at an intermediary step of block-level

updates. The result is ﬁne-tuned for each frame individually.

SEEDS extracts superpixels by maximizing an objective

function, thus enforcing the color histograms of superpixels

to be each concentrated in a single bin. The hill climbing

optimization starts from a grid of square superpixels, which

it iteratively reﬁnes by swapping blocks of pixels at their

boundaries. We chose SEEDS as they are extracted in real-

time on a single CPU.

3.2. SEEDS for videos

Our video approach propagates superpixels over multi-

ple frames to build 3D spatio-temporal constructs. As time

goes on, new video superpixels can appear and others may

terminate. In the literature, this is controlled by constraining

the number of superpixel tubes in the sequence. For online

applications this is not possible however, since the upcom-

ing length and content of the sequence are unknown. Thus,

we use alternative constraints deﬁned through 2 parameters:

• Superpixels per frame: number of superpixels in which

each single frame is partitioned.

• Superpixel rate: the rate of creating/terminating super-

pixels over time.

In order to fulﬁll both constraints, the termination of a su-

perpixel implies the creation of a new one in the same

frame. In the experiments, we discuss how we select these

parameters.

Let S be the set of valid partitions of a video. These

are the partitions for which the superpixels are contiguous

blobs in all frames and that exhibit the correct superpixel-

per-frame and superpixel-rate behavior. Let A

denote the

378

layer 1 (pixels)

layer 2 (blocks) layer 3 (blocks) layer 4 (superpixels)

Figure 3. Hierarchy of blocks of pixels of 4 layers.

set of pixels that belong to superpixel k, at frame t.To

indicate all pixels of the video superpixel up to frame t,we

use A

t:0

Similarly to [15], the energy function encourages color

homogeneity within the 3D superpixels. We use a color

histogram of each superpixel to evaluate this. The color

histogram of A

t:0

is written as c

t:0

. Let H

be a subset of

the color space which determines the colors in a bin of the

histogram. Then the energy function is

H(s)=



}

t:0

(j))

, (1)

which is maximal when the histograms have only one non-

zero bin for each video superpixel.

3.3. Online Optimization via Hill Climbing

The optimization algorithm is designed to maximize the

energy function in an online fashion (i.e. only using past

frames and at video rate). It computes the partition of the

current frame, starting from an approximation of the last

partition. Once the partition of the current frame is deliv-

ered, it remains ﬁxed. We introduce a hill climbing algo-

rithm that runs in real-time. It maximizes the energy by

exchanging pixels between superpixels at their boundaries.

This section describes the optimization in more detail. See

Fig. 2 for an overview of the algorithm.

Hierarchy of blocks of pixels. Both the pixel exchange

between superpixels and their temporal propagation are reg-

ulated through blocks of pixels. The SEEDS algorithm [15]

started by dividing a still image into a regular grid of blocks.

An important difference with our algorithm is that we con-

sider a hierarchy of blocks at different sizes. Starting from

pixels as the most detailed scale, 2 × 2 or 3 × 3 pixel blocks

are formed (how that choice is made is to be clariﬁed soon)

for the second layer. Further layers each time combine 2×2

blocks of the previous one. The block size at the second

layer (2 × 2 or 3 × 3) and the number of layers are cho-

sen such that the image subdivision at the highest layer ap-

proximately yields the prescribed number of superpixels per

frame. In Fig. 3 we illustrate an example of the hierarchy of

4 layers of block sizes.

Pixel and block-level updates. An initial partition of the

current frame is provided by the previous frame. This prop-

agation process will be described shortly. In case of the ﬁrst

initialization

layer 3 (blocks) layer 2 (blocks) layer 1 (pixels)

initialization layer 2 (blocks) layer 1 (pixels)

t = 0

t = 1

Figure 4. Efﬁcient updating at different block sizes.

frame, the initial partition corresponds to the highest block

layer as just described, i.e. a regular grid. The hill climb-

ing optimization starts from the initialization to then itera-

tively propose local changes in the partition. Multiple pixel

block exchanges between superpixels are considered, one

after the other. If such an exchange increases the objective

function, it is accepted and the partition is updated; else, the

exchange is discarded. The exchanged pixel blocks are ad-

jacent to the superpixel boundaries. The algorithm starts by

exchanging bigger blocks, and then it descends in the block

hierarchy until it reaches the pixel level. Thus, in the ﬁrst

iterations larger blocks are exchanged to quickly arrive at a

coarse partition that captures the global structure. Later, the

partition is reﬁned through smaller blocks and pixels that

capture more details. This process is shown in Fig. 4.

Let B

be a block of pixels of the current frame that be-

longs to the superpixel n, i.e. B

⊂A

t:0

. To evaluate

whether exchanging the block B

from superpixel n to m

increases the objective function, we can use one histogram

intersection computation, rather than evaluating the com-

plete energy function. This is

int(c

t:0

) ≥ int(c

t:0

), (2)

in which int(·, ·) denotes the intersection between two his-

tograms, and \ the exclusion of a set. Thus, if the inter-

section of B

to the video superpixel A

t:0

is higher than

the intersection to the superpixel it currently belongs to, the

exchange is accepted, otherwise it is discarded. The speed

of the hill climbing optimization stems from Eq. (2), since

it can evaluate a block exchange with a single intersection

distance computation.

In the supplementary material we show that using Eq. (2)

maximizes the energy under the assumptions that |A

t:0

|≈

t:0

|, |B

||A

t:0

|, where |·|is the cardinality of the set.

Also, it assumes that the histogram of B

is concentrated in

a single bin. The ﬁrst one is that video superpixels are of

similar size and that the blocks are much smaller than the

video superpixels. This holds most of the time, since super-

pixels indeed tend to be of the same size, and the blocks are

379

deﬁned to be at most one fourth of a superpixel in a frame,

and hence, are much smaller than superpixels extending on

multiple frames in the video. The second assumption is that

the block of pixels have a homogeneous color histograms.

This was empirically shown to hold in practice by [15] (in

more than 90% of the cases), and we observed the same.

Creating and terminating video superpixels. Accord-

ing to the superpixel rate, some frames are selected to termi-

nate and create superpixels. When a frame is selected, we

ﬁrst terminate a superpixel, and then we create a new one.

To this aim, we introduce similar inequalities as in Eq. (2).

They allow to evaluate which termination and creation of

superpixels yield higher energy using efﬁcient intersection

distances, as well.

In Fig. 5 there is an illustration of the creation and termi-

nation of superpixels with the notation used. When a super-

pixel is terminated, its pixels at frame t are incorporated to

a neighbor superpixel. Let A

⊂A

t:0

and A

⊂A

t:0

two candidates of superpixels to terminate at frame t. Let

t:0

and A

t:0

be the superpixel candidate to incorporate A

and A

, respectively. The superpixel with larger intersec-

tion with its neighbor is the one selected to terminate, i.e.

int(c

t:0

) ≥ int(c

t:0

). (3)

We terminate the superpixel with higher intersection to its

neighbor among all superpixels in the frame. In the supple-

mentary material, we show that Eq. (3) leads to the highest

energy state, under the assumptions that |A

t:0

|≈|A

t:0

||A

t:0

|, |A

||A

t:0

|, and that both A

and A

have histograms concentrated into one bin. These are sim-

ilar to the assumptions for Eq. (2). Additionally, it is also

assumed that c

t:0

≈ c

(t−1):0

and c

t:0

≈ c

(t−1):0

. This

is that the color histogram of the temporal superpixel re-

mains approximately the same including and excluding the

pixels at the current frame. This holds most of the time,

given the fact that |A

||A

t:0

If a superpixel is terminated, a new one should be created

to fulﬁll the constraint of number of superpixels per frame

(Sec. 3.2). The candidates to form a new superpixel are

blocks of pixels that belong to an existing video superpixel.

Let B

⊂A

t:0

and B

⊂A

t:0

be blocks of superpixels

candidates to create a new superpixel. We select the block

of pixels which histogram minimally intersects with its cur-

rent superpixel. This is,

int(c

t:0

) ≤ int(c

t:0

). (4)

We select the block of pixels with minimum intersection in

the frame. We show in the supplementary material, that this

yields the highest energy, assuming that |A

t:0

|≈|A

t:0

||A

t:0

|, |B

||A

t:0

|, and that both B

and B

have histograms concentrated into one bin. These assump-

tions are similar to the ones of Eq. (3).

Termination

t:0

t-1:0

time

current

frame

Creation

t:0

time

current

frame

Figure 5. Termination and creation of superpixels.

Iterations. We can stop the optimization for a frame at

any time and obtain a valid partition. We expect a higher

value of the energy function if we let the hill-climbing do

more iterations, until convergence. We can ﬁx the allowed

time to run per frame, or set it on-the-ﬂy, depending on the

application. In principle, the algorithm can run for an in-

ﬁnitely long video, since it generates the partition online,

and in memory we only need the histograms of the video

superpixels that propagate to the current frame.

Initialization and Propagation. In the ﬁrst frame of the

video, the superpixels are initialized along a grid using the

hierarchy of blocks. In the subsequent frames, the block

hierarchy is exploited to initialize the superpixels. Rather

than re-initializing along a grid, the new frame is initialized

by taking an intermediary block-level result from the previ-

ous frame (Fig. 2). Like this, the superpixel structure can be

propagated from the previous frame while discarding small

details. In practice, we use 4 block layers and propagate at

the 2nd layer, as shown in Fig. 4.

4. Randomized SEEDS

Some superpixel methods offer extra capabilities, such

as the extraction of a hierarchy of superpixels [17]. In this

section, we introduce a new capability of superpixels that,

to the best of our knowledge, has never been explored so

far. In the next section we exploit it to design an object-

ness measure of temporal windows, though we expect that

applications may not be limited to that one.

Superpixels are over-segmentations with many more re-

gions than objects in the image. A region that is uniform in

color can be over-segmented in many different correct ways,

and thus, more than one partition can be valid. In Fig. 6,we

give an example of different partitions with the same num-

ber of superpixels, with similar energy value and which so-

lutions have very similar accuracy according to the super-

pixel benchmarks. This shows that we can extract multiple

samples of superpixel partitions from the same video, all of

them of comparable quality.

Since there may be a considerable amount of those par-

titions, we aim at extracting samples that differ as much as

possible between themselves. We found a heuristic way,

yet effective and fast to compute, that consists on injecting

380

multiple SEEDS samples

Randomized

SEEDS

objectness scorelabels

Figure 6. Different samples of randomized SEEDS segmentations of the same frame and with the same accuracy are combined. In the

randomized SEEDS, we show the average of the different samples. The objectness score is computed as the sum of the distances to the

common superpixel boundaries.

noise to the evaluation of the exchanges of pixels in the hill-

climbing, i.e. in Eq. (2). This is,

int(c

t:0

)+aξ ≥ int( c

t:0

), (5)

where ξ is the variable for the uniform random noise in the

interval [−1, 1] and a is a scale factor. Note that if a is

small, the noise only affects the block exchanges which do

not produce a large change in the energy value. In the ex-

periments section, we analyze the effect of injecting noise

by changing its scale a and show that up to a certain level,

the performance is not degraded compared to the sample

obtained without adding noise, i.e. a =0. This corrobo-

rates that there exists a diversity of over-segmentations with

energy very close to the maximum that are equally valid.

Injecting noise may not be the only way for extracting

samples, but is by far the most efﬁcient to compute that we

found. For example, changing the order in which we pro-

pose the exchanges of blocks of pixels in the hill-climbing,

turned to be successful but slower in our implementation.

5. Video Objectness

In this section, we introduce an application of random-

ized SEEDS to video objectness. It is based on the ob-

servation that the coincidences among multiple superpixel

partitions, reveal the true boundaries of objects. Fig. 6

shows that when superimposing a diverse set of superpixel

samples obtained with randomized SEEDS, the boundaries

of the objects are preserved, and the boundaries due to

over-segmentation fade away. This is because the over-

segmentation coincides where there are true region bound-

aries, and does not in regions with a similar uniform color.

In the following, we ﬁrst deﬁne the measure of the ob-

jectness in a still image, and then we introduce how to ex-

tend it to temporal windows (tubes of bounding boxes).

Objectness Measure for Still Images. We use O to rep-

resent the intersection of several superpixel samples of ran-

domized SEEDS. O(i) takes value 1 if all samples have a

superpixel boundary at pixel i, and 0 otherwise. Thus, O is

an image that indicates in which pixels the samples of ran-

domized SEEDS agree that there is a superpixel boundary.

We deﬁne the objectness score for a still image using O.

It measures the closed boundary characteristic of objects.

A bounding box is more likely to contain an object when

there is a closed line in O that ﬁts tightly the bounding box.

Speciﬁcally, we compute the distance from each pixel on

the perimeter of the bounding box to the nearest pixel that

fulﬁlls O(i)=1. Thus, in case we are in the bottom or

the top of the bounding box, the distance is computed to the

closest pixel in the same column, and in case we are in one

of the sides, in the same row. See Fig. 6 for an illustration.

Let X be the set of pixels inside the bounding box, Per(X )

the set of pixels in the perimeter of the bounding box, and

R,C(p)

the pixels that are inside the bounding box and in the

same row or column as pixel p. Thus, the objectness score

is:



p∈Per(X )

min

i∈X

R,C(p)

O(i)=1

d(p, i), (6)

where d(·, ·) is the Euclidean distance, and A normalizes the

score using the area of the bounding box. In the supplemen-

tary material, we show that the score can be computed very

efﬁciently using two levels of integral images, with only 8

additions, allowing for the evaluation of over 100 million

bounding boxes per second. To the best of our knowledge,

no earlier work has used multiple superpixel hypotheses to

build an objectness score. In the experiments, we show that

using multiple hypothesis has an important impact on the

performance.

381

Online Video SEEDS for Temporal Window Objectness

Figures

Citations

What Makes for Effective Detection Proposals

Superpixels: An evaluation of the state-of-the-art

Video Segmentation by Non-Local Consensus voting.

Fast action proposals for human action detection and search

Spatio-Temporal Object Detection Proposals

References

The Pascal Visual Object Classes (VOC) Challenge

Efficient Graph-Based Image Segmentation

Contour Detection and Hierarchical Image Segmentation

Efficient Graph-Based Image Segmentationを用いた圃場図自動作成手法の検討

Spectral grouping using the Nystrom method

Related Papers (5)

Selective Search for Object Recognition

Measuring the Objectness of Image Windows

Efficient Graph-Based Image Segmentation

SLIC Superpixels Compared to State-of-the-Art Superpixel Methods

Edge Boxes: Locating Object Proposals from Edges

Frequently Asked Questions (19)

Q1. What have the authors contributed in "Online video seeds for temporal window objectness" ?

Q2. How many possible windows in a 25-frame sequence?

Q3. What is the definition of the 3D Under-segmentation Error?

Q4. What are the standard metrics used to evaluate the accuracy of the randomized superpixel samples?

Q5. What is the energy function of the color histogram?

Q6. What is the objectness score for the test split?

Q7. How many superpixels can be computed using two levels of integral images?

Q8. How does the amount of variation in video affect the accuracy of the randomized superpixels?

Q9. What is the objectness measure based on randomised SEEDS?

Q10. What is the objectness score for video objects?

Q11. What is the main reason for the interest in having similar concepts extracted from time sequences?

Q12. Why is the video objectness score able to be seen as a form of multiple samples?

Q13. How can the authors evaluate the accuracy of the randomized superpixels?

Q14. What is the measure of the objectness in a still image?

Q15. What is the objectness measure introduced by Alexe et al.?

Q16. What is the objectness score for the video?

Q17. What is the way to extract the superpixel structure?

Q18. What is the value of the variable for the uniform random noise in the interval?

Q19. How can the authors use the algorithm to calculate the objectness of a video?