scispace - formally typeset
Search or ask a question
Book ChapterDOI

ClassCut for unsupervised class segmentation

05 Sep 2010-pp 380-393
TL;DR: A novel method for unsupervised class segmentation on a set of images that alternates between segmenting object instances and learning a class model based on a segmentation energy defined over all images at the same time, which can be optimized efficiently by techniques used before in interactive segmentation.
Abstract: We propose a novel method for unsupervised class segmentation on a set of images. It alternates between segmenting object instances and learning a class model. The method is based on a segmentation energy defined over all images at the same time, which can be optimized efficiently by techniques used before in interactive segmentation. Over iterations, our method progressively learns a class model by integrating observations over all images. In addition to appearance, this model captures the location and shape of the class with respect to an automatically determined coordinate frame common across images. This frame allows us to build stronger shape and location models, similar to those used in object class detection. Our method is inspired by interactive segmentation methods [1], but it is fully automatic and learns models characteristic for the object class rather than specific to one particular object/image. We experimentally demonstrate on the Caltech4, Caltech101, and Weizmann horses datasets that our method (a) transfers class knowledge across images and this improves results compared to segmenting every image independently; (b) outperforms Grabcut [1] for the task of unsupervised segmentation; (c) offers competitive performance compared to the state-of-the-art in unsupervised segmentation and in particular it outperforms the topic model [2].

Summary (4 min read)

1 Introduction

  • Image segmentation is a fundamental problem in computer vision.
  • Interestingly, most previous approaches to unsupervised segmentation do not use energy functions similar to those in interactive and supervised segmentation, but instead use topic models [2] or other specialized generative models [10, 12] to find recurring patterns in the images.
  • The authors propose ClassCut, a novel method for unsupervised segmentation based on a binary pairwise energy function similar to those used in interactive/supervised segmentation.
  • Finally, their approach is also related to co-segmentation [21] where the goal is to segment a specific object from two images at the same time.

2 Overview of Our Method

  • The goal is to jointly segment objects of an unknown class from a set of images.
  • Analog to the scheme of GrabCut [1], ClassCut alternates two stages: (1) learning/updating a class model given the current segmentations (sec. 4); (2) jointly segmenting the objects in all images given the current class model (sec. 3).
  • It converges when the segmentation is unchanged in two consecutive iterations.
  • As the class model is used in the next segmentation iteration it transfers knowledge across images, typically from easier images to more difficult ones, aiding their segmentation.
  • In the next iteration, this will help in images where the airplane is difficult to segment (e.g. because of low contrast).

3 Segmentation

  • In (given either as a full image or as automatically determined reference frame) consists of superpixels {S1n, . . . , SKnn }.
  • Skn on the foreground and ljn = 0 for all superpixels S j n on the background.

3.1 Prior ΦΘ(L, I)

  • It penalizes neighboring superpixels having different labels.
  • Thus, the penalty is smaller if the two superpixels are separated by high gradients.
  • Objects rarely touch the boundary of the reference frame.
  • This term penalizes superpixels touching the border of the reference frame to be labeled foreground (fig. 2).

3.2 Class Model ΨΘ(L, I)

  • The scalars w are part of the model parameters Θ and weight the terms.
  • To compute the energy contribution for a superpixel Skn labeled foreground, the authors average over all positions in Skn and incorporate this into eq. (7) as ΩΘ(L, I) = ∑ n ∑ k 1 |Skn| ∑ s∈Skn − log pΩ(lkn|s) (8) Fig. 3a shows a final location model obtained after convergence.
  • Fig. 4 shows an initial shape model and a shape model after convergence.
  • As visual descriptors f the authors use color distributions (col) and bag-of-words [23] of SURF descriptors [24] (bow).

3.3 Energy Minimization

  • To label these superpixels the authors use TRW-S [15].
  • TRW-S not only labels them but also computes a lower bound on the energy which may be used to assess how far from the global optimum the solution is.
  • In their experiments, the authors observed that QPBO labels on average 91% of the superpixels according to the global optimum.
  • Furthermore, the authors observed that the minimization problem is hardest in the first few iterations and easier in the later iterations: over the iterations QPBO labels more superpixels and the difference between the lower bound and the actual energy of the solutions is also decreased.

4.1 Location Model

  • The location model Ω is initialized uniformly.
  • At each iteration, the authors update the parameters of the location model using the current segmentation of all images of the current class according to the maximum likelihood criterion (fig. 3a): for each cell in the 32×32 grid they reestimate the empirical probability of foreground using the current segmentations.

4.2 Shape Model

  • The shape model Π is initialized by accumulating the boundaries of all superpixels in the reference frame over all images.
  • As the boundaries of superpixels follow likely object boundaries, they will reoccur consistently along the true object boundaries across multiple images.
  • The initial shape model (fig. 4) already contains a rough outline of the unknown object class.
  • At each iteration, the authors update the parameters of the shape model using the current segmentation of all images according to the maximum likelihood criterion: for each of the 5 orientations in the 32×32 grid, they reestimate the empirical probability for a label-change at this position and with this orientation.
  • While the shape model only knows about the boundaries of an object but not on which side is foreground or background, jointly with the location model (and with the between-image smoothness) it will encourage similar shapes in similar spatial arrangements to be segmented in all the images.

4.3 Appearance Model

  • Υf are initialized using the color/ SURF observations from all images using an initial segmentation.
  • This initial segmentation is obtained from a generic prior of object location trained on an external set of images with objects of other classes and their ground-truth segmentations (fig. 3b).
  • From this object location prior, the authors select the top 75% pixels as foreground; the remaining 25% as background.
  • The authors observe that this location prior is essentially a Gaussian in the middle of the reference frame.
  • If the authors are using automatically determined reference frames, the observations for the background are collected from both pixels outside the reference frame and pixels inside the reference frame but labelled as background.

5 Finding the Reference Frame

  • To find the reference frame, the authors use the objectness measure of [18] which quantifies how likely it is for an image window to contain an object of any class.
  • Objectness is trained to distinguish windows containing an object with a welldefined boundary and center, such as cows and telephones, from amorphous background windows, such as grass and road.
  • The authors sample 1000 windows likely to contain an object from this measure, project the object location prior (sec. 4.3) into these windows and accumulate into an objectness map M (fig. 5, (bottom)).
  • M will have peaks on the objects in the image.
  • In the experiments the authors demonstrate that this method improves the results of unsupervised segmentation compared to using the full images (sec. 6).

6.1 Datasets

  • The authors evaluate their unsupervised segmentation method on three datasets of varying difficulty and compare the results to a single-image GrabCut and to other stateof-the-art methods.
  • In no experiment training images with segmentations of the unknown class are used.
  • The authors use the experimental setup of [9]: for the classes airplanes, car , faces, and motorbikes, they use the test images of [27] and segment the objects using no training data1.
  • The authors use an experimental setup similar to [2]: for 28 classes, they randomly select 30 images each and determine the segmentations of the objects.
  • Note that [2] additionally uses 30 training images for each class and solves a joint segmentation and classification task (not done here).

6.2 Baselines and the State of the Art

  • To initialize GrabCut, the authors train a foreground color model from the central 25% of the area of the image and a background model from the rest.
  • Using these models, GrabCut is iterated until convergence for each image individually.
  • Notice how the automatic reference frame improves the results of GrabCut from line (c) to (d) and how GrabCut is a strong competitor for previous methods [2, 9] that were designed for unsupervised segmentation.
  • For the datasets for which results are available, the authors compare their approach to Spatial Topic Models [2].

6.3 ClassCut

  • The authors evaluate the ability of ClassCut to segment objects of an unknown class in a set of images.
  • Note also, how ClassCut improves its accuracy over iterations (line (e) to (f)), showing that it is properly learning about the class.
  • Using ClassCut the authors obtain a segmentation accuracy of 83.6%, outperforming both GrabCut (line (c)) and the spatial topic model [2] (line (a)).
  • Since neither [2, 9] use any such measure the authors compare to the GrabCut baseline.
  • This shows that the segmentations obtained using ClassCut are better aligned to the ground-truth segmentation than those from GrabCut.

7 Conclusion

  • The authors presented a novel approach to unsupervised class segmentation.
  • The authors approach alternates between jointly segmenting the objects in all images and updating a class model, which allows to benefit from the insights gained in interactive segmentation and object class detection.
  • The authors model comprises inter-image priors and a comprehensive class model accounting for object appearance, shape, and location w.r.t. an automatically determined reference frame.
  • The authors demonstrate that the reference frame allows to learn a novel type of shape model and aids the segmentation process.

Did you find this useful? Give us your feedback

Figures (9)

Content maybe subject to copyright    Report

ClassCut for Unsupervised Class Segmentation
Bogdan Alexe, Thomas Deselaers, and Vittorio Ferrari
Computer Vision Laboratory, ETH Zurich, Zurich, Switzerland
{bogdan,deselaers,ferrari}@vision.ee.ethz.ch
Abstract. We propose a novel method for unsupervised class segmen-
tation on a set of images. It alternates between segmenting object in-
stances and learning a class model. The method is based on a segmen-
tation energy defined over all images at the same time, which can be
optimized efficiently by techniques used before in interactive segmenta-
tion. Over iterations, our method progressively learns a class model by
integrating observations over all images. In addition to appearance, this
model captures the location and shape of the class with respect to an
automatically determined coordinate frame common across images. This
frame allows us to build stronger shape and location models, similar to
those used in object class detection. Our method is inspired by inter-
active segmentation methods [1], but it is fully automatic and learns
models characteristic for the object class rather than specific to one par-
ticular object/image. We experimentally demonstrate on the Caltech4,
Caltech101, and Weizmann horses datasets that our method (a) trans-
fers class knowledge across images and this improves results compared
to segmenting every image independently; (b) outperforms Grabcut [1]
for the task of unsupervised segmentation; (c) offers competitive per-
formance compared to the state-of-the-art in unsupervised segmentation
and in particular it outperforms the topic model [2].
1 Introduction
Image segmentation is a fundamental problem in computer vision. Over the past
years methods that use graph-cut to minimize binary pairwise energy functions
have become the de-facto standard for segmenting specific objects in individual
images [1, 3, 4]. These methods employ appearance models for the foreground
and background which are estimated through user interactions [1, 3, 4].
On the one hand, analog approaches have been presented for object class
segmentation where the appearance models are learned from a set of training
images with ground-truth segmentations [57]. However, obtaining ground-truth
segmentations is cumbersome and error-prone.
On the other hand, approaches to unsupervised class segmentation have also
been proposed [2, 8–10, 12, 13]. In unsupervised segmentation a set of images
depicting different instances of an object class is given, but without information
about the appearance and shape of the objects to be segmented. The aim of an
algorithm is to automatically segment the object instance in each image.
K. Daniilidis, P. Maragos, N. Paragios (Eds.): ECCV 2010, Part V, LNCS 6315, pp. 380–393, 2010.
c
Springer-Verlag Berlin Heidelberg 2010

ClassCut for Unsupervised Class Segmentation 381
Interestingly, most previous approaches to unsupervised segmentation do not
use energy functions similar to those in interactive and supervised segmentation,
but instead use topic models [2] or other specialized generative models [10, 12]
to find recurring patterns in the images.
We propose ClassCut, a novel method for unsupervised segmentation based on
a binary pairwise energy function similar to those used in interactive/supervised
segmentation. As opposed to those, our energy function is defined over a set
of images rather than on one image [1, 35]. Inspired by GrabCut [1], where
the two stages of learning the foreground/background appearance models and
segmenting the image are alternated, our method alternates between learning a
class model and segmenting the objects in all images jointly. The class model is
learned from all images at the same time, so as to capture knowledge about the
class rather than specific to one image [1]. Therefore, it helps the next segmen-
tation iteration, as it transfers between images knowledge about the appearance
and shape of the class. Thanks to the nature of our energy function, we can
segment all images jointly using existing efficient algorithms used in interactive
segmentation approaches [1, 3, 14, 15].
Inspired by representations successfully used in supervised object class detec-
tion [16, 17], our approach anchors the object class in a reference coordinate
frame common across images. This enables modeling the spatial structure and
shape of the class, as well as designing novel priors tailored to the unsupervised
segmentation task. We determine this reference frame automatically in every
image with a procedure based on a salient object detector [18].
At each iteration ClassCut updates the class model, which captures the ap-
pearance, shape, and location distribution of the class within the reference frame.
The final output of the method are images with segmented object instances as
well as the class model.
In the experiments, we demonstrate that our method (a) transfers knowledge
between images and this improves the performance over segmenting each image
independently; (b) outperforms the original GrabCut [1], which is the main inspi-
ration behind it and turns out to be a very competitive baseline for unsupervised
segmentation; (c) offers competitive performance compared to the state-of-the-
art in unsupervised segmentation; (d) learns meaningful, intuitive class models.
Source code for ClassCut is available at http://www.vision.ee.ethz.ch/˜calvin.
Related Work. We discussed in the introduction that our method employs
energy minimization techniques used in interactive segmentation [1, 3, 4, 14, 15],
and how it is related to supervised [5, 7, 19] as well as to unsupervised [2, 10–12]
class segmentation methods.
A different task is object discovery, which aims at finding multiple object
classes from a mixed set of unlabeled images [11, 29]. In our work instead, all
images contain instances of one class.
The two closest work to ours are [8, 9], which have a procedure iterating be-
tween updating a model and segmenting the images. In [8] the model is given
a set of class and non-class images and then it iteratively improves the fore-
ground/background labeling of image fragments based on their class likelihoods.

382 B. Alexe, T. Deselaers, and V. Ferrari
Their method learns local segmentations masks for image fragments, while our
method learns a more complete class model, including appearance, shape and
location in a global reference frame.
Arora et al. [9] learn a template consistent over all images using variational
inference. Their template model is very different from our class model, and closer
to a constellation model [20]. Moreover, their method optimizes the segmentation
of the images individually rather than jointly.
Finally, our approach is also related to co-segmentation [21] where the goal is
to segment a specific object from two images at the same time. Here we try to go
a step further and co-segment a set of images showing different object instances
of an unknown class.
2OverviewofOurMethod
The goal is to jointly segment objects of an unknown class from a set of images.
Analog to the scheme of GrabCut [1], ClassCut alternates two stages: (1) learn-
ing/updating a class model given the current segmentations (sec. 4); (2) jointly
segmenting the objects in all images given the current class model (sec. 3). It
converges when the segmentation is unchanged in two consecutive iterations.
Our segmentation model for stage (2) is a binary pairwise energy function,
which can be optimized efficiently by techniques used in interactive segmenta-
tion [1, 3, 22], but jointly over all images rather than on a single image [1]
(sec. 3).
In stage (1), learning the class model over all images at once enables cap-
turing knowledge characteristic for the class rather than specific to a particular
image [1]. As the class model is used in the next segmentation iteration it trans-
fers knowledge across images, typically from easier images to more difficult ones,
aiding their segmentation. For example, the model might learn in the first itera-
tion that airplanes are typically grayish and the background is often blue (fig. 1).
In the next iteration, this will help in images where the airplane is difficult to
segment (e.g. because of low contrast).
The class model we propose (sec. 3.2) consists of several components modeling
different class characteristics: appearance, location, and shape. In addition to a
color component also used in GrabCut [1], the appearance model includes a bag-
of-words [23] of SURF descriptors [24], which is well suited for modeling class
appearance. Moreover, we model the location (sec. 3.2) and shape (sec. 3.2)
of the object class w.r.t. a reference coordinate frame common across images
(sec. 5). Overall, our model focuses on knowledge at the class level rather than
at the level of one object as in the works it is inspired from [1, 4].
In addition to the class model, the segmentation energy include priors tailored
for segmenting classes (sec. 3.1). The priors are defined on superpixels [25], which
act as grouping units for homogeneous areas. Superpixels bring two advantages:
(i) they provide additional structure, i.e. the set of possible segmentations is
reduced to those aligning well with image boundaries; (ii) they reduce the com-
putational complexity of segmentation. We formulate four class segmentation
priors over superpixels and multiple images (sec. 3.1).

ClassCut for Unsupervised Class Segmentation 383
Fig. 1. Overview of our method. The top row shows the input images, the auto-
matically determine reference frames and the initial location and shape models. The
bottom row shows how the segmentations evolve over the iterations as well as the final
location and shape models.
If a common reference frame on the objects is available, our method exploits
it to anchor the location and shape models to it and to improve the effectiveness
of some of the priors. We apply a salient object detector [18] to determine this
reference frame automatically (sec. 5). In sec. 6 we show how this detector im-
proves segmentation results compared to using the whole image as a reference
frame. Fig. 1 shows an overview of the entire method.
3Segmentation
In the set of images I = {I
1
,...,I
N
} each image I
n
(given either as a full
image or as automatically determined reference frame) consists of superpixels
{S
1
n
,...,S
K
n
n
}.WesearchforthelabelingL
=
(l
1
1
,...,l
K
1
1
),...,(l
1
n
,...,l
K
n
n
),
...,(l
1
N
,...,l
K
N
N
)
that sets l
k
n
= 1 for all superpixels S
k
n
on the foreground and
l
j
n
= 0 for all superpixels S
j
n
on the background.
To determine L
, we minimize
L
=argmin
L
{E
Θ
(L, I)} with E
Θ
(L, I)=Φ
Θ
(L, I)+Ψ
Θ
(L, I)(1)
where Φ is the segmentation prior (sec. 3.1) and Ψ is the class model (sec. 3.2).
In sec. 3.3 we describe how to minimize eq. (1). Θ are the parameters of the
model.
3.1 Prior Φ
Θ
(L, I )
The prior Φ consists of four terms
Φ
Θ
(L, I)=w
Λ
Λ(L, I)+w
χ
χ(L, I)+w
Γ
Γ (L, I)+w
Δ
Δ(L, I)(2)
The scalars w are part of the model parameters Θ and weight the terms. Below
we describe the terms in detail.

384 B. Alexe, T. Deselaers, and V. Ferrari
(a)
(b)
(c)
Fig. 2. Priors. (a) The smoothness prior between two superpixels is weighted inversely
to the sum over the gradients along their boundary (shown in yellow and blue for
two pairs of superpixels). (b) The between image smoothness prior is weighted by
the overlap (yellow) of superpixels (shown for two pairs of superpixels (red/green) in
two images). (c) The border penalty assigns high values to superpixels touching the
reference frame boundary (dark=low values, bright=high values).
The Within Image Smoothness Λ is a smoothness prior for superpixels
which generalizes the pixel-based smoothness priors typically used in interactive
segmentation [1]. It penalizes neighboring superpixels having different labels.
Λ(L, I)=
n
j,k
δ(l
j
n
= l
k
n
)exp(grad(S
j
n
,S
k
n
)) (3)
where j, k are the indices of neighboring superpixels S
j
n
, S
k
n
within image I
n
.
δ(l
j
n
= l
k
n
) = 1 if the labels l
j
n
, l
k
n
are different and 0 otherwise. The gradient
grad(S
j
n
,S
k
n
) between S
j
n
and S
k
n
is computed by summing the gradient mag-
nitudes [26] along the boundary between S
j
n
, S
k
n
(fig. 2a) normalized w.r.t. the
length of the boundary. Thus, the penalty is smaller if the two superpixels are
separated by high gradients. This term encourages segmentations aligned with
the image gradients.
The Between Image Smoothness χ operates on superpixels across images.
It encourages superpixels in different images but with similar location w.r.t. the
reference frame to have the same label:
χ(L, I)=
n,m
j,k
δ(l
j
n
= l
k
m
)
|S
j
n
S
k
m
|
|S
j
n
S
k
m
|
(4)
where n, m are two images and j, k superpixels, one in I
n
, the other in I
m
.
This penalty grows with the overlap of the superpixels (measured as area of
intersection over area of union). Therefore only overlapping superpixels interact
(fig. 2b). This term encourages similar segmentations across all images (w.r.t.
the reference frame).
The Border Penalty Γ prefers superpixels at the image boundary to be
labeled background. Objects rarely touch the boundary of the reference frame.
Notice how the object would touch even a tight bounding-box around itself only
in a few points (e.g. fig. 2a). The border penalty
Γ (L, I)=
n
k
l
k
n
border(S
k
n
)
perimeter(S
k
n
)
(5)

Citations
More filters
Book ChapterDOI
07 Oct 2012
TL;DR: The goal is to parse typical, often messy, indoor scenes into floor, walls, supporting surfaces, and object regions, and to recover support relationships, to better understand how 3D cues can best inform a structured 3D interpretation.
Abstract: We present an approach to interpret the major surfaces, objects, and support relations of an indoor scene from an RGBD image. Most existing work ignores physical interactions or is applied only to tidy rooms and hallways. Our goal is to parse typical, often messy, indoor scenes into floor, walls, supporting surfaces, and object regions, and to recover support relationships. One of our main interests is to better understand how 3D cues can best inform a structured 3D interpretation. We also contribute a novel integer programming formulation to infer physical support relations. We offer a new dataset of 1449 RGBD images, capturing 464 diverse indoor scenes, with detailed annotations. Our experiments demonstrate our ability to infer support relations in complex scenes and verify that our 3D scene cues and inferred support lead to better object segmentation.

4,827 citations

Book ChapterDOI
06 Sep 2014
TL;DR: A novel method for generating object bounding box proposals using edges is proposed, showing results that are significantly more accurate than the current state-of-the-art while being faster to compute.
Abstract: The use of object proposals is an effective recent approach for increasing the computational efficiency of object detection. We propose a novel method for generating object bounding box proposals using edges. Edges provide a sparse yet informative representation of an image. Our main observation is that the number of contours that are wholly contained in a bounding box is indicative of the likelihood of the box containing an object. We propose a simple box objectness score that measures the number of edges that exist in the box minus those that are members of contours that overlap the box’s boundary. Using efficient data structures, millions of candidate boxes can be evaluated in a fraction of a second, returning a ranked set of a few thousand top-scoring proposals. Using standard metrics, we show results that are significantly more accurate than the current state-of-the-art while being faster to compute. In particular, given just 1000 proposals we achieve over 96% object recall at overlap threshold of 0.5 and over 75% recall at the more challenging overlap of 0.7. Our approach runs in 0.25 seconds and we additionally demonstrate a near real-time variant with only minor loss in accuracy.

2,892 citations

Journal ArticleDOI
TL;DR: In this paper, a generic objectness measure is proposed to quantify how likely an image window is to contain an object of any class, such as cows and telephones, from amorphous background elements such as grass and road.
Abstract: We present a generic objectness measure, quantifying how likely it is for an image window to contain an object of any class. We explicitly train it to distinguish objects with a well-defined boundary in space, such as cows and telephones, from amorphous background elements, such as grass and road. The measure combines in a Bayesian framework several image cues measuring characteristics of objects, such as appearing different from their surroundings and having a closed boundary. These include an innovative cue to measure the closed boundary characteristic. In experiments on the challenging PASCAL VOC 07 dataset, we show this new cue to outperform a state-of-the-art saliency measure, and the combined objectness measure to perform better than any cue alone. We also compare to interest point operators, a HOG detector, and three recent works aiming at automatic object segmentation. Finally, we present two applications of objectness. In the first, we sample a small numberof windows according to their objectness probability and give an algorithm to employ them as location priors for modern class-specific object detectors. As we show experimentally, this greatly reduces the number of windows evaluated by the expensive class-specific model. In the second application, we use objectness as a complementary score in addition to the class-specific model, which leads to fewer false positives. As shown in several recent papers, objectness can act as a valuable focus of attention mechanism in many other applications operating on image windows, including weakly supervised learning of object categories, unsupervised pixelwise segmentation, and object tracking in video. Computing objectness is very efficient and takes only about 4 sec. per image.

1,223 citations

01 Aug 2011
TL;DR: A generic objectness measure, quantifying how likely it is for an image window to contain an object of any class, and uses objectness as a complementary score in addition to the class-specific model, which leads to fewer false positives.
Abstract: We present a generic objectness measure, quantifying how likely it is for an image window to contain an object of any class. We explicitly train it to distinguish objects with a well-defined boundary in space, such as cows and telephones, from amorphous background elements, such as grass and road. The measure combines in a Bayesian framework several image cues measuring characteristics of objects, such as appearing different from their surroundings and having a closed boundary. These include an innovative cue to measure the closed boundary characteristic. In experiments on the challenging PASCAL VOC 07 dataset, we show this new cue to outperform a state-of-the-art saliency measure, and the combined objectness measure to perform better than any cue alone. We also compare to interest point operators, a HOG detector, and three recent works aiming at automatic object segmentation. Finally, we present two applications of objectness. In the first, we sample a small numberof windows according to their objectness probability and give an algorithm to employ them as location priors for modern class-specific object detectors. As we show experimentally, this greatly reduces the number of windows evaluated by the expensive class-specific model. In the second application, we use objectness as a complementary score in addition to the class-specific model, which leads to fewer false positives. As shown in several recent papers, objectness can act as a valuable focus of attention mechanism in many other applications operating on image windows, including weakly supervised learning of object categories, unsupervised pixelwise segmentation, and object tracking in video. Computing objectness is very efficient and takes only about 4 sec. per image.

1,104 citations


Cites background or methods from "ClassCut for unsupervised class seg..."

  • ...Several recent works are increasingly demonstrating the value of objectness in other applications, such as learning object classes in weakly supervised scenarios [13], [30], [47], pixelwise segmentation of objects [2], [52], unsupervised object discovery [34], and learning humans-object interactions [45]....

    [...]

  • ...Analogously, to support weakly supervised pixelwise segmentation of object classes [2], [52] and unsupervised object discovery [34]....

    [...]

Book ChapterDOI
07 Oct 2012
TL;DR: Evaluation on two databases validates that geodesic saliency achieves superior results and outperforms previous approaches by a large margin, in both accuracy and speed (2 ms per image), illustrating that appropriate prior exploitation is helpful for the ill-posed saliency detection problem.
Abstract: Generic object level saliency detection is important for many vision tasks. Previous approaches are mostly built on the prior that "appearance contrast between objects and backgrounds is high". Although various computational models have been developed, the problem remains challenging and huge behavioral discrepancies between previous approaches can be observed. This suggest that the problem may still be highly ill-posed by using this prior only. In this work, we tackle the problem from a different viewpoint: we focus more on the background instead of the object. We exploit two common priors about backgrounds in natural images, namely boundary and connectivity priors, to provide more clues for the problem. Accordingly, we propose a novel saliency measure called geodesic saliency. It is intuitive, easy to interpret and allows fast implementation. Furthermore, it is complementary to previous approaches, because it benefits more from background priors while previous approaches do not. Evaluation on two databases validates that geodesic saliency achieves superior results and outperforms previous approaches by a large margin, in both accuracy and speed (2 ms per image). This illustrates that appropriate prior exploitation is helpful for the ill-posed saliency detection problem.

889 citations

References
More filters
Proceedings ArticleDOI
20 Jun 2005
TL;DR: It is shown experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection, and the influence of each stage of the computation on performance is studied.
Abstract: We study the question of feature sets for robust visual object recognition; adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.

31,952 citations


"ClassCut for unsupervised class seg..." refers methods in this paper

  • ...Inspired by representations successfully used in supervised object class detection [16, 17], our approach anchors the object class in a reference coordinate frame common across images....

    [...]

Book ChapterDOI
07 May 2006
TL;DR: A novel scale- and rotation-invariant interest point detector and descriptor, coined SURF (Speeded Up Robust Features), which approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness, yet can be computed and compared much faster.
Abstract: In this paper, we present a novel scale- and rotation-invariant interest point detector and descriptor, coined SURF (Speeded Up Robust Features). It approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness, yet can be computed and compared much faster. This is achieved by relying on integral images for image convolutions; by building on the strengths of the leading existing detectors and descriptors (in casu, using a Hessian matrix-based measure for the detector, and a distribution-based descriptor); and by simplifying these methods to the essential. This leads to a combination of novel detection, description, and matching steps. The paper presents experimental results on a standard evaluation set, as well as on imagery obtained in the context of a real-life object recognition application. Both show SURF's strong performance.

13,011 citations


"ClassCut for unsupervised class seg..." refers methods in this paper

  • ...As visual descriptors f we use color distributions (col) and bag-of-words [23] of SURF descriptors [24] (bow)....

    [...]

  • ...In addition to a color component also used in GrabCut [1], the appearance model includes a bagof-words [23] of SURF descriptors [24], which is well suited for modeling class appearance....

    [...]

Journal ArticleDOI
TL;DR: A novel scale- and rotation-invariant detector and descriptor, coined SURF (Speeded-Up Robust Features), which approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness, yet can be computed and compared much faster.

12,449 citations

Journal ArticleDOI
TL;DR: An object detection system based on mixtures of multiscale deformable part models that is able to represent highly variable object classes and achieves state-of-the-art results in the PASCAL object detection challenges is described.
Abstract: We describe an object detection system based on mixtures of multiscale deformable part models. Our system is able to represent highly variable object classes and achieves state-of-the-art results in the PASCAL object detection challenges. While deformable part models have become quite popular, their value had not been demonstrated on difficult benchmarks such as the PASCAL data sets. Our system relies on new methods for discriminative training with partially labeled data. We combine a margin-sensitive approach for data-mining hard negative examples with a formalism we call latent SVM. A latent SVM is a reformulation of MI--SVM in terms of latent variables. A latent SVM is semiconvex, and the training problem becomes convex once latent information is specified for the positive examples. This leads to an iterative training algorithm that alternates between fixing latent values for positive examples and optimizing the latent SVM objective function.

10,501 citations


"ClassCut for unsupervised class seg..." refers methods in this paper

  • ...Inspired by representations successfully used in supervised object class detection [16, 17], our approach anchors the object class in a reference coordinate frame common across images....

    [...]

Journal ArticleDOI
TL;DR: An efficient segmentation algorithm is developed based on a predicate for measuring the evidence for a boundary between two regions using a graph-based representation of the image and it is shown that although this algorithm makes greedy decisions it produces segmentations that satisfy global properties.
Abstract: This paper addresses the problem of segmenting an image into regions. We define a predicate for measuring the evidence for a boundary between two regions using a graph-based representation of the image. We then develop an efficient segmentation algorithm based on this predicate, and show that although this algorithm makes greedy decisions it produces segmentations that satisfy global properties. We apply the algorithm to image segmentation using two different kinds of local neighborhoods in constructing the graph, and illustrate the results with both real and synthetic images. The algorithm runs in time nearly linear in the number of graph edges and is also fast in practice. An important characteristic of the method is its ability to preserve detail in low-variability image regions while ignoring detail in high-variability regions.

5,791 citations


"ClassCut for unsupervised class seg..." refers background or methods in this paper

  • ...The priors are defined on superpixels [25], which act as grouping units for homogeneous areas....

    [...]

  • ...We also report the upper bound on the performance that ClassCut can obtain using superpixels [25] (Tab....

    [...]

Frequently Asked Questions (14)
Q1. What contributions have the authors mentioned in the paper "Classcut for unsupervised class segmentation" ?

The authors propose a novel method for unsupervised class segmentation on a set of images. The authors experimentally demonstrate on the Caltech4, Caltech101, and Weizmann horses datasets that their method ( a ) transfers class knowledge across images and this improves results compared to segmenting every image independently ; ( b ) outperforms Grabcut [ 1 ] for the task of unsupervised segmentation ; ( c ) offers competitive performance compared to the state-of-the-art in unsupervised segmentation and in particular it outperforms the topic model [ 2 ]. 

As the class model is used in the next segmentation iteration it transfers knowledge across images, typically from easier images to more difficult ones, aiding their segmentation. 

The authors observed that on average only about 2% of the pairwise terms in the final model (i.e. incorporating all cues) are non-submodular. 

While the shape model only knows about the boundaries of an object but not on which side is foreground or background, jointly with the location model (and with the between-image smoothness) it will encourage similar shapes in similar spatial arrangements to be segmented in all the images. 

Note that their appearance model extends the model of GrabCut [1] by the bag of SURF descriptor which is known to perform well for object classes. 

The class model the authors propose (sec. 3.2) consists of several components modeling different class characteristics: appearance, location, and shape. 

Weights and generic object location prior are set by leaving-one-out (setting parameters on 27 classes, and testing on the remaining 1; do this 28 times). 

the authors observed that the minimization problem is hardest in the first few iterations and easier in the later iterations: over the iterations QPBO labels more superpixels and the difference between the lower bound and the actual energy of the solutions is also decreased. 

To find the reference frame, the authors use the objectness measure of [18] which quantifies how likely it is for an image window to contain an object of any class. 

At each iteration, the authors update the parameters of the shape model using the current segmentation of all images according to the maximum likelihood criterion: for each of the 5 orientations in the 32×32 grid, the authors reestimate the empirical probability for a label-change at this position and with this orientation. 

Υ fΘ(L, I) = ∑n ∑ k − 1|Skn| ∑ s∈Skn log pf (lkn|s) (11)The appearance models capture the appearance of foreground and background region. 

The gradient grad(Sjn, S k n) between S j n and S k n is computed by summing the gradient magnitudes [26] along the boundary between Sjn, S k n (fig. 2a) normalized w.r.t. the length of the boundary. 

If a common reference frame on the objects is available, their method exploits it to anchor the location and shape models to it and to improve the effectiveness of some of the priors. 

The border penaltyΓ (L, I) = ∑n ∑ k lkn border(Skn) perimeter(Skn) (5)assigns a penalty proportional to the number of pixels touching the reference frame (border(Skn)) to each superpixel S k n normalized by its perimeter (perimeter(Skn)).