ClassCut for unsupervised class segmentation
Summary (4 min read)
1 Introduction
- Image segmentation is a fundamental problem in computer vision.
- Interestingly, most previous approaches to unsupervised segmentation do not use energy functions similar to those in interactive and supervised segmentation, but instead use topic models [2] or other specialized generative models [10, 12] to find recurring patterns in the images.
- The authors propose ClassCut, a novel method for unsupervised segmentation based on a binary pairwise energy function similar to those used in interactive/supervised segmentation.
- Finally, their approach is also related to co-segmentation [21] where the goal is to segment a specific object from two images at the same time.
2 Overview of Our Method
- The goal is to jointly segment objects of an unknown class from a set of images.
- Analog to the scheme of GrabCut [1], ClassCut alternates two stages: (1) learning/updating a class model given the current segmentations (sec. 4); (2) jointly segmenting the objects in all images given the current class model (sec. 3).
- It converges when the segmentation is unchanged in two consecutive iterations.
- As the class model is used in the next segmentation iteration it transfers knowledge across images, typically from easier images to more difficult ones, aiding their segmentation.
- In the next iteration, this will help in images where the airplane is difficult to segment (e.g. because of low contrast).
3 Segmentation
- In (given either as a full image or as automatically determined reference frame) consists of superpixels {S1n, . . . , SKnn }.
- Skn on the foreground and ljn = 0 for all superpixels S j n on the background.
3.1 Prior ΦΘ(L, I)
- It penalizes neighboring superpixels having different labels.
- Thus, the penalty is smaller if the two superpixels are separated by high gradients.
- Objects rarely touch the boundary of the reference frame.
- This term penalizes superpixels touching the border of the reference frame to be labeled foreground (fig. 2).
3.2 Class Model ΨΘ(L, I)
- The scalars w are part of the model parameters Θ and weight the terms.
- To compute the energy contribution for a superpixel Skn labeled foreground, the authors average over all positions in Skn and incorporate this into eq. (7) as ΩΘ(L, I) = ∑ n ∑ k 1 |Skn| ∑ s∈Skn − log pΩ(lkn|s) (8) Fig. 3a shows a final location model obtained after convergence.
- Fig. 4 shows an initial shape model and a shape model after convergence.
- As visual descriptors f the authors use color distributions (col) and bag-of-words [23] of SURF descriptors [24] (bow).
3.3 Energy Minimization
- To label these superpixels the authors use TRW-S [15].
- TRW-S not only labels them but also computes a lower bound on the energy which may be used to assess how far from the global optimum the solution is.
- In their experiments, the authors observed that QPBO labels on average 91% of the superpixels according to the global optimum.
- Furthermore, the authors observed that the minimization problem is hardest in the first few iterations and easier in the later iterations: over the iterations QPBO labels more superpixels and the difference between the lower bound and the actual energy of the solutions is also decreased.
4.1 Location Model
- The location model Ω is initialized uniformly.
- At each iteration, the authors update the parameters of the location model using the current segmentation of all images of the current class according to the maximum likelihood criterion (fig. 3a): for each cell in the 32×32 grid they reestimate the empirical probability of foreground using the current segmentations.
4.2 Shape Model
- The shape model Π is initialized by accumulating the boundaries of all superpixels in the reference frame over all images.
- As the boundaries of superpixels follow likely object boundaries, they will reoccur consistently along the true object boundaries across multiple images.
- The initial shape model (fig. 4) already contains a rough outline of the unknown object class.
- At each iteration, the authors update the parameters of the shape model using the current segmentation of all images according to the maximum likelihood criterion: for each of the 5 orientations in the 32×32 grid, they reestimate the empirical probability for a label-change at this position and with this orientation.
- While the shape model only knows about the boundaries of an object but not on which side is foreground or background, jointly with the location model (and with the between-image smoothness) it will encourage similar shapes in similar spatial arrangements to be segmented in all the images.
4.3 Appearance Model
- Υf are initialized using the color/ SURF observations from all images using an initial segmentation.
- This initial segmentation is obtained from a generic prior of object location trained on an external set of images with objects of other classes and their ground-truth segmentations (fig. 3b).
- From this object location prior, the authors select the top 75% pixels as foreground; the remaining 25% as background.
- The authors observe that this location prior is essentially a Gaussian in the middle of the reference frame.
- If the authors are using automatically determined reference frames, the observations for the background are collected from both pixels outside the reference frame and pixels inside the reference frame but labelled as background.
5 Finding the Reference Frame
- To find the reference frame, the authors use the objectness measure of [18] which quantifies how likely it is for an image window to contain an object of any class.
- Objectness is trained to distinguish windows containing an object with a welldefined boundary and center, such as cows and telephones, from amorphous background windows, such as grass and road.
- The authors sample 1000 windows likely to contain an object from this measure, project the object location prior (sec. 4.3) into these windows and accumulate into an objectness map M (fig. 5, (bottom)).
- M will have peaks on the objects in the image.
- In the experiments the authors demonstrate that this method improves the results of unsupervised segmentation compared to using the full images (sec. 6).
6.1 Datasets
- The authors evaluate their unsupervised segmentation method on three datasets of varying difficulty and compare the results to a single-image GrabCut and to other stateof-the-art methods.
- In no experiment training images with segmentations of the unknown class are used.
- The authors use the experimental setup of [9]: for the classes airplanes, car , faces, and motorbikes, they use the test images of [27] and segment the objects using no training data1.
- The authors use an experimental setup similar to [2]: for 28 classes, they randomly select 30 images each and determine the segmentations of the objects.
- Note that [2] additionally uses 30 training images for each class and solves a joint segmentation and classification task (not done here).
6.2 Baselines and the State of the Art
- To initialize GrabCut, the authors train a foreground color model from the central 25% of the area of the image and a background model from the rest.
- Using these models, GrabCut is iterated until convergence for each image individually.
- Notice how the automatic reference frame improves the results of GrabCut from line (c) to (d) and how GrabCut is a strong competitor for previous methods [2, 9] that were designed for unsupervised segmentation.
- For the datasets for which results are available, the authors compare their approach to Spatial Topic Models [2].
6.3 ClassCut
- The authors evaluate the ability of ClassCut to segment objects of an unknown class in a set of images.
- Note also, how ClassCut improves its accuracy over iterations (line (e) to (f)), showing that it is properly learning about the class.
- Using ClassCut the authors obtain a segmentation accuracy of 83.6%, outperforming both GrabCut (line (c)) and the spatial topic model [2] (line (a)).
- Since neither [2, 9] use any such measure the authors compare to the GrabCut baseline.
- This shows that the segmentations obtained using ClassCut are better aligned to the ground-truth segmentation than those from GrabCut.
7 Conclusion
- The authors presented a novel approach to unsupervised class segmentation.
- The authors approach alternates between jointly segmenting the objects in all images and updating a class model, which allows to benefit from the insights gained in interactive segmentation and object class detection.
- The authors model comprises inter-image priors and a comprehensive class model accounting for object appearance, shape, and location w.r.t. an automatically determined reference frame.
- The authors demonstrate that the reference frame allows to learn a novel type of shape model and aids the segmentation process.
Did you find this useful? Give us your feedback
Citations
4,827 citations
2,892 citations
1,223 citations
1,104 citations
Cites background or methods from "ClassCut for unsupervised class seg..."
...Several recent works are increasingly demonstrating the value of objectness in other applications, such as learning object classes in weakly supervised scenarios [13], [30], [47], pixelwise segmentation of objects [2], [52], unsupervised object discovery [34], and learning humans-object interactions [45]....
[...]
...Analogously, to support weakly supervised pixelwise segmentation of object classes [2], [52] and unsupervised object discovery [34]....
[...]
889 citations
References
31,952 citations
"ClassCut for unsupervised class seg..." refers methods in this paper
...Inspired by representations successfully used in supervised object class detection [16, 17], our approach anchors the object class in a reference coordinate frame common across images....
[...]
13,011 citations
"ClassCut for unsupervised class seg..." refers methods in this paper
...As visual descriptors f we use color distributions (col) and bag-of-words [23] of SURF descriptors [24] (bow)....
[...]
...In addition to a color component also used in GrabCut [1], the appearance model includes a bagof-words [23] of SURF descriptors [24], which is well suited for modeling class appearance....
[...]
12,449 citations
10,501 citations
"ClassCut for unsupervised class seg..." refers methods in this paper
...Inspired by representations successfully used in supervised object class detection [16, 17], our approach anchors the object class in a reference coordinate frame common across images....
[...]
5,791 citations
"ClassCut for unsupervised class seg..." refers background or methods in this paper
...The priors are defined on superpixels [25], which act as grouping units for homogeneous areas....
[...]
...We also report the upper bound on the performance that ClassCut can obtain using superpixels [25] (Tab....
[...]
Related Papers (5)
Frequently Asked Questions (14)
Q2. What is the purpose of the class model?
As the class model is used in the next segmentation iteration it transfers knowledge across images, typically from easier images to more difficult ones, aiding their segmentation.
Q3. How many pairs of terms are in the final model?
The authors observed that on average only about 2% of the pairwise terms in the final model (i.e. incorporating all cues) are non-submodular.
Q4. What is the effect of the location model on the image?
While the shape model only knows about the boundaries of an object but not on which side is foreground or background, jointly with the location model (and with the between-image smoothness) it will encourage similar shapes in similar spatial arrangements to be segmented in all the images.
Q5. What is the effect of the appearance model?
Note that their appearance model extends the model of GrabCut [1] by the bag of SURF descriptor which is known to perform well for object classes.
Q6. What is the class model the authors propose?
The class model the authors propose (sec. 3.2) consists of several components modeling different class characteristics: appearance, location, and shape.
Q7. How do the authors set the weights and object location prior?
Weights and generic object location prior are set by leaving-one-out (setting parameters on 27 classes, and testing on the remaining 1; do this 28 times).
Q8. What is the effect of QPBO on the energy of superpixels?
the authors observed that the minimization problem is hardest in the first few iterations and easier in the later iterations: over the iterations QPBO labels more superpixels and the difference between the lower bound and the actual energy of the solutions is also decreased.
Q9. How is the objectness measure used to find the reference frame?
To find the reference frame, the authors use the objectness measure of [18] which quantifies how likely it is for an image window to contain an object of any class.
Q10. What is the probability of a label change?
At each iteration, the authors update the parameters of the shape model using the current segmentation of all images according to the maximum likelihood criterion: for each of the 5 orientations in the 32×32 grid, the authors reestimate the empirical probability for a label-change at this position and with this orientation.
Q11. What is the appearance model for a superpixel?
Υ fΘ(L, I) = ∑n ∑ k − 1|Skn| ∑ s∈Skn log pf (lkn|s) (11)The appearance models capture the appearance of foreground and background region.
Q12. What is the gradient grad between Sjn and S k n?
The gradient grad(Sjn, S k n) between S j n and S k n is computed by summing the gradient magnitudes [26] along the boundary between Sjn, S k n (fig. 2a) normalized w.r.t. the length of the boundary.
Q13. What is the effect of the priors?
If a common reference frame on the objects is available, their method exploits it to anchor the location and shape models to it and to improve the effectiveness of some of the priors.
Q14. What is the penalty for superpixels touching the border of the reference frame?
The border penaltyΓ (L, I) = ∑n ∑ k lkn border(Skn) perimeter(Skn) (5)assigns a penalty proportional to the number of pixels touching the reference frame (border(Skn)) to each superpixel S k n normalized by its perimeter (perimeter(Skn)).