Fast Object Segmentation in Unconstrained Video
Summary (2 min read)
1. Introduction
- Video object segmentation is the task of separating foreground objects from the background in a video [14, 18, 26].
- The latter scenario is more practically relevant, as a good solution would enable processing large amounts of video without human intervention.
- The object can be static in a portion of the video and only part of it can be moving in some other portion (e.g. a cat starts running and then stops to lick its paws).
- This second stage automatically bootstraps an appearance model based on the initial foreground estimate, and uses it to refine the spatial accuracy of the segmentation and to also segment the object in frames where it does not move (sec. 3.2).
3. Our approach
- The goal of their work is to segment objects that move differently than their surroundings.
- The authors method has two main stages: (1) efficient initial foreground estimation (sec. 3.1), (2) foreground-background labelling refinement (sec. 3.2).
- The authors compute the optical flow between pairs of subsequent frames and detect motion boundaries.
- Due to inaccuracies in the flow estimation, the motion boundaries are typically incomplete and do not align perfectly with object boundaries (fig. 1f).
- The goal of the second stage is to refine the spatial accuracy of the inside-outside maps and to segment the whole object in all frames.
3.1. Efficient initial foreground estimation
- The authors begin by computing optical flow between pairs of subsequent frames (t, t + 1) using the stateof-the-art algorithm [6, 22].
- The authors base their approach on motion boundaries, i.e. image points where the optical flow field changes abruptly.
- The algorithm estimates whether a pixel is inside the object based on the point-in-polygon problem [12] from computational geometry.
- Instead, a ray starting from a point outside the polygon will intersect it an even number of times .
- The authors algorithm visits each pixel exactly once per direction while building S, and once to compute its vote, and is therefore linear in the number of pixels in the image.
3.2. Foreground-background labelling refinement
- The authors formulate video segmentation as a pixel labelling problem with two labels (foreground and background).
- The pairwise potentials V andW encourage spatial and temporal smoothness, respectively.
- Two superpixels sti, s t+1 j in subsequent frames are connected if there at least one pixel of sti moves into st+1j according to the optical flow (fig. 3).
- Moreover, the appearance models are integrated over large image regions and over many frames, and therefore can robustly estimate the appearance of the object, despite faults in the insideoutside maps.
- In some frames (part of) the object may be static, and in others the inside-outside map might miss it because of incorrect optical flow estimation (fig. 4, middle row).
4.2. YouTube-Objects
- YouTube-Objects [19]3 is a large database collected from YouTube containing many videos for each of 2http://www2.ulg.ac.be/telecom/research/vibe/.
- The objects undergo rapid movement, strong scale and viewpoint changes, nonrigid deformations, and are sometimes clipped by the image border (fig. 5).
- Prest et al. [19] automatically select one segment per shot among those produced by [6], based on its appearance similarity to segments selected in other videos of the same object class, and on how likely it is to cover an object according to a class-generic objectness measure [2].
- For evaluation the authors fit a bounding-box to the top ranked output segment.
4.3. Runtime
- Given optical flow and superpixels, their method takes 0.5 sec/frame on SegTrack (0.05 sec for the inside-outside maps and the rest for the foreground-background labelling refinement).
- While [16, 27] do not report timings nor have code available for us to measure, their runtime must be > 120 sec/frame as they also use the object proposals [10].
- High quality optical flow can be computed rapidly using [22] (< 1 sec/frame).
- Currently, the authors use TurboPixels as superpixels [15] (1.5 sec/frame), but even faster alternatives are available [1].
Did you find this useful? Give us your feedback
Citations
1,656 citations
Cites background or methods from "Fast Object Segmentation in Unconst..."
...Interestingly the assumption of a completely closed motion boundary curve coinciding with the object contours can robustly accommodate background deformations (FST)....
[...]
...Unsupervised approaches have historically targeted over-segmentation [21, 51] or motion segmentation [5, 18] and only recently automatic methods for foregroundbackground separation have been proposed [13, 25, 33, 43, 45, 52]....
[...]
...Aiming at detecting per-frame indicators of potential foreground object locations, KEY [24], SAL [43], and FST [33] try to determine prior information sparsely distributed over the video sequence....
[...]
...Within the unsupervised category we evaluate the performance of NLC [13], FST [33], SAL [43], TRC [18], MSG [5] and CVOS [45]....
[...]
...The dataset is accompanied with a comprehensive evaluation of several state-of-the-art approaches [5, 7, 13, 14, 18, 21, 24, 33, 35, 40, 43, 45]....
[...]
573 citations
523 citations
516 citations
Cites methods from "Fast Object Segmentation in Unconst..."
...We further carried out experiments on SegTrack v2 dataset [18] and 12 groups of videos randomly selected from Youtube Objects and compared our method with [32, 23, 15, 6] as well....
[...]
...The methods in [32, 23, 15, 6, 19, 4, 22] and our method are unsupervised....
[...]
...The average per-frame pixel error rate compared with these methods [32, 23, 15, 6, 19, 4, 22, 29, 9] for each video from SegTrack dataset [29] are summarized in Table 1....
[...]
...method Ours [32] [23] [15] [6] [19] [4] [22] [29] [9]...
[...]
References
10,592 citations
7,849 citations
6,061 citations
5,692 citations
5,670 citations