What have the authors stated for future works in "Segmentation and recognition using structure from motion point clouds" ?

The authors hope that semi-supervised techniques that use extra partially labeled or unlabeled training data may lead to improved performance in the future.

How did the authors train the randomized decision forest?

The authors trained a randomized decision forest based on their five motion and structure cues, using combined day and dusk sequences for both training and testing.

What are the 11 categories of the labeled data?

The labeled data has 11 categories: Building, Tree, Sky, Car, Sign-Symbol, Road, Pedestrian, Fence, Column-Pole, Sidewalk, and Bicyclist.

What is the effect of balancing the categories?

One by-product of balancing the categories during training is that the areas of smaller classes in the images tend to be overestimated, spilling out into the background (e.g., the bicycle in Fig. 7).

What is the way to model the joint dependencies of the two classifiers?

Learning a histogram for each pair of (motion and structure, appearance) tree leaf nodes could better modelthe joint dependencies of the two classifiers, but care must be taken so that in avoiding overfitting, quadratically more training data is not required.

What are the advantages of motion and structure features over appearance features?

11Motion and structure features do however have an obvious advantage over appearance features: generalization to novel lighting and weather conditions.

How did the authors determine the relative importance of the motion and structure cues?

To determine the relative importance of the five motion and structure cues, the authors analyzed the proportion of each chosen by the learning algorithm, as a function of depth in the randomized forest.

What are the common types of images that are used to label?

Existing databases of labeled images do not include frames taken from video sequences, and usually label relevant classes with only bounding boxes.

(Open Access) Segmentation and Recognition Using Structure from Motion Point Clouds (2008) | Gabriel J. Brostow

Q: What contributions have the authors mentioned in the paper "Segmentation and recognition using structure from motion point clouds" ?

The authors propose an algorithm for semantic segmentation based on 3D point clouds derived from ego-motion. The authors introduce features that project the 3D cues back to the 2D image plane while modeling spatial layout and context. Their method works well on sparse, noisy point clouds, and unlike existing approaches, does not need appearance-based descriptors. Further, the authors show that the motion-derived information complements an existing state-of-the-art appearance-based method, improving both qualitative and quantitative performance.

Q: How can the algorithm be trained on point clouds?

By including a fixed offset Cy, the algorithm can be trained on point clouds from one vehicle, but run on other cameras and vehicles.

Segmentation and Recognition using Structure

from Motion Point Clouds

Gabriel J. Brostow

, Jamie Shotton

, Julien Fauqueur

, and Roberto Cipolla

University College London and ETH Zurich

Microsoft Research Cambridge

University of Cambridge (now with MirriAd Ltd.)

University of Cambridge

Abstract. We propose an algorithm for semantic segmentation based on

3D point clouds derived from ego-motion. We motivate ﬁve simple cues

designed to mo de l speciﬁc patterns of motion and 3D world structure

that vary with object category. We introduce features that project the

3D cues back to the 2D image plane while modeling spatial layout and

context. A randomized decision forest combines many such features to

achieve a coherent 2D segmentation and recognize the object categories

present. Our main contribution is to show how semantic segmentation is

possible based solely on motion-derived 3D world structure. Our method

works well on sparse, noisy point clouds, and unlike existing approaches,

do e s not need appearance-based descriptors.

Experiments were performed on a challenging new video database con-

taining sequences ﬁlmed from a moving car in daylight and at dusk. The

results conﬁrm that indeed, accurate segmentation and recognition are

possible using only motion and 3D world structure. Further, we show that

the motion-derived information complements an existing state-of-the-art

app e arance-based method, improving both qualitative and quantitative

performance.

input video frame reconstructed 3D point cloud automatic segmentation

Fig. 1. The proposed algorithm uses 3D point clouds estimated from videos such as the

pictured driving sequence (with ground truth inset). Having trained on point clouds

from other driving sequences, our new motion and structure features, based purely on

the point cloud, perform 11-class semantic segmentation of each test frame. The colors

in the ground truth and inferred segmentation indicate category labels.

1 Introduction

We address the question of whether motion and 3D world structure can be used

to accurately segment video frames and recognize the object categories present.

In particular, as illustrated in Fig. 1, we investigate how to perform semantic

segmentation from the sparse, noisy 3D point cloud given by structure from ego-

motion. Our algorithm is able to accurately recognize objects and segment video

frames without appearance-based descriptors or dense depth estimates obtained

using e.g., dense stereo or laser range ﬁnders. The structure from motion, or

SfM, community [1] has demonstrated the value of ego-motion derived data,

and their modeling eﬀorts have even extend to stationary geometry of cities [2].

However, the object recognition opportunities presented by the inferred motion

and structure have largely been ignored

The proposed algorithm uses camera-pose estimation from video as an exist-

ing component, and assume s ego-motion is the dominant cause of pixel ﬂow [4].

Tracked 2D image features are triangulated to ﬁnd their position in world space

and their relationship to the moving camera path. We suggest ﬁve simple motion

and structure cues that are indicative of object categories present in the scene.

Projecting these cues from the 3D point cloud to the 2D image, we build a ran-

domized decision forest classiﬁer to perform a coherent semantic segmentation.

Our main contributions are: (i) a demonstration that semantic segmentation

is possible based solely on motion-derived 3D world structure; (ii) ﬁve intuitive

motion and structure cues and a mechanism for projecting these 3D cues to the

2D image plane for semantic segmentation; and (iii) a challenging new database

of video sequences ﬁlmed from a moving car and hand-labeled with ground-

truth semantic segmentations. Our evaluation shows performance comparable

to existing state-of-the-art appearance based techniques, and further, that our

motion-derived features complement appearance-based features, improving both

qualitative and quantitative performance.

Background. An accurate automatic scene understanding of images and videos

has been an enduring goal of computer vision, with applications varying from

image search to driving safety. Many successful techniques for 2D object recogni-

tion have used individual still images [5–7]. Without using SfM, Hoiem et al. [8, 9]

achieve exciting results by considering several spatial cues found in single images,

such as surface orientations and vanishing points, to infer the camera viewpoint

or general scene structure. This, in turn, helps object recognition algorithms

reﬁne their hypotheses, culling spatially infeasible detections. 3D object recog-

nition is still a new research area. Huber et al.[10] m atched laser rangeﬁnder

data to learned object models. Other techniques build 3D object models and

match them to still images using local descriptors [11–14]. None of these meth-

ods, however, can exploit the motion-based cues available in video sequences.

Dalal et al. [15] is a notable exception that used diﬀerential optical ﬂow in pairs

The work of [3] was similarly motivated, and used laser-scans of static scenes to

compute a 3D planar patch feature, which helped to train a chain of binary classiﬁers.

of images. In this paper, we reason about the moving 3D scene given a moving

2D camera. Our method works well on sparse, noisy point clouds, and does not

need appearance-based descriptors attached to 3D world points.

There is a long history of fascinating research about motion-based recognition

of human activities [16]. Laptev and Lindeberg [17] introduced the notion of

space-time interest points to help detect and represent sudden actions as high

gradient points in the xyt cube for motion-based activity recognition. Our focus

is rather object recognition, and our features do not require a stationary camera.

While it is tempting to apply other detectors (e.g., pedestrians [18]) directly

to the problem of recognizing objects from a moving camera, motion compensa-

tion and motion segmentation are still relatively open problems. Yin et al. [19]

use low-level motion cues for bi-layer video segmentation, though do not achieve

a semantic labeling. Computer vision for driving has proven challenging and has

previously been investigated with a related focus on motion segmentation [20].

For example, Kang et al. [21] have recently shown an improvement in the state

of the art while using a structure consistency constraint similar to one of our

motion cues. Leibe et al. [22] address recognition of cars and pedestrians from

a moving vehicle. Our technique handles both these and nine further categories,

and additionally semantically segments the image, without requiring their ex-

pensive stereo setup.

Optical ﬂow has aided recognition of objects for static cameras [23], but for-

ward ego-motion dominates the visual changes in our fo otage. Depth-speciﬁc

motion compensation may help, but requires accurate dense-stereo reconstruc-

tion or laser range-scanning. We instead employ features based on a sparse SfM

point cloud and avoid these problems.

2 Structure from Motion Point Clouds

We use standard structure from ego-motion techniques to automatically generate

a 3D point cloud from video sequences ﬁlmed from moving cars. The dominant

motion in the sequences gives the camera world-pose and thereby the relative

3D point cloud of all tracked 2D features, including outliers.

We start by tracking 2D image features. Speciﬁcally, we use Harris-Stephens

corners [24] with localized normalized cross correlation to track 20 × 20 pixel

patches through time in a search window 15% of the image dimensions. In prac-

tice, this produced reliable 2D trajectories that usually spanned more than 5

frames. To reduce the number of mis-tracks, each initial template is tracked

only until its correlation falls below 0.97.

Fo otage is obtained from a car-mounted camera. We assume, for purposes

of 3D reconstruction, that changes between images are the result of only ego-

motion. This allows us to compute a single world-point W = (x, y, z, 1)

for

each point tracked in 2D image space, (u

, v

). A best-ﬁt

W is computed given

at least two corresponding 3x4 camera projection matrices P

from the sequence.

Matrices P are inferred in a robust pre-processing stage, for which we simply use

a commercial product [4], which normalizes the resulting up-to-sc ale solutions

to 1.0. Then P is split into row vectors p

1:3

, so W projects into the camera C

, v

]

≡ [˜u

, ˜v

, λ]

= [p

, p

]

[x, y , z]

, and dividing through by λ gives

, v

, and similarly for (u

, v

), P

t+1

, and C

t+1

. As long as the

feature was moving, a least squares solution exists for the three unknowns of

W , given these four or more (in the case of longer feature tracks) equations.

We reconstruct using only the most temporally separated matrices P , instead of

ﬁnding a

W based on the whole 2D track. This strategy generally gives maximum

disparity and saves needless computations. After computing the camera poses,

no outlier rejection is performed, so that an order of magnitude more tracked

points are triangulated for the point cloud.

3 Motion and 3D Structure Features

We now describe the new motion and 3D structure features that are based on

the inferred 3D point cloud. We suggest ﬁve simple cues that can be estimated

reliably and are projected from the 3D world into features on the 2D image

plane, where they enable semantic s egme ntation. We conclude the s ec tion by

explaining how a randomized decision forest combines these simple weak features

into a powerful classiﬁer that performs the segmentation.

3.1 Cues from Point Clouds

Just as there are many ways to parameterize the colors and texture of appear-

ance, there are numerous ways to parameterize 3D structure and motion. We

prop os e ﬁve motion and structure cues. These are based on the inferred 3D point

cloud, which, given the small baseline changes, is rather noisy. The c ues were

chosen as robust, intuitive, eﬃcient to compute, and general-purpose but object-

category covariant, though these ﬁve are by no means e xhaustive. The cues also

ﬁt nicely with the powerful 3D to 2D projection mechanism (Sect. 3.2). With

the driving application in mind, they were designed to be invariant to camera

pitch, yaw, and perspective distortion, and could generalize to other domains.

The cues are: height above the camera, distance to the camera path, pro-

jected surface orientation, feature track density, and residual reconstruction er-

ror. These are intentionally weak; stronger features would not work with the

sparse noisy point clouds, though dense feature tracking could someday enable

one to apply [25]. We use machine learning to isolate reliable patterns and build

a strong classiﬁer that combines many of these cues (Sect. 3.3). By projecting

from the 3D point cloud to the 2D image as described in Sect. 3.2, we are able

to exploit contextual relationships. One of the beneﬁts of video is that analysis

of one frame can often be improved through information in neighboring frames.

Our cues take advantage of this since feature tracks exist over several frames.

Height above the camera f

. During video of a typical drive, one will notice

that the only fairly ﬁxed relationship between the 3D coordinate frames of the

camera C and the world is the camera’s height above the pavement (Fig. 2).

Measuring height in image-space would be very susceptible to bumps in the

Fig. 2. The height, camera distance, and residual error features are illustrated for a

car following the dotted yellow path. The red vertical arrow shows how f

captures the

height above the ground of a 3D point (red dot) reconstructed at the top of the stop

light. The green arrow reﬂects the smallest distance between the point on the railing

and the car’s path. The blue ellipse for f

illustrates the large res idual error, itself a

feature, in estimating the world coordinate

W of a point on the moving person’s head.

road. Instead, after aligning the car’s initial “up” vector as the camera’s −y

axis, the height of each world point

W is compared to the came ra center’s y

coordinate as f

(

W ) =

− C

. By including a ﬁxed oﬀset C

, the algorithm

can be trained on point clouds from one vehicle, but run on other cameras and

vehicles. Our experiments use footage from two diﬀerent cars.

Closest distance to camera path f

. The paths of moving vehicles on road

surfaces are less repeatable than a class’s absolute height in world coordinates,

but classes such as buildings and trees are normally set back from driving roads

by a ﬁxed distance (Fig. 2). This feature, using the full sequence of camera

centers C(t), gives the value of the smallest recorded 3D s eparation between C

and each

W as f

(

W ) = min

W − C(t)k. Note that the smallest separation

may occur after a feature in the current frame goes out of view. Such is the case

most obviously with features reconstructed on the surface of the road.

Surface Orientation f

, f

. The points

W in the point cloud are too sparse

and inaccurate in depth to allow an accurate 3D reconstruction of a faceted

world, but do still contain useful spatial information. A 2D Delaunay triangu-

lation [26] is performed on all the projected

W points in a given frame. Each

2D triangle is made of 3D coordinates which have inaccurate depths but, heuris-

tically, acceptable relative depth estimates, and thus can give an approximate

local surface orientation. The 3D normal vector for each triangle is projected

to an angled vector on the image plane in 2D. The x and y components of this

2D angle are encoded in the red and green channels of a false-rendering of the

triangulation, shown in the supplementary data online.

Track Density f

. Faster moving objects, like oncoming traﬃc and people,

often yield sparser feature tracks than stationary objects. Further, some object

classes have more texture than others.We thus use the track density as one of

the motion-derived cues. f

(t) is the 2D image-space map of the feature density,

Segmentation and Recognition Using Structure from Motion Point Clouds

Figures

Citations

SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation

The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes

nuScenes: A multimodal dataset for autonomous driving

ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation

BiSeNet: Bilateral Segmentation Network for Real-time Semantic Segmentation

References

Rapid object detection using a boosted cascade of simple features

Multiple view geometry in computer vision

Multiple View Geometry in Computer Vision.

A Combined Corner and Edge Detector

Extremely randomized trees

Related Papers (5)

Fully convolutional networks for semantic segmentation

Deep Residual Learning for Image Recognition

SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs

U-Net: Convolutional Networks for Biomedical Image Segmentation

Frequently Asked Questions (10)

Q1. What contributions have the authors mentioned in the paper "Segmentation and recognition using structure from motion point clouds" ?

Q2. What have the authors stated for future works in "Segmentation and recognition using structure from motion point clouds" ?

Q3. How did the authors train the randomized decision forest?

Q4. What are the 11 categories of the labeled data?

Q5. How can the algorithm be trained on point clouds?

Q6. What is the effect of balancing the categories?

Q7. What is the way to model the joint dependencies of the two classifiers?

Q8. What are the advantages of motion and structure features over appearance features?

Q9. How did the authors determine the relative importance of the motion and structure cues?

Q10. What are the common types of images that are used to label?