scispace - formally typeset
Open AccessBook ChapterDOI

Segmentation and Recognition Using Structure from Motion Point Clouds

TLDR
This work proposes an algorithm for semantic segmentation based on 3D point clouds derived from ego-motion that works well on sparse, noisy point clouds, and unlike existing approaches, does not need appearance-based descriptors.
Abstract
We propose an algorithm for semantic segmentation based on 3D point clouds derived from ego-motion. We motivate five simple cues designed to model specific patterns of motion and 3D world structure that vary with object category. We introduce features that project the 3D cues back to the 2D image plane while modeling spatial layout and context. A randomized decision forest combines many such features to achieve a coherent 2D segmentation and recognize the object categories present. Our main contribution is to show how semantic segmentation is possible based solely on motion-derived 3D world structure. Our method works well on sparse, noisy point clouds, and unlike existing approaches, does not need appearance-based descriptors. Experiments were performed on a challenging new video database containing sequences filmed from a moving car in daylight and at dusk. The results confirm that indeed, accurate segmentation and recognition are possible using only motion and 3D world structure. Further, we show that the motion-derived information complements an existing state-of-the-art appearance-based method, improving both qualitative and quantitative performance.

read more

Content maybe subject to copyright    Report

Segmentation and Recognition using Structure
from Motion Point Clouds
Gabriel J. Brostow
1
, Jamie Shotton
2
, Julien Fauqueur
3
, and Roberto Cipolla
4
1
University College London and ETH Zurich
2
Microsoft Research Cambridge
3
University of Cambridge (now with MirriAd Ltd.)
4
University of Cambridge
Abstract. We propose an algorithm for semantic segmentation based on
3D point clouds derived from ego-motion. We motivate five simple cues
designed to mo de l specific patterns of motion and 3D world structure
that vary with object category. We introduce features that project the
3D cues back to the 2D image plane while modeling spatial layout and
context. A randomized decision forest combines many such features to
achieve a coherent 2D segmentation and recognize the object categories
present. Our main contribution is to show how semantic segmentation is
possible based solely on motion-derived 3D world structure. Our method
works well on sparse, noisy point clouds, and unlike existing approaches,
do e s not need appearance-based descriptors.
Experiments were performed on a challenging new video database con-
taining sequences filmed from a moving car in daylight and at dusk. The
results confirm that indeed, accurate segmentation and recognition are
possible using only motion and 3D world structure. Further, we show that
the motion-derived information complements an existing state-of-the-art
app e arance-based method, improving both qualitative and quantitative
performance.
input video frame reconstructed 3D point cloud automatic segmentation
Fig. 1. The proposed algorithm uses 3D point clouds estimated from videos such as the
pictured driving sequence (with ground truth inset). Having trained on point clouds
from other driving sequences, our new motion and structure features, based purely on
the point cloud, perform 11-class semantic segmentation of each test frame. The colors
in the ground truth and inferred segmentation indicate category labels.

2
1 Introduction
We address the question of whether motion and 3D world structure can be used
to accurately segment video frames and recognize the object categories present.
In particular, as illustrated in Fig. 1, we investigate how to perform semantic
segmentation from the sparse, noisy 3D point cloud given by structure from ego-
motion. Our algorithm is able to accurately recognize objects and segment video
frames without appearance-based descriptors or dense depth estimates obtained
using e.g., dense stereo or laser range finders. The structure from motion, or
SfM, community [1] has demonstrated the value of ego-motion derived data,
and their modeling efforts have even extend to stationary geometry of cities [2].
However, the object recognition opportunities presented by the inferred motion
and structure have largely been ignored
1
.
The proposed algorithm uses camera-pose estimation from video as an exist-
ing component, and assume s ego-motion is the dominant cause of pixel flow [4].
Tracked 2D image features are triangulated to find their position in world space
and their relationship to the moving camera path. We suggest five simple motion
and structure cues that are indicative of object categories present in the scene.
Projecting these cues from the 3D point cloud to the 2D image, we build a ran-
domized decision forest classifier to perform a coherent semantic segmentation.
Our main contributions are: (i) a demonstration that semantic segmentation
is possible based solely on motion-derived 3D world structure; (ii) five intuitive
motion and structure cues and a mechanism for projecting these 3D cues to the
2D image plane for semantic segmentation; and (iii) a challenging new database
of video sequences filmed from a moving car and hand-labeled with ground-
truth semantic segmentations. Our evaluation shows performance comparable
to existing state-of-the-art appearance based techniques, and further, that our
motion-derived features complement appearance-based features, improving both
qualitative and quantitative performance.
Background. An accurate automatic scene understanding of images and videos
has been an enduring goal of computer vision, with applications varying from
image search to driving safety. Many successful techniques for 2D object recogni-
tion have used individual still images [5–7]. Without using SfM, Hoiem et al. [8, 9]
achieve exciting results by considering several spatial cues found in single images,
such as surface orientations and vanishing points, to infer the camera viewpoint
or general scene structure. This, in turn, helps object recognition algorithms
refine their hypotheses, culling spatially infeasible detections. 3D object recog-
nition is still a new research area. Huber et al.[10] m atched laser rangefinder
data to learned object models. Other techniques build 3D object models and
match them to still images using local descriptors [11–14]. None of these meth-
ods, however, can exploit the motion-based cues available in video sequences.
Dalal et al. [15] is a notable exception that used differential optical flow in pairs
1
The work of [3] was similarly motivated, and used laser-scans of static scenes to
compute a 3D planar patch feature, which helped to train a chain of binary classifiers.

3
of images. In this paper, we reason about the moving 3D scene given a moving
2D camera. Our method works well on sparse, noisy point clouds, and does not
need appearance-based descriptors attached to 3D world points.
There is a long history of fascinating research about motion-based recognition
of human activities [16]. Laptev and Lindeberg [17] introduced the notion of
space-time interest points to help detect and represent sudden actions as high
gradient points in the xyt cube for motion-based activity recognition. Our focus
is rather object recognition, and our features do not require a stationary camera.
While it is tempting to apply other detectors (e.g., pedestrians [18]) directly
to the problem of recognizing objects from a moving camera, motion compensa-
tion and motion segmentation are still relatively open problems. Yin et al. [19]
use low-level motion cues for bi-layer video segmentation, though do not achieve
a semantic labeling. Computer vision for driving has proven challenging and has
previously been investigated with a related focus on motion segmentation [20].
For example, Kang et al. [21] have recently shown an improvement in the state
of the art while using a structure consistency constraint similar to one of our
motion cues. Leibe et al. [22] address recognition of cars and pedestrians from
a moving vehicle. Our technique handles both these and nine further categories,
and additionally semantically segments the image, without requiring their ex-
pensive stereo setup.
Optical flow has aided recognition of objects for static cameras [23], but for-
ward ego-motion dominates the visual changes in our fo otage. Depth-specific
motion compensation may help, but requires accurate dense-stereo reconstruc-
tion or laser range-scanning. We instead employ features based on a sparse SfM
point cloud and avoid these problems.
2 Structure from Motion Point Clouds
We use standard structure from ego-motion techniques to automatically generate
a 3D point cloud from video sequences filmed from moving cars. The dominant
motion in the sequences gives the camera world-pose and thereby the relative
3D point cloud of all tracked 2D features, including outliers.
We start by tracking 2D image features. Specifically, we use Harris-Stephens
corners [24] with localized normalized cross correlation to track 20 × 20 pixel
patches through time in a search window 15% of the image dimensions. In prac-
tice, this produced reliable 2D trajectories that usually spanned more than 5
frames. To reduce the number of mis-tracks, each initial template is tracked
only until its correlation falls below 0.97.
Fo otage is obtained from a car-mounted camera. We assume, for purposes
of 3D reconstruction, that changes between images are the result of only ego-
motion. This allows us to compute a single world-point W = (x, y, z, 1)
T
for
each point tracked in 2D image space, (u
t
, v
t
). A best-fit
˜
W is computed given
at least two corresponding 3x4 camera projection matrices P
t
from the sequence.
Matrices P are inferred in a robust pre-processing stage, for which we simply use
a commercial product [4], which normalizes the resulting up-to-sc ale solutions

4
to 1.0. Then P is split into row vectors p
1:3
, so W projects into the camera C
t
as
[u
1
, v
1
]
T
[˜u
1
, ˜v
1
, λ]
T
= [p
1
, p
2
, p
3
]
T
[x, y , z]
T
, and dividing through by λ gives
u
1
=
p
1
W
p
3
W
, v
1
=
p
2
W
p
3
W
, and similarly for (u
2
, v
2
), P
t+1
, and C
t+1
. As long as the
feature was moving, a least squares solution exists for the three unknowns of
˜
W , given these four or more (in the case of longer feature tracks) equations.
We reconstruct using only the most temporally separated matrices P , instead of
finding a
˜
W based on the whole 2D track. This strategy generally gives maximum
disparity and saves needless computations. After computing the camera poses,
no outlier rejection is performed, so that an order of magnitude more tracked
points are triangulated for the point cloud.
3 Motion and 3D Structure Features
We now describe the new motion and 3D structure features that are based on
the inferred 3D point cloud. We suggest five simple cues that can be estimated
reliably and are projected from the 3D world into features on the 2D image
plane, where they enable semantic s egme ntation. We conclude the s ec tion by
explaining how a randomized decision forest combines these simple weak features
into a powerful classifier that performs the segmentation.
3.1 Cues from Point Clouds
Just as there are many ways to parameterize the colors and texture of appear-
ance, there are numerous ways to parameterize 3D structure and motion. We
prop os e five motion and structure cues. These are based on the inferred 3D point
cloud, which, given the small baseline changes, is rather noisy. The c ues were
chosen as robust, intuitive, efficient to compute, and general-purpose but object-
category covariant, though these five are by no means e xhaustive. The cues also
fit nicely with the powerful 3D to 2D projection mechanism (Sect. 3.2). With
the driving application in mind, they were designed to be invariant to camera
pitch, yaw, and perspective distortion, and could generalize to other domains.
The cues are: height above the camera, distance to the camera path, pro-
jected surface orientation, feature track density, and residual reconstruction er-
ror. These are intentionally weak; stronger features would not work with the
sparse noisy point clouds, though dense feature tracking could someday enable
one to apply [25]. We use machine learning to isolate reliable patterns and build
a strong classifier that combines many of these cues (Sect. 3.3). By projecting
from the 3D point cloud to the 2D image as described in Sect. 3.2, we are able
to exploit contextual relationships. One of the benefits of video is that analysis
of one frame can often be improved through information in neighboring frames.
Our cues take advantage of this since feature tracks exist over several frames.
Height above the camera f
H
. During video of a typical drive, one will notice
that the only fairly fixed relationship between the 3D coordinate frames of the
camera C and the world is the camera’s height above the pavement (Fig. 2).
Measuring height in image-space would be very susceptible to bumps in the

5
f
C
f
H
f
R
Fig. 2. The height, camera distance, and residual error features are illustrated for a
car following the dotted yellow path. The red vertical arrow shows how f
H
captures the
height above the ground of a 3D point (red dot) reconstructed at the top of the stop
light. The green arrow reflects the smallest distance between the point on the railing
and the car’s path. The blue ellipse for f
R
illustrates the large res idual error, itself a
feature, in estimating the world coordinate
˜
W of a point on the moving person’s head.
road. Instead, after aligning the car’s initial “up” vector as the camera’s y
axis, the height of each world point
˜
W is compared to the came ra center’s y
coordinate as f
H
(
˜
W ) =
˜
W
y
C
y
. By including a fixed offset C
y
, the algorithm
can be trained on point clouds from one vehicle, but run on other cameras and
vehicles. Our experiments use footage from two different cars.
Closest distance to camera path f
C
. The paths of moving vehicles on road
surfaces are less repeatable than a class’s absolute height in world coordinates,
but classes such as buildings and trees are normally set back from driving roads
by a fixed distance (Fig. 2). This feature, using the full sequence of camera
centers C(t), gives the value of the smallest recorded 3D s eparation between C
and each
˜
W as f
C
(
˜
W ) = min
t
k
˜
W C(t)k. Note that the smallest separation
may occur after a feature in the current frame goes out of view. Such is the case
most obviously with features reconstructed on the surface of the road.
Surface Orientation f
O
x
, f
O
y
. The points
˜
W in the point cloud are too sparse
and inaccurate in depth to allow an accurate 3D reconstruction of a faceted
world, but do still contain useful spatial information. A 2D Delaunay triangu-
lation [26] is performed on all the projected
˜
W points in a given frame. Each
2D triangle is made of 3D coordinates which have inaccurate depths but, heuris-
tically, acceptable relative depth estimates, and thus can give an approximate
local surface orientation. The 3D normal vector for each triangle is projected
to an angled vector on the image plane in 2D. The x and y components of this
2D angle are encoded in the red and green channels of a false-rendering of the
triangulation, shown in the supplementary data online.
Track Density f
D
. Faster moving objects, like oncoming traffic and people,
often yield sparser feature tracks than stationary objects. Further, some object
classes have more texture than others.We thus use the track density as one of
the motion-derived cues. f
D
(t) is the 2D image-space map of the feature density,

Figures
Citations
More filters
Journal ArticleDOI

SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation

TL;DR: Quantitative assessments show that SegNet provides good performance with competitive inference time and most efficient inference memory-wise as compared to other architectures, including FCN and DeconvNet.
Proceedings ArticleDOI

The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes

TL;DR: This paper generates a synthetic collection of diverse urban images, named SYNTHIA, with automatically generated class annotations, and conducts experiments with DCNNs that show how the inclusion of SYnTHIA in the training stage significantly improves performance on the semantic segmentation task.
Posted Content

nuScenes: A multimodal dataset for autonomous driving

TL;DR: nuScenes as mentioned in this paper is the first dataset to carry the full autonomous vehicle sensor suite: 6 cameras, 5 radars and 1 lidar, all with full 360 degree field of view.
Posted Content

ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation

TL;DR: A novel deep neural network architecture named ENet (efficient neural network), created specifically for tasks requiring low latency operation, which is up to 18 times faster, requires 75% less FLOPs, has 79% less parameters, and provides similar or better accuracy to existing models.
Book ChapterDOI

BiSeNet: Bilateral Segmentation Network for Real-time Semantic Segmentation

TL;DR: BiSeNet as discussed by the authors designs a spatial path with a small stride to preserve the spatial information and generate high-resolution features, while a context path with fast downsampling strategy is employed to obtain sufficient receptive field.
References
More filters
Proceedings ArticleDOI

Rapid object detection using a boosted cascade of simple features

TL;DR: A machine learning approach for visual object detection which is capable of processing images extremely rapidly and achieving high detection rates and the introduction of a new image representation called the "integral image" which allows the features used by the detector to be computed very quickly.
Book

Multiple view geometry in computer vision

TL;DR: In this article, the authors provide comprehensive background material and explain how to apply the methods and implement the algorithms directly in a unified framework, including geometric principles and how to represent objects algebraically so they can be computed and applied.

Multiple View Geometry in Computer Vision.

TL;DR: This book is referred to read because it is an inspiring book to give you more chance to get experiences and also thoughts and it will show the best book collections and completed collections.
Proceedings ArticleDOI

A Combined Corner and Edge Detector

TL;DR: The problem the authors are addressing in Alvey Project MMI149 is that of using computer vision to understand the unconstrained 3D world, in which the viewed scenes will in general contain too wide a diversity of objects for topdown recognition techniques to work.
Journal ArticleDOI

Extremely randomized trees

TL;DR: A new tree-based ensemble method for supervised classification and regression problems that consists of randomizing strongly both attribute and cut-point choice while splitting a tree node and builds totally randomized trees whose structures are independent of the output values of the learning sample.
Related Papers (5)
Frequently Asked Questions (10)
Q1. What contributions have the authors mentioned in the paper "Segmentation and recognition using structure from motion point clouds" ?

The authors propose an algorithm for semantic segmentation based on 3D point clouds derived from ego-motion. The authors introduce features that project the 3D cues back to the 2D image plane while modeling spatial layout and context. Their method works well on sparse, noisy point clouds, and unlike existing approaches, does not need appearance-based descriptors. Further, the authors show that the motion-derived information complements an existing state-of-the-art appearance-based method, improving both qualitative and quantitative performance. 

The authors hope that semi-supervised techniques that use extra partially labeled or unlabeled training data may lead to improved performance in the future. 

The authors trained a randomized decision forest based on their five motion and structure cues, using combined day and dusk sequences for both training and testing. 

The labeled data has 11 categories: Building, Tree, Sky, Car, Sign-Symbol, Road, Pedestrian, Fence, Column-Pole, Sidewalk, and Bicyclist. 

By including a fixed offset Cy, the algorithm can be trained on point clouds from one vehicle, but run on other cameras and vehicles. 

One by-product of balancing the categories during training is that the areas of smaller classes in the images tend to be overestimated, spilling out into the background (e.g., the bicycle in Fig. 7). 

Learning a histogram for each pair of (motion and structure, appearance) tree leaf nodes could better modelthe joint dependencies of the two classifiers, but care must be taken so that in avoiding overfitting, quadratically more training data is not required. 

11Motion and structure features do however have an obvious advantage over appearance features: generalization to novel lighting and weather conditions. 

To determine the relative importance of the five motion and structure cues, the authors analyzed the proportion of each chosen by the learning algorithm, as a function of depth in the randomized forest. 

Existing databases of labeled images do not include frames taken from video sequences, and usually label relevant classes with only bounding boxes.