scispace - formally typeset
Open AccessBook ChapterDOI

View Synthesis for Recognizing Unseen Poses of Object Classes

TLDR
This work proposes a novel representation to model 3D object classes that allows the model to synthesize novel views of an object class at recognition time and incorporates it in a novel two-step algorithm that is able to classify objects under arbitrary and/or unseen poses.
Abstract
An important task in object recognition is to enable algorithms to categorize objects under arbitrary poses in a cluttered 3D world. A recent paper by Savarese & Fei-Fei [1] has proposed a novel representation to model 3D object classes. In this representation stable parts of objects from one class are linked together to capture both the appearance and shape properties of the object class. We propose to extend this framework and improve the ability of the model to recognize poses that have not been seen in training. Inspired by works in single object view synthesis (e.g., Seitz & Dyer [2]), our new representation allows the model to synthesize novel views of an object class at recognition time. This mechanism is incorporated in a novel two-step algorithm that is able to classify objects under arbitrary and/or unseen poses. We compare our results on pose categorization with the model and dataset presented in [1]. In a second experiment, we collect a new, more challenging dataset of 8 object classes from crawling the web. In both experiments, our model shows competitive performances compared to [1] for classifying objects in unseen poses.

read more

Content maybe subject to copyright    Report

View Synthesis for Recognizing Unseen Poses of
Object Classes
Silvio Savarese
1
and Li Fei-Fei
2
1
Department of Electrical Engineering, University of Michigan at Ann Arbor
2
Department of Computer Science, Princeton University
Abstract. An important task in object recognition is to enable algo-
rithms to categorize objects under arbitrary poses in a cluttered 3D
world. A recent paper by Savarese & Fei-Fei [1] has proposed a novel
representation to model 3D object classes. In this representation sta-
ble parts of objects from one class are linked together to capture both
the appearance and shape properties of the object class. We propose to
extend this framework and improve the ability of the model to recog-
nize poses that have not been seen in training. Inspired by works in
single object view synthesis (e.g., Seitz & Dyer [2]), our new represen-
tation allows the model to synthesize novel views of an object class at
recognition time. This mechanism is incorporated in a novel two-step
algorithm that is able to classify objects under arbitrary and/or unseen
poses. We compare our results on pose categorization with the model
and dataset presented in [1]. In a second experiment, we collect a new,
more challenging dataset of 8 object classes from crawling the web. In
both experiments, our model shows competitive performances compared
to [1] for classifying objects in unseen poses.
1 Introduction
An important goal in object recognition is to be able to recognize an object or an
object category given an arbitrary view point. Humans can do this effortlessly
under most conditions. Consider the search for your car in a crowded shopping
center parking lot. We often need to look around 360 degrees in search of our
vehicle. Similarly, this ability is crucial for a robust, intelligent visual recognition
system. Fig. 1 illustrates the problem we would like to solve. Given an image
containing some object(s), we want to 1) categorize the object as a car (or a
stapler, or a computer mouse), and 2) estimate the pose (or view) of the car.
Here by ‘pose’, we refer to the 3D information of the object that is defined by
the viewing angle and scale of the object (i.e. a particular point on the viewing
sphere represented in Fig. 6). If we have seen this pose in the training time, and
have a way of modeling such information, the problem is reduced to matching
the known model with the new image. This is the approach followed by a number
of existing works where either each object class is assumed to be seen under an
unique pose [3,4,5,6,7,8,9,10] or a class model is associated to a specific pose
giving rise to mixture models [11,12,13]. But it is not necessarily possible for an
algorithm to have been trained with all views of the objects. In many situations,
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part III, LNCS 5304, pp. 602–615, 2008.
c
Springer-Verlag Berlin Heidelberg 2008

View Synthesis for Recognizing Unseen Poses of Object Classes 603
car:
azimuth = 200 deg; zenith = 30 deg.
stapler:
azimuth = 75 deg; zenith = 50 deg.
mouse:
azimuth = 60 deg; zenith = 70 deg.
Fig. 1. Categorize an Object Given An Unseen View. azimuth: [front,right,back,
left]= [0, 90, 180, 270]
o
; zenith: [low, med., high]= [0, 45, 90]
o
training is limited (either by the number of examples, or by the coverage of
all possible poses of the object class); it is therefore important to be able to
extrapolate information and make the best guess possible given this limitation.
This is the approach we present in this paper.
In image base rendering, new view synthesis (morphing) have been an active
and prolific area of research [14,15,16]. Seitz & Dyer [2] proposed a method to
morph two observed views of an object into a new, unseen view using basic
principles of projective geometry. Other researchers explored similar formula-
tions [17,18] based on multi-view geometry [19] or extended these results to 3-
view morphing techniques [20]. The key property of view-synthesis techniques is
their ability to generate new views of an object without reconstructing its actual
3D model. It is unclear, however, whether these can be useful as is for recognizing
unseen views of object categories under very general conditions: they were de-
signed to work on single object instances (or at most 2), with no background clut-
ter and with given features correspondences across views (Figs. 6–10 of [2]). In
our work we try to inherit the view-morphing machinery while generalizing it to
the case of object categories. On the opposite side of the spectrum, several works
have addressed the issue of single object recognition by modeling different degree
of 3D information. Again, since these methods achieve recognition by matching
local features [21,22,23,24] or group of local features [25,26] under rigid geomet-
rical transformations, they can be hardly extended for handling object classes.
Recently, a number of works have proposed interesting solutions for capturing
the multi-view essence of an object category [1,27,28,29,30,31,32]. These tech-
niques bridge the gap between models that represent an object category from just
a single 2D view and models that represent single object instances from multiple
views. Among these, [32] presents an interesting methodology for repopulating
the number of views in training by augmenting the views with synthetic data.
In [1] a framework was proposed in which stable parts of objects from one class
are linked together to capture both the appearance and shape properties of the
object class. Our work extends and simplies the representation in [1]. Our crit-
ical contributions are:
We propose a novel method for representing and synthesizing views of ob-
ject classes that are not present in training. Our view-synthesis approach is
inspired by previous research on view morphing and image synthesis from

604 S. Savarese and L. Fei-Fei
multiple views. However, the main contribution of our approach is that the
synthesis takes place at the categorical level as opposed to the single object
level (as previously explored).
We propose a new algorithm that takes advantage of our view-synthesis
machinery for recognizing objects seen under arbitrary views. As opposed
to [32] where training views are augmented by using synthetic data, we
synthesize the views at recognition time. Our experimental analysis validates
our theoretical findings and shows that our algorithm is able to successfully
estimate object classes and poses under very challenging conditions.
2 Model Representation for Unseen Views
We start with an overview of the overall object category model [1] in Sec. 2.1
and give details of our new view synthesis analysis in Sec. 2.2.
2.1 Overview of the Savarese et al. Model [1]
Fig. 2 illustrates the main ideas of the model proposed by [1]. We use the car
category as an example for an overview of the model. There are two main com-
ponents of the model: the canonical parts and the linkage structure among the
canonical parts. A canonical part P in the object class model refers to a region of
the object that tends to occur frequently across different instances of the object
class (e.g. rear bumper of a car). It is automatically determined by the model.
The canonical parts are regions containing multiple features in the images, and
are the building blocks of the model. As previous research has shown, a part
based representation [26,28,29] is more stable for capturing the appearance vari-
ability across instances of objects. A critical property introduced in [1] is that the
canonical part retains the appearance of a region that is viewed most frontally
on the object. In other words, a car’s rear bumper could render different appear-
ances under different geometric transformations as the observer moves around
the viewing sphere (see [1] for details). The canonical part representation of the
car rear bumper is the one that is viewed the most frontally (Fig. 2(a)).
Given an assortment of canonical parts (e.g. the colored patches in Fig. 2(b)),
a linkage structure connects each pair of canonical parts {P
j
,P
i
} if they can
be both visible at the same time (Fig. 2(c)). The linkage captures the relative
position (represented by the 2 × 1 vector t
ij
) and change of pose of a canonical
part given the other (represented by a 2 × 2 homographic transformation A
ij
).
If the two canonical parts share the same pose, then the linkage is simply the
translation vector t
ij
(since A
ij
= I). For example, given that part P
i
(left rear
light) is canonical, the pose (and appearance) of all connected canonical parts
must change according to the transformation imposed by A
ij
for j =1···N,j =
i,whereN is the total number of parts connected to P
i
. This transformation
is depicted in Fig. 2(c) by showing a slanted version of each canonical part (for
details of the model, the reader may refer to [1]).
We define a canonical view V as the collection of canonical parts that share
thesameviewV (Fig. 2(c)). Thus, each pair of canonical parts {P
i
,P
j
} within

View Synthesis for Recognizing Unseen Poses of Object Classes 605
(a) (b) (c)
p
i
Fig. 2. Model Summary. Panel a: A car within the viewing sphere. As the observer
moves on the viewing sphere the same part produces different appearances. The location
on the viewing sphere where the part is viewed the most frontally gives rise to a canonical
part. The appearance of such canonical part is highlighted in green. Panel b: Colored
markers indicate locations of other canonical parts. Panel c: Canonical parts are con-
nected together in a linkage structure. The linkage indicates the relative position and
change of pose of a canonical part given the other (if they are both visible at the same
time). This change of location and pose is represented by a translation vector and a homo-
graphic transformation respectively. The homographic transformation between canoni-
cal parts is illustrated by showing that some canonical parts are slanted with respected to
others. A collection of canonical parts that share the same view defines a canonical view
(for instance, see the canonical parts enclosed in the area highlighted in yellow.
V is connected by A
ij
= I and a translation vector t
ij
. We can interpret a
canonical view V as a subset of the overall linkage structure of the object cat-
egory. Notice that by construction a canonical view may coincide with one of
the object category poses used in learning. However, not all the poses used in
learning will be associated to a canonical view V . The reason is that a canon-
ical view is a collection of canonical parts and each canonical part summarizes
the appearance variability of an object category part under different poses. The
relationship of parts within the same canonical view is what previous literature
have extensively used for representing 2D object categories from single 2D views
(e.g. the constellation models [4,6]). The linkage structure can be interpreted
as its generalization to the multi-view case. Similarly to other methods based
on constellations of features or parts, the linkage structure of canonical parts is
robust to occlusions and background clutter.
2.2 Representing an Unseen View
The critical question is: how can we represent (synthesize) a novel non-canonical
view from the set of canonical views contained in the linkage structure? As we
will show in Sec. 3, this ability becomes crucial if we want to recognize an object
category seen under an arbitrary pose. Our approach is inspired by previous
research on view morphing and image synthesis from multiple views. We show
that it is possible to use a similar machinery for synthesizing appearance, pose

606 S. Savarese and L. Fei-Fei
and position of canonical parts from two or more canonical views. Notice that the
output of this representation (synthesis) is a novel view of the object category,
not just a novel view of a single object instance, whereas all previous morphing
techniques are used for synthesizing novel views of single objects.
Representing Canonical Parts. In [1], each canonical part is represented by
a distribution of feature descriptors along with their x, y location within the part.
In our work, we simplify this representation and describe a canonical part P by a
convex quadrangle B (e.g., the bounding box) enclosing the set of features. The
appearance of this part is then characterized by a bag of c odewords model [5] -
that is, a normalized histogram h of vector quantized descriptors contained in B.
Our choice of feature detectors and descriptors is the same as in [1]. A standard
K-means algorithm can be used for extracting the codewords. B is a 2 ×4 vector
encoding the b =[x, y]
T
coordinates of the four corners of the quadrangle, i.e.
B =
b
1
... b
4
; h is a M ×1 vector, where M is the size of the vocabulary of the
vector quantized descriptors. Given a linked pair of canonical parts {P
i
,P
j
} and
their corresponding {B
i
,B
j
}, relative position of the parts {P
i
,P
j
} is defined
by t
ij
= c
i
c
j
, where the centroid c
i
=
1
4
k
b
k
; the relative change of pose
is defined by A
ij
which encodes the homographic transformation acting on the
coordinates of B
i
. This simplification is crucial for allowing more flexibility in
handling the synthesis of novel non-canonical views at the categorical level.
View Morphing. Given two views of a 3D object it is possible to synthesize
a novel view by using view-interpolating techniques without reconstructing the
3D object shape. It has been shown that a simple linear image interpolation
(or appearance-morphing) between views do not convey correct 3D rigid shape
transformation, unless the views are parallel (that is, the camera moves parallel
to the image planes) [15]. Moreover, Seitz & Dyer [2] have shown that if the cam-
era projection matrices are known, then a geometrical-morphing technique can
be used to synthesize a new view even without having parallel views. However,
estimating the camera projection matrices for the object category may be very
difficult in practice. We notice that under the assumption of having the views
in a neighborhood on the viewing sphere, the cameras can be approximated as
being parallel, enabling a simple linear interpolation scheme (Fig. 3). Next we
show that by combining appearance and geometrical morphing it is possible to
synthesize a novel view (meant as a collection of parts along with their linkage)
from two or more canonical views.
Two-View Synthesis. We start by the simpler case of synthesizing from two
canonical views V
n
and V
m
. A synthesized view V
s
can be expressed as a collec-
tion of linked parts morphed from the corresponding canonical parts belonging to
V
n
and V
m
. Specifically, a pair of linked parts {P
s
i
,P
s
j
}∈V
s
can be synthesized
from the pair {P
n
i
V
n
,P
m
j
V
m
} if and only if P
n
i
and P
m
j
are linked by the
homographic transformation A
ij
= I (Fig. 3). If we represent {P
s
i
,P
s
j
} by the
quadrangles {B
s
i
,B
s
j
} and the histograms {h
s
i
,h
s
j
} respectively, a new view is ex-
pressed by:

Figures
Citations
More filters
Book

Computer Vision: Algorithms and Applications

TL;DR: Computer Vision: Algorithms and Applications explores the variety of techniques commonly used to analyze and interpret images and takes a scientific approach to basic vision problems, formulating physical models of the imaging process before inverting them to produce descriptions of a scene.
Journal ArticleDOI

Shape Analysis of Elastic Curves in Euclidean Spaces

TL;DR: This paper introduces a square-root velocity (SRV) representation for analyzing shapes of curves in euclidean spaces under an elastic metric and demonstrates a wrapped probability distribution for capturing shapes of planar closed curves.
Proceedings ArticleDOI

Deep Stereo: Learning to Predict New Views from the World's Imagery

TL;DR: This work presents a novel deep architecture that performs new view synthesis directly from pixels, trained from a large number of posed image sets, and is the first to apply deep learning to the problem ofnew view synthesis from sets of real-world, natural imagery.
Proceedings ArticleDOI

Multi-view object class detection with a 3D geometric model

TL;DR: This paper presents a new approach for multi-view object class detection which uses a part model which discriminatively learns the object appearance with spatial pyramids from a database of real images, and encodes the 3D geometry of the object class with a generative representation built from adatabase of synthetic models.
Proceedings ArticleDOI

From Virtual to Reality: Fast Adaptation of Virtual Object Detectors to Real Domains

TL;DR: This work investigates the use of such freely available 3D models for multicategory 2D object detection and proposes a simple and fast adaptation approach based on decorrelated features, which performs comparably to existing methods trained on large-scale real image domains.
References
More filters
Proceedings ArticleDOI

Rapid object detection using a boosted cascade of simple features

TL;DR: A machine learning approach for visual object detection which is capable of processing images extremely rapidly and achieving high detection rates and the introduction of a new image representation called the "integral image" which allows the features used by the detector to be computed very quickly.
Proceedings ArticleDOI

Object recognition from local scale-invariant features

TL;DR: Experimental results show that robust object recognition can be achieved in cluttered partially occluded images with a computation time of under 2 seconds.
Book

Multiple view geometry in computer vision

TL;DR: In this article, the authors provide comprehensive background material and explain how to apply the methods and implement the algorithms directly in a unified framework, including geometric principles and how to represent objects algebraically so they can be computed and applied.

Multiple View Geometry in Computer Vision.

TL;DR: This book is referred to read because it is an inspiring book to give you more chance to get experiences and also thoughts and it will show the best book collections and completed collections.
Proceedings Article

Visual categorization with bags of keypoints

TL;DR: This bag of keypoints method is based on vector quantization of affine invariant descriptors of image patches and shows that it is simple, computationally efficient and intrinsically invariant.
Related Papers (5)
Frequently Asked Questions (15)
Q1. What are the contributions in "View synthesis for recognizing unseen poses of object classes" ?

The authors propose to extend this framework and improve the ability of the model to recognize poses that have not been seen in training. The authors compare their results on pose categorization with the model and dataset presented in [ 1 ]. 

But beyond the possibility of semantic labeling of objects seen under specific views, it is often crucial to recognize the pose of the objects in the 3D space, along with its categorical identity. Further research is also needed to explore to what degree the inherent nuisances in category-level recognition ( lighting variability, occlusions and background clutter ) affect the view morphing formulation. Finally, it would be interesting to extend their framework and incorporate the ability to model non-rigid objects. Their initial testing of the algorithm shows promising results. 

One limitation of the interpolation scheme described in Sec. 2.2 is that a new view can be synthesized only if it belongs to the linear camera trajectory from one view to the other. 

The key property of view-synthesis techniques is their ability to generate new views of an object without reconstructing its actual 3D model. 

But beyond the possibility of semantic labeling of objects seen under specific views, it is often crucial to recognize the pose of the objects in the 3D space, along with its categorical identity. 

B is a 2×4 vector encoding the b = [x, y]T coordinates of the four corners of the quadrangle, i.e. B = [ b1 . . . b4 ] ; h is a M×1 vector, where M is the size of the vocabulary of the vector quantized descriptors. 

the main contribution of their approach is that the synthesis takes place at the categorical level as opposed to the single object level (as previously explored). 

The authors notice that under the assumption of having the views in a neighborhood on the viewing sphere, the cameras can be approximated as being parallel, enabling a simple linear interpolation scheme (Fig. 3). 

since these methods achieve recognition by matching local features [21,22,23,24] or group of local features [25,26] under rigid geometrical transformations, they can be hardly extended for handling object classes. 

In this new dataset of 8 object classes, 7 classes of images (cellphone, bike, iron, shoe, stapler, mouse, and toaster) are collected from the Internet (mostly Google and Flickr) by using an automatic image crawler. 

Their experimental analysis validates their theoretical findings and shows that their algorithm is able to successfully estimate object classes and poses under very challenging conditions. 

The authors propose a new algorithm that takes advantage of their view-synthesis machinery for recognizing objects seen under arbitrary views. 

Notice that the output of this representation (synthesis) is a novel view of the object category, not just a novel view of a single object instance, whereas all previous morphing techniques are used for synthesizing novel views of single objects. 

To assess the performance of their algorithm to recognize unseen views, the authors train both the model in [1] and ours by using a reduced set of poses in training. 

Given an assortment of canonical parts (e.g. the colored patches in Fig. 2(b)), a linkage structure connects each pair of canonical parts {Pj , Pi} if they can be both visible at the same time (Fig. 2(c)).