What are the contributions in "View synthesis for recognizing unseen poses of object classes" ?

The authors propose to extend this framework and improve the ability of the model to recognize poses that have not been seen in training. The authors compare their results on pose categorization with the model and dataset presented in [ 1 ].

What future works have the authors mentioned in the paper "View synthesis for recognizing unseen poses of object classes" ?

But beyond the possibility of semantic labeling of objects seen under specific views, it is often crucial to recognize the pose of the objects in the 3D space, along with its categorical identity. Further research is also needed to explore to what degree the inherent nuisances in category-level recognition ( lighting variability, occlusions and background clutter ) affect the view morphing formulation. Finally, it would be interesting to extend their framework and incorporate the ability to model non-rigid objects. Their initial testing of the algorithm shows promising results.

What is the limitation of the interpolation scheme described in Sec. 2.2?

One limitation of the interpolation scheme described in Sec. 2.2 is that a new view can be synthesized only if it belongs to the linear camera trajectory from one view to the other.

What is the significance of the pose estimation algorithm?

But beyond the possibility of semantic labeling of objects seen under specific views, it is often crucial to recognize the pose of the objects in the 3D space, along with its categorical identity.

what is the h of the vector quantized descriptors?

B is a 2×4 vector encoding the b = [x, y]T coordinates of the four corners of the quadrangle, i.e. B = [ b1 . . . b4 ] ; h is a M×1 vector, where M is the size of the vocabulary of the vector quantized descriptors.

How many classes of images are collected from the Internet?

In this new dataset of 8 object classes, 7 classes of images (cellphone, bike, iron, shoe, stapler, mouse, and toaster) are collected from the Internet (mostly Google and Flickr) by using an automatic image crawler.

What is the main contribution of the experimental analysis?

Their experimental analysis validates their theoretical findings and shows that their algorithm is able to successfully estimate object classes and poses under very challenging conditions.

What is the main contribution of the proposed algorithm?

The authors propose a new algorithm that takes advantage of their view-synthesis machinery for recognizing objects seen under arbitrary views.

How do the authors train the model to recognize unseen views?

To assess the performance of their algorithm to recognize unseen views, the authors train both the model in [1] and ours by using a reduced set of poses in training.

(Open Access) View Synthesis for Recognizing Unseen Poses of Object Classes (2008) | Silvio Savarese

Q: What is the main contribution of the approach?

the main contribution of their approach is that the synthesis takes place at the categorical level as opposed to the single object level (as previously explored).

Q: What is the way to synthesize a novel view?

The authors notice that under the assumption of having the views in a neighborhood on the viewing sphere, the cameras can be approximated as being parallel, enabling a simple linear interpolation scheme (Fig. 3).

View Synthesis for Recognizing Unseen Poses of

Object Classes

Silvio Savarese

and Li Fei-Fei

Department of Electrical Engineering, University of Michigan at Ann Arbor

Department of Computer Science, Princeton University

Abstract. An important task in object recognition is to enable algo-

rithms to categorize objects under arbitrary poses in a cluttered 3D

world. A recent paper by Savarese & Fei-Fei [1] has proposed a novel

representation to model 3D object classes. In this representation sta-

ble parts of objects from one class are linked together to capture both

the appearance and shape properties of the object class. We propose to

extend this framework and improve the ability of the model to recog-

nize poses that have not been seen in training. Inspired by works in

single object view synthesis (e.g., Seitz & Dyer [2]), our new represen-

tation allows the model to synthesize novel views of an object class at

recognition time. This mechanism is incorporated in a novel two-step

algorithm that is able to classify objects under arbitrary and/or unseen

poses. We compare our results on pose categorization with the model

and dataset presented in [1]. In a second experiment, we collect a new,

more challenging dataset of 8 object classes from crawling the web. In

both experiments, our model shows competitive performances compared

to [1] for classifying objects in unseen poses.

1 Introduction

An important goal in object recognition is to be able to recognize an object or an

object category given an arbitrary view point. Humans can do this eﬀortlessly

under most conditions. Consider the search for your car in a crowded shopping

center parking lot. We often need to look around 360 degrees in search of our

vehicle. Similarly, this ability is crucial for a robust, intelligent visual recognition

system. Fig. 1 illustrates the problem we would like to solve. Given an image

containing some object(s), we want to 1) categorize the object as a car (or a

stapler, or a computer mouse), and 2) estimate the pose (or view) of the car.

Here by ‘pose’, we refer to the 3D information of the object that is deﬁned by

the viewing angle and scale of the object (i.e. a particular point on the viewing

sphere represented in Fig. 6). If we have seen this pose in the training time, and

have a way of modeling such information, the problem is reduced to matching

the known model with the new image. This is the approach followed by a number

of existing works where either each object class is assumed to be seen under an

unique pose [3,4,5,6,7,8,9,10] or a class model is associated to a speciﬁc pose

giving rise to mixture models [11,12,13]. But it is not necessarily possible for an

algorithm to have been trained with all views of the objects. In many situations,

D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part III, LNCS 5304, pp. 602–615, 2008.

 Springer-Verlag Berlin Heidelberg 2008

View Synthesis for Recognizing Unseen Poses of Object Classes 603

car:

azimuth = 200 deg; zenith = 30 deg.

stapler:

azimuth = 75 deg; zenith = 50 deg.

mouse:

azimuth = 60 deg; zenith = 70 deg.

Fig. 1. Categorize an Object Given An Unseen View. azimuth: [front,right,back,

left]= [0, 90, 180, 270]

; zenith: [low, med., high]= [0, 45, 90]

training is limited (either by the number of examples, or by the coverage of

all possible poses of the object class); it is therefore important to be able to

extrapolate information and make the best guess possible given this limitation.

This is the approach we present in this paper.

In image base rendering, new view synthesis (morphing) have been an active

and proliﬁc area of research [14,15,16]. Seitz & Dyer [2] proposed a method to

morph two observed views of an object into a new, unseen view using basic

principles of projective geometry. Other researchers explored similar formula-

tions [17,18] based on multi-view geometry [19] or extended these results to 3-

view morphing techniques [20]. The key property of view-synthesis techniques is

their ability to generate new views of an object without reconstructing its actual

3D model. It is unclear, however, whether these can be useful as is for recognizing

unseen views of object categories under very general conditions: they were de-

signed to work on single object instances (or at most 2), with no background clut-

ter and with given features correspondences across views (Figs. 6–10 of [2]). In

our work we try to inherit the view-morphing machinery while generalizing it to

the case of object categories. On the opposite side of the spectrum, several works

have addressed the issue of single object recognition by modeling diﬀerent degree

of 3D information. Again, since these methods achieve recognition by matching

local features [21,22,23,24] or group of local features [25,26] under rigid geomet-

rical transformations, they can be hardly extended for handling object classes.

Recently, a number of works have proposed interesting solutions for capturing

the multi-view essence of an object category [1,27,28,29,30,31,32]. These tech-

niques bridge the gap between models that represent an object category from just

a single 2D view and models that represent single object instances from multiple

views. Among these, [32] presents an interesting methodology for repopulating

the number of views in training by augmenting the views with synthetic data.

In [1] a framework was proposed in which stable parts of objects from one class

are linked together to capture both the appearance and shape properties of the

object class. Our work extends and simpliﬁes the representation in [1]. Our crit-

ical contributions are:

– We propose a novel method for representing and synthesizing views of ob-

ject classes that are not present in training. Our view-synthesis approach is

inspired by previous research on view morphing and image synthesis from

604 S. Savarese and L. Fei-Fei

multiple views. However, the main contribution of our approach is that the

synthesis takes place at the categorical level as opposed to the single object

level (as previously explored).

– We propose a new algorithm that takes advantage of our view-synthesis

machinery for recognizing objects seen under arbitrary views. As opposed

to [32] where training views are augmented by using synthetic data, we

synthesize the views at recognition time. Our experimental analysis validates

our theoretical ﬁndings and shows that our algorithm is able to successfully

estimate object classes and poses under very challenging conditions.

2 Model Representation for Unseen Views

We start with an overview of the overall object category model [1] in Sec. 2.1

and give details of our new view synthesis analysis in Sec. 2.2.

2.1 Overview of the Savarese et al. Model [1]

Fig. 2 illustrates the main ideas of the model proposed by [1]. We use the car

category as an example for an overview of the model. There are two main com-

ponents of the model: the canonical parts and the linkage structure among the

canonical parts. A canonical part P in the object class model refers to a region of

the object that tends to occur frequently across diﬀerent instances of the object

class (e.g. rear bumper of a car). It is automatically determined by the model.

The canonical parts are regions containing multiple features in the images, and

are the building blocks of the model. As previous research has shown, a part

based representation [26,28,29] is more stable for capturing the appearance vari-

ability across instances of objects. A critical property introduced in [1] is that the

canonical part retains the appearance of a region that is viewed most frontally

on the object. In other words, a car’s rear bumper could render diﬀerent appear-

ances under diﬀerent geometric transformations as the observer moves around

the viewing sphere (see [1] for details). The canonical part representation of the

car rear bumper is the one that is viewed the most frontally (Fig. 2(a)).

Given an assortment of canonical parts (e.g. the colored patches in Fig. 2(b)),

a linkage structure connects each pair of canonical parts {P

} if they can

be both visible at the same time (Fig. 2(c)). The linkage captures the relative

position (represented by the 2 × 1 vector t

) and change of pose of a canonical

part given the other (represented by a 2 × 2 homographic transformation A

If the two canonical parts share the same pose, then the linkage is simply the

translation vector t

(since A

= I). For example, given that part P

(left rear

light) is canonical, the pose (and appearance) of all connected canonical parts

must change according to the transformation imposed by A

for j =1···N,j =

i,whereN is the total number of parts connected to P

. This transformation

is depicted in Fig. 2(c) by showing a slanted version of each canonical part (for

details of the model, the reader may refer to [1]).

We deﬁne a canonical view V as the collection of canonical parts that share

thesameviewV (Fig. 2(c)). Thus, each pair of canonical parts {P

} within

View Synthesis for Recognizing Unseen Poses of Object Classes 605

(a) (b) (c)

Fig. 2. Model Summary. Panel a: A car within the viewing sphere. As the observer

moves on the viewing sphere the same part produces diﬀerent appearances. The location

on the viewing sphere where the part is viewed the most frontally gives rise to a canonical

part. The appearance of such canonical part is highlighted in green. Panel b: Colored

markers indicate locations of other canonical parts. Panel c: Canonical parts are con-

nected together in a linkage structure. The linkage indicates the relative position and

change of pose of a canonical part given the other (if they are both visible at the same

time). This change of location and pose is represented by a translation vector and a homo-

graphic transformation respectively. The homographic transformation between canoni-

cal parts is illustrated by showing that some canonical parts are slanted with respected to

others. A collection of canonical parts that share the same view deﬁnes a canonical view

(for instance, see the canonical parts enclosed in the area highlighted in yellow.

V is connected by A

= I and a translation vector t

. We can interpret a

canonical view V as a subset of the overall linkage structure of the object cat-

egory. Notice that by construction a canonical view may coincide with one of

the object category poses used in learning. However, not all the poses used in

learning will be associated to a canonical view V . The reason is that a canon-

ical view is a collection of canonical parts and each canonical part summarizes

the appearance variability of an object category part under diﬀerent poses. The

relationship of parts within the same canonical view is what previous literature

have extensively used for representing 2D object categories from single 2D views

(e.g. the constellation models [4,6]). The linkage structure can be interpreted

as its generalization to the multi-view case. Similarly to other methods based

on constellations of features or parts, the linkage structure of canonical parts is

robust to occlusions and background clutter.

2.2 Representing an Unseen View

The critical question is: how can we represent (synthesize) a novel non-canonical

view from the set of canonical views contained in the linkage structure? As we

will show in Sec. 3, this ability becomes crucial if we want to recognize an object

category seen under an arbitrary pose. Our approach is inspired by previous

research on view morphing and image synthesis from multiple views. We show

that it is possible to use a similar machinery for synthesizing appearance, pose

606 S. Savarese and L. Fei-Fei

and position of canonical parts from two or more canonical views. Notice that the

output of this representation (synthesis) is a novel view of the object category,

not just a novel view of a single object instance, whereas all previous morphing

techniques are used for synthesizing novel views of single objects.

Representing Canonical Parts. In [1], each canonical part is represented by

a distribution of feature descriptors along with their x, y location within the part.

In our work, we simplify this representation and describe a canonical part P by a

convex quadrangle B (e.g., the bounding box) enclosing the set of features. The

appearance of this part is then characterized by a bag of c odewords model [5] -

that is, a normalized histogram h of vector quantized descriptors contained in B.

Our choice of feature detectors and descriptors is the same as in [1]. A standard

K-means algorithm can be used for extracting the codewords. B is a 2 ×4 vector

encoding the b =[x, y]

coordinates of the four corners of the quadrangle, i.e.

B =



... b



; h is a M ×1 vector, where M is the size of the vocabulary of the

vector quantized descriptors. Given a linked pair of canonical parts {P

} and

their corresponding {B

}, relative position of the parts {P

} is deﬁned

by t

= c

− c

, where the centroid c



; the relative change of pose

is deﬁned by A

which encodes the homographic transformation acting on the

coordinates of B

. This simpliﬁcation is crucial for allowing more ﬂexibility in

handling the synthesis of novel non-canonical views at the categorical level.

View Morphing. Given two views of a 3D object it is possible to synthesize

a novel view by using view-interpolating techniques without reconstructing the

3D object shape. It has been shown that a simple linear image interpolation

(or appearance-morphing) between views do not convey correct 3D rigid shape

transformation, unless the views are parallel (that is, the camera moves parallel

to the image planes) [15]. Moreover, Seitz & Dyer [2] have shown that if the cam-

era projection matrices are known, then a geometrical-morphing technique can

be used to synthesize a new view even without having parallel views. However,

estimating the camera projection matrices for the object category may be very

diﬃcult in practice. We notice that under the assumption of having the views

in a neighborhood on the viewing sphere, the cameras can be approximated as

being parallel, enabling a simple linear interpolation scheme (Fig. 3). Next we

show that by combining appearance and geometrical morphing it is possible to

synthesize a novel view (meant as a collection of parts along with their linkage)

from two or more canonical views.

Two-View Synthesis. We start by the simpler case of synthesizing from two

canonical views V

and V

. A synthesized view V

can be expressed as a collec-

tion of linked parts morphed from the corresponding canonical parts belonging to

and V

. Speciﬁcally, a pair of linked parts {P

}∈V

can be synthesized

from the pair {P

∈ V

} if and only if P

and P

are linked by the

homographic transformation A

= I (Fig. 3). If we represent {P

} by the

quadrangles {B

} and the histograms {h

} respectively, a new view is ex-

pressed by:

View Synthesis for Recognizing Unseen Poses of Object Classes

Figures

Citations

Computer Vision: Algorithms and Applications

Shape Analysis of Elastic Curves in Euclidean Spaces

Deep Stereo: Learning to Predict New Views from the World's Imagery

Multi-view object class detection with a 3D geometric model

From Virtual to Reality: Fast Adaptation of Virtual Object Detectors to Real Domains

References

Rapid object detection using a boosted cascade of simple features

Object recognition from local scale-invariant features

Multiple view geometry in computer vision

Multiple View Geometry in Computer Vision.

Visual categorization with bags of keypoints

Related Papers (5)

3D generic object categorization, localization and pose estimation

Towards Multi-View Object Class Detection

Histograms of oriented gradients for human detection

Object Detection with Discriminatively Trained Part-Based Models

Distinctive Image Features from Scale-Invariant Keypoints

Frequently Asked Questions (15)

Q1. What are the contributions in "View synthesis for recognizing unseen poses of object classes" ?

Q2. What future works have the authors mentioned in the paper "View synthesis for recognizing unseen poses of object classes" ?

Q3. What is the limitation of the interpolation scheme described in Sec. 2.2?

Q4. What is the key property of view-synthesis techniques?

Q5. What is the significance of the pose estimation algorithm?

Q6. what is the h of the vector quantized descriptors?

Q7. What is the main contribution of the approach?

Q8. What is the way to synthesize a novel view?

Q9. How can view-synthesis techniques be extended for handling object classes?

Q10. How many classes of images are collected from the Internet?

Q11. What is the main contribution of the experimental analysis?

Q12. What is the main contribution of the proposed algorithm?

Q13. What is the key to a novel view of the object category?

Q14. How do the authors train the model to recognize unseen views?

Q15. What is the canonical part representation of the car rear bumper?