scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Parsing IKEA Objects: Fine Pose Estimation

TL;DR: This work addresses the problem of localizing and estimating the fine-pose of objects in the image with exact 3D models by using local keypoint detectors to find candidate poses and score global alignment of each candidate pose to the image.
Abstract: We address the problem of localizing and estimating the fine-pose of objects in the image with exact 3D models. Our main focus is to unify contributions from the 1970s with recent advances in object detection: use local keypoint detectors to find candidate poses and score global alignment of each candidate pose to the image. Moreover, we also provide a new dataset containing fine-aligned objects with their exactly matched 3D models, and a set of models for widely used objects. We also evaluate our algorithm both on object detection and fine pose estimation, and show that our method outperforms state-of-the art algorithms.

Summary (3 min read)

1. Introduction

  • The authors can find surprisingly accurate 3D models of IKEA furniture, such as billy bookcase and ektorp sofa, created by IKEA fans from Google 3D Warehouse and other publicly available databases.
  • Detecting 3D objects in images and estimating their pose was a popular topic in the early days of computer vision [14] and has gained a renewed interest in the last few years.
  • Most of the approaches were designed to work at instance level detection as it was expected that the 3D model would accurately fit the model in the image.
  • There might be large variation in the appearance, as the CAD models do not completely constrain the appearance of the objects placed in the real world.
  • The authors contributions are three folds: (1) Proposing a detection problem that has some of the challenges of category level detection while allowing for an accurate representation of the object pose in the image.

2.1. Our Model

  • The core of their algorithm is to define the final score based on multiple sources of information such as local correspondence, geometric distance, and global alignment.
  • The authors will use this linearity later to learn their discriminative classifier.

2.2. Local correspondence error

  • The goal here is to measure the local correspondences.
  • Given a projection P , the authors find the local shape-based matching score between the rendered image of the CAD model and the 2D image.
  • Because their CAD model contains only 1We assume the image is captured by a regular pinhole camera and is not cropped, so the the principal point is in the middle of image.the authors.the authors.
  • Shape information, a simple distance measure on standard local descriptors fails to match the two images.
  • Moreover, the authors see better performance by training an LDA classifier to discriminate each keypoint from the rest of the points.

2.3. Geometric distance between correspondences

  • The second term measures the alignment error between a 3D interest line l and one of its corresponding 2D lines cl [12].
  • Note that this cost does not penalize displacement along the line.
  • This is important because the end points of detected lines on I are not reliable due to occlusion and noise in the image.

2.4. Global dissimilarity

  • One key improvement of their work compared to traditional works on pose estimation using a 3D model is how the authors measure the global alignment.
  • In other words, the authors would like to measure the alignment between the boundary of their proposed pose P and the texture boundary of I .
  • The authors define an inner boundary region by extruding the proposed object mask followed by subtracting the mask, and an outer boundary by diluting the proposed object mask.
  • Hence, a large change in these two histograms indicates a large texture pattern change, and it is ideal if the LBP histogram difference along the proposed pose’s boundary is large.
  • The authors used a modified Chamfer distance from Satkin, et. al. [18].

2.5. Optimization & Learning

  • Hence, the authors first simplify the optimization by quantizing the space of solutions.
  • Because their L(P, c) is 1 if any local correspondence score is below the threshold, the authors first find all sets of correspondences for which all local correspondences are above the threshold.
  • For the subset of {(P, c)} in the training set, the authors extract the geometric distance and global alignment features, ⇥ D(P, c), G(P ) ⇤ , and the binary labels based on the distance from the ground truth (defined in Eq 9) .
  • Then, the authors learn the weights using a linear SVM classifier.

3.1. Dataset

  • In order to develop and evaluate fine pose estimation based on 3D models, the authors created a new dataset of images and 3D models representing typical indoor scenes.
  • The authors explicitly collected IKEA 3D models from Google 3D Warehouse, and images from Flickr.
  • For their dataset, the authors provide 800 images and 225 3D models.
  • IKEAobject is the split where 300 images are queried by individual object name (e.g. ikea chair poang and ikea sofa ektorp).
  • Given these correspondences, the authors solve the least square problem of Eq 4 using LevenbergMarquardt.

3.2. Error Metric

  • The authors introduce a new error metric for fine-pose estimation.
  • Intuitively, the authors use the average 3D distance between all points in the ground truth and the proposal.
  • When the distance is small, this is close to the average error in viewing angle for all points.
  • Formally, given an estimated pose Pe and a ground truth pose score(Pe, Pgt) = P Xi kEeXi EgtXik2P Xi kEgtXik2 (9).

4.1. Correspondences

  • This is crucial for the rest of their system as each additional poor correspondence grows the search space of RANSAC exponentially.
  • The authors compare Harris corner detectors against their detector based on LDA classifiers.
  • On average, to capture 5 correct correspondences with their method, each interest point has to consider only its top 10 matched candidates.
  • The recall of their full model is upper bonded by that of this set.
  • Figure 7 shows a semi-log plot of recall vs minimum number of top candidates required per image.

4.3. Final Pose Estimation

  • Table 1 shows the evaluation of their method with various sets of features as well as two state-of-the-art object detectors: Deformable part models [4] and Exemplar LDA [8] in IKEAobject database.
  • If there are some mixtures with high false positive rates, then the whole system can break down, and (2) they are trained with rendered images due to their requirement of images for each different mixture.
  • Adding these two features greatly boosts performance.
  • As the table shows, their method does not fluctuate much with a threshold change, whereas both [8] and [4] suffer and drop performances significantly.
  • Figure 8 shows several detection examples where pose estimation is incorrect, but still bounding box estimation is correct with a threshold of 0.5.

5. Conclusion

  • The authors have introduced a novel problem and model of estimating fine-pose of objects in the image with exact 3D models, combining traditionally used and recently developed techniques.
  • The authors believe their approach can extend further to more generic object classes, and enable the community to try more ambitious goals such as accurate 3D contextual modeling and full 3D room parsing.
  • The authors also thank Phillip Isola and Aditya Khosla for important suggestions and discussion.

Did you find this useful? Give us your feedback

Figures (11)

Content maybe subject to copyright    Report

Parsing IKEA Objects: Fine Pose Estimation
Joseph J. Lim Hamed Pirsiavash Antonio Torralba
Massachusetts Institute of Technology
Compter Science and Artificial Intelligence Laboratory
{lim,hpirsiav,torralba}@csail.mit.edu
Abstract
We address the problem of localizing and estimating the
fine-pose of objects in the image with exact 3D models. Our
main focus is to unify contributions from the 1970s with re-
cent advances in object detection: use local keypoint de-
tectors to find candidate poses and score global alignment
of each candidate pose to the image. Moreover, we also
provide a new dataset containing fine-aligned objects with
their exactly matched 3D models, and a set of models for
widely used objects. We also evaluate our algorithm both
on object detection and fine pose estimation, and show that
our method outperforms state-of-the art algorithms.
1. Introduction
Just as a thought experiment imagine that we want to
detect and fit 3D models of IKEA furniture in the images
as shown in Figure 1. We can find surprisingly accurate
3D models of IKEA furniture, such as billy bookcase and
ektorp sofa, created by IKEA fans from Google 3D Ware-
house and other publicly available databases. Therefore,
detecting those models in images could seem to be a very
similar task to an instance detection problem in which we
have training images of the exact instance that we want to
detect. But, it is not exactly like detecting instances. In the
case of typical 3D models (including IKEA models from
Google Warehouse), the exact appearance of each piece is
not available, only the 3D shape is. Moreover, appearance
in real images will vary significantly due to a number of
factors. For instance, IKEA furniture might appear with
different colors and textures, and with geometric deforma-
tions (as people building them might not do it perfectly) and
occlusions (e.g., a chair might have a cushion on top of it).
The problem that we introduce in this paper is detect-
ing and accurately fitting exact 3D models of objects to real
images, as shown in Figure 1. Detecting 3D objects in im-
ages and estimating their pose was a popular topic in the
early days of computer vision [14] and has gained a renewed
interest in the last few years. The traditional approaches
3D Model Original Image
Fine-pose Estimation
Figure 1. Our goal in this paper is to detect and estimate the fine-
pose of an object in the image given an exact 3D model.
[12, 14] were dominated by using accurate geometric rep-
resentations of 3D objects with an emphasis on viewpoint
invariance. Objects would appear in almost any pose and
orientation in the image. Most of the approaches were de-
signed to work at instance level detection as it was expected
that the 3D model would accurately fit the model in the im-
age. Instance level detection regained interest with the in-
troduction of new invariant local descriptors that dramati-
cally improved the detection of interest points [17]. Those
models assumed knowledge about geometry and appear-
ance of the instance. Having access to accurate knowledge
about those two aspects allowed precise detections and pose
estimations of the object on images even in the presence of
clutter and occlusions.
In the last few years, researchers interested in category
1

Edgemap HOG Local correspondence deteciton Image
Figure 2. Local correspondence: for each 3D interest point X
i
(red, green, and blue), we train an LDA patch detector on an edgemap
and use its response as part of our cost function. We compute HOG on edgemaps to ensure a real image and our model share the modality.
level detection have extended 2D constellation models to in-
clude 3D information in the object representation. Many of
these models [9, 20, 10, 6, 5, 19, 16, 7, 21] rely on gradient-
based features [1, 13]. Category level detection requires the
models to be generic and flexible, as they have to deal with
all the variations in shape and appearance of the instances
that belong to the same category. Therefore, the shape rep-
resentations used for those models are coarse (e.g., modeled
as the constellation of a few 3D parts or planes) and the ex-
pected output is at best an approximate fitting of the 3D
model to the image.
In this paper we introduce a detection task that is in the
intersection of these two settings; it is more generic than
detecting instances, but we assume richer models than the
ones typically used in category level detection. In particu-
lar, we assume that accurate CAD models of the objects are
available. Hence, there is little variation on the shape of the
instances that form one category. However, there might be
large variation in the appearance, as the CAD models do not
completely constrain the appearance of the objects placed
in the real world. Although assuming that CAD models are
available might seem to be a restrictive assumption, there
are available 3D CAD models for most man-made artifacts,
used for manufacturing or virtual reality. All those mod-
els could be used as training data for our system. Hence,
we focus on detection and pose estimation of objects in the
wild given their 3D CAD models. Our goal is to provide an
accurate localization of the object, as in the instance level
detection problem, but dealing with some of the variability
that one finds in category level detection.
Our contributions are three folds: (1) Proposing a detec-
tion problem that has some of the challenges of category
level detection while allowing for an accurate representa-
tion of the object pose in the image. This problem will mo-
tivate the development of better 3D object models and the
algorithms needed to find them in images. (2) We propose
a novel solution that unifies contributions from the 1970s
with recent advances in object detection. (3) And we in-
troduce a new dataset of 3D IKEA models obtained from
Google Warehouse and real images containing instances of
IKEA furniture and annotated with ground truth pose.
2. Methods
We now propose the framework that can detect objects
and estimate their poses simultaneously by matching to one
of the 3D models in our database.
2.1. Our Model
The core of our algorithm is to define the final score
based on multiple sources of information such as local
correspondence, geometric distance, and global alignment.
Suppose we are given the image I containing an object for
which we want to estimate the projection matrix P with 9
degrees of freedom
1
. We define our cost function S with
three terms as follows:
S(P, c)=L(P, c)+w
D
D(P, c)+w
G
G(P ) (1)
where c refers to the set of correspondences, L measures er-
ror in local correspondences between the 3D model and 2D
image, D measures geometric distance between correspon-
dences in 3D, and G measures the global dissimilarity in
2D. Note that we designed our model to be linear in weight
vectors of w
D
and w
G
. We will use this linearity later to
learn our discriminative classifier.
2.2. Local correspondence error
The goal here is to measure the local correspondences.
Given a projection P , we find the local shape-based match-
ing score between the rendered image of the CAD model
and the 2D image. Because our CAD model contains only
1
We assume the image is captured by a regular pinhole camera and is
not cropped, so the the principal point is in the middle of image.

shape information, a simple distance measure on standard
local descriptors fails to match the two images. In order
to overcome this modality difference, we compute HOG on
the edgemap of both images since it is more robust to ap-
pearance change but still sensitive to the change in shape.
Moreover, we see better performance by training an LDA
classifier to discriminate each keypoint from the rest of the
points. This is equivalent to a dot product between descrip-
tors in the whitened space. One advantage of using LDA
is its high training speed compared to other previous HOG-
template based approaches. Figure 2 illustrates this step.
More formally, our correspondence error is measured by
L(P, c)=
X
i
H(
T
i
(x
c
i
) ) (2)
i
=
1
((P (X
i
)) µ) (3)
where
i
is the weight learned using LDA based on the
covariance and mean µ obtained from a large external
dataset [8], (·) is a HOG computed on a 2020 pixel
edgemap patch of a given point, x
c
i
is the 2D correspond-
ing point of 3D point X
i
, P (·) projects a 3D point to 2D
coordinate given pose P , and lastly H(x ) binarizes x
to 0 if x or to 1 otherwise.
2.3. Geometric distance between correspondences
Given a proposed set of correspondences c between the
CAD model and the image, it is also necessary to measure if
c is a geometrically acceptable pose based on the 3D model.
For the error measure, we use euclidean distance between
the projection of x
i
and its corresponding 2D point x
c
i
; as
well as the line distance defined in [12] between the 3D line
l and its corresponding 2D line c
l
.
D(P,c)=
X
i2V
kP (X
i
) x
c
i
k
2
+ (4)
X
l2L
cos
c
l
, sin
c
l
,
c
l

P (X
l
1
) P (X
l
2
)
11
2
where X
l
1
and X
l
2
are end points of line l , and
c
l
and
c
l
are polar coordinate parameters of line c
l
.
The first term measures a pairwise distance between a 3D
interest point X
i
and its 2D corresponding point x
c
i
. The
second term measures the alignment error between a 3D in-
terest line l and one of its corresponding 2D lines c
l
[12].
We use the Hough transform on edges to extract 2D lines
from I. Note that this cost does not penalize displacement
along the line. This is important because the end points
of detected lines on I are not reliable due to occlusion and
noise in the image.
2.4. Global dissimilarity
One key improvement of our work compared to tradi-
tional works on pose estimation using a 3D model is how
we measure the global alignment. We use recently devel-
oped features to measure the global alignment between our
proposal and the image:
G(P )=[f
HOG
,f
region
,f
edge
,f
corr
,f
texture
], (5)
The description of each feature follows:
HOG-based: While D and L from Eq 1 are designed to
capture fine local alignments, the local points are sparse and
hence it is yet missing an object-scale alignment. In order
to capture edge alignment per orientation, we compute a
fine HOG descriptor (2x2 per cell) of edgemaps of I and
the rendered image of pose P . We use a similarity measure
based on cosine similarity between vectors .
f
HOG
=
(I)
T
(P )
k(P )k
2
,
(I)
T
(P )
k(I)M k
2
,
(I)
T
(P )
k(I)M kk(P )k
, (6)
where (·) is a HOG descriptor and M is a mask matrix
for counting how many pixels of P fall into each cell. We
multiply (I) by M to normalize only within the proposed
area (by P ) without being affected by other area.
Regions: We add another alignment feature based on super-
pixels. If our proposal P is a reasonable candidate, super-
pixels should not cross over the object boundary, except in
heavily occluded areas. To measure this spill over, we first
extract super-pixels, R
I
, from I using [3] and compute a
ratio between an area of a proposed pose, |R
P
|, and an area
of regions that has some overlap with R
P
. This ratio will
control the spillover effect as shown in Figure 3.
(a) Original image (b) Candidate 1 (c) Candidate 2
Figure 3. Region feature: One feature to measure a fine align-
ment is the ratio between areas of a proposed pose (yellow) and
regions overlapped with the proposed pose. (a) is an original im-
age I, and (b) shows an example to encourage, while (c) shows an
example to penalize.
f
region
=
"
P
|R
P
\R
I
|>0.1|R
P
|
|R
P
|
|R
I
|
#
(7)
Texture boundary: The goal of this feature is to capture
appearance by measuring how well our proposed pose sep-
arates object boundary. In other words, we would like to

measure the alignment between the boundary of our pro-
posed pose P and the texture boundary of I. For this pur-
pose, we use a well-known texture classification feature,
Local Binary Pattern (LBP) [15].
We compute histograms of LBP on inner boundary and
outer boundary of proposed pose P . We define an inner
boundary region by extruding the proposed object mask
followed by subtracting the mask, and an outer boundary
by diluting the proposed object mask. Essentially, the his-
tograms will encode the texture patterns of near-inner/outer
pixels along the proposed pose P s boundary. Hence, a
large change in these two histograms indicates a large
texture pattern change, and it is ideal if the LBP histogram
difference along the proposed pose’s boundary is large.
This feature will discourage the object boundary aligning
with contours with small texture change (such as contours
within an object or contours due to illumination).
Edges: We extract edges [11] from the image to measure
their alignment with the edgemap of estimated pose. We
used a modified Chamfer distance from Satkin, et. al. [18].
f
edge
=
1
|R|
X
a2R
min(min
b2I
ka bk,),
1
|I|
X
b2I
min(min
a2R
kb ak,)
(8)
where we use 2{10, 25, 50, 1} to control the influence
of outlier edges.
Number of correspondences: f
corr
is a binary vector,
where the i’th dimension indicates if there are more than i
good correspondences between the 3D model and the 2D
image under pose P . Good correspondences are the ones
with local correspondence error (in Eq 2) below a threshold.
2.5. Optimization & Learning
Algorithm 1 Pose set {P } search
For each initial seed pose,
while less than n different candidates found do
Choose a random 3D interest point and its 2D corre-
spondence (this determines a displacement)
for i = 1 to 5 do
Choose a random local correspondence agreeing
with correspondences selected (over i 1 iterations)
end for
Estimate parameters by solving least squares
Find all correspondences agreeing with this solution
Estimate parameters using all correspondences
end while
Our cost function S(P, c) from Eq 1 with G(P ) is a non-
convex function and is not easy to solve directly. Hence,
we first simplify the optimization by quantizing the space
of solutions. Because our L(P, c) is 1 if any local corre-
spondence score is below the threshold, we first find all sets
of correspondences for which all local correspondences are
above the threshold. Then, we find the pose P that mini-
mizes L(P, c)+w
D
(P, c). Finally, we optimize the cost
function in the discrete space by evaluating all candidates.
We use RANSAC in populating a set of candidates by
optimizing L(P, c). Our RANSAC procedure is shown in
Alg 1. We then minimize L( P, c)+w
D
D(P, c) by esti-
mating pose P for each found correspondence c. Given
a set of correspondences c, we estimate pose P using the
Levenberg-Marquardt algorithm minimizing D(P, c).
We again leverage the discretized space from Alg 1 in
order to learn weights for Eq 1. For the subset of {(P, c)}
in the training set, we extract the geometric distance and
global alignment features,
D(P, c),G(P )
, and the binary
labels based on the distance from the ground truth (defined
in Eq 9) . Then, we learn the weights using a linear SVM
classifier.
3. Evaluation
3.1. Dataset
Figure 4. Labeling tool: our labeling tool enables users to browse
through 3D models and label point correspondences to an image.
The tool provides a feedback by rendering estimated pose on the
image and an user can edit more correspondences.
In order to develop and evaluate fine pose estimation
based on 3D models, we created a new dataset of images
and 3D models representing typical indoor scenes. We ex-
plicitly collected IKEA 3D models from Google 3D Ware-
house, and images from Flickr. The key difference of this
dataset from previous works [18, 20] is that we align ex-
act 3D models with each image, whereas others provided
coarse pose information without using exact 3D models.
For our dataset, we provide 800 images and 225 3D
models. All 800 images are fully annotated with 90 dif-

(a) 3D models
IKEA object
IKEA room
(b) Aligned Images
Figure 5. Dataset: (a) examples of 3D models we collected from
Google Warehouse, and (b) ground truth images where objects are
aligned with 3D models using our labeling tool. For clarity, we
show only one object per image when there are multiple objects.
ferent models. Also, we separate the data into two differ-
ent splits: IKEAobject and IKEAroom. IKEAobject is
the split where 300 images are queried by individual object
name (e.g. ikea chair poang and ikea sofa ektorp). Hence, it
tends to contain only a few objects at relatively large scales.
IKEAroom is the split where 500 images are queried by
ikea room and ikea home; and contains more complex scene
where multiple objects appear at a smaller scale. Figure 5ab
show examples of our 3D models and annotated images.
For alignment, we created an online tool that allows a
user to browse through models and label point correspon-
dences (usually 5 are sufficient), and check the model’s esti-
mated pose as the user labels. Given these correspondences,
we solve the least square problem of Eq 4 using Levenberg-
Marquardt. Here, we obtain the full intrinsic/extrinsic pa-
rameters except the skewness and principal points. Figure 4
shows a screenshot of our tool.
3.2. Error Metric
We introduce a new error metric for fine-pose estima-
tion. Intuitively, we use the average 3D distance between
all points in the ground truth and the proposal. When the
distance is small, this is close to the average error in view-
ing angle for all points. Formally, given an estimated pose
P
e
and a ground truth pose P
gt
of image I, we obtain corre-
sponding 3D points in the camera space. Then, we compute
the average pair-wise distance between all corresponding
points divided by their distance to the camera. We consider
P is correct if this average value is less than a threshold.
score(P
e
,P
gt
)=
P
X
i
kE
e
X
i
E
gt
X
i
k
2
P
X
i
kE
gt
X
i
k
2
(9)
4. Results
4.1. Correspondences
First of all, we evaluate our algorithm on finding good
correspondences between a 3D model and an image. This
is crucial for the rest of our system as each additional poor
correspondence grows the search space of RANSAC expo-
nentially.
0 50 100 150
0
1
2
3
4
5
6
7
8
9
10
Number of point detections per detector
Detectected keypoints
Ours
Harris
Figure 6. Correspondence evaluation: we are comparing corre-
spondences between our interest point detector and Harris detec-
tor. The minimum number of interest points we need for reliable
pose estimation is 5. Ours can recall 5 correct correspondences
by considering only the top 10 detections per 3D interest point,
whereas the Harris detector requires 100 per point. This results in
effectively 10
5
times fewer search iterations in RANSAC.

Citations
More filters
01 Jan 2006

3,012 citations

Journal ArticleDOI
TL;DR: Simultaneous localization and mapping (SLAM) as mentioned in this paper consists in the concurrent construction of a model of the environment (the map), and the estimation of the state of the robot moving within it.
Abstract: Simultaneous localization and mapping (SLAM) consists in the concurrent construction of a model of the environment (the map ), and the estimation of the state of the robot moving within it. The SLAM community has made astonishing progress over the last 30 years, enabling large-scale real-world applications and witnessing a steady transition of this technology to industry. We survey the current state of SLAM and consider future directions. We start by presenting what is now the de-facto standard formulation for SLAM. We then review related work, covering a broad set of topics including robustness and scalability in long-term mapping, metric and semantic representations for mapping, theoretical performance guarantees, active SLAM and exploration, and other new frontiers. This paper simultaneously serves as a position paper and tutorial to those who are users of SLAM. By looking at the published research with a critical eye, we delineate open challenges and new research issues, that still deserve careful scientific investigation. The paper also contains the authors’ take on two questions that often animate discussions during robotics conferences: Do robots need SLAM? and Is SLAM solved?

2,039 citations

Journal ArticleDOI
TL;DR: What is now the de-facto standard formulation for SLAM is presented, covering a broad set of topics including robustness and scalability in long-term mapping, metric and semantic representations for mapping, theoretical performance guarantees, active SLAM and exploration, and other new frontiers.
Abstract: Simultaneous Localization and Mapping (SLAM)consists in the concurrent construction of a model of the environment (the map), and the estimation of the state of the robot moving within it. The SLAM community has made astonishing progress over the last 30 years, enabling large-scale real-world applications, and witnessing a steady transition of this technology to industry. We survey the current state of SLAM. We start by presenting what is now the de-facto standard formulation for SLAM. We then review related work, covering a broad set of topics including robustness and scalability in long-term mapping, metric and semantic representations for mapping, theoretical performance guarantees, active SLAM and exploration, and other new frontiers. This paper simultaneously serves as a position paper and tutorial to those who are users of SLAM. By looking at the published research with a critical eye, we delineate open challenges and new research issues, that still deserve careful scientific investigation. The paper also contains the authors' take on two questions that often animate discussions during robotics conferences: Do robots need SLAM? and Is SLAM solved?

1,828 citations


Cites background from "Parsing IKEA Objects: Fine Pose Est..."

  • ...sentations, which define a solid as a combination of atoms in a dictionary, have been considered in robotics and computer vision, with dictionary learned from data [266] or based on existing repositories of object models [149], [157]....

    [...]

Posted Content
TL;DR: Wang et al. as discussed by the authors proposed a 3D Generative Adversarial Network (3D-GAN), which generates 3D objects from a probabilistic space by leveraging recent advances in volumetric convolutional networks and generative adversarial nets.
Abstract: We study the problem of 3D object generation. We propose a novel framework, namely 3D Generative Adversarial Network (3D-GAN), which generates 3D objects from a probabilistic space by leveraging recent advances in volumetric convolutional networks and generative adversarial nets. The benefits of our model are three-fold: first, the use of an adversarial criterion, instead of traditional heuristic criteria, enables the generator to capture object structure implicitly and to synthesize high-quality 3D objects; second, the generator establishes a mapping from a low-dimensional probabilistic space to the space of 3D objects, so that we can sample objects without a reference image or CAD models, and explore the 3D object manifold; third, the adversarial discriminator provides a powerful 3D shape descriptor which, learned without supervision, has wide applications in 3D object recognition. Experiments demonstrate that our method generates high-quality 3D objects, and our unsupervisedly learned features achieve impressive performance on 3D object recognition, comparable with those of supervised learning methods.

886 citations

Proceedings ArticleDOI
24 Mar 2014
TL;DR: PASCAL3D+ dataset is contributed, which is a novel and challenging dataset for 3D object detection and pose estimation, and on average there are more than 3,000 object instances per category.
Abstract: 3D object detection and pose estimation methods have become popular in recent years since they can handle ambiguities in 2D images and also provide a richer description for objects compared to 2D object detectors. However, most of the datasets for 3D recognition are limited to a small amount of images per category or are captured in controlled environments. In this paper, we contribute PASCAL3D+ dataset, which is a novel and challenging dataset for 3D object detection and pose estimation. PASCAL3D+ augments 12 rigid categories of the PASCAL VOC 2012 [4] with 3D annotations. Furthermore, more images are added for each category from ImageNet [3]. PASCAL3D+ images exhibit much more variability compared to the existing 3D datasets, and on average there are more than 3,000 object instances per category. We believe this dataset will provide a rich testbed to study 3D detection and pose estimation and will help to significantly push forward research in this area. We provide the results of variations of DPM [6] on our new dataset for object detection and viewpoint estimation in different scenarios, which can be used as baselines for the community. Our benchmark is available online at http://cvgl.stanford.edu/projects/pascal3d

853 citations


Cites background from "Parsing IKEA Objects: Fine Pose Est..."

  • ...PASCAL3D+ (ours) ETH-80 [13] [26] 3DObject [22] EPFL Car [20] [27] KITTI [8] NYU Depth [24] NYC3DCars [19] IKEA [15]...

    [...]

  • ...[15] provides dense 3D annotations for some of the IKEA objects....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene and can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Abstract: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.

46,906 citations

Proceedings ArticleDOI
20 Jun 2005
TL;DR: It is shown experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection, and the influence of each stage of the computation on performance is studied.
Abstract: We study the question of feature sets for robust visual object recognition; adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.

31,952 citations

Journal ArticleDOI
TL;DR: A generalized gray-scale and rotation invariant operator presentation that allows for detecting the "uniform" patterns for any quantization of the angular space and for any spatial resolution and presents a method for combining multiple operators for multiresolution analysis.
Abstract: Presents a theoretically very simple, yet efficient, multiresolution approach to gray-scale and rotation invariant texture classification based on local binary patterns and nonparametric discrimination of sample and prototype distributions. The method is based on recognizing that certain local binary patterns, termed "uniform," are fundamental properties of local image texture and their occurrence histogram is proven to be a very powerful texture feature. We derive a generalized gray-scale and rotation invariant operator presentation that allows for detecting the "uniform" patterns for any quantization of the angular space and for any spatial resolution and presents a method for combining multiple operators for multiresolution analysis. The proposed approach is very robust in terms of gray-scale variations since the operator is, by definition, invariant against any monotonic transformation of the gray scale. Another advantage is computational simplicity as the operator can be realized with a few operations in a small neighborhood and a lookup table. Experimental results demonstrate that good discrimination can be achieved with the occurrence statistics of simple rotation invariant local binary patterns.

14,245 citations

Journal ArticleDOI
TL;DR: A review of the Pascal Visual Object Classes challenge from 2008-2012 and an appraisal of the aspects of the challenge that worked well, and those that could be improved in future challenges.
Abstract: The Pascal Visual Object Classes (VOC) challenge consists of two components: (i) a publicly available dataset of images together with ground truth annotation and standardised evaluation software; and (ii) an annual competition and workshop. There are five challenges: classification, detection, segmentation, action classification, and person layout. In this paper we provide a review of the challenge from 2008---2012. The paper is intended for two audiences: algorithm designers, researchers who want to see what the state of the art is, as measured by performance on the VOC datasets, along with the limitations and weak points of the current generation of algorithms; and, challenge designers, who want to see what we as organisers have learnt from the process and our recommendations for the organisation of future challenges. To analyse the performance of submitted algorithms on the VOC datasets we introduce a number of novel evaluation methods: a bootstrapping method for determining whether differences in the performance of two algorithms are significant or not; a normalised average precision so that performance can be compared across classes with different proportions of positive instances; a clustering method for visualising the performance across multiple algorithms so that the hard and easy images can be identified; and the use of a joint classifier over the submitted algorithms in order to measure their complementarity and combined performance. We also analyse the community's progress through time using the methods of Hoiem et al. (Proceedings of European Conference on Computer Vision, 2012) to identify the types of occurring errors. We conclude the paper with an appraisal of the aspects of the challenge that worked well, and those that could be improved in future challenges.

6,061 citations

Journal ArticleDOI
TL;DR: An efficient segmentation algorithm is developed based on a predicate for measuring the evidence for a boundary between two regions using a graph-based representation of the image and it is shown that although this algorithm makes greedy decisions it produces segmentations that satisfy global properties.
Abstract: This paper addresses the problem of segmenting an image into regions. We define a predicate for measuring the evidence for a boundary between two regions using a graph-based representation of the image. We then develop an efficient segmentation algorithm based on this predicate, and show that although this algorithm makes greedy decisions it produces segmentations that satisfy global properties. We apply the algorithm to image segmentation using two different kinds of local neighborhoods in constructing the graph, and illustrate the results with both real and synthetic images. The algorithm runs in time nearly linear in the number of graph edges and is also fast in practice. An important characteristic of the method is its ability to preserve detail in low-variability image regions while ignoring detail in high-variability regions.

5,791 citations

Frequently Asked Questions (1)
Q1. What have the authors contributed in "Parsing ikea objects: fine pose estimation" ?

Moreover, the authors also provide a new dataset containing fine-aligned objects with their exactly matched 3D models, and a set of models for widely used objects. The authors also evaluate their algorithm both on object detection and fine pose estimation, and show that their method outperforms state-of-the art algorithms.