What are the contributions mentioned in the paper "3d object recognition using spin-images for a humanoid stereoscopic vision system" ?

(Open Access) 3D object recognition using spin-images for a humanoid stereoscopic vision system (2006) | Olivier Stasse

HAL Id: hal-01117854

https://hal.archives-ouvertes.fr/hal-01117854

Submitted on 18 Feb 2015

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-

entic research documents, whether they are pub-

lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diusion de documents

scientiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

3D object recognition using spin-images for a humanoid

stereoscopic vision system

Olivier Stasse, Sylvain Dupitier, Kazuhito yokoi

To cite this version:

Olivier Stasse, Sylvain Dupitier, Kazuhito yokoi. 3D object recognition using spin-images for a

humanoid stereoscopic vision system. IEEE/RSJ International Conferenece on Intelligent Robots

and Systems (IROS), Oct 2006, Beijing, China. pp.2955 - 2960, �10.1109/IROS.2006.282151�. �hal-

01117854�

3D object recognition using spin-images for a humanoid

stereoscopic vision system.

Olivier Stasse, Sylvain Dupitier and Kazuhito Yokoi

AIST/IS-CNRS/STIC Joint Japanese-French Robotics Laboratory (JRL)

Intelligent Systems Research Institute (IS),

National Institute of Advanced Industrial Science and Technology (AIST)

AIST Central 2, Umezono 1-1-1, Tsukuba, Ibaraki, 305-8568 Japan

{olivier.stasse,kazuhito.yokoi}@aist.go.jp

Abstract— This paper presents a 3D object recognition

method based on spin-images for a humanoid robot having a

stereoscopic vision system. Spin-images have been proposed to

search CAD models database, and use 3D range informations.

In this context, the use of a vision system is taken into account

through a multi-resolution approach. A method for quickly

computing multi-resolution and interpolating spin-images is

proposed. The results on simulation and on real data are

given, and show the effectiveness of this method.

Index Terms— Spin-images, multi-resolution, 3D recogni-

tion, humanoid robot.

I. INTRODUCTION

Efﬁcient real-time tracking exists for collections of

2D views [1] [2]. However in a humanoid context, 3D

geometrical information is important because the high

redundancy of such robot allows several kinds of 3D

postures. Moreover if the information is precise enough,

it can also be used for grasping behaviour. Recent works

on 3D object model building make possible a description

based on geometrical features. Towards the design of a

search engine for databases of CAD models, several 3D

descriptors have been proposed to build signatures of 3D

objects [3], [4], [5]. The recognition process proposed here

is based on spin-images proposed initially by [3]. The main

difference in the conventional work and this one lies on

the targeted application and a search scheme based on

multi-resolution spin images. Moreover the computation

of the multi-resolution scheme is reﬁned and allows a fast

implementation.

The targeted application us a “Treasure hunting” be-

haviour on a HRP-2 humanoid robot [6]. This behaviour

consists in two majors steps: ﬁrst building an internal repre-

sentation of an object unknown to the robot, second ﬁnding

this object in an unknown environment. This behaviour is

useful for a robot used in an industrial environment, or as

an aid for elderly person. It may incrementally build its

knowledge of its surrounding environment and the object

it has to manipulate without any a-priori models. The time

constraint is crucial, as a reasonable limit has to be set

on the time an end user can wait the robot to achieve its

mission. Finally the method to cope with the widest set of

objects should rely on a limited set of assumptions.

The reminder of this paper is as follow: in section II

the computation of spin images are introduced, section

III details how the multi-resolution signature of objects

is computed, section IV details the s earch process, ﬁnally

section V presents the simulation and the experiments

realized with the presented algorithm.

Discrete spin image

3D Mesh

Fig. 1. Example of spin image computation.

II. SPIN IMAGES

A. Description

A spin-image can be seen as an image representing the

distribution of the object’s density view from a particular

point [3]. More precisely, it is assume that all the 3D

data are given as a mesh Mesh = V, E where V are the

vertices and E the edges. Let’s consider a vertex P ∈ V .

The spin image axis are the normal to the point P, and

a perpendicular vector to this normal. The former one is

called β, and the latter one α. The support region of a spin-

image is a cylinder centred on P, and aligned around its

normal. From this, each point of the model is assigned to

a ring with a height along β, and a radius along α.An

example of spin-images for a dinosaur model is given in

Fig. 1.

They are two parameters of importance while using

the spin-images: the size of the rings (δα,δβ), and the

boundaries of the spin-image (α

max

,β

max

). The size of the

rings is similar to a resolution parameter. The limitation

(α

max

,β

max

) allows to impose constraints between the

points chosen for computing the spin-image P and other

points of t he model P



. This is particularly meaningful to

take into account occlusion problem. In our implementa-

tion, two points should have less than 90 degrees between

their normals. A greater value would implies that P



occluded by some other points while P is facing the camera.

B. Normal computation

When computing spin-images, the normal computation

should be as less sensitive as possible to noise. This is

specially important for vision based informations where

the noise might be signiﬁcant. Following the tests done

in [7], 8 methods have been tested: gravity center of the

polynoms formed by neighbours of each point; inertia

matrix; normal average of each face; normal average of

faces formed by neighbour points only; normal average

weighted by angle; normal average weighted by sine and

edge length reciprocal; normal average weighted by areas

of adjacent triangles; normal average weighted by edge

length reciprocals; normal average weighted by square

root of edge length reciprocals. Using the Stanford Bunny

model, and adding a Gaussian noise of 20 percent from

the average adjacent edge, the most stable method found

was the gravity center of the polynoms formed by the

neighbours of each point.

ab(1−a)b

a(1−b)(1−a)(1−b)

(

)(

)

(α,β)

δβ

δα

Direct image filling Bilinear image filling

Fig. 2. Two ways to ﬁll a spin-image: (a) d irect way (b) bilinear

interpolation.

C. Spin-image ﬁlling

Regarding the spin-image ﬁlling, Johnson propose two

ways: either using a direct accumulation, or a bilinear

interpolation. Those two methods are depicted in Fig.

2. M is the projection of a point P



∈ V . The ﬁrst

solution relates M =(α,β) in surface (α

,β

)-(α

i+1

,β

(α

i+1

,β

j+1

)-(α

,β

j+1

) to the point (α

,β

) regardless its

position in the surface. This makes the spin-image sensitive

to noise. Indeed if M is close to a boundary, it will involves

important discrete modiﬁcation. To solve this problem, a

bilinear interpolation allows to smooth the effect of noise

by sharing the density information among the 4 points

connected to the surface. This is achieved by computing the

distance of M to those 4 points, using two parameters (a,b)

as depicted in Fig. 2. If the points are processed iteratively

in the following {0, 1,...,k,k + 1,...|V |−1}, then densities

are updated as follows:

i, j

(k + 1)=W

i, j

(k)+(1 − a)(1 − b)

i+1, j

(k + 1)=W

i, j

(k)+a(1 − b)

i, j+1

(k + 1)=W

i, j

(k)+(1 − a)b

i+1, j+1

(k + 1)=W

i, j

(k)+ab

where a =(α − α

)/δα and b =(β − β

)/δβ. It is straight-

forward to check that for a point M the sum of each

contribution is one. In the remainder of this paper, f or sake

of clarity the iteration number is implicit.

III. M

ULTI-RESOLUTION

One of the most important feature needed in our case, is

the possibility to perceive the object at different distances,

and thus at different resolutions. This implies to build a

multi-resolution signature of the object, and to be able to

compute the resolution at which the object has been per-

ceived. In the following, the ﬁnest spin-image SI

max

has the

highest resolution which correspond to (

δα

max

δβ

max

), while

the spin-image SI

has a resolution (

δα

δβ

)=(δα

,δβ

A. Computing resolution of an object

Image

Right

Gaussian

model

Interval analysis

model

Left

Image

Optical center

Fig. 3. Model induces by the surface nature of the pixels.

The resolution of the perceived object depends upon the

stereoscopic system capabilities, the distance between the

robot and the object, and the possible sub-sampling scheme

during image processing. This error may also be induced

by the segmentation used to match two points in the right

and the left images, in our case a correlation. If the pixel is

considered as a surface on the image plane, the stereoscopic

vision system may be seen as a sensor which perceive

3D volumes. Those volumes are t he intersection of the

cones representing the surfaces on the image planes. A 2D

representation is given in Fig. 3. They can be interpreted

also as the l ocation error of a 3D point. [8] and [9] proposed

an ellipsoid based approximation of this volume, while [10]

proposed a warranted bounding box using interval analysis.

Both technics show the non-linearity of the uncertainty

related to the reconstruction of a 3D point. However from

those previous work, it is clear that the error estimation,

and here the resolution, may be different for different

parts of the object. While computing the signature, the

resolution of the model is given by the average edge’s

length L

model

|E|

∑

e∈E

||E|| of its corresponding data. The

number of multiple resolution m pictures can be deduced

from the following relationship: B

model

where

model

= min{X

max

} and {X

max

} is the

bounding box englobing the model. Thus in order to extract

a global resolution from the scene, the average edge’s

length L

scene

is also used. The resolution r is chosen in

the signature such as:

min{r ∈ N|L

scene

< 2

model

|} (1)

B. Multi-resolution signature

The dyadic scheme consists in dividing by 2 each di-

mension of the spin image between two resolutions. Using

the direct ﬁlling way, it is possible to compute, from the

resolution r to r + 1, the density of a point M =(i, j) in

by:

(i, j)

= W

r+1

(2 j,2 j)

r+1

(2i+1,2 j)

r+1

(2i,2 j+1)

r+1

(2i+1,2 j+1)

Using the bilinear interpolated image, the relationship

between W

and W

r+1

is not so obvious. In Fig. 4, the

points from resolution r and r + 1 are depicted. Our goal

is to ﬁnd a relationship between the density W

(i, j)

and

the densities W

r+1

(2i+k,2 j+l)

for k ∈{−2,−1, 0,1,2} and l ∈

{−2,−1, 0,1,2}. The main question is how to share the

information carried by the points which will disappear. In

Fig. 4 let’s consider N

. As this point is not present in

resolution r + 1, its contribution has to be redistributed to

the four adjacent points remaining at resolution r. However

as the density of a point M depends upon its distance,

if M was in Q

(i, j)

0,2

= Q

r+1

(2i−1,2 j−1)

, then its contribution

has already been partially taken into account by N

r+1

(2i,2 j)

but not by N

r+1

(2i,2 j−2)

, N

r+1

(2i−2,2 j−2)

, and N

r+1

(2i−2,2 j)

. For this

three points, an offset of (

δα

δβ

) has to be introduce while

processing N

(i, j)

We note Q

(i, j)

the surface described by the points

(i−1, j−1)

(i+1, j−1)

(i+1, j+1)

(i−1, j+1)

. This surface can

be cut in four quadrants Q

(i, j)

l ∈{0,1,2, 3} as depicted

in Fig. 4. For convenience, and following those notations,

those quadrants may also be divided by four and will

be noted Q

(i, j)

l,k

k ∈{0,1, 2,3}. One can notice that the

same quadrant may have several notations depending of

the reference point used. For instance Q

(i, j)

= Q

(i+1, j+1)

or Q

(i, j)

0,2

= Q

r+1

(2i−1,2 j−1)

The notation used for the variables (a,b) is now extended

as they change according to the resolution. a(M,N

(i, j)

) is

the distance along α from N

(i, j)

to M. b(M,N

(i, j)

) is the

same along β. The relationship between those variables

from one resolution to the next one is summarised in Tab.

TAB LE I

OEFFICIENTS FOR COMPUTING THE MULTI-RESOLUTION BILINEAR

INTERPOLATION

Areas Distances

(i, j)

a(M,N

(i, j)

)=a(M,N

r+1

(2i,2 j)

) b (M,N

(i, j)

)=b(M,N

r+1

(2i,2 j)

)

(i, j)

a(M,N

(i, j)

)=a(M,N

r+1

(2i+1,2 j)

δα

r+1

b(M,N

(i, j)

)=b(M,N

r+1

(2i+1,2 j)

)

(i, j)

a(M,N

(i, j)

)=a(M,N

r+1

(2i+1,2 j+1)

δα

r+1

b(M,N

(i, j)

)=b(M,N

r+1

(2i+1,2 j+1)

δβ

r+1

(i, j)

a(M,N

(i, j)

)=a(M,N

r+1

(2i,2 j+1)

)

b(M,N

(i, j)

)=b(M,N

r+1

(2i,2 j+1)

δβ

r+1

Lemma: Let’s note W

(i, j)

(Q) the contribution of the

quadrant Q for the density at point (i, j) of a spin image

having a resolution r ﬁlled by bilinear interpolation. If

∈{N

r+1

(2i+k,2 j+l)

} for k ∈{0,1,2} and l ∈{0,1,2}, and

0000000000000000

000000000000000

1111111111111111

111111111111111

(i,j)

Point at resolution r+1

Point at resolution r

(i,j+1)

(i+1,j+1)

(i+1,j)(i,j)

X size at resolution r+1

X size at resolution r

Y size at

resolution r

resolution r+1

Y size at

0,3

0,2

0,0

0,1

r+1

(2i+1 , 2j+1)

r+1

(2i , 2j+1)

r+1r+1

(2i+1 , 2j)(2i , 2j)

Fig. 4. Computing bilinear interpolated spin-images from one resolution

to the other.

m = 3k + l, then we have:

(i, j)

∑

n=0

∑

m=0

(1 −

δα

)(1 −

δβ

r+1

(i, j)

2,n

)

(i+1, j)

∑

n=0

∑

m=0

δα

(1 −

δβ

r+1

(i, j)

3,n

)

(i, j+1)

∑

n=0

∑

m=0

(1 −

δα

)

δβ

r+1

(i, j)

1,n

)

(i+1, j+1)

(i, j)

∑

n=0

∑

m=0

δα

δβ

r+1

(i, j)

0,n

)

(2)

with a

= a(M

(i, j)

), b

= b(N

(i, j)

), and W

r+1

(2i+k,2 j+l)

. Finally

(i, j)

∑

n=0

(i, j)

) (3)

Proof: We give here a partial proof to illustrate the

general concept. Lets consider the point M ∈ Q

(i, j)

2,2

r+1

(2i+1,2 j+1)

= Q

r+1

at resolution r + 1. The points N

and N

of the spin images mesh are considered. The

contribution provided by M to each of those points is

computed as follows:

r+1

∑

M∈Q

r+1

a(M,N

)

δα

r+1

(1 −

b(M,N

)

δβ

r+1

)

r+1

∑

M∈Q

r+1

(1 −

a(M,N

)

δα

r+1

)

b(M,N

)

δβ

r+1

∑

M∈Q

r+1

(1 −

a(M,N

)

δα

r+1

)(1 −

b(M,N

)

δβ

r+1

)

r+1

∑

M∈Q

r+1

a(M,N

)

δα

r+1

b(M,N

)

δβ

r+1

Now the same point M ∈ Q

r+1

at resolution r can be

computed through bilinear interpolation ﬁlling. This may

be written for N

(i, j)

r+1

∑

M∈Q

r+1

(1 −

a(M,N

(i, j)

)

δα

)

(1 −

b(M,N

(i, j)

)

δβ

)

(4)

From Tab. I, and having 2δα

r+1

= δα

Eq. 4 can be

rewritten:

(i, j)

r+1

)=W

(i, j)

2,2

∑

M∈Q

r+1

(1 −

a(M,N

)+δα

r+1

2δα

r+1

)

(1 −

b(M,N

)+δα

r+1

2δβ

r+1

)

∑

M∈Q

r+1

(1 −

a(M,N

)

δα

r+1

)

(1 −

b(M,N

)

δβ

r+1

)

(5)

Using the same arguments, we can ﬁnd:

(i, j)

2,0

)=W

r+1

(i, j)

2,0

r+1

(i, j)

2,0

)

r+1

(i, j)

2,0

r+1

(i, j)

2,0

)

(i, j)

2,1

r+1

(i, j)

2,1

r+1

(i, j)

2,1

)

(i, j)

2,3

r+1

(i, j)

2,3

r+1

(i, j)

2,3

)

(6)

Thus

(i, j)

∑

n=0

(i, j)

2,n

)

= W

r+1

(i, j)

2,0

r+1

(i, j)

2,0

)

r+1

(i, j)

2,0

r+1

(i, j)

2,0

)

r+1

(i, j)

2,1

r+1

(i, j)

2,1

)

r+1

(i, j)

2,3

r+1

(i, j)

2,3

)

∑

n=0

∑

m=0

(1 −

δα

)(1 −

δβ

r+1

(i, j)

2,n

)

(7)

The same arguments holds for the other points and

proof the lemma .

The multi-resolution computation of the spin images

is done ﬁrst by computing the most precise spin-image

through examination of every points. For each point of the

spin image, four densities corresponding to each quadrant

are stored. For lower resolution images, the density is

computed using the position of the point regarding the

quadrant considered and Eq. 2.

It should be stress here t hat in our current implementa-

tion, only the spin-images are submit to a multi-resolution

scheme. In this ﬁrst step, no sub-sampling of the mesh has

been applied. Thus if the size of the spin-images decrease

in this process, the number of points does not.

IV. S

EARCH PROCESS

Simulator scene being analysed

Simulator scene

Fig. 5. A 3D mesh extracted from the Stanford Bunny ﬂying in the

OpenHRP simulator. The scene is cut according to the bounding box

model.

The search process described here is based on a 3D

mesh. This can be either a single view of the environment

or an incrementally build representation. In our current

implementation, it is a single view provided by the stereo-

scopic system. In the following, it is called the scene.

The scene is divided in sub-blocks. The sub-block size

is given by the bounding box of the searched object as

depicted in Fig. 5. On each of the sub-block the following

algorithm is applied:

1) Select the best resolution according to the average

edge-length;

2) Get the main rigid transformation which project the

model into the scene;

3) Check if if the model is in the s cene using the previ-

ously computed rigid-transformation. This provides

a main correlation coefﬁcient, and the position plus

orientation in the scene of the seen object.

A. Selection of the best resolution

From section III, the object resolution is the average

edge’s length in the scene. Then the resolution for the

model’s spin-images is chosen according to Eq. 1. Two

spin-images (p,q) with the same resolution are compared

using the following correlation function as proposed in [3]:

R =

∑

i=0

−

∑

i=0

∑

i=0



∑

i=0

−



∑

i=0





∑

i=0

−



∑

i=0



R ∈ [−1; 1]

(8)

with N the number of non-empty points in spin-image of

the scene. This correlation can be proven to be independent

to the normalisation of a spin-image. Thus during the multi-

resolution phase the spin-images are not normalised.

B. Rigid transformation evaluation

The main rigid transformation is obtained as follows:

Some points are randomly selected in the scene. Their cor-

responding points in the model are searched by comparing

3D object recognition using spin-images for a humanoid stereoscopic vision system

Figures

Citations

3D object recognition and classification: a systematic literature review

Integrating walking and vision to increase humanoid autonomy

Towards shape-based visual object categorization for humanoid robots

Treasure hunting for humanoids robot

Vision based motion generation for humanoid robots

References

Using spin images for efficient object recognition in cluttered 3D scenes

Rotation invariant spherical harmonic representation of 3D shape descriptors

Recognizing Objects in Range Data Using Regional Point Descriptors

Humanoid robot HRP-2

Humanoid robot HRP-3

Related Papers (5)

A combined 2D-3D vision system for automatic robot picking

Real-time CAD model matching for mobile manipulation and grasping

Three-dimensional pose determination for a humanoid robot using binocular head system

Robot Vision System based on a 3D-TOF Camera

3D perception from binocular vision for a low cost humanoid robot NAO

Frequently Asked Questions (1)

Q1. What are the contributions mentioned in the paper "3d object recognition using spin-images for a humanoid stereoscopic vision system" ?