Data-Driven Grasp Synthesis—A Survey

doi:10.1109/TRO.2013.2289018

Título artículo / Títol article:

Data-driven grasp

synthesis—a survey

Autores / Autors:

Bohg, J. ; Morales, A. ;

Asfour, T. ; Kragic, D.

Revista:

IEEE Transactions on Robotics

Versión / Versió:

Postprint

Cita bibliográfica / Cita

bibliogràfica (ISO 690):

BOHG, Jeannette, et al. Data-

driven grasp synthesis—a

survey. Robotics, IEEE

Transactions on, 2014, vol. 30,

no 2, p. 289-309.

url Repositori UJI:

http://hdl.handle.net/10234/

132550

TRANSACTIONS ON ROBOTICS 1

Data-Driven Grasp Synthesis - A Survey

Jeannette Bohg, Member, IEEE, Antonio Morales, Member, IEEE, Tamim Asfour, Member, IEEE,

Danica Kragic Member, IEEE

Abstract—We review the work on data-driven grasp synthesis

and the methodologies for sampling and ranking candidate

grasps. We divide the approaches into three groups based on

whether they synthesize grasps for known, familiar or unknown

objects. This structure allows us to identify common object rep-

resentations and perceptual processes that facilitate the employed

data-driven grasp synthesis technique. In the case of known

objects, we concentrate on the approaches that are based on

object recognition and pose estimation. In the case of familiar

objects, the techniques use some form of a similarity matching

to a set of previously encountered objects. Finally, for the

approaches dealing with unknown objects, the core part is the

extraction of speciﬁc features that are indicative of good grasps.

Our survey provides an overview of the different methodologies

and discusses open problems in the area of robot grasping. We

also draw a parallel to the classical approaches that rely on

analytic formulations.

Index Terms—Object grasping and manipulation, grasp syn-

thesis, grasp planning, visual perception, object recognition and

classiﬁcation, visual representations

I. INTRODUCTION

Given an object, grasp synthesis refers to the problem of

ﬁnding a grasp conﬁguration that satisﬁes a set of criteria

relevant for the grasping task. Finding a suitable grasp among

the inﬁnite set of candidates is a challenging problem and has

been addressed frequently in the robotics community, resulting

in an abundance of approaches.

In the recent review of Sahbani et al. [1], the authors divide

the methodologies into analytic and empirical. Following Shi-

moga [2], analytic refers to methods that construct force-

closure grasps with a multi-ﬁngered robotic hand that are

dexterous, in equilibrium, stable and exhibit a certain dynamic

behaviour. Grasp synthesis is then usually formulated as a

constrained optimization problem over criteria that measure

one or several of these four properties. In this case, a grasp is

typically deﬁned by the grasp map that transforms the forces

exerted at a set of contact points to object wrenches [3].

The criteria are based on geometric, kinematic or dynamic

formulations. Analytic formulations towards grasp synthesis

have also been reviewed by Bicchi and Kumar [4].

Empirical or data-driven approaches rely on sampling grasp

candidates for an object and ranking them according to a

speciﬁc metric. This process is usually based on some existing

J. Bohg is with the Autonomous Motion Department at the MPI for

Intelligent Systems, Tübingen, Germany, e-mail: jbohg@tuebingen.mpg.de.

A. Morales is with the Robotic Intelligence Lab. at Universitat Jaume I,

Castelló, Spain, e-mail: Antonio.Morales@uji.es.

T. Asfour is with the KIT, Karlsruhe, Germany, e-mail: asfour@kit.edu.

D. Kragic is with the Centre for Autonomous Systems, Computational

Vision and Active Perception Lab, Royal Institute fo Technology KTH,

Stockholm, Sweden, e-mail: dank@kth.se.

This work has been supported by FLEXBOT (FP7-ERC-279933).

grasp experience that can be a heuristic or is generated in

simulation or on a real robot. Kamon et al. [5] refer to this

as the comparative and Shimoga [2] as the knowledge-based

approach. Here, a grasp is commonly parameterized by [6, 7]:

• the grasping point on the object with which the tool center

point (TCP) should be aligned,

• the approach vector which describes the 3D angle that

the robot hand approaches the grasping point with,

• the wrist orientation of the robotic hand and

• an initial ﬁnger conﬁguration

Data-driven approaches differ in how the set of grasp candi-

dates is sampled, how the grasp quality is estimated and how

good grasps are represented for future use. Some methods

measure grasp quality based on analytic formulations, but

more commonly they encode e.g. human demonstrations,

perceptual information or semantics.

A. Brief Overview of Analytic Approaches

Analytic approaches provide guarantees regarding the crite-

ria that measure the previously mentioned four grasp proper-

ties. However, these are usually based on assumptions such as

simpliﬁed contact models, Coulomb friction and rigid body

modeling [3, 8]. Although these assumptions render grasp

analysis practical, inconsistencies and ambiguities especially

regarding the analysis of grasp dynamics are usually attributed

to their approximate nature.

In this context, Bicchi and Kumar [4] identiﬁed the prob-

lem of ﬁnding an accurate and tractable model of contact

compliance as particularly relevant. This is needed to analyze

statically-indeterminate grasps in which not all internal forces

can be controlled. This case arises e.g. for under-actuated

hands or grasp synergies where the number of the controlled

degrees of freedom is fewer than the number of contact forces.

Prattichizzo et al. [9] model such a system by introducing a set

of springs at the contacts and joints and show how its dexterity

can be analyzed. Rosales et al. [10] adopt the same model of

compliance to synthesize feasible and prehensile grasps. In

this case, only statically-determinate grasps are considered.

The problem of ﬁnding a suitable hand conﬁguration is cast

as a constrained optimization problem in which compliance is

introduced to simultaneously address the constraints of contact

reachability, object restraint and force controllability. As is

the case with many other analytic approaches towards grasp

synthesis, the proposed model is only studied in simulation

where accurate models of the hand kinematics, the object and

their relative alignment are available.

In practice, systematic and random errors are inherent to a

robotic system and are due to noisy sensors and inaccurate

object, robot, etc. models. The relative position of object

TRANSACTIONS ON ROBOTICS 2

and hand can therefore only be known approximately which

makes an accurate placement of the ﬁngertips difﬁcult. In

2000, Bicchi and Kumar [4] identiﬁed a lack of approaches

towards synthesizing grasps that are robust to positioning

errors. Since then, this problem has shifted into focus. One

line of research follows the approach of independent contact

regions (ICRs) as deﬁned by Nguyen [11]: a set of regions on

the object in which each ﬁnger can be independently placed

anywhere without the grasp loosing the force-closure property.

Several examples for computing them are presented by Roa

and Suárez [12] or Krug et al. [13]. Another line of research

towards robustness against inaccurate end-effector positioning

makes use of the caging formulation. Rodriguez et al. [14]

found that there are caging conﬁgurations of a three-ﬁngered

manipulator around a planar object that are speciﬁcally suited

as a way point to grasping it. Once the manipulator is in

such conﬁguration, either opening or closing the ﬁngers is

guaranteed to result in an equilibrium grasp without the need

for accurate positioning of the ﬁngers. Seo et al. [15] exploited

the fact that two-ﬁngered immobilizing grasps of an object are

always preceded by a caging conﬁguration. Full body grasps

of planar objects are synthesized by ﬁrst ﬁnding a two-contact

caging conﬁguration and then using additional contacts to

restrain the object. Results have been presented in simulation

and demonstrated on a real robot.

Another assumption commonly made in analytic approaches

is that precise geometric and physical models of an object are

available to the robot which is not always the case. In addition,

we may not know the surface properties or friction coefﬁcients,

weight, center of mass and weight distribution. Some of these

can be retrieved through interaction: Zhang and Trinkle [16]

propose to use a particle ﬁlter to simultaneously estimate the

physical parameters of an object and track it while it is being

pushed. The dynamic model of the object is formulated as a

mixed nonlinear complementarity problem. The authors show

that even when the object is occluded and the state estimate

cannot be updated through visual observation, the motion of

the object is accurately predicted over time. Although methods

like this relax some of the assumptions, they are still limited

to simulation [14, 10] or consider 2D objects [14, 15, 16].

B. Development of Data-Driven Methods

Up to the year 2000, the ﬁeld of robotic grasping

1

was

clearly dominated by analytic approaches [11, 4, 17, 2]. Apart

from e.g. Kamon et al. [5], data-driven grasp synthesis started

to become popular with the availability of GraspIt! [18] in

2004. Many highly cited approaches have been developed,

analyzed and evaluated in this or other simulators [19, 20, 21,

22, 23, 24]. These approaches differ in how grasp candidates

are sampled from the inﬁnite space of possibilities. For grasp

ranking they rely classical metrics based on analytic formula-

tions such as the widely used -metric proposed in Ferrari and

Canny [17]. It constructs the grasp wrench space (GWS) by

1

Citation counts for the most inﬂuential articles in the ﬁeld. Extracted from

scholar.google.com in October 2013. [11]: 733. [4]: 490. [17]: 477. [2]: 405.

[5]: 77. [18]: 384. [19]: 353. [20]: 100. [21]: 110. [22]: 95. [23]: 96. [24]:

108. [25]: 38. [26]: 156. [27]: 39. [28]: 277. [29]: 75. [30]: 40. [31]: 21. [32]:

43. [33]: 77. [34]: 26. [35]: 191. [36]: 58. [37]: 75. [38]: 39.

computing the convex hull over the wrenches at the contact

points between the hand and the object.  ranks the quality of a

force closure grasp by quantifying the radius of the maximum

sphere still fully contained in the GWS.

Developing and evaluating approaches in simulation is

attractive because the environment and its attributes can be

completely controlled. A large number of experiments can

be efﬁciently performed without having access to expensive

robotics hardware that would also add a lot of complexity to

the evaluation process. However, it is not clear if the simulated

environment resembles the real world well enough to transfer

methods easily. Only recently, several articles [39, 40, 24]

have analyzed this question and come to the conclusion that

the classic metrics are not good predictors for grasp success

in the real world. They do not seem to cope well with the

challenges arising in unstructured environments. Diankov [24]

claims that in practice grasps synthesized using this metric

tend to be relatively fragile. Balasubramanian et al. [39]

systematically tested a number of grasps in the real world that

were stable according to classical grasp metrics. Compared

to grasps planned by humans and transferred to a robot by

kinesthetic teaching on the same objects, they under-performed

signiﬁcantly. A similar study has been conducted by Weisz and

Allen [40]. It focuses on the ability of the -metric to predict

grasp stability under object pose error. The authors found that

it performs poorly especially when grasping large objects.

As pointed out by Bicchi and Kumar [4] and Prattichizzo

and Trinkle [8], grasp closure is often wrongly equated with

stability. Closure states the existence of equilibrium which is

a necessary but not sufﬁcient condition. Stability can only be

deﬁned when considering the grasp as a dynamical system

and in the context of its behavior when perturbed from

an equilibrium. Seen in this light, the results of the above

mentioned studies are not surprising. However, they suggest

that there is a large gap between reality and the models for

grasping that are currently available and tractable.

For this reason, several researchers [25, 26, 27] proposed

to let the robot learn how to grasp by experience that is

gathered during grasp execution. Although, collecting exam-

ples is extremely time-consuming, the problem of transferring

the learned model to the real robot is non-existant. A crucial

question is how the object to be grasped is represented and

how the experience is generalized to novel objects.

Saxena et al. [28] pushed machine learning approaches for

data-driven grasp synthesis even further. A simple logistic

regressor was trained on large amounts of synthetic labeled

training data to predict good grasping points in a monocular

image. The authors demonstrated their method in a household

scenario in which a robot emptied a dishwasher. None of

the classical principles based on analytic formulations were

used. This paper spawned a lot of research [29, 30, 31, 32]

in which essentially one question is addressed: What are the

object features that are sufﬁciently discriminative to infer a

suitable grasp conﬁguration?

From 2009, there were further developments in the area of

3D sensing. Projected Texture Stereo was proposed by Kono-

lige [41]. This technology is built into the sensor head of

the PR2 [42], a robot that is available to comparatively many

TRANSACTIONS ON ROBOTICS 3

Grasp Hypotheses

Prior Object

Knowledge

Known

Unknown

Familiar

Grasp

Synthesis

Analytical

Data-Driven

Object

Features

2D

3D

Multi-Modal

Task

Hand

Gripper

Multi-

Fingered

Object-Grasp

Representation

Global

Local

Figure 1: We identiﬁed a number of aspects that inﬂuence how the ﬁnal set of grasp hypotheses is generated for an object. The most important one is the assumed prior object

knowledge as discussed in Section I-D. Numerous different object-grasp representations are proposed in the literature that are relying on features of different modalities such as

2D or 3D vision or tactile sensors. Either local object parts or the object as a whole are linked to speciﬁc grasp conﬁgurations. Grasp synthesis can either be analytic or data-driven.

The latter is further detailed in Fig. 2. Very few approaches explicitly address the task or hand kinematics of the robot.

robotics research labs and running on the OpenSource middle-

ware ROS [43]. In 2010, Microsoft released the Kinect [44], a

highly accurate depth sensing device based on the technology

developed by PrimeSense [45]. Due to its low price and

simple usage, it became a ubiquitous device within the robotics

community. Although the importance of 3D data for grasping

has been previously recognized, many new approaches were

proposed that operate on real world 3D data. They are either

heuristics that map structures in this data to grasp conﬁgu-

rations directly [33, 34] or they try to detect and recognize

objects and estimate their pose [35, 46].

Furthermore, we have recently seen an increasing amount of

robots fulﬁlling very speciﬁc tasks such as towel folding [37]

or preparing pancakes [38]. In these scenarios, grasping is

embedded into a sequence of different manipulation actions.

C. Analytic vs. Data-Driven Approaches

Contrary to analytic approaches, methods following the

data-driven paradigm place more weight on the object rep-

resentation and the perceptual processing, e.g., feature extrac-

tion, similarity metrics, object recognition or classiﬁcation and

pose estimation. The resulting data is then used to retrieve

grasps from some knowledge base or sample and rank them by

comparison to existing grasp experience. The parameterization

of the grasp is less speciﬁc (e.g. an approach vector instead

of ﬁngertip positions) and therefore accommodates for uncer-

tainties in perception and execution. This provides a natural

precursor to reactive grasping [47, 48, 49, 33, 50], which,

given a grasp hypothesis, considers the problem of robustly

acquiring it under uncertainty. Data-driven methods cannot

provide guarantees regarding the aforementioned criteria of

dexterity, equilibrium, stability and dynamic behaviour [2].

They can only be veriﬁed empirically. However, they form

the basis for studying grasp dynamics and further developing

analytic models that better resemble reality.

D. Classiﬁcation of Data-Driven Approaches

Sahbani et al. [1] divide the data-driven methods based on

whether they employ object features or observation of humans

during grasping. We believe that this falls short of capturing

the diversity of these approaches especially in terms of the

ability to transfer grasp experience between similar objects

and the role of perception in this process. In this survey, we

propose to group data-driven grasp synthesis approaches based

on what they assume to know a priori about the query object:

• Known Objects: These approaches assume that the query

object has been encountered before and that grasps have

already been generated for it. Commonly, the robot has

access to a database containing geometric object models

that are associated with a number of good grasp. This

database is usually built ofﬂine and in the following will

be referred to as an experience database. Once the object

has been recognized, the goal is to estimate its pose and

retrieve a suitable grasp.

TRANSACTIONS ON ROBOTICS 4

• Familiar Objects: Instead of exact identity, the ap-

proaches in this group assume that the query object is

similar to previously encountered ones. New objects can

be familiar on different levels. Low-level similarity can

be deﬁned in terms of shape, color or texture. High-level

similarity can be deﬁned based on object category. These

approaches assume that new objects similar to old ones

can be grasped in a similar way. The challenge is to

ﬁnd an object representation and a similarity metric that

allows to transfer grasp experience.

• Unknown Objects: Approaches in this group do not

assume to have access to object models or any sort of

grasp experience. They focus on identifying structure or

features in sensory data for generating and ranking grasp

candidates. These are usually based on local or global

features of the object as perceived by the sensor.

We ﬁnd the above classiﬁcation suitable for surveying

the data-driven approaches since the assumed prior object

knowledge determines the necessary perceptual processing and

associated object representations for generating and ranking

grasp candidates. For known objects, the problems of recog-

nition and pose estimation have to be addressed. The object is

usually represented by a complete geometric 3D object model.

For familiar objects, an object representation has to be found

that is suitable for comparing them to already encountered

object in terms of graspability. For unknown objects, heuristics

have to be developed for directly linking structure in the

sensory data to candidate grasps.

Only a minority of the approaches discussed in this survey

cannot be clearly classiﬁed to belong to one of these three

groups. Most of the included papers use sensor data from the

scene to perform data-driven grasp synthesis and are part of a

real robotic system that can execute grasps.

Finally, this classiﬁcation is well in line with the research in

the ﬁeld of neuroscience, speciﬁcally, from the theory of the

dorsal and ventral stream in human visual processing [51]. The

dorsal pathway processes immediate action-relevant features

while the ventral pathway extracts context- and scene-relevant

information and is related to object recognition. The visual

processing in the ventral and dorsal pathways can be related

to the grouping of grasp synthesis for familiar/known and

unknown objects, respectively. The details of such links are

out of the scope of this paper. Extensive and detailed reviews

on the neuroscience of grasping are offered in [52, 53, 54].

E. Aspects Inﬂuencing the Generation of Grasp Hypotheses

The number of candidate grasps that can be applied to an

object is inﬁnite. To sample some of these candidates and

deﬁne a quality metric for selecting a good subset of grasp

hypotheses is the core subject of the approaches reviewed

in this survey. In addition to the prior object knowledge,

we identiﬁed a number of other factors that characterize

these metrics. Thereby, they inﬂuence which grasp hypotheses

are selected by a method. Fig. 1 shows a mind map that

structures these aspects. An important one is how the quality

of a candidate grasp depends on the object, i.e., the object-

grasp representation. Some approaches extract local object

Data-Driven

Grasp Synthesis

Heuristics

Learning

Human

Demon-

stration

Labeled

Training

Data

Trial &

Error

Figure 2: Data-driven Grasp Synthesis can either be based on heuristics or on learning

from data. The data can either be provided in the form of ofﬂine generated labeled

training data, human demonstration or through trial and error.

Object-Grasp

Represen.

Object Features Grasp Synthesis

Local

Global

2D

3D

Multi-Modal

Heuristic

Human Demo

Labeled Data

Trial & Error

Task

Multi-Fingered

Deformable

Real Data

Glover et al. [55]

√ √ √ √ √

Goldfeder et al. [21]

√ √ √ √

Berenson et al. [56]

√ √ √ √ √

Miller et al. [19]

√ √ √ √

Przybylski et al. [57]

√ √ √ √

Roa et al. [58]

√ √ √ √ √

Detry et al. [27]

√ √ √ √ √

Detry et al. [59]

√ √ √ √ √

Huebner et al. [60]

√ √ √ √ √ √

Faria et al. [61]

√ √ √ √ √ √

Diankov [24]

√ √ √ √ √

Balasubramanian et al. [39]

√ √ √ √ √ √ √

Borst et al. [22]

√ √ √ √

Brook et al. [62]

√ √ √ √

Ciocarlie and Allen [23]

√ √ √ √

Romero et al. [63]

√ √ √ √ √

Papazov et al. [64]

√ √ √ √ √

Morales et al. [7]

√ √ √ √ √

Collet Romea et al. [65]

√ √ √ √ √

Kroemer et al. [66]

√ √ √ √ √ √

Ekvall and Kragic [6]

√ √ √ √ √

Tegin et al. [67]

√ √ √ √ √

Pastor et al. [49]

√ √ √ √

Stulp et al. [68]

√ √ √ √ √

Table I: Data-Driven Approaches for Grasping Known Objects

attributes (e.g. curvature, contact area with the hand) around a

candidate grasp. Other approaches take global characteristics

(e.g. center of mass, bounding box) and their relation to a

grasp conﬁguration into account. Dependent on the sensor

device, object features can be based on 2D or 3D visual data

as well as on other modalities. Furthermore, grasp synthesis

can be analytic or data-driven. We further categorized the latter

in Fig. 2: there are methods for learning either from human

demonstrations, labeled examples or trial and error. Other

methods rely on various heuristics to directly link structure

in sensory data to candidate grasps. There is relatively little

work on task-dependent grasping. Also, the applied robotic

hand is usually not in the focus of the discussed approaches.

We will therefore not examine these two aspects. However, we

will indicate whether an approach takes the task into account

and whether an approach is developed for a gripper or for the

more complex case of a multi-ﬁngered hand. Table I-III list

all the methods in this survey. The table columns follow the

structure proposed in Fig. 1 and 2.

II. GRASPING KNOWN OBJECTS

If the object to be grasped is known and there is already a

database of grasp hypotheses for it, the problem of ﬁnding a

feasible grasp reduces to estimating the object pose and then

ﬁltering the hypotheses by reachability. Table I summarizes all

the approaches discussed in this section.

Data-Driven Grasp Synthesis—A Survey

Figures

Citations

Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection

Supersizing self-supervision: Learning to grasp from 50K tries and 700 robot hours

Deep learning for detecting robotic grasps

QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation

Deep Learning for Detecting Robotic Grasps

References

Rapid object detection using a boosted cascade of simple features

A method for registration of 3-D shapes

Object recognition from local scale-invariant features

Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories

Shape matching and object recognition using shape contexts

Related Papers (5)

Deep learning for detecting robotic grasps

Supersizing self-supervision: Learning to grasp from 50K tries and 700 robot hours

Graspit! A versatile simulator for robotic grasping

Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection

Robotic grasping and contact: a review