What are the contributions mentioned in the paper "The pascal visual object classes challenge – a retrospective" ?

In this paper the authors provide a review of the challenge from 2008–2012. The paper is intended for two audiences: algorithm designers, researchers who want to see what the state of the art is, as measured by performance on the VOC datasets, along with the limitations and weak points of the current generation of algorithms ; and, challenge designers, who want to see what the authors as organisers have learnt from the process and their recommendations for the organisation of future challenges. Mark Everingham, who died in 2012, was the key member of the VOC project. For these reasons he is included as the posthumous first author of this paper. Mark Everingham University of Leeds, UK S. M. Ali Eslami ( ) Microsoft Research, Cambridge, UK ( The majority of this work was performed whilst at the University of Edinburgh ) E-mail: alie @ microsoft. To analyse the performance of submitted algorithms on the VOC datasets the authors introduce a number of novel evaluation methods: a bootstrapping method for determining whether differences in the performance of two algorithms are significant or not ; a normalised average precision so that performance can be compared across classes with different proportions of positive instances ; a clustering method for visualising the performance across multiple algorithms so that the hard and easy images can be identified ; and the use of a joint classifier over the submitted algorithms in order to measure their complementarity and combined performance. The authors conclude the paper with an appraisal of the aspects of the challenge that worked well, and those that could be improved in future challenges.

What were the dominant methods used in the 2009 segmentation challenge?

The two dominant methods were: hierarchical random fields with a range of potentials, and the use of multiple bottom-up segmentations, combined with a classifier to predict the degree of overlap of a segment with an object.

What were the key extensions of the segmentation challenge?

Key extensions were:– use of multiple bottom-up segmentations to avoid making early incorrect boundary decisions, – Hierarchical MRFs e.g. modelling object cooccurrence, – use of parts-based instance models to refine detections, – deeper integration of segmentation model with detection/classification models, – use of 3D information.

What was the responsibility of the participant’s system to filter multiple detections of the same object?

Multiple detections of the same object in an image were considered false detections, e.g. 5 detections of a single object counted as 1 correct detection and 4 false detections – it was the responsibility of the participant’s system to filter multiple detections from its output.

What is the method the authors investigate for the super-classifier?

The method the authors investigate for the super-classifier is a linear classifier for each of the VOC classes, where the feature vector consists of the real-valued scores supplied by each submitted method.

What was the purpose of interpolating the precision-recall curve?

The intention in interpolating the precision-recall curve was to reduce the impact of the ‘wiggles’ in the precisionrecall curve, caused by small variations in the ranking of examples.

How can a normalised average precision measure be calculated?

A normalised average precision measure for detection can be computed by averaging normalised precisions computed at a range of recalls.

What is the way to avoid sliding windows search?

Another way of avoiding sliding windows search is to hypothesise bounding boxes bottom up, e.g. based on multiple segmentations (Van de Sande et al, 2011).

What is the way to measure the accuracy of a truncated object?

In general (data not shown) performance with respect to aspect ratio is better for lessextreme aspect ratios, and it is better for non-truncated objects than truncated ones (except that the top three methods in 2009 and 2012 all prefer truncated over nontruncated cats).

(Open Access) The Pascal Visual Object Classes Challenge: A Retrospective (2015) | Mark Everingham

Edinburgh Research Explorer

The PASCAL Visual Object Classes Challenge: A Retrospective

Citation for published version:

Everingham, M, Eslami, SMA, Van Gool, L, Williams, CKI, Winn, J & Zisserman, A 2015, 'The PASCAL

Visual Object Classes Challenge: A Retrospective', International Journal of Computer Vision, vol. 111, no.

1, pp. 98-136. https://doi.org/10.1007/s11263-014-0733-5

Digital Object Identifier (DOI):

10.1007/s11263-014-0733-5

Link:

Link to publication record in Edinburgh Research Explorer

Document Version:

Peer reviewed version

Published In:

International Journal of Computer Vision

General rights

and / or other copyright owners and it is a condition of accessing these publications that users recognise and

abide by the legal requirements associated with these rights.

Take down policy

The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer

content complies with UK legislation. If you believe that the public display of this file breaches copyright please

contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and

investigate your claim.

Download date: 09. Aug. 2022

International Journal of Computer Vision manuscript No.

(will be inserted by the editor)

The Pascal Visual Object Classes Challenge – a Retrospective

Mark Everingham, S. M. Ali Eslami, Luc Van Gool,

Christopher K. I. Williams, John Winn, Andrew Zisserman

Received: date / Accepted: date

Abstract The Pascal Visual Object Classes (VOC)

challenge consists of two components: (i) a publicly

available dataset of images together with ground truth

annotation and standardised evaluation software; and

(ii) an annual competition and workshop. There are ﬁve

challenges: classiﬁcation, detection, segmentation, ac-

tion classiﬁcation, and pers on layout. In this paper we

provide a review of the challenge from 2008–2012.

The paper is intended for two audiences: algorithm

designers, researchers who want to see what the state

of the art is, as measured by performance on the VOC

datasets, along with the limitations and weak points

of the current generation of algorithms; and, challenge

designers, who want to see what we as organisers have

learnt from the process and our recommendations for

the organisation of future challenges.

Mark Everingham, who died in 2012, was the key member of

the VOC project. His contribution was crucial and substan-

tial. For these reasons he is included as the posthumous ﬁrst

author of this paper. An appreciation of his life and work can

b e found in Zisserman et al (2012).

Mark Everingham

University of Leeds, UK

S. M. Ali Eslami (



)

Microsoft Research, Cambridge, UK

(The majority of this work was performed

whilst at the University of Edinburgh)

E-mail: alie@microsoft.com

Luc Van Gool

KU Leuven, Belgium and ETH, Switzerland

Christopher K. I. Williams

University of Edinburgh, UK

John Winn

Microsoft Research, Cambridge, UK

Andrew Zisserman

University of Oxford, UK

To analyse the performance of submitted algorithms

on the VOC datasets we introduce a number of novel

evaluation methods: a bootstrapping method for deter-

mining whether diﬀerences in the performance of two

algorithms are signiﬁcant or not; a normalised average

precision so that p erformance can be compared across

classes with diﬀerent proportions of positive instances;

a clustering method for visualising the p erformance

across multiple algorithms so that the hard and easy

images can be identiﬁed; and the use of a joint classi-

ﬁer over the submitted algorithms in order to measure

their complementarity and combined performance. We

also analyse the community’s progress through time us-

ing the methods of Hoiem et al (2012) to identify the

types of occurring errors.

We conclude the paper with an appraisal of the as-

pects of the challenge that worked well, and those that

could be improved in future challenges.

1 Introduction

The Pascal

Visual Object Classes (VOC) Challenge

has been an annual event since 2006. The challenge con-

sists of two components: (i) a publicly available dataset

of images obtained from the Flickr web site (2013), to-

gether with ground truth annotation and standardised

evaluation software; and (ii) an annual competition and

workshop. There are three principal challenges: classiﬁ-

cation – “does the image contain any instances of a par-

ticular object class?” (where object classes include cars,

people, dogs, etc.), detection – “where are the instances

Pascal stands for pattern analysis, statistical modelling

and computational learning. It was an EU Network of Excel-

lence funded project under the IST Programme of the Euro-

p ean Union.

2 Mark Everingham, S. M. Ali Eslami, Luc Van Gool, Christopher K. I. Williams, John Winn, Andrew Zisserman

of a particular object class in the image (if any)?”,

and segmentation – “to which class does each pixel be-

long?”. In addition, there are two subsidiary challenges

(‘tasters’): action classiﬁcation – “what action is be-

ing performed by an indicated person in this image?”

(where actions include jumping, phoning, riding a bike,

etc.) and person layout – “where are the head, hands

and feet of people in this image?”. The challenges were

issued with deadlines each year, and a workshop held to

compare and discuss that year’s results and methods.

The challenges up to and including the year 2007

were described in our paper Everingham et al (2010).

The purp ose of this paper is not just to continue the

story from 2008 until the ﬁnal run of the challenge in

2012, although we will cover that to some extent. In-

stead we aim to inform two audiences: ﬁrst, algorithm

designers, those researchers who want to see what the

state of the art is, as measured by performance on the

VOC datasets, and the limitations and weak points of

the current generation of algorithms; second, challenge

designers, who want to see what we as organisers have

learnt from the process and our recommendations for

the organisation of future challenges.

1.1 Paper layout

This paper is organised as follows: we start with a re-

view of the challenges in Section 2, describing in brief

the competitions, datasets, annotation procedure, and

evaluation criteria of the 2012 challenge, and what was

changed over the 2008–2012 lifespan of the challenges.

The parts on annotation procedures and changes to the

challenges are intended for challenge organisers.

Section 3 provides an overview of the results for the

2012 challenge and, thereby, a snapshot of the state

of the art. We then use these 2012 results for several

additional and novel analyses, going further than those

given at the challenge workshops and in our previous

publication on the challenge (Everingham et al, 2010).

At the end of Section 3 we consider the question of how

the performance of algorithms can be fairly compared

when all that is available is their prediction on the test

set, and propose a method for doing this. This is aimed

at challenge organisers.

Section 4 takes stock and tries to answer broader

questions about where our ﬁeld is at in terms of the clas-

siﬁcation and detection problems that can or cannot be

solved. First, inspired by Hoiem et al (2012), we propose

evaluation measures that normalise against the propor-

tion of positive instances in a class (a problem when

comparing average precision across classes). It is shown

that some classes – like ‘person’ – still pose larger prob-

lems to modern methods than may have been believed.

Second, we describ e a clustering method for visualising

the performance across multiple algorithms submitted

during the lifespan of the challenges, so that the char-

acteristics of hard and easy images can be identiﬁed.

Section 5 investigates the level of complementarity

of the diﬀerent methods. It focusses on classiﬁcation,

for which a ‘super-method’ is designed by combining

the 2012 submitted methods. It turns out that quite

some performance can be gained over any one existing

method with such a combination, without any of those

methods playing a dominant role in the super-method.

Even the combination of only pairs of classiﬁers can

bring a substantial improvement and we make sugges-

tions for such pairs that would be especially promising.

We also comment on the construction of super-methods

for detection and segmentation.

In Section 6 we turn to progress through time. From

the evaluation server, we have available to us the results

of all algorithms for the challenges from 2009 to 2012,

and we analyse these using the methods of Hoiem et al

(2012) to identify the types of errors occurring across

time. Although important progress has been made, it

has often not been as monotonic as one might expect.

This underlines the fact that novel, promising ideas may

require some consolidation time and benchmark scores

must not be used to discard such novelties. Also, the

diversity among the scores has increased as time has

progressed.

Section 7 summarises our conclusions, both about

what we believe to have done well and about caveats.

This section also makes suggestions that we hope will

be useful for future challenge organisers.

2 Challenge Review

This section reviews the challenges, datasets, annota-

tion and evaluation procedures over the 2009–2012 cy-

cles of the challenge. It gives a bare bones summary of

the challenges and then concentrates on changes since

the 2008 release. Our companion paper (Everingham

et al, 2010) describes in detail the motivation, annota-

tions, and evaluation measures of the VOC challenges,

and these details are not repeated here. Sec. 2.3 on the

annotation procedure is intended principally for chal-

lenge organisers.

2.1 Challenge tasks

This section gives a short overview of the three princi-

pal challenge tasks on classiﬁcation, detection, and seg-

mentation, and of the two subsidiary tasks (‘tasters’)

The Pascal Visual Object Classes Challenge – a Retrospective 3

Vehicles Household Animals Other

Aeroplane Bottle Bird Person

Bicycle Chair Cat

Boat Dining table Cow

Bus Potted plant Dog

Car Sofa Horse

Motorbike TV/monitor Sheep

Train

Table 1: The VOC classes. The classes can be considered

in a notional taxonomy.

on action classiﬁcation and person layout. The evalua-

tion of each of these challenges is described in detail in

Sec. 2.4.

2.1.1 Classiﬁcation

For each of twenty object classes predict the pres-

ence/absence of at least one object of that class in a

test image. The twenty objects classes are listed in Ta-

ble 1. Participants are required to provide a real-valued

conﬁdence of the object’s presence for each test image

so that a precision-recall curve can be drawn. Partici-

pants may choose to tackle all, or any subset of object

classes, for example ‘cars only’ or ‘motorbikes and cars’.

Two competitions are deﬁned according to the

choice of training data: (i) taken from the VOC train-

ing/validation data provided, or (ii) from any source ex-

cluding the VOC test data. In the ﬁrst competition, any

annotation provided in the VOC training/validation

data may be used for training, for example b ounding

boxes or particular views e.g. ‘frontal’ or ‘left’. Partici-

pants are not permitted to perform additional manual

annotation of either training or test data. In the second

competition, any source of training data may b e used

except the provided test images.

2.1.2 Detection

For each of the twenty classes, predict the bounding

boxes of each object of that class in a test image (if

any), with associated real-valued conﬁdence. Partici-

pants may choose to tackle all, or any subset of ob-

ject classes. Two competitions are deﬁned in a similar

manner to the classiﬁcation challenge.

It is clear that the additional requirement to locate

the instances in an image makes detection a more de-

manding task than classiﬁcation. Guessing the right an-

swer is far more diﬃcult to achieve. It is also true that

detection can support more applications than mere clas-

siﬁcation, e.g. obstacle avoidance, tracking, etc. Dur-

ing the course of the Pascal VOC challenge it had

even been suggested that only detection matters and

classiﬁcation is hardly relevant. However, this view is

rather extreme. Even in cases where detection is the

end goal, classiﬁcation may be an appropriate initial

step to guide resources towards images that hold good

promise of containing the target class. This is similar

to how an ‘objectness’ analysis (e.g. Alexe et al, 2010)

can guide a detector’s attention to speciﬁc locations

within an image. Classiﬁcation could also be used to

put regression methods for counting into action, which

have been shown to perform well without any detection

(Lempitsky and Zisserman, 2010).

2.1.3 Segmentation

For each test image, predict the object class of each

pixel, or give it ‘background’ status if the object does

not belong to one of the twenty speciﬁed classes. There

are no conﬁdence values associated with this prediction.

Two competitions are deﬁned in a similar manner to the

classiﬁcation and detection challenges.

Segmentation clearly is more challenging than de-

tection and its solution tends to be more time consum-

ing. Detection can therefore be the task of choice in

cases where such ﬁne-grained image analysis is not re-

quired by the application. However, several applications

do need a more detailed knowledge about object outline

or shape, such as robot grasping or image retargeting.

Even if segmentation is the goal, detection can provide

a good initialization (e.g. Leibe et al, 2004).

2.1.4 Action classiﬁcation

This taster was introduced in 2010. The motivation was

that the world is dynamic and snapshots of it still con-

vey substantial information about these dynamics. Sev-

eral of the actions were chosen to involve object classes

that were also part of the classiﬁcation and detection

challenges (like a person riding a horse, or a person

riding a bike). The actions themselves were all geared

towards people.

In 2010 the challenge was: for each of ten action

classes predict if a speciﬁed person (indicated by a

bounding box) in a test image is performing the corre-

sponding action. The output is a real-valued conﬁdence

that the action is being performed so that a precision-

recall curve can be drawn. The action classes are ‘jump-

ing’, ‘phoning’, ‘playing instrument’, ‘reading’, ‘riding

bike’, ‘riding horse’, ‘running’, ‘taking photo’, ‘using

computer’, ‘walking’, and participants may choose to

tackle all, or any subset of action classes, for example

4 Mark Everingham, S. M. Ali Eslami, Luc Van Gool, Christopher K. I. Williams, John Winn, Andrew Zisserman

‘walking only’ or ‘walking and running’. Note, the ac-

tion classes are not exclusive, for example a person can

be both ‘riding a bicycle’ and ‘phoning’. In 2011 an

‘other’ class was introduced (for actions diﬀerent from

the ten already speciﬁed). This increased the diﬃculty

of the challenge. The output is still a real-valued con-

ﬁdence for each of the ten actions. As with other parts

of the challenge, the training could be either based on

the oﬃcial Pascal VOC training data, or on external

data.

It was necessary for us to specify the person of in-

terest in the image as there may be several people per-

forming diﬀerent actions. In 2012 the person of interest

was speciﬁed by both a bounding box and a point on

the torso, and a separate competition deﬁned for each.

The motivation for this additional point annotation was

that the aspect ratio of the bounding box might pro-

vide some information on the action being performed,

and this was almost entirely removed if only a point

was provided. For example, the aspect ratio of the box

could help distinguish walking and running from other

action class es (this was a criticism raised during the

2011 Pascal VOC workshop).

2.1.5 Person layout

For each person in a test image (their bounding box

is provided) predict the presence or absence of parts

(head, hands and feet), and the bounding boxes of those

parts. The prediction of a person layout should be out-

put with an associated real-valued conﬁdence of the

layout so that a precision-recall curve can be gener-

ated for each person. The success of the layout predic-

tion depends both on: (i) a correct prediction of parts

present/absent (e.g. are the hands visible or occluded);

(ii) a correct prediction of bounding boxes for the vis-

ible parts. Two comp etitions are deﬁned in a similar

manner to the classiﬁcation challenge.

2.2 Datasets

For the purposes of the challenge, the data is di-

vided into two main subsets: training/validation data

(trainval), and test data (test). For participants’

convenience, the trainval data is further divided into

suggested training (train) and validation (val) sets,

however participants are free to use any data in the

trainval set for training and/or validation.

There is complete annotation for the twenty classes:

i.e. all images are annotated with bounding boxes for

every instance of the twenty classes for the classiﬁca-

tion and detection challenges. In addition to a bound-

ing box for each object, attributes such as: ‘orientation’,

‘occluded’, ‘truncated’, ‘diﬃcult’; are speciﬁed. The full

list of attributes and their deﬁnitions is given in Ever-

ingham et al (2010). Fig. 1 shows samples from each of

the challenges including annotations. Note, the annota-

tions on the test set are not publicly released.

Statistics for the number of object instances and

images in the training and validation datasets for the

classiﬁcation, detection, segmentation and layout chal-

lenges is given in Table 3, and for the action classiﬁca-

tion challenge in Table 4. Note, we do not release the

exact numbers of object instances in the test set, but

both the number of instances per class and number of

images are approximately balanced with those in the

trainval set.

The number of images and instances in all the tasks

was increased up to 2011. From 2011 to 2012 the num-

ber of images in the classiﬁcation, detection and person

layout tasks was not increased, and only those for seg-

mentation and action classiﬁcation were augmented.

From 2009 onwards the data for all tasks consists of

the previous years’ images augmented with new images.

Before this, in 2008 and earlier, an entirely new dataset

was released each year for the classiﬁcation/detection

tasks. Augmenting allows the number of images to grow

each year and, more importantly, means that test re-

sults can be compared with the previous years’ images.

Thus, for example, performance of all methods from

2009–2012, can be evaluated on the 2009 test set (al-

though the methods may have used a diﬀerent number

of training images).

2.3 Annotation procedure

The procedure of collecting the data and annotating it

with ground truth is described in our companion pa-

per (Everingham et al, 2010). However, the annotation

process has evolved since that time and we outline here

the main changes in the collection and annotation pro-

cedure. Note, for challenge organisers, one essential fac-

tor in obtaining consistent annotations is to have guide-

lines available in advance of the annotation process. The

ones used for VOC are available at the Pascal VOC

annotation guidelines web page (2012).

2.3.1 Use of Mechanical Turk for initial class labelling

of the images

We aimed to collect a balanced set of images with a cer-

tain minimum number of instances of each class. This

required ﬁnding suﬃcient images of the rarer classes,

such as ‘bus’ and ‘dining table’. In previous years this

had been achieved by getting the annotators to focus on

such classes towards the end of the annotation period,

The Pascal Visual Object Classes Challenge: A Retrospective

Citations

Very Deep Convolutional Networks for Large-Scale Image Recognition

Very Deep Convolutional Networks for Large-Scale Image Recognition

ImageNet Large Scale Visual Recognition Challenge

Fully convolutional networks for semantic segmentation

You Only Look Once: Unified, Real-Time Object Detection

References

ImageNet Classification with Deep Convolutional Neural Networks

Distinctive Image Features from Scale-Invariant Keypoints

LIBSVM: A library for support vector machines

Histograms of oriented gradients for human detection

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

Related Papers (5)

Deep Residual Learning for Image Recognition

Microsoft COCO: Common Objects in Context

ImageNet Large Scale Visual Recognition Challenge

Fully convolutional networks for semantic segmentation

ImageNet: A large-scale hierarchical image database

Frequently Asked Questions (9)

Q1. What are the contributions mentioned in the paper "The pascal visual object classes challenge – a retrospective" ?

Q2. What were the dominant methods used in the 2009 segmentation challenge?

Q3. What were the key extensions of the segmentation challenge?

Q4. What was the responsibility of the participant’s system to filter multiple detections of the same object?

Q5. What is the method the authors investigate for the super-classifier?

Q6. What was the purpose of interpolating the precision-recall curve?

Q7. How can a normalised average precision measure be calculated?

Q8. What is the way to avoid sliding windows search?

Q9. What is the way to measure the accuracy of a truncated object?