scispace - formally typeset
Open AccessJournal ArticleDOI

The Pascal Visual Object Classes Challenge: A Retrospective

Reads0
Chats0
TLDR
A review of the Pascal Visual Object Classes challenge from 2008-2012 and an appraisal of the aspects of the challenge that worked well, and those that could be improved in future challenges.
Abstract
The Pascal Visual Object Classes (VOC) challenge consists of two components: (i) a publicly available dataset of images together with ground truth annotation and standardised evaluation software; and (ii) an annual competition and workshop. There are five challenges: classification, detection, segmentation, action classification, and person layout. In this paper we provide a review of the challenge from 2008---2012. The paper is intended for two audiences: algorithm designers, researchers who want to see what the state of the art is, as measured by performance on the VOC datasets, along with the limitations and weak points of the current generation of algorithms; and, challenge designers, who want to see what we as organisers have learnt from the process and our recommendations for the organisation of future challenges. To analyse the performance of submitted algorithms on the VOC datasets we introduce a number of novel evaluation methods: a bootstrapping method for determining whether differences in the performance of two algorithms are significant or not; a normalised average precision so that performance can be compared across classes with different proportions of positive instances; a clustering method for visualising the performance across multiple algorithms so that the hard and easy images can be identified; and the use of a joint classifier over the submitted algorithms in order to measure their complementarity and combined performance. We also analyse the community's progress through time using the methods of Hoiem et al. (Proceedings of European Conference on Computer Vision, 2012) to identify the types of occurring errors. We conclude the paper with an appraisal of the aspects of the challenge that worked well, and those that could be improved in future challenges.

read more

Content maybe subject to copyright    Report

Edinburgh Research Explorer
The PASCAL Visual Object Classes Challenge: A Retrospective
Citation for published version:
Everingham, M, Eslami, SMA, Van Gool, L, Williams, CKI, Winn, J & Zisserman, A 2015, 'The PASCAL
Visual Object Classes Challenge: A Retrospective', International Journal of Computer Vision, vol. 111, no.
1, pp. 98-136. https://doi.org/10.1007/s11263-014-0733-5
Digital Object Identifier (DOI):
10.1007/s11263-014-0733-5
Link:
Link to publication record in Edinburgh Research Explorer
Document Version:
Peer reviewed version
Published In:
International Journal of Computer Vision
General rights
Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s)
and / or other copyright owners and it is a condition of accessing these publications that users recognise and
abide by the legal requirements associated with these rights.
Take down policy
The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer
content complies with UK legislation. If you believe that the public display of this file breaches copyright please
contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and
investigate your claim.
Download date: 09. Aug. 2022

International Journal of Computer Vision manuscript No.
(will be inserted by the editor)
The Pascal Visual Object Classes Challenge a Retrospective
Mark Everingham, S. M. Ali Eslami, Luc Van Gool,
Christopher K. I. Williams, John Winn, Andrew Zisserman
Received: date / Accepted: date
Abstract The Pascal Visual Object Classes (VOC)
challenge consists of two components: (i) a publicly
available dataset of images together with ground truth
annotation and standardised evaluation software; and
(ii) an annual competition and workshop. There are five
challenges: classification, detection, segmentation, ac-
tion classification, and pers on layout. In this paper we
provide a review of the challenge from 2008–2012.
The paper is intended for two audiences: algorithm
designers, researchers who want to see what the state
of the art is, as measured by performance on the VOC
datasets, along with the limitations and weak points
of the current generation of algorithms; and, challenge
designers, who want to see what we as organisers have
learnt from the process and our recommendations for
the organisation of future challenges.
Mark Everingham, who died in 2012, was the key member of
the VOC project. His contribution was crucial and substan-
tial. For these reasons he is included as the posthumous first
author of this paper. An appreciation of his life and work can
b e found in Zisserman et al (2012).
Mark Everingham
University of Leeds, UK
S. M. Ali Eslami (
)
Microsoft Research, Cambridge, UK
(The majority of this work was performed
whilst at the University of Edinburgh)
E-mail: alie@microsoft.com
Luc Van Gool
KU Leuven, Belgium and ETH, Switzerland
Christopher K. I. Williams
University of Edinburgh, UK
John Winn
Microsoft Research, Cambridge, UK
Andrew Zisserman
University of Oxford, UK
To analyse the performance of submitted algorithms
on the VOC datasets we introduce a number of novel
evaluation methods: a bootstrapping method for deter-
mining whether differences in the performance of two
algorithms are significant or not; a normalised average
precision so that p erformance can be compared across
classes with different proportions of positive instances;
a clustering method for visualising the p erformance
across multiple algorithms so that the hard and easy
images can be identified; and the use of a joint classi-
fier over the submitted algorithms in order to measure
their complementarity and combined performance. We
also analyse the community’s progress through time us-
ing the methods of Hoiem et al (2012) to identify the
types of occurring errors.
We conclude the paper with an appraisal of the as-
pects of the challenge that worked well, and those that
could be improved in future challenges.
1 Introduction
The Pascal
1
Visual Object Classes (VOC) Challenge
has been an annual event since 2006. The challenge con-
sists of two components: (i) a publicly available dataset
of images obtained from the Flickr web site (2013), to-
gether with ground truth annotation and standardised
evaluation software; and (ii) an annual competition and
workshop. There are three principal challenges: classifi-
cation “does the image contain any instances of a par-
ticular object class?” (where object classes include cars,
people, dogs, etc.), detection “where are the instances
1
Pascal stands for pattern analysis, statistical modelling
and computational learning. It was an EU Network of Excel-
lence funded project under the IST Programme of the Euro-
p ean Union.

2 Mark Everingham, S. M. Ali Eslami, Luc Van Gool, Christopher K. I. Williams, John Winn, Andrew Zisserman
of a particular object class in the image (if any)?”,
and segmentation “to which class does each pixel be-
long?”. In addition, there are two subsidiary challenges
(‘tasters’): action classification “what action is be-
ing performed by an indicated person in this image?”
(where actions include jumping, phoning, riding a bike,
etc.) and person layout “where are the head, hands
and feet of people in this image?”. The challenges were
issued with deadlines each year, and a workshop held to
compare and discuss that year’s results and methods.
The challenges up to and including the year 2007
were described in our paper Everingham et al (2010).
The purp ose of this paper is not just to continue the
story from 2008 until the final run of the challenge in
2012, although we will cover that to some extent. In-
stead we aim to inform two audiences: first, algorithm
designers, those researchers who want to see what the
state of the art is, as measured by performance on the
VOC datasets, and the limitations and weak points of
the current generation of algorithms; second, challenge
designers, who want to see what we as organisers have
learnt from the process and our recommendations for
the organisation of future challenges.
1.1 Paper layout
This paper is organised as follows: we start with a re-
view of the challenges in Section 2, describing in brief
the competitions, datasets, annotation procedure, and
evaluation criteria of the 2012 challenge, and what was
changed over the 2008–2012 lifespan of the challenges.
The parts on annotation procedures and changes to the
challenges are intended for challenge organisers.
Section 3 provides an overview of the results for the
2012 challenge and, thereby, a snapshot of the state
of the art. We then use these 2012 results for several
additional and novel analyses, going further than those
given at the challenge workshops and in our previous
publication on the challenge (Everingham et al, 2010).
At the end of Section 3 we consider the question of how
the performance of algorithms can be fairly compared
when all that is available is their prediction on the test
set, and propose a method for doing this. This is aimed
at challenge organisers.
Section 4 takes stock and tries to answer broader
questions about where our field is at in terms of the clas-
sification and detection problems that can or cannot be
solved. First, inspired by Hoiem et al (2012), we propose
evaluation measures that normalise against the propor-
tion of positive instances in a class (a problem when
comparing average precision across classes). It is shown
that some classes like ‘person’ still pose larger prob-
lems to modern methods than may have been believed.
Second, we describ e a clustering method for visualising
the performance across multiple algorithms submitted
during the lifespan of the challenges, so that the char-
acteristics of hard and easy images can be identified.
Section 5 investigates the level of complementarity
of the different methods. It focusses on classification,
for which a ‘super-method’ is designed by combining
the 2012 submitted methods. It turns out that quite
some performance can be gained over any one existing
method with such a combination, without any of those
methods playing a dominant role in the super-method.
Even the combination of only pairs of classifiers can
bring a substantial improvement and we make sugges-
tions for such pairs that would be especially promising.
We also comment on the construction of super-methods
for detection and segmentation.
In Section 6 we turn to progress through time. From
the evaluation server, we have available to us the results
of all algorithms for the challenges from 2009 to 2012,
and we analyse these using the methods of Hoiem et al
(2012) to identify the types of errors occurring across
time. Although important progress has been made, it
has often not been as monotonic as one might expect.
This underlines the fact that novel, promising ideas may
require some consolidation time and benchmark scores
must not be used to discard such novelties. Also, the
diversity among the scores has increased as time has
progressed.
Section 7 summarises our conclusions, both about
what we believe to have done well and about caveats.
This section also makes suggestions that we hope will
be useful for future challenge organisers.
2 Challenge Review
This section reviews the challenges, datasets, annota-
tion and evaluation procedures over the 2009–2012 cy-
cles of the challenge. It gives a bare bones summary of
the challenges and then concentrates on changes since
the 2008 release. Our companion paper (Everingham
et al, 2010) describes in detail the motivation, annota-
tions, and evaluation measures of the VOC challenges,
and these details are not repeated here. Sec. 2.3 on the
annotation procedure is intended principally for chal-
lenge organisers.
2.1 Challenge tasks
This section gives a short overview of the three princi-
pal challenge tasks on classification, detection, and seg-
mentation, and of the two subsidiary tasks (‘tasters’)

The Pascal Visual Object Classes Challenge a Retrospective 3
Vehicles Household Animals Other
Aeroplane Bottle Bird Person
Bicycle Chair Cat
Boat Dining table Cow
Bus Potted plant Dog
Car Sofa Horse
Motorbike TV/monitor Sheep
Train
Table 1: The VOC classes. The classes can be considered
in a notional taxonomy.
on action classification and person layout. The evalua-
tion of each of these challenges is described in detail in
Sec. 2.4.
2.1.1 Classification
For each of twenty object classes predict the pres-
ence/absence of at least one object of that class in a
test image. The twenty objects classes are listed in Ta-
ble 1. Participants are required to provide a real-valued
confidence of the object’s presence for each test image
so that a precision-recall curve can be drawn. Partici-
pants may choose to tackle all, or any subset of object
classes, for example ‘cars only’ or ‘motorbikes and cars’.
Two competitions are defined according to the
choice of training data: (i) taken from the VOC train-
ing/validation data provided, or (ii) from any source ex-
cluding the VOC test data. In the first competition, any
annotation provided in the VOC training/validation
data may be used for training, for example b ounding
boxes or particular views e.g. ‘frontal’ or ‘left’. Partici-
pants are not permitted to perform additional manual
annotation of either training or test data. In the second
competition, any source of training data may b e used
except the provided test images.
2.1.2 Detection
For each of the twenty classes, predict the bounding
boxes of each object of that class in a test image (if
any), with associated real-valued confidence. Partici-
pants may choose to tackle all, or any subset of ob-
ject classes. Two competitions are defined in a similar
manner to the classification challenge.
It is clear that the additional requirement to locate
the instances in an image makes detection a more de-
manding task than classification. Guessing the right an-
swer is far more difficult to achieve. It is also true that
detection can support more applications than mere clas-
sification, e.g. obstacle avoidance, tracking, etc. Dur-
ing the course of the Pascal VOC challenge it had
even been suggested that only detection matters and
classification is hardly relevant. However, this view is
rather extreme. Even in cases where detection is the
end goal, classification may be an appropriate initial
step to guide resources towards images that hold good
promise of containing the target class. This is similar
to how an ‘objectness’ analysis (e.g. Alexe et al, 2010)
can guide a detector’s attention to specific locations
within an image. Classification could also be used to
put regression methods for counting into action, which
have been shown to perform well without any detection
(Lempitsky and Zisserman, 2010).
2.1.3 Segmentation
For each test image, predict the object class of each
pixel, or give it ‘background’ status if the object does
not belong to one of the twenty specified classes. There
are no confidence values associated with this prediction.
Two competitions are defined in a similar manner to the
classification and detection challenges.
Segmentation clearly is more challenging than de-
tection and its solution tends to be more time consum-
ing. Detection can therefore be the task of choice in
cases where such fine-grained image analysis is not re-
quired by the application. However, several applications
do need a more detailed knowledge about object outline
or shape, such as robot grasping or image retargeting.
Even if segmentation is the goal, detection can provide
a good initialization (e.g. Leibe et al, 2004).
2.1.4 Action classification
This taster was introduced in 2010. The motivation was
that the world is dynamic and snapshots of it still con-
vey substantial information about these dynamics. Sev-
eral of the actions were chosen to involve object classes
that were also part of the classification and detection
challenges (like a person riding a horse, or a person
riding a bike). The actions themselves were all geared
towards people.
In 2010 the challenge was: for each of ten action
classes predict if a specified person (indicated by a
bounding box) in a test image is performing the corre-
sponding action. The output is a real-valued confidence
that the action is being performed so that a precision-
recall curve can be drawn. The action classes are ‘jump-
ing’, ‘phoning’, ‘playing instrument’, ‘reading’, ‘riding
bike’, ‘riding horse’, ‘running’, ‘taking photo’, ‘using
computer’, ‘walking’, and participants may choose to
tackle all, or any subset of action classes, for example

4 Mark Everingham, S. M. Ali Eslami, Luc Van Gool, Christopher K. I. Williams, John Winn, Andrew Zisserman
‘walking only’ or ‘walking and running’. Note, the ac-
tion classes are not exclusive, for example a person can
be both ‘riding a bicycle’ and ‘phoning’. In 2011 an
‘other’ class was introduced (for actions different from
the ten already specified). This increased the difficulty
of the challenge. The output is still a real-valued con-
fidence for each of the ten actions. As with other parts
of the challenge, the training could be either based on
the official Pascal VOC training data, or on external
data.
It was necessary for us to specify the person of in-
terest in the image as there may be several people per-
forming different actions. In 2012 the person of interest
was specified by both a bounding box and a point on
the torso, and a separate competition defined for each.
The motivation for this additional point annotation was
that the aspect ratio of the bounding box might pro-
vide some information on the action being performed,
and this was almost entirely removed if only a point
was provided. For example, the aspect ratio of the box
could help distinguish walking and running from other
action class es (this was a criticism raised during the
2011 Pascal VOC workshop).
2.1.5 Person layout
For each person in a test image (their bounding box
is provided) predict the presence or absence of parts
(head, hands and feet), and the bounding boxes of those
parts. The prediction of a person layout should be out-
put with an associated real-valued confidence of the
layout so that a precision-recall curve can be gener-
ated for each person. The success of the layout predic-
tion depends both on: (i) a correct prediction of parts
present/absent (e.g. are the hands visible or occluded);
(ii) a correct prediction of bounding boxes for the vis-
ible parts. Two comp etitions are defined in a similar
manner to the classification challenge.
2.2 Datasets
For the purposes of the challenge, the data is di-
vided into two main subsets: training/validation data
(trainval), and test data (test). For participants’
convenience, the trainval data is further divided into
suggested training (train) and validation (val) sets,
however participants are free to use any data in the
trainval set for training and/or validation.
There is complete annotation for the twenty classes:
i.e. all images are annotated with bounding boxes for
every instance of the twenty classes for the classifica-
tion and detection challenges. In addition to a bound-
ing box for each object, attributes such as: ‘orientation’,
‘occluded’, ‘truncated’, ‘difficult’; are specified. The full
list of attributes and their definitions is given in Ever-
ingham et al (2010). Fig. 1 shows samples from each of
the challenges including annotations. Note, the annota-
tions on the test set are not publicly released.
Statistics for the number of object instances and
images in the training and validation datasets for the
classification, detection, segmentation and layout chal-
lenges is given in Table 3, and for the action classifica-
tion challenge in Table 4. Note, we do not release the
exact numbers of object instances in the test set, but
both the number of instances per class and number of
images are approximately balanced with those in the
trainval set.
The number of images and instances in all the tasks
was increased up to 2011. From 2011 to 2012 the num-
ber of images in the classification, detection and person
layout tasks was not increased, and only those for seg-
mentation and action classification were augmented.
From 2009 onwards the data for all tasks consists of
the previous years’ images augmented with new images.
Before this, in 2008 and earlier, an entirely new dataset
was released each year for the classification/detection
tasks. Augmenting allows the number of images to grow
each year and, more importantly, means that test re-
sults can be compared with the previous years’ images.
Thus, for example, performance of all methods from
2009–2012, can be evaluated on the 2009 test set (al-
though the methods may have used a different number
of training images).
2.3 Annotation procedure
The procedure of collecting the data and annotating it
with ground truth is described in our companion pa-
per (Everingham et al, 2010). However, the annotation
process has evolved since that time and we outline here
the main changes in the collection and annotation pro-
cedure. Note, for challenge organisers, one essential fac-
tor in obtaining consistent annotations is to have guide-
lines available in advance of the annotation process. The
ones used for VOC are available at the Pascal VOC
annotation guidelines web page (2012).
2.3.1 Use of Mechanical Turk for initial class labelling
of the images
We aimed to collect a balanced set of images with a cer-
tain minimum number of instances of each class. This
required finding sufficient images of the rarer classes,
such as ‘bus’ and ‘dining table’. In previous years this
had been achieved by getting the annotators to focus on
such classes towards the end of the annotation period,

Citations
More filters
Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Journal ArticleDOI

ImageNet Large Scale Visual Recognition Challenge

TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.
Proceedings ArticleDOI

Fully convolutional networks for semantic segmentation

TL;DR: The key insight is to build “fully convolutional” networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning.
Proceedings ArticleDOI

You Only Look Once: Unified, Real-Time Object Detection

TL;DR: Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background, and outperforms other detection methods, including DPM and R-CNN, when generalizing from natural images to other domains like artwork.
References
More filters
Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Journal ArticleDOI

Distinctive Image Features from Scale-Invariant Keypoints

TL;DR: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene and can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Journal ArticleDOI

LIBSVM: A library for support vector machines

TL;DR: Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
Proceedings ArticleDOI

Histograms of oriented gradients for human detection

TL;DR: It is shown experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection, and the influence of each stage of the computation on performance is studied.
Proceedings ArticleDOI

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

TL;DR: RCNN as discussed by the authors combines CNNs with bottom-up region proposals to localize and segment objects, and when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost.
Related Papers (5)
Frequently Asked Questions (9)
Q1. What are the contributions mentioned in the paper "The pascal visual object classes challenge – a retrospective" ?

In this paper the authors provide a review of the challenge from 2008–2012. The paper is intended for two audiences: algorithm designers, researchers who want to see what the state of the art is, as measured by performance on the VOC datasets, along with the limitations and weak points of the current generation of algorithms ; and, challenge designers, who want to see what the authors as organisers have learnt from the process and their recommendations for the organisation of future challenges. Mark Everingham, who died in 2012, was the key member of the VOC project. For these reasons he is included as the posthumous first author of this paper. Mark Everingham University of Leeds, UK S. M. Ali Eslami ( ) Microsoft Research, Cambridge, UK ( The majority of this work was performed whilst at the University of Edinburgh ) E-mail: alie @ microsoft. To analyse the performance of submitted algorithms on the VOC datasets the authors introduce a number of novel evaluation methods: a bootstrapping method for determining whether differences in the performance of two algorithms are significant or not ; a normalised average precision so that performance can be compared across classes with different proportions of positive instances ; a clustering method for visualising the performance across multiple algorithms so that the hard and easy images can be identified ; and the use of a joint classifier over the submitted algorithms in order to measure their complementarity and combined performance. The authors conclude the paper with an appraisal of the aspects of the challenge that worked well, and those that could be improved in future challenges. 

The two dominant methods were: hierarchical random fields with a range of potentials, and the use of multiple bottom-up segmentations, combined with a classifier to predict the degree of overlap of a segment with an object. 

Key extensions were:– use of multiple bottom-up segmentations to avoid making early incorrect boundary decisions, – Hierarchical MRFs e.g. modelling object cooccurrence, – use of parts-based instance models to refine detections, – deeper integration of segmentation model with detection/classification models, – use of 3D information. 

Multiple detections of the same object in an image were considered false detections, e.g. 5 detections of a single object counted as 1 correct detection and 4 false detections – it was the responsibility of the participant’s system to filter multiple detections from its output. 

The method the authors investigate for the super-classifier is a linear classifier for each of the VOC classes, where the feature vector consists of the real-valued scores supplied by each submitted method. 

The intention in interpolating the precision-recall curve was to reduce the impact of the ‘wiggles’ in the precisionrecall curve, caused by small variations in the ranking of examples. 

A normalised average precision measure for detection can be computed by averaging normalised precisions computed at a range of recalls. 

Another way of avoiding sliding windows search is to hypothesise bounding boxes bottom up, e.g. based on multiple segmentations (Van de Sande et al, 2011). 

In general (data not shown) performance with respect to aspect ratio is better for lessextreme aspect ratios, and it is better for non-truncated objects than truncated ones (except that the top three methods in 2009 and 2012 all prefer truncated over nontruncated cats).