What have the authors contributed in "Microsoft coco: common objects in context" ?

The authors present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. The authors present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, the authors provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.

How many annotators were used in the sample?

by observing how recall increased as the authors added annotators, the authors estimate that in practice over 99% of all object categories not later rejected as false positives are detected given 8 annotators.

How many worker hours were used to generate object segmentation masks?

Utilizing over 70,000 worker hours, a vast collection of object instances was gathered, annotated and organized to drive the advancement of object detection and segmentation algorithms.

How many instances of a category were segmented in an image?

After 10-15 instances of a category were segmented in an image, the remaining instances were marked as “crowds” using a single (possibly multipart) segment.

Why did the authors choose to include only “thing” categories?

Since the authors are primarily interested in precise localization of object instances, the authors decided to only include “thing” categories and not “stuff.”

How many instances of a given category were segmented?

For images containing 10 object instances or fewer of a given category, every instance was individually segmented (note that in some images up to 15 instances were segmented).

What is the effect of difficult examples on the learning model?

Such examples may act as noise and pollute the learned model if the model is not rich enough to capture such appearance variability.

What is the interesting observation about the dataset?

Another interesting observation is only 10% of the images in MS COCO have only one category per image, in comparison, over 60% of images contain a single object category in ImageNet and PASCAL VOC.

(Open Access) Microsoft COCO: Common Objects in Context (2014) | Tsung-Yi Lin

Q: What is the task of labeling objects in a scene?

The task of labeling semantic objects in a scene requires that each pixel of an image be labeled as belonging to a category, such as sky, chair, floor, street, etc.

Q: How many instances of a given category were discarded?

Segmentations of insufficient quality were discarded and the corresponding instances added back to the pool of unsegmented objects.

Q: How many datasets were created for the detection of basic object categories?

For the detection of basic object categories, a multiyear effort from 2005 to 2012 was devoted to the creation and maintenance of a series of benchmark datasets that were widely adopted.

Microsoft COCO: Common Objects in Context

Tsung-Yi Lin Michael Maire Serge Belongie Lubomir Bourdev Ross Girshick

James Hays Pietro Perona Deva Ramanan C. Lawrence Zitnick Piotr Doll

Abstract—We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of

object recognition in the context of the broader question of scene understanding. This is achieved by gather ing images of complex

everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in

precise object localization. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old. With a

total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via

novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of

the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and

segmentation detection results using a Deformable Parts Model.

1 INTRODUCTION

One of the primary goals of computer vision is the

understanding of visual scenes. Scene understanding

involves numerous tasks including recognizing what

objects are present, localizing the objects in 2D and 3D,

determining the objects’ and scene’s attributes, charac-

terizing relationships between objects and providing a

semantic description of the scene. The current object clas-

siﬁcation and detection datasets [1], [2], [3], [4] help us

explore the ﬁrst challenges related to scene understand-

ing. For instance the ImageNet dataset [1], which con-

tains an unprecedented number of images, has recently

enabled breakthroughs in both object classiﬁcation and

detection research [5], [6], [7]. The community has also

created datasets containing object attributes [8], scene

attributes [9], keypoints [10], and 3D scene information

[11]. This leads us to the obvious question: what datasets

will best continue our advance towards our ultimate goal

of scene understanding?

We introduce a new large-scale dataset that addresses

three core research problems in scene understanding: de-

tecting non-iconic views (or non-canonical perspectives

[12]) of objects, contextual reasoning between objects

and the precise 2D localization of objects. For many

categories of objects, there exists an iconic view. For

example, when performing a web-based image search

for the object category “bike,” the top-ranked retrieved

examples appear in proﬁle, unobstructed near the cen-

ter of a neatly composed photo. We posit that current

recognition systems perform fairly well on iconic views,

but struggle to recognize objects otherwise – in the

• T.Y. Lin and S. Belongie are with Cornell NYC Tech and the Cornell

Computer Science Department.

• M. Maire is with the Toyota Technological Institute at Chicago.

• L. Bourdev and P. Doll´ar are with Facebook AI Research. The majority of

this work was performed while P. Doll´ar was with Microsoft Research.

• R. Girshick and C. L. Zitnick are with Microsoft Research, Redmond.

• J. Hays is with Brown University.

• P. Perona is with the California Institute of Technology.

• D. Ramanan is with the University of California at Irvine.

Fig. 1: While previous object recognition datasets have

focused on (a) image classiﬁcation, (b) object bounding

box localization or (c) semantic pixel-level segmentation,

we focus on (d) segmenting individual object instances.

We introduce a large, richly-annotated dataset comprised

of images depicting complex everyday scenes of com-

mon objects in their natural context.

background, partially occluded, amid clutter [13] – re-

ﬂecting the composition of actual everyday scenes. We

verify this experimentally; when evaluated on everyday

scenes, models trained on our data perform better than

those trained with prior datasets. A challenge is ﬁnding

natural images that contain multiple objects. The identity

of many objects can only be resolved using context, due

to small size or ambiguous appearance in the image. To

push research in contextual reasoning, images depicting

scenes [3] rather than objects in isolation are necessary.

Finally, we argue that detailed spatial understanding of

object layout will be a core component of scene analysis.

An object’s spatial location can be deﬁned coarsely using

a bounding box [2] or with a precise pixel-level segmen-

tation [14], [15], [16]. As we demonstrate, to measure

either kind of localization performance it is essential

for the dataset to have every instance of every object

arXiv:1405.0312v3 [cs.CV] 21 Feb 2015

category labeled and fully segmented. Our dataset is

unique in its annotation of instance-level segmentation

masks, Fig. 1.

To create a large-scale dataset that accomplishes these

three goals we employed a novel pipeline for gathering

data with extensive use of Amazon Mechanical Turk.

First and most importantly, we harvested a large set

of images containing contextual relationships and non-

iconic object views. We accomplished this using a sur-

prisingly simple yet effective technique that queries for

pairs of objects in conjunction with images retrieved

via scene-based queries [17], [3]. Next, each image was

labeled as containing particular object categories using

a hierarchical labeling approach [18]. For each category

found, the individual instances were labeled, veriﬁed,

and ﬁnally segmented. Given the inherent ambiguity of

labeling, each of these stages has numerous tradeoffs that

we explored in detail.

The Microsoft Common Objects in COntext (MS

COCO) dataset contains 91 common object categories

with 82 of them having more than 5,000 labeled in-

stances, Fig. 6. In total the dataset has 2,500,000 labeled

instances in 328,000 images. In contrast to the popular

ImageNet dataset [1], COCO has fewer categories but

more instances per category. This can aid in learning

detailed object models capable of precise 2D localization.

The dataset is also signiﬁcantly larger in number of in-

stances per category than the PASCAL VOC [2] and SUN

[3] datasets. Additionally, a critical distinction between

our dataset and others is the number of labeled instances

per image which may aid in learning contextual informa-

tion, Fig. 5. MS COCO contains considerably more object

instances per image (7.7) as compared to ImageNet (3.0)

and PASCAL (2.3). In contrast, the SUN dataset, which

contains signiﬁcant contextual information, has over 17

objects and “stuff” per image but considerably fewer

object instances overall.

An abridged version of this work appeared in [19].

2 RELATED WORK

Throughout the history of computer vision research

datasets have played a critical role. They not only pro-

vide a means to train and evaluate algorithms, they

drive research in new and more challenging directions.

The creation of ground truth stereo and optical ﬂow

datasets [20], [21] helped stimulate a ﬂood of interest

in these areas. The early evolution of object recognition

datasets [22], [23], [24] facilitated the direct comparison

of hundreds of image recognition algorithms while si-

multaneously pushing the ﬁeld towards more complex

problems. Recently, the ImageNet dataset [1] containing

millions of images has enabled breakthroughs in both

object classiﬁcation and detection research using a new

class of deep learning algorithms [5], [6], [7].

Datasets related to object recognition can be roughly

split into three groups: those that primarily address

object classiﬁcation, object detection and semantic scene

labeling. We address each in turn.

Image Classiﬁcation The task of object classiﬁcation

requires binary labels indicating whether objects are

present in an image; see Fig. 1(a). Early datasets of this

type comprised images containing a single object with

blank backgrounds, such as the MNIST handwritten

digits [25] or COIL household objects [26]. Caltech 101

[22] and Caltech 256 [23] marked the transition to more

realistic object images retrieved from the internet while

also increasing the number of object categories to 101

and 256, respectively. Popular datasets in the machine

learning community due to the larger number of training

examples, CIFAR-10 and CIFAR-100 [27] offered 10 and

100 categories from a dataset of tiny 32 × 32 images [28].

While these datasets contained up to 60,000 images and

hundreds of categories, they still only captured a small

fraction of our visual world.

Recently, ImageNet [1] made a striking departure from

the incremental increase in dataset sizes. They proposed

the creation of a dataset containing 22k categories with

500-1000 images each. Unlike previous datasets contain-

ing entry-level categories [29], such as “dog” or “chair,”

like [28], ImageNet used the WordNet Hierarchy [30] to

obtain both entry-level and ﬁne-grained [31] categories.

Currently, the ImageNet dataset contains over 14 million

labeled images and has enabled signiﬁcant advances in

image classiﬁcation [5], [6], [7].

Object detection Detecting an object entails both

stating that an object belonging to a speciﬁed class is

present, and localizing it in the image. The location of

an object is typically represented by a bounding box,

Fig. 1(b). Early algorithms focused on face detection [32]

using various ad hoc datasets. Later, more realistic and

challenging face detection datasets were created [33].

Another popular challenge is the detection of pedestri-

ans for which several datasets have been created [24],

[4]. The Caltech Pedestrian Dataset [4] contains 350,000

labeled instances with bounding boxes.

For the detection of basic object categories, a multi-

year effort from 2005 to 2012 was devoted to the creation

and maintenance of a series of benchmark datasets that

were widely adopted. The PASCAL VOC [2] datasets

contained 20 object categories spread over 11,000 images.

Over 27,000 object instance bounding boxes were la-

beled, of which almost 7,000 had detailed segmentations.

Recently, a detection challenge has been created from 200

object categories using a subset of 400,000 images from

ImageNet [34]. An impressive 350,000 objects have been

labeled using bounding boxes.

Since the detection of many objects such as sunglasses,

cellphones or chairs is highly dependent on contextual

information, it is important that detection datasets con-

tain objects in their natural environments. In our dataset

we strive to collect images rich in contextual information.

The use of bounding boxes also limits the accuracy

for which detection algorithms may be evaluated. We

propose the use of fully segmented instances to enable

more accurate detector evaluation.

Fig. 2: Example of (a) iconic object images, (b) iconic scene images, and (c) non-iconic images.

Semantic scene labeling The task of labeling se-

mantic objects in a scene requires that each pixel of an

image be labeled as belonging to a category, such as

sky, chair, ﬂoor, street, etc. In contrast to the detection

task, individual instances of objects do not need to be

segmented, Fig. 1(c). This enables the labeling of objects

for which individual instances are hard to deﬁne, such

as grass, streets, or walls. Datasets exist for both indoor

[11] and outdoor [35], [14] scenes. Some datasets also

include depth information [11]. Similar to semantic scene

labeling, our goal is to measure the pixel-wise accuracy

of object labels. However, we also aim to distinguish

between individual instances of an object, which requires

a solid understanding of each object’s extent.

A novel dataset that combines many of the properties

of both object detection and semantic scene labeling

datasets is the SUN dataset [3] for scene understanding.

SUN contains 908 scene categories from the WordNet

dictionary [30] with segmented objects. The 3,819 ob-

ject categories span those common to object detection

datasets (person, chair, car) and to semantic scene la-

beling (wall, sky, ﬂoor). Since the dataset was collected

by ﬁnding images depicting various scene types, the

number of instances per object category exhibits the long

tail phenomenon. That is, a few categories have a large

number of instances (wall: 20,213, window: 16,080, chair:

7,971) while most have a relatively modest number of

instances (boat: 349, airplane: 179, ﬂoor lamp: 276). In

our dataset, we ensure that each object category has a

signiﬁcant number of instances, Fig. 5.

Other vision datasets Datasets have spurred the ad-

vancement of numerous ﬁelds in computer vision. Some

notable datasets include the Middlebury datasets for

stereo vision [20], multi-view stereo [36] and optical ﬂow

[21]. The Berkeley Segmentation Data Set (BSDS500) [37]

has been used extensively to evaluate both segmentation

and edge detection algorithms. Datasets have also been

created to recognize both scene [9] and object attributes

[8], [38]. Indeed, numerous areas of vision have beneﬁted

from challenging datasets that helped catalyze progress.

3 IMAGE COLLECTION

We next describe how the object categories and candi-

date images are selected.

3.1 Common Object Categories

The selection of object categories is a non-trivial exercise.

The categories must form a representative set of all

categories, be relevant to practical applications and occur

with high enough frequency to enable the collection of

a large dataset. Other important decisions are whether

to include both “thing” and “stuff” categories [39] and

whether ﬁne-grained [31], [1] and object-part categories

should be included. “Thing” categories include objects

for which individual instances may be easily labeled

(person, chair, car) where “stuff” categories include

materials and objects with no clear boundaries (sky,

street, grass). Since we are primarily interested in pre-

cise localization of object instances, we decided to only

include “thing” categories and not “stuff.” However,

since “stuff” categories can provide signiﬁcant contex-

tual information, we believe the future labeling of “stuff”

categories would be beneﬁcial.

The speciﬁcity of object categories can vary signiﬁ-

cantly. For instance, a dog could be a member of the

“mammal”, “dog”, or “German shepherd” categories. To

enable the practical collection of a signiﬁcant number

of instances per category, we chose to limit our dataset

to entry-level categories, i.e. category labels that are

commonly used by humans when describing objects

(dog, chair, person). It is also possible that some object

categories may be parts of other object categories. For in-

stance, a face may be part of a person. We anticipate the

inclusion of object-part categories (face, hands, wheels)

would be beneﬁcial for many real-world applications.

We used several sources to collect entry-level object

categories of “things.” We ﬁrst compiled a list of cate-

gories by combining categories from PASCAL VOC [2]

and a subset of the 1200 most frequently used words

that denote visually identiﬁable objects [40]. To further

augment our set of candidate categories, several children

ranging in ages from 4 to 8 were asked to name every

Fig. 3: Our annotation pipeline is split into 3 primary tasks: (a) labeling the categories present in the image (§4.1),

(b) locating and marking all instances of the labeled categories (§4.2), and (c) segmenting each object instance (§4.3).

object they see in indoor and outdoor environments.

The ﬁnal 272 candidates may be found in the appendix.

Finally, the co-authors voted on a 1 to 5 scale for each

category taking into account how commonly they oc-

cur, their usefulness for practical applications, and their

diversity relative to other categories. The ﬁnal selec-

tion of categories attempts to pick categories with high

votes, while keeping the number of categories per super-

category (animals, vehicles, furniture, etc.) balanced. Cat-

egories for which obtaining a large number of instances

(greater than 5,000) was difﬁcult were also removed.

To ensure backwards compatibility all categories from

PASCAL VOC [2] are also included. Our ﬁnal list of 91

proposed categories is in Fig. 5(a).

3.2 Non-iconic Image Collection

Given the list of object categories, our next goal was to

collect a set of candidate images. We may roughly group

images into three types, Fig. 2: iconic-object images [41],

iconic-scene images [3] and non-iconic images. Typical

iconic-object images have a single large object in a

canonical perspective centered in the image, Fig. 2(a).

Iconic-scene images are shot from canonical viewpoints

and commonly lack people, Fig. 2(b). Iconic images have

the beneﬁt that they may be easily found by directly

searching for speciﬁc categories using Google or Bing

image search. While iconic images generally provide

high quality object instances, they can lack important

contextual information and non-canonical viewpoints.

Our goal was to collect a dataset such that a majority

of images are non-iconic, Fig. 2(c). It has been shown that

datasets containing more non-iconic images are better at

generalizing [42]. We collected non-iconic images using

two strategies. First as popularized by PASCAL VOC

[2], we collected images from Flickr which tends to have

fewer iconic images. Flickr contains photos uploaded by

amateur photographers with searchable metadata and

keywords. Second, we did not search for object cate-

gories in isolation. A search for “dog” will tend to return

iconic images of large, centered dogs. However, if we

searched for pairwise combinations of object categories,

such as “dog + car” we found many more non-iconic

images. Surprisingly, these images typically do not just

contain the two categories speciﬁed in the search, but nu-

merous other categories as well. To further supplement

our dataset we also searched for scene/object category

pairs, see the appendix. We downloaded at most 5

photos taken by a single photographer within a short

time window. In the rare cases in which enough images

could not be found, we searched for single categories

and performed an explicit ﬁltering stage to remove iconic

images. The result is a collection of 328,000 images with

rich contextual relationships between objects as shown

in Figs. 2(c) and 6.

4 IMAGE ANNO TATION

We next describe how we annotated our image collec-

tion. Due to our desire to label over 2.5 million object

instances, the design of a cost efﬁcient yet high quality

annotation pipeline was critical. The annotation pipeline

is outlined in Fig. 3. For all crowdsourcing tasks we

used workers on Amazon’s Mechanical Turk (AMT). Our

user interfaces are described in detail in the appendix.

Note that, since the original version of this work [19],

we have taken a number of steps to further improve

the quality of the annotations. In particular, we have

increased the number of annotators for the category

labeling and instance spotting stages to eight. We also

added a stage to verify the instance segmentations.

4.1 Category Labeling

The ﬁrst task in annotating our dataset is determin-

ing which object categories are present in each image,

Fig. 3(a). Since we have 91 categories and a large number

of images, asking workers to answer 91 binary clas-

siﬁcation questions per image would be prohibitively

expensive. Instead, we used a hierarchical approach [18].

(a) (b)

Fig. 4: Worker precision and recall for the category labeling task. (a) The union of multiple AMT workers (blue)

has better recall than any expert (red). Ground truth was computed using majority vote of the experts. (b) Shows

the number of workers (circle size) and average number of jobs per worker (circle color) for each precision/recall

range. Most workers have high precision; such workers generally also complete more jobs. For this plot ground

truth for each worker is the union of responses from all other AMT workers. See §4.4 for details.

We group the object categories into 11 super-categories

(see the appendix). For a given image, a worker was

presented with each group of categories in turn and

asked to indicate whether any instances exist for that

super-category. This greatly reduces the time needed to

classify the various categories. For example, a worker

may easily determine no animals are present in the im-

age without having to speciﬁcally look for cats, dogs, etc.

If a worker determines instances from the super-category

(animal) are present, for each subordinate category (dog,

cat, etc.) present, the worker must drag the category’s

icon onto the image over one instance of the category.

The placement of these icons is critical for the following

stage. We emphasize that only a single instance of each

category needs to be annotated in this stage. To ensure

high recall, 8 workers were asked to label each image. A

category is considered present if any worker indicated

the category; false positives are handled in subsequent

stages. A detailed analysis of performance is presented

in §4.4. This stage took ∼20k worker hours to complete.

4.2 Instance Spotting

In the next stage all instances of the object categories

in an image were labeled, Fig. 3(b). In the previous

stage each worker labeled one instance of a category, but

multiple object instances may exist. Therefore, for each

image, a worker was asked to place a cross on top of

each instance of a speciﬁc category found in the previous

stage. To boost recall, the location of the instance found

by a worker in the previous stage was shown to the

current worker. Such priming helped workers quickly

ﬁnd an initial instance upon ﬁrst seeing the image. The

workers could also use a magnifying glass to ﬁnd small

instances. Each worker was asked to label at most 10

instances of a given category per image. Each image was

labeled by 8 workers for a total of ∼10k worker hours.

4.3 Instance Segmentation

Our ﬁnal stage is the laborious task of segmenting each

object instance, Fig. 3(c). For this stage we modiﬁed

the excellent user interface developed by Bell et al. [16]

for image segmentation. Our interface asks the worker

to segment an object instance speciﬁed by a worker in

the previous stage. If other instances have already been

segmented in the image, those segmentations are shown

to the worker. A worker may also indicate there are

no object instances of the given category in the image

(implying a false positive label from the previous stage)

or that all object instances are already segmented.

Segmenting 2,500,000 object instances is an extremely

time consuming task requiring over 22 worker hours per

1,000 segmentations. To minimize cost we only had a

single worker segment each instance. However, when

ﬁrst completing the task, most workers produced only

coarse instance outlines. As a consequence, we required

all workers to complete a training task for each object

category. The training task required workers to segment

an object instance. Workers could not complete the task

until their segmentation adequately matched the ground

truth. The use of a training task vastly improved the

quality of the workers (approximately 1 in 3 workers

passed the training stage) and resulting segmentations.

Example segmentations may be viewed in Fig. 6.

While the training task ﬁltered out most bad workers,

we also performed an explicit veriﬁcation step on each

segmented instance to ensure good quality. Multiple

workers (3 to 5) were asked to judge each segmentation

and indicate whether it matched the instance well or not.

Segmentations of insufﬁcient quality were discarded and

the corresponding instances added back to the pool of

unsegmented objects. Finally, some approved workers

consistently produced poor segmentations; all work ob-

tained from such workers was discarded.

Microsoft COCO: Common Objects in Context

Figures

Citations

Identity Mappings in Deep Residual Networks

Aggregated Residual Transformations for Deep Neural Networks

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

Perceptual Losses for Real-Time Style Transfer and Super-Resolution

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

References

ImageNet Classification with Deep Convolutional Neural Networks

ImageNet: A large-scale hierarchical image database

Histograms of oriented gradients for human detection

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

The Pascal Visual Object Classes (VOC) Challenge

Related Papers (5)

Deep Residual Learning for Image Recognition

ImageNet: A large-scale hierarchical image database

ImageNet Classification with Deep Convolutional Neural Networks

ImageNet Large Scale Visual Recognition Challenge

Very Deep Convolutional Networks for Large-Scale Image Recognition

Frequently Asked Questions (15)

Q1. What have the authors contributed in "Microsoft coco: common objects in context" ?

Q2. What is the importance of a detection dataset?

Q3. How many annotators were used in the sample?

Q4. How many worker hours did it take to segment objects?

Q5. What is the task of labeling objects in a scene?

Q6. How many worker hours were used to generate object segmentation masks?

Q7. How many instances of a given category were discarded?

Q8. How many instances of a category were segmented in an image?

Q9. How many datasets were created for the detection of basic object categories?

Q10. How many instances of the category are present in the image?

Q11. Why did the authors choose to include only “thing” categories?

Q12. How many instances of a given category were segmented?

Q13. What are the categories for which people, cars, and other objects?

Q14. What is the effect of difficult examples on the learning model?

Q15. What is the interesting observation about the dataset?