scispace - formally typeset
Open AccessBook ChapterDOI

Microsoft COCO: Common Objects in Context

Reads0
Chats0
TLDR
A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.
Abstract
We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in precise object localization. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.

read more

Content maybe subject to copyright    Report

1
Microsoft COCO: Common Objects in Context
Tsung-Yi Lin Michael Maire Serge Belongie Lubomir Bourdev Ross Girshick
James Hays Pietro Perona Deva Ramanan C. Lawrence Zitnick Piotr Doll
´
ar
Abstract—We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of
object recognition in the context of the broader question of scene understanding. This is achieved by gather ing images of complex
everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in
precise object localization. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old. With a
total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via
novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of
the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and
segmentation detection results using a Deformable Parts Model.
F
1 INTRODUCTION
One of the primary goals of computer vision is the
understanding of visual scenes. Scene understanding
involves numerous tasks including recognizing what
objects are present, localizing the objects in 2D and 3D,
determining the objects’ and scene’s attributes, charac-
terizing relationships between objects and providing a
semantic description of the scene. The current object clas-
sification and detection datasets [1], [2], [3], [4] help us
explore the first challenges related to scene understand-
ing. For instance the ImageNet dataset [1], which con-
tains an unprecedented number of images, has recently
enabled breakthroughs in both object classification and
detection research [5], [6], [7]. The community has also
created datasets containing object attributes [8], scene
attributes [9], keypoints [10], and 3D scene information
[11]. This leads us to the obvious question: what datasets
will best continue our advance towards our ultimate goal
of scene understanding?
We introduce a new large-scale dataset that addresses
three core research problems in scene understanding: de-
tecting non-iconic views (or non-canonical perspectives
[12]) of objects, contextual reasoning between objects
and the precise 2D localization of objects. For many
categories of objects, there exists an iconic view. For
example, when performing a web-based image search
for the object category “bike,” the top-ranked retrieved
examples appear in profile, unobstructed near the cen-
ter of a neatly composed photo. We posit that current
recognition systems perform fairly well on iconic views,
but struggle to recognize objects otherwise in the
T.Y. Lin and S. Belongie are with Cornell NYC Tech and the Cornell
Computer Science Department.
M. Maire is with the Toyota Technological Institute at Chicago.
L. Bourdev and P. Doll´ar are with Facebook AI Research. The majority of
this work was performed while P. Doll´ar was with Microsoft Research.
R. Girshick and C. L. Zitnick are with Microsoft Research, Redmond.
J. Hays is with Brown University.
P. Perona is with the California Institute of Technology.
D. Ramanan is with the University of California at Irvine.
Fig. 1: While previous object recognition datasets have
focused on (a) image classification, (b) object bounding
box localization or (c) semantic pixel-level segmentation,
we focus on (d) segmenting individual object instances.
We introduce a large, richly-annotated dataset comprised
of images depicting complex everyday scenes of com-
mon objects in their natural context.
background, partially occluded, amid clutter [13] re-
flecting the composition of actual everyday scenes. We
verify this experimentally; when evaluated on everyday
scenes, models trained on our data perform better than
those trained with prior datasets. A challenge is finding
natural images that contain multiple objects. The identity
of many objects can only be resolved using context, due
to small size or ambiguous appearance in the image. To
push research in contextual reasoning, images depicting
scenes [3] rather than objects in isolation are necessary.
Finally, we argue that detailed spatial understanding of
object layout will be a core component of scene analysis.
An object’s spatial location can be defined coarsely using
a bounding box [2] or with a precise pixel-level segmen-
tation [14], [15], [16]. As we demonstrate, to measure
either kind of localization performance it is essential
for the dataset to have every instance of every object
arXiv:1405.0312v3 [cs.CV] 21 Feb 2015

2
category labeled and fully segmented. Our dataset is
unique in its annotation of instance-level segmentation
masks, Fig. 1.
To create a large-scale dataset that accomplishes these
three goals we employed a novel pipeline for gathering
data with extensive use of Amazon Mechanical Turk.
First and most importantly, we harvested a large set
of images containing contextual relationships and non-
iconic object views. We accomplished this using a sur-
prisingly simple yet effective technique that queries for
pairs of objects in conjunction with images retrieved
via scene-based queries [17], [3]. Next, each image was
labeled as containing particular object categories using
a hierarchical labeling approach [18]. For each category
found, the individual instances were labeled, verified,
and finally segmented. Given the inherent ambiguity of
labeling, each of these stages has numerous tradeoffs that
we explored in detail.
The Microsoft Common Objects in COntext (MS
COCO) dataset contains 91 common object categories
with 82 of them having more than 5,000 labeled in-
stances, Fig. 6. In total the dataset has 2,500,000 labeled
instances in 328,000 images. In contrast to the popular
ImageNet dataset [1], COCO has fewer categories but
more instances per category. This can aid in learning
detailed object models capable of precise 2D localization.
The dataset is also significantly larger in number of in-
stances per category than the PASCAL VOC [2] and SUN
[3] datasets. Additionally, a critical distinction between
our dataset and others is the number of labeled instances
per image which may aid in learning contextual informa-
tion, Fig. 5. MS COCO contains considerably more object
instances per image (7.7) as compared to ImageNet (3.0)
and PASCAL (2.3). In contrast, the SUN dataset, which
contains significant contextual information, has over 17
objects and “stuff per image but considerably fewer
object instances overall.
An abridged version of this work appeared in [19].
2 RELATED WORK
Throughout the history of computer vision research
datasets have played a critical role. They not only pro-
vide a means to train and evaluate algorithms, they
drive research in new and more challenging directions.
The creation of ground truth stereo and optical flow
datasets [20], [21] helped stimulate a flood of interest
in these areas. The early evolution of object recognition
datasets [22], [23], [24] facilitated the direct comparison
of hundreds of image recognition algorithms while si-
multaneously pushing the field towards more complex
problems. Recently, the ImageNet dataset [1] containing
millions of images has enabled breakthroughs in both
object classification and detection research using a new
class of deep learning algorithms [5], [6], [7].
Datasets related to object recognition can be roughly
split into three groups: those that primarily address
object classification, object detection and semantic scene
labeling. We address each in turn.
Image Classification The task of object classification
requires binary labels indicating whether objects are
present in an image; see Fig. 1(a). Early datasets of this
type comprised images containing a single object with
blank backgrounds, such as the MNIST handwritten
digits [25] or COIL household objects [26]. Caltech 101
[22] and Caltech 256 [23] marked the transition to more
realistic object images retrieved from the internet while
also increasing the number of object categories to 101
and 256, respectively. Popular datasets in the machine
learning community due to the larger number of training
examples, CIFAR-10 and CIFAR-100 [27] offered 10 and
100 categories from a dataset of tiny 32 × 32 images [28].
While these datasets contained up to 60,000 images and
hundreds of categories, they still only captured a small
fraction of our visual world.
Recently, ImageNet [1] made a striking departure from
the incremental increase in dataset sizes. They proposed
the creation of a dataset containing 22k categories with
500-1000 images each. Unlike previous datasets contain-
ing entry-level categories [29], such as “dog” or “chair,”
like [28], ImageNet used the WordNet Hierarchy [30] to
obtain both entry-level and fine-grained [31] categories.
Currently, the ImageNet dataset contains over 14 million
labeled images and has enabled significant advances in
image classification [5], [6], [7].
Object detection Detecting an object entails both
stating that an object belonging to a specified class is
present, and localizing it in the image. The location of
an object is typically represented by a bounding box,
Fig. 1(b). Early algorithms focused on face detection [32]
using various ad hoc datasets. Later, more realistic and
challenging face detection datasets were created [33].
Another popular challenge is the detection of pedestri-
ans for which several datasets have been created [24],
[4]. The Caltech Pedestrian Dataset [4] contains 350,000
labeled instances with bounding boxes.
For the detection of basic object categories, a multi-
year effort from 2005 to 2012 was devoted to the creation
and maintenance of a series of benchmark datasets that
were widely adopted. The PASCAL VOC [2] datasets
contained 20 object categories spread over 11,000 images.
Over 27,000 object instance bounding boxes were la-
beled, of which almost 7,000 had detailed segmentations.
Recently, a detection challenge has been created from 200
object categories using a subset of 400,000 images from
ImageNet [34]. An impressive 350,000 objects have been
labeled using bounding boxes.
Since the detection of many objects such as sunglasses,
cellphones or chairs is highly dependent on contextual
information, it is important that detection datasets con-
tain objects in their natural environments. In our dataset
we strive to collect images rich in contextual information.
The use of bounding boxes also limits the accuracy
for which detection algorithms may be evaluated. We
propose the use of fully segmented instances to enable
more accurate detector evaluation.

3
Fig. 2: Example of (a) iconic object images, (b) iconic scene images, and (c) non-iconic images.
Semantic scene labeling The task of labeling se-
mantic objects in a scene requires that each pixel of an
image be labeled as belonging to a category, such as
sky, chair, floor, street, etc. In contrast to the detection
task, individual instances of objects do not need to be
segmented, Fig. 1(c). This enables the labeling of objects
for which individual instances are hard to define, such
as grass, streets, or walls. Datasets exist for both indoor
[11] and outdoor [35], [14] scenes. Some datasets also
include depth information [11]. Similar to semantic scene
labeling, our goal is to measure the pixel-wise accuracy
of object labels. However, we also aim to distinguish
between individual instances of an object, which requires
a solid understanding of each object’s extent.
A novel dataset that combines many of the properties
of both object detection and semantic scene labeling
datasets is the SUN dataset [3] for scene understanding.
SUN contains 908 scene categories from the WordNet
dictionary [30] with segmented objects. The 3,819 ob-
ject categories span those common to object detection
datasets (person, chair, car) and to semantic scene la-
beling (wall, sky, floor). Since the dataset was collected
by finding images depicting various scene types, the
number of instances per object category exhibits the long
tail phenomenon. That is, a few categories have a large
number of instances (wall: 20,213, window: 16,080, chair:
7,971) while most have a relatively modest number of
instances (boat: 349, airplane: 179, floor lamp: 276). In
our dataset, we ensure that each object category has a
significant number of instances, Fig. 5.
Other vision datasets Datasets have spurred the ad-
vancement of numerous fields in computer vision. Some
notable datasets include the Middlebury datasets for
stereo vision [20], multi-view stereo [36] and optical flow
[21]. The Berkeley Segmentation Data Set (BSDS500) [37]
has been used extensively to evaluate both segmentation
and edge detection algorithms. Datasets have also been
created to recognize both scene [9] and object attributes
[8], [38]. Indeed, numerous areas of vision have benefited
from challenging datasets that helped catalyze progress.
3 IMAGE COLLECTION
We next describe how the object categories and candi-
date images are selected.
3.1 Common Object Categories
The selection of object categories is a non-trivial exercise.
The categories must form a representative set of all
categories, be relevant to practical applications and occur
with high enough frequency to enable the collection of
a large dataset. Other important decisions are whether
to include both “thing” and “stuff categories [39] and
whether fine-grained [31], [1] and object-part categories
should be included. “Thing” categories include objects
for which individual instances may be easily labeled
(person, chair, car) where “stuff categories include
materials and objects with no clear boundaries (sky,
street, grass). Since we are primarily interested in pre-
cise localization of object instances, we decided to only
include “thing” categories and not “stuff.” However,
since “stuff categories can provide significant contex-
tual information, we believe the future labeling of “stuff
categories would be beneficial.
The specificity of object categories can vary signifi-
cantly. For instance, a dog could be a member of the
“mammal”, “dog”, or “German shepherd” categories. To
enable the practical collection of a significant number
of instances per category, we chose to limit our dataset
to entry-level categories, i.e. category labels that are
commonly used by humans when describing objects
(dog, chair, person). It is also possible that some object
categories may be parts of other object categories. For in-
stance, a face may be part of a person. We anticipate the
inclusion of object-part categories (face, hands, wheels)
would be beneficial for many real-world applications.
We used several sources to collect entry-level object
categories of “things.” We first compiled a list of cate-
gories by combining categories from PASCAL VOC [2]
and a subset of the 1200 most frequently used words
that denote visually identifiable objects [40]. To further
augment our set of candidate categories, several children
ranging in ages from 4 to 8 were asked to name every

4
Fig. 3: Our annotation pipeline is split into 3 primary tasks: (a) labeling the categories present in the image (§4.1),
(b) locating and marking all instances of the labeled categories (§4.2), and (c) segmenting each object instance (§4.3).
object they see in indoor and outdoor environments.
The final 272 candidates may be found in the appendix.
Finally, the co-authors voted on a 1 to 5 scale for each
category taking into account how commonly they oc-
cur, their usefulness for practical applications, and their
diversity relative to other categories. The final selec-
tion of categories attempts to pick categories with high
votes, while keeping the number of categories per super-
category (animals, vehicles, furniture, etc.) balanced. Cat-
egories for which obtaining a large number of instances
(greater than 5,000) was difficult were also removed.
To ensure backwards compatibility all categories from
PASCAL VOC [2] are also included. Our final list of 91
proposed categories is in Fig. 5(a).
3.2 Non-iconic Image Collection
Given the list of object categories, our next goal was to
collect a set of candidate images. We may roughly group
images into three types, Fig. 2: iconic-object images [41],
iconic-scene images [3] and non-iconic images. Typical
iconic-object images have a single large object in a
canonical perspective centered in the image, Fig. 2(a).
Iconic-scene images are shot from canonical viewpoints
and commonly lack people, Fig. 2(b). Iconic images have
the benefit that they may be easily found by directly
searching for specific categories using Google or Bing
image search. While iconic images generally provide
high quality object instances, they can lack important
contextual information and non-canonical viewpoints.
Our goal was to collect a dataset such that a majority
of images are non-iconic, Fig. 2(c). It has been shown that
datasets containing more non-iconic images are better at
generalizing [42]. We collected non-iconic images using
two strategies. First as popularized by PASCAL VOC
[2], we collected images from Flickr which tends to have
fewer iconic images. Flickr contains photos uploaded by
amateur photographers with searchable metadata and
keywords. Second, we did not search for object cate-
gories in isolation. A search for “dog” will tend to return
iconic images of large, centered dogs. However, if we
searched for pairwise combinations of object categories,
such as “dog + car we found many more non-iconic
images. Surprisingly, these images typically do not just
contain the two categories specified in the search, but nu-
merous other categories as well. To further supplement
our dataset we also searched for scene/object category
pairs, see the appendix. We downloaded at most 5
photos taken by a single photographer within a short
time window. In the rare cases in which enough images
could not be found, we searched for single categories
and performed an explicit filtering stage to remove iconic
images. The result is a collection of 328,000 images with
rich contextual relationships between objects as shown
in Figs. 2(c) and 6.
4 IMAGE ANNO TATION
We next describe how we annotated our image collec-
tion. Due to our desire to label over 2.5 million object
instances, the design of a cost efficient yet high quality
annotation pipeline was critical. The annotation pipeline
is outlined in Fig. 3. For all crowdsourcing tasks we
used workers on Amazon’s Mechanical Turk (AMT). Our
user interfaces are described in detail in the appendix.
Note that, since the original version of this work [19],
we have taken a number of steps to further improve
the quality of the annotations. In particular, we have
increased the number of annotators for the category
labeling and instance spotting stages to eight. We also
added a stage to verify the instance segmentations.
4.1 Category Labeling
The first task in annotating our dataset is determin-
ing which object categories are present in each image,
Fig. 3(a). Since we have 91 categories and a large number
of images, asking workers to answer 91 binary clas-
sification questions per image would be prohibitively
expensive. Instead, we used a hierarchical approach [18].

5
(a) (b)
Fig. 4: Worker precision and recall for the category labeling task. (a) The union of multiple AMT workers (blue)
has better recall than any expert (red). Ground truth was computed using majority vote of the experts. (b) Shows
the number of workers (circle size) and average number of jobs per worker (circle color) for each precision/recall
range. Most workers have high precision; such workers generally also complete more jobs. For this plot ground
truth for each worker is the union of responses from all other AMT workers. See §4.4 for details.
We group the object categories into 11 super-categories
(see the appendix). For a given image, a worker was
presented with each group of categories in turn and
asked to indicate whether any instances exist for that
super-category. This greatly reduces the time needed to
classify the various categories. For example, a worker
may easily determine no animals are present in the im-
age without having to specifically look for cats, dogs, etc.
If a worker determines instances from the super-category
(animal) are present, for each subordinate category (dog,
cat, etc.) present, the worker must drag the category’s
icon onto the image over one instance of the category.
The placement of these icons is critical for the following
stage. We emphasize that only a single instance of each
category needs to be annotated in this stage. To ensure
high recall, 8 workers were asked to label each image. A
category is considered present if any worker indicated
the category; false positives are handled in subsequent
stages. A detailed analysis of performance is presented
in §4.4. This stage took 20k worker hours to complete.
4.2 Instance Spotting
In the next stage all instances of the object categories
in an image were labeled, Fig. 3(b). In the previous
stage each worker labeled one instance of a category, but
multiple object instances may exist. Therefore, for each
image, a worker was asked to place a cross on top of
each instance of a specific category found in the previous
stage. To boost recall, the location of the instance found
by a worker in the previous stage was shown to the
current worker. Such priming helped workers quickly
find an initial instance upon first seeing the image. The
workers could also use a magnifying glass to find small
instances. Each worker was asked to label at most 10
instances of a given category per image. Each image was
labeled by 8 workers for a total of 10k worker hours.
4.3 Instance Segmentation
Our final stage is the laborious task of segmenting each
object instance, Fig. 3(c). For this stage we modified
the excellent user interface developed by Bell et al. [16]
for image segmentation. Our interface asks the worker
to segment an object instance specified by a worker in
the previous stage. If other instances have already been
segmented in the image, those segmentations are shown
to the worker. A worker may also indicate there are
no object instances of the given category in the image
(implying a false positive label from the previous stage)
or that all object instances are already segmented.
Segmenting 2,500,000 object instances is an extremely
time consuming task requiring over 22 worker hours per
1,000 segmentations. To minimize cost we only had a
single worker segment each instance. However, when
first completing the task, most workers produced only
coarse instance outlines. As a consequence, we required
all workers to complete a training task for each object
category. The training task required workers to segment
an object instance. Workers could not complete the task
until their segmentation adequately matched the ground
truth. The use of a training task vastly improved the
quality of the workers (approximately 1 in 3 workers
passed the training stage) and resulting segmentations.
Example segmentations may be viewed in Fig. 6.
While the training task filtered out most bad workers,
we also performed an explicit verification step on each
segmented instance to ensure good quality. Multiple
workers (3 to 5) were asked to judge each segmentation
and indicate whether it matched the instance well or not.
Segmentations of insufficient quality were discarded and
the corresponding instances added back to the pool of
unsegmented objects. Finally, some approved workers
consistently produced poor segmentations; all work ob-
tained from such workers was discarded.

Citations
More filters
Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Posted Content

Deep Residual Learning for Image Recognition

TL;DR: This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.
Proceedings ArticleDOI

Going deeper with convolutions

TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).
Journal ArticleDOI

ImageNet Large Scale Visual Recognition Challenge

TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.
Journal ArticleDOI

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

TL;DR: This work introduces a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals and further merge RPN and Fast R-CNN into a single network by sharing their convolutionAL features.
References
More filters
Proceedings ArticleDOI

A Comparison and Evaluation of Multi-View Stereo Reconstruction Algorithms

TL;DR: This paper first survey multi-view stereo algorithms and compare them qualitatively using a taxonomy that differentiates their key properties, then describes the process for acquiring and calibrating multiview image datasets with high-accuracy ground truth and introduces the evaluation methodology.
Journal ArticleDOI

A Database and Evaluation Methodology for Optical Flow

TL;DR: This paper proposes a new set of benchmarks and evaluation methods for the next generation of optical flow algorithms and analyzes the results obtained to date to draw a large number of conclusions.
Proceedings ArticleDOI

Unbiased look at dataset bias

TL;DR: A comparison study using a set of popular datasets, evaluated based on a number of criteria including: relative data bias, cross-dataset generalization, effects of closed-world assumption, and sample value is presented.
Proceedings ArticleDOI

Learning to detect unseen object classes by between-class attribute transfer

TL;DR: The experiments show that by using an attribute layer it is indeed possible to build a learning object detection system that does not require any training images of the target classes, and assembled a new large-scale dataset, “Animals with Attributes”, of over 30,000 animal images that match the 50 classes in Osherson's classic table of how strongly humans associate 85 semantic attributes with animal classes.
Proceedings ArticleDOI

Describing objects by their attributes

TL;DR: This paper proposes to shift the goal of recognition from naming to describing, and introduces a novel feature selection method for learning attributes that generalize well across categories.
Related Papers (5)
Frequently Asked Questions (15)
Q1. What have the authors contributed in "Microsoft coco: common objects in context" ?

The authors present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. The authors present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, the authors provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model. 

Since the detection of many objects such as sunglasses, cellphones or chairs is highly dependent on contextual information, it is important that detection datasets contain objects in their natural environments. 

by observing how recall increased as the authors added annotators, the authors estimate that in practice over 99% of all object categories not later rejected as false positives are detected given 8 annotators. 

Segmenting 2,500,000 object instances is an extremely time consuming task requiring over 22 worker hours per 1,000 segmentations. 

The task of labeling semantic objects in a scene requires that each pixel of an image be labeled as belonging to a category, such as sky, chair, floor, street, etc. 

Utilizing over 70,000 worker hours, a vast collection of object instances was gathered, annotated and organized to drive the advancement of object detection and segmentation algorithms. 

Segmentations of insufficient quality were discarded and the corresponding instances added back to the pool of unsegmented objects. 

After 10-15 instances of a category were segmented in an image, the remaining instances were marked as “crowds” using a single (possibly multipart) segment. 

For the detection of basic object categories, a multiyear effort from 2005 to 2012 was devoted to the creation and maintenance of a series of benchmark datasets that were widely adopted. 

If a worker determines instances from the super-category (animal) are present, for each subordinate category (dog, cat, etc.) present, the worker must drag the category’s icon onto the image over one instance of the category. 

Since the authors are primarily interested in precise localization of object instances, the authors decided to only include “thing” categories and not “stuff.” 

For images containing 10 object instances or fewer of a given category, every instance was individually segmented (note that in some images up to 15 instances were segmented). 

“Thing” categories include objects for which individual instances may be easily labeled (person, chair, car) where “stuff” categories include materials and objects with no clear boundaries (sky, street, grass). 

Such examples may act as noise and pollute the learned model if the model is not rich enough to capture such appearance variability. 

Another interesting observation is only 10% of the images in MS COCO have only one category per image, in comparison, over 60% of images contain a single object category in ImageNet and PASCAL VOC.