scispace - formally typeset
Search or ask a question
Book ChapterDOI

Microsoft COCO: Common Objects in Context

TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.
Abstract: We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in precise object localization. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.

Summary (3 min read)

1 INTRODUCTION

  • One of the primary goals of computer vision is the understanding of visual scenes.
  • The authors introduce a new large-scale dataset that addresses three core research problems in scene understanding: detecting non-iconic views (or non-canonical perspectives [12]) of objects, contextual reasoning between objects and the precise 2D localization of objects.
  • For each category found, the individual instances were labeled, verified, and finally segmented.
  • Additionally, a critical distinction between their dataset and others is the number of labeled instances per image which may aid in learning contextual information, Fig. 5. MS COCO contains considerably more object instances per image (7.7) as compared to ImageNet (3.0) and PASCAL (2.3).

3.1 Common Object Categories

  • The categories must form a representative set of all categories, be relevant to practical applications and occur with high enough frequency to enable the collection of a large dataset.
  • Other important decisions are whether to include both “thing” and “stuff” categories [39] and whether fine-grained [31], [1] and object-part categories should be included.
  • To enable the practical collection of a significant number of instances per category, the authors chose to limit their dataset to entry-level categories, i.e. category labels that are commonly used by humans when describing objects (dog, chair, person).
  • The final selection of categories attempts to pick categories with high votes, while keeping the number of categories per supercategory (animals, vehicles, furniture, etc.) balanced.

3.2 Non-iconic Image Collection

  • Given the list of object categories, their next goal was to collect a set of candidate images.
  • The authors goal was to collect a dataset such that a majority of images are non-iconic, Fig. 2(c).
  • First as popularized by PASCAL VOC [2], the authors collected images from Flickr which tends to have fewer iconic images.
  • Surprisingly, these images typically do not just contain the two categories specified in the search, but numerous other categories as well.
  • The result is a collection of 328,000 images with rich contextual relationships between objects as shown in Figs. 2(c) and 6.

4 IMAGE ANNOTATION

  • The authors next describe how they annotated their image collection.
  • Due to their desire to label over 2.5 million object instances, the design of a cost efficient yet high quality annotation pipeline was critical.
  • For all crowdsourcing tasks the authors used workers on Amazon’s Mechanical Turk (AMT).
  • Note that, since the original version of this work [19], the authors have taken a number of steps to further improve the quality of the annotations.

4.1 Category Labeling

  • The first task in annotating their dataset is determining which object categories are present in each image, Fig. 3(a).
  • Since the authors have 91 categories and a large number of images, asking workers to answer 91 binary classification questions per image would be prohibitively expensive.
  • For a given image, a worker was presented with each group of categories in turn and asked to indicate whether any instances exist for that super-category.
  • This greatly reduces the time needed to classify the various categories.
  • The placement of these icons is critical for the following stage.

4.2 Instance Spotting

  • In the next stage all instances of the object categories in an image were labeled, Fig. 3(b).
  • To boost recall, the location of the instance found by a worker in the previous stage was shown to the current worker.
  • Such priming helped workers quickly find an initial instance upon first seeing the image.
  • The workers could also use a magnifying glass to find small instances.
  • Each image was labeled by 8 workers for a total of ∼10k worker hours.

4.3 Instance Segmentation

  • The authors final stage is the laborious task of segmenting each object instance, Fig. 3(c).
  • To minimize cost the authors only had a single worker segment each instance.
  • The training task required workers to segment an object instance.
  • Workers could not complete the task until their segmentation adequately matched the ground truth.
  • After 10-15 instances of a category were segmented in an image, the remaining instances were marked as “crowds” using a single (possibly multipart) segment.

4.4 Annotation Performance Analysis

  • The authors analyzed crowd worker quality on the category labeling task by comparing to dedicated expert workers, see Fig. 4(a).
  • Ground truth was computed using majority vote of the experts.
  • Fig. 4(a) shows that the union of 8 AMT workers, the same number as was used to collect their labels, achieved greater recall than any of the expert workers.
  • Note that a similar analysis may be done for instance spotting in which 8 annotators were also used.

5 DATASET STATISTICS

  • Next, the authors analyze the properties of the Microsoft Common Objects in COntext (MS COCO) dataset in comparison to several other popular datasets.
  • On average their dataset contains 3.5 categories and 7.7 instances per image.
  • Another interesting observation is only 10% of the images in MS COCO have only one category per image, in comparison, over 60% of images contain a single object category in ImageNet and PASCAL VOC.
  • Generally smaller objects are harder to recognize and require more contextual reasoning to recognize.

6 DATASET SPLITS

  • To accommodate a faster release schedule, the authors split the MS COCO dataset into two roughly equal parts.
  • The authors took care to minimize the chance of near-duplicate images existing across splits by explicitly removing near duplicates (detected with [43]) and grouping images by photographer and date taken.
  • The authors are currently finalizing the evaluation server for automatic evaluation on the test set.
  • The authors did not collect segmentations for the following 11 categories: hat, shoe, eyeglasses (too many instances), mirror, window, door, street sign (ambiguous and difficult to label), plate, desk (due to confusion with bowl and dining table, respectively) and blender, hair brush (too few instances).

7 ALGORITHMIC ANALYSIS

  • For the following experiments the authors take a subset of 55,000 images from their dataset1 and obtain tight-fitting bounding boxes from the annotated segmentation masks.
  • Consistent with past observations [46], the authors find that including difficult (non-iconic) images during training may not always help.
  • These observations support two hypotheses: 1) MS COCO is significantly more difficult than PASCAL VOC and 2) models trained on MS COCO can generalize better to easier datasets such as PASCAL VOC given more training data.
  • The authors then measure the intersection over union of the predicted and ground truth segmentation masks, see Fig.
  • To establish a baseline for their dataset, the authors project learned DPM part masks onto the image to create segmentation masks.

8 DISCUSSION

  • The authors introduced a new dataset for detecting and segmenting objects found in everyday life in their natural environments.
  • Dataset statistics indicate the images contain rich contextual information with many objects present per image.
  • To download and learn more about MS COCO please see the project website2.

Did you find this useful? Give us your feedback

Figures (18)

Content maybe subject to copyright    Report

1
Microsoft COCO: Common Objects in Context
Tsung-Yi Lin Michael Maire Serge Belongie Lubomir Bourdev Ross Girshick
James Hays Pietro Perona Deva Ramanan C. Lawrence Zitnick Piotr Doll
´
ar
Abstract—We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of
object recognition in the context of the broader question of scene understanding. This is achieved by gather ing images of complex
everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in
precise object localization. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old. With a
total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via
novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of
the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and
segmentation detection results using a Deformable Parts Model.
F
1 INTRODUCTION
One of the primary goals of computer vision is the
understanding of visual scenes. Scene understanding
involves numerous tasks including recognizing what
objects are present, localizing the objects in 2D and 3D,
determining the objects’ and scene’s attributes, charac-
terizing relationships between objects and providing a
semantic description of the scene. The current object clas-
sification and detection datasets [1], [2], [3], [4] help us
explore the first challenges related to scene understand-
ing. For instance the ImageNet dataset [1], which con-
tains an unprecedented number of images, has recently
enabled breakthroughs in both object classification and
detection research [5], [6], [7]. The community has also
created datasets containing object attributes [8], scene
attributes [9], keypoints [10], and 3D scene information
[11]. This leads us to the obvious question: what datasets
will best continue our advance towards our ultimate goal
of scene understanding?
We introduce a new large-scale dataset that addresses
three core research problems in scene understanding: de-
tecting non-iconic views (or non-canonical perspectives
[12]) of objects, contextual reasoning between objects
and the precise 2D localization of objects. For many
categories of objects, there exists an iconic view. For
example, when performing a web-based image search
for the object category “bike,” the top-ranked retrieved
examples appear in profile, unobstructed near the cen-
ter of a neatly composed photo. We posit that current
recognition systems perform fairly well on iconic views,
but struggle to recognize objects otherwise in the
T.Y. Lin and S. Belongie are with Cornell NYC Tech and the Cornell
Computer Science Department.
M. Maire is with the Toyota Technological Institute at Chicago.
L. Bourdev and P. Doll´ar are with Facebook AI Research. The majority of
this work was performed while P. Doll´ar was with Microsoft Research.
R. Girshick and C. L. Zitnick are with Microsoft Research, Redmond.
J. Hays is with Brown University.
P. Perona is with the California Institute of Technology.
D. Ramanan is with the University of California at Irvine.
Fig. 1: While previous object recognition datasets have
focused on (a) image classification, (b) object bounding
box localization or (c) semantic pixel-level segmentation,
we focus on (d) segmenting individual object instances.
We introduce a large, richly-annotated dataset comprised
of images depicting complex everyday scenes of com-
mon objects in their natural context.
background, partially occluded, amid clutter [13] re-
flecting the composition of actual everyday scenes. We
verify this experimentally; when evaluated on everyday
scenes, models trained on our data perform better than
those trained with prior datasets. A challenge is finding
natural images that contain multiple objects. The identity
of many objects can only be resolved using context, due
to small size or ambiguous appearance in the image. To
push research in contextual reasoning, images depicting
scenes [3] rather than objects in isolation are necessary.
Finally, we argue that detailed spatial understanding of
object layout will be a core component of scene analysis.
An object’s spatial location can be defined coarsely using
a bounding box [2] or with a precise pixel-level segmen-
tation [14], [15], [16]. As we demonstrate, to measure
either kind of localization performance it is essential
for the dataset to have every instance of every object
arXiv:1405.0312v3 [cs.CV] 21 Feb 2015

2
category labeled and fully segmented. Our dataset is
unique in its annotation of instance-level segmentation
masks, Fig. 1.
To create a large-scale dataset that accomplishes these
three goals we employed a novel pipeline for gathering
data with extensive use of Amazon Mechanical Turk.
First and most importantly, we harvested a large set
of images containing contextual relationships and non-
iconic object views. We accomplished this using a sur-
prisingly simple yet effective technique that queries for
pairs of objects in conjunction with images retrieved
via scene-based queries [17], [3]. Next, each image was
labeled as containing particular object categories using
a hierarchical labeling approach [18]. For each category
found, the individual instances were labeled, verified,
and finally segmented. Given the inherent ambiguity of
labeling, each of these stages has numerous tradeoffs that
we explored in detail.
The Microsoft Common Objects in COntext (MS
COCO) dataset contains 91 common object categories
with 82 of them having more than 5,000 labeled in-
stances, Fig. 6. In total the dataset has 2,500,000 labeled
instances in 328,000 images. In contrast to the popular
ImageNet dataset [1], COCO has fewer categories but
more instances per category. This can aid in learning
detailed object models capable of precise 2D localization.
The dataset is also significantly larger in number of in-
stances per category than the PASCAL VOC [2] and SUN
[3] datasets. Additionally, a critical distinction between
our dataset and others is the number of labeled instances
per image which may aid in learning contextual informa-
tion, Fig. 5. MS COCO contains considerably more object
instances per image (7.7) as compared to ImageNet (3.0)
and PASCAL (2.3). In contrast, the SUN dataset, which
contains significant contextual information, has over 17
objects and “stuff per image but considerably fewer
object instances overall.
An abridged version of this work appeared in [19].
2 RELATED WORK
Throughout the history of computer vision research
datasets have played a critical role. They not only pro-
vide a means to train and evaluate algorithms, they
drive research in new and more challenging directions.
The creation of ground truth stereo and optical flow
datasets [20], [21] helped stimulate a flood of interest
in these areas. The early evolution of object recognition
datasets [22], [23], [24] facilitated the direct comparison
of hundreds of image recognition algorithms while si-
multaneously pushing the field towards more complex
problems. Recently, the ImageNet dataset [1] containing
millions of images has enabled breakthroughs in both
object classification and detection research using a new
class of deep learning algorithms [5], [6], [7].
Datasets related to object recognition can be roughly
split into three groups: those that primarily address
object classification, object detection and semantic scene
labeling. We address each in turn.
Image Classification The task of object classification
requires binary labels indicating whether objects are
present in an image; see Fig. 1(a). Early datasets of this
type comprised images containing a single object with
blank backgrounds, such as the MNIST handwritten
digits [25] or COIL household objects [26]. Caltech 101
[22] and Caltech 256 [23] marked the transition to more
realistic object images retrieved from the internet while
also increasing the number of object categories to 101
and 256, respectively. Popular datasets in the machine
learning community due to the larger number of training
examples, CIFAR-10 and CIFAR-100 [27] offered 10 and
100 categories from a dataset of tiny 32 × 32 images [28].
While these datasets contained up to 60,000 images and
hundreds of categories, they still only captured a small
fraction of our visual world.
Recently, ImageNet [1] made a striking departure from
the incremental increase in dataset sizes. They proposed
the creation of a dataset containing 22k categories with
500-1000 images each. Unlike previous datasets contain-
ing entry-level categories [29], such as “dog” or “chair,”
like [28], ImageNet used the WordNet Hierarchy [30] to
obtain both entry-level and fine-grained [31] categories.
Currently, the ImageNet dataset contains over 14 million
labeled images and has enabled significant advances in
image classification [5], [6], [7].
Object detection Detecting an object entails both
stating that an object belonging to a specified class is
present, and localizing it in the image. The location of
an object is typically represented by a bounding box,
Fig. 1(b). Early algorithms focused on face detection [32]
using various ad hoc datasets. Later, more realistic and
challenging face detection datasets were created [33].
Another popular challenge is the detection of pedestri-
ans for which several datasets have been created [24],
[4]. The Caltech Pedestrian Dataset [4] contains 350,000
labeled instances with bounding boxes.
For the detection of basic object categories, a multi-
year effort from 2005 to 2012 was devoted to the creation
and maintenance of a series of benchmark datasets that
were widely adopted. The PASCAL VOC [2] datasets
contained 20 object categories spread over 11,000 images.
Over 27,000 object instance bounding boxes were la-
beled, of which almost 7,000 had detailed segmentations.
Recently, a detection challenge has been created from 200
object categories using a subset of 400,000 images from
ImageNet [34]. An impressive 350,000 objects have been
labeled using bounding boxes.
Since the detection of many objects such as sunglasses,
cellphones or chairs is highly dependent on contextual
information, it is important that detection datasets con-
tain objects in their natural environments. In our dataset
we strive to collect images rich in contextual information.
The use of bounding boxes also limits the accuracy
for which detection algorithms may be evaluated. We
propose the use of fully segmented instances to enable
more accurate detector evaluation.

3
Fig. 2: Example of (a) iconic object images, (b) iconic scene images, and (c) non-iconic images.
Semantic scene labeling The task of labeling se-
mantic objects in a scene requires that each pixel of an
image be labeled as belonging to a category, such as
sky, chair, floor, street, etc. In contrast to the detection
task, individual instances of objects do not need to be
segmented, Fig. 1(c). This enables the labeling of objects
for which individual instances are hard to define, such
as grass, streets, or walls. Datasets exist for both indoor
[11] and outdoor [35], [14] scenes. Some datasets also
include depth information [11]. Similar to semantic scene
labeling, our goal is to measure the pixel-wise accuracy
of object labels. However, we also aim to distinguish
between individual instances of an object, which requires
a solid understanding of each object’s extent.
A novel dataset that combines many of the properties
of both object detection and semantic scene labeling
datasets is the SUN dataset [3] for scene understanding.
SUN contains 908 scene categories from the WordNet
dictionary [30] with segmented objects. The 3,819 ob-
ject categories span those common to object detection
datasets (person, chair, car) and to semantic scene la-
beling (wall, sky, floor). Since the dataset was collected
by finding images depicting various scene types, the
number of instances per object category exhibits the long
tail phenomenon. That is, a few categories have a large
number of instances (wall: 20,213, window: 16,080, chair:
7,971) while most have a relatively modest number of
instances (boat: 349, airplane: 179, floor lamp: 276). In
our dataset, we ensure that each object category has a
significant number of instances, Fig. 5.
Other vision datasets Datasets have spurred the ad-
vancement of numerous fields in computer vision. Some
notable datasets include the Middlebury datasets for
stereo vision [20], multi-view stereo [36] and optical flow
[21]. The Berkeley Segmentation Data Set (BSDS500) [37]
has been used extensively to evaluate both segmentation
and edge detection algorithms. Datasets have also been
created to recognize both scene [9] and object attributes
[8], [38]. Indeed, numerous areas of vision have benefited
from challenging datasets that helped catalyze progress.
3 IMAGE COLLECTION
We next describe how the object categories and candi-
date images are selected.
3.1 Common Object Categories
The selection of object categories is a non-trivial exercise.
The categories must form a representative set of all
categories, be relevant to practical applications and occur
with high enough frequency to enable the collection of
a large dataset. Other important decisions are whether
to include both “thing” and “stuff categories [39] and
whether fine-grained [31], [1] and object-part categories
should be included. “Thing” categories include objects
for which individual instances may be easily labeled
(person, chair, car) where “stuff categories include
materials and objects with no clear boundaries (sky,
street, grass). Since we are primarily interested in pre-
cise localization of object instances, we decided to only
include “thing” categories and not “stuff.” However,
since “stuff categories can provide significant contex-
tual information, we believe the future labeling of “stuff
categories would be beneficial.
The specificity of object categories can vary signifi-
cantly. For instance, a dog could be a member of the
“mammal”, “dog”, or “German shepherd” categories. To
enable the practical collection of a significant number
of instances per category, we chose to limit our dataset
to entry-level categories, i.e. category labels that are
commonly used by humans when describing objects
(dog, chair, person). It is also possible that some object
categories may be parts of other object categories. For in-
stance, a face may be part of a person. We anticipate the
inclusion of object-part categories (face, hands, wheels)
would be beneficial for many real-world applications.
We used several sources to collect entry-level object
categories of “things.” We first compiled a list of cate-
gories by combining categories from PASCAL VOC [2]
and a subset of the 1200 most frequently used words
that denote visually identifiable objects [40]. To further
augment our set of candidate categories, several children
ranging in ages from 4 to 8 were asked to name every

4
Fig. 3: Our annotation pipeline is split into 3 primary tasks: (a) labeling the categories present in the image (§4.1),
(b) locating and marking all instances of the labeled categories (§4.2), and (c) segmenting each object instance (§4.3).
object they see in indoor and outdoor environments.
The final 272 candidates may be found in the appendix.
Finally, the co-authors voted on a 1 to 5 scale for each
category taking into account how commonly they oc-
cur, their usefulness for practical applications, and their
diversity relative to other categories. The final selec-
tion of categories attempts to pick categories with high
votes, while keeping the number of categories per super-
category (animals, vehicles, furniture, etc.) balanced. Cat-
egories for which obtaining a large number of instances
(greater than 5,000) was difficult were also removed.
To ensure backwards compatibility all categories from
PASCAL VOC [2] are also included. Our final list of 91
proposed categories is in Fig. 5(a).
3.2 Non-iconic Image Collection
Given the list of object categories, our next goal was to
collect a set of candidate images. We may roughly group
images into three types, Fig. 2: iconic-object images [41],
iconic-scene images [3] and non-iconic images. Typical
iconic-object images have a single large object in a
canonical perspective centered in the image, Fig. 2(a).
Iconic-scene images are shot from canonical viewpoints
and commonly lack people, Fig. 2(b). Iconic images have
the benefit that they may be easily found by directly
searching for specific categories using Google or Bing
image search. While iconic images generally provide
high quality object instances, they can lack important
contextual information and non-canonical viewpoints.
Our goal was to collect a dataset such that a majority
of images are non-iconic, Fig. 2(c). It has been shown that
datasets containing more non-iconic images are better at
generalizing [42]. We collected non-iconic images using
two strategies. First as popularized by PASCAL VOC
[2], we collected images from Flickr which tends to have
fewer iconic images. Flickr contains photos uploaded by
amateur photographers with searchable metadata and
keywords. Second, we did not search for object cate-
gories in isolation. A search for “dog” will tend to return
iconic images of large, centered dogs. However, if we
searched for pairwise combinations of object categories,
such as “dog + car we found many more non-iconic
images. Surprisingly, these images typically do not just
contain the two categories specified in the search, but nu-
merous other categories as well. To further supplement
our dataset we also searched for scene/object category
pairs, see the appendix. We downloaded at most 5
photos taken by a single photographer within a short
time window. In the rare cases in which enough images
could not be found, we searched for single categories
and performed an explicit filtering stage to remove iconic
images. The result is a collection of 328,000 images with
rich contextual relationships between objects as shown
in Figs. 2(c) and 6.
4 IMAGE ANNO TATION
We next describe how we annotated our image collec-
tion. Due to our desire to label over 2.5 million object
instances, the design of a cost efficient yet high quality
annotation pipeline was critical. The annotation pipeline
is outlined in Fig. 3. For all crowdsourcing tasks we
used workers on Amazon’s Mechanical Turk (AMT). Our
user interfaces are described in detail in the appendix.
Note that, since the original version of this work [19],
we have taken a number of steps to further improve
the quality of the annotations. In particular, we have
increased the number of annotators for the category
labeling and instance spotting stages to eight. We also
added a stage to verify the instance segmentations.
4.1 Category Labeling
The first task in annotating our dataset is determin-
ing which object categories are present in each image,
Fig. 3(a). Since we have 91 categories and a large number
of images, asking workers to answer 91 binary clas-
sification questions per image would be prohibitively
expensive. Instead, we used a hierarchical approach [18].

5
(a) (b)
Fig. 4: Worker precision and recall for the category labeling task. (a) The union of multiple AMT workers (blue)
has better recall than any expert (red). Ground truth was computed using majority vote of the experts. (b) Shows
the number of workers (circle size) and average number of jobs per worker (circle color) for each precision/recall
range. Most workers have high precision; such workers generally also complete more jobs. For this plot ground
truth for each worker is the union of responses from all other AMT workers. See §4.4 for details.
We group the object categories into 11 super-categories
(see the appendix). For a given image, a worker was
presented with each group of categories in turn and
asked to indicate whether any instances exist for that
super-category. This greatly reduces the time needed to
classify the various categories. For example, a worker
may easily determine no animals are present in the im-
age without having to specifically look for cats, dogs, etc.
If a worker determines instances from the super-category
(animal) are present, for each subordinate category (dog,
cat, etc.) present, the worker must drag the category’s
icon onto the image over one instance of the category.
The placement of these icons is critical for the following
stage. We emphasize that only a single instance of each
category needs to be annotated in this stage. To ensure
high recall, 8 workers were asked to label each image. A
category is considered present if any worker indicated
the category; false positives are handled in subsequent
stages. A detailed analysis of performance is presented
in §4.4. This stage took 20k worker hours to complete.
4.2 Instance Spotting
In the next stage all instances of the object categories
in an image were labeled, Fig. 3(b). In the previous
stage each worker labeled one instance of a category, but
multiple object instances may exist. Therefore, for each
image, a worker was asked to place a cross on top of
each instance of a specific category found in the previous
stage. To boost recall, the location of the instance found
by a worker in the previous stage was shown to the
current worker. Such priming helped workers quickly
find an initial instance upon first seeing the image. The
workers could also use a magnifying glass to find small
instances. Each worker was asked to label at most 10
instances of a given category per image. Each image was
labeled by 8 workers for a total of 10k worker hours.
4.3 Instance Segmentation
Our final stage is the laborious task of segmenting each
object instance, Fig. 3(c). For this stage we modified
the excellent user interface developed by Bell et al. [16]
for image segmentation. Our interface asks the worker
to segment an object instance specified by a worker in
the previous stage. If other instances have already been
segmented in the image, those segmentations are shown
to the worker. A worker may also indicate there are
no object instances of the given category in the image
(implying a false positive label from the previous stage)
or that all object instances are already segmented.
Segmenting 2,500,000 object instances is an extremely
time consuming task requiring over 22 worker hours per
1,000 segmentations. To minimize cost we only had a
single worker segment each instance. However, when
first completing the task, most workers produced only
coarse instance outlines. As a consequence, we required
all workers to complete a training task for each object
category. The training task required workers to segment
an object instance. Workers could not complete the task
until their segmentation adequately matched the ground
truth. The use of a training task vastly improved the
quality of the workers (approximately 1 in 3 workers
passed the training stage) and resulting segmentations.
Example segmentations may be viewed in Fig. 6.
While the training task filtered out most bad workers,
we also performed an explicit verification step on each
segmented instance to ensure good quality. Multiple
workers (3 to 5) were asked to judge each segmentation
and indicate whether it matched the instance well or not.
Segmentations of insufficient quality were discarded and
the corresponding instances added back to the pool of
unsegmented objects. Finally, some approved workers
consistently produced poor segmentations; all work ob-
tained from such workers was discarded.

Citations
More filters
Proceedings ArticleDOI
27 Jun 2016
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

123,388 citations

Posted Content
TL;DR: This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.
Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

44,703 citations

Proceedings ArticleDOI
07 Jun 2015
TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).
Abstract: We propose a deep convolutional neural network architecture codenamed Inception that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. By a carefully crafted design, we increased the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC14 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.

40,257 citations

Journal ArticleDOI
TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.
Abstract: The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide a detailed analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the 5 years of the challenge, and propose future directions and improvements.

30,811 citations

Journal ArticleDOI
TL;DR: This work introduces a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals and further merge RPN and Fast R-CNN into a single network by sharing their convolutionAL features.
Abstract: State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [1] and Fast R-CNN [2] have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. We further merge RPN and Fast R-CNN into a single network by sharing their convolutional features—using the recently popular terminology of neural networks with ’attention’ mechanisms, the RPN component tells the unified network where to look. For the very deep VGG-16 model [3] , our detection system has a frame rate of 5 fps ( including all steps ) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007, 2012, and MS COCO datasets with only 300 proposals per image. In ILSVRC and COCO 2015 competitions, Faster R-CNN and RPN are the foundations of the 1st-place winning entries in several tracks. Code has been made publicly available.

26,458 citations


Cites methods from "Microsoft COCO: Common Objects in C..."

  • ...Region proposal methods typically rely on inexpensive features and economical inference schemes....

    [...]

References
More filters
01 Oct 2008
TL;DR: The database contains labeled face photographs spanning the range of conditions typically encountered in everyday life, and exhibits “natural” variability in factors such as pose, lighting, race, accessories, occlusions, and background.
Abstract: Most face databases have been created under controlled conditions to facilitate the study of specific parameters on the face recognition problem. These parameters include such variables as position, pose, lighting, background, camera quality, and gender. While there are many applications for face recognition technology in which one can control the parameters of image acquisition, there are also many applications in which the practitioner has little or no control over such parameters. This database, Labeled Faces in the Wild, is provided as an aid in studying the latter, unconstrained, recognition problem. The database contains labeled face photographs spanning the range of conditions typically encountered in everyday life. The database exhibits “natural” variability in factors such as pose, lighting, race, accessories, occlusions, and background. In addition to describing the details of the database, we provide specific experimental paradigms for which the database is suitable. This is done in an effort to make research performed with the database as consistent and comparable as possible. We provide baseline results, including results of a state of the art face recognition system combined with a face alignment system. To facilitate experimentation on the database, we provide several parallel databases, including an aligned version.

5,742 citations


"Microsoft COCO: Common Objects in C..." refers background in this paper

  • ...ly represented by a bounding box, Figure1(b). Early algorithms focused on face detection [27] using various ad hoc datasets. Later, more realistic and challenging face detection datasets were created [28]. Another popular challenge is the detection of pedestrians for which several datasets have been created [29], [4]. The Caltech Pedestrian Dataset [4] contains 350,000 labeled instances with bounding ...

    [...]

Journal ArticleDOI
TL;DR: This paper investigates two fundamental problems in computer vision: contour detection and image segmentation and presents state-of-the-art algorithms for both of these tasks.
Abstract: This paper investigates two fundamental problems in computer vision: contour detection and image segmentation. We present state-of-the-art algorithms for both of these tasks. Our contour detector combines multiple local cues into a globalization framework based on spectral clustering. Our segmentation algorithm consists of generic machinery for transforming the output of any contour detector into a hierarchical region tree. In this manner, we reduce the problem of image segmentation to that of contour detection. Extensive experimental evaluation demonstrates that both our contour detection and segmentation methods significantly outperform competing algorithms. The automatically generated hierarchical segmentations can be interactively refined by user-specified annotations. Computation at multiple image resolutions provides a means of coupling our system to recognition applications.

5,068 citations


Additional excerpts

  • ...The Berkeley Segmentation Data Set (BSDS500) [37] has been used extensively to evaluate both segmentation and edge detection algorithms....

    [...]

  • ...The Berkeley Segmentation Data Set (BSDS500) [37]...

    [...]

Book ChapterDOI
07 Oct 2012
TL;DR: The goal is to parse typical, often messy, indoor scenes into floor, walls, supporting surfaces, and object regions, and to recover support relationships, to better understand how 3D cues can best inform a structured 3D interpretation.
Abstract: We present an approach to interpret the major surfaces, objects, and support relations of an indoor scene from an RGBD image. Most existing work ignores physical interactions or is applied only to tidy rooms and hallways. Our goal is to parse typical, often messy, indoor scenes into floor, walls, supporting surfaces, and object regions, and to recover support relationships. One of our main interests is to better understand how 3D cues can best inform a structured 3D interpretation. We also contribute a novel integer programming formulation to infer physical support relations. We offer a new dataset of 1449 RGBD images, capturing 464 diverse indoor scenes, with detailed annotations. Our experiments demonstrate our ability to infer support relations in complex scenes and verify that our 3D scene cues and inferred support lead to better object segmentation.

4,827 citations


"Microsoft COCO: Common Objects in C..." refers background in this paper

  • ...The community has also created datasets containing object attributes [8], scene attributes [9], keypoints [10], and 3D scene information [11]....

    [...]

  • ...Some datasets also include depth information [11]....

    [...]

  • ...Datasets exist for both indoor [11] and outdoor [35], [14] scenes....

    [...]

Journal ArticleDOI
TL;DR: In this article, a large collection of images with ground truth labels is built to be used for object detection and recognition research, such data is useful for supervised learning and quantitative evaluation.
Abstract: We seek to build a large collection of images with ground truth labels to be used for object detection and recognition research. Such data is useful for supervised learning and quantitative evaluation. To achieve this, we developed a web-based tool that allows easy image annotation and instant sharing of such annotations. Using this annotation tool, we have collected a large dataset that spans many object categories, often containing multiple instances over a wide variety of images. We quantify the contents of the dataset and compare against existing state of the art datasets used for object recognition and detection. Also, we show how to extend the dataset to automatically enhance object labels with WordNet, discover object parts, recover a depth ordering of objects in a scene, and increase the number of labels using minimal user supervision and images from the web.

3,501 citations

Journal ArticleDOI
TL;DR: An extensive evaluation of the state of the art in a unified framework of monocular pedestrian detection using sixteen pretrained state-of-the-art detectors across six data sets and proposes a refined per-frame evaluation methodology.
Abstract: Pedestrian detection is a key problem in computer vision, with several applications that have the potential to positively impact quality of life. In recent years, the number of approaches to detecting pedestrians in monocular images has grown steadily. However, multiple data sets and widely varying evaluation protocols are used, making direct comparisons difficult. To address these shortcomings, we perform an extensive evaluation of the state of the art in a unified framework. We make three primary contributions: 1) We put together a large, well-annotated, and realistic monocular pedestrian detection data set and study the statistics of the size, position, and occlusion patterns of pedestrians in urban scenes, 2) we propose a refined per-frame evaluation methodology that allows us to carry out probing and informative comparisons, including measuring performance in relation to scale and occlusion, and 3) we evaluate the performance of sixteen pretrained state-of-the-art detectors across six data sets. Our study allows us to assess the state of the art and provides a framework for gauging future efforts. Our experiments show that despite significant progress, performance still has much room for improvement. In particular, detection is disappointing at low resolutions and for partially occluded pedestrians.

3,170 citations


"Microsoft COCO: Common Objects in C..." refers background in this paper

  • ...Many object detection algorithms benefit from additional annotations, such as the amount an instance is occluded [4] or the location of keypoints on the object [10]....

    [...]

  • ...Another popular challenge is the detection of pedestrians for which several datasets have been created [24], [4]....

    [...]

  • ...The Caltech Pedestrian Dataset [4] contains 350,000 labeled instances with bounding boxes....

    [...]

  • ...The current object classification and detection datasets [1], [2], [3], [4] help us explore the first challenges related to scene understanding....

    [...]

Frequently Asked Questions (15)
Q1. What have the authors contributed in "Microsoft coco: common objects in context" ?

The authors present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. The authors present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, the authors provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model. 

Since the detection of many objects such as sunglasses, cellphones or chairs is highly dependent on contextual information, it is important that detection datasets contain objects in their natural environments. 

by observing how recall increased as the authors added annotators, the authors estimate that in practice over 99% of all object categories not later rejected as false positives are detected given 8 annotators. 

Segmenting 2,500,000 object instances is an extremely time consuming task requiring over 22 worker hours per 1,000 segmentations. 

The task of labeling semantic objects in a scene requires that each pixel of an image be labeled as belonging to a category, such as sky, chair, floor, street, etc. 

Utilizing over 70,000 worker hours, a vast collection of object instances was gathered, annotated and organized to drive the advancement of object detection and segmentation algorithms. 

Segmentations of insufficient quality were discarded and the corresponding instances added back to the pool of unsegmented objects. 

After 10-15 instances of a category were segmented in an image, the remaining instances were marked as “crowds” using a single (possibly multipart) segment. 

For the detection of basic object categories, a multiyear effort from 2005 to 2012 was devoted to the creation and maintenance of a series of benchmark datasets that were widely adopted. 

If a worker determines instances from the super-category (animal) are present, for each subordinate category (dog, cat, etc.) present, the worker must drag the category’s icon onto the image over one instance of the category. 

Since the authors are primarily interested in precise localization of object instances, the authors decided to only include “thing” categories and not “stuff.” 

For images containing 10 object instances or fewer of a given category, every instance was individually segmented (note that in some images up to 15 instances were segmented). 

“Thing” categories include objects for which individual instances may be easily labeled (person, chair, car) where “stuff” categories include materials and objects with no clear boundaries (sky, street, grass). 

Such examples may act as noise and pollute the learned model if the model is not rich enough to capture such appearance variability. 

Another interesting observation is only 10% of the images in MS COCO have only one category per image, in comparison, over 60% of images contain a single object category in ImageNet and PASCAL VOC.