scispace - formally typeset

Book ChapterDOI

Microsoft COCO: Common Objects in Context

06 Sep 2014-pp 740-755

TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.
Abstract: We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in precise object localization. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.
Topics: Object detection (54%)

Summary (3 min read)

1 INTRODUCTION

  • One of the primary goals of computer vision is the understanding of visual scenes.
  • The authors introduce a new large-scale dataset that addresses three core research problems in scene understanding: detecting non-iconic views (or non-canonical perspectives [12]) of objects, contextual reasoning between objects and the precise 2D localization of objects.
  • The authors posit that current recognition systems perform fairly well on iconic views, but struggle to recognize objects otherwise – in the • T.Y. Lin and S. Belongie are with Cornell NYC Tech and the Cornell Computer Science Department.
  • For each category found, the individual instances were labeled, verified, and finally segmented.
  • Additionally, a critical distinction between their dataset and others is the number of labeled instances per image which may aid in learning contextual information, Fig. 5. MS COCO contains considerably more object instances per image (7.7) as compared to ImageNet (3.0) and PASCAL (2.3).

3 IMAGE COLLECTION

  • The authors next describe how the object categories and candidate images are selected.

3.1 Common Object Categories

  • The categories must form a representative set of all categories, be relevant to practical applications and occur with high enough frequency to enable the collection of a large dataset.
  • Other important decisions are whether to include both “thing” and “stuff” categories [39] and whether fine-grained [31], [1] and object-part categories should be included.
  • To enable the practical collection of a significant number of instances per category, the authors chose to limit their dataset to entry-level categories, i.e. category labels that are commonly used by humans when describing objects (dog, chair, person).
  • The final selection of categories attempts to pick categories with high votes, while keeping the number of categories per supercategory (animals, vehicles, furniture, etc.) balanced.

3.2 Non-iconic Image Collection

  • Given the list of object categories, their next goal was to collect a set of candidate images.
  • The authors goal was to collect a dataset such that a majority of images are non-iconic, Fig. 2(c).
  • First as popularized by PASCAL VOC [2], the authors collected images from Flickr which tends to have fewer iconic images.
  • Surprisingly, these images typically do not just contain the two categories specified in the search, but numerous other categories as well.
  • The result is a collection of 328,000 images with rich contextual relationships between objects as shown in Figs. 2(c) and 6.

4 IMAGE ANNOTATION

  • The authors next describe how they annotated their image collection.
  • Due to their desire to label over 2.5 million object instances, the design of a cost efficient yet high quality annotation pipeline was critical.
  • For all crowdsourcing tasks the authors used workers on Amazon’s Mechanical Turk (AMT).
  • Note that, since the original version of this work [19], the authors have taken a number of steps to further improve the quality of the annotations.

4.1 Category Labeling

  • The first task in annotating their dataset is determining which object categories are present in each image, Fig. 3(a).
  • Since the authors have 91 categories and a large number of images, asking workers to answer 91 binary classification questions per image would be prohibitively expensive.
  • For a given image, a worker was presented with each group of categories in turn and asked to indicate whether any instances exist for that super-category.
  • This greatly reduces the time needed to classify the various categories.
  • The placement of these icons is critical for the following stage.

4.2 Instance Spotting

  • In the next stage all instances of the object categories in an image were labeled, Fig. 3(b).
  • To boost recall, the location of the instance found by a worker in the previous stage was shown to the current worker.
  • Such priming helped workers quickly find an initial instance upon first seeing the image.
  • The workers could also use a magnifying glass to find small instances.
  • Each image was labeled by 8 workers for a total of ∼10k worker hours.

4.3 Instance Segmentation

  • The authors final stage is the laborious task of segmenting each object instance, Fig. 3(c).
  • To minimize cost the authors only had a single worker segment each instance.
  • The training task required workers to segment an object instance.
  • Workers could not complete the task until their segmentation adequately matched the ground truth.
  • After 10-15 instances of a category were segmented in an image, the remaining instances were marked as “crowds” using a single (possibly multipart) segment.

4.4 Annotation Performance Analysis

  • The authors analyzed crowd worker quality on the category labeling task by comparing to dedicated expert workers, see Fig. 4(a).
  • Ground truth was computed using majority vote of the experts.
  • Fig. 4(a) shows that the union of 8 AMT workers, the same number as was used to collect their labels, achieved greater recall than any of the expert workers.
  • Object category presence is often ambiguous.
  • Note that a similar analysis may be done for instance spotting in which 8 annotators were also used.

4.5 Caption Annotation

  • A full description of the caption statistics and how they were gathered will be provided shortly in a separate publication.

5 DATASET STATISTICS

  • Next, the authors analyze the properties of the Microsoft Common Objects in COntext (MS COCO) dataset in comparison to several other popular datasets.
  • These include ImageNet [1], PASCAL VOC 2012 [2], and SUN [3].
  • On average their dataset contains 3.5 categories and 7.7 instances per image.
  • Another interesting observation is only 10% of the images in MS COCO have only one category per image, in comparison, over 60% of images contain a single object category in ImageNet and PASCAL VOC.
  • Generally smaller objects are harder to recognize and require more contextual reasoning to recognize.

6 DATASET SPLITS

  • To accommodate a faster release schedule, the authors split the MS COCO dataset into two roughly equal parts.
  • The cumulative 2015 release will contain a total of 165,482 train, 81,208 val, and 81,434 test images.
  • The authors took care to minimize the chance of near-duplicate images existing across splits by explicitly removing near duplicates (detected with [43]) and grouping images by photographer and date taken.
  • The authors are currently finalizing the evaluation server for automatic evaluation on the test set.
  • The authors did not collect segmentations for the following 11 categories: hat, shoe, eyeglasses (too many instances), mirror, window, door, street sign (ambiguous and difficult to label), plate, desk (due to confusion with bowl and dining table, respectively) and blender, hair brush (too few instances).

7 ALGORITHMIC ANALYSIS

  • For the following experiments the authors take a subset of 55,000 images from their dataset1 and obtain tight-fitting bounding boxes from the annotated segmentation masks.
  • Consistent with past observations [46], the authors find that including difficult (non-iconic) images during training may not always help.
  • These observations support two hypotheses: 1) MS COCO is significantly more difficult than PASCAL VOC and 2) models trained on MS COCO can generalize better to easier datasets such as PASCAL VOC given more training data.
  • The authors then measure the intersection over union of the predicted and ground truth segmentation masks, see Fig.
  • To establish a baseline for their dataset, the authors project learned DPM part masks onto the image to create segmentation masks.

8 DISCUSSION

  • The authors introduced a new dataset for detecting and segmenting objects found in everyday life in their natural environments.
  • Dataset statistics indicate the images contain rich contextual information with many objects present per image.
  • To download and learn more about MS COCO please see the project website2.
  • P.P. and D.R. were supported by ONR MURI Grant N00014-10-1-0933.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

1
Microsoft COCO: Common Objects in Context
Tsung-Yi Lin Michael Maire Serge Belongie Lubomir Bourdev Ross Girshick
James Hays Pietro Perona Deva Ramanan C. Lawrence Zitnick Piotr Doll
´
ar
Abstract—We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of
object recognition in the context of the broader question of scene understanding. This is achieved by gather ing images of complex
everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in
precise object localization. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old. With a
total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via
novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of
the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and
segmentation detection results using a Deformable Parts Model.
F
1 INTRODUCTION
One of the primary goals of computer vision is the
understanding of visual scenes. Scene understanding
involves numerous tasks including recognizing what
objects are present, localizing the objects in 2D and 3D,
determining the objects’ and scene’s attributes, charac-
terizing relationships between objects and providing a
semantic description of the scene. The current object clas-
sification and detection datasets [1], [2], [3], [4] help us
explore the first challenges related to scene understand-
ing. For instance the ImageNet dataset [1], which con-
tains an unprecedented number of images, has recently
enabled breakthroughs in both object classification and
detection research [5], [6], [7]. The community has also
created datasets containing object attributes [8], scene
attributes [9], keypoints [10], and 3D scene information
[11]. This leads us to the obvious question: what datasets
will best continue our advance towards our ultimate goal
of scene understanding?
We introduce a new large-scale dataset that addresses
three core research problems in scene understanding: de-
tecting non-iconic views (or non-canonical perspectives
[12]) of objects, contextual reasoning between objects
and the precise 2D localization of objects. For many
categories of objects, there exists an iconic view. For
example, when performing a web-based image search
for the object category “bike,” the top-ranked retrieved
examples appear in profile, unobstructed near the cen-
ter of a neatly composed photo. We posit that current
recognition systems perform fairly well on iconic views,
but struggle to recognize objects otherwise in the
T.Y. Lin and S. Belongie are with Cornell NYC Tech and the Cornell
Computer Science Department.
M. Maire is with the Toyota Technological Institute at Chicago.
L. Bourdev and P. Doll´ar are with Facebook AI Research. The majority of
this work was performed while P. Doll´ar was with Microsoft Research.
R. Girshick and C. L. Zitnick are with Microsoft Research, Redmond.
J. Hays is with Brown University.
P. Perona is with the California Institute of Technology.
D. Ramanan is with the University of California at Irvine.
Fig. 1: While previous object recognition datasets have
focused on (a) image classification, (b) object bounding
box localization or (c) semantic pixel-level segmentation,
we focus on (d) segmenting individual object instances.
We introduce a large, richly-annotated dataset comprised
of images depicting complex everyday scenes of com-
mon objects in their natural context.
background, partially occluded, amid clutter [13] re-
flecting the composition of actual everyday scenes. We
verify this experimentally; when evaluated on everyday
scenes, models trained on our data perform better than
those trained with prior datasets. A challenge is finding
natural images that contain multiple objects. The identity
of many objects can only be resolved using context, due
to small size or ambiguous appearance in the image. To
push research in contextual reasoning, images depicting
scenes [3] rather than objects in isolation are necessary.
Finally, we argue that detailed spatial understanding of
object layout will be a core component of scene analysis.
An object’s spatial location can be defined coarsely using
a bounding box [2] or with a precise pixel-level segmen-
tation [14], [15], [16]. As we demonstrate, to measure
either kind of localization performance it is essential
for the dataset to have every instance of every object
arXiv:1405.0312v3 [cs.CV] 21 Feb 2015

2
category labeled and fully segmented. Our dataset is
unique in its annotation of instance-level segmentation
masks, Fig. 1.
To create a large-scale dataset that accomplishes these
three goals we employed a novel pipeline for gathering
data with extensive use of Amazon Mechanical Turk.
First and most importantly, we harvested a large set
of images containing contextual relationships and non-
iconic object views. We accomplished this using a sur-
prisingly simple yet effective technique that queries for
pairs of objects in conjunction with images retrieved
via scene-based queries [17], [3]. Next, each image was
labeled as containing particular object categories using
a hierarchical labeling approach [18]. For each category
found, the individual instances were labeled, verified,
and finally segmented. Given the inherent ambiguity of
labeling, each of these stages has numerous tradeoffs that
we explored in detail.
The Microsoft Common Objects in COntext (MS
COCO) dataset contains 91 common object categories
with 82 of them having more than 5,000 labeled in-
stances, Fig. 6. In total the dataset has 2,500,000 labeled
instances in 328,000 images. In contrast to the popular
ImageNet dataset [1], COCO has fewer categories but
more instances per category. This can aid in learning
detailed object models capable of precise 2D localization.
The dataset is also significantly larger in number of in-
stances per category than the PASCAL VOC [2] and SUN
[3] datasets. Additionally, a critical distinction between
our dataset and others is the number of labeled instances
per image which may aid in learning contextual informa-
tion, Fig. 5. MS COCO contains considerably more object
instances per image (7.7) as compared to ImageNet (3.0)
and PASCAL (2.3). In contrast, the SUN dataset, which
contains significant contextual information, has over 17
objects and “stuff per image but considerably fewer
object instances overall.
An abridged version of this work appeared in [19].
2 RELATED WORK
Throughout the history of computer vision research
datasets have played a critical role. They not only pro-
vide a means to train and evaluate algorithms, they
drive research in new and more challenging directions.
The creation of ground truth stereo and optical flow
datasets [20], [21] helped stimulate a flood of interest
in these areas. The early evolution of object recognition
datasets [22], [23], [24] facilitated the direct comparison
of hundreds of image recognition algorithms while si-
multaneously pushing the field towards more complex
problems. Recently, the ImageNet dataset [1] containing
millions of images has enabled breakthroughs in both
object classification and detection research using a new
class of deep learning algorithms [5], [6], [7].
Datasets related to object recognition can be roughly
split into three groups: those that primarily address
object classification, object detection and semantic scene
labeling. We address each in turn.
Image Classification The task of object classification
requires binary labels indicating whether objects are
present in an image; see Fig. 1(a). Early datasets of this
type comprised images containing a single object with
blank backgrounds, such as the MNIST handwritten
digits [25] or COIL household objects [26]. Caltech 101
[22] and Caltech 256 [23] marked the transition to more
realistic object images retrieved from the internet while
also increasing the number of object categories to 101
and 256, respectively. Popular datasets in the machine
learning community due to the larger number of training
examples, CIFAR-10 and CIFAR-100 [27] offered 10 and
100 categories from a dataset of tiny 32 × 32 images [28].
While these datasets contained up to 60,000 images and
hundreds of categories, they still only captured a small
fraction of our visual world.
Recently, ImageNet [1] made a striking departure from
the incremental increase in dataset sizes. They proposed
the creation of a dataset containing 22k categories with
500-1000 images each. Unlike previous datasets contain-
ing entry-level categories [29], such as “dog” or “chair,”
like [28], ImageNet used the WordNet Hierarchy [30] to
obtain both entry-level and fine-grained [31] categories.
Currently, the ImageNet dataset contains over 14 million
labeled images and has enabled significant advances in
image classification [5], [6], [7].
Object detection Detecting an object entails both
stating that an object belonging to a specified class is
present, and localizing it in the image. The location of
an object is typically represented by a bounding box,
Fig. 1(b). Early algorithms focused on face detection [32]
using various ad hoc datasets. Later, more realistic and
challenging face detection datasets were created [33].
Another popular challenge is the detection of pedestri-
ans for which several datasets have been created [24],
[4]. The Caltech Pedestrian Dataset [4] contains 350,000
labeled instances with bounding boxes.
For the detection of basic object categories, a multi-
year effort from 2005 to 2012 was devoted to the creation
and maintenance of a series of benchmark datasets that
were widely adopted. The PASCAL VOC [2] datasets
contained 20 object categories spread over 11,000 images.
Over 27,000 object instance bounding boxes were la-
beled, of which almost 7,000 had detailed segmentations.
Recently, a detection challenge has been created from 200
object categories using a subset of 400,000 images from
ImageNet [34]. An impressive 350,000 objects have been
labeled using bounding boxes.
Since the detection of many objects such as sunglasses,
cellphones or chairs is highly dependent on contextual
information, it is important that detection datasets con-
tain objects in their natural environments. In our dataset
we strive to collect images rich in contextual information.
The use of bounding boxes also limits the accuracy
for which detection algorithms may be evaluated. We
propose the use of fully segmented instances to enable
more accurate detector evaluation.

3
Fig. 2: Example of (a) iconic object images, (b) iconic scene images, and (c) non-iconic images.
Semantic scene labeling The task of labeling se-
mantic objects in a scene requires that each pixel of an
image be labeled as belonging to a category, such as
sky, chair, floor, street, etc. In contrast to the detection
task, individual instances of objects do not need to be
segmented, Fig. 1(c). This enables the labeling of objects
for which individual instances are hard to define, such
as grass, streets, or walls. Datasets exist for both indoor
[11] and outdoor [35], [14] scenes. Some datasets also
include depth information [11]. Similar to semantic scene
labeling, our goal is to measure the pixel-wise accuracy
of object labels. However, we also aim to distinguish
between individual instances of an object, which requires
a solid understanding of each object’s extent.
A novel dataset that combines many of the properties
of both object detection and semantic scene labeling
datasets is the SUN dataset [3] for scene understanding.
SUN contains 908 scene categories from the WordNet
dictionary [30] with segmented objects. The 3,819 ob-
ject categories span those common to object detection
datasets (person, chair, car) and to semantic scene la-
beling (wall, sky, floor). Since the dataset was collected
by finding images depicting various scene types, the
number of instances per object category exhibits the long
tail phenomenon. That is, a few categories have a large
number of instances (wall: 20,213, window: 16,080, chair:
7,971) while most have a relatively modest number of
instances (boat: 349, airplane: 179, floor lamp: 276). In
our dataset, we ensure that each object category has a
significant number of instances, Fig. 5.
Other vision datasets Datasets have spurred the ad-
vancement of numerous fields in computer vision. Some
notable datasets include the Middlebury datasets for
stereo vision [20], multi-view stereo [36] and optical flow
[21]. The Berkeley Segmentation Data Set (BSDS500) [37]
has been used extensively to evaluate both segmentation
and edge detection algorithms. Datasets have also been
created to recognize both scene [9] and object attributes
[8], [38]. Indeed, numerous areas of vision have benefited
from challenging datasets that helped catalyze progress.
3 IMAGE COLLECTION
We next describe how the object categories and candi-
date images are selected.
3.1 Common Object Categories
The selection of object categories is a non-trivial exercise.
The categories must form a representative set of all
categories, be relevant to practical applications and occur
with high enough frequency to enable the collection of
a large dataset. Other important decisions are whether
to include both “thing” and “stuff categories [39] and
whether fine-grained [31], [1] and object-part categories
should be included. “Thing” categories include objects
for which individual instances may be easily labeled
(person, chair, car) where “stuff categories include
materials and objects with no clear boundaries (sky,
street, grass). Since we are primarily interested in pre-
cise localization of object instances, we decided to only
include “thing” categories and not “stuff.” However,
since “stuff categories can provide significant contex-
tual information, we believe the future labeling of “stuff
categories would be beneficial.
The specificity of object categories can vary signifi-
cantly. For instance, a dog could be a member of the
“mammal”, “dog”, or “German shepherd” categories. To
enable the practical collection of a significant number
of instances per category, we chose to limit our dataset
to entry-level categories, i.e. category labels that are
commonly used by humans when describing objects
(dog, chair, person). It is also possible that some object
categories may be parts of other object categories. For in-
stance, a face may be part of a person. We anticipate the
inclusion of object-part categories (face, hands, wheels)
would be beneficial for many real-world applications.
We used several sources to collect entry-level object
categories of “things.” We first compiled a list of cate-
gories by combining categories from PASCAL VOC [2]
and a subset of the 1200 most frequently used words
that denote visually identifiable objects [40]. To further
augment our set of candidate categories, several children
ranging in ages from 4 to 8 were asked to name every

4
Fig. 3: Our annotation pipeline is split into 3 primary tasks: (a) labeling the categories present in the image (§4.1),
(b) locating and marking all instances of the labeled categories (§4.2), and (c) segmenting each object instance (§4.3).
object they see in indoor and outdoor environments.
The final 272 candidates may be found in the appendix.
Finally, the co-authors voted on a 1 to 5 scale for each
category taking into account how commonly they oc-
cur, their usefulness for practical applications, and their
diversity relative to other categories. The final selec-
tion of categories attempts to pick categories with high
votes, while keeping the number of categories per super-
category (animals, vehicles, furniture, etc.) balanced. Cat-
egories for which obtaining a large number of instances
(greater than 5,000) was difficult were also removed.
To ensure backwards compatibility all categories from
PASCAL VOC [2] are also included. Our final list of 91
proposed categories is in Fig. 5(a).
3.2 Non-iconic Image Collection
Given the list of object categories, our next goal was to
collect a set of candidate images. We may roughly group
images into three types, Fig. 2: iconic-object images [41],
iconic-scene images [3] and non-iconic images. Typical
iconic-object images have a single large object in a
canonical perspective centered in the image, Fig. 2(a).
Iconic-scene images are shot from canonical viewpoints
and commonly lack people, Fig. 2(b). Iconic images have
the benefit that they may be easily found by directly
searching for specific categories using Google or Bing
image search. While iconic images generally provide
high quality object instances, they can lack important
contextual information and non-canonical viewpoints.
Our goal was to collect a dataset such that a majority
of images are non-iconic, Fig. 2(c). It has been shown that
datasets containing more non-iconic images are better at
generalizing [42]. We collected non-iconic images using
two strategies. First as popularized by PASCAL VOC
[2], we collected images from Flickr which tends to have
fewer iconic images. Flickr contains photos uploaded by
amateur photographers with searchable metadata and
keywords. Second, we did not search for object cate-
gories in isolation. A search for “dog” will tend to return
iconic images of large, centered dogs. However, if we
searched for pairwise combinations of object categories,
such as “dog + car we found many more non-iconic
images. Surprisingly, these images typically do not just
contain the two categories specified in the search, but nu-
merous other categories as well. To further supplement
our dataset we also searched for scene/object category
pairs, see the appendix. We downloaded at most 5
photos taken by a single photographer within a short
time window. In the rare cases in which enough images
could not be found, we searched for single categories
and performed an explicit filtering stage to remove iconic
images. The result is a collection of 328,000 images with
rich contextual relationships between objects as shown
in Figs. 2(c) and 6.
4 IMAGE ANNO TATION
We next describe how we annotated our image collec-
tion. Due to our desire to label over 2.5 million object
instances, the design of a cost efficient yet high quality
annotation pipeline was critical. The annotation pipeline
is outlined in Fig. 3. For all crowdsourcing tasks we
used workers on Amazon’s Mechanical Turk (AMT). Our
user interfaces are described in detail in the appendix.
Note that, since the original version of this work [19],
we have taken a number of steps to further improve
the quality of the annotations. In particular, we have
increased the number of annotators for the category
labeling and instance spotting stages to eight. We also
added a stage to verify the instance segmentations.
4.1 Category Labeling
The first task in annotating our dataset is determin-
ing which object categories are present in each image,
Fig. 3(a). Since we have 91 categories and a large number
of images, asking workers to answer 91 binary clas-
sification questions per image would be prohibitively
expensive. Instead, we used a hierarchical approach [18].

5
(a) (b)
Fig. 4: Worker precision and recall for the category labeling task. (a) The union of multiple AMT workers (blue)
has better recall than any expert (red). Ground truth was computed using majority vote of the experts. (b) Shows
the number of workers (circle size) and average number of jobs per worker (circle color) for each precision/recall
range. Most workers have high precision; such workers generally also complete more jobs. For this plot ground
truth for each worker is the union of responses from all other AMT workers. See §4.4 for details.
We group the object categories into 11 super-categories
(see the appendix). For a given image, a worker was
presented with each group of categories in turn and
asked to indicate whether any instances exist for that
super-category. This greatly reduces the time needed to
classify the various categories. For example, a worker
may easily determine no animals are present in the im-
age without having to specifically look for cats, dogs, etc.
If a worker determines instances from the super-category
(animal) are present, for each subordinate category (dog,
cat, etc.) present, the worker must drag the category’s
icon onto the image over one instance of the category.
The placement of these icons is critical for the following
stage. We emphasize that only a single instance of each
category needs to be annotated in this stage. To ensure
high recall, 8 workers were asked to label each image. A
category is considered present if any worker indicated
the category; false positives are handled in subsequent
stages. A detailed analysis of performance is presented
in §4.4. This stage took 20k worker hours to complete.
4.2 Instance Spotting
In the next stage all instances of the object categories
in an image were labeled, Fig. 3(b). In the previous
stage each worker labeled one instance of a category, but
multiple object instances may exist. Therefore, for each
image, a worker was asked to place a cross on top of
each instance of a specific category found in the previous
stage. To boost recall, the location of the instance found
by a worker in the previous stage was shown to the
current worker. Such priming helped workers quickly
find an initial instance upon first seeing the image. The
workers could also use a magnifying glass to find small
instances. Each worker was asked to label at most 10
instances of a given category per image. Each image was
labeled by 8 workers for a total of 10k worker hours.
4.3 Instance Segmentation
Our final stage is the laborious task of segmenting each
object instance, Fig. 3(c). For this stage we modified
the excellent user interface developed by Bell et al. [16]
for image segmentation. Our interface asks the worker
to segment an object instance specified by a worker in
the previous stage. If other instances have already been
segmented in the image, those segmentations are shown
to the worker. A worker may also indicate there are
no object instances of the given category in the image
(implying a false positive label from the previous stage)
or that all object instances are already segmented.
Segmenting 2,500,000 object instances is an extremely
time consuming task requiring over 22 worker hours per
1,000 segmentations. To minimize cost we only had a
single worker segment each instance. However, when
first completing the task, most workers produced only
coarse instance outlines. As a consequence, we required
all workers to complete a training task for each object
category. The training task required workers to segment
an object instance. Workers could not complete the task
until their segmentation adequately matched the ground
truth. The use of a training task vastly improved the
quality of the workers (approximately 1 in 3 workers
passed the training stage) and resulting segmentations.
Example segmentations may be viewed in Fig. 6.
While the training task filtered out most bad workers,
we also performed an explicit verification step on each
segmented instance to ensure good quality. Multiple
workers (3 to 5) were asked to judge each segmentation
and indicate whether it matched the instance well or not.
Segmentations of insufficient quality were discarded and
the corresponding instances added back to the pool of
unsegmented objects. Finally, some approved workers
consistently produced poor segmentations; all work ob-
tained from such workers was discarded.

Figures (18)
Citations
More filters

Proceedings ArticleDOI
Kaiming He1, Xiangyu Zhang1, Shaoqing Ren1, Jian Sun1Institutions (1)
27 Jun 2016-
Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

93,356 citations


Proceedings ArticleDOI
Christian Szegedy1, Wei Liu2, Yangqing Jia1, Pierre Sermanet1  +5 moreInstitutions (3)
07 Jun 2015-
Abstract: We propose a deep convolutional neural network architecture codenamed Inception that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. By a carefully crafted design, we increased the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC14 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.

29,453 citations


Journal ArticleDOI
Olga Russakovsky1, Jia Deng2, Hao Su1, Jonathan Krause1  +8 moreInstitutions (4)
Abstract: The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide a detailed analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the 5 years of the challenge, and propose future directions and improvements.

25,260 citations


Posted Content
Shaoqing Ren1, Kaiming He2, Ross Girshick3, Jian Sun2Institutions (3)
Abstract: State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet and Fast R-CNN have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. We further merge RPN and Fast R-CNN into a single network by sharing their convolutional features---using the recently popular terminology of neural networks with 'attention' mechanisms, the RPN component tells the unified network where to look. For the very deep VGG-16 model, our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007, 2012, and MS COCO datasets with only 300 proposals per image. In ILSVRC and COCO 2015 competitions, Faster R-CNN and RPN are the foundations of the 1st-place winning entries in several tracks. Code has been made publicly available.

23,121 citations


Posted Content
Ross Girshick1Institutions (1)
TL;DR: This paper proposes a Fast Region-based Convolutional Network method (Fast R-CNN) for object detection that builds on previous work to efficiently classify object proposals using deep convolutional networks.
Abstract: This paper proposes a Fast Region-based Convolutional Network method (Fast R-CNN) for object detection. Fast R-CNN builds on previous work to efficiently classify object proposals using deep convolutional networks. Compared to previous work, Fast R-CNN employs several innovations to improve training and testing speed while also increasing detection accuracy. Fast R-CNN trains the very deep VGG16 network 9x faster than R-CNN, is 213x faster at test-time, and achieves a higher mAP on PASCAL VOC 2012. Compared to SPPnet, Fast R-CNN trains VGG16 3x faster, tests 10x faster, and is more accurate. Fast R-CNN is implemented in Python and C++ (using Caffe) and is available under the open-source MIT License at this https URL.

10,744 citations


References
More filters

Proceedings Article
03 Dec 2012-
Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overriding in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

73,871 citations


Proceedings ArticleDOI
Jia Deng1, Wei Dong1, Richard Socher1, Li-Jia Li1  +2 moreInstitutions (1)
20 Jun 2009-
TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Abstract: The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called “ImageNet”, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond.

31,274 citations


"Microsoft COCO: Common Objects in C..." refers background or methods or result in this paper

  • ...Recently, ImageNet [1] made a striking departure from the incremental increase in dataset sizes....

    [...]

  • ...These include ImageNet [1], PASCAL VOC 2012 [2], and SUN [3]....

    [...]

  • ...In contrast to the popular ImageNet dataset [1], COCO has fewer categories but more instances per category....

    [...]

  • ...Recently, the ImageNet dataset [1] containing millions of images has enabled breakthroughs in both object classification and detection research using a new class of deep learning algorithms [5], [6], [7]....

    [...]

  • ...Other important decisions are whether to include both “thing” and “stuff” categories [39] and whether fine-grained [31], [1] and object-part categories should be included....

    [...]


Proceedings ArticleDOI
Navneet Dalal1, Bill Triggs1Institutions (1)
20 Jun 2005-
TL;DR: It is shown experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection, and the influence of each stage of the computation on performance is studied.
Abstract: We study the question of feature sets for robust visual object recognition; adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.

28,803 citations


"Microsoft COCO: Common Objects in C..." refers background in this paper

  • ...The early evolution of object recognition datasets [22], [23], [24] facilitated the direct comparison...

    [...]

  • ...Another popular challenge is the detection of pedestrians for which several datasets have been created [24], [4]....

    [...]


Proceedings ArticleDOI
23 Jun 2014-
Abstract: Object detection performance, as measured on the canonical PASCAL VOC dataset, has plateaued in the last few years. The best-performing methods are complex ensemble systems that typically combine multiple low-level image features with high-level context. In this paper, we propose a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012 -- achieving a mAP of 53.3%. Our approach combines two key insights: (1) one can apply high-capacity convolutional neural networks (CNNs) to bottom-up region proposals in order to localize and segment objects and (2) when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost. Since we combine region proposals with CNNs, we call our method R-CNN: Regions with CNN features. We also present experiments that provide insight into what the network learns, revealing a rich hierarchy of image features. Source code for the complete system is available at http://www.cs.berkeley.edu/~rbg/rcnn.

15,107 citations


Dissertation
Alex Krizhevsky1Institutions (1)
01 Jan 2009-
Abstract: In this work we describe how to train a multi-layer generative model of natural images. We use a dataset of millions of tiny colour images, described in the next section. This has been attempted by several groups but without success. The models on which we focus are RBMs (Restricted Boltzmann Machines) and DBNs (Deep Belief Networks). These models learn interesting-looking filters, which we show are more useful to a classifier than the raw pixels. We train the classifier on a labeled subset that we have collected and call the CIFAR-10 dataset.

14,902 citations


Network Information
Related Papers (5)
27 Jun 2016

Kaiming He, Xiangyu Zhang +2 more

20 Jun 2009

Jia Deng, Wei Dong +4 more

03 Dec 2012

Alex Krizhevsky, Ilya Sutskever +1 more

04 Sep 2014

Karen Simonyan, Andrew Zisserman

Performance
Metrics
No. of citations received by the Paper in previous years
YearCitations
202289
20215,097
20205,451
20194,023
20182,169
20171,131