Microsoft COCO: Common Objects in Context
Summary (3 min read)
- One of the primary goals of computer vision is the understanding of visual scenes.
- The authors introduce a new large-scale dataset that addresses three core research problems in scene understanding: detecting non-iconic views (or non-canonical perspectives ) of objects, contextual reasoning between objects and the precise 2D localization of objects.
- The authors posit that current recognition systems perform fairly well on iconic views, but struggle to recognize objects otherwise – in the • T.Y. Lin and S. Belongie are with Cornell NYC Tech and the Cornell Computer Science Department.
- For each category found, the individual instances were labeled, verified, and finally segmented.
- Additionally, a critical distinction between their dataset and others is the number of labeled instances per image which may aid in learning contextual information, Fig. 5. MS COCO contains considerably more object instances per image (7.7) as compared to ImageNet (3.0) and PASCAL (2.3).
3 IMAGE COLLECTION
- The authors next describe how the object categories and candidate images are selected.
3.1 Common Object Categories
- The categories must form a representative set of all categories, be relevant to practical applications and occur with high enough frequency to enable the collection of a large dataset.
- Other important decisions are whether to include both “thing” and “stuff” categories  and whether fine-grained ,  and object-part categories should be included.
- To enable the practical collection of a significant number of instances per category, the authors chose to limit their dataset to entry-level categories, i.e. category labels that are commonly used by humans when describing objects (dog, chair, person).
- The final selection of categories attempts to pick categories with high votes, while keeping the number of categories per supercategory (animals, vehicles, furniture, etc.) balanced.
3.2 Non-iconic Image Collection
- Given the list of object categories, their next goal was to collect a set of candidate images.
- The authors goal was to collect a dataset such that a majority of images are non-iconic, Fig. 2(c).
- First as popularized by PASCAL VOC , the authors collected images from Flickr which tends to have fewer iconic images.
- Surprisingly, these images typically do not just contain the two categories specified in the search, but numerous other categories as well.
- The result is a collection of 328,000 images with rich contextual relationships between objects as shown in Figs. 2(c) and 6.
4 IMAGE ANNOTATION
- The authors next describe how they annotated their image collection.
- Due to their desire to label over 2.5 million object instances, the design of a cost efficient yet high quality annotation pipeline was critical.
- For all crowdsourcing tasks the authors used workers on Amazon’s Mechanical Turk (AMT).
- Note that, since the original version of this work , the authors have taken a number of steps to further improve the quality of the annotations.
4.1 Category Labeling
- The first task in annotating their dataset is determining which object categories are present in each image, Fig. 3(a).
- Since the authors have 91 categories and a large number of images, asking workers to answer 91 binary classification questions per image would be prohibitively expensive.
- For a given image, a worker was presented with each group of categories in turn and asked to indicate whether any instances exist for that super-category.
- This greatly reduces the time needed to classify the various categories.
- The placement of these icons is critical for the following stage.
4.2 Instance Spotting
- In the next stage all instances of the object categories in an image were labeled, Fig. 3(b).
- To boost recall, the location of the instance found by a worker in the previous stage was shown to the current worker.
- Such priming helped workers quickly find an initial instance upon first seeing the image.
- The workers could also use a magnifying glass to find small instances.
- Each image was labeled by 8 workers for a total of ∼10k worker hours.
4.3 Instance Segmentation
- The authors final stage is the laborious task of segmenting each object instance, Fig. 3(c).
- To minimize cost the authors only had a single worker segment each instance.
- The training task required workers to segment an object instance.
- Workers could not complete the task until their segmentation adequately matched the ground truth.
- After 10-15 instances of a category were segmented in an image, the remaining instances were marked as “crowds” using a single (possibly multipart) segment.
4.4 Annotation Performance Analysis
- The authors analyzed crowd worker quality on the category labeling task by comparing to dedicated expert workers, see Fig. 4(a).
- Ground truth was computed using majority vote of the experts.
- Fig. 4(a) shows that the union of 8 AMT workers, the same number as was used to collect their labels, achieved greater recall than any of the expert workers.
- Object category presence is often ambiguous.
- Note that a similar analysis may be done for instance spotting in which 8 annotators were also used.
5 DATASET STATISTICS
- Next, the authors analyze the properties of the Microsoft Common Objects in COntext (MS COCO) dataset in comparison to several other popular datasets.
- These include ImageNet , PASCAL VOC 2012 , and SUN .
- On average their dataset contains 3.5 categories and 7.7 instances per image.
- Another interesting observation is only 10% of the images in MS COCO have only one category per image, in comparison, over 60% of images contain a single object category in ImageNet and PASCAL VOC.
- Generally smaller objects are harder to recognize and require more contextual reasoning to recognize.
6 DATASET SPLITS
- To accommodate a faster release schedule, the authors split the MS COCO dataset into two roughly equal parts.
- The cumulative 2015 release will contain a total of 165,482 train, 81,208 val, and 81,434 test images.
- The authors took care to minimize the chance of near-duplicate images existing across splits by explicitly removing near duplicates (detected with ) and grouping images by photographer and date taken.
- The authors are currently finalizing the evaluation server for automatic evaluation on the test set.
- The authors did not collect segmentations for the following 11 categories: hat, shoe, eyeglasses (too many instances), mirror, window, door, street sign (ambiguous and difficult to label), plate, desk (due to confusion with bowl and dining table, respectively) and blender, hair brush (too few instances).
7 ALGORITHMIC ANALYSIS
- For the following experiments the authors take a subset of 55,000 images from their dataset1 and obtain tight-fitting bounding boxes from the annotated segmentation masks.
- Consistent with past observations , the authors find that including difficult (non-iconic) images during training may not always help.
- These observations support two hypotheses: 1) MS COCO is significantly more difficult than PASCAL VOC and 2) models trained on MS COCO can generalize better to easier datasets such as PASCAL VOC given more training data.
- The authors then measure the intersection over union of the predicted and ground truth segmentation masks, see Fig.
- To establish a baseline for their dataset, the authors project learned DPM part masks onto the image to create segmentation masks.
- The authors introduced a new dataset for detecting and segmenting objects found in everyday life in their natural environments.
- Dataset statistics indicate the images contain rich contextual information with many objects present per image.
- To download and learn more about MS COCO please see the project website2.
- P.P. and D.R. were supported by ONR MURI Grant N00014-10-1-0933.
Did you find this useful? Give us your feedback
"Microsoft COCO: Common Objects in C..." refers background or methods or result in this paper
...Recently, ImageNet  made a striking departure from the incremental increase in dataset sizes....
...These include ImageNet , PASCAL VOC 2012 , and SUN ....
...In contrast to the popular ImageNet dataset , COCO has fewer categories but more instances per category....
...Recently, the ImageNet dataset  containing millions of images has enabled breakthroughs in both object classification and detection research using a new class of deep learning algorithms , , ....
...Other important decisions are whether to include both “thing” and “stuff” categories  and whether fine-grained ,  and object-part categories should be included....
"Microsoft COCO: Common Objects in C..." refers background in this paper
...The early evolution of object recognition datasets , ,  facilitated the direct comparison...
...Another popular challenge is the detection of pedestrians for which several datasets have been created , ....
Related Papers (5)
Kaiming He, Xiangyu Zhang +2 more
Jia Deng, Wei Dong +4 more
Alex Krizhevsky, Ilya Sutskever +1 more
Olga Russakovsky, Jia Deng +10 more
Karen Simonyan, Andrew Zisserman