Microsoft COCO: Common Objects in Context
Summary (3 min read)
1 INTRODUCTION
- One of the primary goals of computer vision is the understanding of visual scenes.
- The authors introduce a new large-scale dataset that addresses three core research problems in scene understanding: detecting non-iconic views (or non-canonical perspectives [12]) of objects, contextual reasoning between objects and the precise 2D localization of objects.
- For each category found, the individual instances were labeled, verified, and finally segmented.
- Additionally, a critical distinction between their dataset and others is the number of labeled instances per image which may aid in learning contextual information, Fig. 5. MS COCO contains considerably more object instances per image (7.7) as compared to ImageNet (3.0) and PASCAL (2.3).
3.1 Common Object Categories
- The categories must form a representative set of all categories, be relevant to practical applications and occur with high enough frequency to enable the collection of a large dataset.
- Other important decisions are whether to include both “thing” and “stuff” categories [39] and whether fine-grained [31], [1] and object-part categories should be included.
- To enable the practical collection of a significant number of instances per category, the authors chose to limit their dataset to entry-level categories, i.e. category labels that are commonly used by humans when describing objects (dog, chair, person).
- The final selection of categories attempts to pick categories with high votes, while keeping the number of categories per supercategory (animals, vehicles, furniture, etc.) balanced.
3.2 Non-iconic Image Collection
- Given the list of object categories, their next goal was to collect a set of candidate images.
- The authors goal was to collect a dataset such that a majority of images are non-iconic, Fig. 2(c).
- First as popularized by PASCAL VOC [2], the authors collected images from Flickr which tends to have fewer iconic images.
- Surprisingly, these images typically do not just contain the two categories specified in the search, but numerous other categories as well.
- The result is a collection of 328,000 images with rich contextual relationships between objects as shown in Figs. 2(c) and 6.
4 IMAGE ANNOTATION
- The authors next describe how they annotated their image collection.
- Due to their desire to label over 2.5 million object instances, the design of a cost efficient yet high quality annotation pipeline was critical.
- For all crowdsourcing tasks the authors used workers on Amazon’s Mechanical Turk (AMT).
- Note that, since the original version of this work [19], the authors have taken a number of steps to further improve the quality of the annotations.
4.1 Category Labeling
- The first task in annotating their dataset is determining which object categories are present in each image, Fig. 3(a).
- Since the authors have 91 categories and a large number of images, asking workers to answer 91 binary classification questions per image would be prohibitively expensive.
- For a given image, a worker was presented with each group of categories in turn and asked to indicate whether any instances exist for that super-category.
- This greatly reduces the time needed to classify the various categories.
- The placement of these icons is critical for the following stage.
4.2 Instance Spotting
- In the next stage all instances of the object categories in an image were labeled, Fig. 3(b).
- To boost recall, the location of the instance found by a worker in the previous stage was shown to the current worker.
- Such priming helped workers quickly find an initial instance upon first seeing the image.
- The workers could also use a magnifying glass to find small instances.
- Each image was labeled by 8 workers for a total of ∼10k worker hours.
4.3 Instance Segmentation
- The authors final stage is the laborious task of segmenting each object instance, Fig. 3(c).
- To minimize cost the authors only had a single worker segment each instance.
- The training task required workers to segment an object instance.
- Workers could not complete the task until their segmentation adequately matched the ground truth.
- After 10-15 instances of a category were segmented in an image, the remaining instances were marked as “crowds” using a single (possibly multipart) segment.
4.4 Annotation Performance Analysis
- The authors analyzed crowd worker quality on the category labeling task by comparing to dedicated expert workers, see Fig. 4(a).
- Ground truth was computed using majority vote of the experts.
- Fig. 4(a) shows that the union of 8 AMT workers, the same number as was used to collect their labels, achieved greater recall than any of the expert workers.
- Note that a similar analysis may be done for instance spotting in which 8 annotators were also used.
5 DATASET STATISTICS
- Next, the authors analyze the properties of the Microsoft Common Objects in COntext (MS COCO) dataset in comparison to several other popular datasets.
- On average their dataset contains 3.5 categories and 7.7 instances per image.
- Another interesting observation is only 10% of the images in MS COCO have only one category per image, in comparison, over 60% of images contain a single object category in ImageNet and PASCAL VOC.
- Generally smaller objects are harder to recognize and require more contextual reasoning to recognize.
6 DATASET SPLITS
- To accommodate a faster release schedule, the authors split the MS COCO dataset into two roughly equal parts.
- The authors took care to minimize the chance of near-duplicate images existing across splits by explicitly removing near duplicates (detected with [43]) and grouping images by photographer and date taken.
- The authors are currently finalizing the evaluation server for automatic evaluation on the test set.
- The authors did not collect segmentations for the following 11 categories: hat, shoe, eyeglasses (too many instances), mirror, window, door, street sign (ambiguous and difficult to label), plate, desk (due to confusion with bowl and dining table, respectively) and blender, hair brush (too few instances).
7 ALGORITHMIC ANALYSIS
- For the following experiments the authors take a subset of 55,000 images from their dataset1 and obtain tight-fitting bounding boxes from the annotated segmentation masks.
- Consistent with past observations [46], the authors find that including difficult (non-iconic) images during training may not always help.
- These observations support two hypotheses: 1) MS COCO is significantly more difficult than PASCAL VOC and 2) models trained on MS COCO can generalize better to easier datasets such as PASCAL VOC given more training data.
- The authors then measure the intersection over union of the predicted and ground truth segmentation masks, see Fig.
- To establish a baseline for their dataset, the authors project learned DPM part masks onto the image to create segmentation masks.
8 DISCUSSION
- The authors introduced a new dataset for detecting and segmenting objects found in everyday life in their natural environments.
- Dataset statistics indicate the images contain rich contextual information with many objects present per image.
- To download and learn more about MS COCO please see the project website2.
Did you find this useful? Give us your feedback
Citations
123,388 citations
44,703 citations
40,257 citations
30,811 citations
26,458 citations
Cites methods from "Microsoft COCO: Common Objects in C..."
...Region proposal methods typically rely on inexpensive features and economical inference schemes....
[...]
References
3,043 citations
2,960 citations
"Microsoft COCO: Common Objects in C..." refers background or methods in this paper
...We accomplished this using a surprisingly simple yet effective technique that queries for pairs of objects in conjunction with images retrieved via scene-based queries [17,3]....
[...]
...These include the ImageNet [1], PASCAL VOC 2012 [2], and SUN [3] datasets....
[...]
...Finally, our dataset could provide a good benchmark for other types of labels, including scene types [3], attributes [9,8] and full sentence written descriptions [51]....
[...]
...The dataset is also significantly larger in number of instances per category than the PASCAL VOC [2] and SUN [3] datasets....
[...]
...A novel dataset that combines many of the properties of both object detection and semantic scene labeling datasets is the SUN dataset [3] for scene understanding....
[...]
2,924 citations
2,699 citations
"Microsoft COCO: Common Objects in C..." refers background in this paper
...The early evolution of object recognition datasets [22], [23], [24] facilitated the direct comparison...
[...]
...Caltech 101 [22] and Caltech 256 [23] marked the transition to more realistic object images retrieved from the internet while also increasing the number of object categories to 101 and 256, respectively....
[...]
2,597 citations
"Microsoft COCO: Common Objects in C..." refers background in this paper
...The early evolution of object recognition datasets [22], [23], [24] facilitated the direct comparison...
[...]
...Caltech 101 [22] and Caltech 256 [23] marked the transition to more realistic object images retrieved from the internet while also increasing the number of object categories to 101 and 256, respectively....
[...]
Related Papers (5)
Frequently Asked Questions (15)
Q2. What is the importance of a detection dataset?
Since the detection of many objects such as sunglasses, cellphones or chairs is highly dependent on contextual information, it is important that detection datasets contain objects in their natural environments.
Q3. How many annotators were used in the sample?
by observing how recall increased as the authors added annotators, the authors estimate that in practice over 99% of all object categories not later rejected as false positives are detected given 8 annotators.
Q4. How many worker hours did it take to segment objects?
Segmenting 2,500,000 object instances is an extremely time consuming task requiring over 22 worker hours per 1,000 segmentations.
Q5. What is the task of labeling objects in a scene?
The task of labeling semantic objects in a scene requires that each pixel of an image be labeled as belonging to a category, such as sky, chair, floor, street, etc.
Q6. How many worker hours were used to generate object segmentation masks?
Utilizing over 70,000 worker hours, a vast collection of object instances was gathered, annotated and organized to drive the advancement of object detection and segmentation algorithms.
Q7. How many instances of a given category were discarded?
Segmentations of insufficient quality were discarded and the corresponding instances added back to the pool of unsegmented objects.
Q8. How many instances of a category were segmented in an image?
After 10-15 instances of a category were segmented in an image, the remaining instances were marked as “crowds” using a single (possibly multipart) segment.
Q9. How many datasets were created for the detection of basic object categories?
For the detection of basic object categories, a multiyear effort from 2005 to 2012 was devoted to the creation and maintenance of a series of benchmark datasets that were widely adopted.
Q10. How many instances of the category are present in the image?
If a worker determines instances from the super-category (animal) are present, for each subordinate category (dog, cat, etc.) present, the worker must drag the category’s icon onto the image over one instance of the category.
Q11. Why did the authors choose to include only “thing” categories?
Since the authors are primarily interested in precise localization of object instances, the authors decided to only include “thing” categories and not “stuff.”
Q12. How many instances of a given category were segmented?
For images containing 10 object instances or fewer of a given category, every instance was individually segmented (note that in some images up to 15 instances were segmented).
Q13. What are the categories for which people, cars, and other objects?
“Thing” categories include objects for which individual instances may be easily labeled (person, chair, car) where “stuff” categories include materials and objects with no clear boundaries (sky, street, grass).
Q14. What is the effect of difficult examples on the learning model?
Such examples may act as noise and pollute the learned model if the model is not rich enough to capture such appearance variability.
Q15. What is the interesting observation about the dataset?
Another interesting observation is only 10% of the images in MS COCO have only one category per image, in comparison, over 60% of images contain a single object category in ImageNet and PASCAL VOC.