Microsoft COCO: Common Objects in Context
read more
Citations
Identity Mappings in Deep Residual Networks
Aggregated Residual Transformations for Deep Neural Networks
Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation
Perceptual Losses for Real-Time Style Transfer and Super-Resolution
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
References
ImageNet Classification with Deep Convolutional Neural Networks
ImageNet: A large-scale hierarchical image database
Histograms of oriented gradients for human detection
Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation
The Pascal Visual Object Classes (VOC) Challenge
Related Papers (5)
Frequently Asked Questions (15)
Q2. What is the importance of a detection dataset?
Since the detection of many objects such as sunglasses, cellphones or chairs is highly dependent on contextual information, it is important that detection datasets contain objects in their natural environments.
Q3. How many annotators were used in the sample?
by observing how recall increased as the authors added annotators, the authors estimate that in practice over 99% of all object categories not later rejected as false positives are detected given 8 annotators.
Q4. How many worker hours did it take to segment objects?
Segmenting 2,500,000 object instances is an extremely time consuming task requiring over 22 worker hours per 1,000 segmentations.
Q5. What is the task of labeling objects in a scene?
The task of labeling semantic objects in a scene requires that each pixel of an image be labeled as belonging to a category, such as sky, chair, floor, street, etc.
Q6. How many worker hours were used to generate object segmentation masks?
Utilizing over 70,000 worker hours, a vast collection of object instances was gathered, annotated and organized to drive the advancement of object detection and segmentation algorithms.
Q7. How many instances of a given category were discarded?
Segmentations of insufficient quality were discarded and the corresponding instances added back to the pool of unsegmented objects.
Q8. How many instances of a category were segmented in an image?
After 10-15 instances of a category were segmented in an image, the remaining instances were marked as “crowds” using a single (possibly multipart) segment.
Q9. How many datasets were created for the detection of basic object categories?
For the detection of basic object categories, a multiyear effort from 2005 to 2012 was devoted to the creation and maintenance of a series of benchmark datasets that were widely adopted.
Q10. How many instances of the category are present in the image?
If a worker determines instances from the super-category (animal) are present, for each subordinate category (dog, cat, etc.) present, the worker must drag the category’s icon onto the image over one instance of the category.
Q11. Why did the authors choose to include only “thing” categories?
Since the authors are primarily interested in precise localization of object instances, the authors decided to only include “thing” categories and not “stuff.”
Q12. How many instances of a given category were segmented?
For images containing 10 object instances or fewer of a given category, every instance was individually segmented (note that in some images up to 15 instances were segmented).
Q13. What are the categories for which people, cars, and other objects?
“Thing” categories include objects for which individual instances may be easily labeled (person, chair, car) where “stuff” categories include materials and objects with no clear boundaries (sky, street, grass).
Q14. What is the effect of difficult examples on the learning model?
Such examples may act as noise and pollute the learned model if the model is not rich enough to capture such appearance variability.
Q15. What is the interesting observation about the dataset?
Another interesting observation is only 10% of the images in MS COCO have only one category per image, in comparison, over 60% of images contain a single object category in ImageNet and PASCAL VOC.