Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Open AccessPosted Content

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Ranjay Krishna, +11 more

- 23 Feb 2016 -

arXiv: Computer Vision and Pattern Recog...

Chats0

TLDR

The Visual Genome dataset is presented, which contains over 108K images where each image has an average of $$35$$35 objects, $$26$$26 attributes, and $$21$$21 pairwise relationships between objects, and represents the densest and largest dataset of image descriptions, objects, attributes, relationships, and question answer pairs.

Abstract:

Despite progress in perceptual tasks such as image classification, computers still perform poorly on cognitive tasks such as image description and question answering. Cognition is core to tasks that involve not just recognizing, but reasoning about our visual world. However, models used to tackle the rich content in images for cognitive tasks are still being trained using the same datasets designed for perceptual tasks. To achieve success at cognitive tasks, models need to understand the interactions and relationships between objects in an image. When asked "What vehicle is the person riding?", computers will need to identify the objects in an image as well as the relationships riding(man, carriage) and pulling(horse, carriage) in order to answer correctly that "the person is riding a horse-drawn carriage". In this paper, we present the Visual Genome dataset to enable the modeling of such relationships. We collect dense annotations of objects, attributes, and relationships within each image to learn these models. Specifically, our dataset contains over 100K images where each image has an average of 21 objects, 18 attributes, and 18 pairwise relationships between objects. We canonicalize the objects, attributes, relationships, and noun phrases in region descriptions and questions answer pairs to WordNet synsets. Together, these annotations represent the densest and largest dataset of image descriptions, objects, attributes, relationships, and question answers.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Places: A 10 Million Image Database for Scene Recognition

Bolei Zhou, +4 more

- 01 Jun 2018 -

IEEE Transactions on Pattern Analysis an...

TL;DR: The Places Database is described, a repository of 10 million scene photographs, labeled with scene semantic categories, comprising a large and diverse list of the types of environments encountered in the world, using the state-of-the-art Convolutional Neural Networks as baselines, that significantly outperform the previous approaches.

...read moreread less

Posted Content

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering.

Peter Anderson, +6 more

- 25 Jul 2017 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: A combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions is proposed, demonstrating the broad applicability of this approach to VQA.

...read moreread less

Proceedings ArticleDOI

ChestX-Ray8: Hospital-Scale Chest X-Ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases

Xiaosong Wang, +5 more

TL;DR: The ChestX-ray dataset as discussed by the authors contains 108,948 frontal-view X-ray images of 32,717 unique patients with the text-mined eight disease image labels from the associated radiological reports using natural language processing.

...read moreread less

Journal ArticleDOI

Deep Visual-Semantic Alignments for Generating Image Descriptions

Andrej Karpathy, +1 more

- 01 Apr 2017 -

IEEE Transactions on Pattern Analysis an...

TL;DR: A model that generates natural language descriptions of images and their regions based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding is presented.

...read moreread less

Journal ArticleDOI

Deep Learning for Generic Object Detection: A Survey

Li Liu, +7 more

- 01 Feb 2020 -

International Journal of Computer Vision

TL;DR: A comprehensive survey of the recent achievements in this field brought about by deep learning techniques, covering many aspects of generic object detection: detection frameworks, object feature representation, object proposal generation, context modeling, training strategies, and evaluation metrics.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

Alex Krizhevsky, +2 more

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.

...read moreread less

Journal ArticleDOI

Long short-term memory

Sepp Hochreiter, +1 more

- 01 Nov 1997 -

Neural Computation

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.

...read moreread less

Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan, +1 more

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.

...read moreread less

Proceedings ArticleDOI

ImageNet: A large-scale hierarchical image database

Jia Deng, +5 more

TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.

...read moreread less

Proceedings ArticleDOI

Going deeper with convolutions

Christian Szegedy, +8 more

TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).

...read moreread less

Collapse

Neural Computation

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Citations

Places: A 10 Million Image Database for Scene Recognition

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering.

ChestX-Ray8: Hospital-Scale Chest X-Ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases

Deep Visual-Semantic Alignments for Generating Image Descriptions

Deep Learning for Generic Object Detection: A Survey

References

ImageNet Classification with Deep Convolutional Neural Networks

Long short-term memory

Very Deep Convolutional Networks for Large-Scale Image Recognition

ImageNet: A large-scale hierarchical image database

Going deeper with convolutions

Related Papers (5)

Microsoft COCO: Common Objects in Context

Deep Residual Learning for Image Recognition

ImageNet: A large-scale hierarchical image database

Very Deep Convolutional Networks for Large-Scale Image Recognition

Long short-term memory