Meshed-Memory Transformer for Image Captioning

doi:10.1109/CVPR42600.2020.01059

Open AccessProceedings ArticleDOI

Meshed-Memory Transformer for Image Captioning

Marcella Cornia, +3 more

- pp 10578-10587

Chats0

TLDR

The architecture improves both the image encoding and the language generation steps: it learns a multi-level representation of the relationships between image regions integrating learned a priori knowledge, and uses a mesh-like connectivity at decoding stage to exploit low- and high-level features.

Abstract:

Transformer-based architectures represent the state of the art in sequence modeling tasks like machine translation and language understanding. Their applicability to multi-modal contexts like image captioning, however, is still largely under-explored. With the aim of filling this gap, we present M² - a Meshed Transformer with Memory for Image Captioning. The architecture improves both the image encoding and the language generation steps: it learns a multi-level representation of the relationships between image regions integrating learned a priori knowledge, and uses a mesh-like connectivity at decoding stage to exploit low- and high-level features. Experimentally, we investigate the performance of the M² Transformer and different fully-attentive models in comparison with recurrent ones. When tested on COCO, our proposal achieves a new state of the art in single-model and ensemble configurations on the "Karpathy" test split and on the online test server. We also assess its performances when describing objects unseen in the training set. Trained models and code for reproducing the experiments are publicly available at: https://github.com/aimagelab/meshed-memory-transformer.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

A Review of Uncertainty Quantification in Deep Learning: Techniques, Applications and Challenges

Moloud Abdar, +13 more

- 12 Nov 2020 -

arXiv: Learning

TL;DR: This study reviews recent advances in UQ methods used in deep learning and investigates the application of these methods in reinforcement learning (RL), and outlines a few important applications of UZ methods.

...read moreread less

Proceedings ArticleDOI

Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts

Soravit Changpinyo, +3 more

TL;DR: The Conceptual 12M (CC12M) dataset as mentioned in this paper is a dataset with 12 million image-text pairs specifically meant to be used for vision-and-language pre-training.

...read moreread less

Posted Content

AdaBins: Depth Estimation using Adaptive Bins

Shariq Farooq Bhat, +2 more

- 28 Nov 2020 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: A transformer-based architecture block that divides the depth range into bins whose center value is estimated adaptively per image, and which shows a decisive improvement over the state-of-the-art on several popular depth datasets across all metrics.

...read moreread less

Journal Article

Contrastive Learning of Medical Visual Representations from Paired Images and Text

Yuhao Zhang, +4 more

- 04 May 2021 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: This work proposes an alternative unsupervised strategy to learn medical visual representations directly from the naturally occurring pairing of images and textual data, and shows that this method leads to image representations that considerably outperform strong baselines in most settings.

...read moreread less

Proceedings ArticleDOI

RSTNet: Captioning with Adaptive Attention on Visual and Non-Visual Words

Xuying Zhang, +7 more

TL;DR: Zhang et al. as mentioned in this paper proposed Grid-Augmented (GA) module, in which relative geometry features between grids are incorporated to enhance visual representations, and proposed Adaptive-Attention (AA) module on top of a transformer decoder to adaptively measure the contribution of visual and language cues before making decisions for word prediction.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.

...read moreread less

Proceedings Article

Adam: A Method for Stochastic Optimization

Diederik P. Kingma, +1 more

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

...read moreread less

Proceedings Article

Attention is All you Need

Ashish Vaswani, +7 more

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.

...read moreread less

Proceedings ArticleDOI

Glove: Global Vectors for Word Representation

Jeffrey Pennington, +2 more

TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.

...read moreread less

Book ChapterDOI

Microsoft COCO: Common Objects in Context

Tsung-Yi Lin, +7 more

TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.

...read moreread less

Collapse

Meshed-Memory Transformer for Image Captioning

Citations

A Review of Uncertainty Quantification in Deep Learning: Techniques, Applications and Challenges

Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts

AdaBins: Depth Estimation using Adaptive Bins

Contrastive Learning of Medical Visual Representations from Paired Images and Text

RSTNet: Captioning with Adaptive Attention on Visual and Non-Visual Words

References

Deep Residual Learning for Image Recognition

Adam: A Method for Stochastic Optimization

Attention is All you Need

Glove: Global Vectors for Word Representation

Microsoft COCO: Common Objects in Context

Related Papers (5)

Bleu: a Method for Automatic Evaluation of Machine Translation

Attention is All you Need

Show and tell: A neural image caption generator

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Deep Residual Learning for Image Recognition