Learning Transferable Visual Models From Natural Language Supervision

Open AccessProceedings Article

Learning Transferable Visual Models From Natural Language Supervision

- pp 8748-8763

TLDR

In this paper, a pre-training task of predicting which caption goes with which image is used to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.

Abstract:

State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at this https URL.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

A Review on Explainability in Multimodal Deep Neural Nets

Gargi Joshi, +2 more

- 31 Mar 2021 -

IEEE Access

TL;DR: A comprehensive survey and commentary on the explainability in multimodal deep neural networks, especially for the vision and language tasks, is presented in this article, including the significance, datasets, fundamental building blocks of the methods and techniques, challenges, applications, and future trends in this domain.

...read moreread less

Posted Content

Multimodal Co-learning: Challenges, Applications with Datasets, Recent Advances and Future Directions

Anil Rahate, +3 more

- 29 Jul 2021 -

arXiv: Learning

TL;DR: A comprehensive survey on the emerging area of multimodal co-learning is provided in this article, along with the important ideas and directions for future work that will be beneficial for the entire research community focusing on this exciting domain.

...read moreread less

Posted Content

Concept Generalization in Visual Representation Learning

Mert Bulent Sariyildiz, +3 more

- 10 Dec 2020 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: It is argued that the semantic relationships between seen and unseen concepts affect generalization performance and proposed ImageNet-CoG,1 a novel benchmark on the ImageNet -21K (IN-21K) dataset that enables measuring concept generalization in a principled way.

...read moreread less

Posted Content

Robust fine-tuning of zero-shot models

Mitchell Wortsman, +7 more

- 04 Sep 2021 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: Weight-space ensembles as mentioned in this paper ensembling the weights of the zero-shot and fine-tuned models provide large accuracy improvements out-of-distribution, while matching or improving in-disparity accuracy.

...read moreread less

Posted Content

SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning

Kevin Lin, +7 more

- 25 Nov 2021 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: Wang et al. as discussed by the authors proposed an end-to-end transformer-based model for video captioning, which takes video frame patches directly as inputs, and outputs a natural language description.

...read moreread less

Collapse

Learning Transferable Visual Models From Natural Language Supervision

Citations

A Review on Explainability in Multimodal Deep Neural Nets

Multimodal Co-learning: Challenges, Applications with Datasets, Recent Advances and Future Directions

Concept Generalization in Visual Representation Learning

Robust fine-tuning of zero-shot models

SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning

Related Papers (5)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Deep Residual Learning for Image Recognition

Attention is All you Need

ImageNet: A large-scale hierarchical image database

Adam: A Method for Stochastic Optimization