scispace - formally typeset
Open AccessProceedings Article

Learning Transferable Visual Models From Natural Language Supervision

TLDR
In this paper, a pre-training task of predicting which caption goes with which image is used to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.
Abstract
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at this https URL.

read more

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

A Review on Explainability in Multimodal Deep Neural Nets

TL;DR: A comprehensive survey and commentary on the explainability in multimodal deep neural networks, especially for the vision and language tasks, is presented in this article, including the significance, datasets, fundamental building blocks of the methods and techniques, challenges, applications, and future trends in this domain.
Posted Content

Multimodal Co-learning: Challenges, Applications with Datasets, Recent Advances and Future Directions

TL;DR: A comprehensive survey on the emerging area of multimodal co-learning is provided in this article, along with the important ideas and directions for future work that will be beneficial for the entire research community focusing on this exciting domain.
Posted Content

Concept Generalization in Visual Representation Learning

TL;DR: It is argued that the semantic relationships between seen and unseen concepts affect generalization performance and proposed ImageNet-CoG,1 a novel benchmark on the ImageNet -21K (IN-21K) dataset that enables measuring concept generalization in a principled way.
Posted Content

Robust fine-tuning of zero-shot models

TL;DR: Weight-space ensembles as mentioned in this paper ensembling the weights of the zero-shot and fine-tuned models provide large accuracy improvements out-of-distribution, while matching or improving in-disparity accuracy.
Posted Content

SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning

TL;DR: Wang et al. as discussed by the authors proposed an end-to-end transformer-based model for video captioning, which takes video frame patches directly as inputs, and outputs a natural language description.