Open AccessProceedings Article
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford,Jong Wook Kim,Chris Hallacy,Aditya Ramesh,Gabriel Goh,Sandhini Agarwal,Girish Sastry,Amanda Askell,Pamela Mishkin,Jack Clark,Gretchen Krueger,Ilya Sutskever +11 more
- pp 8748-8763
TLDR
In this paper, a pre-training task of predicting which caption goes with which image is used to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.Abstract:
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at this https URL.read more
Citations
More filters
Journal ArticleDOI
A Review on Explainability in Multimodal Deep Neural Nets
TL;DR: A comprehensive survey and commentary on the explainability in multimodal deep neural networks, especially for the vision and language tasks, is presented in this article, including the significance, datasets, fundamental building blocks of the methods and techniques, challenges, applications, and future trends in this domain.
Posted Content
Multimodal Co-learning: Challenges, Applications with Datasets, Recent Advances and Future Directions
TL;DR: A comprehensive survey on the emerging area of multimodal co-learning is provided in this article, along with the important ideas and directions for future work that will be beneficial for the entire research community focusing on this exciting domain.
Posted Content
Concept Generalization in Visual Representation Learning
TL;DR: It is argued that the semantic relationships between seen and unseen concepts affect generalization performance and proposed ImageNet-CoG,1 a novel benchmark on the ImageNet -21K (IN-21K) dataset that enables measuring concept generalization in a principled way.
Posted Content
Robust fine-tuning of zero-shot models
Mitchell Wortsman,Gabriel Ilharco,Mike Li,Jong Wook Kim,Hannaneh Hajishirzi,Ali Farhadi,Hongseok Namkoong,Ludwig Schmidt +7 more
TL;DR: Weight-space ensembles as mentioned in this paper ensembling the weights of the zero-shot and fine-tuned models provide large accuracy improvements out-of-distribution, while matching or improving in-disparity accuracy.
Posted Content
SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning
TL;DR: Wang et al. as discussed by the authors proposed an end-to-end transformer-based model for video captioning, which takes video frame patches directly as inputs, and outputs a natural language description.