Self-supervised Pretraining of Visual Features in the Wild

Home
/
Papers
/
Self-supervised Pretraining of Visual Features in the Wild

Posted Content•

Self-supervised Pretraining of Visual Features in the Wild

Priya Goyal, Mathilde Caron, Benjamin Lefaudeux, Min Xu, Pengchao Wang, Vivek S. Pai, Mannat Singh, Vitaliy Liptchinsky, Ishan Misra, Armand Joulin, Piotr Bojanowski - Show less +7 more

02 Mar 2021-arXiv: Computer Vision and Pattern Recognition-

TL;DR: Recently, self-supervised learning methods like MoCo, SimCLR, BYOL and SwAV have reduced the gap with supervised methods as mentioned in this paper. But self-learning cannot learn from any random image and from any unbounded dataset.

read less

Abstract: Recently, self-supervised learning methods like MoCo, SimCLR, BYOL and SwAV have reduced the gap with supervised methods. These results have been achieved in a control environment, that is the highly curated ImageNet dataset. However, the premise of self-supervised learning is that it can learn from any random image and from any unbounded dataset. In this work, we explore if self-supervision lives to its expectation by training large models on random, uncurated images with no supervision. Our final SElf-supERvised (SEER) model, a RegNetY with 1.3B parameters trained on 1B random images with 512 GPUs achieves 84.2% top-1 accuracy, surpassing the best self-supervised pretrained model by 1% and confirming that self-supervised learning works in a real world setting. Interestingly, we also observe that self-supervised models are good few-shot learners achieving 77.9% top-1 with access to only 10% of ImageNet. Code: this https URL

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Self-supervised Learning: Generative or Contrastive.

[...]

Xiao Liu¹, Fanjin Zhang¹, Zhenyu Hou¹, Zhaoyu Wang¹, Li Mian², Jing Zhang, Jie Tang³ - Show less +3 more•Institutions (3)

Tsinghua University¹, Beijing Institute of Technology², Renmin University of China³

15 Jun 2020-arXiv: Learning

TL;DR: This survey takes a look into new self-supervised learning methods for representation in computer vision, natural language processing, and graph learning, and comprehensively review the existing empirical methods into three main categories according to their objectives.

...read moreread less

Abstract: Deep supervised learning has achieved great success in the last decade. However, its deficiencies of dependence on manual labels and vulnerability to attacks have driven people to explore a better solution. As an alternative, self-supervised learning attracts many researchers for its soaring performance on representation learning in the last several years. Self-supervised representation learning leverages input data itself as supervision and benefits almost all types of downstream tasks. In this survey, we take a look into new self-supervised learning methods for representation in computer vision, natural language processing, and graph learning. We comprehensively review the existing empirical methods and summarize them into three main categories according to their objectives: generative, contrastive, and generative-contrastive (adversarial). We further investigate related theoretical analysis work to provide deeper thoughts on how self-supervised learning works. Finally, we briefly discuss open problems and future directions for self-supervised learning. An outline slide for the survey is provided.

...read moreread less

576 citations

Posted Content•

Emerging Properties in Self-Supervised Vision Transformers

[...]

Mathilde Caron¹, Hugo Touvron², Hugo Touvron¹, Ishan Misra¹, Hervé Jégou¹, Julien Mairal, Piotr Bojanowski¹, Armand Joulin¹ - Show less +4 more•Institutions (2)

Facebook¹, University of Paris²

29 Apr 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets) beyond the fact that adapting selfsupervised methods to this architecture works particularly well, they make the following observations: first, self-vised ViT features contain explicit information about the semantic segmentation of an image, which does not emerge as clearly with supervised ViTs, nor with convnets.

...read moreread less

Abstract: In this paper, we question if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets). Beyond the fact that adapting self-supervised methods to this architecture works particularly well, we make the following observations: first, self-supervised ViT features contain explicit information about the semantic segmentation of an image, which does not emerge as clearly with supervised ViTs, nor with convnets. Second, these features are also excellent k-NN classifiers, reaching 78.3% top-1 on ImageNet with a small ViT. Our study also underlines the importance of momentum encoder, multi-crop training, and the use of small patches with ViTs. We implement our findings into a simple self-supervised method, called DINO, which we interpret as a form of self-distillation with no labels. We show the synergy between DINO and ViTs by achieving 80.1% top-1 on ImageNet in linear evaluation with ViT-Base.

...read moreread less

557 citations

Journal Article•DOI•

Artificial intelligence and machine learning for medical imaging: A technology review.

[...]

Ana M. Barragan-Montero¹, Umair Javaid¹, Gilmer Valdes², Dan Nguyen³, Paul Desbordes¹, Benoît Macq¹, S. Willems⁴, Liesbeth Vandewinckele⁴, Mats Holmström, Fredrik Löfman, Steven Michiels¹, Kevin Souris¹, Edmond Sterpin⁴, John Aldo Lee¹ - Show less +10 more•Institutions (4)

Université catholique de Louvain¹, University of California, San Francisco², University of Texas Southwestern Medical Center³, Katholieke Universiteit Leuven⁴

01 Mar 2021-Physica Medica

TL;DR: Artificial intelligence (AI) has recently become a very popular buzzword, as a consequence of disruptive technical advances and impressive experimental results, notably in the field of image analysis and processing as discussed by the authors.

...read moreread less

66 citations

Journal Article•DOI•

Review on self-supervised image recognition using deep neural networks

[...]

Kriti Ohri¹, Mukesh Kumar¹•Institutions (1)

National Institute of Technology, Patna¹

19 Jul 2021-Knowledge Based Systems

TL;DR: Self-supervised learning as discussed by the authors is a form of unsupervised deep learning that allows the network to learn rich visual features that help in performing downstream computer vision tasks such as image classification, object detection, and image segmentation.

...read moreread less

Abstract: Deep learning has brought significant developments in image understanding tasks such as object detection, image classification, and image segmentation. But the success of image recognition largely relies on supervised learning that requires huge number of human-annotated labels. To avoid costly collection of labeled data and the domains where very few standard pre-trained models exist, self-supervised learning comes to our rescue. Self-supervised learning is a form of unsupervised learning that allows the network to learn rich visual features that help in performing downstream computer vision tasks such as image classification, object detection, and image segmentation. This paper provides a thorough review of self-supervised learning which has the potential to revolutionize the computer vision field using unlabeled data. First, the motivation of self-supervised learning is discussed, and other annotation efficient learning schemes. Then, the general pipeline for supervised learning and self-supervised learning is illustrated. Next, various handcrafted pretext tasks are explained that enable learning of visual features using unlabeled image dataset. The paper also highlights the recent breakthroughs in self-supervised learning using contrastive learning and clustering methods that are outperforming supervised learning. Finally, we have performance comparisons of self-supervised techniques on evaluation tasks such as image classification and detection. In the end, the paper is concluded with practical considerations and open challenges of image recognition tasks in self-supervised learning regime. From the onset of the review paper, the core focus is on visual feature learning from images using the self-supervised approaches.

...read moreread less

51 citations

Posted Content•DOI•

Self-Supervised Deep-Learning Encodes High-Resolution Features of Protein Subcellular Localization

[...]

Hirofumi Kobayashi, Cheveralls Kc, Leonetti, Loic Royer

29 Mar 2021-bioRxiv

TL;DR: In this article, a deep learning-based approach for fully self-supervised protein localization profiling and clustering is presented, which does not require pre-existing knowledge, categories, or annotations.

...read moreread less

Abstract: Elucidating the diversity and complexity of protein localization is essential to fully understand cellular architecture. Here, we present cytoself, a deep learning-based approach for fully self-supervised protein localization profiling and clustering. cytoself leverages a self-supervised training scheme that does not require pre-existing knowledge, categories, or annotations. Applying cytoself to images of 1311 endogenously labeled proteins from the recently released OpenCell database creates a highly resolved protein localization atlas. We show that the representations derived from cytoself encapsulate highly specific features that can be used to derive functional insights for proteins on the sole basis of their localization. Finally, to better understand the inner workings of our model, we dissect the emergent features from which our clustering is derived, interpret these features in the context of the fluorescence images, and analyze the performance contributions of the different components of our approach.

...read moreread less

35 citations

1
2
3
4
…
5
6
7
8
9

Collapse

References

PDF

Open Access

More filters

Proceedings Article•DOI•

Deep Residual Learning for Image Recognition

[...]

Kaiming He¹, Xiangyu Zhang¹, Shaoqing Ren¹, Jian Sun¹•Institutions (1)

Microsoft¹

27 Jun 2016

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.

...read moreread less

Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

...read moreread less

123,388 citations

Journal Article•DOI•

ImageNet Large Scale Visual Recognition Challenge

[...]

Olga Russakovsky¹, Jia Deng², Hao Su¹, Jonathan Krause¹, Sanjeev Satheesh¹, Sean Ma¹, Zhiheng Huang¹, Andrej Karpathy¹, Aditya Khosla³, Michael S. Bernstein¹, Alexander C. Berg⁴, Li Fei-Fei¹ - Show less +8 more•Institutions (4)

Stanford University¹, University of Michigan², Massachusetts Institute of Technology³, University of North Carolina at Chapel Hill⁴

01 Dec 2015-International Journal of Computer Vision

TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.

...read moreread less

Abstract: The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide a detailed analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the 5 years of the challenge, and propose future directions and improvements.

...read moreread less

30,811 citations

Book Chapter•DOI•

Microsoft COCO: Common Objects in Context

[...]

Tsung-Yi Lin¹, Michael Maire², Serge Belongie¹, James Hays, Pietro Perona², Deva Ramanan³, Piotr Dollár⁴, C. Lawrence Zitnick⁴ - Show less +4 more•Institutions (4)

Cornell University¹, California Institute of Technology², University of California, Irvine³, Microsoft⁴

06 Sep 2014

TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.

...read moreread less

Abstract: We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in precise object localization. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.

...read moreread less

30,462 citations

Posted Content•

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

[...]

Jacob Devlin¹, Ming-Wei Chang¹, Kenton Lee¹, Kristina Toutanova¹•Institutions (1)

Google¹

11 Oct 2018-arXiv: Computation and Language

TL;DR: A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

...read moreread less

Abstract: We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).

...read moreread less

29,480 citations

Journal Article•DOI•

The Pascal Visual Object Classes (VOC) Challenge

[...]

Mark Everingham¹, Luc Van Gool², Christopher Williams³, John Winn⁴, Andrew Zisserman⁵ - Show less +1 more•Institutions (5)

University of Leeds¹, Katholieke Universiteit Leuven², University of Edinburgh³, Microsoft⁴, University of Oxford⁵

01 Jun 2010-International Journal of Computer Vision

TL;DR: The state-of-the-art in evaluated methods for both classification and detection are reviewed, whether the methods are statistically different, what they are learning from the images, and what the methods find easy or confuse.

...read moreread less

Abstract: The Pascal Visual Object Classes (VOC) challenge is a benchmark in visual object category recognition and detection, providing the vision and machine learning communities with a standard dataset of images and annotation, and standard evaluation procedures. Organised annually from 2005 to present, the challenge and its associated dataset has become accepted as the benchmark for object detection. This paper describes the dataset and evaluation procedure. We review the state-of-the-art in evaluated methods for both classification and detection, analyse whether the methods are statistically different, what they are learning from the images (e.g. the object or its context), and what the methods find easy or confuse. The paper concludes with lessons learnt in the three year history of the challenge, and proposes directions for future improvement and extension.

...read moreread less

15,935 citations