scispace - formally typeset
Open AccessPosted Content

Multi-task Self-Supervised Visual Learning

Reads0
Chats0
TLDR
The results show that deeper networks work better, and that combining tasks—even via a na¨ýve multihead architecture—always improves performance.
Abstract
We investigate methods for combining multiple self-supervised tasks--i.e., supervised tasks where data can be collected without manual labeling--in order to train a single visual representation. First, we provide an apples-to-apples comparison of four different self-supervised tasks using the very deep ResNet-101 architecture. We then combine tasks to jointly train a network. We also explore lasso regularization to encourage the network to factorize the information in its representation, and methods for "harmonizing" network inputs in order to learn a more unified representation. We evaluate all methods on ImageNet classification, PASCAL VOC detection, and NYU depth prediction. Our results show that deeper networks work better, and that combining tasks--even via a naive multi-head architecture--always improves performance. Our best joint network nearly matches the PASCAL performance of a model pre-trained on ImageNet classification, and matches the ImageNet network on NYU depth prediction.

read more

Citations
More filters
Posted Content

Deep Clustering for Unsupervised Learning of Visual Features

TL;DR: This work presents DeepCluster, a clustering method that jointly learns the parameters of a neural network and the cluster assignments of the resulting features and outperforms the current state of the art by a significant margin on all the standard benchmarks.
Posted Content

Learning deep representations by mutual information estimation and maximization

TL;DR: It is shown that structure matters: incorporating knowledge about locality in the input into the objective can significantly improve a representation’s suitability for downstream tasks and is an important step towards flexible formulations of representation learning objectives for specific end-goals.
Journal ArticleDOI

Self-supervised learning for medical image analysis using image context restoration.

TL;DR: A novel self-supervised learning strategy based on context restoration is proposed in order to better exploit unlabelled images and is validated in three common problems in medical imaging: classification, localization, and segmentation.
Posted Content

Unsupervised Feature Learning via Non-Parametric Instance-level Discrimination

TL;DR: This work forms this intuition as a non-parametric classification problem at the instance-level, and uses noise-contrastive estimation to tackle the computational challenges imposed by the large number of instance classes.
Posted Content

Objects that Sound

TL;DR: In this article, audio and visual embeddings are learned from unlabeled video using only audio-visual correspondence (AVC) as the objective function, which is a form of cross-modal self-supervision from video.
References
More filters
Journal ArticleDOI

ImageNet Large Scale Visual Recognition Challenge

TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.
Posted Content

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

TL;DR: Faster R-CNN as discussed by the authors proposes a Region Proposal Network (RPN) to generate high-quality region proposals, which are used by Fast R-NN for detection.
Book ChapterDOI

Identity Mappings in Deep Residual Networks

TL;DR: In this paper, the forward and backward signals can be directly propagated from one block to any other block, when using identity mappings as the skip connections and after-addition activation.
Proceedings Article

Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning

TL;DR: In this paper, the authors show that training with residual connections accelerates the training of Inception networks significantly, and they also present several new streamlined architectures for both residual and non-residual Inception Networks.
Proceedings ArticleDOI

Action Recognition with Improved Trajectories

TL;DR: Dense trajectories were shown to be an efficient video representation for action recognition and achieved state-of-the-art results on a variety of datasets are improved by taking into account camera motion to correct them.
Related Papers (5)