scispace - formally typeset
Search or ask a question
Author

Longlong Jing

Other affiliations: The Graduate Center, CUNY
Bio: Longlong Jing is an academic researcher from City University of New York. The author has contributed to research in topics: Convolutional neural network & Feature learning. The author has an hindex of 9, co-authored 26 publications receiving 893 citations. Previous affiliations of Longlong Jing include The Graduate Center, CUNY.

Papers
More filters
Journal ArticleDOI
TL;DR: An extensive review of deep learning-based self-supervised general visual feature learning methods from images or videos as a subset of unsupervised learning methods to learn general image and video features from large-scale unlabeled data without using any human-annotated labels is provided.
Abstract: Large-scale labeled data are generally required to train deep neural networks in order to obtain better performance in visual feature learning from images or videos for computer vision applications. To avoid extensive cost of collecting and annotating large-scale datasets, as a subset of unsupervised learning methods, self-supervised learning methods are proposed to learn general image and video features from large-scale unlabeled data without using any human-annotated labels. This paper provides an extensive review of deep learning-based self-supervised general visual feature learning methods from images or videos. First, the motivation, general pipeline, and terminologies of this field are described. Then the common deep neural network architectures that used for self-supervised learning are summarized. Next, the schema and evaluation metrics of self-supervised learning methods are reviewed followed by the commonly used datasets for images, videos, audios, and 3D data, as well as the existing self-supervised visual feature learning methods. Finally, quantitative performance comparisons of the reviewed methods on benchmark datasets are summarized and discussed for both image and video feature learning. At last, this paper is concluded and lists a set of promising future directions for self-supervised visual feature learning.

876 citations

Posted Content
TL;DR: Self-Supervised Learning: Self-supervised learning as discussed by the authors is a subset of unsupervised image and video feature learning, which aims to learn general image features from large-scale unlabeled data without using any human-annotated labels.
Abstract: Large-scale labeled data are generally required to train deep neural networks in order to obtain better performance in visual feature learning from images or videos for computer vision applications. To avoid extensive cost of collecting and annotating large-scale datasets, as a subset of unsupervised learning methods, self-supervised learning methods are proposed to learn general image and video features from large-scale unlabeled data without using any human-annotated labels. This paper provides an extensive review of deep learning-based self-supervised general visual feature learning methods from images or videos. First, the motivation, general pipeline, and terminologies of this field are described. Then the common deep neural network architectures that used for self-supervised learning are summarized. Next, the main components and evaluation metrics of self-supervised learning methods are reviewed followed by the commonly used image and video datasets and the existing self-supervised visual feature learning methods. Finally, quantitative performance comparisons of the reviewed methods on benchmark datasets are summarized and discussed for both image and video feature learning. At last, this paper is concluded and lists a set of promising future directions for self-supervised visual feature learning.

595 citations

Posted Content
TL;DR: With the self-supervised pre-trained 3DRotNet from large datasets, the recognition accuracy is boosted up by 20.4% on UCF101 and 16.7% on HMDB51 respectively, compared to the models trained from scratch.
Abstract: The success of deep neural networks generally requires a vast amount of training data to be labeled, which is expensive and unfeasible in scale, especially for video collections. To alleviate this problem, in this paper, we propose 3DRotNet: a fully self-supervised approach to learn spatiotemporal features from unlabeled videos. A set of rotations are applied to all videos, and a pretext task is defined as prediction of these rotations. When accomplishing this task, 3DRotNet is actually trained to understand the semantic concepts and motions in videos. In other words, it learns a spatiotemporal video representation, which can be transferred to improve video understanding tasks in small datasets. Our extensive experiments successfully demonstrate the effectiveness of the proposed framework on action recognition, leading to significant improvements over the state-of-the-art self-supervised methods. With the self-supervised pre-trained 3DRotNet from large datasets, the recognition accuracy is boosted up by 20.4% on UCF101 and 16.7% on HMDB51 respectively, compared to the models trained from scratch.

135 citations

Posted Content
28 Nov 2018
TL;DR: A novel 3DConvNet-based fully selfsupervised framework to learn spatiotemporal video features without using any human-labeled annotations and outperforms the state-of-the-arts of fully self-supervised methods on both UCF101 and HMDB51 datasets and achieves 62.9% and 33.7% accuracy respectively.
Abstract: To alleviate the expensive cost of data collection and annotation, many self-supervised learning methods were proposed to learn image representations without human-labeled annotations. However, self-supervised learning for video representations is not yet well-addressed. In this paper, we propose a novel 3DConvNet-based fully self-supervised framework to learn spatiotemporal video features without using any human-labeled annotations. First, a set of pre-designed geometric transformations (e.g. rotating 0 degree, 90 degrees, 180 degrees, and 270 degrees) are applied to each video. Then a pretext task can be defined as "recognizing the pre-designed geometric transformations." Therefore, the spatiotemporal video features can be learned in the process of accomplishing this pretext task without using human-labeled annotations. The learned spatiotemporal video representations can further be employed as pretrained features for different video-related applications. The proposed geometric transformations (e.g. rotations) are proved to be effective to learn representative spatiotemporal features in our 3DConvNet-based fully self-supervised framework. With the pre-trained spatiotemporal features from two large video datasets, the performance of action recognition is significantly boosted up by 20.4% on UCF101 dataset and 16.7% on HMDB51 dataset respectively compared to that from the model trained from scratch. Furthermore, our framework outperforms the state-of-the-arts of fully self-supervised methods on both UCF101 and HMDB51 datasets and achieves 62.9% and 33.7% accuracy respectively.

81 citations

Journal ArticleDOI
TL;DR: This paper proposes a novel recursive coarse-to-fine semantic segmentation framework based on only image-level category labels that can be easily extended to foreground object segmentation task and achieves comparable performance with the state-of-the-art supervised methods on the Internet object dataset.
Abstract: Deep neural network-based semantic segmentation generally requires large-scale cost extensive annotations for training to obtain better performance. To avoid pixel-wise segmentation annotations that are needed for most methods, recently some researchers attempted to use object-level labels (e.g., bounding boxes) or image-level labels (e.g., image categories). In this paper, we propose a novel recursive coarse-to-fine semantic segmentation framework based on only image-level category labels. For each image, an initial coarse mask is first generated by a convolutional neural network-based unsupervised foreground segmentation model and then is enhanced by a graph model. The enhanced coarse mask is fed to a fully convolutional neural network to be recursively refined. Unlike the existing image-level label-based semantic segmentation methods, which require labeling of all categories for images that contain multiple types of objects, our framework only needs one label for each image and can handle images that contain multi-category objects. Only trained on ImageNet, our framework achieves comparable performance on the PASCAL VOC dataset with other image-level label-based state-of-the-art methods of semantic segmentation. Furthermore, our framework can be easily extended to foreground object segmentation task and achieves comparable performance with the state-of-the-art supervised methods on the Internet object dataset.

69 citations


Cited by
More filters
Posted Content
TL;DR: This paper proposes an online algorithm, SwAV, that takes advantage of contrastive methods without requiring to compute pairwise comparisons, and uses a swapped prediction mechanism where it predicts the cluster assignment of a view from the representation of another view.
Abstract: Unsupervised image representations have significantly reduced the gap with supervised pretraining, notably with the recent achievements of contrastive learning methods. These contrastive methods typically work online and rely on a large number of explicit pairwise feature comparisons, which is computationally challenging. In this paper, we propose an online algorithm, SwAV, that takes advantage of contrastive methods without requiring to compute pairwise comparisons. Specifically, our method simultaneously clusters the data while enforcing consistency between cluster assignments produced for different augmentations (or views) of the same image, instead of comparing features directly as in contrastive learning. Simply put, we use a swapped prediction mechanism where we predict the cluster assignment of a view from the representation of another view. Our method can be trained with large and small batches and can scale to unlimited amounts of data. Compared to previous contrastive methods, our method is more memory efficient since it does not require a large memory bank or a special momentum network. In addition, we also propose a new data augmentation strategy, multi-crop, that uses a mix of views with different resolutions in place of two full-resolution views, without increasing the memory or compute requirements much. We validate our findings by achieving 75.3% top-1 accuracy on ImageNet with ResNet-50, as well as surpassing supervised pretraining on all the considered transfer tasks.

1,771 citations

Posted Content
TL;DR: This work revisit and improve Contrastive Predictive Coding, an unsupervised objective for learning such representations which make the variability in natural signals more predictable, and produces features which support state-of-the-art linear classification accuracy on the ImageNet dataset.
Abstract: Human observers can learn to recognize new categories of images from a handful of examples, yet doing so with artificial ones remains an open challenge. We hypothesize that data-efficient recognition is enabled by representations which make the variability in natural signals more predictable. We therefore revisit and improve Contrastive Predictive Coding, an unsupervised objective for learning such representations. This new implementation produces features which support state-of-the-art linear classification accuracy on the ImageNet dataset. When used as input for non-linear classification with deep neural networks, this representation allows us to use 2-5x less labels than classifiers trained directly on image pixels. Finally, this unsupervised representation substantially improves transfer learning to object detection on the PASCAL VOC dataset, surpassing fully supervised pre-trained ImageNet classifiers.

1,035 citations

Posted Content
TL;DR: A comprehensive review of recent pioneering efforts in semantic and instance segmentation, including convolutional pixel-labeling networks, encoder-decoder architectures, multiscale and pyramid-based approaches, recurrent networks, visual attention models, and generative models in adversarial settings are provided.
Abstract: Image segmentation is a key topic in image processing and computer vision with applications such as scene understanding, medical image analysis, robotic perception, video surveillance, augmented reality, and image compression, among many others. Various algorithms for image segmentation have been developed in the literature. Recently, due to the success of deep learning models in a wide range of vision applications, there has been a substantial amount of works aimed at developing image segmentation approaches using deep learning models. In this survey, we provide a comprehensive review of the literature at the time of this writing, covering a broad spectrum of pioneering works for semantic and instance-level segmentation, including fully convolutional pixel-labeling networks, encoder-decoder architectures, multi-scale and pyramid based approaches, recurrent networks, visual attention models, and generative models in adversarial settings. We investigate the similarity, strengths and challenges of these deep learning models, examine the most widely used datasets, report performances, and discuss promising future research directions in this area.

950 citations

Journal ArticleDOI
TL;DR: An extensive review of deep learning-based self-supervised general visual feature learning methods from images or videos as a subset of unsupervised learning methods to learn general image and video features from large-scale unlabeled data without using any human-annotated labels is provided.
Abstract: Large-scale labeled data are generally required to train deep neural networks in order to obtain better performance in visual feature learning from images or videos for computer vision applications. To avoid extensive cost of collecting and annotating large-scale datasets, as a subset of unsupervised learning methods, self-supervised learning methods are proposed to learn general image and video features from large-scale unlabeled data without using any human-annotated labels. This paper provides an extensive review of deep learning-based self-supervised general visual feature learning methods from images or videos. First, the motivation, general pipeline, and terminologies of this field are described. Then the common deep neural network architectures that used for self-supervised learning are summarized. Next, the schema and evaluation metrics of self-supervised learning methods are reviewed followed by the commonly used datasets for images, videos, audios, and 3D data, as well as the existing self-supervised visual feature learning methods. Finally, quantitative performance comparisons of the reviewed methods on benchmark datasets are summarized and discussed for both image and video feature learning. At last, this paper is concluded and lists a set of promising future directions for self-supervised visual feature learning.

876 citations

Posted Content
TL;DR: Self-Supervised Learning: Self-supervised learning as discussed by the authors is a subset of unsupervised image and video feature learning, which aims to learn general image features from large-scale unlabeled data without using any human-annotated labels.
Abstract: Large-scale labeled data are generally required to train deep neural networks in order to obtain better performance in visual feature learning from images or videos for computer vision applications. To avoid extensive cost of collecting and annotating large-scale datasets, as a subset of unsupervised learning methods, self-supervised learning methods are proposed to learn general image and video features from large-scale unlabeled data without using any human-annotated labels. This paper provides an extensive review of deep learning-based self-supervised general visual feature learning methods from images or videos. First, the motivation, general pipeline, and terminologies of this field are described. Then the common deep neural network architectures that used for self-supervised learning are summarized. Next, the main components and evaluation metrics of self-supervised learning methods are reviewed followed by the commonly used image and video datasets and the existing self-supervised visual feature learning methods. Finally, quantitative performance comparisons of the reviewed methods on benchmark datasets are summarized and discussed for both image and video feature learning. At last, this paper is concluded and lists a set of promising future directions for self-supervised visual feature learning.

595 citations