Showing papers by "Ross Girshick published in 2018"

PDF

Open Access

Proceedings Article•DOI•

[...]

Xiaolong Wang¹, Ross Girshick¹, Abhinav Gupta², Kaiming He¹•Institutions (2)

18 Jun 2018

TL;DR: In this article, the non-local operation computes the response at a position as a weighted sum of the features at all positions, which can be used to capture long-range dependencies.

...read moreread less

Abstract: Both convolutional and recurrent operations are building blocks that process one local neighborhood at a time. In this paper, we present non-local operations as a generic family of building blocks for capturing long-range dependencies. Inspired by the classical non-local means method [4] in computer vision, our non-local operation computes the response at a position as a weighted sum of the features at all positions. This building block can be plugged into many computer vision architectures. On the task of video classification, even without any bells and whistles, our nonlocal models can compete or outperform current competition winners on both Kinetics and Charades datasets. In static image recognition, our non-local models improve object detection/segmentation and pose estimation on the COCO suite of tasks. Code will be made available.

...read moreread less

8,059 citations

Book Chapter•DOI•

Exploring the Limits of Weakly Supervised Pretraining

[...]

Dhruv Mahajan¹, Ross Girshick¹, Vignesh Ramanathan¹, Kaiming He¹, Manohar Paluri¹, Yixuan Li¹, Ashwin Bharambe¹, Laurens van der Maaten¹ - Show less +4 more•Institutions (1)

Facebook¹

08 Sep 2018

TL;DR: In this paper, the authors presented a transfer learning approach with large convolutional networks trained to predict hashtags on billions of social media images and reported the highest ImageNet-1k single-crop, top-1 accuracy to date.

...read moreread less

Abstract: State-of-the-art visual perception models for a wide range of tasks rely on supervised pretraining. ImageNet classification is the de facto pretraining task for these models. Yet, ImageNet is now nearly ten years old and is by modern standards “small”. Even so, relatively little is known about the behavior of pretraining with datasets that are multiple orders of magnitude larger. The reasons are obvious: such datasets are difficult to collect and annotate. In this paper, we present a unique study of transfer learning with large convolutional networks trained to predict hashtags on billions of social media images. Our experiments demonstrate that training for large-scale hashtag prediction leads to excellent results. We show improvements on several image classification and object detection tasks, and report the highest ImageNet-1k single-crop, top-1 accuracy to date: 85.4% (97.6% top-5). We also perform extensive experiments that provide novel empirical data on the relationship between large-scale pretraining and transfer learning performance.

...read moreread less

860 citations

Proceedings Article•DOI•

Low-Shot Learning from Imaginary Data

[...]

Yu-Xiong Wang¹, Ross Girshick¹, Martial Hebert², Bharath Hariharan³•Institutions (3)

Facebook¹, Carnegie Mellon University², Cornell University³

18 Jun 2018

TL;DR: This work builds on recent progress in meta-learning by combining a meta-learner with a "hallucinator" that produces additional training examples, and optimizing both models jointly, yielding state-of-the-art performance on the challenging ImageNet low-shot classification benchmark.

...read moreread less

Abstract: Humans can quickly learn new visual concepts, perhaps because they can easily visualize or imagine what novel objects look like from different views. Incorporating this ability to hallucinate novel instances of new concepts might help machine vision systems perform better low-shot learning, i.e., learning concepts from few examples. We present a novel approach to low-shot learning that uses this idea. Our approach builds on recent progress in meta-learning ("learning to learn") by combining a meta-learner with a "hallucinator" that produces additional training examples, and optimizing both models jointly. Our hallucinator can be incorporated into a variety of meta-learners and provides significant gains: up to a 6 point boost in classification accuracy when only a single training example is available, yielding state-of-the-art performance on the challenging ImageNet low-shot classification benchmark.

...read moreread less

639 citations

Posted Content•

Rethinking ImageNet Pre-training.

[...]

Kaiming He¹, Ross Girshick, Piotr Dollár•Institutions (1)

Facebook¹

21 Nov 2018-arXiv: Computer Vision and Pattern Recognition

TL;DR: Experiments show that ImageNet pre-training speeds up convergence early in training, but does not necessarily provide regularization or improve final target task accuracy, and these discoveries will encourage people to rethink the current de facto paradigm of `pre-training and fine-tuning' in computer vision.

...read moreread less

Abstract: We report competitive results on object detection and instance segmentation on the COCO dataset using standard models trained from random initialization. The results are no worse than their ImageNet pre-training counterparts even when using the hyper-parameters of the baseline system (Mask R-CNN) that were optimized for fine-tuning pre-trained models, with the sole exception of increasing the number of training iterations so the randomly initialized models may converge. Training from random initialization is surprisingly robust; our results hold even when: (i) using only 10% of the training data, (ii) for deeper and wider models, and (iii) for multiple tasks and metrics. Experiments show that ImageNet pre-training speeds up convergence early in training, but does not necessarily provide regularization or improve final target task accuracy. To push the envelope we demonstrate 50.9 AP on COCO object detection without using any external data---a result on par with the top COCO 2017 competition results that used ImageNet pre-training. These observations challenge the conventional wisdom of ImageNet pre-training for dependent tasks and we expect these discoveries will encourage people to rethink the current de facto paradigm of `pre-training and fine-tuning' in computer vision.

...read moreread less

597 citations

Posted Content•

Exploring the Limits of Weakly Supervised Pretraining

[...]

Dhruv Mahajan¹, Ross Girshick¹, Vignesh Ramanathan¹, Kaiming He¹, Manohar Paluri¹, Yixuan Li¹, Ashwin Bharambe¹, Laurens van der Maaten¹ - Show less +4 more•Institutions (1)

Facebook¹

02 May 2018-arXiv: Computer Vision and Pattern Recognition

TL;DR: This paper presents a unique study of transfer learning with large convolutional networks trained to predict hashtags on billions of social media images and shows improvements on several image classification and object detection tasks, and reports the highest ImageNet-1k single-crop, top-1 accuracy to date.

...read moreread less

Abstract: State-of-the-art visual perception models for a wide range of tasks rely on supervised pretraining. ImageNet classification is the de facto pretraining task for these models. Yet, ImageNet is now nearly ten years old and is by modern standards "small". Even so, relatively little is known about the behavior of pretraining with datasets that are multiple orders of magnitude larger. The reasons are obvious: such datasets are difficult to collect and annotate. In this paper, we present a unique study of transfer learning with large convolutional networks trained to predict hashtags on billions of social media images. Our experiments demonstrate that training for large-scale hashtag prediction leads to excellent results. We show improvements on several image classification and object detection tasks, and report the highest ImageNet-1k single-crop, top-1 accuracy to date: 85.4% (97.6% top-5). We also perform extensive experiments that provide novel empirical data on the relationship between large-scale pretraining and transfer learning performance.

...read moreread less

490 citations

Proceedings Article•DOI•

Detecting and Recognizing Human-Object Interactions

[...]

Georgia Gkioxari¹, Ross Girshick¹, Piotr Dollár¹, Kaiming He¹•Institutions (1)

Facebook¹

18 Jun 2018

TL;DR: In this paper, a human-centric approach is proposed to detect human, verb, and object triplets in challenging everyday photos, where the appearance of a person is used as a cue for localizing the objects they are interacting with.

...read moreread less

Abstract: To understand the visual world, a machine must not only recognize individual object instances but also how they interact. Humans are often at the center of such interactions and detecting human-object interactions is an important practical and scientific problem. In this paper, we address the task of detecting (human, verb, object) triplets in challenging everyday photos. We propose a novel model that is driven by a human-centric approach. Our hypothesis is that the appearance of a person - their pose, clothing, action - is a powerful cue for localizing the objects they are interacting with. To exploit this cue, our model learns to predict an action-specific density over target object locations based on the appearance of a detected person. Our model also jointly learns to detect people and objects, and by fusing these predictions it efficiently infers interaction triplets in a clean, jointly trained end-to-end system we call InteractNet. We validate our approach on the recently introduced Verbs in COCO (V-COCO) and HICO-DET datasets, where we show quantitatively compelling results.

...read moreread less

388 citations

Proceedings Article•DOI•

Data Distillation: Towards Omni-Supervised Learning

[...]

Ilija Radosavovic¹, Piotr Dollár¹, Ross Girshick¹, Georgia Gkioxari¹, Kaiming He¹ - Show less +1 more•Institutions (1)

Facebook¹

18 Jun 2018

TL;DR: It is argued that visual recognition models have recently become accurate enough that it is now possible to apply classic ideas about self-training to challenging real-world data and propose data distillation, a method that ensembles predictions from multiple transformations of unlabeled data, using a single model, to automatically generate new training annotations.

...read moreread less

Abstract: We investigate omni-supervised learning, a special regime of semi-supervised learning in which the learner exploits all available labeled data plus internet-scale sources of unlabeled data. Omni-supervised learning is lower-bounded by performance on existing labeled datasets, offering the potential to surpass state-of-the-art fully supervised methods. To exploit the omni-supervised setting, we propose data distillation, a method that ensembles predictions from multiple transformations of unlabeled data, using a single model, to automatically generate new training annotations. We argue that visual recognition models have recently become accurate enough that it is now possible to apply classic ideas about self-training to challenging real-world data. Our experimental results show that in the cases of human keypoint detection and general object detection, state-of-the-art models trained with data distillation surpass the performance of using labeled data from the COCO dataset alone.

...read moreread less

319 citations

Proceedings Article•DOI•

Learning to Segment Every Thing

[...]

Ronghang Hu¹, Piotr Dollár², Kaiming He², Trevor Darrell¹, Ross Girshick² - Show less +1 more•Institutions (2)

University of California, Berkeley¹, Facebook²

18 Jun 2018

TL;DR: A new partially supervised training paradigm is proposed, together with a novel weight transfer function, that enables training instance segmentation models on a large set of categories all of which have box annotations, but only a small fraction ofWhich have mask annotations.

...read moreread less

Abstract: Most methods for object instance segmentation require all training examples to be labeled with segmentation masks. This requirement makes it expensive to annotate new categories and has restricted instance segmentation models to ~100 well-annotated classes. The goal of this paper is to propose a new partially supervised training paradigm, together with a novel weight transfer function, that enables training instance segmentation models on a large set of categories all of which have box annotations, but only a small fraction of which have mask annotations. These contributions allow us to train Mask R-CNN to detect and segment 3000 visual concepts using box annotations from the Visual Genome dataset and mask annotations from the 80 classes in the COCO dataset. We evaluate our approach in a controlled study on the COCO dataset. This work is a first step towards instance segmentation models that have broad comprehension of the visual world.

...read moreread less

256 citations

Posted Content•

Long-Term Feature Banks for Detailed Video Understanding.

[...]

Chao-Yuan Wu¹, Christoph Feichtenhofer², Haoqi Fan², Kaiming He², Philipp Krähenbühl¹, Ross Girshick - Show less +2 more•Institutions (2)

University of Texas at Austin¹, Facebook²

12 Dec 2018-arXiv: Computer Vision and Pattern Recognition

TL;DR: This paper proposes a long-term feature bank—supportive information extracted over the entire span of a video—to augment state-of-the-art video models that otherwise would only view short clips of 2-5 seconds.

...read moreread less

Abstract: To understand the world, we humans constantly need to relate the present to the past, and put events in context. In this paper, we enable existing video models to do the same. We propose a long-term feature bank---supportive information extracted over the entire span of a video---to augment state-of-the-art video models that otherwise would only view short clips of 2-5 seconds. Our experiments demonstrate that augmenting 3D convolutional networks with a long-term feature bank yields state-of-the-art results on three challenging video datasets: AVA, EPIC-Kitchens, and Charades.

...read moreread less

247 citations

Posted Content•

Panoptic Segmentation

[...]

Alexander Kirillov¹, Kaiming He¹, Ross Girshick, Carsten Rother², Piotr Dollár - Show less +1 more•Institutions (2)

Facebook¹, Heidelberg University²

03 Jan 2018-arXiv: Computer Vision and Pattern Recognition

TL;DR: Panoptic segmentation as discussed by the authors unifies the typically distinct tasks of semantic segmentation and instance segmentation, and proposes a novel panoptic quality (PQ) metric that captures performance for all classes (stuff and things) in an interpretable and unified manner.

...read moreread less

Abstract: We propose and study a task we name panoptic segmentation (PS). Panoptic segmentation unifies the typically distinct tasks of semantic segmentation (assign a class label to each pixel) and instance segmentation (detect and segment each object instance). The proposed task requires generating a coherent scene segmentation that is rich and complete, an important step toward real-world vision systems. While early work in computer vision addressed related image/scene parsing tasks, these are not currently popular, possibly due to lack of appropriate metrics or associated recognition challenges. To address this, we propose a novel panoptic quality (PQ) metric that captures performance for all classes (stuff and things) in an interpretable and unified manner. Using the proposed metric, we perform a rigorous study of both human and machine performance for PS on three existing datasets, revealing interesting insights about the task. The aim of our work is to revive the interest of the community in a more unified view of image segmentation.

...read moreread less

241 citations

Posted Content•

Low-Shot Learning from Imaginary Data

[...]

Yu-Xiong Wang¹, Ross Girshick¹, Martial Hebert², Bharath Hariharan³•Institutions (3)

Facebook¹, Carnegie Mellon University², Cornell University³

16 Jan 2018-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, a meta-learner is combined with a hallucinator that produces additional training examples, and the two models jointly optimize both models for low-shot learning, achieving state-of-the-art performance on ImageNet.

...read moreread less

Proceedings Article•DOI•

Learning by Asking Questions

[...]

Ishan Misra¹, Ross Girshick¹, Rob Fergus², Martial Hebert¹, Abhinav Gupta¹, Laurens van der Maaten¹ - Show less +2 more•Institutions (2)

Carnegie Mellon University¹, Facebook²

18 Jun 2018

TL;DR: Learning-by-asking (LBA) as discussed by the authors is an interactive learning framework for the development and testing of intelligent visual systems, which has the potential to be more data-efficient than the traditional VQA setting.

...read moreread less

Abstract: We introduce an interactive learning framework for the development and testing of intelligent visual systems, called learning-by-asking (LBA). We explore LBA in context of the Visual Question Answering (VQA) task. LBA differs from standard VQA training in that most questions are not observed during training time, and the learner must ask questions it wants answers to. Thus, LBA more closely mimics natural learning and has the potential to be more data-efficient than the traditional VQA setting. We present a model that performs LBA on the CLEVR dataset, and show that it automatically discovers an easy-to-hard curriculum when learning interactively from an oracle. Our LBA generated data consistently matches or outperforms the CLEVR train data and is more sample efficient. We also show that our model asks questions that generalize to state-of-the-art VQA models and to novel test time distributions.

...read moreread less

Patent•

Machine-Learning Models Based on Non-local Neural Networks

[...]

Kaiming He¹, Ross Girshick¹, Xiaolong Wang•Institutions (1)

Facebook¹

15 Nov 2018

TL;DR: In this article, the authors propose a method for training a baseline machine learning model based on a neural network comprising a plurality of stages, where each stage comprises a number of neural blocks and each non-local operation is based on pairwise and unary functions.

...read moreread less

Abstract: In one embodiment, a method includes training a baseline machine-learning model based on a neural network comprising a plurality of stages, wherein each stage comprises a plurality of neural blocks, accessing a plurality of training samples comprising a plurality of content objects, respectively, determining one or more non-local operations, wherein each non-local operation is based on one or more pairwise functions and one or more unary functions, generating one or more non-local blocks based on the plurality of training samples and the one or more non-local operations, determining a stage from the plurality of stages of the neural network, and training a non-local machine-learning model by inserting each of the one or more non-local blocks in between at least two of the plurality of neural blocks in the determined stage of the neural network

...read moreread less