Top 15 papers published by Ross Girshick from Facebook in 2020

Journal Article•DOI•

[...]

Tsung-Yi Lin¹, Priya Goyal¹, Ross Girshick¹, Kaiming He¹, Piotr Dollár¹ - Show less +1 more•Institutions (1)

01 Feb 2020-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: Focal loss as discussed by the authors focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training, which improves the accuracy of one-stage detectors.

...read moreread less

Abstract: The highest accuracy object detectors to date are based on a two-stage approach popularized by R-CNN, where a classifier is applied to a sparse set of candidate object locations. In contrast, one-stage detectors that are applied over a regular, dense sampling of possible object locations have the potential to be faster and simpler, but have trailed the accuracy of two-stage detectors thus far. In this paper, we investigate why this is the case. We discover that the extreme foreground-background class imbalance encountered during training of dense detectors is the central cause. We propose to address this class imbalance by reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples. Our novel Focal Loss focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training. To evaluate the effectiveness of our loss, we design and train a simple dense detector we call RetinaNet. Our results show that when trained with the focal loss, RetinaNet is able to match the speed of previous one-stage detectors while surpassing the accuracy of all existing state-of-the-art two-stage detectors. Code is at: https://github.com/facebookresearch/Detectron .

...read moreread less

5,734 citations

Proceedings Article•DOI•

Momentum Contrast for Unsupervised Visual Representation Learning

[...]

Kaiming He¹, Haoqi Fan¹, Yuxin Wu¹, Saining Xie¹, Ross Girshick¹ - Show less +1 more•Institutions (1)

Facebook¹

14 Jun 2020

TL;DR: This article proposed Momentum Contrast (MoCo) for unsupervised visual representation learning, which enables building a large and consistent dictionary on-the-fly that facilitates contrastive learning.

...read moreread less

Abstract: We present Momentum Contrast (MoCo) for unsupervised visual representation learning. From a perspective on contrastive learning as dictionary look-up, we build a dynamic dictionary with a queue and a moving-averaged encoder. This enables building a large and consistent dictionary on-the-fly that facilitates contrastive unsupervised learning. MoCo provides competitive results under the common linear protocol on ImageNet classification. More importantly, the representations learned by MoCo transfer well to downstream tasks. MoCo can outperform its supervised pre-training counterpart in 7 detection/segmentation tasks on PASCAL VOC, COCO, and other datasets, sometimes surpassing it by large margins. This suggests that the gap between unsupervised and supervised representation learning has been largely closed in many vision tasks.

...read moreread less

4,128 citations

Posted Content•

Improved Baselines with Momentum Contrastive Learning

[...]

Xinlei Chen, Haoqi Fan, Ross Girshick, Kaiming He

09 Mar 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: With simple modifications to MoCo, this note establishes stronger baselines that outperform SimCLR and do not require large training batches, and hopes this will make state-of-the-art unsupervised learning research more accessible.

...read moreread less

Abstract: Contrastive unsupervised learning has recently shown encouraging progress, e.g., in Momentum Contrast (MoCo) and SimCLR. In this note, we verify the effectiveness of two of SimCLR's design improvements by implementing them in the MoCo framework. With simple modifications to MoCo---namely, using an MLP projection head and more data augmentation---we establish stronger baselines that outperform SimCLR and do not require large training batches. We hope this will make state-of-the-art unsupervised learning research more accessible. Code will be made public.

...read moreread less

1,947 citations

Journal Article•DOI•

Mask R-CNN

[...]

Kaiming He¹, Georgia Gkioxari¹, Piotr Dollár¹, Ross Girshick¹•Institutions (1)

Facebook¹

01 Feb 2020-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: Mask R-CNN as discussed by the authors extends Faster-RCNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition, which achieves state-of-the-art performance in instance segmentation.

...read moreread less

Abstract: We present a conceptually simple, flexible, and general framework for object instance segmentation. Our approach efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance. The method, called Mask R-CNN, extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps. Moreover, Mask R-CNN is easy to generalize to other tasks, e.g., allowing us to estimate human poses in the same framework. We show top results in all three tracks of the COCO suite of challenges, including instance segmentation, bounding-box object detection, and person keypoint detection. Without bells and whistles, Mask R-CNN outperforms all existing, single-model entries on every task, including the COCO 2016 challenge winners. We hope our simple and effective approach will serve as a solid baseline and help ease future research in instance-level recognition. Code has been made available at: https://github.com/facebookresearch/Detectron .

...read moreread less

1,506 citations

Proceedings Article•DOI•

Designing Network Design Spaces

[...]

Ilija Radosavovic¹, Raj Prateek Kosaraju¹, Ross Girshick¹, Kaiming He¹, Piotr Dollár¹ - Show less +1 more•Institutions (1)

Facebook¹

14 Jun 2020

TL;DR: The RegNet design space provides simple and fast networks that work well across a wide range of flop regimes, and outperform the popular EfficientNet models while being up to 5x faster on GPUs.

...read moreread less

Abstract: In this work, we present a new network design paradigm. Our goal is to help advance the understanding of network design and discover design principles that generalize across settings. Instead of focusing on designing individual network instances, we design network design spaces that parametrize populations of networks. The overall process is analogous to classic manual design of networks, but elevated to the design space level. Using our methodology we explore the structure aspect of network design and arrive at a low-dimensional design space consisting of simple, regular networks that we call RegNet. The core insight of the RegNet parametrization is surprisingly simple: widths and depths of good networks can be explained by a quantized linear function. We analyze the RegNet design space and arrive at interesting findings that do not match the current practice of network design. The RegNet design space provides simple and fast networks that work well across a wide range of flop regimes. Under comparable training settings and flops, the RegNet models outperform the popular EfficientNet models while being up to 5x faster on GPUs.

...read moreread less

1,041 citations

Proceedings Article•DOI•

PointRend: Image Segmentation As Rendering

[...]

Alexander Kirillov¹, Yuxin Wu¹, Kaiming He¹, Ross Girshick¹•Institutions (1)

Facebook¹

14 Jun 2020

TL;DR: PointRend as discussed by the authors proposes a point-based rendering module that performs segmentation predictions at adaptively selected locations based on an iterative subdivision algorithm, which produces crisp object boundaries in regions that are over-smoothed by previous methods.

...read moreread less

Abstract: We present a new method for efficient high-quality image segmentation of objects and scenes. By analogizing classical computer graphics methods for efficient rendering with over- and undersampling challenges faced in pixel labeling tasks, we develop a unique perspective of image segmentation as a rendering problem. From this vantage, we present the PointRend (Point-based Rendering) neural network module: a module that performs point-based segmentation predictions at adaptively selected locations based on an iterative subdivision algorithm. PointRend can be flexibly applied to both instance and semantic segmentation tasks by building on top of existing state-of-the-art models. While many concrete implementations of the general idea are possible, we show that a simple design already achieves excellent results. Qualitatively, PointRend outputs crisp object boundaries in regions that are over-smoothed by previous methods. Quantitatively, PointRend yields significant gains on COCO and Cityscapes, for both instance and semantic segmentation. PointRend's efficiency enables output resolutions that are otherwise impractical in terms of memory or computation compared to existing approaches. Code has been made available at https://github.com/facebookresearch/detectron2/tree/master/projects/PointRend.

...read moreread less

393 citations

Posted Content•

Designing Network Design Spaces

[...]

Ilija Radosavovic¹, Raj Prateek Kosaraju¹, Ross Girshick¹, Kaiming He¹, Piotr Dollár¹ - Show less +1 more•Institutions (1)

Facebook¹

30 Mar 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, the authors propose a new network design paradigm called RegNet, where instead of focusing on designing individual network instances, they design network design spaces that parametrize populations of networks.

...read moreread less

Abstract: In this work, we present a new network design paradigm. Our goal is to help advance the understanding of network design and discover design principles that generalize across settings. Instead of focusing on designing individual network instances, we design network design spaces that parametrize populations of networks. The overall process is analogous to classic manual design of networks, but elevated to the design space level. Using our methodology we explore the structure aspect of network design and arrive at a low-dimensional design space consisting of simple, regular networks that we call RegNet. The core insight of the RegNet parametrization is surprisingly simple: widths and depths of good networks can be explained by a quantized linear function. We analyze the RegNet design space and arrive at interesting findings that do not match the current practice of network design. The RegNet design space provides simple and fast networks that work well across a wide range of flop regimes. Under comparable training settings and flops, the RegNet models outperform the popular EfficientNet models while being up to 5x faster on GPUs.

...read moreread less

99 citations

Book Chapter•DOI•

Are Labels Necessary for Neural Architecture Search

[...]

Chenxi Liu¹, Piotr Dollár², Kaiming He², Ross Girshick², Alan L. Yuille¹, Saining Xie² - Show less +2 more•Institutions (2)

Johns Hopkins University¹, Facebook²

23 Aug 2020

TL;DR: The potentially surprising finding that labels are not necessary, and the image statistics alone may be sufficient to identify good neural architectures is revealed.

...read moreread less

Abstract: Existing neural network architectures in computer vision—whether designed by humans or by machines—were typically found using both images and their associated labels. In this paper, we ask the question: can we find high-quality neural architectures using only images, but no human-annotated labels? To answer this question, we first define a new setup called Unsupervised Neural Architecture Search (UnNAS). We then conduct two sets of experiments. In sample-based experiments, we train a large number (500) of diverse architectures with either supervised or unsupervised objectives, and find that the architecture rankings produced with and without labels are highly correlated. In search-based experiments, we run a well-established NAS algorithm (DARTS) using various unsupervised objectives, and report that the architectures searched without labels can be competitive to their counterparts searched with labels. Together, these results reveal the potentially surprising finding that labels are not necessary, and the image statistics alone may be sufficient to identify good neural architectures.

...read moreread less

62 citations

Proceedings Article•DOI•

A Multigrid Method for Efficiently Training Video Models

[...]

Chao-Yuan Wu¹, Ross Girshick², Kaiming He², Christoph Feichtenhofer², Philipp Krähenbühl¹ - Show less +1 more•Institutions (2)

University of Texas at Austin¹, Facebook²

14 Jun 2020

TL;DR: Inspired by multigrid methods in numerical optimization, this work proposes to use variable mini-batch shapes with different spatial-temporal resolutions that are varied according to a schedule to speed up competitive deep video models training.

...read moreread less

Abstract: Training competitive deep video models is an order of magnitude slower than training their counterpart image models. Slow training causes long research cycles, which hinders progress in video understanding research. Following standard practice for training image models, video model training has used a fixed mini-batch shape: a specific number of clips, frames, and spatial size. However, what is the optimal shape? High resolution models perform well, but train slowly. Low resolution models train faster, but are less accurate. Inspired by multigrid methods in numerical optimization, we propose to use variable mini-batch shapes with different spatial-temporal resolutions that are varied according to a schedule. The different shapes arise from resampling the training data on multiple sampling grids. Training is accelerated by scaling up the mini-batch size and learning rate when shrinking the other dimensions. We empirically demonstrate a general and robust grid schedule that yields a significant out-of-the-box training speedup without a loss in accuracy for different models (I3D, non-local, SlowFast), datasets (Kinetics, Something-Something, Charades), and training settings (with and without pre-training, 128 GPUs or 1 GPU). As an illustrative example, the proposed multigrid method trains a ResNet-50 SlowFast network 4.5x faster (wall-clock time, same hardware) while also improving accuracy (+0.8% absolute) on Kinetics-400 compared to baseline training. Code is available online.

...read moreread less

59 citations

Journal Article•DOI•

Impact of data on generalization of AI for surgical intelligence applications

[...]

Omri Bar, Daniel Neimark, Maya Zohar, Gregory D. Hager¹, Ross Girshick, Gerald M. Fried², Tamir Wolf, Dotan Asselmann - Show less +4 more•Institutions (2)

Johns Hopkins University¹, McGill University²

17 Dec 2020-Scientific Reports

TL;DR: Surgical workflow recognition is assessed and a deep learning system is reported, that not only detects surgical phases, but does so with high accuracy and is able to generalize to new settings and unseen medical centers.

...read moreread less

Abstract: AI is becoming ubiquitous, revolutionizing many aspects of our lives. In surgery, it is still a promise. AI has the potential to improve surgeon performance and impact patient care, from post-operative debrief to real-time decision support. But, how much data is needed by an AI-based system to learn surgical context with high fidelity? To answer this question, we leveraged a large-scale, diverse, cholecystectomy video dataset. We assessed surgical workflow recognition and report a deep learning system, that not only detects surgical phases, but does so with high accuracy and is able to generalize to new settings and unseen medical centers. Our findings provide a solid foundation for translating AI applications from research to practice, ushering in a new era of surgical intelligence.

...read moreread less

31 citations

Proceedings Article•DOI•

Large Scale Weakly and Semi-Supervised Learning for Low-Resource Video ASR.

[...]

Kritika Singh¹, Vimal Manohar¹, Alex Xiao¹, Sergey Edunov¹, Ross Girshick¹, Vitaliy Liptchinsky¹, Christian Fuegen¹, Yatharth Saraf¹, Geoffrey Zweig¹, Abdelrahman Mohamed¹ - Show less +6 more•Institutions (1)

Facebook¹

25 Oct 2020

TL;DR: This article investigated distillation methods at the frame level and the sequence level for hybrid, encoder-only CTC-based, and encoderdecoder speech recognition systems on Dutch and Romanian languages using 27,000 and 58,000 hours of unlabeled audio respectively.

...read moreread less

Abstract: Many semi- and weakly-supervised approaches have been investigated for overcoming the labeling cost of building high quality speech recognition systems. On the challenging task of transcribing social media videos in low-resource conditions, we conduct a large scale systematic comparison between two self-labeling methods on one hand, and weakly-supervised pretraining using contextual metadata on the other. We investigate distillation methods at the frame level and the sequence level for hybrid, encoder-only CTC-based, and encoder-decoder speech recognition systems on Dutch and Romanian languages using 27,000 and 58,000 hours of unlabeled audio respectively. Although all approaches improved upon their respective baseline WERs by more than 8%, sequence-level distillation for encoder-decoder models provided the largest relative WER reduction of 20% compared to the strongest data-augmented supervised baseline.

...read moreread less

Patent•

Method and system for using machine-learning for object instance segmentation

[...]

Kaiming He¹, Georgia Gkioxari, Piotr Dollár, Ross Girshick•Institutions (1)

Facebook¹

14 Jul 2020

TL;DR: In this article, an instance segmentation mask associated with the region of interest is generated by processing the regional feature map using a second neural network. But the second network is configured to generate instance segmentations for object instances depicted in images.

...read moreread less

Abstract: In one embodiment, a method includes a computing system accessing a training image. The system may generate a feature map for the training image using a first neural network. The system may identify a region of interest in the feature map and generate a regional feature map for the region of interest based on sampling locations defined by a sampling region. The sampling region and the region of interest may correspond to the same region in the feature map. The system may generate an instance segmentation mask associated with the region of interest by processing the regional feature map using a second neural network. The second neural network may be trained using the instance segmentation mask. Once trained, the second neural network is configured to generate instance segmentation masks for object instances depicted in images.

...read moreread less

Posted Content•

Are Labels Necessary for Neural Architecture Search

[...]

Chenxi Liu¹, Piotr Dollár², Kaiming He², Ross Girshick², Alan L. Yuille¹, Saining Xie² - Show less +2 more•Institutions (2)

Johns Hopkins University¹, Facebook²

26 Mar 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, the authors define a new setup called Unsupervised Neural Architecture Search (UnNAS) and conduct two sets of experiments to find high-quality neural architectures using only images, but no human-annotated labels.

...read moreread less

Abstract: Existing neural network architectures in computer vision -- whether designed by humans or by machines -- were typically found using both images and their associated labels. In this paper, we ask the question: can we find high-quality neural architectures using only images, but no human-annotated labels? To answer this question, we first define a new setup called Unsupervised Neural Architecture Search (UnNAS). We then conduct two sets of experiments. In sample-based experiments, we train a large number (500) of diverse architectures with either supervised or unsupervised objectives, and find that the architecture rankings produced with and without labels are highly correlated. In search-based experiments, we run a well-established NAS algorithm (DARTS) using various unsupervised objectives, and report that the architectures searched without labels can be competitive to their counterparts searched with labels. Together, these results reveal the potentially surprising finding that labels are not necessary, and the image statistics alone may be sufficient to identify good neural architectures.

...read moreread less

Proceedings Article•DOI•

Training ASR Models By Generation of Contextual Information

[...]

Kritika Singh¹, Dmytro Okhonko¹, Jun Liu¹, Yongqiang Wang¹, Frank Zhang¹, Ross Girshick¹, Sergey Edunov¹, Fuchun Peng¹, Yatharth Saraf¹, Geoffrey Zweig¹, Abdelrahman Mohamed¹ - Show less +7 more•Institutions (1)

Facebook¹

04 May 2020

TL;DR: The authors used loosely related contextual information as a surrogate for ground-truth labels to train an encoder-decoder transformer model, which achieved an average 20.8% WER reduction over a 1000 hours supervised baseline, and an average 13.4% reduction when using only the weakly supervised encoder for CTC fine-tuning.

...read moreread less

Abstract: Supervised ASR models have reached unprecedented levels of accuracy, thanks in part to ever-increasing amounts of labelled training data. However, in many applications and locales, only moderate amounts of data are available, which has led to a surge in semi- and weakly-supervised learning research. In this paper, we conduct a large-scale study evaluating the effectiveness of weakly-supervised learning for speech recognition by using loosely related contextual information as a surrogate for ground-truth labels. For weakly supervised training, we use 50k hours of public English social media videos along with their respective titles and post text to train an encoder-decoder transformer model. Our best encoder-decoder models achieve an average of 20.8% WER reduction over a 1000 hours supervised baseline, and an average of 13.4% WER reduction when using only the weakly supervised encoder for CTC fine-tuning. Our results show that our setup for weak supervision improved both the encoder acoustic representations as well as the decoder language generation abilities.

...read moreread less

Posted Content•

Large scale weakly and semi-supervised learning for low-resource video ASR

[...]

Kritika Singh¹, Vimal Manohar¹, Alex Xiao¹, Sergey Edunov¹, Ross Girshick¹, Vitaliy Liptchinsky¹, Christian Fuegen¹, Yatharth Saraf¹, Geoffrey Zweig¹, Abdelrahman Mohamed¹ - Show less +6 more•Institutions (1)

Facebook¹

16 May 2020-arXiv: Audio and Speech Processing

TL;DR: A large scale systematic comparison between two self-labeling methods, and weakly-supervised pretraining using contextual metadata on the challenging task of transcribing social media videos in low-resource conditions is conducted.

...read moreread less

Abstract: Many semi- and weakly-supervised approaches have been investigated for overcoming the labeling cost of building high quality speech recognition systems. On the challenging task of transcribing social media videos in low-resource conditions, we conduct a large scale systematic comparison between two self-labeling methods on one hand, and weakly-supervised pretraining using contextual metadata on the other. We investigate distillation methods at the frame level and the sequence level for hybrid, encoder-only CTC-based, and encoder-decoder speech recognition systems on Dutch and Romanian languages using 27,000 and 58,000 hours of unlabeled audio respectively. Although all approaches improved upon their respective baseline WERs by more than 8%, sequence-level distillation for encoder-decoder models provided the largest relative WER reduction of 20% compared to the strongest data-augmented supervised baseline.

...read moreread less

Showing papers by "Ross Girshick published in 2020"