Top 13 papers published by Ross Girshick from Facebook in 2021

Proceedings Article•DOI•

A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning

[...]

Christoph Feichtenhofer¹, Haoqi Fan¹, Bo Xiong¹, Ross Girshick¹, Kaiming He¹ - Show less +1 more•Institutions (1)

29 Apr 2021

TL;DR: SlowFast as mentioned in this paper proposes a simple objective to encourage temporally-persistent features in the same video, and in spite of its simplicity, it works surprisingly well across: (i) different unsupervised frameworks, (ii) pre-training datasets, (iii) downstream datasets, and (iv) backbone architectures.

...read moreread less

Abstract: We present a large-scale study on unsupervised spatiotemporal representation learning from videos. With a unified perspective on four recent image-based frameworks, we study a simple objective that can easily generalize all these methods to space-time. Our objective encourages temporally-persistent features in the same video, and in spite of its simplicity, it works surprisingly well across: (i) different unsupervised frameworks, (ii) pre-training datasets, (iii) downstream datasets, and (iv) backbone architectures. We draw a series of intriguing observations from this study, e.g., we discover that encouraging long-spanned persistency can be effective even if the timespan is 60 seconds. In addition to state-of-the-art results in multiple benchmarks, we report a few promising cases in which unsupervised pre-training can outperform its supervised counterpart. Code will be made available at https://github.com/facebookresearch/SlowFast.

...read moreread less

175 citations

Proceedings Article•DOI•

Boundary IoU: Improving Object-Centric Image Segmentation Evaluation

[...]

Bowen Cheng¹, Ross Girshick², Piotr Dollár², Alexander C. Berg², Alexander Kirillov² - Show less +1 more•Institutions (2)

University of Illinois at Urbana–Champaign¹, Facebook²

01 Jun 2021

TL;DR: The boundary intersection-over-union (Boundary IoU) measure as mentioned in this paper is a new segmentation evaluation measure focused on boundary quality, which is significantly more sensitive to boundary errors for large objects and does not over-penalize errors on smaller objects.

...read moreread less

Abstract: We present Boundary IoU (Intersection-over-Union), a new segmentation evaluation measure focused on boundary quality. We perform an extensive analysis across different error types and object sizes and show that Boundary IoU is significantly more sensitive than the standard Mask IoU measure to boundary errors for large objects and does not over-penalize errors on smaller objects. The new quality measure displays several desirable characteristics like symmetry w.r.t. prediction/ground truth pairs and balanced responsiveness across scales, which makes it more suitable for segmentation evaluation than other boundary-focused measures like Trimap IoU and F-measure. Based on Boundary IoU, we update the standard evaluation protocols for instance and panoptic segmentation tasks by proposing the Boundary AP (Average Precision) and Boundary PQ (Panoptic Quality) metrics, respectively. Our experiments show that the new evaluation metrics track boundary quality improvements that are generally overlooked by current Mask IoU-based evaluation metrics. We hope that the adoption of the new boundary-sensitive evaluation metrics will lead to rapid progress in segmentation methods that improve boundary quality. 1

...read moreread less

127 citations

Proceedings Article•DOI•

Fast and Accurate Model Scaling

[...]

Piotr Dollár¹, Mannat Singh¹, Ross Girshick¹•Institutions (1)

Facebook¹

11 Mar 2021

TL;DR: In this paper, a simple fast compound scaling strategy was proposed for convolutional neural networks, which encourages primarily scaling model width, while scaling depth and resolution to a lesser extent.

...read moreread less

Abstract: In this work we analyze strategies for convolutional neural network scaling; that is, the process of scaling a base convolutional network to endow it with greater computational complexity and consequently representational power. Example scaling strategies may include increasing model width, depth, resolution, etc. While various scaling strategies exist, their tradeoffs are not fully understood. Existing analysis typically focuses on the interplay of accuracy and flops (floating point operations). Yet, as we demonstrate, various scaling strategies affect model parameters, activations, and consequently actual runtime quite differently. In our experiments we show the surprising result that numerous scaling strategies yield networks with similar accuracy but with widely varying properties. This leads us to propose a simple fast compound scaling strategy that encourages primarily scaling model width, while scaling depth and resolution to a lesser extent. Unlike currently popular scaling strategies, which result in about O(s) increase in model activation w.r.t. scaling flops by a factor of s, the proposed fast compound scaling results in close to $O\left( {\sqrt s } \right)$ increase in activations, while achieving excellent accuracy. Fewer activations leads to speedups on modern memory-bandwidth limited hardware (e.g., GPUs). More generally, we hope this work provides a framework for analyzing scaling strategies under various computational constraints.

...read moreread less

57 citations

Proceedings Article•DOI•

PyTorchVideo: A Deep Learning Library for Video Understanding

[...]

Haoqi Fan¹, Tullie Murrell¹, Heng Wang¹, Kalyan Vasudev Alwala¹, Yanghao Li¹, Yilei Li¹, Bo Xiong¹, Nikhila Ravi¹, Meng Li¹, Haichuan Yang¹, Jitendra Malik¹, Ross Girshick¹, Matt Feiszli¹, Aaron Adcock¹, Wan-Yen Lo¹, Christoph Feichtenhofer¹ - Show less +12 more•Institutions (1)

Facebook¹

17 Oct 2021

TL;DR: PyTorchVideo as discussed by the authors is an open-source deep learning library that provides a rich set of modular, efficient, and reproducible components for a variety of video understanding tasks, including classification, detection, self-supervised learning, and low-level processing.

...read moreread less

Abstract: We introduce PyTorchVideo, an open-source deep-learning library that provides a rich set of modular, efficient, and reproducible components for a variety of video understanding tasks, including classification, detection, self-supervised learning, and low-level processing. The library covers a full stack of video understanding tools including multimodal data loading, transformations, and models that reproduce state-of-the-art performance. PyTorchVideo further supports hardware acceleration that enables real-time inference on mobile devices. The library is based on PyTorch and can be used by any training framework; for example, PyTorchLightning, PySlowFast, or Classy Vision. PyTorchVideo is available at https://pytorchvideo.org/.

...read moreread less

34 citations

Proceedings Article•

Early Convolutions Help Transformers See Better

[...]

Tete Xiao¹, Mannat Singh², Eric Mintun², Trevor Darrell¹, Piotr Dollár², Ross Girshick² - Show less +2 more•Institutions (2)

University of California, Berkeley¹, Facebook²

06 Dec 2021

TL;DR: In this article, a patchify stem is replaced with a small number of stacked stride-two 3x3 convolutions to improve the robustness of the original ViT model.

...read moreread less

Abstract: Vision transformer (ViT) models exhibit substandard optimizability. In particular, they are sensitive to the choice of optimizer (AdamW vs. SGD), optimizer hyperparameters, and training schedule length. In comparison, modern convolutional neural networks are far easier to optimize. Why is this the case? In this work, we conjecture that the issue lies with the patchify stem of ViT models, which is implemented by a stride-p pxp convolution (p=16 by default) applied to the input image. This large-kernel plus large-stride convolution runs counter to typical design choices of convolutional layers in neural networks. To test whether this atypical design choice causes an issue, we analyze the optimization behavior of ViT models with their original patchify stem versus a simple counterpart where we replace the ViT stem by a small number of stacked stride-two 3x3 convolutions. While the vast majority of computation in the two ViT designs is identical, we find that this small change in early visual processing results in markedly different training behavior in terms of the sensitivity to optimization settings as well as the final model accuracy. Using a convolutional stem in ViT dramatically increases optimization stability and also improves peak performance (by ~1-2% top-1 accuracy on ImageNet-1k), while maintaining flops and runtime. The improvement can be observed across the wide spectrum of model complexities (from 1G to 36G flops) and dataset scales (from ImageNet-1k to ImageNet-21k). These findings lead us to recommend using a standard, lightweight convolutional stem for ViT models as a more robust architectural choice compared to the original ViT model design.

...read moreread less

22 citations

Posted Content•

Masked Autoencoders Are Scalable Vision Learners

[...]

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick - Show less +2 more

11 Nov 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: Masked autoencoders (MAE) as mentioned in this paper are scalable self-supervised learners for computer vision, which is based on two core designs: an asymmetric encoder-decoder architecture with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens.

...read moreread less

Abstract: This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels. It is based on two core designs. First, we develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens. Second, we find that masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task. Coupling these two designs enables us to train large models efficiently and effectively: we accelerate training (by 3x or more) and improve accuracy. Our scalable approach allows for learning high-capacity models that generalize well: e.g., a vanilla ViT-Huge model achieves the best accuracy (87.8%) among methods that use only ImageNet-1K data. Transfer performance in downstream tasks outperforms supervised pre-training and shows promising scaling behavior.

...read moreread less

17 citations

Posted Content•

Early Convolutions Help Transformers See Better

[...]

Tete Xiao¹, Mannat Singh², Eric Mintun², Trevor Darrell¹, Piotr Dollár², Ross Girshick² - Show less +2 more•Institutions (2)

University of California, Berkeley¹, Facebook²

28 Jun 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, a patchify stem is replaced with a small number of stacked stride-two 3x3 convolutions to improve the robustness of the original ViT model.

...read moreread less

Abstract: Vision transformer (ViT) models exhibit substandard optimizability. In particular, they are sensitive to the choice of optimizer (AdamW vs. SGD), optimizer hyperparameters, and training schedule length. In comparison, modern convolutional neural networks are far easier to optimize. Why is this the case? In this work, we conjecture that the issue lies with the patchify stem of ViT models, which is implemented by a stride-p pxp convolution (p=16 by default) applied to the input image. This large-kernel plus large-stride convolution runs counter to typical design choices of convolutional layers in neural networks. To test whether this atypical design choice causes an issue, we analyze the optimization behavior of ViT models with their original patchify stem versus a simple counterpart where we replace the ViT stem by a small number of stacked stride-two 3x3 convolutions. While the vast majority of computation in the two ViT designs is identical, we find that this small change in early visual processing results in markedly different training behavior in terms of the sensitivity to optimization settings as well as the final model accuracy. Using a convolutional stem in ViT dramatically increases optimization stability and also improves peak performance (by ~1-2% top-1 accuracy on ImageNet-1k), while maintaining flops and runtime. The improvement can be observed across the wide spectrum of model complexities (from 1G to 36G flops) and dataset scales (from ImageNet-1k to ImageNet-21k). These findings lead us to recommend using a standard, lightweight convolutional stem for ViT models as a more robust architectural choice compared to the original ViT model design.

...read moreread less

10 citations

Posted Content•

Evaluating Large-Vocabulary Object Detectors: The Devil is in the Details

[...]

Achal Dave, Piotr Dollár, Deva Ramanan, Alexander Kirillov, Ross Girshick - Show less +1 more

01 Feb 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, the default implementation of AP is shown to produce a gameable metric, where a simple, nonsensical re-ranking policy can improve AP by a large margin, and a pooled version of AP (AP-pool) is proposed to reward properly calibrated detectors by directly comparing cross-category rankings.

...read moreread less

Abstract: By design, average precision (AP) for object detection aims to treat all classes independently: AP is computed independently per category and averaged On the one hand, this is desirable as it treats all classes, rare to frequent, equally On the other hand, it ignores cross-category confidence calibration, a key property in real-world use cases Unfortunately, we find that on imbalanced, large-vocabulary datasets, the default implementation of AP is neither category independent, nor does it directly reward properly calibrated detectors In fact, we show that the default implementation produces a gameable metric, where a simple, nonsensical re-ranking policy can improve AP by a large margin To address these limitations, we introduce two complementary metrics First, we present a simple fix to the default AP implementation, ensuring that it is truly independent across categories as originally intended We benchmark recent advances in large-vocabulary detection and find that many reported gains do not translate to improvements under our new per-class independent evaluation, suggesting recent improvements may arise from difficult to interpret changes to cross-category rankings Given the importance of reliably benchmarking cross-category rankings, we consider a pooled version of AP (AP-pool) that rewards properly calibrated detectors by directly comparing cross-category rankings Finally, we revisit classical approaches for calibration and find that explicitly calibrating detectors improves state-of-the-art on AP-pool by 17 points

...read moreread less

8 citations

Posted Content•

Fast and Accurate Model Scaling

[...]

Piotr Dollár¹, Mannat Singh¹, Ross Girshick¹•Institutions (1)

Facebook¹

11 Mar 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, a simple fast compound scaling strategy that encourages primarily scaling model width, while scaling depth and resolution to a lesser extent, was proposed, which results in close to O(sqrt{s})$ increase in activations, while achieving excellent accuracy.

...read moreread less

Abstract: In this work we analyze strategies for convolutional neural network scaling; that is, the process of scaling a base convolutional network to endow it with greater computational complexity and consequently representational power. Example scaling strategies may include increasing model width, depth, resolution, etc. While various scaling strategies exist, their tradeoffs are not fully understood. Existing analysis typically focuses on the interplay of accuracy and flops (floating point operations). Yet, as we demonstrate, various scaling strategies affect model parameters, activations, and consequently actual runtime quite differently. In our experiments we show the surprising result that numerous scaling strategies yield networks with similar accuracy but with widely varying properties. This leads us to propose a simple fast compound scaling strategy that encourages primarily scaling model width, while scaling depth and resolution to a lesser extent. Unlike currently popular scaling strategies, which result in about $O(s)$ increase in model activation w.r.t. scaling flops by a factor of $s$, the proposed fast compound scaling results in close to $O(\sqrt{s})$ increase in activations, while achieving excellent accuracy. This leads to comparable speedups on modern memory-limited hardware (e.g., GPU, TPU). More generally, we hope this work provides a framework for analyzing and selecting scaling strategies under various computational constraints.

...read moreread less

7 citations

Posted Content•

Boundary IoU: Improving Object-Centric Image Segmentation Evaluation

[...]

Bowen Cheng¹, Ross Girshick², Piotr Dollár², Alexander C. Berg², Alexander Kirillov² - Show less +1 more•Institutions (2)

University of Illinois at Urbana–Champaign¹, Facebook²

30 Mar 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: The boundary intersection-over-union (Boundary IoU) measure as discussed by the authors is a new segmentation evaluation measure focused on boundary quality, which is significantly more sensitive to boundary errors for large objects and does not over-penalize errors on smaller objects.

...read moreread less

Abstract: We present Boundary IoU (Intersection-over-Union), a new segmentation evaluation measure focused on boundary quality. We perform an extensive analysis across different error types and object sizes and show that Boundary IoU is significantly more sensitive than the standard Mask IoU measure to boundary errors for large objects and does not over-penalize errors on smaller objects. The new quality measure displays several desirable characteristics like symmetry w.r.t. prediction/ground truth pairs and balanced responsiveness across scales, which makes it more suitable for segmentation evaluation than other boundary-focused measures like Trimap IoU and F-measure. Based on Boundary IoU, we update the standard evaluation protocols for instance and panoptic segmentation tasks by proposing the Boundary AP (Average Precision) and Boundary PQ (Panoptic Quality) metrics, respectively. Our experiments show that the new evaluation metrics track boundary quality improvements that are generally overlooked by current Mask IoU-based evaluation metrics. We hope that the adoption of the new boundary-sensitive evaluation metrics will lead to rapid progress in segmentation methods that improve boundary quality.

...read moreread less

5 citations

Benchmarking Detection Transfer Learning with Vision Transformers.

[...]

Yanghao Li, Saining Xie, Xinlei Chen, Piotr Dollár, Kaiming He, Ross Girshick - Show less +2 more

22 Nov 2021

Posted Content•

A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning

[...]

Christoph Feichtenhofer¹, Haoqi Fan¹, Bo Xiong¹, Ross Girshick¹, Kaiming He¹ - Show less +1 more•Institutions (1)

Facebook¹

29 Apr 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, the authors present a large-scale study on unsupervised spatio-temporal representation learning from videos and propose a simple objective that can easily generalize all these methods to space-time.

...read moreread less

Abstract: We present a large-scale study on unsupervised spatiotemporal representation learning from videos. With a unified perspective on four recent image-based frameworks, we study a simple objective that can easily generalize all these methods to space-time. Our objective encourages temporally-persistent features in the same video, and in spite of its simplicity, it works surprisingly well across: (i) different unsupervised frameworks, (ii) pre-training datasets, (iii) downstream datasets, and (iv) backbone architectures. We draw a series of intriguing observations from this study, e.g., we discover that encouraging long-spanned persistency can be effective even if the timespan is 60 seconds. In addition to state-of-the-art results in multiple benchmarks, we report a few promising cases in which unsupervised pre-training can outperform its supervised counterpart. Code is made available at this https URL

...read moreread less

Posted Content•

PyTorchVideo: A Deep Learning Library for Video Understanding.

[...]

Haoqi Fan¹, Tullie Murrell¹, Heng Wang¹, Kalyan Vasudev Alwala¹, Yanghao Li¹, Yilei Li¹, Bo Xiong², Nikhila Ravi¹, Meng Li¹, Haichuan Yang¹, Jitendra Malik¹, Ross Girshick¹, Matt Feiszli¹, Aaron Adcock¹, Wan-Yen Lo¹, Christoph Feichtenhofer¹ - Show less +12 more•Institutions (2)

Facebook¹, Association for Computing Machinery²

18 Nov 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: PyTorchVideo as discussed by the authors is an open-source deep learning library that provides a rich set of modular, efficient, and reproducible components for a variety of video understanding tasks, including classification, detection, self-supervised learning, and low-level processing.

...read moreread less

Abstract: We introduce PyTorchVideo, an open-source deep-learning library that provides a rich set of modular, efficient, and reproducible components for a variety of video understanding tasks, including classification, detection, self-supervised learning, and low-level processing. The library covers a full stack of video understanding tools including multimodal data loading, transformations, and models that reproduce state-of-the-art performance. PyTorchVideo further supports hardware acceleration that enables real-time inference on mobile devices. The library is based on PyTorch and can be used by any training framework; for example, PyTorchLightning, PySlowFast, or Classy Vision. PyTorchVideo is available at https://pytorchvideo.org/

...read moreread less

Showing papers by "Ross Girshick published in 2021"