Showing papers by "Kaiming He published in 2018"

PDF

Open Access

Proceedings Article•DOI•

[...]

Xiaolong Wang¹, Ross Girshick¹, Abhinav Gupta², Kaiming He¹•Institutions (2)

18 Jun 2018

TL;DR: In this article, the non-local operation computes the response at a position as a weighted sum of the features at all positions, which can be used to capture long-range dependencies.

...read moreread less

Abstract: Both convolutional and recurrent operations are building blocks that process one local neighborhood at a time. In this paper, we present non-local operations as a generic family of building blocks for capturing long-range dependencies. Inspired by the classical non-local means method [4] in computer vision, our non-local operation computes the response at a position as a weighted sum of the features at all positions. This building block can be plugged into many computer vision architectures. On the task of video classification, even without any bells and whistles, our nonlocal models can compete or outperform current competition winners on both Kinetics and Charades datasets. In static image recognition, our non-local models improve object detection/segmentation and pose estimation on the COCO suite of tasks. Code will be made available.

...read moreread less

8,059 citations

Posted Content•

Group Normalization

[...]

Yuxin Wu, Kaiming He

22 Mar 2018-arXiv: Computer Vision and Pattern Recognition

TL;DR: Group Normalization can outperform its BN-based counterparts for object detection and segmentation in COCO, and for video classification in Kinetics, showing that GN can effectively replace the powerful BN in a variety of tasks.

...read moreread less

Abstract: Batch Normalization (BN) is a milestone technique in the development of deep learning, enabling various networks to train. However, normalizing along the batch dimension introduces problems --- BN's error increases rapidly when the batch size becomes smaller, caused by inaccurate batch statistics estimation. This limits BN's usage for training larger models and transferring features to computer vision tasks including detection, segmentation, and video, which require small batches constrained by memory consumption. In this paper, we present Group Normalization (GN) as a simple alternative to BN. GN divides the channels into groups and computes within each group the mean and variance for normalization. GN's computation is independent of batch sizes, and its accuracy is stable in a wide range of batch sizes. On ResNet-50 trained in ImageNet, GN has 10.6% lower error than its BN counterpart when using a batch size of 2; when using typical batch sizes, GN is comparably good with BN and outperforms other normalization variants. Moreover, GN can be naturally transferred from pre-training to fine-tuning. GN can outperform its BN-based counterparts for object detection and segmentation in COCO, and for video classification in Kinetics, showing that GN can effectively replace the powerful BN in a variety of tasks. GN can be easily implemented by a few lines of code in modern libraries.

...read moreread less

1,924 citations

Book Chapter•DOI•

Exploring the Limits of Weakly Supervised Pretraining

[...]

Dhruv Mahajan¹, Ross Girshick¹, Vignesh Ramanathan¹, Kaiming He¹, Manohar Paluri¹, Yixuan Li¹, Ashwin Bharambe¹, Laurens van der Maaten¹ - Show less +4 more•Institutions (1)

Facebook¹

08 Sep 2018

TL;DR: In this paper, the authors presented a transfer learning approach with large convolutional networks trained to predict hashtags on billions of social media images and reported the highest ImageNet-1k single-crop, top-1 accuracy to date.

...read moreread less

Abstract: State-of-the-art visual perception models for a wide range of tasks rely on supervised pretraining. ImageNet classification is the de facto pretraining task for these models. Yet, ImageNet is now nearly ten years old and is by modern standards “small”. Even so, relatively little is known about the behavior of pretraining with datasets that are multiple orders of magnitude larger. The reasons are obvious: such datasets are difficult to collect and annotate. In this paper, we present a unique study of transfer learning with large convolutional networks trained to predict hashtags on billions of social media images. Our experiments demonstrate that training for large-scale hashtag prediction leads to excellent results. We show improvements on several image classification and object detection tasks, and report the highest ImageNet-1k single-crop, top-1 accuracy to date: 85.4% (97.6% top-5). We also perform extensive experiments that provide novel empirical data on the relationship between large-scale pretraining and transfer learning performance.

...read moreread less

860 citations

Posted Content•

Rethinking ImageNet Pre-training.

[...]

Kaiming He¹, Ross Girshick, Piotr Dollár•Institutions (1)

Facebook¹

21 Nov 2018-arXiv: Computer Vision and Pattern Recognition

TL;DR: Experiments show that ImageNet pre-training speeds up convergence early in training, but does not necessarily provide regularization or improve final target task accuracy, and these discoveries will encourage people to rethink the current de facto paradigm of `pre-training and fine-tuning' in computer vision.

...read moreread less

Abstract: We report competitive results on object detection and instance segmentation on the COCO dataset using standard models trained from random initialization. The results are no worse than their ImageNet pre-training counterparts even when using the hyper-parameters of the baseline system (Mask R-CNN) that were optimized for fine-tuning pre-trained models, with the sole exception of increasing the number of training iterations so the randomly initialized models may converge. Training from random initialization is surprisingly robust; our results hold even when: (i) using only 10% of the training data, (ii) for deeper and wider models, and (iii) for multiple tasks and metrics. Experiments show that ImageNet pre-training speeds up convergence early in training, but does not necessarily provide regularization or improve final target task accuracy. To push the envelope we demonstrate 50.9 AP on COCO object detection without using any external data---a result on par with the top COCO 2017 competition results that used ImageNet pre-training. These observations challenge the conventional wisdom of ImageNet pre-training for dependent tasks and we expect these discoveries will encourage people to rethink the current de facto paradigm of `pre-training and fine-tuning' in computer vision.

...read moreread less

597 citations

Posted Content•

Exploring the Limits of Weakly Supervised Pretraining

[...]

Dhruv Mahajan¹, Ross Girshick¹, Vignesh Ramanathan¹, Kaiming He¹, Manohar Paluri¹, Yixuan Li¹, Ashwin Bharambe¹, Laurens van der Maaten¹ - Show less +4 more•Institutions (1)

Facebook¹

02 May 2018-arXiv: Computer Vision and Pattern Recognition

TL;DR: This paper presents a unique study of transfer learning with large convolutional networks trained to predict hashtags on billions of social media images and shows improvements on several image classification and object detection tasks, and reports the highest ImageNet-1k single-crop, top-1 accuracy to date.

...read moreread less

Abstract: State-of-the-art visual perception models for a wide range of tasks rely on supervised pretraining. ImageNet classification is the de facto pretraining task for these models. Yet, ImageNet is now nearly ten years old and is by modern standards "small". Even so, relatively little is known about the behavior of pretraining with datasets that are multiple orders of magnitude larger. The reasons are obvious: such datasets are difficult to collect and annotate. In this paper, we present a unique study of transfer learning with large convolutional networks trained to predict hashtags on billions of social media images. Our experiments demonstrate that training for large-scale hashtag prediction leads to excellent results. We show improvements on several image classification and object detection tasks, and report the highest ImageNet-1k single-crop, top-1 accuracy to date: 85.4% (97.6% top-5). We also perform extensive experiments that provide novel empirical data on the relationship between large-scale pretraining and transfer learning performance.

...read moreread less

490 citations

Proceedings Article•DOI•

Detecting and Recognizing Human-Object Interactions

[...]

Georgia Gkioxari¹, Ross Girshick¹, Piotr Dollár¹, Kaiming He¹•Institutions (1)

Facebook¹

18 Jun 2018

TL;DR: In this paper, a human-centric approach is proposed to detect human, verb, and object triplets in challenging everyday photos, where the appearance of a person is used as a cue for localizing the objects they are interacting with.

...read moreread less

Abstract: To understand the visual world, a machine must not only recognize individual object instances but also how they interact. Humans are often at the center of such interactions and detecting human-object interactions is an important practical and scientific problem. In this paper, we address the task of detecting (human, verb, object) triplets in challenging everyday photos. We propose a novel model that is driven by a human-centric approach. Our hypothesis is that the appearance of a person - their pose, clothing, action - is a powerful cue for localizing the objects they are interacting with. To exploit this cue, our model learns to predict an action-specific density over target object locations based on the appearance of a detected person. Our model also jointly learns to detect people and objects, and by fusing these predictions it efficiently infers interaction triplets in a clean, jointly trained end-to-end system we call InteractNet. We validate our approach on the recently introduced Verbs in COCO (V-COCO) and HICO-DET datasets, where we show quantitatively compelling results.

...read moreread less

388 citations

Proceedings Article•DOI•

Data Distillation: Towards Omni-Supervised Learning

[...]

Ilija Radosavovic¹, Piotr Dollár¹, Ross Girshick¹, Georgia Gkioxari¹, Kaiming He¹ - Show less +1 more•Institutions (1)

Facebook¹

18 Jun 2018

TL;DR: It is argued that visual recognition models have recently become accurate enough that it is now possible to apply classic ideas about self-training to challenging real-world data and propose data distillation, a method that ensembles predictions from multiple transformations of unlabeled data, using a single model, to automatically generate new training annotations.

...read moreread less

Abstract: We investigate omni-supervised learning, a special regime of semi-supervised learning in which the learner exploits all available labeled data plus internet-scale sources of unlabeled data. Omni-supervised learning is lower-bounded by performance on existing labeled datasets, offering the potential to surpass state-of-the-art fully supervised methods. To exploit the omni-supervised setting, we propose data distillation, a method that ensembles predictions from multiple transformations of unlabeled data, using a single model, to automatically generate new training annotations. We argue that visual recognition models have recently become accurate enough that it is now possible to apply classic ideas about self-training to challenging real-world data. Our experimental results show that in the cases of human keypoint detection and general object detection, state-of-the-art models trained with data distillation surpass the performance of using labeled data from the COCO dataset alone.

...read moreread less

319 citations

Posted Content•

Feature Denoising for Improving Adversarial Robustness

[...]

Cihang Xie¹, Yuxin Wu², Laurens van der Maaten², Alan L. Yuille¹, Kaiming He² - Show less +1 more•Institutions (2)

Johns Hopkins University¹, Facebook²

09 Dec 2018-arXiv: Computer Vision and Pattern Recognition

TL;DR: Feature denoising networks as mentioned in this paper uses non-local means or other filters to denoise the features of CNNs and achieve state-of-the-art performance in both white-box and black-box attacks.

...read moreread less

Abstract: Adversarial attacks to image classification systems present challenges to convolutional networks and opportunities for understanding them. This study suggests that adversarial perturbations on images lead to noise in the features constructed by these networks. Motivated by this observation, we develop new network architectures that increase adversarial robustness by performing feature denoising. Specifically, our networks contain blocks that denoise the features using non-local means or other filters; the entire networks are trained end-to-end. When combined with adversarial training, our feature denoising networks substantially improve the state-of-the-art in adversarial robustness in both white-box and black-box attack settings. On ImageNet, under 10-iteration PGD white-box attacks where prior art has 27.9% accuracy, our method achieves 55.7%; even under extreme 2000-iteration PGD white-box attacks, our method secures 42.6% accuracy. Our method was ranked first in Competition on Adversarial Attacks and Defenses (CAAD) 2018 --- it achieved 50.6% classification accuracy on a secret, ImageNet-like test dataset against 48 unknown attackers, surpassing the runner-up approach by ~10%. Code is available at this https URL.

...read moreread less

302 citations

Proceedings Article•DOI•

Learning to Segment Every Thing

[...]

Ronghang Hu¹, Piotr Dollár², Kaiming He², Trevor Darrell¹, Ross Girshick² - Show less +1 more•Institutions (2)

University of California, Berkeley¹, Facebook²

18 Jun 2018

TL;DR: A new partially supervised training paradigm is proposed, together with a novel weight transfer function, that enables training instance segmentation models on a large set of categories all of which have box annotations, but only a small fraction ofWhich have mask annotations.

...read moreread less

Abstract: Most methods for object instance segmentation require all training examples to be labeled with segmentation masks. This requirement makes it expensive to annotate new categories and has restricted instance segmentation models to ~100 well-annotated classes. The goal of this paper is to propose a new partially supervised training paradigm, together with a novel weight transfer function, that enables training instance segmentation models on a large set of categories all of which have box annotations, but only a small fraction of which have mask annotations. These contributions allow us to train Mask R-CNN to detect and segment 3000 visual concepts using box annotations from the Visual Genome dataset and mask annotations from the 80 classes in the COCO dataset. We evaluate our approach in a controlled study on the COCO dataset. This work is a first step towards instance segmentation models that have broad comprehension of the visual world.

...read moreread less

256 citations

Posted Content•

Long-Term Feature Banks for Detailed Video Understanding.

[...]

Chao-Yuan Wu¹, Christoph Feichtenhofer², Haoqi Fan², Kaiming He², Philipp Krähenbühl¹, Ross Girshick - Show less +2 more•Institutions (2)

University of Texas at Austin¹, Facebook²

12 Dec 2018-arXiv: Computer Vision and Pattern Recognition

TL;DR: This paper proposes a long-term feature bank—supportive information extracted over the entire span of a video—to augment state-of-the-art video models that otherwise would only view short clips of 2-5 seconds.

...read moreread less

Abstract: To understand the world, we humans constantly need to relate the present to the past, and put events in context. In this paper, we enable existing video models to do the same. We propose a long-term feature bank---supportive information extracted over the entire span of a video---to augment state-of-the-art video models that otherwise would only view short clips of 2-5 seconds. Our experiments demonstrate that augmenting 3D convolutional networks with a long-term feature bank yields state-of-the-art results on three challenging video datasets: AVA, EPIC-Kitchens, and Charades.

...read moreread less

247 citations

Posted Content•

Panoptic Segmentation

[...]

Alexander Kirillov¹, Kaiming He¹, Ross Girshick, Carsten Rother², Piotr Dollár - Show less +1 more•Institutions (2)

Facebook¹, Heidelberg University²

03 Jan 2018-arXiv: Computer Vision and Pattern Recognition

TL;DR: Panoptic segmentation as discussed by the authors unifies the typically distinct tasks of semantic segmentation and instance segmentation, and proposes a novel panoptic quality (PQ) metric that captures performance for all classes (stuff and things) in an interpretable and unified manner.

...read moreread less

Abstract: We propose and study a task we name panoptic segmentation (PS). Panoptic segmentation unifies the typically distinct tasks of semantic segmentation (assign a class label to each pixel) and instance segmentation (detect and segment each object instance). The proposed task requires generating a coherent scene segmentation that is rich and complete, an important step toward real-world vision systems. While early work in computer vision addressed related image/scene parsing tasks, these are not currently popular, possibly due to lack of appropriate metrics or associated recognition challenges. To address this, we propose a novel panoptic quality (PQ) metric that captures performance for all classes (stuff and things) in an interpretable and unified manner. Using the proposed metric, we perform a rigorous study of both human and machine performance for PS on three existing datasets, revealing interesting insights about the task. The aim of our work is to revive the interest of the community in a more unified view of image segmentation.

...read moreread less

Posted Content•

SlowFast Networks for Video Recognition

[...]

Christoph Feichtenhofer¹, Haoqi Fan¹, Jitendra Malik², Kaiming He¹•Institutions (2)

Facebook¹, University of California, Berkeley²

10 Dec 2018-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, a SlowFast network is proposed for action classification and detection in video, which involves a Slow pathway, operating at low frame rate, to capture spatial semantics, and a Fast pathway to capture motion at fine temporal resolution.

...read moreread less

Abstract: We present SlowFast networks for video recognition. Our model involves (i) a Slow pathway, operating at low frame rate, to capture spatial semantics, and (ii) a Fast pathway, operating at high frame rate, to capture motion at fine temporal resolution. The Fast pathway can be made very lightweight by reducing its channel capacity, yet can learn useful temporal information for video recognition. Our models achieve strong performance for both action classification and detection in video, and large improvements are pin-pointed as contributions by our SlowFast concept. We report state-of-the-art accuracy on major video recognition benchmarks, Kinetics, Charades and AVA. Code has been made available at: this https URL

...read moreread less

Posted Content•

GLoMo: Unsupervisedly Learned Relational Graphs as Transferable Representations.

[...]

Zhilin Yang, Junbo Jake Zhao, Bhuwan Dhingra, Kaiming He, William W. Cohen, Ruslan Salakhutdinov, Yann LeCun - Show less +3 more

14 Jun 2018-arXiv: Learning

TL;DR: This work explores the possibility of learning generic latent relational graphs that capture dependencies between pairs of data units from large-scale unlabeled data and transferring the graphs to downstream tasks, and shows that the learned graphs are generic enough to be transferred to different embeddings on which the graphs have been trained.

...read moreread less

Abstract: Modern deep transfer learning approaches have mainly focused on learning generic feature vectors from one task that are transferable to other tasks, such as word embeddings in language and pretrained convolutional features in vision. However, these approaches usually transfer unary features and largely ignore more structured graphical representations. This work explores the possibility of learning generic latent relational graphs that capture dependencies between pairs of data units (e.g., words or pixels) from large-scale unlabeled data and transferring the graphs to downstream tasks. Our proposed transfer learning framework improves performance on various tasks including question answering, natural language inference, sentiment analysis, and image classification. We also show that the learned graphs are generic enough to be transferred to different embeddings on which the graphs have not been trained (including GloVe embeddings, ELMo embeddings, and task-specific RNN hidden unit), or embedding-free units such as image pixels.

...read moreread less

Proceedings Article•

GLoMo: Unsupervised Learning of Transferable Relational Graphs

[...]

Zhilin Yang¹, Junbo Jake Zhao², Junbo Jake Zhao³, Bhuwan Dhingra¹, Kaiming He², William W. Cohen¹, Ruslan Salakhutdinov¹, Yann LeCun² - Show less +4 more•Institutions (3)

Carnegie Mellon University¹, Facebook², New York University³

01 Jan 2018

...read moreread less

Patent•

Machine-Learning Models Based on Non-local Neural Networks

[...]

Kaiming He¹, Ross Girshick¹, Xiaolong Wang•Institutions (1)

Facebook¹

15 Nov 2018

TL;DR: In this article, the authors propose a method for training a baseline machine learning model based on a neural network comprising a plurality of stages, where each stage comprises a number of neural blocks and each non-local operation is based on pairwise and unary functions.

...read moreread less

Abstract: In one embodiment, a method includes training a baseline machine-learning model based on a neural network comprising a plurality of stages, wherein each stage comprises a plurality of neural blocks, accessing a plurality of training samples comprising a plurality of content objects, respectively, determining one or more non-local operations, wherein each non-local operation is based on one or more pairwise functions and one or more unary functions, generating one or more non-local blocks based on the plurality of training samples and the one or more non-local operations, determining a stage from the plurality of stages of the neural network, and training a non-local machine-learning model by inserting each of the one or more non-local blocks in between at least two of the plurality of neural blocks in the determined stage of the neural network

...read moreread less