scispace - formally typeset
Search or ask a question
Author

Kaiming He

Bio: Kaiming He is an academic researcher from Facebook. The author has contributed to research in topics: Object detection & Image segmentation. The author has an hindex of 89, co-authored 135 publications receiving 272091 citations. Previous affiliations of Kaiming He include The Chinese University of Hong Kong & Microsoft.


Papers
More filters
Posted Content
TL;DR: Masked autoencoders (MAE) as mentioned in this paper are scalable self-supervised learners for computer vision, which is based on two core designs: an asymmetric encoder-decoder architecture with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens.
Abstract: This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels. It is based on two core designs. First, we develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens. Second, we find that masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task. Coupling these two designs enables us to train large models efficiently and effectively: we accelerate training (by 3x or more) and improve accuracy. Our scalable approach allows for learning high-capacity models that generalize well: e.g., a vanilla ViT-Huge model achieves the best accuracy (87.8%) among methods that use only ImageNet-1K data. Transfer performance in downstream tasks outperforms supervised pre-training and shows promising scaling behavior.

17 citations

Patent
Jifeng Dai1, Kaiming He1, Jian Sun1
14 Oct 2015
TL;DR: In this paper, a system receives a training image and using the training image, candidate masks are generated and a set of the ranked candidate masks is selected for further processing for further semantic segmentation.
Abstract: Disclosed herein are technologies directed to training a neural network to perform semantic segmentation A system receives a training image, and using the training image, candidate masks are generated The candidate masks are ranked and a set of the ranked candidate masks are selected for further processing One of the set of the ranked candidate masks is selected to train the neural network The one of the set of the set of the ranked candidate masks is also used as an input to train the neural network in a further training evolution In some examples, the one of the set of the ranked candidate masks is selected randomly to reduce the likelihood of ending up in poor local optima that result in poor training inputs

15 citations

Patent
15 Nov 2018
TL;DR: In this article, the authors propose a method for training a baseline machine learning model based on a neural network comprising a plurality of stages, where each stage comprises a number of neural blocks and each non-local operation is based on pairwise and unary functions.
Abstract: In one embodiment, a method includes training a baseline machine-learning model based on a neural network comprising a plurality of stages, wherein each stage comprises a plurality of neural blocks, accessing a plurality of training samples comprising a plurality of content objects, respectively, determining one or more non-local operations, wherein each non-local operation is based on one or more pairwise functions and one or more unary functions, generating one or more non-local blocks based on the plurality of training samples and the one or more non-local operations, determining a stage from the plurality of stages of the neural network, and training a non-local machine-learning model by inserting each of the one or more non-local blocks in between at least two of the plurality of neural blocks in the determined stage of the neural network

14 citations

Patent
08 Aug 2014
TL;DR: In this article, an apparatus for providing super-resolution of a low-resolution image is described, which consists of a patch extracting and representing device comprising a first set of filters configured to extract patches from the low resolution image and represent each of the extracted patches as a high dimensional vector.
Abstract: Disclosed is an apparatus for providing super-resolution of a low-resolution image. The apparatus may comprise: a patch extracting and representing device comprising a first set of filters configured to extract patches from the low-resolution image and represent each of the extracted patches as a high dimensional vector, a mapping device comprising a second set of filters configured to map nonlinearly each of the high dimensional vectors onto a high-resolution patch-wise representation and an aggregating device configured to aggregate the high-resolution patch-wise representations to generate a high-resolution image for the low-resolution image. A method for providing super-resolution of a low-resolution image is also disclosed.

10 citations

Patent
14 Jul 2020
TL;DR: In this article, an instance segmentation mask associated with the region of interest is generated by processing the regional feature map using a second neural network. But the second network is configured to generate instance segmentations for object instances depicted in images.
Abstract: In one embodiment, a method includes a computing system accessing a training image. The system may generate a feature map for the training image using a first neural network. The system may identify a region of interest in the feature map and generate a regional feature map for the region of interest based on sampling locations defined by a sampling region. The sampling region and the region of interest may correspond to the same region in the feature map. The system may generate an instance segmentation mask associated with the region of interest by processing the regional feature map using a second neural network. The second neural network may be trained using the instance segmentation mask. Once trained, the second neural network is configured to generate instance segmentation masks for object instances depicted in images.

9 citations


Cited by
More filters
Proceedings ArticleDOI
27 Jun 2016
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

123,388 citations

Proceedings Article
04 Sep 2014
TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

55,235 citations

Proceedings Article
01 Jan 2015
TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

49,914 citations

Book ChapterDOI
05 Oct 2015
TL;DR: Neber et al. as discussed by the authors proposed a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently, which can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks.
Abstract: There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net .

49,590 citations

Posted Content
TL;DR: This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.
Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

44,703 citations