scispace - formally typeset
Search or ask a question
Author

Zeming Li

Bio: Zeming Li is an academic researcher from Tsinghua University. The author has contributed to research in topics: Object detection & Computer science. The author has an hindex of 16, co-authored 44 publications receiving 1638 citations.

Papers published on a yearly basis

Papers
More filters
Proceedings ArticleDOI
01 Oct 2019
TL;DR: Object365 can serve as a better feature learning dataset for localization-sensitive tasks like object detection and semantic segmentation and better generalization ability of Object365 has been verified on CityPersons, VOC segmentation, and ADE tasks.
Abstract: In this paper, we introduce a new large-scale object detection dataset, Objects365, which has 365 object categories over 600K training images. More than 10 million, high-quality bounding boxes are manually labeled through a three-step, carefully designed annotation pipeline. It is the largest object detection dataset (with full annotation) so far and establishes a more challenging benchmark for the community. Objects365 can serve as a better feature learning dataset for localization-sensitive tasks like object detection and semantic segmentation. The Objects365 pre-trained models significantly outperform ImageNet pre-trained models with 5.6 points gain (42 vs 36.4) based on the standard setting of 90K iterations on COCO benchmark. Even compared with much long training time like 540K iterations, our Objects365 pretrained model with 90K iterations still have 2.7 points gain (42 vs 39.3). Meanwhile, the finetuning time can be greatly reduced (up to 10 times) when reaching the same accuracy. Better generalization ability of Object365 has also been verified on CityPersons, VOC segmentation, and ADE tasks. The dataset as well as the pretrained-models have been released at www.objects365.org.

331 citations

Proceedings ArticleDOI
01 Jun 2018
TL;DR: In this paper, the authors proposed a large mini-batch object detector (MegDet) to enable the training with a large minibatch size up to 256, so that they can effectively utilize at most 128 GPUs to significantly shorten the training time.
Abstract: The development of object detection in the era of deep learning, from R-CNN [11], Fast/Faster R-CNN [10, 31] to recent Mask R-CNN [14] and RetinaNet [24], mainly come from novel network, new framework, or loss design. However, mini-batch size, a key factor for the training of deep neural networks, has not been well studied for object detection. In this paper, we propose a Large Mini-Batch Object Detector (MegDet) to enable the training with a large minibatch size up to 256, so that we can effectively utilize at most 128 GPUs to significantly shorten the training time. Technically, we suggest a warmup learning rate policy and Cross-GPU Batch Normalization, which together allow us to successfully train a large mini-batch detector in much less time (e.g., from 33 hours to 4 hours), and achieve even better accuracy. The MegDet is the backbone of our submission (mmAP 52.5%) to COCO 2017 Challenge, where we won the 1st place of Detection task.

316 citations

Posted Content
TL;DR: The authors' ResNet-101 based light-head R-CNN outperforms state-of-art object detectors on COCO while keeping time efficiency and significantly outperforming the single-stage, fast detectors like YOLO and SSD on both speed and accuracy.
Abstract: In this paper, we first investigate why typical two-stage methods are not as fast as single-stage, fast detectors like YOLO and SSD. We find that Faster R-CNN and R-FCN perform an intensive computation after or before RoI warping. Faster R-CNN involves two fully connected layers for RoI recognition, while R-FCN produces a large score maps. Thus, the speed of these networks is slow due to the heavy-head design in the architecture. Even if we significantly reduce the base model, the computation cost cannot be largely decreased accordingly. We propose a new two-stage detector, Light-Head R-CNN, to address the shortcoming in current two-stage approaches. In our design, we make the head of network as light as possible, by using a thin feature map and a cheap R-CNN subnet (pooling and single fully-connected layer). Our ResNet-101 based light-head R-CNN outperforms state-of-art object detectors on COCO while keeping time efficiency. More importantly, simply replacing the backbone with a tiny network (e.g, Xception), our Light-Head R-CNN gets 30.7 mmAP at 102 FPS on COCO, significantly outperforming the single-stage, fast detectors like YOLO and SSD on both speed and accuracy. Code will be made publicly available.

273 citations

Posted Content
TL;DR: This report presents the method which wins the nuScenes3D Detection Challenge, and proposes a balanced group-ing head to boost the performance for the categories withsimilar shapes, achieving state-of-the-art detection performance on thenuScenes dataset.
Abstract: This report presents our method which wins the nuScenes3D Detection Challenge [17] held in Workshop on Autonomous Driving(WAD, CVPR 2019). Generally, we utilize sparse 3D convolution to extract rich semantic features, which are then fed into a class-balanced multi-head network to perform 3D object detection. To handle the severe class imbalance problem inherent in the autonomous driving scenarios, we design a class-balanced sampling and augmentation strategy to generate a more balanced data distribution. Furthermore, we propose a balanced group-ing head to boost the performance for the categories withsimilar shapes. Based on the Challenge results, our methodoutperforms the PointPillars [14] baseline by a large mar-gin across all metrics, achieving state-of-the-art detection performance on the nuScenes dataset. Code will be released at CBGS.

270 citations

Proceedings ArticleDOI
Zheng Ge1, Songtao Liu, Zeming Li, Osamu Yoshie1, Jian Sun 
26 Mar 2021
TL;DR: In this paper, the authors propose to formulate the label assignment procedure as an optimal transport assignment (OT) problem, which can be solved via Sinkhorn-Knopp Iteration.
Abstract: Recent advances in label assignment in object detection mainly seek to independently define positive/negative training samples for each ground-truth (gt) object. In this paper, we innovatively revisit the label assignment from a global perspective and propose to formulate the assigning procedure as an Optimal Transport (OT) problem – a well-studied topic in Optimization Theory. Concretely, we define the unit transportation cost between each demander (anchor) and supplier (gt) pair as the weighted summation of their classification and regression losses. After formulation, finding the best assignment solution is converted to solve the optimal transport plan at minimal transportation costs, which can be solved via Sinkhorn-Knopp Iteration. On COCO, a single FCOS-ResNet-50 detector equipped with Optimal Transport Assignment (OTA) can reach 40.7% mAP under 1× scheduler, outperforming all other existing assigning methods. Extensive experiments conducted on COCO and CrowdHuman further validate the effectiveness of our proposed OTA, especially its superiority in crowd scenarios. The code is available at https://github.com/Megvii-BaseDetection/OTA.

242 citations


Cited by
More filters
Posted Content
TL;DR: This work uses new features: WRC, CSP, CmBN, SAT, Mish activation, Mosaic data augmentation, C mBN, DropBlock regularization, and CIoU loss, and combine some of them to achieve state-of-the-art results: 43.5% AP for the MS COCO dataset at a realtime speed of ~65 FPS on Tesla V100.
Abstract: There are a huge number of features which are said to improve Convolutional Neural Network (CNN) accuracy. Practical testing of combinations of such features on large datasets, and theoretical justification of the result, is required. Some features operate on certain models exclusively and for certain problems exclusively, or only for small-scale datasets; while some features, such as batch-normalization and residual-connections, are applicable to the majority of models, tasks, and datasets. We assume that such universal features include Weighted-Residual-Connections (WRC), Cross-Stage-Partial-connections (CSP), Cross mini-Batch Normalization (CmBN), Self-adversarial-training (SAT) and Mish-activation. We use new features: WRC, CSP, CmBN, SAT, Mish activation, Mosaic data augmentation, CmBN, DropBlock regularization, and CIoU loss, and combine some of them to achieve state-of-the-art results: 43.5% AP (65.7% AP50) for the MS COCO dataset at a realtime speed of ~65 FPS on Tesla V100. Source code is at this https URL

5,709 citations

Posted Content
Kaiming He1, Haoqi Fan1, Yuxin Wu1, Saining Xie1, Ross Girshick1 
TL;DR: This article proposed Momentum Contrast (MoCo) for unsupervised visual representation learning, which enables building a large and consistent dictionary on-the-fly that facilitates contrastive learning.
Abstract: We present Momentum Contrast (MoCo) for unsupervised visual representation learning. From a perspective on contrastive learning as dictionary look-up, we build a dynamic dictionary with a queue and a moving-averaged encoder. This enables building a large and consistent dictionary on-the-fly that facilitates contrastive unsupervised learning. MoCo provides competitive results under the common linear protocol on ImageNet classification. More importantly, the representations learned by MoCo transfer well to downstream tasks. MoCo can outperform its supervised pre-training counterpart in 7 detection/segmentation tasks on PASCAL VOC, COCO, and other datasets, sometimes surpassing it by large margins. This suggests that the gap between unsupervised and supervised representation learning has been largely closed in many vision tasks.

4,272 citations

Proceedings ArticleDOI
Kaiming He1, Haoqi Fan1, Yuxin Wu1, Saining Xie1, Ross Girshick1 
14 Jun 2020
TL;DR: This article proposed Momentum Contrast (MoCo) for unsupervised visual representation learning, which enables building a large and consistent dictionary on-the-fly that facilitates contrastive learning.
Abstract: We present Momentum Contrast (MoCo) for unsupervised visual representation learning. From a perspective on contrastive learning as dictionary look-up, we build a dynamic dictionary with a queue and a moving-averaged encoder. This enables building a large and consistent dictionary on-the-fly that facilitates contrastive unsupervised learning. MoCo provides competitive results under the common linear protocol on ImageNet classification. More importantly, the representations learned by MoCo transfer well to downstream tasks. MoCo can outperform its supervised pre-training counterpart in 7 detection/segmentation tasks on PASCAL VOC, COCO, and other datasets, sometimes surpassing it by large margins. This suggests that the gap between unsupervised and supervised representation learning has been largely closed in many vision tasks.

4,128 citations

Proceedings ArticleDOI
Mingxing Tan1, Ruoming Pang1, Quoc V. Le1
14 Jun 2020
TL;DR: EfficientDetD7 as discussed by the authors proposes a weighted bi-directional feature pyramid network (BiFPN), which allows easy and fast multi-scale feature fusion, and a compound scaling method that uniformly scales the resolution, depth, and width for all backbone, feature network, and box/class prediction networks at the same time.
Abstract: Model efficiency has become increasingly important in computer vision. In this paper, we systematically study neural network architecture design choices for object detection and propose several key optimizations to improve efficiency. First, we propose a weighted bi-directional feature pyramid network (BiFPN), which allows easy and fast multi-scale feature fusion; Second, we propose a compound scaling method that uniformly scales the resolution, depth, and width for all backbone, feature network, and box/class prediction networks at the same time. Based on these optimizations and EfficientNet backbones, we have developed a new family of object detectors, called EfficientDet, which consistently achieve much better efficiency than prior art across a wide spectrum of resource constraints. In particular, with single-model and single-scale, our EfficientDetD7 achieves state-of-the-art 52.2 AP on COCO test-dev with 52M parameters and 325B FLOPs, being 4x – 9x smaller and using 13x – 42x fewer FLOPs than previous detector.

3,423 citations

Book ChapterDOI
08 Sep 2018
TL;DR: ShuffleNet V2 as discussed by the authors proposes to evaluate the direct metric on the target platform, beyond only considering FLOPs, based on a series of controlled experiments, and derives several practical guidelines for efficient network design.
Abstract: Currently, the neural network architecture design is mostly guided by the indirect metric of computation complexity, i.e., FLOPs. However, the direct metric, e.g., speed, also depends on the other factors such as memory access cost and platform characterics. Thus, this work proposes to evaluate the direct metric on the target platform, beyond only considering FLOPs. Based on a series of controlled experiments, this work derives several practical guidelines for efficient network design. Accordingly, a new architecture is presented, called ShuffleNet V2. Comprehensive ablation experiments verify that our model is the state-of-the-art in terms of speed and accuracy tradeoff.

3,393 citations