scispace - formally typeset
Search or ask a question
Author

Tete Xiao

Other affiliations: Peking University
Bio: Tete Xiao is an academic researcher from University of California, Berkeley. The author has contributed to research in topics: Feature (computer vision) & Minimum bounding box. The author has an hindex of 18, co-authored 31 publications receiving 3302 citations. Previous affiliations of Tete Xiao include Peking University.

Papers
More filters
Book ChapterDOI
08 Sep 2018
TL;DR: A multi-task framework called UPerNet and a training strategy are developed to learn from heterogeneous image annotations and it is shown that it is able to effectively segment a wide range of concepts from images.
Abstract: Humans recognize the visual world at multiple levels: we effortlessly categorize scenes and detect objects inside, while also identifying the textures and surfaces of the objects along with their different compositional parts. In this paper, we study a new task called Unified Perceptual Parsing, which requires the machine vision systems to recognize as many visual concepts as possible from a given image. A multi-task framework called UPerNet and a training strategy are developed to learn from heterogeneous image annotations. We benchmark our framework on Unified Perceptual Parsing and show that it is able to effectively segment a wide range of concepts from images. The trained networks are further applied to discover visual knowledge in natural scenes (Models are available at https://github.com/CSAILVision/unifiedparsing).

977 citations

Journal ArticleDOI
TL;DR: The ADE20K dataset as discussed by the authors contains 25k images of complex everyday scenes containing a variety of objects in their natural spatial context, on average there are 19.5 instances and 10.5 object classes per image.
Abstract: Semantic understanding of visual scenes is one of the holy grails of computer vision. Despite efforts of the community in data collection, there are still few image datasets covering a wide range of scenes and object categories with pixel-wise annotations for scene understanding. In this work, we present a densely annotated dataset ADE20K, which spans diverse annotations of scenes, objects, parts of objects, and in some cases even parts of parts. Totally there are 25k images of the complex everyday scenes containing a variety of objects in their natural spatial context. On average there are 19.5 instances and 10.5 object classes per image. Based on ADE20K, we construct benchmarks for scene parsing and instance segmentation. We provide baseline performances on both of the benchmarks and re-implement state-of-the-art models for open source. We further evaluate the effect of synchronized batch normalization and find that a reasonably large batch size is crucial for the semantic segmentation performance. We show that the networks trained on ADE20K are able to segment a wide variety of scenes and objects.

961 citations

Book ChapterDOI
08 Sep 2018
TL;DR: IoU-Net as discussed by the authors learns to predict the IoU between each detected bounding box and the matched ground-truth, which improves the NMS procedure by preserving accurately localized bounding boxes.
Abstract: Modern CNN-based object detectors rely on bounding box regression and non-maximum suppression to localize objects. While the probabilities for class labels naturally reflect classification confidence, localization confidence is absent. This makes properly localized bounding boxes degenerate during iterative regression or even suppressed during NMS. In the paper we propose IoU-Net learning to predict the IoU between each detected bounding box and the matched ground-truth. The network acquires this confidence of localization, which improves the NMS procedure by preserving accurately localized bounding boxes. Furthermore, an optimization-based bounding box refinement method is proposed, where the predicted IoU is formulated as the objective. Extensive experiments on the MS-COCO dataset show the effectiveness of IoU-Net, as well as its compatibility with and adaptivity to several state-of-the-art object detectors.

529 citations

Posted Content
TL;DR: IoU-Net learning to predict the IoU between each detected bounding box and the matched ground-truth is proposed, which improves the NMS procedure by preserving accurately localized bounding boxes.
Abstract: Modern CNN-based object detectors rely on bounding box regression and non-maximum suppression to localize objects. While the probabilities for class labels naturally reflect classification confidence, localization confidence is absent. This makes properly localized bounding boxes degenerate during iterative regression or even suppressed during NMS. In the paper we propose IoU-Net learning to predict the IoU between each detected bounding box and the matched ground-truth. The network acquires this confidence of localization, which improves the NMS procedure by preserving accurately localized bounding boxes. Furthermore, an optimization-based bounding box refinement method is proposed, where the predicted IoU is formulated as the objective. Extensive experiments on the MS-COCO dataset show the effectiveness of IoU-Net, as well as its compatibility with and adaptivity to several state-of-the-art object detectors.

523 citations

Posted Content
TL;DR: This work presents a densely annotated dataset ADE20K, which spans diverse annotations of scenes, objects, parts of objects, and in some cases even parts of parts, and shows that the networks trained on this dataset are able to segment a wide variety of scenes and objects.
Abstract: Scene parsing, or recognizing and segmenting objects and stuff in an image, is one of the key problems in computer vision. Despite the community's efforts in data collection, there are still few image datasets covering a wide range of scenes and object categories with dense and detailed annotations for scene parsing. In this paper, we introduce and analyze the ADE20K dataset, spanning diverse annotations of scenes, objects, parts of objects, and in some cases even parts of parts. A generic network design called Cascade Segmentation Module is then proposed to enable the segmentation networks to parse a scene into stuff, objects, and object parts in a cascade. We evaluate the proposed module integrated within two existing semantic segmentation networks, yielding significant improvements for scene parsing. We further show that the scene parsing networks trained on ADE20K can be applied to a wide variety of scenes and objects.

399 citations


Cited by
More filters
Proceedings ArticleDOI
21 Jul 2017
TL;DR: This paper exploits the capability of global context information by different-region-based context aggregation through the pyramid pooling module together with the proposed pyramid scene parsing network (PSPNet) to produce good quality results on the scene parsing task.
Abstract: Scene parsing is challenging for unrestricted open vocabulary and diverse scenes. In this paper, we exploit the capability of global context information by different-region-based context aggregation through our pyramid pooling module together with the proposed pyramid scene parsing network (PSPNet). Our global prior representation is effective to produce good quality results on the scene parsing task, while PSPNet provides a superior framework for pixel-level prediction. The proposed approach achieves state-of-the-art performance on various datasets. It came first in ImageNet scene parsing challenge 2016, PASCAL VOC 2012 benchmark and Cityscapes benchmark. A single PSPNet yields the new record of mIoU accuracy 85.4% on PASCAL VOC 2012 and accuracy 80.2% on Cityscapes.

10,189 citations

Posted Content
Kaiming He1, Haoqi Fan1, Yuxin Wu1, Saining Xie1, Ross Girshick1 
TL;DR: This article proposed Momentum Contrast (MoCo) for unsupervised visual representation learning, which enables building a large and consistent dictionary on-the-fly that facilitates contrastive learning.
Abstract: We present Momentum Contrast (MoCo) for unsupervised visual representation learning. From a perspective on contrastive learning as dictionary look-up, we build a dynamic dictionary with a queue and a moving-averaged encoder. This enables building a large and consistent dictionary on-the-fly that facilitates contrastive unsupervised learning. MoCo provides competitive results under the common linear protocol on ImageNet classification. More importantly, the representations learned by MoCo transfer well to downstream tasks. MoCo can outperform its supervised pre-training counterpart in 7 detection/segmentation tasks on PASCAL VOC, COCO, and other datasets, sometimes surpassing it by large margins. This suggests that the gap between unsupervised and supervised representation learning has been largely closed in many vision tasks.

4,272 citations

Proceedings ArticleDOI
Kaiming He1, Haoqi Fan1, Yuxin Wu1, Saining Xie1, Ross Girshick1 
14 Jun 2020
TL;DR: This article proposed Momentum Contrast (MoCo) for unsupervised visual representation learning, which enables building a large and consistent dictionary on-the-fly that facilitates contrastive learning.
Abstract: We present Momentum Contrast (MoCo) for unsupervised visual representation learning. From a perspective on contrastive learning as dictionary look-up, we build a dynamic dictionary with a queue and a moving-averaged encoder. This enables building a large and consistent dictionary on-the-fly that facilitates contrastive unsupervised learning. MoCo provides competitive results under the common linear protocol on ImageNet classification. More importantly, the representations learned by MoCo transfer well to downstream tasks. MoCo can outperform its supervised pre-training counterpart in 7 detection/segmentation tasks on PASCAL VOC, COCO, and other datasets, sometimes surpassing it by large margins. This suggests that the gap between unsupervised and supervised representation learning has been largely closed in many vision tasks.

4,128 citations

Posted Content
Ze Liu1, Yutong Lin1, Yue Cao1, Han Hu1, Yixuan Wei1, Zheng Zhang1, Stephen Lin1, Baining Guo1 
TL;DR: Wang et al. as mentioned in this paper proposed a new vision Transformer called Swin Transformer, which is computed with shifted windows to address the differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text.
Abstract: This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (86.4 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO test-dev) and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The code and models will be made publicly available at~\url{this https URL}.

3,518 citations