scispace - formally typeset
Open AccessJournal ArticleDOI

Semantic Understanding of Scenes Through the ADE20K Dataset

TLDR
The ADE20K dataset as discussed by the authors contains 25k images of complex everyday scenes containing a variety of objects in their natural spatial context, on average there are 19.5 instances and 10.5 object classes per image.
Abstract
Semantic understanding of visual scenes is one of the holy grails of computer vision. Despite efforts of the community in data collection, there are still few image datasets covering a wide range of scenes and object categories with pixel-wise annotations for scene understanding. In this work, we present a densely annotated dataset ADE20K, which spans diverse annotations of scenes, objects, parts of objects, and in some cases even parts of parts. Totally there are 25k images of the complex everyday scenes containing a variety of objects in their natural spatial context. On average there are 19.5 instances and 10.5 object classes per image. Based on ADE20K, we construct benchmarks for scene parsing and instance segmentation. We provide baseline performances on both of the benchmarks and re-implement state-of-the-art models for open source. We further evaluate the effect of synchronized batch normalization and find that a reasonably large batch size is crucial for the semantic segmentation performance. We show that the networks trained on ADE20K are able to segment a wide variety of scenes and objects.

read more

Content maybe subject to copyright    Report

Semantic Understanding of Scenes Through the ADE20K Dataset
The MIT Faculty has made this article openly available. Please share
how this access benefits you. Your story matters.
CitationZhou, Bolei, et al. "Semantic Understanding of Scenes Through the
ADE20K Dataset." International Journal of Computer Vision 127
(2019): 302–321. https://doi.org/10.1007/s11263-018-1140-0 © 2018
Author(s)
As Published10.1007/S11263-018-1140-0
PublisherSpringer Nature
VersionOriginal manuscript
Citable linkhttps://hdl.handle.net/1721.1/125771
Terms of UseCreative Commons Attribution-Noncommercial-Share Alike
Detailed Termshttp://creativecommons.org/licenses/by-nc-sa/4.0/

Semantic Understanding of Scenes through the ADE20K Dataset
Bolei Zhou · Hang Zhao · Xavier Puig · Tete Xiao · Sanja Fidler · Adela
Barriuso · Antonio Torralba
Abstract Semantic understanding of visual scenes is one
of the holy grails of computer vision. Despite efforts of the
community in data collection, there are still few image datasets
covering a wide range of scenes and object categories with
pixel-wise annotations for scene understanding. In this work,
we present a densely annotated dataset ADE20K, which spans
diverse annotations of scenes, objects, parts of objects, and
in some cases even parts of parts. Totally there are 25k im-
ages of the complex everyday scenes containing a variety of
objects in their natural spatial context. On average there are
19.5 instances and 10.5 object classes per image. Based on
ADE20K, we construct benchmarks for scene parsing and
instance segmentation. We provide baseline performances
on both of the benchmarks and re-implement the state-of-
the-art models for open source. We further evaluate the ef-
fect of synchronized batch normalization and find that a rea-
sonably large batch size is crucial for the semantic segmen-
tation performance. We show that the networks trained on
ADE20K are able to segment a wide variety of scenes and
objects
1
.
B. Zhou
Department of Information Engineering, the Chinese University of
Hong Kong, Hong Kong.
H. Zhao, X. Puig, A. Barriuso, A. Torralba
Computer Science and Artificial Intelligence Laboratory, Mas-
sachusetts Institute of Technology, USA.
T. Xiao
School of Electronic Engineering and Computer Science, Peking Uni-
versity, China.
S. Fidler
Department of Computer Science, University of Toronto, Canada.
1
Dataset is available at http://groups.csail.mit.edu/
vision/datasets/ADE20K.
Pretrained models and code are released at https://github.
com/CSAILVision/semantic-segmentation-pytorch
Keywords Scene understanding · Semantic segmentation ·
Instance segmentation · Image dataset · Deep neural
networks
1 Introduction
Semantic understanding of visual scenes is one of the holy
grails of computer vision. The emergence of large-scale im-
age datasets like ImageNet [29], COCO [18] and Places [38],
along with the rapid development of the deep convolutional
neural network (CNN) approaches, have brought great ad-
vancements to visual scene understanding. Nowadays, given
a visual scene of a living room, a robot equipped with a
trained CNN can accurately predict the scene category. How-
ever, to freely navigate in the scene and manipulate the ob-
jects inside, the robot has far more information to extract
from the input image: It needs to recognize and localize not
only the objects like sofa, table, and TV, but also their parts,
e.g., a seat of a chair or a handle of a cup, to allow proper
manipulation, as well as to segment the stuff like floor, wall
and ceiling for spatial navigation.
Recognizing and segmenting objects and stuff at pixel
level remains one of the key problems in scene understand-
ing. Going beyond the image-level recognition, the pixel-
level scene understanding requires a much denser annotation
of scenes with a large set of objects. However, the current
datasets have a limited number of objects (e.g., COCO [18],
Pascal [10]) and in many cases those objects are not the most
common objects one encounters in the world (like frisbees
or baseball bats), or the datasets only cover a limited set of
scenes (e.g., Cityscapes [7]). Some notable exceptions are
Pascal-Context [22] and the SUN database [34]. However,
Pascal-Context still contains scenes primarily focused on
20 object classes, while SUN has noisy labels at the object
level.
arXiv:1608.05442v2 [cs.CV] 16 Oct 2018

2 Bolei Zhou et al.
Fig. 1 Images in ADE20K dataset are densely annotated in detail with objects and parts. The first row shows the sample images, the second row
shows the annotation of objects, and the third row shows the annotation of object parts. The color scheme both encodes the object categories and
object instances, that different object categories have large color difference while different instances from the same object category have small
color difference (e.g., different person instances in first image have slightly different colors).
The motivation of this work is to collect a dataset that
has densely annotated images (every pixel has a semantic
label) with a large and an unrestricted open vocabulary. The
images in our dataset are manually segmented in great de-
tail, covering a diverse set of scenes, object and object part
categories. The challenge for collecting such annotations is
finding reliable annotators, as well as the fact that labeling
is difficult if the class list is not defined in advance. On the
other hand, open vocabulary naming also suffers from nam-
ing inconsistencies across different annotators. In contrast,
our dataset was annotated by a single expert annotator, pro-
viding extremely detailed and exhaustive image annotations.
On average, our annotator labeled 29 annotation segments
per image, compared to the 16 segments per image labeled
by external annotators (like workers from Amazon Mechan-
ical Turk). Furthermore, the data consistency and quality are
much higher than that of external annotators. Fig. 1 shows
examples from our dataset.
The preliminary result of this work is published at [39].
Compared to the previous conference paper, we include more
description of the dataset, more baseline results on the scene
parsing benchmark, the introduction of the new instance seg-
mentation benchmark and its baseline results, as well as the
effect of synchronized batch norm and the joint training of
objects and parts. We also include the contents of the Places
Challenges we hosted at ECCV’16 and ICCV’17 and the
analysis on the challenge results.
The sections of this work are organized as follows. In
Sec.2 we describe the construction of the ADE20K dataset
and its statistics. In Sec.3 we introduce the two pixel-wise
scene understanding benchmarks we build upon ADE20K:
scene parsing and instance segmentation. We train and eval-
uate several baseline networks on the benchmarks. We also
re-implement and open-source several state-of-the-art scene
parsing models and evaluate the effect of batch normaliza-
tion size. In Sec.4 we introduce the Places Challenges at
ECCV’16 and ICCV’17 based on the benchmarks of the
ADE20K, as well as the qualitative and quantitative analysis
on the challenge results. In Sec.5 we train network jointly to
segment objects and their parts. Sec.6 explores the applica-
tions of the scene parsing networks to the hierarchical se-
mantic segmentation and automatic scene content removal.
Sec.7 concludes this work.
1.1 Related work
Many datasets have been collected for the purpose of seman-
tic understanding of scenes. We review the datasets accord-
ing to the level of details of their annotations, then briefly
go through the previous work of semantic segmentation net-
works.
Object classification/detection datasets. Most of the
large-scale datasets typically only contain labels at the im-
age level or provide bounding boxes. Examples include Im-
ageNet [29], Pascal [10], and KITTI [11]. ImageNet has the
largest set of classes, but contains relatively simple scenes.
Pascal and KITTI are more challenging and have more ob-
jects per image, however, their classes and scenes are more
constrained.
Semantic segmentation datasets. Existing datasets with
pixel-level labels typically provide annotations only for a
subset of foreground objects (20 in PASCAL VOC [10] and
91 in Microsoft COCO [18]). Collecting dense annotations
where all pixels are labeled is much more challenging. Such
efforts include Pascal-Context [22], NYU Depth V2 [23],
SUN database [34], SUN RGB-D dataset [31], CityScapes

Semantic Understanding of Scenes through the ADE20K Dataset 3
dataset [7], and OpenSurfaces [2, 3]. Recently COCO stuff
dataset [4] provides additional stuff segmentation comple-
mentary to the 80 object categories in COCO dataset, while
COCO attributes dataset [26] annotates attributes for some
objects in COCO dataset. Such a dataset with progressive
enhancement of diverse annotations over the years makes
great progress to the modern development of image dataset.
Datasets with objects, parts and attributes. Two datasets
were released that go beyond the typical labeling setup by
also providing pixel-level annotation for the object parts,
i.e., Pascal-Part dataset [6], or material classes, i.e., Open-
Surfaces [2, 3]. We advance this effort by collecting very
high-resolution imagery of a much wider selection of scenes,
containing a large set of object classes per image. We anno-
tated both stuff and object classes, for which we additionally
annotated their parts, and parts of these parts. We believe
that our dataset, ADE20K, is one of the most comprehen-
sive datasets of its kind. We provide a comparison between
datasets in Sec. 2.6.
Semantic segmentation models. With the success of
convolutional neural networks (CNN) for image classifica-
tion [17], there is growing interest for semantic pixel-wise
labeling using CNNs with dense output, such as the fully
CNN [20], deconvolutional neural networks [25], encoder-
decoder SegNet [1], multi-task network cascades [9], and
DilatedVGG [5, 36]. They are benchmarked on Pascal dataset
with impressive performance on segmenting the 20 object
classes. Some of them [20, 1] are evaluated on Pascal Con-
text [22] or SUN RGB-D dataset [31] to show the capa-
bility to segment more object classes in scenes. Joint stuff
and object segmentation is explored in [8] which uses pre-
computed superpixels and feature masking to represent stuff.
Cascade of instance segmentation and categorization has been
explored in [9]. A multiscale pyramid pooling module is
proposed to improve the scene parsing [37]. A recent multi-
task segmentation network UperNet is proposed to segment
visual concepts from different levels [35].
2 ADE20K: Fully Annotated Image Dataset
In this section, we describe the construction of our ADE20K
dataset and analyze its statistics.
2.1 Image annotation
For our dataset, we are interested in having a diverse set
of scenes with dense annotations of all the visual concepts
present. The visual concepts could be 1) discrete object which
is a thing with a well-defined shape, e.g., car, person, 2) stuff
which contains amorphous background regions, e.g., grass,
sky, or 3) object part, which is a component of some existing
object instance which has some functional meaning, such
as head or leg. Images come from the LabelMe [30], SUN
datasets [34], and Places [38] and were selected to cover the
900 scene categories defined in the SUN database. Images
were annotated by a single expert worker using the LabelMe
interface [30]. Fig. 2 shows a snapshot of the annotation in-
terface and one fully segmented image. The worker provided
three types of annotations: object segments with names, ob-
ject parts, and attributes. All object instances are segmented
independently so that the dataset could be used to train and
evaluate detection or segmentation algorithms.
Given that the objects appearing in the dataset are fully
annotated, even in the regions where these are occluded,
there are multiple areas where the polygons from different
regions overlap. In order to convert the annotated polygons
into a segmentation mask, we sort objects in an image by
depth layers. Background classes like ‘sky’ or ‘wall’ are set
as the farthest layers. The rest of objects’ depths are set as
follows: when a polygon is fully contained inside another
polygon, the object from the inner polygon is given a closer
depth layer. When objects only partially overlap, we look at
the region of intersection between the two polygons, and set
as the closest object the one whose polygon has more points
in the region of intersection. Once objects have been sorted,
the segmentation mask is constructed by iterating over the
objects in decreasing depth, ensuring that object parts never
occlude whole objects and no object is occluded by its parts.
Datasets such as COCO [18], Pascal [10] or Cityscape [7]
start by defining a set of object categories of interest. How-
ever, when labeling all the objects in a scene, working with
a predefined list of objects is not possible as new categories
appear frequently (see fig. 6.d). Here, the annotator created a
dictionary of visual concepts where new classes were added
constantly to ensure consistency in object naming.
Object parts are associated with object instances. Note
that parts can have parts too, and we label these associations
as well. For example, the ‘rim’ is a part of a ‘wheel’, which
in turn is part of a ‘car’. A ‘knob’ is a part of a ‘door’ that
can be part of a ‘cabinet’. This part hierarchy in Fig. 3 has a
depth of 3.
2.2 Dataset summary
After annotation, there are 20, 210 images in the training
set, 2, 000 images in the validation set, and 3, 000 images
in the testing set. There are in total 3, 169 class labels an-
notated, among them 2, 693 are object and stuff classes, 476
are object part classes. All the images are exhaustively an-
notated with objects. Many objects are also annotated with
their parts. For each object there is additional information
about whether it is occluded or cropped, and other attributes.
The images in the validation set are exhaustively annotated
with parts, while the part annotations are not exhaustive over

4 Bolei Zhou et al.
Fig. 2 Annotation interface, the list of the objects and their associated parts in the image.
9/24/2018 labelme.csail.mit.edu/developers/xavierpuig/analysisADE/plottree.html
http://labelme.csail.mit.edu/developers/xavierpuig/analysisADE/plottree.html 1/1
side (17)
shelf (33)
wardrobe, closet, press (429)
visor (15)
housing (163)
traffic light, traffic signal, ... (1120)
tap (147)
faucet (1106)
sink (1480)
window (21)
handle (76)
door (87)
screen (21)
dial (85)
buttons (12)
button panel (55)
oven (272)
windows (10)
shutter (18)
pane (109)
casing (15)
window (1665)
rakes (16)
roof (280)
rail (15)
post (16)
railing (107)
garage door (43)
pane (20)
door (353)
shaft (20)
capital (20)
base (12)
column (97)
house (1227)
handle (32)
door frame (103)
panel (11)
lock (21)
hinge (30)
handle (58)
door (291)
double door (471)
top (35)
leg (35)
front (36)
knob (857)
handle (930)
drawer (1777)
base (19)
chest of drawers, chest, bureau, ... (663)
window (83)
stile (13)
rail (14)
pane (16)
upper sash (14)
shutter (275)
stile (20)
rail (26)
pane (16)
sash (13)
stile (13)
lower sash (17)
casing (26)
window (35737)
shop window (755)
metal shutters (48)
garage door (40)
pane (12)
door (18)
double door (324)
doors (14)
pane (58)
handle (18)
door (2934)
shutter (51)
railing (31)
balcony (2060)
arcades (42)
building, edifice (18850)
side rail (107)
leg (564)
ladder (22)
headboard (1186)
bed (2418)
Fig. 3 Section of the relation tree of objects and parts for the dataset. Each number indicates the number of instances for each object. The full
relation tree is available at the dataset webpage.
the images in the training set. Sample images and annota-
tions from the ADE20K dataset are shown in Fig. 1.
2.3 Annotation consistency
Defining a labeling protocol is relatively easy when the la-
beling task is restricted to a fixed list of object classes, how-
ever it becomes challenging when the class list is open-ended.
As the goal is to label all the objects within each image,
the list of classes grows unbounded. Many object classes
appear only a few times across the entire collection of im-
ages. However, those rare object classes cannot be ignored
as they might be important elements for the interpretation of
the scene. Labeling in these conditions becomes difficult be-
cause we need to keep a growing list of all the object classes
in order to have a consistent naming across the entire dataset.
Despite the best effort of the annotator, the process is not
free from noise.
To analyze the annotation consistency we took a subset
of 61 randomly chosen images from the validation set, then
asked our annotator to annotate them again (there is a time
difference of six months). One expects that there are some

Citations
More filters
Proceedings ArticleDOI

Pyramid Scene Parsing Network

TL;DR: This paper exploits the capability of global context information by different-region-based context aggregation through the pyramid pooling module together with the proposed pyramid scene parsing network (PSPNet) to produce good quality results on the scene parsing task.
Posted Content

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows.

TL;DR: Wang et al. as mentioned in this paper proposed a new vision Transformer called Swin Transformer, which is computed with shifted windows to address the differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text.
Proceedings ArticleDOI

RefineNet: Multi-path Refinement Networks for High-Resolution Semantic Segmentation

TL;DR: RefineNet is presented, a generic multi-path refinement network that explicitly exploits all the information available along the down-sampling process to enable high-resolution prediction using long-range residual connections and introduces chained residual pooling, which captures rich background context in an efficient manner.
Proceedings ArticleDOI

Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

TL;DR: Zhang et al. as discussed by the authors proposed a pure transformer to encode an image as a sequence of patches, which can be combined with a simple decoder to provide a powerful segmentation model.
Proceedings ArticleDOI

Understanding Convolution for Semantic Segmentation

TL;DR: DUC is designed to generate pixel-level prediction, which is able to capture and decode more detailed information that is generally missing in bilinear upsampling, and a hybrid dilated convolution (HDC) framework in the encoding phase is proposed.
References
More filters
Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Journal ArticleDOI

ImageNet Large Scale Visual Recognition Challenge

TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.
Book ChapterDOI

Microsoft COCO: Common Objects in Context

TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.
Proceedings ArticleDOI

Fully convolutional networks for semantic segmentation

TL;DR: The key insight is to build “fully convolutional” networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning.
Posted Content

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

TL;DR: Faster R-CNN as discussed by the authors proposes a Region Proposal Network (RPN) to generate high-quality region proposals, which are used by Fast R-NN for detection.
Related Papers (5)