Semantic Understanding of Scenes Through the ADE20K Dataset

doi:10.1007/S11263-018-1140-0

The MIT Faculty has made this article openly available. Please share

how this access benefits you. Your story matters.

CitationZhou, Bolei, et al. "Semantic Understanding of Scenes Through the

ADE20K Dataset." International Journal of Computer Vision 127

Author(s)

As Published10.1007/S11263-018-1140-0

PublisherSpringer Nature

VersionOriginal manuscript

Citable linkhttps://hdl.handle.net/1721.1/125771

Terms of UseCreative Commons Attribution-Noncommercial-Share Alike

Detailed Termshttp://creativecommons.org/licenses/by-nc-sa/4.0/

Semantic Understanding of Scenes through the ADE20K Dataset

Bolei Zhou · Hang Zhao · Xavier Puig · Tete Xiao · Sanja Fidler · Adela

Barriuso · Antonio Torralba

Abstract Semantic understanding of visual scenes is one

of the holy grails of computer vision. Despite efforts of the

community in data collection, there are still few image datasets

covering a wide range of scenes and object categories with

pixel-wise annotations for scene understanding. In this work,

we present a densely annotated dataset ADE20K, which spans

diverse annotations of scenes, objects, parts of objects, and

in some cases even parts of parts. Totally there are 25k im-

ages of the complex everyday scenes containing a variety of

objects in their natural spatial context. On average there are

19.5 instances and 10.5 object classes per image. Based on

ADE20K, we construct benchmarks for scene parsing and

instance segmentation. We provide baseline performances

on both of the benchmarks and re-implement the state-of-

the-art models for open source. We further evaluate the ef-

fect of synchronized batch normalization and ﬁnd that a rea-

sonably large batch size is crucial for the semantic segmen-

tation performance. We show that the networks trained on

ADE20K are able to segment a wide variety of scenes and

objects

1

.

B. Zhou

Department of Information Engineering, the Chinese University of

Hong Kong, Hong Kong.

H. Zhao, X. Puig, A. Barriuso, A. Torralba

Computer Science and Artiﬁcial Intelligence Laboratory, Mas-

sachusetts Institute of Technology, USA.

T. Xiao

School of Electronic Engineering and Computer Science, Peking Uni-

versity, China.

S. Fidler

Department of Computer Science, University of Toronto, Canada.

1

Dataset is available at http://groups.csail.mit.edu/

vision/datasets/ADE20K.

Pretrained models and code are released at https://github.

com/CSAILVision/semantic-segmentation-pytorch

Keywords Scene understanding · Semantic segmentation ·

Instance segmentation · Image dataset · Deep neural

networks

1 Introduction

Semantic understanding of visual scenes is one of the holy

grails of computer vision. The emergence of large-scale im-

age datasets like ImageNet [29], COCO [18] and Places [38],

along with the rapid development of the deep convolutional

neural network (CNN) approaches, have brought great ad-

vancements to visual scene understanding. Nowadays, given

a visual scene of a living room, a robot equipped with a

trained CNN can accurately predict the scene category. How-

ever, to freely navigate in the scene and manipulate the ob-

jects inside, the robot has far more information to extract

from the input image: It needs to recognize and localize not

only the objects like sofa, table, and TV, but also their parts,

e.g., a seat of a chair or a handle of a cup, to allow proper

manipulation, as well as to segment the stuff like ﬂoor, wall

and ceiling for spatial navigation.

Recognizing and segmenting objects and stuff at pixel

level remains one of the key problems in scene understand-

ing. Going beyond the image-level recognition, the pixel-

level scene understanding requires a much denser annotation

of scenes with a large set of objects. However, the current

datasets have a limited number of objects (e.g., COCO [18],

Pascal [10]) and in many cases those objects are not the most

common objects one encounters in the world (like frisbees

or baseball bats), or the datasets only cover a limited set of

scenes (e.g., Cityscapes [7]). Some notable exceptions are

Pascal-Context [22] and the SUN database [34]. However,

Pascal-Context still contains scenes primarily focused on

20 object classes, while SUN has noisy labels at the object

level.

arXiv:1608.05442v2 [cs.CV] 16 Oct 2018

2 Bolei Zhou et al.

Fig. 1 Images in ADE20K dataset are densely annotated in detail with objects and parts. The ﬁrst row shows the sample images, the second row

shows the annotation of objects, and the third row shows the annotation of object parts. The color scheme both encodes the object categories and

object instances, that different object categories have large color difference while different instances from the same object category have small

color difference (e.g., different person instances in ﬁrst image have slightly different colors).

The motivation of this work is to collect a dataset that

has densely annotated images (every pixel has a semantic

label) with a large and an unrestricted open vocabulary. The

images in our dataset are manually segmented in great de-

tail, covering a diverse set of scenes, object and object part

categories. The challenge for collecting such annotations is

ﬁnding reliable annotators, as well as the fact that labeling

is difﬁcult if the class list is not deﬁned in advance. On the

other hand, open vocabulary naming also suffers from nam-

ing inconsistencies across different annotators. In contrast,

our dataset was annotated by a single expert annotator, pro-

viding extremely detailed and exhaustive image annotations.

On average, our annotator labeled 29 annotation segments

per image, compared to the 16 segments per image labeled

by external annotators (like workers from Amazon Mechan-

ical Turk). Furthermore, the data consistency and quality are

much higher than that of external annotators. Fig. 1 shows

examples from our dataset.

The preliminary result of this work is published at [39].

Compared to the previous conference paper, we include more

description of the dataset, more baseline results on the scene

parsing benchmark, the introduction of the new instance seg-

mentation benchmark and its baseline results, as well as the

effect of synchronized batch norm and the joint training of

objects and parts. We also include the contents of the Places

Challenges we hosted at ECCV’16 and ICCV’17 and the

analysis on the challenge results.

The sections of this work are organized as follows. In

Sec.2 we describe the construction of the ADE20K dataset

and its statistics. In Sec.3 we introduce the two pixel-wise

scene understanding benchmarks we build upon ADE20K:

scene parsing and instance segmentation. We train and eval-

uate several baseline networks on the benchmarks. We also

re-implement and open-source several state-of-the-art scene

parsing models and evaluate the effect of batch normaliza-

tion size. In Sec.4 we introduce the Places Challenges at

ECCV’16 and ICCV’17 based on the benchmarks of the

ADE20K, as well as the qualitative and quantitative analysis

on the challenge results. In Sec.5 we train network jointly to

segment objects and their parts. Sec.6 explores the applica-

tions of the scene parsing networks to the hierarchical se-

mantic segmentation and automatic scene content removal.

Sec.7 concludes this work.

1.1 Related work

Many datasets have been collected for the purpose of seman-

tic understanding of scenes. We review the datasets accord-

ing to the level of details of their annotations, then brieﬂy

go through the previous work of semantic segmentation net-

works.

Object classiﬁcation/detection datasets. Most of the

large-scale datasets typically only contain labels at the im-

age level or provide bounding boxes. Examples include Im-

ageNet [29], Pascal [10], and KITTI [11]. ImageNet has the

largest set of classes, but contains relatively simple scenes.

Pascal and KITTI are more challenging and have more ob-

jects per image, however, their classes and scenes are more

constrained.

Semantic segmentation datasets. Existing datasets with

pixel-level labels typically provide annotations only for a

subset of foreground objects (20 in PASCAL VOC [10] and

91 in Microsoft COCO [18]). Collecting dense annotations

where all pixels are labeled is much more challenging. Such

efforts include Pascal-Context [22], NYU Depth V2 [23],

SUN database [34], SUN RGB-D dataset [31], CityScapes

Semantic Understanding of Scenes through the ADE20K Dataset 3

dataset [7], and OpenSurfaces [2, 3]. Recently COCO stuff

dataset [4] provides additional stuff segmentation comple-

mentary to the 80 object categories in COCO dataset, while

COCO attributes dataset [26] annotates attributes for some

objects in COCO dataset. Such a dataset with progressive

enhancement of diverse annotations over the years makes

great progress to the modern development of image dataset.

Datasets with objects, parts and attributes. Two datasets

were released that go beyond the typical labeling setup by

also providing pixel-level annotation for the object parts,

i.e., Pascal-Part dataset [6], or material classes, i.e., Open-

Surfaces [2, 3]. We advance this effort by collecting very

high-resolution imagery of a much wider selection of scenes,

containing a large set of object classes per image. We anno-

tated both stuff and object classes, for which we additionally

annotated their parts, and parts of these parts. We believe

that our dataset, ADE20K, is one of the most comprehen-

sive datasets of its kind. We provide a comparison between

datasets in Sec. 2.6.

Semantic segmentation models. With the success of

convolutional neural networks (CNN) for image classiﬁca-

tion [17], there is growing interest for semantic pixel-wise

labeling using CNNs with dense output, such as the fully

CNN [20], deconvolutional neural networks [25], encoder-

decoder SegNet [1], multi-task network cascades [9], and

DilatedVGG [5, 36]. They are benchmarked on Pascal dataset

with impressive performance on segmenting the 20 object

classes. Some of them [20, 1] are evaluated on Pascal Con-

text [22] or SUN RGB-D dataset [31] to show the capa-

bility to segment more object classes in scenes. Joint stuff

and object segmentation is explored in [8] which uses pre-

computed superpixels and feature masking to represent stuff.

Cascade of instance segmentation and categorization has been

explored in [9]. A multiscale pyramid pooling module is

proposed to improve the scene parsing [37]. A recent multi-

task segmentation network UperNet is proposed to segment

visual concepts from different levels [35].

2 ADE20K: Fully Annotated Image Dataset

In this section, we describe the construction of our ADE20K

dataset and analyze its statistics.

2.1 Image annotation

For our dataset, we are interested in having a diverse set

of scenes with dense annotations of all the visual concepts

present. The visual concepts could be 1) discrete object which

is a thing with a well-deﬁned shape, e.g., car, person, 2) stuff

which contains amorphous background regions, e.g., grass,

sky, or 3) object part, which is a component of some existing

object instance which has some functional meaning, such

as head or leg. Images come from the LabelMe [30], SUN

datasets [34], and Places [38] and were selected to cover the

900 scene categories deﬁned in the SUN database. Images

were annotated by a single expert worker using the LabelMe

interface [30]. Fig. 2 shows a snapshot of the annotation in-

terface and one fully segmented image. The worker provided

three types of annotations: object segments with names, ob-

ject parts, and attributes. All object instances are segmented

independently so that the dataset could be used to train and

evaluate detection or segmentation algorithms.

Given that the objects appearing in the dataset are fully

annotated, even in the regions where these are occluded,

there are multiple areas where the polygons from different

regions overlap. In order to convert the annotated polygons

into a segmentation mask, we sort objects in an image by

depth layers. Background classes like ‘sky’ or ‘wall’ are set

as the farthest layers. The rest of objects’ depths are set as

follows: when a polygon is fully contained inside another

polygon, the object from the inner polygon is given a closer

depth layer. When objects only partially overlap, we look at

the region of intersection between the two polygons, and set

as the closest object the one whose polygon has more points

in the region of intersection. Once objects have been sorted,

the segmentation mask is constructed by iterating over the

objects in decreasing depth, ensuring that object parts never

occlude whole objects and no object is occluded by its parts.

Datasets such as COCO [18], Pascal [10] or Cityscape [7]

start by deﬁning a set of object categories of interest. How-

ever, when labeling all the objects in a scene, working with

a predeﬁned list of objects is not possible as new categories

appear frequently (see ﬁg. 6.d). Here, the annotator created a

dictionary of visual concepts where new classes were added

constantly to ensure consistency in object naming.

Object parts are associated with object instances. Note

that parts can have parts too, and we label these associations

as well. For example, the ‘rim’ is a part of a ‘wheel’, which

in turn is part of a ‘car’. A ‘knob’ is a part of a ‘door’ that

can be part of a ‘cabinet’. This part hierarchy in Fig. 3 has a

depth of 3.

2.2 Dataset summary

After annotation, there are 20, 210 images in the training

set, 2, 000 images in the validation set, and 3, 000 images

in the testing set. There are in total 3, 169 class labels an-

notated, among them 2, 693 are object and stuff classes, 476

are object part classes. All the images are exhaustively an-

notated with objects. Many objects are also annotated with

their parts. For each object there is additional information

about whether it is occluded or cropped, and other attributes.

The images in the validation set are exhaustively annotated

with parts, while the part annotations are not exhaustive over

4 Bolei Zhou et al.

Fig. 2 Annotation interface, the list of the objects and their associated parts in the image.

9/24/2018 labelme.csail.mit.edu/developers/xavierpuig/analysisADE/plottree.html

http://labelme.csail.mit.edu/developers/xavierpuig/analysisADE/plottree.html 1/1

side (17)

shelf (33)

wardrobe, closet, press (429)

visor (15)

housing (163)

trafﬁc light, trafﬁc signal, ... (1120)

tap (147)

faucet (1106)

sink (1480)

window (21)

handle (76)

door (87)

screen (21)

dial (85)

buttons (12)

button panel (55)

oven (272)

windows (10)

shutter (18)

pane (109)

casing (15)

window (1665)

rakes (16)

roof (280)

rail (15)

post (16)

railing (107)

garage door (43)

pane (20)

door (353)

shaft (20)

capital (20)

base (12)

column (97)

house (1227)

handle (32)

door frame (103)

panel (11)

lock (21)

hinge (30)

handle (58)

door (291)

double door (471)

top (35)

leg (35)

front (36)

knob (857)

handle (930)

drawer (1777)

base (19)

chest of drawers, chest, bureau, ... (663)

window (83)

stile (13)

rail (14)

pane (16)

upper sash (14)

shutter (275)

stile (20)

rail (26)

pane (16)

sash (13)

stile (13)

lower sash (17)

casing (26)

window (35737)

shop window (755)

metal shutters (48)

garage door (40)

pane (12)

door (18)

double door (324)

doors (14)

pane (58)

handle (18)

door (2934)

shutter (51)

railing (31)

balcony (2060)

arcades (42)

building, ediﬁce (18850)

side rail (107)

leg (564)

ladder (22)

headboard (1186)

bed (2418)

Fig. 3 Section of the relation tree of objects and parts for the dataset. Each number indicates the number of instances for each object. The full

relation tree is available at the dataset webpage.

the images in the training set. Sample images and annota-

tions from the ADE20K dataset are shown in Fig. 1.

2.3 Annotation consistency

Deﬁning a labeling protocol is relatively easy when the la-

beling task is restricted to a ﬁxed list of object classes, how-

ever it becomes challenging when the class list is open-ended.

As the goal is to label all the objects within each image,

the list of classes grows unbounded. Many object classes

appear only a few times across the entire collection of im-

ages. However, those rare object classes cannot be ignored

as they might be important elements for the interpretation of

the scene. Labeling in these conditions becomes difﬁcult be-

cause we need to keep a growing list of all the object classes

in order to have a consistent naming across the entire dataset.

Despite the best effort of the annotator, the process is not

free from noise.

To analyze the annotation consistency we took a subset

of 61 randomly chosen images from the validation set, then

asked our annotator to annotate them again (there is a time

difference of six months). One expects that there are some

Semantic Understanding of Scenes Through the ADE20K Dataset

Citations

Pyramid Scene Parsing Network

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows.

RefineNet: Multi-path Refinement Networks for High-Resolution Semantic Segmentation

Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

Understanding Convolution for Semantic Segmentation

References

ImageNet Classification with Deep Convolutional Neural Networks

ImageNet Large Scale Visual Recognition Challenge

Microsoft COCO: Common Objects in Context

Fully convolutional networks for semantic segmentation

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Related Papers (5)

Deep Residual Learning for Image Recognition

Fully convolutional networks for semantic segmentation

Microsoft COCO: Common Objects in Context

ImageNet: A large-scale hierarchical image database

ImageNet Classification with Deep Convolutional Neural Networks