(Open Access) The Cityscapes Dataset for Semantic Urban Scene Understanding (2016) | Marius Cordts

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Semantic Image Synthesis With Spatially-Adaptive Normalization

[...]

Taesung Park¹, Ming-Yu Liu², Ting-Chun Wang², Jun-Yan Zhu³•Institutions (3)

University of California, Berkeley¹, Nvidia², Massachusetts Institute of Technology³

18 Mar 2019

TL;DR: S spatially-adaptive normalization is proposed, a simple but effective layer for synthesizing photorealistic images given an input semantic layout that allows users to easily control the style and content of image synthesis results as well as create multi-modal results.

...read moreread less

Abstract: We propose spatially-adaptive normalization, a simple but effective layer for synthesizing photorealistic images given an input semantic layout. Previous methods directly feed the semantic layout as input to the network, forcing the network to memorize the information throughout all the layers. Instead, we propose using the input layout for modulating the activations in normalization layers through a spatially-adaptive, learned affine transformation. Experiments on several challenging datasets demonstrate the superiority of our method compared to existing approaches, regarding both visual fidelity and alignment with input layouts. Finally, our model allows users to easily control the style and content of image synthesis results as well as create multi-modal results. Code is available upon publication.

...read moreread less

2,159 citations

Cites methods from "The Cityscapes Dataset for Semantic..."

...We use state-of-the-art segmentation networks for each dataset: DeepLabV2 [6, 32] for COCOStuff, UperNet101 [42] for ADE20K, and DRN-D-105 [44] for Cityscapes....
[...]
...Recent work has achieved photorealistic semantic image synthesis results [35, 39] on the Cityscapes dataset....
[...]
...We note that the SIMS model produces a lower FID score but has poor segmentation performances on the Cityscapes dataset....
[...]
...• Cityscapes dataset [8] contains street scene images in German cities....
[...]
...In Figure 19 and 20, we show additional synthesis results from the proposed method on the ADE20K-outdoor and Cityscapes datasets with comparison to those from the CRN [7], SIMS [35], and pix2pixHD [40] methods....
[...]

Proceedings Article•DOI•

Unsupervised Learning of Depth and Ego-Motion from Video

[...]

Tinghui Zhou¹, Matthew Brown², Noah Snavely², David G. Lowe²•Institutions (2)

University of California, Berkeley¹, Google²

25 Apr 2017

TL;DR: In this paper, an unsupervised learning framework for the task of monocular depth and camera motion estimation from unstructured video sequences is presented, which uses single-view depth and multiview pose networks with a loss based on warping nearby views to the target using the computed depth and pose.

...read moreread less

Abstract: We present an unsupervised learning framework for the task of monocular depth and camera motion estimation from unstructured video sequences. In common with recent work [10, 14, 16], we use an end-to-end learning approach with view synthesis as the supervisory signal. In contrast to the previous work, our method is completely unsupervised, requiring only monocular video sequences for training. Our method uses single-view depth and multiview pose networks, with a loss based on warping nearby views to the target using the computed depth and pose. The networks are thus coupled by the loss during training, but can be applied independently at test time. Empirical evaluation on the KITTI dataset demonstrates the effectiveness of our approach: 1) monocular depth performs comparably with supervised methods that use either ground-truth pose or depth for training, and 2) pose estimation performs favorably compared to established SLAM systems under comparable input settings.

...read moreread less

1,972 citations

Posted Content•

nuScenes: A multimodal dataset for autonomous driving

[...]

Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, Oscar Beijbom - Show less +6 more

26 Mar 2019-arXiv: Learning

TL;DR: nuScenes as mentioned in this paper is the first dataset to carry the full autonomous vehicle sensor suite: 6 cameras, 5 radars and 1 lidar, all with full 360 degree field of view.

...read moreread less

Abstract: Robust detection and tracking of objects is crucial for the deployment of autonomous vehicle technology. Image based benchmark datasets have driven development in computer vision tasks such as object detection, tracking and segmentation of agents in the environment. Most autonomous vehicles, however, carry a combination of cameras and range sensors such as lidar and radar. As machine learning based methods for detection and tracking become more prevalent, there is a need to train and evaluate such methods on datasets containing range sensor data along with images. In this work we present nuTonomy scenes (nuScenes), the first dataset to carry the full autonomous vehicle sensor suite: 6 cameras, 5 radars and 1 lidar, all with full 360 degree field of view. nuScenes comprises 1000 scenes, each 20s long and fully annotated with 3D bounding boxes for 23 classes and 8 attributes. It has 7x as many annotations and 100x as many images as the pioneering KITTI dataset. We define novel 3D detection and tracking metrics. We also provide careful dataset analysis as well as baselines for lidar and image based detection and tracking. Data, development kit and more information are available online.

...read moreread less

1,939 citations

Cites background or methods from "The Cityscapes Dataset for Semantic..."

...CamVid [8], Cityscapes [19], Mapillary Vistas [33], D(2)-City [11], BDD100k [85] and Apolloscape [41] released ever growing datasets with segmentation masks....
[...]
...CamVid [8], Cityscapes [19], Mapillary Vistas [33], D2-City [11], BDD100k [85] and Apolloscape [41] released ever growing datasets with segmentation masks....
[...]
...mAP with a threshold on IOU is perhaps the most popular metric for object detection [32, 19, 21]....
[...]
...Most datasets provide 2D semantic annotations as boxes or masks (class or instance) [8, 19, 33, 85, 55]....
[...]
...KITTI [32] was the pioneering multimodal dataset providing dense pointclouds from a lidar sensor as well as front-facing stereo images and GPS/IMU data....
[...]

Book Chapter•DOI•

Multimodal Unsupervised Image-to-Image Translation

[...]

Xun Huang¹, Ming-Yu Liu², Serge Belongie¹, Jan Kautz²•Institutions (2)

Cornell University¹, Nvidia²

08 Sep 2018

TL;DR: In this article, the authors propose a multimodal unsupervised image-to-image (MUNIT) framework, where the image representation can be decomposed into a content code that is domain-invariant and a style code that captures domain-specific properties.

...read moreread less

Abstract: Unsupervised image-to-image translation is an important and challenging problem in computer vision. Given an image in the source domain, the goal is to learn the conditional distribution of corresponding images in the target domain, without seeing any examples of corresponding image pairs. While this conditional distribution is inherently multimodal, existing approaches make an overly simplified assumption, modeling it as a deterministic one-to-one mapping. As a result, they fail to generate diverse outputs from a given source domain image. To address this limitation, we propose a Multimodal Unsupervised Image-to-image \(\text{ Translation } \text{(MUNIT) }\) framework. We assume that the image representation can be decomposed into a content code that is domain-invariant, and a style code that captures domain-specific properties. To translate an image to another domain, we recombine its content code with a random style code sampled from the style space of the target domain. We analyze the proposed framework and establish several theoretical results. Extensive experiments with comparisons to state-of-the-art approaches further demonstrate the advantage of the proposed framework. Moreover, our framework allows users to control the style of translation outputs by providing an example style image. Code and pretrained models are available at https://github.com/nvlabs/MUNIT.

...read moreread less

1,874 citations

Journal Article•DOI•

Past, Present, and Future of Simultaneous Localization And Mapping: Towards the Robust-Perception Age

[...]

Cesar Cadena, Luca Carlone, Henry Carrillo, Yasir Latif, Davide Scaramuzza, José L. Neira, Ian Reid, John J. Leonard - Show less +4 more

19 Jun 2016-arXiv: Robotics

TL;DR: What is now the de-facto standard formulation for SLAM is presented, covering a broad set of topics including robustness and scalability in long-term mapping, metric and semantic representations for mapping, theoretical performance guarantees, active SLAM and exploration, and other new frontiers.

...read moreread less

Abstract: Simultaneous Localization and Mapping (SLAM)consists in the concurrent construction of a model of the environment (the map), and the estimation of the state of the robot moving within it. The SLAM community has made astonishing progress over the last 30 years, enabling large-scale real-world applications, and witnessing a steady transition of this technology to industry. We survey the current state of SLAM. We start by presenting what is now the de-facto standard formulation for SLAM. We then review related work, covering a broad set of topics including robustness and scalability in long-term mapping, metric and semantic representations for mapping, theoretical performance guarantees, active SLAM and exploration, and other new frontiers. This paper simultaneously serves as a position paper and tutorial to those who are users of SLAM. By looking at the published research with a critical eye, we delineate open challenges and new research issues, that still deserve careful scientific investigation. The paper also contains the authors' take on two questions that often animate discussions during robotics conferences: Do robots need SLAM? and Is SLAM solved?

...read moreread less

1,828 citations

Cites background from "The Cityscapes Dataset for Semantic..."

...ck-ends in small / medium scale scenarios [32, 159] X Front-ends and back-ends in synthetic scenarios [109] X Loop-closures [98, 222] X [184] X [220] X Long-term SLAM [13, 64] X [220] X Semantic SLAM [55] X [220, 284] X Multi-robot SLAM [164] X of visual maps created by large ﬂeets of autonomous vehicles is a compelling area for future work. One can identify tasks for which different ﬂavors of SLAM fo...
[...]

Collapse

The Cityscapes Dataset for Semantic Urban Scene Understanding

Citations

Cites methods from "The Cityscapes Dataset for Semantic..."

Cites background or methods from "The Cityscapes Dataset for Semantic..."

Cites background from "The Cityscapes Dataset for Semantic..."

References

"The Cityscapes Dataset for Semantic..." refers methods in this paper

Related Papers (5)

Trending Questions (1)