Top 32 papers published by Thomas Brox from University of Freiburg in 2017

Proceedings Article•DOI•

FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks

[...]

Eddy Ilg¹, Nikolaus Mayer¹, Tonmoy Saikia¹, Margret Keuper¹, Alexey Dosovitskiy¹, Thomas Brox¹ - Show less +2 more•Institutions (1)

University of Freiburg¹

21 Jul 2017

TL;DR: The concept of end-to-end learning of optical flow is advanced and it work really well, and faster variants that allow optical flow computation at up to 140fps with accuracy matching the original FlowNet are presented.

...read moreread less

Abstract: The FlowNet demonstrated that optical flow estimation can be cast as a learning problem. However, the state of the art with regard to the quality of the flow has still been defined by traditional methods. Particularly on small displacements and real-world data, FlowNet cannot compete with variational methods. In this paper, we advance the concept of end-to-end learning of optical flow and make it work really well. The large improvements in quality and speed are caused by three major contributions: first, we focus on the training data and show that the schedule of presenting data during training is very important. Second, we develop a stacked architecture that includes warping of the second image with intermediate optical flow. Third, we elaborate on small displacements by introducing a subnetwork specializing on small motions. FlowNet 2.0 is only marginally slower than the original FlowNet but decreases the estimation error by more than 50%. It performs on par with state-of-the-art methods, while running at interactive frame rates. Moreover, we present faster variants that allow optical flow computation at up to 140fps with accuracy matching the original FlowNet.

...read moreread less

2,553 citations

Proceedings Article•DOI•

Octree Generating Networks: Efficient Convolutional Architectures for High-resolution 3D Outputs

[...]

Maxim Tatarchenko¹, Alexey Dosovitskiy¹, Thomas Brox¹•Institutions (1)

University of Freiburg¹

01 Oct 2017

TL;DR: In this paper, a deep convolutional decoder architecture is proposed to generate volumetric 3D outputs in a compute-and memory-efficient manner by using an octree representation.

...read moreread less

Abstract: We present a deep convolutional decoder architecture that can generate volumetric 3D outputs in a compute- and memory-efficient manner by using an octree representation. The network learns to predict both the structure of the octree, and the occupancy values of individual cells. This makes it a particularly valuable technique for generating 3D shapes. In contrast to standard decoders acting on regular voxel grids, the architecture does not have cubic complexity. This allows representing much higher resolution outputs with a limited memory budget. We demonstrate this in several application domains, including 3D convolutional autoencoders, generation of objects and whole scenes from high-level representations, and shape from a single image.

...read moreread less

697 citations

Proceedings Article•DOI•

DeMoN: Depth and Motion Network for Learning Monocular Stereo

[...]

Benjamin Ummenhofer¹, Huizhong Zhou¹, Jonas Uhrig¹, Nikolaus Mayer¹, Eddy Ilg¹, Alexey Dosovitskiy¹, Thomas Brox¹ - Show less +3 more•Institutions (1)

University of Freiburg¹

01 Jul 2017

TL;DR: DeMoN as mentioned in this paper proposes an end-to-end architecture composed of multiple stacked encoder-decoder networks, the core part being an iterative network that is able to improve its own predictions.

...read moreread less

Abstract: In this paper we formulate structure from motion as a learning problem. We train a convolutional network end-to-end to compute depth and camera motion from successive, unconstrained image pairs. The architecture is composed of multiple stacked encoder-decoder networks, the core part being an iterative network that is able to improve its own predictions. The network estimates not only depth and motion, but additionally surface normals, optical flow between the images and confidence of the matching. A crucial component of the approach is a training loss based on spatial relative differences. Compared to traditional two-frame structure from motion methods, results are more accurate and more robust. In contrast to the popular depth-from-single-image networks, DeMoN learns the concept of matching and, thus, better generalizes to structures not seen during training.

...read moreread less

580 citations

Proceedings Article•DOI•

Learning to Estimate 3D Hand Pose from Single RGB Images

[...]

Christian Zimmermann¹, Thomas Brox¹•Institutions (1)

University of Freiburg¹

01 Oct 2017

TL;DR: In this paper, the authors propose a deep network that learns a network-implicit 3D articulation prior together with detected keypoints in the images, which yields good estimates of the 3D pose.

...read moreread less

Abstract: Low-cost consumer depth cameras and deep learning have enabled reasonable 3D hand pose estimation from single depth images. In this paper, we present an approach that estimates 3D hand pose from regular RGB images. This task has far more ambiguities due to the missing depth information. To this end, we propose a deep network that learns a network-implicit 3D articulation prior. Together with detected keypoints in the images, this network yields good estimates of the 3D pose. We introduce a large scale 3D hand pose dataset based on synthetic hand models for training the involved networks. Experiments on a variety of test sets, including one on sign language recognition, demonstrate the feasibility of 3D hand pose estimation on single color images.

...read moreread less

539 citations

Proceedings Article•DOI•

Sparsity Invariant CNNs

[...]

Jonas Uhrig¹, Nick Schneider², Lukas Schneider³, Uwe Franke², Thomas Brox¹, Andreas Geiger⁴ - Show less +2 more•Institutions (4)

University of Freiburg¹, Daimler AG², ETH Zurich³, Max Planck Society⁴

01 Oct 2017

TL;DR: This paper proposes a simple yet effective sparse convolution layer which explicitly considers the location of missing data during the convolution operation, and demonstrates the benefits of the proposed network architecture in synthetic and real experiments with respect to various baseline approaches.

...read moreread less

Abstract: In this paper, we consider convolutional neural networks operating on sparse inputs with an application to depth upsampling from sparse laser scan data. First, we show that traditional convolutional networks perform poorly when applied to sparse data even when the location of missing data is provided to the network. To overcome this problem, we propose a simple yet effective sparse convolution layer which explicitly considers the location of missing data during the convolution operation. We demonstrate the benefits of the proposed network architecture in synthetic and real experiments with respect to various baseline approaches. Compared to dense baselines, the proposed sparse convolution network generalizes well to novel datasets and is invariant to the level of sparsity in the data. For our evaluation, we derive a novel dataset from the KITTI benchmark, comprising 93k depth annotated RGB images. Our dataset allows for training and evaluating depth upsampling and depth prediction techniques in challenging real-world settings and will be made available upon publication.

...read moreread less

518 citations

Journal Article•DOI•

An objective comparison of cell-tracking algorithms

[...]

Vladimír Ulman¹, Martin Maška¹, Klas E. G. Magnusson², Olaf Ronneberger³, Carsten Haubold⁴, Nathalie Harder⁴, Pavel Matula¹, Petr Matula¹, David Svoboda¹, Miroslav Radojevic⁵, Ihor Smal⁵, Karl Rohr⁴, Joakim Jalden², Helen M. Blau⁶, Oleh Dzyubachyk⁷, Boudewijn P. F. Lelieveldt⁸, Boudewijn P. F. Lelieveldt⁷, Pengdong Xiao⁹, Yuexiang Li¹⁰, Siu-Yeung Cho¹¹, Alexandre Dufour¹², Jean-Christophe Olivo-Marin¹², Constantino Carlos Reyes-Aldasoro¹³, Jose Alonso Solis-Lemus¹³, Robert Bensch³, Thomas Brox³, Johannes Stegmaier¹⁴, Ralf Mikut¹⁴, Steffen Wolf⁴, Fred A. Hamprecht⁴, Tiago Esteves¹⁵, Pedro Quelhas¹⁵, Omer Burak Demirel¹⁶, Lars Malmström¹⁶, Florian Jug¹⁷, Pavel Tomancak¹⁷, Erik Meijering⁵, Arrate Muñoz-Barrutia¹⁸, Michal Kozubek¹, Carlos Ortiz-de-Solorzano¹⁹ - Show less +36 more•Institutions (19)

Masaryk University¹, Royal Institute of Technology², University of Freiburg³, Heidelberg University⁴, Erasmus University Medical Center⁵, Stanford University⁶, Leiden University Medical Center⁷, Delft University of Technology⁸, Agency for Science, Technology and Research⁹, University of Nottingham¹⁰, The University of Nottingham Ningbo China¹¹, Pasteur Institute¹², City University London¹³, Karlsruhe Institute of Technology¹⁴, University of Porto¹⁵, University of Zurich¹⁶, Max Planck Society¹⁷, Charles III University of Madrid¹⁸, University of Navarra¹⁹

30 Oct 2017-Nature Methods

TL;DR: It is found that methods that either take prior information into account using learning strategies or analyze cells in a global spatiotemporal video context performed better than other methods under the segmentation and tracking scenarios included in the Cell Tracking Challenge.

...read moreread less

Abstract: We present a combined report on the results of three editions of the Cell Tracking Challenge, an ongoing initiative aimed at promoting the development and objective evaluation of cell segmentation and tracking algorithms. With 21 participating algorithms and a data repository consisting of 13 data sets from various microscopy modalities, the challenge displays today's state-of-the-art methodology in the field. We analyzed the challenge results using performance measures for segmentation and tracking that rank all participating methods. We also analyzed the performance of all of the algorithms in terms of biological measures and practical usability. Although some methods scored high in all technical aspects, none obtained fully correct solutions. We found that methods that either take prior information into account using learning strategies or analyze cells in a global spatiotemporal video context performed better than other methods under the segmentation and tracking scenarios included in the challenge.

...read moreread less

468 citations

Posted Content•

Octree Generating Networks: Efficient Convolutional Architectures for High-resolution 3D Outputs

[...]

Maxim Tatarchenko¹, Alexey Dosovitskiy¹, Thomas Brox¹•Institutions (1)

University of Freiburg¹

28 Mar 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: A deep convolutional decoder architecture that can generate volumetric 3D outputs in a compute- and memory-efficient manner by using an octree representation that learns to predict both the structure of the octree, and the occupancy values of individual cells.

...read moreread less

Abstract: We present a deep convolutional decoder architecture that can generate volumetric 3D outputs in a compute- and memory-efficient manner by using an octree representation. The network learns to predict both the structure of the octree, and the occupancy values of individual cells. This makes it a particularly valuable technique for generating 3D shapes. In contrast to standard decoders acting on regular voxel grids, the architecture does not have cubic complexity. This allows representing much higher resolution outputs with a limited memory budget. We demonstrate this in several application domains, including 3D convolutional autoencoders, generation of objects and whole scenes from high-level representations, and shape from a single image.

...read moreread less

362 citations

Posted Content•

Sparsity Invariant CNNs

[...]

Jonas Uhrig¹, Nick Schneider², Lukas Schneider³, Uwe Franke², Thomas Brox¹, Andreas Geiger⁴ - Show less +2 more•Institutions (4)

University of Freiburg¹, Daimler AG², ETH Zurich³, Max Planck Society⁴

22 Aug 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, the location of missing data is considered in the convolutional layer of the network and a simple sparse convolution layer is proposed for depth upsampling from sparse laser scan data.

...read moreread less

Abstract: In this paper, we consider convolutional neural networks operating on sparse inputs with an application to depth upsampling from sparse laser scan data. First, we show that traditional convolutional networks perform poorly when applied to sparse data even when the location of missing data is provided to the network. To overcome this problem, we propose a simple yet effective sparse convolution layer which explicitly considers the location of missing data during the convolution operation. We demonstrate the benefits of the proposed network architecture in synthetic and real experiments with respect to various baseline approaches. Compared to dense baselines, the proposed sparse convolution network generalizes well to novel datasets and is invariant to the level of sparsity in the data. For our evaluation, we derive a novel dataset from the KITTI benchmark, comprising 93k depth annotated RGB images. Our dataset allows for training and evaluating depth upsampling and depth prediction techniques in challenging real-world settings and will be made available upon publication.

...read moreread less

236 citations

Proceedings Article•DOI•

Chained Multi-stream Networks Exploiting Pose, Motion, and Appearance for Action Classification and Detection

[...]

Mohammadreza Zolfaghari¹, Gabriel L. Oliveira¹, Nima Sedaghat¹, Thomas Brox¹•Institutions (1)

University of Freiburg¹

01 Oct 2017

TL;DR: This paper proposes a network architecture that computes and integrates the most important visual cues for action recognition: pose, motion, and the raw images and introduces a Markov chain model which adds cues successively.

...read moreread less

Abstract: General human action recognition requires understanding of various visual cues. In this paper, we propose a network architecture that computes and integrates the most important visual cues for action recognition: pose, motion, and the raw images. For the integration, we introduce a Markov chain model which adds cues successively. The resulting approach is efficient and applicable to action classification as well as to spatial and temporal action localization. The two contributions clearly improve the performance over respective baselines. The overall approach achieves state-of-the-art action classification performance on HMDB51, J-HMDB and NTU RGB+D datasets. Moreover, it yields state-of-the-art spatio-temporal action localization results on UCF101 and J-HMDB.

...read moreread less

209 citations

Proceedings Article•DOI•

Universal Adversarial Perturbations Against Semantic Image Segmentation

[...]

Jan Hendrik Metzen¹, Mummadi Chaithanya Kumar, Thomas Brox², Volker Fischer¹•Institutions (2)

Bosch¹, University of Freiburg²

01 Oct 2017

TL;DR: In this paper, the authors present an approach for generating universal adversarial perturbations that make the network yield a desired target segmentation as output, and empirically show that there exist barely perceptible universal noise patterns which result in nearly the same predicted segmentation for arbitrary inputs.

...read moreread less

Abstract: While deep learning is remarkably successful on perceptual tasks, it was also shown to be vulnerable to adversarial perturbations of the input. These perturbations denote noise added to the input that was generated specifically to fool the system while being quasi-imperceptible for humans. More severely, there even exist universal perturbations that are input-agnostic but fool the network on the majority of inputs. While recent work has focused on image classification, this work proposes attacks against semantic image segmentation: we present an approach for generating (universal) adversarial perturbations that make the network yield a desired target segmentation as output. We show empirically that there exist barely perceptible universal noise patterns which result in nearly the same predicted segmentation for arbitrary inputs. Furthermore, we also show the existence of universal noise which removes a target class (e.g., all pedestrians) from the segmentation while leaving the segmentation mostly unchanged otherwise.

...read moreread less

204 citations

Journal Article•DOI•

Learning to Generate Chairs, Tables and Cars with Convolutional Networks

[...]

Alexey Dosovitskiy¹, Jost Tobias Springenberg¹, Maxim Tatarchenko¹, Thomas Brox¹•Institutions (1)

University of Freiburg¹

01 Apr 2017-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: In this paper, the authors train a generative network on rendered 3D models of chairs, tables, and cars to generate images of objects given object style, viewpoint, and color.

...read moreread less

Abstract: We train generative ‘up-convolutional’ neural networks which are able to generate images of objects given object style, viewpoint, and color. We train the networks on rendered 3D models of chairs, tables, and cars. Our experiments show that the networks do not merely learn all images by heart, but rather find a meaningful representation of 3D models allowing them to assess the similarity of different models, interpolate between given views to generate the missing ones, extrapolate views, and invent new objects not present in the training set by recombining training instances, or even two different object classes. Moreover, we show that such generative networks can be used to find correspondences between different objects from the dataset, outperforming existing approaches on this task.

...read moreread less

Posted Content•

Universal Adversarial Perturbations Against Semantic Image Segmentation

[...]

Jan Hendrik Metzen¹, Mummadi Chaithanya Kumar, Thomas Brox², Volker Fischer¹•Institutions (2)

Bosch¹, University of Freiburg²

19 Apr 2017-arXiv: Machine Learning

TL;DR: This work presents an approach for generating (universal) adversarial perturbations that make the network yield a desired target segmentation as output and shows empirically that there exist barely perceptible universal noise patterns which result in nearly the same predicted segmentation for arbitrary inputs.

...read moreread less

Abstract: While deep learning is remarkably successful on perceptual tasks, it was also shown to be vulnerable to adversarial perturbations of the input. These perturbations denote noise added to the input that was generated specifically to fool the system while being quasi-imperceptible for humans. More severely, there even exist universal perturbations that are input-agnostic but fool the network on the majority of inputs. While recent work has focused on image classification, this work proposes attacks against semantic image segmentation: we present an approach for generating (universal) adversarial perturbations that make the network yield a desired target segmentation as output. We show empirically that there exist barely perceptible universal noise patterns which result in nearly the same predicted segmentation for arbitrary inputs. Furthermore, we also show the existence of universal noise which removes a target class (e.g., all pedestrians) from the segmentation while leaving the segmentation mostly unchanged otherwise.

...read moreread less

Proceedings Article•DOI•

Orientation-boosted Voxel Nets for 3D Object Recognition

[...]

Nima Sedaghat¹, Mohammadreza Zolfaghari¹, Ehsan Amiri, Thomas Brox¹•Institutions (1)

University of Freiburg¹

01 Jan 2017

TL;DR: In this article, the authors argue that objects induce different features in the network under rotation and propose a multi-task approach, in which the network is trained to predict the pose of the object in addition to the class label.

...read moreread less

Abstract: Recent work has shown good recognition results in 3D object recognition using 3D convolutional networks. In this paper, we show that the object orientation plays an important role in 3D recognition. More specifically, we argue that objects induce different features in the network under rotation. Thus, we approach the category-level classification task as a multi-task problem, in which the network is trained to predict the pose of the object in addition to the class label as a parallel task. We show that this yields significant improvements in the classification results. We test our suggested architecture on several datasets representing various 3D data sources: LiDAR data, CAD models, and RGB-D images. We report state-of-the-art results on classification as well as significant improvements in precision and speed over the baseline on 3D detection.

...read moreread less

Posted Content•

Chained Multi-stream Networks Exploiting Pose, Motion, and Appearance for Action Classification and Detection

[...]

Mohammadreza Zolfaghari¹, Gabriel L. Oliveira¹, Nima Sedaghat¹, Thomas Brox¹•Institutions (1)

University of Freiburg¹

03 Apr 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, a Markov chain model is proposed to integrate pose, motion, and raw images for action recognition, which achieves state-of-the-art performance on HMDB51, J-HMDB and NTU RGB+D datasets.

...read moreread less

Abstract: General human action recognition requires understanding of various visual cues. In this paper, we propose a network architecture that computes and integrates the most important visual cues for action recognition: pose, motion, and the raw images. For the integration, we introduce a Markov chain model which adds cues successively. The resulting approach is efficient and applicable to action classification as well as to spatial and temporal action localization. The two contributions clearly improve the performance over respective baselines. The overall approach achieves state-of-the-art action classification performance on HMDB51, J-HMDB and NTU RGB+D datasets. Moreover, it yields state-of-the-art spatio-temporal action localization results on UCF101 and J-HMDB.

...read moreread less

Proceedings Article•DOI•

Semantics-aware visual localization under challenging perceptual conditions

[...]

Tayyab Naseer¹, Gabriel L. Oliveira¹, Thomas Brox¹, Wolfram Burgard¹•Institutions (1)

University of Freiburg¹

01 May 2017

TL;DR: This paper proposes a novel approach for learning a discriminative holistic image representation which exploits the image content to create a dense and salient scene description and shows that the learnt image representation outperforms off-the-shelf features from the deep networks and hand-crafted features.

...read moreread less

Abstract: Visual place recognition under difficult perceptual conditions remains a challenging problem due to changing weather conditions, illumination and seasons Long-term visual navigation approaches for robot localization should be robust to these dynamics of the environment Existing methods typically leverage feature descriptions of whole images or image regions from Deep Convolutional Neural Networks Some approaches also exploit sequential information to alleviate the problem of spatially inconsistent and non-perfect image matches In this paper, we propose a novel approach for learning a discriminative holistic image representation which exploits the image content to create a dense and salient scene description These salient descriptions are learnt over a variety of datasets under large perceptual changes Such an approach enables us to precisely segment the regions of an image which are geometrically stable over large time lags We combine features from these salient regions and an off-the-shelf holistic representation to form a more robust scene descriptor We also introduce a semantically labeled dataset which captures extreme perceptual and structural scene dynamics over the course of 3 years We evaluated our approach with extensive experiments on data collected over several kilometers in Freiburg and show that our learnt image representation outperforms off-the-shelf features from the deep networks and hand-crafted features

...read moreread less

Posted Content•

Lucid Data Dreaming for Object Tracking

[...]

Anna Khoreva¹, Rodrigo Benenson, Eddy Ilg, Thomas Brox, Bernt Schiele¹ - Show less +1 more•Institutions (1)

Max Planck Society¹

28 Mar 2017

TL;DR: In-domain per-video training data as mentioned in this paper allows to train high quality appearance-and motion-based models, as well as tune the post-processing stage, without ImageNet pre-training.

...read moreread less

Abstract: Convolutional networks reach top quality in pixel-level video object segmentation but require a large amount of training data (1k~100k) to deliver such results. We propose a new training strategy which achieves state-of-the-art results across three evaluation datasets while using 20x~1000x less annotated data than competing methods. Our approach is suitable for both single and multiple object segmentation. Instead of using large training sets hoping to generalize across domains, we generate in-domain training data using the provided annotation on the first frame of each video to synthesize ("lucid dream") plausible future video frames. In-domain per-video training data allows us to train high quality appearance- and motion-based models, as well as tune the post-processing stage. This approach allows to reach competitive results even when training from only a single annotated frame, without ImageNet pre-training. Our results indicate that using a larger training set is not automatically better, and that for the video object segmentation task a smaller training set that is closer to the target domain is more effective. This changes the mindset regarding how many training samples and general "objectness" knowledge are required for the video object segmentation task.

...read moreread less

Proceedings Article•DOI•

Joint Graph Decomposition & Node Labeling: Problem, Algorithms, Applications

[...]

Evgeny Levinkov¹, Jonas Uhrig², Siyu Tang¹, Mohamed Omran¹, Eldar Insafutdinov¹, Alexander Kirillov³, Carsten Rother³, Thomas Brox², Bernt Schiele¹, Bjoern Andres¹ - Show less +6 more•Institutions (3)

Max Planck Society¹, University of Freiburg², Dresden University of Technology³

01 Jul 2017

TL;DR: A combinatorial optimization problem whose feasible solutions define both a decomposition and a node labeling of a given graph, which offers a common mathematical abstraction of seemingly unrelated computer vision tasks, including instance-separating semantic segmentation, articulated human body pose estimation and multiple object tracking.

...read moreread less

Abstract: We state a combinatorial optimization problem whose feasible solutions define both a decomposition and a node labeling of a given graph. This problem offers a common mathematical abstraction of seemingly unrelated computer vision tasks, including instance-separating semantic segmentation, articulated human body pose estimation and multiple object tracking. Conceptually, it generalizes the unconstrained integer quadratic program and the minimum cost lifted multicut problem, both of which are NP-hard. In order to find feasible solutions efficiently, we define two local search algorithms that converge monotonously to a local optimum, offering a feasible solution at any time. To demonstrate the effectiveness of these algorithms in tackling computer vision tasks, we apply them to instances of the problem that we construct from published data, using published algorithms. We report state-of-the-art application-specific accuracy in the three above-mentioned applications.

...read moreread less

Lucid Data Dreaming for Object Tracking

[...]

Anna Khoreva¹, Rodrigo Benenson, Eddy Ilg, Thomas Brox, Bernt Schiele¹ - Show less +1 more•Institutions (1)

Max Planck Society¹

01 Jan 2017

TL;DR: This work proposes a new training strategy which achieves state-of-the-art results across three evaluation datasets while using 20x ~ 100x less annotated data than competing methods, and generates in-domain training data using the provided annotation on the first frame of each video to synthesize ("lucid dream") plausible future video frames.

...read moreread less

Abstract: Convolutional networks reach top quality in pixel-level video object segmentation but require a large amount of training data (1k~100k) to deliver such results. We propose a new training strategy which achieves state-of-the-art results across three evaluation datasets while using 20x~1000x less annotated data than competing methods. Our approach is suitable for both single and multiple object segmentation. Instead of using large training sets hoping to generalize across domains, we generate in-domain training data using the provided annotation on the first frame of each video to synthesize ("lucid dream") plausible future video frames. In-domain per-video training data allows us to train high quality appearance- and motion-based models, as well as tune the post-processing stage. This approach allows to reach competitive results even when training from only a single annotated frame, without ImageNet pre-training. Our results indicate that using a larger training set is not automatically better, and that for the video object segmentation task a smaller training set that is closer to the target domain is more effective. This changes the mindset regarding how many training samples and general "objectness" knowledge are required for the video object segmentation task.

...read moreread less

Posted Content•

Lucid Data Dreaming for Video Object Segmentation

[...]

Anna Khoreva¹, Rodrigo Benenson², Eddy Ilg³, Thomas Brox³, Bernt Schiele¹ - Show less +1 more•Institutions (3)

Max Planck Society¹, Google², University of Freiburg³

28 Mar 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: In-domain per-video training data as mentioned in this paper allows to train high quality appearance-and motion-based models, as well as tune the post-processing stage, without ImageNet pre-training.

...read moreread less

Abstract: Convolutional networks reach top quality in pixel-level video object segmentation but require a large amount of training data (1k~100k) to deliver such results. We propose a new training strategy which achieves state-of-the-art results across three evaluation datasets while using 20x~1000x less annotated data than competing methods. Our approach is suitable for both single and multiple object segmentation. Instead of using large training sets hoping to generalize across domains, we generate in-domain training data using the provided annotation on the first frame of each video to synthesize ("lucid dream") plausible future video frames. In-domain per-video training data allows us to train high quality appearance- and motion-based models, as well as tune the post-processing stage. This approach allows to reach competitive results even when training from only a single annotated frame, without ImageNet pre-training. Our results indicate that using a larger training set is not automatically better, and that for the video object segmentation task a smaller training set that is closer to the target domain is more effective. This changes the mindset regarding how many training samples and general "objectness" knowledge are required for the video object segmentation task.

...read moreread less

Posted Content•

Adversarial Examples for Semantic Image Segmentation

[...]

Volker Fischer¹, Mummadi Chaithanya Kumar, Jan Hendrik Metzen², Thomas Brox³•Institutions (3)

Bosch¹, University of Bremen², University of Freiburg³

03 Mar 2017-arXiv: Machine Learning

TL;DR: In this paper, the authors show how existing adversarial attackers can be transferred to this task and that it is possible to create imperceptible adversarial perturbations that lead a deep network to misclassify almost all pixels of a chosen class while leaving network prediction nearly unchanged outside this class.

...read moreread less

Abstract: Machine learning methods in general and Deep Neural Networks in particular have shown to be vulnerable to adversarial perturbations. So far this phenomenon has mainly been studied in the context of whole-image classification. In this contribution, we analyse how adversarial perturbations can affect the task of semantic segmentation. We show how existing adversarial attackers can be transferred to this task and that it is possible to create imperceptible adversarial perturbations that lead a deep network to misclassify almost all pixels of a chosen class while leaving network prediction nearly unchanged outside this class.

...read moreread less

Posted Content•

Learning to Estimate 3D Hand Pose from Single RGB Images

[...]

Christian Zimmermann¹, Thomas Brox¹•Institutions (1)

University of Freiburg¹

03 May 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, the authors propose a deep network that learns a network-implicit 3D articulation prior together with detected keypoints in the images, which yields good estimates of the 3D pose.

...read moreread less

Abstract: Low-cost consumer depth cameras and deep learning have enabled reasonable 3D hand pose estimation from single depth images. In this paper, we present an approach that estimates 3D hand pose from regular RGB images. This task has far more ambiguities due to the missing depth information. To this end, we propose a deep network that learns a network-implicit 3D articulation prior. Together with detected keypoints in the images, this network yields good estimates of the 3D pose. We introduce a large scale 3D hand pose dataset based on synthetic hand models for training the involved networks. Experiments on a variety of test sets, including one on sign language recognition, demonstrate the feasibility of 3D hand pose estimation on single color images.

...read moreread less

Journal Article•DOI•

Global, Dense Multiscale Reconstruction for a Billion Points

[...]

Benjamin Ummenhofer¹, Thomas Brox¹•Institutions (1)

University of Freiburg¹

01 Dec 2017-International Journal of Computer Vision

TL;DR: The surface normals are explicitly optimized and used for surface extraction to improve the reconstruction at edges and corners and optimize memory efficiency by data aggregation, such that robust data terms can be used even on very large scenes.

...read moreread less

Abstract: We present a variational approach for surface reconstruction from a set of oriented points with scale information. We focus particularly on scenarios with nonuniform point densities due to images taken from different distances. In contrast to previous methods, we integrate the scale information in the objective and globally optimize the signed distance function of the surface on a balanced octree grid. We use a finite element discretization on the dual structure of the octree minimizing the number of variables. The tetrahedral mesh is generated efficiently with a lookup table which allows to map octree cells to the nodes of the finite elements. We optimize memory efficiency by data aggregation, such that robust data terms can be used even on very large scenes. The surface normals are explicitly optimized and used for surface extraction to improve the reconstruction at edges and corners.

...read moreread less

Book Chapter•DOI•

End-to-End Learning of Video Super-Resolution with Motion Compensation

[...]

Osama Makansi¹, Eddy Ilg¹, Thomas Brox¹•Institutions (1)

University of Freiburg¹

13 Sep 2017

TL;DR: This paper provides an end-to-end video super-resolution network that, in contrast to previous works, includes the estimation of optical flow in the overall network architecture and shows that with this network configuration, videosuper-resolution can benefit from optical flow and is obtained state-of-the-art results on the popular test sets.

...read moreread less

Abstract: Learning approaches have shown great success in the task of super-resolving an image given a low resolution input. Video super-resolution aims for exploiting additionally the information from multiple images. Typically, the images are related via optical flow and consecutive image warping. In this paper, we provide an end-to-end video super-resolution network that, in contrast to previous works, includes the estimation of optical flow in the overall network architecture. We analyze the usage of optical flow for video super-resolution and find that common off-the-shelf image warping does not allow video super-resolution to benefit much from optical flow. We rather propose an operation for motion compensation that performs warping from low to high resolution directly. We show that with this network configuration, video super-resolution can benefit from optical flow and we obtain state-of-the-art results on the popular test sets. We also show that the processing of whole images rather than independent patches is responsible for a large increase in accuracy.

...read moreread less

Posted Content•

End-to-End Learning of Video Super-Resolution with Motion Compensation

[...]

Osama Makansi¹, Eddy Ilg¹, Thomas Brox¹•Institutions (1)

University of Freiburg¹

03 Jul 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, an end-to-end video super-resolution network that includes the estimation of optical flow in the overall network architecture is proposed, which can benefit from optical flow and obtain state-of-the-art results on the popular test sets.

...read moreread less

Abstract: Learning approaches have shown great success in the task of super-resolving an image given a low resolution input. Video super-resolution aims for exploiting additionally the information from multiple images. Typically, the images are related via optical flow and consecutive image warping. In this paper, we provide an end-to-end video super-resolution network that, in contrast to previous works, includes the estimation of optical flow in the overall network architecture. We analyze the usage of optical flow for video super-resolution and find that common off-the-shelf image warping does not allow video super-resolution to benefit much from optical flow. We rather propose an operation for motion compensation that performs warping from low to high resolution directly. We show that with this network configuration, video super-resolution can benefit from optical flow and we obtain state-of-the-art results on the popular test sets. We also show that the processing of whole images rather than independent patches is responsible for a large increase in accuracy.

...read moreread less

Posted Content•

Lucid Data Dreaming for Multiple Object Tracking

[...]

Anna Khoreva¹, Rodrigo Benenson, Eddy Ilg, Thomas Brox, Bernt Schiele¹ - Show less +1 more•Institutions (1)

Max Planck Society¹

28 Mar 2017

TL;DR: This work proposes a new training strategy which achieves state-of-the-art results across three evaluation datasets while using 20×∼ 100× less annotated data than competing methods, indicating that using a larger training set is not automatically better, and that for the tracking task a smaller training set that is closer to the target domain is more effective.

...read moreread less

Abstract: Convolutional networks reach top quality in pixel-level video object segmentation but require a large amount of training data (1k~100k) to deliver such results. We propose a new training strategy which achieves state-of-the-art results across three evaluation datasets while using 20x~1000x less annotated data than competing methods. Our approach is suitable for both single and multiple object segmentation. Instead of using large training sets hoping to generalize across domains, we generate in-domain training data using the provided annotation on the first frame of each video to synthesize ("lucid dream") plausible future video frames. In-domain per-video training data allows us to train high quality appearance- and motion-based models, as well as tune the post-processing stage. This approach allows to reach competitive results even when training from only a single annotated frame, without ImageNet pre-training. Our results indicate that using a larger training set is not automatically better, and that for the video object segmentation task a smaller training set that is closer to the target domain is more effective. This changes the mindset regarding how many training samples and general "objectness" knowledge are required for the video object segmentation task.

...read moreread less

Book Chapter•DOI•

Topometric Localization with Deep Learning

[...]

Gabriel L. Oliveira¹, Noha Radwan¹, Wolfram Burgard¹, Thomas Brox¹•Institutions (1)

University of Freiburg¹

01 Jun 2017

TL;DR: In this paper, a vision-based localization approach that learns from LiDAR-based methods by using their output as training data is proposed, which combines a cheap, passive sensor with an accuracy that is on-par with LBS.

...read moreread less

Abstract: Compared to LiDAR-based localization methods, which provide high accuracy but rely on expensive sensors, visual localization approaches only require a camera and thus are more cost-effective however their accuracy and reliability is typically inferior to LiDAR-based methods. In this work, we propose a vision-based localization approach that learns from LiDAR-based localization methods by using their output as training data, thus combining a cheap, passive sensor with an accuracy that is on-par with LiDAR-based localization. The approach consists of two deep networks trained on visual odometry and topological localization, respectively, and a successive optimization to combine the predictions of these two networks. Furthermore, we introduce a new challenging pedestrian-based dataset for localization with a high degree of noise. Results obtained by evaluating the proposed approach on this novel dataset demonstrate localization errors up to 10 times smaller than those obtained with traditional vision-based localization methods.

...read moreread less

Proceedings Article•

Adversarial Examples for Semantic Image Segmentation

[...]

Volker Fischer¹, Mummadi Chaithanya Kumar, Jan Hendrik Metzen², Thomas Brox³•Institutions (3)

Bosch¹, University of Bremen², University of Freiburg³

15 Feb 2017

TL;DR: In this article, the authors show how existing adversarial attackers can be transferred to this task and that it is possible to create imperceptible adversarial perturbations that lead a deep network to misclassify almost all pixels of a chosen class while leaving network prediction nearly unchanged outside this class.

...read moreread less

Abstract: Machine learning methods in general and Deep Neural Networks in particular have shown to be vulnerable to adversarial perturbations. So far this phenomenon has mainly been studied in the context of whole-image classification. In this contribution, we analyse how adversarial perturbations can affect the task of semantic segmentation. We show how existing adversarial attackers can be transferred to this task and that it is possible to create imperceptible adversarial perturbations that lead a deep network to misclassify almost all pixels of a chosen class while leaving network prediction nearly unchanged outside this class.

...read moreread less

Posted Content•

Non-smooth Non-convex Bregman Minimization: Unification and new Algorithms

[...]

Peter Ochs¹, Jalal M. Fadili², Thomas Brox³•Institutions (3)

Saarland University¹, Centre national de la recherche scientifique², University of Freiburg³

07 Jul 2017-arXiv: Optimization and Control

TL;DR: In this article, the authors propose an Armijo-like line search strategy for non-smooth non-convex optimization with Bregman proximal point approximation.

...read moreread less

Abstract: We propose a unifying algorithm for non-smooth non-convex optimization. The algorithm approximates the objective function by a convex model function and finds an approximate (Bregman) proximal point of the convex model. This approximate minimizer of the model function yields a descent direction, along which the next iterate is found. Complemented with an Armijo-like line search strategy, we obtain a flexible algorithm for which we prove (subsequential) convergence to a stationary point under weak assumptions on the growth of the model function error. Special instances of the algorithm with a Euclidean distance function are, for example, Gradient Descent, Forward--Backward Splitting, ProxDescent, without the common requirement of a "Lipschitz continuous gradient". In addition, we consider a broad class of Bregman distance functions (generated by Legendre functions) replacing the Euclidean distance. The algorithm has a wide range of applications including many linear and non-linear inverse problems in signal/image processing and machine learning.

...read moreread less

Journal Article•DOI•

Artistic style transfer for videos and spherical images

[...]

Manuel Ruder¹, Alexey Dosovitskiy¹, Thomas Brox¹•Institutions (1)

University of Freiburg¹

13 Aug 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: A deep network architecture and training procedures are proposed that allow us to stylize arbitrary-length videos in a consistent and stable way, and nearly in real time, and it is shown that the proposed methods clearly outperform simpler baselines both qualitatively and quantitatively.

...read moreread less

Abstract: Manually re-drawing an image in a certain artistic style takes a professional artist a long time. Doing this for a video sequence single-handedly is beyond imagination. We present two computational approaches that transfer the style from one image (for example, a painting) to a whole video sequence. In our first approach, we adapt to videos the original image style transfer technique by Gatys et al. based on energy minimization. We introduce new ways of initialization and new loss functions to generate consistent and stable stylized video sequences even in cases with large motion and strong occlusion. Our second approach formulates video stylization as a learning problem. We propose a deep network architecture and training procedures that allow us to stylize arbitrary-length videos in a consistent and stable way, and nearly in real time. We show that the proposed methods clearly outperform simpler baselines both qualitatively and quantitatively. Finally, we propose a way to adapt these approaches also to 360 degree images and videos as they emerge with recent virtual reality hardware.

...read moreread less

Posted Content•

Topometric Localization with Deep Learning

[...]

Gabriel L. Oliveira¹, Noha Radwan¹, Wolfram Burgard¹, Thomas Brox¹•Institutions (1)

University of Freiburg¹

27 Jun 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, the authors proposed a vision-based localization approach that learns from LiDAR-based methods by using their output as training data, thus combining a cheap, passive sensor with an accuracy that is on-par with LBS.

...read moreread less

Abstract: Compared to LiDAR-based localization methods, which provide high accuracy but rely on expensive sensors, visual localization approaches only require a camera and thus are more cost-effective while their accuracy and reliability typically is inferior to LiDAR-based methods. In this work, we propose a vision-based localization approach that learns from LiDAR-based localization methods by using their output as training data, thus combining a cheap, passive sensor with an accuracy that is on-par with LiDAR-based localization. The approach consists of two deep networks trained on visual odometry and topological localization, respectively, and a successive optimization to combine the predictions of these two networks. We evaluate the approach on a new challenging pedestrian-based dataset captured over the course of six months in varying weather conditions with a high degree of noise. The experiments demonstrate that the localization errors are up to 10 times smaller than with traditional vision-based localization methods.

...read moreread less

Showing papers by "Thomas Brox published in 2017"