scispace - formally typeset
Search or ask a question

Showing papers presented at "German Conference on Pattern Recognition in 2019"


Book ChapterDOI
10 Sep 2019
TL;DR: This paper proposes a model that is able to predict both outputs simultaneously, which leads to improved results and even reduced computational costs compared to independent estimation of depth and semantics, and empirically proves that the CNN is capable of learning more meaningful and semantically richer features.
Abstract: Autonomous vehicles and robots require a full scene understanding of the environment to interact with it. Such a perception typically incorporates pixel-wise knowledge of the depths and semantic labels for each image from a video sensor. Recent learning-based methods estimate both types of information independently using two separate CNNs. In this paper, we propose a model that is able to predict both outputs simultaneously, which leads to improved results and even reduced computational costs compared to independent estimation of depth and semantics. We also empirically prove that the CNN is capable of learning more meaningful and semantically richer features. Furthermore, our SDNet estimates the depth based on ordinal classification. On the basis of these two enhancements, our proposed method achieves state-of-the-art results in semantic segmentation and depth estimation from single monocular input images on two challenging datasets.

43 citations


Book ChapterDOI
10 Sep 2019
TL;DR: The proposed two-head model performs comparably to the C-way multi-class model trained to predict uniform distribution in outliers, while outperforming several other validated approaches.
Abstract: Recent success on realistic road driving datasets has increased interest in exploring robust performance in real-world applications. One of the major unsolved problems is to identify image content which can not be reliably recognized with a given inference engine. We therefore study approaches to recover a dense outlier map alongside the primary task with a single forward pass, by relying on shared convolutional features. We consider semantic segmentation as the primary task and perform extensive validation on WildDash val (inliers), LSUN val (outliers), and pasted objects from Pascal VOC 2007 (outliers). We achieve the best validation performance by training to discriminate inliers from pasted ImageNet-1k content, even though ImageNet-1k contains many road-driving pixels, and, at least nominally, fails to account for the full diversity of the visual world. The proposed two-head model performs comparably to the C-way multi-class model trained to predict uniform distribution in outliers, while outperforming several other validated approaches. We evaluate our best two models on the WildDash test dataset and set a new state of the art on the WildDash benchmark.

34 citations


Book ChapterDOI
10 Sep 2019
TL;DR: The authors explore two variations of synthetic data for this challenging problem; a dataset with purely synthetic humans and a real dataset augmented with synthetic humans, and study which approach better generalizes to real data, as well as the influence of virtual humans in the training loss.
Abstract: Neural networks need big annotated datasets for training. However, manual annotation can be too expensive or even unfeasible for certain tasks, like multi-person 2D pose estimation with severe occlusions. A remedy for this is synthetic data with perfect ground truth. Here we explore two variations of synthetic data for this challenging problem; a dataset with purely synthetic humans and a real dataset augmented with synthetic humans. We then study which approach better generalizes to real data, as well as the influence of virtual humans in the training loss. Using the augmented dataset, without considering synthetic humans in the loss, leads to the best results. We observe that not all synthetic samples are equally informative for training, while the informative samples are different for each training stage. To exploit this observation, we employ an adversarial student-teacher framework; the teacher improves the student by providing the hardest samples for its current state as a challenge. Experiments show that the student-teacher framework outperforms normal training on the purely synthetic dataset.

28 citations


Book ChapterDOI
10 Sep 2019
TL;DR: A semi-supervised method for segmentation (delineation) of salt bodies in seismic images which utilizes unlabeled data for multi-round self-training and outperforms state-of-the-art on the TGS Salt Identification Challenge dataset and is ranked the first among the 3234 competing methods.
Abstract: Seismic image analysis plays a crucial role in a wide range of industrial applications and has been receiving significant attention. One of the essential challenges of seismic imaging is detecting subsurface salt structure which is indispensable for the identification of hydrocarbon reservoirs and drill path planning. Unfortunately, the exact identification of large salt deposits is notoriously difficult and professional seismic imaging often requires expert human interpretation of salt bodies. Convolutional neural networks (CNNs) have been successfully applied in many fields, and several attempts have been made in the field of seismic imaging. But the high cost of manual annotations by geophysics experts and scarce publicly available labeled datasets hinder the performance of the existing CNN-based methods. In this work, we propose a semi-supervised method for segmentation (delineation) of salt bodies in seismic images which utilizes unlabeled data for multi-round self-training. To reduce error amplification during self-training we propose a scheme which uses an ensemble of CNNs. We show that our approach outperforms state-of-the-art on the TGS Salt Identification Challenge dataset and is ranked the first among the 3234 competing methods. The source code is available at GitHub.

27 citations


Book ChapterDOI
10 Sep 2019
TL;DR: This work presents an effective pipeline for large-scale 3D reconstruction which extends existing methods in several ways and introduces an outlier filtering considering the MVS geometry and proposes a plane completion method based on growing superpixels allowing a generic generation of high-quality 3D models.
Abstract: Multi-View Stereo (MVS)-based 3D reconstruction is a major topic in computer vision for which a vast number of methods have been proposed over the last decades showing impressive visual results. Long-since, benchmarks like Middlebury [32] numerically rank the individual methods considering accuracy and completeness as quality attributes. While the Middlebury benchmark provides low-resolution images only, the recently published ETH3D [31] and Tanks and Temples [19] benchmarks allow for an evaluation of high-resolution and large-scale MVS from natural camera configurations. This benchmarking reveals that still only few methods can be used for the reconstruction of large-scale models. We present an effective pipeline for large-scale 3D reconstruction which extends existing methods in several ways: (i) We introduce an outlier filtering considering the MVS geometry. (ii) To avoid incomplete models from local matching methods we propose a plane completion method based on growing superpixels allowing a generic generation of high-quality 3D models. (iii) Finally, we use deep learning for a subsequent filtering of outliers in segmented sky areas. We give experimental evidence on benchmarks that our contributions improve the quality of the 3D model and our method is state-of-the-art in high-quality 3D reconstruction from high-resolution images or large image sets.

27 citations


Book ChapterDOI
10 Sep 2019
TL;DR: This work builds upon a real-time 2D multi-person pose estimation system and greedily solves the association problem between multiple views as problems associated with greedy matching such as occlusion can be easily resolved in 3D.
Abstract: In this work we propose an approach for estimating 3D human poses of multiple people from a set of calibrated cameras. Estimating 3D human poses from multiple views has several compelling properties: human poses are estimated within a global coordinate space and multiple cameras provide an extended field of view which helps in resolving ambiguities, occlusions and motion blur. Our approach builds upon a real-time 2D multi-person pose estimation system and greedily solves the association problem between multiple views. We utilize bipartite matching to track multiple people over multiple frames. This proofs to be especially efficient as problems associated with greedy matching such as occlusion can be easily resolved in 3D. Our approach achieves state-of-the-art results on popular benchmarks and may serve as a baseline for future work.

26 citations


Book ChapterDOI
10 Sep 2019
TL;DR: In this article, a part-specific part estimation method was proposed that uses an initial prediction as well as back-propagation of feature importance via gradient computations in order to estimate relevant image regions.
Abstract: Fine-grained visual categorization is a classification task for distinguishing categories with high intra-class and small inter-class variance. While global approaches aim at using the whole image for performing the classification, part-based solutions gather additional local information in terms of attentions or parts. We propose a novel classification-specific part estimation that uses an initial prediction as well as back-propagation of feature importance via gradient computations in order to estimate relevant image regions. The subsequently detected parts are then not only selected by a-posteriori classification knowledge, but also have an intrinsic spatial extent that is determined automatically. This is in contrast to most part-based approaches and even to available ground-truth part annotations, which only provide point coordinates and no additional scale information. We show in our experiments on various widely-used fine-grained datasets the effectiveness of the mentioned part selection method in conjunction with the extracted part features.

24 citations


Book ChapterDOI
10 Sep 2019
TL;DR: A mechanical adjustment procedure based on straight line observations above and below water is proposed that allows for accurate alignments in photogrammetric applications and is demonstrated on real data for acrylic and glass domes in the water.
Abstract: Dome ports act as spherical windows in underwater housings through which a camera can observe objects in the water. As compared to flat glass interfaces, they do not limit the field of view, and they do not cause refraction of light observed by a pinhole camera positioned exactly in the center of the dome. Mechanically adjusting a real lens to this position is a challenging task, in particular for those integrated in deep sea housings. In this contribution a mechanical adjustment procedure based on straight line observations above and below water is proposed that allows for accurate alignments. Additionally, we show a chessboard-based method employing an underwater/above-water image pair to estimate potentially remaining offsets from the dome center to allow refraction correction in photogrammetric applications. Besides providing intuition about the severity of refraction in certain settings, we demonstrate the methods on real data for acrylic and glass domes in the water.

19 citations


Book ChapterDOI
10 Sep 2019
TL;DR: This work proposes a novel approach which simultaneously aligns the source domains at the class-level in a shared feature space, and maps the target domain data in the same space through an adversarially trained ensemble of source domain classifiers.
Abstract: We address the problem of multi-source unsupervised domain adaptation (MS-UDA) for the purpose of visual recognition. As opposed to single source UDA, MS-UDA deals with multiple labeled source domains and a single unlabeled target domain. Notice that the conventional MS-UDA training is based on formalizing independent mappings between the target and the individual source domains without explicitly assessing the need for aligning the source domains among themselves. We argue that such a paradigm invariably overlooks the inherent category-level correlation among the source domains which, on the contrary, is deemed to bring meaningful complementarity in the learned shared feature space. In this regard, we propose a novel approach which simultaneously (i) aligns the source domains at the class-level in a shared feature space, and (ii) maps the target domain data in the same space through an adversarially trained ensemble of source domain classifiers. Experimental results obtained on the Office-31, ImageCLEF-DA, and Office-CalTech dataset validate that our approach achieves a superior accuracy compared to state-of-the-art methods .

18 citations


Book ChapterDOI
10 Sep 2019
TL;DR: "Deep Archetypal Analysis" generates latent representations of high-dimensional datasets in terms of fractions of intuitively understandable basic entities called archetypes, an unsupervised method to represent multivariate data points as sparse convex combinations of extremal elements of the dataset.
Abstract: Deep Archetypal Analysis (DeepAA) generates latent representations of high-dimensional datasets in terms of intuitively understandable basic entities called archetypes. The proposed method extends linear Archetypal Analysis (AA), an unsupervised method to represent multivariate data points as convex combinations of extremal data points. Unlike the original formulation, Deep AA is generative and capable of handling side information. In addition, our model provides the ability for data-driven representation learning which reduces the dependence on expert knowledge. We empirically demonstrate the applicability of our approach by exploring the chemical space of small organic molecules. In doing so, we employ the archetype constraint to learn two different latent archetype representations for the same dataset, with respect to two chemical properties. This type of supervised exploration marks a distinct starting point and let us steer de novo molecular design.

17 citations


Book ChapterDOI
10 Sep 2019
TL;DR: In this article, the authors compare a variety of 2D and 3D methods such as Fourier analysis with state-of-the-art deep neural networks for the classification of local fiber orientations.
Abstract: Collagen fiber orientations in bones, visible with Second Harmonic Generation (SHG) microscopy, represent the inner structure and its alteration due to influences like cancer. While analyses of these orientations are valuable for medical research, it is not feasible to analyze the needed large amounts of local orientations manually. Since we have uncertain borders for these local orientations only rough regions can be segmented instead of a pixel-wise segmentation. We analyze the effect of these uncertain borders on human performance by a user study. Furthermore, we compare a variety of 2D and 3D methods such as classical approaches like Fourier analysis with state-of-the-art deep neural networks for the classification of local fiber orientations. We present a general way to use pretrained 2D weights in 3D neural networks, such as Inception-ResNet-3D a 3D extension of Inception-ResNet-v2. In a 10 fold cross-validation our two stage segmentation based on Inception-ResNet-3D and transferred 2D ImageNet weights achieves a human comparable accuracy.

Book ChapterDOI
10 Sep 2019
TL;DR: This work develops a novel recursive generator model for brain image time series, and trains it on large-scale longitudinal data sets (ADNI/AIBL) and demonstrates the predictive value of the brain aging model in the context of conversion prognosis from mild cognitive impairment to Alzheimer's disease.
Abstract: Predicting the age progression of individual brain images from longitudinal data has been a challenging problem, while its solution is considered key to improve dementia prognosis. Often, approaches are limited to group-level predictions, lack the ability to extrapolate, can not scale to many samples, or do not operate directly on image inputs. We address these issues with the first approach to artificial aging of brain images based on Wasserstein Generative Adversarial Networks. We develop a novel recursive generator model for brain image time series, and train it on large-scale longitudinal data sets (ADNI/AIBL). In addition to thorough analysis of results on healthy and demented subjects, we demonstrate the predictive value of our brain aging model in the context of conversion prognosis from mild cognitive impairment to Alzheimer’s disease. Conversion prognosis for a baseline image is achieved in two steps. First, we estimate the future brain image with the Generative Adversarial Network. This follow-up image is passed to a CNN classifier, pre-trained to discriminate between mild cognitive impairment and Alzheimer’s disease. It estimates the Alzheimer probability for the follow-up image, which represents an effective measure for future disease risk.

Book ChapterDOI
10 Sep 2019
TL;DR: In this article, a semantic segmentation model without lateral connections within the upsampling path is proposed, which ensures that the forecasting addresses only the most abstract features on a very coarse resolution.
Abstract: Future anticipation is of vital importance in autonomous driving and other decision-making systems. We present a method to anticipate semantic segmentation of future frames in driving scenarios based on feature-to-feature forecasting. Our method is based on a semantic segmentation model without lateral connections within the upsampling path. Such design ensures that the forecasting addresses only the most abstract features on a very coarse resolution. We further propose to express feature-to-feature forecasting with deformable convolutions. This increases the modelling power due to being able to represent different motion patterns within a single feature map. Experiments show that our models with deformable convolutions outperform their regular and dilated counterparts while minimally increasing the number of parameters. Our method achieves state of the art performance on the Cityscapes validation set when forecasting nine timesteps into the future.

Book ChapterDOI
10 Sep 2019
TL;DR: Wang et al. as discussed by the authors proposed 3D-BEVIS (3D bird's-eye-view instance segmentation), a deep learning framework for joint semantic-and instance-segmentation on 3D point clouds.
Abstract: Recent deep learning models achieve impressive results on 3D scene analysis tasks by operating directly on unstructured point clouds. A lot of progress was made in the field of object classification and semantic segmentation. However, the task of instance segmentation is currently less explored. In this work, we present 3D-BEVIS (3D bird’s-eye-view instance segmentation), a deep learning framework for joint semantic- and instance-segmentation on 3D point clouds. Following the idea of previous proposal-free instance segmentation approaches, our model learns a feature embedding and groups the obtained feature space into semantic instances. Current point-based methods process local sub-parts of a full scene independently, followed by a heuristic merging step. However, to perform instance segmentation by clustering on a full scene, globally consistent features are required. Therefore, we propose to combine local point geometry with global context information using an intermediate bird’s-eye view representation.

Book ChapterDOI
10 Sep 2019
TL;DR: In this paper, a non-causal tracking by deblurring and image matting (NBD) method is proposed to estimate continuous, complete and accurate object trajectories.
Abstract: Tracking by Deblatting (Deblatting = deblurring and matting) stands for solving an inverse problem of deblurring and image matting for tracking motion-blurred objects. We propose non-causal Tracking by Deblatting which estimates continuous, complete and accurate object trajectories. Energy minimization by dynamic programming is used to detect abrupt changes of motion, called bounces. High-order polynomials are fitted to segments, which are parts of the trajectory separated by bounces. The output is a continuous trajectory function which assigns location for every real-valued time stamp from zero to the number of frames. Additionally, we show that from the trajectory function precise physical calculations are possible, such as radius, gravity or sub-frame object velocity. Velocity estimation is compared to the high-speed camera measurements and radars. Results show high performance of the proposed method in terms of Trajectory-IoU, recall and velocity estimation.

Book ChapterDOI
10 Sep 2019
TL;DR: In this article, a model-based autoencoder is proposed to extract interpretable and physically meaningful parameters for terahertz (THz) sensing applications, which requires solving an inverse problem in which a model function determined by these parameters needs to be fitted to the measured data.
Abstract: Terahertz (THz) sensing is a promising imaging technology for a wide variety of different applications. Extracting the interpretable and physically meaningful parameters for such applications, however, requires solving an inverse problem in which a model function determined by these parameters needs to be fitted to the measured data. Since the underlying optimization problem is nonconvex and very costly to solve, we propose learning the prediction of suitable parameters from the measured data directly. More precisely, we develop a model-based autoencoder in which the encoder network predicts suitable parameters and the decoder is fixed to a physically meaningful model function, such that we can train the encoding network in an unsupervised way. We illustrate numerically that the resulting network is more than 140 times faster than classical optimization techniques while making predictions with only slightly higher objective values. Using such predictions as starting points of local optimization techniques allows us to converge to better local minima about twice as fast as optimizing without the network-based initialization.

Book ChapterDOI
10 Sep 2019
TL;DR: Stability training is studied as a general-purpose method to increase the robustness of deep neural networks against input perturbations and its use as an alternative to data augmentation is explored.
Abstract: We study the recently introduced stability training as a general-purpose method to increase the robustness of deep neural networks against input perturbations. In particular, we explore its use as an alternative to data augmentation and validate its performance against a number of distortion types and transformations including adversarial examples. In our image classification experiments using ImageNet data stability training performs on a par or even outperforms data augmentation for specific transformations, while consistently offering improved robustness against a broader range of distortion strengths and types unseen during training, a considerably smaller hyperparameter dependence and less potentially negative side effects compared to data augmentation.

Book ChapterDOI
10 Sep 2019
TL;DR: This work uses a parametric time-frequency representation of vector autoregressive Granger causality for causal inference and shows that an anomalous event can be identified as the event where the causal intensities differ according to a distance measure from the average causal intensity.
Abstract: Causal inference in dynamical systems is a challenge for different research areas. So far it is mostly about understanding to what extent the underlying causal mechanisms can be derived from observed time series. Here we investigate whether anomalous events can also be identified based on the observed changes in causal relationships. We use a parametric time-frequency representation of vector autoregressive Granger causality for causal inference. The use of time-frequency approach allows for dealing with the nonstationarity of the time series as well as for defining the time scale on which changes occur. We present two representative examples in environmental systems: land-atmosphere ecosystem and marine climate. We show that an anomalous event can be identified as the event where the causal intensities differ according to a distance measure from the average causal intensities. The driver of the anomalous event can then be identified based on the analysis of changes in the causal effect relationships.

Book ChapterDOI
10 Sep 2019
TL;DR: This work proposes to mitigate the problem of inferring nonlinear cause-effect dependencies in the presence of a hidden confounder by using deep learning with domain knowledge integration and suggests a time series anomaly detection approach using causal link intensity increase as an indicator of the anomaly.
Abstract: Causality analysis represents one of the most important tasks when examining dynamical systems such as ecological time series. We propose to mitigate the problem of inferring nonlinear cause-effect dependencies in the presence of a hidden confounder by using deep learning with domain knowledge integration. Moreover, we suggest a time series anomaly detection approach using causal link intensity increase as an indicator of the anomaly. Our proposed method is based on the Causal Effect Variational Autoencoder (CEVAE) which we extend and apply to anomaly detection in time series. We evaluate our method on synthetic data having properties of ecological time series and compare to the vector autoregressive Granger causality (VAR-GC) baseline.

Book ChapterDOI
10 Sep 2019
TL;DR: A new attack known as MLAttack, i.e., Multiple Layers Attack, carefully selects several layers and use them to define a loss function for gradient based adversarial attack on semantic segmentation architectures, demonstrating that MLAttack performs better than existing state-of-the-art semantic segmentsation attacks.
Abstract: Despite the immense success of deep neural networks, their applicability is limited because they can be fooled by adversarial examples, which are generated by adding visually imperceptible and structured perturbations to the original image. Semantic segmentation is required in several visual recognition tasks, but unlike image classification, only a few studies are available for attacking semantic segmentation networks. The existing semantic segmentation adversarial attacks employ different gradient based loss functions which are defined using only the last layer of the network for gradient backpropogation. But some components of semantic segmentation networks implicitly mitigate several adversarial attacks (like multiscale analysis) due to which the existing attacks perform poorly. This provides us the motivation to introduce a new attack in this paper known as MLAttack, i.e., Multiple Layers Attack. It carefully selects several layers and use them to define a loss function for gradient based adversarial attack on semantic segmentation architectures. Experiments conducted on publicly available dataset using the state-of-the-art segmentation network architectures, demonstrate that MLAttack performs better than existing state-of-the-art semantic segmentation attacks.

Book ChapterDOI
10 Sep 2019
TL;DR: The first approach for 3D point-cloud to image translation based on conditional Generative Adversarial Networks (cGAN) is presented, which opens up new ways in augmenting or texturing 3D data to aim the generation of fully individual images.
Abstract: We present the first approach for 3D point-cloud to image translation based on conditional Generative Adversarial Networks (cGAN). The model handles multi-modal information sources from different domains, i.e. raw point-sets and images. The generator is capable of processing three conditions, whereas the point-cloud is encoded as raw point-set and camera projection. An image background patch is used as constraint to bias environmental texturing. A global approximation function within the generator is directly applied on the point-cloud (Point-Net). Hence, the representative learning model incorporates global 3D characteristics directly at the latent feature space. Conditions are used to bias the background and the viewpoint of the generated image. This opens up new ways in augmenting or texturing 3D data to aim the generation of fully individual images. We successfully evaluated our method on the KITTI and SunRGBD dataset with an outstanding object detection inception score.

Book ChapterDOI
10 Sep 2019
TL;DR: Knowledge transfer, zero-shot learning and semantic image retrieval are methods that aim at improving accuracy by utilizing semantic information, e.g., from WordNet to augment or replace missing visual data in the form of labeled training images.
Abstract: Knowledge transfer, zero-shot learning and semantic image retrieval are methods that aim at improving accuracy by utilizing semantic information, e.g., from WordNet. It is assumed that this information can augment or replace missing visual data in the form of labeled training images because semantic similarity correlates with visual similarity.

Book ChapterDOI
10 Sep 2019
TL;DR: In this article, a conditional generative adversarial network is used to generate multispectral imagery given a set of climatic, terrain and anthropogenic predictors, and the generated imagery of the landscapes share many characteristics with the real one.
Abstract: Landscapes are meaningful ecological units that strongly depend on the environmental conditions. Such dependencies between landscapes and the environment have been noted since the beginning of Earth sciences and cast into conceptual models describing the interdependencies of climate, geology, vegetation and geomorphology. Here, we ask whether landscapes, as seen from space, can be statistically predicted from pertinent environmental conditions. To this end we adapted a deep learning generative model in order to establish the relationship between the environmental conditions and the view of landscapes from the Sentinel-2 satellite. We trained a conditional generative adversarial network to generate multispectral imagery given a set of climatic, terrain and anthropogenic predictors. The generated imagery of the landscapes share many characteristics with the real one. Results based on landscape patch metrics, indicative of landscape composition and structure, show that the proposed generative model creates landscapes that are more similar to the targets than the baseline models while overall reflectance and vegetation cover are predicted better. We demonstrate that for many purposes the generated landscapes behave as real with immediate application for global change studies. We envision the application of machine learning as a tool to forecast the effects of climate change on the spatial features of landscapes, while we assess its limitations and breaking points.

Book ChapterDOI
10 Sep 2019
TL;DR: It is hypothesize that INN autoencoders might not have any intrinsic information loss and thereby are not bounded to a maximal number of layers after which only suboptimal results can be achieved.
Abstract: Autoencoders are able to learn useful data representations in an unsupervised matter and have been widely used in various machine learning and computer vision tasks. In this work, we present methods to train Invertible Neural Networks (INNs) as (variational) autoencoders which we call INN (variational) autoencoders. Our experiments on MNIST, CIFAR and CelebA show that for low bottleneck sizes our INN autoencoder achieves results similar to the classical autoencoder. However, for large bottleneck sizes our INN autoencoder outperforms its classical counterpart. Based on the empirical results, we hypothesize that INN autoencoders might not have any intrinsic information loss and thereby are not bounded to a maximal number of layers (depth) after which only suboptimal results can be achieved (Code available at https://github.com/Xenovortex/Training-Invertible-Neural-Networks-as-Autoencoders.git).

Book ChapterDOI
10 Sep 2019
TL;DR: In this paper, the authors address the problem of learning a single model for person re-identification, attribute classification, body part segmentation, and pose estimation, which is a classical multi-task learning problem.
Abstract: We address the problem of learning a single model for person re-identification, attribute classification, body part segmentation, and pose estimation. With predictions for these tasks we gain a more holistic understanding of persons, which is valuable for many applications. This is a classical multi-task learning problem. However, no dataset exists that these tasks could be jointly learned from. Hence several datasets need to be combined during training, which in other contexts has often led to reduced performance in the past. We extensively evaluate how the different task and datasets influence each other and how different degrees of parameter sharing between the tasks affect performance. Our final model matches or outperforms its single-task counterparts without creating significant computational overhead, rendering it highly interesting for resource-constrained scenarios such as mobile robotics.

Book ChapterDOI
10 Sep 2019
TL;DR: A fully convolutional neural network that jointly predicts a semantic 3D reconstruction of a scene as well as a corresponding octree representation that is much more efficient in terms of memory consumption and inference efficiency, while achieving similar reconstruction performance.
Abstract: We present a fully convolutional neural network that jointly predicts a semantic 3D reconstruction of a scene as well as a corresponding octree representation. This approach leverages the efficiency of an octree data structure to improve the capacities of volumetric semantic 3D reconstruction methods, especially in terms of scalability. At every octree level, the network predicts a semantic class for every voxel and decides which voxels should be further split in order to refine the reconstruction, thus working in a coarse-to-fine manner. The semantic prediction part of our method builds on recent work that combines traditional variational optimization and neural networks. In contrast to previous networks that work on dense voxel grids, our network is much more efficient in terms of memory consumption and inference efficiency, while achieving similar reconstruction performance. This allows for a high resolution reconstruction in case of limited memory. We perform experiments on the SUNCG and ScanNetv2 datasets on which our network shows comparable reconstruction results to the corresponding dense network while consuming less memory.

Book ChapterDOI
Nikolai Ufer1, Kam To Lui1, Katja Schwarz1, Paul Warkentin1, Björn Ommer1 
10 Sep 2019
TL;DR: A weakly supervised learning approach which generates stronger features by encoding far more context than previous methods by introducing a new convolutional layer which is a learned mixture of differently strided convolutions and allows the network to encode much more context while preserving matching accuracy at the same time.
Abstract: Finding semantic correspondences is a challenging problem. With the breakthrough of CNNs stronger features are available for tasks like classification but not specifically for the requirements of semantic matching. In the following we present a weakly supervised learning approach which generates stronger features by encoding far more context than previous methods. First, we generate more suitable training data using a geometrically informed correspondence mining method which is less prone to spurious matches and requires only image category labels as supervision. Second, we introduce a new convolutional layer which is a learned mixture of differently strided convolutions and allows the network to encode much more context while preserving matching accuracy at the same time. The strong geometric encoding on the feature side enables us to learn a semantic flow network, which generates more natural deformations than parametric transformation based models and is able to predict foreground regions at the same time. Our semantic flow network outperforms current state-of-the-art on several semantic matching benchmarks and the learned features show astonishing performance regarding simple nearest neighbor matching.

Book ChapterDOI
10 Sep 2019
TL;DR: A novel end-to-end framework on the visual relationship detection task is proposed with a spatial attention model for specializing predicate features and a feature embedding model with a bi-directional RNN which considers subject, predicate and object as a time sequence.
Abstract: Visual relationship detection targets on predicting categories of predicates and object pairs, and also locating the object pairs. Recognizing the relationships between individual objects is important for describing visual scenes in static images. In this paper, we propose a novel end-to-end framework on the visual relationship detection task. First, we design a spatial attention model for specializing predicate features. Compared to a normal ROI-pooling layer, this structure significantly improves Predicate Classification performance. Second, for extracting relative spatial configuration, we propose to map simple geometric representations to a high dimension, which boosts relationship detection accuracy. Third, we implement a feature embedding model with a bi-directional RNN which considers subject, predicate and object as a time sequence. We evaluate our method on three tasks. The experiments demonstrate that our method achieves competitive results compared to state-of-the-art methods.

Book ChapterDOI
10 Sep 2019
TL;DR: A novel convolutional neural network is proposed to verify a~match between two normalized images of the human iris, using a novel Unit-Circle Layer layer which replaces the Gabor-filtering step in a common iris-verification pipeline.
Abstract: We propose a novel convolutional neural network to verify a match between two normalized images of the human iris. The network is trained end-to-end and validated on three publicly available datasets yielding state-of-the-art results against four baseline methods. The network performs better by a \(10\%\) margin to the state-of-the-art method on the CASIA.v4 dataset. In the network, we use a novel “Unit-Circle” layer which replaces the Gabor-filtering step in a common iris-verification pipeline. We show that the layer improves the performance of the model up to \(15\%\) on previously-unseen data.

Book ChapterDOI
10 Sep 2019
TL;DR: A deep neural network is trained for video prediction which embeds the video sequence in a low-dimensional recurrent latent space representation which optimize the total correlation of the latent dimensions within a variational recurrent auto-encoder framework.
Abstract: Physical scene understanding is a fundamental human ability. Empowering artificial systems with such understanding is an important step towards flexible and adaptive behavior in the real world. As a step in this direction, we propose a novel approach to physical scene understanding in video. We train a deep neural network for video prediction which embeds the video sequence in a low-dimensional recurrent latent space representation. We optimize the total correlation of the latent dimensions within a variational recurrent auto-encoder framework. This encourages the representation to disentangle the latent physical factors of variation in the training data. To train and evaluate our approach, we use synthetic video sequences in three different physical scenarios with various degrees of difficulty. Our experiments demonstrate that our model can disentangle several appearance-related properties in the unsupervised case. If we add supervision signals for the latent code, our model can further improve the disentanglement of dynamics-related properties.