scispace - formally typeset
Search or ask a question

Showing papers presented at "German Conference on Pattern Recognition in 2020"


Book ChapterDOI
28 Sep 2020
TL;DR: This work provides globally consistent reference poses with up-to centimeter accuracy obtained from the fusion of direct stereo visual-inertial odometry with RTK-GNSS, which enables research on visual odometry, global place recognition, and map-based re-localization tracking.
Abstract: We present a novel dataset covering seasonal and challenging perceptual conditions for autonomous driving. Among others, it enables research on visual odometry, global place recognition, and map-based re-localization tracking. The data was collected in different scenarios and under a wide variety of weather conditions and illuminations, including day and night. This resulted in more than 350 km of recordings in nine different environments ranging from multi-level parking garage over urban (including tunnels) to countryside and highway. We provide globally consistent reference poses with up-to centimeter accuracy obtained from the fusion of direct stereo visual-inertial odometry with RTK-GNSS. The full dataset is available at https://www.4seasons-dataset.com.

43 citations


Book ChapterDOI
28 Sep 2020
TL;DR: In this article, a conditional invertible neural network (cINN) is proposed to address the task of diverse image-to-image translation for natural images, which combines the purely generative INN model with an unconstrained feed-forward network.
Abstract: We introduce a new architecture called a conditional invertible neural network (cINN), and use it to address the task of diverse image-to-image translation for natural images. This is not easily possible with existing INN models due to some fundamental limitations. The cINN combines the purely generative INN model with an unconstrained feed-forward network, which efficiently preprocesses the conditioning image into maximally informative features. All parameters of a cINN are jointly optimized with a stable, maximum likelihood-based training procedure. Even though INN-based models have received far less attention in the literature than GANs, they have been shown to have some remarkable properties absent in GANs, e.g. apparent immunity to mode collapse. We find that our cINNs leverage these properties for image-to-image translation, demonstrated on day to night translation and image colorization. Furthermore, we take advantage of our bidirectional cINN architecture to explore and manipulate emergent properties of the latent space, such as changing the image style in an intuitive way.

22 citations


Book ChapterDOI
28 Sep 2020
TL;DR: In this paper, the latent space is partitioned into disjoint subspaces for modality-specific and shared factors and learned to disentangle these in a purely self-supervised manner.
Abstract: Multimodal generative models learn a joint distribution over multiple modalities and thus have the potential to learn richer representations than unimodal models. However, current approaches are either inefficient in dealing with more than two modalities or fail to capture both modality-specific and shared variations. We introduce a new multimodal generative model that integrates both modality-specific and shared factors and aggregates shared information across any subset of modalities efficiently. Our method partitions the latent space into disjoint subspaces for modality-specific and shared factors and learns to disentangle these in a purely self-supervised manner. Empirically, we show improvements in representation learning and generative performance compared to previous methods and showcase the disentanglement capabilities.

14 citations


Book ChapterDOI
28 Sep 2020
TL;DR: In this article, the authors propose an ensemble of class-balanced experts that combines the strength of diverse classifiers to achieve state-of-the-art performance in long-tailed recognition.
Abstract: Deep learning enables impressive performance in image recognition using large-scale artificially-balanced datasets. However, real-world datasets exhibit highly class-imbalanced distributions, yielding two main challenges: relative imbalance amongst the classes and data scarcity for mediumshot or fewshot classes. In this work, we address the problem of long-tailed recognition wherein the training set is highly imbalanced and the test set is kept balanced. Differently from existing paradigms relying on data-resampling, cost-sensitive learning, online hard example mining, loss objective reshaping, and/or memory-based modeling, we propose an ensemble of class-balanced experts that combines the strength of diverse classifiers. Our ensemble of class-balanced experts reaches results close to state-of-the-art and an extended ensemble establishes a new state-of-the-art on two benchmarks for long-tailed recognition. We conduct extensive experiments to analyse the performance of the ensembles, and discover that in modern large-scale datasets, relative imbalance is a harder problem than data scarcity. The training and evaluation code is available at https://github.com/ssfootball04/class-balanced-experts.

12 citations


Book ChapterDOI
01 Apr 2020
TL;DR: This work shows, for the first time, how to jointly reconstruct both the individual tracer particles and a dense 3D fluid motion field from the image data, using an integrated energy minimization.
Abstract: The standard approach to densely reconstruct the motion in a volume of fluid is to inject high-contrast tracer particles and record their motion with multiple high-speed cameras. Almost all existing work processes the acquired multi-view video in two separate steps: first, a per-frame reconstruction of the particles, usually in the form of soft occupancy likelihoods in a voxel representation; followed by 3D motion estimation, with some form of dense matching between the precomputed voxel grids from different time steps. In this sequential procedure, the first step cannot use temporal consistency considerations to support the reconstruction, while the second step has no access to the original, high-resolution image data. We show, for the first time, how to jointly reconstruct both the individual tracer particles and a dense 3D fluid motion field from the image data, using an integrated energy minimization. Our hybrid Lagrangian/Eulerian model explicitly reconstructs individual particles, and at the same time recovers a dense 3D motion field in the entire domain. Making particles explicit greatly reduces the memory consumption and allows one to use the high-resolution input images for matching. Whereas the dense motion field makes it possible to include physical a-priori constraints and account for the incompressibility and viscosity of the fluid. The method exhibits greatly (\({\approx }70\%\)) improved results over a recent baseline with two separate steps for 3D reconstruction and motion estimation. Our results with only two time steps are comparable to those of state-of-the-art tracking-based methods that require much longer sequences.

9 citations


Book ChapterDOI
28 Sep 2020
TL;DR: A model that is based on a conditional generative adversarial network designed to generate 2D human poses conditioned on human-written text descriptions is proposed, indicating that it is possible to generate poses that are consistent with the given semantic features, especially for actions with distinctive poses.
Abstract: This work focuses on synthesizing human poses from human-level text descriptions. We propose a model that is based on a conditional generative adversarial network. It is designed to generate 2D human poses conditioned on human-written text descriptions. The model is trained and evaluated using the COCO dataset, which consists of images capturing complex everyday scenes with various human poses. We show through qualitative and quantitative results that the model is capable of synthesizing plausible poses matching the given text, indicating that it is possible to generate poses that are consistent with the given semantic features, especially for actions with distinctive poses.

8 citations


Book ChapterDOI
28 Sep 2020
TL;DR: Center3D as discussed by the authors uses a combination of classification and regression to understand the hidden depth information more robustly than each method alone, which achieved a better speed-accuracy trade-off in real-time monocular object detection.
Abstract: Localizing objects in 3D space and understanding their associated 3D properties is challenging given only monocular RGB images. The situation is compounded by the loss of depth information during perspective projection. We present Center3D, a one-stage anchor-free approach and an extension of CenterNet, to efficiently estimate 3D location and depth using only monocular RGB images. By exploiting the difference between 2D and 3D centers, we are able to estimate depth consistently. Center3D uses a combination of classification and regression to understand the hidden depth information more robustly than each method alone. Our method employs two joint approaches: (1) LID: a classification-dominated approach with sequential Linear Increasing Discretization. (2) DepJoint: a regression-dominated approach with multiple Eigen’s transformations [6] for depth estimation. Evaluating on KITTI dataset [8] for moderate objects, Center3D improved the AP in BEV from \(29.7\%\) to \(\mathbf {43.5\%}\), and the AP in 3D from \(18.6\%\) to \(\mathbf {40.5\%}\). Compared with state-of-the-art detectors, Center3D has achieved a better speed-accuracy trade-off in realtime monocular object detection.

7 citations


Book ChapterDOI
28 Sep 2020
TL;DR: In this paper, a cycle consistency loss over time is introduced to predict the past activities given the predicted future, which achieves state-of-the-art results on two datasets: the Breakfast dataset and 50Salads.
Abstract: With the success of deep learning methods in analyzing activities in videos, more attention has recently been focused towards anticipating future activities. However, most of the work on anticipation either analyzes a partially observed activity or predicts the next action class. Recently, new approaches have been proposed to extend the prediction horizon up to several minutes in the future and that anticipate a sequence of future activities including their durations. While these works decouple the semantic interpretation of the observed sequence from the anticipation task, we propose a framework for anticipating future activities directly from the features of the observed frames and train it in an end-to-end fashion. Furthermore, we introduce a cycle consistency loss over time by predicting the past activities given the predicted future. Our framework achieves state-of-the-art results on two datasets: the Breakfast dataset and 50Salads.

7 citations


Book ChapterDOI
28 Sep 2020
TL;DR: In this paper, a regular convolution layer applying a filter in the same way over known and unknown areas causes visual artifacts in the inpainted image, and feature re-normalization on the output of the convolution.
Abstract: A regular convolution layer applying a filter in the same way over known and unknown areas causes visual artifacts in the inpainted image. Several studies address this issue with feature re-normalization on the output of the convolution. However, these models use a significant amount of learnable parameters for feature re-normalization [41, 48], or assume a binary representation of the certainty of an output [11, 26].

6 citations


Book ChapterDOI
28 Sep 2020
TL;DR: In this paper, a Haar wavelet based block autoregressive model leveraging split couplings is proposed to model the dependency structure of multimodal distributions, particularly over long time horizons.
Abstract: Prediction of trajectories such as that of pedestrians is crucial to the performance of autonomous agents. While previous works have leveraged conditional generative models like GANs and VAEs for learning the likely future trajectories, accurately modeling the dependency structure of these multimodal distributions, particularly over long time horizons remains challenging. Normalizing flow based generative models can model complex distributions admitting exact inference. These include variants with split coupling invertible transformations that are easier to parallelize compared to their autoregressive counterparts. To this end, we introduce a novel Haar wavelet based block autoregressive model leveraging split couplings, conditioned on coarse trajectories obtained from Haar wavelet based transformations at different levels of granularity. This yields an exact inference method that models trajectories at different spatio-temporal resolutions in a hierarchical manner. We illustrate the advantages of our approach for generating diverse and accurate trajectories on two real-world datasets – Stanford Drone and Intersection Drone.

6 citations


Book ChapterDOI
28 Sep 2020
TL;DR: In this article, a differentiable physics engine within an action-conditional video representation network is used to learn a physical latent representation, which can be used to predict future video frames from input images and actions.
Abstract: Video representation learning has recently attracted attention in computer vision due to its applications for activity and scene forecasting or vision-based planning and control. Video prediction models often learn a latent representation of video which is encoded from input frames and decoded back into images. Even when conditioned on actions, purely deep learning based architectures typically lack a physically interpretable latent space. In this study, we use a differentiable physics engine within an action-conditional video representation network to learn a physical latent representation. We propose supervised and self-supervised learning methods to train our network and identify physical properties. The latter uses spatial transformers to decode physical states back into images. The simulation scenarios in our experiments comprise pushing, sliding and colliding objects, for which we also analyze the observability of the physical properties. In experiments we demonstrate that our network can learn to encode images and identify physical properties like mass and friction from videos and action sequences in the simulated scenarios. We evaluate the accuracy of our supervised and self-supervised methods and compare it with a system identification baseline which directly learns from state trajectories. We also demonstrate the ability of our method to predict future video frames from input images and actions.

Book ChapterDOI
28 Sep 2020
TL;DR: In this paper, a deep transfer learning methodology was proposed to perform water segmentation and water level prediction on river camera images. But their work is limited to two datasets: ADE20k and COCOstuff.
Abstract: We investigate a deep transfer learning methodology to perform water segmentation and water level prediction on river camera images. Starting from pre-trained segmentation networks that provided state-of-the-art results on general purpose semantic image segmentation datasets ADE20k and COCO-stuff, we show that we can apply transfer learning methods for semantic water segmentation. Our transfer learning approach improves the current segmentation results of two water segmentation datasets available in the literature. We also investigate the usage of the water segmentation networks in combination with on-site ground surveys to automate the process of water level estimation on river camera images. Our methodology has the potential to impact the study and modelling of flood-related events.

Book ChapterDOI
28 Sep 2020
TL;DR: BBFNet as mentioned in this paper predicts watershed levels and uses them to detect large instance candidates where boundaries are well defined, and predicts instance centers by means of Hough voting followed by mean-shift to reliably detect small objects.
Abstract: In this work we introduce a new Bounding-Box Free Network (BBFNet) for panoptic segmentation. Panoptic segmentation is an ideal problem for proposal-free methods as it already requires per-pixel semantic class labels. We use this observation to exploit class boundaries from off-the-shelf semantic segmentation networks and refine them to predict instance labels. Towards this goal BBFNet predicts coarse watershed levels and uses them to detect large instance candidates where boundaries are well defined. For smaller instances, whose boundaries are less reliable, BBFNet also predicts instance centers by means of Hough voting followed by mean-shift to reliably detect small objects. A novel triplet loss network helps merging fragmented instances while refining boundary pixels. Our approach is distinct from previous works in panoptic segmentation that rely on a combination of a semantic segmentation network with a computationally costly instance segmentation network based on bounding box proposals, such as Mask R-CNN, to guide the prediction of instance labels using a Mixture-of-Expert (MoE) approach. We benchmark our proposal-free method on Cityscapes and Microsoft COCO datasets and show competitive performance with other MoE based approaches while outperforming existing non-proposal based methods on the COCO dataset. We show the flexibility of our method using different semantic segmentation backbones and provide video results on challenging scenes in the wild in the supplementary material.

Book ChapterDOI
28 Sep 2020
TL;DR: In this paper, the authors propose a network architecture and training procedure for learning monocular 3D object detection without 3D bounding box labels by representing the objects as triangular meshes and employing differentiable shape rendering, defined loss functions based on depth maps, segmentation masks, and ego-and object motion.
Abstract: The training of deep-learning-based 3D object detectors requires large datasets with 3D bounding box labels for supervision that have to be generated by hand-labeling. We propose a network architecture and training procedure for learning monocular 3D object detection without 3D bounding box labels. By representing the objects as triangular meshes and employing differentiable shape rendering, we define loss functions based on depth maps, segmentation masks, and ego- and object-motion, which are generated by pre-trained, off-the-shelf networks. We evaluate the proposed algorithm on the real-world KITTI dataset and achieve promising performance in comparison to state-of-the-art methods requiring 3D bounding box labels for training and superior performance to conventional baseline methods.

Book ChapterDOI
28 Sep 2020
TL;DR: In this paper, a camera calibration is a prerequisite for many computer vision applications, and while a good calibration can turn a camera into a measurement device, it can also deteriorate a system's performance if not done correctly.
Abstract: Camera calibration is a prerequisite for many computer vision applications. While a good calibration can turn a camera into a measurement device, it can also deteriorate a system’s performance if not done correctly. In the recent past, there have been great efforts to simplify the calibration process. Yet, inspection and evaluation of calibration results typically still requires expert knowledge.

Book ChapterDOI
28 Sep 2020
TL;DR: In this article, a family of loss functions that allow to optimize deep image compression depending on the observer and to interpolate between human perceived visual quality and classification accuracy is proposed, enabling a more unified view on image compression.
Abstract: Deep neural networks have recently advanced the state-of-the-art in image compression and surpassed many traditional compression algorithms The training of such networks involves carefully trading off entropy of the latent representation against reconstruction quality The term quality crucially depends on the observer of the images which, in the vast majority of literature, is assumed to be human In this paper, we aim to go beyond this notion of compression quality and look at human visual perception and image classification simultaneously To that end, we use a family of loss functions that allows to optimize deep image compression depending on the observer and to interpolate between human perceived visual quality and classification accuracy, enabling a more unified view on image compression Our extensive experiments show that using perceptual loss functions to train a compression system preserves classification accuracy much better than traditional codecs such as BPG without requiring retraining of classifiers on compressed images For example, compressing ImageNet to 025 bpp reduces Inception-ResNet classification accuracy by only 2% At the same time, when using a human friendly loss function, the same compression system achieves competitive performance in terms of MS-SSIM By combining these two objective functions, we show that there is a pronounced trade-off in compression quality between the human visual system and classification accuracy

Book ChapterDOI
28 Sep 2020
TL;DR: This paper proposes using a novel differentiable convolutional distance transform layer for segmentation networks such as U-Net to regularize the training process and addresses the problem of numerical instability for large images by presenting a cascaded procedure with locally restricted convolutionAL distance transforms.
Abstract: In this paper we propose using a novel differentiable convolutional distance transform layer for segmentation networks such as U-Net to regularize the training process. In contrast to related work, we do not need to learn the distance transform, but use an approximation, which can be achieved by means of the convolutional operation. Therefore, the distance transform is directly applicable without previous training and it is also differentiable to ensure the gradient flow during backpropagation. First, we present the derivation of the convolutional distance transform by Karam et al. [6]. Then we address the problem of numerical instability for large images by presenting a cascaded procedure with locally restricted convolutional distance transforms. Afterwards, we discuss the issue of non-binary segmentation outputs for the convolutional distance transform and present our solution attempt for the incorporation into deep segmentation networks. We then demonstrate the feasibility of our proposal in an ablation study on the publicly available SegTHOR data set.

Book ChapterDOI
28 Sep 2020
TL;DR: F facets of the lifted multicut polytope for trees defined by the inequalities of a canonical relaxation are characterized, establishing a connection to the combinatorial properties of alternative formulations such as sequential set partitioning.
Abstract: We study the lifted multicut problem restricted to trees, which is np-hard in general and solvable in polynomial time for paths. In particular, we characterize facets of the lifted multicut polytope for trees defined by the inequalities of a canonical relaxation. Moreover, we present an additional class of inequalities associated with paths that are facet-defining. Taken together, our facets yield a complete totally dual integral description of the lifted multicut polytope for paths. This description establishes a connection to the combinatorial properties of alternative formulations such as sequential set partitioning.

Book ChapterDOI
28 Sep 2020
TL;DR: In this article, a generative model consisting of two disentangled representations for an object's shape and appearance and a latent variable for the part segmentation is used to discover semantic part segmentations without supervision.
Abstract: We address the problem of discovering part segmentations of articulated objects without supervision. In contrast to keypoints, part segmentations provide information about part localizations on the level of individual pixels. Capturing both locations and semantics, they are an attractive target for supervised learning approaches. However, large annotation costs limit the scalability of supervised algorithms to other object categories than humans. Unsupervised approaches potentially allow to use much more data at a lower cost. Most existing unsupervised approaches focus on learning abstract representations to be refined with supervision into the final representation. Our approach leverages a generative model consisting of two disentangled representations for an object’s shape and appearance and a latent variable for the part segmentation. From a single image, the trained model infers a semantic part segmentation map. In experiments, we compare our approach to previous state-of-the-art approaches and observe significant gains in segmentation accuracy and shape consistency (Code available at https://compvis.github.io/unsupervised-part-segmentation). Our work demonstrates the feasibility to discover semantic part segmentations without supervision.

Book ChapterDOI
28 Sep 2020
TL;DR: In this article, a surrogate model for neural architecture performance prediction built upon Graph Neural Networks (GNN) is proposed for structurally unknown architectures (i.e., zero shot prediction).
Abstract: In computer vision research, the process of automating architecture engineering, Neural Architecture Search (NAS), has gained substantial interest. Due to the high computational costs, most recent approaches to NAS as well as the few available benchmarks only provide limited search spaces. In this paper we propose a surrogate model for neural architecture performance prediction built upon Graph Neural Networks (GNN). We demonstrate the effectiveness of this surrogate model on neural architecture performance prediction for structurally unknown architectures (i.e. zero shot prediction) by evaluating the GNN on several experiments on the NAS-Bench-101 dataset.

Book ChapterDOI
28 Sep 2020
TL;DR: In this paper, a self-supervised learning task called phase-swap was introduced to detect if bio-signals have been obtained by merging the amplitude and phase from different sources.
Abstract: Various hand-crafted feature representations of bio-signals rely primarily on the amplitude or power of the signal in specific frequency bands The phase component is often discarded as it is more sample specific, and thus more sensitive to noise, than the amplitude However, in general, the phase component also carries information relevant to the underlying biological processes In fact, in this paper we show the benefits of learning the coupling of both phase and amplitude components of a bio-signal We do so by introducing a novel self-supervised learning task, which we call phase-swap, that detects if bio-signals have been obtained by merging the amplitude and phase from different sources We show in our evaluation that neural networks trained on this task generalize better across subjects and recording sessions than their fully supervised counterpart

Book ChapterDOI
28 Sep 2020
TL;DR: In this article, the authors apply the concept of visual soft attention to efficiently learn a model for lung cancer segmentation from only a small fraction of PET/CT scans and a larger pool of CT-only scans.
Abstract: PET/CT imaging is the gold standard for the diagnosis and staging of lung cancer. However, especially in healthcare systems with limited resources, costly PET/CT images are often not readily available. Conventional machine learning models either process CT or PET/CT images but not both. Models designed for PET/CT images are hence restricted by the number of PET images, such that they are unable to additionally leverage CT-only data. In this work, we apply the concept of visual soft attention to efficiently learn a model for lung cancer segmentation from only a small fraction of PET/CT scans and a larger pool of CT-only scans. We show that our model is capable of jointly processing PET/CT as well as CT-only images, which performs on par with the respective baselines whether or not PET images are available at test time. We then demonstrate that the model learns efficiently from only a few PET/CT scans in a setting where mostly CT-only data is available, unlike conventional models.

Book ChapterDOI
28 Sep 2020
TL;DR: In this article, a proposal-free instance segmentation method is proposed that combines predictions from overlapping masks into edge weights of a signed graph that is subsequently partitioned to obtain all final instances concurrently.
Abstract: This work introduces a new proposal-free instance segmentation method that builds on single-instance segmentation masks predicted across the entire image in a sliding window style. In contrast to related approaches, our method concurrently predicts all masks, one for each pixel, and thus resolves any conflict jointly across the entire image. Specifically, predictions from overlapping masks are combined into edge weights of a signed graph that is subsequently partitioned to obtain all final instances concurrently. The result is a parameter-free method that is strongly robust to noise and prioritizes predictions with the highest consensus across overlapping masks. All masks are decoded from a low dimensional latent representation, which results in great memory savings strictly required for applications to large volumetric images. We test our method on the challenging CREMI 2016 neuron segmentation benchmark where it achieves competitive scores.

Book ChapterDOI
28 Sep 2020
TL;DR: In this paper, a multi-stage guidance framework for interactive segmentation is proposed, which incorporates user cues at different stages of the network, allowing user interactions to impact the final segmentation output in a more direct way.
Abstract: Segmenting objects of interest in an image is an essential building block of applications such as photo-editing and image analysis. Under interactive settings, one should achieve good segmentations while minimizing user input. Current deep learning-based interactive segmentation approaches use early fusion and incorporate user cues at the image input layer. Since segmentation CNNs have many layers, early fusion may weaken the influence of user interactions on the final prediction results. As such, we propose a new multi-stage guidance framework for interactive segmentation. By incorporating user cues at different stages of the network, we allow user interactions to impact the final segmentation output in a more direct way. Our proposed framework has a negligible increase in parameter count compared to early-fusion frameworks. We perform extensive experimentation on the standard interactive instance segmentation and one-click segmentation benchmarks and report state-of-the-art performance.

Book ChapterDOI
28 Sep 2020
TL;DR: In this paper, the authors propose a method to obtain smooth and consistent double-layer estimates of scenes with transparent materials by combining estimates from models with different layer hypotheses in a cost volume with subsequent minimization of a joint second order energy on two depth layers.
Abstract: 3D depth computation from stereo data has been one of the most researched topics in computer vision. While state-of-art approaches have flourished over time, reconstruction of transparent materials is still considered an open problem. Based on 3D light field data we propose a method to obtain smooth and consistent double-layer estimates of scenes with transparent materials. Our novel approach robustly combines estimates from models with different layer hypotheses in a cost volume with subsequent minimization of a joint second order \(\mathrm {TGV}\) energy on two depth layers. Additionally we showcase the results of our approach on objects from common inspection use-cases in an industrial setting and compare our work to related methods.

Book ChapterDOI
28 Sep 2020
TL;DR: In this paper, a single affine coupling layer under maximum likelihood loss is analyzed and a tight lower bound on the loss depending on the orthogonal transformation of the data before the coupling is derived, yielding a layer-wise training algorithm for deep affine flows.
Abstract: Deep Affine Normalizing Flows are efficient and powerful models for high-dimensional density estimation and sample generation. Yet little is known about how they succeed in approximating complex distributions, given the seemingly limited expressiveness of individual affine layers. In this work, we take a first step towards theoretical understanding by analyzing the behaviour of a single affine coupling layer under maximum likelihood loss. We show that such a layer estimates and normalizes conditional moments of the data distribution, and derive a tight lower bound on the loss depending on the orthogonal transformation of the data before the affine coupling. This bound can be used to identify the optimal orthogonal transform, yielding a layer-wise training algorithm for deep affine flows. Toy examples confirm our findings and stimulate further research by highlighting the remaining gap between layer-wise and end-to-end training of deep affine flows.

Book ChapterDOI
28 Sep 2020
TL;DR: This article showed that neural networks can easily fit their training set perfectly and generalize well to future data, defying the classic bias-variance trade-off of machine learning theory.
Abstract: Modern neural networks can easily fit their training set perfectly. Surprisingly, despite being “overfit” in this way, they tend to generalize well to future data, thereby defying the classic bias–variance trade-off of machine learning theory. Of the many possible explanations, a prevalent one is that training by stochastic gradient descent (SGD) imposes an implicit bias that leads it to learn simple functions, and these simple functions generalize well. However, the specifics of this implicit bias are not well understood.

Book ChapterDOI
28 Sep 2020
TL;DR: In this article, the authors propose to discard spatial information via shuffling locations or average pooling during both training and testing phases to investigate the impact on individual layers, and observe that spatial information can be deleted from later layers with small accuracy drops, which indicates spatial information at later layers is not necessary for good test accuracy.
Abstract: Intuitively, image classification should profit from using spatial information. Recent work, however, suggests that this might be overrated in standard CNNs. In this paper, we are pushing the envelope and aim to investigate the reliance on spatial information further. We propose to discard spatial information via shuffling locations or average pooling during both training and testing phases to investigate the impact on individual layers. Interestingly, we observe that spatial information can be deleted from later layers with small accuracy drops, which indicates spatial information at later layers is not necessary for good test accuracy. For example, the test accuracy of VGG-16 only drops by 0.03% and 2.66% with spatial information completely removed from the last 30% and 53% layers on CIFAR-100, respectively. Evaluation on several object recognition datasets with a wide range of CNN architectures shows an overall consistent pattern.

Book ChapterDOI
28 Sep 2020
TL;DR: In this article, a safe feature-based vehicle localization requires correct and reliable association between detected and mapped localization landmarks, which may result in faulty position estimates and risk integrity of vehicle localization.
Abstract: Safe feature-based vehicle localization requires correct and reliable association between detected and mapped localization landmarks. Incorrect feature associations result in faulty position estimates and risk integrity of vehicle localization. Depending on the number and kind of available localization landmarks, there is only a limited guarantee for correct data association due to various ambiguities.

Book ChapterDOI
28 Sep 2020
TL;DR: In this article, projection-based Random Forests are used to use various degrees of local context without changing the overall properties of the classifier (i.e. its capacity).
Abstract: Context - i.e. information not contained in a particular measurement but in its spatial proximity - plays a vital role in the analysis of images in general and in the semantic segmentation of Polarimetric Synthetic Aperture Radar (PolSAR) images in particular. Nevertheless, a detailed study on whether context should be incorporated implicitly (e.g. by spatial features) or explicitly (by exploiting classifiers tailored towards image analysis) and to which degree contextual information has a positive influence on the final classification result is missing in the literature. In this paper we close this gap by using projection-based Random Forests that allow to use various degrees of local context without changing the overall properties of the classifier (i.e. its capacity). Results on two PolSAR data sets - one airborne over a rural area, one space-borne over a dense urban area - show that local context indeed has substantial influence on the achieved accuracy by reducing label noise and resolving ambiguities. However, increasing access to local context beyond a certain amount has a negative effect on the obtained semantic maps.