scispace - formally typeset
Search or ask a question

Showing papers presented at "German Conference on Pattern Recognition in 2017"


Book ChapterDOI
13 Sep 2017
TL;DR: Surprisingly, in numerical experiments on image reconstruction problems it turns out that giving up exact minimization leads to a consistent performance increase, in particular in the case of convex models.
Abstract: In this paper, we introduce variational networks (VNs) for image reconstruction. VNs are fully learned models based on the framework of incremental proximal gradient methods. They provide a natural transition between classical variational methods and state-of-the-art residual neural networks. Due to their incremental nature, VNs are very efficient, but only approximately minimize the underlying variational model. Surprisingly, in our numerical experiments on image reconstruction problems it turns out that giving up exact minimization leads to a consistent performance increase, in particular in the case of convex models.

134 citations


Book ChapterDOI
13 Sep 2017
TL;DR: A deep learning approach to remove motion blur from a single image captured in the wild, i.e., in an uncontrolled setting, is proposed and both a novel convolutional neural network architecture and a dataset for blurry images with ground truth are designed.
Abstract: We propose a deep learning approach to remove motion blur from a single image captured in the wild, i.e., in an uncontrolled setting. Thus, we consider motion blur degradations that are due to both camera and object motion, and by occlusion and coming into view of objects. In this scenario, a model-based approach would require a very large set of parameters, whose fitting is a challenge on its own. Hence, we take a data-driven approach and design both a novel convolutional neural network architecture and a dataset for blurry images with ground truth. The network produces directly the sharp image as output and is built into three pyramid stages, which allow to remove blur gradually from a small amount, at the lowest scale, to the full amount, at the scale of the input image. To obtain corresponding blurry and sharp image pairs, we use videos from a high frame-rate video camera. For each small video clip we select the central frame as the sharp image and use the frame average as the corresponding blurred image. Finally, to ensure that the averaging process is a sufficient approximation to real blurry images we estimate optical flow and select frames with pixel displacements smaller than a pixel. We demonstrate state of the art performance on datasets with both synthetic and real images.

106 citations


Book ChapterDOI
13 Sep 2017
TL;DR: This paper estimates human motion by minimizing the difference between computed flow fields and the output of the novel flow renderer by using a single semi-automatic initialization step, and is able to reconstruct monocular sequences without joint annotation.
Abstract: This paper presents a method to estimate 3D human pose and body shape from monocular videos. While recent approaches infer the 3D pose from silhouettes and landmarks, we exploit properties of optical flow to temporally constrain the reconstructed motion. We estimate human motion by minimizing the difference between computed flow fields and the output of our novel flow renderer. By just using a single semi-automatic initialization step, we are able to reconstruct monocular sequences without joint annotation. Our test scenarios demonstrate that optical flow effectively regularizes the under-constrained problem of human shape and motion estimation from monocular video.

52 citations


Book ChapterDOI
13 Sep 2017
TL;DR: This paper provides an end-to-end video super-resolution network that, in contrast to previous works, includes the estimation of optical flow in the overall network architecture and shows that with this network configuration, videosuper-resolution can benefit from optical flow and is obtained state-of-the-art results on the popular test sets.
Abstract: Learning approaches have shown great success in the task of super-resolving an image given a low resolution input. Video super-resolution aims for exploiting additionally the information from multiple images. Typically, the images are related via optical flow and consecutive image warping. In this paper, we provide an end-to-end video super-resolution network that, in contrast to previous works, includes the estimation of optical flow in the overall network architecture. We analyze the usage of optical flow for video super-resolution and find that common off-the-shelf image warping does not allow video super-resolution to benefit much from optical flow. We rather propose an operation for motion compensation that performs warping from low to high resolution directly. We show that with this network configuration, video super-resolution can benefit from optical flow and we obtain state-of-the-art results on the popular test sets. We also show that the processing of whole images rather than independent patches is responsible for a large increase in accuracy.

48 citations


Book ChapterDOI
13 Sep 2017
TL;DR: In this article, a Gaussian sphere representation arising from an inverse gnomonic projection of lines detected in an image is used for vanishing point detection from uncalibrated monocular images.
Abstract: We present a novel approach for vanishing point detection from uncalibrated monocular images. In contrast to state-of-the-art, we make no a priori assumptions about the observed scene. Our method is based on a convolutional neural network (CNN) which does not use natural images, but a Gaussian sphere representation arising from an inverse gnomonic projection of lines detected in an image. This allows us to rely on synthetic data for training, eliminating the need for labelled images. Our method achieves competitive performance on three horizon estimation benchmark datasets. We further highlight some additional use cases for which our vanishing point detection algorithm can be used.

35 citations


Book ChapterDOI
13 Sep 2017
TL;DR: Four local search algorithms for correlation clustering converge monotonously to a fixpoint, offering a feasible solution at any time and encouraging a broader application of correlation clusters, especially in settings where the number of clusters is not known and needs to be estimated from data.
Abstract: This paper empirically compares four local search algorithms for correlation clustering by applying these to a variety of instances of the correlation clustering problem for the tasks of image segmentation, hand-written digit classification and social network analysis. Although the local search algorithms establish neither lower bounds nor approximation certificates, they converge monotonously to a fixpoint, offering a feasible solution at any time. For some algorithms, the time of convergence is affordable for all instances we consider. This finding encourages a broader application of correlation clustering, especially in settings where the number of clusters is not known and needs to be estimated from data.

25 citations


Book ChapterDOI
13 Sep 2017
TL;DR: Based on the principle of posterior agreement, a general framework for model selection to rank kernels for Gaussian process regression is developed and compared with maximum evidence and leave-one-out cross-validation.
Abstract: Gaussian processes are powerful tools since they can model non-linear dependencies between inputs, while remaining analytically tractable. A Gaussian process is characterized by a mean function and a covariance function (kernel), which are determined by a model selection criterion. The functions to be compared do not just differ in their parametrization but in their fundamental structure. It is often not clear which function structure to choose, for instance to decide between a squared exponential and a rational quadratic kernel. Based on the principle of posterior agreement, we develop a general framework for model selection to rank kernels for Gaussian process regression and compare it with maximum evidence (also called marginal likelihood) and leave-one-out cross-validation. Given the disagreement between current state-of-the-art methods in our experiments, we show the difficulty of model selection and the need for an information-theoretic approach.

23 citations


Book ChapterDOI
13 Sep 2017
TL;DR: This work proposes to leverage the powerful discriminative nature of CNNs to novelty detection tasks by investigating class-specific activation patterns by assuming that a semantic category can be described by its extreme value signature, that specifies which dimensions of deep neural activations have largest values.
Abstract: Achieving or even surpassing human-level accuracy became recently possible in a variety of application scenarios due to the rise of convolutional neural networks (CNNs) trained from large datasets. However, solving supervised visual recognition tasks by discriminating among known categories is only one side of the coin. In contrast to this, novelty detection is still an unsolved task where instances of yet unknown categories need to be identified. Therefore, we propose to leverage the powerful discriminative nature of CNNs to novelty detection tasks by investigating class-specific activation patterns. More precisely, we assume that a semantic category can be described by its extreme value signature, that specifies which dimensions of deep neural activations have largest values. By following this intuition, we show that already a small number of high-valued dimensions allows to separate known from unknown categories. Our approach is simple, intuitive, and can be easily put on top of CNNs trained for vanilla classification tasks. We empirically validate the benefits of our approach in terms of accuracy and speed by comparing it against established methods in a variety of novelty detection tasks derived from ImageNet. Finally, we show that visualizing extreme value signatures allows to inspect class-specific patterns learned during training which may ultimately help to better understand CNN models.

23 citations


Book ChapterDOI
13 Sep 2017
TL;DR: The idea of a primal-dual network is to combine the structure of regular energy optimization techniques, in particular of first order methods, with the flexibility of Deep Learning to adapt to the statistics of the input data.
Abstract: In the past, classic energy optimization techniques were the driving force in many innovations and are a building block for almost any problem in computer vision. Efficient algorithms are mandatory to achieve real-time processing, needed in many applications like autonomous driving. However, energy models - even if designed by human experts - might never be able to fully capture the complexity of natural scenes and images. Similar to optimization techniques, Deep Learning has changed the landscape of computer vision in recent years and has helped to push the performance of many models to never experienced heights. Our idea of a primal-dual network is to combine the structure of regular energy optimization techniques, in particular of first order methods, with the flexibility of Deep Learning to adapt to the statistics of the input data.

18 citations


Book ChapterDOI
13 Sep 2017
TL;DR: In this article, the authors proposed an efficient and robust approach for reducing the size of deep neural networks by pruning entire neurons, which exploits maxout units for combining neurons into more complex convex functions and makes use of a local relevance measurement that ranks neurons according to their activation on the training set.
Abstract: This paper presents an efficient and robust approach for reducing the size of deep neural networks by pruning entire neurons. It exploits maxout units for combining neurons into more complex convex functions and it makes use of a local relevance measurement that ranks neurons according to their activation on the training set for pruning them. Additionally, a parameter reduction comparison between neuron and weight pruning is shown. It will be empirically shown that the proposed neuron pruning reduces the number of parameters dramatically. The evaluation is performed on two tasks, the MNIST handwritten digit recognition and the LFW face verification, using a LeNet-5 and a VGG16 network architecture. The network size is reduced by up to \(74\%\) and \(61\%\), respectively, without affecting the network’s performance. The main advantage of neuron pruning is its direct influence on the size of the network architecture. Furthermore, it will be shown that neuron pruning can be combined with subsequent weight pruning, reducing the size of the LeNet-5 and VGG16 up to \(92\%\) and \(80\%\) respectively.

10 citations


Book ChapterDOI
13 Sep 2017
TL;DR: This work presents a reconstruction pipeline for a large-scale 3D environment viewed by a single moving RGB-D camera, focusing on algorithms which are easily parallelizable on GPUs, allowing the pipeline to be used in real-time scenarios where the user can interactively view the reconstruction and adapt camera motion as required.
Abstract: We present a reconstruction pipeline for a large-scale 3D environment viewed by a single moving RGB-D camera. Our approach combines advantages of fast and direct, regularization-free depth fusion and accurate, but costly variational schemes. The scene’s depth geometry is extracted from each camera view and efficiently integrated into a large, dense grid as a truncated signed distance function, which is organized in an octree. To account for noisy real-world input data, variational range image integration is performed in local regions of the volume directly on this octree structure. We focus on algorithms which are easily parallelizable on GPUs, allowing the pipeline to be used in real-time scenarios where the user can interactively view the reconstruction and adapt camera motion as required.

Book ChapterDOI
13 Sep 2017
TL;DR: In this article, an adaptive regularization scheme in a variational framework where a convex composite energy functional is optimized is proposed to determine the relative weight between data fidelity and regularization based on the residual that measures how well the observation fits the model.
Abstract: We propose an adaptive regularization scheme in a variational framework where a convex composite energy functional is optimized. We consider a number of imaging problems including segmentation and motion estimation, which are considered as optimal solutions of the energy functionals that mainly consist of data fidelity, regularization and a control parameter for their trade-off. We presents an algorithm to determine the relative weight between data fidelity and regularization based on the residual that measures how well the observation fits the model. Our adaptive regularization scheme is designed to locally control the regularization at each pixel based on the assumption that the diversity of the residual of a given imaging model spatially varies. The energy optimization is presented in the alternating direction method of multipliers (ADMM) framework where the adaptive regularization is iteratively applied along with mathematical analysis of the proposed algorithm. We demonstrate the robustness and effectiveness of our adaptive regularization through experimental results presenting that the qualitative and quantitative evaluation results of each imaging task are superior to the results with a constant regularization scheme. The desired properties, robustness and effectiveness, of the regularization parameter selection in a variational framework for imaging problems are achieved by merely replacing the static regularization parameter with our adaptive one.

Book ChapterDOI
13 Sep 2017
TL;DR: This work addresses the problem of estimating heart rate from face videos under real conditions using a model based on the recursive inference problem that leverages the local invariance of the heart rate using the canonical state space representation of an Itō process and a Wiener velocity model.
Abstract: This work addresses the problem of estimating heart rate from face videos under real conditions using a model based on the recursive inference problem that leverages the local invariance of the heart rate. The proposed solution is based on the canonical state space representation of an Itō process and a Wiener velocity model. Empirical results yield to excellent real-time and estimation performance of heart rate in presence of disturbing factors, like rigid head motion, talking and facial expressions under natural illumination conditions making the process of heart rate estimation from face videos applicable in a much broader sense. To facilitate comparisons and to support research we made the code and data for reproducing the results public available.

Book ChapterDOI
13 Sep 2017
TL;DR: In this article, an approach for learning dilation parameters adaptively per channel, consistently improving semantic segmentation results on street-scene datasets like Cityscapes and Camvid, is presented.
Abstract: Contextual information is crucial for semantic segmentation. However, finding the optimal trade-off between keeping desired fine details and at the same time providing sufficiently large receptive fields is non trivial. This is even more so, when objects or classes present in an image significantly vary in size. Dilated convolutions have proven valuable for semantic segmentation, because they allow to increase the size of the receptive field without sacrificing image resolution. However, in current state-of-the-art methods, dilation parameters are hand-tuned and fixed. In this paper, we present an approach for learning dilation parameters adaptively per channel, consistently improving semantic segmentation results on street-scene datasets like Cityscapes and Camvid.

Book ChapterDOI
13 Sep 2017
TL;DR: In this paper, Zhao et al. proposed a reconstruction algorithm that is robust to image noise and produces significantly fewer artifacts than the original reconstruction approach, which assumes noise-free measurements and quickly breaks down.
Abstract: Photographing scenes with high dynamic range (HDR) poses great challenges to consumer cameras with their limited sensor bit depth. To address this, Zhao et al. recently proposed a novel sensor concept – the modulo camera – which captures the least significant bits of the recorded scene instead of going into saturation. Similar to conventional pipelines, HDR images can be reconstructed from multiple exposures, but significantly fewer images are needed than with a typical saturating sensor. While the concept is appealing, we show that the original reconstruction approach assumes noise-free measurements and quickly breaks down otherwise. To address this, we propose a novel reconstruction algorithm that is robust to image noise and produces significantly fewer artifacts. We theoretically analyze correctness as well as limitations, and show that our approach significantly outperforms the baseline on real data.

Book ChapterDOI
13 Sep 2017
TL;DR: In a large experimental study, all combinations of five image normalization methods as well as five image representations with respect to the classification performance in two application scenarios in kidney histopathology are investigated.
Abstract: The advancing pervasion of digital pathology in research and clinical practice results in a strong need for image analysis techniques in the field of histopathology. Due to diverse reasons, histopathological imaging generally exhibits a high degree of variability. As automated segmentation approaches are known to be vulnerable, especially to unseen variability, we investigate several stain normalization methods to compensate for variations between different whole slide images. In a large experimental study, we investigate all combinations of five image normalization (not only stain normalization) methods as well as five image representations with respect to the classification performance in two application scenarios in kidney histopathology. Finally, we also pose the question, if color normalization is sufficient to compensate for the changed properties between whole slide images in an application scenario with few training data.

Book ChapterDOI
13 Sep 2017
TL;DR: A highly accurate and precise multi-view, multi-projector, and multi-pattern phase scanning method for shape acquisition that is able to handle occlusions and optically challenging materials is introduced.
Abstract: We introduce a highly accurate and precise multi-view, multi-projector, and multi-pattern phase scanning method for shape acquisition that is able to handle occlusions and optically challenging materials. The 3D reconstruction is formulated as a two-step process which first estimates reliable measurement samples and then simultaneously optimizes over all cameras, projectors, and patterns. This holistic approach results in significant quality improvements. Furthermore, the acquisition time is drastically reduced by relying on just six high-frequency sinusoidal captures without the need of phase unwrapping, which is implicitly provided by the multi-view geometry.

Book ChapterDOI
13 Sep 2017
TL;DR: It is shown that learning a regularizer for the SR problem improves the reconstruction results compared to an iterative reconstruction algorithm using TV or TGV regularization.
Abstract: In this paper, we present a novel method for multi-frame superresolution (SR). Our main goal is to improve the spatial resolution of a multi-line scan camera for an industrial inspection task. High resolution output images are reconstructed using our proposed SR algorithm for multi-channel data, which is based on the trainable reaction-diffusion model. As this is a supervised learning approach, we simulate ground truth data for a real imaging scenario. We show that learning a regularizer for the SR problem improves the reconstruction results compared to an iterative reconstruction algorithm using TV or TGV regularization. We test the learned regularizer, trained on simulated data, on images acquired with the real camera setup and achieve excellent results.

Book ChapterDOI
13 Sep 2017
TL;DR: The relative Intersection over Union (rIoU) accuracy measure is introduced, which normalizes the IoU with the optimal box for the segmentation to generate an accuracy measure that ranges between 0 and 1 and allows a more precise measurement of accuracies.
Abstract: The accuracy of object detectors and trackers is most commonly evaluated by the Intersection over Union (IoU) criterion. To date, most approaches are restricted to axis-aligned or oriented boxes and, as a consequence, many datasets are only labeled with boxes. Nevertheless, axis-aligned or oriented boxes cannot accurately capture an object’s shape. To address this, a number of densely segmented datasets has started to emerge in both the object detection and the object tracking communities. However, evaluating the accuracy of object detectors and trackers that are restricted to boxes on densely segmented data is not straightforward. To close this gap, we introduce the relative Intersection over Union (rIoU) accuracy measure. The measure normalizes the IoU with the optimal box for the segmentation to generate an accuracy measure that ranges between 0 and 1 and allows a more precise measurement of accuracies. Furthermore, it enables an efficient and easy way to understand scenes and the strengths and weaknesses of an object detection or tracking approach. We display how the new measure can be efficiently calculated and present an easy-to-use evaluation framework. The framework is tested on the DAVIS and the VOT2016 segmentations and has been made available to the community.

Book ChapterDOI
13 Sep 2017
TL;DR: A filtering network (FNet) is proposed, a method which replaces NMS with a differentiable neural network that allows joint reasoning and re-scoring of the generated set of hypotheses per image, and demonstrates that FNet, a feed-forward network architecture, is able to mimic NMS decisions, despite the sequential nature of NMS.
Abstract: Most object detection systems consist of three stages. First, a set of individual hypotheses for object locations is generated using a proposal generating algorithm. Second, a classifier scores every generated hypothesis independently to obtain a multi-class prediction. Finally, all scored hypotheses are filtered via a non-differentiable and decoupled non-maximum suppression (NMS) post-processing step. In this paper, we propose a filtering network (FNet), a method which replaces NMS with a differentiable neural network that allows joint reasoning and re-scoring of the generated set of hypotheses per image. This formulation enables end-to-end training of the full object detection pipeline. First, we demonstrate that FNet, a feed-forward network architecture, is able to mimic NMS decisions, despite the sequential nature of NMS. We further analyze NMS failures and propose a loss formulation that is better aligned with the mean average precision (mAP) evaluation metric. We evaluate FNet on several standard detection datasets. Results surpass standard NMS on highly occluded settings of a synthetic overlapping MNIST dataset and show competitive behavior on PascalVOC2007 and KITTI detection benchmarks.

Book ChapterDOI
13 Sep 2017
TL;DR: This work proposes a two-stage RANSAC procedure in which, in the first step, only features classified as ground points are processed and are subsequently used to score two-view geometry hypotheses generated by the five-point algorithm using samples of non-ground points.
Abstract: The computation of the essential matrix using the five-point algorithm is a staple task usually considered as being solved. However, we show that the algorithm frequently selects erroneous solutions in the presence of noise and outliers. These errors arise when the supporting point correspondences supplied to the algorithm do not adequately cover all essential planes in the scene, leading to ambiguous essential matrix solutions. This is not merely a theoretical problem: such scene conditions often occur in 3D reconstruction of real-world data when fronto-parallel point correspondences, such as points on building facades, are captured but correspondences on obliquely observed planes, such as the ground plane, are missed. To solve this problem, we propose to leverage semantic labelings of image features to guide hypothesis selection in the five-point algorithm. More specifically, we propose a two-stage RANSAC procedure in which, in the first step, only features classified as ground points are processed. These inlier ground features are subsequently used to score two-view geometry hypotheses generated by the five-point algorithm using samples of non-ground points. Results for scenes with prominent ground regions demonstrate the ability of our approach to recover epipolar geometries that describe the entire scene, rather than only well-sampled scene planes.

Book ChapterDOI
13 Sep 2017
TL;DR: This work proposes a novel recurrent ConvNet architecture called recurrent residual networks to address the task of action recognition, and shows that the model improves over both, the standard ResNet architecture and a ResNet extended by a fully recurrent layer.
Abstract: Action recognition is a fundamental problem in computer vision with a lot of potential applications such as video surveillance, human computer interaction, and robot learning. Given pre-segmented videos, the task is to recognize actions happening within videos. Historically, hand crafted video features were used to address the task of action recognition. With the success of Deep ConvNets as an image analysis method, a lot of extensions of standard ConvNets were purposed to process variable length video data. In this work, we propose a novel recurrent ConvNet architecture called recurrent residual networks to address the task of action recognition. The approach extends ResNet, a state of the art model for image classification. While the original formulation of ResNet aims at learning spatial residuals in its layers, we extend the approach by introducing recurrent connections that allow to learn a spatio-temporal residual. In contrast to fully recurrent networks, our temporal connections only allow a limited range of preceding frames to contribute to the output for the current frame, enabling efficient training and inference as well as limiting the temporal context to a reasonable local range around each frame. On a large-scale action recognition dataset, we show that our model improves over both, the standard ResNet architecture and a ResNet extended by a fully recurrent layer.

Book ChapterDOI
13 Sep 2017
TL;DR: This paper proposes an approach for the semantic segmentation of a 3D point cloud using local 3D moment invariants and the integration of contextual information on the task of analyzing forestal and urban areas which were recorded by terrestrial LiDAR scanners.
Abstract: In this paper, we propose an approach for the semantic segmentation of a 3D point cloud using local 3D moment invariants and the integration of contextual information. Specifically, we focus on the task of analyzing forestal and urban areas which were recorded by terrestrial LiDAR scanners. We demonstrate how 3D moment invariants can be leveraged as local features and that they are on a par with established descriptors. Furthermore, we show how an iterative learning scheme can increase the overall quality by taking neighborhood relationships between classes into account. Our experiments show that the approach achieves very good results for a variety of tasks including both binary and multi-class settings.

Book ChapterDOI
13 Sep 2017
TL;DR: A novel method for a combined extraction of points, lines and arcs in images by constructing a graph describing the topology between the features, so that more complex structures can be described over multiple connected primitives.
Abstract: This article presents a novel method for a combined extraction of points, lines and arcs in images. Geometric primitives are fitted into extracted edge pixels. In order to get points, the intersections between the geometric primitives are calculated. The method allows a precise and at the same time robust detection of the image features. By constructing a graph describing the topology between the features, more complex structures can be described over multiple connected primitives.

Book ChapterDOI
13 Sep 2017
TL;DR: This work proposes a self-supervised approach that is able to utilize the myriad of easily available dashcam videos from YouTube or from autonomous vehicles to perform fully automatic training by simply watching others drive, and plays training videos backwards in time to track patches that cars have driven over together with their spatio-temporal interrelations.
Abstract: The most prominent approach for autonomous cars to learn what areas of a scene are drivable is to utilize tedious human supervision in the form of pixel-wise image labeling for training deep semantic segmentation algorithms. However, the underlying CNNs require vast amounts of this training information, rendering the expensive pixel-wise labeling of images a bottleneck. Thus, we propose a self-supervised approach that is able to utilize the myriad of easily available dashcam videos from YouTube or from autonomous vehicles to perform fully automatic training by simply watching others drive. We play training videos backwards in time and track patches that cars have driven over together with their spatio-temporal interrelations, which are a rich source of context information. Collecting large numbers of these local regions enables fully automatic self-supervision for training a CNN. The proposed method has the potential to extend and complement the popular supervised CNN learning of drivable pixels by using a rich, presently untapped source of unlabeled training data.

Book ChapterDOI
13 Sep 2017
TL;DR: In this paper, a zero-shot approach was proposed to solve the semantic boundary and edge detection problem without using edge labels during the training phase by relying on conventional whole image neural net classifiers that were trained using large bounding boxes.
Abstract: Semantic boundary and edge detection aims at simultaneously detecting object edge pixels in images and assigning class labels to them. Systematic training of predictors for this task requires the labeling of edges in images which is a particularly tedious task. We propose a novel strategy for solving this task, when pixel-level annotations are not available, performing it in an almost zero-shot manner by relying on conventional whole image neural net classifiers that were trained using large bounding boxes. Our method performs the following two steps at test time. Firstly it predicts the class labels by applying the trained whole image network to the test images. Secondly, it computes pixel-wise scores from the obtained predictions by applying backprop gradients as well as recent visualization algorithms such as deconvolution and layer-wise relevance propagation. We show that high pixel-wise scores are indicative for the location of semantic boundaries, which suggests that the semantic boundary problem can be approached without using edge labels during the training phase.

Book ChapterDOI
13 Sep 2017
TL;DR: By enhancing low resolution images with a new super-resolution inception network architecture, this work is able to improve upon the state of the art in facial landmark detection.
Abstract: Modern convolutional neural networks for facial landmark detection have become increasingly robust against occlusions, lighting conditions and pose variations. With the predictions being close to pixel-accurate in some cases, intuitively, the input resolution should be as high as possible. We verify this intuition by thoroughly analyzing the impact of low image resolution on landmark prediction performance. Indeed, performance degradations are already measurable for faces smaller than \(50\,\times \,50\,\mathrm {px}\). In order to mitigate those degradations, a new super-resolution inception network architecture is developed which outperforms recent super-resolution methods on various data sets. By enhancing low resolution images with our model, we are able to improve upon the state of the art in facial landmark detection.

Book ChapterDOI
13 Sep 2017
TL;DR: A convolutional neural network-based model that predicts future movements of a ball given a series of images depicting the ball and its environment is presented and whether networks with stereo visual input perform better than those with monocular vision only is investigated.
Abstract: In this work we present a convolutional neural network-based (CNN) model that predicts future movements of a ball given a series of images depicting the ball and its environment. For training and evaluation, we use artificially generated images sequences. Two scenarios are analyzed: Prediction in a simple table tennis environment and a more challenging squash environment. Classical 2D convolution layers are compared with 3D convolution layers that extract the motion information of the ball from contiguous frames. Moreover, we investigate whether networks with stereo visual input perform better than those with monocular vision only. Our experiments suggest that CNNs can indeed predict physical behaviour with small error rates on unseen data but the performance drops for very complex underlying movements.

Book ChapterDOI
13 Sep 2017
TL;DR: In this paper, the authors propose a method for large displacement optical flow in which local matching costs are learned by a convolutional neural network (CNN) and a smoothness prior is imposed by a conditional random field (CRF).
Abstract: We propose a method for large displacement optical flow in which local matching costs are learned by a convolutional neural network (CNN) and a smoothness prior is imposed by a conditional random field (CRF). We tackle the computation- and memory-intensive operations on the 4D cost volume by a min-projection which reduces memory complexity from quadratic to linear and binary descriptors for efficient matching. This enables evaluation of the cost on the fly and allows to perform learning and CRF inference on high resolution images without ever storing the 4D cost volume. To address the problem of learning binary descriptors we propose a new hybrid learning scheme. In contrast to current state of the art approaches for learning binary CNNs we can compute the exact non-zero gradient within our model. We compare several methods for training binary descriptors and show results on public available benchmarks.

Book ChapterDOI
13 Sep 2017
TL;DR: This work proposes an edge adaptive seeding for superpixel segmentation methods, generating more seeds in areas with more edges and vise versa, following the assumption that edges distinguish objects and thus are a good indicator of the level of clutter in an image region.
Abstract: Finding a suitable seeding resolution when using superpixel segmentation methods is usually challenging. Different parts of the image contain different levels of clutter, resulting in an either too dense or too coarse segmentation. Since both possible solutions cause problems with respect to subsequent processing, we propose an edge adaptive seeding for superpixel segmentation methods, generating more seeds in areas with more edges and vise versa. This follows the assumption that edges distinguish objects and thus are a good indicator of the level of clutter in an image region. We show in our evaluation on five datasets by using three popular superpixel segmentation methods that using edge adaptive seeding leads to improved results compared to other priors as well as to uniform seeding.