Showing papers presented at "German Conference on Pattern Recognition in 2018"

PDF

Open Access

Book Chapter•DOI•

On the Integration of Optical Flow and Action Recognition

[...]

Laura Sevilla-Lara¹, Yiyi Liao², Fatma Güney³, Varun Jampani⁴, Andreas Geiger², Michael J. Black² - Show less +2 more•Institutions (4)

Facebook¹, Max Planck Society², University of Oxford³, Nvidia⁴

09 Oct 2018

TL;DR: In this paper, the authors investigated the impact of different flow algorithms and input transformations on the performance of optical flow and showed that optical flow is useful for action recognition because it is invariant to appearance, but the EPE of current methods is not well correlated with action recognition performance.

...read moreread less

Abstract: Most of the top performing action recognition methods use optical flow as a “black box” input. Here we take a deeper look at the combination of flow and action recognition, and investigate why optical flow is helpful, what makes a flow method good for action recognition, and how we can make it better. In particular, we investigate the impact of different flow algorithms and input transformations to better understand how these affect a state-of-the-art action recognition method. Furthermore, we fine tune two neural-network flow methods end-to-end on the most widely used action recognition dataset (UCF101). Based on these experiments, we make the following five observations: (1) optical flow is useful for action recognition because it is invariant to appearance, (2) optical flow methods are optimized to minimize end-point-error (EPE), but the EPE of current methods is not well correlated with action recognition performance, (3) for the flow methods tested, accuracy at boundaries and at small displacements is most correlated with action recognition performance, (4) training optical flow to minimize classification error instead of minimizing EPE improves recognition performance, and (5) optical flow learned for the task of action recognition differs from traditional optical flow especially inside the human body and at the boundary of the body. These observations may encourage optical flow researchers to look beyond EPE as a goal and guide action recognition researchers to seek better motion cues, leading to a tighter integration of the optical flow and action recognition communities.

...read moreread less

160 citations

Book Chapter•DOI•

Associative Deep Clustering - Training a Classification Network with no Labels

[...]

Philip Haeusser¹, Johannes Plapp¹, Vladimir Golkov¹, Elie Aljalbout¹, Daniel Cremers¹ - Show less +1 more•Institutions (1)

Technische Universität München¹

09 Oct 2018

TL;DR: In this article, the authors propose an end-to-end clustering training schedule for neural networks that is direct, i.e., the output is a probability distribution over cluster memberships.

...read moreread less

Abstract: We propose a novel end-to-end clustering training schedule for neural networks that is direct, i.e. the output is a probability distribution over cluster memberships. A neural network maps images to embeddings. We introduce centroid variables that have the same shape as image embeddings. These variables are jointly optimized with the network’s parameters. This is achieved by a cost function that associates the centroid variables with embeddings of input images. Finally, an additional layer maps embeddings to logits, allowing for the direct estimation of the respective cluster membership. Unlike other methods, this does not require any additional classifier to be trained on the embeddings in a separate step. The proposed approach achieves state-of-the-art results in unsupervised classification and we provide an extensive ablation study to demonstrate its capabilities.

...read moreread less

96 citations

Book Chapter•DOI•

Domain Generalization with Domain-Specific Aggregation Modules

[...]

Antonio D'Innocente¹, Barbara Caputo²•Institutions (2)

Sapienza University of Rome¹, Istituto Italiano di Tecnologia²

09 Oct 2018

TL;DR: In this article, the authors propose a deep architecture that maintains separated the information about the available source domains data while at the same time leveraging over generic perceptual information by introducing domain-specific aggregation modules that through an aggregation layer strategy are able to merge generic and specific information in an effective manner.

...read moreread less

Abstract: Visual recognition systems are meant to work in the real world. For this to happen, they must work robustly in any visual domain, and not only on the data used during training. Within this context, a very realistic scenario deals with domain generalization, i.e. the ability to build visual recognition algorithms able to work robustly in several visual domains, without having access to any information about target data statistic. This paper contributes to this research thread, proposing a deep architecture that maintains separated the information about the available source domains data while at the same time leveraging over generic perceptual information. We achieve this by introducing domain-specific aggregation modules that through an aggregation layer strategy are able to merge generic and specific information in an effective manner. Experiments on two different benchmark databases show the power of our approach, reaching the new state of the art in domain generalization.

...read moreread less

77 citations

Book Chapter•DOI•

Convolve, Attend and Spell: An Attention-based Sequence-to-Sequence Model for Handwritten Word Recognition

[...]

Lei Kang¹, J. Ignacio Toledo¹, Pau Riba¹, Mauricio Villegas, Alicia Fornés¹, Marçal Rusiñol¹ - Show less +2 more•Institutions (1)

Autonomous University of Barcelona¹

09 Oct 2018

TL;DR: This paper proposes Convolve, Attend and Spell, an attention-based sequence-to-sequence model for handwritten word recognition that achieves competitive results on the IAM dataset without needing any pre-processing step, predefined lexicon nor language model.

...read moreread less

Abstract: This paper proposes Convolve, Attend and Spell, an attention-based sequence-to-sequence model for handwritten word recognition. The proposed architecture has three main parts: an encoder, consisting of a CNN and a bi-directional GRU, an attention mechanism devoted to focus on the pertinent features and a decoder formed by a one-directional GRU, able to spell the corresponding word, character by character. Compared with the recent state-of-the-art, our model achieves competitive results on the IAM dataset without needing any pre-processing step, predefined lexicon nor language model. Code and additional results are available in https://github.com/omni-us/research-seq2seq-HTR.

...read moreread less

65 citations

Book Chapter•DOI•

Detecting Face Morphing Attacks by Analyzing the Directed Distances of Facial Landmarks Shifts

[...]

Naser Damer¹, Viola Boller¹, Yaza Wainakh¹, Fadi Boutros¹, Philipp Terhorst¹, Andreas Braun¹, Arjan Kuijper¹ - Show less +3 more•Institutions (1)

Fraunhofer Society¹

09 Oct 2018

TL;DR: This work discusses the operational opportunity of having a live face probe to support the morphing detection decision and proposes a detection approach that take advantage of that, and considers the facial landmarks shifting patterns between reference and probe images.

...read moreread less

Abstract: Face morphing attacks create face images that are verifiable to multiple identities. Associating such images to identity documents lead to building faulty identity links, causing attacks on operations like border crossing. Most of previously proposed morphing attack detection approaches directly classified features extracted from the investigated image. We discuss the operational opportunity of having a live face probe to support the morphing detection decision and propose a detection approach that take advantage of that. Our proposed solution considers the facial landmarks shifting patterns between reference and probe images. This is represented by the directed distances to avoid confusion with shifts caused by other variations. We validated our approach using a publicly available database, built on 549 identities. Our proposed detection concept is tested with three landmark detectors and proved to outperform the baseline concept based on handcrafted and transferable CNN features.

...read moreread less

55 citations

Book Chapter•DOI•

Cross and Learn: Cross-Modal Self-supervision

[...]

Nawid Sayed¹, Biagio Brattoli¹, Björn Ommer¹•Institutions (1)

Heidelberg University¹

09 Oct 2018

TL;DR: In this article, a self-supervised method for representation learning utilizing RGB and optical flow was proposed based on the observation that cross-modal information has a high semantic meaning and proposed a method to effectively exploit this signal.

...read moreread less

Abstract: In this paper we present a self-supervised method for representation learning utilizing two different modalities. Based on the observation that cross-modal information has a high semantic meaning we propose a method to effectively exploit this signal. For our approach we utilize video data since it is available on a large scale and provides easily accessible modalities given by RGB and optical flow. We demonstrate state-of-the-art performance on highly contested action recognition datasets in the context of self-supervised learning. We show that our feature representation also transfers to other tasks and conduct extensive ablation studies to validate our core contributions.

...read moreread less

52 citations

Book Chapter•DOI•

MC2SLAM: Real-Time Inertial Lidar Odometry Using Two-Scan Motion Compensation

[...]

Frank Neuhaus¹, Tilman Koß¹, Robert Kohnen¹, Dietrich Paulus¹•Institutions (1)

University of Koblenz and Landau¹

09 Oct 2018

TL;DR: A real-time, low-drift laser odometry approach that tightly integrates sequentially measured 3D multi-beam LIDAR data with inertial measurements that was ranked within the top five laser-only algorithms of the KITTI odometry benchmark.

...read moreread less

Abstract: We propose a real-time, low-drift laser odometry approach that tightly integrates sequentially measured 3D multi-beam LIDAR data with inertial measurements. The laser measurements are motion-compensated using a novel algorithm based on non-rigid registration of two consecutive laser sweeps and a local map. IMU data is being tightly integrated by means of factor-graph optimization on a pose graph. We evaluate our method on a public dataset and also obtain results on our own datasets that contain information not commonly found in existing datasets. At the time of writing, our method was ranked within the top five laser-only algorithms of the KITTI odometry benchmark.

...read moreread less

45 citations

Book Chapter•DOI•

Vehicle Re-identification in Context

[...]

Aytaç Kanacı¹, Xiatian Zhu, Shaogang Gong¹•Institutions (1)

Queen Mary University of London¹

09 Oct 2018

TL;DR: In this article, the authors introduce a more realistic and challenging vehicle re-id benchmark, called Vehicle Re-Identification in Context (VRIC), which contains 60,430 images of 5,622 vehicle identities captured by 60 different cameras at heterogeneous road traffic scenes in both day-time and night-time.

...read moreread less

Abstract: Existing vehicle re-identification (re-id) evaluation benchmarks consider strongly artificial test scenarios by assuming the availability of high quality images and fine-grained appearance at an almost constant image scale, reminiscent to images required for Automatic Number Plate Recognition, e.g. VeRi-776. Such assumptions are often invalid in realistic vehicle re-id scenarios where arbitrarily changing image resolutions (scales) are the norm. This makes the existing vehicle re-id benchmarks limited for testing the true performance of a re-id method. In this work, we introduce a more realistic and challenging vehicle re-id benchmark, called Vehicle Re-Identification in Context (VRIC). In contrast to existing vehicle re-id datasets, VRIC is uniquely characterised by vehicle images subject to more realistic and unconstrained variations in resolution (scale), motion blur, illumination, occlusion, and viewpoint. It contains 60,430 images of 5,622 vehicle identities captured by 60 different cameras at heterogeneous road traffic scenes in both day-time and night-time. Given the nature of this new benchmark, we further investigate a multi-scale matching approach to vehicle re-id by learning more discriminative feature representations from multi-resolution images. Extensive evaluations show that the proposed multi-scale method outperforms the state-of-the-art vehicle re-id methods on three benchmark datasets: VehicleID, VeRi-776, and VRIC (Available at http://qmul-vric.github.io).

...read moreread less

40 citations

Book Chapter•DOI•

A Table Tennis Robot System Using an Industrial KUKA Robot Arm

[...]

Jonas Tebbe¹, Yapeng Gao¹, Marc Sastre-Rienietz¹, Andreas Zell¹•Institutions (1)

University of Tübingen¹

09 Oct 2018

TL;DR: This work presents a novel table tennis robot system with high accuracy vision detection and fast robot reaction based on an industrial KUKA Agilus R900 sixx robot with 6 DOF, and tests both a curve fitting approach and an extended Kalman filter for predicting the ball’s trajectory.

...read moreread less

Abstract: In recent years robotic table tennis has become a popular research challenge for image processing and robot control. Here we present a novel table tennis robot system with high accuracy vision detection and fast robot reaction. Our system is based on an industrial KUKA Agilus R900 sixx robot with 6 DOF. Four cameras are used for ball position detection at 150 fps. We employ a multiple-camera calibration method, and use iterative triangulation to reconstruct the 3D ball position with an accuracy of 2.0 mm. In order to detect the flying ball with higher velocities in real-time, we combine color and background thresholding. For predicting the ball’s trajectory we test both a curve fitting approach and an extended Kalman filter. Our robot is able to play rallies with a human counting up to 50 consequential strokes and has a general hitting rate of 87%.

...read moreread less

31 citations

Book Chapter•DOI•

Segmentation of Head and Neck Organs at Risk Using CNN with Batch Dice Loss

[...]

Oldřich Kodym¹, Michal Španěl¹, Adam Herout¹•Institutions (1)

Brno University of Technology¹

09 Oct 2018

TL;DR: A convolution neural network with encoder-decoder architecture and a new loss function, the batch soft Dice loss function), used to train the network is introduced and the resulting model produces segmentations of every OAR in the public MICCAI 2015 Head And Neck Auto-Segmentation Challenge dataset.

...read moreread less

Abstract: This paper deals with segmentation of organs at risk (OAR) in head and neck area in CT images which is a crucial step for reliable intensity modulated radiotherapy treatment. We introduce a convolution neural network with encoder-decoder architecture and a new loss function, the batch soft Dice loss function, used to train the network. The resulting model produces segmentations of every OAR in the public MICCAI 2015 Head And Neck Auto-Segmentation Challenge dataset. Despite the heavy class imbalance in the data, we improve accuracy of current state-of-the-art methods by 0.33 mm in terms of average surface distance and by 0.11 in terms of Dice overlap coefficient on average.

...read moreread less

27 citations

Book Chapter•DOI•

Context-driven Multi-stream LSTM (M-LSTM) for Recognizing Fine-Grained Activity of Drivers

[...]

Ardhendu Behera¹, Alexander Keidel¹, Bappaditya Debnath¹•Institutions (1)

Edge Hill University¹

09 Oct 2018

TL;DR: A novel Multi-stream Long Short-Term Memory (M-LSTM) network for recognizing driver activities is presented, which is built to be semantically rich and meaningful, and even when coupled with appearance features it is turned out to be highly discriminating.

...read moreread less

Abstract: Automatic recognition of in-vehicle activities has significant impact on the next generation intelligent vehicles. In this paper, we present a novel Multi-stream Long Short-Term Memory (M-LSTM) network for recognizing driver activities. We bring together ideas from recent works on LSTMs, transfer learning for object detection and body pose by exploring the use of deep convolutional neural networks (CNN). Recent work has also shown that representations such as hand-object interactions are important cues in characterizing human activities. The proposed M-LSTM integrates these ideas under one framework, where two streams focus on appearance information with two different levels of abstractions. The other two streams analyze the contextual information involving configuration of body parts and body-object interactions. The proposed contextual descriptor is built to be semantically rich and meaningful, and even when coupled with appearance features it is turned out to be highly discriminating. We validate this on two challenging datasets consisting driver activities.

...read moreread less

Book Chapter•DOI•

Taming the Cross Entropy Loss

[...]

Manuel Martinez¹, Rainer Stiefelhagen¹•Institutions (1)

Karlsruhe Institute of Technology¹

09 Oct 2018

TL;DR: The TCE loss is presented, a robust derivative of the standard Cross Entropy loss used in deep learning for classification tasks that requires no modification on the training regime compared to the CE loss and can be applied in all applications where the CE Loss is currently used.

...read moreread less

Abstract: We present the Tamed Cross Entropy (TCE) loss function, a robust derivative of the standard Cross Entropy (CE) loss used in deep learning for classification tasks. However, unlike other robust losses, the TCE loss is designed to exhibit the same training properties than the CE loss in noiseless scenarios. Therefore, the TCE loss requires no modification on the training regime compared to the CE loss and, in consequence, can be applied in all applications where the CE loss is currently used. We evaluate the TCE loss using the ResNet architecture on four image datasets that we artificially contaminated with various levels of label noise. The TCE loss outperforms the CE loss in every tested scenario.

...read moreread less

Book Chapter•DOI•

A Randomized Gradient-Free Attack on ReLU Networks

[...]

Francesco Croce¹, Matthias Hein²•Institutions (2)

Saarland University¹, University of Tübingen²

09 Oct 2018

TL;DR: In this article, the authors proposed a new attack scheme for the class of ReLU networks based on a direct optimization on the resulting linear regions, which is less susceptible to defences targeting their functional properties.

...read moreread less

Abstract: It has recently been shown that neural networks but also other classifiers are vulnerable to so called adversarial attacks e.g. in object recognition an almost non-perceivable change of the image changes the decision of the classifier. Relatively fast heuristics have been proposed to produce these adversarial inputs but the problem of finding the optimal adversarial input, that is with the minimal change of the input, is NP-hard. While methods based on mixed-integer optimization which find the optimal adversarial input have been developed, they do not scale to large networks. Currently, the attack scheme proposed by Carlini and Wagner is considered to produce the best adversarial inputs. In this paper we propose a new attack scheme for the class of ReLU networks based on a direct optimization on the resulting linear regions. In our experimental validation we improve in all except one experiment out of 18 over the Carlini-Wagner attack with a relative improvement of up to 9%. As our approach is based on the geometrical structure of ReLU networks, it is less susceptible to defences targeting their functional properties.

...read moreread less

Book Chapter•DOI•

Multi-view X-Ray R-CNN.

[...]

Jan-Martin O. Steitz¹, Faraz Saeedan¹, Stefan Roth¹•Institutions (1)

Technische Universität Darmstadt¹

09 Oct 2018

TL;DR: In this paper, a CNN-based object detection approach for multi-view X-ray image data is proposed to detect prohibited objects in carry-on luggage as a part of avionic security screening.

...read moreread less

Abstract: Motivated by the detection of prohibited objects in carry-on luggage as a part of avionic security screening, we develop a CNN-based object detection approach for multi-view X-ray image data. Our contributions are two-fold. First, we introduce a novel multi-view pooling layer to perform a 3D aggregation of 2D CNN-features extracted from each view. To that end, our pooling layer exploits the known geometry of the imaging system to ensure geometric consistency of the feature aggregation. Second, we introduce an end-to-end trainable multi-view detection pipeline based on Faster R-CNN, which derives the region proposals and performs the final classification in 3D using these aggregated multi-view features. Our approach shows significant accuracy gains compared to single-view detection while even being more efficient than performing single-view detection in each view.

...read moreread less

Book Chapter•DOI•

Counting the Uncountable: Deep Semantic Density Estimation from Space

[...]

Andres C. Rodriguez¹, Jan Dirk Wegner¹•Institutions (1)

ETH Zurich¹

09 Oct 2018

TL;DR: In this article, the authors proposed a new method to count objects of specific categories that are significantly smaller than the ground sampling distance of a satellite image, which is hard due to the cluttered nature of scenes where different object categories occur.

...read moreread less

Abstract: We propose a new method to count objects of specific categories that are significantly smaller than the ground sampling distance of a satellite image. This task is hard due to the cluttered nature of scenes where different object categories occur. Target objects can be partially occluded, vary in appearance within the same class and look alike to different categories. Since traditional object detection is infeasible due to the small size of objects with respect to the pixel size, we cast object counting as a density estimation problem. To distinguish objects of different classes, our approach combines density estimation with semantic segmentation in an end-to-end learnable convolutional neural network (CNN). Experiments show that deep semantic density estimation can robustly count objects of various classes in cluttered scenes. Experiments also suggest that we need specific CNN architectures in remote sensing instead of blindly applying existing ones from computer vision.

...read moreread less

Book Chapter•DOI•

KloudNet: Deep Learning for Sky Image Analysis and Irradiance Forecasting

[...]

Dinesh Pothineni¹, Martin R. Oswald¹, Jan Poland, Marc Pollefeys¹•Institutions (1)

ETH Zurich¹

09 Oct 2018

TL;DR: A convolutional neural network with residual building blocks that learns to predict the future irradiance state from a small set of sky images for estimating irradiance fluctuations from sky images significantly outperforms the established baseline and state-of-the-art methods.

...read moreread less

Abstract: We present a novel image-based approach for estimating irradiance fluctuations from sky images. Our goal is a very short-term prediction of the irradiance state around a photovoltaic power plant 5–10 min ahead of time, in order to adjust alternative energy sources and ensure a stable energy network. To this end, we propose a convolutional neural network with residual building blocks that learns to predict the future irradiance state from a small set of sky images. Our experiments on two large datasets demonstrate that the network abstracts upon local site-specific properties such as day- and month-dependent sun positions, as well as generic properties about moving, creating, dissolving clouds, or seasonal changes. Moreover, our approach significantly outperforms the established baseline and state-of-the-art methods.

...read moreread less

Book Chapter•DOI•

Inference, Learning and Attention Mechanisms that Exploit and Preserve Sparsity in CNNs

[...]

Timo Hackel¹, Mikhail Usvyatsov¹, Silvano Galliani¹, Jan Dirk Wegner¹, Konrad Schindler¹ - Show less +1 more•Institutions (1)

ETH Zurich¹

09 Oct 2018

TL;DR: This work introduces a suite of tools that exploit sparsity in both the feature maps and the filter weights, and thereby allow for significantly lower memory footprints and computation times than the conventional dense framework, when processing data with a high degree of sparsity.

...read moreread less

Abstract: While CNNs naturally lend themselves to densely sampled data, and sophisticated implementations are available, they lack the ability to efficiently process sparse data. In this work we introduce a suite of tools that exploit sparsity in both the feature maps and the filter weights, and thereby allow for significantly lower memory footprints and computation times than the conventional dense framework, when processing data with a high degree of sparsity. Our scheme provides (i) an efficient GPU implementation of a convolution layer based on direct, sparse convolution; (ii) a filter step within the convolution layer, which we call attention, that prevents fill-in, i.e., the tendency of convolution to rapidly decrease sparsity, and guarantees an upper bound on the computational resources; and (iii) an adaptation of back-propagation that makes it possible to combine our approach with standard learning frameworks, while still exploiting sparsity in the data and the model.

...read moreread less

Book Chapter•DOI•

Deriving Neural Network Architectures using Precision Learning: Parallel-to-fan beam Conversion

[...]

Christopher Syben¹, Bernhard Stimpel¹, Jonathan Lommen¹, Tobias Würfl¹, Arnd Dörfler¹, Andreas Maier¹ - Show less +2 more•Institutions (1)

University of Erlangen-Nuremberg¹

09 Oct 2018

TL;DR: In this article, a neural network architecture based on an analytical formulation of the parallel-to-fan beam conversion problem following the concept of precision learning is proposed to learn the unknown operators in this conversion in a data-driven manner.

...read moreread less

Abstract: In this paper, we derive a neural network architecture based on an analytical formulation of the parallel-to-fan beam conversion problem following the concept of precision learning. The network allows to learn the unknown operators in this conversion in a data-driven manner avoiding interpolation and potential loss of resolution. Integration of known operators results in a small number of trainable parameters that can be estimated from synthetic data only. The concept is evaluated in the context of Hybrid MRI/X-ray imaging where transformation of the parallel-beam MRI projections to fan-beam X-ray projections is required. The proposed method is compared to a traditional rebinning method. The results demonstrate that the proposed method is superior to ray-by-ray interpolation and is able to deliver sharper images using the same amount of parallel-beam input projections which is crucial for interventional applications. We believe that this approach forms a basis for further work uniting deep learning, signal processing, physics, and traditional pattern recognition.

...read moreread less

Book Chapter•DOI•

Temporal Interpolation as an Unsupervised Pretraining Task for Optical Flow Estimation

[...]

Jonas Wulff¹, Michael J. Black¹•Institutions (1)

Max Planck Society¹

09 Oct 2018

TL;DR: In this article, the authors investigate frame interpolation as a proxy task for optical flow using real movies, and train a CNN unsupervised for temporal interpolation such a network implicitly estimates motion, but cannot handle untextured regions.

...read moreread less

Abstract: The difficulty of annotating training data is a major obstacle to using CNNs for low-level tasks in video Synthetic data often does not generalize to real videos, while unsupervised methods require heuristic losses Proxy tasks can overcome these issues, and start by training a network for a task for which annotation is easier or which can be trained unsupervised The trained network is then fine-tuned for the original task using small amounts of ground truth data Here, we investigate frame interpolation as a proxy task for optical flow Using real movies, we train a CNN unsupervised for temporal interpolation Such a network implicitly estimates motion, but cannot handle untextured regions By fine-tuning on small amounts of ground truth flow, the network can learn to fill in homogeneous regions and compute full optical flow fields Using this unsupervised pre-training, our network outperforms similar architectures that were trained supervised using synthetic optical flow

...read moreread less

Book Chapter•DOI•

Low-Shot Learning of Plankton Categories

[...]

Simon-Martin Schröder¹, Rainer Kiko², Jean-Olivier Irisson³, Reinhard Koch¹•Institutions (3)

University of Kiel¹, Leibniz Institute of Marine Sciences², University of Paris³

09 Oct 2018

TL;DR: The recently introduced weight imprinting technique is employed in order to use the available training data to train accurate classifiers in absence of enough examples for some classes.

...read moreread less

Abstract: The size of current plankton image datasets renders manual classification virtually infeasible. The training of models for machine classification is complicated by the fact that a large number of classes consist of only a few examples. We employ the recently introduced weight imprinting technique in order to use the available training data to train accurate classifiers in absence of enough examples for some classes.

...read moreread less

Book Chapter•DOI•

Metric-Driven Learning of Correspondence Weighting for 2-D/3-D Image Registration

[...]

Roman Schaffert¹, Jian Wang², Peter Fischer², Anja Borsdorf², Andreas Maier¹ - Show less +1 more•Institutions (2)

University of Erlangen-Nuremberg¹, Siemens²

09 Oct 2018

TL;DR: In this article, the authors estimate optimal weights for correspondences using PointNet and train the network directly with the criterion to minimize the registration error, achieving an accuracy of 0.74 ± 0.26 mm and highly improved robustness.

...read moreread less

Abstract: Registration of pre-operative 3-D volumes to intra-operative 2-D X-ray images is important in minimally invasive medical procedures. Rigid registration can be performed by estimating a global rigid motion that optimizes the alignment of local correspondences. However, inaccurate correspondences challenge the registration performance. To minimize their influence, we estimate optimal weights for correspondences using PointNet. We train the network directly with the criterion to minimize the registration error. We propose an objective function which includes point-to-plane correspondence-based motion estimation and projection error computation, thereby enabling the learning of a weighting strategy that optimally fits the underlying formulation of the registration task in an end-to-end fashion. For single-vertebra registration, we achieve an accuracy of \(0.74\pm 0.26\) mm and highly improved robustness. The success rate is increased from 79.3% to 94.3% and the capture range from 3 mm to 13 mm.

...read moreread less

Book Chapter•DOI•

Information-Theoretic Active Learning for Content-Based Image Retrieval

[...]

Björn Barz¹, Christoph Käding¹, Joachim Denzler¹•Institutions (1)

University of Jena¹

09 Oct 2018

TL;DR: Information-Theoretic Active Learning (ITAL), a novel batch-mode active learning method for binary classification, turns out to be highly flexible and provides state-of-the-art performance across various datasets, such as MIRFLICKR and ImageNet.

...read moreread less

Abstract: We propose Information-Theoretic Active Learning (ITAL), a novel batch-mode active learning method for binary classification, and apply it for acquiring meaningful user feedback in the context of content-based image retrieval. Instead of combining different heuristics such as uncertainty, diversity, or density, our method is based on maximizing the mutual information between the predicted relevance of the images and the expected user feedback regarding the selected batch. We propose suitable approximations to this computationally demanding problem and also integrate an explicit model of user behavior that accounts for possible incorrect labels and unnameable instances. Furthermore, our approach does not only take the structure of the data but also the expected model output change caused by the user feedback into account. In contrast to other methods, ITAL turns out to be highly flexible and provides state-of-the-art performance across various datasets, such as MIRFLICKR and ImageNet.

...read moreread less

Book Chapter•DOI•

Unsupervised Label Learning on Manifolds by Spatially Regularized Geometric Assignment

[...]

Artjom Zern¹, Matthias Zisler¹, Freddie Åström¹, Stefania Petra¹, Christoph Schnörr¹ - Show less +1 more•Institutions (1)

Heidelberg University¹

09 Oct 2018

TL;DR: This work presents a novel approach that combines unsupervised computation of representative manifold-valued features, called labels, and the spatially regularized assignment of these labels to given manifolds, through spatiallyregularized geometric assignment.

...read moreread less

Abstract: Manifold models of image features abound in computer vision. We present a novel approach that combines unsupervised computation of representative manifold-valued features, called labels, and the spatially regularized assignment of these labels to given manifold-valued data. Both processes evolve dynamically through two Riemannian gradient flows that are coupled. The representation of labels and assignment variables are kept separate, to enable the flexible application to various manifold data models. As a case study, we apply our approach to the unsupervised learning of covariance descriptors on the positive definite matrix manifold, through spatially regularized geometric assignment.

...read moreread less

Book Chapter•DOI•

Improved Semantic Stixels via Multimodal Sensor Fusion

[...]

Florian Piewak¹, Florian Piewak², Peter Pinggera², Markus Enzweiler², David Pfeiffer², Marius Zöllner³, Marius Zöllner¹ - Show less +3 more•Institutions (3)

Karlsruhe Institute of Technology¹, Daimler AG², Forschungszentrum Informatik³

09 Oct 2018

TL;DR: The results indicate that the proposed mid-level fusion of LiDAR and camera data improves both the geometric and semantic accuracy of the Stixel model significantly while reducing the computational overhead as well as the amount of generated data in comparison to using a single modality on its own.

...read moreread less

Abstract: This paper presents a compact and accurate representation of 3D scenes that are observed by a LiDAR sensor and a monocular camera The proposed method is based on the well-established Stixel model originally developed for stereo vision applications We extend this Stixel concept to incorporate data from multiple sensor modalities The resulting mid-level fusion scheme takes full advantage of the geometric accuracy of LiDAR measurements as well as the high resolution and semantic detail of RGB images The obtained environment model provides a geometrically and semantically consistent representation of the 3D scene at a significantly reduced amount of data while minimizing information loss at the same time Since the different sensor modalities are considered as input to a joint optimization problem, the solution is obtained with only minor computational overhead We demonstrate the effectiveness of the proposed multimodal Stixel algorithm on a manually annotated ground truth dataset Our results indicate that the proposed mid-level fusion of LiDAR and camera data improves both the geometric and semantic accuracy of the Stixel model significantly while reducing the computational overhead as well as the amount of generated data in comparison to using a single modality on its own

...read moreread less

Book Chapter•DOI•

Sublabel-Accurate Convex Relaxation with Total Generalized Variation Regularization

[...]

Michael Strecke¹, Bastian Goldluecke¹•Institutions (1)

University of Konstanz¹

09 Oct 2018

TL;DR: The proposed formulation extends a recent sublabel-accurate relaxation for multi-label problems and thus allows for accurate solutions using only a small number of labels, significantly improving over previous approaches towards lifting the total generalized variation.

...read moreread less

Abstract: We propose a novel idea to introduce regularization based on second order total generalized variation (\(\text {TGV}\)) into optimization frameworks based on functional lifting. The proposed formulation extends a recent sublabel-accurate relaxation for multi-label problems and thus allows for accurate solutions using only a small number of labels, significantly improving over previous approaches towards lifting the total generalized variation. Moreover, even recent sublabel accurate methods exhibit staircasing artifacts when used in conjunction with common first order regularizers such as the total variation (\(\text {TV}\)). This becomes very obvious for example when computing derivatives of disparity maps computed with these methods to obtain normals, which immediately reveals their local flatness and yields inaccurate normal maps. We show that our approach is effective in reducing these artifacts, obtaining disparity maps with a smooth normal field in a single optimization pass.

...read moreread less

Book Chapter•DOI•

Multimodal Dense Stereo Matching

[...]

Max Mehltretter¹, Sebastian P. Kleinschmidt¹, Bernardo Wagner¹, Christian Heipke¹•Institutions (1)

Leibniz University of Hanover¹

09 Oct 2018

TL;DR: This paper proposes a new approach for dense depth estimation based on multimodal stereo images that employs a combined cost function utilizing robust metrics and a transformation to an illumination independent representation and presents a confidence based weighting scheme which allows a pixel-wise weight adjustment within the cost function.

...read moreread less

Abstract: In this paper, we propose a new approach for dense depth estimation based on multimodal stereo images. Our approach employs a combined cost function utilizing robust metrics and a transformation to an illumination independent representation. Additionally, we present a confidence based weighting scheme which allows a pixel-wise weight adjustment within the cost function. We demonstrate the capabilities of our approach using RGB- and thermal images. The resulting depth maps are evaluated by comparing them to depth measurements of a Velodyne HDL-64E LiDAR sensor. We show that our method outperforms current state of the art dense matching methods regarding depth estimation based on multimodal input images.

...read moreread less

Book Chapter•DOI•

End-to-End Learning of Deterministic Decision Trees

[...]

Thomas M. Hehn¹, Fred A. Hamprecht¹•Institutions (1)

Interdisciplinary Center for Scientific Computing¹

09 Oct 2018

TL;DR: In this article, the authors propose an end-to-end trainable deterministic decision tree with an expectation maximization (EM) training scheme for oblique split decision trees.

...read moreread less

Abstract: Conventional decision trees have a number of favorable properties, including interpretability, a small computational footprint and the ability to learn from little training data. However, they lack a key quality that has helped fuel the deep learning revolution: that of being end-to-end trainable. Kontschieder 2015 has addressed this deficit, but at the cost of losing a main attractive trait of decision trees: the fact that each sample is routed along a small subset of tree nodes only. We here propose a model and Expectation-Maximization training scheme for decision trees that are fully probabilistic at train time, but after an annealing process become deterministic at test time. We analyze the learned oblique split parameters on image datasets and show that Neural Networks can be trained at each split. In summary, we present an end-to-end learning scheme for deterministic decision trees and present results on par or superior to published standard oblique decision tree algorithms.

...read moreread less

Book Chapter•DOI•

KS(conf): A Light-Weight Test if a ConvNet Operates Outside of Its Specifications

[...]

Rémy Sun¹, Christoph H. Lampert²•Institutions (2)

École Normale Supérieure¹, Institute of Science and Technology Austria²

09 Oct 2018

TL;DR: KS(conf) is described, a procedure for detecting out-of-specs situations that is easy to implement, adds almost no overhead to the system, works with all networks, including pretrained ones, and requires no a priori knowledge about how the data distribution could change.

...read moreread less

Abstract: Computer vision systems for automatic image categorization have become accurate and reliable enough that they can run continuously for days or even years as components of real-world commercial applications. A major open problem in this context, however, is quality control. Good classification performance can only be expected if systems run under the specific conditions, in particular data distributions, that they were trained for. Surprisingly, none of the currently used deep network architectures have a built-in functionality that could detect if a network operates on data from a distribution it was not trained for, such that potentially a warning to the human users could be triggered.

...read moreread less

Book Chapter•DOI•

Parcel Tracking by Detection in Large Camera Networks

[...]

Sascha Clausen¹, Claudius Zelenka¹, Tobias Schwede¹, Reinhard Koch¹•Institutions (1)

University of Kiel¹

09 Oct 2018

TL;DR: An industry scale tracking framework based on state-of-the-art methods such as Mask R-CNN is described and evaluated and a siamese network inspired feature vector matching with a novel feature improver network is adapted, which increases tracking performance.

...read moreread less

Abstract: Inside parcel distribution hubs, several tenth of up 100 000 parcels processed each day get lost. Human operators have to tediously recover these parcels by searching through large amounts of video footage from the installed large-scale camera network. We want to assist these operators and work towards an automatic solution. The challenge lies both in the size of the hub with a high number of cameras and in the adverse conditions. We describe and evaluate an industry scale tracking framework based on state-of-the-art methods such as Mask R-CNN. Moreover, we adapt a siamese network inspired feature vector matching with a novel feature improver network, which increases tracking performance. Our calibration method exploits a calibration parcel and is suitable for both overlapping and non-overlapping camera views. It requires little manual effort and needs only a single drive-by of the calibration parcel for each conveyor belt. With these methods, most parcels can be tracked start-to-end.

...read moreread less

Book Chapter•DOI•

Learning Style Compatibility for Furniture

[...]

Divyansh Aggarwal¹, Elchin Valiyev², Fadime Sener², Angela Yao²•Institutions (2)

Indian Institute of Technology, Jodhpur¹, University of Bonn²

09 Oct 2018

TL;DR: In this paper, the authors investigate how Siamese networks can be used efficiently for assessing the style compatibility between images of furniture items, and show that the middle layers of pretrained CNNs can capture essential information about furniture style, which allows for efficient applications of such networks for this task.

...read moreread less

Abstract: When judging style, a key question that often arises is whether or not a pair of objects are compatible with each other. In this paper we investigate how Siamese networks can be used efficiently for assessing the style compatibility between images of furniture items. We show that the middle layers of pretrained CNNs can capture essential information about furniture style, which allows for efficient applications of such networks for this task. We also use a joint image-text embedding method that allows for the querying of stylistically compatible furniture items, along with additional attribute constraints based on text. To evaluate our methods, we collect and present a large scale dataset of images of furniture of different style categories accompanied by text attributes.

...read moreread less