scispace - formally typeset
Search or ask a question

Showing papers presented at "German Conference on Pattern Recognition in 2018"


Book ChapterDOI
09 Oct 2018
TL;DR: In this paper, the authors investigated the impact of different flow algorithms and input transformations on the performance of optical flow and showed that optical flow is useful for action recognition because it is invariant to appearance, but the EPE of current methods is not well correlated with action recognition performance.
Abstract: Most of the top performing action recognition methods use optical flow as a “black box” input. Here we take a deeper look at the combination of flow and action recognition, and investigate why optical flow is helpful, what makes a flow method good for action recognition, and how we can make it better. In particular, we investigate the impact of different flow algorithms and input transformations to better understand how these affect a state-of-the-art action recognition method. Furthermore, we fine tune two neural-network flow methods end-to-end on the most widely used action recognition dataset (UCF101). Based on these experiments, we make the following five observations: (1) optical flow is useful for action recognition because it is invariant to appearance, (2) optical flow methods are optimized to minimize end-point-error (EPE), but the EPE of current methods is not well correlated with action recognition performance, (3) for the flow methods tested, accuracy at boundaries and at small displacements is most correlated with action recognition performance, (4) training optical flow to minimize classification error instead of minimizing EPE improves recognition performance, and (5) optical flow learned for the task of action recognition differs from traditional optical flow especially inside the human body and at the boundary of the body. These observations may encourage optical flow researchers to look beyond EPE as a goal and guide action recognition researchers to seek better motion cues, leading to a tighter integration of the optical flow and action recognition communities.

160 citations


Book ChapterDOI
09 Oct 2018
TL;DR: In this article, the authors propose an end-to-end clustering training schedule for neural networks that is direct, i.e., the output is a probability distribution over cluster memberships.
Abstract: We propose a novel end-to-end clustering training schedule for neural networks that is direct, i.e. the output is a probability distribution over cluster memberships. A neural network maps images to embeddings. We introduce centroid variables that have the same shape as image embeddings. These variables are jointly optimized with the network’s parameters. This is achieved by a cost function that associates the centroid variables with embeddings of input images. Finally, an additional layer maps embeddings to logits, allowing for the direct estimation of the respective cluster membership. Unlike other methods, this does not require any additional classifier to be trained on the embeddings in a separate step. The proposed approach achieves state-of-the-art results in unsupervised classification and we provide an extensive ablation study to demonstrate its capabilities.

96 citations


Book ChapterDOI
09 Oct 2018
TL;DR: In this article, the authors propose a deep architecture that maintains separated the information about the available source domains data while at the same time leveraging over generic perceptual information by introducing domain-specific aggregation modules that through an aggregation layer strategy are able to merge generic and specific information in an effective manner.
Abstract: Visual recognition systems are meant to work in the real world. For this to happen, they must work robustly in any visual domain, and not only on the data used during training. Within this context, a very realistic scenario deals with domain generalization, i.e. the ability to build visual recognition algorithms able to work robustly in several visual domains, without having access to any information about target data statistic. This paper contributes to this research thread, proposing a deep architecture that maintains separated the information about the available source domains data while at the same time leveraging over generic perceptual information. We achieve this by introducing domain-specific aggregation modules that through an aggregation layer strategy are able to merge generic and specific information in an effective manner. Experiments on two different benchmark databases show the power of our approach, reaching the new state of the art in domain generalization.

77 citations


Book ChapterDOI
09 Oct 2018
TL;DR: This paper proposes Convolve, Attend and Spell, an attention-based sequence-to-sequence model for handwritten word recognition that achieves competitive results on the IAM dataset without needing any pre-processing step, predefined lexicon nor language model.
Abstract: This paper proposes Convolve, Attend and Spell, an attention-based sequence-to-sequence model for handwritten word recognition. The proposed architecture has three main parts: an encoder, consisting of a CNN and a bi-directional GRU, an attention mechanism devoted to focus on the pertinent features and a decoder formed by a one-directional GRU, able to spell the corresponding word, character by character. Compared with the recent state-of-the-art, our model achieves competitive results on the IAM dataset without needing any pre-processing step, predefined lexicon nor language model. Code and additional results are available in https://github.com/omni-us/research-seq2seq-HTR.

65 citations


Book ChapterDOI
09 Oct 2018
TL;DR: This work discusses the operational opportunity of having a live face probe to support the morphing detection decision and proposes a detection approach that take advantage of that, and considers the facial landmarks shifting patterns between reference and probe images.
Abstract: Face morphing attacks create face images that are verifiable to multiple identities. Associating such images to identity documents lead to building faulty identity links, causing attacks on operations like border crossing. Most of previously proposed morphing attack detection approaches directly classified features extracted from the investigated image. We discuss the operational opportunity of having a live face probe to support the morphing detection decision and propose a detection approach that take advantage of that. Our proposed solution considers the facial landmarks shifting patterns between reference and probe images. This is represented by the directed distances to avoid confusion with shifts caused by other variations. We validated our approach using a publicly available database, built on 549 identities. Our proposed detection concept is tested with three landmark detectors and proved to outperform the baseline concept based on handcrafted and transferable CNN features.

55 citations


Book ChapterDOI
09 Oct 2018
TL;DR: In this article, a self-supervised method for representation learning utilizing RGB and optical flow was proposed based on the observation that cross-modal information has a high semantic meaning and proposed a method to effectively exploit this signal.
Abstract: In this paper we present a self-supervised method for representation learning utilizing two different modalities. Based on the observation that cross-modal information has a high semantic meaning we propose a method to effectively exploit this signal. For our approach we utilize video data since it is available on a large scale and provides easily accessible modalities given by RGB and optical flow. We demonstrate state-of-the-art performance on highly contested action recognition datasets in the context of self-supervised learning. We show that our feature representation also transfers to other tasks and conduct extensive ablation studies to validate our core contributions.

52 citations


Book ChapterDOI
09 Oct 2018
TL;DR: A real-time, low-drift laser odometry approach that tightly integrates sequentially measured 3D multi-beam LIDAR data with inertial measurements that was ranked within the top five laser-only algorithms of the KITTI odometry benchmark.
Abstract: We propose a real-time, low-drift laser odometry approach that tightly integrates sequentially measured 3D multi-beam LIDAR data with inertial measurements. The laser measurements are motion-compensated using a novel algorithm based on non-rigid registration of two consecutive laser sweeps and a local map. IMU data is being tightly integrated by means of factor-graph optimization on a pose graph. We evaluate our method on a public dataset and also obtain results on our own datasets that contain information not commonly found in existing datasets. At the time of writing, our method was ranked within the top five laser-only algorithms of the KITTI odometry benchmark.

45 citations


Book ChapterDOI
09 Oct 2018
TL;DR: In this article, the authors introduce a more realistic and challenging vehicle re-id benchmark, called Vehicle Re-Identification in Context (VRIC), which contains 60,430 images of 5,622 vehicle identities captured by 60 different cameras at heterogeneous road traffic scenes in both day-time and night-time.
Abstract: Existing vehicle re-identification (re-id) evaluation benchmarks consider strongly artificial test scenarios by assuming the availability of high quality images and fine-grained appearance at an almost constant image scale, reminiscent to images required for Automatic Number Plate Recognition, e.g. VeRi-776. Such assumptions are often invalid in realistic vehicle re-id scenarios where arbitrarily changing image resolutions (scales) are the norm. This makes the existing vehicle re-id benchmarks limited for testing the true performance of a re-id method. In this work, we introduce a more realistic and challenging vehicle re-id benchmark, called Vehicle Re-Identification in Context (VRIC). In contrast to existing vehicle re-id datasets, VRIC is uniquely characterised by vehicle images subject to more realistic and unconstrained variations in resolution (scale), motion blur, illumination, occlusion, and viewpoint. It contains 60,430 images of 5,622 vehicle identities captured by 60 different cameras at heterogeneous road traffic scenes in both day-time and night-time. Given the nature of this new benchmark, we further investigate a multi-scale matching approach to vehicle re-id by learning more discriminative feature representations from multi-resolution images. Extensive evaluations show that the proposed multi-scale method outperforms the state-of-the-art vehicle re-id methods on three benchmark datasets: VehicleID, VeRi-776, and VRIC (Available at http://qmul-vric.github.io).

40 citations


Book ChapterDOI
09 Oct 2018
TL;DR: This work presents a novel table tennis robot system with high accuracy vision detection and fast robot reaction based on an industrial KUKA Agilus R900 sixx robot with 6 DOF, and tests both a curve fitting approach and an extended Kalman filter for predicting the ball’s trajectory.
Abstract: In recent years robotic table tennis has become a popular research challenge for image processing and robot control. Here we present a novel table tennis robot system with high accuracy vision detection and fast robot reaction. Our system is based on an industrial KUKA Agilus R900 sixx robot with 6 DOF. Four cameras are used for ball position detection at 150 fps. We employ a multiple-camera calibration method, and use iterative triangulation to reconstruct the 3D ball position with an accuracy of 2.0 mm. In order to detect the flying ball with higher velocities in real-time, we combine color and background thresholding. For predicting the ball’s trajectory we test both a curve fitting approach and an extended Kalman filter. Our robot is able to play rallies with a human counting up to 50 consequential strokes and has a general hitting rate of 87%.

31 citations


Book ChapterDOI
09 Oct 2018
TL;DR: A convolution neural network with encoder-decoder architecture and a new loss function, the batch soft Dice loss function), used to train the network is introduced and the resulting model produces segmentations of every OAR in the public MICCAI 2015 Head And Neck Auto-Segmentation Challenge dataset.
Abstract: This paper deals with segmentation of organs at risk (OAR) in head and neck area in CT images which is a crucial step for reliable intensity modulated radiotherapy treatment. We introduce a convolution neural network with encoder-decoder architecture and a new loss function, the batch soft Dice loss function, used to train the network. The resulting model produces segmentations of every OAR in the public MICCAI 2015 Head And Neck Auto-Segmentation Challenge dataset. Despite the heavy class imbalance in the data, we improve accuracy of current state-of-the-art methods by 0.33 mm in terms of average surface distance and by 0.11 in terms of Dice overlap coefficient on average.

27 citations


Book ChapterDOI
09 Oct 2018
TL;DR: A novel Multi-stream Long Short-Term Memory (M-LSTM) network for recognizing driver activities is presented, which is built to be semantically rich and meaningful, and even when coupled with appearance features it is turned out to be highly discriminating.
Abstract: Automatic recognition of in-vehicle activities has significant impact on the next generation intelligent vehicles. In this paper, we present a novel Multi-stream Long Short-Term Memory (M-LSTM) network for recognizing driver activities. We bring together ideas from recent works on LSTMs, transfer learning for object detection and body pose by exploring the use of deep convolutional neural networks (CNN). Recent work has also shown that representations such as hand-object interactions are important cues in characterizing human activities. The proposed M-LSTM integrates these ideas under one framework, where two streams focus on appearance information with two different levels of abstractions. The other two streams analyze the contextual information involving configuration of body parts and body-object interactions. The proposed contextual descriptor is built to be semantically rich and meaningful, and even when coupled with appearance features it is turned out to be highly discriminating. We validate this on two challenging datasets consisting driver activities.

Book ChapterDOI
09 Oct 2018
TL;DR: The TCE loss is presented, a robust derivative of the standard Cross Entropy loss used in deep learning for classification tasks that requires no modification on the training regime compared to the CE loss and can be applied in all applications where the CE Loss is currently used.
Abstract: We present the Tamed Cross Entropy (TCE) loss function, a robust derivative of the standard Cross Entropy (CE) loss used in deep learning for classification tasks. However, unlike other robust losses, the TCE loss is designed to exhibit the same training properties than the CE loss in noiseless scenarios. Therefore, the TCE loss requires no modification on the training regime compared to the CE loss and, in consequence, can be applied in all applications where the CE loss is currently used. We evaluate the TCE loss using the ResNet architecture on four image datasets that we artificially contaminated with various levels of label noise. The TCE loss outperforms the CE loss in every tested scenario.

Book ChapterDOI
09 Oct 2018
TL;DR: In this article, the authors proposed a new attack scheme for the class of ReLU networks based on a direct optimization on the resulting linear regions, which is less susceptible to defences targeting their functional properties.
Abstract: It has recently been shown that neural networks but also other classifiers are vulnerable to so called adversarial attacks e.g. in object recognition an almost non-perceivable change of the image changes the decision of the classifier. Relatively fast heuristics have been proposed to produce these adversarial inputs but the problem of finding the optimal adversarial input, that is with the minimal change of the input, is NP-hard. While methods based on mixed-integer optimization which find the optimal adversarial input have been developed, they do not scale to large networks. Currently, the attack scheme proposed by Carlini and Wagner is considered to produce the best adversarial inputs. In this paper we propose a new attack scheme for the class of ReLU networks based on a direct optimization on the resulting linear regions. In our experimental validation we improve in all except one experiment out of 18 over the Carlini-Wagner attack with a relative improvement of up to 9%. As our approach is based on the geometrical structure of ReLU networks, it is less susceptible to defences targeting their functional properties.

Book ChapterDOI
09 Oct 2018
TL;DR: In this paper, a CNN-based object detection approach for multi-view X-ray image data is proposed to detect prohibited objects in carry-on luggage as a part of avionic security screening.
Abstract: Motivated by the detection of prohibited objects in carry-on luggage as a part of avionic security screening, we develop a CNN-based object detection approach for multi-view X-ray image data. Our contributions are two-fold. First, we introduce a novel multi-view pooling layer to perform a 3D aggregation of 2D CNN-features extracted from each view. To that end, our pooling layer exploits the known geometry of the imaging system to ensure geometric consistency of the feature aggregation. Second, we introduce an end-to-end trainable multi-view detection pipeline based on Faster R-CNN, which derives the region proposals and performs the final classification in 3D using these aggregated multi-view features. Our approach shows significant accuracy gains compared to single-view detection while even being more efficient than performing single-view detection in each view.

Book ChapterDOI
09 Oct 2018
TL;DR: In this article, the authors proposed a new method to count objects of specific categories that are significantly smaller than the ground sampling distance of a satellite image, which is hard due to the cluttered nature of scenes where different object categories occur.
Abstract: We propose a new method to count objects of specific categories that are significantly smaller than the ground sampling distance of a satellite image. This task is hard due to the cluttered nature of scenes where different object categories occur. Target objects can be partially occluded, vary in appearance within the same class and look alike to different categories. Since traditional object detection is infeasible due to the small size of objects with respect to the pixel size, we cast object counting as a density estimation problem. To distinguish objects of different classes, our approach combines density estimation with semantic segmentation in an end-to-end learnable convolutional neural network (CNN). Experiments show that deep semantic density estimation can robustly count objects of various classes in cluttered scenes. Experiments also suggest that we need specific CNN architectures in remote sensing instead of blindly applying existing ones from computer vision.

Book ChapterDOI
09 Oct 2018
TL;DR: A convolutional neural network with residual building blocks that learns to predict the future irradiance state from a small set of sky images for estimating irradiance fluctuations from sky images significantly outperforms the established baseline and state-of-the-art methods.
Abstract: We present a novel image-based approach for estimating irradiance fluctuations from sky images. Our goal is a very short-term prediction of the irradiance state around a photovoltaic power plant 5–10 min ahead of time, in order to adjust alternative energy sources and ensure a stable energy network. To this end, we propose a convolutional neural network with residual building blocks that learns to predict the future irradiance state from a small set of sky images. Our experiments on two large datasets demonstrate that the network abstracts upon local site-specific properties such as day- and month-dependent sun positions, as well as generic properties about moving, creating, dissolving clouds, or seasonal changes. Moreover, our approach significantly outperforms the established baseline and state-of-the-art methods.

Book ChapterDOI
09 Oct 2018
TL;DR: This work introduces a suite of tools that exploit sparsity in both the feature maps and the filter weights, and thereby allow for significantly lower memory footprints and computation times than the conventional dense framework, when processing data with a high degree of sparsity.
Abstract: While CNNs naturally lend themselves to densely sampled data, and sophisticated implementations are available, they lack the ability to efficiently process sparse data. In this work we introduce a suite of tools that exploit sparsity in both the feature maps and the filter weights, and thereby allow for significantly lower memory footprints and computation times than the conventional dense framework, when processing data with a high degree of sparsity. Our scheme provides (i) an efficient GPU implementation of a convolution layer based on direct, sparse convolution; (ii) a filter step within the convolution layer, which we call attention, that prevents fill-in, i.e., the tendency of convolution to rapidly decrease sparsity, and guarantees an upper bound on the computational resources; and (iii) an adaptation of back-propagation that makes it possible to combine our approach with standard learning frameworks, while still exploiting sparsity in the data and the model.

Book ChapterDOI
09 Oct 2018
TL;DR: In this article, a neural network architecture based on an analytical formulation of the parallel-to-fan beam conversion problem following the concept of precision learning is proposed to learn the unknown operators in this conversion in a data-driven manner.
Abstract: In this paper, we derive a neural network architecture based on an analytical formulation of the parallel-to-fan beam conversion problem following the concept of precision learning. The network allows to learn the unknown operators in this conversion in a data-driven manner avoiding interpolation and potential loss of resolution. Integration of known operators results in a small number of trainable parameters that can be estimated from synthetic data only. The concept is evaluated in the context of Hybrid MRI/X-ray imaging where transformation of the parallel-beam MRI projections to fan-beam X-ray projections is required. The proposed method is compared to a traditional rebinning method. The results demonstrate that the proposed method is superior to ray-by-ray interpolation and is able to deliver sharper images using the same amount of parallel-beam input projections which is crucial for interventional applications. We believe that this approach forms a basis for further work uniting deep learning, signal processing, physics, and traditional pattern recognition.

Book ChapterDOI
09 Oct 2018
TL;DR: In this article, the authors investigate frame interpolation as a proxy task for optical flow using real movies, and train a CNN unsupervised for temporal interpolation such a network implicitly estimates motion, but cannot handle untextured regions.
Abstract: The difficulty of annotating training data is a major obstacle to using CNNs for low-level tasks in video Synthetic data often does not generalize to real videos, while unsupervised methods require heuristic losses Proxy tasks can overcome these issues, and start by training a network for a task for which annotation is easier or which can be trained unsupervised The trained network is then fine-tuned for the original task using small amounts of ground truth data Here, we investigate frame interpolation as a proxy task for optical flow Using real movies, we train a CNN unsupervised for temporal interpolation Such a network implicitly estimates motion, but cannot handle untextured regions By fine-tuning on small amounts of ground truth flow, the network can learn to fill in homogeneous regions and compute full optical flow fields Using this unsupervised pre-training, our network outperforms similar architectures that were trained supervised using synthetic optical flow

Book ChapterDOI
09 Oct 2018
TL;DR: The recently introduced weight imprinting technique is employed in order to use the available training data to train accurate classifiers in absence of enough examples for some classes.
Abstract: The size of current plankton image datasets renders manual classification virtually infeasible. The training of models for machine classification is complicated by the fact that a large number of classes consist of only a few examples. We employ the recently introduced weight imprinting technique in order to use the available training data to train accurate classifiers in absence of enough examples for some classes.

Book ChapterDOI
09 Oct 2018
TL;DR: In this article, the authors estimate optimal weights for correspondences using PointNet and train the network directly with the criterion to minimize the registration error, achieving an accuracy of 0.74 ± 0.26 mm and highly improved robustness.
Abstract: Registration of pre-operative 3-D volumes to intra-operative 2-D X-ray images is important in minimally invasive medical procedures. Rigid registration can be performed by estimating a global rigid motion that optimizes the alignment of local correspondences. However, inaccurate correspondences challenge the registration performance. To minimize their influence, we estimate optimal weights for correspondences using PointNet. We train the network directly with the criterion to minimize the registration error. We propose an objective function which includes point-to-plane correspondence-based motion estimation and projection error computation, thereby enabling the learning of a weighting strategy that optimally fits the underlying formulation of the registration task in an end-to-end fashion. For single-vertebra registration, we achieve an accuracy of \(0.74\pm 0.26\) mm and highly improved robustness. The success rate is increased from 79.3% to 94.3% and the capture range from 3 mm to 13 mm.

Book ChapterDOI
09 Oct 2018
TL;DR: Information-Theoretic Active Learning (ITAL), a novel batch-mode active learning method for binary classification, turns out to be highly flexible and provides state-of-the-art performance across various datasets, such as MIRFLICKR and ImageNet.
Abstract: We propose Information-Theoretic Active Learning (ITAL), a novel batch-mode active learning method for binary classification, and apply it for acquiring meaningful user feedback in the context of content-based image retrieval. Instead of combining different heuristics such as uncertainty, diversity, or density, our method is based on maximizing the mutual information between the predicted relevance of the images and the expected user feedback regarding the selected batch. We propose suitable approximations to this computationally demanding problem and also integrate an explicit model of user behavior that accounts for possible incorrect labels and unnameable instances. Furthermore, our approach does not only take the structure of the data but also the expected model output change caused by the user feedback into account. In contrast to other methods, ITAL turns out to be highly flexible and provides state-of-the-art performance across various datasets, such as MIRFLICKR and ImageNet.

Book ChapterDOI
09 Oct 2018
TL;DR: This work presents a novel approach that combines unsupervised computation of representative manifold-valued features, called labels, and the spatially regularized assignment of these labels to given manifolds, through spatiallyregularized geometric assignment.
Abstract: Manifold models of image features abound in computer vision. We present a novel approach that combines unsupervised computation of representative manifold-valued features, called labels, and the spatially regularized assignment of these labels to given manifold-valued data. Both processes evolve dynamically through two Riemannian gradient flows that are coupled. The representation of labels and assignment variables are kept separate, to enable the flexible application to various manifold data models. As a case study, we apply our approach to the unsupervised learning of covariance descriptors on the positive definite matrix manifold, through spatially regularized geometric assignment.

Book ChapterDOI
09 Oct 2018
TL;DR: The results indicate that the proposed mid-level fusion of LiDAR and camera data improves both the geometric and semantic accuracy of the Stixel model significantly while reducing the computational overhead as well as the amount of generated data in comparison to using a single modality on its own.
Abstract: This paper presents a compact and accurate representation of 3D scenes that are observed by a LiDAR sensor and a monocular camera The proposed method is based on the well-established Stixel model originally developed for stereo vision applications We extend this Stixel concept to incorporate data from multiple sensor modalities The resulting mid-level fusion scheme takes full advantage of the geometric accuracy of LiDAR measurements as well as the high resolution and semantic detail of RGB images The obtained environment model provides a geometrically and semantically consistent representation of the 3D scene at a significantly reduced amount of data while minimizing information loss at the same time Since the different sensor modalities are considered as input to a joint optimization problem, the solution is obtained with only minor computational overhead We demonstrate the effectiveness of the proposed multimodal Stixel algorithm on a manually annotated ground truth dataset Our results indicate that the proposed mid-level fusion of LiDAR and camera data improves both the geometric and semantic accuracy of the Stixel model significantly while reducing the computational overhead as well as the amount of generated data in comparison to using a single modality on its own

Book ChapterDOI
09 Oct 2018
TL;DR: The proposed formulation extends a recent sublabel-accurate relaxation for multi-label problems and thus allows for accurate solutions using only a small number of labels, significantly improving over previous approaches towards lifting the total generalized variation.
Abstract: We propose a novel idea to introduce regularization based on second order total generalized variation (\(\text {TGV}\)) into optimization frameworks based on functional lifting. The proposed formulation extends a recent sublabel-accurate relaxation for multi-label problems and thus allows for accurate solutions using only a small number of labels, significantly improving over previous approaches towards lifting the total generalized variation. Moreover, even recent sublabel accurate methods exhibit staircasing artifacts when used in conjunction with common first order regularizers such as the total variation (\(\text {TV}\)). This becomes very obvious for example when computing derivatives of disparity maps computed with these methods to obtain normals, which immediately reveals their local flatness and yields inaccurate normal maps. We show that our approach is effective in reducing these artifacts, obtaining disparity maps with a smooth normal field in a single optimization pass.

Book ChapterDOI
09 Oct 2018
TL;DR: This paper proposes a new approach for dense depth estimation based on multimodal stereo images that employs a combined cost function utilizing robust metrics and a transformation to an illumination independent representation and presents a confidence based weighting scheme which allows a pixel-wise weight adjustment within the cost function.
Abstract: In this paper, we propose a new approach for dense depth estimation based on multimodal stereo images. Our approach employs a combined cost function utilizing robust metrics and a transformation to an illumination independent representation. Additionally, we present a confidence based weighting scheme which allows a pixel-wise weight adjustment within the cost function. We demonstrate the capabilities of our approach using RGB- and thermal images. The resulting depth maps are evaluated by comparing them to depth measurements of a Velodyne HDL-64E LiDAR sensor. We show that our method outperforms current state of the art dense matching methods regarding depth estimation based on multimodal input images.

Book ChapterDOI
09 Oct 2018
TL;DR: In this article, the authors propose an end-to-end trainable deterministic decision tree with an expectation maximization (EM) training scheme for oblique split decision trees.
Abstract: Conventional decision trees have a number of favorable properties, including interpretability, a small computational footprint and the ability to learn from little training data. However, they lack a key quality that has helped fuel the deep learning revolution: that of being end-to-end trainable. Kontschieder 2015 has addressed this deficit, but at the cost of losing a main attractive trait of decision trees: the fact that each sample is routed along a small subset of tree nodes only. We here propose a model and Expectation-Maximization training scheme for decision trees that are fully probabilistic at train time, but after an annealing process become deterministic at test time. We analyze the learned oblique split parameters on image datasets and show that Neural Networks can be trained at each split. In summary, we present an end-to-end learning scheme for deterministic decision trees and present results on par or superior to published standard oblique decision tree algorithms.

Book ChapterDOI
09 Oct 2018
TL;DR: KS(conf) is described, a procedure for detecting out-of-specs situations that is easy to implement, adds almost no overhead to the system, works with all networks, including pretrained ones, and requires no a priori knowledge about how the data distribution could change.
Abstract: Computer vision systems for automatic image categorization have become accurate and reliable enough that they can run continuously for days or even years as components of real-world commercial applications. A major open problem in this context, however, is quality control. Good classification performance can only be expected if systems run under the specific conditions, in particular data distributions, that they were trained for. Surprisingly, none of the currently used deep network architectures have a built-in functionality that could detect if a network operates on data from a distribution it was not trained for, such that potentially a warning to the human users could be triggered.

Book ChapterDOI
09 Oct 2018
TL;DR: An industry scale tracking framework based on state-of-the-art methods such as Mask R-CNN is described and evaluated and a siamese network inspired feature vector matching with a novel feature improver network is adapted, which increases tracking performance.
Abstract: Inside parcel distribution hubs, several tenth of up 100 000 parcels processed each day get lost. Human operators have to tediously recover these parcels by searching through large amounts of video footage from the installed large-scale camera network. We want to assist these operators and work towards an automatic solution. The challenge lies both in the size of the hub with a high number of cameras and in the adverse conditions. We describe and evaluate an industry scale tracking framework based on state-of-the-art methods such as Mask R-CNN. Moreover, we adapt a siamese network inspired feature vector matching with a novel feature improver network, which increases tracking performance. Our calibration method exploits a calibration parcel and is suitable for both overlapping and non-overlapping camera views. It requires little manual effort and needs only a single drive-by of the calibration parcel for each conveyor belt. With these methods, most parcels can be tracked start-to-end.

Book ChapterDOI
09 Oct 2018
TL;DR: In this paper, the authors investigate how Siamese networks can be used efficiently for assessing the style compatibility between images of furniture items, and show that the middle layers of pretrained CNNs can capture essential information about furniture style, which allows for efficient applications of such networks for this task.
Abstract: When judging style, a key question that often arises is whether or not a pair of objects are compatible with each other. In this paper we investigate how Siamese networks can be used efficiently for assessing the style compatibility between images of furniture items. We show that the middle layers of pretrained CNNs can capture essential information about furniture style, which allows for efficient applications of such networks for this task. We also use a joint image-text embedding method that allows for the querying of stylistically compatible furniture items, along with additional attribute constraints based on text. To evaluate our methods, we collect and present a large scale dataset of images of furniture of different style categories accompanied by text attributes.