Showing papers presented at "German Conference on Pattern Recognition in 2014"

PDF

Open Access

Book Chapter•DOI•

High-Resolution Stereo Datasets with Subpixel-Accurate Ground Truth

[...]

Daniel Scharstein¹, Heiko Hirschmüller², York Kitajima¹, Greg Krathwohl¹, Nera Nešić³, Xi Wang¹, Porter Westling - Show less +3 more•Institutions (3)

Middlebury College¹, German Aerospace Center², Reykjavík University³

02 Sep 2014

TL;DR: A structured lighting system for creating high-resolution stereo datasets of static indoor scenes with highly accurate ground-truth disparities using novel techniques for efficient 2D subpixel correspondence search and self-calibration of cameras and projectors with modeling of lens distortion is presented.

...read moreread less

Abstract: We present a structured lighting system for creating high-resolution stereo datasets of static indoor scenes with highly accurate ground-truth disparities. The system includes novel techniques for efficient 2D subpixel correspondence search and self-calibration of cameras and projectors with modeling of lens distortion. Combining disparity estimates from multiple projector positions we are able to achieve a disparity accuracy of 0.2 pixels on most observed surfaces, including in half-occluded regions. We contribute 33 new 6-megapixel datasets obtained with our system and demonstrate that they present new challenges for the next generation of stereo algorithms.

...read moreread less

1,071 citations

Book Chapter•DOI•

Coherent Multi-sentence Video Description with Variable Level of Detail

[...]

Anna Rohrbach¹, Marcus Rohrbach², Marcus Rohrbach¹, Wei Qiu¹, Wei Qiu³, Annemarie Friedrich³, Manfred Pinkal³, Bernt Schiele¹ - Show less +4 more•Institutions (3)

Max Planck Society¹, University of California, Berkeley², Saarland University³

02 Sep 2014

TL;DR: This paper follows a two-step approach where it first learns to predict a semantic representation from video and then generates natural language descriptions from it, and model across-sentence consistency at the level of the SR by enforcing a consistent topic.

...read moreread less

Abstract: Humans can easily describe what they see in a coherent way and at varying level of detail. However, existing approaches for automatic video description focus on generating only single sentences and are not able to vary the descriptions’ level of detail. In this paper, we address both of these limitations: for a variable level of detail we produce coherent multi-sentence descriptions of complex videos. To understand the difference between detailed and short descriptions, we collect and analyze a video description corpus of three levels of detail. We follow a two-step approach where we first learn to predict a semantic representation (SR) from video and then generate natural language descriptions from it. For our multi-sentence descriptions we model across-sentence consistency at the level of the SR by enforcing a consistent topic. Human judges rate our descriptions as more readable, correct, and relevant than related work.

...read moreread less

244 citations

Book Chapter•DOI•

Mask-Specific Inpainting with Deep Neural Networks

[...]

Rolf Köhler¹, Christian J. Schuler¹, Bernhard Schölkopf¹, Stefan Harmeling¹•Institutions (1)

Max Planck Society¹

02 Sep 2014

TL;DR: This work directly learns a mapping from image patches, corrupted by missing pixels, onto complete image patches that is represented as a deep neural network that is automatically trained on a large image data set to exploit the shape information of the missing regions.

...read moreread less

Abstract: Most inpainting approaches require a good image model to infer the unknown pixels. In this work, we directly learn a mapping from image patches, corrupted by missing pixels, onto complete image patches. This mapping is represented as a deep neural network that is automatically trained on a large image data set. In particular, we are interested in the question whether it is helpful to exploit the shape information of the missing regions, i.e. the masks, which is something commonly ignored by other approaches. In comprehensive experiments on various images, we demonstrate that our learning-based approach is able to use this extra information and can achieve state-of-the-art inpainting results. Furthermore, we show that training with such extra information is useful for blind inpainting, where the exact shape of the missing region might be uncertain, for instance due to aliasing effects.

...read moreread less

143 citations

Book Chapter•DOI•

Mind the Gap: Modeling Local and Global Context in (Road) Networks

[...]

Javier A. Montoya-Zegarra¹, Jan Dirk Wegner¹, Ľubor Ladický¹, Konrad Schindler¹•Institutions (1)

ETH Zurich¹

02 Sep 2014

TL;DR: This work proposes a method to label roads in aerial images and extract a topologically correct road network, which outperforms several baselines on two challenging data sets, both in terms of precision/recall and w.r.t. topological correctness.

...read moreread less

Abstract: We propose a method to label roads in aerial images and extract a topologically correct road network. Three factors make road extraction difficult: (i) high intra-class variability due to clutter like cars, markings, shadows on the roads; (ii) low inter-class variability, because some non-road structures are made of similar materials; and (iii) most importantly, a complex structural prior: roads form a connected network of thin segments, with slowly changing width and curvature, often bordered by buildings, etc. We model this rich, but complicated contextual information at two levels. Locally, the context and layout of roads is learned implicitly, by including multi-scale appearance information from a large neighborhood in the per-pixel classifier. Globally, the network structure is enforced explicitly: we first detect promising stretches of road via shortest-path search on the per-pixel evidence, and then select pixels on an optimal subset of these paths by energy minimization in a CRF, where each putative path forms a higher-order clique. The model outperforms several baselines on two challenging data sets, both in terms of precision/recall and w.r.t. topological correctness.

...read moreread less

67 citations

Book Chapter•DOI•

Fine-Grained Activity Recognition with Holistic and Pose Based Features

[...]

Leonid Pishchulin¹, Mykhaylo Andriluka¹, Bernt Schiele¹•Institutions (1)

Max Planck Society¹

02 Sep 2014

TL;DR: This paper builds on the recent dataset [2] leveraging the existing taxonomy of human activities and reveals that holistic and pose-based methods are highly complementary, and their performance varies significantly depending on the activity.

...read moreread less

Abstract: Holistic methods based on dense trajectories [29, 30] are currently the de facto standard for recognition of human activities in video Whether holistic representations will sustain or will be superseded by higher level video encoding in terms of body pose and motion is the subject of an ongoing debate [12] In this paper we aim to clarify the underlying factors responsible for good performance of holistic and pose-based representations To that end we build on our recent dataset [2] leveraging the existing taxonomy of human activities This dataset includes \(24,920\) video snippets covering \(410\) human activities in total Our analysis reveals that holistic and pose-based methods are highly complementary, and their performance varies significantly depending on the activity We find that holistic methods are mostly affected by the number and speed of trajectories, whereas pose-based methods are mostly influenced by viewpoint of the person We observe striking performance differences across activities: for certain activities results with pose-based features are more than twice as accurate compared to holistic features, and vice versa The best performing approach in our comparison is based on the combination of holistic and pose-based approaches, which again underlines their complementarity

...read moreread less

58 citations

Book Chapter•DOI•

Semi-Global Matching: A Principled Derivation in Terms of Message Passing

[...]

Amnon Drory¹, Carsten Haubold², Shai Avidan¹, Fred A. Hamprecht²•Institutions (2)

Tel Aviv University¹, Heidelberg University²

02 Sep 2014

TL;DR: The first principled explanation of this empirically successful semi-global matching algorithm is offered, and its exact relation to belief propagation and tree-reweighted message passing is clarified.

...read moreread less

Abstract: Semi-global matching, originally introduced in the context of dense stereo, is a very successful heuristic to minimize the energy of a pairwise multi-label Markov Random Field defined on a grid. We offer the first principled explanation of this empirically successful algorithm, and clarify its exact relation to belief propagation and tree-reweighted message passing. One outcome of this new connection is an uncertainty measure for the MAP label of a variable in a Markov Random Field.

...read moreread less

57 citations

Book Chapter•DOI•

Submap-Based Bundle Adjustment for 3D Reconstruction from RGB-D Data

[...]

Robert Maier¹, Jürgen Sturm¹, Daniel Cremers¹•Institutions (1)

Technische Universität München¹

02 Sep 2014

TL;DR: This paper is the first to transfer and adapt submapping to RGB-D sensors and to provide a detailed analysis of the resulting gain, and finds that it outperform several state-of-the-art approaches both in terms of speed and accuracy.

...read moreread less

Abstract: The key contribution of this paper is a novel submapping technique for RGB-D-based bundle adjustment Our approach significantly speeds up 3D object reconstruction with respect to full bundle adjustment while generating visually compelling 3D models of high metric accuracy While submapping has been explored previously for mono and stereo cameras, we are the first to transfer and adapt this concept to RGB-D sensors and to provide a detailed analysis of the resulting gain In our approach, we partition the input data uniformly into submaps to optimize them individually by minimizing the 3D alignment error Subsequently, we fix the interior variables and optimize only over the separator variables between the submaps As we demonstrate in this paper, our method reduces the runtime of full bundle adjustment by 32 % on average while still being able to deal with real-world noise of cheap commodity sensors We evaluated our method on a large number of benchmark datasets, and found that we outperform several state-of-the-art approaches both in terms of speed and accuracy Furthermore, we present highly accurate 3D reconstructions of various objects to demonstrate the validity of our approach

...read moreread less

44 citations

Book Chapter•DOI•

Casting Random Forests as Artificial Neural Networks (and Profiting from It)

[...]

Johannes Welbl¹•Institutions (1)

Heidelberg University¹

02 Sep 2014

TL;DR: Formalizing a connection between Random Forests and ANN allows exploiting the former to initialize the latter, and parameter optimization within the ANN framework yields models that are intermediate betweenRF and ANN, and achieve performance better than RF and ANN on the majority of the UCI datasets used for benchmarking.

...read moreread less

Abstract: While Artificial Neural Networks (ANNs) are highly expressive models, they are hard to train from limited data. Formalizing a connection between Random Forests (RFs) and ANNs allows exploiting the former to initialize the latter. Further parameter optimization within the ANN framework yields models that are intermediate between RF and ANN, and achieve performance better than RF and ANN on the majority of the UCI datasets used for benchmarking.

...read moreread less

43 citations

Book Chapter•DOI•

A Deep Variational Model for Image Segmentation

[...]

Rene Ranftl¹, Thomas Pock¹, Thomas Pock²•Institutions (2)

Graz University of Technology¹, Austrian Institute of Technology²

02 Sep 2014

TL;DR: A novel model that combines Deep Convolutional Neural Networks with a global inference model, derived from a convex variational relaxation of the minimum s-t cut problem on graphs, which is frequently used for the task of image segmentation.

...read moreread less

Abstract: In this paper we introduce a novel model that combines Deep Convolutional Neural Networks with a global inference model. Our model is derived from a convex variational relaxation of the minimum s-t cut problem on graphs, which is frequently used for the task of image segmentation. We treat the outputs of Convolutional Neural Networks as the unary and pairwise potentials of a graph and derive a smooth approximation to the minimum s-t cut problem. During training, this approximation facilitates the adaptation of the Convolutional Neural Network to the smoothing that is induced by the global model. The training algorithm can be understood as a modified backpropagation algorithm, that explicitly takes the global inference layer into account.

...read moreread less

39 citations

Book Chapter•DOI•

Flow and Color Inpainting for Video Completion

[...]

Michael Strobel¹, Julia Diebold¹, Daniel Cremers¹•Institutions (1)

Technische Universität München¹

02 Sep 2014

TL;DR: A flow-based propagation of user scribbles from the first to subsequent video frames which drastically reduces the user input is proposed and is compared to state-of-the-art video completion methods.

...read moreread less

Abstract: We propose a framework for temporally consistent video completion. To this end we generalize the exemplar-based inpainting method of Criminisi et al. [7] to video inpainting. Specifically we address two important issues: Firstly, we propose a color and optical flow inpainting to ensure temporal consistency of inpainting even for complex motion of foreground and background. Secondly, rather than requiring the user to hand-label the inpainting region in every single image, we propose a flow-based propagation of user scribbles from the first to subsequent video frames which drastically reduces the user input. Experimental comparisons to state-of-the-art video completion methods demonstrate the benefits of the proposed approach.

...read moreread less

34 citations

Book Chapter•DOI•

Learning Must-Link Constraints for Video Segmentation Based on Spectral Clustering

[...]

Anna Khoreva¹, Fabio Galasso¹, Matthias Hein², Bernt Schiele¹•Institutions (2)

Max Planck Society¹, Saarland University²

02 Sep 2014

TL;DR: It is shown that the integration of learned must-link constraints not only improves the segmentation result but also significantly reduces the required runtime, making the use of costly spectral methods possible for today’s high quality video.

...read moreread less

Abstract: In recent years it has been shown that clustering and segmentation methods can greatly benefit from the integration of prior information in terms of must-link constraints. Very recently the use of such constraints has been integrated in a rigorous manner also in graph-based methods such as normalized cut. On the other hand spectral clustering as relaxation of the normalized cut has been shown to be among the best methods for video segmentation. In this paper we merge these two developments and propose to learn must-link constraints for video segmentation with spectral clustering. We show that the integration of learned must-link constraints not only improves the segmentation result but also significantly reduces the required runtime, making the use of costly spectral methods possible for today’s high quality video.

...read moreread less

Book Chapter•DOI•

Capturing Hand Motion with an RGB-D Sensor, Fusing a Generative Model with Salient Points

[...]

Dimitrios Tzionas¹, Abhilash Srikantha¹, Pablo Aponte², Juergen Gall²•Institutions (2)

Max Planck Society¹, University of Bonn²

02 Sep 2014

TL;DR: This work proposes a framework for hand tracking that can capture the motion of two interacting hands using only a single, inexpensive RGB-D camera, and combines a generative model with collision detection and discriminatively learned salient points.

...read moreread less

Abstract: Hand motion capture has been an active research topic, following the success of full-body pose tracking Despite similarities, hand tracking proves to be more challenging, characterized by a higher dimensionality, severe occlusions and self-similarity between fingers For this reason, most approaches rely on strong assumptions, like hands in isolation or expensive multi-camera systems, that limit practical use In this work, we propose a framework for hand tracking that can capture the motion of two interacting hands using only a single, inexpensive RGB-D camera Our approach combines a generative model with collision detection and discriminatively learned salient points We quantitatively evaluate our approach on 14 new sequences with challenging interactions

...read moreread less

Book Chapter•DOI•

Detection and Segmentation of Clustered Objects by Using Iterative Classification, Segmentation, and Gaussian Mixture Models and Application to Wood Log Detection

[...]

Christopher Herbon, Klaus D. Tönnies¹, Bernd Stock•Institutions (1)

Otto-von-Guericke University Magdeburg¹

02 Sep 2014

TL;DR: This paper shows that these methods can be significantly improved by introducing a new iterative classification, statistical modeling, and segmentation procedure, using a detect-and-merge algorithm.

...read moreread less

Abstract: There have recently been advances in the area of fully automatic detection of clustered objects in color images. State of the art methods combine detection with segmentation. In this paper we show that these methods can be significantly improved by introducing a new iterative classification, statistical modeling, and segmentation procedure. The proposed method used a detect-and-merge algorithm, which iteratively finds and validates new objects and subsequently updates the statistical model, while converging in very few iterations.

...read moreread less

Book Chapter•DOI•

Efficient Multiple People Tracking Using Minimum Cost Arborescences

[...]

Roberto Henschel¹, Laura Leal-Taixé², Bodo Rosenhahn¹•Institutions (2)

Leibniz University of Hanover¹, ETH Zurich²

02 Sep 2014

TL;DR: This work presents a new global optimization approach for multiple people tracking based on a hierarchical tracklet framework that casts the optimization problem as a minimum cost arborescence problem in an acyclic directed graph, where a tracking solution can be obtained in linear time.

...read moreread less

Abstract: We present a new global optimization approach for multiple people tracking based on a hierarchical tracklet framework. A new type of tracklets is introduced, which we call tree tracklets. They contain bifurcations to naturally deal with ambiguous tracking situations. Difficult decisions are postponed to a later iteration of the hierarchical framework, when more information is available. We cast the optimization problem as a minimum cost arborescence problem in an acyclic directed graph, where a tracking solution can be obtained in linear time. Experiments on six publicly available datasets show that the method performs well when compared to state-of-the art tracking algorithms.

...read moreread less

Book Chapter•DOI•

Object-Level Priors for Stixel Generation

[...]

Marius Cordts¹, Marius Cordts², Lukas Schneider², Markus Enzweiler², Uwe Franke², Stefan Roth¹ - Show less +2 more•Institutions (2)

Technische Universität Darmstadt¹, Daimler AG²

02 Sep 2014

TL;DR: This paper presents a principled way to additionally integrate top-down prior information about object location and shape that arises from independent system modules, ranging from geometric cues up to highly confident object detections, in a consistent scene representation for traffic scenarios.

...read moreread less

Abstract: This paper presents a stereo vision-based scene model for traffic scenarios. Our approach effectively couples bottom-up image segmentation with object-level knowledge in a sound probabilistic fashion. The relevant scene structure, i.e. obstacles and freespace, is encoded using individual Stixels as building blocks that are computed bottom-up from dense disparity images. We present a principled way to additionally integrate top-down prior information about object location and shape that arises from independent system modules, ranging from geometric cues up to highly confident object detections. This results in an efficient exploration of orthogonal image-based cues, such as disparity and gray-level intensity data, combined in a consistent scene representation. The overall segmentation problem is modeled as a Markov Random Field and solved efficiently through Dynamic Programming.

...read moreread less

Book Chapter•DOI•

Lens-Based Depth Estimation for Multi-focus Plenoptic Cameras

[...]

Oliver Fleischmann¹, Reinhard Koch¹•Institutions (1)

University of Kiel¹

02 Sep 2014

TL;DR: This work proposes a lens-based depth estimation scheme based on a novel adaptive lens selection strategy and shows that this strategy achieves similar error rates as selection strategies with a fixed number of lenses, while being computationally less time consuming.

...read moreread less

Abstract: Multi-focus portable plenoptic camera devices provide a reasonable tradeoff between spatial and angular resolution while enlarging the depth of field of a standard camera. Many applications using the data captured by these camera devices require or benefit from correspondences established between the single microlens images. In this work we propose a lens-based depth estimation scheme based on a novel adaptive lens selection strategy. Coarse depth estimates serve as indicators for suitable target lenses. The selection criterion accounts for lens overlap and the amount of defocus blur between the reference and possible target lenses. The depth maps are regularized using a semi-global strategy. For insufficiently textured scenes, we further incorporate a semi-global coarse regularization with respect to the lens-grid. In contrast to algorithms operating on the complete lightfield, our algorithm has a low memory footprint. The resulting per-lens dense depth maps are well suited for volumetric surface reconstruction techniques. We show that our selection strategy achieves similar error rates as selection strategies with a fixed number of lenses, while being computationally less time consuming. Results are presented for synthetic as well as real-world datasets.

...read moreread less

Book Chapter•DOI•

Spatial and Temporal Interpolation of Multi-view Image Sequences

[...]

Tobias Gurdan¹, Martin R. Oswald¹, Daniel Gurdan, Daniel Cremers¹•Institutions (1)

Technische Universität München¹

02 Sep 2014

TL;DR: A simple and effective framework for multi-view image sequence interpolation in space and time is proposed and two novel filtering approaches for outlier elimination and a robust approach for match extrapolations at the image boundaries are introduced.

...read moreread less

Abstract: We propose a simple and effective framework for multi-view image sequence interpolation in space and time. For spatial view point interpolation we present a robust feature-based matching algorithm that allows for wide-baseline camera configurations. To this end, we introduce two novel filtering approaches for outlier elimination and a robust approach for match extrapolations at the image boundaries. For small-baseline and temporal interpolations we rely on an established optical flow based approach. We perform a quantitative and qualitative evaluation of our framework and present applications and results. Our method has a low runtime and results can compete with state-of-the-art methods.

...read moreread less

Book Chapter•DOI•

Robust PCA: Optimization of the Robust Reconstruction Error Over the Stiefel Manifold

[...]

Anastasia Podosinnikova¹, Simon Setzer², Matthias Hein²•Institutions (2)

École Normale Supérieure¹, Saarland University²

02 Sep 2014

TL;DR: In this article, the Stiefel manifold is directly minimized over the reconstruction error to avoid deflation as often used by projection pursuit methods, which has no free parameter and is computationally very efficient.

...read moreread less

Abstract: It is well known that Principal Component Analysis (PCA) is strongly affected by outliers and a lot of effort has been put into robustification of PCA. In this paper we present a new algorithm for robust PCA minimizing the trimmed reconstruction error. By directly minimizing over the Stiefel manifold, we avoid deflation as often used by projection pursuit methods. In distinction to other methods for robust PCA, our method has no free parameter and is computationally very efficient. We illustrate the performance on various datasets including an application to background modeling and subtraction. Our method performs better or similar to current state-of-the-art methods while being faster.

...read moreread less

Book Chapter•DOI•

Convolutional Decision Trees for Feature Learning and Segmentation

[...]

Dmitry Laptev¹, Joachim M. Buhmann¹•Institutions (1)

ETH Zurich¹

02 Sep 2014

TL;DR: This work proposes, for the first time, a general purpose segmentation algorithm to extract the most informative and interpretable features as convolution kernels while simultaneously building a multivariate decision tree.

...read moreread less

Abstract: Most computer vision and especially segmentation tasks require to extract features that represent local appearance of patches. Relevant features can be further processed by learning algorithms to infer posterior probabilities that pixels belong to an object of interest. Deep Convolutional Neural Networks (CNN) define a particularly successful class of learning algorithms for semantic segmentation, although they proved to be very slow to train even when employing special purpose hardware. We propose, for the first time, a general purpose segmentation algorithm to extract the most informative and interpretable features as convolution kernels while simultaneously building a multivariate decision tree. The algorithm trains several orders of magnitude faster than regular CNNs and achieves state of the art results in processing quality on benchmark datasets.

...read moreread less

Book Chapter•DOI•

Motion Segmentation with Weak Labeling Priors

[...]

Hodjat Rahmati¹, Ralf Dragon², Ole Morten Aamo¹, Luc Van Gool², Lars Adde¹ - Show less +1 more•Institutions (2)

Norwegian University of Science and Technology¹, ETH Zurich²

02 Sep 2014

TL;DR: This work proposes a solution based on a single video camera, that is not only far less intrusive, but also a lot cheaper, and outperforms current motion segmentation and tracking approaches for Cerebral Palsy detection.

...read moreread less

Abstract: Motions of organs or extremities are important features for clinical diagnosis. However, tracking and segmentation of complex, quickly changing motion patterns is challenging, certainly in the presence of occlusions. Neither state-of-the-art tracking nor motion segmentation approaches are able to deal with such cases. Thus far, motion capture systems or the like were needed which are complicated to handle and which impact on the movements. We propose a solution based on a single video camera, that is not only far less intrusive, but also a lot cheaper. The limitation of tracking and motion segmentation are overcome by a new approach to integrate prior knowledge in the form of weak labeling into motion segmentation. Using the example of Cerebral Palsy detection, we segment motion patterns of infants into the different body parts by analyzing body movements. Our experimental results show that our approach outperforms current motion segmentation and tracking approaches.

...read moreread less

Book Chapter•DOI•

Asymmetric Cuts: Joint Image Labeling and Partitioning

[...]

Thorben Kroeger¹, Jörg Hendrik Kappes¹, Thorsten Beier¹, Ullrich Koethe¹, Fred A. Hamprecht¹ - Show less +1 more•Institutions (1)

Heidelberg University¹

02 Sep 2014

TL;DR: This paper focuses on image segmentation, where some label classes exhibit strong internal boundaries, such as the background class which is the pool of objects, and should be modeled as a single region, even if some internal boundaries are visible.

...read moreread less

Abstract: For image segmentation, recent advances in optimization make it possible to combine noisy region appearance terms with pairwise terms which can not only discourage, but also encourage label transitions, depending on boundary evidence. These models have the potential to overcome problems such as the shrinking bias. However, with the ability to encourage label transitions comes a different problem: strong boundary evidence can overrule weak region appearance terms to create new regions out of nowhere. While some label classes exhibit strong internal boundaries, such as the background class which is the pool of objects. Other label classes, meanwhile, should be modeled as a single region, even if some internal boundaries are visible.

...read moreread less

Book Chapter•DOI•

Efficient Hierarchical Triplet Merging for Camera Pose Estimation

[...]

Helmut Mayer¹•Institutions (1)

Bundeswehr University Munich¹

02 Sep 2014

TL;DR: It is shown that to obtain a statistically sound result, intuitively appealing deterministic reduction strategies are problematic and that a simple reduction strategy based on random deletion was evaluated best.

...read moreread less

Abstract: This paper deals with efficient means for camera pose estimation for difficult scenes. Particularly, we speed up the combination of image triplets to image sets by hierarchical merging and a reduction of the number of merged points. By image sets we denote a generalization of image sequences where images can be linked in multiple directions, i.e., they can form a graph. To obtain reliable results for triplets, we use large numbers of corresponding points. For a high-quality and yet efficient merging of the triplets we propose strategies for the reduction of the number of points. The strategies are evaluated based on statistical measures employing the full covariance information for the camera poses from bundle adjustment. We show that to obtain a statistically sound result, intuitively appealing deterministic reduction strategies are problematic and that a simple reduction strategy based on random deletion was evaluated best. We also discuss the benefits of the evaluation measures for finding conceptual and implementation weaknesses. The paper is illustrated with a number of experiments giving standard deviations for all values.

...read moreread less

Book Chapter•DOI•

Encoding Spatial Arrangements of Visual Words for Rotation-Invariant Image Classification

[...]

Hafeez Anwar¹, Sebastian Zambanini¹, Martin Kampel¹•Institutions (1)

Vienna University of Technology¹

02 Sep 2014

TL;DR: This work presents a novel approach to integrate the spatial information to BoVWs model in a rotation-invariant way by encoding the triangular relationship among the positions of identical visual words in the \(2D\) image space and validate the proposed method for rotation-Invariance on datasets of ancient coins and butterflies.

...read moreread less

Abstract: Incorporating the spatial information of visual words enhances the performance of the well-known bag-of-visual words (BoVWs) model for problems like object category recognition. However, object images can undergo various in-plane rotations due to which the spatial information must be added to the BoVWs model in rotation-invariant manner. We present a novel approach to integrate the spatial information to BoVWs model in a rotation-invariant way by encoding the triangular relationship among the positions of identical visual words in the \(2D\) image space. Our proposed BoVWs model is based on densely sampled local features for which the dominant orientations are calculated. Thus we achieve rotation-invariance both globally and locally. We validate our proposed method for rotation-invariance on datasets of ancient coins and butterflies and achieve better performance than the conventional BoVWs model.

...read moreread less

Book Chapter•DOI•

Image Descriptors Based on Curvature Histograms

[...]

Philipp Fischer¹, Thomas Brox¹•Institutions (1)

University of Freiburg¹

02 Sep 2014

TL;DR: This paper proposes a descriptor that comprises the direction and magnitude of curvature and naturally expands classical orientation histograms like SIFT and HOG, demonstrating the general benefit of the expansion exemplarily for image classification, object detection, and descriptor matching.

...read moreread less

Abstract: Descriptors based on orientation histograms are widely used in computer vision. The spatial pooling involved in these representations provides important invariance properties, yet it is also responsible for the loss of important details. In this paper, we suggest a way to preserve the details described by the local curvature. We propose a descriptor that comprises the direction and magnitude of curvature and naturally expands classical orientation histograms like SIFT and HOG. We demonstrate the general benefit of the expansion exemplarily for image classification, object detection, and descriptor matching.

...read moreread less

Book Chapter•DOI•

Scene Flow Estimation from Light Fields via the Preconditioned Primal-Dual Algorithm

[...]

Stefan Heber¹, Thomas Pock¹, Thomas Pock²•Institutions (2)

Graz University of Technology¹, Austrian Institute of Technology²

02 Sep 2014

TL;DR: A novel variational model to jointly estimate geometry and motion from a sequence of light fields captured with a plenoptic camera is presented, which enforces multi-view geometry consistency, and piecewise smoothness assumptions on the scene flow variables.

...read moreread less

Abstract: In this paper we present a novel variational model to jointly estimate geometry and motion from a sequence of light fields captured with a plenoptic camera. The proposed model uses the so-called sub-aperture representation of the light field. Sub-aperture images represent images with slightly different viewpoints, which can be extracted from the light field. The sub-aperture representation allows us to formulate a convex global energy functional, which enforces multi-view geometry consistency, and piecewise smoothness assumptions on the scene flow variables. We optimize the proposed scene flow model by using an efficient preconditioned primal-dual algorithm. Finally, we also present synthetic and real world experiments.

...read moreread less

Book Chapter•DOI•

Test-Time Adaptation for 3D Human Pose Estimation

[...]

Sikandar Amin¹, Philipp Müller¹, Andreas Bulling¹, Mykhaylo Andriluka¹•Institutions (1)

Max Planck Society¹

02 Sep 2014

TL;DR: A way to boost the performance of 2D pose estimation based on the output of the 3D pose reconstruction process, thus closing the loop in the pose estimation pipeline is explored.

...read moreread less

Abstract: In this paper we consider the task of articulated 3D human pose estimation in challenging scenes with dynamic background and multiple people. Initial progress on this task has been achieved building on discriminatively trained part-based models that deliver a set of 2D body pose candidates that are then subsequently refined by reasoning in 3D [1, 4, 5]. The performance of such methods is limited by the performance of the underlying 2D pose estimation approaches. In this paper we explore a way to boost the performance of 2D pose estimation based on the output of the 3D pose reconstruction process, thus closing the loop in the pose estimation pipeline. We build our approach around a component that is able to identify true positive pose estimation hypotheses with high confidence. We then either retrain 2D pose estimation models using such highly confident hypotheses as additional training examples, or we use similarity to these hypotheses as a cue for 2D pose estimation. We consider a number of features that can be used for assessing the confidence of the pose estimation results. The strongest feature in our comparison corresponds to the ensemble agreement on the 3D pose output. We evaluate our approach on two publicly available datasets improving over state of the art in each case.

...read moreread less

Book Chapter•DOI•

Pose normalization for eye gaze estimation and facial attribute description from still images

[...]

Bernhard Egger¹, Sandro Schönborn¹, Andreas Forster¹, Thomas Vetter¹•Institutions (1)

University of Basel¹

02 Sep 2014

TL;DR: This work provides the first pose-invariant approach to estimate gaze from unconstrained still images and provides results for pose- Invariant gaze estimation on still images on the UUlm Head Pose and Gaze Database and attribute description on the Multi-PIE database.

...read moreread less

Abstract: Our goal is to obtain an eye gaze estimation and a face description based on attributes (e.g. glasses, beard or thick lips) from still images. An attribute-based face description reflects human vocabulary and is therefore adequate as face description. Head pose and eye gaze play an important role in human interaction and are a key element to extract interaction information from still images. Pose variation is a major challenge when analyzing them. Most current approaches for facial image analysis are not explicitly pose-invariant. To obtain a pose-invariant representation, we have to account the three dimensional nature of a face. A 3D Morphable Model (3DMM) of faces is used to obtain a dense 3D reconstruction of the face in the image. This Analysis-by-Synthesis approach provides model parameters which contain an explicit face description and a dense model to image correspondence. However, the fit is restricted to the model space and cannot explain all variations. Our model only contains straight gaze directions and lacks high detail textural features. To overcome this limitations, we use the obtained correspondence in a discriminative approach. The dense correspondence is used to extract a pose-normalized version of the input image. The warped image contains all information from the original image and preserves gaze and detailed textural information. On the pose-normalized representation we train a regression function to obtain gaze estimation and attribute description. We provide results for pose-invariant gaze estimation on still images on the UUlm Head Pose and Gaze Database and attribute description on the Multi-PIE database. To the best of our knowledge, this is the first pose-invariant approach to estimate gaze from unconstrained still images.

...read moreread less

Book Chapter•DOI•

Committees of Deep Feedforward Networks Trained with Few Data

[...]

Bogdan Miclut¹•Institutions (1)

University of Lübeck¹

02 Sep 2014

TL;DR: A method to improve the classification result by combining multiple deep convolutional neural networks in a committee is presented and can achieve results that are better than the state of the art.

...read moreread less

Abstract: Deep convolutional neural networks are known to give good results on image classification tasks. In this paper we present a method to improve the classification result by combining multiple such networks in a committee. We adopt the STL-10 dataset which has very few training examples and show that our method can achieve results that are better than the state of the art. The networks are trained layer-wise and no backpropagation is used. We also explore the effects of dataset augmentation by mirroring, rotation, and scaling.

...read moreread less

Book Chapter•DOI•

Multi-view Tracking of Multiple Targets with Dynamic Cameras

[...]

Till Kroeger¹, Ralf Dragon¹, Luc Van Gool¹, Luc Van Gool²•Institutions (2)

ETH Zurich¹, Katholieke Universiteit Leuven²

02 Sep 2014

TL;DR: A new tracking-by-detection algorithm for multiple targets from multiple dynamic, unlocalized and unconstrained cameras is proposed and it is shown that the tracking method can effectively deal with independently moving cameras and camera registration noise.

...read moreread less

Abstract: We propose a new tracking-by-detection algorithm for multiple targets from multiple dynamic, unlocalized and unconstrained cameras In the past tracking has either been done with multiple static cameras, or single and stereo dynamic cameras We register several moving cameras using a given 3D model from Structure from Motion (SfM), and initialize the tracking given the registration The camera uncertainty estimate can be efficiently incorporated into a flow-network formulation for tracking As this is a novel task in the tracking domain, we evaluate our method on a new challenging dataset for tracking with multiple moving cameras and show that our tracking method can effectively deal with independently moving cameras and camera registration noise

...read moreread less

Book Chapter•DOI•

Exemplar-Specific Patch Features for Fine-Grained Recognition

[...]

Alexander Freytag¹, Erik Rodner¹, Trevor Darrell², Joachim Denzler¹•Institutions (2)

University of Jena¹, University of California, Berkeley²

02 Sep 2014

TL;DR: An orthogonal approach that learns patch representations specifically tailored to every single test exemplar for fine-grained recognition or subordinate categorization, tasks where an algorithm needs to reliably differentiate between visually similar categories, e.g., different bird species.

...read moreread less

Abstract: In this paper, we present a new approach for fine-grained recognition or subordinate categorization, tasks where an algorithm needs to reliably differentiate between visually similar categories, e.g., different bird species. While previous approaches aim at learning a single generic representation and models with increasing complexity, we propose an orthogonal approach that learns patch representations specifically tailored to every single test exemplar. Since we query a constant number of images similar to a given test image, we obtain very compact features and avoid large-scale training with all classes and examples. Our learned mid-level features are built on shape and color detectors estimated from discovered patches reflecting small highly discriminative structures in the queried images. We evaluate our approach for fine-grained recognition on the CUB-2011 birds dataset and show that high recognition rates can be obtained by model combination.

...read moreread less