Top 38 papers published by Federico Tombari from Technische Universität München in 2018

Posted Content•

BOP: Benchmark for 6D Object Pose Estimation

[...]

Tomas Hodan¹, Frank Michel², Eric Brachmann³, Wadim Kehl⁴, Anders Buch⁵, Dirk Kraft⁵, Bertram Drost, Joel Vidal⁶, Stephan Ihrke², Xenophon Zabulis⁷, Caner Sahin⁸, Fabian Manhardt⁹, Federico Tombari⁹, Tae-Kyun Kim⁸, Jiri Matas¹, Carsten Rother³ - Show less +12 more•Institutions (9)

Czech Technical University in Prague¹, Dresden University of Technology², Heidelberg University³, Toyota⁴, Maersk⁵, National Taiwan University⁶, Foundation for Research & Technology – Hellas⁷, Imperial College London⁸, Technische Universität München⁹

24 Aug 2018-arXiv: Computer Vision and Pattern Recognition

TL;DR: A benchmark for 6D pose estimation of a rigid object from a single RGB-D input image shows that methods based on point-pair features currently perform best, outperforming template matching methods, learning-based methods and methodsbased on 3D local features.

...read moreread less

Abstract: We propose a benchmark for 6D pose estimation of a rigid object from a single RGB-D input image. The training data consists of a texture-mapped 3D object model or images of the object in known 6D poses. The benchmark comprises of: i) eight datasets in a unified format that cover different practical scenarios, including two new datasets focusing on varying lighting conditions, ii) an evaluation methodology with a pose-error function that deals with pose ambiguities, iii) a comprehensive evaluation of 15 diverse recent methods that captures the status quo of the field, and iv) an online evaluation system that is open for continuous submission of new results. The evaluation shows that methods based on point-pair features currently perform best, outperforming template matching methods, learning-based methods and methods based on 3D local features. The project website is available at bop.felk.cvut.cz.

...read moreread less

224 citations

Book Chapter•DOI•

BOP: Benchmark for 6D Object Pose Estimation

[...]

Tomas Hodan¹, Frank Michel², Eric Brachmann³, Wadim Kehl⁴, Anders Buch⁵, Dirk Kraft⁵, Bertram Drost, Joel Vidal⁶, Stephan Ihrke², Xenophon Zabulis⁷, Caner Sahin⁸, Fabian Manhardt⁹, Federico Tombari⁹, Tae-Kyun Kim⁸, Jiri Matas¹, Carsten Rother³ - Show less +12 more•Institutions (9)

Czech Technical University in Prague¹, Dresden University of Technology², Heidelberg University³, Toyota⁴, Maersk⁵, National Taiwan University⁶, Foundation for Research & Technology – Hellas⁷, Imperial College London⁸, Technische Universität München⁹

08 Sep 2018

TL;DR: In this article, the authors propose a benchmark for 6D pose estimation of a rigid object from a single RGB-D input image, which consists of a texture-mapped 3D object model or images of the object in known 6D poses.

...read moreread less

Abstract: We propose a benchmark for 6D pose estimation of a rigid object from a single RGB-D input image. The training data consists of a texture-mapped 3D object model or images of the object in known 6D poses. The benchmark comprises of: (i) eight datasets in a unified format that cover different practical scenarios, including two new datasets focusing on varying lighting conditions, (ii) an evaluation methodology with a pose-error function that deals with pose ambiguities, (iii) a comprehensive evaluation of 15 diverse recent methods that captures the status quo of the field, and (iv) an online evaluation system that is open for continuous submission of new results. The evaluation shows that methods based on point-pair features currently perform best, outperforming template matching methods, learning-based methods and methods based on 3D local features. The project website is available at bop.felk.cvut.cz.

...read moreread less

193 citations

Book Chapter•DOI•

Fully-Convolutional Point Networks for Large-Scale Point Clouds

[...]

Dario Rethage¹, Johanna Wald¹, Jürgen Sturm², Nassir Navab¹, Federico Tombari¹ - Show less +1 more•Institutions (2)

Technische Universität München¹, Google²

08 Sep 2018

TL;DR: This work proposes a general-purpose, fully-convolutional network architecture for efficiently processing large-scale 3D data and demonstrates its ability to effectively learn both low-level features as well as complex compositional relationships by evaluating it on benchmark datasets for semantic voxel segmentation, semantic part segmentation and 3D scene captioning.

...read moreread less

Abstract: This work proposes a general-purpose, fully-convolutional network architecture for efficiently processing large-scale 3D data. One striking characteristic of our approach is its ability to process unorganized 3D representations such as point clouds as input, then transforming them internally to ordered structures to be processed via 3D convolutions. In contrast to conventional approaches that maintain either unorganized or organized representations, from input to output, our approach has the advantage of operating on memory efficient input data representations while at the same time exploiting the natural structure of convolutional operations to avoid the redundant computing and storing of spatial information in the network. The network eliminates the need to pre- or post process the raw sensor data. This, together with the fully-convolutional nature of the network, makes it an end-to-end method able to process point clouds of huge spaces or even entire rooms with up to 200k points at once. Another advantage is that our network can produce either an ordered output or map predictions directly onto the input cloud, thus making it suitable as a general-purpose point cloud descriptor applicable to many 3D tasks. We demonstrate our network’s ability to effectively learn both low-level features as well as complex compositional relationships by evaluating it on benchmark datasets for semantic voxel segmentation, semantic part segmentation and 3D scene captioning.

...read moreread less

172 citations

Book Chapter•DOI•

Distortion-Aware Convolutional Filters for Dense Prediction in Panoramic Images

[...]

Keisuke Tateno¹, Nassir Navab¹, Federico Tombari¹•Institutions (1)

Technische Universität München¹

08 Sep 2018

TL;DR: This work proposes a learning approach for panoramic depth map estimation from a single image, thanks to a specifically developed distortion-aware deformable convolution filter, which can be trained by means of conventional perspective images, then used to regress depth forPanoramic images, thus bypassing the effort needed to create annotated pan oramic training dataset.

...read moreread less

Abstract: There is a high demand of 3D data for 360\(^\circ \) panoramic images and videos, pushed by the growing availability on the market of specialized hardware for both capturing (e.g., omni-directional cameras) as well as visualizing in 3D (e.g., head mounted displays) panoramic images and videos. At the same time, 3D sensors able to capture 3D panoramic data are expensive and/or hardly available. To fill this gap, we propose a learning approach for panoramic depth map estimation from a single image. Thanks to a specifically developed distortion-aware deformable convolution filter, our method can be trained by means of conventional perspective images, then used to regress depth for panoramic images, thus bypassing the effort needed to create annotated panoramic training dataset. We also demonstrate our approach for emerging tasks such as panoramic monocular SLAM, panoramic semantic segmentation and panoramic style transfer.

...read moreread less

143 citations

Book Chapter•DOI•

Deep Model-Based 6D Pose Refinement in RGB

[...]

Fabian Manhardt¹, Wadim Kehl², Nassir Navab¹, Federico Tombari¹•Institutions (2)

Technische Universität München¹, Toyota²

08 Sep 2018

TL;DR: A new visual loss is proposed that drives the pose update by aligning object contours, thus avoiding the definition of any explicit appearance model and producing pose accuracies that come close to 3D ICP without the need for depth data.

...read moreread less

Abstract: We present a novel approach for model-based 6D pose refinement in color data. Building on the established idea of contour-based pose tracking, we teach a deep neural network to predict a translational and rotational update. At the core, we propose a new visual loss that drives the pose update by aligning object contours, thus avoiding the definition of any explicit appearance model. In contrast to previous work our method is correspondence-free, segmentation-free, can handle occlusion and is agnostic to geometrical symmetry as well as visual ambiguities. Additionally, we observe a strong robustness towards rough initialization. The approach can run in real-time and produces pose accuracies that come close to 3D ICP without the need for depth data. Furthermore, our networks are trained from purely synthetic data and will be published together with the refinement code at http://campar.in.tum.de/Main/FabianManhardt to ensure reproducibility.

...read moreread less

135 citations

Proceedings Article•DOI•

Semantic Monocular SLAM for Highly Dynamic Environments

[...]

Nikolas Brasch, Aljaz Bozic¹, Joe Lallemand, Federico Tombari¹•Institutions (1)

Technische Universität München¹

01 Oct 2018

TL;DR: This work proposes a semantic monocular SLAM framework designed to deal with highly dynamic environments, combining feature-based and direct approaches to achieve robustness under challenging conditions and shows more stable pose estimation in dynamic environments.

...read moreread less

Abstract: Recent advances in monocular SLAM have enabled real-time capable systems which run robustly under the assumption of a static environment, but fail in presence of dynamic scene changes and motion, since they lack an explicit dynamic outlier handling. We propose a semantic monocular SLAM framework designed to deal with highly dynamic environments, combining feature-based and direct approaches to achieve robustness under challenging conditions. The proposed approach exploits semantic information extracted from the scene within an explicit probabilistic model, which maximizes the probability for both tracking and mapping to rely on those scene parts that do not present a relative motion with respect to the camera. We show more stable pose estimation in dynamic environments and comparable performance to the state of the art on static sequences on the Virtual KITTI and Synthia datasets.

...read moreread less

70 citations

Posted Content•

Deep Model-Based 6D Pose Refinement in RGB

[...]

Fabian Manhardt¹, Wadim Kehl², Nassir Navab¹, Federico Tombari¹•Institutions (2)

Technische Universität München¹, Toyota²

07 Oct 2018-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, a new visual loss is proposed to align object contours, thus avoiding the definition of any explicit appearance model, and a deep neural network is trained to predict a translational and rotational update.

...read moreread less

Abstract: We present a novel approach for model-based 6D pose refinement in color data. Building on the established idea of contour-based pose tracking, we teach a deep neural network to predict a translational and rotational update. At the core, we propose a new visual loss that drives the pose update by aligning object contours, thus avoiding the definition of any explicit appearance model. In contrast to previous work our method is correspondence-free, segmentation-free, can handle occlusion and is agnostic to geometrical symmetry as well as visual ambiguities. Additionally, we observe a strong robustness towards rough initialization. The approach can run in real-time and produces pose accuracies that come close to 3D ICP without the need for depth data. Furthermore, our networks are trained from purely synthetic data and will be published together with the refinement code to ensure reproducibility.

...read moreread less

59 citations

Journal Article•DOI•

Learning to Detect Good 3D Keypoints

[...]

Alessio Tonioni¹, Samuele Salti, Federico Tombari², Riccardo Spezialetti¹, Luigi Di Stefano¹ - Show less +1 more•Institutions (2)

University of Bologna¹, Technische Universität München²

01 Jan 2018-International Journal of Computer Vision

TL;DR: Through extensive experiments on standard datasets, this work shows how feature matching performance improves significantly by deploying 3D descriptors together with companion detectors learned by the methodology with respect to the adoption of established state-of-the-art 3D detectors based on hand-crafted saliency functions.

...read moreread less

Abstract: The established approach to 3D keypoint detection consists in defining effective handcrafted saliency functions based on geometric cues with the aim of maximizing keypoint repeatability. Differently, the idea behind our work is to learn a descriptor-specific keypoint detector so as to optimize the end-to-end performance of the feature matching pipeline. Accordingly, we cast 3D keypoint detection as a classification problem between surface patches that can or cannot be matched correctly by a given 3D descriptor, i.e. those either good or not in respect to that descriptor. We propose a machine learning framework that allows for defining examples of good surface patches from the training data and leverages Random Forest classifiers to realize both fixed-scale and adaptive-scale 3D keypoint detectors. Through extensive experiments on standard datasets, we show how feature matching performance improves significantly by deploying 3D descriptors together with companion detectors learned by our methodology with respect to the adoption of established state-of-the-art 3D detectors based on hand-crafted saliency functions.

...read moreread less

51 citations

Journal Article•DOI•

Convolutional neural networks for real-time epileptic seizure detection

[...]

Felix Achilles¹, Felix Achilles², Federico Tombari³, Federico Tombari², Vasileios Belagiannis⁴, Vasileios Belagiannis², Anna Mira Loesch¹, Soheyl Noachtar¹, Nassir Navab², Nassir Navab⁵ - Show less +6 more•Institutions (5)

Ludwig Maximilian University of Munich¹, Technische Universität München², University of Bologna³, University of Oxford⁴, Johns Hopkins University⁵

01 Jan 2018-Computer methods in biomechanics and biomedical engineering. Imaging & visualization

TL;DR: It is conjecture that features learned via a convolutional neural network provide the ability to distinctively detect seizures from video, and even allow the system to generalize to different seizure types.

...read moreread less

Abstract: Epileptic seizures constitute a serious neurological condition for patients and, if untreated, considerably decrease their quality of life. Early and correct diagnosis by semiological seizure analysis provides the main approach to treat and improve the patients’ condition. To obtain reliable and quantifiable information, medical professionals perform seizure detection and subsequent analysis using expensive video-EEG systems in specialized epilepsy monitoring units. However, the detection of seizures, especially under difficult circumstances such as occlusion by the blanket or in the absence of predictive EEG patterns, is highly subjective and should therefore be supported by automated systems. In this work, we conjecture that features learned via a convolutional neural network provide the ability to distinctively detect seizures from video, and even allow our system to generalize to different seizure types. By comparing our method to the state of the art we show the superior performance of learned featur...

...read moreread less

45 citations

Proceedings Article•DOI•

Guide Me: Interacting with Deep Networks

[...]

Christian Rupprecht¹, Christian Rupprecht², Iro Laina², Nassir Navab², Gregory D. Hager¹, Federico Tombari² - Show less +2 more•Institutions (2)

Johns Hopkins University¹, Technische Universität München²

30 Mar 2018

TL;DR: In this article, a layer that acts as a spatio-semantic guide is added to the network to modify the network's activations, either directly via an energy minimization scheme or indirectly through a recurrent model that translates human language queries to interaction weights.

...read moreread less

Abstract: Interaction and collaboration between humans and intelligent machines has become increasingly important as machine learning methods move into real-world applications that involve end users. While much prior work lies at the intersection of natural language and vision, such as image captioning or image generation from text descriptions, less focus has been placed on the use of language to guide or improve the performance of a learned visual processing algorithm. In this paper, we explore methods to flexibly guide a trained convolutional neural network through user input to improve its performance during inference. We do so by inserting a layer that acts as a spatio-semantic guide into the network. This guide is trained to modify the network's activations, either directly via an energy minimization scheme or indirectly through a recurrent model that translates human language queries to interaction weights. Learning the verbal interaction is fully automatic and does not require manual text annotations. We evaluate the method on two datasets, showing that guiding a pre-trained network can improve performance, and provide extensive insights into the interaction between the guide and the CNN.

...read moreread less

35 citations

Book Chapter•DOI•

Human Motion Analysis with Deep Metric Learning

[...]

Huseyin Coskun¹, David Joseph Tan¹, Sailesh Conjeti¹, Nassir Navab¹, Federico Tombari¹ - Show less +1 more•Institutions (1)

Technische Universität München¹

08 Sep 2018

TL;DR: In this paper, a triplet-based deep metric learning approach is proposed to deal with human motion data, in particular with the problem of varying input size and computationally expensive hard negative mining due to motion pair alignment.

...read moreread less

Abstract: Effectively measuring the similarity between two human motions is necessary for several computer vision tasks such as gait analysis, person identification and action retrieval. Nevertheless, we believe that traditional approaches such as L2 distance or Dynamic Time Warping based on hand-crafted local pose metrics fail to appropriately capture the semantic relationship across motions and, as such, are not suitable for being employed as metrics within these tasks. This work addresses this limitation by means of a triplet-based deep metric learning specifically tailored to deal with human motion data, in particular with the problem of varying input size and computationally expensive hard negative mining due to motion pair alignment. Specifically, we propose (1) a novel metric learning objective based on a triplet architecture and Maximum Mean Discrepancy; as well as, (2) a novel deep architecture based on attentive recurrent neural networks. One benefit of our objective function is that it enforces a better separation within the learned embedding space of the different motion categories by means of the associated distribution moments. At the same time, our attentive recurrent neural network allows processing varying input sizes to a fixed size of embedding while learning to focus on those motion parts that are semantically distinctive. Our experiments on two different datasets demonstrate significant improvements over conventional human motion metrics.

...read moreread less

Proceedings Article•DOI•

Fast and Accurate Semantic Mapping through Geometric-based Incremental Segmentation

[...]

Yoshikatsu Nakajima¹, Keisuke Tateno², Federico Tombari², Hideo Saito¹•Institutions (2)

Keio University¹, Technische Universität München²

27 Dec 2018

TL;DR: In this article, the authors propose an efficient and scalable method for incrementally building a dense, semantically annotated 3D map in real-time, which assigns class probabilities to each region, not each element (e.g., surfel and voxel), of the 3D maps built up through a robust SLAM framework and incrementally segmented with a geometric-based segmentation method.

...read moreread less

Abstract: We propose an efficient and scalable method for incrementally building a dense, semantically annotated 3D map in real-time. The proposed method assigns class probabilities to each region, not each element (e.g., surfel and voxel), of the 3D map which is built up through a robust SLAM framework and incrementally segmented with a geometric-based segmentation method. Differently from all other approaches, our method has a capability of running at over 30Hz while performing all processing components, including SLAM, segmentation, 2D recognition, and updating class probabilities of each segmentation label at every incoming frame, thanks to the high efficiency that characterizes the computationally intensive stages of our framework. By utilizing a specifically designed CNN to improve the frame-wise segmentation result, we can also achieve high accuracy. We validate our method on the NYUv2 dataset by comparing with the state of the art in terms of accuracy and computational efficiency, and by means of an analysis in terms of time and space complexity.

...read moreread less

Posted Content•

Explaining the Ambiguity of Object Detection and 6D Pose From Visual Data

[...]

Fabian Manhardt¹, Diego Martin Arroyo¹, Christian Rupprecht², Benjamin Busam³, Tolga Birdal⁴, Nassir Navab¹, Federico Tombari¹ - Show less +3 more•Institutions (4)

Technische Universität München¹, University of Oxford², Huawei³, Stanford University⁴

01 Dec 2018-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, the authors predict multiple pose and class outcomes to estimate the specific pose distribution generated by symmetries and repetitive textures, and show the benefits of their approach which provides not only a better explanation for pose ambiguity, but also a higher accuracy in terms of pose estimation.

...read moreread less

Abstract: 3D object detection and pose estimation from a single image are two inherently ambiguous problems. Oftentimes, objects appear similar from different viewpoints due to shape symmetries, occlusion and repetitive textures. This ambiguity in both detection and pose estimation means that an object instance can be perfectly described by several different poses and even classes. In this work we propose to explicitly deal with this uncertainty. For each object instance we predict multiple pose and class outcomes to estimate the specific pose distribution generated by symmetries and repetitive textures. The distribution collapses to a single outcome when the visual appearance uniquely identifies just one valid pose. We show the benefits of our approach which provides not only a better explanation for pose ambiguity, but also a higher accuracy in terms of pose estimation.

...read moreread less

Journal Article•DOI•

Real-Time Fully Incremental Scene Understanding on Mobile Platforms

[...]

Johanna Wald¹, Keisuke Tateno¹, Jürgen Sturm², Nassir Navab¹, Federico Tombari¹ - Show less +1 more•Institutions (2)

Ludwig Maximilian University of Munich¹, Canon Inc.²

04 Jul 2018

TL;DR: An online RGB-D based scene understanding method for indoor scenes running in real time on mobile devices that achieves an accuracy that gets close to the most state-of-the-art 3-D scene understanding methods while being much more efficient, enabling real-time execution on low-power embedded systems.

...read moreread less

Abstract: We propose an online RGB-D based scene understanding method for indoor scenes running in real time on mobile devices. First, we incrementally reconstruct the scene via simultaneous localization and mapping and compute a three-dimensional (3-D) geometric segmentation by fusing segments obtained from each input depth image in a global 3-D model. We combine this geometric segmentation with semantic annotations to obtain a semantic segmentation in form of a semantic map. To accomplish efficient semantic segmentation, we encode the segments in the global model with a fast incremental 3-D descriptor and use a random forest to determine its semantic label. The predictions from successive frames are then fused to obtain a confident semantic class across time. As a result, the overall method achieves an accuracy that gets close to the most state-of-the-art 3-D scene understanding methods while being much more efficient, enabling real-time execution on low-power embedded systems.

...read moreread less

Journal Article•DOI•

A performance evaluation of point pair features

[...]

Lilita Kiforenko¹, Bertram Drost, Federico Tombari², Norbert Krüger¹, Anders Buch¹ - Show less +1 more•Institutions (2)

Maersk¹, Technische Universität München²

01 Jan 2018-Computer Vision and Image Understanding

TL;DR: A detailed analysis of PPFs is presented, which highlights under which conditions PPFs perform particularly well as well as its main weaknesses, and finds that PPFs compared to most local histogram features degrade faster under disturbances such as occlusion and clutter.

...read moreread less

Proceedings Article•DOI•

Adversarial Semantic Scene Completion from a Single Depth Image

[...]

Yida Wang¹, David Joseph Tan², Nassir Navab², Federico Tombari²•Institutions (2)

Beijing University of Posts and Telecommunications¹, Technische Universität München²

25 Oct 2018

TL;DR: In this article, the authors propose a method to reconstruct, complete and semantically label a 3D scene from a single input depth image by using multiple adversarial loss terms that enforce realistic outputs with respect to the ground truth, but also an effective embedding of the internal features.

...read moreread less

Abstract: We propose a method to reconstruct, complete and semantically label a 3D scene from a single input depth image. We improve the accuracy of the regressed semantic 3D maps by a novel architecture based on adversarial learning. In particular, we suggest using multiple adversarial loss terms that not only enforce realistic outputs with respect to the ground truth, but also an effective embedding of the internal features. This is done by correlating the latent features of the encoder working on partial 2.5D data with the latent features extracted from a variational 3D auto-encoder trained to reconstruct the complete semantic scene. In addition, differently from other approaches that operate entirely through 3D convolutions, at test time we retain the original 2.5D structure of the input during downsampling to improve the effectiveness of the internal representation of our model. We test our approach on the main benchmark datasets for semantic scene completion to qualitatively and quantitatively assess the effectiveness of our proposal.

...read moreread less

Journal Article•DOI•

Real-Time Accurate 3D Head Tracking and Pose Estimation with Consumer RGB-D Cameras

[...]

David Joseph Tan¹, Federico Tombari¹, Nassir Navab¹•Institutions (1)

Technische Universität München¹

01 Apr 2018-International Journal of Computer Vision

TL;DR: This work proposes another solution that utilizes a multi-camera system such that the data simultaneously acquired from multiple RGB-D sensors helps the tracker to handle challenging conditions that affect a subset of the cameras.

...read moreread less

Abstract: We demonstrate how 3D head tracking and pose estimation can be effectively and efficiently achieved from noisy RGB-D sequences. Our proposal leverages on a random forest framework, designed to regress the 3D head pose at every frame in a temporal tracking manner. One peculiarity of the algorithm is that it exploits together (1) a generic training dataset of 3D head models, which is learned once offline; and, (2) an online refinement with subject-specific 3D data, which aims for the tracker to withstand slight facial deformations and to adapt its forest to the specific characteristics of an individual subject. The combination of these works allows our algorithm to be robust even under extreme poses, where the user's face is no longer visible on the image. Finally, we also propose another solution that utilizes a multi-camera system such that the data simultaneously acquired from multiple RGB-D sensors helps the tracker to handle challenging conditions that affect a subset of the cameras. Notably, the proposed multi-camera frameworks yields a real-time performance of approximately 8 ms per frame given six cameras and one CPU core, and scales up linearly to 30 fps with 25 cameras.

...read moreread less

Proceedings Article•DOI•

Situation Assessment for Planning Lane Changes: Combining Recurrent Models and Prediction

[...]

Oliver Scheel¹, Loren Schwarz¹, Nassir Navab², Federico Tombari²•Institutions (2)

BMW¹, Technische Universität München²

17 May 2018

TL;DR: This work proposes a situation assessment algorithm for classifying driving situations with respect to their suitability for lane changing, based on a Bidirectional Recurrent Neural Network, which uses Long Short-Term Memory units, and integrates a prediction component in the form of the Intelligent Driver Model.

...read moreread less

Abstract: One of the greatest challenges towards fully autonomous cars is the understanding of complex and dynamic scenes. Such understanding is needed for planning of maneuvers, especially those that are particularly frequent such as lane changes. While in recent years advanced driver-assistance systems have made driving safer and more comfortable, these have mostly focused on car following scenarios, and less on maneuvers involving lane changes. In this work we propose a situation assessment algorithm for classifying driving situations with respect to their suitability for lane changing. For this, we propose a deep learning architecture based on a Bidirectional Recurrent Neural Network, which uses Long Short-Term Memory units, and integrates a prediction component in the form of the Intelligent Driver Model. We prove the feasibility of our algorithm on the publicly available NGSIM datasets, where we outperform existing methods.

...read moreread less

Book Chapter•DOI•

Dealing with Ambiguity in Robotic Grasping via Multiple Predictions

[...]

Ghazal Ghazaei¹, Iro Laina², Christian Rupprecht², Federico Tombari², Nassir Navab², Kianoush Nazarpour¹ - Show less +2 more•Institutions (2)

Newcastle University¹, Technische Universität München²

02 Dec 2018

TL;DR: In this paper, the authors propose to estimate multiple grasp poses from a single RGB image of the target object by replacing conventional grasp rectangles with grasp belief maps, which hold more precise location information than a rectangle and account for the uncertainty inherent to the task.

...read moreread less

Abstract: Humans excel in grasping and manipulating objects because of their life-long experience and knowledge about the 3D shape and weight distribution of objects. However, the lack of such intuition in robots makes robotic grasping an exceptionally challenging task. There are often several equally viable options of grasping an object. However, this ambiguity is not modeled in conventional systems that estimate a single, optimal grasp position. We propose to tackle this problem by simultaneously estimating multiple grasp poses from a single RGB image of the target object. Further, we reformulate the problem of robotic grasping by replacing conventional grasp rectangles with grasp belief maps, which hold more precise location information than a rectangle and account for the uncertainty inherent to the task. We augment a fully convolutional neural network with a multiple hypothesis prediction model that predicts a set of grasp hypotheses in under 60 ms, which is critical for real-time robotic applications. The grasp detection accuracy reaches over \(90\%\) for unseen objects, outperforming the current state of the art on this task.

...read moreread less

Journal Article•DOI•

Tracking-by-Detection of 3D Human Shapes: From Surfaces to Volumes

[...]

Chun-Hao Paul Huang¹, Benjamin Allain², Edmond Boyer², Jean-Sébastien Franco², Federico Tombari¹, Nassir Navab¹, Slobodan Ilic³ - Show less +3 more•Institutions (3)

Technische Universität München¹, French Institute for Research in Computer Science and Automation², Siemens³

01 Aug 2018-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: A discriminative alternative for the association, that leverages random forests to infer correspondences in one shot is proposed, which allows for large deformations and prevents tracking errors from being accumulated and is referred to as ‘tracking-by-detection of 3D human shapes.’

...read moreread less

Abstract: 3D Human shape tracking consists in fitting a template model to temporal sequences of visual observations. It usually comprises an association step, that finds correspondences between the model and the input data, and a deformation step, that fits the model to the observations given correspondences. Most current approaches follow the Iterative-Closest-Point (ICP) paradigm, where the association step is carried out by searching for the nearest neighbors. It fails when large deformations occur and errors in the association tend to propagate over time. In this paper, we propose a discriminative alternative for the association, that leverages random forests to infer correspondences in one shot. Regardless the choice of shape parameterizations, being surface or volumetric meshes, we convert 3D shapes to volumetric distance fields and thereby design features to train the forest. We investigate two ways to draw volumetric samples: voxels of regular grids and cells from Centroidal Voronoi Tessellation (CVT). While the former consumes considerable memory and in turn limits us to learn only subject-specific correspondences, the latter yields much less memory footprint by compactly tessellating the interior space of a shape with optimal discretization. This facilitates the use of larger cross-subject training databases, generalizes to different human subjects and hence results in less overfitting and better detection. The discriminative correspondences are successfully integrated to both surface and volumetric deformation frameworks that recover human shape poses, which we refer to as ‘tracking-by-detection of 3D human shapes.’ It allows for large deformations and prevents tracking errors from being accumulated. When combined with ICP for refinement, it proves to yield better accuracy in registration and more stability when tracking over time. Evaluations on existing datasets demonstrate the benefits with respect to the state-of-the-art.

...read moreread less

Book Chapter•DOI•

Webly Supervised Learning for Skin Lesion Classification

[...]

Fernando Navarro¹, Sailesh Conjeti¹, Federico Tombari¹, Nassir Navab¹•Institutions (1)

Technische Universität München¹

16 Sep 2018

TL;DR: In this paper, a two-step transfer learning based training process with a robust loss function is proposed to train deep models for the task of fine-grained skin lesion classification.

...read moreread less

Abstract: Within medical imaging, manual curation of sufficient well-labeled samples is cost, time and scale-prohibitive. To improve the representativeness of the training dataset, for the first time, we present an approach to utilize large amounts of freely available web data through web-crawling. To handle noise and weak nature of web annotations, we propose a two-step transfer learning based training process with a robust loss function, termed as Webly Supervised Learning (WSL) to train deep models for the task. We also leverage search by image to improve the search specificity of our web-crawling and reduce cross-domain noise. Within WSL, we explicitly model the noise structure between classes and incorporate it to selectively distill knowledge from the web data during model training. To demonstrate improved performance due to WSL, we benchmarked on a publicly available 10-class fine-grained skin lesion classification dataset and report a significant improvement of top-1 classification accuracy from 71.25% to 80.53% due to the incorporation of web-supervision.

...read moreread less

Proceedings Article•DOI•

Self-Supervised Learning of the Drivable Area for Autonomous Vehicles

[...]

Jakob Mayr¹, Christian Unger¹, Federico Tombari²•Institutions (2)

BMW¹, Technische Universität München²

01 Oct 2018

TL;DR: It is shown that good segmentation results on monocular images are achieved, which substantially exceed the performance of the algorithm employed for automatic labeling without the need of any manual annotation.

...read moreread less

Abstract: We propose a new approach for generating training data for the task of drivable area segmentation with deep neural networks (DNN). The impressive progress of deep learning in recent years demonstrated a superior performance of DNNs over traditional machine learning and deterministic algorithms for various tasks. Nevertheless, the acquisition of large-scale datasets with associated ground truth labels still poses an expensive and labor-intensive problem. We contribute to the solution of this problem for the task of road segmentation by proposing an automatic labeling pipeline which leverages a deterministic stereo-based approach for ground plane detection to create large datasets suitable for training neural networks. Based on the popular Cityscapes [1] and KITTI dataset [2] and two off-the-shelf DNNs for semantic segmentation, we show that we can achieve good segmentation results on monocular images, which substantially exceed the performance of the algorithm employed for automatic labeling without the need of any manual annotation.

...read moreread less

Posted Content•

Human Motion Analysis with Deep Metric Learning

[...]

Huseyin Coskun¹, David Joseph Tan¹, Sailesh Conjeti¹, Nassir Navab¹, Federico Tombari¹ - Show less +1 more•Institutions (1)

Technische Universität München¹

30 Jul 2018-arXiv: Computer Vision and Pattern Recognition

TL;DR: A novel metric learning objective based on a triplet architecture and Maximum Mean Discrepancy is proposed and a novel deep architecture based on attentive recurrent neural networks is proposed, which enforces a better separation within the learned embedding space of the different motion categories by means of the associated distribution moments.

...read moreread less

Abstract: Effectively measuring the similarity between two human motions is necessary for several computer vision tasks such as gait analysis, person identi- fication and action retrieval. Nevertheless, we believe that traditional approaches such as L2 distance or Dynamic Time Warping based on hand-crafted local pose metrics fail to appropriately capture the semantic relationship across motions and, as such, are not suitable for being employed as metrics within these tasks. This work addresses this limitation by means of a triplet-based deep metric learning specifically tailored to deal with human motion data, in particular with the prob- lem of varying input size and computationally expensive hard negative mining due to motion pair alignment. Specifically, we propose (1) a novel metric learn- ing objective based on a triplet architecture and Maximum Mean Discrepancy; as well as, (2) a novel deep architecture based on attentive recurrent neural networks. One benefit of our objective function is that it enforces a better separation within the learned embedding space of the different motion categories by means of the associated distribution moments. At the same time, our attentive recurrent neural network allows processing varying input sizes to a fixed size of embedding while learning to focus on those motion parts that are semantically distinctive. Our ex- periments on two different datasets demonstrate significant improvements over conventional human motion metrics.

...read moreread less

Book Chapter•DOI•

A Summary of the 4th International Workshop on~Recovering 6D Object Pose

[...]

Tomas Hodan, Rigas Kouskouridas, Tae-Kyun Kim¹, Federico Tombari², Kostas E. Bekris³, Bertram Drost, Thibault Groueix⁴, Krzysztof Walas⁵, Vincent Lepetit⁶, Ales Leonardis⁷, Carsten Steger, Frank Michel⁸, Caner Sahin¹, Carsten Rother⁹, Jiri Matas - Show less +11 more•Institutions (9)

Imperial College London¹, Technische Universität München², Rutgers University³, École des ponts ParisTech⁴, Poznań University of Technology⁵, University of Bordeaux⁶, University of Birmingham⁷, Dresden University of Technology⁸, Heidelberg University⁹

08 Sep 2018

TL;DR: This document summarizes the 4th International Workshop on Recovering 6D Object Pose which was organized in conjunction with ECCV 2018 in Munich and featured four invited talks, oral and poster presentations of accepted workshop papers, and an introduction of the BOP benchmark for 6D object pose estimation.

...read moreread less

Abstract: This document summarizes the 4th International Workshop on Recovering 6D Object Pose which was organized in conjunction with ECCV 2018 in Munich. The workshop featured four invited talks, oral and poster presentations of accepted workshop papers, and an introduction of the BOP benchmark for 6D object pose estimation. The workshop was attended by 100+ people working on relevant topics in both academia and industry who shared up-to-date advances and discussed open problems.

...read moreread less

Posted Content•

Guide Me: Interacting with Deep Networks.

[...]

Christian Rupprecht¹, Christian Rupprecht², Iro Laina¹, Nassir Navab¹, Gregory D. Hager², Federico Tombari¹ - Show less +2 more•Institutions (2)

Technische Universität München¹, Johns Hopkins University²

30 Mar 2018-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, a layer that acts as a spatio-semantic guide is added to the network to modify the network's activations, either directly via an energy minimization scheme or indirectly through a recurrent model that translates human language queries to interaction weights.

...read moreread less

Abstract: Interaction and collaboration between humans and intelligent machines has become increasingly important as machine learning methods move into real-world applications that involve end users. While much prior work lies at the intersection of natural language and vision, such as image captioning or image generation from text descriptions, less focus has been placed on the use of language to guide or improve the performance of a learned visual processing algorithm. In this paper, we explore methods to flexibly guide a trained convolutional neural network through user input to improve its performance during inference. We do so by inserting a layer that acts as a spatio-semantic guide into the network. This guide is trained to modify the network's activations, either directly via an energy minimization scheme or indirectly through a recurrent model that translates human language queries to interaction weights. Learning the verbal interaction is fully automatic and does not require manual text annotations. We evaluate the method on two datasets, showing that guiding a pre-trained network can improve performance, and provide extensive insights into the interaction between the guide and the CNN.

...read moreread less

Posted Content•

Fully-Convolutional Point Networks for Large-Scale Point Clouds

[...]

Dario Rethage¹, Johanna Wald¹, Jürgen Sturm², Nassir Navab¹, Federico Tombari¹ - Show less +1 more•Institutions (2)

Technische Universität München¹, Google²

21 Aug 2018-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, a general-purpose, fully-convolutional network architecture for efficiently processing large-scale 3D data is proposed, which can process unorganized 3D representations such as point clouds as input, then transform them internally to ordered structures to be processed via 3D convolutions.

...read moreread less

Abstract: This work proposes a general-purpose, fully-convolutional network architecture for efficiently processing large-scale 3D data. One striking characteristic of our approach is its ability to process unorganized 3D representations such as point clouds as input, then transforming them internally to ordered structures to be processed via 3D convolutions. In contrast to conventional approaches that maintain either unorganized or organized representations, from input to output, our approach has the advantage of operating on memory efficient input data representations while at the same time exploiting the natural structure of convolutional operations to avoid the redundant computing and storing of spatial information in the network. The network eliminates the need to pre- or post process the raw sensor data. This, together with the fully-convolutional nature of the network, makes it an end-to-end method able to process point clouds of huge spaces or even entire rooms with up to 200k points at once. Another advantage is that our network can produce either an ordered output or map predictions directly onto the input cloud, thus making it suitable as a general-purpose point cloud descriptor applicable to many 3D tasks. We demonstrate our network's ability to effectively learn both low-level features as well as complex compositional relationships by evaluating it on benchmark datasets for semantic voxel segmentation, semantic part segmentation and 3D scene captioning.

...read moreread less

Posted Content•

3D Point Capsule Networks

[...]

Yongheng Zhao¹, Tolga Birdal², Haowen Deng¹, Federico Tombari¹•Institutions (2)

Technische Universität München¹, University of Padua²

27 Dec 2018-arXiv: Computer Vision and Pattern Recognition

TL;DR: 3D capsule networks are proposed, an auto-encoder designed to process sparse 3D point clouds while preserving spatial arrangements of the input data and enables new applications such as part interpolation and replacement.

...read moreread less

Abstract: In this paper, we propose 3D point-capsule networks, an auto-encoder designed to process sparse 3D point clouds while preserving spatial arrangements of the input data. 3D capsule networks arise as a direct consequence of our novel unified 3D auto-encoder formulation. Their dynamic routing scheme and the peculiar 2D latent space deployed by our approach bring in improvements for several common point cloud-related tasks, such as object classification, object reconstruction and part segmentation as substantiated by our extensive evaluations. Moreover, it enables new applications such as part interpolation and replacement.

...read moreread less

Proceedings Article•DOI•

Adversarial Semantic Scene Completion from a Single Depth Image

[...]

Yida Wang¹, David Joseph Tan², Nassir Navab², Federico Tombari²•Institutions (2)

Beijing University of Posts and Telecommunications¹, Technische Universität München²

25 Oct 2018-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work proposes a method to reconstruct, complete and semantically label a 3D scene from a single input depth image by using multiple adversarial loss terms that not only enforce realistic outputs with respect to the ground truth, but also an effective embedding of the internal features.

...read moreread less

Abstract: We propose a method to reconstruct, complete and semantically label a 3D scene from a single input depth image. We improve the accuracy of the regressed semantic 3D maps by a novel architecture based on adversarial learning. In particular, we suggest using multiple adversarial loss terms that not only enforce realistic outputs with respect to the ground truth, but also an effective embedding of the internal features. This is done by correlating the latent features of the encoder working on partial 2.5D data with the latent features extracted from a variational 3D auto-encoder trained to reconstruct the complete semantic scene. In addition, differently from other approaches that operate entirely through 3D convolutions, at test time we retain the original 2.5D structure of the input during downsampling to improve the effectiveness of the internal representation of our model. We test our approach on the main benchmark datasets for semantic scene completion to qualitatively and quantitatively assess the effectiveness of our proposal.

...read moreread less

Book Chapter•DOI•

Computer Vision and Machine Learning for Surgical Instrument Tracking: Focus: Random Forest-Based Microsurgical Tool Tracking

[...]

Nicola Rieke¹, Federico Tombari¹, Nassir Navab¹, Nassir Navab²•Institutions (2)

Technische Universität München¹, Johns Hopkins University²

01 Jan 2018

TL;DR: This chapter outlines how computer vision can support the surgeon during an intervention via the example of surgical instrument tracking in retinal microsurgery, and shows how to derive algorithms for simultaneous tool tracking and pose estimation based on random forests.

...read moreread less

Abstract: In recent years, computer vision has become a remarkable tool for various computer-assisted medical applications, paving the way towards the use of augmented reality and advanced visualization in the medical domain. In this chapter we outline how computer vision can support the surgeon during an intervention via the example of surgical instrument tracking in retinal microsurgery, which incorporates challenges and requirements that are common for the employment of this technique in various medical applications. In particular, we show how to derive algorithms for simultaneous tool tracking and pose estimation based on random forests and how to increase robustness in problems associated with retinal microsurgery images, such as wide variations in illumination and high noise levels. Furthermore, we elaborate on how to evaluate the overall performance of such an algorithm in terms of accuracy and describe the missing steps that are necessary to deploy these techniques in real clinical practice.

...read moreread less

Book Chapter•DOI•

A Summary of the 4th International Workshop on Recovering 6D Object Pose.

[...]

Tomas Hodan, Rigas Kouskouridas, Tae-Kyun Kim¹, Federico Tombari², Kostas E. Bekris³, Bertram Drost, Thibault Groueix⁴, Krzysztof Walas⁵, Vincent Lepetit⁶, Ales Leonardis⁷, Carsten Steger, Frank Michel⁸, Caner Sahin¹, Carsten Rother⁹, Jiri Matas - Show less +11 more•Institutions (9)

Imperial College London¹, Technische Universität München², Rutgers University³, École des ponts ParisTech⁴, Poznań University of Technology⁵, University of Bordeaux⁶, University of Birmingham⁷, Dresden University of Technology⁸, Heidelberg University⁹

09 Oct 2018-arXiv: Computer Vision and Pattern Recognition

TL;DR: The 4th International Workshop on Recovering 6D Object Pose as mentioned in this paper was organized in conjunction with ECCV 2018 in Munich, which was attended by 100+ people working on relevant topics in both academia and industry who shared up-to-date advances and discussed open problems.

...read moreread less

Abstract: This document summarizes the 4th International Workshop on Recovering 6D Object Pose which was organized in conjunction with ECCV 2018 in Munich. The workshop featured four invited talks, oral and poster presentations of accepted workshop papers, and an introduction of the BOP benchmark for 6D object pose estimation. The workshop was attended by 100+ people working on relevant topics in both academia and industry who shared up-to-date advances and discussed open problems.

...read moreread less

Showing papers by "Federico Tombari published in 2018"