scispace - formally typeset
Search or ask a question

Showing papers by "Federico Tombari published in 2018"


Posted Content
TL;DR: A benchmark for 6D pose estimation of a rigid object from a single RGB-D input image shows that methods based on point-pair features currently perform best, outperforming template matching methods, learning-based methods and methodsbased on 3D local features.
Abstract: We propose a benchmark for 6D pose estimation of a rigid object from a single RGB-D input image. The training data consists of a texture-mapped 3D object model or images of the object in known 6D poses. The benchmark comprises of: i) eight datasets in a unified format that cover different practical scenarios, including two new datasets focusing on varying lighting conditions, ii) an evaluation methodology with a pose-error function that deals with pose ambiguities, iii) a comprehensive evaluation of 15 diverse recent methods that captures the status quo of the field, and iv) an online evaluation system that is open for continuous submission of new results. The evaluation shows that methods based on point-pair features currently perform best, outperforming template matching methods, learning-based methods and methods based on 3D local features. The project website is available at bop.felk.cvut.cz.

224 citations


Book ChapterDOI
08 Sep 2018
TL;DR: In this article, the authors propose a benchmark for 6D pose estimation of a rigid object from a single RGB-D input image, which consists of a texture-mapped 3D object model or images of the object in known 6D poses.
Abstract: We propose a benchmark for 6D pose estimation of a rigid object from a single RGB-D input image. The training data consists of a texture-mapped 3D object model or images of the object in known 6D poses. The benchmark comprises of: (i) eight datasets in a unified format that cover different practical scenarios, including two new datasets focusing on varying lighting conditions, (ii) an evaluation methodology with a pose-error function that deals with pose ambiguities, (iii) a comprehensive evaluation of 15 diverse recent methods that captures the status quo of the field, and (iv) an online evaluation system that is open for continuous submission of new results. The evaluation shows that methods based on point-pair features currently perform best, outperforming template matching methods, learning-based methods and methods based on 3D local features. The project website is available at bop.felk.cvut.cz.

193 citations


Book ChapterDOI
08 Sep 2018
TL;DR: This work proposes a general-purpose, fully-convolutional network architecture for efficiently processing large-scale 3D data and demonstrates its ability to effectively learn both low-level features as well as complex compositional relationships by evaluating it on benchmark datasets for semantic voxel segmentation, semantic part segmentation and 3D scene captioning.
Abstract: This work proposes a general-purpose, fully-convolutional network architecture for efficiently processing large-scale 3D data. One striking characteristic of our approach is its ability to process unorganized 3D representations such as point clouds as input, then transforming them internally to ordered structures to be processed via 3D convolutions. In contrast to conventional approaches that maintain either unorganized or organized representations, from input to output, our approach has the advantage of operating on memory efficient input data representations while at the same time exploiting the natural structure of convolutional operations to avoid the redundant computing and storing of spatial information in the network. The network eliminates the need to pre- or post process the raw sensor data. This, together with the fully-convolutional nature of the network, makes it an end-to-end method able to process point clouds of huge spaces or even entire rooms with up to 200k points at once. Another advantage is that our network can produce either an ordered output or map predictions directly onto the input cloud, thus making it suitable as a general-purpose point cloud descriptor applicable to many 3D tasks. We demonstrate our network’s ability to effectively learn both low-level features as well as complex compositional relationships by evaluating it on benchmark datasets for semantic voxel segmentation, semantic part segmentation and 3D scene captioning.

172 citations


Book ChapterDOI
08 Sep 2018
TL;DR: This work proposes a learning approach for panoramic depth map estimation from a single image, thanks to a specifically developed distortion-aware deformable convolution filter, which can be trained by means of conventional perspective images, then used to regress depth forPanoramic images, thus bypassing the effort needed to create annotated pan oramic training dataset.
Abstract: There is a high demand of 3D data for 360\(^\circ \) panoramic images and videos, pushed by the growing availability on the market of specialized hardware for both capturing (e.g., omni-directional cameras) as well as visualizing in 3D (e.g., head mounted displays) panoramic images and videos. At the same time, 3D sensors able to capture 3D panoramic data are expensive and/or hardly available. To fill this gap, we propose a learning approach for panoramic depth map estimation from a single image. Thanks to a specifically developed distortion-aware deformable convolution filter, our method can be trained by means of conventional perspective images, then used to regress depth for panoramic images, thus bypassing the effort needed to create annotated panoramic training dataset. We also demonstrate our approach for emerging tasks such as panoramic monocular SLAM, panoramic semantic segmentation and panoramic style transfer.

143 citations


Book ChapterDOI
08 Sep 2018
TL;DR: A new visual loss is proposed that drives the pose update by aligning object contours, thus avoiding the definition of any explicit appearance model and producing pose accuracies that come close to 3D ICP without the need for depth data.
Abstract: We present a novel approach for model-based 6D pose refinement in color data. Building on the established idea of contour-based pose tracking, we teach a deep neural network to predict a translational and rotational update. At the core, we propose a new visual loss that drives the pose update by aligning object contours, thus avoiding the definition of any explicit appearance model. In contrast to previous work our method is correspondence-free, segmentation-free, can handle occlusion and is agnostic to geometrical symmetry as well as visual ambiguities. Additionally, we observe a strong robustness towards rough initialization. The approach can run in real-time and produces pose accuracies that come close to 3D ICP without the need for depth data. Furthermore, our networks are trained from purely synthetic data and will be published together with the refinement code at http://campar.in.tum.de/Main/FabianManhardt to ensure reproducibility.

135 citations


Proceedings ArticleDOI
01 Oct 2018
TL;DR: This work proposes a semantic monocular SLAM framework designed to deal with highly dynamic environments, combining feature-based and direct approaches to achieve robustness under challenging conditions and shows more stable pose estimation in dynamic environments.
Abstract: Recent advances in monocular SLAM have enabled real-time capable systems which run robustly under the assumption of a static environment, but fail in presence of dynamic scene changes and motion, since they lack an explicit dynamic outlier handling. We propose a semantic monocular SLAM framework designed to deal with highly dynamic environments, combining feature-based and direct approaches to achieve robustness under challenging conditions. The proposed approach exploits semantic information extracted from the scene within an explicit probabilistic model, which maximizes the probability for both tracking and mapping to rely on those scene parts that do not present a relative motion with respect to the camera. We show more stable pose estimation in dynamic environments and comparable performance to the state of the art on static sequences on the Virtual KITTI and Synthia datasets.

70 citations


Posted Content
TL;DR: In this paper, a new visual loss is proposed to align object contours, thus avoiding the definition of any explicit appearance model, and a deep neural network is trained to predict a translational and rotational update.
Abstract: We present a novel approach for model-based 6D pose refinement in color data. Building on the established idea of contour-based pose tracking, we teach a deep neural network to predict a translational and rotational update. At the core, we propose a new visual loss that drives the pose update by aligning object contours, thus avoiding the definition of any explicit appearance model. In contrast to previous work our method is correspondence-free, segmentation-free, can handle occlusion and is agnostic to geometrical symmetry as well as visual ambiguities. Additionally, we observe a strong robustness towards rough initialization. The approach can run in real-time and produces pose accuracies that come close to 3D ICP without the need for depth data. Furthermore, our networks are trained from purely synthetic data and will be published together with the refinement code to ensure reproducibility.

59 citations


Journal ArticleDOI
TL;DR: Through extensive experiments on standard datasets, this work shows how feature matching performance improves significantly by deploying 3D descriptors together with companion detectors learned by the methodology with respect to the adoption of established state-of-the-art 3D detectors based on hand-crafted saliency functions.
Abstract: The established approach to 3D keypoint detection consists in defining effective handcrafted saliency functions based on geometric cues with the aim of maximizing keypoint repeatability. Differently, the idea behind our work is to learn a descriptor-specific keypoint detector so as to optimize the end-to-end performance of the feature matching pipeline. Accordingly, we cast 3D keypoint detection as a classification problem between surface patches that can or cannot be matched correctly by a given 3D descriptor, i.e. those either good or not in respect to that descriptor. We propose a machine learning framework that allows for defining examples of good surface patches from the training data and leverages Random Forest classifiers to realize both fixed-scale and adaptive-scale 3D keypoint detectors. Through extensive experiments on standard datasets, we show how feature matching performance improves significantly by deploying 3D descriptors together with companion detectors learned by our methodology with respect to the adoption of established state-of-the-art 3D detectors based on hand-crafted saliency functions.

51 citations


Journal ArticleDOI
TL;DR: It is conjecture that features learned via a convolutional neural network provide the ability to distinctively detect seizures from video, and even allow the system to generalize to different seizure types.
Abstract: Epileptic seizures constitute a serious neurological condition for patients and, if untreated, considerably decrease their quality of life. Early and correct diagnosis by semiological seizure analysis provides the main approach to treat and improve the patients’ condition. To obtain reliable and quantifiable information, medical professionals perform seizure detection and subsequent analysis using expensive video-EEG systems in specialized epilepsy monitoring units. However, the detection of seizures, especially under difficult circumstances such as occlusion by the blanket or in the absence of predictive EEG patterns, is highly subjective and should therefore be supported by automated systems. In this work, we conjecture that features learned via a convolutional neural network provide the ability to distinctively detect seizures from video, and even allow our system to generalize to different seizure types. By comparing our method to the state of the art we show the superior performance of learned featur...

45 citations


Proceedings ArticleDOI
30 Mar 2018
TL;DR: In this article, a layer that acts as a spatio-semantic guide is added to the network to modify the network's activations, either directly via an energy minimization scheme or indirectly through a recurrent model that translates human language queries to interaction weights.
Abstract: Interaction and collaboration between humans and intelligent machines has become increasingly important as machine learning methods move into real-world applications that involve end users. While much prior work lies at the intersection of natural language and vision, such as image captioning or image generation from text descriptions, less focus has been placed on the use of language to guide or improve the performance of a learned visual processing algorithm. In this paper, we explore methods to flexibly guide a trained convolutional neural network through user input to improve its performance during inference. We do so by inserting a layer that acts as a spatio-semantic guide into the network. This guide is trained to modify the network's activations, either directly via an energy minimization scheme or indirectly through a recurrent model that translates human language queries to interaction weights. Learning the verbal interaction is fully automatic and does not require manual text annotations. We evaluate the method on two datasets, showing that guiding a pre-trained network can improve performance, and provide extensive insights into the interaction between the guide and the CNN.

35 citations


Book ChapterDOI
08 Sep 2018
TL;DR: In this paper, a triplet-based deep metric learning approach is proposed to deal with human motion data, in particular with the problem of varying input size and computationally expensive hard negative mining due to motion pair alignment.
Abstract: Effectively measuring the similarity between two human motions is necessary for several computer vision tasks such as gait analysis, person identification and action retrieval. Nevertheless, we believe that traditional approaches such as L2 distance or Dynamic Time Warping based on hand-crafted local pose metrics fail to appropriately capture the semantic relationship across motions and, as such, are not suitable for being employed as metrics within these tasks. This work addresses this limitation by means of a triplet-based deep metric learning specifically tailored to deal with human motion data, in particular with the problem of varying input size and computationally expensive hard negative mining due to motion pair alignment. Specifically, we propose (1) a novel metric learning objective based on a triplet architecture and Maximum Mean Discrepancy; as well as, (2) a novel deep architecture based on attentive recurrent neural networks. One benefit of our objective function is that it enforces a better separation within the learned embedding space of the different motion categories by means of the associated distribution moments. At the same time, our attentive recurrent neural network allows processing varying input sizes to a fixed size of embedding while learning to focus on those motion parts that are semantically distinctive. Our experiments on two different datasets demonstrate significant improvements over conventional human motion metrics.

Proceedings ArticleDOI
27 Dec 2018
TL;DR: In this article, the authors propose an efficient and scalable method for incrementally building a dense, semantically annotated 3D map in real-time, which assigns class probabilities to each region, not each element (e.g., surfel and voxel), of the 3D maps built up through a robust SLAM framework and incrementally segmented with a geometric-based segmentation method.
Abstract: We propose an efficient and scalable method for incrementally building a dense, semantically annotated 3D map in real-time. The proposed method assigns class probabilities to each region, not each element (e.g., surfel and voxel), of the 3D map which is built up through a robust SLAM framework and incrementally segmented with a geometric-based segmentation method. Differently from all other approaches, our method has a capability of running at over 30Hz while performing all processing components, including SLAM, segmentation, 2D recognition, and updating class probabilities of each segmentation label at every incoming frame, thanks to the high efficiency that characterizes the computationally intensive stages of our framework. By utilizing a specifically designed CNN to improve the frame-wise segmentation result, we can also achieve high accuracy. We validate our method on the NYUv2 dataset by comparing with the state of the art in terms of accuracy and computational efficiency, and by means of an analysis in terms of time and space complexity.

Posted Content
TL;DR: In this article, the authors predict multiple pose and class outcomes to estimate the specific pose distribution generated by symmetries and repetitive textures, and show the benefits of their approach which provides not only a better explanation for pose ambiguity, but also a higher accuracy in terms of pose estimation.
Abstract: 3D object detection and pose estimation from a single image are two inherently ambiguous problems. Oftentimes, objects appear similar from different viewpoints due to shape symmetries, occlusion and repetitive textures. This ambiguity in both detection and pose estimation means that an object instance can be perfectly described by several different poses and even classes. In this work we propose to explicitly deal with this uncertainty. For each object instance we predict multiple pose and class outcomes to estimate the specific pose distribution generated by symmetries and repetitive textures. The distribution collapses to a single outcome when the visual appearance uniquely identifies just one valid pose. We show the benefits of our approach which provides not only a better explanation for pose ambiguity, but also a higher accuracy in terms of pose estimation.

Journal ArticleDOI
04 Jul 2018
TL;DR: An online RGB-D based scene understanding method for indoor scenes running in real time on mobile devices that achieves an accuracy that gets close to the most state-of-the-art 3-D scene understanding methods while being much more efficient, enabling real-time execution on low-power embedded systems.
Abstract: We propose an online RGB-D based scene understanding method for indoor scenes running in real time on mobile devices. First, we incrementally reconstruct the scene via simultaneous localization and mapping and compute a three-dimensional (3-D) geometric segmentation by fusing segments obtained from each input depth image in a global 3-D model. We combine this geometric segmentation with semantic annotations to obtain a semantic segmentation in form of a semantic map. To accomplish efficient semantic segmentation, we encode the segments in the global model with a fast incremental 3-D descriptor and use a random forest to determine its semantic label. The predictions from successive frames are then fused to obtain a confident semantic class across time. As a result, the overall method achieves an accuracy that gets close to the most state-of-the-art 3-D scene understanding methods while being much more efficient, enabling real-time execution on low-power embedded systems.

Journal ArticleDOI
TL;DR: A detailed analysis of PPFs is presented, which highlights under which conditions PPFs perform particularly well as well as its main weaknesses, and finds that PPFs compared to most local histogram features degrade faster under disturbances such as occlusion and clutter.

Proceedings ArticleDOI
25 Oct 2018
TL;DR: In this article, the authors propose a method to reconstruct, complete and semantically label a 3D scene from a single input depth image by using multiple adversarial loss terms that enforce realistic outputs with respect to the ground truth, but also an effective embedding of the internal features.
Abstract: We propose a method to reconstruct, complete and semantically label a 3D scene from a single input depth image. We improve the accuracy of the regressed semantic 3D maps by a novel architecture based on adversarial learning. In particular, we suggest using multiple adversarial loss terms that not only enforce realistic outputs with respect to the ground truth, but also an effective embedding of the internal features. This is done by correlating the latent features of the encoder working on partial 2.5D data with the latent features extracted from a variational 3D auto-encoder trained to reconstruct the complete semantic scene. In addition, differently from other approaches that operate entirely through 3D convolutions, at test time we retain the original 2.5D structure of the input during downsampling to improve the effectiveness of the internal representation of our model. We test our approach on the main benchmark datasets for semantic scene completion to qualitatively and quantitatively assess the effectiveness of our proposal.

Journal ArticleDOI
TL;DR: This work proposes another solution that utilizes a multi-camera system such that the data simultaneously acquired from multiple RGB-D sensors helps the tracker to handle challenging conditions that affect a subset of the cameras.
Abstract: We demonstrate how 3D head tracking and pose estimation can be effectively and efficiently achieved from noisy RGB-D sequences. Our proposal leverages on a random forest framework, designed to regress the 3D head pose at every frame in a temporal tracking manner. One peculiarity of the algorithm is that it exploits together (1) a generic training dataset of 3D head models, which is learned once offline; and, (2) an online refinement with subject-specific 3D data, which aims for the tracker to withstand slight facial deformations and to adapt its forest to the specific characteristics of an individual subject. The combination of these works allows our algorithm to be robust even under extreme poses, where the user's face is no longer visible on the image. Finally, we also propose another solution that utilizes a multi-camera system such that the data simultaneously acquired from multiple RGB-D sensors helps the tracker to handle challenging conditions that affect a subset of the cameras. Notably, the proposed multi-camera frameworks yields a real-time performance of approximately 8 ms per frame given six cameras and one CPU core, and scales up linearly to 30 fps with 25 cameras.

Proceedings ArticleDOI
17 May 2018
TL;DR: This work proposes a situation assessment algorithm for classifying driving situations with respect to their suitability for lane changing, based on a Bidirectional Recurrent Neural Network, which uses Long Short-Term Memory units, and integrates a prediction component in the form of the Intelligent Driver Model.
Abstract: One of the greatest challenges towards fully autonomous cars is the understanding of complex and dynamic scenes. Such understanding is needed for planning of maneuvers, especially those that are particularly frequent such as lane changes. While in recent years advanced driver-assistance systems have made driving safer and more comfortable, these have mostly focused on car following scenarios, and less on maneuvers involving lane changes. In this work we propose a situation assessment algorithm for classifying driving situations with respect to their suitability for lane changing. For this, we propose a deep learning architecture based on a Bidirectional Recurrent Neural Network, which uses Long Short-Term Memory units, and integrates a prediction component in the form of the Intelligent Driver Model. We prove the feasibility of our algorithm on the publicly available NGSIM datasets, where we outperform existing methods.

Book ChapterDOI
02 Dec 2018
TL;DR: In this paper, the authors propose to estimate multiple grasp poses from a single RGB image of the target object by replacing conventional grasp rectangles with grasp belief maps, which hold more precise location information than a rectangle and account for the uncertainty inherent to the task.
Abstract: Humans excel in grasping and manipulating objects because of their life-long experience and knowledge about the 3D shape and weight distribution of objects. However, the lack of such intuition in robots makes robotic grasping an exceptionally challenging task. There are often several equally viable options of grasping an object. However, this ambiguity is not modeled in conventional systems that estimate a single, optimal grasp position. We propose to tackle this problem by simultaneously estimating multiple grasp poses from a single RGB image of the target object. Further, we reformulate the problem of robotic grasping by replacing conventional grasp rectangles with grasp belief maps, which hold more precise location information than a rectangle and account for the uncertainty inherent to the task. We augment a fully convolutional neural network with a multiple hypothesis prediction model that predicts a set of grasp hypotheses in under 60 ms, which is critical for real-time robotic applications. The grasp detection accuracy reaches over \(90\%\) for unseen objects, outperforming the current state of the art on this task.

Journal ArticleDOI
TL;DR: A discriminative alternative for the association, that leverages random forests to infer correspondences in one shot is proposed, which allows for large deformations and prevents tracking errors from being accumulated and is referred to as ‘tracking-by-detection of 3D human shapes.’
Abstract: 3D Human shape tracking consists in fitting a template model to temporal sequences of visual observations. It usually comprises an association step, that finds correspondences between the model and the input data, and a deformation step, that fits the model to the observations given correspondences. Most current approaches follow the Iterative-Closest-Point (ICP) paradigm, where the association step is carried out by searching for the nearest neighbors. It fails when large deformations occur and errors in the association tend to propagate over time. In this paper, we propose a discriminative alternative for the association, that leverages random forests to infer correspondences in one shot. Regardless the choice of shape parameterizations, being surface or volumetric meshes, we convert 3D shapes to volumetric distance fields and thereby design features to train the forest. We investigate two ways to draw volumetric samples: voxels of regular grids and cells from Centroidal Voronoi Tessellation (CVT). While the former consumes considerable memory and in turn limits us to learn only subject-specific correspondences, the latter yields much less memory footprint by compactly tessellating the interior space of a shape with optimal discretization. This facilitates the use of larger cross-subject training databases, generalizes to different human subjects and hence results in less overfitting and better detection. The discriminative correspondences are successfully integrated to both surface and volumetric deformation frameworks that recover human shape poses, which we refer to as ‘tracking-by-detection of 3D human shapes.’ It allows for large deformations and prevents tracking errors from being accumulated. When combined with ICP for refinement, it proves to yield better accuracy in registration and more stability when tracking over time. Evaluations on existing datasets demonstrate the benefits with respect to the state-of-the-art.

Book ChapterDOI
16 Sep 2018
TL;DR: In this paper, a two-step transfer learning based training process with a robust loss function is proposed to train deep models for the task of fine-grained skin lesion classification.
Abstract: Within medical imaging, manual curation of sufficient well-labeled samples is cost, time and scale-prohibitive. To improve the representativeness of the training dataset, for the first time, we present an approach to utilize large amounts of freely available web data through web-crawling. To handle noise and weak nature of web annotations, we propose a two-step transfer learning based training process with a robust loss function, termed as Webly Supervised Learning (WSL) to train deep models for the task. We also leverage search by image to improve the search specificity of our web-crawling and reduce cross-domain noise. Within WSL, we explicitly model the noise structure between classes and incorporate it to selectively distill knowledge from the web data during model training. To demonstrate improved performance due to WSL, we benchmarked on a publicly available 10-class fine-grained skin lesion classification dataset and report a significant improvement of top-1 classification accuracy from 71.25% to 80.53% due to the incorporation of web-supervision.

Proceedings ArticleDOI
01 Oct 2018
TL;DR: It is shown that good segmentation results on monocular images are achieved, which substantially exceed the performance of the algorithm employed for automatic labeling without the need of any manual annotation.
Abstract: We propose a new approach for generating training data for the task of drivable area segmentation with deep neural networks (DNN). The impressive progress of deep learning in recent years demonstrated a superior performance of DNNs over traditional machine learning and deterministic algorithms for various tasks. Nevertheless, the acquisition of large-scale datasets with associated ground truth labels still poses an expensive and labor-intensive problem. We contribute to the solution of this problem for the task of road segmentation by proposing an automatic labeling pipeline which leverages a deterministic stereo-based approach for ground plane detection to create large datasets suitable for training neural networks. Based on the popular Cityscapes [1] and KITTI dataset [2] and two off-the-shelf DNNs for semantic segmentation, we show that we can achieve good segmentation results on monocular images, which substantially exceed the performance of the algorithm employed for automatic labeling without the need of any manual annotation.

Posted Content
TL;DR: A novel metric learning objective based on a triplet architecture and Maximum Mean Discrepancy is proposed and a novel deep architecture based on attentive recurrent neural networks is proposed, which enforces a better separation within the learned embedding space of the different motion categories by means of the associated distribution moments.
Abstract: Effectively measuring the similarity between two human motions is necessary for several computer vision tasks such as gait analysis, person identi- fication and action retrieval. Nevertheless, we believe that traditional approaches such as L2 distance or Dynamic Time Warping based on hand-crafted local pose metrics fail to appropriately capture the semantic relationship across motions and, as such, are not suitable for being employed as metrics within these tasks. This work addresses this limitation by means of a triplet-based deep metric learning specifically tailored to deal with human motion data, in particular with the prob- lem of varying input size and computationally expensive hard negative mining due to motion pair alignment. Specifically, we propose (1) a novel metric learn- ing objective based on a triplet architecture and Maximum Mean Discrepancy; as well as, (2) a novel deep architecture based on attentive recurrent neural networks. One benefit of our objective function is that it enforces a better separation within the learned embedding space of the different motion categories by means of the associated distribution moments. At the same time, our attentive recurrent neural network allows processing varying input sizes to a fixed size of embedding while learning to focus on those motion parts that are semantically distinctive. Our ex- periments on two different datasets demonstrate significant improvements over conventional human motion metrics.

Book ChapterDOI
08 Sep 2018
TL;DR: This document summarizes the 4th International Workshop on Recovering 6D Object Pose which was organized in conjunction with ECCV 2018 in Munich and featured four invited talks, oral and poster presentations of accepted workshop papers, and an introduction of the BOP benchmark for 6D object pose estimation.
Abstract: This document summarizes the 4th International Workshop on Recovering 6D Object Pose which was organized in conjunction with ECCV 2018 in Munich. The workshop featured four invited talks, oral and poster presentations of accepted workshop papers, and an introduction of the BOP benchmark for 6D object pose estimation. The workshop was attended by 100+ people working on relevant topics in both academia and industry who shared up-to-date advances and discussed open problems.

Posted Content
TL;DR: In this paper, a layer that acts as a spatio-semantic guide is added to the network to modify the network's activations, either directly via an energy minimization scheme or indirectly through a recurrent model that translates human language queries to interaction weights.
Abstract: Interaction and collaboration between humans and intelligent machines has become increasingly important as machine learning methods move into real-world applications that involve end users. While much prior work lies at the intersection of natural language and vision, such as image captioning or image generation from text descriptions, less focus has been placed on the use of language to guide or improve the performance of a learned visual processing algorithm. In this paper, we explore methods to flexibly guide a trained convolutional neural network through user input to improve its performance during inference. We do so by inserting a layer that acts as a spatio-semantic guide into the network. This guide is trained to modify the network's activations, either directly via an energy minimization scheme or indirectly through a recurrent model that translates human language queries to interaction weights. Learning the verbal interaction is fully automatic and does not require manual text annotations. We evaluate the method on two datasets, showing that guiding a pre-trained network can improve performance, and provide extensive insights into the interaction between the guide and the CNN.

Posted Content
TL;DR: In this article, a general-purpose, fully-convolutional network architecture for efficiently processing large-scale 3D data is proposed, which can process unorganized 3D representations such as point clouds as input, then transform them internally to ordered structures to be processed via 3D convolutions.
Abstract: This work proposes a general-purpose, fully-convolutional network architecture for efficiently processing large-scale 3D data. One striking characteristic of our approach is its ability to process unorganized 3D representations such as point clouds as input, then transforming them internally to ordered structures to be processed via 3D convolutions. In contrast to conventional approaches that maintain either unorganized or organized representations, from input to output, our approach has the advantage of operating on memory efficient input data representations while at the same time exploiting the natural structure of convolutional operations to avoid the redundant computing and storing of spatial information in the network. The network eliminates the need to pre- or post process the raw sensor data. This, together with the fully-convolutional nature of the network, makes it an end-to-end method able to process point clouds of huge spaces or even entire rooms with up to 200k points at once. Another advantage is that our network can produce either an ordered output or map predictions directly onto the input cloud, thus making it suitable as a general-purpose point cloud descriptor applicable to many 3D tasks. We demonstrate our network's ability to effectively learn both low-level features as well as complex compositional relationships by evaluating it on benchmark datasets for semantic voxel segmentation, semantic part segmentation and 3D scene captioning.

Posted Content
TL;DR: 3D capsule networks are proposed, an auto-encoder designed to process sparse 3D point clouds while preserving spatial arrangements of the input data and enables new applications such as part interpolation and replacement.
Abstract: In this paper, we propose 3D point-capsule networks, an auto-encoder designed to process sparse 3D point clouds while preserving spatial arrangements of the input data. 3D capsule networks arise as a direct consequence of our novel unified 3D auto-encoder formulation. Their dynamic routing scheme and the peculiar 2D latent space deployed by our approach bring in improvements for several common point cloud-related tasks, such as object classification, object reconstruction and part segmentation as substantiated by our extensive evaluations. Moreover, it enables new applications such as part interpolation and replacement.

Proceedings ArticleDOI
TL;DR: This work proposes a method to reconstruct, complete and semantically label a 3D scene from a single input depth image by using multiple adversarial loss terms that not only enforce realistic outputs with respect to the ground truth, but also an effective embedding of the internal features.
Abstract: We propose a method to reconstruct, complete and semantically label a 3D scene from a single input depth image. We improve the accuracy of the regressed semantic 3D maps by a novel architecture based on adversarial learning. In particular, we suggest using multiple adversarial loss terms that not only enforce realistic outputs with respect to the ground truth, but also an effective embedding of the internal features. This is done by correlating the latent features of the encoder working on partial 2.5D data with the latent features extracted from a variational 3D auto-encoder trained to reconstruct the complete semantic scene. In addition, differently from other approaches that operate entirely through 3D convolutions, at test time we retain the original 2.5D structure of the input during downsampling to improve the effectiveness of the internal representation of our model. We test our approach on the main benchmark datasets for semantic scene completion to qualitatively and quantitatively assess the effectiveness of our proposal.

Book ChapterDOI
01 Jan 2018
TL;DR: This chapter outlines how computer vision can support the surgeon during an intervention via the example of surgical instrument tracking in retinal microsurgery, and shows how to derive algorithms for simultaneous tool tracking and pose estimation based on random forests.
Abstract: In recent years, computer vision has become a remarkable tool for various computer-assisted medical applications, paving the way towards the use of augmented reality and advanced visualization in the medical domain. In this chapter we outline how computer vision can support the surgeon during an intervention via the example of surgical instrument tracking in retinal microsurgery, which incorporates challenges and requirements that are common for the employment of this technique in various medical applications. In particular, we show how to derive algorithms for simultaneous tool tracking and pose estimation based on random forests and how to increase robustness in problems associated with retinal microsurgery images, such as wide variations in illumination and high noise levels. Furthermore, we elaborate on how to evaluate the overall performance of such an algorithm in terms of accuracy and describe the missing steps that are necessary to deploy these techniques in real clinical practice.

Book ChapterDOI
TL;DR: The 4th International Workshop on Recovering 6D Object Pose as mentioned in this paper was organized in conjunction with ECCV 2018 in Munich, which was attended by 100+ people working on relevant topics in both academia and industry who shared up-to-date advances and discussed open problems.
Abstract: This document summarizes the 4th International Workshop on Recovering 6D Object Pose which was organized in conjunction with ECCV 2018 in Munich. The workshop featured four invited talks, oral and poster presentations of accepted workshop papers, and an introduction of the BOP benchmark for 6D object pose estimation. The workshop was attended by 100+ people working on relevant topics in both academia and industry who shared up-to-date advances and discussed open problems.