Showing papers on "3D reconstruction published in 2020"

PDF

Open Access

Proceedings Article•DOI•

Differentiable Volumetric Rendering: Learning Implicit 3D Representations Without 3D Supervision

[...]

Michael Niemeyer¹, Lars Mescheder¹, Michael Oechsle¹, Andreas Geiger¹•Institutions (1)

14 Jun 2020

TL;DR: This work proposes a differentiable rendering formulation for implicit shape and texture representations, showing that depth gradients can be derived analytically using the concept of implicit differentiation, and finds that this method can be used for multi-view 3D reconstruction, directly resulting in watertight meshes.

...read moreread less

Abstract: Learning-based 3D reconstruction methods have shown impressive results. However, most methods require 3D supervision which is often hard to obtain for real-world datasets. Recently, several works have proposed differentiable rendering techniques to train reconstruction models from RGB images. Unfortunately, these approaches are currently restricted to voxel- and mesh-based representations, suffering from discretization or low resolution. In this work, we propose a differentiable rendering formulation for implicit shape and texture representations. Implicit representations have recently gained popularity as they represent shape and texture continuously. Our key insight is that depth gradients can be derived analytically using the concept of implicit differentiation. This allows us to learn implicit shape and texture representations directly from RGB images. We experimentally show that our single-view reconstructions rival those learned with full 3D supervision. Moreover, we find that our method can be used for multi-view 3D reconstruction, directly resulting in watertight meshes.

...read moreread less

709 citations

Posted Content•

pixelNeRF: Neural Radiance Fields from One or Few Images

[...]

Alex Yu¹, Vickie Ye¹, Matthew Tancik¹, Angjoo Kanazawa¹•Institutions (1)

University of California, Berkeley¹

03 Dec 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: For example, pixelNeRF as discussed by the authors predicts a continuous neural scene representation conditioned on one or few input images, which can be trained across multiple scenes to learn a scene prior, enabling it to perform novel view synthesis in a feed-forward manner from a sparse set of views.

...read moreread less

Abstract: We propose pixelNeRF, a learning framework that predicts a continuous neural scene representation conditioned on one or few input images. The existing approach for constructing neural radiance fields involves optimizing the representation to every scene independently, requiring many calibrated views and significant compute time. We take a step towards resolving these shortcomings by introducing an architecture that conditions a NeRF on image inputs in a fully convolutional manner. This allows the network to be trained across multiple scenes to learn a scene prior, enabling it to perform novel view synthesis in a feed-forward manner from a sparse set of views (as few as one). Leveraging the volume rendering approach of NeRF, our model can be trained directly from images with no explicit 3D supervision. We conduct extensive experiments on ShapeNet benchmarks for single image novel view synthesis tasks with held-out objects as well as entire unseen categories. We further demonstrate the flexibility of pixelNeRF by demonstrating it on multi-object ShapeNet scenes and real scenes from the DTU dataset. In all cases, pixelNeRF outperforms current state-of-the-art baselines for novel view synthesis and single image 3D reconstruction. For the video and code, please visit the project website: this https URL

...read moreread less

527 citations

Proceedings Article•DOI•

Implicit Functions in Feature Space for 3D Shape Reconstruction and Completion

[...]

Julian Chibane¹, Thiemo Alldieck², Gerard Pons-Moll²•Institutions (2)

University of Würzburg¹, Max Planck Society²

14 Jun 2020

TL;DR: In this paper, the authors proposed Implicit Feature Networks (IF-Nets), which deliver continuous outputs, can handle multiple topologies, and complete shapes for missing or sparse input data retaining the nice properties of recent learned implicit functions.

...read moreread less

Abstract: While many works focus on 3D reconstruction from images, in this paper, we focus on 3D shape reconstruction and completion from a variety of 3D inputs, which are deficient in some respect: low and high resolution voxels, sparse and dense point clouds, complete or incomplete. Processing of such 3D inputs is an increasingly important problem as they are the output of 3D scanners, which are becoming more accessible, and are the intermediate output of 3D computer vision algorithms. Recently, learned implicit functions have shown great promise as they produce continuous reconstructions. However, we identified two limitations in reconstruction from 3D inputs: 1) details present in the input data are not retained, and 2) poor reconstruction of articulated humans. To solve this, we propose Implicit Feature Networks (IF-Nets), which deliver continuous outputs, can handle multiple topologies, and complete shapes for missing or sparse input data retaining the nice properties of recent learned implicit functions, but critically they can also retain detail when it is present in the input data, and can reconstruct articulated humans. Our work differs from prior work in two crucial aspects. First, instead of using a single vector to encode a 3D shape, we extract a learnable 3-dimensional multi-scale tensor of deep features, which is aligned with the original Euclidean space embedding the shape. Second, instead of classifying x-y-z point coordinates directly, we classify deep features extracted from the tensor at a continuous query point. We show that this forces our model to make decisions based on global and local shape structure, as opposed to point coordinates, which are arbitrary under Euclidean transformations. Experiments demonstrate that IF-Nets outperform prior work in 3D object reconstruction in ShapeNet, and obtain significantly more accurate 3D human reconstructions. Code and project website is available at https://virtualhumans.mpi-inf.mpg.de/ifnets/.

...read moreread less

390 citations

Proceedings Article•DOI•

ARCH: Animatable Reconstruction of Clothed Humans

[...]

Zeng Huang¹, Yuanlu Xu², Christoph Lassner², Hao Li¹, Tony Tung² - Show less +1 more•Institutions (2)

University of Southern California¹, Facebook²

14 Jun 2020

TL;DR: This paper proposes ARCH (Animatable Reconstruction of Clothed Humans), a novel end-to-end framework for accurate reconstruction of animation-ready 3D clothed humans from a monocular image and shows numerous qualitative examples of animated, high-quality reconstructed avatars unseen in the literature so far.

...read moreread less

Abstract: In this paper, we propose ARCH (Animatable Reconstruction of Clothed Humans), a novel end-to-end framework for accurate reconstruction of animation-ready 3D clothed humans from a monocular image. Existing approaches to digitize 3D humans struggle to handle pose variations and recover details. Also, they do not produce models that are animation ready. In contrast, ARCH is a learned pose-aware model that produces detailed 3D rigged full-body human avatars from a single unconstrained RGB image. A Semantic Space and a Semantic Deformation Field are created using a parametric 3D body estimator. They allow the transformation of 2D/3D clothed humans into a canonical space, reducing ambiguities in geometry caused by pose variations and occlusions in training data. Detailed surface geometry and appearance are learned using an implicit function representation with spatial local features. Furthermore, we propose additional per-pixel supervision on the 3D reconstruction using opacity-aware differentiable rendering. Our experiments indicate that ARCH increases the fidelity of the reconstructed humans. We obtain more than 50% lower reconstruction errors for standard metrics compared to state-of-the-art methods on public datasets. We also show numerous qualitative examples of animated, high-quality reconstructed avatars unseen in the literature so far.

...read moreread less

253 citations

Proceedings Article•DOI•

Deep Stereo Using Adaptive Thin Volume Representation With Uncertainty Awareness

[...]

Shuo Cheng¹, Zexiang Xu¹, Shilin Zhu¹, Zhuwen Li, Li Erran Li², Ravi Ramamoorthi¹, Hao Su¹ - Show less +3 more•Institutions (2)

University of California, San Diego¹, Columbia University²

14 Jun 2020

TL;DR: The proposed ATV consists of only a small number of planes with low memory and computation costs; yet, it efficiently partitions local depth ranges within learned small uncertainty intervals, which enables reconstruction with high completeness and accuracy in a coarse-to-fine fashion.

...read moreread less

Abstract: We present Uncertainty-aware Cascaded Stereo Network (UCS-Net) for 3D reconstruction from multiple RGB images. Multi-view stereo (MVS) aims to reconstruct fine-grained scene geometry from multi-view images. Previous learning-based MVS methods estimate per-view depth using plane sweep volumes (PSVs) with a fixed depth hypothesis at each plane; this requires densely sampled planes for high accuracy, which is impractical for high-resolution depth because of limited memory. In contrast, we propose adaptive thin volumes (ATVs); in an ATV, the depth hypothesis of each plane is spatially varying, which adapts to the uncertainties of previous per-pixel depth predictions. Our UCS-Net has three stages: the first stage processes a small PSV to predict low-resolution depth; two ATVs are then used in the following stages to refine the depth with higher resolution and higher accuracy. Our ATV consists of only a small number of planes with low memory and computation costs; yet, it efficiently partitions local depth ranges within learned small uncertainty intervals. We propose to use variance-based uncertainty estimates to adaptively construct ATVs; this differentiable process leads to reasonable and fine-grained spatial partitioning. Our multi-stage framework progressively sub-divides the vast scene space with increasing depth resolution and precision, which enables reconstruction with high completeness and accuracy in a coarse-to-fine fashion. We demonstrate that our method achieves superior performance compared with other learning-based MVS methods on various challenging datasets.

...read moreread less

181 citations

Proceedings Article•DOI•

BlendedMVS: A Large-Scale Dataset for Generalized Multi-View Stereo Networks

[...]

Yao Yao¹, Zixin Luo¹, Shiwei Li, Jingyang Zhang¹, Yufan Ren², Lei Zhou¹, Tian Fang, Long Quan¹ - Show less +4 more•Institutions (2)

Hong Kong University of Science and Technology¹, Zhejiang University²

14 Jun 2020

TL;DR: This paper introduces BlendedMVS, a novel large-scale dataset to provide sufficient training ground truth for learning-based MVS and endows the trained model with significantly better generalization ability compared with other MVS datasets.

...read moreread less

Abstract: While deep learning has recently achieved great success on multi-view stereo (MVS), limited training data makes the trained model hard to be generalized to unseen scenarios. Compared with other computer vision tasks, it is rather difficult to collect a large-scale MVS dataset as it requires expensive active scanners and labor-intensive process to obtain ground truth 3D structures. In this paper, we introduce BlendedMVS, a novel large-scale dataset, to provide sufficient training ground truth for learning-based MVS. To create the dataset, we apply a 3D reconstruction pipeline to recover high-quality textured meshes from images of well-selected scenes. Then, we render these mesh models to color images and depth maps. To introduce the ambient lighting information during training, the rendered color images are further blended with the input images to generate the training input. Our dataset contains over 17k high-resolution images covering a variety of scenes, including cities, architectures, sculptures and small objects. Extensive experiments demonstrate that BlendedMVS endows the trained model with significantly better generalization ability compared with other MVS datasets. The dataset and pretrained models are available at https://github.com/YoYo000/BlendedMVS.

...read moreread less

179 citations

Book Chapter•DOI•

Deep Local Shapes: Learning Local SDF Priors for Detailed 3D Reconstruction

[...]

Rohan Chabra¹, Jan Eric Lenssen², Eddy Ilg³, Tanner Schmidt³, Julian Straub³, Steven Lovegrove³, Richard Newcombe³ - Show less +3 more•Institutions (3)

University of North Carolina at Chapel Hill¹, Technical University of Dortmund², Facebook³

23 Aug 2020

TL;DR: Deep Local Shapes (DeepLS) as discussed by the authors replaces the dense volumetric signed distance function (SDF) representation used in traditional surface reconstruction systems with a set of locally learned continuous SDFs defined by a neural network.

...read moreread less

Abstract: Efficiently reconstructing complex and intricate surfaces at scale is a long-standing goal in machine perception. To address this problem we introduce Deep Local Shapes (DeepLS), a deep shape representation that enables encoding and reconstruction of high-quality 3D shapes without prohibitive memory requirements. DeepLS replaces the dense volumetric signed distance function (SDF) representation used in traditional surface reconstruction systems with a set of locally learned continuous SDFs defined by a neural network, inspired by recent work such as DeepSDF. Unlike DeepSDF, which represents an object-level SDF with a neural network and a single latent code, we store a grid of independent latent codes, each responsible for storing information about surfaces in a small local neighborhood. This decomposition of scenes into local shapes simplifies the prior distribution that the network must learn, and also enables efficient inference. We demonstrate the effectiveness and generalization power of DeepLS by showing object shape encoding and reconstructions of full scenes, where DeepLS delivers high compression, accuracy, and local shape completion.

...read moreread less

169 citations

Posted Content•

Deep Local Shapes: Learning Local SDF Priors for Detailed 3D Reconstruction

[...]

Rohan Chabra¹, Jan Eric Lenssen², Eddy Ilg³, Tanner Schmidt³, Julian Straub³, Steven Lovegrove³, Richard Newcombe³ - Show less +3 more•Institutions (3)

University of North Carolina at Chapel Hill¹, Technical University of Dortmund², Facebook³

24 Mar 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work introduces Deep Local Shapes (DeepLS), a deep shape representation that enables encoding and reconstruction of high-quality 3D shapes without prohibitive memory requirements, and demonstrates the effectiveness and generalization power of this representation.

...read moreread less

168 citations

Proceedings Article•DOI•

Coherent Reconstruction of Multiple Humans From a Single Image

[...]

Wen Jiang¹, Nikos Kolotouros², Georgios Pavlakos², Xiaowei Zhou¹, Kostas Daniilidis² - Show less +1 more•Institutions (2)

Zhejiang University¹, University of Pennsylvania²

14 Jun 2020

TL;DR: This work addresses the problem of multi-person 3D pose estimation from a single image by incorporating the SMPL parametric body model in a top-down framework and proposing two novel losses that enable more coherent reconstruction in natural images.

...read moreread less

Abstract: In this work, we address the problem of multi-person 3D pose estimation from a single image. A typical regression approach in the top-down setting of this problem would first detect all humans and then reconstruct each one of them independently. However, this type of prediction suffers from incoherent results, e.g., interpenetration and inconsistent depth ordering between the people in the scene. Our goal is to train a single network that learns to avoid these problems and generate a coherent 3D reconstruction of all the humans in the scene. To this end, a key design choice is the incorporation of the SMPL parametric body model in our top-down framework, which enables the use of two novel losses. First, a distance field-based collision loss penalizes interpenetration among the reconstructed people. Second, a depth ordering-aware loss reasons about occlusions and promotes a depth ordering of people that leads to a rendering which is consistent with the annotated instance segmentation. This provides depth supervision signals to the network, even if the image has no explicit 3D annotations. The experiments show that our approach outperforms previous methods on standard 3D pose benchmarks, while our proposed losses enable more coherent reconstruction in natural images. The project website with videos, results, and code can be found at: https://jiangwenpl.github.io/multiperson

...read moreread less

133 citations

Proceedings Article•DOI•

SDFDiff: Differentiable Rendering of Signed Distance Fields for 3D Shape Optimization

[...]

Yue Jiang¹, Dantong Ji¹, Zhizhong Han¹, Matthias Zwicker¹•Institutions (1)

University of Maryland, College Park¹

14 Jun 2020

TL;DR: SDFDiff as mentioned in this paper is a differentiable rendering of 3D shapes represented by signed distance functions (SDFs), which can represent shapes with arbitrary topology and guarantee watertight surfaces.

...read moreread less

Abstract: We propose SDFDiff, a novel approach for image-based shape optimization using differentiable rendering of 3D shapes represented by signed distance functions (SDFs). Compared to other representations, SDFs have the advantage that they can represent shapes with arbitrary topology, and that they guarantee watertight surfaces. We apply our approach to the problem of multi-view 3D reconstruction, where we achieve high reconstruction quality and can capture complex topology of 3D objects. In addition, we employ a multi-resolution strategy to obtain a robust optimization algorithm. We further demonstrate that our SDF-based differentiable renderer can be integrated with deep learning models, which opens up options for learning approaches on 3D objects without 3D supervision. In particular, we apply our method to single-view 3D reconstruction and achieve state-of-the-art results.

...read moreread less

112 citations

Journal Article•DOI•

Learning Single-Image 3D Reconstruction by Generative Modelling of Shape, Pose and Shading

[...]

Paul Henderson¹, Vittorio Ferrari²•Institutions (2)

Institute of Science and Technology Austria¹, Google²

01 Apr 2020-International Journal of Computer Vision

TL;DR: In this article, a unified framework is presented for class-specific 3D reconstruction from a single image, and generation of new 3D shape samples, which can be trained from 2D images, without pose annotations, and with only a single view per instance.

...read moreread less

Abstract: We present a unified framework tackling two problems: class-specific 3D reconstruction from a single image, and generation of new 3D shape samples. These tasks have received considerable attention recently; however, most existing approaches rely on 3D supervision, annotation of 2D images with keypoints or poses, and/or training with multiple views of each object instance. Our framework is very general: it can be trained in similar settings to existing approaches, while also supporting weaker supervision. Importantly, it can be trained purely from 2D images, without pose annotations, and with only a single view per instance. We employ meshes as an output representation, instead of voxels used in most prior work. This allows us to reason over lighting parameters and exploit shading information during training, which previous 2D-supervised methods cannot. Thus, our method can learn to generate and reconstruct concave object classes. We evaluate our approach in various settings, showing that: (i) it learns to disentangle shape from pose and lighting; (ii) using shading in the loss improves performance compared to just silhouettes; (iii) when using a standard single white light, our model outperforms state-of-the-art 2D-supervised methods, both with and without pose supervision, thanks to exploiting shading cues; (iv) performance improves further when using multiple coloured lights, even approaching that of state-of-the-art 3D-supervised methods; (v) shapes produced by our model capture smooth surfaces and fine details better than voxel-based approaches; and (vi) our approach supports concave classes such as bathtubs and sofas, which methods based on silhouettes cannot learn.

...read moreread less

Posted Content•

Atlas: End-to-End 3D Scene Reconstruction from Posed Images

[...]

Zak Murez, Tarrence van As, James Bartolozzi, Ayan Sinha, Vijay Badrinarayanan, Andrew Rabinovich - Show less +2 more

23 Mar 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: An end-to-end 3D reconstruction method for a scene by directly regressing a truncated signed distance function (TSDF) from a set of posed RGB images is presented and semantic segmentation of the 3D model is obtained without significant computation.

...read moreread less

Abstract: We present an end-to-end 3D reconstruction method for a scene by directly regressing a truncated signed distance function (TSDF) from a set of posed RGB images. Traditional approaches to 3D reconstruction rely on an intermediate representation of depth maps prior to estimating a full 3D model of a scene. We hypothesize that a direct regression to 3D is more effective. A 2D CNN extracts features from each image independently which are then back-projected and accumulated into a voxel volume using the camera intrinsics and extrinsics. After accumulation, a 3D CNN refines the accumulated features and predicts the TSDF values. Additionally, semantic segmentation of the 3D model is obtained without significant computation. This approach is evaluated on the Scannet dataset where we significantly outperform state-of-the-art baselines (deep multiview stereo followed by traditional TSDF fusion) both quantitatively and qualitatively. We compare our 3D semantic segmentation to prior methods that use a depth sensor since no previous work attempts the problem with only RGB input.

...read moreread less

Journal Article•DOI•

Diffraction tomography with a deep image prior

[...]

Kevin C. Zhou¹, Roarke Horstmeyer¹•Institutions (1)

Duke University¹

27 Apr 2020-Optics Express

TL;DR: Deep Prior Diffraction Tomography (DP-DT) as mentioned in this paper reconstructs the 3D refractive index (RI) of biological samples at high resolution from a sequence of low-resolution images collected under angularly varying illumination.

...read moreread less

Abstract: We present a tomographic imaging technique, termed Deep Prior Diffraction Tomography (DP-DT), to reconstruct the 3D refractive index (RI) of thick biological samples at high resolution from a sequence of low-resolution images collected under angularly varying illumination. DP-DT processes the multi-angle data using a phase retrieval algorithm that is extended by a deep image prior (DIP), which reparameterizes the 3D sample reconstruction with an untrained, deep generative 3D convolutional neural network (CNN). We show that DP-DT effectively addresses the missing cone problem, which otherwise degrades the resolution and quality of standard 3D reconstruction algorithms. As DP-DT does not require pre-captured data or pre-training, it is not biased towards any particular dataset. Hence, it is a general technique that can be applied to a wide variety of 3D samples, including scenarios in which large datasets for supervised training would be infeasible or expensive. We applied DP-DT to obtain 3D RI maps of bead phantoms and complex biological specimens, both in simulation and experiment, and show that DP-DT produces higher-quality results than standard regularization techniques. We further demonstrate the generality of DP-DT, using two different scattering models, the first Born and multi-slice models. Our results point to the potential benefits of DP-DT for other 3D imaging modalities, including X-ray computed tomography, magnetic resonance imaging, and electron microscopy.

...read moreread less

Journal Article•DOI•

Learning 3D Shape Completion Under Weak Supervision

[...]

David Stutz¹, Andreas Geiger¹•Institutions (1)

Max Planck Society¹

01 May 2020-International Journal of Computer Vision

TL;DR: In this article, a weakly supervised learning-based approach is proposed for 3D shape completion from sparse and noisy point clouds, which neither requires slow optimization nor direct supervision, but is able to compete with the fully supervised baseline of Dai et al. (in: Proceedings of IEEE conference on computer vision and pattern recognition (CVPR), 2017).

...read moreread less

Abstract: We address the problem of 3D shape completion from sparse and noisy point clouds, a fundamental problem in computer vision and robotics. Recent approaches are either data-driven or learning-based: Data-driven approaches rely on a shape model whose parameters are optimized to fit the observations; Learning-based approaches, in contrast, avoid the expensive optimization step by learning to directly predict complete shapes from incomplete observations in a fully-supervised setting. However, full supervision is often not available in practice. In this work, we propose a weakly-supervised learning-based approach to 3D shape completion which neither requires slow optimization nor direct supervision. While we also learn a shape prior on synthetic data, we amortize, i.e., learn, maximum likelihood fitting using deep neural networks resulting in efficient shape completion without sacrificing accuracy. On synthetic benchmarks based on ShapeNet (Chang et al. Shapenet: an information-rich 3d model repository, 2015. arXiv:1512.03012 ) and ModelNet (Wu et al., in: Proceedings of IEEE conference on computer vision and pattern recognition (CVPR), 2015) as well as on real robotics data from KITTI (Geiger et al., in: Proceedings of IEEE conference on computer vision and pattern recognition (CVPR), 2012) and Kinect (Yang et al., 3d object dense reconstruction from a single depth view, 2018. arXiv:1802.00411 ), we demonstrate that the proposed amortized maximum likelihood approach is able to compete with the fully supervised baseline of Dai et al. (in: Proceedings of IEEE conference on computer vision and pattern recognition (CVPR), 2017) and outperforms the data-driven approach of Engelmann et al. (in: Proceedings of the German conference on pattern recognition (GCPR), 2016), while requiring less supervision and being significantly faster.

...read moreread less

Journal Article•DOI•

RGB2Hands: real-time tracking of 3D hand interactions from monocular RGB video

[...]

Jiayi Wang, Franziska Mueller, Florian Bernard¹, Suzanne Sorli², Oleksandr Sotnychenko, Neng Qian, Miguel A. Otaduy², Dan Casas², Christian Theobalt - Show less +5 more•Institutions (2)

Technische Universität München¹, King Juan Carlos University²

26 Nov 2020-ACM Transactions on Graphics

TL;DR: JIAYI WANG and FRANZISKA MUELLER,MPI Informatics, Saarland Informatic Campus, Technical University of Munich and Universidad Rey Juan Carlos.

...read moreread less

Abstract: Tracking and reconstructing the 3D pose and geometry of two hands in interaction is a challenging problem that has a high relevance for several human-computer interaction applications, including AR/VR, robotics, or sign language recognition. Existing works are either limited to simpler tracking settings (e.g., considering only a single hand or two spatially separated hands), or rely on less ubiquitous sensors, such as depth cameras. In contrast, in this work we present the first real-time method for motion capture of skeletal pose and 3D surface geometry of hands from a single RGB camera that explicitly considers close interactions. In order to address the inherent depth ambiguities in RGB data, we propose a novel multi-task CNN that regresses multiple complementary pieces of information, including segmentation, dense matchings to a 3D hand model, and 2D keypoint positions, together with newly proposed intra-hand relative depth and inter-hand distance maps. These predictions are subsequently used in a generative model fitting framework in order to estimate pose and shape parameters of a 3D hand model for both hands. We experimentally verify the individual components of our RGB two-hand tracking and 3D reconstruction pipeline through an extensive ablation study. Moreover, we demonstrate that our approach offers previously unseen two-hand tracking performance from RGB, and quantitatively and qualitatively outperforms existing RGB-based methods that were not explicitly designed for two-hand interactions. Moreover, our method even performs on-par with depth-based real-time methods.

...read moreread less

Journal Article•DOI•

Three-dimensional perception of orchard banana central stock enhanced by adaptive multi-vision technology

[...]

Mingyou Chen¹, Yunchao Tang², Xiangjun Zou¹, Kuangyu Huang¹, Huang Zhaofeng¹, Zhou Hao¹, Chenglin Wang³, Guoping Lian⁴ - Show less +4 more•Institutions (4)

South China Agricultural University¹, Zhongkai University of Agriculture and Engineering², Chongqing University³, University of Surrey⁴

01 Jul 2020-Computers and Electronics in Agriculture

TL;DR: The proposed adaptive stereo matching strategy was designed for adaptability of the multi-vision system for field perception, so it can be easily transferred to similar applications such as the 3D reconstruction of agricultural targets, 3D positioning of fruit clusters, and 3D robotic arm obstacle avoidance.

...read moreread less

Journal Article•DOI•

A Comprehensive Survey of Indoor Localization Methods Based on Computer Vision.

[...]

Anca Morar¹, Alin Moldoveanu¹, Irina Mocanu¹, Florica Moldoveanu¹, Ion Emilian Radoi¹, Victor Asavei¹, Alexandru Gradinaru¹, Alexandru Butean² - Show less +4 more•Institutions (2)

Politehnica University of Bucharest¹, Lucian Blaga University of Sibiu²

06 May 2020-Sensors

TL;DR: An overview of the computer vision based indoor localization domain is offered, presenting application areas, commercial tools, existing benchmarks, and other reviews, and proposing a new classification based on the configuration stage (use of known environment data), sensing devices, type of detected elements, and localization method.

...read moreread less

Abstract: Computer vision based indoor localization methods use either an infrastructure of static cameras to track mobile entities (e.g., people, robots) or cameras attached to the mobile entities. Methods in the first category employ object tracking, while the others map images from mobile cameras with images acquired during a configuration stage or extracted from 3D reconstructed models of the space. This paper offers an overview of the computer vision based indoor localization domain, presenting application areas, commercial tools, existing benchmarks, and other reviews. It provides a survey of indoor localization research solutions, proposing a new classification based on the configuration stage (use of known environment data), sensing devices, type of detected elements, and localization method. It groups 70 of the most recent and relevant image based indoor localization methods according to the proposed classification and discusses their advantages and drawbacks. It highlights localization methods that also offer orientation information, as this is required by an increasing number of applications of indoor localization (e.g., augmented reality).

...read moreread less

Proceedings Article•DOI•

DeepDeform: Learning Non-Rigid RGB-D Reconstruction With Semi-Supervised Data

[...]

Aljaz Bozic¹, Michael Zollhöfer², Christian Theobalt³, Matthias NieBner¹•Institutions (3)

Technische Universität München¹, Stanford University², Max Planck Society³

14 Jun 2020

TL;DR: A new neural network that operates on RGB-D frames is proposed that significantly outperforms existing non-rigid reconstruction methods that do not use learned data terms, as well as learning-based approaches that only use self-supervision.

...read moreread less

Abstract: Applying data-driven approaches to non-rigid 3D reconstruction has been difficult, which we believe can be attributed to the lack of a large-scale training corpus. Unfortunately, this method fails for important cases such as highly non-rigid deformations. We first address this problem of lack of data by introducing a novel semi-supervised strategy to obtain dense inter-frame correspondences from a sparse set of annotations. This way, we obtain a large dataset of 400 scenes, over 390,000 RGB-D frames, and 5,533 densely aligned frame pairs; in addition, we provide a test set along with several metrics for evaluation. Based on this corpus, we introduce a data-driven non-rigid feature matching approach, which we integrate into an optimization-based reconstruction pipeline. Here, we propose a new neural network that operates on RGB-D frames, while maintaining robustness under large non-rigid deformations and producing accurate predictions. Our approach significantly outperforms existing non-rigid reconstruction methods that do not use learned data terms, as well as learning-based approaches that only use self-supervision.

...read moreread less

Posted Content•

Self-supervised Single-view 3D Reconstruction via Semantic Consistency

[...]

Xueting Li¹, Sifei Liu², Kihwan Kim², Shalini De Mello², Varun Jampani², Ming-Hsuan Yang¹, Jan Kautz² - Show less +3 more•Institutions (2)

University of California, Merced¹, Nvidia²

13 Mar 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work is the first to try and solve the single-view reconstruction problem without a category-specific template mesh or semantic keypoints, and demonstrates that the unsupervised method performs comparably if not better than existing category- specific reconstruction methods learned with supervision.

...read moreread less

Abstract: We learn a self-supervised, single-view 3D reconstruction model that predicts the 3D mesh shape, texture and camera pose of a target object with a collection of 2D images and silhouettes. The proposed method does not necessitate 3D supervision, manually annotated keypoints, multi-view images of an object or a prior 3D template. The key insight of our work is that objects can be represented as a collection of deformable parts, and each part is semantically coherent across different instances of the same category (e.g., wings on birds and wheels on cars). Therefore, by leveraging self-supervisedly learned part segmentation of a large collection of category-specific images, we can effectively enforce semantic consistency between the reconstructed meshes and the original images. This significantly reduces ambiguities during joint prediction of shape and camera pose of an object, along with texture. To the best of our knowledge, we are the first to try and solve the single-view reconstruction problem without a category-specific template mesh or semantic keypoints. Thus our model can easily generalize to various object categories without such labels, e.g., horses, penguins, etc. Through a variety of experiments on several categories of deformable and rigid objects, we demonstrate that our unsupervised method performs comparably if not better than existing category-specific reconstruction methods learned with supervision.

...read moreread less

Proceedings Article•DOI•

FroDO: From Detections to 3D Objects

[...]

Martin Rünz¹, Kejie Li², Meng Tang³, Lingni Ma³, Chen Kong³, Tanner Schmidt³, Ian Reid², Lourdes Agapito¹, Julian Straub³, Steven Lovegrove³, Richard Newcombe³ - Show less +7 more•Institutions (3)

University College London¹, University of Adelaide², Facebook³

14 Jun 2020

TL;DR: FroDO is a method for accurate 3D reconstruction of object instances from RGB video that infers their location, pose and shape in a coarse to fine manner to embed object shapes in a novel learnt shape space that allows seamless switching between sparse point cloud and dense DeepSDF decoding.

...read moreread less

Abstract: Object-oriented maps are important for scene understanding since they jointly capture geometry and semantics, allow individual instantiation and meaningful reasoning about objects. We introduce FroDO, a method for accurate 3D reconstruction of object instances from RGB video that infers their location, pose and shape in a coarse to fine manner. Key to FroDO is to embed object shapes in a novel learnt shape space that allows seamless switching between sparse point cloud and dense DeepSDF decoding. Given an input sequence of localized RGB frames, FroDO first aggregates 2D detections to instantiate a 3D bounding box per object. A shape code is regressed using an encoder network before optimizing shape and pose further under the learnt shape priors using sparse or dense shape representations. The optimization uses multi-view geometric, photometric and silhouette losses. We evaluate on real-world datasets, including Pix3D, Redwood-OS, and ScanNet, for single-view, multi-view, and multi-object reconstruction.

...read moreread less

Journal Article•DOI•

An end-to-end three-dimensional reconstruction framework of porous media from a single two-dimensional image based on deep learning

[...]

Junxi Feng¹, Teng Qizhi¹, Teng Qizhi², Bing Li³, Xiaohai He¹, Xiaohai He², Honggang Chen¹, Yang Li¹ - Show less +4 more•Institutions (3)

Sichuan University¹, Chinese Ministry of Education², China National Petroleum Corporation³

15 Aug 2020-Computer Methods in Applied Mechanics and Engineering

TL;DR: This paper proposes a general end-to-end deep learning-based 3D reconstruction framework that enables that theoretically an arbitrary number of constraints can be incorporated to further improve the reconstruction accuracy.

...read moreread less

Journal Article•DOI•

High-Precision Depth Estimation Using Uncalibrated LiDAR and Stereo Fusion

[...]

Kihong Park¹, Seungryong Kim¹, Kwanghoon Sohn¹•Institutions (1)

Yonsei University¹

01 Jan 2020-IEEE Transactions on Intelligent Transportation Systems

TL;DR: A deep sensor fusion framework for high-precision depth estimation using LiDAR point cloud and stereo images, and a simple but effective approach to generate pseudo ground truth labels from the raw KITTI dataset.

...read moreread less

Abstract: We address the problem of 3D reconstruction from uncalibrated LiDAR point cloud and stereo images. Since the usage of each sensor alone for 3D reconstruction has weaknesses in terms of density and accuracy, we propose a deep sensor fusion framework for high-precision depth estimation. The proposed architecture consists of calibration network and depth fusion network, where both networks are designed considering the trade-off between accuracy and efficiency for mobile devices. The calibration network first corrects an initial extrinsic parameter to align the input sensor coordinate systems. The accuracy of calibration is markedly improved by formulating the calibration in the depth domain. In the depth fusion network, complementary characteristics of sparse LiDAR and dense stereo depth are then encoded in a boosting manner. Since training data for the LiDAR and stereo depth fusion are rather limited, we introduce a simple but effective approach to generate pseudo ground truth labels from the raw KITTI dataset. The experimental evaluation verifies that the proposed method outperforms current state-of-the-art methods on the KITTI benchmark. We also collect data using our proprietary multi-sensor acquisition platform and verify that the proposed method generalizes across different sensor settings and scenes.

...read moreread less

Book Chapter•DOI•

Atlas: End-to-End 3D Scene Reconstruction from Posed Images

[...]

Zak Murez, Tarrence van As, James Bartolozzi, Ayan Sinha, Vijay Badrinarayanan, Andrew Rabinovich - Show less +2 more

23 Aug 2020

TL;DR: In this paper, a 2D CNN extracts features from each image independently which are then back-projected and accumulated into a voxel volume using the camera intrinsics and extrinsics.

...read moreread less

Posted Content•

Implicit Functions in Feature Space for 3D Shape Reconstruction and Completion

[...]

Julian Chibane¹, Thiemo Alldieck¹, Gerard Pons-Moll²•Institutions (2)

Max Planck Society¹, Braunschweig University of Technology²

03 Mar 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work proposes Implicit Feature Networks (IF-Nets), which deliver continuous outputs, can handle multiple topologies, and complete shapes for missing or sparse input data retaining the nice properties of recent learned implicit functions, but critically they can also retain detail when it is present in the input data, and can reconstruct articulated humans.

...read moreread less

Abstract: While many works focus on 3D reconstruction from images, in this paper, we focus on 3D shape reconstruction and completion from a variety of 3D inputs, which are deficient in some respect: low and high resolution voxels, sparse and dense point clouds, complete or incomplete. Processing of such 3D inputs is an increasingly important problem as they are the output of 3D scanners, which are becoming more accessible, and are the intermediate output of 3D computer vision algorithms. Recently, learned implicit functions have shown great promise as they produce continuous reconstructions. However, we identified two limitations in reconstruction from 3D inputs: 1) details present in the input data are not retained, and 2) poor reconstruction of articulated humans. To solve this, we propose Implicit Feature Networks (IF-Nets), which deliver continuous outputs, can handle multiple topologies, and complete shapes for missing or sparse input data retaining the nice properties of recent learned implicit functions, but critically they can also retain detail when it is present in the input data, and can reconstruct articulated humans. Our work differs from prior work in two crucial aspects. First, instead of using a single vector to encode a 3D shape, we extract a learnable 3-dimensional multi-scale tensor of deep features, which is aligned with the original Euclidean space embedding the shape. Second, instead of classifying x-y-z point coordinates directly, we classify deep features extracted from the tensor at a continuous query point. We show that this forces our model to make decisions based on global and local shape structure, as opposed to point coordinates, which are arbitrary under Euclidean transformations. Experiments demonstrate that IF-Nets clearly outperform prior work in 3D object reconstruction in ShapeNet, and obtain significantly more accurate 3D human reconstructions.

...read moreread less

Proceedings Article•

Learning Deformable Tetrahedral Meshes for 3D Reconstruction

[...]

Jun Gao¹, Wenzheng Chen¹, Tommy Xiang, Alec Jacobson¹, Morgan McGuire², Sanja Fidler¹ - Show less +2 more•Institutions (2)

University of Toronto¹, Nvidia²

01 Jan 2020

TL;DR: Deformable tetrahedral meshes (DefTet) as mentioned in this paper is a particular parameterization that utilizes volumetric tetralayer meshes for the reconstruction problem, which can represent arbitrary, complex topology, is both memory and computationally efficient, and can produce high-fidelity reconstructions with a significantly smaller grid size.

...read moreread less

Abstract: 3D shape representations that accommodate learning-based 3D reconstruction are an open problem in machine learning and computer graphics. Previous work on neural 3D reconstruction demonstrated benefits, but also limitations, of point cloud, voxel, surface mesh, and implicit function representations. We introduce Deformable Tetrahedral Meshes (DefTet) as a particular parameterization that utilizes volumetric tetrahedral meshes for the reconstruction problem. Unlike existing volumetric approaches, DefTet optimizes for both vertex placement and occupancy, and is differentiable with respect to standard 3D reconstruction loss functions. It is thus simultaneously high-precision, volumetric, and amenable to learning-based neural architectures. We show that it can represent arbitrary, complex topology, is both memory and computationally efficient, and can produce high-fidelity reconstructions with a significantly smaller grid size than alternative volumetric approaches. The predicted surfaces are also inherently defined as tetrahedral meshes, thus do not require post-processing. We demonstrate that DefTet matches or exceeds both the quality of the previous best approaches and the performance of the fastest ones. Our approach obtains high-quality tetrahedral meshes computed directly from noisy point clouds, and is the first to showcase high-quality 3D tet-mesh results using only a single image as input. Our project webpage: this https URL

...read moreread less

Proceedings Article•

SDF-SRN: Learning Signed Distance 3D Object Reconstruction from Static Images

[...]

Chen-Hsuan Lin¹, Chaoyang Wang¹, Simon Lucey¹•Institutions (1)

Carnegie Mellon University¹

20 Oct 2020

TL;DR: This paper derives a novel differentiable rendering formulation for learning signed distance functions (SDF) from 2D silhouettes and proposes SDF-SRN, an approach that outperforms the state of the art under challenging single-view supervision settings on both synthetic and real-world datasets.

...read moreread less

Abstract: Dense 3D object reconstruction from a single image has recently witnessed remarkable advances, but supervising neural networks with ground-truth 3D shapes is impractical due to the laborious process of creating paired image-shape datasets. Recent efforts have turned to learning 3D reconstruction without 3D supervision from RGB images with annotated 2D silhouettes, dramatically reducing the cost and effort of annotation. These techniques, however, remain impractical as they still require multi-view annotations of the same object instance during training. As a result, most experimental efforts to date have been limited to synthetic datasets. In this paper, we address this issue and propose SDF-SRN, an approach that requires only a single view of objects at training time, offering greater utility for real-world scenarios. SDF-SRN learns implicit 3D shape representations to handle arbitrary shape topologies that may exist in the datasets. To this end, we derive a novel differentiable rendering formulation for learning signed distance functions (SDF) from 2D silhouettes. Our method outperforms the state of the art under challenging single-view supervision settings on both synthetic and real-world datasets.

...read moreread less

Proceedings Article•DOI•

RoutedFusion: Learning Real-Time Depth Map Fusion

[...]

Silvan Weder¹, Johannes L. Schönberger², Marc Pollefeys¹, Martin R. Oswald¹•Institutions (2)

ETH Zurich¹, Microsoft²

14 Jun 2020

TL;DR: This work proposes a neural network that predicts non-linear updates to better account for typical fusion errors and outperforms the traditional fusion approach and related learned approaches on both synthetic and real data.

...read moreread less

Abstract: The efficient fusion of depth maps is a key part of most state-of-the-art 3D reconstruction methods. Besides requiring high accuracy, these depth fusion methods need to be scalable and real-time capable. To this end, we present a novel real-time capable machine learning-based method for depth map fusion. Similar to the seminal depth map fusion approach by Curless and Levoy, we only update a local group of voxels to ensure real-time capability. Instead of a simple linear fusion of depth information, we propose a neural network that predicts non-linear updates to better account for typical fusion errors. Our network is composed of a 2D depth routing network and a 3D depth fusion network which efficiently handle sensor-specific noise and outliers. This is especially useful for surface edges and thin objects for which the original approach suffers from thickening artifacts. Our method outperforms the traditional fusion approach and related learned approaches on both synthetic and real data. We demonstrate the performance of our method in reconstructing fine geometric details from noise and outlier contaminated data on various scenes.

...read moreread less

Book Chapter•DOI•

End-To-End Convolutional Neural Network for 3D Reconstruction of Knee Bones From Bi-Planar X-Ray Images

[...]

Yoni Kasten¹, Daniel Doktofsky, Ilya Kovler•Institutions (1)

Weizmann Institute of Science¹

08 Oct 2020

TL;DR: An end-to-end Convolutional Neural Network approach for 3D reconstruction of knee bones directly from two bi-planar X-ray images, which indicates that the deep learning model is very efficient, generalizes well and produces high quality reconstructions.

...read moreread less

Abstract: We present an end-to-end Convolutional Neural Network (CNN) approach for 3D reconstruction of knee bones directly from two bi-planar X-ray images. Clinically, capturing the 3D models of the bones is crucial for surgical planning, implant fitting, and postoperative evaluation. X-ray imaging significantly reduces the exposure of patients to ionizing radiation compared to Computer Tomography (CT) imaging, and is much more common and inexpensive compared to Magnetic Resonance Imaging (MRI) scanners. However, retrieving 3D models from such 2D scans is extremely challenging. In contrast to the common approach of statistically modeling the shape of each bone, our deep network learns the distribution of the bones’ shapes directly from the training images. We train our model with both supervised and unsupervised losses using Digitally Reconstructed Radiograph (DRR) images generated from CT scans. To apply our model to X-Ray data, we use style transfer to transform between X-Ray and DRR modalities. As a result, at test time, without further optimization, our solution directly outputs a 3D reconstruction from a pair of bi-planar X-ray images, while preserving geometric constraints. Our results indicate that our deep learning model is very efficient, generalizes well and produces high quality reconstructions.

...read moreread less

Book Chapter•DOI•

Camera Calibration and 3D Reconstruction

[...]

Arcangelo Distante, Cosimo Distante

01 Jan 2020

TL;DR: This chapter describes the algorithms for calibrating the image acquisition system that are fundamental for detecting metric information (detecting an object’s size or determining accurate measurements of object–observer distance—the pose) of the scene from the images.

...read moreread less

Abstract: This chapter describes the algorithms for calibrating the image acquisition system (normally a single camera or a stereo vision) that are fundamental for detecting metric information (detecting an object’s size or determining accurate measurements of object–observer distance—the pose) of the scene from the images. The various camera calibration methods are presented that determine the relative intrinsic parameters and the extrinsic parameters that define the geometric transformation to pass from the reference system of the world to that of the camera. The epipolar geometry introduced in Chap. 5 is exploited to solve the problem of correspondence of homologous points in a stereo vision system with the cameras calibrated and not. With the epipolar geometry is simplified the search for the homologous points between the stereo images (the correspondence problem) by introducing the Essential matrix (calibrated approach) and the Fundamental matrix (weak calibrated approach). The algorithms for estimating these matrices are also described, known a priori the corresponding points of a calibration platform. This is also accomplished with the image alignment procedure, known as stereo image rectification. Finally, the triangulation procedures for the 3D reconstruction of the geometry of the scene without ambiguity are described, given the 2D projections of the homologous points of the stereo images, given the calibration parameters of the stereo system. If only the intrinsic parameters are known, the 3D geometry of the scene is reconstructed by estimating the extrinsic parameters of the system at less than a non-determinable scale factor. If the calibration parameters of the stereo system are not available but only the correspondences between the stereo images are known, the structure of the scene is recovered through an unknown homography transformation.

...read moreread less

Book Chapter•DOI•

Self-supervised Single-View 3D Reconstruction via Semantic Consistency

[...]

Xueting Li¹, Sifei Liu², Kihwan Kim², Shalini De Mello², Varun Jampani², Ming-Hsuan Yang¹, Jan Kautz² - Show less +3 more•Institutions (2)

University of California, Merced¹, Nvidia²

23 Aug 2020

TL;DR: In this article, a self-supervised, single-view 3D reconstruction model that predicts the 3D mesh shape, texture and camera pose of a target object with a collection of 2D images and silhouettes is proposed.

...read moreread less

Abstract: We learn a self-supervised, single-view 3D reconstruction model that predicts the 3D mesh shape, texture and camera pose of a target object with a collection of 2D images and silhouettes. The proposed method does not necessitate 3D supervision, manually annotated keypoints, multi-view images of an object or a prior 3D template. The key insight of our work is that objects can be represented as a collection of deformable parts, and each part is semantically coherent across different instances of the same category (e.g., wings on birds and wheels on cars). Therefore, by leveraging part segmentation of a large collection of category-specific images learned via self-supervision, we can effectively enforce semantic consistency between the reconstructed meshes and the original images. This significantly reduces ambiguities during joint prediction of shape and camera pose of an object, along with texture. To the best of our knowledge, we are the first to try and solve the single-view reconstruction problem without a category-specific template mesh or semantic keypoints. Thus our model can easily generalize to various object categories without such labels, e.g., horses, penguins, etc. Through a variety of experiments on several categories of deformable and rigid objects, we demonstrate that our unsupervised method performs comparably if not better than existing category-specific reconstruction methods learned with supervision. More details can be found at the project page https://sites.google.com/nvidia.com/unsup-mesh-2020.

...read moreread less

Collapse