Top 158 papers published in the topic of View synthesis in 2020

Posted Content•

NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis

[...]

Ben Mildenhall¹, Pratul P. Srinivasan¹, Matthew Tancik¹, Jonathan T. Barron², Ravi Ramamoorthi³, Ren Ng¹ - Show less +2 more•Institutions (3)

University of California, Berkeley¹, Google², University of California, San Diego³

19 Mar 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work describes how to effectively optimize neural radiance fields to render photorealistic novel views of scenes with complicated geometry and appearance, and demonstrates results that outperform prior work on neural rendering and view synthesis.

...read moreread less

Abstract: We present a method that achieves state-of-the-art results for synthesizing novel views of complex scenes by optimizing an underlying continuous volumetric scene function using a sparse set of input views. Our algorithm represents a scene using a fully-connected (non-convolutional) deep network, whose input is a single continuous 5D coordinate (spatial location $(x,y,z)$ and viewing direction $(\theta, \phi)$) and whose output is the volume density and view-dependent emitted radiance at that spatial location. We synthesize views by querying 5D coordinates along camera rays and use classic volume rendering techniques to project the output colors and densities into an image. Because volume rendering is naturally differentiable, the only input required to optimize our representation is a set of images with known camera poses. We describe how to effectively optimize neural radiance fields to render photorealistic novel views of scenes with complicated geometry and appearance, and demonstrate results that outperform prior work on neural rendering and view synthesis. View synthesis results are best viewed as videos, so we urge readers to view our supplementary video for convincing comparisons.

...read moreread less

2,435 citations

Book Chapter•DOI•

NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis

[...]

Ben Mildenhall¹, Pratul P. Srinivasan¹, Matthew Tancik¹, Jonathan T. Barron², Ravi Ramamoorthi³, Ren Ng¹ - Show less +2 more•Institutions (3)

University of California, Berkeley¹, Google², University of California, San Diego³

23 Aug 2020

TL;DR: In this article, a fully-connected (non-convolutional) deep network is used to synthesize novel views of complex scenes by optimizing an underlying continuous volumetric scene function using a sparse set of input views.

...read moreread less

Abstract: We present a method that achieves state-of-the-art results for synthesizing novel views of complex scenes by optimizing an underlying continuous volumetric scene function using a sparse set of input views. Our algorithm represents a scene using a fully-connected (non-convolutional) deep network, whose input is a single continuous 5D coordinate (spatial location (x, y, z) and viewing direction $(\theta ,\phi )$) and whose output is the volume density and view-dependent emitted radiance at that spatial location. We synthesize views by querying 5D coordinates along camera rays and use classic volume rendering techniques to project the output colors and densities into an image. Because volume rendering is naturally differentiable, the only input required to optimize our representation is a set of images with known camera poses. We describe how to effectively optimize neural radiance fields to render photorealistic novel views of scenes with complicated geometry and appearance, and demonstrate results that outperform prior work on neural rendering and view synthesis. View synthesis results are best viewed as videos, so we urge readers to view our supplementary video for convincing comparisons.

...read moreread less

951 citations

Posted Content•

pixelNeRF: Neural Radiance Fields from One or Few Images

[...]

Alex Yu¹, Vickie Ye¹, Matthew Tancik¹, Angjoo Kanazawa¹•Institutions (1)

University of California, Berkeley¹

03 Dec 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: For example, pixelNeRF as discussed by the authors predicts a continuous neural scene representation conditioned on one or few input images, which can be trained across multiple scenes to learn a scene prior, enabling it to perform novel view synthesis in a feed-forward manner from a sparse set of views.

...read moreread less

Abstract: We propose pixelNeRF, a learning framework that predicts a continuous neural scene representation conditioned on one or few input images. The existing approach for constructing neural radiance fields involves optimizing the representation to every scene independently, requiring many calibrated views and significant compute time. We take a step towards resolving these shortcomings by introducing an architecture that conditions a NeRF on image inputs in a fully convolutional manner. This allows the network to be trained across multiple scenes to learn a scene prior, enabling it to perform novel view synthesis in a feed-forward manner from a sparse set of views (as few as one). Leveraging the volume rendering approach of NeRF, our model can be trained directly from images with no explicit 3D supervision. We conduct extensive experiments on ShapeNet benchmarks for single image novel view synthesis tasks with held-out objects as well as entire unseen categories. We further demonstrate the flexibility of pixelNeRF by demonstrating it on multi-object ShapeNet scenes and real scenes from the DTU dataset. In all cases, pixelNeRF outperforms current state-of-the-art baselines for novel view synthesis and single image 3D reconstruction. For the video and code, please visit the project website: this https URL

...read moreread less

527 citations

Posted Content•

NeRF++: Analyzing and Improving Neural Radiance Fields.

[...]

Kai Zhang, Gernot Riegler, Noah Snavely, Vladlen Koltun

15 Oct 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: A parametrization issue involved in applying NeRF to 360 captures of objects within large-scale, unbounded 3D scenes is addressed, and the method improves view synthesis fidelity in this challenging scenario.

...read moreread less

Abstract: Neural Radiance Fields (NeRF) achieve impressive view synthesis results for a variety of capture settings, including 360 capture of bounded scenes and forward-facing capture of bounded and unbounded scenes. NeRF fits multi-layer perceptrons (MLPs) representing view-invariant opacity and view-dependent color volumes to a set of training images, and samples novel views based on volume rendering techniques. In this technical report, we first remark on radiance fields and their potential ambiguities, namely the shape-radiance ambiguity, and analyze NeRF's success in avoiding such ambiguities. Second, we address a parametrization issue involved in applying NeRF to 360 captures of objects within large-scale, unbounded 3D scenes. Our method improves view synthesis fidelity in this challenging scenario. Code is available at this https URL.

...read moreread less

413 citations

Posted Content•

GRAF: Generative Radiance Fields for 3D-Aware Image Synthesis

[...]

Katja Schwarz¹, Yiyi Liao¹, Michael Niemeyer¹, Andreas Geiger¹•Institutions (1)

Max Planck Society¹

05 Jul 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: This paper proposes a generative model for radiance fields which have recently proven successful for novel view synthesis of a single scene, and introduces a multi-scale patch-based discriminator to demonstrate synthesis of high-resolution images while training the model from unposed 2D images alone.

...read moreread less

Abstract: While 2D generative adversarial networks have enabled high-resolution image synthesis, they largely lack an understanding of the 3D world and the image formation process. Thus, they do not provide precise control over camera viewpoint or object pose. To address this problem, several recent approaches leverage intermediate voxel-based representations in combination with differentiable rendering. However, existing methods either produce low image resolution or fall short in disentangling camera and scene properties, e.g., the object identity may vary with the viewpoint. In this paper, we propose a generative model for radiance fields which have recently proven successful for novel view synthesis of a single scene. In contrast to voxel-based representations, radiance fields are not confined to a coarse discretization of the 3D space, yet allow for disentangling camera and scene properties while degrading gracefully in the presence of reconstruction ambiguity. By introducing a multi-scale patch-based discriminator, we demonstrate synthesis of high-resolution images while training our model from unposed 2D images alone. We systematically analyze our approach on several challenging synthetic and real-world datasets. Our experiments reveal that radiance fields are a powerful representation for generative image synthesis, leading to 3D consistent models that render with high fidelity.

...read moreread less

358 citations

Proceedings Article•DOI•

SynSin: End-to-End View Synthesis From a Single Image

[...]

Olivia Wiles¹, Georgia Gkioxari², Richard Szeliski², Justin Johnson³•Institutions (3)

University of Oxford¹, Facebook², University of Michigan³

14 Jun 2020

TL;DR: This work proposes a novel differentiable point cloud renderer that is used to transform a latent 3D point cloud of features into the target view and outperforms baselines and prior work on the Matterport, Replica, and RealEstate10K datasets.

...read moreread less

Abstract: View synthesis allows for the generation of new views of a scene given one or more images. This is challenging; it requires comprehensively understanding the 3D scene from images. As a result, current methods typically use multiple images, train on ground-truth depth, or are limited to synthetic data. We propose a novel end-to-end model for this task using a single image at test time; it is trained on real images without any ground-truth 3D information. To this end, we introduce a novel differentiable point cloud renderer that is used to transform a latent 3D point cloud of features into the target view. The projected features are decoded by our refinement network to inpaint missing regions and generate a realistic output image. The 3D component inside of our generative model allows for interpretable manipulation of the latent feature space at test time, e.g. we can animate trajectories from a single image. Additionally, we can generate high resolution images and generalise to other input resolutions. We outperform baselines and prior work on the Matterport, Replica, and RealEstate10K datasets.

...read moreread less

298 citations

Posted Content•

Neural Scene Flow Fields for Space-Time View Synthesis of Dynamic Scenes

[...]

Zhengqi Li¹, Simon Niklaus², Noah Snavely¹, Oliver Wang²•Institutions (2)

Cornell University¹, Adobe Systems²

26 Nov 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: A method to perform novel view and time synthesis of dynamic scenes, requiring only a monocular video with known camera poses as input, is presented, and a new representation that models the dynamic scene as a time-variant continuous function of appearance, geometry, and 3D scene motion is introduced.

...read moreread less

Abstract: We present a method to perform novel view and time synthesis of dynamic scenes, requiring only a monocular video with known camera poses as input. To do this, we introduce Neural Scene Flow Fields, a new representation that models the dynamic scene as a time-variant continuous function of appearance, geometry, and 3D scene motion. Our representation is optimized through a neural network to fit the observed input views. We show that our representation can be used for complex dynamic scenes, including thin structures, view-dependent effects, and natural degrees of motion. We conduct a number of experiments that demonstrate our approach significantly outperforms recent monocular view synthesis methods, and show qualitative results of space-time view synthesis on a variety of real-world videos.

...read moreread less

271 citations

Proceedings Article•DOI•

Single-View View Synthesis With Multiplane Images

[...]

Richard Tucker¹, Noah Snavely¹•Institutions (1)

Google¹

14 Jun 2020

TL;DR: In this article, a scale-invariant view synthesis method is proposed to predict a multiplane image directly from a single image input, enabling them to train on online video.

...read moreread less

Abstract: A recent strand of work in view synthesis uses deep learning to generate multiplane images—a camera-centric, layered 3D representation—given two or more input images at known viewpoints. We apply this representation to single-view view synthesis, a problem which is more challenging but has potentially much wider application. Our method learns to predict a multiplane image directly from a single image input, and we introduce scale-invariant view synthesis for supervision, enabling us to train on online video. We show this approach is applicable to several different datasets, that it additionally generates reasonable depth maps, and that it learns to fill in content behind the edges of foreground objects in background layers.

...read moreread less

234 citations

Posted Content•

NeRV: Neural Reflectance and Visibility Fields for Relighting and View Synthesis

[...]

Pratul P. Srinivasan¹, Boyang Deng¹, Xiuming Zhang², Matthew Tancik³, Ben Mildenhall³, Jonathan T. Barron¹ - Show less +2 more•Institutions (3)

Google¹, Massachusetts Institute of Technology², University of California, Berkeley³

07 Dec 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work presents a method that takes as input a set of images of a scene illuminated by unconstrained known lighting, and produces as output a 3D representation that can be rendered from novel viewpoints under arbitrary lighting conditions.

...read moreread less

Abstract: We present a method that takes as input a set of images of a scene illuminated by unconstrained known lighting, and produces as output a 3D representation that can be rendered from novel viewpoints under arbitrary lighting conditions. Our method represents the scene as a continuous volumetric function parameterized as MLPs whose inputs are a 3D location and whose outputs are the following scene properties at that input location: volume density, surface normal, material parameters, distance to the first surface intersection in any direction, and visibility of the external environment in any direction. Together, these allow us to render novel views of the object under arbitrary lighting, including indirect illumination effects. The predicted visibility and surface intersection fields are critical to our model's ability to simulate direct and indirect illumination during training, because the brute-force techniques used by prior work are intractable for lighting conditions outside of controlled setups with a single light. Our method outperforms alternative approaches for recovering relightable 3D scene representations, and performs well in complex lighting settings that have posed a significant challenge to prior work.

...read moreread less

227 citations

Posted Content•

NeRD: Neural Reflectance Decomposition from Image Collections

[...]

Mark Boss¹, Raphael Braun, Varun Jampani², Jonathan T. Barron³, Ce Liu³, Hendrik P. A. Lensch¹ - Show less +2 more•Institutions (3)

University of Tübingen¹, Nvidia², Google³

07 Dec 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: A neural reflectance decomposition (NeRD) technique that uses physically-based rendering to decompose the scene into spatially varying BRDF material properties enabling fast real-time rendering with novel illuminations.

...read moreread less

Abstract: Decomposing a scene into its shape, reflectance, and illumination is a challenging but essential problem in computer vision and graphics. This problem is inherently more challenging when the illumination is not a single light source under laboratory conditions but is instead an unconstrained environmental illumination. Though recent work has shown that implicit representations can be used to model the radiance field of an object, these techniques only enable view synthesis and not relighting. Additionally, evaluating these radiance fields is resource and time-intensive. By decomposing a scene into explicit representations, any rendering framework can be leveraged to generate novel views under any illumination in real-time. NeRD is a method that achieves this decomposition by introducing physically-based rendering to neural radiance fields. Even challenging non-Lambertian reflectances, complex geometry, and unknown illumination can be decomposed into high-quality models. The datasets and code is available on the project page: this https URL

...read moreread less

211 citations

Book Chapter•DOI•

Free View Synthesis

[...]

Gernot Riegler¹, Vladlen Koltun¹•Institutions (1)

Intel¹

23 Aug 2020

TL;DR: This work presents a method for novel view synthesis from input images that are freely distributed around a scene that can synthesize images for free camera movement through the scene, and works for general scenes with unconstrained geometric layouts.

...read moreread less

Abstract: We present a method for novel view synthesis from input images that are freely distributed around a scene. Our method does not rely on a regular arrangement of input views, can synthesize images for free camera movement through the scene, and works for general scenes with unconstrained geometric layouts. We calibrate the input images via SfM and erect a coarse geometric scaffold via MVS. This scaffold is used to create a proxy depth map for a novel view of the scene. Based on this depth map, a recurrent encoder-decoder network processes reprojected features from nearby views and synthesizes the new view. Our network does not need to be optimized for a given scene. After training on a dataset, it works in previously unseen environments with no fine-tuning or per-scene optimization. We evaluate the presented approach on challenging real-world datasets, including Tanks and Temples, where we demonstrate successful view synthesis for the first time and substantially outperform prior and concurrent work.

...read moreread less

Posted Content•

iNeRF: Inverting Neural Radiance Fields for Pose Estimation

[...]

Lin Yen-Chen, Peter R. Florence, Jonathan T. Barron, Alberto Rodriguez, Phillip Isola, Tsung-Yi Lin - Show less +2 more

10 Dec 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: iNeRF can perform categorylevel object pose estimation, including object instances not seen during training, with RGB images by inverting a NeRF model inferred from a single view.

...read moreread less

Abstract: We present iNeRF, a framework that performs mesh-free pose estimation by "inverting" a Neural RadianceField (NeRF). NeRFs have been shown to be remarkably effective for the task of view synthesis - synthesizing photorealistic novel views of real-world scenes or objects. In this work, we investigate whether we can apply analysis-by-synthesis via NeRF for mesh-free, RGB-only 6DoF pose estimation - given an image, find the translation and rotation of a camera relative to a 3D object or scene. Our method assumes that no object mesh models are available during either training or test time. Starting from an initial pose estimate, we use gradient descent to minimize the residual between pixels rendered from a NeRF and pixels in an observed image. In our experiments, we first study 1) how to sample rays during pose refinement for iNeRF to collect informative gradients and 2) how different batch sizes of rays affect iNeRF on a synthetic dataset. We then show that for complex real-world scenes from the LLFF dataset, iNeRF can improve NeRF by estimating the camera poses of novel images and using these images as additional training data for NeRF. Finally, we show iNeRF can perform category-level object pose estimation, including object instances not seen during training, with RGB images by inverting a NeRF model inferred from a single view.

...read moreread less

Journal Article•DOI•

State of the Art on Neural Rendering

[...]

Ayush Tewari, Ohad Fried¹, Justus Thies², Vincent Sitzmann¹, Stephen Lombardi³, Kalyan Sunkavalli⁴, Ricardo Martin-Brualla⁵, Tomas Simon³, Jason Saragih³, Matthias Nießner², Rohit Pandey⁵, Sean Fanello⁵, Gordon Wetzstein¹, Jun-Yan Zhu⁴, Christian Theobalt, Maneesh Agrawala¹, Eli Shechtman⁴, Dan B. Goldman⁵, Michael Zollhöfer³ - Show less +15 more•Institutions (5)

Stanford University¹, Technische Universität München², Facebook³, Adobe Systems⁴, Google⁵

01 May 2020-Computer Graphics Forum

TL;DR: Neural rendering as discussed by the authors is a new and rapidly emerging field that combines generative machine learning techniques with physical knowledge from computer graphics, e.g., by the integration of differentiable rendering into network training.

...read moreread less

Abstract: Efficient rendering of photo-realistic virtual worlds is a long standing effort of computer graphics. Modern graphics techniques have succeeded in synthesizing photo-realistic images from hand-crafted scene representations. However, the automatic generation of shape, materials, lighting, and other aspects of scenes remains a challenging problem that, if solved, would make photo-realistic computer graphics more widely accessible. Concurrently, progress in computer vision and machine learning have given rise to a new approach to image synthesis and editing, namely deep generative models. Neural rendering is a new and rapidly emerging field that combines generative machine learning techniques with physical knowledge from computer graphics, e.g., by the integration of differentiable rendering into network training. With a plethora of applications in computer graphics and vision, neural rendering is poised to become a new area in the graphics community, yet no survey of this emerging field exists. This state-of-the-art report summarizes the recent trends and applications of neural rendering. We focus on approaches that combine classic computer graphics techniques with deep generative models to obtain controllable and photo-realistic outputs. Starting with an overview of the underlying computer graphics and machine learning concepts, we discuss critical aspects of neural rendering approaches. This state-of-the-art report is focused on the many important use cases for the described algorithms such as novel view synthesis, semantic photo manipulation, facial and body reenactment, relighting, free-viewpoint video, and the creation of photo-realistic avatars for virtual and augmented reality telepresence. Finally, we conclude with a discussion of the social implications of such technology and investigate open research problems.

...read moreread less

Journal Article•DOI•

Immersive light field video with a layered mesh representation

[...]

Michael Broxton¹, John Flynn¹, Ryan Overbeck¹, Daniel Erickson¹, Peter Hedman¹, Matthew DuVall¹, Jason Dourgarian¹, Jay Busch¹, Matt Whalen¹, Paul Debevec¹ - Show less +6 more•Institutions (1)

Google¹

08 Jul 2020-ACM Transactions on Graphics

TL;DR: Advancing over previous work, this system is able to reproduce challenging content such as view-dependent reflections, semi-transparent surfaces, and near-field objects as close as 34 cm to the surface of the camera rig.

...read moreread less

Abstract: We present a system for capturing, reconstructing, compressing, and rendering high quality immersive light field video. We accomplish this by leveraging the recently introduced DeepView view interpolation algorithm, replacing its underlying multi-plane image (MPI) scene representation with a collection of spherical shells that are better suited for representing panoramic light field content. We further process this data to reduce the large number of shell layers to a small, fixed number of RGBA+depth layers without significant loss in visual quality. The resulting RGB, alpha, and depth channels in these layers are then compressed using conventional texture atlasing and video compression techniques. The final compressed representation is lightweight and can be rendered on mobile VR/AR platforms or in a web browser. We demonstrate light field video results using data from the 16-camera rig of [Pozo et al. 2019] as well as a new low-cost hemispherical array made from 46 synchronized action sports cameras. From this data we produce 6 degree of freedom volumetric videos with a wide 70 cm viewing baseline, 10 pixels per degree angular resolution, and a wide field of view, at 30 frames per second video frame rates. Advancing over previous work, we show that our system is able to reproduce challenging content such as view-dependent reflections, semi-transparent surfaces, and near-field objects as close as 34 cm to the surface of the camera rig.

...read moreread less

Posted Content•

State of the Art on Neural Rendering

[...]

Ayush Tewari, Ohad Fried¹, Justus Thies², Vincent Sitzmann¹, Stephen Lombardi³, Kalyan Sunkavalli⁴, Ricardo Martin-Brualla⁵, Tomas Simon³, Jason Saragih³, Matthias Nießner², Rohit Pandey⁵, Sean Fanello⁵, Gordon Wetzstein¹, Jun-Yan Zhu⁴, Christian Theobalt, Maneesh Agrawala¹, Eli Shechtman⁴, Dan B. Goldman⁵, Michael Zollhöfer³ - Show less +15 more•Institutions (5)

Stanford University¹, Technische Universität München², Facebook³, Adobe Systems⁴, Google⁵

08 Apr 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: This state‐of‐the‐art report summarizes the recent trends and applications of neural rendering and focuses on approaches that combine classic computer graphics techniques with deep generative models to obtain controllable and photorealistic outputs.

...read moreread less

Abstract: Efficient rendering of photo-realistic virtual worlds is a long standing effort of computer graphics. Modern graphics techniques have succeeded in synthesizing photo-realistic images from hand-crafted scene representations. However, the automatic generation of shape, materials, lighting, and other aspects of scenes remains a challenging problem that, if solved, would make photo-realistic computer graphics more widely accessible. Concurrently, progress in computer vision and machine learning have given rise to a new approach to image synthesis and editing, namely deep generative models. Neural rendering is a new and rapidly emerging field that combines generative machine learning techniques with physical knowledge from computer graphics, e.g., by the integration of differentiable rendering into network training. With a plethora of applications in computer graphics and vision, neural rendering is poised to become a new area in the graphics community, yet no survey of this emerging field exists. This state-of-the-art report summarizes the recent trends and applications of neural rendering. We focus on approaches that combine classic computer graphics techniques with deep generative models to obtain controllable and photo-realistic outputs. Starting with an overview of the underlying computer graphics and machine learning concepts, we discuss critical aspects of neural rendering approaches. This state-of-the-art report is focused on the many important use cases for the described algorithms such as novel view synthesis, semantic photo manipulation, facial and body reenactment, relighting, free-viewpoint video, and the creation of photo-realistic avatars for virtual and augmented reality telepresence. Finally, we conclude with a discussion of the social implications of such technology and investigate open research problems.

...read moreread less

Posted Content•

3D Photography using Context-aware Layered Depth Inpainting

[...]

Meng-Li Shih¹, Shih-Yang Su², Johannes Kopf³, Jia-Bin Huang²•Institutions (3)

National Tsing Hua University¹, Virginia Tech², Facebook³

09 Apr 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: A learning-based inpainting model is presented that iteratively synthesizes new local color-and-depth content into the occluded region in a spatial context-aware manner and can be efficiently rendered with motion parallax using standard graphics engines.

...read moreread less

Abstract: We propose a method for converting a single RGB-D input image into a 3D photo - a multi-layer representation for novel view synthesis that contains hallucinated color and depth structures in regions occluded in the original view. We use a Layered Depth Image with explicit pixel connectivity as underlying representation, and present a learning-based inpainting model that synthesizes new local color-and-depth content into the occluded region in a spatial context-aware manner. The resulting 3D photos can be efficiently rendered with motion parallax using standard graphics engines. We validate the effectiveness of our method on a wide range of challenging everyday scenes and show fewer artifacts compared with the state of the arts.

...read moreread less

Proceedings Article•DOI•

3D Photography Using Context-Aware Layered Depth Inpainting

[...]

Meng-Li Shih¹, Shih-Yang Su², Johannes Kopf³, Jia-Bin Huang²•Institutions (3)

National Tsing Hua University¹, Virginia Tech², Facebook³

14 Jun 2020

TL;DR: In this paper, a multi-layer representation for novel view synthesis that contains hallucinated color and depth structures in regions occluded in the original view is proposed, which can be efficiently rendered with motion parallax using standard graphics engines.

...read moreread less

Abstract: We propose a method for converting a single RGB-D input image into a 3D photo, i.e., a multi-layer representation for novel view synthesis that contains hallucinated color and depth structures in regions occluded in the original view. We use a Layered Depth Image with explicit pixel connectivity as underlying representation, and present a learning-based inpainting model that iteratively synthesizes new local color-and-depth content into the occluded region in a spatial context-aware manner. The resulting 3D photos can be efficiently rendered with motion parallax using standard graphics engines. We validate the effectiveness of our method on a wide range of challenging everyday scenes and show less artifacts when compared with the state-of-the-arts.

...read moreread less

Posted Content•

AutoInt: Automatic Integration for Fast Neural Volume Rendering

[...]

David B. Lindell¹, Julien N. P. Martel¹, Gordon Wetzstein¹•Institutions (1)

Stanford University¹

03 Dec 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work proposes automatic integration, a new framework for learning efficient, closed-form solutions to integrals using coordinate-based neural networks, and improves a tradeoff between rendering speed and image quality by improving render times by greater than 10× with a tradeoffs of reduced image quality.

...read moreread less

Abstract: Numerical integration is a foundational technique in scientific computing and is at the core of many computer vision applications. Among these applications, neural volume rendering has recently been proposed as a new paradigm for view synthesis, achieving photorealistic image quality. However, a fundamental obstacle to making these methods practical is the extreme computational and memory requirements caused by the required volume integrations along the rendered rays during training and inference. Millions of rays, each requiring hundreds of forward passes through a neural network are needed to approximate those integrations with Monte Carlo sampling. Here, we propose automatic integration, a new framework for learning efficient, closed-form solutions to integrals using coordinate-based neural networks. For training, we instantiate the computational graph corresponding to the derivative of the network. The graph is fitted to the signal to integrate. After optimization, we reassemble the graph to obtain a network that represents the antiderivative. By the fundamental theorem of calculus, this enables the calculation of any definite integral in two evaluations of the network. Applying this approach to neural rendering, we improve a tradeoff between rendering speed and image quality: improving render times by greater than 10 times with a tradeoff of slightly reduced image quality.

...read moreread less

Proceedings Article•

GRAF: Generative Radiance Fields for 3D-Aware Image Synthesis

[...]

Katja Schwarz¹, Yiyi Liao¹, Michael Niemeyer¹, Andreas Geiger¹•Institutions (1)

Max Planck Society¹

05 Jul 2020

TL;DR: In this paper, a multi-scale patch-based discriminator is proposed to disentangle camera and scene properties. But the model is limited to a coarse discretization of the 3D space.

...read moreread less

Abstract: While 2D generative adversarial networks have enabled high-resolution image synthesis, they largely lack an understanding of the 3D world and the image formation process. Thus, they do not provide precise control over camera viewpoint or object pose. To address this problem, several recent approaches leverage intermediate voxel-based representations in combination with differentiable rendering. However, existing methods either produce low image resolution or fall short in disentangling camera and scene properties, e.g., the object identity may vary with the viewpoint. In this paper, we propose a generative model for radiance fields which have recently proven successful for novel view synthesis of a single scene. In contrast to voxel-based representations, radiance fields are not confined to a coarse discretization of the 3D space, yet allow for disentangling camera and scene properties while degrading gracefully in the presence of reconstruction ambiguity. By introducing a multi-scale patch-based discriminator, we demonstrate synthesis of high-resolution images while training our model from unposed 2D images alone. We systematically analyze our approach on several challenging synthetic and real-world datasets. Our experiments reveal that radiance fields are a powerful representation for generative image synthesis, leading to 3D consistent models that render with high fidelity.

...read moreread less

Posted Content•

Neural Scene Graphs for Dynamic Scenes

[...]

Julian Ost, Fahim Mannan, Nils Thuerey¹, Julian Knodt², Felix Heide - Show less +1 more•Institutions (2)

Technische Universität München¹, Princeton University²

20 Nov 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work proposes a learned scene graph representation, which encodes object transformations and radiance, allowing us to efficiently render novel arrangements and views of the scene, and presents the first neural rendering method that represents multi-object dynamic scenes as scene graphs.

...read moreread less

Abstract: Recent implicit neural rendering methods have demonstrated that it is possible to learn accurate view synthesis for complex scenes by predicting their volumetric density and color supervised solely by a set of RGB images. However, existing methods are restricted to learning efficient representations of static scenes that encode all scene objects into a single neural network, and lack the ability to represent dynamic scenes and decompositions into individual scene objects. In this work, we present the first neural rendering method that decomposes dynamic scenes into scene graphs. We propose a learned scene graph representation, which encodes object transformation and radiance, to efficiently render novel arrangements and views of the scene. To this end, we learn implicitly encoded scenes, combined with a jointly learned latent representation to describe objects with a single implicit function. We assess the proposed method on synthetic and real automotive data, validating that our approach learns dynamic scenes -- only by observing a video of this scene -- and allows for rendering novel photo-realistic views of novel scene compositions with unseen sets of objects at unseen poses.

...read moreread less

Proceedings Article•DOI•

Deep Image Spatial Transformation for Person Image Generation

[...]

Yurui Ren¹, Xiaoming Yu¹, Junming Chen¹, Thomas H. Li¹, Ge Li¹ - Show less +1 more•Institutions (1)

Peking University¹

14 Jun 2020

TL;DR: Ren et al. as discussed by the authors proposed a differentiable global-flow local-attention framework to reassemble the inputs at the feature level, which calculates the global correlations between sources and targets to predict flow fields and warp the source features using a content-aware sampling method with the obtained local attention coefficients.

...read moreread less

Abstract: Pose-guided person image generation is to transform a source person image to a target pose. This task requires spatial manipulations of source data. However, Convolutional Neural Networks are limited by the lack of ability to spatially transform the inputs. In this paper, we propose a differentiable global-flow local-attention framework to reassemble the inputs at the feature level. Specifically, our model first calculates the global correlations between sources and targets to predict flow fields. Then, the flowed local patch pairs are extracted from the feature maps to calculate the local attention coefficients. Finally, we warp the source features using a content-aware sampling method with the obtained local attention coefficients. The results of both subjective and objective experiments demonstrate the superiority of our model. Besides, additional results in video animation and view synthesis show that our model is applicable to other tasks requiring spatial transformation. Our source code is available at https://github.com/RenYurui/Global-Flow-Local-Attention.

...read moreread less

Posted Content•

Stable View Synthesis

[...]

Gernot Riegler¹, Vladlen Koltun¹•Institutions (1)

Intel¹

14 Nov 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: Experimental results demonstrate that SVS outperforms state-of-the-art view synthesis methods both quantitatively and qualitatively on three diverse real-world datasets, achieving unprecedented levels of realism in free-viewpoint video of challenging large-scale scenes.

...read moreread less

Abstract: We present Stable View Synthesis (SVS). Given a set of source images depicting a scene from freely distributed viewpoints, SVS synthesizes new views of the scene. The method operates on a geometric scaffold computed via structure-from-motion and multi-view stereo. Each point on this 3D scaffold is associated with view rays and corresponding feature vectors that encode the appearance of this point in the input images. The core of SVS is view-dependent on-surface feature aggregation, in which directional feature vectors at each 3D point are processed to produce a new feature vector for a ray that maps this point into the new target view. The target view is then rendered by a convolutional network from a tensor of features synthesized in this way for all pixels. The method is composed of differentiable modules and is trained end-to-end. It supports spatially-varying view-dependent importance weighting and feature transformation of source images at each point; spatial and temporal stability due to the smooth dependence of on-surface feature aggregation on the target view; and synthesis of view-dependent effects such as specular reflection. Experimental results demonstrate that SVS outperforms state-of-the-art view synthesis methods both quantitatively and qualitatively on three diverse real-world datasets, achieving unprecedented levels of realism in free-viewpoint video of challenging large-scale scenes.

...read moreread less

Posted Content•

Non-Rigid Neural Radiance Fields: Reconstruction and Novel View Synthesis of a Dynamic Scene From Monocular Video.

[...]

Edgar Tretschk¹, Ayush Tewari¹, Vladislav Golyanik¹, Michael Zollhöfer², Christoph Lassner³, Christian Theobalt¹ - Show less +2 more•Institutions (3)

Max Planck Society¹, Stanford University², Facebook³

22 Dec 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, a non-rigid neural ray bending (NR-NeRF) network is proposed to disentangle the dynamic scene into a canonical volume and its deformation.

...read moreread less

Abstract: We present Non-Rigid Neural Radiance Fields (NR-NeRF), a reconstruction and novel view synthesis approach for general non-rigid dynamic scenes. Our approach takes RGB images of a dynamic scene as input (e.g., from a monocular video recording), and creates a high-quality space-time geometry and appearance representation. We show that a single handheld consumer-grade camera is sufficient to synthesize sophisticated renderings of a dynamic scene from novel virtual camera views, e.g. a `bullet-time' video effect. NR-NeRF disentangles the dynamic scene into a canonical volume and its deformation. Scene deformation is implemented as ray bending, where straight rays are deformed non-rigidly. We also propose a novel rigidity network to better constrain rigid regions of the scene, leading to more stable results. The ray bending and rigidity network are trained without explicit supervision. Our formulation enables dense correspondence estimation across views and time, and compelling video editing applications such as motion exaggeration. Our code will be open sourced.

...read moreread less

Proceedings Article•

NeRD: Neural Reflectance Decomposition From Image Collections

[...]

Mark Boss¹, Raphael Braun, Varun Jampani², Jonathan T. Barron³, Ce Liu³, Hendrik P. A. Lensch¹ - Show less +2 more•Institutions (3)

University of Tübingen¹, Nvidia², Google³

01 Jan 2020

TL;DR: NeRD as mentioned in this paper decomposes a scene into explicit representations by introducing physically-based rendering to neural radiance fields, which can be leveraged to generate novel views under any illumination in real-time.

...read moreread less

Abstract: Decomposing a scene into its shape, reflectance, and illumination is a challenging but essential problem in computer vision and graphics. This problem is inherently more challenging when the illumination is not a single light source under laboratory conditions but is instead an unconstrained environmental illumination. Though recent work has shown that implicit representations can be used to model the radiance field of an object, these techniques only enable view synthesis and not relighting. Additionally, evaluating these radiance fields is resource and time-intensive. By decomposing a scene into explicit representations, any rendering framework can be leveraged to generate novel views under any illumination in real-time. NeRD is a method that achieves this decomposition by introducing physically-based rendering to neural radiance fields. Even challenging non-Lambertian reflectances, complex geometry, and unknown illumination can be decomposed into high-quality models. The datasets and code is available on the project page: this https URL

...read moreread less

Posted Content•

Neural Radiance Flow for 4D View Synthesis and Video Processing

[...]

Yilun Du¹, Yinan Zhang, Hong-Xing Yu², Joshua B. Tenenbaum¹, Jiajun Wu³ - Show less +1 more•Institutions (3)

Massachusetts Institute of Technology¹, University of California, San Diego², Stanford University³

17 Dec 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work uses a neural implicit representation that learns to capture the 3D occupancy, radiance, and dynamics of the scene, and demonstrates that the learned representation can serve as an implicit scene prior, enabling video processing tasks such as image super-resolution and de-noising without any additional supervision.

...read moreread less

Abstract: We present a method, Neural Radiance Flow (NeRFlow),to learn a 4D spatial-temporal representation of a dynamic scene from a set of RGB images. Key to our approach is the use of a neural implicit representation that learns to capture the 3D occupancy, radiance, and dynamics of the scene. By enforcing consistency across different modalities, our representation enables multi-view rendering in diverse dynamic scenes, including water pouring, robotic interaction, and real images, outperforming state-of-the-art methods for spatial-temporal view synthesis. Our approach works even when inputs images are captured with only one camera. We further demonstrate that the learned representation can serve as an implicit scene prior, enabling video processing tasks such as image super-resolution and de-noising without any additional supervision.

...read moreread less

Book Chapter•DOI•

Crowdsampling the Plenoptic Function

[...]

Zhengqi Li¹, Wenqi Xian¹, Abe Davis¹, Noah Snavely¹•Institutions (1)

Cornell University¹

23 Aug 2020

TL;DR: A new DeepMPI representation is introduced, motivated by observations on the sparsity structure of the plenoptic function, that allows for real-time synthesis of photorealistic views that are continuous in both space and across changes in lighting.

...read moreread less

Abstract: Many popular tourist landmarks are captured in a multitude of online, public photos. These photos represent a sparse and unstructured sampling of the plenoptic function for a particular scene. In this paper, we present a new approach to novel view synthesis under time-varying illumination from such data. Our approach builds on the recent multi-plane image (MPI) format for representing local light fields under fixed viewing conditions. We introduce a new DeepMPI representation, motivated by observations on the sparsity structure of the plenoptic function, that allows for real-time synthesis of photorealistic views that are continuous in both space and across changes in lighting. Our method can synthesize the same compelling parallax and view-dependent effects as previous MPI methods, while simultaneously interpolating along changes in reflectance and illumination with time. We show how to learn a model of these effects in an unsupervised way from an unstructured collection of photos without temporal registration, demonstrating significant improvements over recent work in neural rendering. More information can be found at crowdsampling.io.

...read moreread less

Posted Content•

Deep Image Spatial Transformation for Person Image Generation

[...]

Yurui Ren¹, Xiaoming Yu¹, Junming Chen¹, Thomas H. Li¹, Ge Li¹ - Show less +1 more•Institutions (1)

Peking University¹

02 Mar 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: A differentiable global-flow local-attention framework to reassemble the inputs at the feature level to transform a source person image to a target pose and the results of both subjective and objective experiments demonstrate the superiority of this model.

...read moreread less

Abstract: Pose-guided person image generation is to transform a source person image to a target pose. This task requires spatial manipulations of source data. However, Convolutional Neural Networks are limited by the lack of ability to spatially transform the inputs. In this paper, we propose a differentiable global-flow local-attention framework to reassemble the inputs at the feature level. Specifically, our model first calculates the global correlations between sources and targets to predict flow fields. Then, the flowed local patch pairs are extracted from the feature maps to calculate the local attention coefficients. Finally, we warp the source features using a content-aware sampling method with the obtained local attention coefficients. The results of both subjective and objective experiments demonstrate the superiority of our model. Besides, additional results in video animation and view synthesis show that our model is applicable to other tasks requiring spatial transformation. Our source code is available at this https URL.

...read moreread less

Posted Content•

Infinite Nature: Perpetual View Generation of Natural Scenes from a Single Image

[...]

Andrew Liu¹, Richard Tucker², Varun Jampani², Ameesh Makadia², Noah Snavely³, Angjoo Kanazawa⁴ - Show less +2 more•Institutions (4)

Massachusetts Institute of Technology¹, Google², Cornell University³, University of California, Berkeley⁴

17 Dec 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work introduces the problem of perpetual view generation— long-range generation of novel views corresponding to an arbitrarily long camera trajectory given a single image, and takes a hybrid approach that integrates both geometry and image synthesis in an iterative ‘render, refine and repeat’ framework.

...read moreread less

Abstract: We introduce the problem of perpetual view generation -- long-range generation of novel views corresponding to an arbitrarily long camera trajectory given a single image. This is a challenging problem that goes far beyond the capabilities of current view synthesis methods, which work for a limited range of viewpoints and quickly degenerate when presented with a large camera motion. Methods designed for video generation also have limited ability to produce long video sequences and are often agnostic to scene geometry. We take a hybrid approach that integrates both geometry and image synthesis in an iterative render, refine, and repeat framework, allowing for long-range generation that cover large distances after hundreds of frames. Our approach can be trained from a set of monocular video sequences without any manual annotation. We propose a dataset of aerial footage of natural coastal scenes, and compare our method with recent view synthesis and conditional video generation baselines, showing that it can generate plausible scenes for much longer time horizons over large camera trajectories compared to existing methods. Please visit our project page at this https URL.

...read moreread less

Posted Content•

Neural Human Video Rendering by Learning Dynamic Textures and Rendering-to-Video Translation.

[...]

Lingjie Liu, Weipeng Xu¹, Marc Habermann², Michael Zollhoefer², Florian Bernard², Hyeongwoo Kim, Wenping Wang², Christian Theobalt³ - Show less +4 more•Institutions (3)

University of Hong Kong¹, Max Planck Society², Stanford University³

14 Jan 2020-arXiv: Graphics

TL;DR: A novel human video synthesis method that approaches limiting factors by explicitly disentangling the learning of time-coherent fine-scale details from the embedding of the human in 2D screen space and shows significant improvement over the state of the art both qualitatively and quantitatively.

...read moreread less

Abstract: Synthesizing realistic videos of humans using neural networks has been a popular alternative to the conventional graphics-based rendering pipeline due to its high efficiency. Existing works typically formulate this as an image-to-image translation problem in 2D screen space, which leads to artifacts such as over-smoothing, missing body parts, and temporal instability of fine-scale detail, such as pose-dependent wrinkles in the clothing. In this paper, we propose a novel human video synthesis method that approaches these limiting factors by explicitly disentangling the learning of time-coherent fine-scale details from the embedding of the human in 2D screen space. More specifically, our method relies on the combination of two convolutional neural networks (CNNs). Given the pose information, the first CNN predicts a dynamic texture map that contains time-coherent high-frequency details, and the second CNN conditions the generation of the final video on the temporally coherent output of the first CNN. We demonstrate several applications of our approach, such as human reenactment and novel view synthesis from monocular video, where we show significant improvement over the state of the art both qualitatively and quantitatively.

...read moreread less

Posted Content•

Novel View Synthesis of Dynamic Scenes with Globally Coherent Depths from a Monocular Camera

[...]

Jae Shin Yoon¹, Kihwan Kim², Orazio Gallo², Hyun Soo Park¹, Jan Kautz² - Show less +1 more•Institutions (2)

University of Minnesota¹, Nvidia²

02 Apr 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: This paper presents a new method to synthesize an image from arbitrary views and times given a collection of images of a dynamic scene, and evaluates the method of depth estimation and view synthesis on a diverse real-world dynamic scenes and shows the outstanding performance over existing methods.

...read moreread less

Abstract: This paper presents a new method to synthesize an image from arbitrary views and times given a collection of images of a dynamic scene. A key challenge for the novel view synthesis arises from dynamic scene reconstruction where epipolar geometry does not apply to the local motion of dynamic contents. To address this challenge, we propose to combine the depth from single view (DSV) and the depth from multi-view stereo (DMV), where DSV is complete, i.e., a depth is assigned to every pixel, yet view-variant in its scale, while DMV is view-invariant yet incomplete. Our insight is that although its scale and quality are inconsistent with other views, the depth estimation from a single view can be used to reason about the globally coherent geometry of dynamic contents. We cast this problem as learning to correct the scale of DSV, and to refine each depth with locally consistent motions between views to form a coherent depth estimation. We integrate these tasks into a depth fusion network in a self-supervised fashion. Given the fused depth maps, we synthesize a photorealistic virtual view in a specific location and time with our deep blending network that completes the scene and renders the virtual view. We evaluate our method of depth estimation and view synthesis on diverse real-world dynamic scenes and show the outstanding performance over existing methods.

...read moreread less

Showing papers on "View synthesis published in 2020"