Showing papers on "View synthesis published in 2021"

PDF

Open Access

Proceedings Article•DOI•

IBRNet: Learning Multi-View Image-Based Rendering

[...]

Qianqian Wang¹, Zhicheng Wang¹, Kyle Genova¹, Pratul P. Srinivasan¹, Howard Zhou¹, Jonathan T. Barron¹, Ricardo Martin-Brualla¹, Noah Snavely¹, Thomas Funkhouser¹ - Show less +5 more•Institutions (1)

Google¹

20 Jun 2021

TL;DR: A method that synthesizes novel views of complex scenes by interpolating a sparse set of nearby views using a network architecture that includes a multilayer perceptron and a ray transformer that estimates radiance and volume density at continuous 5D locations.

...read moreread less

Abstract: We present a method that synthesizes novel views of complex scenes by interpolating a sparse set of nearby views. The core of our method is a network architecture that includes a multilayer perceptron and a ray transformer that estimates radiance and volume density at continuous 5D locations (3D spatial locations and 2D viewing directions), drawing appearance information on the fly from multiple source views. By drawing on source views at render time, our method hearkens back to classic work on image-based rendering (IBR), and allows us to render high-resolution imagery. Unlike neural scene representation work that optimizes per-scene functions for rendering, we learn a generic view interpolation function that generalizes to novel scenes. We render images using classic volume rendering, which is fully differentiable and allows us to train using only multi-view posed images as supervision. Experiments show that our method outperforms recent novel view synthesis methods that also seek to generalize to novel scenes. Further, if fine-tuned on each scene, our method is competitive with state-of-the-art single-scene neural rendering methods.1

...read moreread less

402 citations

Proceedings Article•DOI•

Neural Body: Implicit Neural Representations with Structured Latent Codes for Novel View Synthesis of Dynamic Humans

[...]

Sida Peng¹, Yuanqing Zhang¹, Yinghao Xu², Qianqian Wang³, Qing Shuai¹, Hujun Bao¹, Xiaowei Zhou¹ - Show less +3 more•Institutions (3)

Zhejiang University¹, The Chinese University of Hong Kong², Cornell University³

20 Jun 2021

TL;DR: In this paper, the authors propose Neural Body, a new human body representation which assumes that learned neural representations at different frames share the same set of latent codes anchored to a deformable mesh, so that the observations across frames can be naturally integrated.

...read moreread less

Abstract: This paper addresses the challenge of novel view synthesis for a human performer from a very sparse set of camera views. Some recent works have shown that learning implicit neural representations of 3D scenes achieves remarkable view synthesis quality given dense input views. However, the representation learning will be ill-posed if the views are highly sparse. To solve this ill-posed problem, our key idea is to integrate observations over video frames. To this end, we propose Neural Body, a new human body representation which assumes that the learned neural representations at different frames share the same set of latent codes anchored to a deformable mesh, so that the observations across frames can be naturally integrated. The deformable mesh also provides geometric guidance for the network to learn 3D representations more efficiently. To evaluate our approach, we create a multi-view dataset named ZJU-MoCap that captures performers with complex motions. Experiments on ZJU-MoCap show that our approach outperforms prior works by a large margin in terms of novel view synthesis quality. We also demonstrate the capability of our approach to reconstruct a moving person from a monocular video on the People-Snapshot dataset.

...read moreread less

364 citations

Proceedings Article•DOI•

pixelNeRF: Neural Radiance Fields from One or Few Images

[...]

Alex Yu¹, Vickie Ye¹, Matthew Tancik¹, Angjoo Kanazawa¹•Institutions (1)

University of California, Berkeley¹

01 Jun 2021

TL;DR: PixelNeRF as mentioned in this paper is a learning framework that predicts a continuous neural scene representation conditioned on one or few input images, enabling it to perform novel view synthesis in a feed-forward manner from a sparse set of views.

...read moreread less

Abstract: We propose pixelNeRF, a learning framework that predicts a continuous neural scene representation conditioned on one or few input images. The existing approach for constructing neural radiance fields [27] involves optimizing the representation to every scene independently, requiring many calibrated views and significant compute time. We take a step towards resolving these shortcomings by introducing an architecture that conditions a NeRF on image inputs in a fully convolutional manner. This allows the network to be trained across multiple scenes to learn a scene prior, enabling it to perform novel view synthesis in a feed-forward manner from a sparse set of views (as few as one). Leveraging the volume rendering approach of NeRF, our model can be trained directly from images with no explicit 3D supervision. We conduct extensive experiments on ShapeNet benchmarks for single image novel view synthesis tasks with held-out objects as well as entire unseen categories. We further demonstrate the flexibility of pixelNeRF by demonstrating it on multi-object ShapeNet scenes and real scenes from the DTU dataset. In all cases, pixelNeRF outperforms current state-of-the-art baselines for novel view synthesis and single image 3D reconstruction. For the video and code, please visit the project website:https://alexyu.net/pixelnerf.

...read moreread less

242 citations

Proceedings Article•DOI•

Neural Scene Flow Fields for Space-Time View Synthesis of Dynamic Scenes

[...]

Zhengqi Li¹, Simon Niklaus², Noah Snavely¹, Oliver Wang²•Institutions (2)

Cornell University¹, Adobe Systems²

01 Jun 2021

TL;DR: In this article, a neural scene flow field is proposed to model the dynamic scene as a time-varying continuous function of appearance, geometry, and 3D scene motion.

...read moreread less

Abstract: We present a method to perform novel view and time synthesis of dynamic scenes, requiring only a monocular video with known camera poses as input. To do this, we introduce Neural Scene Flow Fields, a new representation that models the dynamic scene as a time-variant continuous function of appearance, geometry, and 3D scene motion. Our representation is optimized through a neural network to fit the observed input views. We show that our representation can be used for varieties of in-the-wild scenes, including thin structures, view-dependent effects, and complex degrees of motion. We conduct a number of experiments that demonstrate our approach significantly outperforms recent monocular view synthesis methods, and show qualitative results of space-time view synthesis on a variety of real-world videos.

...read moreread less

144 citations

Proceedings Article•DOI•

Stereo Radiance Fields (SRF): Learning View Synthesis for Sparse Views of Novel Scenes

[...]

Julian Chibane¹, Aayush Bansal², Verica Lazova¹, Gerard Pons-Moll¹•Institutions (2)

University of Tübingen¹, Carnegie Mellon University²

01 Jun 2021

TL;DR: SRF as mentioned in this paper predicts color and density for each 3D point given an encoding of its stereo correspondence in the input images, implicitly learned by an ensemble of pair-wise similarities, emulating classical stereo.

...read moreread less

Abstract: Recent neural view synthesis methods have achieved impressive quality and realism, surpassing classical pipelines which rely on multi-view reconstruction. State-of-the-Art methods, such as NeRF [34], are designed to learn a single scene with a neural network and require dense multi-view inputs. Testing on a new scene requires re-training from scratch, which takes 2-3 days. In this work, we introduce Stereo Radiance Fields (SRF), a neural view synthesis approach that is trained end-to-end, generalizes to new scenes, and requires only sparse views at test time. The core idea is a neural architecture inspired by classical multi-view stereo methods, which estimates surface points by finding similar image regions in stereo images. In SRF, we predict color and density for each 3D point given an encoding of its stereo correspondence in the input images. The encoding is implicitly learned by an ensemble of pair-wise similarities – emulating classical stereo. Experiments show that SRF learns structure instead of over-fitting on a scene. We train on multiple scenes of the DTU dataset and generalize to new ones without re-training, requiring only 10 sparse and spread-out views as input. We show that 10-15 minutes of fine-tuning further improve the results, achieving significantly sharper, more detailed results than scene-specific models. The code, model, and videos are available – https://virtualhumans.mpi-inf.mpg.de/srf/.

...read moreread less

122 citations

Proceedings Article•DOI•

NeRV: Neural Reflectance and Visibility Fields for Relighting and View Synthesis

[...]

Pratul P. Srinivasan¹, Boyang Deng¹, Xiuming Zhang², Matthew Tancik³, Ben Mildenhall³, Jonathan T. Barron¹ - Show less +2 more•Institutions (3)

Google¹, Massachusetts Institute of Technology², University of California, Berkeley³

01 Jun 2021

TL;DR: The authors presented a method that takes as input a set of images of a scene illuminated by unconstrained known lighting, and produces as output a 3D representation that can be rendered from novel viewpoints under arbitrary lighting conditions.

...read moreread less

Abstract: We present a method that takes as input a set of images of a scene illuminated by unconstrained known lighting, and produces as output a 3D representation that can be rendered from novel viewpoints under arbitrary lighting conditions. Our method represents the scene as a continuous volumetric function parameterized as MLPs whose inputs are a 3D location and whose outputs are the following scene properties at that input location: volume density, surface normal, material parameters, distance to the first surface intersection in any direction, and visibility of the external environment in any direction. Together, these allow us to render novel views of the object under arbitrary lighting, including indirect illumination effects. The predicted visibility and surface intersection fields are critical to our model’s ability to simulate direct and indirect illumination during training, because the brute-force techniques used by prior work are intractable for lighting conditions outside of controlled setups with a single light. Our method outperforms alternative approaches for recovering relightable 3D scene representations, and performs well in complex lighting settings that have posed a significant challenge to prior work.

...read moreread less

114 citations

Proceedings Article•

UNISURF: Unifying Neural Implicit Surfaces and Radiance Fields for Multi-View Reconstruction

[...]

Michael Oechsle¹, Songyou Peng², Andreas Geiger¹•Institutions (2)

Max Planck Society¹, ETH Zurich²

20 Apr 2021

TL;DR: In this paper, implicit surface models and radiance fields can be formulated in a unified way, enabling both surface and volume rendering using the same model, which enables novel, more efficient sampling procedures and the ability to reconstruct accurate surfaces without input masks.

...read moreread less

Abstract: Neural implicit 3D representations have emerged as a powerful paradigm for reconstructing surfaces from multi-view images and synthesizing novel views. Unfortunately, existing methods such as DVR or IDR require accurate per-pixel object masks as supervision. At the same time, neural radiance fields have revolutionized novel view synthesis. However, NeRF's estimated volume density does not admit accurate surface reconstruction. Our key insight is that implicit surface models and radiance fields can be formulated in a unified way, enabling both surface and volume rendering using the same model. This unified perspective enables novel, more efficient sampling procedures and the ability to reconstruct accurate surfaces without input masks. We compare our method on the DTU, BlendedMVS, and a synthetic indoor dataset. Our experiments demonstrate that we outperform NeRF in terms of reconstruction quality while performing on par with IDR without requiring masks.

...read moreread less

102 citations

Proceedings Article•

MVSNeRF: Fast Generalizable Radiance Field Reconstruction From Multi-View Stereo

[...]

Anpei Chen¹, Zexiang Xu², Fuqiang Zhao¹, Xiaoshuai Zhang², Fanbo Xiang³, Jingyi Yu¹, Hao Su⁴ - Show less +3 more•Institutions (4)

ShanghaiTech University¹, Adobe Systems², University of California, San Diego³, Johns Hopkins University⁴

01 Jan 2021

TL;DR: MVSNeRF as discussed by the authors proposes a generic deep neural network that can reconstruct radiance fields from only three nearby input views via fast network inference, leveraging plane-swept cost volumes (widely used in multi-view stereo) for geometry-aware scene reasoning.

...read moreread less

Abstract: We present MVSNeRF, a novel neural rendering approach that can efficiently reconstruct neural radiance fields for view synthesis. Unlike prior works on neural radiance fields that consider per-scene optimization on densely captured images, we propose a generic deep neural network that can reconstruct radiance fields from only three nearby input views via fast network inference. Our approach leverages plane-swept cost volumes (widely used in multi-view stereo) for geometry-aware scene reasoning, and combines this with physically based volume rendering for neural radiance field reconstruction. We train our network on real objects in the DTU dataset, and test it on three different datasets to evaluate its effectiveness and generalizability. Our approach can generalize across scenes (even indoor scenes, completely different from our training scenes of objects) and generate realistic view synthesis results using only three input images, significantly outperforming concurrent works on generalizable radiance field reconstruction. Moreover, if dense images are captured, our estimated radiance field representation can be easily fine-tuned; this leads to fast per-scene reconstruction with higher rendering quality and substantially less optimization time than NeRF.

...read moreread less

94 citations

Proceedings Article•DOI•

Stable View Synthesis

[...]

Gernot Riegler¹, Vladlen Koltun¹•Institutions (1)

Intel¹

01 Jun 2021

TL;DR: Stable View Synthesis (SVS) as discussed by the authors is a view-dependent on-surface feature aggregation, in which directional feature vectors at each 3D point are processed to produce a new feature vector for a ray that maps this point into the new target view.

...read moreread less

Abstract: We present Stable View Synthesis (SVS). Given a set of source images depicting a scene from freely distributed viewpoints, SVS synthesizes new views of the scene. The method operates on a geometric scaffold computed via structure-from-motion and multi-view stereo. Each point on this 3D scaffold is associated with view rays and corresponding feature vectors that encode the appearance of this point in the input images. The core of SVS is view-dependent on-surface feature aggregation, in which directional feature vectors at each 3D point are processed to produce a new feature vector for a ray that maps this point into the new target view. The target view is then rendered by a convolutional network from a tensor of features synthesized in this way for all pixels. The method is composed of differentiable modules and is trained end-to-end. It supports spatially-varying view-dependent importance weighting and feature transformation of source images at each point; spatial and temporal stability due to the smooth dependence of on-surface feature aggregation on the target view; and synthesis of view-dependent effects such as specular reflection. Experimental results demonstrate that SVS outperforms state-of-the-art view synthesis methods both quantitatively and qualitatively on three diverse real-world datasets, achieving unprecedented levels of realism in free-viewpoint video of challenging large-scale scenes. Code is available at https://github.com/intel-isl/StableViewSynthesis

...read moreread less

76 citations

Proceedings Article•DOI•

Coming Down to Earth: Satellite-to-Street View Synthesis for Geo-Localization

[...]

Aysim Toker¹, Qunjie Zhou¹, Maxim Maximov¹, Laura Leal-Taixé¹•Institutions (1)

Technische Universität München¹

20 Jun 2021

TL;DR: In this paper, the authors propose a multi-task architecture in which image synthesis and retrieval are considered jointly, which can bias their network to learn latent feature representations that are useful for retrieval if they utilize them to generate images across two input domains.

...read moreread less

Abstract: The goal of cross-view image based geo-localization is to determine the location of a given street view image by matching it against a collection of geo-tagged satellite images. This task is notoriously challenging due to the drastic viewpoint and appearance differences between the two domains. We show that we can address this discrepancy explicitly by learning to synthesize realistic street views from satellite inputs. Following this observation, we propose a novel multi-task architecture in which image synthesis and retrieval are considered jointly. The rationale behind this is that we can bias our network to learn latent feature representations that are useful for retrieval if we utilize them to generate images across the two input domains. To the best of our knowledge, ours is the first approach that creates realistic street views from satellite images and localizes the corresponding query street-view simultaneously in an end-to-end manner. In our experiments, we obtain state-of-the-art performance on the CVUSA and CVACT benchmarks. Finally, we show compelling qualitative results for satellite-to-street view synthesis.

...read moreread less

70 citations

Proceedings Article•DOI•

AutoInt: Automatic Integration for Fast Neural Volume Rendering

[...]

David B. Lindell¹, Julien N. P. Martel¹, Gordon Wetzstein¹•Institutions (1)

Stanford University¹

01 Jun 2021

TL;DR: Automatic integration as discussed by the authors is a new framework for learning efficient, closed-form solutions to integrals using coordinate-based neural networks, which can improve render times by more than 10× with a tradeoff of reduced image quality.

...read moreread less

Abstract: Numerical integration is a foundational technique in scientific computing and is at the core of many computer vision applications. Among these applications, neural volume rendering has recently been proposed as a new paradigm for view synthesis, achieving photorealistic image quality. However, a fundamental obstacle to making these methods practical is the extreme computational and memory requirements caused by the required volume integrations along the rendered rays during training and inference. Millions of rays, each requiring hundreds of forward passes through a neural network are needed to approximate those integrations with Monte Carlo sampling. Here, we propose automatic integration, a new framework for learning efficient, closed-form solutions to integrals using coordinate-based neural networks. For training, we instantiate the computational graph corresponding to the derivative of the coordinate-based network. The graph is fitted to the signal to integrate. After optimization, we reassemble the graph to obtain a network that represents the antiderivative. By the fundamental theorem of calculus, this enables the calculation of any definite integral in two evaluations of the network. Applying this approach to neural rendering, we improve a tradeoff between rendering speed and image quality: improving render times by greater than 10× with a tradeoff of reduced image quality.

...read moreread less

Proceedings Article•DOI•

Neural Scene Graphs for Dynamic Scenes

[...]

Julian Ost, Fahim Mannan, Nils Thuerey¹, Julian Knodt², Felix Heide - Show less +1 more•Institutions (2)

Technische Universität München¹, Princeton University²

01 Jun 2021

TL;DR: In this paper, the authors propose to represent multi-object dynamic scenes as scene graphs, which encodes object transformations and radiance, allowing them to efficiently render novel arrangements and views of the scene.

...read moreread less

Abstract: Recent implicit neural rendering methods have demonstrated that it is possible to learn accurate view synthesis for complex scenes by predicting their volumetric density and color supervised solely by a set of RGB images. However, existing methods are restricted to learning efficient representations of static scenes that encode all scene objects into a single neural network, and they lack the ability to represent dynamic scenes and decompose scenes into individual objects. In this work, we present the first neural rendering method that represents multi-object dynamic scenes as scene graphs. We propose a learned scene graph representation, which encodes object transformations and radiance, allowing us to efficiently render novel arrangements and views of the scene. To this end, we learn implicitly encoded scenes, combined with a jointly learned latent representation to describe similar objects with a single implicit function. We assess the proposed method on synthetic and real automotive data, validating that our approach learns dynamic scenes – only by observing a video of this scene – and allows for rendering novel photo-realistic views of novel scene compositions with unseen sets of objects at unseen poses.

...read moreread less

Journal Article•DOI•

InLoc: Indoor Visual Localization with Dense Matching and View Synthesis

[...]

Hajime Taira¹, Masatoshi Okutomi¹, Torsten Sattler², Mircea Cimpoi³, Marc Pollefeys⁴, Josef Sivic³, Tomas Pajdla³, Akihiko Torii¹ - Show less +4 more•Institutions (4)

Tokyo Institute of Technology¹, Chalmers University of Technology², Czech Technical University in Prague³, ETH Zurich⁴

01 Apr 2021-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: A new large-scale visual localization method targeted for indoor spaces that significantly outperforms current state-of-the-art indoor localization approaches on this new challenging data.

...read moreread less

Abstract: We seek to predict the 6 degree-of-freedom (6DoF) pose of a query photograph with respect to a large indoor 3D map. The contributions of this work are three-fold. First, we develop a new large-scale visual localization method targeted for indoor spaces. The method proceeds along three steps: (i) efficient retrieval of candidate poses that scales to large-scale environments, (ii) pose estimation using dense matching rather than sparse local features to deal with weakly textured indoor scenes, and (iii) pose verification by virtual view synthesis that is robust to significant changes in viewpoint, scene layout, and occlusion. Second, we release a new dataset with reference 6DoF poses for large-scale indoor localization. Query photographs are captured by mobile phones at a different time than the reference 3D map, thus presenting a realistic indoor localization scenario. Third, we demonstrate that our method significantly outperforms current state-of-the-art indoor localization approaches on this new challenging data. Code and data are publicly available.

...read moreread less

Proceedings Article•

Putting NeRF on a Diet: Semantically Consistent Few-Shot View Synthesis

[...]

Ajay Jain¹, Matthew Tancik¹, Pieter Abbeel²•Institutions (2)

University of California, Berkeley¹, Stanford University²

01 Jan 2021

TL;DR: DietNeRF as discussed by the authors introduces an auxiliary semantic consistency loss that encourages realistic renderings at novel poses, which is trained on individual scenes to correctly render given input views from the same pose and match high-level semantic attributes across different, random poses.

...read moreread less

Abstract: We present DietNeRF, a 3D neural scene representation estimated from a few images. Neural Radiance Fields (NeRF) learn a continuous volumetric representation of a scene through multi-view consistency, and can be rendered from novel viewpoints by ray casting. While NeRF has an impressive ability to reconstruct geometry and fine details given many images, up to 100 for challenging 360° scenes, it often finds a degenerate solution to its image reconstruction objective when only a few input views are available. To improve few-shot quality, we propose DietNeRF. We introduce an auxiliary semantic consistency loss that encourages realistic renderings at novel poses. DietNeRF is trained on individual scenes to (1) correctly render given input views from the same pose, and (2) match high-level semantic attributes across different, random poses. Our semantic loss allows us to supervise DietNeRF from arbitrary poses. We extract these semantics using a pre-trained visual encoder such as CLIP, a Vision Transformer trained on hundreds of millions of diverse single-view, 2D photographs mined from the web with natural language supervision. In experiments, DietNeRF improves the perceptual quality of few-shot view synthesis when learned from scratch, can render novel views with as few as one observed image when pre-trained on a multi-view dataset, and produces plausible completions of completely unobserved regions.

...read moreread less

Journal Article•DOI•

Neural Light Transport for Relighting and View Synthesis

[...]

Xiuming Zhang¹, Sean Fanello², Yun-Ta Tsai², Tiancheng Sun³, Tianfan Xue², Rohit Pandey², Sergio Orts-Escolano², Philip Davidson², Christoph Rhemann², Paul Debevec², Jonathan T. Barron², Ravi Ramamoorthi³, William T. Freeman¹ - Show less +9 more•Institutions (3)

Massachusetts Institute of Technology¹, Google², University of California, San Diego³

18 Jan 2021-ACM Transactions on Graphics

TL;DR: In this article, a semi-parametric approach is proposed to learn a neural representation of the light transport of a scene that is embedded in a texture atlas of known but possibly rough geometry.

...read moreread less

Abstract: The light transport (LT) of a scene describes how it appears under different lighting conditions from different viewing directions, and complete knowledge of a scene’s LT enables the synthesis of novel views under arbitrary lighting. In this article, we focus on image-based LT acquisition, primarily for human bodies within a light stage setup. We propose a semi-parametric approach for learning a neural representation of the LT that is embedded in a texture atlas of known but possibly rough geometry. We model all non-diffuse and global LT as residuals added to a physically based diffuse base rendering. In particular, we show how to fuse previously seen observations of illuminants and views to synthesize a new image of the same scene under a desired lighting condition from a chosen viewpoint. This strategy allows the network to learn complex material effects (such as subsurface scattering) and global illumination (such as diffuse interreflection), while guaranteeing the physical correctness of the diffuse LT (such as hard shadows). With this learned LT, one can relight the scene photorealistically with a directional light or an HDRI map, synthesize novel views with view-dependent effects, or do both simultaneously, all in a unified framework using a set of sparse observations. Qualitative and quantitative experiments demonstrate that our Neural Light Transport (NLT) outperforms state-of-the-art solutions for relighting and view synthesis, without requiring separate treatments for both problems that prior work requires. The code and data are available at http://nlt.csail.mit.edu.

...read moreread less

Proceedings Article•

Learning Object-Compositional Neural Radiance Field for Editable Scene Rendering

[...]

Bangbang Yang¹, Yinda Zhang², Yinghao Xu, Yijin Li¹, Han Zhou¹, Hujun Bao¹, Guofeng Zhang³, Zhaopeng Cui¹ - Show less +4 more•Institutions (3)

Zhejiang University¹, Google², Beihang University³

01 Jan 2021

TL;DR: Zhang et al. as mentioned in this paper proposed a scene-guided training strategy to solve the 3D space ambiguity in the occluded regions and learn sharp boundaries for each object, which not only achieves competitive performance for static scene novel-view synthesis, but also produces realistic rendering for object-level editing.

...read moreread less

Abstract: Implicit neural rendering techniques have shown promising results for novel view synthesis. However, existing methods usually encode the entire scene as a whole, which is generally not aware of the object identity and limits the ability to the high-level editing tasks such as moving or adding furniture. In this paper, we present a novel neural scene rendering system, which learns an object-compositional neural radiance field and produces realistic rendering with editing capability for a clustered and real-world scene. Specifically, we design a novel two-pathway architecture, in which the scene branch encodes the scene geometry and appearance, and the object branch encodes each standalone object conditioned on learnable object activation codes. To survive the training in heavily cluttered scenes, we propose a scene-guided training strategy to solve the 3D space ambiguity in the occluded regions and learn sharp boundaries for each object. Extensive experiments demonstrate that our system not only achieves competitive performance for static scene novel-view synthesis, but also produces realistic rendering for object-level editing.

...read moreread less

Journal Article•DOI•

TSAN: Synthesized View Quality Enhancement via Two-Stream Attention Network for 3D-HEVC

[...]

Zhaoqing Pan¹, Wei-Jie Yu¹, Jianjun Lei², Nam Ling³, Sam Kwong⁴ - Show less +1 more•Institutions (4)

Nanjing University of Information Science and Technology¹, Tianjin University², Santa Clara University³, City University of Hong Kong⁴

05 Feb 2021-IEEE Transactions on Circuits and Systems for Video Technology

TL;DR: A Two-stream Attention Network (TSAN)-based synthesized view quality enhancement method is proposed for 3D-High Efficiency Video Coding (3D-HEVC) in this article, which achieves significantly better performance than the state-of-the-art methods.

...read moreread less

Abstract: In three-dimensional video system, the texture and depth videos are jointly encoded, and then the Depth Image Based Rendering (DIBR) is utilized to realize view synthesis. However, the compression distortion of texture and depth videos, as well as the disocclusion problem in DIBR degrade the visual quality of the synthesized view. To address this problem, a Two-stream Attention Network (TSAN)-based synthesized view quality enhancement method is proposed for 3D-High Efficiency Video Coding (3D-HEVC) in this paper. First, the shortcomings of the view synthesis technique and traditional convolutional neural networks are analyzed. Then, based on these analyses, a TSAN with two information extraction streams is proposed for enhancing the quality of the synthesized view, in which the global information extraction stream learns the contextual information, and the local information extraction stream extracts the texture information from the rendered image. Third, a Multi-Scale Residual Attention Block (MSRAB) is proposed, which can efficiently detect features in different scales, and adaptively refine features by considering interdependencies among spatial dimensions. Extensive experimental results show that the proposed synthesized view quality enhancement method achieves significantly better performance than the state-of-the-art methods.

...read moreread less

Posted Content•

Neural RGB-D Surface Reconstruction.

[...]

Dejan Azinovic, Ricardo Martin-Brualla, Dan B. Goldman, Matthias Nießner, Justus Thies - Show less +1 more

09 Apr 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, depth measurements are incorporated into the radiance field formulation to produce more detailed and complete reconstruction results than using methods based on either color or depth data alone, which is beneficial for learning the signed distance field in regions with missing depth measurements.

...read moreread less

Abstract: In this work, we explore how to leverage the success of implicit novel view synthesis methods for surface reconstruction. Methods which learn a neural radiance field have shown amazing image synthesis results, but the underlying geometry representation is only a coarse approximation of the real geometry. We demonstrate how depth measurements can be incorporated into the radiance field formulation to produce more detailed and complete reconstruction results than using methods based on either color or depth data alone. In contrast to a density field as the underlying geometry representation, we propose to learn a deep neural network which stores a truncated signed distance field. Using this representation, we show that one can still leverage differentiable volume rendering to estimate color values of the observed images during training to compute a reconstruction loss. This is beneficial for learning the signed distance field in regions with missing depth measurements. Furthermore, we correct misalignment errors of the camera, improving the overall reconstruction quality. In several experiments, we showcase our method and compare to existing works on classical RGB-D fusion and learned representations.

...read moreread less

Proceedings Article•

MINE: Towards Continuous Depth MPI With NeRF for Novel View Synthesis

[...]

Jiaxin Li¹, Zijian Feng, Qi She², Henghui Ding³, Changhu Wang⁴, Gim Hee Lee¹ - Show less +2 more•Institutions (4)

National University of Singapore¹, Trinity College, Dublin², Nanyang Technological University³, Shanghai Jiao Tong University⁴

01 Jan 2021

TL;DR: MINE as mentioned in this paper predicts a 4-channel image (RGB and volume density) at arbitrary depth values to jointly reconstruct the camera frustum and fill in occluded contents, which can then be easily rendered into novel RGB or depth views using differentiable rendering.

...read moreread less

Abstract: In this paper, we propose MINE to perform novel view synthesis and depth estimation via dense 3D reconstruction from a single image. Our approach is a continuous depth generalization of the Multiplane Images (MPI) by introducing the NEural radiance fields (NeRF). Given a single image as input, MINE predicts a 4-channel image (RGB and volume density) at arbitrary depth values to jointly reconstruct the camera frustum and fill in occluded contents. The reconstructed and inpainted frustum can then be easily rendered into novel RGB or depth views using differentiable rendering. Extensive experiments on RealEstate10K, KITTI and Flowers Light Fields show that our MINE outperforms state-of-the-art by a large margin in novel view synthesis. We also achieve competitive results in depth estimation on iBims-1 and NYU-v2 without annotated depth supervision. Our source code is available at this https URL

...read moreread less

Journal Article•DOI•

Point-Based Neural Rendering with Per-View Optimization

[...]

Georgios Kopanas, Julien Philip¹, Thomas Leimkühler, George Drettakis•Institutions (1)

Adobe Systems¹

01 Jul 2021-Computer Graphics Forum

TL;DR: This paper proposed a differentiable point-based splatting pipeline based on the bi-directional Elliptical Weighted Average (EWA) solution, which can be applied to multi-view harmonization and stylization.

...read moreread less

Abstract: There has recently been great interest in neural rendering methods. Some approaches use 3D geometry reconstructed with Multi-View Stereo (MVS) but cannot recover from the errors of this process, while others directly learn a volumetric neural representation, but suffer from expensive training and inference. We introduce a general approach that is initialized with MVS, but allows further optimization of scene properties in the space of input views, including depth and reprojected features, resulting in improved novel view synthesis. A key element of our approach is a differentiable point-based splatting pipeline, based on our bi-directional Elliptical Weighted Average solution. To further improve quality and efficiency of our point-based method, we introduce a probabilistic depth test and efficient camera selection. We use these elements together in our neural renderer, allowing us to achieve a good compromise between quality and speed. Our pipeline can be applied to multi-view harmonization and stylization in addition to novel view synthesis.

...read moreread less

Posted Content•

NeRF$--$: Neural Radiance Fields Without Known Camera Parameters

[...]

Zirui Wang, Shangzhe Wu, Weidi Xie¹, Min Chen¹, Victor Adrian Prisacariu¹ - Show less +1 more•Institutions (1)

University of Oxford¹

14 Feb 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, the authors propose an end-to-end framework, termed NeRF--, for training NeRF models given only RGB images, without pre-computed camera parameters.

...read moreread less

Abstract: This paper tackles the problem of novel view synthesis (NVS) from 2D images without known camera poses and intrinsics. Among various NVS techniques, Neural Radiance Field (NeRF) has recently gained popularity due to its remarkable synthesis quality. Existing NeRF-based approaches assume that the camera parameters associated with each input image are either directly accessible at training, or can be accurately estimated with conventional techniques based on correspondences, such as Structure-from-Motion. In this work, we propose an end-to-end framework, termed NeRF--, for training NeRF models given only RGB images, without pre-computed camera parameters. Specifically, we show that the camera parameters, including both intrinsics and extrinsics, can be automatically discovered via joint optimisation during the training of the NeRF model. On the standard LLFF benchmark, our model achieves comparable novel view synthesis results compared to the baseline trained with COLMAP pre-computed camera parameters. We also conduct extensive analyses to understand the model behaviour under different camera trajectories, and show that in scenarios where COLMAP fails, our model still produces robust results.

...read moreread less

Journal Article•DOI•

Micro-Lens Image Stack Upsampling for Densely-Sampled Light Field Reconstruction

[...]

Shuo Zhang¹, Song Chang¹, Zeqi Shen¹, Youfang Lin¹•Institutions (1)

Beijing Jiaotong University¹

26 Jul 2021-IEEE Transactions on Computational Imaging

TL;DR: Wang et al. as mentioned in this paper propose a learning-based framework to synthesize novel views and reconstruct densely-sampled LFs from sparsely sampled LFs, in which details in novel views are explored from spatial and angular domain.

...read moreread less

Abstract: Due to hardware restriction, it is costly to capture densely-sampled Light Fields (LFs) with high angular and spatial resolution, which becomes the main bottleneck of LFs development. In this paper, we propose a learning-based framework to synthesize novel views and reconstruct densely-sampled LFs from sparsely-sampled LFs. In the proposed framework, micro-lens image stacks and view image stacks are separately grouped, in which details in novel views are explored from spatial and angular domain. The two kinds of stacks contain epipolar information and 3D convolution layers are employed to effectively extract features that include structure information. Moreover, an innovative way is proposed to synthesize views by upsampling micro-lens image stacks using deconvolution layers. The parameters in decovolution layers provide the view position information and different interpolation and extrapolation tasks can be explicitly modeled. It is validated that this view synthesis module can be embedded in different frameworks and improve the related performances. Without precise depth estimation and view warping, the proposed method is mainly designed for reconstructing LFs with small baselines. Related experimental results show that the proposed model outperforms other state-of-the-art methods in terms of both visual and numerical evaluations. Furthermore, the consistency between synthesized views and the intrinsic structure information is well preserved in the proposed method.

...read moreread less

Posted Content•

Self-supervised Multi-view Stereo via Effective Co-Segmentation and Data-Augmentation

[...]

Hongbin Xu¹, Zhipeng Zhou², Yu Qiao², Wenxiong Kang¹, Qiuxia Wu¹ - Show less +1 more•Institutions (2)

South China University of Technology¹, Chinese Academy of Sciences²

12 Apr 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: Zhang et al. as discussed by the authors proposed a framework integrated with more reliable supervision guided by semantic co-segmentation and data-augmentation, which excavates mutual semantic from multi-view images to guide the semantic consistency.

...read moreread less

Abstract: Recent studies have witnessed that self-supervised methods based on view synthesis obtain clear progress on multi-view stereo (MVS). However, existing methods rely on the assumption that the corresponding points among different views share the same color, which may not always be true in practice. This may lead to unreliable self-supervised signal and harm the final reconstruction performance. To address the issue, we propose a framework integrated with more reliable supervision guided by semantic co-segmentation and data-augmentation. Specially, we excavate mutual semantic from multi-view images to guide the semantic consistency. And we devise effective data-augmentation mechanism which ensures the transformation robustness by treating the prediction of regular samples as pseudo ground truth to regularize the prediction of augmented samples. Experimental results on DTU dataset show that our proposed methods achieve the state-of-the-art performance among unsupervised methods, and even compete on par with supervised methods. Furthermore, extensive experiments on Tanks&Temples dataset demonstrate the effective generalization ability of the proposed method.

...read moreread less

Proceedings Article•DOI•

Shadow Neural Radiance Fields for Multi-view Satellite Photogrammetry

[...]

Dawa Derksen¹, Dario Izzo¹•Institutions (1)

European Space Agency¹

01 Jun 2021

TL;DR: The proposed Shadow Neural Radiance Field (S-NeRF) methodology not only performs novel view synthesis and full 3D shape estimation, it also enables shadow detection, albedo synthesis, and transient object filtering, without any explicit shape supervision.

...read moreread less

Abstract: We present a new generic method for shadow-aware multi-view satellite photogrammetry of Earth Observation scenes. Our proposed method, the Shadow Neural Radiance Field (S-NeRF) follows recent advances in implicit volumetric representation learning. For each scene, we train S-NeRF using very high spatial resolution optical images taken from known viewing angles. The learning requires no labels or shape priors: it is self-supervised by an image reconstruction loss. To accommodate for changing light source conditions both from a directional light source (the Sun) and a diffuse light source (the sky), we extend the NeRF approach in two ways. First, direct illumination from the Sun is modeled via a local light source visibility field. Second, indirect illumination from a diffuse light source is learned as a non-local color field as a function of the position of the Sun. Quantitatively, the combination of these factors reduces the altitude and color errors in shaded areas, compared to NeRF. The S-NeRF methodology not only performs novel view synthesis and full 3D shape estimation, it also enables shadow detection, albedo synthesis, and transient object filtering, without any explicit shape supervision.

...read moreread less

Journal Article•DOI•

Light Field View Synthesis via Aperture Disparity and Warping Confidence Map

[...]

Nan Meng¹, Kai Li², Jianzhuang Liu³, Edmund Y. Lam¹•Institutions (3)

University of Hong Kong¹, Shanghai Jiao Tong University², Huawei³

22 Mar 2021-IEEE Transactions on Image Processing

TL;DR: In this paper, a learning-based approach is proposed to synthesize the view from an arbitrary camera position given a sparse set of images by jointly modeling the epipolar property and occlusion in designing a convolutional neural network.

...read moreread less

Abstract: This paper presents a learning-based approach to synthesize the view from an arbitrary camera position given a sparse set of images. A key challenge for this novel view synthesis arises from the reconstruction process, when the views from different input images may not be consistent due to obstruction in the light path. We overcome this by jointly modeling the epipolar property and occlusion in designing a convolutional neural network. We start by defining and computing the aperture disparity map, which approximates the parallax and measures the pixel-wise shift between two views. While this relates to free-space rendering and can fail near the object boundaries, we further develop a warping confidence map to address pixel occlusion in these challenging regions. The proposed method is evaluated on diverse real-world and synthetic light field scenes, and it shows better performance over several state-of-the-art techniques.

...read moreread less

Posted Content•

IBRNet: Learning Multi-View Image-Based Rendering

[...]

Google¹

25 Feb 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, a multilayer perceptron and a ray transformer are used to estimate radiance and volume density at continuous 5D locations (3D spatial locations and 2D viewing directions), drawing appearance information from multiple source views.

...read moreread less

Proceedings Article•

BARF: Bundle-Adjusting Neural Radiance Fields

[...]

Chen-Hsuan Lin¹, Wei-Chiu Ma², Antonio Torralba³, Simon Lucey¹•Institutions (3)

Carnegie Mellon University¹, Uber ², Massachusetts Institute of Technology³

13 Apr 2021

TL;DR: In this article, a Bundle-Adjusting Neural Radiance Fields (BARF) is proposed for training NeRF from imperfect (or even unknown) camera poses, which enables view synthesis and localization of video sequences from unknown camera poses.

...read moreread less

Abstract: Neural Radiance Fields (NeRF) have recently gained a surge of interest within the computer vision community for its power to synthesize photorealistic novel views of real-world scenes. One limitation of NeRF, however, is its requirement of accurate camera poses to learn the scene representations. In this paper, we propose Bundle-Adjusting Neural Radiance Fields (BARF) for training NeRF from imperfect (or even unknown) camera poses -- the joint problem of learning neural 3D representations and registering camera frames. We establish a theoretical connection to classical image alignment and show that coarse-to-fine registration is also applicable to NeRF. Furthermore, we show that naively applying positional encoding in NeRF has a negative impact on registration with a synthesis-based objective. Experiments on synthetic and real-world data show that BARF can effectively optimize the neural scene representations and resolve large camera pose misalignment at the same time. This enables view synthesis and localization of video sequences from unknown camera poses, opening up new avenues for visual localization systems (e.g. SLAM) and potential applications for dense 3D mapping and reconstruction.

...read moreread less

Journal Article•DOI•

Real-Time Depth Video-Based Rendering for 6-DoF HMD Navigation and Light Field Displays

[...]

Daniele Bonatto¹, Sarah Fachada¹, Segolene Rogge², Adrian Munteanu², Gauthier Lafruit¹ - Show less +1 more•Institutions (2)

Université libre de Bruxelles¹, VU University Amsterdam²

01 Jan 2021-IEEE Access

TL;DR: In this article, a view synthesis method is proposed to provide immersive free navigation with 6 degrees of freedom in real-time for natural and virtual scenery, for both static and dynamic content, which can take any number of input views with their corresponding depth maps as priors.

...read moreread less

Abstract: This paper presents a novel approach to provide immersive free navigation with 6 Degrees of Freedom in real-time for natural and virtual scenery, for both static and dynamic content. Stemming from the state-of-the-art in Depth Image-Based Rendering and the OpenGL pipeline, this new View Synthesis method achieves free navigation at up to 90 FPS and can take any number of input views with their corresponding depth maps as priors. Video content can be played thanks to GPU decompression, supporting free navigation with full parallax in real-time. To render a novel viewpoint, each selected input view is warped using the camera pose and associated depth map, using an implicit 3D representation. The warped views are then blended all together to generate the chosen virtual view. Various view blending approaches specifically designed to avoid visual artifacts are compared. Using as few as four input views appears to be an optimal trade-off between computation time and quality, allowing to synthesize high-quality stereoscopic views in real-time, offering a genuine immersive virtual reality experience. Additionally, the proposed approach provides high-quality rendering of a 3D scenery on holographic light field displays. Our results are comparable - objectively and subjectively - to the state of the art view synthesis tools NeRF and LLFF, while maintaining an overall lower complexity and real-time rendering.

...read moreread less

Journal Article•DOI•

High-quality holographic stereogram generation using four RGBD images

[...]

Sarah Fachada¹, Daniele Bonatto¹, Gauthier Lafruit¹•Institutions (1)

Université libre de Bruxelles¹

01 Feb 2021-Applied Optics

TL;DR: This paper proposes to drastically reduce the number of input images to four input images with depth maps in any pose, in order to create the missing images withdepth image-based rendering.

...read moreread less

Abstract: To computer-generate high-quality holographic stereograms, a huge number of images must be provided: several hundred for a horizontal parallax and the square of this number for a full parallax. In this paper, we propose to drastically reduce this number to four input images with depth maps (or equivalently, four groups of neighboring images used to compute a depth map) in any pose, in order to create the missing images with depth image-based rendering. We evaluate the view synthesis method objectively before providing visual results of the corresponding holographic stereograms. We believe this method outperforms shearlet-based approaches in objective view synthesis quality metrics and in the number of required input images (7×7).

...read moreread less

Journal Article•DOI•

Liquid Warping GAN with Attention: A Unified Framework for Human Image Synthesis.

[...]

Wen Liu¹, Zhixin Piao¹, Zhi Tu¹, Wenhan Luo², Lin Ma², Shenghua Gao¹ - Show less +2 more•Institutions (2)

ShanghaiTech University¹, Tencent²

07 May 2021-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: Zhang et al. as discussed by the authors proposed an Attentional liquid warping GAN with Attentional Liquid Warping Block (AttLWB) that propagates the source information in both image and feature spaces to the synthesized reference.

...read moreread less

Abstract: We tackle human image synthesis, including motion imitation, appearance transfer, and novel view synthesis, within a unified framework. The model, once being trained, can be used to handle all these tasks. We propose to use a 3D body mesh recovery module to disentangle the pose and shape. It can not only model the joint location and rotation but also characterize the personalized body shape. To preserve the source information, such as texture, style, color, and face identity, we propose an Attentional Liquid Warping GAN with Attentional Liquid Warping Block (AttLWB) that propagates the source information in both image and feature spaces to the synthesized reference. Specifically, the source features are extracted by a denoising convolutional auto-encoder for characterizing the source identity well. Our proposed method can support a more flexible warping from multiple sources. To further improve the generalization ability of the unseen source images, a one/few-shot adversarial learning is applied in a self-supervised way to generate high-resolution 512 x 512 and 1024 x 1024 results. Also, we build a new dataset, namely iPER, for the evaluation of these three tasks. Extensive experiments demonstrate the effectiveness of our methods in terms of preserving face identity, shape consistency, and clothes details.

...read moreread less

Collapse