scispace - formally typeset
Search or ask a question

Showing papers on "View synthesis published in 2021"


Proceedings ArticleDOI
20 Jun 2021
TL;DR: A method that synthesizes novel views of complex scenes by interpolating a sparse set of nearby views using a network architecture that includes a multilayer perceptron and a ray transformer that estimates radiance and volume density at continuous 5D locations.
Abstract: We present a method that synthesizes novel views of complex scenes by interpolating a sparse set of nearby views. The core of our method is a network architecture that includes a multilayer perceptron and a ray transformer that estimates radiance and volume density at continuous 5D locations (3D spatial locations and 2D viewing directions), drawing appearance information on the fly from multiple source views. By drawing on source views at render time, our method hearkens back to classic work on image-based rendering (IBR), and allows us to render high-resolution imagery. Unlike neural scene representation work that optimizes per-scene functions for rendering, we learn a generic view interpolation function that generalizes to novel scenes. We render images using classic volume rendering, which is fully differentiable and allows us to train using only multi-view posed images as supervision. Experiments show that our method outperforms recent novel view synthesis methods that also seek to generalize to novel scenes. Further, if fine-tuned on each scene, our method is competitive with state-of-the-art single-scene neural rendering methods.1

402 citations


Proceedings ArticleDOI
20 Jun 2021
TL;DR: In this paper, the authors propose Neural Body, a new human body representation which assumes that learned neural representations at different frames share the same set of latent codes anchored to a deformable mesh, so that the observations across frames can be naturally integrated.
Abstract: This paper addresses the challenge of novel view synthesis for a human performer from a very sparse set of camera views. Some recent works have shown that learning implicit neural representations of 3D scenes achieves remarkable view synthesis quality given dense input views. However, the representation learning will be ill-posed if the views are highly sparse. To solve this ill-posed problem, our key idea is to integrate observations over video frames. To this end, we propose Neural Body, a new human body representation which assumes that the learned neural representations at different frames share the same set of latent codes anchored to a deformable mesh, so that the observations across frames can be naturally integrated. The deformable mesh also provides geometric guidance for the network to learn 3D representations more efficiently. To evaluate our approach, we create a multi-view dataset named ZJU-MoCap that captures performers with complex motions. Experiments on ZJU-MoCap show that our approach outperforms prior works by a large margin in terms of novel view synthesis quality. We also demonstrate the capability of our approach to reconstruct a moving person from a monocular video on the People-Snapshot dataset.

364 citations


Proceedings ArticleDOI
01 Jun 2021
TL;DR: PixelNeRF as mentioned in this paper is a learning framework that predicts a continuous neural scene representation conditioned on one or few input images, enabling it to perform novel view synthesis in a feed-forward manner from a sparse set of views.
Abstract: We propose pixelNeRF, a learning framework that predicts a continuous neural scene representation conditioned on one or few input images. The existing approach for constructing neural radiance fields [27] involves optimizing the representation to every scene independently, requiring many calibrated views and significant compute time. We take a step towards resolving these shortcomings by introducing an architecture that conditions a NeRF on image inputs in a fully convolutional manner. This allows the network to be trained across multiple scenes to learn a scene prior, enabling it to perform novel view synthesis in a feed-forward manner from a sparse set of views (as few as one). Leveraging the volume rendering approach of NeRF, our model can be trained directly from images with no explicit 3D supervision. We conduct extensive experiments on ShapeNet benchmarks for single image novel view synthesis tasks with held-out objects as well as entire unseen categories. We further demonstrate the flexibility of pixelNeRF by demonstrating it on multi-object ShapeNet scenes and real scenes from the DTU dataset. In all cases, pixelNeRF outperforms current state-of-the-art baselines for novel view synthesis and single image 3D reconstruction. For the video and code, please visit the project website:https://alexyu.net/pixelnerf.

242 citations


Proceedings ArticleDOI
01 Jun 2021
TL;DR: In this article, a neural scene flow field is proposed to model the dynamic scene as a time-varying continuous function of appearance, geometry, and 3D scene motion.
Abstract: We present a method to perform novel view and time synthesis of dynamic scenes, requiring only a monocular video with known camera poses as input. To do this, we introduce Neural Scene Flow Fields, a new representation that models the dynamic scene as a time-variant continuous function of appearance, geometry, and 3D scene motion. Our representation is optimized through a neural network to fit the observed input views. We show that our representation can be used for varieties of in-the-wild scenes, including thin structures, view-dependent effects, and complex degrees of motion. We conduct a number of experiments that demonstrate our approach significantly outperforms recent monocular view synthesis methods, and show qualitative results of space-time view synthesis on a variety of real-world videos.

144 citations


Proceedings ArticleDOI
01 Jun 2021
TL;DR: SRF as mentioned in this paper predicts color and density for each 3D point given an encoding of its stereo correspondence in the input images, implicitly learned by an ensemble of pair-wise similarities, emulating classical stereo.
Abstract: Recent neural view synthesis methods have achieved impressive quality and realism, surpassing classical pipelines which rely on multi-view reconstruction. State-of-the-Art methods, such as NeRF [34], are designed to learn a single scene with a neural network and require dense multi-view inputs. Testing on a new scene requires re-training from scratch, which takes 2-3 days. In this work, we introduce Stereo Radiance Fields (SRF), a neural view synthesis approach that is trained end-to-end, generalizes to new scenes, and requires only sparse views at test time. The core idea is a neural architecture inspired by classical multi-view stereo methods, which estimates surface points by finding similar image regions in stereo images. In SRF, we predict color and density for each 3D point given an encoding of its stereo correspondence in the input images. The encoding is implicitly learned by an ensemble of pair-wise similarities – emulating classical stereo. Experiments show that SRF learns structure instead of over-fitting on a scene. We train on multiple scenes of the DTU dataset and generalize to new ones without re-training, requiring only 10 sparse and spread-out views as input. We show that 10-15 minutes of fine-tuning further improve the results, achieving significantly sharper, more detailed results than scene-specific models. The code, model, and videos are available – https://virtualhumans.mpi-inf.mpg.de/srf/.

122 citations


Proceedings ArticleDOI
01 Jun 2021
TL;DR: The authors presented a method that takes as input a set of images of a scene illuminated by unconstrained known lighting, and produces as output a 3D representation that can be rendered from novel viewpoints under arbitrary lighting conditions.
Abstract: We present a method that takes as input a set of images of a scene illuminated by unconstrained known lighting, and produces as output a 3D representation that can be rendered from novel viewpoints under arbitrary lighting conditions. Our method represents the scene as a continuous volumetric function parameterized as MLPs whose inputs are a 3D location and whose outputs are the following scene properties at that input location: volume density, surface normal, material parameters, distance to the first surface intersection in any direction, and visibility of the external environment in any direction. Together, these allow us to render novel views of the object under arbitrary lighting, including indirect illumination effects. The predicted visibility and surface intersection fields are critical to our model’s ability to simulate direct and indirect illumination during training, because the brute-force techniques used by prior work are intractable for lighting conditions outside of controlled setups with a single light. Our method outperforms alternative approaches for recovering relightable 3D scene representations, and performs well in complex lighting settings that have posed a significant challenge to prior work.

114 citations


Proceedings Article
20 Apr 2021
TL;DR: In this paper, implicit surface models and radiance fields can be formulated in a unified way, enabling both surface and volume rendering using the same model, which enables novel, more efficient sampling procedures and the ability to reconstruct accurate surfaces without input masks.
Abstract: Neural implicit 3D representations have emerged as a powerful paradigm for reconstructing surfaces from multi-view images and synthesizing novel views. Unfortunately, existing methods such as DVR or IDR require accurate per-pixel object masks as supervision. At the same time, neural radiance fields have revolutionized novel view synthesis. However, NeRF's estimated volume density does not admit accurate surface reconstruction. Our key insight is that implicit surface models and radiance fields can be formulated in a unified way, enabling both surface and volume rendering using the same model. This unified perspective enables novel, more efficient sampling procedures and the ability to reconstruct accurate surfaces without input masks. We compare our method on the DTU, BlendedMVS, and a synthetic indoor dataset. Our experiments demonstrate that we outperform NeRF in terms of reconstruction quality while performing on par with IDR without requiring masks.

102 citations


Proceedings Article
01 Jan 2021
TL;DR: MVSNeRF as discussed by the authors proposes a generic deep neural network that can reconstruct radiance fields from only three nearby input views via fast network inference, leveraging plane-swept cost volumes (widely used in multi-view stereo) for geometry-aware scene reasoning.
Abstract: We present MVSNeRF, a novel neural rendering approach that can efficiently reconstruct neural radiance fields for view synthesis. Unlike prior works on neural radiance fields that consider per-scene optimization on densely captured images, we propose a generic deep neural network that can reconstruct radiance fields from only three nearby input views via fast network inference. Our approach leverages plane-swept cost volumes (widely used in multi-view stereo) for geometry-aware scene reasoning, and combines this with physically based volume rendering for neural radiance field reconstruction. We train our network on real objects in the DTU dataset, and test it on three different datasets to evaluate its effectiveness and generalizability. Our approach can generalize across scenes (even indoor scenes, completely different from our training scenes of objects) and generate realistic view synthesis results using only three input images, significantly outperforming concurrent works on generalizable radiance field reconstruction. Moreover, if dense images are captured, our estimated radiance field representation can be easily fine-tuned; this leads to fast per-scene reconstruction with higher rendering quality and substantially less optimization time than NeRF.

94 citations


Proceedings ArticleDOI
Gernot Riegler1, Vladlen Koltun1
01 Jun 2021
TL;DR: Stable View Synthesis (SVS) as discussed by the authors is a view-dependent on-surface feature aggregation, in which directional feature vectors at each 3D point are processed to produce a new feature vector for a ray that maps this point into the new target view.
Abstract: We present Stable View Synthesis (SVS). Given a set of source images depicting a scene from freely distributed viewpoints, SVS synthesizes new views of the scene. The method operates on a geometric scaffold computed via structure-from-motion and multi-view stereo. Each point on this 3D scaffold is associated with view rays and corresponding feature vectors that encode the appearance of this point in the input images. The core of SVS is view-dependent on-surface feature aggregation, in which directional feature vectors at each 3D point are processed to produce a new feature vector for a ray that maps this point into the new target view. The target view is then rendered by a convolutional network from a tensor of features synthesized in this way for all pixels. The method is composed of differentiable modules and is trained end-to-end. It supports spatially-varying view-dependent importance weighting and feature transformation of source images at each point; spatial and temporal stability due to the smooth dependence of on-surface feature aggregation on the target view; and synthesis of view-dependent effects such as specular reflection. Experimental results demonstrate that SVS outperforms state-of-the-art view synthesis methods both quantitatively and qualitatively on three diverse real-world datasets, achieving unprecedented levels of realism in free-viewpoint video of challenging large-scale scenes. Code is available at https://github.com/intel-isl/StableViewSynthesis

76 citations


Proceedings ArticleDOI
20 Jun 2021
TL;DR: In this paper, the authors propose a multi-task architecture in which image synthesis and retrieval are considered jointly, which can bias their network to learn latent feature representations that are useful for retrieval if they utilize them to generate images across two input domains.
Abstract: The goal of cross-view image based geo-localization is to determine the location of a given street view image by matching it against a collection of geo-tagged satellite images. This task is notoriously challenging due to the drastic viewpoint and appearance differences between the two domains. We show that we can address this discrepancy explicitly by learning to synthesize realistic street views from satellite inputs. Following this observation, we propose a novel multi-task architecture in which image synthesis and retrieval are considered jointly. The rationale behind this is that we can bias our network to learn latent feature representations that are useful for retrieval if we utilize them to generate images across the two input domains. To the best of our knowledge, ours is the first approach that creates realistic street views from satellite images and localizes the corresponding query street-view simultaneously in an end-to-end manner. In our experiments, we obtain state-of-the-art performance on the CVUSA and CVACT benchmarks. Finally, we show compelling qualitative results for satellite-to-street view synthesis.

70 citations


Proceedings ArticleDOI
01 Jun 2021
TL;DR: Automatic integration as discussed by the authors is a new framework for learning efficient, closed-form solutions to integrals using coordinate-based neural networks, which can improve render times by more than 10× with a tradeoff of reduced image quality.
Abstract: Numerical integration is a foundational technique in scientific computing and is at the core of many computer vision applications. Among these applications, neural volume rendering has recently been proposed as a new paradigm for view synthesis, achieving photorealistic image quality. However, a fundamental obstacle to making these methods practical is the extreme computational and memory requirements caused by the required volume integrations along the rendered rays during training and inference. Millions of rays, each requiring hundreds of forward passes through a neural network are needed to approximate those integrations with Monte Carlo sampling. Here, we propose automatic integration, a new framework for learning efficient, closed-form solutions to integrals using coordinate-based neural networks. For training, we instantiate the computational graph corresponding to the derivative of the coordinate-based network. The graph is fitted to the signal to integrate. After optimization, we reassemble the graph to obtain a network that represents the antiderivative. By the fundamental theorem of calculus, this enables the calculation of any definite integral in two evaluations of the network. Applying this approach to neural rendering, we improve a tradeoff between rendering speed and image quality: improving render times by greater than 10× with a tradeoff of reduced image quality.

Proceedings ArticleDOI
01 Jun 2021
TL;DR: In this paper, the authors propose to represent multi-object dynamic scenes as scene graphs, which encodes object transformations and radiance, allowing them to efficiently render novel arrangements and views of the scene.
Abstract: Recent implicit neural rendering methods have demonstrated that it is possible to learn accurate view synthesis for complex scenes by predicting their volumetric density and color supervised solely by a set of RGB images. However, existing methods are restricted to learning efficient representations of static scenes that encode all scene objects into a single neural network, and they lack the ability to represent dynamic scenes and decompose scenes into individual objects. In this work, we present the first neural rendering method that represents multi-object dynamic scenes as scene graphs. We propose a learned scene graph representation, which encodes object transformations and radiance, allowing us to efficiently render novel arrangements and views of the scene. To this end, we learn implicitly encoded scenes, combined with a jointly learned latent representation to describe similar objects with a single implicit function. We assess the proposed method on synthetic and real automotive data, validating that our approach learns dynamic scenes – only by observing a video of this scene – and allows for rendering novel photo-realistic views of novel scene compositions with unseen sets of objects at unseen poses.

Journal ArticleDOI
TL;DR: A new large-scale visual localization method targeted for indoor spaces that significantly outperforms current state-of-the-art indoor localization approaches on this new challenging data.
Abstract: We seek to predict the 6 degree-of-freedom (6DoF) pose of a query photograph with respect to a large indoor 3D map. The contributions of this work are three-fold. First, we develop a new large-scale visual localization method targeted for indoor spaces. The method proceeds along three steps: (i) efficient retrieval of candidate poses that scales to large-scale environments, (ii) pose estimation using dense matching rather than sparse local features to deal with weakly textured indoor scenes, and (iii) pose verification by virtual view synthesis that is robust to significant changes in viewpoint, scene layout, and occlusion. Second, we release a new dataset with reference 6DoF poses for large-scale indoor localization. Query photographs are captured by mobile phones at a different time than the reference 3D map, thus presenting a realistic indoor localization scenario. Third, we demonstrate that our method significantly outperforms current state-of-the-art indoor localization approaches on this new challenging data. Code and data are publicly available.

Proceedings Article
01 Jan 2021
TL;DR: DietNeRF as discussed by the authors introduces an auxiliary semantic consistency loss that encourages realistic renderings at novel poses, which is trained on individual scenes to correctly render given input views from the same pose and match high-level semantic attributes across different, random poses.
Abstract: We present DietNeRF, a 3D neural scene representation estimated from a few images. Neural Radiance Fields (NeRF) learn a continuous volumetric representation of a scene through multi-view consistency, and can be rendered from novel viewpoints by ray casting. While NeRF has an impressive ability to reconstruct geometry and fine details given many images, up to 100 for challenging 360° scenes, it often finds a degenerate solution to its image reconstruction objective when only a few input views are available. To improve few-shot quality, we propose DietNeRF. We introduce an auxiliary semantic consistency loss that encourages realistic renderings at novel poses. DietNeRF is trained on individual scenes to (1) correctly render given input views from the same pose, and (2) match high-level semantic attributes across different, random poses. Our semantic loss allows us to supervise DietNeRF from arbitrary poses. We extract these semantics using a pre-trained visual encoder such as CLIP, a Vision Transformer trained on hundreds of millions of diverse single-view, 2D photographs mined from the web with natural language supervision. In experiments, DietNeRF improves the perceptual quality of few-shot view synthesis when learned from scratch, can render novel views with as few as one observed image when pre-trained on a multi-view dataset, and produces plausible completions of completely unobserved regions.

Journal ArticleDOI
TL;DR: In this article, a semi-parametric approach is proposed to learn a neural representation of the light transport of a scene that is embedded in a texture atlas of known but possibly rough geometry.
Abstract: The light transport (LT) of a scene describes how it appears under different lighting conditions from different viewing directions, and complete knowledge of a scene’s LT enables the synthesis of novel views under arbitrary lighting. In this article, we focus on image-based LT acquisition, primarily for human bodies within a light stage setup. We propose a semi-parametric approach for learning a neural representation of the LT that is embedded in a texture atlas of known but possibly rough geometry. We model all non-diffuse and global LT as residuals added to a physically based diffuse base rendering. In particular, we show how to fuse previously seen observations of illuminants and views to synthesize a new image of the same scene under a desired lighting condition from a chosen viewpoint. This strategy allows the network to learn complex material effects (such as subsurface scattering) and global illumination (such as diffuse interreflection), while guaranteeing the physical correctness of the diffuse LT (such as hard shadows). With this learned LT, one can relight the scene photorealistically with a directional light or an HDRI map, synthesize novel views with view-dependent effects, or do both simultaneously, all in a unified framework using a set of sparse observations. Qualitative and quantitative experiments demonstrate that our Neural Light Transport (NLT) outperforms state-of-the-art solutions for relighting and view synthesis, without requiring separate treatments for both problems that prior work requires. The code and data are available at http://nlt.csail.mit.edu.

Proceedings Article
01 Jan 2021
TL;DR: Zhang et al. as mentioned in this paper proposed a scene-guided training strategy to solve the 3D space ambiguity in the occluded regions and learn sharp boundaries for each object, which not only achieves competitive performance for static scene novel-view synthesis, but also produces realistic rendering for object-level editing.
Abstract: Implicit neural rendering techniques have shown promising results for novel view synthesis. However, existing methods usually encode the entire scene as a whole, which is generally not aware of the object identity and limits the ability to the high-level editing tasks such as moving or adding furniture. In this paper, we present a novel neural scene rendering system, which learns an object-compositional neural radiance field and produces realistic rendering with editing capability for a clustered and real-world scene. Specifically, we design a novel two-pathway architecture, in which the scene branch encodes the scene geometry and appearance, and the object branch encodes each standalone object conditioned on learnable object activation codes. To survive the training in heavily cluttered scenes, we propose a scene-guided training strategy to solve the 3D space ambiguity in the occluded regions and learn sharp boundaries for each object. Extensive experiments demonstrate that our system not only achieves competitive performance for static scene novel-view synthesis, but also produces realistic rendering for object-level editing.

Journal ArticleDOI
TL;DR: A Two-stream Attention Network (TSAN)-based synthesized view quality enhancement method is proposed for 3D-High Efficiency Video Coding (3D-HEVC) in this article, which achieves significantly better performance than the state-of-the-art methods.
Abstract: In three-dimensional video system, the texture and depth videos are jointly encoded, and then the Depth Image Based Rendering (DIBR) is utilized to realize view synthesis. However, the compression distortion of texture and depth videos, as well as the disocclusion problem in DIBR degrade the visual quality of the synthesized view. To address this problem, a Two-stream Attention Network (TSAN)-based synthesized view quality enhancement method is proposed for 3D-High Efficiency Video Coding (3D-HEVC) in this paper. First, the shortcomings of the view synthesis technique and traditional convolutional neural networks are analyzed. Then, based on these analyses, a TSAN with two information extraction streams is proposed for enhancing the quality of the synthesized view, in which the global information extraction stream learns the contextual information, and the local information extraction stream extracts the texture information from the rendered image. Third, a Multi-Scale Residual Attention Block (MSRAB) is proposed, which can efficiently detect features in different scales, and adaptively refine features by considering interdependencies among spatial dimensions. Extensive experimental results show that the proposed synthesized view quality enhancement method achieves significantly better performance than the state-of-the-art methods.

Posted Content
TL;DR: In this article, depth measurements are incorporated into the radiance field formulation to produce more detailed and complete reconstruction results than using methods based on either color or depth data alone, which is beneficial for learning the signed distance field in regions with missing depth measurements.
Abstract: In this work, we explore how to leverage the success of implicit novel view synthesis methods for surface reconstruction. Methods which learn a neural radiance field have shown amazing image synthesis results, but the underlying geometry representation is only a coarse approximation of the real geometry. We demonstrate how depth measurements can be incorporated into the radiance field formulation to produce more detailed and complete reconstruction results than using methods based on either color or depth data alone. In contrast to a density field as the underlying geometry representation, we propose to learn a deep neural network which stores a truncated signed distance field. Using this representation, we show that one can still leverage differentiable volume rendering to estimate color values of the observed images during training to compute a reconstruction loss. This is beneficial for learning the signed distance field in regions with missing depth measurements. Furthermore, we correct misalignment errors of the camera, improving the overall reconstruction quality. In several experiments, we showcase our method and compare to existing works on classical RGB-D fusion and learned representations.

Proceedings Article
01 Jan 2021
TL;DR: MINE as mentioned in this paper predicts a 4-channel image (RGB and volume density) at arbitrary depth values to jointly reconstruct the camera frustum and fill in occluded contents, which can then be easily rendered into novel RGB or depth views using differentiable rendering.
Abstract: In this paper, we propose MINE to perform novel view synthesis and depth estimation via dense 3D reconstruction from a single image. Our approach is a continuous depth generalization of the Multiplane Images (MPI) by introducing the NEural radiance fields (NeRF). Given a single image as input, MINE predicts a 4-channel image (RGB and volume density) at arbitrary depth values to jointly reconstruct the camera frustum and fill in occluded contents. The reconstructed and inpainted frustum can then be easily rendered into novel RGB or depth views using differentiable rendering. Extensive experiments on RealEstate10K, KITTI and Flowers Light Fields show that our MINE outperforms state-of-the-art by a large margin in novel view synthesis. We also achieve competitive results in depth estimation on iBims-1 and NYU-v2 without annotated depth supervision. Our source code is available at this https URL

Journal ArticleDOI
TL;DR: This paper proposed a differentiable point-based splatting pipeline based on the bi-directional Elliptical Weighted Average (EWA) solution, which can be applied to multi-view harmonization and stylization.
Abstract: There has recently been great interest in neural rendering methods. Some approaches use 3D geometry reconstructed with Multi-View Stereo (MVS) but cannot recover from the errors of this process, while others directly learn a volumetric neural representation, but suffer from expensive training and inference. We introduce a general approach that is initialized with MVS, but allows further optimization of scene properties in the space of input views, including depth and reprojected features, resulting in improved novel view synthesis. A key element of our approach is a differentiable point-based splatting pipeline, based on our bi-directional Elliptical Weighted Average solution. To further improve quality and efficiency of our point-based method, we introduce a probabilistic depth test and efficient camera selection. We use these elements together in our neural renderer, allowing us to achieve a good compromise between quality and speed. Our pipeline can be applied to multi-view harmonization and stylization in addition to novel view synthesis.

Posted Content
TL;DR: In this paper, the authors propose an end-to-end framework, termed NeRF--, for training NeRF models given only RGB images, without pre-computed camera parameters.
Abstract: This paper tackles the problem of novel view synthesis (NVS) from 2D images without known camera poses and intrinsics. Among various NVS techniques, Neural Radiance Field (NeRF) has recently gained popularity due to its remarkable synthesis quality. Existing NeRF-based approaches assume that the camera parameters associated with each input image are either directly accessible at training, or can be accurately estimated with conventional techniques based on correspondences, such as Structure-from-Motion. In this work, we propose an end-to-end framework, termed NeRF--, for training NeRF models given only RGB images, without pre-computed camera parameters. Specifically, we show that the camera parameters, including both intrinsics and extrinsics, can be automatically discovered via joint optimisation during the training of the NeRF model. On the standard LLFF benchmark, our model achieves comparable novel view synthesis results compared to the baseline trained with COLMAP pre-computed camera parameters. We also conduct extensive analyses to understand the model behaviour under different camera trajectories, and show that in scenarios where COLMAP fails, our model still produces robust results.

Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper propose a learning-based framework to synthesize novel views and reconstruct densely-sampled LFs from sparsely sampled LFs, in which details in novel views are explored from spatial and angular domain.
Abstract: Due to hardware restriction, it is costly to capture densely-sampled Light Fields (LFs) with high angular and spatial resolution, which becomes the main bottleneck of LFs development. In this paper, we propose a learning-based framework to synthesize novel views and reconstruct densely-sampled LFs from sparsely-sampled LFs. In the proposed framework, micro-lens image stacks and view image stacks are separately grouped, in which details in novel views are explored from spatial and angular domain. The two kinds of stacks contain epipolar information and 3D convolution layers are employed to effectively extract features that include structure information. Moreover, an innovative way is proposed to synthesize views by upsampling micro-lens image stacks using deconvolution layers. The parameters in decovolution layers provide the view position information and different interpolation and extrapolation tasks can be explicitly modeled. It is validated that this view synthesis module can be embedded in different frameworks and improve the related performances. Without precise depth estimation and view warping, the proposed method is mainly designed for reconstructing LFs with small baselines. Related experimental results show that the proposed model outperforms other state-of-the-art methods in terms of both visual and numerical evaluations. Furthermore, the consistency between synthesized views and the intrinsic structure information is well preserved in the proposed method.

Posted Content
TL;DR: Zhang et al. as discussed by the authors proposed a framework integrated with more reliable supervision guided by semantic co-segmentation and data-augmentation, which excavates mutual semantic from multi-view images to guide the semantic consistency.
Abstract: Recent studies have witnessed that self-supervised methods based on view synthesis obtain clear progress on multi-view stereo (MVS). However, existing methods rely on the assumption that the corresponding points among different views share the same color, which may not always be true in practice. This may lead to unreliable self-supervised signal and harm the final reconstruction performance. To address the issue, we propose a framework integrated with more reliable supervision guided by semantic co-segmentation and data-augmentation. Specially, we excavate mutual semantic from multi-view images to guide the semantic consistency. And we devise effective data-augmentation mechanism which ensures the transformation robustness by treating the prediction of regular samples as pseudo ground truth to regularize the prediction of augmented samples. Experimental results on DTU dataset show that our proposed methods achieve the state-of-the-art performance among unsupervised methods, and even compete on par with supervised methods. Furthermore, extensive experiments on Tanks&Temples dataset demonstrate the effective generalization ability of the proposed method.

Proceedings ArticleDOI
01 Jun 2021
TL;DR: The proposed Shadow Neural Radiance Field (S-NeRF) methodology not only performs novel view synthesis and full 3D shape estimation, it also enables shadow detection, albedo synthesis, and transient object filtering, without any explicit shape supervision.
Abstract: We present a new generic method for shadow-aware multi-view satellite photogrammetry of Earth Observation scenes. Our proposed method, the Shadow Neural Radiance Field (S-NeRF) follows recent advances in implicit volumetric representation learning. For each scene, we train S-NeRF using very high spatial resolution optical images taken from known viewing angles. The learning requires no labels or shape priors: it is self-supervised by an image reconstruction loss. To accommodate for changing light source conditions both from a directional light source (the Sun) and a diffuse light source (the sky), we extend the NeRF approach in two ways. First, direct illumination from the Sun is modeled via a local light source visibility field. Second, indirect illumination from a diffuse light source is learned as a non-local color field as a function of the position of the Sun. Quantitatively, the combination of these factors reduces the altitude and color errors in shaded areas, compared to NeRF. The S-NeRF methodology not only performs novel view synthesis and full 3D shape estimation, it also enables shadow detection, albedo synthesis, and transient object filtering, without any explicit shape supervision.

Journal ArticleDOI
TL;DR: In this paper, a learning-based approach is proposed to synthesize the view from an arbitrary camera position given a sparse set of images by jointly modeling the epipolar property and occlusion in designing a convolutional neural network.
Abstract: This paper presents a learning-based approach to synthesize the view from an arbitrary camera position given a sparse set of images. A key challenge for this novel view synthesis arises from the reconstruction process, when the views from different input images may not be consistent due to obstruction in the light path. We overcome this by jointly modeling the epipolar property and occlusion in designing a convolutional neural network. We start by defining and computing the aperture disparity map, which approximates the parallax and measures the pixel-wise shift between two views. While this relates to free-space rendering and can fail near the object boundaries, we further develop a warping confidence map to address pixel occlusion in these challenging regions. The proposed method is evaluated on diverse real-world and synthetic light field scenes, and it shows better performance over several state-of-the-art techniques.

Posted Content
TL;DR: In this paper, a multilayer perceptron and a ray transformer are used to estimate radiance and volume density at continuous 5D locations (3D spatial locations and 2D viewing directions), drawing appearance information from multiple source views.
Abstract: We present a method that synthesizes novel views of complex scenes by interpolating a sparse set of nearby views. The core of our method is a network architecture that includes a multilayer perceptron and a ray transformer that estimates radiance and volume density at continuous 5D locations (3D spatial locations and 2D viewing directions), drawing appearance information on the fly from multiple source views. By drawing on source views at render time, our method hearkens back to classic work on image-based rendering (IBR), and allows us to render high-resolution imagery. Unlike neural scene representation work that optimizes per-scene functions for rendering, we learn a generic view interpolation function that generalizes to novel scenes. We render images using classic volume rendering, which is fully differentiable and allows us to train using only multi-view posed images as supervision. Experiments show that our method outperforms recent novel view synthesis methods that also seek to generalize to novel scenes. Further, if fine-tuned on each scene, our method is competitive with state-of-the-art single-scene neural rendering methods. Project page: this https URL

Proceedings Article
13 Apr 2021
TL;DR: In this article, a Bundle-Adjusting Neural Radiance Fields (BARF) is proposed for training NeRF from imperfect (or even unknown) camera poses, which enables view synthesis and localization of video sequences from unknown camera poses.
Abstract: Neural Radiance Fields (NeRF) have recently gained a surge of interest within the computer vision community for its power to synthesize photorealistic novel views of real-world scenes. One limitation of NeRF, however, is its requirement of accurate camera poses to learn the scene representations. In this paper, we propose Bundle-Adjusting Neural Radiance Fields (BARF) for training NeRF from imperfect (or even unknown) camera poses -- the joint problem of learning neural 3D representations and registering camera frames. We establish a theoretical connection to classical image alignment and show that coarse-to-fine registration is also applicable to NeRF. Furthermore, we show that naively applying positional encoding in NeRF has a negative impact on registration with a synthesis-based objective. Experiments on synthetic and real-world data show that BARF can effectively optimize the neural scene representations and resolve large camera pose misalignment at the same time. This enables view synthesis and localization of video sequences from unknown camera poses, opening up new avenues for visual localization systems (e.g. SLAM) and potential applications for dense 3D mapping and reconstruction.

Journal ArticleDOI
TL;DR: In this article, a view synthesis method is proposed to provide immersive free navigation with 6 degrees of freedom in real-time for natural and virtual scenery, for both static and dynamic content, which can take any number of input views with their corresponding depth maps as priors.
Abstract: This paper presents a novel approach to provide immersive free navigation with 6 Degrees of Freedom in real-time for natural and virtual scenery, for both static and dynamic content. Stemming from the state-of-the-art in Depth Image-Based Rendering and the OpenGL pipeline, this new View Synthesis method achieves free navigation at up to 90 FPS and can take any number of input views with their corresponding depth maps as priors. Video content can be played thanks to GPU decompression, supporting free navigation with full parallax in real-time. To render a novel viewpoint, each selected input view is warped using the camera pose and associated depth map, using an implicit 3D representation. The warped views are then blended all together to generate the chosen virtual view. Various view blending approaches specifically designed to avoid visual artifacts are compared. Using as few as four input views appears to be an optimal trade-off between computation time and quality, allowing to synthesize high-quality stereoscopic views in real-time, offering a genuine immersive virtual reality experience. Additionally, the proposed approach provides high-quality rendering of a 3D scenery on holographic light field displays. Our results are comparable - objectively and subjectively - to the state of the art view synthesis tools NeRF and LLFF, while maintaining an overall lower complexity and real-time rendering.

Journal ArticleDOI
TL;DR: This paper proposes to drastically reduce the number of input images to four input images with depth maps in any pose, in order to create the missing images withdepth image-based rendering.
Abstract: To computer-generate high-quality holographic stereograms, a huge number of images must be provided: several hundred for a horizontal parallax and the square of this number for a full parallax. In this paper, we propose to drastically reduce this number to four input images with depth maps (or equivalently, four groups of neighboring images used to compute a depth map) in any pose, in order to create the missing images with depth image-based rendering. We evaluate the view synthesis method objectively before providing visual results of the corresponding holographic stereograms. We believe this method outperforms shearlet-based approaches in objective view synthesis quality metrics and in the number of required input images (7×7).

Journal ArticleDOI
Wen Liu1, Zhixin Piao1, Zhi Tu1, Wenhan Luo2, Lin Ma2, Shenghua Gao1 
TL;DR: Zhang et al. as discussed by the authors proposed an Attentional liquid warping GAN with Attentional Liquid Warping Block (AttLWB) that propagates the source information in both image and feature spaces to the synthesized reference.
Abstract: We tackle human image synthesis, including motion imitation, appearance transfer, and novel view synthesis, within a unified framework. The model, once being trained, can be used to handle all these tasks. We propose to use a 3D body mesh recovery module to disentangle the pose and shape. It can not only model the joint location and rotation but also characterize the personalized body shape. To preserve the source information, such as texture, style, color, and face identity, we propose an Attentional Liquid Warping GAN with Attentional Liquid Warping Block (AttLWB) that propagates the source information in both image and feature spaces to the synthesized reference. Specifically, the source features are extracted by a denoising convolutional auto-encoder for characterizing the source identity well. Our proposed method can support a more flexible warping from multiple sources. To further improve the generalization ability of the unseen source images, a one/few-shot adversarial learning is applied in a self-supervised way to generate high-resolution 512 x 512 and 1024 x 1024 results. Also, we build a new dataset, namely iPER, for the evaluation of these three tasks. Extensive experiments demonstrate the effectiveness of our methods in terms of preserving face identity, shape consistency, and clothes details.