scispace - formally typeset
Search or ask a question

Showing papers on "View synthesis published in 2023"


Posted ContentDOI
TL;DR: In this article , a 4D dynamic neural scene representation that captures full-body appearance in motion from multi-view video input, and enables playback from novel, unseen viewpoints is presented.
Abstract: Representing human performance at high-fidelity is an essential building block in diverse applications, such as film production, computer games or videoconferencing. To close the gap to production-level quality, we introduce HumanRF, a 4D dynamic neural scene representation that captures full-body appearance in motion from multi-view video input, and enables playback from novel, unseen viewpoints. Our novel representation acts as a dynamic video encoding that captures fine details at high compression rates by factorizing space-time into a temporal matrix-vector decomposition. This allows us to obtain temporally coherent reconstructions of human actors for long sequences, while representing high-resolution details even in the context of challenging motion. While most research focuses on synthesizing at resolutions of 4MP or lower, we address the challenge of operating at 12MP. To this end, we introduce ActorsHQ, a novel multi-view dataset that provides 12MP footage from 160 cameras for 16 sequences with high-fidelity, per-frame mesh reconstructions. We demonstrate challenges that emerge from using such high-resolution data and show that our newly introduced HumanRF effectively leverages this data, making a significant step towards production-level quality novel view synthesis.

4 citations


Proceedings ArticleDOI
01 Jan 2023
TL;DR: Control-NeRF as discussed by the authors is a method for performing flexible, 3D-aware image content manipulation while enabling high-quality novel view synthesis, from a set of posed input images.
Abstract: We present Control-NeRF 1 , a method for performing flexible, 3D-aware image content manipulation while enabling high-quality novel view synthesis, from a set of posed input images. NeRF-based approaches [23] are effective for novel view synthesis, however such models memorize the radiance for every point in a scene within a neural network. Since these models are scene-specific and lack a 3D scene representation, classical editing such as shape manipulation, or combining scenes is not possible. While there are some recent hybrid approaches that combine NeRF with external scene representations such as sparse voxels, planes, hash tables, etc. [16], [5], [24], [9], they focus mostly on efficiency and don't explore the scene editing and manipulation capabilities of hybrid approaches. With the aim of exploring controllable scene representations for novel view synthesis, our model couples learnt scene-specific 3D feature volumes with a general NeRF rendering network. We can generalize to novel scenes by optimizing only the scene-specific 3D feature volume, while keeping the parameters of the rendering network fixed. Since the feature volumes are independent of the rendering model, we can manipulate and combine scenes by editing their corresponding feature volumes. The edited volume can then be plugged into the rendering model to synthesize high-quality novel views. We demonstrate scene manipulations including: scene mixing; applying rigid and non-rigid transformations; inserting, moving and deleting objects in a scene; while producing photo-realistic novel-view synthesis results.

2 citations


Proceedings ArticleDOI
01 Jan 2023
TL;DR: Zhang et al. as mentioned in this paper propose to leverage both the global and local features to form an expressive 3D representation, and train a multi-layer perceptron (MLP) network conditioned on the learned 3D representations to perform volume rendering.
Abstract: Although neural radiance fields (NeRF) have shown impressive advances in novel view synthesis, most methods require multiple input images of the same scene with accurate camera poses. In this work, we seek to substantially reduce the inputs to a single unposed image. Existing approaches using local image features to reconstruct a 3D object often render blurry predictions at viewpoints distant from the source view. To address this, we propose to leverage both the global and local features to form an expressive 3D representation. The global features are learned from a vision transformer, while the local features are extracted from a 2D convolutional network. To synthesize a novel view, we train a multi-layer perceptron (MLP) network conditioned on the learned 3D representation to perform volume rendering. This novel 3D representation allows the network to reconstruct unseen regions without enforcing constraints like symmetry or canonical coordinate systems. Our method renders novel views from just a single input image, and generalizes across multiple object categories using a single model. Quantitative and qualitative evaluations demonstrate that the proposed method achieves state-of-the-art performance and renders richer details than existing approaches. https://cseweb.ucsd.edu/ %7eviscomp/projects/VisionNeRF/

1 citations


Proceedings ArticleDOI
04 Jun 2023
TL;DR: In this article , a template-free method is proposed to reconstruct surface geometry and appearance using neural implicit representations from multi-view videos, which leverage topology-aware deformation and the signed distance field to learn complex dynamic surfaces.
Abstract: Reconstructing general dynamic scenes is important for many computer vision and graphics applications. Recent works represent the dynamic scene with neural radiance fields for photorealistic view synthesis, while their surface geometry is under-constrained and noisy. Other works introduce surface constraints to the implicit neural representation to disentangle the ambiguity of geometry and appearance field for static scene reconstruction. To bridge the gap between rendering dynamic scenes and recovering static surface geometry, we propose a template-free method to reconstruct surface geometry and appearance using neural implicit representations from multi-view videos. We leverage topology-aware deformation and the signed distance field to learn complex dynamic surfaces via differentiable volume rendering without scene-specific prior knowledge like template models. Furthermore, we propose a novel mask-based ray selection strategy to significantly boost the optimization on challenging time-varying regions. Experiments on different multi-view video datasets demonstrate that our method achieves high-fidelity surface reconstruction as well as photorealistic novel view synthesis.

1 citations


Proceedings ArticleDOI
07 Jun 2023
TL;DR: In this article , a layered synthesis algorithm that combines offline and online sources of colour and depth information to increase the quality of the synthesized image is proposed, where foreground objects are synthesized using online depth and colour information.
Abstract: In this work we present a novel approach for the generation in real time of synthetic views for free-viewpoint video. Our system is based on purely passive stereo cameras which, under the constraints of real-time operation yield unreliable depth maps, especially in the background areas. To solve this issue we propose a layered synthesis algorithm that combines offline and online sources of colour and depth information to increase the quality of the synthesized image. Foreground objects are synthesized using online depth and colour information. The background, however, is assumed to be stationary so its geometry can be reconstructed offline. However, foreground objects affect the colour of the background, so to render the background we carefully mix online and offline colour information. Finally, foreground and background are combined to form the final image.

Posted ContentDOI
23 Mar 2023
TL;DR: Zhang et al. as mentioned in this paper propose a geometry-guided attention mechanism to register the appearance code from multi-view 2D images to a geometry proxy which can alleviate the misalignment between inaccurate geometry prior and pixel space.
Abstract: In this work, we focus on synthesizing high-fidelity novel view images for arbitrary human performers, given a set of sparse multi-view images. It is a challenging task due to the large variation among articulated body poses and heavy self-occlusions. To alleviate this, we introduce an effective generalizable framework Generalizable Model-based Neural Radiance Fields (GM-NeRF) to synthesize free-viewpoint images. Specifically, we propose a geometry-guided attention mechanism to register the appearance code from multi-view 2D images to a geometry proxy which can alleviate the misalignment between inaccurate geometry prior and pixel space. On top of that, we further conduct neural rendering and partial gradient backpropagation for efficient perceptual supervision and improvement of the perceptual quality of synthesis. To evaluate our method, we conduct experiments on synthesized datasets THuman2.0 and Multi-garment, and real-world datasets Genebody and ZJUMocap. The results demonstrate that our approach outperforms state-of-the-art methods in terms of novel view synthesis and geometric reconstruction.

Posted ContentDOI
28 Feb 2023
TL;DR: In this paper , a template-free method is proposed to reconstruct surface geometry and appearance using neural implicit representations from multi-view videos, which leverage topology-aware deformation and the signed distance field to learn complex dynamic surfaces.
Abstract: Reconstructing general dynamic scenes is important for many computer vision and graphics applications. Recent works represent the dynamic scene with neural radiance fields for photorealistic view synthesis, while their surface geometry is under-constrained and noisy. Other works introduce surface constraints to the implicit neural representation to disentangle the ambiguity of geometry and appearance field for static scene reconstruction. To bridge the gap between rendering dynamic scenes and recovering static surface geometry, we propose a template-free method to reconstruct surface geometry and appearance using neural implicit representations from multi-view videos. We leverage topology-aware deformation and the signed distance field to learn complex dynamic surfaces via differentiable volume rendering without scene-specific prior knowledge like template models. Furthermore, we propose a novel mask-based ray selection strategy to significantly boost the optimization on challenging time-varying regions. Experiments on different multi-view video datasets demonstrate that our method achieves high-fidelity surface reconstruction as well as photorealistic novel view synthesis.

Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors proposed an end-to-end as-deformable-as-possible (ADAP) single-image-based view synthesis solution without depth prior.
Abstract: Depth-image-based rendering (DIBR) technologies have been widely employed to synthesize novel realistic views from a single image in 3D video applications. However, DIBR-oriented approaches heavily rely on the accuracy of depth maps, usually requiring the depth GT as a prior. Despite that, there might exist extensive float precision losses and invalid holes in the synthesized view due to warping error and occlusion. In this paper, we propose an end-to-end as-deformable-as-possible (ADAP) single-image-based view synthesis solution without depth prior. It addresses the above issues through two stages: alignment and reconstruction, where we first transform the input image to the latent feature space and then reconstruct the novel view in the image domain. In the first stage, the input image is deformed to align with the synthesized view at feature level. To this end, we propose an ADAP alignment mechanism through pixel-level warping to error-level quantization to feature-level alignment, progressively improving the deformable capability in handling challenging motion conditions in real-world scenes. In the second stage, we exploit an occlusion-aware reconstruction module to recover the content details from the deformed feature at pixel level. Extensive experiments demonstrate that our alignment-reconstruction approach is robust to the depth map. Even with a coarsely estimated depth map, our solution outperforms other SoTA schemes in the popular benchmarks.

Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper proposed a novel deep gradual-conversion and cycle network (DGCC-Net) for single-view synthesis by jointly considering the gradual and cycle synthesis between source and target views.
Abstract: With the popular application of convolutional neural networks in computational intelligence, research on deep learning-based view synthesis has been a hot topic. Although promising performance has been achieved by the existing learning-based view synthesis methods, how to obtain a clearer target view in the single-view synthesis task is still a challenging problem. In this paper, we propose a novel deep gradual-conversion and cycle network (DGCC-Net) for single-view synthesis by jointly considering the gradual and cycle synthesis between source and target views. Specifically, a gradual conversion mechanism is designed to synthesize a clearer target view in a gradual manner, which learns the progressive rotation trend from the source to the target view by introducing the intermediate transformation. Based on the synthesized target view, a cycle synthesis mechanism is designed to further promote the learning of single-view synthesis network by mapping the synthesized target back to the source view. By utilizing the proposed gradual conversion and cycle synthesis mechanisms, the whole network achieves a cycle view synthesis mapping between source and target views to obtain a better target view. Experiments on widely used datasets indicate the proposed DGCC-Net exceeds state-of-the-art methods.

Journal ArticleDOI
TL;DR: In this paper , the authors proposed a neural radiance field (NeRF) based framework for high-quality view synthesis using only a sparse set of RGB-D images, which can be easily captured using cameras and LiDAR sensors on current consumer devices.
Abstract: The recently proposed neural radiance fields (NeRF) use a continuous function formulated as a multi-layer perceptron (MLP) to model the appearance and geometry of a 3D scene. This enables realistic synthesis of novel views, even for scenes with view dependent appearance. Many follow-up works have since extended NeRFs in different ways. However, a fundamental restriction of the method remains that it requires a large number of images captured from densely placed viewpoints for high-quality synthesis and the quality of the results quickly degrades when the number of captured views is insufficient. To address this problem, we propose a novel NeRF-based framework capable of high-quality view synthesis using only a sparse set of RGB-D images, which can be easily captured using cameras and LiDAR sensors on current consumer devices. First, a geometric proxy of the scene is reconstructed from the captured RGB-D images. Renderings of the reconstructed scene along with precise camera parameters can then be used to pre-train a network. Finally, the network is fine-tuned with a small number of real captured images. We further introduce a patch discriminator to supervise the network under novel views during fine-tuning, as well as a 3D color prior to improve synthesis quality. We demonstrate that our method can generate arbitrary novel views of a 3D scene from as few as 6 RGB-D images. Extensive experiments show the improvements of our method compared with the existing NeRF-based methods, including approaches that also aim to reduce the number of input images.

Posted ContentDOI
20 Mar 2023
TL;DR: Zhang et al. as mentioned in this paper proposed geometry-aware contrastive learning to learn multi-view consistent features with geometric constraints, and adopted cross-view attention to enhance the geometry perception of features by querying features across input views.
Abstract: Although many recent works have investigated generalizable NeRF-based novel view synthesis for unseen scenes, they seldom consider the synthetic-to-real generalization, which is desired in many practical applications. In this work, we first investigate the effects of synthetic data in synthetic-to-real novel view synthesis and surprisingly observe that models trained with synthetic data tend to produce sharper but less accurate volume densities. For pixels where the volume densities are correct, fine-grained details will be obtained. Otherwise, severe artifacts will be produced. To maintain the advantages of using synthetic data while avoiding its negative effects, we propose to introduce geometry-aware contrastive learning to learn multi-view consistent features with geometric constraints. Meanwhile, we adopt cross-view attention to further enhance the geometry perception of features by querying features across input views. Experiments demonstrate that under the synthetic-to-real setting, our method can render images with higher quality and better fine-grained details, outperforming existing generalizable novel view synthesis methods in terms of PSNR, SSIM, and LPIPS. When trained on real data, our method also achieves state-of-the-art results.

Proceedings ArticleDOI
04 Jun 2023
TL;DR: Wang et al. as discussed by the authors proposed a 3D-aware encoder for GAN inversion and face editing based on the powerful StyleNeRF model, which combines a parametric 3D face model with a learnable detail representation model to generate geometry, texture and view direction codes.
Abstract: GAN inversion has been exploited in many face manipulation tasks, but 2D GANs often fail to generate multi-view 3D consistent images. The encoders designed for 2D GANs are not able to provide sufficient 3D information for the inversion and editing. Therefore, 3D-aware GAN inversion is proposed to increase the 3D editing capability of GANs. However, the 3D-aware GAN inversion remains under-explored. To tackle this problem, we propose a 3D-aware (3Da) encoder for GAN inversion and face editing based on the powerful StyleNeRF model. Our proposed 3Da encoder combines a parametric 3D face model with a learnable detail representation model to generate geometry, texture and view direction codes. For more flexible face manipulation, we then design a dual-branch StyleFlow module to transfer the StyleNeRF codes with disentangled geometry and texture flows. Extensive experiments demonstrate that we realize 3D consistent face manipulation in both facial attribute editing and texture transfer. Furthermore, for video editing, we make the sequence of frame codes share a common canonical manifold, which improves the temporal consistency of the edited attributes.

Posted ContentDOI
23 Mar 2023
TL;DR: LipRF as discussed by the authors uses Lipschitz mapping to transform the appearance representation of a pre-trained neural radiance field with the help of 2D style transfer to generate visually consistent and photorealistic stylized scenes from novel views.
Abstract: Recent advances in 3D scene representation and novel view synthesis have witnessed the rise of Neural Radiance Fields (NeRFs). Nevertheless, it is not trivial to exploit NeRF for the photorealistic 3D scene stylization task, which aims to generate visually consistent and photorealistic stylized scenes from novel views. Simply coupling NeRF with photorealistic style transfer (PST) will result in cross-view inconsistency and degradation of stylized view syntheses. Through a thorough analysis, we demonstrate that this non-trivial task can be simplified in a new light: When transforming the appearance representation of a pre-trained NeRF with Lipschitz mapping, the consistency and photorealism across source views will be seamlessly encoded into the syntheses. That motivates us to build a concise and flexible learning framework namely LipRF, which upgrades arbitrary 2D PST methods with Lipschitz mapping tailored for the 3D scene. Technically, LipRF first pre-trains a radiance field to reconstruct the 3D scene, and then emulates the style on each view by 2D PST as the prior to learn a Lipschitz network to stylize the pre-trained appearance. In view of that Lipschitz condition highly impacts the expressivity of the neural network, we devise an adaptive regularization to balance the reconstruction and stylization. A gradual gradient aggregation strategy is further introduced to optimize LipRF in a cost-efficient manner. We conduct extensive experiments to show the high quality and robust performance of LipRF on both photorealistic 3D stylization and object appearance editing.

Posted ContentDOI
29 Mar 2023
TL;DR: In this article , the complementary behavior of multiview stereo and monocular depth is exploited to improve scene depth per view for nearby and far points, respectively, by jointly refining camera poses with image-based rendering via multiple rotation averaging graph optimization.
Abstract: We introduce an approach to enhance the novel view synthesis from images taken from a freely moving camera. The introduced approach focuses on outdoor scenes where recovering accurate geometric scaffold and camera pose is challenging, leading to inferior results using the state-of-the-art stable view synthesis (SVS) method. SVS and related methods fail for outdoor scenes primarily due to (i) over-relying on the multiview stereo (MVS) for geometric scaffold recovery and (ii) assuming COLMAP computed camera poses as the best possible estimates, despite it being well-studied that MVS 3D reconstruction accuracy is limited to scene disparity and camera-pose accuracy is sensitive to key-point correspondence selection. This work proposes a principled way to enhance novel view synthesis solutions drawing inspiration from the basics of multiple view geometry. By leveraging the complementary behavior of MVS and monocular depth, we arrive at a better scene depth per view for nearby and far points, respectively. Moreover, our approach jointly refines camera poses with image-based rendering via multiple rotation averaging graph optimization. The recovered scene depth and the camera-pose help better view-dependent on-surface feature aggregation of the entire scene. Extensive evaluation of our approach on the popular benchmark dataset, such as Tanks and Temples, shows substantial improvement in view synthesis results compared to the prior art. For instance, our method shows 1.5 dB of PSNR improvement on the Tank and Temples. Similar statistics are observed when tested on other benchmark datasets such as FVS, Mip-NeRF 360, and DTU.

Posted ContentDOI
28 Feb 2023
TL;DR: In this paper , the authors present a method for reconstructing high-quality meshes of large unbounded real-world scenes suitable for photorealistic novel view synthesis using a hybrid neural volume-surface scene representation designed to have well-behaved level sets that correspond to surfaces in the scene.
Abstract: We present a method for reconstructing high-quality meshes of large unbounded real-world scenes suitable for photorealistic novel view synthesis. We first optimize a hybrid neural volume-surface scene representation designed to have well-behaved level sets that correspond to surfaces in the scene. We then bake this representation into a high-quality triangle mesh, which we equip with a simple and fast view-dependent appearance model based on spherical Gaussians. Finally, we optimize this baked representation to best reproduce the captured viewpoints, resulting in a model that can leverage accelerated polygon rasterization pipelines for real-time view synthesis on commodity hardware. Our approach outperforms previous scene representations for real-time rendering in terms of accuracy, speed, and power consumption, and produces high quality meshes that enable applications such as appearance editing and physical simulation.

Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper proposed a deep model to predict the quality of view synthesis based on Curriculum-style Structure Generation without conducting the DIBR process, where a structure generation network is first built to learn the structure of the new view from the original one by curriculum-style training.
Abstract: Existing quality metrics of view synthesis usually perform on synthesized images, which are produced based on a computationally expensive depth-image-based rendering (DIBR) process. Moreover, current metrics quantify quality by extracting hand-crafted features, which may fail to fully capture the complex distortion characteristics. With the success of deep learning on numerous computer vision tasks, it has become possible to utilize convolutional neural networks to predict the quality of DIBR-synthesized images. In this letter, we propose a deep model to predict the quality of view synthesis based on Curriculum-style Structure Generation without conducting the DIBR process. Specifically, considering that the distortion of view synthesis is mainly manifested in the destruction of image structure, a structure generation network is first built to learn the structure of the new view from the original one by curriculum-style training. Then, we transfer the prior knowledge learned from the last phase into the quality prediction network for measuring the structure distortion, based on which a regressor is introduced to produce the quality score. Experimental results prove the advantages of the proposed model.

Posted ContentDOI
18 May 2023
TL;DR: In this article , the authors propose ConsistentNeRF, a method that leverages depth information to regularize both multi-view and single-view 3D consistency among pixels.
Abstract: Neural Radiance Fields (NeRF) has demonstrated remarkable 3D reconstruction capabilities with dense view images. However, its performance significantly deteriorates under sparse view settings. We observe that learning the 3D consistency of pixels among different views is crucial for improving reconstruction quality in such cases. In this paper, we propose ConsistentNeRF, a method that leverages depth information to regularize both multi-view and single-view 3D consistency among pixels. Specifically, ConsistentNeRF employs depth-derived geometry information and a depth-invariant loss to concentrate on pixels that exhibit 3D correspondence and maintain consistent depth relationships. Extensive experiments on recent representative works reveal that our approach can considerably enhance model performance in sparse view conditions, achieving improvements of up to 94% in PSNR, 76% in SSIM, and 31% in LPIPS compared to the vanilla baselines across various benchmarks, including DTU, NeRF Synthetic, and LLFF.

Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper used joint information from multi-view panoramic videos to further explore the pixel correlation, and obtained sub-pixels from different reference views in the virtual view by performing multiview three-dimensional image warping.


Proceedings ArticleDOI
25 Mar 2023
TL;DR: In this article , the authors evaluate the depth map quality by the plenoptic 2.0 cameras for DIBR applications, by simulating a multi-planoptic camera array, and synthesizing virtual images based on the obtained color and depth maps.
Abstract: We aim to evaluate the depth map quality by the plenoptic 2.0 cameras for DIBR applications, by simulating a multi-plenoptic camera array, and synthesizing virtual images based on the obtained color and depth maps. The real-time generated depth maps by plenoptic cameras are compared with the offline and high-quality depth maps estimated by the MPEG-I Depth Estimation Reference Software (DERS). Results show that synthesized virtual image quality obtained by RGBD plenoptic cameras has acceptable quality for real-time DIBR applications, but lower than using the depth estimated by DERS, due to a loss of inter-view depth consistency.

Posted ContentDOI
17 Apr 2023
TL;DR: In this paper , a multi-view transformer encoder is proposed for novel view synthesis given only a single wide-baseline stereo image pair, and an image-space epipolar line sampling scheme is used to assemble image features for a target ray.
Abstract: We introduce a method for novel view synthesis given only a single wide-baseline stereo image pair. In this challenging regime, 3D scene points are regularly observed only once, requiring prior-based reconstruction of scene geometry and appearance. We find that existing approaches to novel view synthesis from sparse observations fail due to recovering incorrect 3D geometry and due to the high cost of differentiable rendering that precludes their scaling to large-scale training. We take a step towards resolving these shortcomings by formulating a multi-view transformer encoder, proposing an efficient, image-space epipolar line sampling scheme to assemble image features for a target ray, and a lightweight cross-attention-based renderer. Our contributions enable training of our method on a large-scale real-world dataset of indoor and outdoor scenes. We demonstrate that our method learns powerful multi-view geometry priors while reducing the rendering time. We conduct extensive comparisons on held-out test scenes across two real-world datasets, significantly outperforming prior work on novel view synthesis from sparse image observations and achieving multi-view-consistent novel view synthesis.

Journal ArticleDOI
01 Feb 2023-Sensors
TL;DR: In this article , a light field synthesis algorithm that uses the focal stack images and the all-in-focus image to synthesize a 9 × 9 sub-aperture view light field image is proposed.
Abstract: Light field reconstruction and synthesis algorithms are essential for improving the lower spatial resolution for hand-held plenoptic cameras. Previous light field synthesis algorithms produce blurred regions around depth discontinuities, especially for stereo-based algorithms, where no information is available to fill the occluded areas in the light field image. In this paper, we propose a light field synthesis algorithm that uses the focal stack images and the all-in-focus image to synthesize a 9 × 9 sub-aperture view light field image. Our approach uses depth from defocus to estimate a depth map. Then, we use the depth map and the all-in-focus image to synthesize the sub-aperture views, and their corresponding depth maps by mimicking the apparent shifting of the central image according to the depth values. We handle the occluded regions in the synthesized sub-aperture views by filling them with the information recovered from the focal stack images. We also show that, if the depth levels in the image are known, we can synthesize a high-accuracy light field image with just five focal stack images. The accuracy of our approach is compared with three state-of-the-art algorithms: one non-learning and two CNN-based approaches, and the results show that our algorithm outperforms all three in terms of PSNR and SSIM metrics.

Posted ContentDOI
11 Jan 2023
TL;DR: Geometry-biased Transformers as discussed by the authors incorporate geometric inductive biases in the set-latent representation-based inference to encourage multi-view geometric consistency, and induce the geometric bias by augmenting the dot-product attention mechanism to also incorporate 3D distances between rays associated with tokens as a learnable bias.
Abstract: We tackle the task of synthesizing novel views of an object given a few input images and associated camera viewpoints. Our work is inspired by recent 'geometry-free' approaches where multi-view images are encoded as a (global) set-latent representation, which is then used to predict the color for arbitrary query rays. While this representation yields (coarsely) accurate images corresponding to novel viewpoints, the lack of geometric reasoning limits the quality of these outputs. To overcome this limitation, we propose 'Geometry-biased Transformers' (GBTs) that incorporate geometric inductive biases in the set-latent representation-based inference to encourage multi-view geometric consistency. We induce the geometric bias by augmenting the dot-product attention mechanism to also incorporate 3D distances between rays associated with tokens as a learnable bias. We find that this, along with camera-aware embeddings as input, allows our models to generate significantly more accurate outputs. We validate our approach on the real-world CO3D dataset, where we train our system over 10 categories and evaluate its view-synthesis ability for novel objects as well as unseen categories. We empirically validate the benefits of the proposed geometric biases and show that our approach significantly improves over prior works.

Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed a multiview-video-based framework to exploit correlations in color across different viewpoints using a multi-view-videobased framework, which can provide a BD-rate reduction of over 70%.
Abstract: Plenoptic point clouds are more complete representations of three-dimensional (3-D) objects than single-color point clouds, as they can have multiple colors per spatial point, representing colors of each point as seen from different view angles. They are more realistic but also involve a larger volume of data in need of compression. Therefore, in this paper, a multiview-video-based framework is proposed to better exploit correlations in color across different viewpoints. To the best of the authors' knowledge, this is the first work to exploit correlations in color across different viewpoints using a multiview-video-based framework. In addition, it is observed that some unoccupied pixels, which do not have corresponding points in plenoptic point clouds and are of no use to the quality of the reconstructed plenoptic point cloud colors, may cost many bits. To address this problem, a block-based group smoothing and a combined occupancy-map-based rate distortion optimization and four-neighbor average residual padding are further proposed to reduce the bit cost of unoccupied color pixels. The proposed algorithms are implemented in the moving pictures experts group (MPEG) video-based point cloud compression (V-PCC) and multiview extension of High Efficiency Video Coding (MV-HEVC) reference software. The experimental results show that the proposed algorithms can lead to a Bjontegaard Delta bitrate (BD-rate) reduction of 40% compared with the RAHT-KLT. Compared with the V-PCC independently applied to each view direction, the proposed algorithms can provide a BD-rate reduction of over 70%.

Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors proposed a novel occlusion-aware source sampler (OSS) module which efficiently transferred the pixels of source views to the target view's frustum.
Abstract: Abstract Novel view synthesis has attracted tremendous research attention recently for its applications in virtual reality and immersive telepresence. Rendering a locally immersive light field (LF) based on arbitrary large baseline RGB references is a challenging problem that lacks efficient solutions with existing novel view synthesis techniques. In this work, we aim at truthfully rendering local immersive novel views/LF images based on large baseline LF captures and a single RGB image in the target view. To fully explore the precious information from source LF captures, we propose a novel occlusion-aware source sampler (OSS) module which efficiently transfers the pixels of source views to the target view’s frustum in an occlusion-aware manner. An attention-based deep visual fusion module is proposed to fuse the revealed occluded background content with a preliminary LF into a final refined LF. The proposed source sampling and fusion mechanism not only helps to provide information for occluded regions from varying observation angles, but also proves to be able to effectively enhance the visual rendering quality. Experimental results show that our proposed method is able to render high-quality LF images/novel views with sparse RGB references and outperforms state-of-the-art LF rendering and novel view synthesis methods.

Posted ContentDOI
17 Feb 2023
TL;DR: MixNeRF as mentioned in this paper estimates the joint distribution of RGB colors along the ray samples by modeling it with a mixture of distributions and remodels the colors with regenerated blending weights based on the estimated ray depth.
Abstract: Neural Radiance Field (NeRF) has broken new ground in the novel view synthesis due to its simple concept and state-of-the-art quality. However, it suffers from severe performance degradation unless trained with a dense set of images with different camera poses, which hinders its practical applications. Although previous methods addressing this problem achieved promising results, they relied heavily on the additional training resources, which goes against the philosophy of sparse-input novel-view synthesis pursuing the training efficiency. In this work, we propose MixNeRF, an effective training strategy for novel view synthesis from sparse inputs by modeling a ray with a mixture density model. Our MixNeRF estimates the joint distribution of RGB colors along the ray samples by modeling it with mixture of distributions. We also propose a new task of ray depth estimation as a useful training objective, which is highly correlated with 3D scene geometry. Moreover, we remodel the colors with regenerated blending weights based on the estimated ray depth and further improves the robustness for colors and viewpoints. Our MixNeRF outperforms other state-of-the-art methods in various standard benchmarks with superior efficiency of training and inference.

Posted ContentDOI
20 Apr 2023
TL;DR: In this article , an autoregressive conditional diffusion-based model is proposed to interpolate visible scene elements, and extrapolating unobserved regions in a view, in a geometrically consistent manner.
Abstract: Novel view synthesis from a single input image is a challenging task, where the goal is to generate a new view of a scene from a desired camera pose that may be separated by a large motion. The highly uncertain nature of this synthesis task due to unobserved elements within the scene (i.e., occlusion) and outside the field-of-view makes the use of generative models appealing to capture the variety of possible outputs. In this paper, we propose a novel generative model which is capable of producing a sequence of photorealistic images consistent with a specified camera trajectory, and a single starting image. Our approach is centred on an autoregressive conditional diffusion-based model capable of interpolating visible scene elements, and extrapolating unobserved regions in a view, in a geometrically consistent manner. Conditioning is limited to an image capturing a single camera view and the (relative) pose of the new camera view. To measure the consistency over a sequence of generated views, we introduce a new metric, the thresholded symmetric epipolar distance (TSED), to measure the number of consistent frame pairs in a sequence. While previous methods have been shown to produce high quality images and consistent semantics across pairs of views, we show empirically with our metric that they are often inconsistent with the desired camera poses. In contrast, we demonstrate that our method produces both photorealistic and view-consistent imagery.

Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors presented PatchMatch Multi-View Stereo (PM-MVS), which is a highly accurate 3D reconstruction method that can be used in various environments.
Abstract: Abstract PatchMatch Stereo is a method for generating a depth map from stereo images by repeating spatial propagation and view propagation. The concept of PatchMatch Stereo can be easily extended to Multi-View Stereo (MVS). In this paper, we present PatchMatch Multi-View Stereo (PM-MVS), which is a highly accurate 3D reconstruction method that can be used in various environments. Three techniques are introduced to PM-MVS: (i) matching score evaluation, (ii) viewpoint selection, and (iii) outlier filtering. The combination of normalized cross-correlation with bilateral weights and geometric consistency between viewpoints is used to improve the estimation accuracy of depth and normal maps at object boundaries and poor-texture regions. For each pixel, viewpoints used for stereo matching are carefully selected in order to improve robustness against disturbances such as occlusion, noise, blur, and distortion. Outliers are removed from reconstructed 3D point clouds by a weighted median filter and consistency-based filters assuming multi-view geometry. Through a set of experiments using public multi-view image datasets, we demonstrate that the proposed method exhibits efficient performance compared with conventional methods.

Posted ContentDOI
30 Mar 2023
TL;DR: Zhang et al. as discussed by the authors proposed a pose-guided diffusion model to generate a consistent long-term video of novel views from a single image by using epipolar lines as constraints to facilitate the association between different viewpoints.
Abstract: Novel view synthesis from a single image has been a cornerstone problem for many Virtual Reality applications that provide immersive experiences. However, most existing techniques can only synthesize novel views within a limited range of camera motion or fail to generate consistent and high-quality novel views under significant camera movement. In this work, we propose a pose-guided diffusion model to generate a consistent long-term video of novel views from a single image. We design an attention layer that uses epipolar lines as constraints to facilitate the association between different viewpoints. Experimental results on synthetic and real-world datasets demonstrate the effectiveness of the proposed diffusion model against state-of-the-art transformer-based and GAN-based approaches.

Posted ContentDOI
11 Apr 2023
TL;DR: SfMNeRF as discussed by the authors employs the epipolar, photometric consistency, depth smoothness, and position-of-matches constraints to explicitly reconstruct the 3D-scene structure.
Abstract: With dense inputs, Neural Radiance Fields (NeRF) is able to render photo-realistic novel views under static conditions. Although the synthesis quality is excellent, existing NeRF-based methods fail to obtain moderate three-dimensional (3D) structures. The novel view synthesis quality drops dramatically given sparse input due to the implicitly reconstructed inaccurate 3D-scene structure. We propose SfMNeRF, a method to better synthesize novel views as well as reconstruct the 3D-scene geometry. SfMNeRF leverages the knowledge from the self-supervised depth estimation methods to constrain the 3D-scene geometry during view synthesis training. Specifically, SfMNeRF employs the epipolar, photometric consistency, depth smoothness, and position-of-matches constraints to explicitly reconstruct the 3D-scene structure. Through these explicit constraints and the implicit constraint from NeRF, our method improves the view synthesis as well as the 3D-scene geometry performance of NeRF at the same time. In addition, SfMNeRF synthesizes novel sub-pixels in which the ground truth is obtained by image interpolation. This strategy enables SfMNeRF to include more samples to improve generalization performance. Experiments on two public datasets demonstrate that SfMNeRF surpasses state-of-the-art approaches. Code is available at https://github.com/XTU-PR-LAB/SfMNeRF