scispace - formally typeset
Search or ask a question

Showing papers on "View synthesis published in 2019"


Journal ArticleDOI
TL;DR: This work proposes Neural Textures, which are learned feature maps that are trained as part of the scene capture process that can be utilized to coherently re-render or manipulate existing video content in both static and dynamic environments at real-time rates.
Abstract: The modern computer graphics pipeline can synthesize images at remarkable visual quality; however, it requires well-defined, high-quality 3D content as input. In this work, we explore the use of imperfect 3D content, for instance, obtained from photo-metric reconstructions with noisy and incomplete surface geometry, while still aiming to produce photo-realistic (re-)renderings. To address this challenging problem, we introduce Deferred Neural Rendering, a new paradigm for image synthesis that combines the traditional graphics pipeline with learnable components. Specifically, we propose Neural Textures, which are learned feature maps that are trained as part of the scene capture process. Similar to traditional textures, neural textures are stored as maps on top of 3D mesh proxies; however, the high-dimensional feature maps contain significantly more information, which can be interpreted by our new deferred neural rendering pipeline. Both neural textures and deferred neural renderer are trained end-to-end, enabling us to synthesize photo-realistic images even when the original 3D content was imperfect. In contrast to traditional, black-box 2D generative neural networks, our 3D representation gives us explicit control over the generated output, and allows for a wide range of application domains. For instance, we can synthesize temporally-consistent video re-renderings of recorded 3D scenes as our representation is inherently embedded in 3D space. This way, neural textures can be utilized to coherently re-render or manipulate existing video content in both static and dynamic environments at real-time rates. We show the effectiveness of our approach in several experiments on novel view synthesis, scene editing, and facial reenactment, and compare to state-of-the-art approaches that leverage the standard graphics pipeline as well as conventional generative neural networks.

734 citations


Posted Content
TL;DR: This work proposes Neural Textures, which are learned feature maps that are trained as part of the scene capture process that can be utilized to coherently re-render or manipulate existing video content in both static and dynamic environments at real-time rates.
Abstract: The modern computer graphics pipeline can synthesize images at remarkable visual quality; however, it requires well-defined, high-quality 3D content as input. In this work, we explore the use of imperfect 3D content, for instance, obtained from photo-metric reconstructions with noisy and incomplete surface geometry, while still aiming to produce photo-realistic (re-)renderings. To address this challenging problem, we introduce Deferred Neural Rendering, a new paradigm for image synthesis that combines the traditional graphics pipeline with learnable components. Specifically, we propose Neural Textures, which are learned feature maps that are trained as part of the scene capture process. Similar to traditional textures, neural textures are stored as maps on top of 3D mesh proxies; however, the high-dimensional feature maps contain significantly more information, which can be interpreted by our new deferred neural rendering pipeline. Both neural textures and deferred neural renderer are trained end-to-end, enabling us to synthesize photo-realistic images even when the original 3D content was imperfect. In contrast to traditional, black-box 2D generative neural networks, our 3D representation gives us explicit control over the generated output, and allows for a wide range of application domains. For instance, we can synthesize temporally-consistent video re-renderings of recorded 3D scenes as our representation is inherently embedded in 3D space. This way, neural textures can be utilized to coherently re-render or manipulate existing video content in both static and dynamic environments at real-time rates. We show the effectiveness of our approach in several experiments on novel view synthesis, scene editing, and facial reenactment, and compare to state-of-the-art approaches that leverage the standard graphics pipeline as well as conventional generative neural networks.

444 citations


Journal ArticleDOI
TL;DR: An algorithm for view synthesis from an irregular grid of sampled views that first expands each sampled view into a local light field via a multiplane image (MPI) scene representation, then renders novel views by blending adjacent local light fields.
Abstract: We present a practical and robust deep learning solution for capturing and rendering novel views of complex real world scenes for virtual exploration. Previous approaches either require intractably dense view sampling or provide little to no guidance for how users should sample views of a scene to reliably render high-quality novel views. Instead, we propose an algorithm for view synthesis from an irregular grid of sampled views that first expands each sampled view into a local light field via a multiplane image (MPI) scene representation, then renders novel views by blending adjacent local light fields. We extend traditional plenoptic sampling theory to derive a bound that specifies precisely how densely users should sample views of a given scene when using our algorithm. In practice, we apply this bound to capture and render views of real world scenes that achieve the perceptual quality of Nyquist rate view sampling while using up to 4000X fewer views. We demonstrate our approach's practicality with an augmented reality smart-phone app that guides users to capture input images of a scene and viewers that enable realtime virtual exploration on desktop and mobile platforms.

400 citations


Proceedings ArticleDOI
15 Jun 2019
TL;DR: This work proposes DeepVoxels, a learned representation that encodes the view-dependent appearance of a 3D scene without having to explicitly model its geometry, based on a Cartesian 3D grid of persistent embedded features that learn to make use of the underlying3D scene structure.
Abstract: In this work, we address the lack of 3D understanding of generative neural networks by introducing a persistent 3D feature embedding for view synthesis. To this end, we propose DeepVoxels, a learned representation that encodes the view-dependent appearance of a 3D scene without having to explicitly model its geometry. At its core, our approach is based on a Cartesian 3D grid of persistent embedded features that learn to make use of the underlying 3D scene structure. Our approach combines insights from 3D geometric computer vision with recent advances in learning image-to-image mappings based on adversarial loss functions. DeepVoxels is supervised, without requiring a 3D reconstruction of the scene, using a 2D re-rendering loss and enforces perspective and multi-view geometry in a principled manner. We apply our persistent 3D scene representation to the problem of novel view synthesis demonstrating high-quality results for a variety of challenging scenes.

353 citations


Posted Content
TL;DR: An algorithm for view synthesis from an irregular grid of sampled views that first expands each sampled view into a local light field via a multiplane image (MPI) scene representation, then renders novel views by blending adjacent local light fields.
Abstract: We present a practical and robust deep learning solution for capturing and rendering novel views of complex real world scenes for virtual exploration. Previous approaches either require intractably dense view sampling or provide little to no guidance for how users should sample views of a scene to reliably render high-quality novel views. Instead, we propose an algorithm for view synthesis from an irregular grid of sampled views that first expands each sampled view into a local light field via a multiplane image (MPI) scene representation, then renders novel views by blending adjacent local light fields. We extend traditional plenoptic sampling theory to derive a bound that specifies precisely how densely users should sample views of a given scene when using our algorithm. In practice, we apply this bound to capture and render views of real world scenes that achieve the perceptual quality of Nyquist rate view sampling while using up to 4000x fewer views. We demonstrate our approach's practicality with an augmented reality smartphone app that guides users to capture input images of a scene and viewers that enable realtime virtual exploration on desktop and mobile platforms.

338 citations


Proceedings ArticleDOI
15 Jun 2019
TL;DR: This work presents a novel approach to view synthesis using multiplane images (MPIs) that incorporates occlusion reasoning, improving performance on challenging scene features such as object boundaries, lighting reflections, thin structures, and scenes with high depth complexity.
Abstract: We present a novel approach to view synthesis using multiplane images (MPIs). Building on recent advances in learned gradient descent, our algorithm generates an MPI from a set of sparse camera viewpoints. The resulting method incorporates occlusion reasoning, improving performance on challenging scene features such as object boundaries, lighting reflections, thin structures, and scenes with high depth complexity. We show that our method achieves high-quality, state-of-the-art results on two datasets: the Kalantari light field dataset, and a new camera array dataset, Spaces, which we make publicly available.

335 citations


Proceedings ArticleDOI
01 Jun 2019
TL;DR: This paper presents a theoretical analysis showing how the range of views that can be rendered from an MPI increases linearly with the MPI disparity sampling frequency, as well as a novel MPI prediction procedure that theoretically enables view extrapolations of up to 4 times the lateral viewpoint movement allowed by prior work.
Abstract: We explore the problem of view synthesis from a narrow baseline pair of images, and focus on generating high-quality view extrapolations with plausible disocclusions. Our method builds upon prior work in predicting a multiplane image (MPI), which represents scene content as a set of RGBA planes within a reference view frustum and renders novel views by projecting this content into the target viewpoints. We present a theoretical analysis showing how the range of views that can be rendered from an MPI increases linearly with the MPI disparity sampling frequency, as well as a novel MPI prediction procedure that theoretically enables view extrapolations of up to 4 times the lateral viewpoint movement allowed by prior work. Our method ameliorates two specific issues that limit the range of views renderable by prior methods: 1) We expand the range of novel views that can be rendered without depth discretization artifacts by using a 3D convolutional network architecture along with a randomized-resolution training procedure to allow our model to predict MPIs with increased disparity sampling frequency. 2) We reduce the repeated texture artifacts seen in disocclusions by enforcing a constraint that the appearance of hidden content at any depth must be drawn from visible content at or behind that depth.

209 citations


Journal ArticleDOI
TL;DR: In this paper, a semantic-aware neural network is used to estimate the scene depth from a single image, which is then combined with a segmentation-based depth adjustment process to synthesize the 3D Ken Burns effect.
Abstract: The Ken Burns effect allows animating still images with a virtual camera scan and zoom. Adding parallax, which results in the 3D Ken Burns effect, enables significantly more compelling results. Creating such effects manually is time-consuming and demands sophisticated editing skills. Existing automatic methods, however, require multiple input images from varying viewpoints. In this paper, we introduce a framework that synthesizes the 3D Ken Burns effect from a single image, supporting both a fully automatic mode and an interactive mode with the user controlling the camera. Our framework first leverages a depth prediction pipeline, which estimates scene depth that is suitable for view synthesis tasks. To address the limitations of existing depth estimation methods such as geometric distortions, semantic distortions, and inaccurate depth boundaries, we develop a semantic-aware neural network for depth prediction, couple its estimate with a segmentation-based depth adjustment process, and employ a refinement neural network that facilitates accurate depth predictions at object boundaries. According to this depth estimate, our framework then maps the input image to a point cloud and synthesizes the resulting video frames by rendering the point cloud from the corresponding camera positions. To address disocclusions while maintaining geometrically and temporally coherent synthesis results, we utilize context-aware color- and depth-inpainting to fill in the missing information in the extreme views of the camera path, thus extending the scene geometry of the point cloud. Experiments with a wide variety of image content show that our method enables realistic synthesis results. Our study demonstrates that our system allows users to achieve better results while requiring little effort compared to existing solutions for the 3D Ken Burns effect creation.

186 citations


Proceedings ArticleDOI
Wen Liu1, Zhixin Piao1, Jie Min1, Wenhan Luo2, Lin Ma2, Shenghua Gao1 
01 Oct 2019
TL;DR: Zhang et al. as mentioned in this paper propose to use a 3D body mesh recovery module to disentangle the pose and shape, which can not only model the joint location and rotation but also characterize the personalized body shape.
Abstract: We tackle the human motion imitation, appearance transfer, and novel view synthesis within a unified framework, which means that the model once being trained can be used to handle all these tasks. The existing task-specific methods mainly use 2D keypoints (pose) to estimate the human body structure. However, they only expresses the position information with no abilities to characterize the personalized shape of the individual person and model the limbs rotations. In this paper, we propose to use a 3D body mesh recovery module to disentangle the pose and shape, which can not only model the joint location and rotation but also characterize the personalized body shape. To preserve the source information, such as texture, style, color, and face identity, we propose a Liquid Warping GAN with Liquid Warping Block (LWB) that propagates the source information in both image and feature spaces, and synthesizes an image with respect to the reference. Specifically, the source features are extracted by a denoising convolutional auto-encoder for characterizing the source identity well. Furthermore, our proposed method is able to support a more flexible warping from multiple sources. In addition, we build a new dataset, namely Impersonator (iPER) dataset, for the evaluation of human motion imitation, appearance transfer, and novel view synthesis. Extensive experiments demonstrate the effectiveness of our method in several aspects, such as robustness in occlusion case and preserving face identity, shape consistency and clothes details. All codes and datasets are available on https://svip-lab.github.io/project/impersonator.html.

174 citations


Proceedings ArticleDOI
Inchang Choi1, Orazio Gallo2, Alejandro Troccoli2, Min H. Kim1, Jan Kautz2 
01 Nov 2019
TL;DR: Extreme View Synthesis as mentioned in this paper estimates a depth probability volume, rather than just a single depth value for each pixel of the novel view, and combines learned image priors and the depth uncertainty to synthesize a refined image with less artifacts.
Abstract: We present Extreme View Synthesis, a solution for novel view extrapolation that works even when the number of input images is small---as few as two. In this context, occlusions and depth uncertainty are two of the most pressing issues, and worsen as the degree of extrapolation increases. We follow the traditional paradigm of performing depth-based warping and refinement, with a few key improvements. First, we estimate a depth probability volume, rather than just a single depth value for each pixel of the novel view. This allows us to leverage depth uncertainty in challenging regions, such as depth discontinuities. After using it to get an initial estimate of the novel view, we explicitly combine learned image priors and the depth uncertainty to synthesize a refined image with less artifacts. Our method is the first to show visually pleasing results for baseline magnifications of up to 30x.

170 citations


Posted Content
Wen Liu1, Zhixin Piao1, Jie Min1, Wenhan Luo1, Lin Ma2, Shenghua Gao1 
TL;DR: A 3D body mesh recovery module is proposed to disentangle the pose and shape, which can not only model the joint location and rotation but also characterize the personalized body shape and is able to support a more flexible warping from multiple sources.
Abstract: We tackle the human motion imitation, appearance transfer, and novel view synthesis within a unified framework, which means that the model once being trained can be used to handle all these tasks. The existing task-specific methods mainly use 2D keypoints (pose) to estimate the human body structure. However, they only expresses the position information with no abilities to characterize the personalized shape of the individual person and model the limbs rotations. In this paper, we propose to use a 3D body mesh recovery module to disentangle the pose and shape, which can not only model the joint location and rotation but also characterize the personalized body shape. To preserve the source information, such as texture, style, color, and face identity, we propose a Liquid Warping GAN with Liquid Warping Block (LWB) that propagates the source information in both image and feature spaces, and synthesizes an image with respect to the reference. Specifically, the source features are extracted by a denoising convolutional auto-encoder for characterizing the source identity well. Furthermore, our proposed method is able to support a more flexible warping from multiple sources. In addition, we build a new dataset, namely Impersonator (iPER) dataset, for the evaluation of human motion imitation, appearance transfer, and novel view synthesis. Extensive experiments demonstrate the effectiveness of our method in several aspects, such as robustness in occlusion case and preserving face identity, shape consistency and clothes details. All codes and datasets are available on this https URL

Journal ArticleDOI
TL;DR: This paper synthesizes novel viewpoints across a wide range of viewing directions (covering a 60° cone) from a sparse set of just six viewing directions, based on a deep convolutional network trained to directly synthesize new views from the six input views.
Abstract: The goal of light transport acquisition is to take images from a sparse set of lighting and viewing directions, and combine them to enable arbitrary relighting with changing view. While relighting from sparse images has received significant attention, there has been relatively less progress on view synthesis from a sparse set of "photometric" images---images captured under controlled conditions, lit by a single directional source; we use a spherical gantry to position the camera on a sphere surrounding the object. In this paper, we synthesize novel viewpoints across a wide range of viewing directions (covering a 60° cone) from a sparse set of just six viewing directions. While our approach relates to previous view synthesis and image-based rendering techniques, those methods are usually restricted to much smaller baselines, and are captured under environment illumination. At our baselines, input images have few correspondences and large occlusions; however we benefit from structured photometric images. Our method is based on a deep convolutional network trained to directly synthesize new views from the six input views. This network combines 3D convolutions on a plane sweep volume with a novel per-view per-depth plane attention map prediction network to effectively aggregate multi-view appearance. We train our network with a large-scale synthetic dataset of 1000 scenes with complex geometry and material properties. In practice, it is able to synthesize novel viewpoints for captured real data and reproduces complex appearance effects like occlusions, view-dependent specularities and hard shadows. Moreover, the method can also be combined with previous relighting techniques to enable changing both lighting and view, and applied to computer vision problems like multiview stereo from sparse image sets.

Posted Content
TL;DR: In this paper, an approach to view synthesis using multiplane images (MPIs) is presented, which incorporates occlusion reasoning, improving performance on challenging scene features such as object boundaries, lighting reflections, thin structures, and scenes with high depth complexity.
Abstract: We present a novel approach to view synthesis using multiplane images (MPIs). Building on recent advances in learned gradient descent, our algorithm generates an MPI from a set of sparse camera viewpoints. The resulting method incorporates occlusion reasoning, improving performance on challenging scene features such as object boundaries, lighting reflections, thin structures, and scenes with high depth complexity. We show that our method achieves high-quality, state-of-the-art results on two datasets: the Kalantari light field dataset, and a new camera array dataset, Spaces, which we make publicly available.

Posted Content
TL;DR: In this paper, the authors explore the problem of view synthesis from a narrow baseline pair of images, and focus on generating high-quality view extrapolations with plausible disocclusions.
Abstract: We explore the problem of view synthesis from a narrow baseline pair of images, and focus on generating high-quality view extrapolations with plausible disocclusions. Our method builds upon prior work in predicting a multiplane image (MPI), which represents scene content as a set of RGB$\alpha$ planes within a reference view frustum and renders novel views by projecting this content into the target viewpoints. We present a theoretical analysis showing how the range of views that can be rendered from an MPI increases linearly with the MPI disparity sampling frequency, as well as a novel MPI prediction procedure that theoretically enables view extrapolations of up to $4\times$ the lateral viewpoint movement allowed by prior work. Our method ameliorates two specific issues that limit the range of views renderable by prior methods: 1) We expand the range of novel views that can be rendered without depth discretization artifacts by using a 3D convolutional network architecture along with a randomized-resolution training procedure to allow our model to predict MPIs with increased disparity sampling frequency. 2) We reduce the repeated texture artifacts seen in disocclusions by enforcing a constraint that the appearance of hidden content at any depth must be drawn from visible content at or behind that depth. Please see our results video at: this https URL.

Proceedings ArticleDOI
15 Jun 2019
TL;DR: The authors propose a style consistency discriminator to determine whether a pair of images are consistent in style, and an adaptive semantic consistency loss for synthesizing style-consistent results to the exemplar.
Abstract: Example-guided image synthesis aims to synthesize an image from a semantic label map and an exemplary image indicating style. We use the term "style" in this problem to refer to implicit characteristics of images, for example: in portraits "style" includes gender, racial identity, age, hairstyle; in full body pictures it includes clothing; in street scenes it refers to weather and time of day and such like. A semantic label map in these cases indicates facial expression, full body pose, or scene segmentation. We propose a solution to the example-guided image synthesis problem using conditional generative adversarial networks with style consistency. Our key contributions are (i) a novel style consistency discriminator to determine whether a pair of images are consistent in style; (ii) an adaptive semantic consistency loss; and (iii) a training data sampling strategy, for synthesizing style-consistent results to the exemplar. We demonstrate the efficiency of our method on face, dance and street view synthesis tasks.

Journal ArticleDOI
TL;DR: A novel approach based on Convolutional Neural Networks to jointly predict depth maps and foreground separation masks used to condition Generative Adversarial Networks for hallucinating plausible color and depths in the initially occluded areas is proposed.

Proceedings ArticleDOI
01 Sep 2019
TL;DR: In this article, the authors explore spherical view synthesis for learning monocular 360 depth in a self-supervised manner and demonstrate its feasibility for horizontal and vertical baselines, as well as for the trinocular case.
Abstract: Learning based approaches for depth perception are limited by the availability of clean training data. This has led to the utilization of view synthesis as an indirect objective for learning depth estimation using efficient data acquisition procedures. Nonetheless, most research focuses on pinhole based monocular vision, with scarce works presenting results for omnidirectional input. In this work, we explore spherical view synthesis for learning monocular 360 depth in a self-supervised manner and demonstrate its feasibility. Under a purely geometrically derived formulation we present results for horizontal and vertical baselines, as well as for the trinocular case. Further, we show how to better exploit the expressiveness of traditional CNNs when applied to the equirectangular domain in an efficient manner. Finally, given the availability of ground truth depth data, our work is uniquely positioned to compare view synthesis against direct supervision in a consistent and fair manner. The results indicate that alternative research directions might be better suited to enable higher quality depth perception. Our data, models and code are publicly available at https://vcl3d.github.io/SphericalViewSynthesis/.

Proceedings ArticleDOI
Jie Song1, Xu Chen1, Otmar Hilliges1
01 Oct 2019
TL;DR: In this paper, a learning pipeline determines the output pixels directly from the source color, which leads to more accurate view synthesis under continuous 6-DoF camera control and outperforms state-of-the-art baseline methods on public datasets.
Abstract: We propose a method to produce a continuous stream of novel views under fine-grained (e.g., 1 degree step-size) camera control at interactive rates. A novel learning pipeline determines the output pixels directly from the source color. Injecting geometric transformations, including perspective projection, 3D rotation and translation into the network forces implicit reasoning about the underlying geometry. The latent 3D geometry representation is compact and meaningful under 3D transformation, being able to produce geometrically accurate views for both single objects and natural scenes. Our experiments show that both proposed components, the transforming encoder-decoder and depth-guided appearance mapping, lead to significantly improved generalization beyond the training views and in consequence to more accurate view synthesis under continuous 6-DoF camera control. Finally, we show that our method outperforms state-of-the-art baseline methods on public datasets.

Journal ArticleDOI
TL;DR: A no-reference quality index for Synthesized views using DoG-based Edge statistics and Texture naturalness (SET) and the experimental results demonstrate that the proposed metric is advantageous over the relevant state of the art in dealing with the distortions in the whole view synthesis process.
Abstract: View synthesis is a key technique in free-viewpoint video, which renders virtual views based on texture and depth images. The distortions in synthesized views come from two stages, i.e., the stage of the acquisition and processing of texture and depth images, and the rendering stage using depth-image-based-rendering (DIBR) algorithms. The existing view synthesis quality metrics are designed for the distortions caused by a single stage, which cannot accurately evaluate the quality of the entire view synthesis process. With the considerations that the distortions introduced by two stages both cause edge degradation and texture unnaturalness, and the Difference-of-Gaussian (DoG) representation is powerful in capturing image edge and texture characteristics by simulating the center-surrounding receptive fields of retinal ganglion cells of human eyes, this paper presents a no-reference quality index for Synthesized views using DoG-based Edge statistics and Texture naturalness (SET). To mimic the multi-scale property of the human visual system (HVS), DoG images are first calculated at multiple scales. Then, the orientation selective statistics features and the texture naturalness features are calculated on the DoG images and the coarsest scale image, producing two groups of quality-aware features. Finally, the quality model is learnt from these features using the random forest regression model. The experimental results on two view synthesis image databases demonstrate that the proposed metric is advantageous over the relevant state of the art in dealing with the distortions in the whole view synthesis process.

Journal ArticleDOI
TL;DR: To handle large up-sampling factors, a deeply supervised network structure is presented to enforce strong supervision in each stage of the network and a multi-scale fusion strategy is proposed to effectively exploit the feature maps at different scales and handle the blocking effect.
Abstract: Deep convolutional neural network (DCNN) has been successfully applied to depth map super-resolution and outperforms existing methods by a wide margin. However, there still exist two major issues with these DCNN-based depth map super-resolution methods that hinder the performance: 1) the low-resolution depth maps either need to be up-sampled before feeding into the network or substantial deconvolution has to be used and 2) the supervision (high-resolution depth maps) is only applied at the end of the network, thus it is difficult to handle large up-sampling factors, such as $\times 8$ and $\times 16$ . In this paper, we propose a new framework to tackle the above problems. First, we propose to represent the task of depth map super-resolution as a series of novel view synthesis sub-tasks. The novel view synthesis sub-task aims at generating (synthesizing) a depth map from a different camera pose, which could be learned in parallel. Second, to handle large up-sampling factors, we present a deeply supervised network structure to enforce strong supervision in each stage of the network. Third, a multi-scale fusion strategy is proposed to effectively exploit the feature maps at different scales and handle the blocking effect. In this way, our proposed framework could deal with challenging depth map super-resolution efficiently under large up-sampling factors (e.g., $\times 8$ and $\times 16$ ). Our method only uses the low-resolution depth map as input, and the support of color image is not needed, which greatly reduces the restriction of our method. Extensive experiments on various benchmarking data sets demonstrate the superiority of our method over current state-of-the-art depth map super-resolution methods.

Proceedings ArticleDOI
01 Oct 2019
TL;DR: This paper proposes an encoder-decoder based generative adversarial network VI-GAN to tackle the problem of synthesizing novel views from a 2D image and makes the decoder hallucinate the image of a novel view based on the extracted feature and an arbitrary user-specific camera pose.
Abstract: Synthesizing novel views from a 2D image requires to infer 3D structure and project it back to 2D from a new viewpoint. In this paper, we propose an encoder-decoder based generative adversarial network VI-GAN to tackle this problem. Our method is to let the network, after seeing many images of objects belonging to the same category in different views, obtain essential knowledge of intrinsic properties of the objects. To this end, an encoder is designed to extract view-independent feature that characterizes intrinsic properties of the input image, which includes 3D structure, color, texture etc. We also make the decoder hallucinate the image of a novel view based on the extracted feature and an arbitrary user-specific camera pose. Extensive experiments demonstrate that our model can synthesize high-quality images in different views with continuous camera poses, and is general for various applications.

Journal ArticleDOI
TL;DR: A new DIBR-synthesized image database with the associated subjective scores is presented and subjective test results show that the interview synthesis methods, having more input information, significantly outperform the single-view-based ones.
Abstract: Depth-image-based rendering (DIBR) is a funda-mental technology in several 3-D-related applications, such as free viewpoint video, virtual reality, and augmented reality. However, new challenges have also been brought in assessing the quality of DIBR-synthesized views since this process induces some new types of distortions, which are inherently different from the distortion caused by video coding. In this paper, we present a new DIBR-synthesized image database with the associated subjective scores. We also test the performances of the state-of-the-art objective quality metrics on this database. This paper focuses on the distortions only induced by different DIBR synthesis methods. Seven state-of-the-art DIBR algorithms, including inter-view synthesis and single-view-based synthesis methods, are considered in this database. The quality of synthesized views was assessed subjectively by 41 observers and objectively using 14 state-of-the-art objective metrics. Subjective test results show that the interview synthesis methods, having more input information, significantly outperform the single-view-based ones. Correlation results between the tested objective metrics and the subjective scores on this database reveal that further studies are still needed for a better objective quality metric dedicated to the DIBR-synthesized views.

Journal ArticleDOI
23 Jan 2019
TL;DR: Hirose et al. as mentioned in this paper presented a view synthesis method for mobile robots in dynamic environments, and its application to the estimation of future traversability, which predicts future images for given virtual robot velocity commands using only RGB images at previous and current time steps.
Abstract: We present VUNet, a novel view(VU) synthesis method for mobile robots in dynamic environments, and its application to the estimation of future traversability. Our method predicts future images for given virtual robot velocity commands using only RGB images at previous and current time steps. The future images result from applying two types of image changes to the previous and current images: first, changes caused by different camera pose. Second, changes due to the motion of the dynamic obstacles. We learn to predict these two types of changes disjointly using two novel network architectures, SNet and DNet. We combine SNet and DNet to synthesize future images that we pass to our previously presented method GONet [N. Hirose, A. Sadeghian, M. Vazquez, P. Goebel, and S. Savarese, “Gonet: A semi-supervised deep learning approach for traversability estimation,” in Proc. IEEE International Conference on Intelligent Robots and Systems, 2018, pp. 3044–3051] to estimate the traversable areas around the robot. Our quantitative and qualitative evaluation indicate that our approach for view synthesis predicts accurate future images in both static and dynamic environments. We also show that these virtual images can be used to estimate future traversability correctly. We apply our view synthesis-based traversability estimation method to two applications for assisted teleoperation.

Posted Content
TL;DR: This work explores spherical view synthesis for learning monocular 360 depth in a self-supervised manner and demonstrates its feasibility, and shows how to better exploit the expressiveness of traditional CNNs when applied to the equirectangular domain in an efficient manner.
Abstract: Learning based approaches for depth perception are limited by the availability of clean training data. This has led to the utilization of view synthesis as an indirect objective for learning depth estimation using efficient data acquisition procedures. Nonetheless, most research focuses on pinhole based monocular vision, with scarce works presenting results for omnidirectional input. In this work, we explore spherical view synthesis for learning monocular 360 depth in a self-supervised manner and demonstrate its feasibility. Under a purely geometrically derived formulation we present results for horizontal and vertical baselines, as well as for the trinocular case. Further, we show how to better exploit the expressiveness of traditional CNNs when applied to the equirectangular domain in an efficient manner. Finally, given the availability of ground truth depth data, our work is uniquely positioned to compare view synthesis against direct supervision in a consistent and fair manner. The results indicate that alternative research directions might be better suited to enable higher quality depth perception. Our data, models and code are publicly available at this https URL.

Journal ArticleDOI
TL;DR: A real-time high-quality DIBR system that consists of disparity estimation and view synthesis is proposed and a local approach that focuses on depth discontinuities and disparity smoothness is presented to improve the disparity accuracy.
Abstract: Depth image-based rendering (DIBR) techniques have recently drawn more attention in various 3D applications. In this paper, a real-time high-quality DIBR system that consists of disparity estimation and view synthesis is proposed. For disparity estimation, a local approach that focuses on depth discontinuities and disparity smoothness is presented to improve the disparity accuracy. For view synthesis, a method that contains view interpolation and extrapolation is proposed to render high-quality virtual views. Moreover, the system is designed with an optimized parallelism scheme to achieve a high throughput, and can be scaled up easily. It is implemented on an Altera Stratix IV FPGA at a processing speed of 45 frames per second for 1080p resolution. Evaluated on selected image sets of the Middlebury benchmark, the average error rate of the disparity maps is 6.02%; the average peak signal to noise ratio and structural similarity values of the virtual views are 30.07 dB and 0.9303, respectively. The experimental results indicate that the proposed DIBR system has the top-performing processing speed and its accuracy performance is among the best of state-of-the-art hardware implementations.

Journal ArticleDOI
TL;DR: A new region-of-interest (ROI)-based video compression method is designed for light field videos and a novel view synthesis algorithm is presented to generate arbitrary viewpoints at the receiver to improve the compression performance.
Abstract: Light field videos provide a rich representation of real-world, thus the research of this technology is of urgency and interest for both the scientific community and industries. Light field applications such as virtual reality and post-production in the movie industry require a large number of viewpoints of the captured scene to achieve an immersive experience, and this creates a significant burden on light field compression and streaming. In this paper, we first present a light field video dataset captured with a plenoptic camera. Then a new region-of-interest (ROI)-based video compression method is designed for light field videos. In order to further improve the compression performance, a novel view synthesis algorithm is presented to generate arbitrary viewpoints at the receiver. The experimental evaluation of four light field video sequences demonstrates that the proposed ROI-based compression method can save 5%-7% in bitrates in comparison to conventional light field video compression methods. Furthermore, the proposed view synthesis-based compression method not only can achieve a reduction of about 50% in bitrates against conventional compression methods, but the synthesized views can exhibit identical visual quality as their ground truth.

Proceedings ArticleDOI
01 Nov 2019
TL;DR: This paper presents a flexible Virtual Multi-View Synthesis module that can be adopted into 3D object detection methods to improve orientation estimation and shows that this module greatly improves the orientation estimation on the challenging pedestrian class on the KITTI benchmark.
Abstract: Accurately estimating the orientation of pedestrians is an important and challenging task for autonomous driving because this information is essential for tracking and predicting pedestrian behavior. This paper presents a flexible Virtual Multi-View Synthesis module that can be adopted into 3D object detection methods to improve orientation estimation. The module uses a multi-step process to acquire the fine-grained semantic information required for accurate orientation estimation. First, the scene’s point cloud is densified using a structure preserving depth completion algorithm and each point is colorized using its corresponding RGB pixel. Next, virtual cameras are placed around each object in the densified point cloud to generate novel viewpoints, which preserve the object’s appearance. We show that this module greatly improves the orientation estimation on the challenging pedestrian class on the KITTI benchmark. When used with the open-source 3D detector AVOD-FPN, we outperform all other published methods on the pedestrian Orientation, 3D, and Bird’s Eye View benchmarks.

Proceedings Article
01 Jan 2019
TL;DR: This work devise an approach that exploits known geometric properties of the scene (per-frame camera extrinsics and depth) in order to warp reference views into the new ones, and obtains images that are geometrically consistent with all the views in the scene camera system.
Abstract: Given a set of a reference RGBD views of an indoor environment, and a new viewpoint, our goal is to predict the view from that location. Prior work on new-view generation has predominantly focused on significantly constrained scenarios, typically involving artificially rendered views of isolated CAD models. Here we tackle a much more challenging version of the problem. We devise an approach that exploits known geometric properties of the scene (per-frame camera extrinsics and depth) in order to warp reference views into the new ones. The defects in the generated views are handled by a novel RGBD inpainting network, PerspectiveNet, that is fine-tuned for a given scene in order to obtain images that are geometrically consistent with all the views in the scene camera system. Experiments conducted on the ScanNet and SceneNet datasets reveal performance superior to strong baselines.

Proceedings ArticleDOI
11 Dec 2019
TL;DR: This paper is an attempt to deliver good-quality Depth Estimation Reference Software (DERS) that is well-structured for further use in the worldwide MPEG standardization committee.
Abstract: For enabling virtual reality on natural content, Depth Image-Based Rendering (DIBR) techniques have been steadily developed over the past decade, but their quality highly depends on that of the depth estimation. This paper is an attempt to deliver good-quality Depth Estimation Reference Software (DERS) that is well-structured for further use in the worldwide MPEG standardization committee.The existing DERS has been refactored, debugged and extended to any number of input views for generating accurate depth maps. Their quality has been validated by synthesizing DIBR virtual views with the Reference View Synthesizer (RVS) and the Versatile View Synthesizer (VVS), using the available MPEG test sequences. Resulting images and runtimes are reported.

Journal ArticleDOI
TL;DR: A full resolution network to extract fine-scale image features, which contributes to prevent blurry artifacts and a synthesis layer is used to not only warp the observed pixels to the desired positions but also hallucinate the missing pixels from other recorded pixels.
Abstract: Research works in novel viewpoint synthesis are based mainly on multiview input images. In this paper, we focus on a more challenging and ill-posed problem that is to synthesize surrounding novel viewpoints from a single image. To achieve this goal, we design a full resolution network to extract fine-scale image features, which contributes to prevent blurry artifacts. We also involve a pretrained relative depth estimation network, thus three-dimensional information is utilized to infer the flow field between the input and the target image. Since the depth network is trained by depth order between any pair of objects, large-scale image features are also involved in our system. Finally, a synthesis layer is used to not only warp the observed pixels to the desired positions but also hallucinate the missing pixels from other recorded pixels. Experiments show that our technique successfully synthesizes reasonable novel viewpoints surrounding the input, while other state-of-the-art techniques fail.