Top 114 papers published in the topic of View synthesis in 2019

Journal Article•DOI•

Deferred neural rendering: image synthesis using neural textures

[...]

Justus Thies¹, Michael Zollhöfer², Matthias Nießner¹•Institutions (2)

Technische Universität München¹, Stanford University²

12 Jul 2019-ACM Transactions on Graphics

TL;DR: This work proposes Neural Textures, which are learned feature maps that are trained as part of the scene capture process that can be utilized to coherently re-render or manipulate existing video content in both static and dynamic environments at real-time rates.

...read moreread less

Abstract: The modern computer graphics pipeline can synthesize images at remarkable visual quality; however, it requires well-defined, high-quality 3D content as input. In this work, we explore the use of imperfect 3D content, for instance, obtained from photo-metric reconstructions with noisy and incomplete surface geometry, while still aiming to produce photo-realistic (re-)renderings. To address this challenging problem, we introduce Deferred Neural Rendering, a new paradigm for image synthesis that combines the traditional graphics pipeline with learnable components. Specifically, we propose Neural Textures, which are learned feature maps that are trained as part of the scene capture process. Similar to traditional textures, neural textures are stored as maps on top of 3D mesh proxies; however, the high-dimensional feature maps contain significantly more information, which can be interpreted by our new deferred neural rendering pipeline. Both neural textures and deferred neural renderer are trained end-to-end, enabling us to synthesize photo-realistic images even when the original 3D content was imperfect. In contrast to traditional, black-box 2D generative neural networks, our 3D representation gives us explicit control over the generated output, and allows for a wide range of application domains. For instance, we can synthesize temporally-consistent video re-renderings of recorded 3D scenes as our representation is inherently embedded in 3D space. This way, neural textures can be utilized to coherently re-render or manipulate existing video content in both static and dynamic environments at real-time rates. We show the effectiveness of our approach in several experiments on novel view synthesis, scene editing, and facial reenactment, and compare to state-of-the-art approaches that leverage the standard graphics pipeline as well as conventional generative neural networks.

...read moreread less

734 citations

Posted Content•

Deferred Neural Rendering: Image Synthesis using Neural Textures

[...]

Justus Thies¹, Michael Zollhöfer², Matthias Nießner¹•Institutions (2)

Technische Universität München¹, Stanford University²

28 Apr 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work proposes Neural Textures, which are learned feature maps that are trained as part of the scene capture process that can be utilized to coherently re-render or manipulate existing video content in both static and dynamic environments at real-time rates.

...read moreread less

Abstract: The modern computer graphics pipeline can synthesize images at remarkable visual quality; however, it requires well-defined, high-quality 3D content as input. In this work, we explore the use of imperfect 3D content, for instance, obtained from photo-metric reconstructions with noisy and incomplete surface geometry, while still aiming to produce photo-realistic (re-)renderings. To address this challenging problem, we introduce Deferred Neural Rendering, a new paradigm for image synthesis that combines the traditional graphics pipeline with learnable components. Specifically, we propose Neural Textures, which are learned feature maps that are trained as part of the scene capture process. Similar to traditional textures, neural textures are stored as maps on top of 3D mesh proxies; however, the high-dimensional feature maps contain significantly more information, which can be interpreted by our new deferred neural rendering pipeline. Both neural textures and deferred neural renderer are trained end-to-end, enabling us to synthesize photo-realistic images even when the original 3D content was imperfect. In contrast to traditional, black-box 2D generative neural networks, our 3D representation gives us explicit control over the generated output, and allows for a wide range of application domains. For instance, we can synthesize temporally-consistent video re-renderings of recorded 3D scenes as our representation is inherently embedded in 3D space. This way, neural textures can be utilized to coherently re-render or manipulate existing video content in both static and dynamic environments at real-time rates. We show the effectiveness of our approach in several experiments on novel view synthesis, scene editing, and facial reenactment, and compare to state-of-the-art approaches that leverage the standard graphics pipeline as well as conventional generative neural networks.

...read moreread less

444 citations

Journal Article•DOI•

Local light field fusion: practical view synthesis with prescriptive sampling guidelines

[...]

Ben Mildenhall¹, Pratul P. Srinivasan¹, Rodrigo Ortiz-Cayon, Nima Khademi Kalantari², Ravi Ramamoorthi¹, Ren Ng¹, Abhishek Kar - Show less +3 more•Institutions (2)

University of California¹, Texas A&M University²

12 Jul 2019-ACM Transactions on Graphics

TL;DR: An algorithm for view synthesis from an irregular grid of sampled views that first expands each sampled view into a local light field via a multiplane image (MPI) scene representation, then renders novel views by blending adjacent local light fields.

...read moreread less

Abstract: We present a practical and robust deep learning solution for capturing and rendering novel views of complex real world scenes for virtual exploration. Previous approaches either require intractably dense view sampling or provide little to no guidance for how users should sample views of a scene to reliably render high-quality novel views. Instead, we propose an algorithm for view synthesis from an irregular grid of sampled views that first expands each sampled view into a local light field via a multiplane image (MPI) scene representation, then renders novel views by blending adjacent local light fields. We extend traditional plenoptic sampling theory to derive a bound that specifies precisely how densely users should sample views of a given scene when using our algorithm. In practice, we apply this bound to capture and render views of real world scenes that achieve the perceptual quality of Nyquist rate view sampling while using up to 4000X fewer views. We demonstrate our approach's practicality with an augmented reality smart-phone app that guides users to capture input images of a scene and viewers that enable realtime virtual exploration on desktop and mobile platforms.

...read moreread less

400 citations

Proceedings Article•DOI•

DeepVoxels: Learning Persistent 3D Feature Embeddings

[...]

Vincent Sitzmann¹, Justus Thies², Felix Heide³, Matthias NieBner², Gordon Wetzstein¹, Michael Zollhöfer¹ - Show less +2 more•Institutions (3)

Stanford University¹, Technische Universität München², Princeton University³

15 Jun 2019

TL;DR: This work proposes DeepVoxels, a learned representation that encodes the view-dependent appearance of a 3D scene without having to explicitly model its geometry, based on a Cartesian 3D grid of persistent embedded features that learn to make use of the underlying3D scene structure.

...read moreread less

Abstract: In this work, we address the lack of 3D understanding of generative neural networks by introducing a persistent 3D feature embedding for view synthesis. To this end, we propose DeepVoxels, a learned representation that encodes the view-dependent appearance of a 3D scene without having to explicitly model its geometry. At its core, our approach is based on a Cartesian 3D grid of persistent embedded features that learn to make use of the underlying 3D scene structure. Our approach combines insights from 3D geometric computer vision with recent advances in learning image-to-image mappings based on adversarial loss functions. DeepVoxels is supervised, without requiring a 3D reconstruction of the scene, using a 2D re-rendering loss and enforces perspective and multi-view geometry in a principled manner. We apply our persistent 3D scene representation to the problem of novel view synthesis demonstrating high-quality results for a variety of challenging scenes.

...read moreread less

353 citations

Posted Content•

Local Light Field Fusion: Practical View Synthesis with Prescriptive Sampling Guidelines

[...]

Ben Mildenhall¹, Pratul P. Srinivasan¹, Rodrigo Ortiz-Cayon, Nima Khademi Kalantari², Ravi Ramamoorthi¹, Ren Ng¹, Abhishek Kar - Show less +3 more•Institutions (2)

University of California¹, Texas A&M University²

02 May 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: An algorithm for view synthesis from an irregular grid of sampled views that first expands each sampled view into a local light field via a multiplane image (MPI) scene representation, then renders novel views by blending adjacent local light fields.

...read moreread less

Abstract: We present a practical and robust deep learning solution for capturing and rendering novel views of complex real world scenes for virtual exploration. Previous approaches either require intractably dense view sampling or provide little to no guidance for how users should sample views of a scene to reliably render high-quality novel views. Instead, we propose an algorithm for view synthesis from an irregular grid of sampled views that first expands each sampled view into a local light field via a multiplane image (MPI) scene representation, then renders novel views by blending adjacent local light fields. We extend traditional plenoptic sampling theory to derive a bound that specifies precisely how densely users should sample views of a given scene when using our algorithm. In practice, we apply this bound to capture and render views of real world scenes that achieve the perceptual quality of Nyquist rate view sampling while using up to 4000x fewer views. We demonstrate our approach's practicality with an augmented reality smartphone app that guides users to capture input images of a scene and viewers that enable realtime virtual exploration on desktop and mobile platforms.

...read moreread less

338 citations

Proceedings Article•DOI•

DeepView: View Synthesis With Learned Gradient Descent

[...]

John Flynn¹, Michael Broxton¹, Paul Debevec¹, Matthew DuVall¹, Graham Fyffe¹, Ryan Overbeck¹, Noah Snavely¹, Richard Tucker¹ - Show less +4 more•Institutions (1)

Google¹

15 Jun 2019

TL;DR: This work presents a novel approach to view synthesis using multiplane images (MPIs) that incorporates occlusion reasoning, improving performance on challenging scene features such as object boundaries, lighting reflections, thin structures, and scenes with high depth complexity.

...read moreread less

Abstract: We present a novel approach to view synthesis using multiplane images (MPIs). Building on recent advances in learned gradient descent, our algorithm generates an MPI from a set of sparse camera viewpoints. The resulting method incorporates occlusion reasoning, improving performance on challenging scene features such as object boundaries, lighting reflections, thin structures, and scenes with high depth complexity. We show that our method achieves high-quality, state-of-the-art results on two datasets: the Kalantari light field dataset, and a new camera array dataset, Spaces, which we make publicly available.

...read moreread less

335 citations

Proceedings Article•DOI•

Pushing the Boundaries of View Extrapolation With Multiplane Images

[...]

Pratul P. Srinivasan¹, Richard Tucker², Jonathan T. Barron², Ravi Ramamoorthi³, Ren Ng¹, Noah Snavely⁴ - Show less +2 more•Institutions (4)

University of California, Berkeley¹, Google², University of California, San Diego³, Cornell University⁴

01 Jun 2019

TL;DR: This paper presents a theoretical analysis showing how the range of views that can be rendered from an MPI increases linearly with the MPI disparity sampling frequency, as well as a novel MPI prediction procedure that theoretically enables view extrapolations of up to 4 times the lateral viewpoint movement allowed by prior work.

...read moreread less

Abstract: We explore the problem of view synthesis from a narrow baseline pair of images, and focus on generating high-quality view extrapolations with plausible disocclusions. Our method builds upon prior work in predicting a multiplane image (MPI), which represents scene content as a set of RGBA planes within a reference view frustum and renders novel views by projecting this content into the target viewpoints. We present a theoretical analysis showing how the range of views that can be rendered from an MPI increases linearly with the MPI disparity sampling frequency, as well as a novel MPI prediction procedure that theoretically enables view extrapolations of up to 4 times the lateral viewpoint movement allowed by prior work. Our method ameliorates two specific issues that limit the range of views renderable by prior methods: 1) We expand the range of novel views that can be rendered without depth discretization artifacts by using a 3D convolutional network architecture along with a randomized-resolution training procedure to allow our model to predict MPIs with increased disparity sampling frequency. 2) We reduce the repeated texture artifacts seen in disocclusions by enforcing a constraint that the appearance of hidden content at any depth must be drawn from visible content at or behind that depth.

...read moreread less

209 citations

Journal Article•DOI•

3D Ken Burns effect from a single image

[...]

Simon Niklaus¹, Long Mai², Jimei Yang², Feng Liu¹•Institutions (2)

Portland State University¹, Adobe Systems²

08 Nov 2019-ACM Transactions on Graphics

TL;DR: In this paper, a semantic-aware neural network is used to estimate the scene depth from a single image, which is then combined with a segmentation-based depth adjustment process to synthesize the 3D Ken Burns effect.

...read moreread less

Abstract: The Ken Burns effect allows animating still images with a virtual camera scan and zoom. Adding parallax, which results in the 3D Ken Burns effect, enables significantly more compelling results. Creating such effects manually is time-consuming and demands sophisticated editing skills. Existing automatic methods, however, require multiple input images from varying viewpoints. In this paper, we introduce a framework that synthesizes the 3D Ken Burns effect from a single image, supporting both a fully automatic mode and an interactive mode with the user controlling the camera. Our framework first leverages a depth prediction pipeline, which estimates scene depth that is suitable for view synthesis tasks. To address the limitations of existing depth estimation methods such as geometric distortions, semantic distortions, and inaccurate depth boundaries, we develop a semantic-aware neural network for depth prediction, couple its estimate with a segmentation-based depth adjustment process, and employ a refinement neural network that facilitates accurate depth predictions at object boundaries. According to this depth estimate, our framework then maps the input image to a point cloud and synthesizes the resulting video frames by rendering the point cloud from the corresponding camera positions. To address disocclusions while maintaining geometrically and temporally coherent synthesis results, we utilize context-aware color- and depth-inpainting to fill in the missing information in the extreme views of the camera path, thus extending the scene geometry of the point cloud. Experiments with a wide variety of image content show that our method enables realistic synthesis results. Our study demonstrates that our system allows users to achieve better results while requiring little effort compared to existing solutions for the 3D Ken Burns effect creation.

...read moreread less

186 citations

Proceedings Article•DOI•

Liquid Warping GAN: A Unified Framework for Human Motion Imitation, Appearance Transfer and Novel View Synthesis

[...]

Wen Liu¹, Zhixin Piao¹, Jie Min¹, Wenhan Luo², Lin Ma², Shenghua Gao¹ - Show less +2 more•Institutions (2)

ShanghaiTech University¹, Tencent²

01 Oct 2019

TL;DR: Zhang et al. as mentioned in this paper propose to use a 3D body mesh recovery module to disentangle the pose and shape, which can not only model the joint location and rotation but also characterize the personalized body shape.

...read moreread less

Abstract: We tackle the human motion imitation, appearance transfer, and novel view synthesis within a unified framework, which means that the model once being trained can be used to handle all these tasks. The existing task-specific methods mainly use 2D keypoints (pose) to estimate the human body structure. However, they only expresses the position information with no abilities to characterize the personalized shape of the individual person and model the limbs rotations. In this paper, we propose to use a 3D body mesh recovery module to disentangle the pose and shape, which can not only model the joint location and rotation but also characterize the personalized body shape. To preserve the source information, such as texture, style, color, and face identity, we propose a Liquid Warping GAN with Liquid Warping Block (LWB) that propagates the source information in both image and feature spaces, and synthesizes an image with respect to the reference. Specifically, the source features are extracted by a denoising convolutional auto-encoder for characterizing the source identity well. Furthermore, our proposed method is able to support a more flexible warping from multiple sources. In addition, we build a new dataset, namely Impersonator (iPER) dataset, for the evaluation of human motion imitation, appearance transfer, and novel view synthesis. Extensive experiments demonstrate the effectiveness of our method in several aspects, such as robustness in occlusion case and preserving face identity, shape consistency and clothes details. All codes and datasets are available on https://svip-lab.github.io/project/impersonator.html.

...read moreread less

174 citations

Proceedings Article•DOI•

Extreme View Synthesis

[...]

Inchang Choi¹, Orazio Gallo², Alejandro Troccoli², Min H. Kim¹, Jan Kautz² - Show less +1 more•Institutions (2)

KAIST¹, Nvidia²

01 Nov 2019

TL;DR: Extreme View Synthesis as mentioned in this paper estimates a depth probability volume, rather than just a single depth value for each pixel of the novel view, and combines learned image priors and the depth uncertainty to synthesize a refined image with less artifacts.

...read moreread less

Abstract: We present Extreme View Synthesis, a solution for novel view extrapolation that works even when the number of input images is small---as few as two. In this context, occlusions and depth uncertainty are two of the most pressing issues, and worsen as the degree of extrapolation increases. We follow the traditional paradigm of performing depth-based warping and refinement, with a few key improvements. First, we estimate a depth probability volume, rather than just a single depth value for each pixel of the novel view. This allows us to leverage depth uncertainty in challenging regions, such as depth discontinuities. After using it to get an initial estimate of the novel view, we explicitly combine learned image priors and the depth uncertainty to synthesize a refined image with less artifacts. Our method is the first to show visually pleasing results for baseline magnifications of up to 30x.

...read moreread less

170 citations

Posted Content•

Liquid Warping GAN: A Unified Framework for Human Motion Imitation, Appearance Transfer and Novel View Synthesis

[...]

Wen Liu¹, Zhixin Piao¹, Jie Min¹, Wenhan Luo¹, Lin Ma², Shenghua Gao¹ - Show less +2 more•Institutions (2)

ShanghaiTech University¹, Tencent²

26 Sep 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: A 3D body mesh recovery module is proposed to disentangle the pose and shape, which can not only model the joint location and rotation but also characterize the personalized body shape and is able to support a more flexible warping from multiple sources.

...read moreread less

Abstract: We tackle the human motion imitation, appearance transfer, and novel view synthesis within a unified framework, which means that the model once being trained can be used to handle all these tasks. The existing task-specific methods mainly use 2D keypoints (pose) to estimate the human body structure. However, they only expresses the position information with no abilities to characterize the personalized shape of the individual person and model the limbs rotations. In this paper, we propose to use a 3D body mesh recovery module to disentangle the pose and shape, which can not only model the joint location and rotation but also characterize the personalized body shape. To preserve the source information, such as texture, style, color, and face identity, we propose a Liquid Warping GAN with Liquid Warping Block (LWB) that propagates the source information in both image and feature spaces, and synthesizes an image with respect to the reference. Specifically, the source features are extracted by a denoising convolutional auto-encoder for characterizing the source identity well. Furthermore, our proposed method is able to support a more flexible warping from multiple sources. In addition, we build a new dataset, namely Impersonator (iPER) dataset, for the evaluation of human motion imitation, appearance transfer, and novel view synthesis. Extensive experiments demonstrate the effectiveness of our method in several aspects, such as robustness in occlusion case and preserving face identity, shape consistency and clothes details. All codes and datasets are available on this https URL

...read moreread less

Journal Article•DOI•

Deep view synthesis from sparse photometric images

[...]

Zexiang Xu¹, Sai Bi¹, Kalyan Sunkavalli², Sunil Hadap³, Hao Su¹, Ravi Ramamoorthi¹ - Show less +2 more•Institutions (3)

University of California¹, Adobe Systems², Amazon.com³

12 Jul 2019-ACM Transactions on Graphics

TL;DR: This paper synthesizes novel viewpoints across a wide range of viewing directions (covering a 60° cone) from a sparse set of just six viewing directions, based on a deep convolutional network trained to directly synthesize new views from the six input views.

...read moreread less

Abstract: The goal of light transport acquisition is to take images from a sparse set of lighting and viewing directions, and combine them to enable arbitrary relighting with changing view. While relighting from sparse images has received significant attention, there has been relatively less progress on view synthesis from a sparse set of "photometric" images---images captured under controlled conditions, lit by a single directional source; we use a spherical gantry to position the camera on a sphere surrounding the object. In this paper, we synthesize novel viewpoints across a wide range of viewing directions (covering a 60° cone) from a sparse set of just six viewing directions. While our approach relates to previous view synthesis and image-based rendering techniques, those methods are usually restricted to much smaller baselines, and are captured under environment illumination. At our baselines, input images have few correspondences and large occlusions; however we benefit from structured photometric images. Our method is based on a deep convolutional network trained to directly synthesize new views from the six input views. This network combines 3D convolutions on a plane sweep volume with a novel per-view per-depth plane attention map prediction network to effectively aggregate multi-view appearance. We train our network with a large-scale synthetic dataset of 1000 scenes with complex geometry and material properties. In practice, it is able to synthesize novel viewpoints for captured real data and reproduces complex appearance effects like occlusions, view-dependent specularities and hard shadows. Moreover, the method can also be combined with previous relighting techniques to enable changing both lighting and view, and applied to computer vision problems like multiview stereo from sparse image sets.

...read moreread less

Posted Content•

DeepView: View Synthesis with Learned Gradient Descent.

[...]

John Flynn¹, Michael Broxton¹, Paul Debevec¹, Matthew DuVall¹, Graham Fyffe¹, Ryan Overbeck¹, Noah Snavely¹, Richard Tucker¹ - Show less +4 more•Institutions (1)

Google¹

18 Jun 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, an approach to view synthesis using multiplane images (MPIs) is presented, which incorporates occlusion reasoning, improving performance on challenging scene features such as object boundaries, lighting reflections, thin structures, and scenes with high depth complexity.

...read moreread less

Abstract: We present a novel approach to view synthesis using multiplane images (MPIs). Building on recent advances in learned gradient descent, our algorithm generates an MPI from a set of sparse camera viewpoints. The resulting method incorporates occlusion reasoning, improving performance on challenging scene features such as object boundaries, lighting reflections, thin structures, and scenes with high depth complexity. We show that our method achieves high-quality, state-of-the-art results on two datasets: the Kalantari light field dataset, and a new camera array dataset, Spaces, which we make publicly available.

...read moreread less

Posted Content•

Pushing the Boundaries of View Extrapolation with Multiplane Images

[...]

Pratul P. Srinivasan¹, Richard Tucker², Jonathan T. Barron², Ravi Ramamoorthi³, Ren Ng¹, Noah Snavely⁴ - Show less +2 more•Institutions (4)

University of California, Berkeley¹, Google², University of California, San Diego³, Cornell University⁴

01 May 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, the authors explore the problem of view synthesis from a narrow baseline pair of images, and focus on generating high-quality view extrapolations with plausible disocclusions.

...read moreread less

Abstract: We explore the problem of view synthesis from a narrow baseline pair of images, and focus on generating high-quality view extrapolations with plausible disocclusions. Our method builds upon prior work in predicting a multiplane image (MPI), which represents scene content as a set of RGB$\alpha$ planes within a reference view frustum and renders novel views by projecting this content into the target viewpoints. We present a theoretical analysis showing how the range of views that can be rendered from an MPI increases linearly with the MPI disparity sampling frequency, as well as a novel MPI prediction procedure that theoretically enables view extrapolations of up to $4\times$ the lateral viewpoint movement allowed by prior work. Our method ameliorates two specific issues that limit the range of views renderable by prior methods: 1) We expand the range of novel views that can be rendered without depth discretization artifacts by using a 3D convolutional network architecture along with a randomized-resolution training procedure to allow our model to predict MPIs with increased disparity sampling frequency. 2) We reduce the repeated texture artifacts seen in disocclusions by enforcing a constraint that the appearance of hidden content at any depth must be drawn from visible content at or behind that depth. Please see our results video at: this https URL.

...read moreread less

Proceedings Article•DOI•

Example-Guided Style-Consistent Image Synthesis From Semantic Labeling

[...]

Miao Wang¹, Guo-Ye Yang², Ruilong Li², Run-Ze Liang², Song-Hai Zhang², Peter Hall³, Shi-Min Hu² - Show less +3 more•Institutions (3)

Beihang University¹, Tsinghua University², University of Bath³

15 Jun 2019

TL;DR: The authors propose a style consistency discriminator to determine whether a pair of images are consistent in style, and an adaptive semantic consistency loss for synthesizing style-consistent results to the exemplar.

...read moreread less

Abstract: Example-guided image synthesis aims to synthesize an image from a semantic label map and an exemplary image indicating style. We use the term "style" in this problem to refer to implicit characteristics of images, for example: in portraits "style" includes gender, racial identity, age, hairstyle; in full body pictures it includes clothing; in street scenes it refers to weather and time of day and such like. A semantic label map in these cases indicates facial expression, full body pose, or scene segmentation. We propose a solution to the example-guided image synthesis problem using conditional generative adversarial networks with style consistency. Our key contributions are (i) a novel style consistency discriminator to determine whether a pair of images are consistent in style; (ii) an adaptive semantic consistency loss; and (iii) a training data sampling strategy, for synthesizing style-consistent results to the exemplar. We demonstrate the efficiency of our method on face, dance and street view synthesis tasks.

...read moreread less

Journal Article•DOI•

Peeking behind objects: Layered depth prediction from a single image

[...]

Helisa Dhamo¹, Keisuke Tateno¹, Iro Laina¹, Nassir Navab¹, Federico Tombari¹ - Show less +1 more•Institutions (1)

Technische Universität München¹

01 Jul 2019-Pattern Recognition Letters

TL;DR: A novel approach based on Convolutional Neural Networks to jointly predict depth maps and foreground separation masks used to condition Generative Adversarial Networks for hallucinating plausible color and depths in the initially occluded areas is proposed.

...read moreread less

Proceedings Article•DOI•

Spherical View Synthesis for Self-Supervised 360° Depth Estimation

[...]

Nikolaos Zioulis¹, Antonis Karakottas, Dimitrios Zarpalas, Federico Alvarez¹, Petros Daras - Show less +1 more•Institutions (1)

Technical University of Madrid¹

01 Sep 2019

TL;DR: In this article, the authors explore spherical view synthesis for learning monocular 360 depth in a self-supervised manner and demonstrate its feasibility for horizontal and vertical baselines, as well as for the trinocular case.

...read moreread less

Abstract: Learning based approaches for depth perception are limited by the availability of clean training data. This has led to the utilization of view synthesis as an indirect objective for learning depth estimation using efficient data acquisition procedures. Nonetheless, most research focuses on pinhole based monocular vision, with scarce works presenting results for omnidirectional input. In this work, we explore spherical view synthesis for learning monocular 360 depth in a self-supervised manner and demonstrate its feasibility. Under a purely geometrically derived formulation we present results for horizontal and vertical baselines, as well as for the trinocular case. Further, we show how to better exploit the expressiveness of traditional CNNs when applied to the equirectangular domain in an efficient manner. Finally, given the availability of ground truth depth data, our work is uniquely positioned to compare view synthesis against direct supervision in a consistent and fair manner. The results indicate that alternative research directions might be better suited to enable higher quality depth perception. Our data, models and code are publicly available at https://vcl3d.github.io/SphericalViewSynthesis/.

...read moreread less

Proceedings Article•DOI•

Monocular Neural Image Based Rendering With Continuous View Control

[...]

Jie Song¹, Xu Chen¹, Otmar Hilliges¹•Institutions (1)

ETH Zurich¹

01 Oct 2019

TL;DR: In this paper, a learning pipeline determines the output pixels directly from the source color, which leads to more accurate view synthesis under continuous 6-DoF camera control and outperforms state-of-the-art baseline methods on public datasets.

...read moreread less

Abstract: We propose a method to produce a continuous stream of novel views under fine-grained (e.g., 1 degree step-size) camera control at interactive rates. A novel learning pipeline determines the output pixels directly from the source color. Injecting geometric transformations, including perspective projection, 3D rotation and translation into the network forces implicit reasoning about the underlying geometry. The latent 3D geometry representation is compact and meaningful under 3D transformation, being able to produce geometrically accurate views for both single objects and natural scenes. Our experiments show that both proposed components, the transforming encoder-decoder and depth-guided appearance mapping, lead to significantly improved generalization beyond the training views and in consequence to more accurate view synthesis under continuous 6-DoF camera control. Finally, we show that our method outperforms state-of-the-art baseline methods on public datasets.

...read moreread less

Journal Article•DOI•

No-Reference Quality Assessment for View Synthesis Using DoG-Based Edge Statistics and Texture Naturalness

[...]

Yu Zhou¹, Leida Li², Shiqi Wang³, Jinjian Wu², Yuming Fang⁴, Xinbo Gao² - Show less +2 more•Institutions (4)

China University of Mining and Technology¹, Xidian University², City University of Hong Kong³, Jiangxi University of Finance and Economics⁴

26 Apr 2019-IEEE Transactions on Image Processing

TL;DR: A no-reference quality index for Synthesized views using DoG-based Edge statistics and Texture naturalness (SET) and the experimental results demonstrate that the proposed metric is advantageous over the relevant state of the art in dealing with the distortions in the whole view synthesis process.

...read moreread less

Abstract: View synthesis is a key technique in free-viewpoint video, which renders virtual views based on texture and depth images. The distortions in synthesized views come from two stages, i.e., the stage of the acquisition and processing of texture and depth images, and the rendering stage using depth-image-based-rendering (DIBR) algorithms. The existing view synthesis quality metrics are designed for the distortions caused by a single stage, which cannot accurately evaluate the quality of the entire view synthesis process. With the considerations that the distortions introduced by two stages both cause edge degradation and texture unnaturalness, and the Difference-of-Gaussian (DoG) representation is powerful in capturing image edge and texture characteristics by simulating the center-surrounding receptive fields of retinal ganglion cells of human eyes, this paper presents a no-reference quality index for Synthesized views using DoG-based Edge statistics and Texture naturalness (SET). To mimic the multi-scale property of the human visual system (HVS), DoG images are first calculated at multiple scales. Then, the orientation selective statistics features and the texture naturalness features are calculated on the DoG images and the coarsest scale image, producing two groups of quality-aware features. Finally, the quality model is learnt from these features using the random forest regression model. The experimental results on two view synthesis image databases demonstrate that the proposed metric is advantageous over the relevant state of the art in dealing with the distortions in the whole view synthesis process.

...read moreread less

Journal Article•DOI•

Deeply Supervised Depth Map Super-Resolution as Novel View Synthesis

[...]

Xibin Song¹, Yuchao Dai², Xueying Qin¹•Institutions (2)

Shandong University¹, Northwestern Polytechnical University²

01 Aug 2019-IEEE Transactions on Circuits and Systems for Video Technology

TL;DR: To handle large up-sampling factors, a deeply supervised network structure is presented to enforce strong supervision in each stage of the network and a multi-scale fusion strategy is proposed to effectively exploit the feature maps at different scales and handle the blocking effect.

...read moreread less

Abstract: Deep convolutional neural network (DCNN) has been successfully applied to depth map super-resolution and outperforms existing methods by a wide margin. However, there still exist two major issues with these DCNN-based depth map super-resolution methods that hinder the performance: 1) the low-resolution depth maps either need to be up-sampled before feeding into the network or substantial deconvolution has to be used and 2) the supervision (high-resolution depth maps) is only applied at the end of the network, thus it is difficult to handle large up-sampling factors, such as $\times 8$ and $\times 16$ . In this paper, we propose a new framework to tackle the above problems. First, we propose to represent the task of depth map super-resolution as a series of novel view synthesis sub-tasks. The novel view synthesis sub-task aims at generating (synthesizing) a depth map from a different camera pose, which could be learned in parallel. Second, to handle large up-sampling factors, we present a deeply supervised network structure to enforce strong supervision in each stage of the network. Third, a multi-scale fusion strategy is proposed to effectively exploit the feature maps at different scales and handle the blocking effect. In this way, our proposed framework could deal with challenging depth map super-resolution efficiently under large up-sampling factors (e.g., $\times 8$ and $\times 16$ ). Our method only uses the low-resolution depth map as input, and the support of color image is not needed, which greatly reduces the restriction of our method. Extensive experiments on various benchmarking data sets demonstrate the superiority of our method over current state-of-the-art depth map super-resolution methods.

...read moreread less

Proceedings Article•DOI•

View Independent Generative Adversarial Network for Novel View Synthesis

[...]

Xiaogang Xu¹, Ying-Cong Chen¹, Jiaya Jia¹•Institutions (1)

The Chinese University of Hong Kong¹

01 Oct 2019

TL;DR: This paper proposes an encoder-decoder based generative adversarial network VI-GAN to tackle the problem of synthesizing novel views from a 2D image and makes the decoder hallucinate the image of a novel view based on the extracted feature and an arbitrary user-specific camera pose.

...read moreread less

Abstract: Synthesizing novel views from a 2D image requires to infer 3D structure and project it back to 2D from a new viewpoint. In this paper, we propose an encoder-decoder based generative adversarial network VI-GAN to tackle this problem. Our method is to let the network, after seeing many images of objects belonging to the same category in different views, obtain essential knowledge of intrinsic properties of the objects. To this end, an encoder is designed to extract view-independent feature that characterizes intrinsic properties of the input image, which includes 3D structure, color, texture etc. We also make the decoder hallucinate the image of a novel view based on the extracted feature and an arbitrary user-specific camera pose. Extensive experiments demonstrate that our model can synthesize high-quality images in different views with continuous camera poses, and is general for various applications.

...read moreread less

Journal Article•DOI•

A Benchmark of DIBR Synthesized View Quality Assessment Metrics on a New Database for Immersive Media Applications

[...]

Shishun Tian¹, Lu Zhang¹, Luce Morin¹, Olivier Deforges¹•Institutions (1)

University of Rennes¹

01 May 2019-IEEE Transactions on Multimedia

TL;DR: A new DIBR-synthesized image database with the associated subjective scores is presented and subjective test results show that the interview synthesis methods, having more input information, significantly outperform the single-view-based ones.

...read moreread less

Abstract: Depth-image-based rendering (DIBR) is a funda-mental technology in several 3-D-related applications, such as free viewpoint video, virtual reality, and augmented reality. However, new challenges have also been brought in assessing the quality of DIBR-synthesized views since this process induces some new types of distortions, which are inherently different from the distortion caused by video coding. In this paper, we present a new DIBR-synthesized image database with the associated subjective scores. We also test the performances of the state-of-the-art objective quality metrics on this database. This paper focuses on the distortions only induced by different DIBR synthesis methods. Seven state-of-the-art DIBR algorithms, including inter-view synthesis and single-view-based synthesis methods, are considered in this database. The quality of synthesized views was assessed subjectively by 41 observers and objectively using 14 state-of-the-art objective metrics. Subjective test results show that the interview synthesis methods, having more input information, significantly outperform the single-view-based ones. Correlation results between the tested objective metrics and the subjective scores on this database reveal that further studies are still needed for a better objective quality metric dedicated to the DIBR-synthesized views.

...read moreread less

Journal Article•DOI•

VUNet: Dynamic Scene View Synthesis for Traversability Estimation Using an RGB Camera

[...]

Noriaki Hirose¹, Amir Sadeghian¹, Fei Xia¹, Roberto Martín-Martín¹, Silvio Savarese¹ - Show less +1 more•Institutions (1)

Stanford University¹

23 Jan 2019

TL;DR: Hirose et al. as mentioned in this paper presented a view synthesis method for mobile robots in dynamic environments, and its application to the estimation of future traversability, which predicts future images for given virtual robot velocity commands using only RGB images at previous and current time steps.

...read moreread less

Abstract: We present VUNet, a novel view(VU) synthesis method for mobile robots in dynamic environments, and its application to the estimation of future traversability. Our method predicts future images for given virtual robot velocity commands using only RGB images at previous and current time steps. The future images result from applying two types of image changes to the previous and current images: first, changes caused by different camera pose. Second, changes due to the motion of the dynamic obstacles. We learn to predict these two types of changes disjointly using two novel network architectures, SNet and DNet. We combine SNet and DNet to synthesize future images that we pass to our previously presented method GONet [N. Hirose, A. Sadeghian, M. Vazquez, P. Goebel, and S. Savarese, “Gonet: A semi-supervised deep learning approach for traversability estimation,” in Proc. IEEE International Conference on Intelligent Robots and Systems, 2018, pp. 3044–3051] to estimate the traversable areas around the robot. Our quantitative and qualitative evaluation indicate that our approach for view synthesis predicts accurate future images in both static and dynamic environments. We also show that these virtual images can be used to estimate future traversability correctly. We apply our view synthesis-based traversability estimation method to two applications for assisted teleoperation.

...read moreread less

Posted Content•

Spherical View Synthesis for Self-Supervised 360 Depth Estimation

[...]

Nikolaos Zioulis¹, Antonis Karakottas, Dimitrios Zarpalas, Federico Alvarez¹, Petros Daras - Show less +1 more•Institutions (1)

Technical University of Madrid¹

17 Sep 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work explores spherical view synthesis for learning monocular 360 depth in a self-supervised manner and demonstrates its feasibility, and shows how to better exploit the expressiveness of traditional CNNs when applied to the equirectangular domain in an efficient manner.

...read moreread less

Abstract: Learning based approaches for depth perception are limited by the availability of clean training data. This has led to the utilization of view synthesis as an indirect objective for learning depth estimation using efficient data acquisition procedures. Nonetheless, most research focuses on pinhole based monocular vision, with scarce works presenting results for omnidirectional input. In this work, we explore spherical view synthesis for learning monocular 360 depth in a self-supervised manner and demonstrate its feasibility. Under a purely geometrically derived formulation we present results for horizontal and vertical baselines, as well as for the trinocular case. Further, we show how to better exploit the expressiveness of traditional CNNs when applied to the equirectangular domain in an efficient manner. Finally, given the availability of ground truth depth data, our work is uniquely positioned to compare view synthesis against direct supervision in a consistent and fair manner. The results indicate that alternative research directions might be better suited to enable higher quality depth perception. Our data, models and code are publicly available at this https URL.

...read moreread less

Journal Article•DOI•

A Real-Time High-Quality Complete System for Depth Image-Based Rendering on FPGA

[...]

Yanzhe Li¹, Luc Claesen², Kai Huang¹, Menglian Zhao¹•Institutions (2)

Zhejiang University¹, University of Hasselt²

01 Apr 2019-IEEE Transactions on Circuits and Systems for Video Technology

TL;DR: A real-time high-quality DIBR system that consists of disparity estimation and view synthesis is proposed and a local approach that focuses on depth discontinuities and disparity smoothness is presented to improve the disparity accuracy.

...read moreread less

Abstract: Depth image-based rendering (DIBR) techniques have recently drawn more attention in various 3D applications. In this paper, a real-time high-quality DIBR system that consists of disparity estimation and view synthesis is proposed. For disparity estimation, a local approach that focuses on depth discontinuities and disparity smoothness is presented to improve the disparity accuracy. For view synthesis, a method that contains view interpolation and extrapolation is proposed to render high-quality virtual views. Moreover, the system is designed with an optimized parallelism scheme to achieve a high throughput, and can be scaled up easily. It is implemented on an Altera Stratix IV FPGA at a processing speed of 45 frames per second for 1080p resolution. Evaluated on selected image sets of the Middlebury benchmark, the average error rate of the disparity maps is 6.02%; the average peak signal to noise ratio and structural similarity values of the virtual views are 30.07 dB and 0.9303, respectively. The experimental results indicate that the proposed DIBR system has the top-performing processing speed and its accuracy performance is among the best of state-of-the-art hardware implementations.

...read moreread less

Journal Article•DOI•

Region-of-interest compression and view synthesis for light field video streaming

[...]

Bing Wang¹, Qiang Peng¹, Eric Wang², Kang Han², Wei Xiang² - Show less +1 more•Institutions (2)

Southwest Jiaotong University¹, James Cook University²

26 Mar 2019-IEEE Access

TL;DR: A new region-of-interest (ROI)-based video compression method is designed for light field videos and a novel view synthesis algorithm is presented to generate arbitrary viewpoints at the receiver to improve the compression performance.

...read moreread less

Abstract: Light field videos provide a rich representation of real-world, thus the research of this technology is of urgency and interest for both the scientific community and industries. Light field applications such as virtual reality and post-production in the movie industry require a large number of viewpoints of the captured scene to achieve an immersive experience, and this creates a significant burden on light field compression and streaming. In this paper, we first present a light field video dataset captured with a plenoptic camera. Then a new region-of-interest (ROI)-based video compression method is designed for light field videos. In order to further improve the compression performance, a novel view synthesis algorithm is presented to generate arbitrary viewpoints at the receiver. The experimental evaluation of four light field video sequences demonstrates that the proposed ROI-based compression method can save 5%-7% in bitrates in comparison to conventional light field video compression methods. Furthermore, the proposed view synthesis-based compression method not only can achieve a reduction of about 50% in bitrates against conventional compression methods, but the synthesized views can exhibit identical visual quality as their ground truth.

...read moreread less

Proceedings Article•DOI•

Improving 3D Object Detection for Pedestrians with Virtual Multi-View Synthesis Orientation Estimation

[...]

Jason Ku¹, Alex D. Pon¹, Sean Walsh¹, Steven L. Waslander¹•Institutions (1)

University of Toronto¹

01 Nov 2019

TL;DR: This paper presents a flexible Virtual Multi-View Synthesis module that can be adopted into 3D object detection methods to improve orientation estimation and shows that this module greatly improves the orientation estimation on the challenging pedestrian class on the KITTI benchmark.

...read moreread less

Abstract: Accurately estimating the orientation of pedestrians is an important and challenging task for autonomous driving because this information is essential for tracking and predicting pedestrian behavior. This paper presents a flexible Virtual Multi-View Synthesis module that can be adopted into 3D object detection methods to improve orientation estimation. The module uses a multi-step process to acquire the fine-grained semantic information required for accurate orientation estimation. First, the scene’s point cloud is densified using a structure preserving depth completion algorithm and each point is colorized using its corresponding RGB pixel. Next, virtual cameras are placed around each object in the densified point cloud to generate novel viewpoints, which preserve the object’s appearance. We show that this module greatly improves the orientation estimation on the challenging pedestrian class on the KITTI benchmark. When used with the open-source 3D detector AVOD-FPN, we outperform all other published methods on the pedestrian Orientation, 3D, and Bird’s Eye View benchmarks.

...read moreread less

Proceedings Article•

PerspectiveNet: A Scene-consistent Image Generator for New View Synthesis in Real Indoor Environments

[...]

David Novotny¹, Ben Graham², Jeremy Reizenstein³•Institutions (3)

University of Oxford¹, Facebook², University of Warwick³

01 Jan 2019

TL;DR: This work devise an approach that exploits known geometric properties of the scene (per-frame camera extrinsics and depth) in order to warp reference views into the new ones, and obtains images that are geometrically consistent with all the views in the scene camera system.

...read moreread less

Abstract: Given a set of a reference RGBD views of an indoor environment, and a new viewpoint, our goal is to predict the view from that location. Prior work on new-view generation has predominantly focused on significantly constrained scenarios, typically involving artificially rendered views of isolated CAD models. Here we tackle a much more challenging version of the problem. We devise an approach that exploits known geometric properties of the scene (per-frame camera extrinsics and depth) in order to warp reference views into the new ones. The defects in the generated views are handled by a novel RGBD inpainting network, PerspectiveNet, that is fine-tuned for a given scene in order to obtain images that are geometrically consistent with all the views in the scene camera system. Experiments conducted on the ScanNet and SceneNet datasets reveal performance superior to strong baselines.

...read moreread less

Proceedings Article•DOI•

MPEG-I Depth Estimation Reference Software

[...]

Segolene Rogge¹, Daniele Bonatto², Jaime Sancho³, Ruben Salvador³, Eduardo Juarez³, Adrian Munteanu¹, Gauthier Lafruit² - Show less +3 more•Institutions (3)

Vrije Universiteit Brussel¹, Université libre de Bruxelles², Technical University of Madrid³

11 Dec 2019

TL;DR: This paper is an attempt to deliver good-quality Depth Estimation Reference Software (DERS) that is well-structured for further use in the worldwide MPEG standardization committee.

...read moreread less

Abstract: For enabling virtual reality on natural content, Depth Image-Based Rendering (DIBR) techniques have been steadily developed over the past decade, but their quality highly depends on that of the depth estimation. This paper is an attempt to deliver good-quality Depth Estimation Reference Software (DERS) that is well-structured for further use in the worldwide MPEG standardization committee.The existing DERS has been refactored, debugged and extended to any number of input views for generating accurate depth maps. Their quality has been validated by synthesizing DIBR virtual views with the Reference View Synthesizer (RVS) and the Versatile View Synthesizer (VVS), using the available MPEG test sequences. Resulting images and runtimes are reported.

...read moreread less

Journal Article•DOI•

Depth-Assisted Full Resolution Network for Single Image-Based View Synthesis

[...]

Xiaodong Cun¹, Feng Xu², Chi-Man Pun¹, Hao Gao³•Institutions (3)

University of Macau¹, Tsinghua University², Nanjing University of Posts and Telecommunications³

01 Mar 2019-IEEE Computer Graphics and Applications

TL;DR: A full resolution network to extract fine-scale image features, which contributes to prevent blurry artifacts and a synthesis layer is used to not only warp the observed pixels to the desired positions but also hallucinate the missing pixels from other recorded pixels.

...read moreread less

Abstract: Research works in novel viewpoint synthesis are based mainly on multiview input images. In this paper, we focus on a more challenging and ill-posed problem that is to synthesize surrounding novel viewpoints from a single image. To achieve this goal, we design a full resolution network to extract fine-scale image features, which contributes to prevent blurry artifacts. We also involve a pretrained relative depth estimation network, thus three-dimensional information is utilized to infer the flow field between the input and the target image. Since the depth network is trained by depth order between any pair of objects, large-scale image features are also involved in our system. Finally, a synthesis layer is used to not only warp the observed pixels to the desired positions but also hallucinate the missing pixels from other recorded pixels. Experiments show that our technique successfully synthesizes reasonable novel viewpoints surrounding the input, while other state-of-the-art techniques fail.

...read moreread less

Showing papers on "View synthesis published in 2019"