Showing papers by "Christian Theobalt published in 2021"

PDF

Open Access

Proceedings Article•DOI•

[...]

Ayush Tewari, Ohad Fried¹, Justus Thies², Vincent Sitzmann³, Stephen Lombardi⁴, Zexiang Xu⁵, Tomas Simon⁴, Matthias Nießner⁶, Edgar Tretschk, Lingjie Liu, Ben Mildenhall⁷, Pratul P. Srinivasan⁷, Rohit Pandey⁷, Sergio Orts-Escolano⁷, Sean Fanello⁷, M. Guo⁸, Gordon Wetzstein⁸, Jun-Yan Zhu⁹, Christian Theobalt, Maneesh Agrawala⁸, Dan B. Goldman⁷, Michael Zollhöfer⁴ - Show less +18 more•Institutions (9)

Interdisciplinary Center Herzliya¹, Max Planck Society², Massachusetts Institute of Technology³, Facebook⁴, Adobe Systems⁵, Technische Universität München⁶, Google⁷, Stanford University⁸, Carnegie Mellon University⁹

09 Aug 2021

TL;DR: Loss functions for Neural Rendering Jun-Yan Zhu shows the importance of knowing the number of neurons in the system and how many neurons are firing at the same time.

...read moreread less

Abstract: Loss functions for Neural Rendering Jun-Yan Zhu

...read moreread less

174 citations

Proceedings Article•DOI•

i3DMM: Deep Implicit 3D Morphable Model of Human Heads

[...]

Tarun Yenamandra¹, Ayush Tewari, Florian Bernard¹, Hans-Peter Seidel, Mohamed Elgharib, Daniel Cremers¹, Christian Theobalt - Show less +3 more•Institutions (1)

Technische Universität München¹

01 Jun 2021

TL;DR: The first deep implicit 3D morphable model (i3DMM) of full heads, which not only captures identity-specific geometry, texture, and expressions of the frontal face, but also models the entire head, including hair is presented.

...read moreread less

Abstract: We present the first deep implicit 3D morphable model (i3DMM) of full heads. Unlike earlier morphable face models it not only captures identity-specific geometry, texture, and expressions of the frontal face, but also models the entire head, including hair. We collect a new dataset consisting of 64 people with different expressions and hairstyles to train i3DMM. Our approach has the following favorable properties: (i) It is the first full head morphable model that includes hair. (ii) In contrast to mesh-based models it can be trained on merely rigidly aligned scans, without requiring difficult non-rigid registration. (iii) We design a novel architecture to decouple the shape model into an implicit reference shape and a deformation of this reference shape. With that, dense correspondences between shapes can be learned implicitly. (iv) This architecture allows us to semantically disentangle the geometry and color components, as color is learned in the reference space. Geometry is further disentangled as identity, expressions, and hairstyle, while color is disentangled as identity and hairstyle components. We show the merits of i3DMM using ablation studies, comparisons to state-of-the-art models, and applications such as semantic head editing and texture transfer. We will make our model publicly available1.

...read moreread less

80 citations

Proceedings Article•DOI•

Learning Speech-driven 3D Conversational Gestures from Video

[...]

Ikhsanul Habibie¹, Weipeng Xu², Dushyant Mehta¹, Lingjie Liu¹, Hans-Peter Seidel¹, Gerard Pons-Moll³, Mohamed Elgharib¹, Christian Theobalt¹ - Show less +4 more•Institutions (3)

Max Planck Society¹, Facebook², University of Tübingen³

14 Sep 2021

TL;DR: In this article, the authors propose an approach to synthesize the synchronous 3D conversational body and hand gestures, as well as 3D face and head animations, of a virtual character from speech input.

...read moreread less

Abstract: We propose the first approach to synthesize the synchronous 3D conversational body and hand gestures, as well as 3D face and head animations, of a virtual character from speech input. Our algorithm uses a CNN architecture that leverages the inherent correlation between facial expression and hand gestures. Synthesis of conversational body gestures is a multi-modal problem since many similar gestures can plausibly accompany the same input speech. To synthesize plausible body gestures in this setting, we train a Generative Adversarial Network (GAN) based model that measures the plausibility of the generated sequences of 3D body motion when paired with the input audio features. We also contribute a new corpus that contains more than 33 hours of annotated data from in-the-wild videos of talking people. To this end, we apply state-of-the-art monocular approaches for 3D body and hand pose estimation as well as 3D face performance capture to the video corpus. In this way, we can train on orders of magnitude more data than previous algorithms that resort to complex in-studio motion capture solutions, and thereby train more expressive synthesis algorithms. Our experiments and user study show the state-of-the-art quality of our speech-synthesized full 3D character animations.

...read moreread less

38 citations

Proceedings Article•DOI•

Quantum Permutation Synchronization

[...]

Tolga Birdal¹, Vladislav Golyanik², Christian Theobalt², Leonidas J. Guibas¹•Institutions (2)

Stanford University¹, Max Planck Society²

01 Jun 2021

TL;DR: In this paper, the first quantum algorithm for solving a synchronization problem in the context of computer vision is presented, which involves solving a non-convex optimization problem in discrete variables.

...read moreread less

Abstract: We present QuantumSync, the first quantum algorithm for solving a synchronization problem in the context of computer vision. In particular, we focus on permutation synchronization which involves solving a non-convex optimization problem in discrete variables. We start by formulating synchronization into a quadratic unconstrained binary optimization problem (QUBO). While such formulation respects the binary nature of the problem, ensuring that the result is a set of permutations requires extra care. Hence, we: (i) show how to insert permutation constraints into a QUBO problem and (ii) solve the constrained QUBO problem on the current generation of the adiabatic quantum computers D-Wave. Thanks to the quantum annealing, we guarantee global optimality with high probability while sampling the energy landscape to yield confidence estimates. Our proof-of-concepts realization on the adiabatic D-Wave computer demonstrates that quantum machines offer a promising way to solve the prevalent yet difficult synchronization problems.

...read moreread less

34 citations

Journal Article•DOI•

Real-time deep dynamic characters

[...]

Marc Habermann¹, Lingjie Liu¹, Weipeng Xu², Michael Zollhoefer², Gerard Pons-Moll¹, Christian Theobalt¹ - Show less +2 more•Institutions (2)

Max Planck Society¹, Facebook²

17 Jul 2021-ACM Transactions on Graphics

TL;DR: In this paper, a graph convolutional network architecture is used to enable motion-dependent deformation learning of body and clothing, including dynamics, and a neural generative dynamic texture model creates corresponding dynamic texture maps.

...read moreread less

Abstract: We propose a deep videorealistic 3D human character model displaying highly realistic shape, motion, and dynamic appearance learned in a new weakly supervised way from multi-view imagery. In contrast to previous work, our controllable 3D character displays dynamics, e.g., the swing of the skirt, dependent on skeletal body motion in an efficient data-driven way, without requiring complex physics simulation. Our character model also features a learned dynamic texture model that accounts for photo-realistic motion-dependent appearance details, as well as view-dependent lighting effects. During training, we do not need to resort to difficult dynamic 3D capture of the human; instead we can train our model entirely from multi-view video in a weakly supervised manner. To this end, we propose a parametric and differentiable character representation which allows us to model coarse and fine dynamic deformations, e.g., garment wrinkles, as explicit spacetime coherent mesh geometry that is augmented with high-quality dynamic textures dependent on motion and view point. As input to the model, only an arbitrary 3D skeleton motion is required, making it directly compatible with the established 3D animation pipeline. We use a novel graph convolutional network architecture to enable motion-dependent deformation learning of body and clothing, including dynamics, and a neural generative dynamic texture model creates corresponding dynamic texture maps. We show that by merely providing new skeletal motions, our model creates motion-dependent surface deformations, physically plausible dynamic clothing deformations, as well as video-realistic surface textures at a much higher level of detail than previous state of the art approaches, and even in real-time.

...read moreread less

31 citations

Proceedings Article•DOI•

Learning Complete 3D Morphable Face Models from Images and Videos

[...]

Mallikarjun B R¹, Ayush Tewari¹, Hans-Peter Seidel¹, Mohamed Elgharib¹, Christian Theobalt¹ - Show less +1 more•Institutions (1)

Max Planck Society¹

01 Jun 2021

TL;DR: In this article, a self-supervised learning-based approach is proposed to learn complete 3D models of face identity and expression geometry, and reflectance, just from images and videos.

...read moreread less

Abstract: Most 3D face reconstruction methods rely on 3D morphable models, which disentangle the space of facial deformations into identity and expression geometry, and skin reflectance. These models are typically learned from a limited number of 3D scans and thus do not generalize well across different identities and expressions. We present the first approach to learn complete 3D models of face identity and expression geometry, and reflectance, just from images and videos. The virtually endless collection of such data, in combination with our self-supervised learning-based approach allows for learning face models that generalize beyond the span of existing approaches. Our network design and loss functions ensure a disentangled parameterization of not only identity and albedo, but also, for the first time, an expression basis. Our method also allows for in-the-wild monocular reconstruction at test time. We show that our learned models better generalize and lead to higher quality image-based reconstructions than existing approaches. We show that the learned model can also be personalized to a video, for a better capture of the geometry and albedo.

...read moreread less

31 citations

Proceedings Article•DOI•

High-Fidelity Neural Human Motion Transfer from Monocular Video

[...]

Moritz Kappel¹, Vladislav Golyanik², Mohamed Elgharib², Jann-Ole Henningson¹, Hans-Peter Seidel², Susana Castillo¹, Christian Theobalt², Marcus Magnor¹ - Show less +4 more•Institutions (2)

Braunschweig University of Technology¹, Max Planck Society²

01 Jun 2021

TL;DR: Kappel2020 as mentioned in this paper performs high fidelity and temporally consistent human motion transfer with natural pose-dependent non-rigid deformations, for several types of loose garments, and performs image generation in three subsequent stages: synthesizing human shape, structure, and appearance.

...read moreread less

Abstract: Video-based human motion transfer creates video animations of humans following a source motion. Current methods show remarkable results for tightly-clad subjects. However, the lack of temporally consistent handling of plausible clothing dynamics, including fine and high-frequency details, significantly limits the attainable visual quality. We address these limitations for the first time in the literature and present a new framework which performs high-fidelity and temporally-consistent human motion transfer with natural pose-dependent non-rigid deformations, for several types of loose garments. In contrast to the previous techniques, we perform image generation in three subsequent stages: synthesizing human shape, structure, and appearance. Given a monocular RGB video of an actor, we train a stack of recurrent deep neural networks that generate these intermediate representations from 2D poses and their temporal derivatives. Splitting the difficult motion transfer problem into subtasks that are aware of the temporal motion context helps us to synthesize results with plausible dynamics and pose-dependent detail. It also allows artistic control of results by manipulation of individual framework stages. In the experimental results, we significantly outperform the state-of-the-art in terms of video realism. The source code is available at https://graphics.tu-bs.de/publications/kappel2020high-fidelity.

...read moreread less

25 citations

Proceedings Article•DOI•

Pose-Guided Human Animation from a Single Image in the Wild

[...]

Jae Shin Yoon¹, Lingjie Liu², Vladislav Golyanik², Kripasindhu Sarkar², Hyun Soo Park¹, Christian Theobalt² - Show less +2 more•Institutions (2)

University of Minnesota¹, Max Planck Society²

01 Jun 2021

TL;DR: In this article, a compositional neural network is designed to predict the silhouette, garment labels, and textures of a person from a single image of the person controlled by a sequence of body poses.

...read moreread less

Abstract: We present a new pose transfer method for synthesizing a human animation from a single image of a person controlled by a sequence of body poses. Existing pose transfer methods exhibit significant visual artifacts when applying to a novel scene, resulting in temporal inconsistency and failures in preserving the identity and textures of the person. To address these limitations, we design a compositional neural network that predicts the silhouette, garment labels, and textures. Each modular network is explicitly dedicated to a subtask that can be learned from the synthetic data. At the inference time, we utilize the trained network to produce a unified representation of appearance and its labels in UV coordinates, which remains constant across poses. The unified representation provides an incomplete yet strong guidance to generating the appearance in response to the pose change. We use the trained network to complete the appearance and render it with the background. With these strategies, we are able to synthesize human animations that can preserve the identity and appearance of the person in a temporally coherent way without any fine-tuning of the network on the testing scene. Experiments show that our method outperforms the state-of-the-arts in terms of synthesis quality, temporal coherence, and generalization ability.

...read moreread less

22 citations

Journal Article•DOI•

PhotoApp: photorealistic appearance editing of head portraits

[...]

Mallikarjun B R, Ayush Tewari, Abdallah Dib¹, Tim Weyrich², Bernd Bickel³, Hans-Peter Seidel, Hanspeter Pfister⁴, Wojciech Matusik⁵, Louis Chevallier¹, Mohamed Elgharib, Christian Theobalt - Show less +7 more•Institutions (5)

InterDigital, Inc.¹, University College London², Institute of Science and Technology Austria³, Harvard University⁴, Massachusetts Institute of Technology⁵

17 Jul 2021-ACM Transactions on Graphics

TL;DR: In this paper, the camera viewpoint and scene illumination are modelled in the latent space of StyleGAN to produce high-quality photorealistic results for in-the-wild images and significantly outperforms existing methods.

...read moreread less

Abstract: Photorealistic editing of head portraits is a challenging task as humans are very sensitive to inconsistencies in faces. We present an approach for high-quality intuitive editing of the camera viewpoint and scene illumination (parameterised with an environment map) in a portrait image. This requires our method to capture and control the full reflectance field of the person in the image. Most editing approaches rely on supervised learning using training data captured with setups such as light and camera stages. Such datasets are expensive to acquire, not readily available and do not capture all the rich variations of in-the-wild portrait images. In addition, most supervised approaches only focus on relighting, and do not allow camera viewpoint editing. Thus, they only capture and control a subset of the reflectance field. Recently, portrait editing has been demonstrated by operating in the generative model space of StyleGAN. While such approaches do not require direct supervision, there is a significant loss of quality when compared to the supervised approaches. In this paper, we present a method which learns from limited supervised training data. The training images only include people in a fixed neutral expression with eyes closed, without much hair or background variations. Each person is captured under 150 one-light-at-a-time conditions and under 8 camera poses. Instead of training directly in the image space, we design a supervised problem which learns transformations in the latent space of StyleGAN. This combines the best of supervised learning and generative adversarial modeling. We show that the StyleGAN prior allows for generalisation to different expressions, hairstyles and backgrounds. This produces high-quality photorealistic results for in-the-wild images and significantly outperforms existing methods. Our approach can edit the illumination and pose simultaneously, and runs at interactive rates.

...read moreread less

21 citations

Journal Article•DOI•

Neural monocular 3D human motion capture with physical awareness

[...]

Soshi Shimada, Vladislav Golyanik, Weipeng Xu¹, Patrick Pérez², Christian Theobalt - Show less +1 more•Institutions (2)

Facebook¹, Valeo²

17 Jul 2021-ACM Transactions on Graphics

TL;DR: In this paper, a proportional-derivative controller with gains predicted by a neural network is proposed to reduce delays even in the presence of fast motions and prevent physically implausible foot-floor penetration as a hard constraint.

...read moreread less

Abstract: We present a new trainable system for physically plausible markerless 3D human motion capture, which achieves state-of-the-art results in a broad range of challenging scenarios. Unlike most neural methods for human motion capture, our approach, which we dub "physionical", is aware of physical and environmental constraints. It combines in a fully-differentiable way several key innovations, i.e., 1) a proportional-derivative controller, with gains predicted by a neural network, that reduces delays even in the presence of fast motions, 2) an explicit rigid body dynamics model and 3) a novel optimisation layer that prevents physically implausible foot-floor penetration as a hard constraint. The inputs to our system are 2D joint keypoints, which are canonicalised in a novel way so as to reduce the dependency on intrinsic camera parameters---both at train and test time. This enables more accurate global translation estimation without generalisability loss. Our model can be finetuned only with 2D annotations when the 3D annotations are not available. It produces smooth and physically-principled 3D motions in an interactive frame rate in a wide variety of challenging scenes, including newly recorded ones. Its advantages are especially noticeable on in-the-wild sequences that significantly differ from common 3D pose estimation benchmarks such as Human 3.6M and MPI-INF-3DHP. Qualitative results are provided in the supplementary video.

...read moreread less

18 citations

Posted Content•

Advances in Neural Rendering

[...]

Ayush Tewari, Justus Thies, Ben Mildenhall, Pratul P. Srinivasan, Edgar Tretschk, Yifan Wang, Christoph Lassner, Vincent Sitzmann, Ricardo Martin-Brualla, Stephen Lombardi, Tomas Simon, Christian Theobalt, Matthias Niessner, Jonathan T. Barron, Gordon Wetzstein, Michael Zollhoefer, Vladislav Golyanik - Show less +13 more

10 Nov 2021-arXiv: Graphics

TL;DR: In this paper, a state-of-the-art report on advances in neural rendering focuses on methods that combine classical rendering principles with learned 3D scene representations, often referred to as neural scene representations.

...read moreread less

Abstract: Synthesizing photo-realistic images and videos is at the heart of computer graphics and has been the focus of decades of research. Traditionally, synthetic images of a scene are generated using rendering algorithms such as rasterization or ray tracing, which take specifically defined representations of geometry and material properties as input. Collectively, these inputs define the actual scene and what is rendered, and are referred to as the scene representation (where a scene consists of one or more objects). Example scene representations are triangle meshes with accompanied textures (e.g., created by an artist), point clouds (e.g., from a depth sensor), volumetric grids (e.g., from a CT scan), or implicit surface functions (e.g., truncated signed distance fields). The reconstruction of such a scene representation from observations using differentiable rendering losses is known as inverse graphics or inverse rendering. Neural rendering is closely related, and combines ideas from classical computer graphics and machine learning to create algorithms for synthesizing images from real-world observations. Neural rendering is a leap forward towards the goal of synthesizing photo-realistic image and video content. In recent years, we have seen immense progress in this field through hundreds of publications that show different ways to inject learnable components into the rendering pipeline. This state-of-the-art report on advances in neural rendering focuses on methods that combine classical rendering principles with learned 3D scene representations, often now referred to as neural scene representations. A key advantage of these methods is that they are 3D-consistent by design, enabling applications such as novel viewpoint synthesis of a captured scene. In addition to methods that handle static scenes, we cover neural scene representations for modeling non-rigidly deforming objects...

...read moreread less

Proceedings Article•DOI•

Monocular Real-time Full Body Capture with Inter-part Correlations

[...]

Yuxiao Zhou¹, Marc Habermann², Ikhsanul Habibie², Ayush Tewari², Christian Theobalt², Feng Xu¹ - Show less +2 more•Institutions (2)

Tsinghua University¹, Max Planck Society²

01 Jun 2021

TL;DR: In this paper, the shape and motion of body and hands together with a dynamic 3D face model from a single color image are estimated using a new neural network architecture that exploits correlations between body and hand at high computational efficiency.

...read moreread less

Abstract: We present the first method for real-time full body capture that estimates shape and motion of body and hands together with a dynamic 3D face model from a single color image. Our approach uses a new neural network architecture that exploits correlations between body and hands at high computational efficiency. Unlike previous works, our approach is jointly trained on multiple datasets focusing on hand, body or face separately, without requiring data where all the parts are annotated at the same time, which is much more difficult to create at sufficient variety. The possibility of such multi-dataset training enables superior generalization ability. In contrast to earlier monocular full body methods, our approach captures more expressive 3D face geometry and color by estimating the shape, expression, albedo and illumination parameters of a statistical face model. Our method achieves competitive accuracy on public benchmarks, while being significantly faster and providing more complete face reconstructions.

...read moreread less

Journal Article•DOI•

Real-time Global Illumination Decomposition of Videos

[...]

Abhimitra Meka¹, Mohammad Shafiei¹, Michael Zollhöfer², Christian Richardt³, Christian Theobalt¹ - Show less +1 more•Institutions (3)

Max Planck Society¹, University of Pittsburgh², University of Bath³

10 Aug 2021

TL;DR: The first approach for the decomposition of a monocular color video into direct and indirect illumination components in real time is proposed and improvements over the state-of-the-art in this field are shown, in both quality and runtime.

...read moreread less

Abstract: We propose the first approach for the decomposition of a monocular color video into direct and indirect illumination components in real time. We retrieve, in separate layers, the contribution made to the scene appearance by the scene reflectance, the light sources, and the reflections from various coherent scene regions to one another. Existing techniques that invert global light transport require image capture under multiplexed controlled lighting or only enable the decomposition of a single image at slow off-line frame rates. In contrast, our approach works for regular videos and produces temporally coherent decomposition layers at real-time frame rates. At the core of our approach are several sparsity priors that enable the estimation of the per-pixel direct and indirect illumination layers based on a small set of jointly estimated base reflectance colors. The resulting variational decomposition problem uses a new formulation based on sparse and dense sets of non-linear equations that we solve efficiently using a novel alternating data-parallel optimization strategy. We evaluate our approach qualitatively and quantitatively and show improvements over the state-of-the-art in this field, in both quality and runtime. In addition, we demonstrate various real-time appearance editing applications for videos with consistent illumination.

...read moreread less

Journal Article•DOI•

Real-time Pose and Shape Reconstruction of Two Interacting Hands With a Single Depth Camera

[...]

Franziska Mueller¹, Micah Davis¹, Florian Bernard¹, Oleksandr Sotnychenko², Mickeal Verschoor², Miguel A. Otaduy², Dan Casas¹, Christian Theobalt³ - Show less +4 more•Institutions (3)

Max Planck Society¹, King Juan Carlos University², Association for Computing Machinery³

15 Jun 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, a real-time pose and shape reconstruction of two strongly interacting hands is presented, which combines an extensive list of favorable properties, namely it is markerless, uses a single consumer-level depth camera, runs in real time, handles inter-and intra-hand collisions, and automatically adjusts to the user's hand shape.

...read moreread less

Abstract: We present a novel method for real-time pose and shape reconstruction of two strongly interacting hands. Our approach is the first two-hand tracking solution that combines an extensive list of favorable properties, namely it is marker-less, uses a single consumer-level depth camera, runs in real time, handles inter- and intra-hand collisions, and automatically adjusts to the user's hand shape. In order to achieve this, we embed a recent parametric hand pose and shape model and a dense correspondence predictor based on a deep neural network into a suitable energy minimization framework. For training the correspondence prediction network, we synthesize a two-hand dataset based on physical simulations that includes both hand pose and shape annotations while at the same time avoiding inter-hand penetrations. To achieve real-time rates, we phrase the model fitting in terms of a nonlinear least-squares problem so that the energy can be optimized based on a highly efficient GPU-based Gauss-Newton optimizer. We show state-of-the-art results in scenes that exceed the complexity level demonstrated by previous work, including tight two-hand grasps, significant inter-hand occlusions, and gesture interaction.

...read moreread less

Posted Content•

Neural Actor: Neural Free-view Synthesis of Human Actors with Pose Control

[...]

Lingjie Liu¹, Marc Habermann¹, Viktor Rudnev¹, Kripasindhu Sarkar¹, Jiatao Gu², Christian Theobalt¹ - Show less +2 more•Institutions (2)

Max Planck Society¹, Facebook²

03 Jun 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: Neural Actor as discussed by the authors uses a coarse body model as the proxy to unwarp the surrounding 3D space into a canonical pose, and then uses a neural radiance field to learn pose-dependent geometric deformations and pose-and view-dependent appearance effects.

...read moreread less

Abstract: We propose Neural Actor (NA), a new method for high-quality synthesis of humans from arbitrary viewpoints and under arbitrary controllable poses. Our method is built upon recent neural scene representation and rendering works which learn representations of geometry and appearance from only 2D images. While existing works demonstrated compelling rendering of static scenes and playback of dynamic scenes, photo-realistic reconstruction and rendering of humans with neural implicit methods, in particular under user-controlled novel poses, is still difficult. To address this problem, we utilize a coarse body model as the proxy to unwarp the surrounding 3D space into a canonical pose. A neural radiance field learns pose-dependent geometric deformations and pose- and view-dependent appearance effects in the canonical space from multi-view video input. To synthesize novel views of high fidelity dynamic geometry and appearance, we leverage 2D texture maps defined on the body model as latent variables for predicting residual deformations and the dynamic appearance. Experiments demonstrate that our method achieves better quality than the state-of-the-arts on playback as well as novel pose synthesis, and can even generalize well to new poses that starkly differ from the training poses. Furthermore, our method also supports body shape control of the synthesized results.

...read moreread less

Posted Content•

Towards High Fidelity Monocular Face Reconstruction with Rich Reflectance using Self-supervised Learning and Ray Tracing

[...]

Abdallah Dib¹, Cedric Thebault¹, Junghyun Ahn¹, Philippe-Henri Gosselin¹, Christian Theobalt², Louis Chevallier¹ - Show less +2 more•Institutions (2)

InterDigital, Inc.¹, Max Planck Society²

29 Mar 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, a CNN encoder with a differentiable ray tracer is used to improve the robustness of face reconstruction in general lighting conditions, which enables to take a big leap forward in reconstruction quality of shape, appearance and lighting even in scenes with difficult illumination.

...read moreread less

Abstract: Robust face reconstruction from monocular image in general lighting conditions is challenging. Methods combining deep neural network encoders with differentiable rendering have opened up the path for very fast monocular reconstruction of geometry, lighting and reflectance. They can also be trained in self-supervised manner for increased robustness and better generalization. However, their differentiable rasterization based image formation models, as well as underlying scene parameterization, limit them to Lambertian face reflectance and to poor shape details. More recently, ray tracing was introduced for monocular face reconstruction within a classic optimization-based framework and enables state-of-the art results. However optimization-based approaches are inherently slow and lack robustness. In this paper, we build our work on the aforementioned approaches and propose a new method that greatly improves reconstruction quality and robustness in general scenes. We achieve this by combining a CNN encoder with a differentiable ray tracer, which enables us to base the reconstruction on much more advanced personalized diffuse and specular albedos, a more sophisticated illumination model and a plausible representation of self-shadows. This enables to take a big leap forward in reconstruction quality of shape, appearance and lighting even in scenes with difficult illumination. With consistent face attributes reconstruction, our method leads to practical applications such as relighting and self-shadows removal. Compared to state-of-the-art methods, our results show improved accuracy and validity of the approach.

...read moreread less

Posted Content•

Synthesis of Compositional Animations from Textual Descriptions

[...]

Anindita Ghosh¹, Noshaba Cheema², Cennet Oguz, Christian Theobalt³, Philipp Slusallek² - Show less +1 more•Institutions (3)

St. Xavier's College-Autonomous, Mumbai¹, Saarland University², Max Planck Society³

26 Mar 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: This paper proposed a hierarchical two-stream sequential model to explore a finer joint-level mapping between natural language sentences and 3D pose sequences corresponding to the given motion, which can generate plausible pose sequences for short sentences describing single actions and long compositional sentences describing multiple sequential and superimposed actions.

...read moreread less

Abstract: "How can we animate 3D-characters from a movie script or move robots by simply telling them what we would like them to do?" "How unstructured and complex can we make a sentence and still generate plausible movements from it?" These are questions that need to be answered in the long-run, as the field is still in its infancy. Inspired by these problems, we present a new technique for generating compositional actions, which handles complex input sentences. Our output is a 3D pose sequence depicting the actions in the input sentence. We propose a hierarchical two-stream sequential model to explore a finer joint-level mapping between natural language sentences and 3D pose sequences corresponding to the given motion. We learn two manifold representations of the motion -- one each for the upper body and the lower body movements. Our model can generate plausible pose sequences for short sentences describing single actions as well as long compositional sentences describing multiple sequential and superimposed actions. We evaluate our proposed model on the publicly available KIT Motion-Language Dataset containing 3D pose data with human-annotated sentences. Experimental results show that our model advances the state-of-the-art on text-based motion synthesis in objective evaluations by a margin of 50%. Qualitative evaluations based on a user study indicate that our synthesized motions are perceived to be the closest to the ground-truth motion captures for both short and compositional sentences.

...read moreread less

Journal Article•DOI•

Learning Dynamic Textures for Neural Rendering of Human Actors

[...]

Lingjie Liu¹, Weipeng Xu², Marc Habermann², Michael Zollhöfer³, Florian Bernard², Hyeongwoo Kim², Wenping Wang¹, Christian Theobalt² - Show less +4 more•Institutions (3)

University of Hong Kong¹, Max Planck Society², Stanford University³

01 Oct 2021-IEEE Transactions on Visualization and Computer Graphics

TL;DR: In this article, the authors propose a method that disentangles the learning of time-coherent fine-scale details from the embedding of the human in 2D screen space.

...read moreread less

Abstract: Synthesizing realistic videos of humans using neural networks has been a popular alternative to the conventional graphics-based rendering pipeline due to its high efficiency. Existing works typically formulate this as an image-to-image translation problem in 2D screen space, which leads to artifacts such as over-smoothing, missing body parts, and temporal instability of fine-scale detail, such as pose-dependent wrinkles in the clothing. In this article, we propose a novel human video synthesis method that approaches these limiting factors by explicitly disentangling the learning of time-coherent fine-scale details from the embedding of the human in 2D screen space. More specifically, our method relies on the combination of two convolutional neural networks (CNNs). Given the pose information, the first CNN predicts a dynamic texture map that contains time-coherent high-frequency details, and the second CNN conditions the generation of the final video on the temporally coherent output of the first CNN. We demonstrate several applications of our approach, such as human reenactment and novel view synthesis from monocular video, where we show significant improvement over the state of the art both qualitatively and quantitatively.

...read moreread less

Proceedings Article•DOI•

Differentiable Event Stream Simulator for Non-Rigid 3D Tracking

[...]

Jalees Nehvi¹, Vladislav Golyanik, Franziska Mueller², Hans-Peter Seidel, Mohamed Elgharib, Christian Theobalt - Show less +2 more•Institutions (2)

Saarland University¹, Google²

30 Apr 2021

TL;DR: In this paper, a differentiable simulator for non-rigid 3D tracking of deformable objects (such as human hands, isometric surfaces and general watertight meshes) from event streams is proposed.

...read moreread less

Abstract: This paper introduces the first differentiable simulator of event streams, i.e., streams of asynchronous brightness change signals recorded by event cameras. Our differentiable simulator enables non-rigid 3D tracking of deformable objects (such as human hands, isometric surfaces and general watertight meshes) from event streams by leveraging an analysis-by-synthesis principle. So far, event-based tracking and reconstruction of non-rigid objects in 3D, like hands and body, has been either tackled using explicit event trajectories or large-scale datasets. In contrast, our method does not require any such processing or data, and can be readily applied to incoming event streams. We show the effectiveness of our approach for various types of non-rigid objects and compare to existing methods for non-rigid 3D tracking. In our experiments, the proposed energy-based formulations outperform competing RGB-based methods in terms of 3D errors. The source code and the new data are publicly available1.

...read moreread less

Proceedings Article•

NeuS: Learning Neural Implicit Surfaces by Volume Rendering for Multi-view Reconstruction

[...]

Peng Wang¹, Lingjie Liu², Yuan Liu¹, Christian Theobalt¹, Taku Komura³, Wenping Wang⁴ - Show less +2 more•Institutions (4)

University of Hong Kong¹, Max Planck Society², University of Edinburgh³, Texas A&M University⁴

06 Dec 2021

TL;DR: NeuS as mentioned in this paper proposes to represent a surface as the zero-level set of a signed distance function (SDF) and develop a new volume rendering method to train a neural SDF representation.

...read moreread less

Abstract: We present a novel neural surface reconstruction method, called NeuS, for reconstructing objects and scenes with high fidelity from 2D image inputs. Existing neural surface reconstruction approaches, such as DVR and IDR, require foreground mask as supervision, easily get trapped in local minima, and therefore struggle with the reconstruction of objects with severe self-occlusion or thin structures. Meanwhile, recent neural methods for novel view synthesis, such as NeRF and its variants, use volume rendering to produce a neural scene representation with robustness of optimization, even for highly complex objects. However, extracting high-quality surfaces from this learned implicit representation is difficult because there are not sufficient surface constraints in the representation. In NeuS, we propose to represent a surface as the zero-level set of a signed distance function (SDF) and develop a new volume rendering method to train a neural SDF representation. We observe that the conventional volume rendering method causes inherent geometric errors (i.e. bias) for surface reconstruction, and therefore propose a new formulation that is free of bias in the first order of approximation, thus leading to more accurate surface reconstruction even without the mask supervision. Experiments on the DTU dataset and the BlendedMVS dataset show that NeuS outperforms the state-of-the-arts in high-quality surface reconstruction, especially for objects and scenes with complex structures and self-occlusion.

...read moreread less

Journal Article•DOI•

Learning meaningful controls for fluids

[...]

Mengyu Chu¹, Nils Thuerey², Hans-Peter Seidel¹, Christian Theobalt¹, Rhaleb Zayer¹ - Show less +1 more•Institutions (2)

Max Planck Society¹, Technische Universität München²

17 Jul 2021-ACM Transactions on Graphics

TL;DR: Cmy et al. as discussed by the authors proposed a novel data-driven conditional adversarial model that solves the challenging and theoretically ill-posed problem of deriving plausible velocity fields from a single frame of a density field.

...read moreread less

Abstract: While modern fluid simulation methods achieve high-quality simulation results, it is still a big challenge to interpret and control motion from visual quantities, such as the advected marker density. These visual quantities play an important role in user interactions: Being familiar and meaningful to humans, these quantities have a strong correlation with the underlying motion. We propose a novel data-driven conditional adversarial model that solves the challenging and theoretically ill-posed problem of deriving plausible velocity fields from a single frame of a density field. Besides density modifications, our generative model is the first to enable the control of the results using all of the following control modalities: obstacles, physical parameters, kinetic energy, and vorticity. Our method is based on a new conditional generative adversarial neural network that explicitly embeds physical quantities into the learned latent space, and a new cyclic adversarial network design for control disentanglement. We show the high quality and versatile controllability of our results for density-based inference, realistic obstacle interaction, and sensitive responses to modifications of physical parameters, kinetic energy, and vorticity. Code, models, and results can be found at https://github.com/RachelCmy/den2vel.

...read moreread less

Journal Article•DOI•

Real-time Deep Dynamic Characters

[...]

Marc Habermann¹, Lingjie Liu¹, Weipeng Xu¹, Michael Zollhoefer², Gerard Pons-Moll¹, Christian Theobalt¹ - Show less +2 more•Institutions (2)

Max Planck Society¹, Facebook²

04 May 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, a parametric and differentiable character representation is proposed to model coarse and fine dynamic deformations, e.g., garment wrinkles, as explicit space-time coherent mesh geometry that is augmented with high-quality dynamic textures dependent on motion and view point.

...read moreread less

Abstract: We propose a deep videorealistic 3D human character model displaying highly realistic shape, motion, and dynamic appearance learned in a new weakly supervised way from multi-view imagery. In contrast to previous work, our controllable 3D character displays dynamics, e.g., the swing of the skirt, dependent on skeletal body motion in an efficient data-driven way, without requiring complex physics simulation. Our character model also features a learned dynamic texture model that accounts for photo-realistic motion-dependent appearance details, as well as view-dependent lighting effects. During training, we do not need to resort to difficult dynamic 3D capture of the human; instead we can train our model entirely from multi-view video in a weakly supervised manner. To this end, we propose a parametric and differentiable character representation which allows us to model coarse and fine dynamic deformations, e.g., garment wrinkles, as explicit space-time coherent mesh geometry that is augmented with high-quality dynamic textures dependent on motion and view point. As input to the model, only an arbitrary 3D skeleton motion is required, making it directly compatible with the established 3D animation pipeline. We use a novel graph convolutional network architecture to enable motion-dependent deformation learning of body and clothing, including dynamics, and a neural generative dynamic texture model creates corresponding dynamic texture maps. We show that by merely providing new skeletal motions, our model creates motion-dependent surface deformations, physically plausible dynamic clothing deformations, as well as video-realistic surface textures at a much higher level of detail than previous state of the art approaches, and even in real-time.

...read moreread less

Proceedings Article•

Non-Rigid Neural Radiance Fields: Reconstruction and Novel View Synthesis of a Dynamic Scene From Monocular Video

[...]

Edgar Tretschk¹, Ayush Tewari¹, Vladislav Golyanik¹, Michael Zollhöfer², Christoph Lassner³, Christian Theobalt¹ - Show less +2 more•Institutions (3)

Max Planck Society¹, Stanford University², Facebook³

01 Jan 2021

TL;DR: In this article, a non-rigid neural ray bending (NR-NeRF) network is proposed to disentangle the dynamic scene into a canonical volume and its deformation.

...read moreread less

Abstract: We present Non-Rigid Neural Radiance Fields (NR-NeRF), a reconstruction and novel view synthesis approach for general non-rigid dynamic scenes. Our approach takes RGB images of a dynamic scene as input (e.g., from a monocular video recording), and creates a high-quality space-time geometry and appearance representation. We show that a single handheld consumer-grade camera is sufficient to synthesize sophisticated renderings of a dynamic scene from novel virtual camera views, e.g. a `bullet-time' video effect. NR-NeRF disentangles the dynamic scene into a canonical volume and its deformation. Scene deformation is implemented as ray bending, where straight rays are deformed non-rigidly. We also propose a novel rigidity network to better constrain rigid regions of the scene, leading to more stable results. The ray bending and rigidity network are trained without explicit supervision. Our formulation enables dense correspondence estimation across views and time, and compelling video editing applications such as motion exaggeration. Our code will be open sourced.

...read moreread less

Posted Content•

Style and Pose Control for Image Synthesis of Humans from a Single Monocular View

[...]

Kripasindhu Sarkar¹, Vladislav Golyanik¹, Lingjie Liu¹, Christian Theobalt¹•Institutions (1)

Max Planck Society¹

22 Feb 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: Recently, StylePoseGAN as mentioned in this paper extends a non-controllable generator to disentangle pose, appearance and body parts in a fully-supervised way, and it significantly outperforms existing single image re-rendering methods.

...read moreread less

Abstract: Photo-realistic re-rendering of a human from a single image with explicit control over body pose, shape and appearance enables a wide range of applications, such as human appearance transfer, virtual try-on, motion imitation, and novel view synthesis. While significant progress has been made in this direction using learning-based image generation tools, such as GANs, existing approaches yield noticeable artefacts such as blurring of fine details, unrealistic distortions of the body parts and garments as well as severe changes of the textures. We, therefore, propose a new method for synthesising photo-realistic human images with explicit control over pose and part-based appearance, i.e., StylePoseGAN, where we extend a non-controllable generator to accept conditioning of pose and appearance separately. Our network can be trained in a fully supervised way with human images to disentangle pose, appearance and body parts, and it significantly outperforms existing single image re-rendering methods. Our disentangled representation opens up further applications such as garment transfer, motion transfer, virtual try-on, head (identity) swap and appearance interpolation. StylePoseGAN achieves state-of-the-art image generation fidelity on common perceptual metrics compared to the current best-performing methods and convinces in a comprehensive user study.

...read moreread less

Posted Content•

Efficient and Differentiable Shadow Computation for Inverse Problems

[...]

Linjie Lyu¹, Marc Habermann¹, Lingjie Liu¹, Mallikarjun B R¹, Ayush Tewari¹, Christian Theobalt¹ - Show less +2 more•Institutions (1)

Max Planck Society¹

01 Apr 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: Differentiable rendering has received increasing interest for image-based inverse problems as discussed by the authors, which can benefit traditional optimization-based solutions to inverse problems, but also allow for self-supervision of learning-based approaches for which training data with ground truth annotation is hard to obtain.

...read moreread less

Abstract: Differentiable rendering has received increasing interest for image-based inverse problems. It can benefit traditional optimization-based solutions to inverse problems, but also allows for self-supervision of learning-based approaches for which training data with ground truth annotation is hard to obtain. However, existing differentiable renderers either do not model visibility of the light sources from the different points in the scene, responsible for shadows in the images, or are too slow for being used to train deep architectures over thousands of iterations. To this end, we propose an accurate yet efficient approach for differentiable visibility and soft shadow computation. Our approach is based on the spherical harmonics approximations of the scene illumination and visibility, where the occluding surface is approximated with spheres. This allows for a significantly more efficient shadow computation compared to methods based on ray tracing. As our formulation is differentiable, it can be used to solve inverse problems such as texture, illumination, rigid pose, and geometric deformation recovery from images using analysis-by-synthesis optimization.

...read moreread less

Posted Content•

Egocentric Videoconferencing.

[...]

Mohamed Elgharib, Mohit Mendiratta, Justus Thies, Matthias Nießner, Hans-Peter Seidel, Ayush Tewari, Vladislav Golyanik, Christian Theobalt - Show less +4 more

07 Jul 2021-arXiv: Graphics

TL;DR: A conditional generative adversarial neural network is employed that learns a transition from the highly distorted egocentric views to frontal views common in videoconferencing, and produces temporally smooth video-realistic renderings in real-time using a video-to-video translation network in conjunction with a temporal discriminator.

...read moreread less

Abstract: We introduce a method for egocentric videoconferencing that enables hands-free video calls, for instance by people wearing smart glasses or other mixed-reality devices. Videoconferencing portrays valuable non-verbal communication and face expression cues, but usually requires a front-facing camera. Using a frontal camera in a hands-free setting when a person is on the move is impractical. Even holding a mobile phone camera in the front of the face while sitting for a long duration is not convenient. To overcome these issues, we propose a low-cost wearable egocentric camera setup that can be integrated into smart glasses. Our goal is to mimic a classical video call, and therefore, we transform the egocentric perspective of this camera into a front facing video. To this end, we employ a conditional generative adversarial neural network that learns a transition from the highly distorted egocentric views to frontal views common in videoconferencing. Our approach learns to transfer expression details directly from the egocentric view without using a complex intermediate parametric expressions model, as it is used by related face reenactment methods. We successfully handle subtle expressions, not easily captured by parametric blendshape-based solutions, e.g., tongue movement, eye movements, eye blinking, strong expressions and depth varying movements. To get control over the rigid head movements in the target view, we condition the generator on synthetic renderings of a moving neutral face. This allows us to synthesis results at different head poses. Our technique produces temporally smooth video-realistic renderings in real-time using a video-to-video translation network in conjunction with a temporal discriminator. We demonstrate the improved capabilities of our technique by comparing against related state-of-the art approaches.

...read moreread less

Proceedings Article•DOI•

Multi-view Depth Estimation using Epipolar Spatio-Temporal Networks

[...]

Xiaoxiao Long¹, Lingjie Liu², Wei Li, Christian Theobalt², Wenping Wang¹ - Show less +1 more•Institutions (2)

University of Hong Kong¹, Max Planck Society²

20 Jun 2021

TL;DR: Wang et al. as discussed by the authors proposed an Epipolar Spatio-Temporal (EST) transformer to explicitly associate geometric and temporal correlation with multiple estimated depth maps, and designed a compact hybrid network consisting of a 2D context-aware network and a 3D matching network.

...read moreread less

Abstract: We present a novel method for multi-view depth estimation from a single video, which is a critical task in various applications, such as perception, reconstruction and robot navigation. Although previous learning-based methods have demonstrated compelling results, most works estimate depth maps of individual video frames independently, without taking into consideration the strong geometric and temporal coherence among the frames. Moreover, current state-of-the-art (SOTA) models mostly adopt a fully 3D convolution network for cost regularization and therefore require high computational cost, thus limiting their deployment in real-world applications. Our method achieves temporally coherent depth estimation results by using a novel Epipolar Spatio-Temporal (EST) transformer to explicitly associate geometric and temporal correlation with multiple estimated depth maps. Furthermore, to reduce the computational cost, inspired by recent Mixture-of-Experts models, we design a compact hybrid network consisting of a 2D context-aware network and a 3D matching network which learn 2D context information and 3D disparity cues separately. Extensive experiments demonstrate that our method achieves higher accuracy in depth estimation and significant speedup than the SOTA methods.

...read moreread less

Posted Content•

Quantum Permutation Synchronization

[...]

Tolga Birdal¹, Vladislav Golyanik², Christian Theobalt², Leonidas J. Guibas¹•Institutions (2)

Stanford University¹, Max Planck Society²

19 Jan 2021-arXiv: Quantum Physics

TL;DR: In this article, the first quantum algorithm for solving a synchronization problem in the context of computer vision is presented, which involves solving a non-convex optimization problem in discrete variables.

...read moreread less

Proceedings Article•DOI•

Videoforensicshq: Detecting High-Quality Manipulated Face Videos

[...]

Gereon Fox¹, Wentao Liu¹, Hyeongwoo Kim¹, Hans-Peter Seidel¹, Mohamed Elgharib¹, Christian Theobalt¹ - Show less +2 more•Institutions (1)

Max Planck Society¹

05 Jul 2021

TL;DR: In this article, a new benchmark dataset for face video forgery detection is introduced, which allows the authors to demonstrate that existing detection techniques have difficulties detecting fakes that reliably fool the human eye.

...read moreread less

Abstract: There are concerns that new approaches to the synthesis of high quality face videos may be misused to manipulate videos with malicious intent. The research community therefore developed methods for the detection of modified footage and assembled benchmark datasets for this task. In this paper, we examine how the performance of forgery detectors depends on the presence of artefacts that the human eye can see. We introduce a new benchmark dataset for face video forgery detection, of unprecedented quality. It allows us to demonstrate that existing detection techniques have difficulties detecting fakes that reliably fool the human eye. We thus introduce a new family of detectors that examine combinations of spatial and temporal features and outperform existing approaches both in terms of detection accuracy and generalization.

...read moreread less

Posted Content•

HumanGAN: A Generative Model of Humans Images

[...]

Kripasindhu Sarkar, Lingjie Liu, Vladislav Golyanik, Christian Theobalt

11 Mar 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: Zhang et al. as discussed by the authors proposed a generative model for images of dressed humans offering control over pose, local body part appearance and garment style, which is the first method to solve various aspects of human image generation such as global appearance sampling, pose transfer, parts and garment transfer, and parts sampling jointly in a unified framework.

...read moreread less

Abstract: Generative adversarial networks achieve great performance in photorealistic image synthesis in various domains, including human images. However, they usually employ latent vectors that encode the sampled outputs globally. This does not allow convenient control of semantically-relevant individual parts of the image, and is not able to draw samples that only differ in partial aspects, such as clothing style. We address these limitations and present a generative model for images of dressed humans offering control over pose, local body part appearance and garment style. This is the first method to solve various aspects of human image generation such as global appearance sampling, pose transfer, parts and garment transfer, and parts sampling jointly in a unified framework. As our model encodes part-based latent appearance vectors in a normalized pose-independent space and warps them to different poses, it preserves body and clothing appearance under varying posture. Experiments show that our flexible and general generative method outperforms task-specific baselines for pose-conditioned image generation, pose transfer and part sampling in terms of realism and output resolution.

...read moreread less