Showing papers by "Christian Theobalt published in 2017"

PDF

Open Access

Journal Article•DOI•

VNect: real-time 3D human pose estimation with a single RGB camera

[...]

Dushyant Mehta¹, Srinath Sridhar¹, Oleksandr Sotnychenko¹, Helge Rhodin¹, Mohammad Shafiei¹, Hans-Peter Seidel¹, Weipeng Xu¹, Dan Casas², Christian Theobalt¹ - Show less +5 more•Institutions (2)

Max Planck Society¹, King Juan Carlos University²

20 Jul 2017

TL;DR: In this paper, a fully-convolutional pose formulation was proposed to regress 2D and 3D joint positions jointly in real-time and does not require tightly cropped input frames.

...read moreread less

Abstract: We present the first real-time method to capture the full global 3D skeletal pose of a human in a stable, temporally consistent manner using a single RGB camera. Our method combines a new convolutional neural network (CNN) based pose regressor with kinematic skeleton fitting. Our novel fully-convolutional pose formulation regresses 2D and 3D joint positions jointly in real time and does not require tightly cropped input frames. A real-time kinematic skeleton fitting method uses the CNN output to yield temporally stable 3D global pose reconstructions on the basis of a coherent kinematic skeleton. This makes our approach the first monocular RGB method usable in real-time applications such as 3D character control---thus far, the only monocular methods for such applications employed specialized RGB-D cameras. Our method's accuracy is quantitatively on par with the best offline 3D monocular RGB pose estimation methods. Our results are qualitatively comparable to, and sometimes better than, results from monocular RGB-D approaches, such as the Kinect. However, we show that our approach is more broadly applicable than RGB-D solutions, i.e., it works for outdoor scenes, community videos, and low quality commodity RGB cameras.

...read moreread less

859 citations

Journal Article•DOI•

BundleFusion: real-time globally consistent 3D reconstruction using on-the-fly surface re-integration

[...]

Angela Dai¹, Matthias Nießner¹, Michael Zollhöfer², Shahram Izadi³, Christian Theobalt² - Show less +1 more•Institutions (3)

Stanford University¹, Max Planck Society², Microsoft³

01 May 2017-ACM Transactions on Graphics

TL;DR: In this paper, a robust pose estimation strategy is proposed for real-time, high-quality, 3D scanning of large-scale scenes using RGB-D input with an efficient hierarchical approach, which removes heavy reliance on temporal tracking and continually localizes to the globally optimized frames instead.

...read moreread less

Abstract: Real-time, high-quality, 3D scanning of large-scale scenes is key to mixed reality and robotic applications. However, scalability brings challenges of drift in pose estimation, introducing significant errors in the accumulated model. Approaches often require hours of offline processing to globally correct model errors. Recent online methods demonstrate compelling results but suffer from (1) needing minutes to perform online correction, preventing true real-time use; (2) brittle frame-to-frame (or frame-to-model) pose estimation, resulting in many tracking failures; or (3) supporting only unstructured point-based representations, which limit scan quality and applicability. We systematically address these issues with a novel, real-time, end-to-end reconstruction framework. At its core is a robust pose estimation strategy, optimizing per frame for a global set of camera poses by considering the complete history of RGB-D input with an efficient hierarchical approach. We remove the heavy reliance on temporal tracking and continually localize to the globally optimized frames instead. We contribute a parallelizable optimization framework, which employs correspondences based on sparse features and dense geometric and photometric matching. Our approach estimates globally optimized (i.e., bundle adjusted) poses in real time, supports robust tracking with recovery from gross tracking failures (i.e., relocalization), and re-estimates the 3D model in real time to ensure global consistency, all within a single framework. Our approach outperforms state-of-the-art online systems with quality on par to offline methods, but with unprecedented speed and scan completeness. Our framework leads to a comprehensive online scanning solution for large indoor environments, enabling ease of use and high-quality results.1

...read moreread less

711 citations

Journal Article•DOI•

VNect: Real-time 3D Human Pose Estimation with a Single RGB Camera

[...]

Max Planck Society¹, King Juan Carlos University²

03 May 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work presents the first real-time method to capture the full global 3D skeletal pose of a human in a stable, temporally consistent manner using a single RGB camera and shows that the approach is more broadly applicable than RGB-D solutions, i.e., it works for outdoor scenes, community videos, and low quality commodity RGB cameras.

...read moreread less

Abstract: We present the first real-time method to capture the full global 3D skeletal pose of a human in a stable, temporally consistent manner using a single RGB camera. Our method combines a new convolutional neural network (CNN) based pose regressor with kinematic skeleton fitting. Our novel fully-convolutional pose formulation regresses 2D and 3D joint positions jointly in real time and does not require tightly cropped input frames. A real-time kinematic skeleton fitting method uses the CNN output to yield temporally stable 3D global pose reconstructions on the basis of a coherent kinematic skeleton. This makes our approach the first monocular RGB method usable in real-time applications such as 3D character control---thus far, the only monocular methods for such applications employed specialized RGB-D cameras. Our method's accuracy is quantitatively on par with the best offline 3D monocular RGB pose estimation methods. Our results are qualitatively comparable to, and sometimes better than, results from monocular RGB-D approaches, such as the Kinect. However, we show that our approach is more broadly applicable than RGB-D solutions, i.e. it works for outdoor scenes, community videos, and low quality commodity RGB cameras.

...read moreread less

644 citations

Proceedings Article•DOI•

Monocular 3D Human Pose Estimation in the Wild Using Improved CNN Supervision

[...]

Dushyant Mehta¹, Helge Rhodin¹, Dan Casas², Pascal Fua³, Oleksandr Sotnychenko¹, Weipeng Xu¹, Christian Theobalt¹ - Show less +3 more•Institutions (3)

Max Planck Society¹, King Juan Carlos University², École Polytechnique Fédérale de Lausanne³

01 Oct 2017

TL;DR: In this article, a CNN-based approach for 3D human body pose estimation from single RGB images is proposed to address the issue of limited generalizability of models trained solely on the starkly limited publicly available 3D pose data.

...read moreread less

Abstract: We propose a CNN-based approach for 3D human body pose estimation from single RGB images that addresses the issue of limited generalizability of models trained solely on the starkly limited publicly available 3D pose data. Using only the existing 3D pose data and 2D pose data, we show state-of-the-art performance on established benchmarks through transfer of learned features, while also generalizing to in-the-wild scenes. We further introduce a new training set for human body pose estimation from monocular images of real humans that has the ground truth captured with a multi-camera marker-less motion capture system. It complements existing corpora with greater diversity in pose, human appearance, clothing, occlusion, and viewpoints, and enables an increased scope of augmentation. We also contribute a new benchmark that covers outdoor and indoor scenes, and demonstrate that our 3D pose dataset shows better in-the-wild performance than existing annotated data, which is further improved in conjunction with transfer learning from 2D pose data. All in all, we argue that the use of transfer learning of representations in tandem with algorithmic and data contributions is crucial for general 3D body pose estimation.

...read moreread less

620 citations

Posted Content•

MoFA: Model-based Deep Convolutional Face Autoencoder for Unsupervised Monocular Reconstruction

[...]

Ayush Tewari¹, Michael Zollhöfer¹, Hyeongwoo Kim¹, Pablo Garrido¹, Florian Bernard¹, Patrick Pérez², Christian Theobalt¹ - Show less +3 more•Institutions (2)

Max Planck Society¹, Valeo²

30 Mar 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: A novel model-based deep convolutional autoencoder that addresses the highly challenging problem of reconstructing a 3D human face from a single in-the-wild color image and can be trained end-to-end in an unsupervised manner, which renders training on very large real world data feasible.

...read moreread less

Abstract: In this work we propose a novel model-based deep convolutional autoencoder that addresses the highly challenging problem of reconstructing a 3D human face from a single in-the-wild color image. To this end, we combine a convolutional encoder network with an expert-designed generative model that serves as decoder. The core innovation is our new differentiable parametric decoder that encapsulates image formation analytically based on a generative model. Our decoder takes as input a code vector with exactly defined semantic meaning that encodes detailed face pose, shape, expression, skin reflectance and scene illumination. Due to this new way of combining CNN-based with model-based face reconstruction, the CNN-based encoder learns to extract semantically meaningful parameters from a single monocular input image. For the first time, a CNN encoder and an expert-designed generative model can be trained end-to-end in an unsupervised manner, which renders training on very large (unlabeled) real world data feasible. The obtained reconstructions compare favorably to current state-of-the-art approaches in terms of quality and richness of representation.

...read moreread less

355 citations

Proceedings Article•DOI•

MoFA: Model-Based Deep Convolutional Face Autoencoder for Unsupervised Monocular Reconstruction

[...]

Ayush Tewari¹, Michael Zollhöfer¹, Hyeongwoo Kim¹, Pablo Garrido¹, Florian Bernard¹, Patrick Pérez², Christian Theobalt¹ - Show less +3 more•Institutions (2)

Max Planck Society¹, Valeo²

01 Oct 2017

...read moreread less

Abstract: In this work we propose a novel model-based deep convolutional autoencoder that addresses the highly challenging problem of reconstructing a 3D human face from a single in-the-wild color image. To this end, we combine a convolutional encoder network with an expert-designed generative model that serves as decoder. The core innovation is the differentiable parametric decoder that encapsulates image formation analytically based on a generative model. Our decoder takes as input a code vector with exactly defined semantic meaning that encodes detailed face pose, shape, expression, skin reflectance and scene illumination. Due to this new way of combining CNN-based with model-based face reconstruction, the CNN-based encoder learns to extract semantically meaningful parameters from a single monocular input image. For the first time, a CNN encoder and an expert-designed generative model can be trained end-to-end in an unsupervised manner, which renders training on very large (unlabeled) real world data feasible. The obtained reconstructions compare favorably to current state-of-the-art approaches in terms of quality and richness of representation.

...read moreread less

316 citations

Proceedings Article•DOI•

Real-Time Hand Tracking under Occlusion from an Egocentric RGB-D Sensor

[...]

Franziska Mueller¹, Dushyant Mehta¹, Oleksandr Sotnychenko¹, Srinath Sridhar¹, Dan Casas², Christian Theobalt¹ - Show less +2 more•Institutions (2)

Max Planck Society¹, King Juan Carlos University²

01 Oct 2017

TL;DR: In this article, a real-time, robust and accurate hand pose estimation from moving egocentric RGB-D cameras in cluttered real environments is presented, which uses two subsequently applied Convolutional Neural Networks (CNNs) to localize the hand and regress 3D joint locations.

...read moreread less

Abstract: We present an approach for real-time, robust and accurate hand pose estimation from moving egocentric RGB-D cameras in cluttered real environments. Existing methods typically fail for hand-object interactions in cluttered scenes imaged from egocentric viewpoints-common for virtual or augmented reality applications. Our approach uses two subsequently applied Convolutional Neural Networks (CNNs) to localize the hand and regress 3D joint locations. Hand localization is achieved by using a CNN to estimate the 2D position of the hand center in the input, even in the presence of clutter and occlusions. The localized hand position, together with the corresponding input depth value, is used to generate a normalized cropped image that is fed into a second CNN to regress relative 3D hand joint locations in real time. For added accuracy, robustness and temporal stability, we refine the pose estimates using a kinematic pose tracking energy. To train the CNNs, we introduce a new photorealistic dataset that uses a merged reality approach to capture and synthesize large amounts of annotated data of natural hand interaction in cluttered scenes. Through quantitative and qualitative evaluation, we show that our method is robust to self-occlusion and occlusions by objects, particularly in moving egocentric perspectives.

...read moreread less

235 citations

Posted Content•

GANerated Hands for Real-time 3D Hand Tracking from Monocular RGB

[...]

Franziska Mueller¹, Florian Bernard¹, Oleksandr Sotnychenko¹, Dushyant Mehta¹, Srinath Sridhar², Dan Casas, Christian Theobalt¹ - Show less +3 more•Institutions (2)

Max Planck Society¹, Stanford University²

04 Dec 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, a geometrically consistent image-to-image translation network is proposed to translate synthetic images to real-world images, such that the so-generated images follow the same statistical distribution as realworld hand images.

...read moreread less

Abstract: We address the highly challenging problem of real-time 3D hand tracking based on a monocular RGB-only sequence. Our tracking method combines a convolutional neural network with a kinematic 3D hand model, such that it generalizes well to unseen data, is robust to occlusions and varying camera viewpoints, and leads to anatomically plausible as well as temporally smooth hand motions. For training our CNN we propose a novel approach for the synthetic generation of training data that is based on a geometrically consistent image-to-image translation network. To be more specific, we use a neural network that translates synthetic images to "real" images, such that the so-generated images follow the same statistical distribution as real-world hand images. For training this translation network we combine an adversarial loss and a cycle-consistency loss with a geometric consistency loss in order to preserve geometric properties (such as hand pose) during translation. We demonstrate that our hand tracking system outperforms the current state-of-the-art on challenging RGB-only footage.

...read moreread less

204 citations

Journal Article•DOI•

Building statistical shape spaces for 3D human modeling

[...]

Leonid Pishchulin¹, Stefanie Wuhrer², Thomas Helten, Christian Theobalt¹, Bernt Schiele¹ - Show less +1 more•Institutions (2)

Max Planck Society¹, French Institute for Research in Computer Science and Automation²

01 Jul 2017-Pattern Recognition

TL;DR: A widely used statistical body representation from the largest commercially available scan database is rebuilt, and the resulting model is made available to the community by developing robust best practice solutions for scan alignment that quantitatively lead to the best learned models.

...read moreread less

194 citations

Posted Content•

MonoPerfCap: Human Performance Capture from Monocular Video

[...]

Weipeng Xu¹, Avishek Chatterjee¹, Michael Zollhöfer¹, Helge Rhodin², Dushyant Mehta¹, Hans-Peter Seidel¹, Christian Theobalt¹ - Show less +3 more•Institutions (2)

Max Planck Society¹, École Polytechnique Fédérale de Lausanne²

07 Aug 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work presents the first marker-less approach for temporally coherent 3D performance capture of a human with general clothing from monocular video that significantly outperforms previous monocular methods in terms of accuracy, robustness and scene complexity that can be handled.

...read moreread less

Abstract: We present the first marker-less approach for temporally coherent 3D performance capture of a human with general clothing from monocular video. Our approach reconstructs articulated human skeleton motion as well as medium-scale non-rigid surface deformations in general scenes. Human performance capture is a challenging problem due to the large range of articulation, potentially fast motion, and considerable non-rigid deformations, even from multi-view data. Reconstruction from monocular video alone is drastically more challenging, since strong occlusions and the inherent depth ambiguity lead to a highly ill-posed reconstruction problem. We tackle these challenges by a novel approach that employs sparse 2D and 3D human pose detections from a convolutional neural network using a batch-based pose estimation strategy. Joint recovery of per-batch motion allows to resolve the ambiguities of the monocular reconstruction problem based on a low dimensional trajectory subspace. In addition, we propose refinement of the surface geometry based on fully automatically extracted silhouettes to enable medium-scale non-rigid alignment. We demonstrate state-of-the-art performance capture results that enable exciting applications such as video editing and free viewpoint video, previously infeasible from monocular video. Our qualitative and quantitative evaluation demonstrates that our approach significantly outperforms previous monocular methods in terms of accuracy, robustness and scene complexity that can be handled.

...read moreread less

140 citations

Posted Content•

Self-supervised Multi-level Face Model Learning for Monocular Reconstruction at over 250 Hz

[...]

Ayush Tewari¹, Michael Zollhöfer¹, Pablo Garrido¹, Florian Bernard¹, Hyeongwoo Kim¹, Patrick Pérez, Christian Theobalt¹ - Show less +3 more•Institutions (1)

Max Planck Society¹

07 Dec 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, a multi-level face model is proposed to combine the advantage of 3D Morphable Models for regularization with the out-of-space generalization of a learned corrective space.

...read moreread less

Abstract: The reconstruction of dense 3D models of face geometry and appearance from a single image is highly challenging and ill-posed. To constrain the problem, many approaches rely on strong priors, such as parametric face models learned from limited 3D scan data. However, prior models restrict generalization of the true diversity in facial geometry, skin reflectance and illumination. To alleviate this problem, we present the first approach that jointly learns 1) a regressor for face shape, expression, reflectance and illumination on the basis of 2) a concurrently learned parametric face model. Our multi-level face model combines the advantage of 3D Morphable Models for regularization with the out-of-space generalization of a learned corrective space. We train end-to-end on in-the-wild images without dense annotations by fusing a convolutional encoder with a differentiable expert-designed renderer and a self-supervised training loss, both defined at multiple detail levels. Our approach compares favorably to the state-of-the-art in terms of reconstruction quality, better generalizes to real world faces, and runs at over 250 Hz.

...read moreread less

Posted Content•

Single-Shot Multi-Person 3D Pose Estimation From Monocular RGB

[...]

Dushyant Mehta¹, Oleksandr Sotnychenko¹, Franziska Mueller¹, Weipeng Xu¹, Srinath Sridhar², Gerard Pons-Moll¹, Christian Theobalt¹ - Show less +3 more•Institutions (2)

Max Planck Society¹, Stanford University²

09 Dec 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, an occlusion-robust pose-maps (ORPM) is proposed for multi-person 3D pose estimation in general scenes from a monocular RGB camera, which outputs a fixed number of maps which encode the 3D joint locations of all people in the scene.

...read moreread less

Abstract: We propose a new single-shot method for multi-person 3D pose estimation in general scenes from a monocular RGB camera. Our approach uses novel occlusion-robust pose-maps (ORPM) which enable full body pose inference even under strong partial occlusions by other people and objects in the scene. ORPM outputs a fixed number of maps which encode the 3D joint locations of all people in the scene. Body part associations allow us to infer 3D pose for an arbitrary number of people without explicit bounding box prediction. To train our approach we introduce MuCo-3DHP, the first large scale training data set showing real images of sophisticated multi-person interactions and occlusions. We synthesize a large corpus of multi-person images by compositing images of individual people (with ground truth from mutli-view performance capture). We evaluate our method on our new challenging 3D annotated multi-person test set MuPoTs-3D where we achieve state-of-the-art performance. To further stimulate research in multi-person 3D pose estimation, we will make our new datasets, and associated code publicly available for research purposes.

...read moreread less

Journal Article•DOI•

MARCOnI—ConvNet-Based MARker-Less Motion Capture in Outdoor and Indoor Scenes

[...]

Ahmed Elhayek, E. de Aguiar, Arjun Jain¹, J. Thompson², Leonid Pishchulin, Mykhaylo Andriluka³, C. Bregler², Bernt Schiele, Christian Theobalt - Show less +5 more•Institutions (3)

New York University¹, Google², Stanford University³

01 Mar 2017-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: A method for accurate marker-less capture of articulated skeleton motion of several subjects in general scenes, indoors and outdoors, even from input filmed with as few as two cameras is proposed, and is efficient and lends itself to implementation on parallel computing hardware, such as GPUs.

...read moreread less

Abstract: Marker-less motion capture has seen great progress, but most state-of-the-art approaches fail to reliably track articulated human body motion with a very low number of cameras, let alone when applied in outdoor scenes with general background. In this paper, we propose a method for accurate marker-less capture of articulated skeleton motion of several subjects in general scenes, indoors and outdoors, even from input filmed with as few as two cameras. The new algorithm combines the strengths of a discriminative image-based joint detection method with a model-based generative motion tracking algorithm through an unified pose optimization energy. The discriminative part-based pose detection method is implemented using Convolutional Networks (ConvNet) and estimates unary potentials for each joint of a kinematic skeleton model. These unary potentials serve as the basis of a probabilistic extraction of pose constraints for tracking by using weighted sampling from a pose posterior that is guided by the model. In the final energy, we combine these constraints with an appearance-based model-to-image similarity term. Poses can be computed very efficiently using iterative local optimization, since joint detection with a trained ConvNet is fast, and since our formulation yields a combined pose estimation energy with analytic derivatives. In combination, this enables to track full articulated joint angles at state-of-the-art accuracy and temporal stability with a very low number of cameras. Our method is efficient and lends itself to implementation on parallel computing hardware, such as GPUs. We test our method extensively and show its advantages over related work on many indoor and outdoor data sets captured by ourselves, as well as data sets made available to the community by other research labs. The availability of good evaluation data sets is paramount for scientific progress, and many existing test data sets focus on controlled indoor settings, do not feature much variety in the scenes, and often lack a large corpus of data with ground truth annotation. We therefore further contribute with a new extensive test data set called MPI-MARCOnI for indoor and outdoor marker-less motion capture that features $12$ scenes of varying complexity and varying camera count, and that features ground truth reference data from different modalities, ranging from manual joint annotations to marker-based motion capture results. Our new method is tested on these data, and the data set will be made available to the community.

...read moreread less

Proceedings Article•DOI•

Real-time Hand Tracking under Occlusion from an Egocentric RGB-D Sensor

[...]

Franziska Mueller¹, Dushyant Mehta¹, Oleksandr Sotnychenko¹, Srinath Sridhar¹, Dan Casas², Christian Theobalt¹ - Show less +2 more•Institutions (2)

Max Planck Society¹, King Juan Carlos University²

07 Apr 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: The method is robust to self-occlusion and occlusions by objects, particularly in moving egocentric perspectives, and introduced a new photorealistic dataset that uses a merged reality approach to capture and synthesize large amounts of annotated data of natural hand interaction in cluttered scenes.

...read moreread less

Abstract: We present an approach for real-time, robust and accurate hand pose estimation from moving egocentric RGB-D cameras in cluttered real environments. Existing methods typically fail for hand-object interactions in cluttered scenes imaged from egocentric viewpoints, common for virtual or augmented reality applications. Our approach uses two subsequently applied Convolutional Neural Networks (CNNs) to localize the hand and regress 3D joint locations. Hand localization is achieved by using a CNN to estimate the 2D position of the hand center in the input, even in the presence of clutter and occlusions. The localized hand position, together with the corresponding input depth value, is used to generate a normalized cropped image that is fed into a second CNN to regress relative 3D hand joint locations in real time. For added accuracy, robustness and temporal stability, we refine the pose estimates using a kinematic pose tracking energy. To train the CNNs, we introduce a new photorealistic dataset that uses a merged reality approach to capture and synthesize large amounts of annotated data of natural hand interaction in cluttered scenes. Through quantitative and qualitative evaluation, we show that our method is robust to self-occlusion and occlusions by objects, particularly in moving egocentric perspectives.

...read moreread less

Posted Content•

InverseFaceNet: Deep Single-Shot Inverse Face Rendering From A Single Image

[...]

Hyeongwoo Kim¹, Michael Zollhöfer¹, Ayush Tewari¹, Justus Thies, Christian Richardt, Christian Theobalt¹ - Show less +2 more•Institutions (1)

Max Planck Society¹

31 Mar 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work proposes to recover high-quality facial pose, shape, expression, reflectance and illumination using a deep neural network that is trained using a large, synthetically created dataset and builds on a novel loss function that measures model-space similarity directly in parameter space and significantly improves reconstruction accuracy.

...read moreread less

Abstract: We introduce InverseFaceNet, a deep convolutional inverse rendering framework for faces that jointly estimates facial pose, shape, expression, reflectance and illumination from a single input image in a single shot. By estimating all these parameters from just a single image, advanced editing possibilities on a single face image, such as appearance editing and relighting, become feasible. Previous learning-based face reconstruction approaches do not jointly recover all dimensions, or are severely limited in terms of visual quality. In contrast, we propose to recover high-quality facial pose, shape, expression, reflectance and illumination using a deep neural network that is trained using a large, synthetically created dataset. Our approach builds on a novel loss function that measures model-space similarity directly in parameter space and significantly improves reconstruction accuracy. In addition, we propose an analysis-by-synthesis breeding approach which iteratively updates the synthetic training corpus based on the distribution of real-world images, and we demonstrate that this strategy outperforms completely synthetically trained networks. Finally, we show high-quality reconstructions and compare our approach to several state-of-the-art approaches.

...read moreread less

Proceedings Article•DOI•

WatchSense: On- and Above-Skin Input Sensing through a Wearable Depth Sensor

[...]

Srinath Sridhar¹, Anders Markussen², Antti Oulasvirta³, Christian Theobalt¹, Sebastian Boring² - Show less +1 more•Institutions (3)

Max Planck Society¹, University of Copenhagen², Aalto University³

02 May 2017

TL;DR: WatchSense uses a depth sensor embedded in a wearable device to expand the input space to neighboring areas of skin and the space above it and increases the expressiveness of input by interweaving mid-air and multitouch for several interactive applications.

...read moreread less

Abstract: This paper contributes a novel sensing approach to support on- and above-skin finger input for interaction on the move. WatchSense uses a depth sensor embedded in a wearable device to expand the input space to neighboring areas of skin and the space above it. Our approach addresses challenging camera-based tracking conditions, such as oblique viewing angles and occlusions. It can accurately detect fingertips, their locations, and whether they are touching the skin or hovering above it. It extends previous work that supported either mid-air or multitouch input by simultaneously supporting both. We demonstrate feasibility with a compact, wearable prototype attached to a user's forearm (simulating an integrated depth sensor). Our prototype---which runs in real-time on consumer mobile devices---enables a 3D input space on the back of the hand. We evaluated the accuracy and robustness of the approach in a user study. We also show how WatchSense increases the expressiveness of input by interweaving mid-air and multitouch for several interactive applications.

...read moreread less

Journal Article•DOI•

Opt: A Domain Specific Language for Non-Linear Least Squares Optimization in Graphics and Imaging

[...]

Zachary DeVito¹, Michael Mara², Michael Zollhöfer³, Gilbert Louis Bernstein², Jonathan Ragan-Kelley⁴, Christian Theobalt³, Pat Hanrahan², Matthew Fisher⁵, Matthias Niessner⁶ - Show less +5 more•Institutions (6)

Facebook¹, Stanford University², Max Planck Society³, University of California, Berkeley⁴, Adobe Systems⁵, Technische Universität München⁶

11 Oct 2017-ACM Transactions on Graphics

TL;DR: Opt as discussed by the authors is a language for writing objective functions over image- or graph-structured unknowns concisely and at a high level, which automatically transforms these specifications into state-of-the-art GPU solvers based on Gauss-Newton or Levenberg-Marquardt methods.

...read moreread less

Abstract: Many graphics and vision problems can be expressed as non-linear least squares optimizations of objective functions over visual data, such as images and meshes. The mathematical descriptions of these functions are extremely concise, but their implementation in real code is tedious, especially when optimized for real-time performance on modern GPUs in interactive applications. In this work, we propose a new language, Opt,1 for writing these objective functions over image- or graph-structured unknowns concisely and at a high level. Our compiler automatically transforms these specifications into state-of-the-art GPU solvers based on Gauss-Newton or Levenberg-Marquardt methods. Opt can generate different variations of the solver, so users can easily explore tradeoffs in numerical precision, matrix-free methods, and solver approaches.In our results, we implement a variety of real-world graphics and vision applications. Their energy functions are expressible in tens of lines of code and produce highly optimized GPU solver implementations. These solvers are competitive in performance with the best published hand-tuned, application-specific GPU solvers, and orders of magnitude beyond a general-purpose auto-generated solver.

...read moreread less

Proceedings Article•DOI•

Demo of FaceVR: real-time facial reenactment and eye gaze control in virtual reality

[...]

Justus Thies¹, Michael Zollhöfer², Marc Stamminger¹, Christian Theobalt², Matthias Nießner³ - Show less +1 more•Institutions (3)

University of Erlangen-Nuremberg¹, Max Planck Society², Technische Universität München³

30 Jul 2017

TL;DR: FaceVR is introduced, a novel method for gaze-aware facial reenactment in the Virtual Reality (VR) context that combines a robust algorithm to perform real-time facial motion capture of an actor who is wearing a head-mounted display (HMD), as well as a new data-driven approach for eye tracking from monocular videos.

...read moreread less

Abstract: We introduce FaceVR, a novel method for gaze-aware facial reenactment in the Virtual Reality (VR) context. The key component of FaceVR is a robust algorithm to perform real-time facial motion capture of an actor who is wearing a head-mounted display (HMD), as well as a new data-driven approach for eye tracking from monocular videos. In addition to these face reconstruction components, FaceVR incorporates photo-realistic re-rendering in real time, thus allowing artificial modifications of face and eye appearances. For instance, we can alter facial expressions, change gaze directions, or remove the VR goggles in realistic re-renderings. In a live setup with a source and a target actor, we apply these newly-introduced algorithmic components. We assume that the source actor is wearing a VR device, and we capture his facial expressions and eye movement in real-time. For the target video, we mimic a similar tracking process; however, we use the source input to drive the animations of the target video, thus enabling gaze-aware facial reenactment. To render the modified target video on a stereo display, we augment our capture and reconstruction process with stereo data. In the end, FaceVR produces compelling results for a variety of applications, such as gaze-aware facial reenactment, reenactment in virtual reality, removal of VR goggles, and re-targeting of somebody's gaze direction in a video conferencing call.

...read moreread less

Journal Article•DOI•

Live User-Guided Intrinsic Video for Static Scenes

[...]

Abhimitra Meka, Gereon Fox¹, Michael Zollhöfer, Christian Richardt², Christian Theobalt - Show less +1 more•Institutions (2)

Saarland University¹, University of Bath²

11 Aug 2017

TL;DR: A novel real-time approach for user-guided intrinsic decomposition of static scenes captured by an RGB-D sensor that improves on the decomposition quality of existing intrinsic video decomposition techniques by further constraining the ill-posed decomposition problem.

...read moreread less

Abstract: We present a novel real-time approach for user-guided intrinsic decomposition of static scenes captured by an RGB-D sensor In the first step, we acquire a three-dimensional representation of the scene using a dense volumetric reconstruction framework The obtained reconstruction serves as a proxy to densely fuse reflectance estimates and to store user-provided constraints in three-dimensional space User constraints, in the form of constant shading and reflectance strokes, can be placed directly on the real-world geometry using an intuitive touch-based interaction metaphor, or using interactive mouse strokes Fusing the decomposition results and constraints in three-dimensional space allows for robust propagation of this information to novel views by re-projection We leverage this information to improve on the decomposition quality of existing intrinsic video decomposition techniques by further constraining the ill-posed decomposition problem In addition to improved decomposition quality, we show a variety of live augmented reality applications such as recoloring of objects, relighting of scenes and editing of material appearance

...read moreread less

Journal Article•DOI•

Multi-view Performance Capture of Surface Details

[...]

Nadia Robertini¹, Dan Casas¹, Edilson de Aguiar, Christian Theobalt¹•Institutions (1)

Max Planck Society¹

21 Jan 2017-International Journal of Computer Vision

TL;DR: A novel approach to recover true fine surface detail of deforming meshes reconstructed from multi-view video by formulating dense dynamic surface reconstruction as a global optimization problem of the densely deforming surface.

...read moreread less

Abstract: This paper presents a novel approach to recover true fine surface detail of deforming meshes reconstructed from multi-view video. Template-based methods for performance capture usually produce a coarse-to-medium scale detail 4D surface reconstruction which does not contain the real high-frequency geometric detail present in the original video footage. Fine scale deformation is often incorporated in a second pass by using stereo constraints, features, or shading-based refinement. In this paper, we propose an alternative solution to this second stage by formulating dense dynamic surface reconstruction as a global optimization problem of the densely deforming surface. Our main contribution is an implicit representation of a deformable mesh that uses a set of Gaussian functions on the surface to represent the initial coarse mesh, and a set of Gaussians for the images to represent the original captured multi-view images. We effectively find the fine scale deformations for all mesh vertices, which maximize photo-temporal-consistency, by densely optimizing our model-to-image consistency energy on all vertex positions. Our formulation yields a smooth closed form energy with implicit occlusion handling and analytic derivatives. Furthermore, it does not require error-prone correspondence finding or discrete sampling of surface displacement values. We demonstrate our approach on a variety of datasets of human subjects wearing loose clothing and performing different motions. We qualitatively and quantitatively demonstrate that our technique successfully reproduces finer detail than the input baseline geometry.

...read moreread less

Posted Content•

InverseFaceNet: Deep Monocular Inverse Face Rendering

[...]

Hyeongwoo Kim, Michael Zollhöfer, Ayush Tewari, Justus Thies, Christian Richardt, Christian Theobalt - Show less +2 more

31 Mar 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work introduces InverseFaceNet, a deep convolutional inverse rendering framework for faces that jointly estimates facial pose, shape, expression, reflectance and illumination from a single input image and demonstrates that this strategy outperforms completely synthetically trained networks.

...read moreread less

Abstract: We introduce InverseFaceNet, a deep convolutional inverse rendering framework for faces that jointly estimates facial pose, shape, expression, reflectance and illumination from a single input image. By estimating all parameters from just a single image, advanced editing possibilities on a single face image, such as appearance editing and relighting, become feasible in real time. Most previous learning-based face reconstruction approaches do not jointly recover all dimensions, or are severely limited in terms of visual quality. In contrast, we propose to recover high-quality facial pose, shape, expression, reflectance and illumination using a deep neural network that is trained using a large, synthetically created training corpus. Our approach builds on a novel loss function that measures model-space similarity directly in parameter space and significantly improves reconstruction accuracy. We further propose a self-supervised bootstrapping process in the network training loop, which iteratively updates the synthetic training corpus to better reflect the distribution of real-world imagery. We demonstrate that this strategy outperforms completely synthetically trained networks. Finally, we show high-quality reconstructions and compare our approach to several state-of-the-art approaches.

...read moreread less

Posted Content•

Tighter Lifting-Free Convex Relaxations for Quadratic Matching Problems

[...]

Florian Bernard, Christian Theobalt, Michael Moeller

29 Nov 2017-arXiv: Optimization and Control

TL;DR: In this paper, a lifting-free convex relaxation of quadratic optimisation problems over permutation matrices is presented. But the authors do not consider the problem of image arrangement.

...read moreread less

Abstract: In this work we study convex relaxations of quadratic optimisation problems over permutation matrices. While existing semidefinite programming approaches can achieve remarkably tight relaxations, they have the strong disadvantage that they lift the original $n {\times} n$-dimensional variable to an $n^2 {\times} n^2$-dimensional variable, which limits their practical applicability. In contrast, here we present a lifting-free convex relaxation that is provably at least as tight as existing (lifting-free) convex relaxations. We demonstrate experimentally that our approach is superior to existing convex and non-convex methods for various problems, including image arrangement and multi-graph matching.

...read moreread less

Posted Content•

Single-Shot Multi-Person 3D Body Pose Estimation From Monocular RGB Input

[...]

Dushyant Mehta¹, Oleksandr Sotnychenko¹, Franziska Mueller¹, Weipeng Xu¹, Srinath Sridhar, Gerard Pons-Moll¹, Christian Theobalt¹ - Show less +3 more•Institutions (1)

Max Planck Society¹

09 Dec 2017

TL;DR: This work proposes a new efficient single-shot method for multi-person 3D pose estimation in general scenes from a monocular RGB camera that succeeds even under strong partial body occlusions by other people and objects in the scene.

...read moreread less

Abstract: We propose a new efficient single-shot method for multi-person 3D pose estimation in general scenes from a monocular RGB camera. Our fully convolutional DNN-based approach jointly infers 2D and 3D joint locations on the basis of an extended 3D location map supported by body part associations. This new formulation enables the readout of full body poses at a subset of visible joints without the need for explicit bounding box tracking. It therefore succeeds even under strong partial body occlusions by other people and objects in the scene. We also contribute the first training data set showing real images of sophisticated multi-person interactions and occlusions. To this end, we leverage multi-view video-based performance capture of individual people for ground truth annotation and a new image compositing for user-controlled synthesis of large corpora of real multi-person images. We also propose a new video-recorded multi-person test set with ground truth 3D annotations. Our method achieves state-of-the-art performance on challenging multi-person scenes.

...read moreread less

Posted Content•

HandSeg: An Automatically Labeled Dataset for Hand Segmentation from Depth Images

[...]

Abhishake Kumar Bojja¹, Franziska Mueller, Sri Raghu Malireddi¹, Markus Oberweger², Vincent Lepetit², Christian Theobalt, Kwang Moo Yi¹, Andrea Tagliasacchi¹ - Show less +4 more•Institutions (2)

University of Victoria¹, Graz University of Technology²

16 Nov 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: An automatic method for generating high-quality annotations for depth-based hand segmentation by exploiting the visual cues given by an RGBD sensor and a pair of colored gloves, which lowers the cost/complexity of creating high quality datasets, and makes it easy to expand the dataset in the future.

...read moreread less

Abstract: We propose an automatic method for generating high-quality annotations for depth-based hand segmentation, and introduce a large-scale hand segmentation dataset. Existing datasets are typically limited to a single hand. By exploiting the visual cues given by an RGBD sensor and a pair of colored gloves, we automatically generate dense annotations for two hand segmentation. This lowers the cost/complexity of creating high quality datasets, and makes it easy to expand the dataset in the future. We further show that existing datasets, even with data augmentation, are not sufficient to train a hand segmentation algorithm that can distinguish two hands. Source and datasets will be made publicly available.

...read moreread less

Posted Content•

HandSeg: A Dataset for Hand Segmentation from Depth Images.

[...]

Sri Raghu Malireddi, Franziska Mueller¹, Markus Oberweger, Abhishake Kumar Bojja, Vincent Lepetit, Christian Theobalt¹, Andrea Tagliasacchi - Show less +3 more•Institutions (1)

Max Planck Society¹

16 Nov 2017

TL;DR: This work introduces a large-scale RGBD hand segmentation dataset, with detailed and automatically generated high-quality ground-truth annotations, and proposes a novel architecture employing strided convolution/deconvolutions in place of max-pooling and unpooling layers.

...read moreread less

Abstract: We introduce a large-scale RGBD hand segmentation dataset, with detailed and automatically generated high-quality ground-truth annotations. Existing real-world datasets are limited in quantity due to the difficulty in manually annotating ground-truth labels. By leveraging a pair of brightly colored gloves and an RGBD camera, we propose an acquisition pipeline that eases the task of annotating very large datasets with minimal human intervention. We then quantify the importance of a large annotated dataset in this domain, and compare the performance of existing datasets in the training of deep-learning architectures. Finally, we propose a novel architecture employing strided convolution/deconvolutions in place of max-pooling and unpooling layers. Our variant outperforms baseline architectures while remaining computationally efficient at inference time. Source and datasets will be made publicly available.

...read moreread less

Proceedings Article•DOI•

Criteria Sliders: Learning Continuous Database Criteria via Interactive Ranking

[...]

James Tompkin¹, Kwang In Kim², Hanspeter Pfister³, Christian Theobalt⁴•Institutions (4)

Brown University¹, University of Bath², Harvard University³, Max Planck Society⁴

12 Jun 2017

TL;DR: This work learns low-dimensional continuous criteria via interactive ranking, so that the novice user need only describe the relative ordering of examples, and actively suggest data points to the user to rank in a more informative way than existing work.

...read moreread less

Abstract: Large databases are often organized by hand-labeled metadata—or criteria—which are expensive to collect. We can use unsupervised learning to model database variation, but these models are often high dimensional, complex to parameterize, or require expert knowledge. We learn low-dimensional continuous criteria via interactive ranking, so that the novice user need only describe the relative ordering of examples. This is formed as semi-supervised label propagation in which we maximize the information gained from a limited number of examples. Further, we actively suggest data points to the user to rank in a more informative way than existing work. Our efficient approach allows users to interactively organize thousands of data points along 1D and 2D continuous sliders. We experiment with databases of imagery and geometry to demonstrate that our tool is useful for quickly assessing and organizing the content of large databases.

...read moreread less

Posted Content•

Criteria Sliders: Learning Continuous Database Criteria via Interactive Ranking

[...]

James Tompkin¹, Kwang In Kim², Hanspeter Pfister³, Christian Theobalt⁴•Institutions (4)

Brown University¹, University of Bath², Harvard University³, Max Planck Society⁴

12 Jun 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, the authors learn low-dimensional continuous criteria via interactive ranking, so that the novice user need only describe the relative ordering of examples and actively suggest data points to the user to rank in a more informative way than existing work.

...read moreread less

Abstract: Large databases are often organized by hand-labeled metadata, or criteria, which are expensive to collect. We can use unsupervised learning to model database variation, but these models are often high dimensional, complex to parameterize, or require expert knowledge. We learn low-dimensional continuous criteria via interactive ranking, so that the novice user need only describe the relative ordering of examples. This is formed as semi-supervised label propagation in which we maximize the information gained from a limited number of examples. Further, we actively suggest data points to the user to rank in a more informative way than existing work. Our efficient approach allows users to interactively organize thousands of data points along 1D and 2D continuous sliders. We experiment with datasets of imagery and geometry to demonstrate that our tool is useful for quickly assessing and organizing the content of large databases.

...read moreread less

Proceedings Article•

Demo of VNect: Real-time 3D Human Pose Estimation with a Single RGB Camera

[...]

Dushyant Mehta¹, Srinath Sridhar¹, Oleksandr Sotnychenko¹, Helge Rhodin¹, Franziska Mueller¹, Weipeng Xu¹, Dan Casas¹, Christian Theobalt¹ - Show less +4 more•Institutions (1)

Max Planck Society¹

01 Jan 2017