scispace - formally typeset
Search or ask a question

Showing papers by "Christian Theobalt published in 2012"


Journal ArticleDOI
01 Nov 2012
TL;DR: This approach is the first to capture facial performances of such high quality from a single stereo rig and it is demonstrated that it brings facial performance capture out of the studio, into the wild, and within the reach of everybody.
Abstract: Recent progress in passive facial performance capture has shown impressively detailed results on highly articulated motion. However, most methods rely on complex multi-camera set-ups, controlled lighting or fiducial markers. This prevents them from being used in general environments, outdoor scenes, during live action on a film set, or by freelance animators and everyday users who want to capture their digital selves. In this paper, we therefore propose a lightweight passive facial performance capture approach that is able to reconstruct high-quality dynamic facial geometry from only a single pair of stereo cameras. Our method succeeds under uncontrolled and time-varying lighting, and also in outdoor scenes. Our approach builds upon and extends recent image-based scene flow computation, lighting estimation and shading-based refinement algorithms. It integrates them into a pipeline that is specifically tailored towards facial performance reconstruction from challenging binocular footage under uncontrolled lighting. In an experimental evaluation, the strong capabilities of our method become explicit: We achieve detailed and spatio-temporally coherent results for expressive facial motion in both indoor and outdoor scenes -- even from low quality input images recorded with a hand-held consumer stereo camera. We believe that our approach is the first to capture facial performances of such high quality from a single stereo rig and we demonstrate that it brings facial performance capture out of the studio, into the wild, and within the reach of everybody.

178 citations


Journal ArticleDOI
TL;DR: To turn a video camera with a recent infrared time‐of‐flight depth camera into a practical RGBZ video camera, efficient data filtering techniques that are tailored to the noise characteristics of IR depth cameras are developed.
Abstract: Sophisticated video processing effects require both image and geometry information. We explore the possibility to augment a video camera with a recent infrared time-of-flight depth camera, to capture high-resolution RGB and low-resolution, noisy depth at video frame rates. To turn such a setup into a practical RGBZ video camera, we develop efficient data filtering techniques that are tailored to the noise characteristics of IR depth cameras. We first remove typical artefacts in the RGBZ data and then apply an efficient spatiotemporal denoising and upsampling scheme. This allows us to record temporally coherent RGBZ videos at interactive frame rates and to use them to render a variety of effects in unprecedented quality. We show effects such as video relighting, geometry-based abstraction and stylisation, background segmentation and rendering in stereoscopic 3D. © 2012 Wiley Periodicals, Inc.

141 citations


Book ChapterDOI
07 Oct 2012
TL;DR: This work presents an algorithm for marker-less performance capture of interacting humans using only three hand-held Kinect cameras that succeeds on general uncontrolled indoor scenes with potentially dynamic background, and it succeeds even if the cameras are moving.
Abstract: We present an algorithm for marker-less performance capture of interacting humans using only three hand-held Kinect cameras. Our method reconstructs human skeletal poses, deforming surface geometry and camera poses for every time step of the depth video. Skeletal configurations and camera poses are found by solving a joint energy minimization problem which optimizes the alignment of RGBZ data from all cameras, as well as the alignment of human shape templates to the Kinect data. The energy function is based on a combination of geometric correspondence finding, implicit scene segmentation, and correspondence finding using image features. Only the combination of geometric and photometric correspondences and the integration of human pose and camera pose estimation enables reliable performance capture with only three sensors. As opposed to previous performance capture methods, our algorithm succeeds on general uncontrolled indoor scenes with potentially dynamic background, and it succeeds even if the cameras are moving.

119 citations


Book ChapterDOI
07 Oct 2012
TL;DR: This work provides experimental validation with several real-world video sequences to demonstrate that, unlike in previous work, inpainting videos shot with free-moving cameras does not necessarily require estimation of absolute camera positions and per-frame per-pixel depth maps.
Abstract: We propose a method for removing marked dynamic objects from videos captured with a free-moving camera, so long as the objects occlude parts of the scene with a static background. Our approach takes as input a video, a mask marking the object to be removed, and a mask marking the dynamic objects to remain in the scene. To inpaint a frame, we align other candidate frames in which parts of the missing region are visible. Among these candidates, a single source is chosen to fill each pixel so that the final arrangement is color-consistent. Intensity differences between sources are smoothed using gradient domain fusion. Our frame alignment process assumes that the scene can be approximated using piecewise planar geometry: A set of homographies is estimated for each frame pair, and one each is selected for aligning pixels such that the color-discrepancy is minimized and the epipolar constraints are maintained. We provide experimental validation with several real-world video sequences to demonstrate that, unlike in previous work, inpainting videos shot with free-moving cameras does not necessarily require estimation of absolute camera positions and per-frame per-pixel depth maps.

113 citations


Journal ArticleDOI
TL;DR: This work proposes a new approach to video completion that can deal with complex scenes containing dynamic background and non‐periodical moving objects, and builds upon the idea that the spatio‐temporal hole left by a removed object can be filled with data available on other regions of the video where the occluded objects were visible.
Abstract: Removing dynamic objects from videos is an extremely challenging problem that even visual effects professionals often solve with time-consuming manual frame-by-frame editing We propose a new approach to video completion that can deal with complex scenes containing dynamic background and non-periodical moving objects We build upon the idea that the spatio-temporal hole left by a removed object can be filled with data available on other regions of the video where the occluded objects were visible Video completion is performed by solving a large combinatorial problem that searches for an optimal pattern of pixel offsets from occluded to unoccluded regions Our contribution includes an energy functional that generalizes well over different scenes with stable parameters, and that has the desirable convergence properties for a graph-cut-based optimization We provide an interface to guide the completion process that both reduces computation time and allows for efficient correction of small errors in the result We demonstrate that our approach can effectively complete complex, high-resolution occlusions that are greater in difficulty than what existing methods have shown © 2012 Wiley Periodicals, Inc

75 citations


Book ChapterDOI
07 Oct 2012
TL;DR: A marker-less method for full body human performance capture by analyzing shading information from a sequence of multi-view images, which are recorded under uncontrolled and changing lighting conditions, and is applicable in cases where background segmentation cannot be performed or a set of training poses is unavailable.
Abstract: This paper presents a marker-less method for full body human performance capture by analyzing shading information from a sequence of multi-view images, which are recorded under uncontrolled and changing lighting conditions. Both the articulated motion of the limbs and then the fine-scale surface detail are estimated in a temporally coherent manner. In a temporal framework, differential 3D human pose-changes from the previous time-step are expressed in terms of constraints on the visible image displacements derived from shading cues, estimated albedo and estimated scene illumination. The incident illumination at each frame are estimated jointly with pose, by assuming the Lambertian model of reflectance. The proposed method is independent of image silhouettes and training data, and is thus applicable in cases where background segmentation cannot be performed or a set of training poses is unavailable. We show results on challenging cases for pose-tracking such as changing backgrounds, occlusions and changing lighting conditions.

56 citations


Proceedings ArticleDOI
16 Jun 2012
TL;DR: A new spatio-temporal method for markerless motion capture that reconstructs the pose and motion of a character from a multi-view video sequence without requiring the cameras to be synchronized and without aligning captured frames in time is presented.
Abstract: We present a new spatio-temporal method for markerless motion capture. We reconstruct the pose and motion of a character from a multi-view video sequence without requiring the cameras to be synchronized and without aligning captured frames in time. By formulating the model-to-image similarity measure as a temporally continuous functional, we are also able to reconstruct motion in much higher temporal detail than was possible with previous synchronized approaches. By purposefully running cameras unsynchronized we can capture even very fast motion at speeds that off-the-shelf but high quality cameras provide.

46 citations


Journal ArticleDOI
01 Jul 2012
TL;DR: A system that analyzes collections of unstructured but related video data to create a Videoscape, a data structure that enables interactive exploration of video collections by visually navigating -- spatially and/or temporally -- between different clips, is proposed.
Abstract: The abundance of mobile devices and digital cameras with video capture makes it easy to obtain large collections of video clips that contain the same location, environment, or event. However, such an unstructured collection is difficult to comprehend and explore. We propose a system that analyzes collections of unstructured but related video data to create a Videoscape: a data structure that enables interactive exploration of video collections by visually navigating -- spatially and/or temporally -- between different clips. We automatically identify transition opportunities, or portals. From these portals, we construct the Videoscape, a graph whose edges are video clips and whose nodes are portals between clips. Now structured, the videos can be interactively explored by walking the graph or by geographic map. Given this system, we gauge preference for different video transition styles in a user study, and generate heuristics that automatically choose an appropriate transition style. We evaluate our system using three further user studies, which allows us to conclude that Videoscapes provides significant benefits over related methods. Our system leads to previously unseen ways of interactive spatio-temporal exploration of casually captured videos, and we demonstrate this on several video collections.

45 citations


Journal ArticleDOI
TL;DR: This paper presents an algorithm that automatically creates animation rigs for multi‐component 3D models, as they are typically found in online shape databases, and implicitly handles large scale and proportional differences between input and target skeletons.
Abstract: Rigging an arbitrary 3D character by creating an animation skeleton is a time-consuming process even for experienced animators. In this paper, we present an algorithm that automatically creates animation rigs for multi-component 3D models, as they are typically found in online shape databases. Our algorithm takes as input a multi-component model and an input animation skeleton with associated motion data. It then creates a target skeleton for the input model, calculates the rigid skinning weights, and a mapping between the joints of the target skeleton and the input animation skeleton. The automatic approach does not need additional semantic information, such as component labels or user-provided correspondences, and succeeds on a wide range of models where the number of components is significantly different. It implicitly handles large scale and proportional differences between input and target skeletons and can deal with certain morphological differences, e.g., if input and target have different numbers of limbs. The output of our algorithm can be directly used in a retargeting system to create a plausible animated character. © 2012 Wiley Periodicals, Inc.

40 citations


Book ChapterDOI
28 Aug 2012
TL;DR: A novel algorithm for temporally synchronizing multiple videos capturing the same dynamic scene by using a stable RANSAC-based optimization approach that identifies an informative subset of video pairs which prevents the RansAC algorithm from being biased by outliers.
Abstract: We present a novel algorithm for temporally synchronizing multiple videos capturing the same dynamic scene. Our algorithm relies on general image features and it does not require explicitly tracking any specific object, making it applicable to general scenes with complex motion. This is facilitated by our new trajectory filtering and matching schemes that correctly identifies matching pairs of trajectories (inliers) from a large set of potential candidate matches, of which many are outliers. We find globally optimal synchronization parameters by using a stable RANSAC-based optimization approach. For multi-video synchronization, the algorithm identifies an informative subset of video pairs which prevents the RANSAC algorithm from being biased by outliers. Experiments on two-camera and multi-camera synchronization demonstrate the performance of our algorithm.

25 citations


Proceedings ArticleDOI
13 Oct 2012
TL;DR: A system for real-time deformation of the shape and appearance of people who are standing in front of a depth+RGB camera, such as the Microsoft Kinect, made possible by a morphable model of 3D human shape that was learnt from a large database of3D scans of people in various body shapes and poses.
Abstract: We present a system for real-time deformation of the shape and appearance of people who are standing in front of a depth+RGB camera, such as the Microsoft Kinect. Our system allows manipulating human body shape parameters such as height, muscularity, weight, waist girth and leg length. The manipulated appearance is displayed in realtime. Thus, instead of posing in front a real mirror and visualizing their appearance, users can pose in front of a 'virtual mirror' and visualize themselves in different body shapes. Our system is made possible by a morphable model of 3D human shape that was learnt from a large database of 3D scans of people in various body shapes and poses. In an initialization step, which lasts a couple of seconds, this model is fit to the 3D shape parameters of the people as observed in the depth data. Then, a succession of pose tracking, body segmentation, shape deformation and image warping steps are performed -- in real-time and independently for multiple people. We present a variety of results in the paper and the video, showing the interactive virtual mirror cabinet experience.

Book ChapterDOI
07 Oct 2012
TL;DR: The identification of potential matches as a link prediction problem in an image correspondence graph is posed, and an effective algorithm to solve this problem is proposed.
Abstract: How best to efficiently establish correspondence among a large set of images or video frames is an interesting unanswered question. For large databases, the high computational cost of performing pair-wise image matching is a major problem. However, for many applications, images are inherently sparsely connected, and so current techniques try to correctly estimate small potentially matching subsets of databases upon which to perform expensive pair-wise matching. Our contribution is to pose the identification of potential matches as a link prediction problem in an image correspondence graph, and to propose an effective algorithm to solve this problem. Our algorithm facilitates incremental image matching: initially, the match graph is very sparse, but it becomes dense as we alternate between link prediction and verification. We demonstrate the effectiveness of our algorithm by comparing it with several existing alternatives on large-scale databases. Our resulting match graph is useful for many different applications. As an example, we show the benefits of our graph construction method to a label propagation application which propagates user-provided sparse object labels to other instances of that object in large image collections.

Patent
11 May 2012
TL;DR: In this paper, the authors present methods and systems that enable interactive exploration of digital videos, including transitions and other such features, including mobile phone cameras, tablets, and the like.
Abstract: Approaches presented herein enable to the interactive exploration of digital videos. The videos can include digital videos that have casually been captured by consumer devices, such as mobile phone cameras, tablets, and the like. Robust methods and systems are presented that enable such digital videos to be explored in interesting and advantageous ways, including transitions and other such features.

Journal ArticleDOI
TL;DR: This work presents a markerless performance capture system that can acquire the motion and the texture of human actors performing fast movements using only commodity hardware and introduces a model‐based deblurring algorithm which is able to handle disocclusion, self‐occlusion and complex object motions.
Abstract: We present a markerless performance capture system that can acquire the motion and the texture of human actors performing fast movements using only commodity hardware. To this end we introduce two novel concepts: First, a staggered surround multi-view recording setup that enables us to perform model-based motion capture on motion-blurred images, and second, a model-based deblurring algorithm which is able to handle disocclusion, self-occlusion and complex object motions. We show that the model-based approach is not only a powerful strategy for tracking but also for deblurring highly complex blur patterns. © 2012 Wiley Periodicals, Inc.

Journal ArticleDOI
TL;DR: This special issue discusses new results, technologies and applications related to the capture, representation, processing, storage, transmission and visualization of 3D geometric and photometric models.
Abstract: This special issue discusses new results, technologies and applications related to the capture, representation, processing, storage, transmission and visualization of 3D geometric and photometric models. The guest editors selected 7 papers from 128 as candidates for this publication. The topics of the accepted papers cover several important problems in the area of 3D vision. The first four papers discuss image registration through keypoint detection and matching. “Physical Scale Keypoints: Matching and Registration for Combined Intensity/Range Images” by Smith et al. presents a method to detect and match keypoints between images for which the depth of each pixel is known. “Interesting Interest Points—A Comparative Study of Interest Point Performance on a Unique Data Set” by Aanaes et al. presents a framework for the evaluation of keypoint detection with ground truth, using the recall rate as main criterion. The conference version of this paper received the Best Paper Award at 3DPVT 2010. “Imposing Semi-Local Geometric Constraints for Accurate Correspondences Selection in Structure from Motion: A GameTheoretic Perspective” by Albarelli et al. discusses the prob-

Proceedings ArticleDOI
01 Jan 2012
TL;DR: A generic regularization framework that comprises a prior on natural images, as well as an application-specific conditional model based on Gaussian processes that enables users to easily adopt the algorithm to a specific problem and data set of interest.
Abstract: In this paper, we describe a framework for learning-based image enhancement. At the core of our algorithm lies a generic regularization framework that comprises a prior on natural images, as well as an application-specific conditional model based on Gaussian processes. In contrast to prior learning-based approaches, our algorithm can instantly learn task-specific degradation models from sample images which enables users to easily adopt the algorithm to a specific problem and data set of interest. This is facilitated by our efficient approximation scheme of large-scale Gaussian processes. We demonstrate the efficiency and effectiveness of our approach by applying it to two example enhancement applications: single-image super-resolution as well as artifact removal in JPEG-encoded images.

DOI
01 Jan 2012
TL;DR: The executive summary and abstracts of the talks given during the seminar as well as the outcome of several working groups on specific research topics are presented in this report.
Abstract: This report documents the program and the outcomes of Dagstuhl Seminar 12431 "Time-of-Flight Imaging: Algorithms, Sensors and Applications". The seminar brought together researchers with diverse background from both academia and industry to discuss various aspects of Time-of-Flight imaging and general depth sensors. The executive summary and abstracts of the talks given during the seminar as well as the outcome of several working groups on specific research topics are presented in this report.

Proceedings ArticleDOI
05 Aug 2012
TL;DR: A 3D reconstruction method enabling high resolution marker-based capturing of deforming surfaces that allows all markers to look exactly the same and do not rely on temporal tracking is presented.
Abstract: We present a 3D reconstruction method enabling high resolution marker-based capturing of deforming surfaces. In contrast to previous work, we allow all markers to look exactly the same and do not rely on temporal tracking. This implies considerable advantages: markers can be smaller and are easier to apply due to omitted identification; long-range motions normally confusing temporal tracking algorithms become feasible. However, the correct matching of markers between camera views is highly ambiguous in such a scenario. To solve this problem we propose an optimization framework that considers multiview conflicts and local smoothness of the captured surface. An iterative relaxation method based on graph matching is adopted to obtain a consistent, smooth reconstruction for all stereo pairs of a multi-camera system simultanously. Preliminary experiments show excellent and robust results.

Patent
11 May 2012
TL;DR: In this article, a method for exploring, browsing and navigating in 3D digital video collections comprising two or more videos and an index of possible visual space and time transition frames between pairs of videos is presented.
Abstract: Method for exploring, browsing and navigating in three dimensions a sparse, unstructured digital video collection comprising two or more videos and an index of possible visual space and time transition frames ("portals") between pairs of videos. The method comprises the steps of displaying at least a part of a first video; receiving a user input; displaying a visual transition _ such as 3D camera sweep, warp, dissolve _ from the first video to a second video, based on the user input; and displaying at least a part of the second video.