scispace - formally typeset
Search or ask a question

Showing papers by "Christian Theobalt published in 2011"


Proceedings ArticleDOI
06 Nov 2011
TL;DR: In this article, the authors present an efficient and robust pose estimation framework for tracking full-body motions from a single depth image stream using a data-driven hybrid strategy that combines local optimization with global retrieval techniques.
Abstract: In recent years, depth cameras have become a widely available sensor type that captures depth images at real-time frame rates. Even though recent approaches have shown that 3D pose estimation from monocular 2.5D depth images has become feasible, there are still challenging problems due to strong noise in the depth data and self-occlusions in the motions being captured. In this paper, we present an efficient and robust pose estimation framework for tracking full-body motions from a single depth image stream. Following a data-driven hybrid strategy that combines local optimization with global retrieval techniques, we contribute several technical improvements that lead to speed-ups of an order of magnitude compared to previous approaches. In particular, we introduce a variant of Dijkstra's algorithm to efficiently extract pose features from the depth data and describe a novel late-fusion scheme based on an efficiently computable sparse Hausdorff distance to combine local and global pose estimates. Our experiments show that the combination of these techniques facilitates real-time tracking with stable results even for fast and complex motions, making it applicable to a wide range of inter-active scenarios.

254 citations


Proceedings ArticleDOI
06 Nov 2011
TL;DR: A novel continuous and differentiable model-to-image similarity measure is introduced that can be used to estimate the skeletal motion of a human at 5–15 frames per second even for many camera views.
Abstract: We present an approach for modeling the human body by Sums of spatial Gaussians (SoG), allowing us to perform fast and high-quality markerless motion capture from multi-view video sequences. The SoG model is equipped with a color model to represent the shape and appearance of the human and can be reconstructed from a sparse set of images. Similar to the human body, we also represent the image domain as SoG that models color consistent image blobs. Based on the SoG models of the image and the human body, we introduce a novel continuous and differentiable model-to-image similarity measure that can be used to estimate the skeletal motion of a human at 5–15 frames per second even for many camera views. In our experiments, we show that our method, which does not rely on silhouettes or training data, offers an good balance between accuracy and computational cost.

244 citations


Proceedings ArticleDOI
20 Jun 2011
TL;DR: A markerless motion capture approach that reconstructs the skeletal motion and detailed time-varying surface geometry of two closely interacting people from multi-view video and provides a reference dataset for multi-person motion capture with ground truth.
Abstract: We present a markerless motion capture approach that reconstructs the skeletal motion and detailed time-varying surface geometry of two closely interacting people from multi-view video. Due to ambiguities in feature-to-person assignments and frequent occlusions, it is not feasible to directly apply single-person capture approaches to the multi-person case. We therefore propose a combined image segmentation and tracking approach to overcome these difficulties. A new probabilistic shape and appearance model is employed to segment the input images and to assign each pixel uniquely to one person. Thereafter, a single-person markerless motion and surface capture approach can be applied to each individual, either one-by-one or in parallel, even under strong occlusions. We demonstrate the performance of our approach on several challenging multi-person motions, including dance and martial arts, and also provide a reference dataset for multi-person motion capture with ground truth.

164 citations


Proceedings ArticleDOI
20 Jun 2011
TL;DR: A new algorithm is presented that combines multi-view stereo and shading-based refinement for high-quality reconstruction of 3D geometry models from images taken under constant but otherwise arbitrary illumination and shows that its final reconstructions rival laser range scans.
Abstract: Multi-view stereo methods reconstruct 3D geometry from images well for sufficiently textured scenes, but often fail to recover high-frequency surface detail, particularly for smoothly shaded surfaces. On the other hand, shape-from-shading methods can recover fine detail from shading variations. Unfortunately, it is non-trivial to apply shape-from-shading alone to multi-view data, and most shading-based estimation methods only succeed under very restricted or controlled illumination. We present a new algorithm that combines multi-view stereo and shading-based refinement for high-quality reconstruction of 3D geometry models from images taken under constant but otherwise arbitrary illumination. We have tested our algorithm on several scenes that were captured under several general and unknown lighting conditions, and we show that our final reconstructions rival laser range scans.

160 citations


Proceedings ArticleDOI
25 Jul 2011
TL;DR: A warping-based texture synthesis approach that uses the retrieved most-similar database frames to synthesize spatio-temporally coherent target video frames to create realistic videos of people, even if the target motions and camera views are different from the database content.
Abstract: We present a method to synthesize plausible video sequences of humans according to user-defined body motions and viewpoints. We first capture a small database of multi-view video sequences of an actor performing various basic motions. This database needs to be captured only once and serves as the input to our synthesis algorithm. We then apply a marker-less model-based performance capture approach to the entire database to obtain pose and geometry of the actor in each database frame. To create novel video sequences of the actor from the database, a user animates a 3D human skeleton with novel motion and viewpoints. Our technique then synthesizes a realistic video sequence of the actor performing the specified motion based only on the initial database. The first key component of our approach is a new efficient retrieval strategy to find appropriate spatio-temporally coherent database frames from which to synthesize target video frames. The second key component is a warping-based texture synthesis approach that uses the retrieved most-similar database frames to synthesize spatio-temporally coherent target video frames. For instance, this enables us to easily create video sequences of actors performing dangerous stunts without them being placed in harm's way. We show through a variety of result videos and a user study that we can synthesize realistic videos of people, even if the target motions and camera views are different from the database content.

126 citations


Proceedings ArticleDOI
06 Nov 2011
TL;DR: This work presents an approach to add true fine-scale spatio-temporal shape detail to dynamic scene geometry captured from multi-view video footage and uses weak temporal priors on lighting, albedo and geometry which improve reconstruction quality yet allow for temporal variations in the data.
Abstract: We present an approach to add true fine-scale spatio-temporal shape detail to dynamic scene geometry captured from multi-view video footage. Our approach exploits shading information to recover the millimeter-scale surface structure, but in contrast to related approaches succeeds under general unconstrained lighting conditions. Our method starts off from a set of multi-view video frames and an initial series of reconstructed coarse 3D meshes that lack any surface detail. In a spatio-temporal maximum a posteriori probability (MAP) inference framework, our approach first estimates the incident illumination and the spatially-varying albedo map on the mesh surface for every time instant. Thereafter, albedo and illumination are used to estimate the true geometric detail visible in the images and add it to the coarse reconstructions. The MAP framework uses weak temporal priors on lighting, albedo and geometry which improve reconstruction quality yet allow for temporal variations in the data.

113 citations


01 Jan 2011
TL;DR: A new approach for video inpainting that can deal with complex scenes with dynamic backgrounds and many non-periodically moving occluding scene elements, and is built on the idea that a spatio-temporal hole created by a removed scene element can be filled by copying information from other space-time locations in the video.
Abstract: Removal of dynamic scene elements from video is an extremely challenging problem that even movie professionals often solve through days of manual frame-byframe editing. The disoccluded regions in the video have to be inpainted in a coherent way, even if originally occluded objects or background are dynamic. To make this problem easier, we propose a new approach for video inpainting that can deal with complex scenes with dynamic backgrounds and many non-periodically moving occluding scene elements. It is built on the idea that a spatio-temporal hole created by a removed scene element can be filled by copying information from other space-time locations in the video, where objects and background are unoccluded. Inpainting is performed by solving a combinatorial optimization problem that searches for the optimal pattern of pixel shifts. Solving this problem naively, even on short videos, quickly becomes infeasible. The primary contributions of this work are a new energy functional with desirable convergence properties, an efficient hierarchical solution strategy, and an effective search space reduction strategy that restricts potential pixel shifts to regions around tracked objects in the scene. A simple interface enables the user to optionally support the algorithm in marking and tracking dynamic objects. Our approach can efficiently inpaint holes even in HD videos with many occlusions, and requires only little user input.

14 citations


Patent
29 Nov 2011
TL;DR: In this article, a computer-implemented method for tracking and reshaping a human-shaped figure in a digital video comprising the steps: acquiring a body model of the figure from the digital video, adapting a shape of the body model, modifying frames of the video, based on the adapted body model and outputting the video.
Abstract: The invention concerns a computer-implemented method for tracking and reshaping a human-shaped figure in a digital video comprising the steps: acquiring a body model of the figure from the digital video, adapting a shape of the body model, modifying frames of the digital video, based on the adapted body model and outputting the digital video.

8 citations


01 Jan 2011
TL;DR: The proposed algorithm can instantly learn task-specific degradation models from sample images which enables users to easily adapt the algorithm to a specific problem and data set of interest, facilitated by the efficient approximation scheme of large-scale Gaussian processes.
Abstract: Many computer vision and computational photography applications essentially solve an image enhancement problem. The image has been deteriorated by a specific noise process, such as aberrations from camera optics and compression artifacts, that we would like to remove. We describe a framework for learningbased image enhancement. At the core of our algorithm lies a generic regularization framework that comprises a prior on natural images, as well as an application-specific conditional model based on Gaussian processes. In contrast to prior learning-based approaches, our algorithm can instantly learn task-specific degradation models from sample images which enables users to easily adapt the algorithm to a specific problem and data set of interest. This is facilitated by our efficient approximation scheme of large-scale Gaussian processes. We demonstrate the efficiency and effectiveness of our approach by applying it to example enhancement applications including singleimage super-resolution, as well as artifact removal in JPEGand JPEG 2000-encoded images.

5 citations


Patent
29 Nov 2011
TL;DR: In this paper, a computer-implemented method for tracking and reshaping a human-shaped figure in a digital video comprising the steps: acquiring a body model of the figure from the digital video, adapting a shape of the body model, modifying frames of the video, based on the adapted body model and outputting the video.
Abstract: The invention concerns a computer-implemented method for tracking and reshaping a human-shaped figure in a digital video comprising the steps: acquiring a body model of the figure from the digital video, adapting a shape of the body model, modifying frames of the digital video, based on the adapted body model and outputting the digital video.

1 citations


01 Jan 2011
TL;DR: A system that analyzes collections of unstructured but related video data to create a Videoscape, a data structure that enables interactive exploration of video collections by visually navigating – spatially and/or temporally – between different clips, is proposed.
Abstract: The abundance of mobile devices and digital cameras with video capture makes it easy to obtain large collections of video clips that contain the same location, environment, or event. However, such an unstructured collection is difficult to comprehend and explore. We propose a system that analyzes collections of unstructured but related video data to create a Videoscape: a data structure that enables interactive exploration of video collections by visually navigating – spatially and/or temporally – between different clips. We automatically identify transition opportunities, or portals. From these portals, we construct the Videoscape, a graph whose edges are video clips and whose nodes are portals between clips. Now structured, the videos can be interactively explored by walking the graph or by geographic map. Given this system, we gauge preference for different video transition styles in a user study, and generate heuristics that automatically choose an appropriate transition style. We evaluate our system using three further user studies, which allows us to conclude that Videoscapes provides significant benefits over related methods. Our system leads to previously unseen ways of interactive spatio-temporal exploration of casually captured videos, and we demonstrate this on several video collections.