scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Reconstructing detailed dynamic face geometry from monocular video

01 Nov 2013-Vol. 32, Iss: 6, pp 158
TL;DR: This work presents a new method for capturing face geometry from monocular video that captures detailed, dynamic, spatio-temporally coherent 3D face geometry without the need for markers, and successfully reconstructs expressive motion including high-frequency face detail such as folds and laugh lines.
Abstract: Detailed facial performance geometry can be reconstructed using dense camera and light setups in controlled studios. However, a wide range of important applications cannot employ these approaches, including all movie productions shot from a single principal camera. For post-production, these require dynamic monocular face capture for appearance modification. We present a new method for capturing face geometry from monocular video. Our approach captures detailed, dynamic, spatio-temporally coherent 3D face geometry without the need for markers. It works under uncontrolled lighting, and it successfully reconstructs expressive motion including high-frequency face detail such as folds and laugh lines. After simple manual initialization, the capturing process is fully automatic, which makes it versatile, lightweight and easy-to-deploy. Our approach tracks accurate sparse 2D features between automatically selected key frames to animate a parametric blend shape model, which is further refined in pose, expression and shape by temporally coherent optical flow and photometric stereo. We demonstrate performance capture results for long and complex face sequences captured indoors and outdoors, and we exemplify the relevance of our approach as an enabling technology for model-based face editing in movies and video, such as adding new facial textures, as well as a step towards enabling everyone to do facial performance capture with a single affordable camera.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI
27 Jun 2016
TL;DR: A novel approach for real-time facial reenactment of a monocular target video sequence (e.g., Youtube video) that addresses the under-constrained problem of facial identity recovery from monocular video by non-rigid model-based bundling and re-render the manipulated output video in a photo-realistic fashion.
Abstract: We present a novel approach for real-time facial reenactment of a monocular target video sequence (e.g., Youtube video). The source sequence is also a monocular video stream, captured live with a commodity webcam. Our goal is to animate the facial expressions of the target video by a source actor and re-render the manipulated output video in a photo-realistic fashion. To this end, we first address the under-constrained problem of facial identity recovery from monocular video by non-rigid model-based bundling. At run time, we track facial expressions of both source and target video using a dense photometric consistency measure. Reenactment is then achieved by fast and efficient deformation transfer between source and target. The mouth interior that best matches the re-targeted expression is retrieved from the target sequence and warped to produce an accurate fit. Finally, we convincingly re-render the synthesized target face on top of the corresponding video stream such that it seamlessly blends with the real-world illumination. We demonstrate our method in a live setup, where Youtube videos are reenacted in real time.

1,011 citations


Cites methods from "Reconstructing detailed dynamic fac..."

  • ...Offline RGB Performance Capture Recent offline performance capture techniques approach the hard monocular reconstruction problem by fitting a blendshape [15] or a multi-linear face [26] model to the input video sequence....

    [...]

Journal ArticleDOI
TL;DR: Faces Learned with an Articulated Model and Expressions is low-dimensional but more expressive than the FaceWarehouse model and the Basel Face Model and is compared to these models by fitting them to static 3D scans and 4D sequences using the same optimization method.
Abstract: The field of 3D face modeling has a large gap between high-end and low-end methods. At the high end, the best facial animation is indistinguishable from real humans, but this comes at the cost of extensive manual labor. At the low end, face capture from consumer depth sensors relies on 3D face models that are not expressive enough to capture the variability in natural facial shape and expression. We seek a middle ground by learning a facial model from thousands of accurately aligned 3D scans. Our FLAME model (Faces Learned with an Articulated Model and Expressions) is designed to work with existing graphics software and be easy to fit to data. FLAME uses a linear shape space trained from 3800 scans of human heads. FLAME combines this linear shape space with an articulated jaw, neck, and eyeballs, pose-dependent corrective blendshapes, and additional global expression blendshapes. The pose and expression dependent articulations are learned from 4D face sequences in the D3DFACS dataset along with additional 4D sequences. We accurately register a template mesh to the scan sequences and make the D3DFACS registrations available for research purposes. In total the model is trained from over 33, 000 scans. FLAME is low-dimensional but more expressive than the FaceWarehouse model and the Basel Face Model. We compare FLAME to these models by fitting them to static 3D scans and 4D sequences using the same optimization method. FLAME is significantly more accurate and is available for research purposes (http://flame.is.tue.mpg.de).

629 citations

Posted Content
TL;DR: Face2Face addresses the under-constrained problem of facial identity recovery from monocular video by non-rigid model-based bundling and convincingly re-render the synthesized target face on top of the corresponding video stream such that it seamlessly blends with the real-world illumination.
Abstract: We present Face2Face, a novel approach for real-time facial reenactment of a monocular target video sequence (e.g., Youtube video). The source sequence is also a monocular video stream, captured live with a commodity webcam. Our goal is to animate the facial expressions of the target video by a source actor and re-render the manipulated output video in a photo-realistic fashion. To this end, we first address the under-constrained problem of facial identity recovery from monocular video by non-rigid model-based bundling. At run time, we track facial expressions of both source and target video using a dense photometric consistency measure. Reenactment is then achieved by fast and efficient deformation transfer between source and target. The mouth interior that best matches the re-targeted expression is retrieved from the target sequence and warped to produce an accurate fit. Finally, we convincingly re-render the synthesized target face on top of the corresponding video stream such that it seamlessly blends with the real-world illumination. We demonstrate our method in a live setup, where Youtube videos are reenacted in real time.

448 citations


Cites methods from "Reconstructing detailed dynamic fac..."

  • ...Offline RGB Performance Capture Recent offline performance capture techniques approach the hard monocular reconstruction problem by fitting a blendshape [15] or a multi-linear face [26] model to the input video sequence....

    [...]

Journal ArticleDOI
27 Jul 2014
TL;DR: A combined hardware and software solution for markerless reconstruction of non-rigidly deforming physical objects with arbitrary shape in real-time, an order of magnitude faster than state-of-the-art methods, while matching the quality and robustness of many offline algorithms.
Abstract: We present a combined hardware and software solution for markerless reconstruction of non-rigidly deforming physical objects with arbitrary shape in real-time. Our system uses a single self-contained stereo camera unit built from off-the-shelf components and consumer graphics hardware to generate spatio-temporally coherent 3D models at 30 Hz. A new stereo matching algorithm estimates real-time RGB-D data. We start by scanning a smooth template model of the subject as they move rigidly. This geometric surface prior avoids strong scene assumptions, such as a kinematic human skeleton or a parametric shape model. Next, a novel GPU pipeline performs non-rigid registration of live RGB-D data to the smooth template using an extended non-linear as-rigid-as-possible (ARAP) framework. High-frequency details are fused onto the final mesh using a linear deformation model. The system is an order of magnitude faster than state-of-the-art methods, while matching the quality and robustness of many offline algorithms. We show precise real-time reconstructions of diverse scenes, including: large deformations of users' heads, hands, and upper bodies; fine-scale wrinkles and folds of skin and clothing; and non-rigid interactions performed by users on flexible objects such as toys. We demonstrate how acquired models can be used for many interactive scenarios, including re-texturing, online performance capture and preview, and real-time shape and motion re-targeting.

444 citations

Journal ArticleDOI
27 Jul 2014
TL;DR: This work presents a fully automatic approach to real-time facial tracking and animation with a single video camera that learns a generic regressor from public image datasets to infer accurate 2D facial landmarks as well as the 3D facial shape from 2D video frames.
Abstract: We present a fully automatic approach to real-time facial tracking and animation with a single video camera. Our approach does not need any calibration for each individual user. It learns a generic regressor from public image datasets, which can be applied to any user and arbitrary video cameras to infer accurate 2D facial landmarks as well as the 3D facial shape from 2D video frames. The inferred 2D landmarks are then used to adapt the camera matrix and the user identity to better match the facial expressions of the current user. The regression and adaptation are performed in an alternating manner. With more and more facial expressions observed in the video, the whole process converges quickly with accurate facial tracking and animation. In experiments, our approach demonstrates a level of robustness and accuracy on par with state-of-the-art techniques that require a time-consuming calibration step for each individual user, while running at 28 fps on average. We consider our approach to be an attractive solution for wide deployment in consumer-level applications.

416 citations

References
More filters
Journal ArticleDOI
Abstract: We describe a new method of matching statistical models of appearance to images. A set of model parameters control modes of shape and gray-level variation learned from a training set. We construct an efficient iterative matching algorithm by learning the relationship between perturbations in the model parameters and the induced image errors.

6,200 citations

Journal ArticleDOI
TL;DR: This paper presents a novel and efficient facial image representation based on local binary pattern (LBP) texture features that is assessed in the face recognition problem under different challenges.
Abstract: This paper presents a novel and efficient facial image representation based on local binary pattern (LBP) texture features. The face image is divided into several regions from which the LBP feature distributions are extracted and concatenated into an enhanced feature vector to be used as a face descriptor. The performance of the proposed method is assessed in the face recognition problem under different challenges. Other applications and several extensions are also discussed

5,563 citations


"Reconstructing detailed dynamic fac..." refers background in this paper

  • ...LBPs are very effective for expression matching and identification tasks [Ahonen et al. 2006] and encode the relative brightness around a pixel by assigning a binary value to each neighboring pixel, depending on whether it is brighter or not....

    [...]

Book ChapterDOI
02 Jun 1998
TL;DR: A novel method of interpreting images using an Active Appearance Model (AAM), a statistical model of the shape and grey-level appearance of the object of interest which can generalise to almost any valid example.
Abstract: We demonstrate a novel method of interpreting images using an Active Appearance Model (AAM). An AAM contains a statistical model of the shape and grey-level appearance of the object of interest which can generalise to almost any valid example. During a training phase we learn the relationship between model parameter displacements and the residual errors induced between a training image and a synthesised model example. To match to an image we measure the current residuals and use the model to predict changes to the current parameters, leading to a better fit. A good overall match is obtained in a few iterations, even from poor starting estimates. We describe the technique in detail and give results of quantitative performance tests. We anticipate that the AAM algorithm will be an important method for locating deformable objects in many applications.

3,905 citations

Journal ArticleDOI
TL;DR: An algorithm for finding the least-squares solution of R and T, which is based on the singular value decomposition (SVD) of a 3 × 3 matrix, is presented.
Abstract: Two point sets {pi} and {p'i}; i = 1, 2,..., N are related by p'i = Rpi + T + Ni, where R is a rotation matrix, T a translation vector, and Ni a noise vector. Given {pi} and {p'i}, we present an algorithm for finding the least-squares solution of R and T, which is based on the singular value decomposition (SVD) of a 3 × 3 matrix. This new algorithm is compared to two earlier algorithms with respect to computer time requirements.

3,862 citations


"Reconstructing detailed dynamic fac..." refers background in this paper

  • ...Finding the least squares solution of {s,R, t} to expression (6) for a constant set of blending weights is equivalent to aligning two 3D point sets, which can be solved in closed form by SVD [Arun et al. 1987]....

    [...]

  • ...Finding the least squares solution of {s,R, t}t to expression (6) for a constant set of blending weights is equivalent to aligning two 3D point sets, which can be solved in closed form by SVD [Arun et al. 1987]....

    [...]

Journal Article
John Platt1
TL;DR: The sequential minimal optimization (SMO) algorithm as mentioned in this paper uses a series of smallest possible QP problems to solve a large QP problem, which avoids using a time-consuming numerical QP optimization as an inner loop.
Abstract: This paper proposes a new algorithm for training support vector machines: Sequential Minimal Optimization, or SMO. Training a support vector machine requires the solution of a very large quadratic programming (QP) optimization problem. SMO breaks this large QP problem into a series of smallest possible QP problems. These small QP problems are solved analytically, which avoids using a time-consuming numerical QP optimization as an inner loop. The amount of memory required for SMO is linear in the training set size, which allows SMO to handle very large training sets. Because matrix computation is avoided, SMO scales somewhere between linear and quadratic in the training set size for various test problems, while the standard chunking SVM algorithm scales somewhere between linear and cubic in the training set size. SMO’s computation time is dominated by SVM evaluation, hence SMO is fastest for linear SVMs and sparse data sets. On realworld sparse data sets, SMO can be more than 1000 times faster than the chunking algorithm.

2,856 citations


"Reconstructing detailed dynamic fac..." refers methods in this paper

  • ...This can be solved efficiently by methods based on sequential minimal optimization 6 [Platt 1998]....

    [...]