scispace - formally typeset
Search or ask a question

Showing papers by "Hao Li published in 2016"


Posted Content
TL;DR: The authors proposed a multi-scale neural patch synthesis approach based on joint optimization of image content and texture constraints, which not only preserves contextual structures but also produces high-frequency details by matching and adapting patches with the most similar mid-layer feature correlations of a deep classification network.
Abstract: Recent advances in deep learning have shown exciting promise in filling large holes in natural images with semantically plausible and context aware details, impacting fundamental image manipulation tasks such as object removal. While these learning-based methods are significantly more effective in capturing high-level features than prior techniques, they can only handle very low-resolution inputs due to memory limitations and difficulty in training. Even for slightly larger images, the inpainted regions would appear blurry and unpleasant boundaries become visible. We propose a multi-scale neural patch synthesis approach based on joint optimization of image content and texture constraints, which not only preserves contextual structures but also produces high-frequency details by matching and adapting patches with the most similar mid-layer feature correlations of a deep classification network. We evaluate our method on the ImageNet and Paris Streetview datasets and achieved state-of-the-art inpainting accuracy. We show our approach produces sharper and more coherent results than prior methods, especially for high-resolution images.

419 citations


Proceedings ArticleDOI
01 Jun 2016
TL;DR: This work uses a deep convolutional neural network to train a feature descriptor on depth map pixels, but crucially, rather than training the network to solve the shape correspondence problem directly, it trains it to solve a body region classification problem, modified to increase the smoothness of the learned descriptors near region boundaries.
Abstract: We propose a deep learning approach for finding dense correspondences between 3D scans of people. Our method requires only partial geometric information in the form of two depth maps or partial reconstructed surfaces, works for humans in arbitrary poses and wearing any clothing, does not require the two people to be scanned from similar view-points, and runs in real time. We use a deep convolutional neural network to train a feature descriptor on depth map pixels, but crucially, rather than training the network to solve the shape correspondence problem directly, we train it to solve a body region classification problem, modified to increase the smoothness of the learned descriptors near region boundaries. This approach ensures that nearby points on the human body are nearby in feature space, and vice versa, rendering the feature descriptor suitable for computing dense correspondences between the scans. We validate our method on real and synthetic data for both clothed and unclothed humans, and show that our correspondences are more robust than is possible with state-of-the-art unsupervised methods, and more accurate than those found using methods that require full watertight 3D geometry.

165 citations


Journal ArticleDOI
11 Nov 2016
TL;DR: This work introduces a novel system for HMD users to control a digital avatar in real-time while producing plausible speech animation and emotional expressions and demonstrates the quality of the system on a variety of subjects and evaluates its performance against state-of-the-art real- time facial tracking techniques.
Abstract: Significant challenges currently prohibit expressive interaction in virtual reality (VR). Occlusions introduced by head-mounted displays (HMDs) make existing facial tracking techniques intractable, and even state-of-the-art techniques used for real-time facial tracking in unconstrained environments fail to capture subtle details of the user's facial expressions that are essential for compelling speech animation. We introduce a novel system for HMD users to control a digital avatar in real-time while producing plausible speech animation and emotional expressions. Using a monocular camera attached to an HMD, we record multiple subjects performing various facial expressions and speaking several phonetically-balanced sentences. These images are used with artist-generated animation data corresponding to these sequences to train a convolutional neural network (CNN) to regress images of a user's mouth region to the parameters that control a digital avatar. To make training this system more tractable, we use audio-based alignment techniques to map images of multiple users making the same utterance to the corresponding animation parameters. We demonstrate that this approach is also feasible for tracking the expressions around the user's eye region with an internal infrared (IR) camera, thereby enabling full facial tracking. This system requires no user-specific calibration, uses easily obtainable consumer hardware, and produces high-quality animations of speech and emotional expressions. Finally, we demonstrate the quality of our system on a variety of subjects and evaluate its performance against state-of-the-art real-time facial tracking techniques.

134 citations


Posted Content
TL;DR: A state-of-the-art regression-based facial tracking framework with segmented face images as training is adopted, and accurate and uninterrupted facial performance capture is demonstrated in the presence of extreme occlusion and even side views.
Abstract: We introduce the concept of unconstrained real-time 3D facial performance capture through explicit semantic segmentation in the RGB input. To ensure robustness, cutting edge supervised learning approaches rely on large training datasets of face images captured in the wild. While impressive tracking quality has been demonstrated for faces that are largely visible, any occlusion due to hair, accessories, or hand-to-face gestures would result in significant visual artifacts and loss of tracking accuracy. The modeling of occlusions has been mostly avoided due to its immense space of appearance variability. To address this curse of high dimensionality, we perform tracking in unconstrained images assuming non-face regions can be fully masked out. Along with recent breakthroughs in deep learning, we demonstrate that pixel-level facial segmentation is possible in real-time by repurposing convolutional neural networks designed originally for general semantic segmentation. We develop an efficient architecture based on a two-stream deconvolution network with complementary characteristics, and introduce carefully designed training samples and data augmentation strategies for improved segmentation accuracy and robustness. We adopt a state-of-the-art regression-based facial tracking framework with segmented face images as training, and demonstrate accurate and uninterrupted facial performance capture in the presence of extreme occlusion and even side views. Furthermore, the resulting segmentation can be directly used to composite partial 3D face models on the input images and enable seamless facial manipulation tasks, such as virtual make-up or face replacement.

127 citations


Proceedings ArticleDOI
01 Oct 2016
TL;DR: Wang et al. as mentioned in this paper proposed a deep neural network (DNN) based model that directly predicts the CTR of an image ad based on raw image pixels and other basic features in one step, which employs convolution layers to automatically extract representative visual features from images, and nonlinear CTR features are then learned from visual features and other contextual features by using fully connected layers.
Abstract: Click through rate (CTR) prediction of image ads is the core task of online display advertising systems, and logistic regression (LR) has been frequently applied as the prediction model. However, LR model lacks the ability of extracting complex and intrinsic nonlinear features from handcrafted high-dimensional image features, which limits its effectiveness. To solve this issue, in this paper, we introduce a novel deep neural network (DNN) based model that directly predicts the CTR of an image ad based on raw image pixels and other basic features in one step. The DNN model employs convolution layers to automatically extract representative visual features from images, and nonlinear CTR features are then learned from visual features and other contextual features by using fully-connected layers. Empirical evaluations on a real world dataset with over 50 million records demonstrate the effectiveness and efficiency of this method.

100 citations


Book ChapterDOI
08 Oct 2016
TL;DR: In this paper, a two-stream deconvolutional neural network was proposed to perform real-time 3D facial performance capture in the presence of extreme occlusion and even side views.
Abstract: We introduce the concept of unconstrained real-time 3D facial performance capture through explicit semantic segmentation in the RGB input. To ensure robustness, cutting edge supervised learning approaches rely on large training datasets of face images captured in the wild. While impressive tracking quality has been demonstrated for faces that are largely visible, any occlusion due to hair, accessories, or hand-to-face gestures would result in significant visual artifacts and loss of tracking accuracy. The modeling of occlusions has been mostly avoided due to its immense space of appearance variability. To address this curse of high dimensionality, we perform tracking in unconstrained images assuming non-face regions can be fully masked out. Along with recent breakthroughs in deep learning, we demonstrate that pixel-level facial segmentation is possible in real-time by repurposing convolutional neural networks designed originally for general semantic segmentation. We develop an efficient architecture based on a two-stream deconvolution network with complementary characteristics, and introduce carefully designed training samples and data augmentation strategies for improved segmentation accuracy and robustness. We adopt a state-of-the-art regression-based facial tracking framework with segmented face images as training, and demonstrate accurate and uninterrupted facial performance capture in the presence of extreme occlusion and even side views. Furthermore, the resulting segmentation can be directly used to composite partial 3D face models on the input images and enable seamless facial manipulation tasks, such as virtual make-up or face replacement.

96 citations


Posted Content
TL;DR: In this article, a real-time deep learning framework for video-based facial performance capture is presented, which can reduce the amount of labor involved in the development of modern narrative-driven video games or films involving realistic digital doubles of actors.
Abstract: We present a real-time deep learning framework for video-based facial performance capture -- the dense 3D tracking of an actor's face given a monocular video. Our pipeline begins with accurately capturing a subject using a high-end production facial capture pipeline based on multi-view stereo tracking and artist-enhanced animations. With 5-10 minutes of captured footage, we train a convolutional neural network to produce high-quality output, including self-occluded regions, from a monocular video sequence of that subject. Since this 3D facial performance capture is fully automated, our system can drastically reduce the amount of labor involved in the development of modern narrative-driven video games or films involving realistic digital doubles of actors and potentially hours of animated dialogue per character. We compare our results with several state-of-the-art monocular real-time facial capture techniques and demonstrate compelling animation inference in challenging areas such as eyes and lips.

54 citations


Book ChapterDOI
08 Oct 2016
TL;DR: An end-to-end system for reconstructing complete watertight and textured models of moving subjects such as clothed humans and animals, using only three or four handheld sensors, with a new pairwise registration algorithm that minimizes, using a particle swarm strategy, an alignment error metric based on mutual visibility and occlusion.
Abstract: We present an end-to-end system for reconstructing complete watertight and textured models of moving subjects such as clothed humans and animals, using only three or four handheld sensors. The heart of our framework is a new pairwise registration algorithm that minimizes, using a particle swarm strategy, an alignment error metric based on mutual visibility and occlusion. We show that this algorithm reliably registers partial scans with as little as 15 % overlap without requiring any initial correspondences, and outperforms alternative global registration algorithms. This registration algorithm allows us to reconstruct moving subjects from free-viewpoint video produced by consumer-grade sensors, without extensive sensor calibration, constrained capture volume, expensive arrays of cameras, or templates of the subject geometry.

45 citations


Posted Content
TL;DR: A data-driven inference method is presented that can synthesize a photorealistic texture map of a complete 3D face model given a partial 2D view of a person in the wild and successful face reconstructions from a wide range of low resolution input images are demonstrated.
Abstract: We present a data-driven inference method that can synthesize a photorealistic texture map of a complete 3D face model given a partial 2D view of a person in the wild. After an initial estimation of shape and low-frequency albedo, we compute a high-frequency partial texture map, without the shading component, of the visible face area. To extract the fine appearance details from this incomplete input, we introduce a multi-scale detail analysis technique based on mid-layer feature correlations extracted from a deep convolutional neural network. We demonstrate that fitting a convex combination of feature correlations from a high-resolution face database can yield a semantically plausible facial detail description of the entire face. A complete and photorealistic texture map can then be synthesized by iteratively optimizing for the reconstructed feature correlations. Using these high-resolution textures and a commercial rendering framework, we can produce high-fidelity 3D renderings that are visually comparable to those obtained with state-of-the-art multi-view face capture systems. We demonstrate successful face reconstructions from a wide range of low resolution input images, including those of historical figures. In addition to extensive evaluations, we validate the realism of our results using a crowdsourced user study.

39 citations


Proceedings ArticleDOI
28 Nov 2016
TL;DR: The mathematical foundations and theoretical explanation of registration algorithms are introduced, in addition to the practical tools to design systems that leverage information from RGBD devices, to illustrate the practical relevance of the theoretical content.
Abstract: Registration algorithms are an essential component of many computer graphics and computer vision systems. With recent technological advances in RGBD sensors (color plus depth), an active area of research is in techniques combining color, geometry, and learnt priors for robust real-time registration. The goal of this course is to introduce the mathematical foundations and theoretical explanation of registration algorithms, in addition to the practical tools to design systems that leverage information from RGBD devices. We present traditional methods for correspondence computation derived from geometric first principles, along with modern techniques leveraging pre-processing of annotated datasets (e.g. deep neural networks). To illustrate the practical relevance of the theoretical content, we discuss applications including static and dynamic scanning/reconstruction as well as real-time tracking of hands and faces. An up-to-date version of the course notes, as well as slides and source code can be found at http://gfx.uvic.ca/teaching/registration.

34 citations


Posted Content
TL;DR: A novel deep neural network (DNN) based model is introduced that directly predicts the CTR of an image ad based on raw image pixels and other basic features in one step.
Abstract: Click through rate (CTR) prediction of image ads is the core task of online display advertising systems, and logistic regression (LR) has been frequently applied as the prediction model. However, LR model lacks the ability of extracting complex and intrinsic nonlinear features from handcrafted high-dimensional image features, which limits its effectiveness. To solve this issue, in this paper, we introduce a novel deep neural network (DNN) based model that directly predicts the CTR of an image ad based on raw image pixels and other basic features in one step. The DNN model employs convolution layers to automatically extract representative visual features from images, and nonlinear CTR features are then learned from visual features and other contextual features by using fully-connected layers. Empirical evaluations on a real world dataset with over 50 million records demonstrate the effectiveness and efficiency of this method.

Proceedings ArticleDOI
23 May 2016
TL;DR: A system to generate photorealistic 3D blendshape-based face models automatically using only a single consumer RGB-D sensor and a registration method that solves dense correspondences between two face scans by utilizing facial landmarks detection and optical flows is proposed.
Abstract: Creating and animating realistic 3D human faces is an important element of virtual reality, video games, and other areas that involve interactive 3D graphics. In this paper, we propose a system to generate photorealistic 3D blendshape-based face models automatically using only a single consumer RGB-D sensor. The capture and processing requires no artistic expertise to operate, takes 15 seconds to capture and generate a single facial expression, and approximately 1 minute of processing time per expression to transform it into a blendshape model. Our main contributions include a complete end-to-end pipeline for capturing and generating photorealistic blendshape models automatically and a registration method that solves dense correspondences between two face scans by utilizing facial landmarks detection and optical flows. We demonstrate the effectiveness of the proposed method by capturing different human subjects with a variety of sensors and puppeteering their 3D faces with real-time facial performance retargeting. The rapid nature of our method allows for just-in-time construction of a digital face. To that end, we also integrated our pipeline with a virtual reality facial performance capture system that allows dynamic embodiment of the generated faces despite partial occlusion of the user's real face by the head-mounted display.

Journal ArticleDOI
TL;DR: Objective quantification of anatomical variations about the femur head–neck junction in pre‐operative planning for surgical intervention in femoro‐acetabular impingement is problematic, as no clear definition of average normal anatomy for a specific subject exists.
Abstract: Background Objective quantification of anatomical variations about the femur head–neck junction in pre-operative planning for surgical intervention in femoro-acetabular impingement is problematic, as no clear definition of average normal anatomy for a specific subject exists. Methods We have defined the normal-equivalent of a subject's anatomy by using a statistical shape model and geometric shape optimization for finding correspondences, while excluding the femoral head–neck junction during the fitting procedure. The presented technique was evaluated on a cohort of 20 patients. Results Difference in α-angle measurement between the actual morphology and the predicted normal-equivalent, averaged 1.3° (SD 1.7°) in the control group versus 8° (SD 7.3°) in the patient group (p < 0.05). Conclusions Defining normal equivalent anatomy is effective in quantifying anatomical dysmorphism of the femoral head–neck junction and as such can improve presurgical analysis of patients diagnosed with femoro-acetabular impingement. Copyright © 2016 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: A hybrid calibration method is proposed, which combines the redundant baselines and reduced noise injection, which needs only a small noise injection network and can deal with the out-of-plane distortion of antenna arrays.
Abstract: The calibration of aperture synthesis radiometers (ASRs) with large arrays by internal noise injection suffers from high increased mass and volume and is not capable to deal with the out-of-plane distortion of antenna arrays. In this letter, a hybrid calibration method is proposed, which combines the redundant baselines and reduced noise injection. The hybrid method needs only a small noise injection network and can deal with the out-of-plane distortion of antenna arrays. The performance of the hybrid method depends on the noise injection element distribution, which is analyzed and given based on minimizing the root mean square of all residual amplitude and phase errors. Simulation results demonstrate its performance. The proposed hybrid method is particularly helpful to the calibration of ASRs with large arrays in geostationary orbit.

Proceedings ArticleDOI
28 Nov 2016
TL;DR: The technology enables the automatic generation of a complete head model from a fully unconstrained image and can be instantly animated by anyone in real-time through natural facial performances captured from a regular RGB camera.
Abstract: The age of social media and immersive technologies has created a growing need for processing detailed visual representations of ourselves as virtual and augmented reality is growing into the next generation platform for online communication, connecting hundreds of millions of users. A realistic simulation of our presence in a mixed reality environment is unthinkable without a compelling and directable 3D digitization of ourselves. With the wide availability of mobile cameras and internet images, we introduce a technology that can build a realistic 3D avatar from a single photograph. This textured 3D face model includes hair and can be instantly animated by anyone in real-time through natural facial performances captured from a regular RGB camera. Immediate applications include personalized gaming and VR-enabled social networks using automatically digitized 3D avatars, as well as mobile apps such as video messengers (e.g., Snapchat) with face-swapping capabilities. As opposed to existing solutions, our technology enables the automatic generation of a complete head model from a fully unconstrained image.

Proceedings ArticleDOI
24 Jul 2016
TL;DR: This course will cover the major topics and challenges in using image acquisition to model the human body.
Abstract: Modeling the human body is of special interest in computer graphics to create "virtual humans", but material and optical properties of biological tissues are complex and not easily captured. This course will cover the major topics and challenges in using image acquisition to model the human body.

Posted Content
TL;DR: In this paper, the authors present an end-to-end system for reconstructing complete watertight and textured models of moving subjects such as clothed humans and animals, using only three or four handheld sensors.
Abstract: We present an end-to-end system for reconstructing complete watertight and textured models of moving subjects such as clothed humans and animals, using only three or four handheld sensors. The heart of our framework is a new pairwise registration algorithm that minimizes, using a particle swarm strategy, an alignment error metric based on mutual visibility and occlusion. We show that this algorithm reliably registers partial scans with as little as 15% overlap without requiring any initial correspondences, and outperforms alternative global registration algorithms. This registration algorithm allows us to reconstruct moving subjects from free-viewpoint video produced by consumer-grade sensors, without extensive sensor calibration, constrained capture volume, expensive arrays of cameras, or templates of the subject geometry.