Top 17 papers published by Hao Li from Alibaba Group in 2016

Posted Content•

High-Resolution Image Inpainting using Multi-Scale Neural Patch Synthesis

[...]

Chao Yang¹, Xin Lu², Zhe Lin², Eli Shechtman², Oliver Wang², Hao Li³ - Show less +2 more•Institutions (3)

University of Southern California¹, Adobe Systems², Institute for Creative Technologies³

30 Nov 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: The authors proposed a multi-scale neural patch synthesis approach based on joint optimization of image content and texture constraints, which not only preserves contextual structures but also produces high-frequency details by matching and adapting patches with the most similar mid-layer feature correlations of a deep classification network.

...read moreread less

Abstract: Recent advances in deep learning have shown exciting promise in filling large holes in natural images with semantically plausible and context aware details, impacting fundamental image manipulation tasks such as object removal. While these learning-based methods are significantly more effective in capturing high-level features than prior techniques, they can only handle very low-resolution inputs due to memory limitations and difficulty in training. Even for slightly larger images, the inpainted regions would appear blurry and unpleasant boundaries become visible. We propose a multi-scale neural patch synthesis approach based on joint optimization of image content and texture constraints, which not only preserves contextual structures but also produces high-frequency details by matching and adapting patches with the most similar mid-layer feature correlations of a deep classification network. We evaluate our method on the ImageNet and Paris Streetview datasets and achieved state-of-the-art inpainting accuracy. We show our approach produces sharper and more coherent results than prior methods, especially for high-resolution images.

...read moreread less

419 citations

Proceedings Article•DOI•

Dense Human Body Correspondences Using Convolutional Networks

[...]

Lingyu Wei¹, Qixing Huang², Duygu Ceylan³, Etienne Vouga⁴, Hao Li¹ - Show less +1 more•Institutions (4)

University of Southern California¹, Toyota Technological Institute², Adobe Systems³, University of Texas at Austin⁴

01 Jun 2016

TL;DR: This work uses a deep convolutional neural network to train a feature descriptor on depth map pixels, but crucially, rather than training the network to solve the shape correspondence problem directly, it trains it to solve a body region classification problem, modified to increase the smoothness of the learned descriptors near region boundaries.

...read moreread less

Abstract: We propose a deep learning approach for finding dense correspondences between 3D scans of people. Our method requires only partial geometric information in the form of two depth maps or partial reconstructed surfaces, works for humans in arbitrary poses and wearing any clothing, does not require the two people to be scanned from similar view-points, and runs in real time. We use a deep convolutional neural network to train a feature descriptor on depth map pixels, but crucially, rather than training the network to solve the shape correspondence problem directly, we train it to solve a body region classification problem, modified to increase the smoothness of the learned descriptors near region boundaries. This approach ensures that nearby points on the human body are nearby in feature space, and vice versa, rendering the feature descriptor suitable for computing dense correspondences between the scans. We validate our method on real and synthetic data for both clothed and unclothed humans, and show that our correspondences are more robust than is possible with state-of-the-art unsupervised methods, and more accurate than those found using methods that require full watertight 3D geometry.

...read moreread less

165 citations

Journal Article•DOI•

High-fidelity facial and speech animation for VR HMDs

[...]

Kyle Olszewski¹, Joseph J. Lim², Shunsuke Saito¹, Hao Li³•Institutions (3)

University of Southern California¹, Stanford University², Institute for Creative Technologies³

11 Nov 2016

TL;DR: This work introduces a novel system for HMD users to control a digital avatar in real-time while producing plausible speech animation and emotional expressions and demonstrates the quality of the system on a variety of subjects and evaluates its performance against state-of-the-art real- time facial tracking techniques.

...read moreread less

Abstract: Significant challenges currently prohibit expressive interaction in virtual reality (VR). Occlusions introduced by head-mounted displays (HMDs) make existing facial tracking techniques intractable, and even state-of-the-art techniques used for real-time facial tracking in unconstrained environments fail to capture subtle details of the user's facial expressions that are essential for compelling speech animation. We introduce a novel system for HMD users to control a digital avatar in real-time while producing plausible speech animation and emotional expressions. Using a monocular camera attached to an HMD, we record multiple subjects performing various facial expressions and speaking several phonetically-balanced sentences. These images are used with artist-generated animation data corresponding to these sequences to train a convolutional neural network (CNN) to regress images of a user's mouth region to the parameters that control a digital avatar. To make training this system more tractable, we use audio-based alignment techniques to map images of multiple users making the same utterance to the corresponding animation parameters. We demonstrate that this approach is also feasible for tracking the expressions around the user's eye region with an internal infrared (IR) camera, thereby enabling full facial tracking. This system requires no user-specific calibration, uses easily obtainable consumer hardware, and produces high-quality animations of speech and emotional expressions. Finally, we demonstrate the quality of our system on a variety of subjects and evaluate its performance against state-of-the-art real-time facial tracking techniques.

...read moreread less

134 citations

Posted Content•

Real-Time Facial Segmentation and Performance Capture from RGB Input

[...]

Shunsuke Saito¹, Tianye Li¹, Hao Li¹•Institutions (1)

University of Southern California¹

10 Apr 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: A state-of-the-art regression-based facial tracking framework with segmented face images as training is adopted, and accurate and uninterrupted facial performance capture is demonstrated in the presence of extreme occlusion and even side views.

...read moreread less

Abstract: We introduce the concept of unconstrained real-time 3D facial performance capture through explicit semantic segmentation in the RGB input. To ensure robustness, cutting edge supervised learning approaches rely on large training datasets of face images captured in the wild. While impressive tracking quality has been demonstrated for faces that are largely visible, any occlusion due to hair, accessories, or hand-to-face gestures would result in significant visual artifacts and loss of tracking accuracy. The modeling of occlusions has been mostly avoided due to its immense space of appearance variability. To address this curse of high dimensionality, we perform tracking in unconstrained images assuming non-face regions can be fully masked out. Along with recent breakthroughs in deep learning, we demonstrate that pixel-level facial segmentation is possible in real-time by repurposing convolutional neural networks designed originally for general semantic segmentation. We develop an efficient architecture based on a two-stream deconvolution network with complementary characteristics, and introduce carefully designed training samples and data augmentation strategies for improved segmentation accuracy and robustness. We adopt a state-of-the-art regression-based facial tracking framework with segmented face images as training, and demonstrate accurate and uninterrupted facial performance capture in the presence of extreme occlusion and even side views. Furthermore, the resulting segmentation can be directly used to composite partial 3D face models on the input images and enable seamless facial manipulation tasks, such as virtual make-up or face replacement.

...read moreread less

127 citations

Proceedings Article•DOI•

Deep CTR Prediction in Display Advertising

[...]

Junxuan Chen¹, Baigui Sun², Hao Li², Hongtao Lu¹, Xian-Sheng Hua² - Show less +1 more•Institutions (2)

Shanghai Jiao Tong University¹, Alibaba Group²

01 Oct 2016

TL;DR: Wang et al. as mentioned in this paper proposed a deep neural network (DNN) based model that directly predicts the CTR of an image ad based on raw image pixels and other basic features in one step, which employs convolution layers to automatically extract representative visual features from images, and nonlinear CTR features are then learned from visual features and other contextual features by using fully connected layers.

...read moreread less

Abstract: Click through rate (CTR) prediction of image ads is the core task of online display advertising systems, and logistic regression (LR) has been frequently applied as the prediction model. However, LR model lacks the ability of extracting complex and intrinsic nonlinear features from handcrafted high-dimensional image features, which limits its effectiveness. To solve this issue, in this paper, we introduce a novel deep neural network (DNN) based model that directly predicts the CTR of an image ad based on raw image pixels and other basic features in one step. The DNN model employs convolution layers to automatically extract representative visual features from images, and nonlinear CTR features are then learned from visual features and other contextual features by using fully-connected layers. Empirical evaluations on a real world dataset with over 50 million records demonstrate the effectiveness and efficiency of this method.

...read moreread less

100 citations

Book Chapter•DOI•

Real-Time Facial Segmentation and Performance Capture from RGB Input

[...]

Shunsuke Saito¹, Tianye Li¹, Hao Li¹•Institutions (1)

University of Southern California¹

08 Oct 2016

TL;DR: In this paper, a two-stream deconvolutional neural network was proposed to perform real-time 3D facial performance capture in the presence of extreme occlusion and even side views.

...read moreread less

Abstract: We introduce the concept of unconstrained real-time 3D facial performance capture through explicit semantic segmentation in the RGB input. To ensure robustness, cutting edge supervised learning approaches rely on large training datasets of face images captured in the wild. While impressive tracking quality has been demonstrated for faces that are largely visible, any occlusion due to hair, accessories, or hand-to-face gestures would result in significant visual artifacts and loss of tracking accuracy. The modeling of occlusions has been mostly avoided due to its immense space of appearance variability. To address this curse of high dimensionality, we perform tracking in unconstrained images assuming non-face regions can be fully masked out. Along with recent breakthroughs in deep learning, we demonstrate that pixel-level facial segmentation is possible in real-time by repurposing convolutional neural networks designed originally for general semantic segmentation. We develop an efficient architecture based on a two-stream deconvolution network with complementary characteristics, and introduce carefully designed training samples and data augmentation strategies for improved segmentation accuracy and robustness. We adopt a state-of-the-art regression-based facial tracking framework with segmented face images as training, and demonstrate accurate and uninterrupted facial performance capture in the presence of extreme occlusion and even side views. Furthermore, the resulting segmentation can be directly used to composite partial 3D face models on the input images and enable seamless facial manipulation tasks, such as virtual make-up or face replacement.

...read moreread less

96 citations

Posted Content•

Production-Level Facial Performance Capture Using Deep Convolutional Neural Networks

[...]

Samuli Laine¹, Tero Karras¹, Timo Aila¹, Antti Herva, Shunsuke Saito², Ronald Yu², Hao Li², Jaakko Lehtinen³ - Show less +4 more•Institutions (3)

Nvidia¹, University of Southern California², Aalto University³

21 Sep 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, a real-time deep learning framework for video-based facial performance capture is presented, which can reduce the amount of labor involved in the development of modern narrative-driven video games or films involving realistic digital doubles of actors.

...read moreread less

Abstract: We present a real-time deep learning framework for video-based facial performance capture -- the dense 3D tracking of an actor's face given a monocular video. Our pipeline begins with accurately capturing a subject using a high-end production facial capture pipeline based on multi-view stereo tracking and artist-enhanced animations. With 5-10 minutes of captured footage, we train a convolutional neural network to produce high-quality output, including self-occluded regions, from a monocular video sequence of that subject. Since this 3D facial performance capture is fully automated, our system can drastically reduce the amount of labor involved in the development of modern narrative-driven video games or films involving realistic digital doubles of actors and potentially hours of animated dialogue per character. We compare our results with several state-of-the-art monocular real-time facial capture techniques and demonstrate compelling animation inference in challenging areas such as eyes and lips.

...read moreread less

54 citations

Book Chapter•DOI•

Capturing Dynamic Textured Surfaces of Moving Targets

[...]

Ruizhe Wang¹, Lingyu Wei¹, Etienne Vouga², Qixing Huang³, Qixing Huang², Duygu Ceylan⁴, Gerard Medioni¹, Hao Li¹ - Show less +4 more•Institutions (4)

University of Southern California¹, University of Texas at Austin², Toyota Technological Institute at Chicago³, Adobe Systems⁴

08 Oct 2016

TL;DR: An end-to-end system for reconstructing complete watertight and textured models of moving subjects such as clothed humans and animals, using only three or four handheld sensors, with a new pairwise registration algorithm that minimizes, using a particle swarm strategy, an alignment error metric based on mutual visibility and occlusion.

...read moreread less

Abstract: We present an end-to-end system for reconstructing complete watertight and textured models of moving subjects such as clothed humans and animals, using only three or four handheld sensors. The heart of our framework is a new pairwise registration algorithm that minimizes, using a particle swarm strategy, an alignment error metric based on mutual visibility and occlusion. We show that this algorithm reliably registers partial scans with as little as 15 % overlap without requiring any initial correspondences, and outperforms alternative global registration algorithms. This registration algorithm allows us to reconstruct moving subjects from free-viewpoint video produced by consumer-grade sensors, without extensive sensor calibration, constrained capture volume, expensive arrays of cameras, or templates of the subject geometry.

...read moreread less

45 citations

Posted Content•

Photorealistic Facial Texture Inference Using Deep Neural Networks

[...]

Shunsuke Saito¹, Lingyu Wei¹, Liwen Hu¹, Koki Nagano², Hao Li³ - Show less +1 more•Institutions (3)

University of Southern California¹, AmeriCorps VISTA², Institute for Creative Technologies³

02 Dec 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: A data-driven inference method is presented that can synthesize a photorealistic texture map of a complete 3D face model given a partial 2D view of a person in the wild and successful face reconstructions from a wide range of low resolution input images are demonstrated.

...read moreread less

Abstract: We present a data-driven inference method that can synthesize a photorealistic texture map of a complete 3D face model given a partial 2D view of a person in the wild. After an initial estimation of shape and low-frequency albedo, we compute a high-frequency partial texture map, without the shading component, of the visible face area. To extract the fine appearance details from this incomplete input, we introduce a multi-scale detail analysis technique based on mid-layer feature correlations extracted from a deep convolutional neural network. We demonstrate that fitting a convex combination of feature correlations from a high-resolution face database can yield a semantically plausible facial detail description of the entire face. A complete and photorealistic texture map can then be synthesized by iteratively optimizing for the reconstructed feature correlations. Using these high-resolution textures and a commercial rendering framework, we can produce high-fidelity 3D renderings that are visually comparable to those obtained with state-of-the-art multi-view face capture systems. We demonstrate successful face reconstructions from a wide range of low resolution input images, including those of historical figures. In addition to extensive evaluations, we validate the realism of our results using a crowdsourced user study.

...read moreread less

39 citations

Proceedings Article•DOI•

Modern techniques and applications for real-time non-rigid registration

[...]

Sofien Bouaziz¹, Andrea Tagliasacchi², Hao Li, Mark Pauly¹•Institutions (2)

École Polytechnique Fédérale de Lausanne¹, University of Victoria²

28 Nov 2016

TL;DR: The mathematical foundations and theoretical explanation of registration algorithms are introduced, in addition to the practical tools to design systems that leverage information from RGBD devices, to illustrate the practical relevance of the theoretical content.

...read moreread less

Abstract: Registration algorithms are an essential component of many computer graphics and computer vision systems. With recent technological advances in RGBD sensors (color plus depth), an active area of research is in techniques combining color, geometry, and learnt priors for robust real-time registration. The goal of this course is to introduce the mathematical foundations and theoretical explanation of registration algorithms, in addition to the practical tools to design systems that leverage information from RGBD devices. We present traditional methods for correspondence computation derived from geometric first principles, along with modern techniques leveraging pre-processing of annotated datasets (e.g. deep neural networks). To illustrate the practical relevance of the theoretical content, we discuss applications including static and dynamic scanning/reconstruction as well as real-time tracking of hands and faces. An up-to-date version of the course notes, as well as slides and source code can be found at http://gfx.uvic.ca/teaching/registration.

...read moreread less

34 citations

Posted Content•

Deep CTR Prediction in Display Advertising

[...]

Junxuan Chen¹, Baigui Sun², Hao Li², Hongtao Lu¹, Xian-Sheng Hua² - Show less +1 more•Institutions (2)

Shanghai Jiao Tong University¹, Alibaba Group²

20 Sep 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: A novel deep neural network (DNN) based model is introduced that directly predicts the CTR of an image ad based on raw image pixels and other basic features in one step.

...read moreread less

Abstract: Click through rate (CTR) prediction of image ads is the core task of online display advertising systems, and logistic regression (LR) has been frequently applied as the prediction model. However, LR model lacks the ability of extracting complex and intrinsic nonlinear features from handcrafted high-dimensional image features, which limits its effectiveness. To solve this issue, in this paper, we introduce a novel deep neural network (DNN) based model that directly predicts the CTR of an image ad based on raw image pixels and other basic features in one step. The DNN model employs convolution layers to automatically extract representative visual features from images, and nonlinear CTR features are then learned from visual features and other contextual features by using fully-connected layers. Empirical evaluations on a real world dataset with over 50 million records demonstrate the effectiveness and efficiency of this method.

...read moreread less

Proceedings Article•DOI•

Rapid Photorealistic Blendshape Modeling from RGB-D Sensors

[...]

Dan Casas¹, Andrew Feng¹, Oleg Alexander¹, Graham Fyffe¹, Paul Debevec¹, Ryosuke Ichikari¹, Hao Li¹, Kyle Olszewski¹, Evan A. Suma¹, Ari Shapiro¹ - Show less +6 more•Institutions (1)

University of Southern California¹

23 May 2016

TL;DR: A system to generate photorealistic 3D blendshape-based face models automatically using only a single consumer RGB-D sensor and a registration method that solves dense correspondences between two face scans by utilizing facial landmarks detection and optical flows is proposed.

...read moreread less

Abstract: Creating and animating realistic 3D human faces is an important element of virtual reality, video games, and other areas that involve interactive 3D graphics. In this paper, we propose a system to generate photorealistic 3D blendshape-based face models automatically using only a single consumer RGB-D sensor. The capture and processing requires no artistic expertise to operate, takes 15 seconds to capture and generate a single facial expression, and approximately 1 minute of processing time per expression to transform it into a blendshape model. Our main contributions include a complete end-to-end pipeline for capturing and generating photorealistic blendshape models automatically and a registration method that solves dense correspondences between two face scans by utilizing facial landmarks detection and optical flows. We demonstrate the effectiveness of the proposed method by capturing different human subjects with a variety of sensors and puppeteering their 3D faces with real-time facial performance retargeting. The rapid nature of our method allows for just-in-time construction of a digital face. To that end, we also integrated our pipeline with a virtual reality facial performance capture system that allows dynamic embodiment of the generated faces despite partial occlusion of the user's real face by the head-mounted display.

...read moreread less

Journal Article•DOI•

Patient-specific assessment of dysmorphism of the femoral head–neck junction: a statistical shape model approach

[...]

Vikas Khanduja¹, Vikas Khanduja², N Baelde, Andreas Dobbelaere², Jan Van Houcke², Hao Li³, Christophe Pattyn², Emmanuel Audenaert² - Show less +4 more•Institutions (3)

University of Cambridge¹, Ghent University Hospital², University of Southern California³

08 Jan 2016-International Journal of Medical Robotics and Computer Assisted Surgery

TL;DR: Objective quantification of anatomical variations about the femur head–neck junction in pre‐operative planning for surgical intervention in femoro‐acetabular impingement is problematic, as no clear definition of average normal anatomy for a specific subject exists.

...read moreread less

Abstract: Background Objective quantification of anatomical variations about the femur head–neck junction in pre-operative planning for surgical intervention in femoro-acetabular impingement is problematic, as no clear definition of average normal anatomy for a specific subject exists. Methods We have defined the normal-equivalent of a subject's anatomy by using a statistical shape model and geometric shape optimization for finding correspondences, while excluding the femoral head–neck junction during the fitting procedure. The presented technique was evaluated on a cohort of 20 patients. Results Difference in α-angle measurement between the actual morphology and the predicted normal-equivalent, averaged 1.3° (SD 1.7°) in the control group versus 8° (SD 7.3°) in the patient group (p < 0.05). Conclusions Defining normal equivalent anatomy is effective in quantifying anatomical dysmorphism of the femoral head–neck junction and as such can improve presurgical analysis of patients diagnosed with femoro-acetabular impingement. Copyright © 2016 John Wiley & Sons, Ltd.

...read moreread less

Journal Article•DOI•

A Hybrid Calibration Method for Aperture Synthesis Radiometers

[...]

Hailiang Lu¹, Qingxia Li¹, Rong Jin¹, Chen Ke¹, Yan Li¹, Li Feng², Hao Li, Yinan Li - Show less +4 more•Institutions (2)

Huazhong University of Science and Technology¹, Hubei University of Technology²

17 Mar 2016-IEEE Geoscience and Remote Sensing Letters

TL;DR: A hybrid calibration method is proposed, which combines the redundant baselines and reduced noise injection, which needs only a small noise injection network and can deal with the out-of-plane distortion of antenna arrays.

...read moreread less

Abstract: The calibration of aperture synthesis radiometers (ASRs) with large arrays by internal noise injection suffers from high increased mass and volume and is not capable to deal with the out-of-plane distortion of antenna arrays. In this letter, a hybrid calibration method is proposed, which combines the redundant baselines and reduced noise injection. The hybrid method needs only a small noise injection network and can deal with the out-of-plane distortion of antenna arrays. The performance of the hybrid method depends on the noise injection element distribution, which is analyzed and given based on minimizing the root mean square of all residual amplitude and phase errors. Simulation results demonstrate its performance. The proposed hybrid method is particularly helpful to the calibration of ASRs with large arrays in geostationary orbit.

...read moreread less

Proceedings Article•DOI•

Pinscreen: 3D avatar from a single image

[...]

Shunsuke Saito, Lingyu Wei, Jens Fursund, Liwen Hu, Chao Yang, Ronald Yu, Kyle Olszewski, Stephen Chen, Isabella Benavente, Yen-Chun Chen, Hao Li - Show less +7 more

28 Nov 2016

TL;DR: The technology enables the automatic generation of a complete head model from a fully unconstrained image and can be instantly animated by anyone in real-time through natural facial performances captured from a regular RGB camera.

...read moreread less

Abstract: The age of social media and immersive technologies has created a growing need for processing detailed visual representations of ourselves as virtual and augmented reality is growing into the next generation platform for online communication, connecting hundreds of millions of users. A realistic simulation of our presence in a mixed reality environment is unthinkable without a compelling and directable 3D digitization of ourselves. With the wide availability of mobile cameras and internet images, we introduce a technology that can build a realistic 3D avatar from a single photograph. This textured 3D face model includes hair and can be instantly animated by anyone in real-time through natural facial performances captured from a regular RGB camera. Immediate applications include personalized gaming and VR-enabled social networks using automatically digitized 3D avatars, as well as mobile apps such as video messengers (e.g., Snapchat) with face-swapping capabilities. As opposed to existing solutions, our technology enables the automatic generation of a complete head model from a fully unconstrained image.

...read moreread less

Proceedings Article•DOI•

Capturing the human body: from VR, consumer, to health applications

[...]

Hao Li¹, Lingyu Wei², Anshuman J. Das³, Tristan Swedish³, Pratik Shah³, Ramesh Raskar³ - Show less +2 more•Institutions (3)

University of Southern California¹, Tsinghua University², Massachusetts Institute of Technology³

24 Jul 2016

TL;DR: This course will cover the major topics and challenges in using image acquisition to model the human body.

...read moreread less

Abstract: Modeling the human body is of special interest in computer graphics to create "virtual humans", but material and optical properties of biological tissues are complex and not easily captured. This course will cover the major topics and challenges in using image acquisition to model the human body.

...read moreread less

Posted Content•

Capturing Dynamic Textured Surfaces of Moving Targets

[...]

Ruizhe Wang¹, Lingyu Wei¹, Etienne Vouga², Qixing Huang³, Qixing Huang², Duygu Ceylan⁴, Gerard Medioni¹, Hao Li¹ - Show less +4 more•Institutions (4)

University of Southern California¹, University of Texas at Austin², Toyota Technological Institute at Chicago³, Adobe Systems⁴

11 Apr 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, the authors present an end-to-end system for reconstructing complete watertight and textured models of moving subjects such as clothed humans and animals, using only three or four handheld sensors.

...read moreread less

Abstract: We present an end-to-end system for reconstructing complete watertight and textured models of moving subjects such as clothed humans and animals, using only three or four handheld sensors. The heart of our framework is a new pairwise registration algorithm that minimizes, using a particle swarm strategy, an alignment error metric based on mutual visibility and occlusion. We show that this algorithm reliably registers partial scans with as little as 15% overlap without requiring any initial correspondences, and outperforms alternative global registration algorithms. This registration algorithm allows us to reconstruct moving subjects from free-viewpoint video produced by consumer-grade sensors, without extensive sensor calibration, constrained capture volume, expensive arrays of cameras, or templates of the subject geometry.

...read moreread less

Showing papers by "Hao Li published in 2016"