Showing papers in "ACM Transactions on Graphics in 2017"

PDF

Open Access

Journal Article•DOI•

Synthesizing Obama: learning lip sync from audio

[...]

Supasorn Suwajanakorn¹, Steven M. Seitz¹, Ira Kemelmacher-Shlizerman¹•Institutions (1)

20 Jul 2017-ACM Transactions on Graphics

TL;DR: Given audio of President Barack Obama, a high quality video of him speaking with accurate lip sync is synthesized, composited into a target video clip, and a recurrent neural network learns the mapping from raw audio features to mouth shapes to produce photorealistic results.

...read moreread less

Abstract: Given audio of President Barack Obama, we synthesize a high quality video of him speaking with accurate lip sync, composited into a target video clip. Trained on many hours of his weekly address footage, a recurrent neural network learns the mapping from raw audio features to mouth shapes. Given the mouth shape at each time instant, we synthesize high quality mouth texture, and composite it with proper 3D pose matching to change what he appears to be saying in a target video to match the input audio track. Our approach produces photorealistic results.

...read moreread less

763 citations

Journal Article•DOI•

BundleFusion: real-time globally consistent 3D reconstruction using on-the-fly surface re-integration

[...]

Angela Dai¹, Matthias Nießner¹, Michael Zollhöfer², Shahram Izadi³, Christian Theobalt² - Show less +1 more•Institutions (3)

Stanford University¹, Max Planck Society², Microsoft³

01 May 2017-ACM Transactions on Graphics

TL;DR: In this paper, a robust pose estimation strategy is proposed for real-time, high-quality, 3D scanning of large-scale scenes using RGB-D input with an efficient hierarchical approach, which removes heavy reliance on temporal tracking and continually localizes to the globally optimized frames instead.

...read moreread less

Abstract: Real-time, high-quality, 3D scanning of large-scale scenes is key to mixed reality and robotic applications. However, scalability brings challenges of drift in pose estimation, introducing significant errors in the accumulated model. Approaches often require hours of offline processing to globally correct model errors. Recent online methods demonstrate compelling results but suffer from (1) needing minutes to perform online correction, preventing true real-time use; (2) brittle frame-to-frame (or frame-to-model) pose estimation, resulting in many tracking failures; or (3) supporting only unstructured point-based representations, which limit scan quality and applicability. We systematically address these issues with a novel, real-time, end-to-end reconstruction framework. At its core is a robust pose estimation strategy, optimizing per frame for a global set of camera poses by considering the complete history of RGB-D input with an efficient hierarchical approach. We remove the heavy reliance on temporal tracking and continually localize to the globally optimized frames instead. We contribute a parallelizable optimization framework, which employs correspondences based on sparse features and dense geometric and photometric matching. Our approach estimates globally optimized (i.e., bundle adjusted) poses in real time, supports robust tracking with recovery from gross tracking failures (i.e., relocalization), and re-estimates the 3D model in real time to ensure global consistency, all within a single framework. Our approach outperforms state-of-the-art online systems with quality on par to offline methods, but with unprecedented speed and scan completeness. Our framework leads to a comprehensive online scanning solution for large indoor environments, enabling ease of use and high-quality results.1

...read moreread less

711 citations

Journal Article•DOI•

O-CNN: octree-based convolutional neural networks for 3D shape analysis

[...]

Peng-Shuai Wang¹, Yang Liu², Yuxiao Guo³, Chun-Yu Sun¹, Xin Tong² - Show less +1 more•Institutions (3)

Tsinghua University¹, Microsoft², University of Electronic Science and Technology of China³

20 Jul 2017-ACM Transactions on Graphics

TL;DR: The O-CNN is presented, an Octree-based Convolutional Neural Network (CNN) for 3D shape analysis built upon the octree representation of 3D shapes, which takes the average normal vectors of a 3D model sampled in the finest leaf octants as input and performs 3D CNN operations on the octants occupied by the3D shape surface.

...read moreread less

Abstract: We present O-CNN, an Octree-based Convolutional Neural Network (CNN) for 3D shape analysis. Built upon the octree representation of 3D shapes, our method takes the average normal vectors of a 3D model sampled in the finest leaf octants as input and performs 3D CNN operations on the octants occupied by the 3D shape surface. We design a novel octree data structure to efficiently store the octant information and CNN features into the graphics memory and execute the entire O-CNN training and evaluation on the GPU. O-CNN supports various CNN structures and works for 3D shapes in different representations. By restraining the computations on the octants occupied by 3D surfaces, the memory and computational costs of the O-CNN grow quadratically as the depth of the octree increases, which makes the 3D CNN feasible for high-resolution 3D models. We compare the performance of the O-CNN with other existing 3D CNN solutions and demonstrate the efficiency and efficacy of O-CNN in three shape analysis tasks, including object classification, shape retrieval, and shape segmentation.

...read moreread less

699 citations

Journal Article•DOI•

Learning a model of facial shape and expression from 4D scans

[...]

Tianye Li¹, Timo Bolkart¹, Michael J. Black¹, Hao Li², Javier Romero - Show less +1 more•Institutions (2)

Max Planck Society¹, Institute for Creative Technologies²

20 Nov 2017-ACM Transactions on Graphics

TL;DR: Faces Learned with an Articulated Model and Expressions is low-dimensional but more expressive than the FaceWarehouse model and the Basel Face Model and is compared to these models by fitting them to static 3D scans and 4D sequences using the same optimization method.

...read moreread less

Abstract: The field of 3D face modeling has a large gap between high-end and low-end methods. At the high end, the best facial animation is indistinguishable from real humans, but this comes at the cost of extensive manual labor. At the low end, face capture from consumer depth sensors relies on 3D face models that are not expressive enough to capture the variability in natural facial shape and expression. We seek a middle ground by learning a facial model from thousands of accurately aligned 3D scans. Our FLAME model (Faces Learned with an Articulated Model and Expressions) is designed to work with existing graphics software and be easy to fit to data. FLAME uses a linear shape space trained from 3800 scans of human heads. FLAME combines this linear shape space with an articulated jaw, neck, and eyeballs, pose-dependent corrective blendshapes, and additional global expression blendshapes. The pose and expression dependent articulations are learned from 4D face sequences in the D3DFACS dataset along with additional 4D sequences. We accurately register a template mesh to the scan sequences and make the D3DFACS registrations available for research purposes. In total the model is trained from over 33, 000 scans. FLAME is low-dimensional but more expressive than the FaceWarehouse model and the Basel Face Model. We compare FLAME to these models by fitting them to static 3D scans and 4D sequences using the same optimization method. FLAME is significantly more accurate and is available for research purposes (http://flame.is.tue.mpg.de).

...read moreread less

629 citations

Journal Article•DOI•

Tanks and temples: benchmarking large-scale scene reconstruction

[...]

A. Knapitsch¹, Jaesik Park¹, Qian-Yi Zhou¹, Vladlen Koltun¹•Institutions (1)

Intel¹

20 Jul 2017-ACM Transactions on Graphics

TL;DR: A benchmark for image-based 3D reconstruction with high-resolution video sequences provided as input, supporting the development of novel pipelines that take advantage of video input to increase reconstruction fidelity.

...read moreread less

Abstract: We present a benchmark for image-based 3D reconstruction. The benchmark sequences were acquired outside the lab, in realistic conditions. Ground-truth data was captured using an industrial laser scanner. The benchmark includes both outdoor scenes and indoor environments. High-resolution video sequences are provided as input, supporting the development of novel pipelines that take advantage of video input to increase reconstruction fidelity. We report the performance of many image-based 3D reconstruction pipelines on the new benchmark. The results point to exciting challenges and opportunities for future work.

...read moreread less

553 citations

Journal Article•DOI•

Embodied hands: modeling and capturing hands and bodies together

[...]

Javier Romero, Dimitrios Tzionas¹, Michael J. Black¹•Institutions (1)

Max Planck Society¹

20 Nov 2017-ACM Transactions on Graphics

TL;DR: A model of hands and bodies interacting together and fit it to full-body 4D sequences that move naturally with detailed hand motions and a realism not seen before in full body performance capture is formulated.

...read moreread less

Abstract: Humans move their hands and bodies together to communicate and solve tasks. Capturing and replicating such coordinated activity is critical for virtual characters that behave realistically. Surprisingly, most methods treat the 3D modeling and tracking of bodies and hands separately. Here we formulate a model of hands and bodies interacting together and fit it to full-body 4D sequences. When scanning or capturing the full body in 3D, hands are small and often partially occluded, making their shape and pose hard to recover. To cope with low-resolution, occlusion, and noise, we develop a new model called MANO (hand Model with Articulated and Non-rigid defOrmations). MANO is learned from around 1000 high-resolution 3D scans of hands of 31 subjects in a wide variety of hand poses. The model is realistic, low-dimensional, captures non-rigid shape changes with pose, is compatible with standard graphics packages, and can fit any human hand. MANO provides a compact mapping from hand poses to pose blend shape corrections and a linear manifold of pose synergies. We attach MANO to a standard parameterized 3D body shape model (SMPL), resulting in a fully articulated body and hand model (SMPL+H). We illustrate SMPL+H by fitting complex, natural, activities of subjects captured with a 4D scanner. The fitting is fully automatic and results in full body models that move naturally with detailed hand motions and a realism not seen before in full body performance capture. The models and data are freely available for research purposes at http://mano.is.tue.mpg.de.

...read moreread less

536 citations

Journal Article•DOI•

DeepLoco: dynamic locomotion skills using hierarchical deep reinforcement learning

[...]

Xue Bin Peng¹, Glen Berseth¹, KangKang Yin², Michiel van de Panne¹•Institutions (2)

University of British Columbia¹, National University of Singapore²

20 Jul 2017-ACM Transactions on Graphics

TL;DR: This paper aims to learn a variety of environment-aware locomotion skills with a limited amount of prior knowledge by adopting a two-level hierarchical control framework and training both levels using deep reinforcement learning.

...read moreread less

Abstract: Learning physics-based locomotion skills is a difficult problem, leading to solutions that typically exploit prior knowledge of various forms. In this paper we aim to learn a variety of environment-aware locomotion skills with a limited amount of prior knowledge. We adopt a two-level hierarchical control framework. First, low-level controllers are learned that operate at a fine timescale and which achieve robust walking gaits that satisfy stepping-target and style objectives. Second, high-level controllers are then learned which plan at the timescale of steps by invoking desired step targets for the low-level controller. The high-level controller makes decisions directly based on high-dimensional inputs, including terrain maps or other suitable representations of the surroundings. Both levels of the control policy are trained using deep reinforcement learning. Results are demonstrated on a simulated 3D biped. Low-level controllers are learned for a variety of motion styles and demonstrate robustness with respect to force-based disturbances, terrain variations, and style interpolation. High-level controllers are demonstrated that are capable of following trails through terrains, dribbling a soccer ball towards a target location, and navigating through static or dynamic obstacles.

...read moreread less

518 citations

Journal Article•DOI•

Holographic near-eye displays for virtual and augmented reality

[...]

Andrew Maimone¹, Andreas Georgiou¹, Joel S. Kollin¹•Institutions (1)

Microsoft¹

20 Jul 2017-ACM Transactions on Graphics

TL;DR: A unified focus, aberration correction, and vision correction model, along with a user calibration process, accounts for any optical defects between the light source and retina to enable truly compact, eyeglasses-like displays with wide fields of view that would be inaccessible through conventional means.

...read moreread less

Abstract: We present novel designs for virtual and augmented reality near-eye displays based on phase-only holographic projection. Our approach is built on the principles of Fresnel holography and double phase amplitude encoding with additional hardware, phase correction factors, and spatial light modulator encodings to achieve full color, high contrast and low noise holograms with high resolution and true per-pixel focal control. We provide a GPU-accelerated implementation of all holographic computation that integrates with the standard graphics pipeline and enables real-time (≥90 Hz) calculation directly or through eye tracked approximations. A unified focus, aberration correction, and vision correction model, along with a user calibration process, accounts for any optical defects between the light source and retina. We use this optical correction ability not only to fix minor aberrations but to enable truly compact, eyeglasses-like displays with wide fields of view (80°) that would be inaccessible through conventional means. All functionality is evaluated across a series of hardware prototypes; we discuss remaining challenges to incorporate all features into a single device.

...read moreread less

510 citations

Journal Article•DOI•

Deep bilateral learning for real-time image enhancement

[...]

Michaël Gharbi¹, Jiawen Chen², Jonathan T. Barron², Samuel W. Hasinoff², Frédo Durand - Show less +1 more•Institutions (2)

Massachusetts Institute of Technology¹, Google²

20 Jul 2017-ACM Transactions on Graphics

TL;DR: In this paper, a convolutional neural network is used to predict the coefficients of a locally affine model in bilateral space, which is then applied to the full-resolution image.

...read moreread less

Abstract: Performance is a critical challenge in mobile image processing. Given a reference imaging pipeline, or even human-adjusted pairs of images, we seek to reproduce the enhancements and enable real-time evaluation. For this, we introduce a new neural network architecture inspired by bilateral grid processing and local affine color transforms. Using pairs of input/output images, we train a convolutional neural network to predict the coefficients of a locally-affine model in bilateral space. Our architecture learns to make local, global, and content-dependent decisions to approximate the desired image transformation. At runtime, the neural network consumes a low-resolution version of the input image, produces a set of affine transformations in bilateral space, upsamples those transformations in an edge-preserving fashion using a new slicing node, and then applies those upsampled transformations to the full-resolution image. Our algorithm processes high-resolution images on a smartphone in milliseconds, provides a real-time viewfinder at 1080p resolution, and matches the quality of state-of-the-art approximation techniques on a large class of image operators. Unlike previous work, our model is trained off-line from data and therefore does not require access to the original operator at runtime. This allows our model to learn complex, scene-dependent transformations for which no reference implementation is available, such as the photographic edits of a human retoucher.

...read moreread less

510 citations

Journal Article•DOI•

ClothCap: seamless 4D clothing capture and retargeting

[...]

Gerard Pons-Moll¹, Sergi Pujades¹, Sonny Hu, Michael J. Black¹•Institutions (1)

Max Planck Society¹

20 Jul 2017-ACM Transactions on Graphics

TL;DR: The ClothCap approach uses a new multi-part 3D model of clothed bodies, automatically segments each piece of clothing, estimates the minimally clothed body shape and pose under the clothing, and tracks the 3D deformations of the clothing over time.

...read moreread less

Abstract: Designing and simulating realistic clothing is challenging. Previous methods addressing the capture of clothing from 3D scans have been limited to single garments and simple motions, lack detail, or require specialized texture patterns. Here we address the problem of capturing regular clothing on fully dressed people in motion. People typically wear multiple pieces of clothing at a time. To estimate the shape of such clothing, track it over time, and render it believably, each garment must be segmented from the others and the body. Our ClothCap approach uses a new multi-part 3D model of clothed bodies, automatically segments each piece of clothing, estimates the minimally clothed body shape and pose under the clothing, and tracks the 3D deformations of the clothing over time. We estimate the garments and their motion from 4D scans; that is, high-resolution 3D scans of the subject in motion at 60 fps. ClothCap is able to capture a clothed person in motion, extract their clothing, and retarget the clothing to new body shapes; this provides a step towards virtual try-on.

...read moreread less

441 citations

Journal Article•DOI•

GRASS: generative recursive autoencoders for shape structures

[...]

Jun Li¹, Kai Xu¹, Siddhartha Chaudhuri², Ersin Yumer³, Hao Zhang⁴, Leonidas J. Guibas⁵ - Show less +2 more•Institutions (5)

National University of Defense Technology¹, Indian Institute of Technology Bombay², Adobe Systems³, Simon Fraser University⁴, Stanford University⁵

20 Jul 2017-ACM Transactions on Graphics

TL;DR: A novel neural network architecture for encoding and synthesis of 3D shapes, particularly their structures, is introduced and it is demonstrated that without supervision, the network learns meaningful structural hierarchies adhering to perceptual grouping principles, produces compact codes which enable applications such as shape classification and partial matching, and supports shape synthesis and interpolation with significant variations in topology and geometry.

...read moreread less

Abstract: We introduce a novel neural network architecture for encoding and synthesis of 3D shapes, particularly their structures. Our key insight is that 3D shapes are effectively characterized by their hierarchical organization of parts, which reflects fundamental intra-shape relationships such as adjacency and symmetry. We develop a recursive neural net (RvNN) based autoencoder to map a flat, unlabeled, arbitrary part layout to a compact code. The code effectively captures hierarchical structures of man-made 3D objects of varying structural complexities despite being fixed-dimensional: an associated decoder maps a code back to a full hierarchy. The learned bidirectional mapping is further tuned using an adversarial setup to yield a generative model of plausible structures, from which novel structures can be sampled. Finally, our structure synthesis framework is augmented by a second trained module that produces fine-grained part geometry, conditioned on global and local structural context, leading to a full generative pipeline for 3D shapes. We demonstrate that without supervision, our network learns meaningful structural hierarchies adhering to perceptual grouping principles, produces compact codes which enable applications such as shape classification and partial matching, and supports shape synthesis and interpolation with significant variations in topology and geometry.

...read moreread less

Journal Article•DOI•

Deep high dynamic range imaging of dynamic scenes

[...]

Nima Khademi Kalantari¹, Ravi Ramamoorthi¹•Institutions (1)

University of California¹

20 Jul 2017-ACM Transactions on Graphics

TL;DR: A convolutional neural network is used as the learning model and three different system architectures are compared to model the HDR merge process to demonstrate the performance of the system by producing high-quality HDR images from a set of three LDR images.

...read moreread less

Abstract: Producing a high dynamic range (HDR) image from a set of images with different exposures is a challenging process for dynamic scenes. A category of existing techniques first register the input images to a reference image and then merge the aligned images into an HDR image. However, the artifacts of the registration usually appear as ghosting and tearing in the final HDR images. In this paper, we propose a learning-based approach to address this problem for dynamic scenes. We use a convolutional neural network (CNN) as our learning model and present and compare three different system architectures to model the HDR merge process. Furthermore, we create a large dataset of input LDR images and their corresponding ground truth HDR images to train our system. We demonstrate the performance of our system by producing high-quality HDR images from a set of three LDR images. Experimental results show that our method consistently produces better results than several state-of-the-art approaches on challenging scenes.

...read moreread less

Journal Article•DOI•

Audio-driven facial animation by joint end-to-end learning of pose and emotion

[...]

Tero Karras¹, Timo Aila¹, Samuli Laine¹, Antti Herva, Jaakko Lehtinen² - Show less +1 more•Institutions (2)

Nvidia¹, Aalto University²

20 Jul 2017-ACM Transactions on Graphics

TL;DR: This work presents a machine learning technique for driving 3D facial animation by audio input in real time and with low latency, and simultaneously discovers a compact, latent code that disambiguates the variations in facial expression that cannot be explained by the audio alone.

...read moreread less

Abstract: We present a machine learning technique for driving 3D facial animation by audio input in real time and with low latency. Our deep neural network learns a mapping from input waveforms to the 3D vertex coordinates of a face model, and simultaneously discovers a compact, latent code that disambiguates the variations in facial expression that cannot be explained by the audio alone. During inference, the latent code can be used as an intuitive control for the emotional state of the face puppet. We train our network with 3--5 minutes of high-quality animation data obtained using traditional, vision-based performance capture methods. Even though our primary goal is to model the speaking style of a single actor, our model yields reasonable results even when driven with audio from other speakers with different gender, accent, or language, as we demonstrate with a user study. The results are applicable to in-game dialogue, low-cost localization, virtual reality avatars, and telepresence.

...read moreread less

Journal Article•DOI•

Soft 3D reconstruction for view synthesis

[...]

Eric Scott Penner¹, Li Zhang¹•Institutions (1)

Google¹

20 Nov 2017-ACM Transactions on Graphics

TL;DR: A novel algorithm for view synthesis that utilizes a soft 3D reconstruction to improve quality, continuity and robustness and it is shown that this representation is beneficial throughout the view synthesis pipeline.

...read moreread less

Abstract: We present a novel algorithm for view synthesis that utilizes a soft 3D reconstruction to improve quality, continuity and robustness Our main contribution is the formulation of a soft 3D representation that preserves depth uncertainty through each stage of 3D reconstruction and rendering We show that this representation is beneficial throughout the view synthesis pipeline During view synthesis, it provides a soft model of scene geometry that provides continuity across synthesized views and robustness to depth uncertainty During 3D reconstruction, the same robust estimates of scene visibility can be applied iteratively to improve depth estimation around object edges Our algorithm is based entirely on O(1) filters, making it conducive to acceleration and it works with structured or unstructured sets of input views We compare with recent classical and learning-based algorithms on plenoptic lightfields, wide baseline captures, and lightfield videos produced from camera arrays

...read moreread less

Journal Article•DOI•

Learning to predict indoor illumination from a single image

[...]

Marc-André Gardner¹, Kalyan Sunkavalli², Ersin Yumer², Xiaohui Shen², Emiliano Gambaretto², Christian Gagné¹, Jean-François Lalonde¹ - Show less +3 more•Institutions (2)

Laval University¹, Adobe Systems²

20 Nov 2017-ACM Transactions on Graphics

TL;DR: An end-to-end deep neural network is trained that directly regresses a limited field-of-view photo to HDR illumination, without strong assumptions on scene geometry, material properties, or lighting, which allows to automatically recover high-quality HDR illumination estimates that significantly outperform previous state- of-the-art methods.

...read moreread less

Abstract: We propose an automatic method to infer high dynamic range illumination from a single, limited field-of-view, low dynamic range photograph of an indoor scene. In contrast to previous work that relies on specialized image capture, user input, and/or simple scene models, we train an end-to-end deep neural network that directly regresses a limited field-of-view photo to HDR illumination, without strong assumptions on scene geometry, material properties, or lighting. We show that this can be accomplished in a three step process: 1) we train a robust lighting classifier to automatically annotate the location of light sources in a large dataset of LDR environment maps, 2) we use these annotations to train a deep neural network that predicts the location of lights in a scene from a single limited field-of-view photo, and 3) we fine-tune this network using a small dataset of HDR environment maps to predict light intensities. This allows us to automatically recover high-quality HDR illumination estimates that significantly outperform previous state-of-the-art methods. Consequently, using our illumination estimates for applications like 3D object insertion, produces photo-realistic results that we validate via a perceptual user study.

...read moreread less

Journal Article•DOI•

Kernel-predicting convolutional networks for denoising Monte Carlo renderings

[...]

Steve Bako¹, Thijs Vogels², Brian McWilliams², Mark Meyer, Jan Novák², Alex Harvill, Pradeep Sen¹, Tony DeRose, Fabrice Rousselle² - Show less +5 more•Institutions (2)

University of California¹, Disney Research²

20 Jul 2017-ACM Transactions on Graphics

TL;DR: A novel, supervised learning approach that allows the filtering kernel to be more complex and general by leveraging a deep convolutional neural network (CNN) architecture and introduces a novel, kernel-prediction network which uses the CNN to estimate the local weighting kernels used to compute each denoised pixel from its neighbors.

...read moreread less

Abstract: Regression-based algorithms have shown to be good at denoising Monte Carlo (MC) renderings by leveraging its inexpensive by-products (e.g., feature buffers). However, when using higher-order models to handle complex cases, these techniques often overfit to noise in the input. For this reason, supervised learning methods have been proposed that train on a large collection of reference examples, but they use explicit filters that limit their denoising ability. To address these problems, we propose a novel, supervised learning approach that allows the filtering kernel to be more complex and general by leveraging a deep convolutional neural network (CNN) architecture. In one embodiment of our framework, the CNN directly predicts the final denoised pixel value as a highly non-linear combination of the input features. In a second approach, we introduce a novel, kernel-prediction network which uses the CNN to estimate the local weighting kernels used to compute each denoised pixel from its neighbors. We train and evaluate our networks on production data and observe improvements over state-of-the-art MC denoisers, showing that our methods generalize well to a variety of scenes. We conclude by analyzing various components of our architecture and identify areas of further research in deep learning for MC denoising.

...read moreread less

Journal Article•DOI•

Interactive reconstruction of Monte Carlo image sequences using a recurrent denoising autoencoder

[...]

Chakravarty R. Alla Chaitanya¹, Anton S. Kaplanyan², Christoph Schied³, Marco Salvi², Aaron Eliot Lefohn², Derek Nowrouzezahrai⁴, Timo Aila² - Show less +3 more•Institutions (4)

Université de Montréal¹, Nvidia², Karlsruhe Institute of Technology³, McGill University⁴

20 Jul 2017-ACM Transactions on Graphics

TL;DR: This work proposes a variant of deep convolutional networks better suited to the class of noise present in Monte Carlo rendering, which allows for much larger pixel neighborhoods to be taken into account, while also improving execution speed by an order of magnitude.

...read moreread less

Abstract: We describe a machine learning technique for reconstructing image sequences rendered using Monte Carlo methods. Our primary focus is on reconstruction of global illumination with extremely low sampling budgets at interactive rates. Motivated by recent advances in image restoration with deep convolutional networks, we propose a variant of these networks better suited to the class of noise present in Monte Carlo rendering. We allow for much larger pixel neighborhoods to be taken into account, while also improving execution speed by an order of magnitude. Our primary contribution is the addition of recurrent connections to the network in order to drastically improve temporal stability for sequences of sparsely sampled input images. Our method also has the desirable property of automatically modeling relationships based on auxiliary per-pixel input channels, such as depth and normals. We show significantly higher quality results compared to existing methods that run at comparable speeds, and furthermore argue a clear path for making our method run at realtime rates in the near future.

...read moreread less

Journal Article•DOI•

Deep reverse tone mapping

[...]

Yuki Endo¹, Yoshihiro Kanamori¹, Jun Mitani¹•Institutions (1)

University of Tsukuba¹

20 Nov 2017-ACM Transactions on Graphics

TL;DR: The first deep-learning-based approach for fully automatic inference using convolutional neural networks is proposed, which can reproduce not only natural tones without introducing visible noise but also the colors of saturated pixels.

...read moreread less

Abstract: Inferring a high dynamic range (HDR) image from a single low dynamic range (LDR) input is an ill-posed problem where we must compensate lost data caused by under-/over-exposure and color quantization. To tackle this, we propose the first deep-learning-based approach for fully automatic inference using convolutional neural networks. Because a naive way of directly inferring a 32-bit HDR image from an 8-bit LDR image is intractable due to the difficulty of training, we take an indirect approach; the key idea of our method is to synthesize LDR images taken with different exposures (i.e., bracketed images) based on supervised learning, and then reconstruct an HDR image by merging them. By learning the relative changes of pixel values due to increased/decreased exposures using 3D deconvolutional networks, our method can reproduce not only natural tones without introducing visible noise but also the colors of saturated pixels. We demonstrate the effectiveness of our method by comparing our results not only with those of conventional methods but also with ground-truth HDR images.

...read moreread less

Journal Article•DOI•

A deep learning approach for generalized speech animation

[...]

Sarah Taylor¹, Taehwan Kim², Yisong Yue², Moshe Mahler³, James Krahe³, Anastasio Garcia Rodriguez³, Jessica K. Hodgins⁴, Iain Matthews³ - Show less +4 more•Institutions (4)

University of East Anglia¹, California Institute of Technology², Disney Research³, Carnegie Mellon University⁴

20 Jul 2017-ACM Transactions on Graphics

TL;DR: A simple and effective deep learning approach to automatically generate natural looking speech animation that synchronizes to input speech and can also generate on-demand speech animation in real-time from user speech input.

...read moreread less

Abstract: We introduce a simple and effective deep learning approach to automatically generate natural looking speech animation that synchronizes to input speech. Our approach uses a sliding window predictor that learns arbitrary nonlinear mappings from phoneme label input sequences to mouth movements in a way that accurately captures natural motion and visual coarticulation effects. Our deep learning approach enjoys several attractive properties: it runs in real-time, requires minimal parameter tuning, generalizes well to novel input speech sequences, is easily edited to create stylized and emotional speech, and is compatible with existing animation retargeting approaches. One important focus of our work is to develop an effective approach for speech animation that can be easily integrated into existing production pipelines. We provide a detailed description of our end-to-end approach, including machine learning design decisions. Generalized speech animation results are demonstrated over a wide range of animation clips on a variety of characters and voices, including singing and foreign language input. Our approach can also generate on-demand speech animation in real-time from user speech input.

...read moreread less

Journal Article•DOI•

Convolutional neural networks on surfaces via seamless toric covers

[...]

Haggai Maron¹, Meirav Galun¹, Noam Aigerman¹, Miri Trope¹, Nadav Dym¹, Ersin Yumer², Vladimir G. Kim², Yaron Lipman¹ - Show less +4 more•Institutions (2)

Weizmann Institute of Science¹, Adobe Systems²

20 Jul 2017-ACM Transactions on Graphics

TL;DR: This paper presents a method for applying deep learning to sphere-type shapes using a global seamless parameterization to a planar flat-torus, for which the convolution operator is well defined and the standard deep learning framework can be readily applied for learning semantic, high-level properties of the shape.

...read moreread less

Abstract: The recent success of convolutional neural networks (CNNs) for image processing tasks is inspiring research efforts attempting to achieve similar success for geometric tasks. One of the main challenges in applying CNNs to surfaces is defining a natural convolution operator on surfaces.In this paper we present a method for applying deep learning to sphere-type shapes using a global seamless parameterization to a planar flat-torus, for which the convolution operator is well defined. As a result, the standard deep learning framework can be readily applied for learning semantic, high-level properties of the shape. An indication of our success in bridging the gap between images and surfaces is the fact that our algorithm succeeds in learning semantic information from an input of raw low-dimensional feature vectors.We demonstrate the usefulness of our approach by presenting two applications: human body segmentation, and automatic landmark detection on anatomical surfaces. We show that our algorithm compares favorably with competing geometric deep-learning algorithms for segmentation tasks, and is able to produce meaningful correspondences on anatomical surfaces where hand-crafted features are bound to fail.

...read moreread less

Journal Article•DOI•

Bringing portraits to life

[...]

Hadar Averbuch-Elor¹, Daniel Cohen-Or¹, Johannes Kopf², Michael F. Cohen²•Institutions (2)

Tel Aviv University¹, Facebook²

20 Nov 2017-ACM Transactions on Graphics

TL;DR: A technique to automatically animate a still portrait, making it possible for the subject in the photo to come to life and express various emotions, and gives rise to reactive profiles, where people in still images can automatically interact with their viewers.

...read moreread less

Abstract: We present a technique to automatically animate a still portrait, making it possible for the subject in the photo to come to life and express various emotions. We use a driving video (of a different subject) and develop means to transfer the expressiveness of the subject in the driving video to the target portrait. In contrast to previous work that requires an input video of the target face to reenact a facial performance, our technique uses only a single target image. We animate the target image through 2D warps that imitate the facial transformations in the driving video. As warps alone do not carry the full expressiveness of the face, we add fine-scale dynamic details which are commonly associated with facial expressions such as creases and wrinkles. Furthermore, we hallucinate regions that are hidden in the input target face, most notably in the inner mouth. Our technique gives rise to reactive profiles, where people in still images can automatically interact with their viewers. We demonstrate our technique operating on numerous still portraits from the internet.

...read moreread less

Journal Article•DOI•

Design and volume optimization of space structures

[...]

Caigui Jiang¹, Chengcheng Tang², Hans-Peter Seidel¹, Peter Wonka³•Institutions (3)

Max Planck Society¹, Stanford University², King Abdullah University of Science and Technology³

20 Jul 2017-ACM Transactions on Graphics

TL;DR: A systematic computational framework is proposed for the design of space structures that incorporates static soundness, approximation of reference surfaces, boundary alignment, and geometric regularity and is validated by a variety of examples and comparisons.

...read moreread less

Abstract: We study the design and optimization of statically sound and materially efficient space structures constructed by connected beams. We propose a systematic computational framework for the design of space structures that incorporates static soundness, approximation of reference surfaces, boundary alignment, and geometric regularity. To tackle this challenging problem, we first jointly optimize node positions and connectivity through a nonlinear continuous optimization algorithm. Next, with fixed nodes and connectivity, we formulate the assignment of beam cross sections as a mixed-integer programming problem with a bilinear objective function and quadratic constraints. We solve this problem with a novel and practical alternating direction method based on linear programming relaxation. The capability and efficiency of the algorithms and the computational framework are validated by a variety of examples and comparisons.

...read moreread less

Journal Article•DOI•

Real-time planning for automated multi-view drone cinematography

[...]

Tobias Nägeli¹, Lukas Meier¹, Alexander Domahidi, Javier Alonso-Mora², Otmar Hilliges¹ - Show less +1 more•Institutions (2)

ETH Zurich¹, Delft University of Technology²

20 Jul 2017-ACM Transactions on Graphics

TL;DR: The online nature of the method enables incorporation of feedback into the planning and control loop, makes the algorithm robust to disturbances and extended to include coordination between multiple drones to enable dynamic multi-view shots, typical for action sequences and live TV coverage.

...read moreread less

Abstract: We propose a method for automated aerial videography in dynamic and cluttered environments. An online receding horizon optimization formulation facilitates the planning process for novices and experts alike. The algorithm takes high-level plans as input, which we dub virtual rails, alongside interactively defined aesthetic framing objectives and jointly solves for 3D quadcopter motion plans and associated velocities. The method generates control inputs subject to constraints of a non-linear quadrotor model and dynamic constraints imposed by actors moving in an a priori unknown way. The output plans are physically feasible, for the horizon length, and we apply the resulting control inputs directly at each time-step, without requiring a separate trajectory tracking algorithm. The online nature of the method enables incorporation of feedback into the planning and control loop, makes the algorithm robust to disturbances. Furthermore, we extend the method to include coordination between multiple drones to enable dynamic multi-view shots, typical for action sequences and live TV coverage. The algorithm runs in real-time on standard hardware and computes motion plans for several drones in the order of milliseconds. Finally, we evaluate the approach qualitatively with a number of challenging shots, involving multiple drones and actors and qualitatively characterize the computational performance experimentally.

...read moreread less

Journal Article•DOI•

Avatar digitization from a single image for real-time rendering

[...]

Liwen Hu¹, Shunsuke Saito¹, Lingyu Wei¹, Koki Nagano, Jaewoo Seo, Jens Fursund, Iman Sadeghi, Carrie Sun, Yen-Chun Chen, Hao Li¹ - Show less +6 more•Institutions (1)

University of Southern California¹

20 Nov 2017-ACM Transactions on Graphics

TL;DR: This work proposes a novel single-view hair generation pipeline, based on 3D-model and texture retrieval, shape refinement, and polystrip patching optimization, and demonstrates the flexibility of polystrips in handling hairstyle variations, as opposed to conventional strand-based representations.

...read moreread less

Abstract: We present a fully automatic framework that digitizes a complete 3D head with hair from a single unconstrained image. Our system offers a practical and consumer-friendly end-to-end solution for avatar personalization in gaming and social VR applications. The reconstructed models include secondary components (eyes, teeth, tongue, and gums) and provide animation-friendly blendshapes and joint-based rigs. While the generated face is a high-quality textured mesh, we propose a versatile and efficient polygonal strips (polystrips) representation for the hair. Polystrips are suitable for an extremely wide range of hairstyles and textures and are compatible with existing game engines for real-time rendering. In addition to integrating state-of-the-art advances in facial shape modeling and appearance inference, we propose a novel single-view hair generation pipeline, based on 3D-model and texture retrieval, shape refinement, and polystrip patching optimization. The performance of our hairstyle retrieval is enhanced using a deep convolutional neural network for semantic hair attribute classification. Our generated models are visually comparable to state-of-the-art game characters designed by professional artists. For real-time settings, we demonstrate the flexibility of polystrips in handling hairstyle variations, as opposed to conventional strand-based representations. We further show the effectiveness of our approach on a large number of images taken in the wild, and how compelling avatars can be easily created by anyone.

...read moreread less

Journal Article•DOI•

Real-Time Geometry, Albedo, and Motion Reconstruction Using a Single RGB-D Camera

[...]

Kaiwen Guo¹, Feng Xu¹, Tao Yu¹, Xiaoyang Liu¹, Qionghai Dai¹, Yebin Liu¹ - Show less +2 more•Institutions (1)

Tsinghua University¹

01 Jun 2017-ACM Transactions on Graphics

TL;DR: A real-time method that uses a single-view RGB-D input (a depth sensor integrated with a color camera) to simultaneously reconstruct a casual scene with a detailed geometry model, surface albedo, per- frame non-rigid motion, and per-frame low-frequency lighting, without requiring any template or motion priors is proposed.

...read moreread less

Abstract: This article proposes a real-time method that uses a single-view RGB-D input (a depth sensor integrated with a color camera) to simultaneously reconstruct a casual scene with a detailed geometry model, surface albedo, per-frame non-rigid motion, and per-frame low-frequency lighting, without requiring any template or motion priors. The key observation is that accurate scene motion can be used to integrate temporal information to recover the precise appearance, whereas the intrinsic appearance can help to establish true correspondence in the temporal domain to recover motion. Based on this observation, we first propose a shading-based scheme to leverage appearance information for motion estimation. Then, using the reconstructed motion, a volumetric albedo fusing scheme is proposed to complete and refine the intrinsic appearance of the scene by incorporating information from multiple frames. Since the two schemes are iteratively applied during recording, the reconstructed appearance and motion become increasingly more accurate. In addition to the reconstruction results, our experiments also show that additional applications can be achieved, such as relighting, albedo editing, and free-viewpoint rendering of a dynamic scene, since geometry, appearance, and motion are all reconstructed by our technique.

...read moreread less

Journal Article•DOI•

Learning Local Shape Descriptors from Part Correspondences with Multiview Convolutional Networks

[...]

Haibin Huang¹, Evangelos Kalogerakis¹, Siddhartha Chaudhuri², Duygu Ceylan³, Vladimir G. Kim³, Ersin Yumer³ - Show less +2 more•Institutions (3)

University of Massachusetts Amherst¹, Indian Institute of Technology Bombay², Adobe Systems³

16 Nov 2017-ACM Transactions on Graphics

TL;DR: A new local descriptor for 3D shapes is presented, directly applicable to a wide range of shape analysis problems such as point correspondences, semantic segmentation, affordance prediction, and shape-to-scan matching by a convolutional network trained to embed geometrically and semantically similar points close to one another in descriptor space.

...read moreread less

Abstract: We present a new local descriptor for 3D shapes, directly applicable to a wide range of shape analysis problems such as point correspondences, semantic segmentation, affordance prediction, and shape-to-scan matching. The descriptor is produced by a convolutional network that is trained to embed geometrically and semantically similar points close to one another in descriptor space. The network processes surface neighborhoods around points on a shape that are captured at multiple scales by a succession of progressively zoomed-out views, taken from carefully selected camera positions. We leverage two extremely large sources of data to train our network. First, since our network processes rendered views in the form of 2D images, we repurpose architectures pretrained on massive image datasets. Second, we automatically generate a synthetic dense point correspondence dataset by nonrigid alignment of corresponding shape parts in a large collection of segmented 3D models. As a result of these design choices, our network effectively encodes multiscale local context and fine-grained surface detail. Our network can be trained to produce either category-specific descriptors or more generic descriptors by learning from multiple shape categories. Once trained, at test time, the network extracts local descriptors for shapes without requiring any part segmentation as input. Our method can produce effective local descriptors even for shapes whose category is unknown or different from the ones used while training. We demonstrate through several experiments that our learned local descriptors are more discriminative compared to state-of-the-art alternatives and are effective in a variety of shape analysis applications.

...read moreread less

Journal Article•DOI•

Quasi-Newton Methods for Real-Time Simulation of Hyperelastic Materials

[...]

Tiantian Liu¹, Sofien Bouaziz², Ladislav Kavan³•Institutions (3)

University of Pennsylvania¹, École Polytechnique Fédérale de Lausanne², University of Utah³

01 May 2017-ACM Transactions on Graphics

TL;DR: It is shown that Projective Dynamics can be interpreted as a quasi-Newton method, which enables very efficient simulation of a large class of hyperelastic materials, including the Neo-Hookean, spline-based materials, and others.

...read moreread less

Abstract: We present a new method for real-time physics-based simulation supporting many different types of hyperelastic materials. Previous methods such as Position-Based or Projective Dynamics are fast but support only a limited selection of materials; even classical materials such as the Neo-Hookean elasticity are not supported. Recently, Xu et al. [2015] introduced new “spline-based materials” that can be easily controlled by artists to achieve desired animation effects. Simulation of these types of materials currently relies on Newton’s method, which is slow, even with only one iteration per timestep. In this article, we show that Projective Dynamics can be interpreted as a quasi-Newton method. This insight enables very efficient simulation of a large class of hyperelastic materials, including the Neo-Hookean, spline-based materials, and others. The quasi-Newton interpretation also allows us to leverage ideas from numerical optimization. In particular, we show that our solver can be further accelerated using L-BFGS updates (Limited-memory Broyden-Fletcher-Goldfarb-Shanno algorithm). Our final method is typically more than 10 times faster than one iteration of Newton’s method without compromising quality. In fact, our result is often more accurate than the result obtained with one iteration of Newton’s method. Our method is also easier to implement, implying reduced software development costs.

...read moreread less

Journal Article•DOI•

Modeling surface appearance from a single photograph using self-augmented convolutional neural networks

[...]

Xiao Li¹, Yue Dong², Pieter Peers³, Xin Tong²•Institutions (3)

University of Science and Technology of China¹, Microsoft², College of William & Mary³

20 Jul 2017-ACM Transactions on Graphics

TL;DR: In this article, a convolutional neural network (CNN) based solution for modeling physically plausible spatially varying surface reflectance functions (SVBRDF) from a single photograph of a planar material sample under unknown natural illumination is presented.

...read moreread less

Abstract: We present a convolutional neural network (CNN) based solution for modeling physically plausible spatially varying surface reflectance functions (SVBRDF) from a single photograph of a planar material sample under unknown natural illumination. Gathering a sufficiently large set of labeled training pairs consisting of photographs of SVBRDF samples and corresponding reflectance parameters, is a difficult and arduous process. To reduce the amount of required labeled training data, we propose to leverage the appearance information embedded in unlabeled images of spatially varying materials to self-augment the training process. Starting from an initial approximative network obtained from a small set of labeled training pairs, we estimate provisional model parameters for each unlabeled training exemplar. Given this provisional reflectance estimate, we then synthesize a novel temporary labeled training pair by rendering the exact corresponding image under a new lighting condition. After refining the network using these additional training samples, we re-estimate the provisional model parameters for the unlabeled data and repeat the self-augmentation process until convergence. We demonstrate the efficacy of the proposed network structure on spatially varying wood, metals, and plastics, as well as thoroughly validate the effectiveness of the self-augmentation training process.

...read moreread less

Journal Article•DOI•

Anisotropic elastoplasticity for cloth, knit and hair frictional contact

[...]

Chenfanfu Jiang¹, Theodore F. Gast², Joseph Teran²•Institutions (2)

University of Pennsylvania¹, University of California, Berkeley²

20 Jul 2017-ACM Transactions on Graphics

TL;DR: This work designs an anisotropic hyperelastic constitutive model that separately characterizes the response to manifold strain as well as shearing and compression in the directions orthogonal to the manifold, and proposes a novel hybrid Lagrangian/Eulerian approach that preserves the best aspects of both views.

...read moreread less

Abstract: The typical elastic surface or curve simulation method takes a Lagrangian approach and consists of three components: time integration, collision detection and collision response. The Lagrangian view is beneficial because it naturally allows for tracking of the codimensional manifold, however collision must then be detected and resolved separately. Eulerian methods are promising alternatives because collision processing is automatic and while this is effective for volumetric objects, advection of a codimensional manifold is too inaccurate in practice. We propose a novel hybrid Lagrangian/Eulerian approach that preserves the best aspects of both views. Similar to the Drucker-Prager and Mohr-Coulomb models for granular materials, we define our collision response with a novel elastoplastic constitutive model. To achieve this, we design an anisotropic hyperelastic constitutive model that separately characterizes the response to manifold strain as well as shearing and compression in the directions orthogonal to the manifold. We discretize the model with the Material Point Method and a novel codimensional Lagrangian/Eulerian update of the deformation gradient. Collision intensive scenarios with millions of degrees of freedom require only a few minutes per frame and examples with up to one million degrees of freedom run in less than thirty seconds per frame.

...read moreread less

Journal Article•DOI•

Scalable Locally Injective Mappings

[...]

Michael Rabinovich¹, Roi Poranne¹, Daniele Panozzo², Olga Sorkine-Hornung¹•Institutions (2)

ETH Zurich¹, New York University²

14 Apr 2017-ACM Transactions on Graphics

TL;DR: This work presents a scalable approach for the optimization of flip-preventing energies in the general context of simplicial mappings and specifically for mesh parameterization and shows that the algorithm can be applied to mesh deformation and mesh quality improvement.

...read moreread less

Abstract: We present a scalable approach for the optimization of flip-preventing energies in the general context of simplicial mappings and specifically for mesh parameterization. Our iterative minimization is based on the observation that many distortion energies can be optimized indirectly by minimizing a family of simpler proxy energies. Minimization of these proxies is a natural extension of the local/global minimization of the ARAP energy. Our algorithm is simple to implement and scales to datasets with millions of faces. We demonstrate our approach for the computation of maps that minimize a conformal or isometric distortion energy, both in two and three dimensions. In addition to mesh parameterization, we show that our algorithm can be applied to mesh deformation and mesh quality improvement.

...read moreread less

Collapse