scispace - formally typeset
Search or ask a question

Showing papers by "Jimei Yang published in 2020"


Book ChapterDOI
23 Aug 2020
TL;DR: A deep generative model which not only outputs an inpainting result but also a corresponding confidence map is introduced, which progressively fills the hole by trusting only high-confidence pixels inside the hole at each iteration and focuses on the remaining pixels in the next iteration.
Abstract: Existing image inpainting methods often produce artifacts when dealing with large holes in real applications. To address this challenge, we propose an iterative inpainting method with a feedback mechanism. Specifically, we introduce a deep generative model which not only outputs an inpainting result but also a corresponding confidence map. Using this map as feedback, it progressively fills the hole by trusting only high-confidence pixels inside the hole at each iteration and focuses on the remaining pixels in the next iteration. As it reuses partial predictions from the previous iterations as known pixels, this process gradually improves the result. In addition, we propose a guided upsampling network to enable generation of high-resolution inpainting results. We achieve this by extending the Contextual Attention module [39] to borrow high-resolution feature patches in the input image. Furthermore, to mimic real object removal scenarios, we collect a large object mask dataset and synthesize more realistic training data that better simulates user inputs. Experiments show that our method significantly outperforms existing methods in both quantitative and qualitative evaluations. More results and Web APP are available at https://zengxianyu.github.io/iic.

125 citations


Book ChapterDOI
23 Aug 2020
TL;DR: A physics-based method for inferring 3D human motion from video sequences that takes initial 2D and 3D pose estimates as input and produces motions that are significantly more realistic than those from purely kinematic methods, substantially improving quantitative measures of both kinematics and dynamic plausibility.
Abstract: Existing deep models predict 2D and 3D kinematic poses from video that are approximately accurate, but contain visible errors that violate physical constraints, such as feet penetrating the ground and bodies leaning at extreme angles. In this paper, we present a physics-based method for inferring 3D human motion from video sequences that takes initial 2D and 3D pose estimates as input. We first estimate ground contact timings with a novel prediction network which is trained without hand-labeled data. A physics-based trajectory optimization then solves for a physically-plausible motion, based on the inputs. We show this process produces motions that are significantly more realistic than those from purely kinematic methods, substantially improving quantitative measures of both kinematic and dynamic plausibility. We demonstrate our method on character animation and pose estimation tasks on dynamic motions of dancing and sports with complex contact patterns.

60 citations


Posted Content
TL;DR: Attribute-conditioned Layout GAN is introduced to incorporate the attributes of design elements for graphic layout generation by forcing both the generator and the discriminator to meet attribute conditions.
Abstract: Modeling layout is an important first step for graphic design. Recently, methods for generating graphic layouts have progressed, particularly with Generative Adversarial Networks (GANs). However, the problem of specifying the locations and sizes of design elements usually involves constraints with respect to element attributes, such as area, aspect ratio and reading-order. Automating attribute conditional graphic layouts remains a complex and unsolved problem. In this paper, we introduce Attribute-conditioned Layout GAN to incorporate the attributes of design elements for graphic layout generation by forcing both the generator and the discriminator to meet attribute conditions. Due to the complexity of graphic designs, we further propose an element dropout method to make the discriminator look at partial lists of elements and learn their local patterns. In addition, we introduce various loss designs following different design principles for layout optimization. We demonstrate that the proposed method can synthesize graphic layouts conditioned on different element attributes. It can also adjust well-designed layouts to new sizes while retaining elements' original reading-orders. The effectiveness of our method is validated through a user study.

30 citations


Proceedings ArticleDOI
01 Mar 2020
TL;DR: A neural network based detector for localizing ground contact events of human feet is presented and used to impose a physical constraint for optimization of the whole human dynamics in a video.
Abstract: In this paper, we aim to reduce the footskate artifacts when reconstructing human dynamics from monocular RGB videos. Recent work has made substantial progress in improving the temporal smoothness of the reconstructed motion trajectories. Their results, however, still suffer from severe foot skating and slippage artifacts. To tackle this issue, we present a neural network based detector for localizing ground contact events of human feet and use it to impose a physical constraint for optimization of the whole human dynamics in a video. We present a detailed study on the proposed ground contact detector and demonstrate high-quality human motion reconstruction results in various videos.

23 citations


Posted Content
TL;DR: This work introduces a biomechanically constrained generative adversarial network that performs long-term inbetweening of human motions, conditioned on keyframe constraints.
Abstract: The ability to generate complex and realistic human body animations at scale, while following specific artistic constraints, has been a fundamental goal for the game and animation industry for decades. Popular techniques include key-framing, physics-based simulation, and database methods via motion graphs. Recently, motion generators based on deep learning have been introduced. Although these learning models can automatically generate highly intricate stylized motions of arbitrary length, they still lack user control. To this end, we introduce the problem of long-term inbetweening, which involves automatically synthesizing complex motions over a long time interval given very sparse keyframes by users. We identify a number of challenges related to this problem, including maintaining biomechanical and keyframe constraints, preserving natural motions, and designing the entire motion sequence holistically while considering all constraints. We introduce a biomechanically constrained generative adversarial network that performs long-term inbetweening of human motions, conditioned on keyframe constraints. This network uses a novel two-stage approach where it first predicts local motion in the form of joint angles, and then predicts global motion, i.e. the global path that the character follows. Since there are typically a number of possible motions that could satisfy the given user constraints, we also enable our network to generate a variety of outputs with a scheme that we call Motion DNA. This approach allows the user to manipulate and influence the output content by feeding seed motions (DNA) to the network. Trained with 79 classes of captured motion data, our network performs robustly on a variety of highly complex motion styles.

17 citations


Book ChapterDOI
23 Aug 2020
TL;DR: The AIM 2020 Extreme Image Inpainting Challenge as mentioned in this paper focused on semi-guided and classical image inpainting, and the goal was to inpaint large parts of the image with no supervision.
Abstract: This paper reviews the AIM 2020 challenge on extreme image inpainting. This report focuses on proposed solutions and results for two different tracks on extreme image inpainting: classical image inpainting and semantically guided image inpainting. The goal of track 1 is to inpaint large part of the image with no supervision. Similarly, the goal of track 2 is to inpaint the image by having access to the entire semantic segmentation map of the input. The challenge had 88 and 74 participants, respectively. 11 and 6 teams competed in the final phase of the challenge, respectively. This report gauges current solutions and set a benchmark for future extreme image inpainting methods.

15 citations


Journal ArticleDOI
06 Oct 2020
TL;DR: This paper proposes a motion synthesis technique that can rapidly generate animated motion for characters engaged in two‐party conversations that synthesizes gestures and other body motions for dyadic conversations that synchronize with novel input audio clips.
Abstract: Plausible conversations among characters are required to generate the ambiance of social settings such as a restaurant, hotel lobby, or cocktail party. In this paper, we propose a motion synthesis technique that can rapidly generate animated motion for characters engaged in two-party conversations. Our system synthesizes gestures and other body motions for dyadic conversations that synchronize with novel input audio clips. Human conversations feature many different forms of coordination and synchronization. For example, speakers use hand gestures to emphasize important points, and listeners often nod in agreement or acknowledgment. To achieve the desired degree of realism, our method first constructs a motion graph that preserves the statistics of a database of recorded conversations performed by a pair of actors. This graph is then used to search for a motion sequence that respects three forms of audio-motion coordination in human conversations: coordination to phonemic clause, listener response, and partner's hesitation pause. We assess the quality of the generated animations through a user study that compares them to the originally recorded motion and evaluate the effects of each type of audio-motion coordination via ablation studies.

12 citations


Journal ArticleDOI
TL;DR: Splashing is one of the most fascinating liquid phenomena in the real world and it is favored by artists to create stunning visual effects, both statically and dynamically.
Abstract: Splashing is one of the most fascinating liquid phenomena in the real world and it is favored by artists to create stunning visual effects, both statically and dynamically. Unfortunately, the generation of complex and specialized liquid splashes is a challenging task and often requires considerable time and effort. In this paper, we present a novel system that synthesizes realistic liquid splashes from simple user sketch input. Our system adopts a conditional generative adversarial network (cGAN) trained with physics-based simulation data to produce raw liquid splash models from input sketches, and then applies model refinement processes to further improve their small-scale details. The system considers not only the trajectory of every user stroke, but also its speed, which makes the splash model simulation-ready with its underlying 3D flow. Compared with simulation-based modeling techniques through trials and errors, our system offers flexibility, convenience and intuition in liquid splash design and editing. We evaluate the usability and the efficiency of our system in an immersive virtual reality environment. Thanks to this system, an amateur user can now generate a variety of realistic liquid splashes in just a few minutes.

7 citations


Patent
Zhe Lin1, Xin Lu1, Xiaohui Shen1, Jimei Yang1, Jiahui Yu1 
25 Aug 2020
TL;DR: In this article, a dual-stage framework that combines a coarse image neural network and an image refinement network was proposed for digital image completion by learning generation and patch matching jointly, where a digital image having at least one hole is provided as input to an image completer.
Abstract: Digital image completion by learning generation and patch matching jointly is described. Initially, a digital image having at least one hole is received. This holey digital image is provided as input to an image completer formed with a dual-stage framework that combines a coarse image neural network and an image refinement network. The coarse image neural network generates a coarse prediction of imagery for filling the holes of the holey digital image. The image refinement network receives the coarse prediction as input, refines the coarse prediction, and outputs a filled digital image having refined imagery that fills these holes. The image refinement network generates refined imagery using a patch matching technique, which includes leveraging information corresponding to patches of known pixels for filtering patches generated based on the coarse prediction. Based on this, the image completer outputs the filled digital image with the refined imagery.

6 citations


Patent
Zhe Lin1, Xin Lu1, Xiaohui Shen1, Jimei Yang1, Jiahui Yu1 
02 Jun 2020
TL;DR: In this article, offset prediction neural network is used to predict patch displacement maps, which represent a displacement of pixels of the digital image to different locations for performing the image editing operation, and the pixel values of the affected pixels are set.
Abstract: Predicting patch displacement maps using a neural network is described. Initially, a digital image on which an image editing operation is to be performed is provided as input to a patch matcher having an offset prediction neural network. From this image and based on the image editing operation for which this network is trained, the offset prediction neural network generates an offset prediction formed as a displacement map, which has offset vectors that represent a displacement of pixels of the digital image to different locations for performing the image editing operation. Pixel values of the digital image are copied to the image pixels affected by the operation by: determining the vectors pixels that correspond to the image pixels affected by the image editing operation and mapping the pixel values of the image pixels represented by the determined offset vectors to the affected pixels. According to this mapping, the pixel values of the affected pixels are set, effective to perform the image editing operation.

2 citations


Patent
Chen Fang1, Zhe Lin1, Zhaowen Wang1, Zhang Yulun1, Yilin Wang1, Jimei Yang1 
13 Aug 2020
TL;DR: In this paper, a style of a digital image is transferred to another digital image of arbitrary resolution using a patch-by-patch style transfer process at several increasing resolutions, or scale levels, of both content and style images.
Abstract: A style of a digital image is transferred to another digital image of arbitrary resolution. A high-resolution (HR) content image is segmented into several low-resolution (LR) patches. The resolution of a style image is matched to have the same resolution as the LR content image patches. Style transfer is then performed on a patch-by-patch basis using, for example, a pair of feature transforms—whitening and coloring. The patch-by-patch style transfer process is then repeated at several increasing resolutions, or scale levels, of both the content and style images. The results of the style transfer at each scale level are incorporated into successive scale levels up to and including the original HR scale. As a result, style transfer can be performed with images having arbitrary resolutions to produce visually pleasing results with good spatial consistency.

Patent
14 May 2020
TL;DR: In this paper, a GAN system is employed to train the generator module to refine digital image layouts using a wireframe rendering discriminator module, which is then compared with at least one ground truth digital image layout using a loss function as part of machine learning.
Abstract: Digital image layout training is described using wireframe rendering within a generative adversarial network (GAN) system A GAN system is employed to train the generator module to refine digital image layouts To do so, a wireframe rendering discriminator module rasterizes a refined digital training digital image layout received from a generator module into a wireframe digital image layout The wireframe digital image layout is then compared with at least one ground truth digital image layout using a loss function as part of machine learning by the wireframe discriminator module The generator module is then trained by backpropagating a result of the comparison

Posted Content
TL;DR: In this article, a physics-based method for inferring 3D human motion from video sequences is presented, which takes initial 2D and 3D pose estimates as input and solves for a physically plausible motion, based on the inputs.
Abstract: Existing deep models predict 2D and 3D kinematic poses from video that are approximately accurate, but contain visible errors that violate physical constraints, such as feet penetrating the ground and bodies leaning at extreme angles. In this paper, we present a physics-based method for inferring 3D human motion from video sequences that takes initial 2D and 3D pose estimates as input. We first estimate ground contact timings with a novel prediction network which is trained without hand-labeled data. A physics-based trajectory optimization then solves for a physically-plausible motion, based on the inputs. We show this process produces motions that are significantly more realistic than those from purely kinematic methods, substantially improving quantitative measures of both kinematic and dynamic plausibility. We demonstrate our method on character animation and pose estimation tasks on dynamic motions of dancing and sports with complex contact patterns.

Patent
Chen Fang1, Zhe Lin1, Zhaowen Wang, Zhang Yulun, Wang Yilin, Jimei Yang 
16 Jul 2020
TL;DR: In this article, a feature transfer module iteratively transfers style features to the coarse feature map and generates a fine feature map, and a decoder generates an output image with content of the content image in a style of the style image from the fused features.
Abstract: In implementations of transferring image style to content of a digital image, an image editing system includes an encoder that extracts features from a content image and features from a style image. A whitening and color transform generates coarse features from the content and style features extracted by the encoder for one pass of encoding and decoding. Hence, the processing delay and memory requirements are low. A feature transfer module iteratively transfers style features to the coarse feature map and generates a fine feature map. The image editing system fuses the fine features with the coarse features, and a decoder generates an output image with content of the content image in a style of the style image from the fused features. Accordingly, the image editing system efficiently transfers an image style to image content in real-time, without undesirable artifacts in the output image.

Patent
Xin Sun1, Zhili Chen1, Nathan A. Carr1, Murria Julio Marco1, Jimei Yang1 
12 Mar 2020
TL;DR: In this paper, a painting stroke is rendered on the canvas using the shading function, where the painting stroke includes a plurality of pixels and a neighborhood patch of pixels is selected and input into a neural network and a shading function is output from the neural network.
Abstract: According to one general aspect, systems and techniques for rendering a painting stroke of a three-dimensional digital painting include receiving a painting stroke input on a canvas, where the painting stroke includes a plurality of pixels. For each of the pixels in the plurality of pixels, a neighborhood patch of pixels is selected and input into a neural network and a shading function is output from the neural network. The painting stroke is rendered on the canvas using the shading function.

Posted Content
TL;DR: Zhang et al. as mentioned in this paper proposed an iterative inpainting method with a feedback mechanism, which not only outputs an in-painting result but also a corresponding confidence map, and progressively fills the hole by trusting only high-confidence pixels inside the hole at each iteration and focusing on the remaining pixels in the next iteration.
Abstract: Existing image inpainting methods often produce artifacts when dealing with large holes in real applications. To address this challenge, we propose an iterative inpainting method with a feedback mechanism. Specifically, we introduce a deep generative model which not only outputs an inpainting result but also a corresponding confidence map. Using this map as feedback, it progressively fills the hole by trusting only high-confidence pixels inside the hole at each iteration and focuses on the remaining pixels in the next iteration. As it reuses partial predictions from the previous iterations as known pixels, this process gradually improves the result. In addition, we propose a guided upsampling network to enable generation of high-resolution inpainting results. We achieve this by extending the Contextual Attention module to borrow high-resolution feature patches in the input image. Furthermore, to mimic real object removal scenarios, we collect a large object mask dataset and synthesize more realistic training data that better simulates user inputs. Experiments show that our method significantly outperforms existing methods in both quantitative and qualitative evaluations. More results and Web APP are available at this https URL.


Patent
22 Oct 2020
TL;DR: In this paper, a 3D motion effect from a 2D image is generated by inpainting occlusion gaps in the one or more extremal views and combining them with the global point cloud and the camera path.
Abstract: Systems and methods are described for generating a three dimensional (3D) effect from a two dimensional (2D) image. The methods may include generating a depth map based on a 2D image, identifying a camera path, generating one or more extremal views based on the 2D image and the camera path, generating a global point cloud by inpainting occlusion gaps in the one or more extremal views, generating one or more intermediate views based on the global point cloud and the camera path, and combining the one or more extremal views and the one or more intermediate views to produce a 3D motion effect.

Patent
Brian Price1, Ning Xu1, Naoto Inoue, Jimei Yang, Ito Daicho 
19 Nov 2020
TL;DR: In this paper, a two-tone digital image is generated from a human-generated line drawing of the contents of a photograph, where the background of the image is one tone and the contents in the input photographs are represented by lines drawn in the second tone.
Abstract: Computing systems and computer-implemented methods can be used for automatically generating a digital line drawing of the contents of a photograph. In various examples, these techniques include use of a neural network, referred to as a generator network, that is trained on a dataset of photographs and human-generated line drawings of the photographs. The training data set teaches the neural network to trace the edges and features of objects in the photographs, as well as which edges or features can be ignored. The output of the generator network is a two-tone digital image, where the background of the image is one tone, and the contents in the input photographs are represented by lines drawn in the second tone. In some examples, a second neural network, referred to as a restorer network, can further process the output of the generator network, and remove visual artifacts and clean up the lines.