scispace - formally typeset
Search or ask a question

Showing papers by "Jimei Yang published in 2018"


Proceedings ArticleDOI
18 Jun 2018
TL;DR: Yu et al. as discussed by the authors proposed a new deep generative model-based approach which can not only synthesize novel image structures but also explicitly utilize surrounding image features as references during network training to make better predictions.
Abstract: Recent deep learning based approaches have shown promising results for the challenging task of inpainting large missing regions in an image. These methods can generate visually plausible image structures and textures, but often create distorted structures or blurry textures inconsistent with surrounding areas. This is mainly due to ineffectiveness of convolutional neural networks in explicitly borrowing or copying information from distant spatial locations. On the other hand, traditional texture and patch synthesis approaches are particularly suitable when it needs to borrow textures from the surrounding regions. Motivated by these observations, we propose a new deep generative model-based approach which can not only synthesize novel image structures but also explicitly utilize surrounding image features as references during network training to make better predictions. The model is a feedforward, fully convolutional neural network which can process images with multiple holes at arbitrary locations and with variable sizes during the test time. Experiments on multiple datasets including faces (CelebA, CelebA-HQ), textures (DTD) and natural images (ImageNet, Places2) demonstrate that our proposed approach generates higher-quality inpainting results than existing ones. Code, demo and models are available at: https://github.com/JiahuiYu/generative_inpainting.

1,397 citations


Posted Content
TL;DR: In this article, a new deep generative model-based approach is proposed which can not only synthesize novel image structures but also explicitly utilize surrounding image features as references during network training to make better predictions.
Abstract: Recent deep learning based approaches have shown promising results for the challenging task of inpainting large missing regions in an image. These methods can generate visually plausible image structures and textures, but often create distorted structures or blurry textures inconsistent with surrounding areas. This is mainly due to ineffectiveness of convolutional neural networks in explicitly borrowing or copying information from distant spatial locations. On the other hand, traditional texture and patch synthesis approaches are particularly suitable when it needs to borrow textures from the surrounding regions. Motivated by these observations, we propose a new deep generative model-based approach which can not only synthesize novel image structures but also explicitly utilize surrounding image features as references during network training to make better predictions. The model is a feed-forward, fully convolutional neural network which can process images with multiple holes at arbitrary locations and with variable sizes during the test time. Experiments on multiple datasets including faces (CelebA, CelebA-HQ), textures (DTD) and natural images (ImageNet, Places2) demonstrate that our proposed approach generates higher-quality inpainting results than existing ones. Code, demo and models are available at: this https URL.

1,333 citations


Posted Content
TL;DR: The proposed gated convolution solves the issue of vanilla convolution that treats all input pixels as valid ones, generalizes partial convolution by providing a learnable dynamic feature selection mechanism for each channel at each spatial location across all layers.
Abstract: We present a generative image inpainting system to complete images with free-form mask and guidance. The system is based on gated convolutions learned from millions of images without additional labelling efforts. The proposed gated convolution solves the issue of vanilla convolution that treats all input pixels as valid ones, generalizes partial convolution by providing a learnable dynamic feature selection mechanism for each channel at each spatial location across all layers. Moreover, as free-form masks may appear anywhere in images with any shape, global and local GANs designed for a single rectangular mask are not applicable. Thus, we also present a patch-based GAN loss, named SN-PatchGAN, by applying spectral-normalized discriminator on dense image patches. SN-PatchGAN is simple in formulation, fast and stable in training. Results on automatic image inpainting and user-guided extension demonstrate that our system generates higher-quality and more flexible results than previous methods. Our system helps user quickly remove distracting objects, modify image layouts, clear watermarks and edit faces. Code, demo and models are available at: this https URL

848 citations


Proceedings ArticleDOI
01 Jun 2018
TL;DR: The authors decompose expressions into three modular components related to subject appearance, location, and relationship to other objects in an end-to-end framework, which allows to flexibly adapt to expressions containing different types of information.
Abstract: In this paper, we address referring expression comprehension: localizing an image region described by a natural language expression. While most recent work treats expressions as a single unit, we propose to decompose them into three modular components related to subject appearance, location, and relationship to other objects. This allows us to flexibly adapt to expressions containing different types of information in an end-to-end framework. In our model, which we call the Modular Attention Network (MAttNet), two types of attention are utilized: language-based attention that learns the module weights as well as the word/phrase attention that each module should focus on; and visual attention that allows the subject and relationship modules to focus on relevant image components. Module weights combine scores from all three modules dynamically to output an overall score. Experiments show that MAttNet outperforms previous state-of-the-art methods by a large margin on both bounding-box-level and pixel-level comprehension tasks. Demo1 and code2 are provided.

626 citations


Posted Content
TL;DR: A definition of a continuous representation is advanced, which can be helpful for training deep neural networks and related to topological concepts such as homeomorphism and embedding, and results show that continuous rotation representations outperform discontinuous ones for several practical problems in graphics and vision.
Abstract: In neural networks, it is often desirable to work with various representations of the same space. For example, 3D rotations can be represented with quaternions or Euler angles. In this paper, we advance a definition of a continuous representation, which can be helpful for training deep neural networks. We relate this to topological concepts such as homeomorphism and embedding. We then investigate what are continuous and discontinuous representations for 2D, 3D, and n-dimensional rotations. We demonstrate that for 3D rotations, all representations are discontinuous in the real Euclidean spaces of four or fewer dimensions. Thus, widely used representations such as quaternions and Euler angles are discontinuous and difficult for neural networks to learn. We show that the 3D rotations have continuous representations in 5D and 6D, which are more suitable for learning. We also present continuous representations for the general case of the n-dimensional rotation group SO(n). While our main focus is on rotations, we also show that our constructions apply to other groups such as the orthogonal group and similarity transforms. We finally present empirical results, which show that our continuous rotation representations outperform discontinuous ones for several practical problems in graphics and vision, including a simple autoencoder sanity test, a rotation estimator for 3D point clouds, and an inverse kinematics solver for 3D human poses.

464 citations


Book ChapterDOI
08 Sep 2018
TL;DR: BodyNet as mentioned in this paper proposes an end-to-end trainable network that benefits from a volumetric 3D loss, a multi-view re-projection loss, and intermediate supervision of 2D pose, 2D body part segmentation, and 3D pose.
Abstract: Human shape estimation is an important task for video editing, animation and fashion industry. Predicting 3D human body shape from natural images, however, is highly challenging due to factors such as variation in human bodies, clothing and viewpoint. Prior methods addressing this problem typically attempt to fit parametric body models with certain priors on pose and shape. In this work we argue for an alternative representation and propose BodyNet, a neural network for direct inference of volumetric body shape from a single image. BodyNet is an end-to-end trainable network that benefits from (i) a volumetric 3D loss, (ii) a multi-view re-projection loss, and (iii) intermediate supervision of 2D pose, 2D body part segmentation, and 3D pose. Each of them results in performance improvement as demonstrated by our experiments. To evaluate the method, we fit the SMPL model to our network output and show state-of-the-art results on the SURREAL and Unite the People datasets, outperforming recent approaches. Besides achieving state-of-the-art performance, our method also enables volumetric body-part segmentation.

385 citations


Proceedings ArticleDOI
18 Jun 2018
TL;DR: This paper presents the first end-to-end neural architecture for piece-wise planar reconstruction from a single RGB image, and outperforms baseline methods in terms of both plane segmentation and depth estimation accuracy.
Abstract: This paper proposes a deep neural network (DNN) for piece-wise planar depthmap reconstruction from a single RGB image. While DNNs have brought remarkable progress to single-image depth prediction, piece-wise planar depthmap reconstruction requires a structured geometry representation, and has been a difficult task to master even for DNNs. The proposed end-to-end DNN learns to directly infer a set of plane parameters and corresponding plane segmentation masks from a single RGB image. We have generated more than 50,000 piece-wise planar depthmaps for training and testing from ScanNet, a large-scale RGBD video database. Our qualitative and quantitative evaluations demonstrate that the proposed approach outperforms baseline methods in terms of both plane segmentation and depth estimation accuracy. To the best of our knowledge, this paper presents the first end-to-end neural architecture for piece-wise planar reconstruction from a single RGB image. Code and data are available at https://github.com/art-programmer/PlaneNet.

183 citations


Proceedings ArticleDOI
16 Apr 2018
TL;DR: In this paper, a recurrent neural network architecture with a Forward Kinematics layer and cycle consistency based adversarial training objective is proposed for unsupervised motion retargeting, which works online and adapts the motion sequence on-the-fly as new frames are received.
Abstract: We propose a recurrent neural network architecture with a Forward Kinematics layer and cycle consistency based adversarial training objective for unsupervised motion retargetting. Our network captures the high-level properties of an input motion by the forward kinematics layer, and adapts them to a target character with different skeleton bone lengths (e.g., shorter, longer arms etc.). Collecting paired motion training sequences from different characters is expensive. Instead, our network utilizes cycle consistency to learn to solve the Inverse Kinematics problem in an unsupervised manner. Our method works online, i.e., it adapts the motion sequence on-the-fly as new frames are received. In our experiments, we use the Mixamo animation data1 to test our method for a variety of motions and characters and achieve state-of-the-art results. We also demonstrate motion retargetting from monocular human videos to 3D characters using an off-the-shelf 3D pose estimator.

166 citations


Posted Content
TL;DR: BodyNet is an end-to-end trainable network that benefits from a volumetric 3D loss, a multi-view re-projection loss, and intermediate supervision of 2D pose, 2D body part segmentation, and 3D pose and achieves state-of-the-art performance.
Abstract: Human shape estimation is an important task for video editing, animation and fashion industry. Predicting 3D human body shape from natural images, however, is highly challenging due to factors such as variation in human bodies, clothing and viewpoint. Prior methods addressing this problem typically attempt to fit parametric body models with certain priors on pose and shape. In this work we argue for an alternative representation and propose BodyNet, a neural network for direct inference of volumetric body shape from a single image. BodyNet is an end-to-end trainable network that benefits from (i) a volumetric 3D loss, (ii) a multi-view re-projection loss, and (iii) intermediate supervision of 2D pose, 2D body part segmentation, and 3D pose. Each of them results in performance improvement as demonstrated by our experiments. To evaluate the method, we fit the SMPL model to our network output and show state-of-the-art results on the SURREAL and Unite the People datasets, outperforming recent approaches. Besides achieving state-of-the-art performance, our method also enables volumetric body-part segmentation.

150 citations


Book ChapterDOI
08 Sep 2018
TL;DR: Wang et al. as mentioned in this paper formulated the multi-frame prediction task as a multiple time step flow (multi-flow) prediction phase followed by a flow-to-frame synthesis phase.
Abstract: Existing video prediction methods mainly rely on observing multiple historical frames or focus on predicting the next one-frame. In this work, we study the problem of generating consecutive multiple future frames by observing one single still image only. We formulate the multi-frame prediction task as a multiple time step flow (multi-flow) prediction phase followed by a flow-to-frame synthesis phase. The multi-flow prediction is modeled in a variational probabilistic manner with spatial-temporal relationships learned through 3D convolutions. The flow-to-frame synthesis is modeled as a generative process in order to keep the predicted results lying closer to the manifold shape of real video sequence. Such a two-phase design prevents the model from directly looking at the high-dimensional pixel space of the frame sequence and is demonstrated to be more effective in predicting better and diverse results. Extensive experimental results on videos with different types of motion show that the proposed algorithm performs favorably against existing methods in terms of quality, diversity and human perceptual evaluation.

117 citations


Proceedings Article
27 Sep 2018
TL;DR: A novel differentiable wireframe rendering layer is proposed that maps the generated layout to a wireframe image, upon which a CNN-based discriminator is used to optimize the layouts in image space.
Abstract: Layout is important for graphic design and scene generation. We propose a novel Generative Adversarial Network, called LayoutGAN, that synthesizes layouts by modeling geometric relations of different types of 2D elements. The generator of LayoutGAN takes as input a set of randomly-placed 2D graphic elements and uses self-attention modules to refine their labels and geometric parameters jointly to produce a realistic layout. Accurate alignment is critical for good layouts. We thus propose a novel differentiable wireframe rendering layer that maps the generated layout to a wireframe image, upon which a CNN-based discriminator is used to optimize the layouts in image space. We validate the effectiveness of LayoutGAN in various experiments including MNIST digit generation, document layout generation, clipart abstract scene generation and tangram graphic design.

Posted Content
TL;DR: In this article, the authors proposed an end-to-end neural architecture for piece-wise planar reconstruction from a single RGB image, which can directly infer a set of plane parameters and corresponding plane segmentation masks.
Abstract: This paper proposes a deep neural network (DNN) for piece-wise planar depthmap reconstruction from a single RGB image. While DNNs have brought remarkable progress to single-image depth prediction, piece-wise planar depthmap reconstruction requires a structured geometry representation, and has been a difficult task to master even for DNNs. The proposed end-to-end DNN learns to directly infer a set of plane parameters and corresponding plane segmentation masks from a single RGB image. We have generated more than 50,000 piece-wise planar depthmaps for training and testing from ScanNet, a large-scale RGBD video database. Our qualitative and quantitative evaluations demonstrate that the proposed approach outperforms baseline methods in terms of both plane segmentation and depth estimation accuracy. To the best of our knowledge, this paper presents the first end-to-end neural architecture for piece-wise planar reconstruction from a single RGB image. Code and data are available at this https URL.

Patent
Zhe Lin1, Yibing Song1, Xin Lu1, Xiaohui Shen1, Jimei Yang1 
15 May 2018
TL;DR: In this paper, the authors used a first neural network and a second neural network to generate image information used to generate a segmentation mask that corresponds to the object portrayed in the digital image.
Abstract: Systems and methods are disclosed for segmenting a digital image to identify an object portrayed in the digital image from background pixels in the digital image. In particular, in one or more embodiments, the disclosed systems and methods use a first neural network and a second neural network to generate image information used to generate a segmentation mask that corresponds to the object portrayed in the digital image. Specifically, in one or more embodiments, the disclosed systems and methods optimize a fit between a mask boundary of the segmentation mask to edges of the object portrayed in the digital image to accurately segment the object within the digital image.

Posted Content
TL;DR: The authors decompose expressions into three modular components related to subject appearance, location, and relationship to other objects in an end-to-end framework, which allows to flexibly adapt to expressions containing different types of information.
Abstract: In this paper, we address referring expression comprehension: localizing an image region described by a natural language expression. While most recent work treats expressions as a single unit, we propose to decompose them into three modular components related to subject appearance, location, and relationship to other objects. This allows us to flexibly adapt to expressions containing different types of information in an end-to-end framework. In our model, which we call the Modular Attention Network (MAttNet), two types of attention are utilized: language-based attention that learns the module weights as well as the word/phrase attention that each module should focus on; and visual attention that allows the subject and relationship modules to focus on relevant image components. Module weights combine scores from all three modules dynamically to output an overall score. Experiments show that MAttNet outperforms previous state-of-art methods by a large margin on both bounding-box-level and pixel-level comprehension tasks. Demo and code are provided.

Posted Content
TL;DR: This work forms the multi-frame prediction task as a multiple time step flow (multi-flow) prediction phase followed by a flow-to-frame synthesis phase, which prevents the model from directly looking at the high-dimensional pixel space of the frame sequence and is demonstrated to be more effective in predicting better and diverse results.
Abstract: Existing video prediction methods mainly rely on observing multiple historical frames or focus on predicting the next one-frame. In this work, we study the problem of generating consecutive multiple future frames by observing one single still image only. We formulate the multi-frame prediction task as a multiple time step flow (multi-flow) prediction phase followed by a flow-to-frame synthesis phase. The multi-flow prediction is modeled in a variational probabilistic manner with spatial-temporal relationships learned through 3D convolutions. The flow-to-frame synthesis is modeled as a generative process in order to keep the predicted results lying closer to the manifold shape of real video sequence. Such a two-phase design prevents the model from directly looking at the high-dimensional pixel space of the frame sequence and is demonstrated to be more effective in predicting better and diverse results. Extensive experimental results on videos with different types of motion show that the proposed algorithm performs favorably against existing methods in terms of quality, diversity and human perceptual evaluation.

Patent
Lin Zhe1, Lu Xin1, Shen Xiaohui1, Jimei Yang1, Chenxi Liu1 
19 Jan 2018
TL;DR: In this paper, a fully convolutional neural network identifies and encodes the image features and a word embedding model generates the token vectors, and a recurrent neural network (RNN) iteratively updates a segmentation map based on combinations of the image feature encoding and the word vectors.
Abstract: The invention is directed towards segmenting images based on natural language phrases. An image and an n-gram, including a sequence of tokens, are received. An encoding of image features and a sequence of token vectors are generated. A fully convolutional neural network identifies and encodes the image features. A word embedding model generates the token vectors. A recurrent neural network (RNN) iteratively updates a segmentation map based on combinations of the image feature encoding and the token vectors. The segmentation map identifies which pixels are included in an image region referenced by the n-gram. A segmented image is generated based on the segmentation map. The RNN may be a convolutional multimodal RNN. A separate RNN, such as a long short-term memory network, may iteratively update an encoding of semantic features based on the order of tokens. The first RNN may update the segmentation map based on the semantic feature encoding.

Posted Content
TL;DR: In this article, a recurrent neural network architecture with a Forward Kinematics layer and cycle consistency based adversarial training objective is proposed for unsupervised motion retargeting, which works online and adapts the motion sequence on-the-fly as new frames are received.
Abstract: We propose a recurrent neural network architecture with a Forward Kinematics layer and cycle consistency based adversarial training objective for unsupervised motion retargetting. Our network captures the high-level properties of an input motion by the forward kinematics layer, and adapts them to a target character with different skeleton bone lengths (e.g., shorter, longer arms etc.). Collecting paired motion training sequences from different characters is expensive. Instead, our network utilizes cycle consistency to learn to solve the Inverse Kinematics problem in an unsupervised manner. Our method works online, i.e., it adapts the motion sequence on-the-fly as new frames are received. In our experiments, we use the Mixamo animation data to test our method for a variety of motions and characters and achieve state-of-the-art results. We also demonstrate motion retargetting from monocular human videos to 3D characters using an off-the-shelf 3D pose estimator.

Posted Content
TL;DR: A two-stage learning framework to teach a machine to doodle in a simulated painting environment via Stroke Demonstration and deep Q-learning (SDQ), which generates a sequence of pen actions to reproduce a reference drawing and mimics the behavior of human painters.
Abstract: Doodling is a useful and common intelligent skill that people can learn and master. In this work, we propose a two-stage learning framework to teach a machine to doodle in a simulated painting environment via Stroke Demonstration and deep Q-learning (SDQ). The developed system, Doodle-SDQ, generates a sequence of pen actions to reproduce a reference drawing and mimics the behavior of human painters. In the first stage, it learns to draw simple strokes by imitating in supervised fashion from a set of strokeaction pairs collected from artist paintings. In the second stage, it is challenged to draw real and more complex doodles without ground truth actions; thus, it is trained with Qlearning. Our experiments confirm that (1) doodling can be learned without direct stepby- step action supervision and (2) pretraining with stroke demonstration via supervised learning is important to improve performance. We further show that Doodle-SDQ is effective at producing plausible drawings in different media types, including sketch and watercolor.

Patent
Yijun Li1, Chen Fang1, Jimei Yang1, Zhaowen Wang1, Xin Lu1 
23 Aug 2018
TL;DR: In this paper, the authors describe techniques for synthesizing an image style based on a plurality of neural networks, where a computer system selects a style image based on user input that identifies the style image.
Abstract: In some embodiments, techniques for synthesizing an image style based on a plurality of neural networks are described. A computer system selects a style image based on user input that identifies the style image. The computer system generates an image based on a generator neural network and a loss neural network. The generator neural network outputs the synthesized image based on a noise vector and the style image and is trained based on style features generated from the loss neural network. The loss neural network outputs the style features based on a training image. The training image and the style image have a same resolution. The style features are generated at different resolutions of the training image. The computer system provides the synthesized image to a user device in response to the user input.

Proceedings ArticleDOI
17 Aug 2018
TL;DR: A novel approach that uses a generative adversarial network (GAN) to synthesize realistic oil painting brush strokes, where the network is trained with data generated by a high-fidelity simulator to replace the expensive fluid simulation with a neural network.
Abstract: We introduce a novel approach that uses a generative adversarial network (GAN) to synthesize realistic oil painting brush strokes, where the network is trained with data generated by a high-fidelity simulator. Among approaches to digitally synthesizing natural media painting strokes, physically based simulation produces by far the most realistic visual results and allows the most intuitive control of stroke variations. However, accurate physics simulations are known to be computationally expensive and often cannot meet the performance requirements of painting applications.In our work, we propose to replace the expensive fluid simulation with a neural network. The network takes the existing canvas and a new stroke trajectory as input and produces the height and color of the new stroke as output. We train the network with a dataset generated with a high quality offline simulator. The network is able to produce visual quality comparable to the offline simulator with better performance than the existing real-time oil painting simulator. Finally, we implement a real-time painting system using the trained network.

Patent
06 Sep 2018
TL;DR: In this article, a neural network is used to decompose an input digital image into intrinsic physical properties (e.g., material, illumination, and shape) and then a rendering layer is utilized to generate a modified digital image based on the target property and the remaining (unsubstituted) intrinsic physical property.
Abstract: The present disclosure includes methods and systems for generating modified digital images utilizing a neural network that includes a rendering layer. In particular, the disclosed systems and methods can train a neural network to decompose an input digital image into intrinsic physical properties (e.g., such as material, illumination, and shape). Moreover, the systems and methods can substitute one of the intrinsic physical properties for a target property (e.g., a modified material, illumination, or shape). The systems and methods can utilize a rendering layer trained to synthesize a digital image to generate a modified digital image based on the target property and the remaining (unsubstituted) intrinsic physical properties. Systems and methods can increase the accuracy of modified digital images by generating modified digital images that realistically reflect a confluence of intrinsic physical properties of an input digital image and target (i.e., modified) properties.

Patent
21 Dec 2018
TL;DR: In this paper, a source image from a source viewpoint and including a common portion of the object is encoded in 2D data, and an intermediate image that includes an intermediate view of the objects is generated based on the data.
Abstract: Embodiments are directed towards providing a target view, from a target viewpoint, of a 3D object. A source image, from a source viewpoint and including a common portion of the object, is encoded in 2D data. An intermediate image that includes an intermediate view of the object is generated based on the data. The intermediate view is from the target viewpoint and includes the common portion of the object and a disoccluded portion of the object not visible in the source image. The intermediate image includes a common region and a disoccluded region corresponding to the disoccluded portion of the object. The disoccluded region is updated to include a visual representation of a prediction of the disoccluded portion of the object. The prediction is based on a trained image completion model. The target view is based on the common region and the updated disoccluded region of the intermediate image.

Proceedings Article
01 Jan 2018
TL;DR: A two-stage learning framework to teach a machine to doodle in a simulated painting environment via Stroke Demonstration and deep Q-learning (SDQ), which generates a sequence of pen actions to reproduce a reference drawing and mimics the behavior of human painters.
Abstract: Doodling is a useful and common intelligent skill that people can learn and master. In this work, we propose a two-stage learning framework to teach a machine to doodle in a simulated painting environment via Stroke Demonstration and deep Q-learning (SDQ). The developed system, Doodle-SDQ, generates a sequence of pen actions to reproduce a reference drawing and mimics the behavior of human painters. In the first stage, it learns to draw simple strokes by imitating in supervised fashion from a set of strokeaction pairs collected from artist paintings. In the second stage, it is challenged to draw real and more complex doodles without ground truth actions; thus, it is trained with Qlearning. Our experiments confirm that (1) doodling can be learned without direct stepby-step action supervision and (2) pretraining with stroke demonstration via supervised learning is important to improve performance. We further show that Doodle-SDQ is effective at producing plausible drawings in different media types, including sketch and watercolor. A short video can be found at https://www.youtube.com/watch? v=-5FVUQFQTaE.