Top 32 papers published by Jimei Yang from Adobe Systems in 2017

Proceedings Article•DOI•

[...]

Yijun Li¹, Sifei Liu¹, Jimei Yang², Ming-Hsuan Yang¹•Institutions (2)

University of California, Merced¹, Adobe Systems²

21 Jul 2017

TL;DR: Zhang et al. as mentioned in this paper proposed an effective face completion algorithm using a deep generative model, which is trained with a combination of a reconstruction loss, two adversarial losses and a semantic parsing loss to ensure pixel faithfulness and local-global contents consistency.

...read moreread less

Abstract: In this paper, we propose an effective face completion algorithm using a deep generative model. Different from well-studied background completion, the face completion task is more challenging as it often requires to generate semantically new pixels for the missing key components (e.g., eyes and mouths) that contain large appearance variations. Unlike existing nonparametric algorithms that search for patches to synthesize, our algorithm directly generates contents for missing regions based on a neural network. The model is trained with a combination of a reconstruction loss, two adversarial losses and a semantic parsing loss, which ensures pixel faithfulness and local-global contents consistency. With extensive experimental results, we demonstrate qualitatively and quantitatively that our model is able to deal with a large area of missing pixels in arbitrary shapes and generate realistic face completion results.

...read moreread less

511 citations

Proceedings Article•

Universal Style Transfer via Feature Transforms

[...]

Yijun Li¹, Chen Fang², Jimei Yang², Zhaowen Wang², Xin Lu¹, Ming-Hsuan Yang¹ - Show less +2 more•Institutions (2)

University of California, Merced¹, Adobe Systems²

23 May 2017

TL;DR: In this paper, a pair of feature transforms, whitening and coloring, are embedded to an image reconstruction network to reflect direct matching of feature covariance of the content image to a given style image, which shares similar spirit with the optimization of Gram matrix based cost in neural style transfer.

...read moreread less

Abstract: Universal style transfer aims to transfer arbitrary visual styles to content images. Existing feed-forward based methods, while enjoying the inference efficiency, are mainly limited by inability of generalizing to unseen styles or compromised visual quality. In this paper, we present a simple yet effective method that tackles these limitations without training on any pre-defined styles. The key ingredient of our method is a pair of feature transforms, whitening and coloring, that are embedded to an image reconstruction network. The whitening and coloring transforms reflect direct matching of feature covariance of the content image to a given style image, which shares similar spirits with the optimization of Gram matrix based cost in neural style transfer. We demonstrate the effectiveness of our algorithm by generating high-quality stylized images with comparisons to a number of recent methods. We also analyze our method by visualizing the whitened features and synthesizing textures by simple feature coloring.

...read moreread less

508 citations

Posted Content•

Generative Face Completion

[...]

Yijun Li¹, Sifei Liu¹, Jimei Yang², Ming-Hsuan Yang¹•Institutions (2)

University of California, Merced¹, Adobe Systems²

19 Apr 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: This paper demonstrates qualitatively and quantitatively that the proposed effective face completion algorithm is able to deal with a large area of missing pixels in arbitrary shapes and generate realistic face completion results.

...read moreread less

Abstract: In this paper, we propose an effective face completion algorithm using a deep generative model. Different from well-studied background completion, the face completion task is more challenging as it often requires to generate semantically new pixels for the missing key components (e.g., eyes and mouths) that contain large appearance variations. Unlike existing nonparametric algorithms that search for patches to synthesize, our algorithm directly generates contents for missing regions based on a neural network. The model is trained with a combination of a reconstruction loss, two adversarial losses and a semantic parsing loss, which ensures pixel faithfulness and local-global contents consistency. With extensive experimental results, we demonstrate qualitatively and quantitatively that our model is able to deal with a large area of missing pixels in arbitrary shapes and generate realistic face completion results.

...read moreread less

354 citations

Proceedings Article•

Learning to Generate Long-term Future via Hierarchical Prediction

[...]

Ruben Villegas¹, Jimei Yang², Yuliang Zou³, Sungryull Sohn⁴, Xunyu Lin, Honglak Lee¹ - Show less +2 more•Institutions (4)

University of Michigan¹, Adobe Systems², Virginia Tech³, KAIST⁴

17 Jul 2017

TL;DR: In this paper, a hierarchical approach is proposed to predict high-level structure in the input frames, then predict how that structure evolves in the future, and finally by observing a single frame from the past and the predicted highlevel structure, construct the future frames without having to observe any of the pixel-level predictions.

...read moreread less

Abstract: We propose a hierarchical approach for making long-term predictions of future frames. To avoid inherent compounding errors in recursive pixel-level prediction, we propose to first estimate high-level structure in the input frames, then predict how that structure evolves in the future, and finally by observing a single frame from the past and the predicted high-level structure, we construct the future frames without having to observe any of the pixel-level predictions. Long-term video prediction is difficult to perform by recurrently observing the predicted frames because the small errors in pixel space exponentially amplify as predictions are made deeper into the future. Our approach prevents pixel-level error propagation from happening by removing the need to observe the predicted frames. Our model is built with a combination of LSTM and analogy based encoder-decoder convolutional neural networks, which independently predict the video structure and generate the future frames, respectively. In experiments, our model is evaluated on the Human3.6M and Penn Action datasets on the task of long-term pixel-level video prediction of humans performing actions and demonstrate significantly better results than the state-of-the-art.

...read moreread less

266 citations

Proceedings Article•DOI•

Diversified Texture Synthesis with Feed-Forward Networks

[...]

Yijun Li¹, Chen Fang², Jimei Yang², Zhaowen Wang², Xin Lu², Ming-Hsuan Yang¹ - Show less +2 more•Institutions (2)

University of California, Merced¹, Adobe Systems²

21 Jul 2017

TL;DR: A deep generative feed-forward network is proposed which enables efficient synthesis of multiple textures within one single network and meaningful interpolation between them and a suite of important techniques are introduced to achieve better convergence and diversity.

...read moreread less

Abstract: Recent progresses on deep discriminative and generative modeling have shown promising results on texture synthesis. However, existing feed-forward based methods trade off generality for efficiency, which suffer from many issues, such as shortage of generality (i.e., build one network per texture), lack of diversity (i.e., always produce visually identical output) and suboptimality (i.e., generate less satisfying visual effects). In this work, we focus on solving these issues for improved texture synthesis. We propose a deep generative feed-forward network which enables efficient synthesis of multiple textures within one single network and meaningful interpolation between them. Meanwhile, a suite of important techniques are introduced to achieve better convergence and diversity. With extensive experiments, we demonstrate the effectiveness of the proposed model and techniques for synthesizing a large number of textures and show its applications with the stylization.

...read moreread less

266 citations

Posted Content•

Learning to Generate Long-term Future via Hierarchical Prediction

[...]

Ruben Villegas¹, Jimei Yang², Yuliang Zou³, Sungryull Sohn⁴, Xunyu Lin, Honglak Lee¹ - Show less +2 more•Institutions (4)

University of Michigan¹, Adobe Systems², Virginia Tech³, KAIST⁴

19 Apr 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: This model is built with a combination of LSTM and analogy based encoder-decoder convolutional neural networks, which independently predict the video structure and generate the future frames, respectively, which prevents pixel-level error propagation from happening by removing the need to observe the predicted frames.

...read moreread less

Abstract: We propose a hierarchical approach for making long-term predictions of future frames. To avoid inherent compounding errors in recursive pixel-level prediction, we propose to first estimate high-level structure in the input frames, then predict how that structure evolves in the future, and finally by observing a single frame from the past and the predicted high-level structure, we construct the future frames without having to observe any of the pixel-level predictions. Long-term video prediction is difficult to perform by recurrently observing the predicted frames because the small errors in pixel space exponentially amplify as predictions are made deeper into the future. Our approach prevents pixel-level error propagation from happening by removing the need to observe the predicted frames. Our model is built with a combination of LSTM and analogy based encoder-decoder convolutional neural networks, which independently predict the video structure and generate the future frames, respectively. In experiments, our model is evaluated on the Human3.6M and Penn Action datasets on the task of long-term pixel-level video prediction of humans performing actions and demonstrate significantly better results than the state-of-the-art.

...read moreread less

246 citations

Proceedings Article•DOI•

Transformation-Grounded Image Generation Network for Novel 3D View Synthesis

[...]

Eunbyung Park¹, Jimei Yang², Ersin Yumer², Duygu Ceylan², Alexander C. Berg¹ - Show less +1 more•Institutions (2)

University of North Carolina at Chapel Hill¹, Adobe Systems²

08 Mar 2017

TL;DR: In this paper, a transformation-grounded image generation network is proposed for novel 3D view synthesis from a single image, which explicitly infers the parts of the geometry visible both in the input and novel views and then casts the remaining synthesis problem as image completion.

...read moreread less

Abstract: We present a transformation-grounded image generation network for novel 3D view synthesis from a single image. Our approach first explicitly infers the parts of the geometry visible both in the input and novel views and then casts the remaining synthesis problem as image completion. Specifically, we both predict a flow to move the pixels from the input to the novel view along with a novel visibility map that helps deal with occulsion/disocculsion. Next, conditioned on those intermediate results, we hallucinate (infer) parts of the object invisible in the input image. In addition to the new network structure, training with a combination of adversarial and perceptual loss results in a reduction in common artifacts of novel view synthesis such as distortions and holes, while successfully generating high frequency details and preserving visual aspects of the input image. We evaluate our approach on a wide range of synthetic and real examples. Both qualitative and quantitative results show our method achieves significantly better results compared to existing methods.

...read moreread less

243 citations

Posted Content•

Decomposing Motion and Content for Natural Video Sequence Prediction

[...]

Ruben Villegas¹, Jimei Yang², Seunghoon Hong³, Xunyu Lin, Honglak Lee¹ - Show less +1 more•Institutions (3)

University of Michigan¹, Chinese Academy of Sciences², Pohang University of Science and Technology³

25 Jun 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, an end-to-end trainable network architecture with motion and content separation is proposed for pixel-level future prediction in natural videos, which can decompose the motion and the content, two key components generating dynamics in videos.

...read moreread less

Abstract: We propose a deep neural network for the prediction of future frames in natural video sequences To effectively handle complex evolution of pixels in videos, we propose to decompose the motion and content, two key components generating dynamics in videos Our model is built upon the Encoder-Decoder Convolutional Neural Network and Convolutional LSTM for pixel-level prediction, which independently capture the spatial layout of an image and the corresponding temporal dynamics By independently modeling motion and content, predicting the next frame reduces to converting the extracted content features into the next frame content by the identified motion features, which simplifies the task of prediction Our model is end-to-end trainable over multiple time steps, and naturally learns to decompose motion and content without separate training We evaluate the proposed network architecture on human activity videos using KTH, Weizmann action, and UCF-101 datasets We show state-of-the-art performance in comparison to recent approaches To the best of our knowledge, this is the first end-to-end trainable network architecture with motion and content separation to model the spatiotemporal dynamics for pixel-level future prediction in natural videos

...read moreread less

233 citations

Posted Content•

Universal Style Transfer via Feature Transforms

[...]

Yijun Li¹, Chen Fang², Jimei Yang², Zhaowen Wang², Xin Lu¹, Ming-Hsuan Yang¹ - Show less +2 more•Institutions (2)

University of California, Merced¹, Adobe Systems²

23 May 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: The key ingredient of the method is a pair of feature transforms, whitening and coloring, that are embedded to an image reconstruction network that reflects a direct matching of feature covariance of the content image to a given style image.

...read moreread less

Abstract: Universal style transfer aims to transfer arbitrary visual styles to content images. Existing feed-forward based methods, while enjoying the inference efficiency, are mainly limited by inability of generalizing to unseen styles or compromised visual quality. In this paper, we present a simple yet effective method that tackles these limitations without training on any pre-defined styles. The key ingredient of our method is a pair of feature transforms, whitening and coloring, that are embedded to an image reconstruction network. The whitening and coloring transforms reflect a direct matching of feature covariance of the content image to a given style image, which shares similar spirits with the optimization of Gram matrix based cost in neural style transfer. We demonstrate the effectiveness of our algorithm by generating high-quality stylized images with comparisons to a number of recent methods. We also analyze our method by visualizing the whitened features and synthesizing textures via simple feature coloring.

...read moreread less

233 citations

Journal Article•DOI•

Top-Down Visual Saliency via Joint CRF and Dictionary Learning

[...]

Jimei Yang¹, Ming-Hsuan Yang²•Institutions (2)

Adobe Systems¹, University of California, Merced²

01 Mar 2017-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: This paper proposes a novel top-down saliency model that jointly learns a Conditional Random Field (CRF) and a discriminative dictionary and proposes a max-margin approach to train the dictionary modulated by CRF, and meanwhile a CRF with sparse coding.

...read moreread less

Abstract: Top-down visual saliency is an important module of visual attention. In this work, we propose a novel top-down saliency model that jointly learns a Conditional Random Field (CRF) and a visual dictionary. The proposed model incorporates a layered structure from top to bottom: CRF, sparse coding and image patches. With sparse coding as an intermediate layer, CRF is learned in a feature-adaptive manner; meanwhile with CRF as the output layer, the dictionary is learned under structured supervision. For efficient and effective joint learning, we develop a max-margin approach via a stochastic gradient descent algorithm. Experimental results on the Graz-02 and PASCAL VOC datasets show that our model performs favorably against state-of-the-art top-down saliency methods for target object localization. In addition, the dictionary update significantly improves the performance of our model. We demonstrate the merits of the proposed top-down saliency model by applying it to prioritizing object proposals for detection and predicting human fixations.

...read moreread less

213 citations

Posted Content•

Transformation-Grounded Image Generation Network for Novel 3D View Synthesis

[...]

Eunbyung Park¹, Jimei Yang², Ersin Yumer², Duygu Ceylan², Alexander C. Berg¹ - Show less +1 more•Institutions (2)

University of North Carolina at Chapel Hill¹, Adobe Systems²

08 Mar 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work presents a transformation-grounded image generation network for novel 3D view synthesis from a single image that first explicitly infers the parts of the geometry visible both in the input and novel views and then casts the remaining synthesis problem as image completion.

...read moreread less

Abstract: We present a transformation-grounded image generation network for novel 3D view synthesis from a single image. Instead of taking a 'blank slate' approach, we first explicitly infer the parts of the geometry visible both in the input and novel views and then re-cast the remaining synthesis problem as image completion. Specifically, we both predict a flow to move the pixels from the input to the novel view along with a novel visibility map that helps deal with occulsion/disocculsion. Next, conditioned on those intermediate results, we hallucinate (infer) parts of the object invisible in the input image. In addition to the new network structure, training with a combination of adversarial and perceptual loss results in a reduction in common artifacts of novel view synthesis such as distortions and holes, while successfully generating high frequency details and preserving visual aspects of the input image. We evaluate our approach on a wide range of synthetic and real examples. Both qualitative and quantitative results show our method achieves significantly better results compared to existing methods.

...read moreread less

Proceedings Article•

Decomposing Motion and Content for Natural Video Sequence Prediction

[...]

Ruben Villegas¹, Jimei Yang², Seunghoon Hong³, Xunyu Lin, Honglak Lee¹ - Show less +1 more•Institutions (3)

University of Michigan¹, Chinese Academy of Sciences², Pohang University of Science and Technology³

25 Jun 2017

TL;DR: To the best of the knowledge, this is the first end-to-end trainable network architecture with motion and content separation to model the spatiotemporal dynamics for pixel-level future prediction in natural videos.

...read moreread less

Abstract: We propose a deep neural network for the prediction of future frames in natural video sequences. To effectively handle complex evolution of pixels in videos, we propose to decompose the motion and content, two key components generating dynamics in videos. Our model is built upon the Encoder-Decoder Convolutional Neural Network and Convolutional LSTM for pixel-level prediction, which independently capture the spatial layout of an image and the corresponding temporal dynamics. By independently modeling motion and content, predicting the next frame reduces to converting the extracted content features into the next frame content by the identified motion features, which simplifies the task of prediction. Our model is end-to-end trainable over multiple time steps, and naturally learns to decompose motion and content without separate training. We evaluate the proposed network architecture on human activity videos using KTH, Weizmann action, and UCF-101 datasets. We show state-of-the-art performance in comparison to recent approaches. To the best of our knowledge, this is the first end-to-end trainable network architecture with motion and content separation to model the spatiotemporal dynamics for pixel-level future prediction in natural videos.

...read moreread less

Proceedings Article•DOI•

Recurrent Multimodal Interaction for Referring Image Segmentation

[...]

Chenxi Liu¹, Zhe Lin¹, Xiaohui Shen¹, Jimei Yang¹, Xin Lu¹, Alan L. Yuille² - Show less +2 more•Institutions (2)

Adobe Systems¹, Johns Hopkins University²

23 Mar 2017

TL;DR: It is argued that learning word-to-image interaction is more native in the sense of jointly modeling two modalities for the image segmentation task, and a convolutional multimodal LSTM to encode the sequential interactions between individual words, visual information, and spatial information is proposed.

...read moreread less

Abstract: In this paper we are interested in the problem of image segmentation given natural language descriptions, i.e. referring expressions. Existing works tackle this problem by first modeling images and sentences independently and then segment images by combining these two types of representations. We argue that learning word-to-image interaction is more native in the sense of jointly modeling two modalities for the image segmentation task, and we propose convolutional multimodal LSTM to encode the sequential interactions between individual words, visual information, and spatial information. We show that our proposed model outperforms the baseline model on benchmark datasets. In addition, we analyze the intermediate output of the proposed multimodal LSTM approach and empirically explain how this approach enforces a more effective word-to-image interaction.

...read moreread less

Proceedings Article•DOI•

3D-PRNN: Generating Shape Primitives with Recurrent Neural Networks

[...]

Chuhang Zou¹, Ersin Yumer², Jimei Yang², Duygu Ceylan², Derek Hoiem¹ - Show less +1 more•Institutions (2)

University of Illinois at Urbana–Champaign¹, Adobe Systems²

04 Aug 2017

TL;DR: 3DPRNN as discussed by the authors is a generative recurrent neural network that synthesizes multiple plausible shapes composed of a set of primitives, which preserves long-range structural coherence and describes objects of varying complexity.

...read moreread less

Abstract: The success of various applications including robotics, digital content creation, and visualization demand a structured and abstract representation of the 3D world from limited sensor data. Inspired by the nature of human perception of 3D shapes as a collection of simple parts, we explore such an abstract shape representation based on primitives. Given a single depth image of an object, we present 3DPRNN, a generative recurrent neural network that synthesizes multiple plausible shapes composed of a set of primitives. Our generative model encodes symmetry characteristics of common man-made objects, preserves long-range structural coherence, and describes objects of varying complexity with a compact representation. We also propose a method based on Gaussian Fields to generate a large scale dataset of primitive-based shape representations to train our network. We evaluate our approach on a wide range of examples and show that it outperforms nearest-neighbor based shape retrieval methods and is on-par with voxelbased generative models while using a significantly reduced parameter space.

...read moreread less

Proceedings Article•DOI•

Forecasting Human Dynamics from Static Images

[...]

Yu-Wei Chao¹, Jimei Yang², Brian Price², Scott Cohen², Jia Deng¹ - Show less +1 more•Institutions (2)

University of Michigan¹, Adobe Systems²

21 Jul 2017

TL;DR: The 3D Pose Forecasting Network (3D-PFNet) is proposed, which integrates recent advances on single-image human pose estimation and sequence prediction, and converts the 2D predictions into 3D space.

...read moreread less

Abstract: This paper presents the first study on forecasting human dynamics from static images. The problem is to input a single RGB image and generate a sequence of upcoming human body poses in 3D. To address the problem, we propose the 3D Pose Forecasting Network (3D-PFNet). Our 3D-PFNet integrates recent advances on single-image human pose estimation and sequence prediction, and converts the 2D predictions into 3D space. We train our 3D-PFNet using a three-step training strategy to leverage a diverse source of training data, including image and video based human pose datasets and 3D motion capture (MoCap) data. We demonstrate competitive performance of our 3D-PFNet on 2D pose forecasting and 3D structure recovery through quantitative and qualitative results.

...read moreread less

Posted Content•

3D-PRNN: Generating Shape Primitives with Recurrent Neural Networks

[...]

Chuhang Zou¹, Ersin Yumer², Jimei Yang², Duygu Ceylan², Derek Hoiem¹ - Show less +1 more•Institutions (2)

University of Illinois at Urbana–Champaign¹, Adobe Systems²

04 Aug 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: 3DPRNN is presented, a generative recurrent neural network that synthesizes multiple plausible shapes composed of a set of primitives that outperforms nearest-neighbor based shape retrieval methods and is on-par with voxelbased generative models while using a significantly reduced parameter space.

...read moreread less

Abstract: The success of various applications including robotics, digital content creation, and visualization demand a structured and abstract representation of the 3D world from limited sensor data. Inspired by the nature of human perception of 3D shapes as a collection of simple parts, we explore such an abstract shape representation based on primitives. Given a single depth image of an object, we present 3D-PRNN, a generative recurrent neural network that synthesizes multiple plausible shapes composed of a set of primitives. Our generative model encodes symmetry characteristics of common man-made objects, preserves long-range structural coherence, and describes objects of varying complexity with a compact representation. We also propose a method based on Gaussian Fields to generate a large scale dataset of primitive-based shape representations to train our network. We evaluate our approach on a wide range of examples and show that it outperforms nearest-neighbor based shape retrieval methods and is on-par with voxel-based generative models while using a significantly reduced parameter space.

...read moreread less

Proceedings Article•DOI•

Video Scene Parsing with Predictive Feature Learning

[...]

Xiaojie Jin¹, Xin Li², Huaxin Xiao³, Xiaohui Shen⁴, Zhe Lin⁴, Jimei Yang⁴, Yunpeng Chen¹, Jian Dong¹, Luoqi Liu¹, Zequn Jie¹, Jiashi Feng¹, Shuicheng Yan¹ - Show less +8 more•Institutions (4)

National University of Singapore¹, Tsinghua University², National University of Defense Technology³, Adobe Systems⁴

01 Oct 2017

TL;DR: It is experimentally proved that the learned predictive features in the model are able to significantly enhance the video parsing performance by combining with the standard image parsing network.

...read moreread less

Abstract: Video scene parsing is challenging due to the following two reasons: firstly, it is non-trivial to learn meaningful video representations for producing the temporally consistent labeling map; secondly, such a learning process becomes more difficult with insufficient labeled video training data. In this work, we propose a unified framework to address the above two problems, which is to our knowledge the first model to employ predictive feature learning in the video scene parsing. The predictive feature learning is carried out in two predictive tasks: frame prediction and predictive parsing. It is experimentally proved that the learned predictive features in our model are able to significantly enhance the video parsing performance by combining with the standard image parsing network. Interestingly, the performance gain brought by the predictive learning is almost costless as the features are learned from a large amount of unlabeled video data in an unsupervised way. Extensive experiments over two challenging datasets, Cityscapes and Camvid, have demonstrated the effectiveness of our model by showing remarkable improvement over well-established baselines.

...read moreread less

Proceedings Article•DOI•

Material Editing Using a Physically Based Rendering Network

[...]

Guilin Liu¹, Duygu Ceylan¹, Ersin Yumer¹, Jimei Yang¹, Jyh-Ming Lien² - Show less +1 more•Institutions (2)

Adobe Systems¹, George Mason University²

01 Oct 2017

TL;DR: In this article, the authors propose an end-to-end network architecture that replicates the forward image formation process to disentangle intrinsic physical properties of an image, i.e. shape, illumination, and material.

...read moreread less

Abstract: The ability to edit materials of objects in images is desirable by many content creators. However, this is an extremely challenging task as it requires to disentangle intrinsic physical properties of an image. We propose an end-to-end network architecture that replicates the forward image formation process to accomplish this task. Specifically, given a single image, the network first predicts intrinsic properties, i.e. shape, illumination, and material, which are then provided to a rendering layer. This layer performs in-network image synthesis, thereby enabling the network to understand the physics behind the image formation process. The proposed rendering layer is fully differentiable, supports both diffuse and specular materials, and thus can be applicable in a variety of problem settings. We demonstrate a rich set of visually plausible material editing examples and provide an extensive comparative study.

...read moreread less

Proceedings Article•DOI•

Deep grabcut for object selection

[...]

Ning Xu¹, Brian Price², Scott Cohen², Jimei Yang², Thomas S. Huang¹ - Show less +1 more•Institutions (2)

University of Illinois at Urbana–Champaign¹, Adobe Systems²

01 Jan 2017

TL;DR: A novel segmentation approach that uses a rectangle as a soft constraint by transforming it into an Euclidean distance map that gets accurate segmentation results given sloppy rectangles while being general for both interactive segmentation and instance segmentation.

...read moreread less

Abstract: Most previous bounding-box-based segmentation methods assume the bounding box tightly covers the object of interest. However it is common that a rectangle input could be too large or too small. In this paper, we propose a novel segmentation approach that uses a rectangle as a soft constraint by transforming it into an Euclidean distance map. A convolutional encoder-decoder network is trained end-to-end by concatenating images with these distance maps as inputs and predicting the object masks as outputs. Our approach gets accurate segmentation results given sloppy rectangles while being general for both interactive segmentation and instance segmentation. We show our network extends to curve-based input without retraining. We further apply our network to instance-level semantic segmentation and resolve any overlap using a conditional random field. Experiments on benchmark datasets demonstrate the effectiveness of the proposed approaches.

...read moreread less

Proceedings Article•

Predicting Scene Parsing and Motion Dynamics in the Future

[...]

Xiaojie Jin¹, Huaxin Xiao², Xiaohui Shen³, Jimei Yang³, Zhe Lin³, Yunpeng Chen¹, Zequn Jie¹, Jiashi Feng¹, Shuicheng Yan¹ - Show less +5 more•Institutions (3)

National University of Singapore¹, National University of Defense Technology², Adobe Systems³

01 Nov 2017

TL;DR: A novel model to simultaneously predict scene parsing and optical flow in unobserved future video frames is proposed and shows significantly better parsing and motion prediction results when compared to well-established baselines and individual prediction models on the large-scale Cityscapes dataset.

...read moreread less

Abstract: It is important for intelligent systems, e.g. autonomous vehicles and robotics to anticipate the future in order to plan early and make decisions accordingly. Predicting the future scene parsing and motion dynamics helps the agents better understand the visual environment better as the former provides dense semantic segmentations, i.e. what objects will be present and where they will appear, while the latter provides dense motion information, i.e. how the objects move in the future. In this paper, we propose a novel model to predict the scene parsing and motion dynamics in unobserved future video frames simultaneously. Using history information (preceding frames and corresponding scene parsing results) as input, our model is able to predict the scene parsing and motion for arbitrary time steps ahead. More importantly, our model is superior compared to other methods that predict parsing and motion separately, as the complementary relationship between the two tasks are fully utilized in our model through joint learning. To our best knowledge, this is the first attempt in jointly predicting scene parsing and motion dynamics in the future frames. On the large-scale Cityscapes dataset, it is demonstrated that our model produces significantly better parsing and motion prediction results compared to well established baselines. In addition, we also show our model can be used to predict the steering angle of the vehicles, which further verifies the ability of our model to learn underlying latent parameters.

...read moreread less

Posted Content•

Deep GrabCut for Object Selection

[...]

Ning Xu¹, Brian Price², Scott Cohen², Jimei Yang², Thomas S. Huang¹ - Show less +1 more•Institutions (2)

University of Illinois at Urbana–Champaign¹, Adobe Systems²

02 Jul 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, a convolutional encoder-decoder network is trained end-to-end by concatenating images with these distance maps as inputs and predicting the object masks as outputs.

...read moreread less

Abstract: Most previous bounding-box-based segmentation methods assume the bounding box tightly covers the object of interest. However it is common that a rectangle input could be too large or too small. In this paper, we propose a novel segmentation approach that uses a rectangle as a soft constraint by transforming it into an Euclidean distance map. A convolutional encoder-decoder network is trained end-to-end by concatenating images with these distance maps as inputs and predicting the object masks as outputs. Our approach gets accurate segmentation results given sloppy rectangles while being general for both interactive segmentation and instance segmentation. We show our network extends to curve-based input without retraining. We further apply our network to instance-level semantic segmentation and resolve any overlap using a conditional random field. Experiments on benchmark datasets demonstrate the effectiveness of the proposed approaches.

...read moreread less

Posted Content•

Recurrent Multimodal Interaction for Referring Image Segmentation

[...]

Chenxi Liu¹, Zhe Lin¹, Xiaohui Shen¹, Jimei Yang¹, Xin Lu¹, Alan L. Yuille² - Show less +2 more•Institutions (2)

Adobe Systems¹, Johns Hopkins University²

23 Mar 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, a convolutional multimodal LSTM was proposed to encode the sequential interactions between individual words, visual information, and spatial information for image segmentation given natural language descriptions.

...read moreread less

Abstract: In this paper we are interested in the problem of image segmentation given natural language descriptions, i.e. referring expressions. Existing works tackle this problem by first modeling images and sentences independently and then segment images by combining these two types of representations. We argue that learning word-to-image interaction is more native in the sense of jointly modeling two modalities for the image segmentation task, and we propose convolutional multimodal LSTM to encode the sequential interactions between individual words, visual information, and spatial information. We show that our proposed model outperforms the baseline model on benchmark datasets. In addition, we analyze the intermediate output of the proposed multimodal LSTM approach and empirically explain how this approach enforces a more effective word-to-image interaction.

...read moreread less

Posted Content•

FoveaNet: Perspective-aware Urban Scene Parsing

[...]

Xin Li¹, Zequn Jie², Wei Wang³, Changsong Liu¹, Jimei Yang⁴, Xiaohui Shen⁴, Zhe Lin⁴, Qiang Chen⁵, Shuicheng Yan², Jiashi Feng² - Show less +6 more•Institutions (5)

Tsinghua University¹, National University of Singapore², University of New South Wales³, Adobe Systems⁴, Xi'an Jiaotong University⁵

08 Aug 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work proposes a new FoveaNet model to fully exploit the perspective geometry of scene images and address the common failures of generic parsing models, and introduces a new dense CRFs model that takes the Perspective geometry as a prior potential.

...read moreread less

Abstract: Parsing urban scene images benefits many applications, especially self-driving. Most of the current solutions employ generic image parsing models that treat all scales and locations in the images equally and do not consider the geometry property of car-captured urban scene images. Thus, they suffer from heterogeneous object scales caused by perspective projection of cameras on actual scenes and inevitably encounter parsing failures on distant objects as well as other boundary and recognition errors. In this work, we propose a new FoveaNet model to fully exploit the perspective geometry of scene images and address the common failures of generic parsing models. FoveaNet estimates the perspective geometry of a scene image through a convolutional network which integrates supportive evidence from contextual objects within the image. Based on the perspective geometry information, FoveaNet "undoes" the camera perspective projection analyzing regions in the space of the actual scene, and thus provides much more reliable parsing results. Furthermore, to effectively address the recognition errors, FoveaNet introduces a new dense CRFs model that takes the perspective geometry as a prior potential. We evaluate FoveaNet on two urban scene parsing datasets, Cityspaces and CamVid, which demonstrates that FoveaNet can outperform all the well-established baselines and provide new state-of-the-art performance.

...read moreread less

Posted Content•

Diversified Texture Synthesis with Feed-forward Networks

[...]

Yijun Li¹, Chen Fang², Jimei Yang², Zhaowen Wang², Xin Lu², Ming-Hsuan Yang¹ - Show less +2 more•Institutions (2)

University of California, Merced¹, Adobe Systems²

05 Mar 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, a deep generative feed-forward network is proposed to enable efficient synthesis of multiple textures within one single network and meaningful interpolation between them, while a suite of important techniques are introduced to achieve better convergence and diversity.

...read moreread less

Abstract: Recent progresses on deep discriminative and generative modeling have shown promising results on texture synthesis. However, existing feed-forward based methods trade off generality for efficiency, which suffer from many issues, such as shortage of generality (i.e., build one network per texture), lack of diversity (i.e., always produce visually identical output) and suboptimality (i.e., generate less satisfying visual effects). In this work, we focus on solving these issues for improved texture synthesis. We propose a deep generative feed-forward network which enables efficient synthesis of multiple textures within one single network and meaningful interpolation between them. Meanwhile, a suite of important techniques are introduced to achieve better convergence and diversity. With extensive experiments, we demonstrate the effectiveness of the proposed model and techniques for synthesizing a large number of textures and show its applications with the stylization.

...read moreread less

Proceedings Article•DOI•

FoveaNet: Perspective-Aware Urban Scene Parsing

[...]

Xin Li¹, Zequn Jie², Wei Wang³, Changsong Liu¹, Jimei Yang⁴, Xiaohui Shen⁴, Zhe Lin⁴, Qiang Chen⁵, Shuicheng Yan², Jiashi Feng² - Show less +6 more•Institutions (5)

Tsinghua University¹, National University of Singapore², University of New South Wales³, Adobe Systems⁴, Xi'an Jiaotong University⁵

08 Aug 2017

TL;DR: FoveaNet as discussed by the authors estimates the perspective geometry of a scene image through a convolutional network which integrates supportive evidence from contextual objects within the image, and thus provides much more reliable parsing results.

...read moreread less

Abstract: Parsing urban scene images benefits many applications, especially self-driving. Most of the current solutions employ generic image parsing models that treat all scales and locations in the images equally and do not consider the geometry property of car-captured urban scene images. Thus, they suffer from heterogeneous object scales caused by perspective projection of cameras on actual scenes and inevitably encounter parsing failures on distant objects as well as other boundary and recognition errors. In this work, we propose a new FoveaNet model to fully exploit the perspective geometry of scene images and address the common failures of generic parsing models. FoveaNet estimates the perspective geometry of a scene image through a convolutional network which integrates supportive evidence from contextual objects within the image. Based on the perspective geometry information, FoveaNet “undoes” the camera perspective projection — analyzing regions in the space of the actual scene, and thus provides much more reliable parsing results. Furthermore, to effectively address the recognition errors, FoveaNet introduces a new dense CRFs model that takes the perspective geometry as a prior potential. We evaluate FoveaNet on two urban scene parsing datasets, Cityspaces and CamVid, which demonstrates that FoveaNet can outperform all the well-established baselines and provide new state-of-the-art performance.

...read moreread less

Posted Content•

Predicting Scene Parsing and Motion Dynamics in the Future

[...]

Xiaojie Jin¹, Huaxin Xiao², Xiaohui Shen³, Jimei Yang³, Zhe Lin³, Yunpeng Chen¹, Zequn Jie¹, Jiashi Feng¹, Shuicheng Yan¹ - Show less +5 more•Institutions (3)

National University of Singapore¹, National University of Defense Technology², Adobe Systems³

09 Nov 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, a model was proposed to predict scene parsing and optical flow in unobserved future video frames by decomposing optical flow into different groups, which can be used to learn latent representations of scene dynamics.

...read moreread less

Abstract: The ability of predicting the future is important for intelligent systems, e.g. autonomous vehicles and robots to plan early and make decisions accordingly. Future scene parsing and optical flow estimation are two key tasks that help agents better understand their environments as the former provides dense semantic information, i.e. what objects will be present and where they will appear, while the latter provides dense motion information, i.e. how the objects will move. In this paper, we propose a novel model to simultaneously predict scene parsing and optical flow in unobserved future video frames. To our best knowledge, this is the first attempt in jointly predicting scene parsing and motion dynamics. In particular, scene parsing enables structured motion prediction by decomposing optical flow into different groups while optical flow estimation brings reliable pixel-wise correspondence to scene parsing. By exploiting this mutually beneficial relationship, our model shows significantly better parsing and motion prediction results when compared to well-established baselines and individual prediction models on the large-scale Cityscapes dataset. In addition, we also demonstrate that our model can be used to predict the steering angle of the vehicles, which further verifies the ability of our model to learn latent representations of scene dynamics.

...read moreread less

Posted Content•

Material Editing Using a Physically Based Rendering Network

[...]

Guilin Liu¹, Duygu Ceylan¹, Ersin Yumer¹, Jimei Yang¹, Jyh-Ming Lien² - Show less +1 more•Institutions (2)

Adobe Systems¹, George Mason University²

01 Aug 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work proposes an end-to-end network architecture that replicates the forward image formation process to accomplish material editing, and demonstrates a rich set of visually plausible material editing examples and provides an extensive comparative study.

...read moreread less

Abstract: The ability to edit materials of objects in images is desirable by many content creators. However, this is an extremely challenging task as it requires to disentangle intrinsic physical properties of an image. We propose an end-to-end network architecture that replicates the forward image formation process to accomplish this task. Specifically, given a single image, the network first predicts intrinsic properties, i.e. shape, illumination, and material, which are then provided to a rendering layer. This layer performs in-network image synthesis, thereby enabling the network to understand the physics behind the image formation process. The proposed rendering layer is fully differentiable, supports both diffuse and specular materials, and thus can be applicable in a variety of problem settings. We demonstrate a rich set of visually plausible material editing examples and provide an extensive comparative study.

...read moreread less

Posted Content•

Forecasting Human Dynamics from Static Images

[...]

Yu-Wei Chao¹, Jimei Yang², Brian Price², Scott Cohen², Jia Deng¹ - Show less +1 more•Institutions (2)

University of Michigan¹, Adobe Systems²

11 Apr 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: Wang et al. as discussed by the authors proposed the 3D Pose Forecasting Network (3D-PFNet), which combines single-image human pose estimation and sequence prediction, and converts the 2D predictions into 3D space.

...read moreread less

Abstract: This paper presents the first study on forecasting human dynamics from static images. The problem is to input a single RGB image and generate a sequence of upcoming human body poses in 3D. To address the problem, we propose the 3D Pose Forecasting Network (3D-PFNet). Our 3D-PFNet integrates recent advances on single-image human pose estimation and sequence prediction, and converts the 2D predictions into 3D space. We train our 3D-PFNet using a three-step training strategy to leverage a diverse source of training data, including image and video based human pose datasets and 3D motion capture (MoCap) data. We demonstrate competitive performance of our 3D-PFNet on 2D pose forecasting and 3D pose recovery through quantitative and qualitative results.

...read moreread less

Patent•

Forecasting multiple poses based on a graphical image

[...]

Jimei Yang¹, Yu-Wei Chao¹, Scott Cohen¹, Brian Price¹•Institutions (1)

Adobe Systems¹

07 Apr 2017

TL;DR: In this paper, a recurrent neural network is used to extract features from the extracted features and pose is predicted based on the forecasted features, and additional poses are forecasted based on additional features.

...read moreread less

Abstract: A forecasting neural network receives data and extracts features from the data. A recurrent neural network included in the forecasting neural network provides forecasted features based on the extracted features. In an embodiment, the forecasting neural network receives an image, and features of the image are extracted. The recurrent neural network forecasts features based on the extracted features, and pose is forecasted based on the forecasted features. Additionally or alternatively, additional poses are forecasted based on additional forecasted features.

...read moreread less

Patent•

Multi-style texture synthesis

[...]

Chen Fang¹, Zhaowen Wang¹, Yijun Li¹, Jimei Yang¹•Institutions (1)

Adobe Systems¹

18 Jan 2017

TL;DR: In this article, a generator network is trained to synthesize texture images depending on a selection unit input, which is then used to generate texture images for selected style images, where the generator network can be configured to minimize a covariance matrix based style loss and/or a diversity loss in synthesizing the texture images.

...read moreread less

Abstract: Systems and techniques that synthesize an image with similar texture to a selected style image. A generator network is trained to synthesize texture images depending on a selection unit input. The training configures the generator network to synthesize texture images that are similar to individual style images of multiple style images based on which is selected by the selection unit input. The generator network can be configured to minimize a covariance matrix-based style loss and/or a diversity loss in synthesizing the texture images. After training the generator network, the generator network is used to synthesize texture images for selected style images. For example, this can involve receiving user input selecting a selected style image, determining the selection unit input based on the selected style image, and synthesizing texture images using the generator network with the selection unit input and noise input.

...read moreread less

Showing papers by "Jimei Yang published in 2017"