scispace - formally typeset
Search or ask a question
Author

Jie Hu

Bio: Jie Hu is an academic researcher from Xiamen University. The author has contributed to research in topics: Image translation & Feature (computer vision). The author has an hindex of 6, co-authored 12 publications receiving 102 citations.

Papers
More filters
Proceedings ArticleDOI
02 Mar 2021
TL;DR: Li et al. as discussed by the authors propose Hierarchical Style Disentanglement (HiSD), which disentangles the labels into independent tags, exclusive attributes, and disentangled styles from top to bottom.
Abstract: Recently, image-to-image translation has made significant progress in achieving both multi-label (i.e., translation conditioned on different labels) and multi-style (i.e., generation with diverse styles) tasks. However, due to the unexplored independence and exclusiveness in the labels, existing endeavors are defeated by involving uncontrolled manipulations to the translation results. In this paper, we propose Hierarchical Style Disentanglement (HiSD) to address this issue. Specifically, we organize the labels into a hierarchical tree structure, in which independent tags, exclusive attributes, and disentangled styles are allocated from top to bottom. Correspondingly, a new translation process is designed to adapt the above structure, in which the styles are identified for controllable translations. Both qualitative and quantitative results on the CelebA-HQ dataset verify the ability of the proposed HiSD. The code has been released at https://github.com/imlixinyang/HiSD.

67 citations

Journal ArticleDOI
TL;DR: This paper presents a novel face sketch synthesis method by multidomain adversarial learning (termed MDAL), which overcomes the defects of blurs and deformations toward high-quality synthesis.
Abstract: Given a training set of face photo-sketch pairs, face sketch synthesis targets at learning a mapping from the photo domain to the sketch domain. Despite the exciting progresses made in the literature, it retains as an open problem to synthesize high-quality sketches against blurs and deformations. Recent advances in generative adversarial training provide a new insight into face sketch synthesis, from which perspective the existing synthesis pipelines can be fundamentally revisited. In this paper, we present a novel face sketch synthesis method by multidomain adversarial learning (termed MDAL), which overcomes the defects of blurs and deformations toward high-quality synthesis. The principle of our scheme relies on the concept of “interpretation through synthesis.” In particular, we first interpret face photographs in the photodomain and face sketches in the sketch domain by reconstructing themselves respectively via adversarial learning. We define the intermediate products in the reconstruction process as latent variables, which form a latent domain. Second, via adversarial learning, we make the distributions of latent variables being indistinguishable between the reconstruction process of the face photograph and that of the face sketch. Finally, given an input face photograph, the latent variable obtained by reconstructing this face photograph is applied for synthesizing the corresponding sketch. Quantitative comparisons to the state-of-the-art methods demonstrate the superiority of the proposed MDAL method.

59 citations

Proceedings ArticleDOI
13 Jul 2018
TL;DR: A novel generative adversarial network termed pGAN is proposed, which can generate face sketches efficiently using training data under fixed conditions and handle the aforementioned uncontrolled conditions.
Abstract: Despite the extensive progress in face sketch synthesis, existing methods are mostly workable under constrained conditions, such as fixed illumination, pose, background and ethnic origin that are hardly to control in real-world scenarios. The key issue lies in the difficulty to use data under fixed conditions to train a model against imaging variations. In this paper, we propose a novel generative adversarial network termed pGAN, which can generate face sketches efficiently using training data under fixed conditions and handle the aforementioned uncontrolled conditions. In pGAN, we embed key photo priors into the process of synthesis and design a parametric sigmoid activation function for compensating illumination variations. Compared to the existing methods, we quantitatively demonstrate that the proposed method can work well on face photos in the wild.

33 citations

Proceedings ArticleDOI
15 Jun 2019
TL;DR: This paper proposes a Hybrid Auto-Encoder (HAE) to translate visual features, which learns a mapping by minimizing the translation and reconstruction errors, and an Undirected Affinity Measurement (UAM) is further designed to quantify the affinity among different types of visual features.
Abstract: Most existing visual search systems are deployed based upon fixed kinds of visual features, which prohibits the feature reusing across different systems or when upgrading systems with a new type of feature. Such a setting is obviously inflexible and time/memory consuming, which is indeed mendable if visual features can be ``translated" across systems. In this paper, we make the first attempt towards visual feature translation to break through the barrier of using features across different visual search systems. To this end, we propose a Hybrid Auto-Encoder (HAE) to translate visual features, which learns a mapping by minimizing the translation and reconstruction errors. Based upon HAE, an Undirected Affinity Measurement (UAM) is further designed to quantify the affinity among different types of visual features. Extensive experiments have been conducted on several public datasets with sixteen different types of widely-used features in visual search systems. Quantitative results show the encouraging possibilities of feature translation. For the first time, the affinity among widely-used features like SIFT and DELF is reported.

18 citations

Posted Content
TL;DR: An Attribute Guided UIT model termed AGUIT is proposed to tackle multi-modal and multi-domain tasks of UIT jointly with a novel semi-supervised setting, which also merits in representation disentanglement and fine control of outputs.
Abstract: Unpaired Image-to-Image Translation (UIT) focuses on translating images among different domains by using unpaired data, which has received increasing research focus due to its practical usage. However, existing UIT schemes defect in the need of supervised training, as well as the lack of encoding domain information. In this paper, we propose an Attribute Guided UIT model termed AGUIT to tackle these two challenges. AGUIT considers multi-modal and multi-domain tasks of UIT jointly with a novel semi-supervised setting, which also merits in representation disentanglement and fine control of outputs. Especially, AGUIT benefits from two-fold: (1) It adopts a novel semi-supervised learning process by translating attributes of labeled data to unlabeled data, and then reconstructing the unlabeled data by a cycle consistency operation. (2) It decomposes image representation into domain-invariant content code and domain-specific style code. The redesigned style code embeds image style into two variables drawn from standard Gaussian distribution and the distribution of domain label, which facilitates the fine control of translation due to the continuity of both variables. Finally, we introduce a new challenge, i.e., disentangled transfer, for UIT models, which adopts the disentangled representation to translate data less related with the training set. Extensive experiments demonstrate the capacity of AGUIT over existing state-of-the-art models.

16 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: The recent research on GANs in the field of image processing, including image synthesis, image generation, image semantic editing, image-to-image translation, image super-resolution, image inpainting, and cartoon generation is introduced.
Abstract: Generative Adversarial Networks (GANs) have achieved impressive results in various image synthesis tasks, and are becoming a hot topic in computer vision research because of the impressive performance they achieved in various applications. In this paper, we introduce the recent research on GANs in the field of image processing, including image synthesis, image generation, image semantic editing, image-to-image translation, image super-resolution, image inpainting, and cartoon generation. We analyze and summarize the methods used in these applications which have improved the generated results. Then, we discuss the challenges faced by GANs and introduce some methods to deal with these problems. We also preview some likely future research directions in the field of GANs, such as video generation, facial animation synthesis and 3D face reconstruction. The purpose of this review is to provide insights into the research on GANs and to present the various applications based on GANs in different scenarios.

83 citations

Journal ArticleDOI
TL;DR: This survey provides a comprehensive review of adversarial models for image synthesis, and summarizes the synthetic image generation methods, and discusses the categories including image-to-image translation, fusion image generation, label- to-image mapping, and text-to -image translation.

80 citations

Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors proposed a novel composition-aided generative adversarial network (CA-GAN) for face photo-sketch synthesis, which utilizes paired inputs, including a face photo/Sketch and the corresponding pixelwise face labels for generating a sketch/photo.
Abstract: Face photo–sketch synthesis aims at generating a facial sketch/photo conditioned on a given photo/sketch. It covers wide applications including digital entertainment and law enforcement. Precisely depicting face photos/sketches remains challenging due to the restrictions on structural realism and textural consistency. While existing methods achieve compelling results, they mostly yield blurred effects and great deformation over various facial components, leading to the unrealistic feeling of synthesized images. To tackle this challenge, in this article, we propose using facial composition information to help the synthesis of face sketch/photo. Especially, we propose a novel composition-aided generative adversarial network (CA-GAN) for face photo–sketch synthesis. In CA-GAN, we utilize paired inputs, including a face photo/sketch and the corresponding pixelwise face labels for generating a sketch/photo. Next, to focus training on hard-generated components and delicate facial structures, we propose a compositional reconstruction loss. In addition, we employ a perceptual loss function to encourage the synthesized image and real image to be perceptually similar. Finally, we use stacked CA-GANs (SCA-GANs) to further rectify defects and add compelling details. The experimental results show that our method is capable of generating both visually comfortable and identity-preserving face sketches/photos over a wide range of challenging data. In addition, our method significantly decreases the best previous Frechet inception distance (FID) from 36.2 to 26.2 for sketch synthesis, and from 60.9 to 30.5 for photo synthesis. Besides, we demonstrate that the proposed method is of considerable generalization ability.

80 citations

Journal ArticleDOI
TL;DR: A novel multimodal emotion recognition model for conversational videos based on reinforcement learning and domain knowledge (ERLDK) is proposed in this paper and achieves the state-of-the-art results on weighted average and most of the specific emotion categories.
Abstract: Multimodal emotion recognition in conversational videos (ERC) develops rapidly in recent years. To fully extract the relative context from video clips, most studies build their models on the entire dialogues which make them lack of real-time ERC ability. Different from related researches, a novel multimodal emotion recognition model for conversational videos based on reinforcement learning and domain knowledge (ERLDK) is proposed in this paper. In ERLDK, the reinforcement learning algorithm is introduced to conduct real-time ERC with the occurrence of conversations. The collection of history utterances is composed as an emotion-pair which represents the multimodal context of the following utterance to be recognized. Dueling deep-Q-network (DDQN) based on gated recurrent unit (GRU) layers is designed to learn the correct action from the alternative emotion categories. Domain knowledge is extracted from public dataset based on the former information of emotion-pairs. The extracted domain knowledge is used to revise the results from the RL module and is transformed into other dataset to examine the rationality. The experimental results on datasets show that ERLDK achieves the state-of-the-art results on weighted average and most of the specific emotion categories.

71 citations