scispace - formally typeset
Search or ask a question

Showing papers by "Ran He published in 2021"


Journal Article
TL;DR: A method to edit a target portrait footage by taking a sequence of audio as input to synthesize a photo-realistic video, which is end-to-end learnable and robust to voice variations in the source audio.
Abstract: We present a method to edit a target portrait footage by taking a sequence of audio as input to synthesize a photo-realistic video. This method is unique because it is highly dynamic. It does not assume a person-specific rendering network yet capable of translating one source audio into one random chosen video output within a set of speech videos. Instead of learning a highly heterogeneous and nonlinear mapping from audio to the video directly, we first factorize each target video frame into orthogonal parameter spaces, i.e., expression, geometry, and pose, via monocular 3D face reconstruction. Next, a recurrent network is introduced to translate source audio into expression parameters that are primarily related to the audio content. The audio-translated expression parameters are then used to synthesize a photo-realistic human subject in each video frame, with the movement of the mouth regions precisely mapped to the source audio. The geometry and pose parameters of the target human portrait are retained, therefore preserving the con-text of the original video footage. Finally, we introduce a novel video rendering network and a dynamic programming method to construct a temporally coherent and photo-realistic video. Extensive experiments demonstrate the superiority of our method over existing approaches. Our method is end-to-end learnable and robust to voice variations in the source audio.

72 citations


Journal ArticleDOI
TL;DR: A comprehensive survey of recent audio-visual learning development is provided, dividing the current audio- visual learning tasks into four different subfields: audio- Visual separation and localization, audio-Visual correspondence learning, audio -visual generation, and audio- visuals representation learning.
Abstract: Audio-visual learning, aimed at exploiting the relationship between audio and visual modalities, has drawn considerable attention since deep learning started to be used successfully. Researchers tend to leverage these two modalities to improve the performance of previously considered single-modality tasks or address new challenging problems. In this paper, we provide a comprehensive survey of recent audio-visual learning development. We divide the current audio-visual learning tasks into four different subfields: audio-visual separation and localization, audio-visual correspondence learning, audio-visual generation, and audio-visual representation learning. State-of-the-art methods, as well as the remaining challenges of each subfield, are further discussed. Finally, we summarize the commonly used datasets and challenges.

57 citations


Proceedings ArticleDOI
01 Jun 2021
TL;DR: In this article, a self-supervised memory module is proposed to record the prototypical patterns of rain degradations and are updated in a selfsupervised way in order to explore and exploit the properties of rain streaks from both synthetic and real data.
Abstract: Deep learning based methods have shown dramatic improvements in image rain removal by using large-scale paired data of synthetic datasets. However, due to the various appearances of real rain streaks that may differ from those in the synthetic training data, it is challenging to directly extend existing methods to the real-world scenes. To address this issue, we propose a memory-oriented semi-supervised (MOSS) method which enables the network to explore and exploit the properties of rain streaks from both synthetic and real data. The key aspect of our method is designing an encoder-decoder neural network that is augmented with a self-supervised memory module, where items in the memory record the prototypical patterns of rain degradations and are updated in a self-supervised way. Consequently, the rainy styles can be comprehensively de-rived from synthetic or real-world degraded images without the need for clean labels. Furthermore, we present a self-training mechanism that attempts to transfer deraining knowledge from supervised rain removal to unsupervised cases. An additional target network, which is updated with an exponential moving average of the online deraining network, is utilized to produce pseudo-labels for unlabeled rainy images. Meanwhile, the deraining network is optimized with supervised objectives on both synthetic paired data and pseudo-paired noisy data. Extensive experiments show that the proposed method achieves more appealing results not only on limited labeled data but also on unlabeled real-world images than recent state-of-the-art methods.

50 citations


Journal ArticleDOI
TL;DR: A novel framework that simplifies face manipulation into two correlated stages: a boundary prediction stage and a disentangled face synthesis stage is proposed, which dramatically improves the synthesis quality.
Abstract: Face manipulation has shown remarkable advances with the flourish of Generative Adversarial Networks. However, due to the difficulties of controlling structures and textures, it is challenging to model poses and expressions simultaneously, especially for the extreme manipulation at high-resolution. In this article, we propose a novel framework that simplifies face manipulation into two correlated stages: a boundary prediction stage and a disentangled face synthesis stage. The first stage models poses and expressions jointly via boundary images. Specifically, a conditional encoder-decoder network is employed to predict the boundary image of the target face in a semi-supervised way. Pose and expression estimators are introduced to improve the prediction performance. In the second stage, the predicted boundary image and the input face image are encoded into the structure and the texture latent space by two encoder networks, respectively. A proxy network and a feature threshold loss are further imposed to disentangle the latent space. Furthermore, due to the lack of high-resolution face manipulation databases to verify the effectiveness of our method, we collect a new high-quality Multi-View Face (MVF-HQ) database. It contains 120,283 images at $6000\times 4000$ resolution from 479 identities with diverse poses, expressions, and illuminations. MVF-HQ is much larger in scale and much higher in resolution than publicly available high-resolution face manipulation databases. We will release MVF-HQ soon to push forward the advance of face manipulation. Qualitative and quantitative experiments on four databases show that our method dramatically improves the synthesis quality.

46 citations


Proceedings ArticleDOI
Gege Gao1, Huaibo Huang1, Chaoyou Fu1, Zhaoyang Li1, Ran He1 
01 Jun 2021
TL;DR: InfoSwap as discussed by the authors proposes a novel identity disentangling and swapping network to extract the most expressive information for identity representation from a pre-trained face recognition model, and formulate the learning of disentangled representations as an information bottleneck tradeoff, in terms of finding an optimal compression of the pretrained latent features.
Abstract: Improving the performance of face forgery detectors often requires more identity-swapped images of higher-quality. One core objective of identity swapping is to generate identity-discriminative faces that are distinct from the target while identical to the source. To this end, properly disentangling identity and identity-irrelevant information is critical and remains a challenging endeavor. In this work, we propose a novel information disentangling and swapping network, called InfoSwap, to extract the most expressive information for identity representation from a pre-trained face recognition model. The key insight of our method is to formulate the learning of disentangled representations as optimizing an information bottleneck tradeoff, in terms of finding an optimal compression of the pretrained latent features. Moreover, a novel identity contrastive loss is proposed for further disentanglement by requiring a proper distance between the generated identity and the target. While the most prior works have focused on using various loss functions to implicitly guide the learning of representations, we demonstrate that our model can provide explicit supervision for learning disentangled representations, achieving impressive performance in generating more identity-discriminative swapped faces.

44 citations


Journal ArticleDOI
Chaoyou Fu1, Xiang Wu1, Yibo Hu1, Huaibo Huang1, Ran He1 
TL;DR: In this article, a dual variational generator is designed to learn the joint distribution of paired heterogeneous images, and a pairwise identity preserving loss is imposed on the generated paired heterogenous images to ensure their identity consistency.
Abstract: Heterogeneous Face Recognition (HFR) refers to matching cross-domain faces and plays a crucial role in public security. Nevertheless, HFR is confronted with challenges from large domain discrepancy and insufficient heterogeneous data. In this paper, we formulate HFR as a dual generation problem, and tackle it via a novel Dual Variational Generation (DVG-Face) framework. Specifically, a dual variational generator is elaborately designed to learn the joint distribution of paired heterogeneous images. However, the small-scale paired heterogeneous training data may limit the identity diversity of sampling. In order to break through the limitation, we propose to integrate abundant identity information of large-scale visible data into the joint distribution. Furthermore, a pairwise identity preserving loss is imposed on the generated paired heterogeneous images to ensure their identity consistency. As a consequence, massive new diverse paired heterogeneous images with the same identity can be generated from noises. The identity consistency and identity diversity properties allow us to employ these generated images to train the HFR network via a contrastive learning mechanism, yielding both domain-invariant and discriminative embedding features. Concretely, the generated paired heterogeneous images are regarded as positive pairs, and the images obtained from different samplings are considered as negative pairs. Our method achieves superior performances over state-of-the-art methods on seven challenging databases belonging to five HFR tasks, including NIR-VIS, Sketch-Photo, Profile-Frontal Photo, Thermal-VIS, and ID-Camera.

31 citations


Journal ArticleDOI
Mandi Luo1, Jie Cao1, Xin Ma1, Xiaoyu Zhang1, Ran He1 
TL;DR: Li et al. as discussed by the authors proposed a hierarchical disentanglement module to decouple attributes from the identity representation and recover geometric information by exploring the interrelations among local regions to guarantee the preservation of identities in face data augmentation.
Abstract: Substantial improvements have been achieved in the field of face recognition due to the successful application of deep neural networks. However, existing methods are sensitive to both the quality and quantity of the training data. Despite the availability of large-scale datasets, the long tail data distribution induces strong biases in model learning. In this paper, we present a Face Augmentation Generative Adversarial Network (FA-GAN) to reduce the influence of imbalanced deformation attribute distributions. We propose to decouple these attributes from the identity representation with a novel hierarchical disentanglement module. Moreover, Graph Convolutional Networks (GCNs) are applied to recover geometric information by exploring the interrelations among local regions to guarantee the preservation of identities in face data augmentation. Extensive experiments on face reconstruction, face manipulation, and face recognition demonstrate the effectiveness and generalization ability of the proposed method.

31 citations


Journal ArticleDOI
TL;DR: In this paper, the authors propose a source hypothesis transfer (SHOT) method, which learns the feature extraction module for the target domain by fitting the target data features to the frozen source classification module (representing classification hypothesis).
Abstract: Unsupervised domain adaptation (UDA) aims to transfer knowledge from a related but different well-labeled source domain to a new unlabeled target domain. Most existing UDA methods require access to the source data, and thus are not applicable when the data are confidential and not shareable due to privacy concerns. This paper aims to tackle a realistic setting with only a classification model available trained over, instead of accessing to, the source data. To address it, we propose a novel approach called Source HypOthesis Transfer (SHOT), which learns the feature extraction module for the target domain by fitting the target data features to the frozen source classification module (representing classification hypothesis). Specifically, SHOT exploits both information maximization and self-supervised learning for the feature extractor learning to ensure the target features are implicitly aligned with the features of unseen source data. Furthermore, we propose a new labeling transfer strategy, which separates the target data into two splits based on the confidence of predictions (labeling information), and then employ semi-supervised learning to improve the accuracy of less-confident predictions in the target domain. Extensive experiments on various domain adaptation tasks show that our methods achieve results surpassing or comparable to the state-of-the-arts.

31 citations


Journal ArticleDOI
TL;DR: The identified core regions may serve as a basis for building markers for ASD and OCD diagnoses, as well as measures of symptom severity, and may inform future development of machine-learning models for psychiatric disorders.
Abstract: Objective:Psychiatric disorders commonly comprise comorbid symptoms, such as autism spectrum disorder (ASD), obsessive-compulsive disorder (OCD), and attention deficit hyperactivity disorder (ADHD)...

27 citations


Journal ArticleDOI
07 Jun 2021
TL;DR: In this paper, an inconsistency-aware wavelet dual-branch network was proposed for face forgery detection, which is mainly based on two kinds of forgery clues called inter-image and intra-image inconsistencies.
Abstract: Current face forgery techniques can generate high-fidelity fake faces with extremely low labor and time costs. As a result, face forgery detection becomes an important research topic to prevent technology abuse. In this paper, we present an inconsistency-aware wavelet dual-branch network for face forgery detection. This model is mainly based on two kinds of forgery clues called inter-image and intra-image inconsistencies. To fully utilize them, we firstly enhance the forgery features by using additional inputs based on stationary wavelet decomposition (SWD). Then, considering the different properties of the two inconsistencies, we design a dual-branch network that predicts image-level and pixel-level forgery labels respectively. The segmentation branch aims to recognize real and fake local regions, which is crucial for discovering intra-image inconsistency. The classification branch learns to discriminate the real and fake images globally, thus can extract inter-image inconsistency. Finally, bilinear pooling is employed to fuse the features from the two branches. We find that the bilinear pooling is a kind of spatial attentive pooling. It effectively utilizes the rich spatial features learned by the segmentation branch. Experimental results show that the proposed method surpasses the state-of-the-art face forgery detection methods.

23 citations


Proceedings ArticleDOI
Jia Li1, Zhaoyang Li1, Jie Cao1, Xingguang Song1, Ran He1 
01 Jun 2021
TL;DR: Zhang et al. as discussed by the authors propose a two-stage framework named FaceInpainter to implement controllable identity-guided face inpainting (IGFI) under heterogeneous domains.
Abstract: In this work, we propose a novel two-stage framework named FaceInpainter to implement controllable Identity-Guided Face Inpainting (IGFI) under heterogeneous domains. Concretely, by explicitly disentangling foreground and background of the target face, the first stage focuses on adaptive face fitting to the fixed background via a Styled Face Inpainting Network (SFI-Net), with 3D priors and texture code of the target, as well as identity factor of the source face. It is challenging to deal with the inconsistency between the new identity of the source and the original background of the target, concerning the face shape and appearance on the fused boundary. The second stage consists of a Joint Refinement Network (JR-Net) to refine the swapped face. It leverages AdaIN considering identity and multi-scale texture codes, for feature transformation of the decoded face from SFI-Net with facial occlusions. We adopt the contextual loss to implicitly preserve the attributes, encouraging face deformation and fewer texture distortions. Experimental results demonstrate that our approach handles high-quality identity adaptation to heterogeneous domains, exhibiting the competitive performance compared with state-of-the-art methods concerning both attribute and identity fidelity.

Posted Content
TL;DR: In this article, a cross-modality neural architecture search (CM-NAS) is proposed to find the optimal separation scheme for each batch normalization layer in a BN-oriented search space.
Abstract: Visible-Infrared person re-identification (VI-ReID) aims to match cross-modality pedestrian images, breaking through the limitation of single-modality person ReID in dark environment. In order to mitigate the impact of large modality discrepancy, existing works manually design various two-stream architectures to separately learn modality-specific and modality-sharable representations. Such a manual design routine, however, highly depends on massive experiments and empirical practice, which is time consuming and labor intensive. In this paper, we systematically study the manually designed architectures, and identify that appropriately separating Batch Normalization (BN) layers is the key to bring a great boost towards cross-modality matching. Based on this observation, the essential objective is to find the optimal separation scheme for each BN layer. To this end, we propose a novel method, named Cross-Modality Neural Architecture Search (CM-NAS). It consists of a BN-oriented search space in which the standard optimization can be fulfilled subject to the cross-modality task. Equipped with the searched architecture, our method outperforms state-of-the-art counterparts in both two benchmarks, improving the Rank-1/mAP by 6.70%/6.13% on SYSU-MM01 and by 12.17%/11.23% on RegDB. Code is released at this https URL.

Journal ArticleDOI
TL;DR: A coupled adversarial learning (CAL) approach for the VIS-NIR face matching is proposed by performing adversarial learn on both image and feature levels to reduce the spectrum domain discrepancy and the over-fitting problem.

Proceedings ArticleDOI
01 Jun 2021
TL;DR: In this article, a data augmentation method (ReMix) is proposed to solve the problem of overfitting in image-to-image (I2I) translation by interpolating training samples at the feature level and propose a novel content loss based on the perceptual relations among samples.
Abstract: Image-to-image (I2I) translation methods based on generative adversarial networks (GANs) typically suffer from overfitting when limited training data is available. In this work, we propose a data augmentation method (ReMix) to tackle this issue. We interpolate training samples at the feature level and propose a novel content loss based on the perceptual relations among samples. The generator learns to translate the in-between samples rather than memorizing the training set, and thereby forces the discriminator to generalize. The proposed approach effectively reduces the ambiguity of generation and renders content-preserving results. The ReMix method can be easily incorporated into existing GAN models with minor modifications. Experimental results on numerous tasks demonstrate that GAN models equipped with the ReMix method achieve significant improvements.

Journal ArticleDOI
Aijing Yu1, Haoxue Wu1, Huaibo Huang1, Zhen Lei1, Ran He1 
TL;DR: A novel exemplar-based variational spectral attention network to produce high-fidelity VIS images from NIR data and a spectral conditional attention module is introduced to reduce the domain gap between NIR and VIS data and improve the performance of NIR-VIS heterogeneous face recognition on various databases including the LAMP-HQ.
Abstract: Near-infrared-visible (NIR-VIS) heterogeneous face recognition matches NIR to corresponding VIS face images. However, due to the sensing gap, NIR images often lose some identity information so that the NIR-VIS recognition issue is more difficult than conventional VIS face recognition. Recently, NIR-VIS heterogeneous face recognition has attracted considerable attention in the computer vision community because of its convenience and adaptability in practical applications. Various deep learning-based methods have been proposed and substantially increased the recognition performance, but the lack of NIR-VIS training samples leads to the difficulty of the model training process. In this paper, we propose a new $$\mathbf{L} {} \mathbf{a} $$ rge-Scale $$\mathbf{M} $$ ulti- $$\mathbf{P} $$ ose $$\mathbf{H} $$ igh- $$\mathbf{Q} $$ uality NIR-VIS database ‘ $$\mathbf{LAMP}-HQ $$ ’ containing 56,788 NIR and 16,828 VIS images of 573 subjects with large diversities in pose, illumination, attribute, scene and accessory. We furnish a benchmark along with the protocol for NIR-VIS face recognition via generation on LAMP-HQ, including Pixel2-Pixel, CycleGAN, ADFL, PCFH, and PACH. Furthermore, we propose a novel exemplar-based variational spectral attention network to produce high-fidelity VIS images from NIR data. A spectral conditional attention module is introduced to reduce the domain gap between NIR and VIS data and then improve the performance of NIR-VIS heterogeneous face recognition on various databases including the LAMP-HQ.

Journal ArticleDOI
Huaibo Huang1, Aijing Yu1, Zhenhua Chai, Ran He, Tieniu Tan 
TL;DR: Zhang et al. as discussed by the authors proposed a selective wavelet attention learning method to separate rain and background information in the embedding space, which can improve the accuracy of single image deraining.
Abstract: Single image deraining refers to the process of restoring the clean background scene from a rainy image. Current approaches have resorted to deep learning techniques to remove rain from a single image by leveraging some prior information. However, due to the various appearances of rain streaks and accumulation, it is difficult to separate rain and background information in the embedding space, which results in inaccurate deraining. To address this issue, this paper proposes a selective wavelet attention learning method by learning a series of wavelet attention maps to guide the separation of rain and background information in both spatial and frequency domains. The key aspect of our method is utilizing wavelet transform to learn the content and structure of rainy features because the high-frequency features are more sensitive to rain degradations, whereas the low-frequency features preserve more of the background content. To begin with, we develop a selective wavelet attention encoder–decoder network to learn wavelet attention maps guiding the separation of rainy and background features at multiple scales. Meanwhile, we introduce wavelet pooling and unpooling to the encoder–decoder network, which shows superiority in learning increasingly abstract representations while preserving the background details. In addition, we propose latent alignment learning to supervise the background features as well as augment the training data to further improve the accuracy of deraining. Finally, we employ a hierarchical discriminator network based on selective wavelet attention to adversarially improve the visual fidelity of the generated results both globally and locally. Extensive experiments on synthetic and real datasets demonstrate that the proposed approach achieves more appealing results both quantitatively and qualitatively than the recent state-of-the-art methods.

Proceedings ArticleDOI
Yuting Xu1, Gengyun Jia1, Huaibo Huang1, Junxian Duan1, Ran He1 
TL;DR: Zhang et al. as mentioned in this paper proposed a visual-semantic transformer (VST) to detect face forgery based on semantic aware feature relations, which achieved 99.58% accuracy on FF++(Raw) and 96.16% on Celeb-DF.
Abstract: This paper proposes a novel Visual-Semantic Transformer (VST) to detect face forgery based on semantic aware feature relations. In face images, intrinsic feature relations exist between different semantic parsing regions. We find that face forgery algorithms always change such relations. Therefore, we start the approach by extracting Contextual Feature Sequence (CFS) using a transformer encoder to make the best abnormal feature relation patterns. Meanwhile, images are segmented as soft face regions by a face parsing module. Then we merge the CFS and the soft face regions as Visual Semantic Sequences (VSS) representing features of semantic regions. The VSS is fed into the transformer decoder, in which the relations in the semantic region level are modeled. Our method achieved 99.58% accuracy on FF++(Raw) and 96.16% accuracy on Celeb-DF. Extensive experiments demonstrate that our framework outperforms or is comparable with state-of-the-art detection methods, especially towards unseen forgery methods.

Journal ArticleDOI
TL;DR: In this article, a new search space is designed for feature pyramids in object detectors, which is formulated as a combinatorial optimization problem and solved by a Simulated Annealing-based Network Architecture Search method (SA-NAS).
Abstract: Feature pyramids have delivered significant improvement in object detection. However, building effective feature pyramids heavily relies on expert knowledge, and also requires strenuous efforts to balance effectiveness and efficiency. Automatic search methods, such as NAS-FPN, automates the design of feature pyramids, but the low search efficiency makes it difficult to apply in a large search space. In this paper, we propose a novel search framework for a feature pyramid network, called AutoDet, which enables to automatic discovery of informative connections between multi-scale features and configure detection architectures with both high efficiency and state-of-the-art performance. In AutoDet, a new search space is specifically designed for feature pyramids in object detectors, which is more general than NAS-FPN. Furthermore, the architecture search process is formulated as a combinatorial optimization problem and solved by a Simulated Annealing-based Network Architecture Search method (SA-NAS). Compared with existing NAS methods, AutoDet ensures a dramatic reduction in search times. For example, our SA-NAS can be up to 30x faster than reinforcement learning-based approaches. Furthermore, AutoDet is compatible with both one-stage and two-stage structures with all kinds of backbone networks. We demonstrate the effectiveness of AutoDet with outperforming single-model results on the COCO dataset. Without pre-training on OpenImages, AutoDet with the ResNet-101 backbone achieves an AP of 39.7 and 47.3 for one-stage and two-stage architectures, respectively, which surpass current state-of-the-art methods.

Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors investigated the robust property of Kernel Mean P-Power Error Loss (KMPE-Loss), and proposed a novel robust transfer feature learning (RTFL) method to enhance the robustness of domain adaptation.

Proceedings ArticleDOI
TL;DR: In this paper, the authors proposed Contrastive Uncertainty Learning (CUL) by integrating the merits of uncertainty learning and contrastive self-supervised learning to improve the performance of iris recognition with insufficient labeled data.
Abstract: Cross-database recognition is still an unavoidable challenge when deploying an iris recognition system to a new environment. In the paper, we present a compromise problem that resembles the real-world scenario, named iris recognition with insufficient labeled samples. This new problem aims to improve the recognition performance by utilizing partially-or un-labeled data. To address the problem, we propose Contrastive Uncertainty Learning (CUL) by integrating the merits of uncertainty learning and contrastive self-supervised learning. CUL makes two efforts to learn a discriminative and robust feature representation. On the one hand, CUL explores the uncertain acquisition factors and adopts a probabilistic embedding to represent the iris image. In the probabilistic representation, the identity information and acquisition factors are disentangled into the mean and variance, avoiding the impact of uncertain acquisition factors on the identity information. On the other hand, CUL utilizes probabilistic embeddings to generate virtual positive and negative pairs. Then CUL builds its contrastive loss to group the similar samples closely and push the dissimilar samples apart. The experimental results demonstrate the effectiveness of the proposed CUL for iris recognition with insufficient labeled samples.

Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper proposed a saliency search network (SSN) to extract domain-invariant identity features, and guided the searching process by an information bottleneck network to mitigate the overfitting problems caused by small datasets.
Abstract: Near-infrared-visual (NIR-VIS) heterogeneous face recognition (HFR) aims to match NIR face images with the corresponding VIS ones. It is a challenging task due to the sensing gaps among different modalities. Occlusions in the input face images make the task extremely complex. To tackle these problems, we present a Saliency Search Network (SSN) to extract domain-invariant identity features. We propose to automatically search the efficient parts of face images in a modality-aware manner, and remove redundant information. Moreover, the searching process is guided by an information bottleneck network, which mitigates the overfitting problems caused by small datasets. Extensive experiments on both complete and partial NIR-VIS HFR on multiple datasets demonstrate the effectiveness and robustness of the proposed method to modality discrepancy and occlusions.

Proceedings ArticleDOI
10 Jan 2021
TL;DR: Li et al. as mentioned in this paper proposed a self-supervised Siamese inference network to improve the robustness and generalization of deep learning based image inpainting approaches, which can encode contextual semantics from full resolution images and obtain more discriminative representations.
Abstract: Most deep learning based image inpainting approaches adopt autoencoder or its variants to fill missing regions in images. Encoders are usually utilized to learn powerful representational spaces, which are important for dealing with sophisticated learning tasks. Specifically, in image inpainting tasks, masks with any shapes can appear anywhere in images (i.e., free-form masks) which form complex patterns. It is difficult for encoders to capture such powerful representations under this complex situation. To tackle this problem, we propose a self-supervised Siamese inference network to improve the robustness and generalization. It can encode contextual semantics from full resolution images and obtain more discriminative representations. we further propose a multi-scale decoder with a novel dual attention fusion module (DAF), which can combine both the restored and known regions in a smooth way. This multi-scale architecture is benefit for decoding discriminative representations learned by encoders into images layer by layer. In this way, unknown regions will be filled naturally from outside to inside. Qualitative and quantitative experiments on multiple datasets, including facial and natural datasets (i.e., Celeb-HQ, Pairs Street View, Places2 and ImageNet), demonstrate that our proposed method outperforms state-of-the-art methods in generating high-quality inpainting results.

Posted Content
TL;DR: In this paper, a two-step adaptation framework called Dis-tune is proposed for unsupervised domain adaptation, which first distills the knowledge from the source model to a customized target model, and then fine-tunes the distilled model to fit the target domain.
Abstract: To alleviate the burden of labeling, unsupervised domain adaptation (UDA) aims to transfer knowledge in previous related labeled datasets (source) to a new unlabeled dataset (target) Despite impressive progress, prior methods always need to access the raw source data and develop data-dependent alignment approaches to recognize the target samples in a transductive learning manner, which may raise privacy concerns from source individuals Several recent studies resort to an alternative solution by exploiting the well-trained white-box model instead of the raw data from the source domain, however, it may leak the raw data through generative adversarial training This paper studies a practical and interesting setting for UDA, where only a black-box source model (ie, only network predictions are available) is provided during adaptation in the target domain Besides, different neural networks are even allowed to be employed for different domains For this new problem, we propose a novel two-step adaptation framework called Distill and Fine-tune (Dis-tune) Specifically, Dis-tune first structurally distills the knowledge from the source model to a customized target model, then unsupervisedly fine-tunes the distilled model to fit the target domain To verify the effectiveness, we consider two UDA scenarios (\ie, closed-set and partial-set), and discover that Dis-tune achieves highly competitive performance to state-of-the-art approaches

Proceedings ArticleDOI
10 Jan 2021
TL;DR: Zhang et al. as discussed by the authors proposed a two-stage inpainting framework to address the two requirements in two separate stages, where the corrupted image is firstly predicted through segmentation reconstruction network, while fine-grained image details are restored in the second stage through an image generator.
Abstract: Image inpainting faces the challenging issue of the requirements on structure reasonableness and texture coherence. In this paper, we propose a two-stage inpainting framework to address this issue. The basic idea is to address the two requirements in two separate stages. Completed segmentation of the corrupted image is firstly predicted through segmentation reconstruction network, while fine-grained image details are restored in the second stage through an image generator. The two stages are connected in series as the image details are generated under the guidance of completed segmentation map that predicted in the first stage. Specifically, in the second stage, we propose a novel graph-based relation network to model the relationship existed in corrupted image. In relation network, both intra-relationship for pixels in the same semantic region and inter-relationship between different semantic parts are considered, improving the consistency and compatibility of image textures. Besides, contrastive loss is designed to facilitate the relation network training. Such a framework not only simplifies the inpainting problem directly, but also exploits the relationship in corrupted image explicitly. Extensive experiments on various public datasets quantitatively and qualitatively demonstrate the superiority of our approach compared with the state-of-the-art.

Proceedings ArticleDOI
10 Jan 2021
TL;DR: In this paper, a semantic and temporal synchronous landmark learning method was proposed to synthesize a talking face video with accurate mouth synchronization and natural face motion, where a U-Net generation network with adaptive reconstruction loss was employed to generate facial images for the predicted landmarks.
Abstract: Given a speech clip and facial image, the goal of talking face generation is to synthesize a talking face video with accurate mouth synchronization and natural face motion. Recent progress has proven the effectiveness of the landmarks as the intermediate information during talking face generation. However, the large gap between audio and visual modalities makes the prediction of landmarks challenging and limits generation ability. This paper proposes a semantic and temporal synchronous landmark learning method for talking face generation. First, we propose to introduce a word detector to enforce richer semantic information. Then, we propose to preserve the temporal synchronization and consistency between landmarks and audio via the proposed temporal residual loss. Lastly, we employ a U-Net generation network with adaptive reconstruction loss to generate facial images for the predicted landmarks. Experimental results on two benchmark datasets LRW and GRID demonstrate the effectiveness of our model compared to the state-of-the-art methods of talking face generation.

Proceedings ArticleDOI
01 Jun 2021
TL;DR: Wang et al. as discussed by the authors proposed to decompose the reenactment into three catenate processes: shape modeling, motion transfer and texture synthesis, and introduce three crucial components, i.e., Parametric Shape Modeling, Expansionary Motion Transfer and Unsupervised Texture Synthesizer, to overcome the remarkably variances on pareidolia faces.
Abstract: We present a new application direction named Pareidolia Face Reenactment, which is defined as animating a static illusory face to move in tandem with a human face in the video. For the large differences between pareidolia face reenactment and traditional human face reenactment, two main challenges are introduced, i.e., shape variance and texture variance. In this work, we propose a novel Parametric Unsupervised Reenactment Algorithm to tackle these two challenges. Specifically, we propose to decompose the reenactment into three catenate processes: shape modeling, motion transfer and texture synthesis. With the decomposition, we introduce three crucial components, i.e., Parametric Shape Modeling, Expansionary Motion Transfer and Unsupervised Texture Synthesizer, to overcome the problems brought by the remarkably variances on pareidolia faces. Extensive experiments show the superior performance of our method both qualitatively and quantitatively. Code, model and data are available on our project page1.

Proceedings Article
03 May 2021
TL;DR: In this paper, the authors proposed a graph information bottleneck (GIB) objective based on a mutual information estimator for the irregular graph data and a bi-level optimization scheme to maximize the GIB objective.
Abstract: Given the input graph and its label/property, several key problems of graph learning, such as finding interpretable subgraphs, graph denoising and graph compression, can be attributed to the fundamental problem of recognizing a subgraph of the original one. This subgraph shall be as informative as possible, yet contains less redundant and noisy structure. This problem setting is closely related to the well-known information bottleneck (IB) principle, which, however, has less been studied for the irregular graph data and graph neural networks (GNNs). In this paper, we propose a framework of Graph Information Bottleneck (GIB) for the subgraph recognition problem in deep graph learning. Under this framework, one can recognize the maximally informative yet compressive subgraph, named IB-subgraph. However, the GIB objective is notoriously hard to optimize, mostly due to the intractability of the mutual information of irregular graph data and the unstable optimization process. In order to tackle these challenges, we propose: i) a GIB objective based-on a mutual information estimator for the irregular graph data; ii) a bi-level optimization scheme to maximize the GIB objective; iii) a connectivity loss to stabilize the optimization process. We evaluate the properties of the IB-subgraph in three application scenarios: improvement of graph classification, graph interpretation and graph denoising. Extensive experiments demonstrate that the information-theoretic IB-subgraph enjoys superior graph properties.

Proceedings ArticleDOI
10 Jan 2021
TL;DR: Wang et al. as discussed by the authors proposed an attentional wavelet network for photo to Chinese painting transferring, which first introduced wavelets to obtain high-level conception and local details in Chinese paintings via 2-D haar wavelet transform.
Abstract: Traditional Chinese paintings pay more attention to ‘Gongbi’ and ‘Xieyi’ in artworks, which raises a challenging task to generate Chinese paintings from photos. ‘Xieyi’ creates high-level conception for paintings, while ‘Gongbi’ refers to portraying local details in paintings. This paper proposes an attentional wavelet network for photo to Chinese painting transferring. We first introduce wavelets to obtain high-level conception and local details in Chinese paintings via 2-D haar wavelet transform. Moreover, we design high-level transform stream and local enhancement stream to dispose high frequencies and low frequency respectively. Furthermore, we exploit self-attention mechanism to compatibly pick up high-level information which is used to remedy the missing details when reconstructing the Chinese painting. To advance our experiment, we set up a new dataset named P2ADataset, with diverse photos and Chinese paintings on famous mountains around China. Experimental results comparing with the state-of-the-art style transferring algorithms verify the effectiveness of the proposed method. We will release the codes and data to the public.

Proceedings ArticleDOI
10 Jan 2021
TL;DR: In this article, an exemplar guided cross-spectral face hallucination (EGCH) method is proposed to reduce the domain discrepancy through disentangled representation learning for NIR-VIS heterogeneous face recognition.
Abstract: Recently, many Near infrared-visible (NIR-VIS) heterogeneous face recognition (HFR) methods have been proposed in the community. But it remains a challenging problem because of the sensing gap along with large pose variations. In this paper, we propose an Exemplar Guided Cross-Spectral Face Hallucination (EGCH) to reduce the domain discrepancy through disentangled representation learning. For each modality, EGCH contains a spectral encoder as well as a structure encoder to disentangle spectral and structure representation, respectively. It also contains a traditional generator that reconstructs the input from the above two representations, and a structure generator that predicts the facial parsing map from the structure representation. Besides, mutual information minimization and maximization are conducted to boost disentanglement and make representations adequately expressed. Then the translation is built on structure representations between two modalities. Provided with the transformed NIR structure representation and original VIS spectral representation, EGCH is capable to produce high-fidelity VIS images that preserve the topology structure of the input NIR while transfer the spectral information of an arbitrary VIS exemplar. Extensive experiments demonstrate that the proposed method achieves more promising results both qualitatively and quantitatively than the state-of-the-art NIR-VIS methods.

Proceedings Article
01 Jan 2021
TL;DR: In this article, the authors explore a novel attack paradigm, where backdoor triggers are sample-specific, i.e., they only need to modify certain training samples with invisible perturbation, while not need to manipulate other training components.
Abstract: Recently, backdoor attacks pose a new security threat to the training process of deep neural networks (DNNs). Attackers intend to inject hidden backdoors into DNNs, such that the attacked model performs well on benign samples, whereas its prediction will be maliciously changed if hidden backdoors are activated by the attacker-defined trigger. Existing backdoor attacks usually adopt the setting that triggers are sample-agnostic, $i.e.,$ different poisoned samples contain the same trigger, resulting in that the attacks could be easily mitigated by current backdoor defenses. In this work, we explore a novel attack paradigm, where backdoor triggers are sample-specific. In our attack, we only need to modify certain training samples with invisible perturbation, while not need to manipulate other training components ($e.g.$, training loss, and model structure) as required in many existing attacks. Specifically, inspired by the recent advance in DNN-based image steganography, we generate sample-specific invisible additive noises as backdoor triggers by encoding an attacker-specified string into benign images through an encoder-decoder network. The mapping from the string to the target label will be generated when DNNs are trained on the poisoned dataset. Extensive experiments on benchmark datasets verify the effectiveness of our method in attacking models with or without defenses.