scispace - formally typeset
Search or ask a question

Showing papers by "Ran He published in 2017"


Proceedings ArticleDOI
13 Apr 2017
TL;DR: Tang et al. as discussed by the authors proposed a Two-Pathway Generative Adversarial Network (TP-GAN) for photorealistic frontal view synthesis by simultaneously perceiving global structures and local details.
Abstract: Photorealistic frontal view synthesis from a single face image has a wide range of applications in the field of face recognition. Although data-driven deep learning methods have been proposed to address this problem by seeking solutions from ample face data, this problem is still challenging because it is intrinsically ill-posed. This paper proposes a Two-Pathway Generative Adversarial Network (TP-GAN) for photorealistic frontal view synthesis by simultaneously perceiving global structures and local details. Four landmark located patch networks are proposed to attend to local textures in addition to the commonly used global encoderdecoder network. Except for the novel architecture, we make this ill-posed problem well constrained by introducing a combination of adversarial loss, symmetry loss and identity preserving loss. The combined loss function leverages both frontal face distribution and pre-trained discriminative deep face models to guide an identity preserving inference of frontal views from profiles. Different from previous deep learning methods that mainly rely on intermediate features for recognition, our method directly leverages the synthesized identity preserving image for downstream tasks like face recognition and attribution estimation. Experimental results demonstrate that our method not only presents compelling perceptual results but also outperforms state-of-theart results on large pose face recognition.

509 citations


Proceedings ArticleDOI
01 Oct 2017
TL;DR: A wavelet-based CNN approach that can ultra-resolve a very low resolution face image of 16 × 16 or smaller pixelsize to its larger version of multiple scaling factors in a unified framework with three types of loss: wavelet prediction loss, texture loss and full-image loss is presented.
Abstract: Most modern face super-resolution methods resort to convolutional neural networks (CNN) to infer highresolution (HR) face images. When dealing with very low resolution (LR) images, the performance of these CNN based methods greatly degrades. Meanwhile, these methods tend to produce over-smoothed outputs and miss some textural details. To address these challenges, this paper presents a wavelet-based CNN approach that can ultra-resolve a very low resolution face image of 16 × 16 or smaller pixelsize to its larger version of multiple scaling factors (2×, 4×, 8× and even 16×) in a unified framework. Different from conventional CNN methods directly inferring HR images, our approach firstly learns to predict the LR’s corresponding series of HR’s wavelet coefficients before reconstructing HR images from them. To capture both global topology information and local texture details of human faces, we present a flexible and extensible convolutional neural network with three types of loss: wavelet prediction loss, texture loss and full-image loss. Extensive experiments demonstrate that the proposed approach achieves more appealing results both quantitatively and qualitatively than state-ofthe- art super-resolution methods.

369 citations


Posted Content
TL;DR: In this paper, a deep supervised discrete hashing algorithm based on the assumption that the learned binary codes should be ideal for classification is proposed. But, there are some limitations of previous deep hashing methods (e.g., the semantic information is not fully exploited).
Abstract: With the rapid growth of image and video data on the web, hashing has been extensively studied for image or video search in recent years Benefit from recent advances in deep learning, deep hashing methods have achieved promising results for image retrieval However, there are some limitations of previous deep hashing methods (eg, the semantic information is not fully exploited) In this paper, we develop a deep supervised discrete hashing algorithm based on the assumption that the learned binary codes should be ideal for classification Both the pairwise label information and the classification information are used to learn the hash codes within one stream framework We constrain the outputs of the last layer to be binary codes directly, which is rarely investigated in deep hashing algorithm Because of the discrete nature of hash codes, an alternating minimization method is used to optimize the objective function Experimental results have shown that our method outperforms current state-of-the-art methods on benchmark datasets

152 citations


Proceedings Article
13 Feb 2017
TL;DR: A deep convolutional network approach that uses only one network to map both NIR and VIS images to a compact Euclidean space and achieves 94% verification rate at FAR=0.1% on the challenging CASIA NIR-VIS 2.0 face recognition dataset.
Abstract: Visual versus near infrared (VIS-NIR) face recognition is still a challenging heterogeneous task due to large appearance difference between VIS and NIR modalities. This paper presents a deep convolutional network approach that uses only one network to map both NIR and VIS images to a compact Euclidean space. The low-level layers of this network are trained only on large-scale VIS data. Each convolutional layer is implemented by the simplest case of maxout operator. The high-level layer is divided into two orthogonal subspaces that contain modality-invariant identity information and modality-variant spectrum information respectively. Our joint formulation leads to an alternating minimization approach for deep representation at the training time and an efficient computation for heterogeneous data at the testing time. Experimental evaluations show that our method achieves 94% verification rate at FAR=0.1% on the challenging CASIA NIR-VIS 2.0 face recognition dataset. Compared with state-of-the-art methods, it reduces the error rate by 58% only with a compact 64-D representation.

143 citations


Proceedings Article
01 Dec 2017
TL;DR: Wang et al. as discussed by the authors developed a deep supervised discrete hashing algorithm based on the assumption that the learned binary codes should be ideal for classification, where both the pairwise label information and the classification information are used to learn the hash codes within one stream framework.
Abstract: With the rapid growth of image and video data on the web, hashing has been extensively studied for image or video search in recent years. Benefiting from recent advances in deep learning, deep hashing methods have achieved promising results for image retrieval. However, there are some limitations of previous deep hashing methods (e.g., the semantic information is not fully exploited). In this paper, we develop a deep supervised discrete hashing algorithm based on the assumption that the learned binary codes should be ideal for classification. Both the pairwise label information and the classification information are used to learn the hash codes within one stream framework. We constrain the outputs of the last layer to be binary codes directly, which is rarely investigated in deep hashing algorithm. Because of the discrete nature of hash codes, an alternating minimization method is used to optimize the objective function. Experimental results have shown that our method outperforms current state-of-the-art methods on benchmark datasets.

119 citations


Posted Content
TL;DR: Tang et al. as mentioned in this paper proposed a Two-Pathway Generative Adversarial Network (TP-GAN) for photorealistic frontal view synthesis by simultaneously perceiving global structures and local details.
Abstract: Photorealistic frontal view synthesis from a single face image has a wide range of applications in the field of face recognition. Although data-driven deep learning methods have been proposed to address this problem by seeking solutions from ample face data, this problem is still challenging because it is intrinsically ill-posed. This paper proposes a Two-Pathway Generative Adversarial Network (TP-GAN) for photorealistic frontal view synthesis by simultaneously perceiving global structures and local details. Four landmark located patch networks are proposed to attend to local textures in addition to the commonly used global encoder-decoder network. Except for the novel architecture, we make this ill-posed problem well constrained by introducing a combination of adversarial loss, symmetry loss and identity preserving loss. The combined loss function leverages both frontal face distribution and pre-trained discriminative deep face models to guide an identity preserving inference of frontal views from profiles. Different from previous deep learning methods that mainly rely on intermediate features for recognition, our method directly leverages the synthesized identity preserving image for downstream tasks like face recognition and attribution estimation. Experimental results demonstrate that our method not only presents compelling perceptual results but also outperforms state-of-the-art results on large pose face recognition.

99 citations


Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors employed a single and simple multi-task framework to efficiently utilize the supervision of aesthetic and semantic labels, and a correlation item between these two tasks was further introduced to the framework by incorporating the inter-task relationship learning.
Abstract: Human beings often assess the aesthetic quality of an image coupled with the identification of the image’s semantic content. This paper addresses the correlation issue between automatic aesthetic quality assessment and semantic recognition. We cast the assessment problem as the main task among a multi-task deep model, and argue that semantic recognition task offers the key to address this problem. Based on convolutional neural networks, we employ a single and simple multi-task framework to efficiently utilize the supervision of aesthetic and semantic labels. A correlation item between these two tasks is further introduced to the framework by incorporating the inter-task relationship learning. This item not only provides some useful insight about the correlation but also improves assessment accuracy of the aesthetic task. In particular, an effective strategy is developed to keep a balance between the two tasks, which facilitates to optimize the parameters of the framework. Extensive experiments on the challenging Aesthetic Visual Analysis dataset and Photo.net dataset validate the importance of semantic recognition in aesthetic quality assessment, and demonstrate that multitask deep models can discover an effective aesthetic representation to achieve the state-of-the-art results.

97 citations


Posted Content
TL;DR: CDL seeks a shared feature space in which the heterogeneous face matching problem can be approximately treated as a homogeneous facematching problem and achieves better performance on the challenging CASIA NIR-VIS 2.0 face recognition database, the IIIT-D Sketch database, and the CUHK Face Sketch (CUFS), which significantly outperforms state-of-the-art heterogeneous Face Sketch FERET methods.
Abstract: Heterogeneous face matching is a challenge issue in face recognition due to large domain difference as well as insufficient pairwise images in different modalities during training. This paper proposes a coupled deep learning (CDL) approach for the heterogeneous face matching. CDL seeks a shared feature space in which the heterogeneous face matching problem can be approximately treated as a homogeneous face matching problem. The objective function of CDL mainly includes two parts. The first part contains a trace norm and a block-diagonal prior as relevance constraints, which not only make unpaired images from multiple modalities be clustered and correlated, but also regularize the parameters to alleviate overfitting. An approximate variational formulation is introduced to deal with the difficulties of optimizing low-rank constraint directly. The second part contains a cross modal ranking among triplet domain specific images to maximize the margin for different identities and increase data for a small amount of training samples. Besides, an alternating minimization method is employed to iteratively update the parameters of CDL. Experimental results show that CDL achieves better performance on the challenging CASIA NIR-VIS 2.0 face recognition database, the IIIT-D Sketch database, the CUHK Face Sketch (CUFS), and the CUHK Face Sketch FERET (CUFSF), which significantly outperforms state-of-the-art heterogeneous face recognition methods.

60 citations


Proceedings ArticleDOI
01 Jul 2017
TL;DR: A triplet-based loss function is introduced to enforce the selected feature groups to preserve ordinal locality of original data, which contributes to distance-based clustering tasks.
Abstract: Unsupervised feature selection has shown significant potential in distance-based clustering tasks. This paper proposes a novel triplet induced method. Firstly, a triplet-based loss function is introduced to enforce the selected feature groups to preserve ordinal locality of original data, which contributes to distance-based clustering tasks. Secondly, we simplify the orthogonal basis clustering by imposing an orthogonal constraint on the feature projection matrix. Consequently, a general framework for simultaneous feature selection and clustering is discussed. Thirdly, an alternating minimization algorithm is employed to efficiently optimize the proposed model together with rapid convergence. Extensive comparison experiments on several benchmark datasets well validate the encouraging gain in clustering from our proposed method.

51 citations


Posted Content
Yi Li1, Lingxiao Song1, Xiang Wu1, Ran He1, Tieniu Tan1 
TL;DR: Li et al. as discussed by the authors proposed a learning from generation approach for makeup-invariant face verification by introducing a bi-level adversarial network (BLAN) to alleviate the negative effects from makeup, and then used the synthesized non-makeup images for further verification.
Abstract: Makeup is widely used to improve facial attractiveness and is well accepted by the public. However, different makeup styles will result in significant facial appearance changes. It remains a challenging problem to match makeup and non-makeup face images. This paper proposes a learning from generation approach for makeup-invariant face verification by introducing a bi-level adversarial network (BLAN). To alleviate the negative effects from makeup, we first generate non-makeup images from makeup ones, and then use the synthesized non-makeup images for further verification. Two adversarial networks in BLAN are integrated in an end-to-end deep network, with the one on pixel level for reconstructing appealing facial images and the other on feature level for preserving identity information. These two networks jointly reduce the sensing gap between makeup and non-makeup images. Moreover, we make the generator well constrained by incorporating multiple perceptual losses. Experimental results on three benchmark makeup face datasets demonstrate that our method achieves state-of-the-art verification accuracy across makeup status and can produce photo-realistic non-makeup face images.

35 citations


Proceedings ArticleDOI
01 Mar 2017
TL;DR: This paper learns the aesthetic map by a deep convolutional neural network with a large-scale dataset for aesthetic quality assessment and develops an aesthetic preservation model to compute the aesthetic information remained in crops to avoid cropping out high aesthetic regions.
Abstract: Image cropping is a fundamental task in image editing to enhance the aesthetic quality of images. In this paper, we propose an automatic image cropping technique based on aesthetic map and gradient energy map. Instead of utilizing aesthetic rules in previous methods, we learn the aesthetic map by a deep convolutional neural network with a large-scale dataset for aesthetic quality assessment. The aesthetic map can highlight the discriminative image regions for high (or low) aesthetic quality category. The gradient energy map presents edge spatial distribution of images and is developed to compute the simplicity of images. Then a composition model is learned with the aesthetic map and gradient energy map to evaluate the quality of composition for crops. Moreover, an aesthetic preservation model is developed to compute the aesthetic information remained in crops to avoid cropping out high aesthetic regions. Experiments show that our approach significantly outperforms state-of-the-art cropping methods.

Proceedings Article
Yi Li1, Lingxiao Song1, Xiang Wu1, Ran He1, Tieniu Tan1 
01 Sep 2017
TL;DR: Li et al. as discussed by the authors proposed a learning from generation approach for makeup-invariant face verification by introducing a bi-level adversarial network (BLAN) to alleviate the negative effects from makeup, and then used the synthesized non-makeup images for further verification.
Abstract: Makeup is widely used to improve facial attractiveness and is well accepted by the public. However, different makeup styles will result in significant facial appearance changes. It remains a challenging problem to match makeup and non-makeup face images. This paper proposes a learning from generation approach for makeup-invariant face verification by introducing a bi-level adversarial network (BLAN). To alleviate the negative effects from makeup, we first generate non-makeup images from makeup ones, and then use the synthesized non-makeup images for further verification. Two adversarial networks in BLAN are integrated in an end-to-end deep network, with the one on pixel level for reconstructing appealing facial images and the other on feature level for preserving identity information. These two networks jointly reduce the sensing gap between makeup and non-makeup images. Moreover, we make the generator well constrained by incorporating multiple perceptual losses. Experimental results on three benchmark makeup face datasets demonstrate that our method achieves state-of-the-art verification accuracy across makeup status and can produce photo-realistic non-makeup face images.

Posted Content
Zhihe Lu1, Zhihang Li1, Jie Cao1, Ran He1, Zhenan Sun1 
TL;DR: A comprehensive review of typical face synthesis works that involve traditional methods as well as advanced deep learning approaches is provided, particularly, Generative Adversarial Net (GAN) is highlighted to generate photo-realistic and identity preserving results.
Abstract: Face synthesis has been a fascinating yet challenging problem in computer vision and machine learning. Its main research effort is to design algorithms to generate photo-realistic face images via given semantic domain. It has been a crucial prepossessing step of main-stream face recognition approaches and an excellent test of AI ability to use complicated probability distributions. In this paper, we provide a comprehensive review of typical face synthesis works that involve traditional methods as well as advanced deep learning approaches. Particularly, Generative Adversarial Net (GAN) is highlighted to generate photo-realistic and identity preserving results. Furthermore, the public available databases and evaluation metrics are introduced in details. We end the review with discussing unsolved difficulties and promising directions for future research.

Journal ArticleDOI
TL;DR: A novel conditional random field (CRF) based framework for weakly supervised semantic segmentation that outperforms or is comparable to state-of-the-art segmentation methods and is fit for domain adaption.
Abstract: The task of semantic segmentation is to infer a predefined category label for each pixel in the image. For most cases, image segmentation is established as a fully supervised task. These methods all built on the basis of having access to sufficient pixel-wise annotated samples for training. However, obtaining the satisfied ground truth is not only labor intensive but also time-consuming, which severely hinders the generality of these fully supervised methods. Instead of pixel-level ground truth, weakly supervised approaches learn their models from much less prior information, e.g., image-level annotation. In this paper, we propose a novel conditional random field (CRF) based framework for weakly supervised semantic segmentation. Enlightened by jigsaw puzzles, we start the approach with merging superpixels from an image into larger pieces by a newly designed strategy. Then pieces from all the training images are gathered and associated with appropriate semantic labels by CRF. Thus, the piece library is constructed, achieving remarkable universality and flexibility. In the case of testing, we compare the superpixels with image pieces in the library and assign them the labels that minimize the potential energy. In addition, the proposed framework is fit for domain adaption and obtains promising results, which is of great practical value. Extensive experimental results on PASCAL VOC 2007, MSRC-21, and VOC 2012 databases demonstrate that our framework outperforms or is comparable to state-of-the-art segmentation methods.

Posted Content
TL;DR: This framework integrates cross-spectral face hallucination and discriminative feature learning into an end-to-end adversarial network, which outperforms state-of-the-art HFR methods, without requiring of complex network or large-scale training dataset.
Abstract: The gap between sensing patterns of different face modalities remains a challenging problem in heterogeneous face recognition (HFR). This paper proposes an adversarial discriminative feature learning framework to close the sensing gap via adversarial learning on both raw-pixel space and compact feature space. This framework integrates cross-spectral face hallucination and discriminative feature learning into an end-to-end adversarial network. In the pixel space, we make use of generative adversarial networks to perform cross-spectral face hallucination. An elaborate two-path model is introduced to alleviate the lack of paired images, which gives consideration to both global structures and local textures. In the feature space, an adversarial loss and a high-order variance discrepancy loss are employed to measure the global and local discrepancy between two heterogeneous distributions respectively. These two losses enhance domain-invariant feature learning and modality independent noise removing. Experimental results on three NIR-VIS databases show that our proposed approach outperforms state-of-the-art HFR methods, without requiring of complex network or large-scale training dataset.

Posted Content
Lingxiao Song, Zhihe Lu1, Ran He1, Zhenan Sun1, Tieniu Tan1 
TL;DR: Zhang et al. as mentioned in this paper proposed a Geometry-Guided Generative Adversarial Network (G2-GAN) for photo-realistic and identity-preserving facial expression synthesis.
Abstract: Facial expression synthesis has drawn much attention in the field of computer graphics and pattern recognition. It has been widely used in face animation and recognition. However, it is still challenging due to the high-level semantic presence of large and non-linear face geometry variations. This paper proposes a Geometry-Guided Generative Adversarial Network (G2-GAN) for photo-realistic and identity-preserving facial expression synthesis. We employ facial geometry (fiducial points) as a controllable condition to guide facial texture synthesis with specific expression. A pair of generative adversarial subnetworks are jointly trained towards opposite tasks: expression removal and expression synthesis. The paired networks form a mapping cycle between neutral expression and arbitrary expressions, which also facilitate other applications such as face transfer and expression invariant face recognition. Experimental results show that our method can generate compelling perceptual results on various facial expression synthesis databases. An expression invariant face recognition experiment is also performed to further show the advantages of our proposed method.

Proceedings ArticleDOI
Zhihe Lu1, Zhihang Li1, Jie Cao1, Ran He1, Zhenan Sun1 
15 Jun 2017
TL;DR: A comprehensive review of typical face synthesis works that involve traditional methods as well as advanced deep learning approaches is provided in this article, where Generative Adversarial Networks (GANs) are highlighted to generate photo-realistic and identity preserving results.
Abstract: Face synthesis has been a fascinating yet challenging problem in computer vision and machine learning. Its main research effort is to design algorithms to generate photo-realistic face images via given semantic domain. It has been a crucial prepossessing step of main-stream face recognition approaches and an excellent test of AI ability to use complicated probability distributions. In this paper, we provide a comprehensive review of typical face synthesis works that involve traditional methods as well as advanced deep learning approaches. Particularly, Generative Adversarial Net (GAN) is highlighted to generate photo-realistic and identity preserving results. Furthermore, the public available databases and evaluation metrics are introduced in details. We end the review with discussing unsolved difficulties and promising directions for future research.

Posted Content
TL;DR: An Adversarial Occlusion-aware Face Detector (AOFD) is introduced by simultaneously detecting occluded faces and segmenting occluding areas by employing an adversarial training strategy to generate occlusions-like face features that are difficult for a face detector to recognize.
Abstract: Occluded face detection is a challenging detection task due to the large appearance variations incurred by various real-world occlusions. This paper introduces an Adversarial Occlusion-aware Face Detector (AOFD) by simultaneously detecting occluded faces and segmenting occluded areas. Specifically, we employ an adversarial training strategy to generate occlusion-like face features that are difficult for a face detector to recognize. Occlusion mask is predicted simultaneously while detecting occluded faces and the occluded area is utilized as an auxiliary instead of being regarded as a hindrance. Moreover, the supervisory signals from the segmentation branch will reversely affect the features, aiding in detecting heavily-occluded faces accordingly. Consequently, AOFD is able to find the faces with few exposed facial landmarks with very high confidences and keeps high detection accuracy even for masked faces. Extensive experiments demonstrate that AOFD not only significantly outperforms state-of-the-art methods on the MAFA occluded face detection dataset, but also achieves competitive detection accuracy on benchmark dataset for general face detection such as FDDB.

Journal ArticleDOI
TL;DR: Discrete Cross-Modal Hashing (DCMH) is a novel supervised cross-modal hashing method to learn the binary codes without relaxing them, and it learns binary codes for use as ideal features for classification.
Abstract: Hashing techniques have been widely adopted for cross-modal retrieval due to their low storage cost and fast query speed. Recently, some unimodal hashing methods have tried to directly optimize the objective function with discrete binary constraints. Inspired by these methods, the authors propose a novel supervised cross-modal hashing method called Discrete Cross-Modal Hashing (DCMH) to learn the binary codes without relaxing them. DCMH is formulated through semantic similarity reconstruction, and it learns binary codes for use as ideal features for classification. Furthermore, DCMH alternately updates binary codes for each modality, and its discrete hashing codes are learned efficiently, bit by bit, which is quite promising for large-scale datasets. To evaluate the effectiveness of the proposed discrete optimization, the authors optimize their objective function in a relax-and-threshold manner. Extensive empirical results on both image-text and image-tag datasets demonstrate that DCMH is a significant improvement over previous approaches in terms of training time and retrieval performance.

Posted Content
Yanbo Fan, Jian Liang, Ran He, Bao-Gang Hu, Siwei Lyu 
TL;DR: This paper proposes a novel localized multi-view subspace clustering model that considers the confidence levels of both views and samples, and develops a regularizer on weight parameters based on the convex conjugacy theory, and samples weights are determined in an adaptive manner.
Abstract: In multi-view clustering, different views may have different confidence levels when learning a consensus representation. Existing methods usually address this by assigning distinctive weights to different views. However, due to noisy nature of real-world applications, the confidence levels of samples in the same view may also vary. Thus considering a unified weight for a view may lead to suboptimal solutions. In this paper, we propose a novel localized multi-view subspace clustering model that considers the confidence levels of both views and samples. By assigning weight to each sample under each view properly, we can obtain a robust consensus representation via fusing the noiseless structures among views and samples. We further develop a regularizer on weight parameters based on the convex conjugacy theory, and samples weights are determined in an adaptive manner. An efficient iterative algorithm is developed with a convergence guarantee. Experimental results on four benchmarks demonstrate the correctness and effectiveness of the proposed model.

Posted Content
TL;DR: Experimental results on four Meshface databases demonstrate the effectiveness of the proposed method for the face completion under structured occlusions.
Abstract: Face completion aims to generate semantically new pixels for missing facial components. It is a challenging generative task due to large variations of face appearance. This paper studies generative face completion under structured occlusions. We treat the face completion and corruption as disentangling and fusing processes of clean faces and occlusions, and propose a jointly disentangling and fusing Generative Adversarial Network (DF-GAN). First, three domains are constructed, corresponding to the distributions of occluded faces, clean faces and structured occlusions. The disentangling and fusing processes are formulated as the transformations between the three domains. Then the disentangling and fusing networks are built to learn the transformations from unpaired data, where the encoder-decoder structure is adopted and allows DF-GAN to simulate structure occlusions by modifying the latent representations. Finally, the disentangling and fusing processes are unified into a dual learning framework along with an adversarial strategy. The proposed method is evaluated on Meshface verification problem. Experimental results on four Meshface databases demonstrate the effectiveness of our proposed method for the face completion under structured occlusions.

Posted Content
TL;DR: Wasserstein CNN as mentioned in this paper is proposed to learn invariant features between near-infrared and visual face images (i.e. NIR-VIS face recognition) to achieve the minimization of Wasserstein distance between NIR distribution and VIS distribution for invariant deep feature representation.
Abstract: Heterogeneous face recognition (HFR) aims to match facial images acquired from different sensing modalities with mission-critical applications in forensics, security and commercial sectors. However, HFR is a much more challenging problem than traditional face recognition because of large intra-class variations of heterogeneous face images and limited training samples of cross-modality face image pairs. This paper proposes a novel approach namely Wasserstein CNN (convolutional neural networks, or WCNN for short) to learn invariant features between near-infrared and visual face images (i.e. NIR-VIS face recognition). The low-level layers of WCNN are trained with widely available face images in visual spectrum. The high-level layer is divided into three parts, i.e., NIR layer, VIS layer and NIR-VIS shared layer. The first two layers aims to learn modality-specific features and NIR-VIS shared layer is designed to learn modality-invariant feature subspace. Wasserstein distance is introduced into NIR-VIS shared layer to measure the dissimilarity between heterogeneous feature distributions. So W-CNN learning aims to achieve the minimization of Wasserstein distance between NIR distribution and VIS distribution for invariant deep feature representation of heterogeneous face images. To avoid the over-fitting problem on small-scale heterogeneous face data, a correlation prior is introduced on the fully-connected layers of WCNN network to reduce parameter space. This prior is implemented by a low-rank constraint in an end-to-end network. The joint formulation leads to an alternating minimization for deep feature representation at training stage and an efficient computation for heterogeneous data at testing stage. Extensive experiments on three challenging NIR-VIS face recognition databases demonstrate the significant superiority of Wasserstein CNN over state-of-the-art methods.

Posted Content
TL;DR: Zhang et al. as discussed by the authors proposed an effective distance metric on image sets, which explicitly minimizes the intra-set distance and maximizes the interset distance simultaneously, and inspired by Neural Turing Machine, a Memory Attention Weighting is proposed to adapt set-aware global contents.
Abstract: Face recognition has made great progress with the development of deep learning. However, video face recognition (VFR) is still an ongoing task due to various illumination, low-resolution, pose variations and motion blur. Most existing CNN-based VFR methods only obtain a feature vector from a single image and simply aggregate the features in a video, which less consider the correlations of face images in one video. In this paper, we propose a novel Attention-Set based Metric Learning (ASML) method to measure the statistical characteristics of image sets. It is a promising and generalized extension of Maximum Mean Discrepancy with memory attention weighting. First, we define an effective distance metric on image sets, which explicitly minimizes the intra-set distance and maximizes the inter-set distance simultaneously. Second, inspired by Neural Turing Machine, a Memory Attention Weighting is proposed to adapt set-aware global contents. Then ASML is naturally integrated into CNNs, resulting in an end-to-end learning scheme. Our method achieves state-of-the-art performance for the task of video face recognition on the three widely used benchmarks including YouTubeFace, YouTube Celebrities and Celebrity-1000.

Proceedings ArticleDOI
TL;DR: A two-stage multi-task Auto-encoders framework for fast face alignment by incorporating head pose information to handle large view variations and the computational cost is much lower than its deep learning competitors.
Abstract: Face alignment is an important problem in computer vision. It is still an open problem due to the variations of facial attributes (e.g., head pose, facial expression, illumination variation). Many studies have shown that face alignment and facial attribute analysis are often correlated. This paper develops a two-stage multi-task Auto-encoders framework for fast face alignment by incorporating head pose information to handle large view variations. In the first and second stages, multi-task Auto-encoders are used to roughly locate and further refine facial landmark locations with related pose information, respectively. Besides, the shape constraint is naturally encoded into our two-stage face alignment framework to preserve facial structures. A coarse-to-fine strategy is adopted to refine the facial landmark results with the shape constraint. Furthermore, the computational cost of our method is much lower than its deep learning competitors. Experimental results on various challenging datasets show the effectiveness of the proposed method.


Proceedings ArticleDOI
01 Nov 2017
TL;DR: This paper proposes a novel Attention-Set based Metric Learning (ASML) method for VFR that is a promising and generalized extension of Maximum Mean Discrepancy with Memory Attention Weighting inspired by Neural Turing Machine.
Abstract: Face recognition has made great progress with the development of deep learning. However, video face recognition (VFR) is still an ongoing task due to various illumination, low-resolution, pose variations and motion blur. In this paper, we propose a novel Attention-Set based Metric Learning (ASML) method for VFR. It is a promising and generalized extension of Maximum Mean Discrepancy with Memory Attention Weighting inspired by Neural Turing Machine. ASML can be naturally integrated into Convolutional Neural Networks, resulting in an end-to-end learning scheme. Our method achieves state-of-the-art performance for the task of video face recognition on three widely used benchmarks including YouTubeFace, YouTube Celebrities and Celebrity-1000.

Book ChapterDOI
30 Jun 2017
TL;DR: A global perception feedback convolutional neural network that considers the global structure of visual response during feedback inference and eliminates “Visual illusions” that are produced in the process of visual attention.
Abstract: Top-down feedback mechanism is an important module of visual attention for weakly supervised learning. Previous top-down feedback convolutional neural networks often perform local perception during feedback. Inspired by the fact that the visual system is sensitive to global topological properties [1], we propose a global perception feedback convolutional neural network that considers the global structure of visual response during feedback inference. The global perception eliminates “Visual illusions” that are produced in the process of visual attention. It is achieved by simply imposing the trace norm on hidden neuron activations. Particularly, when updating the status of hidden neuron activations during gradient backpropagation, we get rid of some minor constituent in the SVD decomposition, which both ensures the global low-rank structure of feedback information and the elimination of local noise. Experimental results on the ImageNet dataset corroborate our claims and demonstrate the effectiveness of our global perception model.