scispace - formally typeset
Search or ask a question

Showing papers by "Stan Z. Li published in 2017"


Proceedings Article•DOI•
Shifeng Zhang1, Xiangyu Zhu1, Zhen Lei1, Hailin Shi1, Xiaobo Wang1, Stan Z. Li1 •
17 Aug 2017
TL;DR: S3FD as mentioned in this paper proposes a scale-equitable face detection framework to handle different scales of faces well and improves the recall rate of small faces by a scale compensation anchor matching strategy.
Abstract: This paper presents a real-time face detector, named Single Shot Scale-invariant Face Detector (S3FD), which performs superiorly on various scales of faces with a single deep neural network, especially for small faces. Specifically, we try to solve the common problem that anchorbased detectors deteriorate dramatically as the objects become smaller. We make contributions in the following three aspects: 1) proposing a scale-equitable face detection framework to handle different scales of faces well. We tile anchors on a wide range of layers to ensure that all scales of faces have enough features for detection. Besides, we design anchor scales based on the effective receptive field and a proposed equal proportion interval principle; 2) improving the recall rate of small faces by a scale compensation anchor matching strategy; 3) reducing the false positive rate of small faces via a max-out background label. As a consequence, our method achieves state-of-theart detection performance on all the common face detection benchmarks, including the AFW, PASCAL face, FDDB and WIDER FACE datasets, and can run at 36 FPS on a Nvidia Titan X (Pascal) for VGA-resolution images.

374 citations


Proceedings Article•DOI•
Shifeng Zhang1, Xiangyu Zhu1, Zhen Lei1, Hailin Shi1, Xiaobo Wang1, Stan Z. Li1 •
TL;DR: Zhang et al. as mentioned in this paper proposed a novel face detector, named FaceBoxes, which consists of the Rapidly Digested Convolutional Layers (RDCL) and the multiple scale convolutional layers (MSCL).
Abstract: Although tremendous strides have been made in face detection, one of the remaining open challenges is to achieve real-time speed on the CPU as well as maintain high performance, since effective models for face detection tend to be computationally prohibitive. To address this challenge, we propose a novel face detector, named FaceBoxes, with superior performance on both speed and accuracy. Specifically, our method has a lightweight yet powerful network structure that consists of the Rapidly Digested Convolutional Layers (RDCL) and the Multiple Scale Convolutional Layers (MSCL). The RDCL is designed to enable FaceBoxes to achieve real-time speed on the CPU. The MSCL aims at enriching the receptive fields and discretizing anchors over different layers to handle faces of various scales. Besides, we propose a new anchor densification strategy to make different types of anchors have the same density on the image, which significantly improves the recall rate of small faces. As a consequence, the proposed detector runs at 20 FPS on a single CPU core and 125 FPS using a GPU for VGA-resolution images. Moreover, the speed of FaceBoxes is invariant to the number of faces. We comprehensively evaluate this method and present state-of-the-art detection performance on several face detection benchmark datasets, including the AFW, PASCAL face, and FDDB.

207 citations


Proceedings Article•DOI•
21 Jul 2017
TL;DR: A novel multi-view subspace clustering model that attempts to harness the complementary information between different representations by introducing a novel position-aware exclusivity term and a consistency term is employed to make these complementary representations to further have a common indicator.
Abstract: Multi-view subspace clustering aims to partition a set of multi-source data into their underlying groups. To boost the performance of multi-view clustering, numerous subspace learning algorithms have been developed in recent years, but with rare exploitation of the representation complementarity between different views as well as the indicator consistency among the representations, let alone considering them simultaneously. In this paper, we propose a novel multi-view subspace clustering model that attempts to harness the complementary information between different representations by introducing a novel position-aware exclusivity term. Meanwhile, a consistency term is employed to make these complementary representations to further have a common indicator. We formulate the above concerns into a unified optimization framework. Experimental results on several benchmark datasets are conducted to reveal the effectiveness of our algorithm over other state-of-the-arts.

199 citations


Posted Content•
TL;DR: This paper presents a real-time face detector, named Single Shot Scale-invariant Face Detector (S3FD), which performs superiorly on various scales of faces with a single deep neural network, especially for small faces.
Abstract: This paper presents a real-time face detector, named Single Shot Scale-invariant Face Detector (S$^3$FD), which performs superiorly on various scales of faces with a single deep neural network, especially for small faces. Specifically, we try to solve the common problem that anchor-based detectors deteriorate dramatically as the objects become smaller. We make contributions in the following three aspects: 1) proposing a scale-equitable face detection framework to handle different scales of faces well. We tile anchors on a wide range of layers to ensure that all scales of faces have enough features for detection. Besides, we design anchor scales based on the effective receptive field and a proposed equal proportion interval principle; 2) improving the recall rate of small faces by a scale compensation anchor matching strategy; 3) reducing the false positive rate of small faces via a max-out background label. As a consequence, our method achieves state-of-the-art detection performance on all the common face detection benchmarks, including the AFW, PASCAL face, FDDB and WIDER FACE datasets, and can run at 36 FPS on a Nvidia Titan X (Pascal) for VGA-resolution images.

150 citations


Posted Content•
Shifeng Zhang1, Xiangyu Zhu1, Zhen Lei1, Hailin Shi1, Xiaobo Wang1, Stan Z. Li1 •
TL;DR: FaceBoxes as mentioned in this paper proposes a lightweight yet powerful network structure that consists of the Rapidly Digested Convolutional Layers (RDCL) and the multiple scale convolutional layers (MSCL) to enable FaceBoxes to achieve real-time speed on the CPU.
Abstract: Although tremendous strides have been made in face detection, one of the remaining open challenges is to achieve real-time speed on the CPU as well as maintain high performance, since effective models for face detection tend to be computationally prohibitive. To address this challenge, we propose a novel face detector, named FaceBoxes, with superior performance on both speed and accuracy. Specifically, our method has a lightweight yet powerful network structure that consists of the Rapidly Digested Convolutional Layers (RDCL) and the Multiple Scale Convolutional Layers (MSCL). The RDCL is designed to enable FaceBoxes to achieve real-time speed on the CPU. The MSCL aims at enriching the receptive fields and discretizing anchors over different layers to handle faces of various scales. Besides, we propose a new anchor densification strategy to make different types of anchors have the same density on the image, which significantly improves the recall rate of small faces. As a consequence, the proposed detector runs at 20 FPS on a single CPU core and 125 FPS using a GPU for VGA-resolution images. Moreover, the speed of FaceBoxes is invariant to the number of faces. We comprehensively evaluate this method and present state-of-the-art detection performance on several face detection benchmark datasets, including the AFW, PASCAL face, and FDDB. Code is available at this https URL

117 citations


Journal Article•DOI•
TL;DR: The proposed multi-label convolutional neural network (MLCNN) can simultaneously predict multiple pedestrian attributes and significantly outperforms the SVM based method on the PETA database.

106 citations


Book Chapter•DOI•
Xuezhi Liang1, Xiaobo Wang1, Zhen Lei1, Shengcai Liao1, Stan Z. Li1 •
14 Nov 2017
TL;DR: A novel soft-margin softmax (SM-Softmax) loss to improve the discriminative power of features that can not only adjust the desired continuous soft margin but also be easily optimized by the typical stochastic gradient descent (SGD).
Abstract: In deep classification, the softmax loss (Softmax) is arguably one of the most commonly used components to train deep convolutional neural networks (CNNs). However, such a widely used loss is limited due to its lack of encouraging the discriminability of features. Recently, the large-margin softmax loss (L-Softmax [1]) is proposed to explicitly enhance the feature discrimination, with hard margin and complex forward and backward computation. In this paper, we propose a novel soft-margin softmax (SM-Softmax) loss to improve the discriminative power of features. Specifically, SM-Softamx only modifies the forward of Softmax by introducing a non-negative real number m, without changing the backward. Thus it can not only adjust the desired continuous soft margin but also be easily optimized by the typical stochastic gradient descent (SGD). Experimental results on three benchmark datasets have demonstrated the superiority of our SM-Softmax over the baseline Softmax, the alternative L-Softmax and several state-of-the-art competitors.

81 citations


Proceedings Article•
01 Jan 2017
TL;DR: A novel coding method named weighted linear coding (WLC) to learn multi-level descriptors from raw pixel data in an unsupervised manner that guarantees the property of saliency with a similarity constraint and has a good balance between the robustness and distinctiveness.
Abstract: In this paper, we propose a novel coding method named weighted linear coding (WLC) to learn multi-level (e.g., pixel-level, patch-level and image-level) descriptors from raw pixel data in an unsupervised manner. It guarantees the property of saliency with a similarity constraint. The resulting multi-level descriptors have a good balance between the robustness and distinctiveness. Based on WLC, all data from the same region can be jointly encoded. Consequently, when we extract the holistic image features, it is able to preserve the spatial consistency. Furthermore, we apply PCA to these features and compact person representations are then achieved. During the stage of matching persons, we exploit the complementary information resided in multi-level descriptors via a score-level fusion strategy. Experiments on the challenging person re-identification datasets VIPeR and CUHK 01, demonstrate the effectiveness of our method.

46 citations


Posted Content•
TL;DR: This paper proposes a novel single-shot based detector, called RefineDet, that achieves better accuracy than two-stage methods and maintains comparable efficiency of one- stage methods.
Abstract: For object detection, the two-stage approach (e.g., Faster R-CNN) has been achieving the highest accuracy, whereas the one-stage approach (e.g., SSD) has the advantage of high efficiency. To inherit the merits of both while overcoming their disadvantages, in this paper, we propose a novel single-shot based detector, called RefineDet, that achieves better accuracy than two-stage methods and maintains comparable efficiency of one-stage methods. RefineDet consists of two inter-connected modules, namely, the anchor refinement module and the object detection module. Specifically, the former aims to (1) filter out negative anchors to reduce search space for the classifier, and (2) coarsely adjust the locations and sizes of anchors to provide better initialization for the subsequent regressor. The latter module takes the refined anchors as the input from the former to further improve the regression and predict multi-class label. Meanwhile, we design a transfer connection block to transfer the features in the anchor refinement module to predict locations, sizes and class labels of objects in the object detection module. The multi-task loss function enables us to train the whole network in an end-to-end way. Extensive experiments on PASCAL VOC 2007, PASCAL VOC 2012, and MS COCO demonstrate that RefineDet achieves state-of-the-art detection accuracy with high efficiency. Code is available at this https URL

37 citations


Proceedings Article•DOI•
TL;DR: Zhang et al. as mentioned in this paper employed identification loss with center loss to train a deep model for person re-identification, which does not require image pairs or triplets for training while the inter-class distinction and intra-class variance are well handled.
Abstract: Person re-identification task has been greatly boosted by deep convolutional neural networks (CNNs) in recent years. The core of which is to enlarge the inter-class distinction as well as reduce the intra-class variance. However, to achieve this, existing deep models prefer to adopt image pairs or triplets to form verification loss, which is inefficient and unstable since the number of training pairs or triplets grows rapidly as the number of training data grows. Moreover, their performance is limited since they ignore the fact that different dimension of embedding may play different importance. In this paper, we propose to employ identification loss with center loss to train a deep model for person re-identification. The training process is efficient since it does not require image pairs or triplets for training while the inter-class distinction and intra-class variance are well handled. To boost the performance, a new feature reweighting (FRW) layer is designed to explicitly emphasize the importance of each embedding dimension, thus leading to an improved embedding. Experiments 1 on several benchmark datasets have shown the superiority of our method over the state-of-the-art alternatives on both accuracy and speed.

37 citations


Proceedings Article•DOI•
01 May 2017
TL;DR: In the final testing phase of Micro Emotion Challenge1, the proposed multi-modality convolutional neural networks based on visual and geometrical information is more effective and has got better performance.
Abstract: Micro emotion recognition is a very challenging problem because of the subtle appearance variants among different facial expression classes. To deal with the mentioned problem, we proposed a multi-modality convolutional neural networks (CNNs) based on visual and geometrical information in this paper. The visual face image and structured geometry are embedded into a unified network and the recognition accuracy can be benefic from the fused information. The proposed network includes two branches. The first branch is used to extract visual feature from color face images, and another branch is used to extract the geometry feature from 68 facial landmarks. Then, both visual and geometry features are concatenated into a long vector. Finally, the concatenated vector is fed to the hinge loss layer. Compared with the CNN architecture only used face images, our method is more effective and has got better performance. In the final testing phase of Micro Emotion Challenge1, our method has got the first place with the misclassification of 80.212137.

Journal Article•DOI•
Hailin Shi1, Xiaobo Wang1, Dong Yi2, Zhen Lei1, Xiangyu Zhu1, Stan Z. Li1 •
TL;DR: The proposed HJB explicitly models the modality difference of image pairs and is able to better discriminate the same/different face pairs accurately, showing the superiority of the HJB over previous methods.
Abstract: In many face recognition applications, the modalities of face images between the gallery and probe sets are different, which is known as heterogeneous face recognition. How to reduce the feature gap between images from different modalities is a critical issue to develop a highly accurate face recognition algorithm. Recently, joint Bayesian (JB) has demonstrated superior performance on general face recognition compared to traditional discriminant analysis methods like subspace learning. However, the original JB treats the two input samples equally and does not take into account the modality difference between them and may be suboptimal to address the heterogeneous face recognition problem. In this work, we extend the original JB by modeling the gallery and probe images using two different Gaussian distributions to propose a heterogeneous joint Bayesian (HJB) formulation for cross-modality face recognition. The proposed HJB explicitly models the modality difference of image pairs and, therefore, is able to better discriminate the same/different face pairs accurately. Extensive experiments conducted in the case of visible–near-infrared and ID photo versus spot face recognition problems show the superiority of the HJB over previous methods.

Book Chapter•DOI•
Shifeng Zhang1, Xiangyu Zhu1, Zhen Lei1, Hailin Shi1, Xiaobo Wang1, Stan Z. Li1 •
28 Oct 2017
TL;DR: This work proposes a novel face detector, dubbed the Densely Connected Face Proposal Network (DCFPN), with high performance as well as real-time speed on the CPU devices, and uses the dense anchor strategy and a fair L1 loss function to handle small faces well.
Abstract: Accuracy and efficiency are two conflicting challenges for face detection, since effective models tend to be computationally prohibitive. To address these two conflicting challenges, our core idea is to shrink the input image and focus on detecting small faces. Specifically, we propose a novel face detector, dubbed the name Densely Connected Face Proposal Network (DCFPN), with high performance as well as real-time speed on the CPU devices. On the one hand, we subtly design a lightweight-but-powerful fully convolutional network with the consideration of efficiency and accuracy. On the other hand, we use the dense anchor strategy and propose a fair L1 loss function to handle small faces well. As a consequence, our method can detect faces at 30 FPS on a single 2.60 GHz CPU core and 250 FPS using a GPU for the VGA-resolution images. We achieve state-of-the-art performance on the AFW, PASCAL face and FDDB datasets.

Posted Content•
09 May 2017
TL;DR: A new feature reweighting (FRW) layer is designed to explicitly emphasize the importance of each embedding dimension, thus leading to an improved embedding in a deep model for person re-identification.
Abstract: Person re-identification task has been greatly boosted by deep convolutional neural networks (CNNs) in recent years. The core of which is to enlarge the inter-class distinction as well as reduce the intra-class variance. However, to achieve this, existing deep models prefer to adopt image pairs or triplets to form verification loss, which is inefficient and unstable since the number of training pairs or triplets grows rapidly as the number of training data grows. Moreover, their performance is limited since they ignore the fact that different dimension of embedding may play different importance. In this paper, we propose to employ identification loss with center loss to train a deep model for person re-identification. The training process is efficient since it does not require image pairs or triplets for training while the inter-class distinction and intra-class variance are well handled. To boost the performance, a new feature reweighting (FRW) layer is designed to explicitly emphasize the importance of each embedding dimension, thus leading to an improved embedding. Experiments on several benchmark datasets have shown the superiority of our method over the state-of-the-art alternatives.

Book Chapter•DOI•
Zhen Lei, Wang Tao, Xiangyu Zhu, Tianyu Fu, Stan Z. Li 
30 Sep 2017
TL;DR: The proposed anti-spoofing method is applicable to face recognition applications such as face access control and remote authentication on mobile devices, and the simple head rotation requirement is acceptable in these applications.
Abstract: This work focuses on the most common and cheapest face spoofing methods, i.e., photo attacks (including the printed photo on a paper or a photo demonstrated on an electronic screen). Many previous works [3-6] propose to classify genuine and fake samples based on frontal face images and achieve good performance on several face spoofing databases. However, in real applications, the imposter will try his best to fool the system and the texture difference between the genuine and fake samples is usually very small. In order to achieve robust face anti-spoofing performance, other cues like 3D face structure and motion pattern can be incorporated. In this work, we propose to detect spoofing photo attacks based on a sequence of rotated face images. Both the structure and texture information from the rotated face sequence are exploited. In practice, the users are only asked to take simple movement (i.e., rotate their faces). As pointed in [7], this head rotation requirement is much simpler than traditional challenge-response-based face anti-spoofing method, in which a combination of multiple movements is usually necessary. The proposed anti-spoofing method is applicable to face recognition applications such as face access control and remote authentication on mobile devices. The simple head rotation requirement is acceptable in these applications.

Posted Content•
TL;DR: This paper model the discrepancies between color names and pixels using a Gaussian and utilize the inverse of covariance matrix to bridge the gap between them and proposes a new method named soft Gaussian mapping (SGM) to address this problem.
Abstract: Color names based image representation is successfully used in person re-identification, due to the advantages of being compact, intuitively understandable as well as being robust to photometric variance. However, there exists the diversity between underlying distribution of color names' RGB values and that of image pixels' RGB values, which may lead to inaccuracy when directly comparing them in Euclidean space. In this paper, we propose a new method named soft Gaussian mapping (SGM) to address this problem. We model the discrepancies between color names and pixels using a Gaussian and utilize the inverse of covariance matrix to bridge the gap between them. Based on SGM, an image could be converted to several soft Gaussian maps. In each soft Gaussian map, we further seek to establish stable and robust descriptors within a local region through a max pooling operation. Then, a robust image representation based on color names is obtained by concatenating the statistical descriptors in each stripe. When labeled data are available, one discriminative subspace projection matrix is learned to build efficient representations of an image via cross-view coupling learning. Experiments on the public datasets - VIPeR, PRID450S and CUHK03, demonstrate the effectiveness of our method.

Posted Content•
TL;DR: Zhang et al. as discussed by the authors employed identification loss with center loss to train a deep model for person re-identification, which does not require image pairs or triplets for training while the inter-class distinction and intra-class variance are well handled.
Abstract: Person re-identification task has been greatly boosted by deep convolutional neural networks (CNNs) in recent years. The core of which is to enlarge the inter-class distinction as well as reduce the intra-class variance. However, to achieve this, existing deep models prefer to adopt image pairs or triplets to form verification loss, which is inefficient and unstable since the number of training pairs or triplets grows rapidly as the number of training data grows. Moreover, their performance is limited since they ignore the fact that different dimension of embedding may play different importance. In this paper, we propose to employ identification loss with center loss to train a deep model for person re-identification. The training process is efficient since it does not require image pairs or triplets for training while the inter-class distinction and intra-class variance are well handled. To boost the performance, a new feature reweighting (FRW) layer is designed to explicitly emphasize the importance of each embedding dimension, thus leading to an improved embedding. Experiments on several benchmark datasets have shown the superiority of our method over the state-of-the-art alternatives on both accuracy and speed.