scispace - formally typeset
Search or ask a question

Showing papers by "Stan Z. Li published in 2019"


Posted Content
TL;DR: An Adaptive Training Sample Selection (ATSS) to automatically select positive and negative samples according to statistical characteristics of object significantly improves the performance of anchor-based and anchor-free detectors and bridges the gap between them.
Abstract: Object detection has been dominated by anchor-based detectors for several years. Recently, anchor-free detectors have become popular due to the proposal of FPN and Focal Loss. In this paper, we first point out that the essential difference between anchor-based and anchor-free detection is actually how to define positive and negative training samples, which leads to the performance gap between them. If they adopt the same definition of positive and negative samples during training, there is no obvious difference in the final performance, no matter regressing from a box or a point. This shows that how to select positive and negative training samples is important for current object detectors. Then, we propose an Adaptive Training Sample Selection (ATSS) to automatically select positive and negative samples according to statistical characteristics of object. It significantly improves the performance of anchor-based and anchor-free detectors and bridges the gap between them. Finally, we discuss the necessity of tiling multiple anchors per location on the image to detect objects. Extensive experiments conducted on MS COCO support our aforementioned analysis and conclusions. With the newly introduced ATSS, we improve state-of-the-art detectors by a large margin to $50.7\%$ AP without introducing any overhead. The code is available at this https URL

564 citations


Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed a 3D Dense Face Alignment (3DDFA) framework, in which a dense 3D Morphable Model (3DMM) is fitted to the image via Cascaded Convolutional Neural Networks.
Abstract: Face alignment, which fits a face model to an image and extracts the semantic meanings of facial pixels, has been an important topic in the computer vision community. However, most algorithms are designed for faces in small to medium poses (yaw angle is smaller than 45 degree), which lack the ability to align faces in large poses up to 90 degree. The challenges are three-fold. First, the commonly used landmark face model assumes that all the landmarks are visible and is therefore not suitable for large poses. Second, the face appearance varies more drastically across large poses, from the frontal view to the profile view. Third, labelling landmarks in large poses is extremely challenging since the invisible landmarks have to be guessed. In this paper, we propose to tackle these three challenges in an new alignment framework termed 3D Dense Face Alignment (3DDFA), in which a dense 3D Morphable Model (3DMM) is fitted to the image via Cascaded Convolutional Neural Networks. We also utilize 3D information to synthesize face images in profile views to provide abundant samples for training. Experiments on the challenging AFLW database show that the proposed approach achieves significant improvements over the state-of-the-art methods.

358 citations


Proceedings ArticleDOI
15 Jun 2019
TL;DR: This paper proposes the Adaptive Margin Softmax to adjust the margins for different classes adaptively, and makes the sampling process adaptive in two folds: Firstly, the Hard Prototype Mining to adaptively select a small number of hard classes to participate in classification, and secondly, theAdaptive Data Sampling to find valuable samples for training adaptively.
Abstract: Training large-scale unbalanced data is the central topic in face recognition. In the past two years, face recognition has achieved remarkable improvements due to the introduction of margin based Softmax loss. However, these methods have an implicit assumption that all the classes possess sufficient samples to describe its distribution, so that a manually set margin is enough to equally squeeze each intra-class variations. However, real face datasets are highly unbalanced, which means the classes have tremendously different numbers of samples. In this paper, we argue that the margin should be adapted to different classes. We propose the Adaptive Margin Softmax to adjust the margins for different classes adaptively. In addition to the unbalance challenge, face data always consists of large-scale classes and samples. Smartly selecting valuable classes and samples to participate in the training makes the training more effective and efficient. To this end, we also make the sampling process adaptive in two folds: Firstly, we propose the Hard Prototype Mining to adaptively select a small number of hard classes to participate in classification. Secondly, for data sampling, we introduce the Adaptive Data Sampling to find valuable samples for training adaptively. We combine these three parts together as AdaptiveFace. Extensive analysis and experiments on LFW, LFW BLUFR and MegaFace show that our method performs better than state-of-the-art methods using the same network architecture and training dataset. Code is available at https://github.com/haoliu1994/AdaptiveFace.

150 citations


Proceedings ArticleDOI
15 Jun 2019
TL;DR: CASIA-SURF as mentioned in this paper is a large-scale multi-modal dataset for face anti-spoofing, which consists of 1,000 subjects with 21,000 videos and each sample has three modalities (i.e., RGB, depth and IR).
Abstract: Face anti-spoofing is essential to prevent face recognition systems from a security breach. Much of the progresses have been made by the availability of face anti-spoofing benchmark datasets in recent years. However, existing face anti-spoofing benchmarks have limited number of subjects (≤170) and modalities (≤2), which hinder the further development of the academic community. To facilitate face anti-spoofing research, we introduce a large-scale multi-modal dataset, namely CASIA-SURF, which is the largest publicly available dataset for face anti-spoofing in terms of both subjects and visual modalities. Specifically, it consists of 1,000 subjects with 21,000 videos and each sample has 3 modalities (i.e., RGB, Depth and IR). We also provide a measurement set, evaluation protocol and training/validation/testing subsets, developing a new benchmark for face anti-spoofing. Moreover, we present a new multi-modal fusion method as baseline, which performs feature re-weighting to select the more informative channel features while suppressing the less useful ones for each modal. Extensive experiments have been conducted on the proposed dataset to verify its significance and generalization capability. The dataset is available at https://sites.google.com/qq.com/chalearnfacespoofingattackdete/.

138 citations


Journal ArticleDOI
TL;DR: A novel multi-view subspace clustering model that attempts to form an informative intactness-aware similarity based on the intact space learning technique is proposed and its superior performance over other state-of-the-art alternatives is revealed.

115 citations


Journal ArticleDOI
17 Jul 2019
TL;DR: A novel single-shot face detector, named Selective Refinement Network (SRN), which introduces novel two-step classification and regression operations selectively into an anchor-based face detector to reduce false positives and improve location accuracy simultaneously.
Abstract: High performance face detection remains a very challenging problem, especially when there exists many tiny faces. This paper presents a novel single-shot face detector, named Selective Refinement Network (SRN), which introduces novel twostep classification and regression operations selectively into an anchor-based face detector to reduce false positives and improve location accuracy simultaneously. In particular, the SRN consists of two modules: the Selective Two-step Classification (STC) module and the Selective Two-step Regression (STR) module. The STC aims to filter out most simple negative anchors from low level detection layers to reduce the search space for the subsequent classifier, while the STR is designed to coarsely adjust the locations and sizes of anchors from high level detection layers to provide better initialization for the subsequent regressor. Moreover, we design a Receptive Field Enhancement (RFE) block to provide more diverse receptive field, which helps to better capture faces in some extreme poses. As a consequence, the proposed SRN detector achieves state-of-the-art performance on all the widely used face detection benchmarks, including AFW, PASCAL face, FDDB, and WIDER FACE datasets. Codes will be released to facilitate further studies on the face detection problem.

112 citations


Proceedings ArticleDOI
01 Oct 2019
TL;DR: Extensive experiments and ablation studies on seven re-id datasets demonstrate the superiority of the proposed UGA over most state-of-the-art unsupervised and domain adaptation re-ID methods.
Abstract: In this paper, we propose an unsupervised graph association (UGA) framework to learn the underlying viewinvariant representations from the video pedestrian tracklets. The core points of UGA are mining the underlying cross-view associations and reducing the damage of noise associations. To this end, UGA is adopts a two-stage training strategy: (1) intra-camera learning stage and (2) intercamera learning stage. The former learns the intra-camera representation for each camera. While the latter builds a cross-view graph (CVG) to associate different cameras. By doing this, we can learn view-invariant representation for all person. Extensive experiments and ablation studies on seven re-id datasets demonstrate the superiority of the proposed UGA over most state-of-the-art unsupervised and domain adaptation re-id methods.

110 citations


Journal ArticleDOI
TL;DR: Extensive comparative evaluations conducted on multiple large-scale benchmarks, including PA-100K, RAP, PETA, Market-1501, and Duke attribute datasets, further demonstrate the effectiveness of the proposed JLPLS-PAA framework for pedestrian attribute analysis.
Abstract: Recognizing the pedestrian attributes in surveillance scenes is an inherently challenging task, especially for the pedestrian images with large pose variations, complex backgrounds, and various camera viewing angles. To select important and discriminative regions or pixels against the variations, three attention mechanisms are proposed, including parsing attention, label attention, and spatial attention. Those attentions aim at accessing effective information by considering problems from different perspectives. To be specific, the parsing attention extracts discriminative features by learning not only where to turn attention to but also how to aggregate features from different semantic regions of human bodies, e.g., head and upper body. The label attention aims at targetedly collecting the discriminative features for each attribute. Different from the parsing and label attention mechanisms, the spatial attention considers the problem from a global perspective, aiming at selecting several important and discriminative image regions or pixels for all attributes. Then, we propose a joint learning framework formulated in a multi-task-like way with these three attention mechanisms learned concurrently to extract complementary and correlated features. This joint learning framework is named Joint Learning of Parsing attention, Label attention, and Spatial attention for Pedestrian Attributes Analysis (JLPLS-PAA, for short). Extensive comparative evaluations conducted on multiple large-scale benchmarks, including PA-100K, RAP, PETA, Market-1501, and Duke attribute datasets, further demonstrate the effectiveness of the proposed JLPLS-PAA framework for pedestrian attribute analysis.

67 citations


Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed a deep learning based large-scale bisample learning (LBL) method for ID versus Spot (IvS) face recognition, where a classification-verification-classification training strategy is proposed to progressively enhance the IvS performance.
Abstract: In real-world face recognition applications, there is a tremendous amount of data with two images for each person. One is an ID photo for face enrollment, and the other is a probe photo captured on spot. Most existing methods are designed for training data with limited breadth (a relatively small number of classes) and sufficient depth (many samples for each class). They would meet great challenges on ID versus Spot (IvS) data, including the under-represented intra-class variations and an excessive demand on computing devices. In this paper, we propose a deep learning based large-scale bisample learning (LBL) method for IvS face recognition. To tackle the bisample problem with only two samples for each class, a classification–verification–classification training strategy is proposed to progressively enhance the IvS performance. Besides, a dominant prototype softmax is incorporated to make the deep learning scalable on large-scale classes. We conduct LBL on a IvS face dataset with more than two million identities. Experimental results show the proposed method achieves superior performance to previous ones, validating the effectiveness of LBL on IvS face recognition.

48 citations


Journal ArticleDOI
TL;DR: A novel face detector, named FaceBoxes, with superior performance on both speed and accuracy is proposed, and a new anchor densification strategy to make different types of anchors have the same density on the image, which significantly improves the recall rate of small faces.

46 citations


Journal ArticleDOI
TL;DR: This work proposes a scale-aware detection network using a wide scale range of layers associated with appropriate scales of anchors to handle faces with various scales, and describes a new equal density principle to ensure anchors with different scales to be evenly distributed on the image.
Abstract: In this work, we describe a single-shot scale-aware convolutional neural network based face detector (SFDet). In comparison with the state-of-the-art anchor-based face detection methods, the main advantages of our method are summarized in four aspects. (1) We propose a scale-aware detection network using a wide scale range of layers associated with appropriate scales of anchors to handle faces with various scales, and describe a new equal density principle to ensure anchors with different scales to be evenly distributed on the image. (2) To improve the recall rates of faces with certain scales (e.g., the scales of the faces are quite different from the scales of designed anchors), we design a new anchor matching strategy with scale compensation. (3) We introduce an IoU-aware weighting scheme for each training sample in classification loss calculation to encode samples accurately in training process. (4) Considering the class imbalance issue, a max-out background strategy is used to reduce false positives. Several experiments are conducted on public challenging face detection datasets, i.e., WIDER FACE, AFW, PASCAL Face, FDDB, and MAFA, to demonstrate that the proposed method achieves the state-of-the-art results and runs at 82.1 FPS for the VGA-resolution images.

Proceedings ArticleDOI
16 Jun 2019
TL;DR: An overview of the Chalearn LAP multi-modal face anti-spoofing attack detection challenge, including its design, evaluation protocol and a summary of results, is presented.
Abstract: Anti-spoofing attack detection is critical to guarantee the security of face-based authentication and facial analysis systems. Recently, a multi-modal face anti-spoofing dataset, CASIA-SURF, has been released with the goal of boosting research in this important topic. CASIA-SURF is the largest public data set for facial anti-spoofing attack detection in terms of both, diversity and modalities: it comprises 1,000 subjects and 21,000 video samples. We organized a challenge around this novel resource to boost research in the subject. The Chalearn LAP multi-modal face anti-spoofing attack detection challenge attracted more than 300 teams for the development phase with a total of 13 teams qualifying for the final round. This paper presents an overview of the challenge, including its design, evaluation protocol and a summary of results. We analyze the top ranked solutions and draw conclusions derived from the competition. In addition we outline future work directions.

Posted Content
TL;DR: A single-shot refinement face detector namely RefineFace is presented to achieve high performance and achieves state-of-the-art results and runs at 37.3 FPS with ResNet-18 for VGA-resolution images.
Abstract: Face detection has achieved significant progress in recent years. However, high performance face detection still remains a very challenging problem, especially when there exists many tiny faces. In this paper, we present a single-shot refinement face detector namely RefineFace to achieve high performance. Specifically, it consists of five modules: Selective Two-step Regression (STR), Selective Two-step Classification (STC), Scale-aware Margin Loss (SML), Feature Supervision Module (FSM) and Receptive Field Enhancement (RFE). To enhance the regression ability for high location accuracy, STR coarsely adjusts locations and sizes of anchors from high level detection layers to provide better initialization for subsequent regressor. To improve the classification ability for high recall efficiency, STC first filters out most simple negatives from low level detection layers to reduce search space for subsequent classifier, then SML is applied to better distinguish faces from background at various scales and FSM is introduced to let the backbone learn more discriminative features for classification. Besides, RFE is presented to provide more diverse receptive field to better capture faces in some extreme poses. Extensive experiments conducted on WIDER FACE, AFW, PASCAL Face, FDDB, MAFA demonstrate that our method achieves state-of-the-art results and runs at $37.3$ FPS with ResNet-18 for VGA-resolution images.

Proceedings ArticleDOI
01 Aug 2019
TL;DR: A novel unified network named Deep Hybrid-Aligned Architecture for facial age estimation that contains global, local and global-local branches, which are jointly optimized and thus can capture multiple types of features with complementary information.
Abstract: In this paper, we propose a novel unified network named Deep Hybrid-Aligned Architecture for facial age estimation. It contains global, local and global-local branches. They are jointly optimized and thus can capture multiple types of features with complementary information. In each branch, we employ a separate loss for each sub-network to extract the independent features and use a recurrent fusion to explore correlations among those region features. Considering that the pose variations may lead to misalignment in different regions, we design an Aligned Region Pooling operation to generate aligned region features. Moreover, a new large age dataset named Web-FaceAge owning more than 120K samples is collected under diverse scenes and spanning a large age range. Experiments on five age benchmark datasets, including Web-FaceAge, Morph, FG-NET, CACD and Chalearn LAP 2015, show that the proposed method outperforms the state-of-the-art approaches significantly.

Posted Content
TL;DR: An improved SRN face detector is presented by combining these useful techniques together and obtain the best performance on widely used face detection benchmark WIDER FACE dataset.
Abstract: As a long-standing problem in computer vision, face detection has attracted much attention in recent decades for its practical applications. With the availability of face detection benchmark WIDER FACE dataset, much of the progresses have been made by various algorithms in recent years. Among them, the Selective Refinement Network (SRN) face detector introduces the two-step classification and regression operations selectively into an anchor-based face detector to reduce false positives and improve location accuracy simultaneously. Moreover, it designs a receptive field enhancement block to provide more diverse receptive field. In this report, to further improve the performance of SRN, we exploit some existing techniques via extensive experiments, including new data augmentation strategy, improved backbone network, MS COCO pretraining, decoupled classification module, segmentation branch and Squeeze-and-Excitation block. Some of these techniques bring performance improvements, while few of them do not well adapt to our baseline. As a consequence, we present an improved SRN face detector by combining these useful techniques together and obtain the best performance on widely used face detection benchmark WIDER FACE dataset.

Posted Content
TL;DR: This paper presents a review of the 2018 WIDER Challenge on Face and Pedestrian and summarizes the winning solutions for all three tracks, and presents discussions on open problems and potential research directions in these topics.
Abstract: This paper presents a review of the 2018 WIDER Challenge on Face and Pedestrian. The challenge focuses on the problem of precise localization of human faces and bodies, and accurate association of identities. It comprises of three tracks: (i) WIDER Face which aims at soliciting new approaches to advance the state-of-the-art in face detection, (ii) WIDER Pedestrian which aims to find effective and efficient approaches to address the problem of pedestrian detection in unconstrained environments, and (iii) WIDER Person Search which presents an exciting challenge of searching persons across 192 movies. In total, 73 teams made valid submissions to the challenge tracks. We summarize the winning solutions for all three tracks. and present discussions on open problems and potential research directions in these topics.

Proceedings ArticleDOI
08 Jul 2019
TL;DR: A Clustering and Dynamic Sampling (CDS) method is proposed in this paper, which tries to transfer the useful knowledge of existing labeled source domain to the unlabeled target one to improve the discriminability of CNN model on source domain.
Abstract: Person Re-Identification (Re-ID) has witnessed great improvements due to the advances of the deep convolutional neural networks (CNN). Despite this, existing methods mainly suffer from the poor generalization ability to unseen scenes because of the different characteristics between different domains. To address this issue, a Clustering and Dynamic Sampling (CDS) method is proposed in this paper, which tries to transfer the useful knowledge of existing labeled source domain to the unlabeled target one. Specifically, to improve the discriminability of CNN model on source domain, we use the commonly shared pedestrian attributes (e.g., gender, hat and clothing color etc.) to enrich the information and resort to the margin-based softmax (e.g., A-Softmax) loss to train the model. For the unlabeled target domain, we iteratively cluster the samples into several centers and dynamically select informative ones from each center to fine-tune the source-domain model. Extensive experiments on DukeMTMC-reID and Market-1501 datasets show that the proposed method greatly improves the state of the arts in unsupervised domain adaptation.

Proceedings ArticleDOI
01 Jun 2019
TL;DR: Zhang et al. as discussed by the authors presented a method to synthesize virtual spoof data in 3D space to alleviate the problem of acquiring spoof data is very expensive since the live faces should be reprinted and re-captured in many views.
Abstract: Face anti-spoofing is crucial for the security of face recognition systems. Learning based methods especially deep learning based methods need large-scale training samples to reduce overfitting. However, acquiring spoof data is very expensive since the live faces should be reprinted and re-captured in many views. In this paper, we present a method to synthesize virtual spoof data in 3D space to alleviate this problem. Specifically, we consider a printed photo as a flat surface and mesh it into a 3D object, which is then randomly bent and rotated in 3D space. Afterward, the transformed 3D photo is rendered through perspective projection as a virtual sample. The synthetic virtual samples can significantly boost the anti-spoofing performance when combined with a proposed data balancing strategy. Our promising results open up new possibilities for advancing face anti-spoofing using cheap and large-scale synthetic data.

Posted Content
TL;DR: This work designs a head-body relationship discriminating module to perform relational learning between heads and human bodies, and leverage this learned relationship to regain the suppressed human detections and reduce head false positives.
Abstract: Head and human detection have been rapidly improved with the development of deep convolutional neural networks. However, these two tasks are often studied separately without considering their inherent correlation, leading to that 1) head detection is often trapped in more false positives, and 2) the performance of human detector frequently drops dramatically in crowd scenes. To handle these two issues, we present a novel joint head and human detection network, namely JointDet, which effectively detects head and human body simultaneously. Moreover, we design a head-body relationship discriminating module to perform relational learning between heads and human bodies, and leverage this learned relationship to regain the suppressed human detections and reduce head false positives. To verify the effectiveness of the proposed method, we annotate head bounding boxes of the CityPersons and Caltech-USA datasets, and conduct extensive experiments on the CrowdHuman, CityPersons and Caltech-USA datasets. As a consequence, the proposed JointDet detector achieves state-of-the-art performance on these three benchmarks. To facilitate further studies on the head and human detection problem, all new annotations, source codes and trained models will be public.

Proceedings ArticleDOI
04 Jun 2019
TL;DR: A novel loss function based on knowledge distillation to boost the performance of lightweight face detectors and gets a CPU real-time face detector that runs at 20 FPS while being state-of-the-art on performance among CPU based detectors.
Abstract: Despite that face detection has progressed significantly in recent years, it is still a challenging task to get a fast face detector with competitive performance, especially on CPU based devices. In this paper, we propose a novel loss function based on knowledge distillation to boost the performance of lightweight face detectors. More specifically, a student detector learns additional soft label from a teacher detector by mimicking its classification map. To make the knowledge transfer more efficient, a threshold function is designed to assign threshold values adaptively for different objectness scores such that only the informative samples are used for mimicking. Experiments on FDDB and WIDER FACE show that the proposed method improves the performance of face detectors consistently. With the help of the proposed training method, we get a CPU real-time face detector that runs at 20 FPS while being state-of-the-art on performance among CPU based detectors.

Proceedings ArticleDOI
01 Sep 2019
TL;DR: A multi-modality 3D mask face anti-spoofing database named 3DMA, which contains 920 videos of 67 genuine subjects wearing 48 kinds of 3D masks, captured in visual (VIS) and near-infrared (NIR) modalities, is released.
Abstract: Benefiting from publicly available databases, face anti-spoofing has recently gained extensive attention in the academic community. However, most of the existing databases focus on the 2D object attacks, including photo and video attacks. The only two public 3D mask face anti-spoofing database are very small. In this paper, we release a multi-modality 3D mask face anti-spoofing database named 3DMA, which contains 920 videos of 67 genuine subjects wearing 48 kinds of 3D masks, captured in visual (VIS) and near-infrared (NIR) modalities. To simulate the real world scenarios, two illumination and four capturing distance settings are deployed during the collection process. To the best of our knowledge, the proposed database is currently the most extensive public database for 3D mask face anti-spoofing. Furthermore, we build three protocols for performance evaluation under different illumination conditions and distances. Experimental results with Convolutional Neural Network (CNN) and LBP-based methods reveal that our proposed 3DMA is indeed a challenge for face anti-spoofing. This database is available at http://www.cbsr.ia.ac.cn/english/3DMA.html. We hope our public 3DMA database can help to pave the way for further research on 3D mask face anti-spoofing.

Proceedings ArticleDOI
04 Jun 2019
TL;DR: This paper proposes a novel single-shot detector for joint face detection and alignment, namely FLDet, with remarkable performance on both speed and accuracy, and introduces a new data augmentation strategy to take full usage of the face alignment dataset.
Abstract: Face detection and alignment are considered as two independent tasks and conducted sequentially in most face applications. However, these two tasks are highly related and they can be integrated into a single model. In this paper, we propose a novel single-shot detector for joint face detection and alignment, namely FLDet, with remarkable performance on both speed and accuracy. Specifically, the FLDet consists of three main modules: Rapidly Digested Backbone (RDB), Lightweight Feature Pyramid Network (LFPN) and Multi-task Detection Module (MDM). The RDB quickly shrinks the spatial size of feature maps to guarantee the CPU real-time speed. The LFPN integrates different detection layers in a top-down fashion to enrich the feature of low-level layers with little extra time overhead. The MDM jointly performs face and landmark detection over different layers to handle faces of various scales. Besides, we introduce a new data augmentation strategy to take full usage of the face alignment dataset. As a result, the proposed FLDet can run at 20 FPS on a single CPU core and 120 FPS using a GPU for VGA-resolution images. Notably, the FLDet can be trained end-to-end and its inference time is invariant to the number of faces. We achieve competitive results on both face detection and face alignment benchmark datasets, including AFW, PASCAL FACE, FDDB and AFLW.

Posted Content
TL;DR: A method to synthesize virtual spoof data in 3D space to alleviate the problem of expensive spoof data acquisition and open up new possibilities for advancing face anti-spoofing using cheap and large-scale synthetic data.
Abstract: Face anti-spoofing is crucial for the security of face recognition systems. Learning based methods especially deep learning based methods need large-scale training samples to reduce overfitting. However, acquiring spoof data is very expensive since the live faces should be re-printed and re-captured in many views. In this paper, we present a method to synthesize virtual spoof data in 3D space to alleviate this problem. Specifically, we consider a printed photo as a flat surface and mesh it into a 3D object, which is then randomly bent and rotated in 3D space. Afterward, the transformed 3D photo is rendered through perspective projection as a virtual sample. The synthetic virtual samples can significantly boost the anti-spoofing performance when combined with a proposed data balancing strategy. Our promising results open up new possibilities for advancing face anti-spoofing using cheap and large-scale synthetic data.

Journal ArticleDOI
TL;DR: This paper establishes a novel joint multi-task model, which allows us to simultaneously detect multiple faces and their landmarks on a given scene image and proposes an end-to-end convolutional network by sharing and transform feature representations between the task-specific modules.
Abstract: The recent studies for face alignment have involved developing an isolated algorithm on well-cropped face images. It is difficult to obtain the expected input by using an off-the-shelf face detector in practical applications. In this paper, we attempt to bridge between face detection and face alignment by establishing a novel joint multi-task model, which allows us to simultaneously detect multiple faces and their landmarks on a given scene image. In contrast to the pipeline-based framework by cascading separate models, we aim to propose an end-to-end convolutional network by sharing and transform feature representations between the task-specific modules. To learn a robust landmark estimator for unconstrained face alignment, three types of context enhanced blocks are designed to encode feature maps with multi-level context, multi-scale context, and global context. In the post-processing step, we develop a shape reconstruction algorithm based on point distribution model to refine the landmark outliers. Extensive experiments demonstrate that our results are robust for the landmark location task and insensitive to the location of estimated face regions. Furthermore, our method significantly outperforms recent state-of-the-art methods on several challenging datasets including 300 W, AFLW, and COFW.

Posted Content
TL;DR: CASIA-SURF as discussed by the authors is the largest publicly available dataset for face anti-spoofing in terms of both subjects and modalities, which consists of $1,000$ subjects with video and each sample has $3$ modalities (i.e., RGB, depth and IR).
Abstract: Face anti-spoofing is essential to prevent face recognition systems from a security breach. Much of the progresses have been made by the availability of face anti-spoofing benchmark datasets in recent years. However, existing face anti-spoofing benchmarks have limited number of subjects ($\le egmedspace170$) and modalities ($\leq egmedspace2$), which hinder the further development of the academic community. To facilitate face anti-spoofing research, we introduce a large-scale multi-modal dataset, namely CASIA-SURF, which is the largest publicly available dataset for face anti-spoofing in terms of both subjects and modalities. Specifically, it consists of $1,000$ subjects with $21,000$ videos and each sample has $3$ modalities (i.e., RGB, Depth and IR). We also provide comprehensive evaluation metrics, diverse evaluation protocols, training/validation/testing subsets and a measurement tool, developing a new benchmark for face anti-spoofing. Moreover, we present a novel multi-modal multi-scale fusion method as a strong baseline, which performs feature re-weighting to select the more informative channel features while suppressing the less useful ones for each modality across different scales. Extensive experiments have been conducted on the proposed dataset to verify its significance and generalization capability. The dataset is available at this https URL

Proceedings ArticleDOI
01 Jun 2019
TL;DR: A cross-view coupling learning method is presented to build a common subspace where the learned features can contain the transition information among different cameras and a new approach – soft Gaussian mapping (SGM), which uses a Gaussian model to bridge their semantic gap is proposed.
Abstract: In this paper, we propose an efficient image representation strategy for addressing the task of small-scale person re-identification. Taking advantages of being compact and intuitively understandable, we adopt color names descriptor (CND) as our color feature. To solve the inaccuracy by comparing color names with image pixels in Euclidean space, we propose a new approach – soft Gaussian mapping (SGM), which uses a Gaussian model to bridge their semantic gap. We further present a cross-view coupling learning method to build a common subspace where the learned features can contain the transition information among different cameras. Experiments on the challenging small-scale benchmark public datasets demonstrate the effectiveness of our proposed method.

Posted Content
TL;DR: This work proposes a static-dynamic fusion mechanism for multi-modal face anti-spoofing, inspired by motion divergences between real and fake faces, and develops a partially shared fusion method to learn complementary information from multiple modalities.
Abstract: Regardless of the usage of deep learning and handcrafted methods, the dynamic information from videos and the effect of cross-ethnicity are rarely considered in face anti-spoofing. In this work, we propose a static-dynamic fusion mechanism for multi-modal face anti-spoofing. Inspired by motion divergences between real and fake faces, we incorporate the dynamic image calculated by rank pooling with static information into a conventional neural network (CNN) for each modality (i.e., RGB, Depth and infrared (IR)). Then, we develop a partially shared fusion method to learn complementary information from multiple modalities. Furthermore, in order to study the generalization capability of the proposal in terms of cross-ethnicity attacks and unknown spoofs, we introduce the largest public cross-ethnicity Face Anti-spoofing (CASIA-CeFA) dataset, covering 3 ethnicities, 3 modalities, 1607 subjects, and 2D plus 3D attack types. Experiments demonstrate that the proposed method achieves state-of-the-art results on CASIA-CeFA, CASIA-SURF, OULU-NPU and SiW.

Posted Content
TL;DR: Zhang et al. as discussed by the authors designed a mask-guided module to leverage the head information to enhance the feature representation learning of the backbone network and developed a strict classification criterion by improving the quality of positive samples during training to eliminate common false positives.
Abstract: Pedestrian detection in crowded scenes is a challenging problem, because occlusion happens frequently among different pedestrians. In this paper, we propose an effective and efficient detection network to hunt pedestrians in crowd scenes. The proposed method, namely PedHunter, introduces strong occlusion handling ability to existing region-based detection networks without bringing extra computations in the inference stage. Specifically, we design a mask-guided module to leverage the head information to enhance the feature representation learning of the backbone network. Moreover, we develop a strict classification criterion by improving the quality of positive samples during training to eliminate common false positives of pedestrian detection in crowded scenes. Besides, we present an occlusion-simulated data augmentation to enrich the pattern and quantity of occlusion samples to improve the occlusion robustness. As a consequent, we achieve state-of-the-art results on three pedestrian detection datasets including CityPersons, Caltech-USA and CrowdHuman. To facilitate further studies on the occluded pedestrian detection in surveillance scenes, we release a new pedestrian dataset, called SUR-PED, with a total of over 162k high-quality manually labeled instances in 10k images. The proposed dataset, source codes and trained models will be released.

Book ChapterDOI
12 Oct 2019
TL;DR: A novel attention-based method with 3D convolutional neural network (CNN) to recognize isolated gesture recognition from RGB-D videos and an adaptive model fusion strategy to fuse the predicted probabilities from multi-modality inputs is proposed.
Abstract: In this paper, we focus on isolated gesture recognition from RGB-D videos. Our main idea is to design an algorithm that can extract global and local information from multi-modality inputs. To this end, we propose a novel attention-based method with 3D convolutional neural network (CNN) to recognize isolated gesture recognition. It includes two parts. The first one is a global and local spatial-attention network (GLSANet), which takes into account the global information that focuses on the context of the frame and the local information that focuses on the hand/arm actions of the person, to extract efficient features from multi-modality inputs simultaneously. The second part is an adaptive model fusion strategy to fuse the predicted probabilities from multi-modality inputs. Experiments demonstrate that the proposed method has achieved state-of-the-art performance on the IsoGD dataset.

Posted Content
TL;DR: A large and diverse dataset named WiderPerson for dense pedestrian detection in the wild, which contains dense pedestrians with various kinds of occlusions, and finds the classification ability of pedestrian detector needs to be improved to reduce false alarm and misdetection rates.
Abstract: Pedestrian detection has achieved significant progress with the availability of existing benchmark datasets. However, there is a gap in the diversity and density between real world requirements and current pedestrian detection benchmarks: 1) most of existing datasets are taken from a vehicle driving through the regular traffic scenario, usually leading to insufficient diversity; 2) crowd scenarios with highly occluded pedestrians are still under represented, resulting in low density. To narrow this gap and facilitate future pedestrian detection research, we introduce a large and diverse dataset named WiderPerson for dense pedestrian detection in the wild. This dataset involves five types of annotations in a wide range of scenarios, no longer limited to the traffic scenario. There are a total of $13,382$ images with $399,786$ annotations, i.e., $29.87$ annotations per image, which means this dataset contains dense pedestrians with various kinds of occlusions. Hence, pedestrians in the proposed dataset are extremely challenging due to large variations in the scenario and occlusion, which is suitable to evaluate pedestrian detectors in the wild. We introduce an improved Faster R-CNN and the vanilla RetinaNet to serve as baselines for the new pedestrian detection benchmark. Several experiments are conducted on previous datasets including Caltech-USA and CityPersons to analyze the generalization capabilities of the proposed dataset and we achieve state-of-the-art performances on these previous datasets without bells and whistles. Finally, we analyze common failure cases and find the classification ability of pedestrian detector needs to be improved to reduce false alarm and miss detection rates. The proposed dataset is available at this http URL