scispace - formally typeset
Search or ask a question
Author

Kaipeng Zhang

Bio: Kaipeng Zhang is an academic researcher from National Taiwan University. The author has contributed to research in topics: Convolutional neural network & Computer science. The author has an hindex of 13, co-authored 20 publications receiving 7072 citations. Previous affiliations of Kaipeng Zhang include Chinese Academy of Sciences & University of Tokyo.

Papers
More filters
Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper proposed a deep cascaded multitask framework that exploits the inherent correlation between detection and alignment to boost up their performance, which leverages a cascaded architecture with three stages of carefully designed deep convolutional networks to predict face and landmark location in a coarse-to-fine manner.
Abstract: Face detection and alignment in unconstrained environment are challenging due to various poses, illuminations, and occlusions. Recent studies show that deep learning approaches can achieve impressive performance on these two tasks. In this letter, we propose a deep cascaded multitask framework that exploits the inherent correlation between detection and alignment to boost up their performance. In particular, our framework leverages a cascaded architecture with three stages of carefully designed deep convolutional networks to predict face and landmark location in a coarse-to-fine manner. In addition, we propose a new online hard sample mining strategy that further improves the performance in practice. Our method achieves superior accuracy over the state-of-the-art techniques on the challenging face detection dataset and benchmark and WIDER FACE benchmarks for face detection, and annotated facial landmarks in the wild benchmark for face alignment, while keeps real-time performance.

3,980 citations

Book ChapterDOI
08 Oct 2016
TL;DR: This paper proposes a new supervision signal, called center loss, for face recognition task, which simultaneously learns a center for deep features of each class and penalizes the distances between the deep features and their corresponding class centers.
Abstract: Convolutional neural networks (CNNs) have been widely used in computer vision community, significantly improving the state-of-the-art. In most of the available CNNs, the softmax loss function is used as the supervision signal to train the deep model. In order to enhance the discriminative power of the deeply learned features, this paper proposes a new supervision signal, called center loss, for face recognition task. Specifically, the center loss simultaneously learns a center for deep features of each class and penalizes the distances between the deep features and their corresponding class centers. More importantly, we prove that the proposed center loss function is trainable and easy to optimize in the CNNs. With the joint supervision of softmax loss and center loss, we can train a robust CNNs to obtain the deep features with the two key learning objectives, inter-class dispension and intra-class compactness as much as possible, which are very essential to face recognition. It is encouraging to see that our CNNs (with such joint supervision) achieve the state-of-the-art accuracy on several important face recognition benchmarks, Labeled Faces in the Wild (LFW), YouTube Faces (YTF), and MegaFace Challenge. Especially, our new approach achieves the best results on MegaFace (the largest public domain face benchmark) under the protocol of small training set (contains under 500000 images and under 20000 persons), significantly improving the previous results and setting new state-of-the-art for both face recognition and face verification tasks.

3,464 citations

Journal ArticleDOI
TL;DR: A deep cascaded multitask framework that exploits the inherent correlation between detection and alignment to boost up their performance and achieves superior accuracy over the state-of-the-art techniques on the challenging face detection dataset and benchmark.
Abstract: Face detection and alignment in unconstrained environment are challenging due to various poses, illuminations and occlusions. Recent studies show that deep learning approaches can achieve impressive performance on these two tasks. In this paper, we propose a deep cascaded multi-task framework which exploits the inherent correlation between them to boost up their performance. In particular, our framework adopts a cascaded structure with three stages of carefully designed deep convolutional networks that predict face and landmark location in a coarse-to-fine manner. In addition, in the learning process, we propose a new online hard sample mining strategy that can improve the performance automatically without manual sample selection. Our method achieves superior accuracy over the state-of-the-art techniques on the challenging FDDB and WIDER FACE benchmark for face detection, and AFLW benchmark for face alignment, while keeps real time performance.

1,982 citations

Book ChapterDOI
08 Sep 2018
TL;DR: This paper defines a super-identity loss to measure the identity difference between a hallucinated face and its corresponding high-resolution face within the hypersphere identity metric space and presents a domain-integrated training approach by constructing a robust identity metric for faces from these two domains.
Abstract: Face hallucination is a generative task to super-resolve the facial image with low resolution while human perception of face heavily relies on identity information. However, previous face hallucination approaches largely ignore facial identity recovery. This paper proposes Super-Identity Convolutional Neural Network (SICNN) to recover identity information for generating faces closed to the real identity. Specifically, we define a super-identity loss to measure the identity difference between a hallucinated face and its corresponding high-resolution face within the hypersphere identity metric space. However, directly using this loss will lead to a Dynamic Domain Divergence problem, which is caused by the large margin between the high-resolution domain and the hallucination domain. To overcome this challenge, we present a domain-integrated training approach by constructing a robust identity metric for faces from these two domains. Extensive experimental evaluations demonstrate that the proposed SICNN achieves superior visual quality over the state-of-the-art methods on a challenging task to super-resolve 12 \(\times \) 14 faces with an 8\(\times \) upscaling factor. In addition, SICNN significantly improves the recognizability of ultra-low-resolution faces.

124 citations

Proceedings ArticleDOI
01 Oct 2017
TL;DR: Inside Cascaded Structure introduces a two-stream contextual CNN architecture that leverages body part information adaptively to enhance face detection and achieves competitive accuracy to the state-of-the-art techniques while keeps real time performance.
Abstract: Deep Convolutional Neural Networks (CNNs) achieve substantial improvements in face detection in the wild. Classical CNN-based face detection methods simply stack successive layers of filters where an input sample should pass through all layers before reaching a face/non-face decision. Inspired by the fact that for face detection, filters in deeper layers can discriminate between difficult face/non-face samples while those in shallower layers can efficiently reject simple non-face samples, we propose Inside Cascaded Structure that introduces face/non-face classifiers at different layers within the same CNN. In the training phase, we propose data routing mechanism which enables different layers to be trained by different types of samples, and thus deeper layers can focus on handling more difficult samples compared with traditional architecture. In addition, we introduce a two-stream contextual CNN architecture that leverages body part information adaptively to enhance face detection. Extensive experiments on the challenging FD-DB and WIDER FACE benchmarks demonstrate that our method achieves competitive accuracy to the state-of-the-art techniques while keeps real time performance.

81 citations


Cited by
More filters
Proceedings Article
15 Mar 2017
TL;DR: Prototypical Networks as discussed by the authors learn a metric space in which classification can be performed by computing distances to prototype representations of each class, and achieve state-of-the-art results on the CU-Birds dataset.
Abstract: We propose Prototypical Networks for the problem of few-shot classification, where a classifier must generalize to new classes not seen in the training set, given only a small number of examples of each new class. Prototypical Networks learn a metric space in which classification can be performed by computing distances to prototype representations of each class. Compared to recent approaches for few-shot learning, they reflect a simpler inductive bias that is beneficial in this limited-data regime, and achieve excellent results. We provide an analysis showing that some simple design decisions can yield substantial improvements over recent approaches involving complicated architectural choices and meta-learning. We further extend Prototypical Networks to zero-shot learning and achieve state-of-the-art results on the CU-Birds dataset.

5,333 citations

Proceedings ArticleDOI
15 Jun 2019
TL;DR: This paper presents arguably the most extensive experimental evaluation against all recent state-of-the-art face recognition methods on ten face recognition benchmarks, and shows that ArcFace consistently outperforms the state of the art and can be easily implemented with negligible computational overhead.
Abstract: One of the main challenges in feature learning using Deep Convolutional Neural Networks (DCNNs) for large-scale face recognition is the design of appropriate loss functions that can enhance the discriminative power. Centre loss penalises the distance between deep features and their corresponding class centres in the Euclidean space to achieve intra-class compactness. SphereFace assumes that the linear transformation matrix in the last fully connected layer can be used as a representation of the class centres in the angular space and therefore penalises the angles between deep features and their corresponding weights in a multiplicative way. Recently, a popular line of research is to incorporate margins in well-established loss functions in order to maximise face class separability. In this paper, we propose an Additive Angular Margin Loss (ArcFace) to obtain highly discriminative features for face recognition. The proposed ArcFace has a clear geometric interpretation due to its exact correspondence to geodesic distance on a hypersphere. We present arguably the most extensive experimental evaluation against all recent state-of-the-art face recognition methods on ten face recognition benchmarks which includes a new large-scale image database with trillions of pairs and a large-scale video dataset. We show that ArcFace consistently outperforms the state of the art and can be easily implemented with negligible computational overhead. To facilitate future research, the code has been made available.

4,312 citations

Book ChapterDOI
08 Oct 2016
TL;DR: This paper proposes a new supervision signal, called center loss, for face recognition task, which simultaneously learns a center for deep features of each class and penalizes the distances between the deep features and their corresponding class centers.
Abstract: Convolutional neural networks (CNNs) have been widely used in computer vision community, significantly improving the state-of-the-art. In most of the available CNNs, the softmax loss function is used as the supervision signal to train the deep model. In order to enhance the discriminative power of the deeply learned features, this paper proposes a new supervision signal, called center loss, for face recognition task. Specifically, the center loss simultaneously learns a center for deep features of each class and penalizes the distances between the deep features and their corresponding class centers. More importantly, we prove that the proposed center loss function is trainable and easy to optimize in the CNNs. With the joint supervision of softmax loss and center loss, we can train a robust CNNs to obtain the deep features with the two key learning objectives, inter-class dispension and intra-class compactness as much as possible, which are very essential to face recognition. It is encouraging to see that our CNNs (with such joint supervision) achieve the state-of-the-art accuracy on several important face recognition benchmarks, Labeled Faces in the Wild (LFW), YouTube Faces (YTF), and MegaFace Challenge. Especially, our new approach achieves the best results on MegaFace (the largest public domain face benchmark) under the protocol of small training set (contains under 500000 images and under 20000 persons), significantly improving the previous results and setting new state-of-the-art for both face recognition and face verification tasks.

3,464 citations

Journal ArticleDOI
TL;DR: In this article, a review of deep learning-based object detection frameworks is provided, focusing on typical generic object detection architectures along with some modifications and useful tricks to improve detection performance further.
Abstract: Due to object detection’s close relationship with video analysis and image understanding, it has attracted much research attention in recent years. Traditional object detection methods are built on handcrafted features and shallow trainable architectures. Their performance easily stagnates by constructing complex ensembles that combine multiple low-level image features with high-level context from object detectors and scene classifiers. With the rapid development in deep learning, more powerful tools, which are able to learn semantic, high-level, deeper features, are introduced to address the problems existing in traditional architectures. These models behave differently in network architecture, training strategy, and optimization function. In this paper, we provide a review of deep learning-based object detection frameworks. Our review begins with a brief introduction on the history of deep learning and its representative tool, namely, the convolutional neural network. Then, we focus on typical generic object detection architectures along with some modifications and useful tricks to improve detection performance further. As distinct specific detection tasks exhibit different characteristics, we also briefly survey several specific tasks, including salient object detection, face detection, and pedestrian detection. Experimental analyses are also provided to compare various methods and draw some meaningful conclusions. Finally, several promising directions and tasks are provided to serve as guidelines for future work in both object detection and relevant neural network-based learning systems.

3,097 citations

21 Jan 2018
TL;DR: It is shown that the highest error involves images of dark-skinned women, while the most accurate result is for light-skinned men, in commercial API-based classifiers of gender from facial images, including IBM Watson Visual Recognition.
Abstract: The paper “Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification” by Joy Buolamwini and Timnit Gebru, that will be presented at the Conference on Fairness, Accountability, and Transparency (FAT*) in February 2018, evaluates three commercial API-based classifiers of gender from facial images, including IBM Watson Visual Recognition. The study finds these services to have recognition capabilities that are not balanced over genders and skin tones [1]. In particular, the authors show that the highest error involves images of dark-skinned women, while the most accurate result is for light-skinned men.

2,528 citations