scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Attribute and simile classifiers for face verification

TL;DR: Two novel methods for face verification using binary classifiers trained to recognize the presence or absence of describable aspects of visual appearance and a new data set of real-world images of public figures acquired from the internet.
Abstract: We present two novel methods for face verification. Our first method - “attribute” classifiers - uses binary classifiers trained to recognize the presence or absence of describable aspects of visual appearance (e.g., gender, race, and age). Our second method - “simile” classifiers - removes the manual labeling required for attribute classification and instead learns the similarity of faces, or regions of faces, to specific reference people. Neither method requires costly, often brittle, alignment between image pairs; yet, both methods produce compact visual descriptions, and work on real-world images. Furthermore, both the attribute and simile classifiers improve on the current state-of-the-art for the LFW data set, reducing the error rates compared to the current best by 23.92% and 26.34%, respectively, and 31.68% when combined. For further testing across pose, illumination, and expression, we introduce a new data set - termed PubFig - of real-world images of public figures (celebrities and politicians) acquired from the internet. This data set is both larger (60,000 images) and deeper (300 images per individual) than existing data sets of its kind. Finally, we present an evaluation of human performance.
Citations
More filters
Proceedings Article
21 Jun 2010
TL;DR: Restricted Boltzmann machines were developed using binary stochastic hidden units that learn features that are better for object recognition on the NORB dataset and face verification on the Labeled Faces in the Wild dataset.
Abstract: Restricted Boltzmann machines were developed using binary stochastic hidden units. These can be generalized by replacing each binary unit by an infinite number of copies that all have the same weights but have progressively more negative biases. The learning and inference rules for these "Stepped Sigmoid Units" are unchanged. They can be approximated efficiently by noisy, rectified linear units. Compared with binary units, these units learn features that are better for object recognition on the NORB dataset and face verification on the Labeled Faces in the Wild dataset. Unlike binary units, rectified linear units preserve information about relative intensities as information travels through multiple layers of feature detectors.

14,799 citations


Cites background or methods from "Attribute and simile classifiers fo..."

  • ...27% accuracy on the LFW task even when the centre of the image is masked (Kumar et al., 2009)....

    [...]

  • ...During discriminative fine-tuning, we use a subset of the Pubfig face dataset (Kumar et al., 2009) as a validation set for selecting the model architecture and for...

    [...]

  • ...During discriminative fine-tuning, we use a subset of the Pubfig face dataset (Kumar et al., 2009) as a validation set for selecting the model architecture and for early stopping....

    [...]

  • ...The two current best LFW results are 0.8683± 0.0034 (Wolf et al., 2009), and 0.8529±0.0123 (Kumar et al., 2009)....

    [...]

  • ...Recently it has been found that humans are able to get 94.27% accuracy on the LFW task even when the centre of the image is masked (Kumar et al., 2009)....

    [...]

Proceedings ArticleDOI
07 Dec 2015
TL;DR: A novel deep learning framework for attribute prediction in the wild that cascades two CNNs, LNet and ANet, which are fine-tuned jointly with attribute tags, but pre-trained differently.
Abstract: Predicting face attributes in the wild is challenging due to complex face variations. We propose a novel deep learning framework for attribute prediction in the wild. It cascades two CNNs, LNet and ANet, which are fine-tuned jointly with attribute tags, but pre-trained differently. LNet is pre-trained by massive general object categories for face localization, while ANet is pre-trained by massive face identities for attribute prediction. This framework not only outperforms the state-of-the-art with a large margin, but also reveals valuable facts on learning face representation. (1) It shows how the performances of face localization (LNet) and attribute prediction (ANet) can be improved by different pre-training strategies. (2) It reveals that although the filters of LNet are fine-tuned only with image-level attribute tags, their response maps over entire images have strong indication of face locations. This fact enables training LNet for face localization with only image-level annotations, but without face bounding boxes or landmarks, which are required by all attribute recognition works. (3) It also demonstrates that the high-level hidden neurons of ANet automatically discover semantic concepts after pre-training with massive face identities, and such concepts are significantly enriched after fine-tuning with attribute tags. Each attribute can be well explained with a sparse linear combination of these concepts.

6,273 citations


Cites background from "Attribute and simile classifiers fo..."

  • ...[15] predicted face attributes by extracting hand-crafted features from ten face parts....

    [...]

  • ...Face attributes are beneficial for multiple applications such as face verification [15, 2, 24], identification [20], and retrieval....

    [...]

  • ...[15] extracted HOG-like features on various face regions to tackle attribute classification and face verification....

    [...]

  • ...Extracting hand-crafted features at pre-defined landmarks has become a standard step in attribute recognition [9, 15, 4, 2]....

    [...]

  • ...Recent local models [15, 4, 5, 2, 19, 32] first detect object parts and extract features from each part....

    [...]

Proceedings ArticleDOI
23 Jun 2014
TL;DR: This work revisits both the alignment step and the representation step by employing explicit 3D face modeling in order to apply a piecewise affine transformation, and derive a face representation from a nine-layer deep neural network.
Abstract: In modern face recognition, the conventional pipeline consists of four stages: detect => align => represent => classify. We revisit both the alignment step and the representation step by employing explicit 3D face modeling in order to apply a piecewise affine transformation, and derive a face representation from a nine-layer deep neural network. This deep network involves more than 120 million parameters using several locally connected layers without weight sharing, rather than the standard convolutional layers. Thus we trained it on the largest facial dataset to-date, an identity labeled dataset of four million facial images belonging to more than 4, 000 identities. The learned representations coupling the accurate model-based alignment with the large facial database generalize remarkably well to faces in unconstrained environments, even with a simple classifier. Our method reaches an accuracy of 97.35% on the Labeled Faces in the Wild (LFW) dataset, reducing the error of the current state of the art by more than 27%, closely approaching human-level performance.

6,132 citations


Additional excerpts

  • ...5% [20]....

    [...]

Posted Content
TL;DR: Zhang et al. as mentioned in this paper proposed a novel deep learning framework for attribute prediction in the wild, which cascades two CNNs, LNet and ANet, which are fine-tuned jointly with attribute tags, but pre-trained differently.
Abstract: Predicting face attributes in the wild is challenging due to complex face variations. We propose a novel deep learning framework for attribute prediction in the wild. It cascades two CNNs, LNet and ANet, which are fine-tuned jointly with attribute tags, but pre-trained differently. LNet is pre-trained by massive general object categories for face localization, while ANet is pre-trained by massive face identities for attribute prediction. This framework not only outperforms the state-of-the-art with a large margin, but also reveals valuable facts on learning face representation. (1) It shows how the performances of face localization (LNet) and attribute prediction (ANet) can be improved by different pre-training strategies. (2) It reveals that although the filters of LNet are fine-tuned only with image-level attribute tags, their response maps over entire images have strong indication of face locations. This fact enables training LNet for face localization with only image-level annotations, but without face bounding boxes or landmarks, which are required by all attribute recognition works. (3) It also demonstrates that the high-level hidden neurons of ANet automatically discover semantic concepts after pre-training with massive face identities, and such concepts are significantly enriched after fine-tuning with attribute tags. Each attribute can be well explained with a sparse linear combination of these concepts.

2,822 citations

Journal ArticleDOI
TL;DR: This review, which focuses on the application of CNNs to image classification tasks, covers their development, from their predecessors up to recent state-of-the-art deep learning systems.
Abstract: Convolutional neural networks CNNs have been applied to visual tasks since the late 1980s. However, despite a few scattered applications, they were dormant until the mid-2000s when developments in computing power and the advent of large amounts of labeled data, supplemented by improved algorithms, contributed to their advancement and brought them to the forefront of a neural network renaissance that has seen rapid progression since 2012. In this review, which focuses on the application of CNNs to image classification tasks, we cover their development, from their predecessors up to recent state-of-the-art deep learning systems. Along the way, we analyze 1 their early successes, 2 their role in the deep learning renaissance, 3 selected symbolic works that have contributed to their recent popularity, and 4 several improvement attempts by reviewing contributions and challenges of over 300 publications. We also introduce some of their current trends and remaining challenges.

2,366 citations

References
More filters
Journal ArticleDOI
TL;DR: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene and can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Abstract: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.

46,906 citations

Journal ArticleDOI
TL;DR: High generalization ability of support-vector networks utilizing polynomial input transformations is demonstrated and the performance of the support- vector network is compared to various classical learning algorithms that all took part in a benchmark study of Optical Character Recognition.
Abstract: The support-vector network is a new learning machine for two-group classification problems. The machine conceptually implements the following idea: input vectors are non-linearly mapped to a very high-dimension feature space. In this feature space a linear decision surface is constructed. Special properties of the decision surface ensures high generalization ability of the learning machine. The idea behind the support-vector network was previously implemented for the restricted case where the training data can be separated without errors. We here extend this result to non-separable training data. High generalization ability of support-vector networks utilizing polynomial input transformations is demonstrated. We also compare the performance of the support-vector network to various classical learning algorithms that all took part in a benchmark study of Optical Character Recognition.

37,861 citations


"Attribute and simile classifiers fo..." refers methods in this paper

  • ...Automatically determining the gender of a face has been an active area of research since at least 1990 [15, 9], and includes more recent work [24] using Support Vector Machines (SVMs) [8]....

    [...]

01 Jan 2011
TL;DR: The Scale-Invariant Feature Transform (or SIFT) algorithm is a highly robust method to extract and consequently match distinctive invariant features from images that can then be used to reliably match objects in diering images.
Abstract: The Scale-Invariant Feature Transform (or SIFT) algorithm is a highly robust method to extract and consequently match distinctive invariant features from images. These features can then be used to reliably match objects in diering images. The algorithm was rst proposed by Lowe [12] and further developed to increase performance resulting in the classic paper [13] that served as foundation for SIFT which has played an important role in robotic and machine vision in the past decade.

14,708 citations


"Attribute and simile classifiers fo..." refers background in this paper

  • ...Our low-level features are designed following a great deal of work in face recognition (and the larger recognition community) which has identified gradient direction and local descriptors around fiducial features as effective first steps toward dealing with illumination [6, 28, 22, 23, 10]....

    [...]

Proceedings Article
Yoav Freund1, Robert E. Schapire1
03 Jul 1996
TL;DR: This paper describes experiments carried out to assess how well AdaBoost with and without pseudo-loss, performs on real learning problems and compared boosting to Breiman's "bagging" method when used to aggregate various classifiers.
Abstract: In an earlier paper, we introduced a new "boosting" algorithm called AdaBoost which, theoretically, can be used to significantly reduce the error of any learning algorithm that con- sistently generates classifiers whose performance is a little better than random guessing. We also introduced the related notion of a "pseudo-loss" which is a method for forcing a learning algorithm of multi-label concepts to concentrate on the labels that are hardest to discriminate. In this paper, we describe experiments we carried out to assess how well AdaBoost with and without pseudo-loss, performs on real learning problems. We performed two sets of experiments. The first set compared boosting to Breiman's "bagging" method when used to aggregate various classifiers (including decision trees and single attribute- value tests). We compared the performance of the two methods on a collection of machine-learning benchmarks. In the second set of experiments, we studied in more detail the performance of boosting using a nearest-neighbor classifier on an OCR problem.

7,601 citations


"Attribute and simile classifiers fo..." refers methods in this paper

  • ...Training requires a set of positive and negative example images for each attribute, and uses a simplified version of adaboost [12] to choose from the set of low-level features described in the previous section....

    [...]

01 Jan 1996

7,386 citations