scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Volume structured ordinal features with background similarity measure for video face recognition

TL;DR: The proposed method not only encodes jointly the local spatial and temporal information, but also extracts the most discriminative facial dynamic information while trying to discard spatio-temporal features related to intra-personal variations.
Abstract: It has been shown in different studies the benefits of using spatio-temporal information for video face recognition. However, most of the existing spatio-temporal representations do not capture the local discriminative information present in human faces. In this paper we introduce a new local spatio-temporal descriptor, based on structured ordinal features, for video face recognition. The proposed method not only encodes jointly the local spatial and temporal information, but also extracts the most discriminative facial dynamic information while trying to discard spatio-temporal features related to intra-personal variations. Besides, a similarity measure based on a set of background samples is proposed to be used with our descriptor, showing to boost its performance. Extensive experiments conducted on the recent but difficult YouTube Faces database demonstrate the good performance of our proposal, achieving state-of-the-art results.
Citations
More filters
Proceedings ArticleDOI
23 Jun 2014
TL;DR: This work revisits both the alignment step and the representation step by employing explicit 3D face modeling in order to apply a piecewise affine transformation, and derive a face representation from a nine-layer deep neural network.
Abstract: In modern face recognition, the conventional pipeline consists of four stages: detect => align => represent => classify. We revisit both the alignment step and the representation step by employing explicit 3D face modeling in order to apply a piecewise affine transformation, and derive a face representation from a nine-layer deep neural network. This deep network involves more than 120 million parameters using several locally connected layers without weight sharing, rather than the standard convolutional layers. Thus we trained it on the largest facial dataset to-date, an identity labeled dataset of four million facial images belonging to more than 4, 000 identities. The learned representations coupling the accurate model-based alignment with the large facial database generalize remarkably well to faces in unconstrained environments, even with a simple classifier. Our method reaches an accuracy of 97.35% on the Labeled Faces in the Wild (LFW) dataset, reducing the error of the current state of the art by more than 27%, closely approaching human-level performance.

6,132 citations


Additional excerpts

  • ...VSOF+OSS [23] 79....

    [...]

Proceedings ArticleDOI
23 Jun 2014
TL;DR: The proposed DDML trains a deep neural network which learns a set of hierarchical nonlinear transformations to project face pairs into the same feature subspace, under which the distance of each positive face pair is less than a smaller threshold and that of each negative pair is higher than a larger threshold.
Abstract: This paper presents a new discriminative deep metric learning (DDML) method for face verification in the wild. Different from existing metric learning-based face verification methods which aim to learn a Mahalanobis distance metric to maximize the inter-class variations and minimize the intra-class variations, simultaneously, the proposed DDML trains a deep neural network which learns a set of hierarchical nonlinear transformations to project face pairs into the same feature subspace, under which the distance of each positive face pair is less than a smaller threshold and that of each negative pair is higher than a larger threshold, respectively, so that discriminative information can be exploited in the deep network. Our method achieves very competitive face verification performance on the widely used LFW and YouTube Faces (YTF) datasets.

730 citations


Cites methods from "Volume structured ordinal features ..."

  • ...These compared methods include Matched Background Similarity (MBGS) [34], APEM [21], STFRD+PMML [6], MBGS+SVM [37], VSOF+OSS (Adaboost) [24], and PHL+SILD [16]....

    [...]

Proceedings ArticleDOI
01 Jul 2017
TL;DR: This NAN is trained with a standard classification or verification loss without any extra supervision signal, and it is found that it automatically learns to advocate high-quality face images while repelling low-quality ones such as blurred, occluded and improperly exposed faces.
Abstract: This paper presents a Neural Aggregation Network (NAN) for video face recognition. The network takes a face video or face image set of a person with a variable number of face images as its input, and produces a compact, fixed-dimension feature representation for recognition. The whole network is composed of two modules. The feature embedding module is a deep Convolutional Neural Network (CNN) which maps each face image to a feature vector. The aggregation module consists of two attention blocks which adaptively aggregate the feature vectors to form a single feature inside the convex hull spanned by them. Due to the attention mechanism, the aggregation is invariant to the image order. Our NAN is trained with a standard classification or verification loss without any extra supervision signal, and we found that it automatically learns to advocate high-quality face images while repelling low-quality ones such as blurred, occluded and improperly exposed faces. The experiments on IJB-A, YouTube Face, Celebrity-1000 video face recognition benchmarks show that it consistently outperforms naive aggregation methods and achieves the state-of-the-art accuracy.

323 citations


Cites background from "Volume structured ordinal features ..."

  • ...Video face recognition has caught more and more attention from the community in recent years [29, 17, 30, 5, 20, 18, 19, 21, 10, 26, 23]....

    [...]

Posted Content
TL;DR: Wang et al. as mentioned in this paper proposed a Neural Aggregation Network (NAN) for video face recognition, which consists of two attention blocks which adaptively aggregate the feature vectors to form a single feature inside the convex hull spanned by them.
Abstract: This paper presents a Neural Aggregation Network (NAN) for video face recognition. The network takes a face video or face image set of a person with a variable number of face images as its input, and produces a compact, fixed-dimension feature representation for recognition. The whole network is composed of two modules. The feature embedding module is a deep Convolutional Neural Network (CNN) which maps each face image to a feature vector. The aggregation module consists of two attention blocks which adaptively aggregate the feature vectors to form a single feature inside the convex hull spanned by them. Due to the attention mechanism, the aggregation is invariant to the image order. Our NAN is trained with a standard classification or verification loss without any extra supervision signal, and we found that it automatically learns to advocate high-quality face images while repelling low-quality ones such as blurred, occluded and improperly exposed faces. The experiments on IJB-A, YouTube Face, Celebrity-1000 video face recognition benchmarks show that it consistently outperforms naive aggregation methods and achieves the state-of-the-art accuracy.

291 citations

Journal ArticleDOI
TL;DR: A discriminative deep multi-metric learning method to jointly learn multiple neural networks, under which the correlation of different features of each sample is maximized, and the distance of each positive pair is reduced and that of each negative pair is enlarged.
Abstract: This paper presents a new discriminative deep metric learning (DDML) method for face and kinship verification in wild conditions. While metric learning has achieved reasonably good performance in face and kinship verification, most existing metric learning methods aim to learn a single Mahalanobis distance metric to maximize the inter-class variations and minimize the intra-class variations, which cannot capture the nonlinear manifold where face images usually lie on. To address this, we propose a DDML method to train a deep neural network to learn a set of hierarchical nonlinear transformations to project face pairs into the same latent feature space, under which the distance of each positive pair is reduced and that of each negative pair is enlarged. To better use the commonality of multiple feature descriptors to make all the features more robust for face and kinship verification, we develop a discriminative deep multi-metric learning method to jointly learn multiple neural networks, under which the correlation of different features of each sample is maximized, and the distance of each positive pair is reduced and that of each negative pair is enlarged. Extensive experimental results show that our proposed methods achieve the acceptable results in both face and kinship verification.

264 citations

References
More filters
Proceedings ArticleDOI
07 Sep 2009
TL;DR: It is demonstrated that regular sampling of space-time features consistently outperforms all testedspace-time interest point detectors for human actions in realistic settings and is a consistent ranking for the majority of methods over different datasets.
Abstract: Local space-time features have recently become a popular video representation for action recognition. Several methods for feature localization and description have been proposed in the literature and promising recognition results were demonstrated for a number of action classes. The comparison of existing methods, however, is often limited given the different experimental settings used. The purpose of this paper is to evaluate and compare previously proposed space-time features in a common experimental setup. In particular, we consider four different feature detectors and six local feature descriptors and use a standard bag-of-features SVM approach for action recognition. We investigate the performance of these methods on a total of 25 action classes distributed over three datasets with varying difficulty. Among interesting conclusions, we demonstrate that regular sampling of space-time features consistently outperforms all tested space-time interest point detectors for human actions in realistic settings. We also demonstrate a consistent ranking for the majority of methods over different datasets and discuss their advantages and limitations.

1,485 citations


"Volume structured ordinal features ..." refers background in this paper

  • ...Local spatio-temporal descriptors have become very popular for human action recognition in videos [17]....

    [...]

Proceedings ArticleDOI
20 Jun 2011
TL;DR: A comprehensive database of labeled videos of faces in challenging, uncontrolled conditions, the ‘YouTube Faces’ database, along with benchmark, pair-matching tests are presented and a novel set-to-set similarity measure, the Matched Background Similarity (MBGS), is described.
Abstract: Recognizing faces in unconstrained videos is a task of mounting importance. While obviously related to face recognition in still images, it has its own unique characteristics and algorithmic requirements. Over the years several methods have been suggested for this problem, and a few benchmark data sets have been assembled to facilitate its study. However, there is a sizable gap between the actual application needs and the current state of the art. In this paper we make the following contributions. (a) We present a comprehensive database of labeled videos of faces in challenging, uncontrolled conditions (i.e., ‘in the wild’), the ‘YouTube Faces’ database, along with benchmark, pair-matching tests1. (b) We employ our benchmark to survey and compare the performance of a large variety of existing video face recognition techniques. Finally, (c) we describe a novel set-to-set similarity measure, the Matched Background Similarity (MBGS). This similarity is shown to considerably improve performance on the benchmark tests.

1,423 citations


"Volume structured ordinal features ..." refers background or methods in this paper

  • ...Recently, the public available YouTube Faces database [19] was introduced, containing more than 3425 videos of 1595 subjects obtained from YouTube, with significant variations on expression, illumination, pose, resolution and background....

    [...]

  • ...Three LBP variants were used as descriptors in [19] and many state-of-the-arte methods were tested, including all frames comparisons, pose based methods, algebraic methods, non-algebraic set methods and match background similarity methods (MBGS)....

    [...]

  • ...Similar to [19], the ROC curve was obtained for all splits together, using the average recognition rates....

    [...]

  • ...Recently, an alternative approach which combines both strategies have emerged [19, 20]....

    [...]

  • ...In our experiments we follow the protocol defined for the YouTube Faces database [19], where 5000 randomly chosen video pairs are used....

    [...]

Book
21 Jun 2011
TL;DR: Computer Vision Using Local Binary Patterns provides a detailed description of the LBP methods and their variants both in spatial and spatiotemporal domains and provides an excellent overview as to how texture methods can be utilized for solving different kinds of computer vision and image analysis problems.
Abstract: The recent emergence of Local Binary Patterns (LBP) has led to significant progress in applying texture methods to various computer vision problems and applications. The focus of this research has broadened from 2D textures to 3D textures and spatiotemporal (dynamic) textures. Also, where texture was once utilized for applications such as remote sensing, industrial inspection and biomedical image analysis, the introduction of LBP-based approaches have provided outstanding results in problems relating to face and activity analysis, with future scope for face and facial expression recognition, biometrics, visual surveillance and video analysis. Computer Vision Using Local Binary Patterns provides a detailed description of the LBP methods and their variants both in spatial and spatiotemporal domains. This comprehensive reference also provides an excellent overview as to how texture methods can be utilized for solving different kinds of computer vision and image analysis problems. Source codes of the basic LBP algorithms, demonstrations, some databases and a comprehensive LBP bibliography can be found from an accompanying web site. Topics include: local binary patterns and their variants in spatial and spatiotemporal domains, texture classification and segmentation, description of interest regions, applications in image retrieval and 3D recognition - Recognition and segmentation of dynamic textures, background subtraction, recognition of actions, face analysis using still images and image sequences, visual speech recognition and LBP in various applications. Written by pioneers of LBP, this book is an essential resource for researchers, professional engineers and graduate students in computer vision, image analysis and pattern recognition. The book will also be of interest to all those who work with specific applications of machine vision.

641 citations


"Volume structured ordinal features ..." refers background in this paper

  • ...One of them is the Extended set of Volume Local Binary Patterns (EVLBP) [5], which is an extension of the very successful Local Binary Pattern (LBP) operator [14] to the case of videos....

    [...]

Book ChapterDOI
Shengcai Liao1, Xiangxin Zhu1, Zhen Lei1, Lun Zhang1, Stan Z. Li1 
27 Aug 2007
TL;DR: Experiments on Face Recognition Grand Challenge (FRGC) ver2.0 database show that the proposed MB-LBP method significantly outperforms other LBP based face recognition algorithms.
Abstract: In this paper, we propose a novel representation, calledMultiscale Block Local Binary Pattern (MB-LBP), and apply it to face recognition. The Local Binary Pattern (LBP) has been proved to be effective for image representation, but it is too local to be robust. In MB-LBP, the computation is done based on average values of block subregions, instead of individual pixels. In this way, MB-LBP code presents several advantages: (1) It ismore robust than LBP; (2) it encodes not only microstructures but also macrostructures of image patterns, and hence provides a more complete image representation than the basic LBP operator; and (3) MB-LBP can be computed very efficiently using integral images. Furthermore, in order to reflect the uniform appearance of MB-LBP, we redefine the uniform patterns via statistical analysis. Finally, AdaBoost learning is applied to select most effective uniform MB-LBP features and construct face classifiers. Experiments on Face Recognition Grand Challenge (FRGC) ver2.0 database show that the proposed MB-LBP method significantly outperforms other LBP based face recognition algorithms.

633 citations


"Volume structured ordinal features ..." refers background in this paper

  • ...Different from Multi-block LBP [10], SOF compares not only adjacent regions but also neighboring regions of some size at a given radius....

    [...]

Journal ArticleDOI
TL;DR: A recently proposed distributed neural system for face perception, with minor modifications, can accommodate the psychological findings with moving faces.

466 citations


"Volume structured ordinal features ..." refers background in this paper

  • ...Psychophysical and neural studies suggest that humans use both, the spatial and the dynamic information, when recognizing moving faces [13]....

    [...]