scispace - formally typeset
Search or ask a question
Book ChapterDOI

On Frame Selection for Video Face Recognition

TL;DR: The role and importance of frame selection in video face recognition is discussed, an overview of existing techniques are provided, an entropy based frame selection algorithm is presented, and a new paradigm for frame selection algorithms is proposed as a path forward.
Abstract: In surveillance applications, a video can be an advantageous alternative to traditional still image face recognition. In order to extract discriminative features from a video, frame based processing is preferred; however, not all frames are suitable for face recognition. While some frames are distorted due to noise, blur, and occlusion, others might be affected by the presence of covariates such as pose, illumination, and expression. Frames affected by imaging artifacts such as noise and blur may not contain reliable facial information and may affect the recognition performance. In an ideal scenario, such frames should not be considered for feature extraction and matching. Furthermore, video contains a large amount of frames and adjacent frames contain largely redundant information and processing all the frames from a video increases the computational complexity. Instead of utilizing all the frames from a video, frame selection can be performed to determine a subset of frames which is best suited for face recognition. Several frame selection algorithms have been proposed in the literature to address these concerns. In this chapter, we discuss the role and importance of frame selection in video face recognition, provide an overview of existing techniques, present an entropy based frame selection algorithm with the results and analysis on Point-and-Shoot-Challenge which is a recent benchmark database, and also propose a new paradigm for frame selection algorithms as a path forward.
Citations
More filters
Proceedings ArticleDOI
01 Jan 2018
TL;DR: A convolutional neural network based key-frame extraction (KFE) engine with Graphic Processing Unit (GPU) acceleration, which targets at extracting key-frames with high quality faces correctly and swiftly and is adaptive to different face recognition back-end.
Abstract: Nowadays we see an increasing demand for face in video recognition. However, in order to overcome the large variations in face quality in video streams, as well as for the purpose of improving the processing speed of face recognition system, frame selection becomes a necessary and essential step prior to performing face recognition. In this paper, we propose a convolutional neural network (CNN) based key-frame extraction (KFE) engine with Graphic Processing Unit (GPU) acceleration, which targets at extracting key-frames with high quality faces correctly and swiftly. We evaluated our method with ChokePoint dataset following NIST standards and compared against several representative key-frame selection approaches. The experimental results show that our CNN-based KFE engine can largely reduce the total processing time for face in video recognition, as well as improves the recognition accuracy of the face recognition back-end. With GPU acceleration, our KFE engine reaches and exceeds real-time processing speed requirement under HD resolution, making it capable of processing multiple video steams on the fly. On top ofthat, our proposed KFE engine is adaptive to different face recognition back-end.

21 citations


Cites methods from "On Frame Selection for Video Face R..."

  • ...The existing frame selection methods can be divided into three main categories [3]: clustering based, optical flow based and quality based....

    [...]

Proceedings ArticleDOI
01 Feb 2018
TL;DR: The experimental results show that the proposed CNN-based key-frame extraction (KFE) engine can dramatically reduce the data volume while improving the FR performance, and can achieve higher than real-time performance with GPU acceleration in dealing with HD videos in real application scenarios.
Abstract: Face in video recognition (FiVR) technology is widely applied in various fields such as video analytics and real-time video surveillance. However, FiVR technology also faces the challenges of high-volume video data, real-time processing requirement, as well as improving the performance of face recognition (FR) algorithms. To overcome these challenges, frame selection becomes a necessary and beneficial step before the FR stage. In this paper, we propose a CNN-based key-frame extraction (KFE) engine with GPU acceleration, employing our innovative Face Quality Assessment (FQA) module. For theoretical performance analysis of the KFE engine, we evaluated representative one-person video datasets such as PaSC, FiA and ChokePoint using ROC and DET curves. For performance analysis under practical scenario, we evaluated multi-person videos using ChokePoint dataset as well as in-house captured full-HD videos. The experimental results show that our KFE engine can dramatically reduce the data volume while improving the FR performance. In addition, our KFE engine can achieve higher than real-time performance with GPU acceleration in dealing with HD videos in real application scenarios.

19 citations


Additional excerpts

  • ...The existing frame selection approaches for FiVR can be divided into three main categories [7]: face image clustering based, optical-flow based and face image quality based....

    [...]

Journal ArticleDOI
TL;DR: This work proposes a new method for extracting keyframes from videos based on face quality and deep learning for a face recognition task and shows the effectiveness of the proposed method compared to the methods of the state of the art.
Abstract: Indexing is the process of extracting a compact, significant and pertinent signature that describes the content of the data. This field has a broad spectrum of promising applications, such as the Face in Video Recognition (FiVR). Motivating the interest of researchers around the world. Since the video has a huge amount of data, the process of extracting the relevant frames becomes necessary and an essential step prior to performing face recognition. In this context, we propose a new method for extracting keyframes from videos based on face quality and deep learning for a face recognition task. The first step is the face detection using MTCNN detector, which detects five landmarks (the eyes, the two corners of the mouth and the nose). It limits face boundaries in a bounding box, and provides a confidence score. This method has two steps. The first step aims to generate the face quality score of each face in the data set prepared for the learning step. To generate quality scores, we use three face feature extractor including Gabor, LBP and HoG. The second step consist on training a deep Convolutional Neural Network in a supervised manner in order to select frames having the best face quality. The obtained results show the effectiveness of the proposed method compared to the methods of the state of the art.

6 citations


Cites background or methods from "On Frame Selection for Video Face R..."

  • ...In this study, we propose a new method for keyframe extraction based on FQA and CNN. Unlike [4, 55, 61], our method do not use metrics and predefined weights to calculate face quality....

    [...]

  • ...The best frames will be chosen based on a subjective criteria namely Face Quality Assessment (FQA)....

    [...]

  • ...Inspired by the huge success of deep convolutional neural networks (DCNN) in several computer vision tasks, several authors use this technique for extracting keyframes based on FQA [13, 18, 74, 75]....

    [...]

  • ...As part of the performance analysis experiments, our FQM-CNNmethod achieved a higher accuracy rates against similar methods that use keyframe extraction based on FQA for face recognition....

    [...]

  • ...For this reason, we are interested in keyframe extraction based on FQA....

    [...]

Proceedings ArticleDOI
23 Mar 2018
TL;DR: This work proposes a novel methodology for face detection on low-resolution videos based on parallel Gunnar Farnebäck optical flow algorithm, Haar Cascades and Local Binary Patterns, which can detect faces in a rate of 50%.
Abstract: The use of video cameras for security reasons has increased in recent times. Identify a person with automatic face detection systems have greater importance today; but the low-quality of the videos make it difficult and are still an open problem that many researchers are trying to solve. We propose a novel methodology for face detection on low-resolution videos based on parallel Gunnar Farneback optical flow algorithm, Haar Cascades and Local Binary Patterns. Our model does not use illumination normalization or super-resolution techniques, commonly used in literature. The results on the Caviar Database prove a better detection rate compared with OpenCv Library, Dlib C++ Library and Matlab function, which use the known Viola-Jones Haar cascade algorithm and HOGs. Even though these tools not have a number of detections up to 1%, our proposal can detect faces in a rate of 50%.

5 citations


Cites methods from "On Frame Selection for Video Face R..."

  • ...All characteristics of these frames make more difficult apply techniques such as face detection or recognition which were originally designed for images in semi-controlled environments [2],[5]-[14]...

    [...]

References
More filters
Proceedings Article
24 Aug 1981
TL;DR: In this paper, the spatial intensity gradient of the images is used to find a good match using a type of Newton-Raphson iteration, which can be generalized to handle rotation, scaling and shearing.
Abstract: Image registration finds a variety of applications in computer vision. Unfortunately, traditional image registration techniques tend to be costly. We present a new image registration technique that makes use of the spatial intensity gradient of the images to find a good match using a type of Newton-Raphson iteration. Our technique is taster because it examines far fewer potential matches between the images than existing techniques Furthermore, this registration technique can be generalized to handle rotation, scaling and shearing. We show how our technique can be adapted tor use in a stereo vision system.

12,944 citations

01 Jan 2010
TL;DR: This work clearly establishes the value of using a denoising criterion as a tractable unsupervised objective to guide the learning of useful higher level representations.
Abstract: We explore an original strategy for building deep networks, based on stacking layers of denoising autoencoders which are trained locally to denoise corrupted versions of their inputs. The resulting algorithm is a straightforward variation on the stacking of ordinary autoencoders. It is however shown on a benchmark of classification problems to yield significantly lower classification error, thus bridging the performance gap with deep belief networks (DBN), and in several cases surpassing it. Higher level representations learnt in this purely unsupervised fashion also help boost the performance of subsequent SVM classifiers. Qualitative experiments show that, contrary to ordinary autoencoders, denoising autoencoders are able to learn Gabor-like edge detectors from natural image patches and larger stroke detectors from digit images. This work clearly establishes the value of using a denoising criterion as a tractable unsupervised objective to guide the learning of useful higher level representations.

5,303 citations

Journal Article
TL;DR: Denoising autoencoders as mentioned in this paper are trained locally to denoise corrupted versions of their inputs, which is a straightforward variation on the stacking of ordinary autoencoder.
Abstract: We explore an original strategy for building deep networks, based on stacking layers of denoising autoencoders which are trained locally to denoise corrupted versions of their inputs. The resulting algorithm is a straightforward variation on the stacking of ordinary autoencoders. It is however shown on a benchmark of classification problems to yield significantly lower classification error, thus bridging the performance gap with deep belief networks (DBN), and in several cases surpassing it. Higher level representations learnt in this purely unsupervised fashion also help boost the performance of subsequent SVM classifiers. Qualitative experiments show that, contrary to ordinary autoencoders, denoising autoencoders are able to learn Gabor-like edge detectors from natural image patches and larger stroke detectors from digit images. This work clearly establishes the value of using a denoising criterion as a tractable unsupervised objective to guide the learning of useful higher level representations.

4,814 citations

Proceedings Article
04 Dec 2006
TL;DR: These experiments confirm the hypothesis that the greedy layer-wise unsupervised training strategy mostly helps the optimization, by initializing weights in a region near a good local minimum, giving rise to internal distributed representations that are high-level abstractions of the input, bringing better generalization.
Abstract: Complexity theory of circuits strongly suggests that deep architectures can be much more efficient (sometimes exponentially) than shallow architectures, in terms of computational elements required to represent some functions. Deep multi-layer neural networks have many levels of non-linearities allowing them to compactly represent highly non-linear and highly-varying functions. However, until recently it was not clear how to train such deep networks, since gradient-based optimization starting from random initialization appears to often get stuck in poor solutions. Hinton et al. recently introduced a greedy layer-wise unsupervised learning algorithm for Deep Belief Networks (DBN), a generative model with many layers of hidden causal variables. In the context of the above optimization problem, we study this algorithm empirically and explore variants to better understand its success and extend it to cases where the inputs are continuous or where the structure of the input distribution is not revealing enough about the variable to be predicted in a supervised task. Our experiments also confirm the hypothesis that the greedy layer-wise unsupervised training strategy mostly helps the optimization, by initializing weights in a region near a good local minimum, giving rise to internal distributed representations that are high-level abstractions of the input, bringing better generalization.

4,385 citations