Other affiliations: International Institute of Information Technology
Bio: Vijay Kumar is an academic researcher from International Institute of Information Technology, Hyderabad. The author has contributed to research in topics: Facial recognition system & Face detection. The author has an hindex of 8, co-authored 14 publications receiving 236 citations. Previous affiliations of Vijay Kumar include International Institute of Information Technology.
••01 Dec 2013
TL;DR: A more challenging Indian Movie Face Database (IMFDB) that has much more variability compared to LFW and Pubfigs is introduced and is the first face database that provides a detailed annotation in terms of age, pose, gender, expression, amount of occlusion, for each face which may help other face related applications.
Abstract: Recognizing human faces in the wild is emerging as a critically important, and technically challenging computer vision problem With a few notable exceptions, most previous works in the last several decades have focused on recognizing faces captured in a laboratory setting However, with the introduction of databases such as LFW and Pubfigs, face recognition community is gradually shifting its focus on much more challenging unconstrained settings Since its introduction, LFW verification benchmark is getting a lot of attention with various researchers contributing towards state-of-the-results To further boost the unconstrained face recognition research, we introduce a more challenging Indian Movie Face Database (IMFDB) that has much more variability compared to LFW and Pubfigs The database consists of 34512 faces of 100 known actors collected from approximately 103 Indian movies Unlike LFW and Pubfigs which used face detectors to automatically detect the faces from the web collection, faces in IMFDB are detected manually from all the movies Manual selection of faces from movies resulted in high degree of variability (in scale, pose, expression, illumination, age, occlusion, makeup) which one could ever see in natural world IMFDB is the first face database that provides a detailed annotation in terms of age, pose, gender, expression, amount of occlusion, for each face which may help other face related applications
24 Aug 2014
TL;DR: This work attempts the raga classification problem in a non-linear SVM framework using a combination of two kernels that represent the similarities of a music signal using two different features-pitch-class profile and n-gram distribution of notes.
Abstract: In this work, we propose a method to identify the ragas of an Indian Carnatic music signal. This has several interesting applications in digital music indexing, recommendation and retrieval. However, this problem is hard due to (i) the absence of a fixed frequency for a note (ii) relative scale of notes (iii) oscillations around a note, and (iv) improvisations. In this work, we attempt the raga classification problem in a non-linear SVM framework using a combination of two kernels that represent the similarities of a music signal using two different features-pitch-class profile and n-gram distribution of notes. This differs from the previous pitch-class profile based approaches where the temporal information of notes is ignored. We evaluated the proposed approach on our own raga dataset and Comp Music dataset and show an improvement of 10.19% by combining the information from two features relevant to Indian Carnatic music.
••01 Jul 2017
TL;DR: In this article, the authors proposed a network that jointly optimizes a single loss over multiple body regions for learning a person representation, and showed significant improvements over previously proposed approaches on all the benchmarks including photo album setting of PIPA.
Abstract: Person recognition methods that use multiple body regions have shown significant improvements over traditional face-based recognition. One of the primary challenges in full-body person recognition is the extreme variation in pose and view point. In this work, (i) we present an approach that tackles pose variations utilizing multiple models that are trained on specific poses, and combined using pose-aware weights during testing. (ii) For learning a person representation, we propose a network that jointly optimizes a single loss over multiple body regions. (iii) Finally, we introduce new benchmarks to evaluate person recognition in diverse scenarios and show significant improvements over previously proposed approaches on all the benchmarks including the photo album setting of PIPA.
••07 Dec 2015
TL;DR: A novel approach that incorporates higher order information in the voting process that ensures that a visual word or a phrase in an exemplar makes a major contribution only if it occurs at its semantic location, thereby suppressing the noise significantly.
Abstract: Recently, exemplar based approaches have been successfully applied for face detection in the wild. Contrary to traditional approaches that model face variations from a large and diverse set of training examples, exemplar-based approaches use a collection of discriminatively trained exemplars for detection. In this paradigm, each exemplar casts a vote using retrieval framework and generalized Hough voting, to locate the faces in the target image. The advantage of this approach is that by having a large database that covers all possible variations, faces in challenging conditions can be detected without having to learn explicit models for different variations. Current schemes, however, make an assumption of independence between the visual words, ignoring their relations in the process. They also ignore the spatial consistency of the visual words. Consequently, every exemplar word contributes equally during voting regardless of its location. In this paper, we propose a novel approach that incorporates higher order information in the voting process. We discover visual phrases that contain semantically related visual words and exploit them for detection along with the visual words. For spatial consistency, we estimate the spatial distribution of visual words and phrases from the entire database and then weigh their occurrence in exemplars. This ensures that a visual word or a phrase in an exemplar makes a major contribution only if it occurs at its semantic location, thereby suppressing the noise significantly. We perform extensive experiments on standard FDDB, AFW and G-album datasets and show significant improvement over previous exemplar approaches.
24 Aug 2014
TL;DR: This work considers the problem of automatic identification of faces in videos such as movies, given a dictionary of known faces from a public or an alternate database, and proposes a two stage approach that recognizes the faces in a video using a sparse representation framework using l1-minimization and select a few key-frames based on a robust confidence measure.
Abstract: We consider the problem of automatic identification of faces in videos such as movies, given a dictionary of known faces from a public or an alternate database. This has applications in video indexing, content based search, surveillance, and real time recognition on wearable computers. We propose a two stage approach for this problem. First, we recognize the faces in a video using a sparse representation framework using h-minimization and select a few key-frames based on a robust confidence measure. We then use transductive learning to propagate the labels from the key-frames to the remaining frames by incorporating constraints simultaneously in temporal and feature spaces. This is in contrast to some of the previous approaches where every test frame/track is identified independently, ignoring the correlation between the faces in video tracks. Having a few key frames belonging to few subjects for label propagation rather than a large dictionary of actors reduces the amount of confusion. We evaluate the performance of our algorithm on Movie Trailer face dataset and five movie clips, and achieve a significant improvement in labeling accuracy compared to previous approaches.
••17 Aug 2017
TL;DR: S3FD as mentioned in this paper proposes a scale-equitable face detection framework to handle different scales of faces well and improves the recall rate of small faces by a scale compensation anchor matching strategy.
Abstract: This paper presents a real-time face detector, named Single Shot Scale-invariant Face Detector (S3FD), which performs superiorly on various scales of faces with a single deep neural network, especially for small faces. Specifically, we try to solve the common problem that anchorbased detectors deteriorate dramatically as the objects become smaller. We make contributions in the following three aspects: 1) proposing a scale-equitable face detection framework to handle different scales of faces well. We tile anchors on a wide range of layers to ensure that all scales of faces have enough features for detection. Besides, we design anchor scales based on the effective receptive field and a proposed equal proportion interval principle; 2) improving the recall rate of small faces by a scale compensation anchor matching strategy; 3) reducing the false positive rate of small faces via a max-out background label. As a consequence, our method achieves state-of-theart detection performance on all the common face detection benchmarks, including the AFW, PASCAL face, FDDB and WIDER FACE datasets, and can run at 36 FPS on a Nvidia Titan X (Pascal) for VGA-resolution images.
••01 Nov 2018
TL;DR: It is shown that AGR consistently operationalises gender in a trans-exclusive way, and consequently carries disproportionate risk for trans people subject to it.
Abstract: Automatic Gender Recognition (AGR) is a subfield of facial recognition that aims to algorithmically identify the gender of individuals from photographs or videos. In wider society the technology has proposed applications in physical access control, data analytics and advertising. Within academia, it is already used in the field of Human-Computer Interaction (HCI) to analyse social media usage. Given the long-running critiques of HCI for failing to consider and include transgender (trans) perspectives in research, and the potential implications of AGR for trans people if deployed, I sought to understand how AGR and HCI understand the term "gender", and how HCI describes and deploys gender recognition technology. Using a content analysis of papers from both fields, I show that AGR consistently operationalises gender in a trans-exclusive way, and consequently carries disproportionate risk for trans people subject to it. In addition, I use the dearth of discussion of this in HCI papers that apply AGR to discuss how HCI operationalises gender, and the implications that this has for the field's research. I conclude with recommendations for alternatives to AGR, and some ideas for how HCI can work towards a more effective and trans-inclusive treatment of gender.
••08 Sep 2018
TL;DR: Zhang et al. as discussed by the authors proposed a context-assisted single shot face detector, named PyramidBox, to handle the hard face detection problem, which improves the utilization of contextual information in the following three aspects.
Abstract: Face detection has been well studied for many years and one of remaining challenges is to detect small, blurred and partially occluded faces in uncontrolled environment. This paper proposes a novel context-assisted single shot face detector, named PyramidBox to handle the hard face detection problem. Observing the importance of the context, we improve the utilization of contextual information in the following three aspects. First, we design a novel context anchor to supervise high-level contextual feature learning by a semi-supervised method, which we call it PyramidAnchors. Second, we propose the Low-level Feature Pyramid Network to combine adequate high-level context semantic feature and Low-level facial feature together, which also allows the PyramidBox to predict faces of all scales in a single shot. Third, we introduce a context-sensitive structure to increase the capacity of prediction network to improve the final accuracy of output. In addition, we use the method of Data-anchor-sampling to augment the training samples across different scales, which increases the diversity of training data for smaller faces. By exploiting the value of context, PyramidBox achieves superior performance among the state-of-the-art over the two common face detection benchmarks, FDDB and WIDER FACE. Our code is available in PaddlePaddle: https://github.com/PaddlePaddle/models/tree/develop/fluid/face_detection.
TL;DR: Zhang et al. as mentioned in this paper proposed a novel face detector, named FaceBoxes, which consists of the Rapidly Digested Convolutional Layers (RDCL) and the multiple scale convolutional layers (MSCL).
Abstract: Although tremendous strides have been made in face detection, one of the remaining open challenges is to achieve real-time speed on the CPU as well as maintain high performance, since effective models for face detection tend to be computationally prohibitive. To address this challenge, we propose a novel face detector, named FaceBoxes, with superior performance on both speed and accuracy. Specifically, our method has a lightweight yet powerful network structure that consists of the Rapidly Digested Convolutional Layers (RDCL) and the Multiple Scale Convolutional Layers (MSCL). The RDCL is designed to enable FaceBoxes to achieve real-time speed on the CPU. The MSCL aims at enriching the receptive fields and discretizing anchors over different layers to handle faces of various scales. Besides, we propose a new anchor densification strategy to make different types of anchors have the same density on the image, which significantly improves the recall rate of small faces. As a consequence, the proposed detector runs at 20 FPS on a single CPU core and 125 FPS using a GPU for VGA-resolution images. Moreover, the speed of FaceBoxes is invariant to the number of faces. We comprehensively evaluate this method and present state-of-the-art detection performance on several face detection benchmark datasets, including the AFW, PASCAL face, and FDDB.
TL;DR: This work substantially extends the largest available in-the-wild database (Aff-Wild) to study continuous emotions such as valence and arousal and annotates parts of the database with basic expressions and action units, which allows the joint study of all three types of behavior states.
Abstract: Affective computing has been largely limited in terms of available data resources. The need to collect and annotate diverse in-the-wild datasets has become apparent with the rise of deep learning models, as the default approach to address any computer vision task. Some in-the-wild databases have been recently proposed. However: i) their size is small, ii) they are not audiovisual, iii) only a small part is manually annotated, iv) they contain a small number of subjects, or v) they are not annotated for all main behavior tasks (valence-arousal estimation, action unit detection and basic expression classification). To address these, we substantially extend the largest available in-the-wild database (Aff-Wild) to study continuous emotions such as valence and arousal. Furthermore, we annotate parts of the database with basic expressions and action units. As a consequence, for the first time, this allows the joint study of all three types of behavior states. We call this database Aff-Wild2. We conduct extensive experiments with CNN and CNN-RNN architectures that use visual and audio modalities; these networks are trained on Aff-Wild2 and their performance is then evaluated on 10 publicly available emotion databases. We show that the networks achieve state-of-the-art performance for the emotion recognition tasks. Additionally, we adapt the ArcFace loss function in the emotion recognition context and use it for training two new networks on Aff-Wild2 and then re-train them in a variety of diverse expression recognition databases. The networks are shown to improve the existing state-of-the-art. The database, emotion recognition models and source code are available at this http URL.