scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

DroneSURF: Benchmark Dataset for Drone-based Face Recognition

TL;DR: This research presents a novel large-scale drone dataset, DroneSURF: Drone Surveillance of Faces, in order to facilitate research for face recognition, along with information regarding the data distribution, protocols for evaluation, and baseline results.
Abstract: Unmanned Aerial Vehicles (UAVs) or drones are often used to reach remote areas or regions which are inaccessible to humans. Equipped with a large field of view, compact size, and remote control abilities, drones are deemed suitable for monitoring crowded or disaster-hit areas, and performing aerial surveillance. While research has focused on area monitoring, object detection and tracking, limited attention has been given to person identification, especially face recognition, using drones. This research presents a novel large-scale drone dataset, DroneSURF: Drone Surveillance of Faces, in order to facilitate research for face recognition. The dataset contains 200 videos of 58 subjects, captured across 411K frames, having over 786K face annotations. The proposed dataset demonstrates variations across two surveillance use cases: (i) active and (ii) passive, two locations, and two acquisition times. DroneSURF encapsulates challenges due to the effect of motion, variations in pose, illumination, background, altitude, and resolution, especially due to the large and varying distance between the drone and the subjects. This research presents a detailed description of the proposed DroneSURF dataset, along with information regarding the data distribution, protocols for evaluation, and baseline results.
Citations
More filters
Posted Content
TL;DR: The VisDrone dataset, which is captured over various urban/suburban areas of 14 different cities across China from North to South, is described, being the largest such dataset ever published, and enables extensive evaluation and investigation of visual analysis algorithms on the drone platform.
Abstract: Drones, or general UAVs, equipped with cameras have been fast deployed with a wide range of applications, including agriculture, aerial photography, and surveillance. Consequently, automatic understanding of visual data collected from drones becomes highly demanding, bringing computer vision and drones more and more closely. To promote and track the evelopments of object detection and tracking algorithms, we have organized two challenge workshops in conjunction with ECCV 2018, and ICCV 2019, attracting more than 100 teams around the world. We provide a large-scale drone captured dataset, VisDrone, which includes four tracks, i.e., (1) image object detection, (2) video object detection, (3) single object tracking, and (4) multi-object tracking. In this paper, we first presents a thorough review of object detection and tracking datasets and benchmarks, and discuss the challenges of collecting large-scale drone-based object detection and tracking datasets with fully manual annotations. After that, we describe our VisDrone dataset, which is captured over various urban/suburban areas of 14 different cities across China from North to South. Being the largest such dataset ever published, VisDrone enables extensive evaluation and investigation of visual analysis algorithms on the drone platform. We provide a detailed analysis of the current state of the field of large-scale object detection and tracking on drones, and conclude the challenge as well as propose future directions. We expect the benchmark largely boost the research and development in video analysis on drone platforms. All the datasets and experimental results can be downloaded from the website: this https URL.

129 citations

Journal ArticleDOI
TL;DR: A novel approach based on the so-called Context-aware Multi-task Siamese Network (CMSN) model that explores new cues in UAV videos by judging the consistency degree between objects and contexts and that can be used for SOT and MOT is proposed.
Abstract: With the increasing popularity of Unmanned Aerial Vehicles (UAVs) in computer vision-related applications, intelligent UAV video analysis has recently attracted the attention of an increasing number of researchers. To facilitate research in the UAV field, this paper presents a UAV dataset with 100 videos featuring approximately 2700 vehicles recorded under unconstrained conditions and 840k manually annotated bounding boxes. These UAV videos were recorded in complex real-world scenarios and pose significant new challenges, such as complex scenes, high density, small objects, and large camera motion, to the existing object detection and tracking methods. These challenges have encouraged us to define a benchmark for three fundamental computer vision tasks, namely, object detection, single object tracking (SOT) and multiple object tracking (MOT), on our UAV dataset. Specifically, our UAV benchmark facilitates evaluation and detailed analysis of state-of-the-art detection and tracking methods on the proposed UAV dataset. Furthermore, we propose a novel approach based on the so-called Context-aware Multi-task Siamese Network (CMSN) model that explores new cues in UAV videos by judging the consistency degree between objects and contexts and that can be used for SOT and MOT. The experimental results demonstrate that our model could make tracking results more robust in both SOT and MOT, showing that the current tracking and detection methods have limitations in dealing with the proposed UAV benchmark and that further research is indeed needed.

90 citations


Cites background from "DroneSURF: Benchmark Dataset for Dr..."

  • ...However, only a very few UAV datasets have been constructed so far and most datasets are limited to a specific task, such as visual tracking [e.g., UAV123 (Mueller et al. 2016), UAV123L (Mueller et al. 2016) and Campus (Robicquet et al. 2016)] or detection [e.g., CARPK (Hsieh et al. 2017), DOTA (Xia et al. 2018) and DroneSURF (Kalra et al. 2019)]....

    [...]

  • ...…so far and most datasets are limited to a specific task, such as visual tracking [e.g., UAV123 (Mueller et al. 2016), UAV123L (Mueller et al. 2016) and Campus (Robicquet et al. 2016)] or detection [e.g., CARPK (Hsieh et al. 2017), DOTA (Xia et al. 2018) and DroneSURF (Kalra et al. 2019)]....

    [...]

Journal ArticleDOI
TL;DR: VisDrone as discussed by the authors is a large-scale drone captured dataset, which includes four tracks, i.e., (1) image object detection, (2) video object detection and tracking, (3) single object tracking, and (4) multi-object tracking.
Abstract: Drones, or general UAVs, equipped with cameras have been fast deployed with a wide range of applications, including agriculture, aerial photography, and surveillance. Consequently, automatic understanding of visual data collected from drones becomes highly demanding, bringing computer vision and drones more and more closely. To promote and track the developments of object detection and tracking algorithms, we have organized three challenge workshops in conjunction with ECCV 2018, ICCV 2019 and ECCV 2020, attracting more than 100 teams around the world. We provide a large-scale drone captured dataset, VisDrone, which includes four tracks, i.e., (1) image object detection, (2) video object detection, (3) single object tracking, and (4) multi-object tracking. In this paper, we first present a thorough review of object detection and tracking datasets and benchmarks, and discuss the challenges of collecting large-scale drone-based object detection and tracking datasets with fully manual annotations. After that, we describe our VisDrone dataset, which is captured over various urban/suburban areas of 14 different cities across China from North to South. Being the largest such dataset ever published, VisDrone enables extensive evaluation and investigation of visual analysis algorithms for the drone platform. We provide a detailed analysis of the current state of the field of large-scale object detection and tracking on drones, and conclude the challenge as well as propose future directions. We expect the benchmark largely boost the research and development in video analysis on drone platforms. All the datasets and experimental results can be downloaded from https://github.com/VisDrone/VisDrone-Dataset .

60 citations

Journal ArticleDOI
TL;DR: This survey presents recent advancements in 2D object detection for the case of UAVs, focusing on the differences, strategies, and trade-offs between the generic problem of object detection, and the adaptation of such solutions for operations of the UAV.
Abstract: The spread of Unmanned Aerial Vehicles (UAVs) in the last decade revolutionized many applications fields. Most investigated research topics focus on increasing autonomy during operational campaigns, environmental monitoring, surveillance, maps, and labeling. To achieve such complex goals, a high-level module is exploited to build semantic knowledge leveraging the outputs of the low-level module that takes data acquired from multiple sensors and extracts information concerning what is sensed. All in all, the detection of the objects is undoubtedly the most important low-level task, and the most employed sensors to accomplish it are by far RGB cameras due to costs, dimensions, and the wide literature on RGB-based object detection. This survey presents recent advancements in 2D object detection for the case of UAVs, focusing on the differences, strategies, and trade-offs between the generic problem of object detection, and the adaptation of such solutions for operations of the UAV. Moreover, a new taxonomy that considers different heights intervals and driven by the methodological approaches introduced by the works in the state of the art instead of hardware, physical and/or technological constraints is proposed.

51 citations

Journal ArticleDOI
TL;DR: A complete review of databases in the first two groups and works that used the databases to apply their methods are presented, and vision-based intelligent applications and their databases are explored.
Abstract: Analyzing videos and images captured by unmanned aerial vehicles or aerial drones is an emerging application attracting significant attention from researchers in various areas of computer vision. Currently, the major challenge is the development of autonomous operations to complete missions and replace human operators. In this paper, based on the type of analyzing videos and images captured by drones in computer vision, we have reviewed these applications by categorizing them into three groups. The first group is related to remote sensing with challenges such as camera calibration, image matching, and aerial triangulation. The second group is related to drone-autonomous navigation, in which computer vision methods are designed to explore challenges such as flight control, visual localization and mapping, and target tracking and obstacle detection. The third group is dedicated to using images and videos captured by drones in various applications, such as surveillance, agriculture and forestry, animal detection, disaster detection, and face recognition. Since most of the computer vision methods related to the three categories have been designed for real-world conditions, providing real conditions based on drones is impossible. We aim to explore papers that provide a database for these purposes. In the first two groups, some survey papers presented are current. However, the surveys have not been aimed at exploring any databases. This paper presents a complete review of databases in the first two groups and works that used the databases to apply their methods. Vision-based intelligent applications and their databases are explored in the third group, and we discuss open problems and avenues for future research.

45 citations

References
More filters
Proceedings ArticleDOI
20 Jun 2005
TL;DR: It is shown experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection, and the influence of each stage of the computation on performance is studied.
Abstract: We study the question of feature sets for robust visual object recognition; adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.

31,952 citations

Journal ArticleDOI
TL;DR: A generalized gray-scale and rotation invariant operator presentation that allows for detecting the "uniform" patterns for any quantization of the angular space and for any spatial resolution and presents a method for combining multiple operators for multiresolution analysis.
Abstract: Presents a theoretically very simple, yet efficient, multiresolution approach to gray-scale and rotation invariant texture classification based on local binary patterns and nonparametric discrimination of sample and prototype distributions. The method is based on recognizing that certain local binary patterns, termed "uniform," are fundamental properties of local image texture and their occurrence histogram is proven to be a very powerful texture feature. We derive a generalized gray-scale and rotation invariant operator presentation that allows for detecting the "uniform" patterns for any quantization of the angular space and for any spatial resolution and presents a method for combining multiple operators for multiresolution analysis. The proposed approach is very robust in terms of gray-scale variations since the operator is, by definition, invariant against any monotonic transformation of the gray scale. Another advantage is computational simplicity as the operator can be realized with a few operations in a small neighborhood and a lookup table. Experimental results demonstrate that good discrimination can be achieved with the occurrence statistics of simple rotation invariant local binary patterns.

14,245 citations


"DroneSURF: Benchmark Dataset for Dr..." refers methods in this paper

  • ...Some key observations are as follows: (i) Face Recognition Performance: From the frame-wise identification results reported in Table IV, it can be observed that for all frames, best rank-1 identification performance of 14.36% is obtained with VGG-Face feature descriptor for active surveillance, while an accuracy of 5.08% is achieved with LBP features for passive surveillance....

    [...]

  • ...Baselines for face recognition have been computed with two hand-crafted features: (i) Histogram of Oriented Gradients (HOG) [8], (ii) Local Binary Pattern (LBP) [23], two deep learning based feature extractor: (iii) VGG-Face [25], (iv) VGG-Face2 [7], and (v) a Commercial-Off-The-Shelf system (COTS)....

    [...]

Journal ArticleDOI
TL;DR: In this paper, a face detection framework that is capable of processing images extremely rapidly while achieving high detection rates is described. But the detection performance is limited to 15 frames per second.
Abstract: This paper describes a face detection framework that is capable of processing images extremely rapidly while achieving high detection rates. There are three key contributions. The first is the introduction of a new image representation called the “Integral Image” which allows the features used by our detector to be computed very quickly. The second is a simple and efficient classifier which is built using the AdaBoost learning algorithm (Freund and Schapire, 1995) to select a small number of critical visual features from a very large set of potential features. The third contribution is a method for combining classifiers in a “cascade” which allows background regions of the image to be quickly discarded while spending more computation on promising face-like regions. A set of experiments in the domain of face detection is presented. The system yields face detection performance comparable to the best previous systems (Sung and Poggio, 1998; Rowley et al., 1998; Schneiderman and Kanade, 2000; Roth et al., 2000). Implemented on a conventional desktop, face detection proceeds at 15 frames per second.

13,037 citations

Proceedings ArticleDOI
07 Jul 2001
TL;DR: A new image representation called the “Integral Image” is introduced which allows the features used by the detector to be computed very quickly and a method for combining classifiers in a “cascade” which allows background regions of the image to be quickly discarded while spending more computation on promising face-like regions.
Abstract: This paper describes a face detection framework that is capable of processing images extremely rapidly while achieving high detection rates. There are three key contributions. The first is the introduction of a new image representation called the "Integral Image" which allows the features used by our detector to be computed very quickly. The second is a simple and efficient classifier which is built using the AdaBoost learning algo- rithm (Freund and Schapire, 1995) to select a small number of critical visual features from a very large set of potential features. The third contribution is a method for combining classifiers in a "cascade" which allows back- ground regions of the image to be quickly discarded while spending more computation on promising face-like regions. A set of experiments in the domain of face detection is presented. The system yields face detection perfor- mance comparable to the best previous systems (Sung and Poggio, 1998; Rowley et al., 1998; Schneiderman and Kanade, 2000; Roth et al., 2000). Implemented on a conventional desktop, face detection proceeds at 15 frames per second.

10,592 citations


"DroneSURF: Benchmark Dataset for Dr..." refers methods in this paper

  • ...(ii) Analysis of Face Detection: Tiny Face and Viola Jones detected a total of 131K and 64K faces for active surveillance, while the ground truth annotated faces are a little over 125K....

    [...]

  • ...On the other hand, the total number of detected faces by Viola Jones detector is less than half of the total annotated faces....

    [...]

  • ...For passive surveillance, Tiny Face and Viola Jones detected a total of 136K and 35K faces, respectively, for the ground truth annotated faces of over 155K....

    [...]

  • ...Algorithm Active Surveillance Passive SurveillancePrecision Recall Precision Recall Viola Jones [29] 22.60 27.50 2.15 1.15 Tiny Face [14] 96.52 94.59 95.36 78.80 Fig....

    [...]

  • ...For the two face detectors, Viola Jones and Tiny Face, Table III presents the precision and recall values for both scenarios of active and passive surveillance....

    [...]

Proceedings ArticleDOI
01 Jan 2015
TL;DR: It is shown how a very large scale dataset can be assembled by a combination of automation and human in the loop, and the trade off between data purity and time is discussed.
Abstract: The goal of this paper is face recognition – from either a single photograph or from a set of faces tracked in a video. Recent progress in this area has been due to two factors: (i) end to end learning for the task using a convolutional neural network (CNN), and (ii) the availability of very large scale training datasets. We make two contributions: first, we show how a very large scale dataset (2.6M images, over 2.6K people) can be assembled by a combination of automation and human in the loop, and discuss the trade off between data purity and time; second, we traverse through the complexities of deep network training and face recognition to present methods and procedures to achieve comparable state of the art results on the standard LFW and YTF face benchmarks.

5,308 citations


"DroneSURF: Benchmark Dataset for Dr..." refers methods in this paper

  • ...Some key observations are as follows: (i) Face Recognition Performance: From the frame-wise identification results reported in Table IV, it can be observed that for all frames, best rank-1 identification performance of 14.36% is obtained with VGG-Face feature descriptor for active surveillance, while an accuracy of 5.08% is achieved with LBP features for passive surveillance....

    [...]

  • ...With VGG-Face, quality-based frame selection results in increased performance for both the scenarios (14.36% to 16.78%, and 4.66% to 4.95%), as compared to alternate frame selection....

    [...]

  • ...At rank-5, VGG-Face descriptor achieves the best performance of around 39% for active surveillance, while the highest performance for passive surveillance is only around 24%....

    [...]

  • ...Baselines for face recognition have been computed with two hand-crafted features: (i) Histogram of Oriented Gradients (HOG) [8], (ii) Local Binary Pattern (LBP) [23], two deep learning based feature extractor: (iii) VGG-Face [25], (iv) VGG-Face2 [7], and (v) a Commercial-Off-The-Shelf system (COTS)....

    [...]