Robust Head Detection in Complex Videos Using Two-Stage Deep Convolution Framework

doi:10.1109/ACCESS.2020.2995764

Home
/
Papers
/
Robust Head Detection in Complex Videos Using Two-Stage Deep Convolution Framework

Journal Article•DOI•

Robust Head Detection in Complex Videos Using Two-Stage Deep Convolution Framework

Sultan Daud Khan, Yasir Ali, Basim Zafar, Abdulfattah Noorwali¹•Institutions (1)

Umm al-Qura University¹

19 May 2020-IEEE Access (Institute of Electrical and Electronics Engineers (IEEE))-Vol. 8, pp 98679-98692

TL;DR: This paper presents a two-stage head detection framework that utilizes fully convolutional network (FCN) to generate scale-aware proposals followed by CNN that classifies each proposal into two classes, i.e. head and background.

read less

Abstract: Pedestrian head detection plays an important role in identifying and localizing individuals in real world visual data. Head detection is a nontrivial problem due to considerable variance in camera view-points, scales, human poses, and appearances in the scene. Thanks to the translation invariance property of convolutional neural networks (CNNs) which enables large capacity CNNs to handle the problem of appearance and pose variations in the scene. However, the problem of scale invariance is still an open issue. To address this problem, this paper presents a two-stage head detection framework that utilizes fully convolutional network (FCN) to generate scale-aware proposals followed by CNN that classifies each proposal into two classes, i.e. head and background. Experiments results show that using scale-aware proposals obtained by FCN, the object recall rate and mean average precision (mAP) are improved. Additionaly, we demonstrate that our framework achieved state-of-the-art results on four challenging benchmark datasets, i.e. HollywoodHeads, Casablanca, SHOCK, and WIDERFACE.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Journal Article•DOI•

New End-to-End Strategy Based on DeepLabv3+ Semantic Segmentation for Human Head Detection.

[...]

Mohamed Chouai¹, Petr Dolezel¹, Dominik Stursa¹, Zdenek Nemec¹•Institutions (1)

University of Pardubice¹

30 Aug 2021-Sensors

TL;DR: In this paper, the authors developed a new approach based on two parallel Deeplapv3+ to improve the performance of the person detection system, which can be used not only for the detection of the human head but also for several semantic segmentation applications.

...read moreread less

Abstract: In the field of computer vision, object detection consists of automatically finding objects in images by giving their positions. The most common fields of application are safety systems (pedestrian detection, identification of behavior) and control systems. Another important application is head/person detection, which is the primary material for road safety, rescue, surveillance, etc. In this study, we developed a new approach based on two parallel Deeplapv3+ to improve the performance of the person detection system. For the implementation of our semantic segmentation model, a working methodology with two types of ground truths extracted from the bounding boxes given by the original ground truths was established. The approach has been implemented in our two private datasets as well as in a public dataset. To show the performance of the proposed system, a comparative analysis was carried out on two deep learning semantic segmentation state-of-art models: SegNet and U-Net. By achieving 99.14% of global accuracy, the result demonstrated that the developed strategy could be an efficient way to build a deep neural network model for semantic segmentation. This strategy can be used, not only for the detection of the human head but also be applied in several semantic segmentation applications.

...read moreread less

5 citations

Journal Article•DOI•

A fusion framework for vision-based indoor occupancy estimation

[...]

Peng Li, Kailai Sun, Tian Xing, Qianchuan Zhao, Xinwei Wang - Show less +1 more

01 Sep 2022-Building and Environment

TL;DR: Wang et al. as discussed by the authors proposed a novel fusion framework for occupancy detection and estimation based on two different perspectives: head detection method combined with indoor scene knowledge to filter false positives and recover missed detection, and a two-vision entrance counting method to refine the predicted results.

...read moreread less

3 citations

Journal Article•DOI•

MPSN: Motion-aware Pseudo-Siamese Network for indoor video head detection in buildings

[...]

76940X Exam Questions¹•Institutions (1)

Tsinghua University¹

01 Aug 2022-Building and Environment

TL;DR: Wang et al. as mentioned in this paper proposed Motion-aware Pseudo-Siamese Network (MPSN), an end-to-end approach that leverages head motion information to guide deep learning models to extract effective head features in indoor scenarios.

...read moreread less

2 citations

Posted Content•

MGPSN: Motion-Guided Pseudo Siamese Network for Indoor Video Head Detection.

[...]

Kailai Sun¹, Xiaoteng Ma¹, Qianchuan Zhao¹, Peng Liu¹•Institutions (1)

Tsinghua University¹

07 Oct 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: Wang et al. as discussed by the authors proposed Motion-aware Pseudo Siamese Network (MPSN), which leverages head motion information to guide the deep model to extract effective head features in indoor scenarios.

...read moreread less

Abstract: Head detection in the indoor video is an essential component of many real-world applications. While deep models have achieved remarkable progress in general object detection, they are not satisfying enough in complex indoor scenes. The indoor surveillance video often includes cluttered background objects, among which heads have small scales and diverse poses. In this paper, we propose Motion-aware Pseudo Siamese Network (MPSN), an end-to-end approach that leverages head motion information to guide the deep model to extract effective head features in indoor scenarios. By taking the pixel-wise difference of adjacent frames as the auxiliary input, MPSN effectively enhances human head motion information and removes the irrelevant objects in the background. Compared with prior methods, it achieves superior performance on the two indoor video datasets. Our experiments show that MPSN successfully suppresses static background objects and highlights the moving instances, especially human heads in indoor videos. We also compare different methods to capture head motion, which demonstrates the simplicity and flexibility of MPSN. Finally, to validate the robustness of MPSN, we conduct adversarial experiments with a mathematical solution of small perturbations for robust model selection. Code is available at https://github.com/pl-share/MPSN.

...read moreread less

Journal Article•DOI•

An object detection algorithm combining semantic and geometric information of the 3D point cloud

[...]

Zhe Huang, Yongcai Wang, Jie Wen, Peng Wang, Xudong Cai - Show less +1 more

01 Apr 2023-Advanced Engineering Informatics

TL;DR: Zhang et al. as mentioned in this paper proposed two modules for enhancing useful feature extraction in the SA layer to improve 3D object detection accuracy, focusing on the foreground and boundary scores of the points and reweighing the Furthest Point Sampling (FPS) using the evaluated scores.

...read moreread less

References

PDF

Open Access

More filters

Proceedings Article•

ImageNet Classification with Deep Convolutional Neural Networks

[...]

Alex Krizhevsky¹, Ilya Sutskever¹, Geoffrey E. Hinton¹•Institutions (1)

University of Toronto¹

03 Dec 2012

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.

...read moreread less

Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overriding in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

...read moreread less

73,978 citations

Proceedings Article•

Very Deep Convolutional Networks for Large-Scale Image Recognition

[...]

Karen Simonyan¹, Andrew Zisserman¹•Institutions (1)

University of Oxford¹

04 Sep 2014

TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.

...read moreread less

Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

...read moreread less

55,235 citations

"Robust Head Detection in Complex Vi..." refers methods in this paper

...We use different state-of-the-art network architectures, like AlexNet [19], VGGS [4], VGG-verydeep-16 [38] and ZF [54]for classification....
[...]
...For the classification, we use different architectures, AlexNet [19], VGGS [4], VGG-verydeep-16 [38], and ZF [54]....
[...]

Proceedings Article•

Very Deep Convolutional Networks for Large-Scale Image Recognition

[...]

Karen Simonyan¹, Andrew Zisserman¹•Institutions (1)

University of Oxford¹

01 Jan 2015

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.

...read moreread less

49,914 citations

Proceedings Article•DOI•

Fully convolutional networks for semantic segmentation

[...]

Jonathan Long¹, Evan Shelhamer¹, Trevor Darrell¹•Institutions (1)

University of California, Berkeley¹

07 Jun 2015

TL;DR: The key insight is to build “fully convolutional” networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning.

...read moreread less

Abstract: Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks by themselves, trained end-to-end, pixels-to-pixels, exceed the state-of-the-art in semantic segmentation. Our key insight is to build “fully convolutional” networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolutional networks, explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet [20], the VGG net [31], and GoogLeNet [32]) into fully convolutional networks and transfer their learned representations by fine-tuning [3] to the segmentation task. We then define a skip architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional network achieves state-of-the-art segmentation of PASCAL VOC (20% relative improvement to 62.2% mean IU on 2012), NYUDv2, and SIFT Flow, while inference takes less than one fifth of a second for a typical image.

...read moreread less

28,225 citations

Proceedings Article•DOI•

You Only Look Once: Unified, Real-Time Object Detection

[...]

Joseph Redmon¹, Santosh K. Divvala², Ross Girshick³, Ali Farhadi²•Institutions (3)

University of Washington¹, Allen Institute for Artificial Intelligence², Facebook³

27 Jun 2016

TL;DR: Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background, and outperforms other detection methods, including DPM and R-CNN, when generalizing from natural images to other domains like artwork.

...read moreread less

Abstract: We present YOLO, a new approach to object detection. Prior work on object detection repurposes classifiers to perform detection. Instead, we frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance. Our unified architecture is extremely fast. Our base YOLO model processes images in real-time at 45 frames per second. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors. Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background. Finally, YOLO learns very general representations of objects. It outperforms other detection methods, including DPM and R-CNN, when generalizing from natural images to other domains like artwork.

...read moreread less

27,256 citations

"Robust Head Detection in Complex Vi..." refers methods in this paper

...YOLO beast Faster-RCNN in terms of inference speed on most of existing object detection datasets, however, at the cost of accuracy....
[...]
...You only look once (YOLO) [30] generates bounding boxes using regression and classify each bounding box by assigning class scores to the bounding boxes....
[...]
...[31] J. Redmon and A. Farhadi, ‘‘YOLO9000: Better, faster, stronger,’’ in Proc....
[...]