scispace - formally typeset
Search or ask a question
Author

Haroon Idrees

Bio: Haroon Idrees is an academic researcher from University of Sargodha. The author has contributed to research in topics: Frame (networking) & TRECVID. The author has an hindex of 22, co-authored 53 publications receiving 2517 citations. Previous affiliations of Haroon Idrees include International Islamic University, Islamabad & University of Central Florida.


Papers
More filters
Proceedings ArticleDOI
23 Jun 2013
TL;DR: This work relies on multiple sources such as low confidence head detections, repetition of texture elements, and frequency-domain analysis to estimate counts, along with confidence associated with observing individuals, in an image region, and employs a global consistency constraint on counts using Markov Random Field.
Abstract: We propose to leverage multiple sources of information to compute an estimate of the number of individuals present in an extremely dense crowd visible in a single image. Due to problems including perspective, occlusion, clutter, and few pixels per person, counting by human detection in such images is almost impossible. Instead, our approach relies on multiple sources such as low confidence head detections, repetition of texture elements (using SIFT), and frequency-domain analysis to estimate counts, along with confidence associated with observing individuals, in an image region. Secondly, we employ a global consistency constraint on counts using Markov Random Field. This caters for disparity in counts in local neighborhoods and across scales. We tested our approach on a new dataset of fifty crowd images containing 64K annotated humans, with the head counts ranging from 94 to 4543. This is in stark contrast to datasets used for existing methods which contain not more than tens of individuals. We experimentally demonstrate the efficacy and reliability of the proposed approach by quantifying the counting performance.

897 citations

Book ChapterDOI
08 Sep 2018
TL;DR: A novel approach is proposed that simultaneously solves the problems of counting, density map estimation and localization of people in a given dense crowd image and significantly outperforms state-of-the-art on the new dataset, which is the most challenging dataset with the largest number of crowd annotations in the most diverse set of scenes.
Abstract: With multiple crowd gatherings of millions of people every year in events ranging from pilgrimages to protests, concerts to marathons, and festivals to funerals; visual crowd analysis is emerging as a new frontier in computer vision. In particular, counting in highly dense crowds is a challenging problem with far-reaching applicability in crowd safety and management, as well as gauging political significance of protests and demonstrations. In this paper, we propose a novel approach that simultaneously solves the problems of counting, density map estimation and localization of people in a given dense crowd image. Our formulation is based on an important observation that the three problems are inherently related to each other making the loss function for optimizing a deep CNN decomposable. Since localization requires high-quality images and annotations, we introduce UCF-QNRF dataset that overcomes the shortcomings of previous datasets, and contains 1.25 million humans manually marked with dot annotations. Finally, we present evaluation measures and comparison with recent deep CNNs, including those developed specifically for crowd counting. Our approach significantly outperforms state-of-the-art on the new dataset, which is the most challenging dataset with the largest number of crowd annotations in the most diverse set of scenes.

579 citations

Journal ArticleDOI
TL;DR: The THUMOS benchmark is described in detail and an overview of data collection and annotation procedures are given, including a comprehensive empirical study evaluating the differences in action recognition between trimmed and untrimed videos, and how well methods trained on trimmed videos generalize to untrimmed videos.

415 citations

Book ChapterDOI
05 Sep 2010
TL;DR: This paper divides the scene into grid cells, solves the tracking problem optimally within each cell using bipartite graph matching and then link tracks across cells, and uses median background modeling which requires few frames to obtain a workable model.
Abstract: In this paper, we tackle the problem of object detection and tracking in a new and challenging domain of wide area surveillance. This problem poses several challenges: large camera motion, strong parallax, large number of moving objects, small number of pixels on target, single channel data and low framerate of video. We propose a method that overcomes these challenges and evaluate it on CLIF dataset. We use median background modeling which requires few frames to obtain a workable model. We remove false detections due to parallax and registration errors using gradient information of the background image. In order to keep complexity of the tracking problem manageable, we divide the scene into grid cells, solve the tracking problem optimally within each cell using bipartite graph matching and then link tracks across cells. Besides tractability, grid cells allow us to define a set of local scene constraints such as road orientation and object context. We use these constraints as part of cost function to solve the tracking problem which allows us to track fast-moving objects in low framerate videos. In addition to that, we manually generated groundtruth for four sequences and performed quantitative evaluation of the proposed algorithm.

209 citations

Proceedings ArticleDOI
23 Jun 2014
TL;DR: The key idea is to learn query-specific generative model on the features of nearest-neighbors and tags using the proposed NMF-KNN approach which imposes consensus constraint on the coefficient matrices across different features to solve the problem of feature fusion.
Abstract: The real world image databases such as Flickr are characterized by continuous addition of new images. The recent approaches for image annotation, i.e. the problem of assigning tags to images, have two major drawbacks. First, either models are learned using the entire training data, or to handle the issue of dataset imbalance, tag-specific discriminative models are trained. Such models become obsolete and require relearning when new images and tags are added to database. Second, the task of feature-fusion is typically dealt using ad-hoc approaches. In this paper, we present a weighted extension of Multi-view Non-negative Matrix Factorization (NMF) to address the aforementioned drawbacks. The key idea is to learn query-specific generative model on the features of nearest-neighbors and tags using the proposed NMF-KNN approach which imposes consensus constraint on the coefficient matrices across different features. This results in coefficient vectors across features to be consistent and, thus, naturally solves the problem of feature fusion, while the weight matrices introduced in the proposed formulation alleviate the issue of dataset imbalance. Furthermore, our approach, being query-specific, is unaffected by addition of images and tags in a database. We tested our method on two datasets used for evaluation of image annotation and obtained competitive results.

137 citations


Cited by
More filters
01 May 1997
TL;DR: Coaching & Communicating for Performance Coaching and communicating for Performance is a highly interactive program that will give supervisors and managers the opportunity to build skills that will enable them to share expectations and set objectives for employees, provide constructive feedback, more effectively engage in learning conversations, and coaching opportunities as mentioned in this paper.
Abstract: Building Leadership Effectiveness This program encourages leaders to develop practices that transform values into action, vision into realities, obstacles into innovations, and risks into rewards. Participants will be introduced to the five practices of exemplary leadership: modeling the way, inspiring a shared vision, challenging the process, enabling others to act, and encouraging the heart Coaching & Communicating for Performance Coaching & Communicating for Performance is a highly interactive program that will give supervisors and managers the opportunity to build skills that will enable them to share expectations and set objectives for employees, provide constructive feedback, more effectively engage in learning conversations, and coaching opportunities. Skillful Conflict Management for Leaders As a leader, it is important to understand conflict and be effective at conflict management because the way conflict is resolved becomes an integral component of our university’s culture. This series of conflict management sessions help leaders learn and put into practice effective strategies for managing conflict.

4,935 citations

Proceedings ArticleDOI
Yingying Zhang1, Desen Zhou1, Siqin Chen1, Shenghua Gao1, Yi Ma1 
27 Jun 2016
TL;DR: With the proposed simple MCNN model, the method outperforms all existing methods and experiments show that the model, once trained on one dataset, can be readily transferred to a new dataset.
Abstract: This paper aims to develop a method than can accurately estimate the crowd count from an individual image with arbitrary crowd density and arbitrary perspective. To this end, we have proposed a simple but effective Multi-column Convolutional Neural Network (MCNN) architecture to map the image to its crowd density map. The proposed MCNN allows the input image to be of arbitrary size or resolution. By utilizing filters with receptive fields of different sizes, the features learned by each column CNN are adaptive to variations in people/head size due to perspective effect or image resolution. Furthermore, the true density map is computed accurately based on geometry-adaptive kernels which do not need knowing the perspective map of the input image. Since exiting crowd counting datasets do not adequately cover all the challenging situations considered in our work, we have collected and labelled a large new dataset that includes 1198 images with about 330,000 heads annotated. On this challenging new dataset, as well as all existing datasets, we conduct extensive experiments to verify the effectiveness of the proposed model and method. In particular, with the proposed simple MCNN model, our method outperforms all existing methods. In addition, experiments show that our model, once trained on one dataset, can be readily transferred to a new dataset.

1,603 citations

Proceedings ArticleDOI
07 Jun 2015
TL;DR: A deep convolutional neural network is proposed for crowd counting, and it is trained alternatively with two related learning objectives, crowd density and crowd count, to obtain better local optimum for both objectives.
Abstract: Cross-scene crowd counting is a challenging task where no laborious data annotation is required for counting people in new target surveillance crowd scenes unseen in the training set. The performance of most existing crowd counting methods drops significantly when they are applied to an unseen scene. To address this problem, we propose a deep convolutional neural network (CNN) for crowd counting, and it is trained alternatively with two related learning objectives, crowd density and crowd count. This proposed switchable learning approach is able to obtain better local optimum for both objectives. To handle an unseen target crowd scene, we present a data-driven method to finetune the trained CNN model for the target scene. A new dataset including 108 crowd scenes with nearly 200,000 head annotations is introduced to better evaluate the accuracy of cross-scene crowd counting methods. Extensive experiments on the proposed and another two existing datasets demonstrate the effectiveness and reliability of our approach.

1,143 citations

Proceedings ArticleDOI
18 Jun 2018
TL;DR: CSRNet as discussed by the authors is composed of two major components: a convolutional neural network (CNN) as the front-end for 2D feature extraction and a dilated CNN for the back-end, which uses dilated kernels to deliver larger reception fields and to replace pooling operations.
Abstract: We propose a network for Congested Scene Recognition called CSRNet to provide a data-driven and deep learning method that can understand highly congested scenes and perform accurate count estimation as well as present high-quality density maps. The proposed CSRNet is composed of two major components: a convolutional neural network (CNN) as the front-end for 2D feature extraction and a dilated CNN for the back-end, which uses dilated kernels to deliver larger reception fields and to replace pooling operations. CSRNet is an easy-trained model because of its pure convolutional structure. We demonstrate CSRNet on four datasets (ShanghaiTech dataset, the UCF_CC_50 dataset, the WorldEXPO'10 dataset, and the UCSD dataset) and we deliver the state-of-the-art performance. In the ShanghaiTech Part_B dataset, CSRNet achieves 47.3% lower Mean Absolute Error (MAE) than the previous state-of-the-art method. We extend the targeted applications for counting other objects, such as the vehicle in TRANCOS dataset. Results show that CSRNet significantly improves the output quality with 15.4% lower MAE than the previous state-of-the-art approach.

1,120 citations

Proceedings ArticleDOI
18 Jun 2018
TL;DR: The AVA dataset densely annotates 80 atomic visual actions in 437 15-minute video clips, where actions are localized in space and time, resulting in 1.59M action labels with multiple labels per person occurring frequently.
Abstract: This paper introduces a video dataset of spatio-temporally localized Atomic Visual Actions (AVA). The AVA dataset densely annotates 80 atomic visual actions in 437 15-minute video clips, where actions are localized in space and time, resulting in 1.59M action labels with multiple labels per person occurring frequently. The key characteristics of our dataset are: (1) the definition of atomic visual actions, rather than composite actions; (2) precise spatio-temporal annotations with possibly multiple annotations for each person; (3) exhaustive annotation of these atomic actions over 15-minute video clips; (4) people temporally linked across consecutive segments; and (5) using movies to gather a varied set of action representations. This departs from existing datasets for spatio-temporal action recognition, which typically provide sparse annotations for composite actions in short video clips. AVA, with its realistic scene and action complexity, exposes the intrinsic difficulty of action recognition. To benchmark this, we present a novel approach for action localization that builds upon the current state-of-the-art methods, and demonstrates better performance on JHMDB and UCF101-24 categories. While setting a new state of the art on existing datasets, the overall results on AVA are low at 15.8% mAP, underscoring the need for developing new approaches for video understanding.

850 citations