scispace - formally typeset
Search or ask a question
Book ChapterDOI

People Counting in Videos by Fusing Temporal Cues from Spatial Context-Aware Convolutional Neural Networks

TL;DR: An efficient method for people counting in video sequences from fixed cameras by utilising the responses of spatially context-aware convolutional neural networks (CNN) in the temporal domain is presented.
Abstract: We present an efficient method for people counting in video sequences from fixed cameras by utilising the responses of spatially context-aware convolutional neural networks (CNN) in the temporal domain. For stationary cameras, the background information remains fairly static, while foreground characteristics, such as size and orientation may depend on their image location, thus the use of whole frames for training a CNN improves the differentiation between background and foreground pixels. Foreground density representing the presence of people in the environment can then be associated with people counts. Moreover the fusion, of the responses of count estimations, in the temporal domain, can further enhance the accuracy of the final count. Our methodology was tested using the publicly available Mall dataset and achieved a mean deviation error of 0.091.

Summary (2 min read)

1 Introduction

  • Counting people can provide useful information for monitoring purposes in public areas, assist urban planners in designing more efficient environments, provide cues for situations that might endanger the safety of civilians, and also be used by shopping mall and retail store managers for evaluating their business practices.
  • People counting is a very challenging problem, and although commercial solutions exist, these focus mainly in top-view cameras, where occlusions between people are minimal.
  • Such an approach seems consistent to how humans would approach the problem, as implied by expressions such as ‘headcount’.
  • Feeding the CNN with whole images allows modelling of the local context, i.e. the expected local appearance (e.g. size, orientation) of the foreground pedestrian heads and the spatial distribution of pixel luminance in the background.
  • Temporal coherence is exploited by refined regression of count estimations from multiple frames.

2 Previous Work

  • Counting methods can be mainly categorised into two groups.
  • For each pixel, a linear transformation of its feature descriptor is learned, using a random forest to match the ground truth density function.
  • Also the camera perspective effect is not taken into consideration which could invalidate the regression assumption.
  • Hybrid methods [7, 10, 17, 20] aim to combine the benefits of both approaches by fusing their techniques.
  • Counting then becomes a problem of finding a relationship between the number of foreground pixels and the number of humans present in the image, a relationship which is learned using a neural network.

3 Method

  • Deep learning machines have addressed many problems that were deemed as unsolvable in a surprisingly easy way.
  • Most of the research has focused on the use of static architectures ignoring relevant dynamics aspects of some of the problems.
  • The authors work explores how to use time cues in an efficient manner, therefore the authors avoided recurrent neural networks or other time domain architectures.
  • Appropriate representation of the input can lead to better and faster learning of the network [15].
  • Every frame, is pre-processed by initially calculating the mean in all training images and subtracting it from all the pixels, before entering the network.

3.1 Density Estimation

  • The density learning pipeline (Fig. 1) is comprised of four convolutional layers followed by a fully connected one.
  • In contrast to [7], where all activations in a feature share the same bias, in their case each feature activation is characterised from its own bias.
  • The last layer of the density estimation pipeline is a fully connected one (F1 in Eq. 1) and has as many neurons as there are present in one feature of the previous layer (i.e. C4).
  • The cost function the authors use for the comparison is the Kullback–Leibler divergence shown in Eq. 6, and the error produced is the mean cost across all the examples seen.

3.2 Counting

  • The final layer of each pipeline is dedicated to estimate the relationship between this density and the actual count of people.
  • So, a single linear neuron (L1 in Fig. 1) is fully connected with the sigmoid neurons of F1.
  • Learning is performed by linear regression using the mean square error across a number of examples as cost function.
  • The accuracy of people counting, is further improved by fusing measurements from networks operating on subsequent frames along the temporal dimension.
  • Each rectified neuron has as activation function similar to Eq. 7 with the only difference that negative values, produced by the summation of the weighted input with the bias, produce a zero output.

4 Results

  • The network described earlier was implemented using Python and the pylearn2 and theano machine learning libraries [2, 3].
  • Instead of learning a density and then performing a linear regression to estimate the count, the training of the density and the counting takes place in an alternate way.
  • For [7], the ground truth was based on cropped images of size 320 240 from the original 640 480 binary images created in the previous step, scaled to a resolution of 33 23 and normalised with values between 0 and 1.
  • Training a CNN requires fine tuning of various parameters.
  • On the other hand the approach in [20], has a plethora of parameters to adjust and to solve the problem of detecting people in an image, and furthermore they exchange node information by using fully connected layers.

5 Conclusion

  • In this work a methodology using CNN was presented for people counting.
  • The authors have demonstrated that using the whole image information as training input instead of using cropped images, performs better as the network is able to learn how to distinguish between the foreground and the background.
  • Furthermore by fusing the count estimate in the temporal domain, count estimations are further improved.
  • To the best to their knowledge, their method is the first to propose the application of a CNN on the whole image for the task of people counting and furthermore to use temporal information for the same task.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

This is a postprint version of the published document at:
Sourtzinos, P., Velastin, S.A., Jara, M., Zegers, P. y Makris,
D. (2016).
People Counting in Videos by Fusing Temporal
Cues from Spatial Context-Aware Convolutional Neural
Networks. In European Conference on Computer Vision
2016 Workshops, Part II, LNCS 9914, pp. 655–667.
DOI: https://doi.org/10.1007/978-3-319-48881-3_46
© Springer International Publishing Switzerland 2016.

People Counting in Videos by Fusing Temporal
Cues from Spatial Context-Aware
Convolutional Neural Networks
Panos Sourtzinos
1
, Sergio A. Velastin
2(&)
, Miguel Jara
3
,
Pablo Zegers
4
, and Dimitrios Makris
1
1
School of Computer Science and Mathematics,
Kingston University, Kingston, UK
psourt@gmail.com, d.makris@kingston.ac.uk
2
Department of Computer Science,
Universidad Carlos III de Madrid, Getafe, Spain
sergio.velastin@ieee.org
3
Departmento de Informática,
Universidad de Santiago de Chile, Santiago, Chile
miguel.jara.rodriguez@gmail.com
4
Faculty of Engineering and Applied Sciences,
Universidad de los Andes, Santiago, Chile
pablozegers@gmail.com
Abstract. We present an efcient method for people counting in video sequences from
xed cameras by utilising the responses of
spatially context-aware convolutional neural
networks (CNN) in the temporal domain. For stationary cameras, the background
information remains fairly static, while foreground characteristics, such as size and
orientation may depend on their image location, thus
the use of whole frames for training a
CNN improves the differentiation between background and foreground pixels. Foreground
density representing the presence of people in the environment can then be associated with
people counts. Moreover the fusion, of the responses of count estimations, in the temporal
domain, can further enhance the accuracy of the nal count. Our methodology was tested
using the publicly available Mall dataset and achieved a mean deviation error of 0.091.
Keywords: People counting Convolutional neural networks Video analysis
1 Introduction
Counting people can provide useful information for monitoring purposes in public
areas, assist urban planners in designing more efcient environments, provide cues for
situations that might endanger the safety of civilians, and also be used by shopping
mall and retail store managers for evaluating their business pract ices.
In principle, such
knowledge can be obtained by analysing image and video footage from location
specic cameras with the goal to measure the number of people in them. For this reason
in this work we present an efcient method for counting people in images and video
1

sequences, from xed cameras which incorporates the fusion of context aware cues from
CNN in the temporal domain.
People counting is a very challenging problem, and although commercial solutions
exist, these focus mainly in top-view cameras, where occlusions between people are
minimal. An effective approach is to detect the heads of the
pedestrians present in an
image, since they are less prone to disappear in the image through occlusions, and then
sum the head detections to measure the total count. Such an approach seems consistent
to how humans would approach the problem, as implied by expressions such as
headcount. Furthermore since our interest is in measuring the count of people using
stationary cameras, where background is assumed fairly static, a local context-aware
detector that is spatially tuned to distinguish foreground objects (e.g. heads) from the
background scene is more promising than a general-purpose detector.
The main contribution of this work is the proposal of a convolutional neural net-
work (CNN) that uses global image information, rather than cropped images, for people
counting and the use of temporal coherence for enhancing the precision in the obtained
results. Feeding the CNN with whole images allows modelling of the local context, i.e.
the expected local appearance (e.g. size, orientation) of the foreground pedestrian heads
and the spatial distribution of pixel luminance in the background. The output of the
CNN for each frame is an intermediate density map and head counts are estimated using
regression. Temporal coherence is exploited by rened regression of count estimations
from multiple frames. In Sect. 2 a background study on the methods for people counting
is presented, while in Sect. 3 the methodology of our approach is described. Finally in
Sect. 4 the results and a critical discussion of our methodology are given followed by the
concluding section in Sect. 5.
2 Previous Work
Counting methods can be mainly categorised into two groups. Counting by detection
and counting by regression. In the former case, human shape models are used to
localise people on the image plane, while the latter is based on the relationship between
a distribution of low level features in the whole image and the number of people in it.
Hybrid methods combine these two approaches, i.e. a person detector is used to create a
footprint on a distribution describing the whole image, which then is used to infer the
number of people in it. The use of CNNs for the task of people counting is by its nature
such an approach. In the following sections we identify some methods, but as the
literature on the topic is extensive, space limitations prevent us from giving a fuller
review.
In counting by detection [16] the idea is to detect the presence of people in an image
and then sum the detections to produce the nal count. People detections is achieved by
object detectors (whole or part-based), based on learned models that use features such as
histogram of oriented gradients (HOG), poselets, edgelets and others which describe a
shape model of a human body using pixel information. Traditionally, a location
invariant object detector is applied using a sliding window technique followed by non-
maximal suppression to localise the objects of interest.
2

In [16] a HOG detector is used to create a probability distribution over the image. To
deal with occlusions, the HOG detector is trained to learn only the upper part of the
human body. Next the optical ow between two consecutive frames is computed.
Assuming that the upper human body exhibits a uniform motion in contrast with the
motion generated from the limbs, a mask resembling the shape of upper human body, is
scanned through the optical ow response and a probability distribution of uniform
motion is computed. The probability distributions learned from the shape model and the
uniform motion model are then combined and the fused probability distribution is
searched, using Mean Shift Mode Estimation to localise head detections.
A pitfall of using counting by detection techniques is that
they do not perform well
in images with low resolution, since objects, in these, appear small and they do not
generate enough information in order to be detected. Moreover, since most of these
approaches use a sliding window to scan the whole image multiple times in different
scales, they are computational heavy and thus slow.
In counting by regression [1, 11, 13], a mapping from some low level image
characteristics, like edges or corners, to the number of objects is estimated using
machine learning methods. Although this approach avoids the hard task of object
detection, ambiguities may arise from the presence of objects of other classes that may
also generate responses. Furthermore since some of these methods are location-
invariant,, the training phase requires large amount of data, in order to cover all the
possible perspective nonlinearities of the image plane. On the other hand, annotating
the ground truth data is simpler as it only involves manual counting.
In [13] the main idea is that integrals of density functions over pixel grids should
match the object counts in an image. It is assumed that each pixel is characterised by a
discretized feature vector and the training data are dot annotated (e.g. torso). Each
annotated pixel is then characterised, using a randomised tree approach, by a feature
descriptor combining the modalities of the actual image, the difference image and the
foreground image. For each pixel, a linear transformation of its feature descriptor is
learned, using a random forest to match the ground truth density function.
In [11] a mixture of Gaussians is initially applied to extract foreground information.
Histograms of the area of the foreground blobs and edge orientation are then used as
features to describe the image. Finally a feed forward back propagation neural network
is used with the histograms of the normalised features as inputs, learning the number of
pedestrians in the image.
In [1] a method for counting people using the Harris corner detector is presented.
Motion vectors are used to differentiate between static and moving corners. Assuming
that each person in the image generates the same amount of moving corners, the
number of people in a frame is therefore computed based on the ratio of the moving
corners detected over the average number of corners per person. As a consequence, this
approach fails to recognise static people. Also the camera perspective effect is not taken
into consideration which could invalidate the regression assumption.
A drawback of all regression approaches is that they cannot discriminate well
between intra class variations (i.e. differences in human sizes, humans carrying objects,
humans with bicycles etc.) and since they lack learning object shape models, they are
unable to differentiate between interclass (e.g. animals) differences. Thus their appli-
cation is mostly location specic.
3

Hybrid methods [7, 10, 17, 20] aim to combine the benets of both approaches by
fusing their techniques. For instance, in [17] combines a density image is computed
where each pixel value denes the condence output of the person detector used in [13].
This value is then discretized and represented by a binary feature vector. SIFT features
are extracted from the image in order to compute another binary feature vector. The
concatenation of the two binary feature vectors is then used to describe each pixel, and
by minimizing the regularised MESA
distance, the weight of each discretized feature is
learned. The density of each pixel is thus calculated by multiplying its feature descriptor
with the learned weight vector, and the count of people in the image is then estimated by
the integral of the density of the image.
Another example of a hybrid approach is presented in [10] that copes with crowded
situations. A Gaussian mixture model is initially applied on a grayscale video sequence
to obtain the foreground information. After perspective correction this is further pro-
cessed using a closing operation. Counting then becomes a problem of nding a
relationship between the number of foreground pixels and the number of humans
present in the image, a relationship which is learned using a neural network.
Finally two hybrid approaches [7, 20] are the only ones, as far as we know, that use
CNN purely for people counting. Both attempt to exploit the CNN characteristic of the
spatial invariance in the detection of patterns, and thus the networks described are
trained as human detectors by using spatial crops from whole images for training. In [7]
a CNN learns to estimate the density of people in an image by using cropped images
from the full resolution training dataset. The trained network is then applied to the whole
image information to produce a density map of human presence and moreover its
parameters are transferred to two similar networks that are applied on different
resolutions of the global image. The response from the three networks is then averaged
to produce a nal density map. To count the number of people in the density image,
each point of the density estimated is fed to a linear regression node. The weights then of
the node are learned independently for the density estimation. In [20] cropped images
are also used for training, however the learning of the density and the total count is not
sequential, but takes place in parallel. Both the density map and the linear regression
node are connected to the same CNN and learning takes place by altering the cost
function between the one used for the density estimation and the one used for count
estimation.
3 Method
Deep learning machines have addressed many problems that were deemed as unsolvable
in a surprisingly easy way. However, most of the research has focused on the use of
static architectures ignoring relevant dynamics aspects of some of the problems. This is
especially true in video analytics, where analysis is mainly frame-based, and tradi-
tionally the information obtained from each of them has been integrated using some
heuristic-based algorithm. This has been recognized by many in the eld and many
recent publications extended and complemented the convolutional neural network
(CNN) architecture into the time domain achieving good results (e.g. [4, 8, 9, 19]). Our
work explores how to use time cues in an efcient manner, therefore we avoided
4

Citations
More filters
Proceedings ArticleDOI
25 Jul 2017
TL;DR: Li et al. as mentioned in this paper proposed a convolutional LSTM (ConvLSTM) model to capture both spatial and temporal dependencies for crowd counting, which can access long-range information in both directions.
Abstract: Region of Interest (ROI) crowd counting can be formulated as a regression problem of learning a mapping from an image or a video frame to a crowd density map. Recently, convolutional neural network (CNN) models have achieved promising results for crowd counting. However, even when dealing with video data, CNN-based methods still consider each video frame independently, ignoring the strong temporal correlation between neighboring frames. To exploit the otherwise very useful temporal information in video sequences, we propose a variant of a recent deep learning model called convolutional LSTM (ConvLSTM) for crowd counting. Unlike the previous CNN-based methods, our method fully captures both spatial and temporal dependencies. Furthermore, we extend the ConvLSTM model to a bidirectional ConvLSTM model which can access long-range information in both directions. Extensive experiments using four publicly available datasets demonstrate the reliability of our approach and the effectiveness of incorporating temporal information to boost the accuracy of crowd counting. In addition, we also conduct some transfer learning experiments to show that once our model is trained on one dataset, its learning experience can be transferred easily to a new dataset which consists of only very few video frames for model adaptation.

146 citations

Posted Content
TL;DR: Li et al. as mentioned in this paper proposed a convolutional LSTM (ConvLSTM) model to capture both spatial and temporal dependencies for crowd counting, which can access long-range information in both directions.
Abstract: Region of Interest (ROI) crowd counting can be formulated as a regression problem of learning a mapping from an image or a video frame to a crowd density map. Recently, convolutional neural network (CNN) models have achieved promising results for crowd counting. However, even when dealing with video data, CNN-based methods still consider each video frame independently, ignoring the strong temporal correlation between neighboring frames. To exploit the otherwise very useful temporal information in video sequences, we propose a variant of a recent deep learning model called convolutional LSTM (ConvLSTM) for crowd counting. Unlike the previous CNN-based methods, our method fully captures both spatial and temporal dependencies. Furthermore, we extend the ConvLSTM model to a bidirectional ConvLSTM model which can access long-range information in both directions. Extensive experiments using four publicly available datasets demonstrate the reliability of our approach and the effectiveness of incorporating temporal information to boost the accuracy of crowd counting. In addition, we also conduct some transfer learning experiments to show that once our model is trained on one dataset, its learning experience can be transferred easily to a new dataset which consists of only very few video frames for model adaptation.

123 citations

Journal ArticleDOI
TL;DR: This survey presents detailed attributes of CNN with special emphasis on optimization methods that have been utilized in CNN-based methods, and introduces a taxonomy that summarizes important aspects of the CNN for approaching crowd behaviour analysis.
Abstract: Interest in automatic crowd behaviour analysis has grown considerably in the last few years. Crowd behaviour analysis has become an integral part all over the world for ensuring peaceful event organizations and minimum casualties in the places of public and religious interests. Traditionally, the area of crowd analysis was computed using handcrafted features. However, the real-world images and videos consist of nonlinearity that must be used efficiently for gaining accuracies in the results. As in many other computer vision areas, deep learning-based methods have taken giant strides for obtaining state-of-the-art performance in crowd behaviour analysis. This paper presents a comprehensive survey of current convolution neural network (CNN)-based methods for crowd behaviour analysis. We have also surveyed popular software tools for CNN in the recent years. This survey presents detailed attributes of CNN with special emphasis on optimization methods that have been utilized in CNN-based methods. It also reviews fundamental and innovative methodologies, both conventional and latest methods of CNN, reported in the last few years. We introduce a taxonomy that summarizes important aspects of the CNN for approaching crowd behaviour analysis. Details of the proposed architectures, crowd analysis needs and their respective datasets are reviewed. In addition, we summarize and discuss the main works proposed so far with particular interest on CNNs on how they treat the temporal dimension of data, their highlighting features and opportunities and challenges for future research. To the best of our knowledge, this is a unique survey for crowd behaviour analysis using the CNN. We hope that this survey would become a reference in this ever-evolving field of research.

122 citations

Journal ArticleDOI
Biao Yang, Jinmeng Cao, Nan Wang1, Yuyu Zhang, Guo-Zeng Cui 
TL;DR: A cross-scene counting model is learned with information transferred from other scenes and the effectiveness of DA-ELM in transferring information through embedding domain adaptation into an ELM framework is revealed.
Abstract: Cross-scene counting is difficult if only limited training samples are available in the new scene. In this paper, a cross-scene counting model is learned with information transferred from other scenes. Counting is achieved through regression, which maps the features of crowds to their counts. Hand-crafted features are extracted from segmented crowd foregrounds obtained through block robust principal component analysis. Samples of existing scenes (source domain) are adaptively transferred into the new scene (target domain) through domain adaptation. Then, a counting model based on domain adaptation-extreme learning machine (DA-ELM) is efficiently learned via iterative optimization with training samples of both domains. Quantitative analysis indicates that the DA-ELM can count the crowds of a new scene with only a half of the training samples compared with counting without domain adaptation. Contrastive evaluations based on three benchmarking data sets are implemented with several state-of-the-art domain adaptation approaches, including hand-crafted feature-based and deep neural network-based approaches. Results reveal the effectiveness of DA-ELM in transferring information through embedding domain adaptation into an ELM framework.

11 citations


Cites methods from "People Counting in Videos by Fusing..."

  • ...[13] utilized the responses of a spatially context-aware CNN in the temporal domain to enhance the accuracy of the final count....

    [...]

Book ChapterDOI
20 Nov 2017
TL;DR: A Multi-scale Fully Convolutional Network for robust crowd counting, that is achieved through estimating density map and can adapt to not only sparse scenes, but also dense ones, achieves the state-of-the-art counting performance in benchmarking datasets.
Abstract: Crowd count estimation from a still crowd image with arbitrary perspective and density level is one of the challenges in crowd analysis. Techniques developed in the past performed poorly in highly congested scenes with several thousands of people. To resolve the problem, we propose a Multi-scale Fully Convolutional Network for robust crowd counting, that is achieved through estimating density map. Our approach consists of the following contributions: (1) an adaptive human-shaped kernel is proposed to generate the ground truth of the density map. (2) A deep, multi-scale, fully convolutional network is proposed to predict crowd counts. Per-scale loss is used to guarantee the effectiveness of multi-scale strategy. (3) Several attempts, e.g. de-convolutional and minimizing per-scale loss, are tried to improve the counting performance of the proposed approach. Our approach can adapt to not only sparse scenes, but also dense ones. In addition, it achieves the state-of-the-art counting performance in benchmarking datasets, including the World Expo’10, the UCF_CC_50, and the UCSD datasets.

1 citations

References
More filters
Proceedings ArticleDOI
07 Jun 2015
TL;DR: A deep convolutional neural network is proposed for crowd counting, and it is trained alternatively with two related learning objectives, crowd density and crowd count, to obtain better local optimum for both objectives.
Abstract: Cross-scene crowd counting is a challenging task where no laborious data annotation is required for counting people in new target surveillance crowd scenes unseen in the training set. The performance of most existing crowd counting methods drops significantly when they are applied to an unseen scene. To address this problem, we propose a deep convolutional neural network (CNN) for crowd counting, and it is trained alternatively with two related learning objectives, crowd density and crowd count. This proposed switchable learning approach is able to obtain better local optimum for both objectives. To handle an unseen target crowd scene, we present a data-driven method to finetune the trained CNN model for the target scene. A new dataset including 108 crowd scenes with nearly 200,000 head annotations is introduced to better evaluate the accuracy of cross-scene crowd counting methods. Extensive experiments on the proposed and another two existing datasets demonstrate the effectiveness and reliability of our approach.

1,143 citations

Proceedings Article
06 Dec 2010
TL;DR: This work focuses on the practically-attractive case when the training images are annotated with dots, and introduces a new loss function, which is well-suited for visual object counting tasks and at the same time can be computed efficiently via a maximum subarray algorithm.
Abstract: We propose a new supervised learning framework for visual object counting tasks, such as estimating the number of cells in a microscopic image or the number of humans in surveillance video frames. We focus on the practically-attractive case when the training images are annotated with dots (one dot per object). Our goal is to accurately estimate the count. However, we evade the hard task of learning to detect and localize individual object instances. Instead, we cast the problem as that of estimating an image density whose integral over any image region gives the count of objects within that region. Learning to infer such density can be formulated as a minimization of a regularized risk quadratic cost function. We introduce a new loss function, which is well-suited for such learning, and at the same time can be computed efficiently via a maximum subarray algorithm. The learning can then be posed as a convex quadratic program solvable with cutting-plane optimization. The proposed framework is very flexible as it can accept any domain-specific visual features. Once trained, our system provides accurate object counts and requires a very small time overhead over the feature extraction step, making it a good candidate for applications involving real-time processing or dealing with huge amount of visual data.

1,098 citations

Proceedings ArticleDOI
01 Jan 2010
TL;DR: This paper illustrates how to use Theano, outlines the scope of the compiler, provides benchmarks on both CPU and GPU processors, and explains its overall design.
Abstract: Theano is a compiler for mathematical expressions in Python that combines the convenience of NumPy's syntax with the speed of optimized native machine language. The user composes mathematical expressions in a high-level description that mimics NumPy's syntax and semantics, while being statically typed and functional (as opposed to imperative). These expressions allow Theano to provide symbolic differentiation. Before performing computation, Theano optimizes the choice of expressions, translates them into C++ (or CUDA for GPU), compiles them into dynamically loaded Python modules, all automatically. Common machine learn- ing algorithms implemented with Theano are from 1:6 to 7:5 faster than competitive alternatives (including those implemented with C/C++, NumPy/SciPy and MATLAB) when compiled for the CPU and between 6:5 and 44 faster when compiled for the GPU. This paper illustrates how to use Theano, outlines the scope of the compiler, provides benchmarks on both CPU and GPU processors, and explains its overall design.

939 citations

Proceedings ArticleDOI
07 Jun 2015
TL;DR: In this article, the authors proposed a method to extract spatio-temporal feature representations to build strong classifiers using Convolutional Neural Networks and link their predictions to produce detections consistent in time.
Abstract: We address the problem of action detection in videos. Driven by the latest progress in object detection from 2D images, we build action models using rich feature hierarchies derived from shape and kinematic cues. We incorporate appearance and motion in two ways. First, starting from image region proposals we select those that are motion salient and thus are more likely to contain the action. This leads to a significant reduction in the number of regions being processed and allows for faster computations. Second, we extract spatio-temporal feature representations to build strong classifiers using Convolutional Neural Networks. We link our predictions to produce detections consistent in time, which we call action tubes. We show that our approach outperforms other techniques in the task of action detection.

694 citations

Proceedings ArticleDOI
01 Jan 2012
TL;DR: This paper presents a single regression model based approach that is able to estimate people count in spatially localised regions and is more scalable without the need for training a large number of regressors proportional to the number of local regions.
Abstract: This paper presents a multi-output regression model for crowd counting in public scenes. Existing counting by regression methods either learn a single model for global counting, or train a large number of separate regressors for localised density estimation. In contrast, our single regression model based approach is able to estimate people count in spatially localised regions and is more scalable without the need for training a large number of regressors proportional to the number of local regions. In particular, the proposed model automatically learns the functional mapping between interdependent low-level features and multi-dimensional structured outputs. The model is able to discover the inherent importance of different features for people counting at different spatial locations. Extensive evaluations on an existing crowd analysis benchmark dataset and a new more challenging dataset demonstrate the effectiveness of our approach.

661 citations

Frequently Asked Questions (2)
Q1. What contributions have the authors mentioned in the paper "Counting in videos by fusing temporal cues from spatial context-aware convolutional neural networks. in european conference on computer vision" ?

The authors present an efficient method for people counting in video sequences from fixed cameras by utilising the responses of spatially context-aware convolutional neural networks ( CNN ) in the temporal domain. Moreover the fusion, of the responses of count estimations, in the temporal domain, can further enhance the accuracy of the final count. 

Possible future lines of research may include to minimise the information theoretical measure instead of the Euclidean error in order to take into account the probabilistic nature of the problem. Moreover network architectures that utilise recurrent nodes can be used to take advantage of their application in the temporal domain, but also the use of other CNN architectures which incorporate temporal features, such as optical flow, can be investigated.