# People Counting in Videos by Fusing Temporal Cues from Spatial Context-Aware Convolutional Neural Networks

## Summary (2 min read)

### 1 Introduction

- Counting people can provide useful information for monitoring purposes in public areas, assist urban planners in designing more efficient environments, provide cues for situations that might endanger the safety of civilians, and also be used by shopping mall and retail store managers for evaluating their business practices.
- People counting is a very challenging problem, and although commercial solutions exist, these focus mainly in top-view cameras, where occlusions between people are minimal.
- Such an approach seems consistent to how humans would approach the problem, as implied by expressions such as ‘headcount’.
- Feeding the CNN with whole images allows modelling of the local context, i.e. the expected local appearance (e.g. size, orientation) of the foreground pedestrian heads and the spatial distribution of pixel luminance in the background.
- Temporal coherence is exploited by refined regression of count estimations from multiple frames.

### 2 Previous Work

- Counting methods can be mainly categorised into two groups.
- For each pixel, a linear transformation of its feature descriptor is learned, using a random forest to match the ground truth density function.
- Also the camera perspective effect is not taken into consideration which could invalidate the regression assumption.
- Hybrid methods [7, 10, 17, 20] aim to combine the benefits of both approaches by fusing their techniques.
- Counting then becomes a problem of finding a relationship between the number of foreground pixels and the number of humans present in the image, a relationship which is learned using a neural network.

### 3 Method

- Deep learning machines have addressed many problems that were deemed as unsolvable in a surprisingly easy way.
- Most of the research has focused on the use of static architectures ignoring relevant dynamics aspects of some of the problems.
- The authors work explores how to use time cues in an efficient manner, therefore the authors avoided recurrent neural networks or other time domain architectures.
- Appropriate representation of the input can lead to better and faster learning of the network [15].
- Every frame, is pre-processed by initially calculating the mean in all training images and subtracting it from all the pixels, before entering the network.

### 3.1 Density Estimation

- The density learning pipeline (Fig. 1) is comprised of four convolutional layers followed by a fully connected one.
- In contrast to [7], where all activations in a feature share the same bias, in their case each feature activation is characterised from its own bias.
- The last layer of the density estimation pipeline is a fully connected one (F1 in Eq. 1) and has as many neurons as there are present in one feature of the previous layer (i.e. C4).
- The cost function the authors use for the comparison is the Kullback–Leibler divergence shown in Eq. 6, and the error produced is the mean cost across all the examples seen.

### 3.2 Counting

- The final layer of each pipeline is dedicated to estimate the relationship between this density and the actual count of people.
- So, a single linear neuron (L1 in Fig. 1) is fully connected with the sigmoid neurons of F1.
- Learning is performed by linear regression using the mean square error across a number of examples as cost function.
- The accuracy of people counting, is further improved by fusing measurements from networks operating on subsequent frames along the temporal dimension.
- Each rectified neuron has as activation function similar to Eq. 7 with the only difference that negative values, produced by the summation of the weighted input with the bias, produce a zero output.

### 4 Results

- The network described earlier was implemented using Python and the pylearn2 and theano machine learning libraries [2, 3].
- Instead of learning a density and then performing a linear regression to estimate the count, the training of the density and the counting takes place in an alternate way.
- For [7], the ground truth was based on cropped images of size 320 240 from the original 640 480 binary images created in the previous step, scaled to a resolution of 33 23 and normalised with values between 0 and 1.
- Training a CNN requires fine tuning of various parameters.
- On the other hand the approach in [20], has a plethora of parameters to adjust and to solve the problem of detecting people in an image, and furthermore they exchange node information by using fully connected layers.

### 5 Conclusion

- In this work a methodology using CNN was presented for people counting.
- The authors have demonstrated that using the whole image information as training input instead of using cropped images, performs better as the network is able to learn how to distinguish between the foreground and the background.
- Furthermore by fusing the count estimate in the temporal domain, count estimations are further improved.
- To the best to their knowledge, their method is the first to propose the application of a CNN on the whole image for the task of people counting and furthermore to use temporal information for the same task.

Did you find this useful? Give us your feedback

##### Citations

146 citations

123 citations

122 citations

11 citations

### Cites methods from "People Counting in Videos by Fusing..."

...[13] utilized the responses of a spatially context-aware CNN in the temporal domain to enhance the accuracy of the final count....

[...]

1 citations

##### References

1,143 citations

1,098 citations

939 citations

[...]

694 citations

661 citations

##### Related Papers (5)

##### Frequently Asked Questions (2)

###### Q2. What have the authors stated for future works in "Counting in videos by fusing temporal cues from spatial context-aware convolutional neural networks. in european conference on computer vision" ?

Possible future lines of research may include to minimise the information theoretical measure instead of the Euclidean error in order to take into account the probabilistic nature of the problem. Moreover network architectures that utilise recurrent nodes can be used to take advantage of their application in the temporal domain, but also the use of other CNN architectures which incorporate temporal features, such as optical flow, can be investigated.