scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Generating High-Quality Crowd Density Maps Using Contextual Pyramid CNNs

01 Oct 2017-pp 1879-1888
TL;DR: A novel method called Contextual Pyramid CNN (CP-CNN) for generating high-quality crowd density and count estimation by explicitly incorporating global and local contextual information of crowd images is presented.
Abstract: We present a novel method called Contextual Pyramid CNN (CP-CNN) for generating high-quality crowd density and count estimation by explicitly incorporating global and local contextual information of crowd images. The proposed CP-CNN consists of four modules: Global Context Estimator (GCE), Local Context Estimator (LCE), Density Map Estimator (DME) and a Fusion-CNN (F-CNN). GCE is a VGG-16 based CNN that encodes global context and it is trained to classify input images into different density classes, whereas LCE is another CNN that encodes local context information and it is trained to perform patch-wise classification of input images into different density classes. DME is a multi-column architecture-based CNN that aims to generate high-dimensional feature maps from the input image which are fused with the contextual information estimated by GCE and LCE using F-CNN. To generate high resolution and high-quality density maps, F-CNN uses a set of convolutional and fractionally-strided convolutional layers and it is trained along with the DME in an end-to-end fashion using a combination of adversarial loss and pixellevel Euclidean loss. Extensive experiments on highly challenging datasets show that the proposed method achieves significant improvements over the state-of-the-art methods.
Citations
More filters
Book ChapterDOI
08 Sep 2018
TL;DR: A novel approach is proposed that simultaneously solves the problems of counting, density map estimation and localization of people in a given dense crowd image and significantly outperforms state-of-the-art on the new dataset, which is the most challenging dataset with the largest number of crowd annotations in the most diverse set of scenes.
Abstract: With multiple crowd gatherings of millions of people every year in events ranging from pilgrimages to protests, concerts to marathons, and festivals to funerals; visual crowd analysis is emerging as a new frontier in computer vision. In particular, counting in highly dense crowds is a challenging problem with far-reaching applicability in crowd safety and management, as well as gauging political significance of protests and demonstrations. In this paper, we propose a novel approach that simultaneously solves the problems of counting, density map estimation and localization of people in a given dense crowd image. Our formulation is based on an important observation that the three problems are inherently related to each other making the loss function for optimizing a deep CNN decomposable. Since localization requires high-quality images and annotations, we introduce UCF-QNRF dataset that overcomes the shortcomings of previous datasets, and contains 1.25 million humans manually marked with dot annotations. Finally, we present evaluation measures and comparison with recent deep CNNs, including those developed specifically for crowd counting. Our approach significantly outperforms state-of-the-art on the new dataset, which is the most challenging dataset with the largest number of crowd annotations in the most diverse set of scenes.

579 citations


Cites background from "Generating High-Quality Crowd Densi..."

  • ...Sindagi and Patel [26] presented a CNNbased approach that incorporates global and local contextual information in an image to generate density maps....

    [...]

Book ChapterDOI
08 Sep 2018
TL;DR: A novel training loss, combining of Euclidean loss and local pattern consistency loss is proposed, which improves the performance of the model in the authors' experiments and achieves superior performance to state-of-the-art methods while with much less parameters.
Abstract: In this paper, we propose a novel encoder-decoder network, called Scale Aggregation Network (SANet), for accurate and efficient crowd counting. The encoder extracts multi-scale features with scale aggregation modules and the decoder generates high-resolution density maps by using a set of transposed convolutions. Moreover, we find that most existing works use only Euclidean loss which assumes independence among each pixel but ignores the local correlation in density maps. Therefore, we propose a novel training loss, combining of Euclidean loss and local pattern consistency loss, which improves the performance of the model in our experiments. In addition, we use normalization layers to ease the training process and apply a patch-based test scheme to reduce the impact of statistic shift problem. To demonstrate the effectiveness of the proposed method, we conduct extensive experiments on four major crowd counting datasets and our method achieves superior performance to state-of-the-art methods while with much less parameters.

573 citations


Cites background or methods from "Generating High-Quality Crowd Densi..."

  • ...[32,6] explored methods to incorporate the contextual information by learning various density levels and generate high-resolution density maps....

    [...]

  • ...Method MCNN [3] Switch-CNN [5] CP-CNN [6] CSRNet [33] SANet...

    [...]

  • ...(2) [5,32,6] require density level classifier to provide contextual information....

    [...]

  • ...Some works [3,4,5,6] have achieved significant improvement by addressing the scale variation issue with multi-scale architecture....

    [...]

  • ...We follow experiment setting of [6] to generate density maps with perspective maps....

    [...]

Proceedings ArticleDOI
18 Jun 2018
TL;DR: A novel density-aware multi-stream densely connected convolutional neural network-based algorithm, called DID-MDN, for joint rain density estimation and de-raining, which achieves significant improvements over the recent state-of-the-art methods.
Abstract: Single image rain streak removal is an extremely challenging problem due to the presence of non-uniform rain densities in images. We present a novel density-aware multi-stream densely connected convolutional neural network-based algorithm, called DID-MDN, for joint rain density estimation and de-raining. The proposed method enables the network itself to automatically determine the rain-density information and then efficiently remove the corresponding rain-streaks guided by the estimated rain-density label. To better characterize rain-streaks with different scales and shapes, a multi-stream densely connected de-raining network is proposed which efficiently leverages features from different scales. Furthermore, a new dataset containing images with rain-density labels is created and used to train the proposed density-aware network. Extensive experiments on synthetic and real datasets demonstrate that the proposed method achieves significant improvements over the recent state-of-the-art methods. In addition, an ablation study is performed to demonstrate the improvements obtained by different modules in the proposed method. The code can be downloaded at https://github.com/hezhangsprinter/DID-MDN

535 citations


Cites background from "Generating High-Quality Crowd Densi..."

  • ...leveraged in various applications such as semantic segmentation [45], face-alignment [22], visual tracking [18] crowdcounting [30], single image super-resolution[43], face antiSpoofing [1], action recognition [48], depth estimation [5], single image dehazing [24, 42, 40] and also in single image de-raining [36]....

    [...]

Proceedings ArticleDOI
20 Jun 2019
TL;DR: In this article, an end-to-end trainable deep architecture that combines features obtained using multiple receptive field sizes and learns the importance of each such feature at each image location is proposed.
Abstract: State-of-the-art methods for counting people in crowded scenes rely on deep networks to estimate crowd density. They typically use the same filters over the whole image or over large image patches. Only then do they estimate local scale to compensate for perspective distortion. This is typically achieved by training an auxiliary classifier to select, for predefined image patches, the best kernel size among a limited set of choices. As such, these methods are not end-to-end trainable and restricted in the scope of context they can leverage. In this paper, we introduce an end-to-end trainable deep architecture that combines features obtained using multiple receptive field sizes and learns the importance of each such feature at each image location. In other words, our approach adaptively encodes the scale of the contextual information required to accurately predict crowd density. This yields an algorithm that outperforms state-of-the-art crowd counting methods, especially when perspective effects are strong.

369 citations

Proceedings ArticleDOI
18 Jun 2018
TL;DR: A novel crowd counting (density estimation) framework called Adversarial Cross-Scale Consistency Pursuit (ACSCP) is proposed, which designs a novel scale-consistency regularizer which enforces that the sum up of the crowd counts from local patches is coherent with the overall count of their region union.
Abstract: Crowd counting or density estimation is a challenging task in computer vision due to large scale variations, perspective distortions and serious occlusions, etc. Existing methods generally suffer from two issues: 1) the model averaging effects in multi-scale CNNs induced by the widely adopted $$ regression loss; and 2) inconsistent estimation across different scaled inputs. To explicitly address these issues, we propose a novel crowd counting (density estimation) framework called Adversarial Cross-Scale Consistency Pursuit (ACSCP). On one hand, a U-net structured generation network is designed to generate density map from input patch, and an adversarial loss is directly employed to shrink the solution onto a realistic subspace, thus attenuating the blurry effects of density map estimation. On the other hand, we design a novel scale-consistency regularizer which enforces that the sum up of the crowd counts from local patches (i.e., small scale) is coherent with the overall count of their region union (i.e., large scale). The above losses are integrated via a joint training scheme, so as to help boost density estimation performance by further exploring the collaboration between both objectives. Extensive experiments on four benchmarks have well demonstrated the effectiveness of the proposed innovations as well as the superior performance over prior art.

348 citations


Cites background from "Generating High-Quality Crowd Densi..."

  • ...In most cases [37, 3, 20, 28], to deal with human scale changes, multiple convolution paths (sub-networks) with varying sized kernels are fused to yield the final density map prediction....

    [...]

  • ...[34] MCNN [37] Switch-CNN [25] CP-CNN [28] ACSCP (ours)...

    [...]

  • ...Both our method and CP-CNN are contemporary works starting to consider the quality of density map....

    [...]

  • ...However, it seems unfair that the training process of CPCNN demands extra priori density-class labels (i.e., global and local density classes) which are NOT directly provided by datasets....

    [...]

  • ...First, although different sizes of convolutional kernels are utilized to extract multi-scale features [37, 28], (i....

    [...]

References
More filters
Proceedings Article
01 Jan 2015
TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

49,914 citations

Journal ArticleDOI
TL;DR: In this article, a structural similarity index is proposed for image quality assessment based on the degradation of structural information, which can be applied to both subjective ratings and objective methods on a database of images compressed with JPEG and JPEG2000.
Abstract: Objective methods for assessing perceptual image quality traditionally attempted to quantify the visibility of errors (differences) between a distorted image and a reference image using a variety of known properties of the human visual system. Under the assumption that human visual perception is highly adapted for extracting structural information from a scene, we introduce an alternative complementary framework for quality assessment based on the degradation of structural information. As a specific example of this concept, we develop a structural similarity index and demonstrate its promise through a set of intuitive examples, as well as comparison to both subjective ratings and state-of-the-art objective methods on a database of images compressed with JPEG and JPEG2000. A MATLAB implementation of the proposed algorithm is available online at http://www.cns.nyu.edu//spl sim/lcv/ssim/.

40,609 citations

Journal ArticleDOI
08 Dec 2014
TL;DR: A new framework for estimating generative models via an adversarial process, in which two models are simultaneously train: a generative model G that captures the data distribution and a discriminative model D that estimates the probability that a sample came from the training data rather than G.
Abstract: We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to ½ everywhere. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.

38,211 citations


"Generating High-Quality Crowd Densi..." refers methods in this paper

  • ...Furthermore, we train the CNNs in a Generative Adversarial Network (GAN) based framework [10] to exploit the recent success of adversarial loss to achieve highquality and sharper density maps....

    [...]

  • ...In a further attempt to improve the quality of density maps, the F-CNN is trained using a weighted combination of pixelwise Euclidean loss and adversarial loss [10]....

    [...]

Proceedings ArticleDOI
21 Jul 2017
TL;DR: Conditional adversarial networks are investigated as a general-purpose solution to image-to-image translation problems and it is demonstrated that this approach is effective at synthesizing photos from label maps, reconstructing objects from edge maps, and colorizing images, among other tasks.
Abstract: We investigate conditional adversarial networks as a general-purpose solution to image-to-image translation problems. These networks not only learn the mapping from input image to output image, but also learn a loss function to train this mapping. This makes it possible to apply the same generic approach to problems that traditionally would require very different loss formulations. We demonstrate that this approach is effective at synthesizing photos from label maps, reconstructing objects from edge maps, and colorizing images, among other tasks. Moreover, since the release of the pix2pix software associated with this paper, hundreds of twitter users have posted their own artistic experiments using our system. As a community, we no longer hand-engineer our mapping functions, and this work suggests we can achieve reasonable results without handengineering our loss functions either.

11,958 citations


"Generating High-Quality Crowd Densi..." refers background or methods in this paper

  • ...The use of adversarial loss helps us combat the widely acknowledge issue of blurred results obtained by minimizing only the Euclidean loss [13]....

    [...]

  • ...Motivated by these observations and the recent success of GANs for overcoming the issues of L2-minimization [13], we attempt to further improve the quality of density maps by minimizing a weighted combination of pixel-wise Euclidean loss and adversarial loss....

    [...]

  • ...It has been widely acknowledged that minimization of L2 error results in blurred results especially for image reconstruction tasks [13, 14, 45, 46, 47]....

    [...]

Proceedings ArticleDOI
21 Jul 2017
TL;DR: This paper exploits the capability of global context information by different-region-based context aggregation through the pyramid pooling module together with the proposed pyramid scene parsing network (PSPNet) to produce good quality results on the scene parsing task.
Abstract: Scene parsing is challenging for unrestricted open vocabulary and diverse scenes. In this paper, we exploit the capability of global context information by different-region-based context aggregation through our pyramid pooling module together with the proposed pyramid scene parsing network (PSPNet). Our global prior representation is effective to produce good quality results on the scene parsing task, while PSPNet provides a superior framework for pixel-level prediction. The proposed approach achieves state-of-the-art performance on various datasets. It came first in ImageNet scene parsing challenge 2016, PASCAL VOC 2012 benchmark and Cityscapes benchmark. A single PSPNet yields the new record of mIoU accuracy 85.4% on PASCAL VOC 2012 and accuracy 80.2% on Cityscapes.

10,189 citations


"Generating High-Quality Crowd Densi..." refers background in this paper

  • ...Several recent works for semantic segmentation [21], scene parsing [51] and visual saliency [52] have demonstrated that incorporating contextual information can provide significant improvements in the results....

    [...]