scispace - formally typeset
Open AccessJournal ArticleDOI

Efficient Deep CNN-Based Fire Detection and Localization in Video Surveillance Applications

TLDR
This paper proposes an original, energy-friendly, and computationally efficient CNN architecture, inspired by the SqueezeNet architecture for fire detection, localization, and semantic understanding of the scene of the fire, which uses smaller convolutional kernels and contains no dense, fully connected layers.
Abstract
Convolutional neural networks (CNNs) have yielded state-of-the-art performance in image classification and other computer vision tasks. Their application in fire detection systems will substantially improve detection accuracy, which will eventually minimize fire disasters and reduce the ecological and social ramifications. However, the major concern with CNN-based fire detection systems is their implementation in real-world surveillance networks, due to their high memory and computational requirements for inference. In this paper, we propose an original, energy-friendly, and computationally efficient CNN architecture, inspired by the SqueezeNet architecture for fire detection, localization, and semantic understanding of the scene of the fire. It uses smaller convolutional kernels and contains no dense, fully connected layers, which helps keep the computational requirements to a minimum. Despite its low computational needs, the experimental results demonstrate that our proposed solution achieves accuracies that are comparable to other, more complex models, mainly due to its increased depth. Moreover, this paper shows how a tradeoff can be reached between fire detection accuracy and efficiency, by considering the specific characteristics of the problem of interest and the variety of fire data.

read more

Content maybe subject to copyright    Report

This is a repository copy of Efficient deep CNN-based fire detection and localization in
video surveillance applications.
White Rose Research Online URL for this paper:
http://eprints.whiterose.ac.uk/150629/
Version: Accepted Version
Article:
Muhammad, K., Ahmad, J., Lv, Z. et al. (3 more authors) (2019) Efficient deep CNN-based
fire detection and localization in video surveillance applications. IEEE Transactions on
Systems, Man, and Cybernetics: Systems, 49 (7). pp. 1419-1434. ISSN 2168-2216
https://doi.org/10.1109/tsmc.2018.2830099
© 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be
obtained for all other users, including reprinting/ republishing this material for advertising or
promotional purposes, creating new collective works for resale or redistribution to servers
or lists, or reuse of any copyrighted components of this work in other works. Reproduced
in accordance with the publisher's self-archiving policy.
eprints@whiterose.ac.uk
https://eprints.whiterose.ac.uk/
Reuse
Items deposited in White Rose Research Online are protected by copyright, with all rights reserved unless
indicated otherwise. They may be downloaded and/or printed for private study, or other acts as permitted by
national copyright laws. The publisher or other rights holders may allow further reproduction and re-use of
the full text version. This is indicated by the licence information on the White Rose Research Online record
for the item.
Takedown
If you consider content in White Rose Research Online to be in breach of UK law, please notify us by
emailing eprints@whiterose.ac.uk including the URL of the record and the reason for the withdrawal request.

1
AbstractConvolutional neural networks (CNN) have yielded
state-of-the-art performance in image classification and other
computer vision tasks. Their application in fire detection systems
will substantially improve detection accuracy, which will
eventually minimize fire disasters and reduce the ecological and
social ramifications. However, the major concern with
CNN-based fire detection systems is their implementation in
real-world surveillance networks, due to their high memory and
computational requirements for inference. In this work, we
propose an energy-friendly and computationally efficient CNN
architecture, inspired by the SqueezeNet architecture for fire
detection, localization, and semantic understanding of the scene of
the fire. It uses smaller convolutional kernels and contains no
dense, fully connected layers, which helps keep the computational
requirements to a minimum. Despite its low computational needs,
the experimental results demonstrate that our proposed solution
achieves accuracies that are comparable to other, more complex
models, mainly due to its increased depth. Moreover, the paper
shows how a trade-off can be reached between fire detection
accuracy and efficiency, by considering the specific characteristics
of the problem of interest and the variety of fire data.
Index Terms Convolutional Neural Networks, Deep
Learning, Fire Detection, Fire Localization, Fire Disaster, Image
Classification, Surveillance Networks
I. INTRODUCTION
ECENTLY, a variety of sensors have been introduced for
different applications such as setting off a fire alarm [1],
vehicle obstacle detection, visualizing the interior of the
human body for diagnosis [2-4], animal and ship monitoring,
and surveillance [5]. Of these applications, surveillance has
primarily attracted the attention of researchers due to the
enhanced embedded processing capabilities of cameras. Using
smart surveillance systems, various abnormal events such as
road accidents, fires, medical emergencies etc. can be detected
at early stages, and the appropriate authority can be
autonomously informed [6]. A fire is an abnormal event which
can cause significant damage to lives and property within a
very short time [7]. The main causes of such disasters include
human error or a system failure which results in severe loss of
human life and other damage [8]. In Europe, fire disasters affect
10,000 km
2
of vegetation zones each year; in North America
and Russia, the damage is about 100,000 km
2
. In June 2013, fire
disasters killed 19 firefighters and ruined 100 houses in Arizona,
USA. Similarly, another forest fire in August 2013 in California
ruined an area of land the size of 1042 km
2
, causing a loss of
$127.35 million [9]. According to an annual disaster report [10],
fire disasters alone affected 494,000 people and resulted in a
loss of $3.1 billion USD in 2015. In order to avoid such
disasters, it is important to detect fires at early stages utilizing
smart surveillance cameras.
Two broad categories of approach can be identified for fire
detection: traditional fire alarms and vision sensor-assisted fire
detection. Traditional fire alarm systems are based on sensors
that require close proximity for activation, such as infrared and
optical sensors. These sensors are not well suited to critical
environments and need human involvement to confirm a fire in
the case of an alarm, involving a visit to the location of the fire.
Furthermore, such systems cannot usually provide information
about the size, location, and burning degree of the fire. To
overcome these limitations, numerous vision sensor-based
methods have been explored by researchers in this field
[11-14]; these have the advantages of less human interference,
faster response, affordable cost, and larger surveillance
coverage. In addition, such systems can confirm a fire without
requiring a visit to the fire’s location, and can provide detailed
information about the fire including its location, size, and
degree, etc. Despite these advantages, there are still some issues
with these systems, e.g. the complexity of the scenes under
observation, irregular lighting, and low-quality frames;
researchers have made several efforts to address these aspects,
taking into consideration both color and motion features.
Chen et al. [8] examined the dynamic behavior of fires using
RGB and HSI color models and proposed a decision
rule-assisted fire detection approach, which uses the irregular
properties of fire for detection. Their approach is based on
frame-to-frame differences, and hence cannot distinguish
between fire and fire-colored moving regions. Marbach et al.
[15] investigated the YUV color space using motion
information to classify pixels into fire and non-fire
components. Toreyin et al. [16] used temporal and spatial
wavelet analysis to determine fire and non-fire regions. Their
approach uses many heuristic thresholds, which greatly
restricts its real-world implementation. Han et al. [17]
compared normal frames with their color information for tunnel
fire detection; this method is suitable only for static fires, as it is
based on numerous parameters. Celik et al. [18] explored the
YCbCr color space and presented a pixel classification method
for flames. To this end, they proposed novel rules for separating
the chrominance and luminance components. However, their
method is unable to detect fire from a large distance or at small
scales, which are important in the early detection of fires. In
addition to these color space-based techniques, Borges et al.
Efficient Deep CNN-Based Fire Detection and
Localization in Video Surveillance Applications
Khan Muhammad, Jamil Ahmad, Student Member, IEEE, Zhihan Lv, Member, IEEE, Paolo Bellavista,
Senior Member, IEEE, Po Yang, Member, IEEE , Sung Wook Baik, Member, IEEE
R

2
[19] utilized the low-level features including color, skewness,
and roughness in combination with a Bayes classifier for fire
recognition.
Rafiee et al. [20] investigated a multi-resolution 2D wavelet
analysis to improve the thresholding mechanism in the RGB
color space. Their method reduced the rate of false alarms by
considering variations in energy as well as shape; however,
false alarms can be higher in this approach for the case of rigid
body movements within the frames, such as the movement of a
human arm in the scene. In [21], the authors presented a
modified version of [20] based on a YUC color model, which
obtained better results than the RGB version. Another similar
method based on color information and an SVM classifier is
presented in [22]. This method can process 20 frames/sec;
however, it cannot detect a fire from a large distance or of small
size, which can occur in real-world surveillance footage.
Color-based methods typically generate more false alarms due
to variations in shadows and brightness, and often mis-classify
people wearing red clothes or red vehicles. Mueller et al. [23]
attempted to solve this issue by analyzing changes in the shape
of a fire and the movement of rigid objects. Their algorithm can
distinguish between rigid moving objects and a flame, based on
a feature vector extracted from the optical flow and the physical
behavior of a fire. De Lascio et al. [24] combined color and
motion information for the detection of fire in surveillance
videos. Dimitropoulos et al. [25] used spatio-temporal features
based on texture analysis followed by an SVM classifier to
classify candidate regions of the video frames into fire and
non-fire. This method is heavily dependent on the parameters
used; for instance, small-sized blocks increase the rate of false
alarms, while larger blocks reduce its sensitivity. Similarly, the
time window is also crucial to the performance of this system;
smaller values reduce the detection accuracy, while larger
values increase the computational complexity. These
dependencies greatly affect the feasibility of this approach for
implementation in real surveillance systems. Recently, the
authors of [21] proposed a real-time fire detection algorithm
based on color, shape, and motion features, combined in a
multi-expert system. The accuracy of this approach is higher
than that of other methods; however, the number of false alarms
is still high, and the accuracy of fire detection can be further
improved. A survey of the existing literature shows that
computationally expensive methods have better accuracy, and
simpler methods compromise on accuracy and the rate of false
positives. Hence, there is a need to find a better trade-off
between these metrics for several application scenarios of
practical interest, for which existing computationally expensive
methods do not fit well.
To address the above issues, we investigate convolutional
neural network (CNN)-based deep features for early fire
detection in surveillance networks. The key contributions can
be summarized as follows:
1. We avoid the time-consuming efforts of conventional
hand-crafted features for fire detection, and explore deep
learning architectures for early fire detection in closed-circuit
television (CCTV) surveillance networks for both indoor and
outdoor environments. Our proposed fire detection
framework improves fire detection accuracy and reduces
false alarms, compared to state-of-the-art methods. Thus, our
algorithm can play a vital role in the early detection of fire to
minimize damage.
2. We train and fine-tune an AlexNet architecture [26] for fire
detection using a transfer learning strategy. Our model
outperforms conventional hand-engineered features based
fire detection methods. However, the model remains
comparatively large in size (238 MB), making its
implementation difficult in resource-constrained equipment.
3. To reduce the size of the model, we fine-tune a model with a
similar architecture to the SqueezeNet model for fire
detection at the early stages. The size of the model was
reduced from 238 MB to 3 MB, thus saving an extra space of
235 MB, thus minimizing the cost and making its
implementation more feasible in surveillance networks. The
proposed model requires 0.72 GFLOPS/image compared to
AlexNet, whose computational complexity is 2
GFLOPS/image. This makes our proposed model more
efficient in terms of inference, allowing it to process multiple
surveillance streams.
4. An intelligent feature map selection algorithm is proposed for
choose appropriate feature maps from the convolutional
layers of the trained CNN, which are sensitive to fire regions.
These feature maps allow a more accurate segmentation of
fire compared to other hand-crafted methods. The
segmentation information can be further analyzed to assess
the essential characteristics of the fire, for instance its growth
rate. Using this approach, the severity of the fire and/or its
burning degree can also be determined. Another novel
characteristic of our system is the ability to identify the object
which is on fire, using a pre-trained model trained on 1,000
classes of objects in the ImageNet dataset. This enables our
approach to determine whether the fire is in a car, a house, a
forest or any other object. Using this semantic information,
firefighters can prioritize their targets by primarily focusing
on regions with the strongest fire.
The remainder of this paper is organized as follows. We
propose our architecture in Section 2. Our experimental results
using benchmark datasets and a feasibility analysis of the
proposed work are discussed in Section 3. Finally, the
manuscript is concluded in Section 4 and possible future
research directions are suggested.
II. THE PROPOSED FRAMEWORK
Fire detection using hand-crafted features is a tedious task,
due to the time-consuming method of features engineering. It is
particularly challenging to detect a fire at an early stage in
scenes with changing lighting conditions, shadows, and
fire-like objects; conventional low-level feature-based methods
generate a high rate of false alarms and have low detection
accuracy. To overcome these issues, we investigate deep
learning models for possible fire detection at early stages
during surveillance. Taking into consideration the accuracy, the
embedded processing capabilities of smart cameras, and the
number of false alarms, we examine various deep CNNs for the
target problem. A systematic diagram of our framework is
given in Fig. 1.

3
Fig. 1: Overview of the proposed system for fire detection using a deep CNN
A. Convolutional Neural Network Architecture
CNNs have shown encouraging performance in numerous
computer vision problems and applications, such as object
detection and localization [27, 28], image segmentation,
super-resolution, classification [29-31], and indexing and
retrieval [32]. This widespread success is due to their
hierarchical structure, which automatically learns very strong
features from raw data. A typical CNN architecture consists of
three well-known processing layers: 1) a convolution layer,
where various feature maps are produced when different
kernels are applied to the input data; 2) a pooling layer, which
is used for the selection of maximum activation considering a
small neighborhood of feature maps received from the
previous convolution layer; the goal of this layer is to achieve
translation invariance to some extent and dimensionality
reduction; and 3) a fully connected layer which models
high-level information from the input data and constructs its
global representation. This layer follows numerous stacks of
convolution and pooling layers, thus resulting in a high-level
representation of the input data. These layers are arranged in a
hierarchical architecture such that the output of one layer acts
as the input of the next layer. During the training phase, the
weights of all neurons in convolutional kernels and fully
connected layers are adjusted and learnt. These weights model
the representative characteristics of the input training data, and
in turn can perform the target classification.
We use a model with an architecture similar to that of
SqueezeNet [33], modified in accordance with our target
problem. The original model was trained on the ImageNet
dataset and is capable of classifying 1000 different objects. In
our case, however, we used this architecture to detect fire and
non-fire images. This was achieved by reducing the number of
neurons in the final layer from 1000 to 2. By keeping the rest
of the architecture similar to the original, we aimed to reuse the
parameters to solve the fire detection problem more
effectively.
There are several motivational reasons for this selection,
such as a lower communication cost between different servers
in the case of distributed training, a higher feasibility of
deployment on FPGAs, application-specific integrated circuits,
and other hardware architectures with memory constraints and
lower bandwidth. The model consists of two regular
convolutional layers, three max pooling layers, one average
pooling layer, and eight modules called “fire modules”. The
input of the model is color images with dimensions of
224×224×3 pixels. In the first convolution layer, 64 filters of
size 3×3 are applied to the input image, generating 64 feature
maps. The maximum activations of these 64 features maps are
selected by the first max pooling layer with a stride of two
pixels, using a neighborhood of 3×3 pixels. This reduces the
size of the feature maps by factor of two, thus retaining the
most useful information while discarding the less important
details. Next, we use two fire modules of 128 filters, followed
by another fire module of 256 filters. Each fire module
involves two further convolutions, squeezing, and expansion.
Since each module consists of multiple filter resolutions and
there is no native support for such convolution layers in the
Caffe framework [34], an expansion layer was introduced,
with two separate convolution layers in each fire module. The
first convolution layer contains 1 x 1 filters, while the second
layer consists of 3×3 filters. The output of these two layers is
concatenated in the channel dimension. Following the three
fire modules, there is another max pooling layer which
operates in the same way as the first max pooling layer.
Following the last fire module (Fire9) of 512 filters, we
modify the convolution layer according to the problem of
interest by reducing the number of classes to two (M=2 (fire
and normal)). The output of this layer is passed to the average
pooling layer, and result of this layer is fed directly into the
Softmax classifier to calculate the probabilities of the two
target classes.

4
A significant number of weights need to be properly
adjusted in CNNs, and a huge amount of training data is
usually required for this. These parameters can suffer from
overfitting if insufficient training data is used. The fully
connected layers usually contain the most parameters, and
these can cause significant overfitting. These problems can be
avoided by introducing regularization layers such as dropout,
or by replacing dense fully connected layers with convolution
layers. In view of this, a number of models were trained based
on the collected training data. Several benchmark datasets
were then used to evaluate the classification performance of
these models. During the experiments, a transfer learning
strategy was also explored in an attempt to further improve the
accuracy. Interestingly, we achieved an improvement in
classification accuracy of approximately 5% for the test data
after fine-tuning. A transfer learning strategy can solve
problems more efficiently based on the re-use of previously
learned knowledge. This reflects the human strategy of
applying existing knowledge to different problems in several
domains of interest. Employing this strategy, we used a
pre-trained SqueezeNet model and fine-tuned it according to
our classification problem with a slower learning rate of 0.001.
We also removed the last fully connected layers to make the
architecture as efficient as possible in terms of classification
accuracy. The process of fine-tuning was executed for 10
epochs; this increased the classification accuracy from 89.8%
to 94.50%, thus giving an improvement of 5%.
Fig. 2: Prediction scores for a set of query images using the proposed deep
CNN.
B. Difference with other network models
The key difference of our proposed CNN architecture in Fig.3
with SqueezeNet [28] is that our model simplifies the
SqueezeNet model by removing no residual connections,
which is more light-weight and balanced computational
efficiency.
As shown in Fig.3, looking at the architectural similarity
between our CNN's Fire and Inception modules, note that in
Inception modules, Fire modules have multiple sizes of filters
at the same level of depth in the NN. For example,
Inception-v1 modules have multiple instances with 1x1, 3x3,
and 5x5 filters alongside each other. This arose the relevant
question "how does a CNN architect decide how many of each
size of filter to have in each module?" Some versions of the
inception modules have 10 or more filter banks per module.
Doing careful A/B comparisons of "how many of each type of
filter" would easily lead to a combinatorial explosion. But, in
the Fire modules, there are just 3 filter banks (1x1_1, 1x1_2,
and 3x3_2). With this setup, it can be further asked that: What
are the tradeoffs in "many 1x1_2 and few 3x3_2" vs "few
1x1_2 and many 3x3_2" in terms of metrics such as model size
and accuracy? From [1], it is evident that 50% 1x1_2 and 50%
3x3_2 filters generate the same accuracy level as 99% 3x3_2.
But there is a significant difference in the model size and
computational footprint of these models. The lesson learnt is
the suitability to adopt, to some extent, a simple step-by-step
methodology: look for the point where adding more spatial
resolution to the filters stops improving accuracy, and stop
there; otherwise computation and model parameters are being
wasted.
Also, in comparison to other network models like
AlexNet [26] and GoogleNet [27]. Our proposed network is
light-weight, requiring a memory of 3 MB which is less than
AlexNet and GoogleNet. It also is computationally
inexpensive, requiring only 0.72 GFLOPS/image compared to
other networks such as AlexNet (which needs 2
GFLOS/image). Thus, our proposed model maintains a better
trade-off between the computational complexity, memory
requirement, fire detection accuracy and number of false
alarms compared to other networks.
Looking at GoogLeNet-v1, some of the Inception-v1
modules are set up such that the early filter banks have 75%
the number of filters as the late filter banks. This is like they
have a "squeeze ratio" (SR) of 0.75. Another interesting point
was to find the tradeoffs that emerge if the number of filters at
the beginning of each module are more aggressively cut down.
It was experimentally found, again, that there is a saturation
point where going from SR=0.75 to SR=1.0; here, the increase
in computational footprint and model size does not correspond
to a significant improvement, but it does not improve accuracy.
Thus, the Fire modules have been very useful in our
experience for understanding the tradeoffs that emerge when
selecting the number of filters inside of the CNN modules.
C. Deep CNN for Fire Detection and Localization
Although deep CNN architectures learn very strong
features automatically from raw data, some effort is required
to train the appropriate model considering the quality and
quantity of the available data and the nature of the target
problem. We trained various models with different parameter
settings, and following the fine-tuning process obtained an
optimal model which can detect fire from a large distance and
at a small scale, under varying conditions, and in both indoor
and outdoor scenarios.
Another motivational factor for the proposed deep CNN
was the avoidance of pre-processing and features engineering,
which are required by traditional fire detection algorithms. To
test a given image, it is fed forward through the deep CNN,
which assigns a label of ‘fire’ or ‘normal’ to the input image.

Citations
More filters
Journal ArticleDOI

Multi-grade brain tumor classification using deep CNN with extensive data augmentation

TL;DR: A novel convolutional neural network (CNN) based multi-grade brain tumor classification system that is experimentally evaluated on both augmented and original data and results show its convincing performance compared to existing methods.
Journal ArticleDOI

A Review on Early Forest Fire Detection Systems Using Optical Remote Sensing

TL;DR: An overview of the optical remote sensing technologies used in early fire warning systems is presented and an extensive survey on both flame and smoke detection algorithms employed by each technology is provided.
Journal ArticleDOI

Efficient Fire Detection for Uncertain Surveillance Environment

TL;DR: Considering the accuracy, false alarms, size, and running time of the proposed CNN based system, it is believed that it is a suitable candidate for fire detection in uncertain IoT environment for mobile and embedded vision applications during surveillance.
Journal ArticleDOI

Activity Recognition Using Temporal Optical Flow Convolutional Features and Multilayer LSTM

TL;DR: A framework for activity recognition in surveillance videos captured over industrial systems is proposed and the results reveal the effectiveness of the proposed method for activity Recognition in industrial settings compared with state-of-the-art methods.
Journal ArticleDOI

Energy-Efficient Deep CNN for Smoke Detection in Foggy IoT Environment

TL;DR: This paper proposes an energy-efficient system based on deep convolutional neural networks for early smoke detection in both normal and foggy IoT environments and takes advantage of VGG-16 architecture, considering its sensible stability between the accuracy and time efficiency for smoke detection.
References
More filters
Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Journal ArticleDOI

A Survey on Transfer Learning

TL;DR: The relationship between transfer learning and other related machine learning techniques such as domain adaptation, multitask learning and sample selection bias, as well as covariate shift are discussed.
Posted Content

Caffe: Convolutional Architecture for Fast Feature Embedding

TL;DR: Caffe as discussed by the authors is a BSD-licensed C++ library with Python and MATLAB bindings for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures.
Proceedings ArticleDOI

Caffe: Convolutional Architecture for Fast Feature Embedding

TL;DR: Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures.
Posted Content

SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size

TL;DR: This work proposes a small DNN architecture called SqueezeNet, which achieves AlexNet-level accuracy on ImageNet with 50x fewer parameters and is able to compress to less than 0.5MB (510x smaller than AlexNet).
Related Papers (5)