Efficient Deep CNN-Based Fire Detection and Localization in Video Surveillance Applications

doi:10.1109/TSMC.2018.2830099

This is a repository copy of Efficient deep CNN-based fire detection and localization in

video surveillance applications.

White Rose Research Online URL for this paper:

http://eprints.whiterose.ac.uk/150629/

Version: Accepted Version

Article:

Muhammad, K., Ahmad, J., Lv, Z. et al. (3 more authors) (2019) Efficient deep CNN-based

fire detection and localization in video surveillance applications. IEEE Transactions on

Systems, Man, and Cybernetics: Systems, 49 (7). pp. 1419-1434. ISSN 2168-2216

https://doi.org/10.1109/tsmc.2018.2830099

obtained for all other users, including reprinting/ republishing this material for advertising or

promotional purposes, creating new collective works for resale or redistribution to servers

or lists, or reuse of any copyrighted components of this work in other works. Reproduced

in accordance with the publisher's self-archiving policy.

eprints@whiterose.ac.uk

https://eprints.whiterose.ac.uk/

Reuse

indicated otherwise. They may be downloaded and/or printed for private study, or other acts as permitted by

national copyright laws. The publisher or other rights holders may allow further reproduction and re-use of

the full text version. This is indicated by the licence information on the White Rose Research Online record

for the item.

Takedown

If you consider content in White Rose Research Online to be in breach of UK law, please notify us by

emailing eprints@whiterose.ac.uk including the URL of the record and the reason for the withdrawal request.

1



Abstract—Convolutional neural networks (CNN) have yielded

state-of-the-art performance in image classification and other

computer vision tasks. Their application in fire detection systems

will substantially improve detection accuracy, which will

eventually minimize fire disasters and reduce the ecological and

social ramifications. However, the major concern with

CNN-based fire detection systems is their implementation in

real-world surveillance networks, due to their high memory and

computational requirements for inference. In this work, we

propose an energy-friendly and computationally efficient CNN

architecture, inspired by the SqueezeNet architecture for fire

detection, localization, and semantic understanding of the scene of

the fire. It uses smaller convolutional kernels and contains no

dense, fully connected layers, which helps keep the computational

requirements to a minimum. Despite its low computational needs,

the experimental results demonstrate that our proposed solution

achieves accuracies that are comparable to other, more complex

models, mainly due to its increased depth. Moreover, the paper

shows how a trade-off can be reached between fire detection

accuracy and efficiency, by considering the specific characteristics

of the problem of interest and the variety of fire data.

Index Terms— Convolutional Neural Networks, Deep

Learning, Fire Detection, Fire Localization, Fire Disaster, Image

Classification, Surveillance Networks

I. INTRODUCTION

ECENTLY, a variety of sensors have been introduced for

different applications such as setting off a fire alarm [1],

vehicle obstacle detection, visualizing the interior of the

human body for diagnosis [2-4], animal and ship monitoring,

and surveillance [5]. Of these applications, surveillance has

primarily attracted the attention of researchers due to the

enhanced embedded processing capabilities of cameras. Using

smart surveillance systems, various abnormal events such as

road accidents, fires, medical emergencies etc. can be detected

at early stages, and the appropriate authority can be

autonomously informed [6]. A fire is an abnormal event which

can cause significant damage to lives and property within a

very short time [7]. The main causes of such disasters include

human error or a system failure which results in severe loss of

human life and other damage [8]. In Europe, fire disasters affect

10,000 km

2

of vegetation zones each year; in North America

and Russia, the damage is about 100,000 km

2

. In June 2013, fire

disasters killed 19 firefighters and ruined 100 houses in Arizona,

USA. Similarly, another forest fire in August 2013 in California

ruined an area of land the size of 1042 km

2

, causing a loss of

$127.35 million [9]. According to an annual disaster report [10],

fire disasters alone affected 494,000 people and resulted in a

loss of $3.1 billion USD in 2015. In order to avoid such

disasters, it is important to detect fires at early stages utilizing

smart surveillance cameras.

Two broad categories of approach can be identified for fire

detection: traditional fire alarms and vision sensor-assisted fire

detection. Traditional fire alarm systems are based on sensors

that require close proximity for activation, such as infrared and

optical sensors. These sensors are not well suited to critical

environments and need human involvement to confirm a fire in

the case of an alarm, involving a visit to the location of the fire.

Furthermore, such systems cannot usually provide information

about the size, location, and burning degree of the fire. To

overcome these limitations, numerous vision sensor-based

methods have been explored by researchers in this field

[11-14]; these have the advantages of less human interference,

faster response, affordable cost, and larger surveillance

coverage. In addition, such systems can confirm a fire without

requiring a visit to the fire’s location, and can provide detailed

information about the fire including its location, size, and

degree, etc. Despite these advantages, there are still some issues

with these systems, e.g. the complexity of the scenes under

observation, irregular lighting, and low-quality frames;

researchers have made several efforts to address these aspects,

taking into consideration both color and motion features.

Chen et al. [8] examined the dynamic behavior of fires using

RGB and HSI color models and proposed a decision

rule-assisted fire detection approach, which uses the irregular

properties of fire for detection. Their approach is based on

frame-to-frame differences, and hence cannot distinguish

between fire and fire-colored moving regions. Marbach et al.

[15] investigated the YUV color space using motion

information to classify pixels into fire and non-fire

components. Toreyin et al. [16] used temporal and spatial

wavelet analysis to determine fire and non-fire regions. Their

approach uses many heuristic thresholds, which greatly

restricts its real-world implementation. Han et al. [17]

compared normal frames with their color information for tunnel

fire detection; this method is suitable only for static fires, as it is

based on numerous parameters. Celik et al. [18] explored the

YCbCr color space and presented a pixel classification method

for flames. To this end, they proposed novel rules for separating

the chrominance and luminance components. However, their

method is unable to detect fire from a large distance or at small

scales, which are important in the early detection of fires. In

addition to these color space-based techniques, Borges et al.

Efficient Deep CNN-Based Fire Detection and

Localization in Video Surveillance Applications

Khan Muhammad, Jamil Ahmad, Student Member, IEEE, Zhihan Lv, Member, IEEE, Paolo Bellavista,

Senior Member, IEEE, Po Yang, Member, IEEE , Sung Wook Baik, Member, IEEE

R

2

[19] utilized the low-level features including color, skewness,

and roughness in combination with a Bayes classifier for fire

recognition.

Rafiee et al. [20] investigated a multi-resolution 2D wavelet

analysis to improve the thresholding mechanism in the RGB

color space. Their method reduced the rate of false alarms by

considering variations in energy as well as shape; however,

false alarms can be higher in this approach for the case of rigid

body movements within the frames, such as the movement of a

human arm in the scene. In [21], the authors presented a

modified version of [20] based on a YUC color model, which

obtained better results than the RGB version. Another similar

method based on color information and an SVM classifier is

presented in [22]. This method can process 20 frames/sec;

however, it cannot detect a fire from a large distance or of small

size, which can occur in real-world surveillance footage.

Color-based methods typically generate more false alarms due

to variations in shadows and brightness, and often mis-classify

people wearing red clothes or red vehicles. Mueller et al. [23]

attempted to solve this issue by analyzing changes in the shape

of a fire and the movement of rigid objects. Their algorithm can

distinguish between rigid moving objects and a flame, based on

a feature vector extracted from the optical flow and the physical

behavior of a fire. De Lascio et al. [24] combined color and

motion information for the detection of fire in surveillance

videos. Dimitropoulos et al. [25] used spatio-temporal features

based on texture analysis followed by an SVM classifier to

classify candidate regions of the video frames into fire and

non-fire. This method is heavily dependent on the parameters

used; for instance, small-sized blocks increase the rate of false

alarms, while larger blocks reduce its sensitivity. Similarly, the

time window is also crucial to the performance of this system;

smaller values reduce the detection accuracy, while larger

values increase the computational complexity. These

dependencies greatly affect the feasibility of this approach for

implementation in real surveillance systems. Recently, the

authors of [21] proposed a real-time fire detection algorithm

based on color, shape, and motion features, combined in a

multi-expert system. The accuracy of this approach is higher

than that of other methods; however, the number of false alarms

is still high, and the accuracy of fire detection can be further

improved. A survey of the existing literature shows that

computationally expensive methods have better accuracy, and

simpler methods compromise on accuracy and the rate of false

positives. Hence, there is a need to find a better trade-off

between these metrics for several application scenarios of

practical interest, for which existing computationally expensive

methods do not fit well.

To address the above issues, we investigate convolutional

neural network (CNN)-based deep features for early fire

detection in surveillance networks. The key contributions can

be summarized as follows:

1. We avoid the time-consuming efforts of conventional

hand-crafted features for fire detection, and explore deep

learning architectures for early fire detection in closed-circuit

television (CCTV) surveillance networks for both indoor and

outdoor environments. Our proposed fire detection

framework improves fire detection accuracy and reduces

false alarms, compared to state-of-the-art methods. Thus, our

algorithm can play a vital role in the early detection of fire to

minimize damage.

2. We train and fine-tune an AlexNet architecture [26] for fire

detection using a transfer learning strategy. Our model

outperforms conventional hand-engineered features based

fire detection methods. However, the model remains

comparatively large in size (238 MB), making its

implementation difficult in resource-constrained equipment.

3. To reduce the size of the model, we fine-tune a model with a

similar architecture to the SqueezeNet model for fire

detection at the early stages. The size of the model was

reduced from 238 MB to 3 MB, thus saving an extra space of

235 MB, thus minimizing the cost and making its

implementation more feasible in surveillance networks. The

proposed model requires 0.72 GFLOPS/image compared to

AlexNet, whose computational complexity is 2

GFLOPS/image. This makes our proposed model more

efficient in terms of inference, allowing it to process multiple

surveillance streams.

4. An intelligent feature map selection algorithm is proposed for

choose appropriate feature maps from the convolutional

layers of the trained CNN, which are sensitive to fire regions.

These feature maps allow a more accurate segmentation of

fire compared to other hand-crafted methods. The

segmentation information can be further analyzed to assess

the essential characteristics of the fire, for instance its growth

rate. Using this approach, the severity of the fire and/or its

burning degree can also be determined. Another novel

characteristic of our system is the ability to identify the object

which is on fire, using a pre-trained model trained on 1,000

classes of objects in the ImageNet dataset. This enables our

approach to determine whether the fire is in a car, a house, a

forest or any other object. Using this semantic information,

firefighters can prioritize their targets by primarily focusing

on regions with the strongest fire.

The remainder of this paper is organized as follows. We

propose our architecture in Section 2. Our experimental results

using benchmark datasets and a feasibility analysis of the

proposed work are discussed in Section 3. Finally, the

manuscript is concluded in Section 4 and possible future

research directions are suggested.

II. THE PROPOSED FRAMEWORK

Fire detection using hand-crafted features is a tedious task,

due to the time-consuming method of features engineering. It is

particularly challenging to detect a fire at an early stage in

scenes with changing lighting conditions, shadows, and

fire-like objects; conventional low-level feature-based methods

generate a high rate of false alarms and have low detection

accuracy. To overcome these issues, we investigate deep

learning models for possible fire detection at early stages

during surveillance. Taking into consideration the accuracy, the

embedded processing capabilities of smart cameras, and the

number of false alarms, we examine various deep CNNs for the

target problem. A systematic diagram of our framework is

given in Fig. 1.

3

Fig. 1: Overview of the proposed system for fire detection using a deep CNN

A. Convolutional Neural Network Architecture

CNNs have shown encouraging performance in numerous

computer vision problems and applications, such as object

detection and localization [27, 28], image segmentation,

super-resolution, classification [29-31], and indexing and

retrieval [32]. This widespread success is due to their

hierarchical structure, which automatically learns very strong

features from raw data. A typical CNN architecture consists of

three well-known processing layers: 1) a convolution layer,

where various feature maps are produced when different

kernels are applied to the input data; 2) a pooling layer, which

is used for the selection of maximum activation considering a

small neighborhood of feature maps received from the

previous convolution layer; the goal of this layer is to achieve

translation invariance to some extent and dimensionality

reduction; and 3) a fully connected layer which models

high-level information from the input data and constructs its

global representation. This layer follows numerous stacks of

convolution and pooling layers, thus resulting in a high-level

representation of the input data. These layers are arranged in a

hierarchical architecture such that the output of one layer acts

as the input of the next layer. During the training phase, the

weights of all neurons in convolutional kernels and fully

connected layers are adjusted and learnt. These weights model

the representative characteristics of the input training data, and

in turn can perform the target classification.

We use a model with an architecture similar to that of

SqueezeNet [33], modified in accordance with our target

problem. The original model was trained on the ImageNet

dataset and is capable of classifying 1000 different objects. In

our case, however, we used this architecture to detect fire and

non-fire images. This was achieved by reducing the number of

neurons in the final layer from 1000 to 2. By keeping the rest

of the architecture similar to the original, we aimed to reuse the

parameters to solve the fire detection problem more

effectively.

There are several motivational reasons for this selection,

such as a lower communication cost between different servers

in the case of distributed training, a higher feasibility of

deployment on FPGAs, application-specific integrated circuits,

and other hardware architectures with memory constraints and

lower bandwidth. The model consists of two regular

convolutional layers, three max pooling layers, one average

pooling layer, and eight modules called “fire modules”. The

input of the model is color images with dimensions of

224×224×3 pixels. In the first convolution layer, 64 filters of

size 3×3 are applied to the input image, generating 64 feature

maps. The maximum activations of these 64 features maps are

selected by the first max pooling layer with a stride of two

pixels, using a neighborhood of 3×3 pixels. This reduces the

size of the feature maps by factor of two, thus retaining the

most useful information while discarding the less important

details. Next, we use two fire modules of 128 filters, followed

by another fire module of 256 filters. Each fire module

involves two further convolutions, squeezing, and expansion.

Since each module consists of multiple filter resolutions and

there is no native support for such convolution layers in the

Caffe framework [34], an expansion layer was introduced,

with two separate convolution layers in each fire module. The

first convolution layer contains 1 x 1 filters, while the second

layer consists of 3×3 filters. The output of these two layers is

concatenated in the channel dimension. Following the three

fire modules, there is another max pooling layer which

operates in the same way as the first max pooling layer.

Following the last fire module (Fire9) of 512 filters, we

modify the convolution layer according to the problem of

interest by reducing the number of classes to two (M=2 (fire

and normal)). The output of this layer is passed to the average

pooling layer, and result of this layer is fed directly into the

Softmax classifier to calculate the probabilities of the two

target classes.

4

A significant number of weights need to be properly

adjusted in CNNs, and a huge amount of training data is

usually required for this. These parameters can suffer from

overfitting if insufficient training data is used. The fully

connected layers usually contain the most parameters, and

these can cause significant overfitting. These problems can be

avoided by introducing regularization layers such as dropout,

or by replacing dense fully connected layers with convolution

layers. In view of this, a number of models were trained based

on the collected training data. Several benchmark datasets

were then used to evaluate the classification performance of

these models. During the experiments, a transfer learning

strategy was also explored in an attempt to further improve the

accuracy. Interestingly, we achieved an improvement in

classification accuracy of approximately 5% for the test data

after fine-tuning. A transfer learning strategy can solve

problems more efficiently based on the re-use of previously

learned knowledge. This reflects the human strategy of

applying existing knowledge to different problems in several

domains of interest. Employing this strategy, we used a

pre-trained SqueezeNet model and fine-tuned it according to

our classification problem with a slower learning rate of 0.001.

We also removed the last fully connected layers to make the

architecture as efficient as possible in terms of classification

accuracy. The process of fine-tuning was executed for 10

epochs; this increased the classification accuracy from 89.8%

to 94.50%, thus giving an improvement of 5%.

Fig. 2: Prediction scores for a set of query images using the proposed deep

CNN.

B. Difference with other network models

The key difference of our proposed CNN architecture in Fig.3

with SqueezeNet [28] is that our model simplifies the

SqueezeNet model by removing no residual connections,

which is more light-weight and balanced computational

efficiency.

As shown in Fig.3, looking at the architectural similarity

between our CNN's Fire and Inception modules, note that in

Inception modules, Fire modules have multiple sizes of filters

at the same level of depth in the NN. For example,

Inception-v1 modules have multiple instances with 1x1, 3x3,

and 5x5 filters alongside each other. This arose the relevant

question "how does a CNN architect decide how many of each

size of filter to have in each module?" Some versions of the

inception modules have 10 or more filter banks per module.

Doing careful A/B comparisons of "how many of each type of

filter" would easily lead to a combinatorial explosion. But, in

the Fire modules, there are just 3 filter banks (1x1_1, 1x1_2,

and 3x3_2). With this setup, it can be further asked that: What

are the tradeoffs in "many 1x1_2 and few 3x3_2" vs "few

1x1_2 and many 3x3_2" in terms of metrics such as model size

and accuracy? From [1], it is evident that 50% 1x1_2 and 50%

3x3_2 filters generate the same accuracy level as 99% 3x3_2.

But there is a significant difference in the model size and

computational footprint of these models. The lesson learnt is

the suitability to adopt, to some extent, a simple step-by-step

methodology: look for the point where adding more spatial

resolution to the filters stops improving accuracy, and stop

there; otherwise computation and model parameters are being

wasted.

Also, in comparison to other network models like

AlexNet [26] and GoogleNet [27]. Our proposed network is

light-weight, requiring a memory of 3 MB which is less than

AlexNet and GoogleNet. It also is computationally

inexpensive, requiring only 0.72 GFLOPS/image compared to

other networks such as AlexNet (which needs 2

GFLOS/image). Thus, our proposed model maintains a better

trade-off between the computational complexity, memory

requirement, fire detection accuracy and number of false

alarms compared to other networks.

Looking at GoogLeNet-v1, some of the Inception-v1

modules are set up such that the early filter banks have 75%

the number of filters as the late filter banks. This is like they

have a "squeeze ratio" (SR) of 0.75. Another interesting point

was to find the tradeoffs that emerge if the number of filters at

the beginning of each module are more aggressively cut down.

It was experimentally found, again, that there is a saturation

point where going from SR=0.75 to SR=1.0; here, the increase

in computational footprint and model size does not correspond

to a significant improvement, but it does not improve accuracy.

Thus, the Fire modules have been very useful in our

experience for understanding the tradeoffs that emerge when

selecting the number of filters inside of the CNN modules.

C. Deep CNN for Fire Detection and Localization

Although deep CNN architectures learn very strong

features automatically from raw data, some effort is required

to train the appropriate model considering the quality and

quantity of the available data and the nature of the target

problem. We trained various models with different parameter

settings, and following the fine-tuning process obtained an

optimal model which can detect fire from a large distance and

at a small scale, under varying conditions, and in both indoor

and outdoor scenarios.

Another motivational factor for the proposed deep CNN

was the avoidance of pre-processing and features engineering,

which are required by traditional fire detection algorithms. To

test a given image, it is fed forward through the deep CNN,

which assigns a label of ‘fire’ or ‘normal’ to the input image.

Efficient Deep CNN-Based Fire Detection and Localization in Video Surveillance Applications

Figures

Citations

Multi-grade brain tumor classification using deep CNN with extensive data augmentation

A Review on Early Forest Fire Detection Systems Using Optical Remote Sensing

Efficient Fire Detection for Uncertain Surveillance Environment

Activity Recognition Using Temporal Optical Flow Convolutional Features and Multilayer LSTM

Energy-Efficient Deep CNN for Smoke Detection in Foggy IoT Environment

References

ImageNet Classification with Deep Convolutional Neural Networks

A Survey on Transfer Learning

Caffe: Convolutional Architecture for Fast Feature Embedding

Caffe: Convolutional Architecture for Fast Feature Embedding

SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size

Related Papers (5)

Convolutional Neural Networks Based Fire Detection in Surveillance Videos

Early fire detection using convolutional neural networks during surveillance for effective disaster management

ImageNet Classification with Deep Convolutional Neural Networks

Deep Residual Learning for Image Recognition

Fire detection in video sequences using a generic color model