scispace - formally typeset
Open AccessBook ChapterDOI

A CNN Approach for Audio Classification in Construction Sites

Reads0
Chats0
TLDR
This work developed an application for the classification of different types and brands of construction vehicles and tools, which operates on the emitted audio through a stack of convolutional layers, demonstrating its effectiveness in environmental sound classification (ESC) achieving a high accuracy.
Abstract
Convolutional Neural Networks (CNNs) have been widely used in the field of audio recognition and classification, since they often provide positive results. Motivated by the success of this kind of approach and the lack of practical methodologies for the monitoring of construction sites by using audio data, we developed an application for the classification of different types and brands of construction vehicles and tools, which operates on the emitted audio through a stack of convolutional layers. The proposed architecture works on the mel-spectrogram representation of the input audio frames and it demonstrates its effectiveness in environmental sound classification (ESC) achieving a high accuracy. In summary, our contribution shows that techniques employed for general ESC can be also successfully adapted to a more specific environmental sound classification task, such as event recognition in construction sites.

read more

Content maybe subject to copyright    Report

A CNN Approach for Audio Classification in
Construction Sites
Alessandro Maccagno
1
, Andrea Mastropietro
1
, Umberto Mazziotta
1
, Michele
Scarpiniti
2
, Yong-Cheol Lee
3
, and Aurelio Uncini
2
1
Department of Computer, Control and Management Engineering,
Sapienza University of Rome, Italy,
{maccagno.1653200,mastropietro.1652886,mazziotta.1647818}@studenti.uniroma1.it,
2
Department of Information Engineering, Electronics and Telecommunications,
Sapienza University of Rome, Italy
{michele.scarpiniti,aurelio.uncini}@uniroma1.it,
3
Department of Construction Management,
Louisiana State University, Baton Rouge, USA
yclee@lsu.edu
Abstract. Convolutional Neural Networks (CNNs) have been widely
used in the field of audio recognition and classification, since they often
provide positive results. Motivated by the success of this kind of approach
and the lack of practical methodologies for the monitoring of construction
sites by using audio data, we developed an application for the classifi-
cation of different types and brands of construction vehicles and tools,
which operates on the emitted audio through a stack of convolutional
layers. The proposed architecture works on the mel-spectrogram repre-
sentation of the input audio frames and it demonstrates its effectiveness
in environmental sound classification (ESC) achieving a high accuracy. In
summary, our contribution shows that techniques employed for general
ESC can be also successfully adapted to a more specific environmental
sound classification task, such as event recognition in construction sites.
Keywords: Deep learning, Convolutional neural networks, Audio pro-
cessing, Environmental sound classification, Construction sites.
1 Introduction
In last years, many research efforts have been made towards the event classifica-
tion of audio data, due to the availability of cheap sensors [1]. In fact, systems
based on acoustic sensors are of particular interest for their flexibility and cheap-
ness [2]. When we consider generic outdoor scenarios, an automatic monitoring
system based on a microphone array would be an invaluable tool in assessing
and controlling any type of situation occurring in the environment [3]. This in-
cludes, but is not limited to, handling large civil and/or military events. The
idea in these works is to use Computational Auditory Scene Analysis (CASA)
[4], which involves Computational Intelligence and Machine Learning techniques,

2 Maccagno et al.
to recognize the presence of specific objects into sound tracks. This last problem
is a notable example of Automatic Audio Classification (AAC) [5], the task of
automatically labeling a given audio signal in a set of predefined classes.
Getting into the more specific field of environmental sound classification
(ESC) in construction site, the closest attempts have been performed by Cheng
et al. [6], who used Support Vector Machines (SVM) to analyze the activity of
construction tools and equipment. Recent applications of AAC have also been
addressed to audio-based construction sites monitoring [7–9], in order to improve
the construction process management of field activities. This approach is reveal-
ing itself as a promising method and a supportive resource for unmanned field
monitoring and safety surveillance that leverages construction project manage-
ment and decision making [8, 9]. More recently, several studies extend these ef-
forts to more complicated architectures exploiting Deep Learning techniques [10].
In the literature, it is possible to find several instances of successful appli-
cations in the field of environmental sound classification that make use of deep
learning. For example, in the work of Piczak [11], the author exploits a 2-layered
CNN working on the spectrogram of the data to perform ESC, reaching an
average accuracy of 70% over different datasets. Other approaches, instead of
using handcrafted features such as the spectrogram, perform end-to-end environ-
mental sound classification obtaining higher results with respect to the previous
ones [12, 13].
Inspired and motivated by the MelNet architecture described by Li et al. [14],
which has been proven to be remarkably effective in environmental sound clas-
sification, the aim of this paper is to develop an application able to recognize
vehicles and tools used in construction sites, and classify them in terms of type
and brand. This task will be tackled with a neural network approach, involving
the use of a Deep Convolutional Neural Network (DCNN), which will be fed
with the mel spectrogram of the audio source as input. The classification will be
carried on five classes extracted from audio files collected in several construction
sites, containing in situ recordings of multiple vehicles and tools. We demonstrate
that the proposed approach for ESC can obtain good results (average accuracy
of 97%) to a very specific domain as the one of construction sites.
The rest of this paper is organized as follows. Section 2 describes the pro-
posed approach used to perform the sound classification. Section 3 introduces
the experimental setup, while Section 4 shows the obtained numerical results.
Finally, section 5 concludes the paper and outlines some future directions.
2 The proposed approach
CNNs are a particular type of neural networks, which use the convolution oper-
ation in one or more layers for the learning process. These networks are inspired
by the primal visual system, and are therefore extensively used with image and
video inputs [10]. A CNN is composed by three main layers:
Convolutional Layer: The convolutional layer is the one tasked with ap-
plying the convolution operation on the input. This is done by passing a

A CNN Approach for Audio Classification in Construction Sites 3
filter (or kernel) over the matricial input, computing the convolution value,
and using the obtained result as the value of one cell of the output matrix
(called feature map); the filter is then shifted by a predefined stride along its
dimensions. The filters parameters are trained during the training process.
Detector layer: In the detector layer, the output of the convolution is
passed through a nonlinear function, usually a ReLU function.
Pooling layer: The pooling layer is meant to reduce the dimensionality of
data by combining the output of neuron clusters at one layer into one single
neuron in the subsequent layer.
The last layer of the network is a fully connected one (a layer whose units
are connected to every single unit from the previous one), which outputs the
probability of the input to belong to each of the classes.
CNNs in a machine learning system show some advantages with respect to
traditional fully connected neural networks, because they allow sparse interac-
tions, parameters sharing and equivariant representations.
The reasons why we used CNNs in our approach is due to the intrinsic nature
of audio signals. CNNs are extensively used with images and, since the spectrum
of the audio is an actual picture of the signal, it is straightforward to see why
CNNs are a good idea for such kind of input, being able to exploit the adjacency
properties of audio signals and recognize patterns in the spectrum images that
can properly represent each one of the classes taken into consideration.
The proposed architecture consists in a DCNN composed of eight layers, as
shown in Fig. 1, that is fed with the mel spectrogram extracted from audio signals
and its time derivative. Specifically, we have as input a tensor of dimension
60 × 2 × 2 that is a couple of images representing the spectrogram and its time
derivative: 60 is the number of mel bands, while 2 is the number of time buckets.
Then, we have five convolutional layers, followed by a dense fully connected layer
with 200 units and a final softmax layer that performs the classification over the
5 classes. The structure of the proposed network is summarized in the following
Table 1, and it can be graphically appreciated in Fig. 1.
Layer Input Shape Filters Kernel Size Strides Output Shape
Conv1 [batch, 60, 2, 2] 24 (6,2) (1,1) [batch, 60, 2, 24]
Conv2 [batch, 60, 2, 24] 24 (6,2) (1,1) [batch, 60, 2, 24]
Conv3 [batch, 60, 2, 24] 48 (5,1) (2,2) [batch, 30, 1, 48]
Conv4 [batch, 30, 1, 48] 48 (5,1) (2,2) [batch, 15, 1, 48]
Conv5 [batch, 15, 1, 48] 64 (4,1) (2,2) [batch, 8, 1, 64]
Flatten [batch, 8, 1, 64] [batch, 512]
Dense [batch, 512] [batch, 200]
Output - Dense [batch, 200] [batch, 5]
Table 1. Parameters of the proposed DCNN architecture.

4 Maccagno et al.
All the layers employ a ReLu activation function except for the output layers
which uses a Sofmax function. The optimizer chosen for the network is an Adam
Optimizer [15], with the a learning rate set to 0.0005. Such value was chosen
by performing a grid search in the range [0.00001, 0.001]. Moreover, a dropout
strategy, with a rate equal to 30%, has been used in the dense layer.
Fig. 1. Graphical representation of the proposed architecture.
Regarding the setting of other hyper-parameters, different strategies were
adopted. For the batch size, a grid search was used to determine the most ap-
propriate values. The filter size and the stride were set reasonably according
to the input size. Small filters were adopted such to capture small, local and
adjacent features that are typical of audio data. Lastly, to prevent the network
depth from either exploding in size, adding unnecessary complexity for no ac-
tual return, or not being high enough, therefore returning substandard results,
we decided to use the same amount of layers as other related works, such as the
one in [14], as a baseline. Variations on this depth have not shown appreciable
improvements on the overall effectiveness of the networks classification, so it has
been kept unchanged.
2.1 Spectrogram Extraction
The proposed DCNN uses, as its inputs, the mel spectrogram that is a version
of the spectrogram where the frequency scale has been distorted in a perceptual
way, and its time derivative.
The technique used to extract the spectrogram from the sample is the same
used by Piczak [11], via the Python library librosa
4
. The frames were re-
sampled to 22,050 Hz, then a window of size 1024 with hop-size of 512 and 60
mel bands has been used. A mel band represents an interval of frequencies which
are perceived to have the same pitch by human listeners. They have been found
to be performing in speech recognition.
With this parameters, and the chosen length of 30 ms for the frames (see
next sections), we obtain a small spectrogram of 60 rows (bands) and 2 columns
4
Available at: https://librosa.github.io/

A CNN Approach for Audio Classification in Construction Sites 5
(buckets). Then, using again librosa, we compute the derivative of the spec-
trogram and we overlap the two matrices, obtaining a dual channel input which
is fed into the network.
0 50 100 150 200 250 300 350 400
Time bucket
0
20
40
Mel band
spectrum
0 50 100 150 200 250 300 350 400
Time bucket
0
20
40
Mel band
delta
Fig. 2. Example of a log-mel spectrogram extracted from a fragment along with its
derivative. On the abscissa we find the time buckets, each of which representing a
sample about 23 ms long, while on the ordinates the log-mel bands. Since our fragments
are 30 ms long, the spectrogram we extract will contain 2 buckets.
3 Experimental setup
3.1 Dataset
The authors collected audio data of equipment operations in several construction
sites consisting diverse construction machines and equipments. Unlike artificially
built datasets, when working with real data different problems arise, such as noise
due to weather conditions and/or workers talking among themselves. Thus, we
focused our work on the classification of a reduced number of classes, specifically
Backhoe JD50D Compact, Compactor Ingersoll Rand, Concrete Mixer, Excavator
Cat 320E, Excavator Hitachi 50U. Classes which did not have enough usable
audio (too short, excessive noise, low quality of the audio) were ignored for this
work. The activities of these machines were observed during certain periods, and
the audio signals generated were recorded accordingly. A Zoom H1 digital handy
recorder has been used for data collection purposes. All files have been recorded
by using a sample rate of 44,100 Hz and a total of about one hour of sound data
(eight different files for each machine) has been used to train the architecture.
3.2 Data Preprocessing
In order to feed the network with enough and proper data, each audio file for
each class is segmented into fixed length frames (the choice of the best frame

Citations
More filters
Journal ArticleDOI

Advanced Sound Classifiers and Performance Analyses for Accurate Audio-Based Construction Project Monitoring

TL;DR: The sounds of work activities and equipment operations at a construction site provide critical information regarding construction progress, task performance, and safety issues.
Journal ArticleDOI

Activity identification in modular construction using audio signals and machine learning

TL;DR: The result of this study demonstrates the potential of the proposed system to be applied for automated monitoring and data collection in modular construction factory in conjunction with other activity recognition frameworks based on computer vision (CV) and/or inertial measurement units (IMU).
Journal ArticleDOI

Deep Belief Network based audio classification for construction sites monitoring

TL;DR: The aim of the work is to obtain an accurate and flexible tool for consistently executing and managing the unmanned monitoring of construction sites by using distributed acoustic sensors by using a Deep Belief Network based approach.
Journal ArticleDOI

Diverse ocean noise classification using deep learning

TL;DR: A deep neural network architecture, Convolutional Neural Network-based ocean noise classification cum recognition system capable of classifying vocalization of cetaceans, fishes, marine invertebrates, anthropogenic sounds, natural sounds, and the unidentified ocean sounds from passive acoustic ocean noise recordings is presented.
Proceedings ArticleDOI

Deep Recurrent Neural Networks for Audio Classification in Construction Sites

TL;DR: In this article, a Deep Recurrent Neural Network (DRNN) approach based on Long Short Term Memory (LSTM) units was proposed for the classification of audio signals recorded in construction sites.
References
More filters
Proceedings Article

Adam: A Method for Stochastic Optimization

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Book

Introduction to Machine Learning

TL;DR: Introduction to Machine Learning is a comprehensive textbook on the subject, covering a broad array of topics not usually included in introductory machine learning texts, and discusses many methods from different fields, including statistics, pattern recognition, neural networks, artificial intelligence, signal processing, control, and data mining.
Book ChapterDOI

Introduction to Machine Learning

TL;DR: Machine learning is evolved from a collection of powerful techniques in AI areas and has been extensively used in data mining, which allows the system to learn the useful structural patterns and models from training data as discussed by the authors.
Journal ArticleDOI

A Scale for the Measurement of the Psychological Magnitude Pitch

TL;DR: A subjective scale for the measurement of pitch was constructed from determinations of the half-value of pitches at various frequencies as mentioned in this paper, which differs from both the musical scale and the frequency scale, neither of which is subjective.
Journal ArticleDOI

Computational Auditory Scene Analysis: Principles, Algorithms, and Applications

TL;DR: This paper focuses on the development of model-Based Speech Segregation in CASA systems, which was first introduced in 2000 and has since been upgraded to a full-blown model-based system.
Related Papers (5)
Trending Questions (1)
What is the use of CNN in the construction industry?

CNNs are used in the construction industry for audio classification, specifically for monitoring construction sites by classifying different types and brands of construction vehicles and tools.