scispace - formally typeset
Search or ask a question
Book ChapterDOI

A CNN Approach for Audio Classification in Construction Sites

TL;DR: This work developed an application for the classification of different types and brands of construction vehicles and tools, which operates on the emitted audio through a stack of convolutional layers, demonstrating its effectiveness in environmental sound classification (ESC) achieving a high accuracy.
Abstract: Convolutional Neural Networks (CNNs) have been widely used in the field of audio recognition and classification, since they often provide positive results. Motivated by the success of this kind of approach and the lack of practical methodologies for the monitoring of construction sites by using audio data, we developed an application for the classification of different types and brands of construction vehicles and tools, which operates on the emitted audio through a stack of convolutional layers. The proposed architecture works on the mel-spectrogram representation of the input audio frames and it demonstrates its effectiveness in environmental sound classification (ESC) achieving a high accuracy. In summary, our contribution shows that techniques employed for general ESC can be also successfully adapted to a more specific environmental sound classification task, such as event recognition in construction sites.

Summary (2 min read)

1 Introduction

  • In last years, many research efforts have been made towards the event classification of audio data, due to the availability of cheap sensors [1].
  • This approach is revealing itself as a promising method and a supportive resource for unmanned field monitoring and safety surveillance that leverages construction project management and decision making [8, 9].
  • The classification will be carried on five classes extracted from audio files collected in several construction sites, containing in situ recordings of multiple vehicles and tools.
  • Section 3 introduces the experimental setup, while Section 4 shows the obtained numerical results.

2 The proposed approach

  • CNNs are a particular type of neural networks, which use the convolution operation in one or more layers for the learning process.
  • In the detector layer, the output of the convolution is passed through a nonlinear function, usually a ReLU function.
  • The reasons why the authors used CNNs in their approach is due to the intrinsic nature of audio signals.
  • For the batch size, a grid search was used to determine the most appropriate values.
  • Lastly, to prevent the network depth from either exploding in size, adding unnecessary complexity for no actual return, or not being high enough, therefore returning substandard results, the authors decided to use the same amount of layers as other related works, such as the one in [14], as a baseline.

2.1 Spectrogram Extraction

  • The proposed DCNN uses, as its inputs, the mel spectrogram that is a version of the spectrogram where the frequency scale has been distorted in a perceptual way, and its time derivative.
  • A mel band represents an interval of frequencies which are perceived to have the same pitch by human listeners.
  • And the chosen length of 30 ms for the frames (see next sections), the authors obtain a small spectrogram of 60 rows and 2 columns 4 Available at: https://librosa.github.io/ .
  • Then, using again librosa, the authors compute the derivative of the spectrogram and they overlap the two matrices, obtaining a dual channel input which is fed into the network.

3.1 Dataset

  • The authors collected audio data of equipment operations in several construction sites consisting diverse construction machines and equipments.
  • Unlike artificially built datasets, when working with real data different problems arise, such as noise due to weather conditions and/or workers talking among themselves.
  • Thus, the authors focused their work on the classification of a reduced number of classes, specifically Backhoe JD50D Compact, Compactor Ingersoll Rand, Concrete Mixer, Excavator Cat 320E, Excavator Hitachi 50U.
  • Classes which did not have enough usable audio (too short, excessive noise, low quality of the audio) were ignored for this work.
  • A Zoom H1 digital handy recorder has been used for data collection purposes.

3.2 Data Preprocessing

  • In order to feed the network with enough and proper data, each audio file for each class is segmented into fixed length frames (the choice of the best frame size is described in the experiment section).
  • As first step, the authors split the original audio files into two parts, training samples (70% of the original length) and test samples (30% of the original length).
  • This is done to avoid testing the network on data used previously to train the network, as this would cause the network to overfit and give misleading results.
  • After that, the dataset is balanced by taking N samples for each class, where N is the number of elements contained in the class with the least amount of samples.
  • Numerical results have been evaluated in terms of accuracy, recall, precision and F1 score [17].

4.1 Selecting the frame size

  • A sizeable amount of time was spent into finding the proper length for the audio segments.
  • This is of crucial importance since, if the length is not adequate, the network will not be able to learn proper features that clearly characterize the input.
  • Hence, in order to select the most suitable length, the authors generated different dataset variants by splitting the audio using different frame lengths, and they subsequently trained and tested different models on the differently-sized datasets.
  • As the authors can see, with a smaller frame size better results are obtained, while they notice a drop as the size increases.
  • The results of the classification are shown in the next subsection.

4.2 Classification Results

  • As just stated, a 5-fold cross validation was performed and the results are shown in Table 2.
  • The dataset was split into training set and validation set (80% – 20%) for each fold.
  • The way the network learns can be seen in Fig.
  • The class with worst result is the Excavator Cat 320E that performs at 95% of accuracy.
  • All details, features and parameters of the implemented classifiers can be found in [8].

4.3 Prediction

  • The proposed approach can be used to promptly predict the active working vehicles and tools.
  • In fact, with such an approach, project managers will be able to remotely and continuously monitor the status of workers and machines, investigate the effective distribution of hours, and detect issues of safety in a timely manner.
  • Every frame will be classified as belonging to one of the classes and the audio track will be labeled according to the majority of the labels among all the frames.
  • One can also see what is the probability for the input track to belong to each of the classes.the authors.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

A CNN Approach for Audio Classification in
Construction Sites
Alessandro Maccagno
1
, Andrea Mastropietro
1
, Umberto Mazziotta
1
, Michele
Scarpiniti
2
, Yong-Cheol Lee
3
, and Aurelio Uncini
2
1
Department of Computer, Control and Management Engineering,
Sapienza University of Rome, Italy,
{maccagno.1653200,mastropietro.1652886,mazziotta.1647818}@studenti.uniroma1.it,
2
Department of Information Engineering, Electronics and Telecommunications,
Sapienza University of Rome, Italy
{michele.scarpiniti,aurelio.uncini}@uniroma1.it,
3
Department of Construction Management,
Louisiana State University, Baton Rouge, USA
yclee@lsu.edu
Abstract. Convolutional Neural Networks (CNNs) have been widely
used in the field of audio recognition and classification, since they often
provide positive results. Motivated by the success of this kind of approach
and the lack of practical methodologies for the monitoring of construction
sites by using audio data, we developed an application for the classifi-
cation of different types and brands of construction vehicles and tools,
which operates on the emitted audio through a stack of convolutional
layers. The proposed architecture works on the mel-spectrogram repre-
sentation of the input audio frames and it demonstrates its effectiveness
in environmental sound classification (ESC) achieving a high accuracy. In
summary, our contribution shows that techniques employed for general
ESC can be also successfully adapted to a more specific environmental
sound classification task, such as event recognition in construction sites.
Keywords: Deep learning, Convolutional neural networks, Audio pro-
cessing, Environmental sound classification, Construction sites.
1 Introduction
In last years, many research efforts have been made towards the event classifica-
tion of audio data, due to the availability of cheap sensors [1]. In fact, systems
based on acoustic sensors are of particular interest for their flexibility and cheap-
ness [2]. When we consider generic outdoor scenarios, an automatic monitoring
system based on a microphone array would be an invaluable tool in assessing
and controlling any type of situation occurring in the environment [3]. This in-
cludes, but is not limited to, handling large civil and/or military events. The
idea in these works is to use Computational Auditory Scene Analysis (CASA)
[4], which involves Computational Intelligence and Machine Learning techniques,

2 Maccagno et al.
to recognize the presence of specific objects into sound tracks. This last problem
is a notable example of Automatic Audio Classification (AAC) [5], the task of
automatically labeling a given audio signal in a set of predefined classes.
Getting into the more specific field of environmental sound classification
(ESC) in construction site, the closest attempts have been performed by Cheng
et al. [6], who used Support Vector Machines (SVM) to analyze the activity of
construction tools and equipment. Recent applications of AAC have also been
addressed to audio-based construction sites monitoring [7–9], in order to improve
the construction process management of field activities. This approach is reveal-
ing itself as a promising method and a supportive resource for unmanned field
monitoring and safety surveillance that leverages construction project manage-
ment and decision making [8, 9]. More recently, several studies extend these ef-
forts to more complicated architectures exploiting Deep Learning techniques [10].
In the literature, it is possible to find several instances of successful appli-
cations in the field of environmental sound classification that make use of deep
learning. For example, in the work of Piczak [11], the author exploits a 2-layered
CNN working on the spectrogram of the data to perform ESC, reaching an
average accuracy of 70% over different datasets. Other approaches, instead of
using handcrafted features such as the spectrogram, perform end-to-end environ-
mental sound classification obtaining higher results with respect to the previous
ones [12, 13].
Inspired and motivated by the MelNet architecture described by Li et al. [14],
which has been proven to be remarkably effective in environmental sound clas-
sification, the aim of this paper is to develop an application able to recognize
vehicles and tools used in construction sites, and classify them in terms of type
and brand. This task will be tackled with a neural network approach, involving
the use of a Deep Convolutional Neural Network (DCNN), which will be fed
with the mel spectrogram of the audio source as input. The classification will be
carried on five classes extracted from audio files collected in several construction
sites, containing in situ recordings of multiple vehicles and tools. We demonstrate
that the proposed approach for ESC can obtain good results (average accuracy
of 97%) to a very specific domain as the one of construction sites.
The rest of this paper is organized as follows. Section 2 describes the pro-
posed approach used to perform the sound classification. Section 3 introduces
the experimental setup, while Section 4 shows the obtained numerical results.
Finally, section 5 concludes the paper and outlines some future directions.
2 The proposed approach
CNNs are a particular type of neural networks, which use the convolution oper-
ation in one or more layers for the learning process. These networks are inspired
by the primal visual system, and are therefore extensively used with image and
video inputs [10]. A CNN is composed by three main layers:
Convolutional Layer: The convolutional layer is the one tasked with ap-
plying the convolution operation on the input. This is done by passing a

A CNN Approach for Audio Classification in Construction Sites 3
filter (or kernel) over the matricial input, computing the convolution value,
and using the obtained result as the value of one cell of the output matrix
(called feature map); the filter is then shifted by a predefined stride along its
dimensions. The filters parameters are trained during the training process.
Detector layer: In the detector layer, the output of the convolution is
passed through a nonlinear function, usually a ReLU function.
Pooling layer: The pooling layer is meant to reduce the dimensionality of
data by combining the output of neuron clusters at one layer into one single
neuron in the subsequent layer.
The last layer of the network is a fully connected one (a layer whose units
are connected to every single unit from the previous one), which outputs the
probability of the input to belong to each of the classes.
CNNs in a machine learning system show some advantages with respect to
traditional fully connected neural networks, because they allow sparse interac-
tions, parameters sharing and equivariant representations.
The reasons why we used CNNs in our approach is due to the intrinsic nature
of audio signals. CNNs are extensively used with images and, since the spectrum
of the audio is an actual picture of the signal, it is straightforward to see why
CNNs are a good idea for such kind of input, being able to exploit the adjacency
properties of audio signals and recognize patterns in the spectrum images that
can properly represent each one of the classes taken into consideration.
The proposed architecture consists in a DCNN composed of eight layers, as
shown in Fig. 1, that is fed with the mel spectrogram extracted from audio signals
and its time derivative. Specifically, we have as input a tensor of dimension
60 × 2 × 2 that is a couple of images representing the spectrogram and its time
derivative: 60 is the number of mel bands, while 2 is the number of time buckets.
Then, we have five convolutional layers, followed by a dense fully connected layer
with 200 units and a final softmax layer that performs the classification over the
5 classes. The structure of the proposed network is summarized in the following
Table 1, and it can be graphically appreciated in Fig. 1.
Layer Input Shape Filters Kernel Size Strides Output Shape
Conv1 [batch, 60, 2, 2] 24 (6,2) (1,1) [batch, 60, 2, 24]
Conv2 [batch, 60, 2, 24] 24 (6,2) (1,1) [batch, 60, 2, 24]
Conv3 [batch, 60, 2, 24] 48 (5,1) (2,2) [batch, 30, 1, 48]
Conv4 [batch, 30, 1, 48] 48 (5,1) (2,2) [batch, 15, 1, 48]
Conv5 [batch, 15, 1, 48] 64 (4,1) (2,2) [batch, 8, 1, 64]
Flatten [batch, 8, 1, 64] [batch, 512]
Dense [batch, 512] [batch, 200]
Output - Dense [batch, 200] [batch, 5]
Table 1. Parameters of the proposed DCNN architecture.

4 Maccagno et al.
All the layers employ a ReLu activation function except for the output layers
which uses a Sofmax function. The optimizer chosen for the network is an Adam
Optimizer [15], with the a learning rate set to 0.0005. Such value was chosen
by performing a grid search in the range [0.00001, 0.001]. Moreover, a dropout
strategy, with a rate equal to 30%, has been used in the dense layer.
Fig. 1. Graphical representation of the proposed architecture.
Regarding the setting of other hyper-parameters, different strategies were
adopted. For the batch size, a grid search was used to determine the most ap-
propriate values. The filter size and the stride were set reasonably according
to the input size. Small filters were adopted such to capture small, local and
adjacent features that are typical of audio data. Lastly, to prevent the network
depth from either exploding in size, adding unnecessary complexity for no ac-
tual return, or not being high enough, therefore returning substandard results,
we decided to use the same amount of layers as other related works, such as the
one in [14], as a baseline. Variations on this depth have not shown appreciable
improvements on the overall effectiveness of the networks classification, so it has
been kept unchanged.
2.1 Spectrogram Extraction
The proposed DCNN uses, as its inputs, the mel spectrogram that is a version
of the spectrogram where the frequency scale has been distorted in a perceptual
way, and its time derivative.
The technique used to extract the spectrogram from the sample is the same
used by Piczak [11], via the Python library librosa
4
. The frames were re-
sampled to 22,050 Hz, then a window of size 1024 with hop-size of 512 and 60
mel bands has been used. A mel band represents an interval of frequencies which
are perceived to have the same pitch by human listeners. They have been found
to be performing in speech recognition.
With this parameters, and the chosen length of 30 ms for the frames (see
next sections), we obtain a small spectrogram of 60 rows (bands) and 2 columns
4
Available at: https://librosa.github.io/

A CNN Approach for Audio Classification in Construction Sites 5
(buckets). Then, using again librosa, we compute the derivative of the spec-
trogram and we overlap the two matrices, obtaining a dual channel input which
is fed into the network.
0 50 100 150 200 250 300 350 400
Time bucket
0
20
40
Mel band
spectrum
0 50 100 150 200 250 300 350 400
Time bucket
0
20
40
Mel band
delta
Fig. 2. Example of a log-mel spectrogram extracted from a fragment along with its
derivative. On the abscissa we find the time buckets, each of which representing a
sample about 23 ms long, while on the ordinates the log-mel bands. Since our fragments
are 30 ms long, the spectrogram we extract will contain 2 buckets.
3 Experimental setup
3.1 Dataset
The authors collected audio data of equipment operations in several construction
sites consisting diverse construction machines and equipments. Unlike artificially
built datasets, when working with real data different problems arise, such as noise
due to weather conditions and/or workers talking among themselves. Thus, we
focused our work on the classification of a reduced number of classes, specifically
Backhoe JD50D Compact, Compactor Ingersoll Rand, Concrete Mixer, Excavator
Cat 320E, Excavator Hitachi 50U. Classes which did not have enough usable
audio (too short, excessive noise, low quality of the audio) were ignored for this
work. The activities of these machines were observed during certain periods, and
the audio signals generated were recorded accordingly. A Zoom H1 digital handy
recorder has been used for data collection purposes. All files have been recorded
by using a sample rate of 44,100 Hz and a total of about one hour of sound data
(eight different files for each machine) has been used to train the architecture.
3.2 Data Preprocessing
In order to feed the network with enough and proper data, each audio file for
each class is segmented into fixed length frames (the choice of the best frame

Citations
More filters
Journal ArticleDOI
TL;DR: The sounds of work activities and equipment operations at a construction site provide critical information regarding construction progress, task performance, and safety issues.
Abstract: The sounds of work activities and equipment operations at a construction site provide critical information regarding construction progress, task performance, and safety issues The construc

28 citations

Journal ArticleDOI
TL;DR: The result of this study demonstrates the potential of the proposed system to be applied for automated monitoring and data collection in modular construction factory in conjunction with other activity recognition frameworks based on computer vision (CV) and/or inertial measurement units (IMU).

25 citations

Journal ArticleDOI
TL;DR: The aim of the work is to obtain an accurate and flexible tool for consistently executing and managing the unmanned monitoring of construction sites by using distributed acoustic sensors by using a Deep Belief Network based approach.
Abstract: In this paper, we propose a Deep Belief Network (DBN) based approach for the classification of audio signals to improve work activity identification and remote surveillance of construction projects. The aim of the work is to obtain an accurate and flexible tool for consistently executing and managing the unmanned monitoring of construction sites by using distributed acoustic sensors. In this paper, ten classes of multiple construction equipment and tools, frequently and broadly used in construction sites, have been collected and examined to conduct and validate the proposed approach. The input provided to the DBN consists in the concatenation of several statistics evaluated by a set of spectral features, like MFCCs and mel-scaled spectrogram. The proposed architecture, along with the preprocessing and the feature extraction steps, has been described in details while the effectiveness of the proposed idea has been demonstrated by some numerical results, evaluated by using real-world recordings. The final overall accuracy on the test set is up to 98% and is a significantly improved performance compared to other state-of-the-are approaches. A practical and real-time application of the presented method has been also proposed in order to apply the classification scheme to sound data recorded in different environmental scenarios.

23 citations

Journal ArticleDOI
TL;DR: A deep neural network architecture, Convolutional Neural Network-based ocean noise classification cum recognition system capable of classifying vocalization of cetaceans, fishes, marine invertebrates, anthropogenic sounds, natural sounds, and the unidentified ocean sounds from passive acoustic ocean noise recordings is presented.

17 citations

Proceedings ArticleDOI
24 Jan 2021
TL;DR: In this article, a Deep Recurrent Neural Network (DRNN) approach based on Long Short Term Memory (LSTM) units was proposed for the classification of audio signals recorded in construction sites.
Abstract: In this paper, we propose a Deep Recurrent Neural Network (DRNN) approach based on Long-Short Term Memory (LSTM) units for the classification of audio signals recorded in construction sites. Five classes of multiple vehicles and tools, normally used in construction sites, have been considered. The input provided to the DRNN consists in the concatenation of several spectral features, like MFCCs, mel-scaled spectrogram, chroma and spectral contrast. The proposed architecture and the feature extraction have been described. Some experimental results, obtained by using real-world recordings, demonstrate the effectiveness of the proposed idea. The final overall accuracy on the test set is up to 97% and overcomes other state-of-the-art approaches.

11 citations

References
More filters
Journal ArticleDOI
TL;DR: An extensive set of simulations showing the effectiveness of the proposed architecture are provided, with a particular emphasis on the innovative aspects that are introduced with respect to the state-of-the-art.
Abstract: The aim of this paper is to describe a novel security system able to localize and classify audio sources in an outdoor environment. Its primary intended use is for security monitoring in severe scenarios, and it has been designed to cope with a large set of heterogeneous objects, including weapons, human speakers and vehicles. The system is the result of a research project sponsored by the Italian Ministry of Defense. It is composed of a large squared array of 864 microphones arranged in a rectangular lattice, whose input is processed using a classical delay-and-sum beamformer. The result of this localization process is elaborated by a complex multi-level classification system designed in a modular fashion. In this paper, after presenting the details of the system's design, with a particular emphasis on the innovative aspects that are introduced with respect to the state-of-the-art, we provide an extensive set of simulations showing the effectiveness of the proposed architecture. We conclude by describing the current limits of the system, and the projected further developments.

15 citations

Trending Questions (1)
What is the use of CNN in the construction industry?

CNNs are used in the construction industry for audio classification, specifically for monitoring construction sites by classifying different types and brands of construction vehicles and tools.