A CNN Approach for Audio Classification in Construction Sites

doi:10.1007/978-981-15-5093-5_33

A CNN Approach for Audio Classiﬁcation in

Construction Sites

Alessandro Maccagno

1

, Andrea Mastropietro

1

, Umberto Mazziotta

1

, Michele

Scarpiniti

2

, Yong-Cheol Lee

3

, and Aurelio Uncini

2

1

Department of Computer, Control and Management Engineering,

Sapienza University of Rome, Italy,

{maccagno.1653200,mastropietro.1652886,mazziotta.1647818}@studenti.uniroma1.it,

2

Department of Information Engineering, Electronics and Telecommunications,

Sapienza University of Rome, Italy

{michele.scarpiniti,aurelio.uncini}@uniroma1.it,

3

Department of Construction Management,

Louisiana State University, Baton Rouge, USA

yclee@lsu.edu

Abstract. Convolutional Neural Networks (CNNs) have been widely

used in the ﬁeld of audio recognition and classiﬁcation, since they often

provide positive results. Motivated by the success of this kind of approach

and the lack of practical methodologies for the monitoring of construction

sites by using audio data, we developed an application for the classiﬁ-

cation of diﬀerent types and brands of construction vehicles and tools,

which operates on the emitted audio through a stack of convolutional

layers. The proposed architecture works on the mel-spectrogram repre-

sentation of the input audio frames and it demonstrates its eﬀectiveness

in environmental sound classiﬁcation (ESC) achieving a high accuracy. In

summary, our contribution shows that techniques employed for general

ESC can be also successfully adapted to a more speciﬁc environmental

sound classiﬁcation task, such as event recognition in construction sites.

Keywords: Deep learning, Convolutional neural networks, Audio pro-

cessing, Environmental sound classiﬁcation, Construction sites.

1 Introduction

In last years, many research eﬀorts have been made towards the event classiﬁca-

tion of audio data, due to the availability of cheap sensors [1]. In fact, systems

based on acoustic sensors are of particular interest for their ﬂexibility and cheap-

ness [2]. When we consider generic outdoor scenarios, an automatic monitoring

system based on a microphone array would be an invaluable tool in assessing

and controlling any type of situation occurring in the environment [3]. This in-

cludes, but is not limited to, handling large civil and/or military events. The

idea in these works is to use Computational Auditory Scene Analysis (CASA)

[4], which involves Computational Intelligence and Machine Learning techniques,

2 Maccagno et al.

to recognize the presence of speciﬁc objects into sound tracks. This last problem

is a notable example of Automatic Audio Classiﬁcation (AAC) [5], the task of

automatically labeling a given audio signal in a set of predeﬁned classes.

Getting into the more speciﬁc ﬁeld of environmental sound classiﬁcation

(ESC) in construction site, the closest attempts have been performed by Cheng

et al. [6], who used Support Vector Machines (SVM) to analyze the activity of

construction tools and equipment. Recent applications of AAC have also been

addressed to audio-based construction sites monitoring [7–9], in order to improve

the construction process management of ﬁeld activities. This approach is reveal-

ing itself as a promising method and a supportive resource for unmanned ﬁeld

monitoring and safety surveillance that leverages construction project manage-

ment and decision making [8, 9]. More recently, several studies extend these ef-

forts to more complicated architectures exploiting Deep Learning techniques [10].

In the literature, it is possible to ﬁnd several instances of successful appli-

cations in the ﬁeld of environmental sound classiﬁcation that make use of deep

learning. For example, in the work of Piczak [11], the author exploits a 2-layered

CNN working on the spectrogram of the data to perform ESC, reaching an

average accuracy of 70% over diﬀerent datasets. Other approaches, instead of

using handcrafted features such as the spectrogram, perform end-to-end environ-

mental sound classiﬁcation obtaining higher results with respect to the previous

ones [12, 13].

Inspired and motivated by the MelNet architecture described by Li et al. [14],

which has been proven to be remarkably eﬀective in environmental sound clas-

siﬁcation, the aim of this paper is to develop an application able to recognize

vehicles and tools used in construction sites, and classify them in terms of type

and brand. This task will be tackled with a neural network approach, involving

the use of a Deep Convolutional Neural Network (DCNN), which will be fed

with the mel spectrogram of the audio source as input. The classiﬁcation will be

carried on ﬁve classes extracted from audio ﬁles collected in several construction

sites, containing in situ recordings of multiple vehicles and tools. We demonstrate

that the proposed approach for ESC can obtain good results (average accuracy

of 97%) to a very speciﬁc domain as the one of construction sites.

The rest of this paper is organized as follows. Section 2 describes the pro-

posed approach used to perform the sound classiﬁcation. Section 3 introduces

the experimental setup, while Section 4 shows the obtained numerical results.

Finally, section 5 concludes the paper and outlines some future directions.

2 The proposed approach

CNNs are a particular type of neural networks, which use the convolution oper-

ation in one or more layers for the learning process. These networks are inspired

by the primal visual system, and are therefore extensively used with image and

video inputs [10]. A CNN is composed by three main layers:

– Convolutional Layer: The convolutional layer is the one tasked with ap-

plying the convolution operation on the input. This is done by passing a

A CNN Approach for Audio Classiﬁcation in Construction Sites 3

ﬁlter (or kernel) over the matricial input, computing the convolution value,

and using the obtained result as the value of one cell of the output matrix

(called feature map); the ﬁlter is then shifted by a predeﬁned stride along its

dimensions. The ﬁlters parameters are trained during the training process.

– Detector layer: In the detector layer, the output of the convolution is

passed through a nonlinear function, usually a ReLU function.

– Pooling layer: The pooling layer is meant to reduce the dimensionality of

data by combining the output of neuron clusters at one layer into one single

neuron in the subsequent layer.

The last layer of the network is a fully connected one (a layer whose units

are connected to every single unit from the previous one), which outputs the

probability of the input to belong to each of the classes.

CNNs in a machine learning system show some advantages with respect to

traditional fully connected neural networks, because they allow sparse interac-

tions, parameters sharing and equivariant representations.

The reasons why we used CNNs in our approach is due to the intrinsic nature

of audio signals. CNNs are extensively used with images and, since the spectrum

of the audio is an actual picture of the signal, it is straightforward to see why

CNNs are a good idea for such kind of input, being able to exploit the adjacency

properties of audio signals and recognize patterns in the spectrum images that

can properly represent each one of the classes taken into consideration.

The proposed architecture consists in a DCNN composed of eight layers, as

shown in Fig. 1, that is fed with the mel spectrogram extracted from audio signals

and its time derivative. Speciﬁcally, we have as input a tensor of dimension

60 × 2 × 2 that is a couple of images representing the spectrogram and its time

derivative: 60 is the number of mel bands, while 2 is the number of time buckets.

Then, we have ﬁve convolutional layers, followed by a dense fully connected layer

with 200 units and a ﬁnal softmax layer that performs the classiﬁcation over the

5 classes. The structure of the proposed network is summarized in the following

Table 1, and it can be graphically appreciated in Fig. 1.

Layer Input Shape Filters Kernel Size Strides Output Shape

Conv1 [batch, 60, 2, 2] 24 (6,2) (1,1) [batch, 60, 2, 24]

Conv2 [batch, 60, 2, 24] 24 (6,2) (1,1) [batch, 60, 2, 24]

Conv3 [batch, 60, 2, 24] 48 (5,1) (2,2) [batch, 30, 1, 48]

Conv4 [batch, 30, 1, 48] 48 (5,1) (2,2) [batch, 15, 1, 48]

Conv5 [batch, 15, 1, 48] 64 (4,1) (2,2) [batch, 8, 1, 64]

Flatten [batch, 8, 1, 64] – – – [batch, 512]

Dense [batch, 512] – – – [batch, 200]

Output - Dense [batch, 200] – – – [batch, 5]

Table 1. Parameters of the proposed DCNN architecture.

4 Maccagno et al.

All the layers employ a ReLu activation function except for the output layers

which uses a Sofmax function. The optimizer chosen for the network is an Adam

Optimizer [15], with the a learning rate set to 0.0005. Such value was chosen

by performing a grid search in the range [0.00001, 0.001]. Moreover, a dropout

strategy, with a rate equal to 30%, has been used in the dense layer.

Fig. 1. Graphical representation of the proposed architecture.

Regarding the setting of other hyper-parameters, diﬀerent strategies were

adopted. For the batch size, a grid search was used to determine the most ap-

propriate values. The ﬁlter size and the stride were set reasonably according

to the input size. Small ﬁlters were adopted such to capture small, local and

adjacent features that are typical of audio data. Lastly, to prevent the network

depth from either exploding in size, adding unnecessary complexity for no ac-

tual return, or not being high enough, therefore returning substandard results,

we decided to use the same amount of layers as other related works, such as the

one in [14], as a baseline. Variations on this depth have not shown appreciable

improvements on the overall eﬀectiveness of the networks classiﬁcation, so it has

been kept unchanged.

2.1 Spectrogram Extraction

The proposed DCNN uses, as its inputs, the mel spectrogram that is a version

of the spectrogram where the frequency scale has been distorted in a perceptual

way, and its time derivative.

The technique used to extract the spectrogram from the sample is the same

used by Piczak [11], via the Python library librosa

4

. The frames were re-

sampled to 22,050 Hz, then a window of size 1024 with hop-size of 512 and 60

mel bands has been used. A mel band represents an interval of frequencies which

are perceived to have the same pitch by human listeners. They have been found

to be performing in speech recognition.

With this parameters, and the chosen length of 30 ms for the frames (see

next sections), we obtain a small spectrogram of 60 rows (bands) and 2 columns

4

Available at: https://librosa.github.io/

A CNN Approach for Audio Classiﬁcation in Construction Sites 5

(buckets). Then, using again librosa, we compute the derivative of the spec-

trogram and we overlap the two matrices, obtaining a dual channel input which

is fed into the network.

0 50 100 150 200 250 300 350 400

Time bucket

0

20

40

Mel band

spectrum

0 50 100 150 200 250 300 350 400

Time bucket

0

20

40

Mel band

delta

Fig. 2. Example of a log-mel spectrogram extracted from a fragment along with its

derivative. On the abscissa we ﬁnd the time buckets, each of which representing a

sample about 23 ms long, while on the ordinates the log-mel bands. Since our fragments

are 30 ms long, the spectrogram we extract will contain 2 buckets.

3 Experimental setup

3.1 Dataset

The authors collected audio data of equipment operations in several construction

sites consisting diverse construction machines and equipments. Unlike artiﬁcially

built datasets, when working with real data diﬀerent problems arise, such as noise

due to weather conditions and/or workers talking among themselves. Thus, we

focused our work on the classiﬁcation of a reduced number of classes, speciﬁcally

Backhoe JD50D Compact, Compactor Ingersoll Rand, Concrete Mixer, Excavator

Cat 320E, Excavator Hitachi 50U. Classes which did not have enough usable

audio (too short, excessive noise, low quality of the audio) were ignored for this

work. The activities of these machines were observed during certain periods, and

the audio signals generated were recorded accordingly. A Zoom H1 digital handy

recorder has been used for data collection purposes. All ﬁles have been recorded

by using a sample rate of 44,100 Hz and a total of about one hour of sound data

(eight diﬀerent ﬁles for each machine) has been used to train the architecture.

3.2 Data Preprocessing

In order to feed the network with enough and proper data, each audio ﬁle for

each class is segmented into ﬁxed length frames (the choice of the best frame

A CNN Approach for Audio Classification in Construction Sites

Figures

Citations

Advanced Sound Classifiers and Performance Analyses for Accurate Audio-Based Construction Project Monitoring

Activity identification in modular construction using audio signals and machine learning

Deep Belief Network based audio classification for construction sites monitoring

Diverse ocean noise classification using deep learning

Deep Recurrent Neural Networks for Audio Classification in Construction Sites

References

Adam: A Method for Stochastic Optimization

Introduction to Machine Learning

Introduction to Machine Learning

A Scale for the Measurement of the Psychological Magnitude Pitch

Computational Auditory Scene Analysis: Principles, Algorithms, and Applications

Related Papers (5)

Benchmarking Audio Signal Representation Techniques for Classification with Convolutional Neural Networks.

An Adversarial Feature Distillation Method for Audio Classification

Recognition of Urban Sound Events Using Deep Context-Aware Feature Extractors and Handcrafted Features

Audio Concept Classification with Hierarchical Deep Neural Networks

The Machine Learning Approach for Analysis of Sound Scenes and Events

Trending Questions (1)