A CNN Approach for Audio Classification in Construction Sites

doi:10.1007/978-981-15-5093-5_33

Book Chapter•DOI•

A CNN Approach for Audio Classification in Construction Sites

Alessandro Maccagno¹, Andrea Mastropietro¹, Umberto Mazziotta¹, Michele Scarpiniti¹, Yong-Cheol Lee², Aurelio Uncini¹ - Show less +2 more•Institutions (2)

Sapienza University of Rome¹, Louisiana State University²

01 Jan 2021-pp 371-381

TL;DR: This work developed an application for the classification of different types and brands of construction vehicles and tools, which operates on the emitted audio through a stack of convolutional layers, demonstrating its effectiveness in environmental sound classification (ESC) achieving a high accuracy.

read less

Abstract: Convolutional Neural Networks (CNNs) have been widely used in the field of audio recognition and classification, since they often provide positive results. Motivated by the success of this kind of approach and the lack of practical methodologies for the monitoring of construction sites by using audio data, we developed an application for the classification of different types and brands of construction vehicles and tools, which operates on the emitted audio through a stack of convolutional layers. The proposed architecture works on the mel-spectrogram representation of the input audio frames and it demonstrates its effectiveness in environmental sound classification (ESC) achieving a high accuracy. In summary, our contribution shows that techniques employed for general ESC can be also successfully adapted to a more specific environmental sound classification task, such as event recognition in construction sites.

...read moreread less

Summary (2 min read)

Jump to: [1 Introduction] – [2 The proposed approach] – [2.1 Spectrogram Extraction] – [3.1 Dataset] – [3.2 Data Preprocessing] – [4.1 Selecting the frame size] – [4.2 Classification Results] and [4.3 Prediction]

1 Introduction

In last years, many research efforts have been made towards the event classification of audio data, due to the availability of cheap sensors [1].
This approach is revealing itself as a promising method and a supportive resource for unmanned field monitoring and safety surveillance that leverages construction project management and decision making [8, 9].
The classification will be carried on five classes extracted from audio files collected in several construction sites, containing in situ recordings of multiple vehicles and tools.
Section 3 introduces the experimental setup, while Section 4 shows the obtained numerical results.

2 The proposed approach

CNNs are a particular type of neural networks, which use the convolution operation in one or more layers for the learning process.
In the detector layer, the output of the convolution is passed through a nonlinear function, usually a ReLU function.
The reasons why the authors used CNNs in their approach is due to the intrinsic nature of audio signals.
For the batch size, a grid search was used to determine the most appropriate values.
Lastly, to prevent the network depth from either exploding in size, adding unnecessary complexity for no actual return, or not being high enough, therefore returning substandard results, the authors decided to use the same amount of layers as other related works, such as the one in [14], as a baseline.

2.1 Spectrogram Extraction

The proposed DCNN uses, as its inputs, the mel spectrogram that is a version of the spectrogram where the frequency scale has been distorted in a perceptual way, and its time derivative.
A mel band represents an interval of frequencies which are perceived to have the same pitch by human listeners.
And the chosen length of 30 ms for the frames (see next sections), the authors obtain a small spectrogram of 60 rows and 2 columns 4 Available at: https://librosa.github.io/ .
Then, using again librosa, the authors compute the derivative of the spectrogram and they overlap the two matrices, obtaining a dual channel input which is fed into the network.

3.1 Dataset

The authors collected audio data of equipment operations in several construction sites consisting diverse construction machines and equipments.
Unlike artificially built datasets, when working with real data different problems arise, such as noise due to weather conditions and/or workers talking among themselves.
Thus, the authors focused their work on the classification of a reduced number of classes, specifically Backhoe JD50D Compact, Compactor Ingersoll Rand, Concrete Mixer, Excavator Cat 320E, Excavator Hitachi 50U.
Classes which did not have enough usable audio (too short, excessive noise, low quality of the audio) were ignored for this work.
A Zoom H1 digital handy recorder has been used for data collection purposes.

3.2 Data Preprocessing

In order to feed the network with enough and proper data, each audio file for each class is segmented into fixed length frames (the choice of the best frame size is described in the experiment section).
As first step, the authors split the original audio files into two parts, training samples (70% of the original length) and test samples (30% of the original length).
This is done to avoid testing the network on data used previously to train the network, as this would cause the network to overfit and give misleading results.
After that, the dataset is balanced by taking N samples for each class, where N is the number of elements contained in the class with the least amount of samples.
Numerical results have been evaluated in terms of accuracy, recall, precision and F1 score [17].

4.1 Selecting the frame size

A sizeable amount of time was spent into finding the proper length for the audio segments.
This is of crucial importance since, if the length is not adequate, the network will not be able to learn proper features that clearly characterize the input.
Hence, in order to select the most suitable length, the authors generated different dataset variants by splitting the audio using different frame lengths, and they subsequently trained and tested different models on the differently-sized datasets.
As the authors can see, with a smaller frame size better results are obtained, while they notice a drop as the size increases.
The results of the classification are shown in the next subsection.

4.2 Classification Results

As just stated, a 5-fold cross validation was performed and the results are shown in Table 2.
The dataset was split into training set and validation set (80% – 20%) for each fold.
The way the network learns can be seen in Fig.
The class with worst result is the Excavator Cat 320E that performs at 95% of accuracy.
All details, features and parameters of the implemented classifiers can be found in [8].

4.3 Prediction

The proposed approach can be used to promptly predict the active working vehicles and tools.
In fact, with such an approach, project managers will be able to remotely and continuously monitor the status of workers and machines, investigate the effective distribution of hours, and detect issues of safety in a timely manner.
Every frame will be classified as belonging to one of the classes and the audio track will be labeled according to the majority of the labels among all the frames.
One can also see what is the probability for the input track to belong to each of the classes.the authors.

Did you find this useful? Give us your feedback

Figures (8)

Table 2. 5-Fold cross validation classification results (in %).

Fig. 3. Overall classification accuracy according to different sample sizes of the audio frames.

Table 1. Parameters of the proposed DCNN architecture.

Fig. 4. Overall accuracy obtained on the test set.

Fig. 1. Graphical representation of the proposed architecture.

Fig. 5. Confusion matrix obtained by the proposed approach.

Table 3. Averaged results of compared classifiers (in %).

Fig. 2. Example of a log-mel spectrogram extracted from a fragment along with its derivative. On the abscissa we find the time buckets, each of which representing a sample about 23 ms long, while on the ordinates the log-mel bands. Since our fragments are 30 ms long, the spectrogram we extract will contain 2 buckets.

Content maybe subject to copyright Report

A CNN Approach for Audio Classiﬁcation in

Construction Sites

Alessandro Maccagno

, Andrea Mastropietro

, Umberto Mazziotta

, Michele

Scarpiniti

, Yong-Cheol Lee

, and Aurelio Uncini

Department of Computer, Control and Management Engineering,

Sapienza University of Rome, Italy,

{maccagno.1653200,mastropietro.1652886,mazziotta.1647818}@studenti.uniroma1.it,

Department of Information Engineering, Electronics and Telecommunications,

Sapienza University of Rome, Italy

{michele.scarpiniti,aurelio.uncini}@uniroma1.it,

Department of Construction Management,

Louisiana State University, Baton Rouge, USA

yclee@lsu.edu

Abstract. Convolutional Neural Networks (CNNs) have been widely

used in the ﬁeld of audio recognition and classiﬁcation, since they often

provide positive results. Motivated by the success of this kind of approach

and the lack of practical methodologies for the monitoring of construction

sites by using audio data, we developed an application for the classiﬁ-

cation of diﬀerent types and brands of construction vehicles and tools,

which operates on the emitted audio through a stack of convolutional

layers. The proposed architecture works on the mel-spectrogram repre-

sentation of the input audio frames and it demonstrates its eﬀectiveness

in environmental sound classiﬁcation (ESC) achieving a high accuracy. In

summary, our contribution shows that techniques employed for general

ESC can be also successfully adapted to a more speciﬁc environmental

sound classiﬁcation task, such as event recognition in construction sites.

Keywords: Deep learning, Convolutional neural networks, Audio pro-

cessing, Environmental sound classiﬁcation, Construction sites.

1 Introduction

In last years, many research eﬀorts have been made towards the event classiﬁca-

tion of audio data, due to the availability of cheap sensors [1]. In fact, systems

based on acoustic sensors are of particular interest for their ﬂexibility and cheap-

ness [2]. When we consider generic outdoor scenarios, an automatic monitoring

system based on a microphone array would be an invaluable tool in assessing

and controlling any type of situation occurring in the environment [3]. This in-

cludes, but is not limited to, handling large civil and/or military events. The

idea in these works is to use Computational Auditory Scene Analysis (CASA)

[4], which involves Computational Intelligence and Machine Learning techniques,

2 Maccagno et al.

to recognize the presence of speciﬁc objects into sound tracks. This last problem

is a notable example of Automatic Audio Classiﬁcation (AAC) [5], the task of

automatically labeling a given audio signal in a set of predeﬁned classes.

Getting into the more speciﬁc ﬁeld of environmental sound classiﬁcation

(ESC) in construction site, the closest attempts have been performed by Cheng

et al. [6], who used Support Vector Machines (SVM) to analyze the activity of

construction tools and equipment. Recent applications of AAC have also been

addressed to audio-based construction sites monitoring [7–9], in order to improve

the construction process management of ﬁeld activities. This approach is reveal-

ing itself as a promising method and a supportive resource for unmanned ﬁeld

monitoring and safety surveillance that leverages construction project manage-

ment and decision making [8, 9]. More recently, several studies extend these ef-

forts to more complicated architectures exploiting Deep Learning techniques [10].

In the literature, it is possible to ﬁnd several instances of successful appli-

cations in the ﬁeld of environmental sound classiﬁcation that make use of deep

learning. For example, in the work of Piczak [11], the author exploits a 2-layered

CNN working on the spectrogram of the data to perform ESC, reaching an

average accuracy of 70% over diﬀerent datasets. Other approaches, instead of

using handcrafted features such as the spectrogram, perform end-to-end environ-

mental sound classiﬁcation obtaining higher results with respect to the previous

ones [12, 13].

Inspired and motivated by the MelNet architecture described by Li et al. [14],

which has been proven to be remarkably eﬀective in environmental sound clas-

siﬁcation, the aim of this paper is to develop an application able to recognize

vehicles and tools used in construction sites, and classify them in terms of type

and brand. This task will be tackled with a neural network approach, involving

the use of a Deep Convolutional Neural Network (DCNN), which will be fed

with the mel spectrogram of the audio source as input. The classiﬁcation will be

carried on ﬁve classes extracted from audio ﬁles collected in several construction

sites, containing in situ recordings of multiple vehicles and tools. We demonstrate

that the proposed approach for ESC can obtain good results (average accuracy

of 97%) to a very speciﬁc domain as the one of construction sites.

The rest of this paper is organized as follows. Section 2 describes the pro-

posed approach used to perform the sound classiﬁcation. Section 3 introduces

the experimental setup, while Section 4 shows the obtained numerical results.

Finally, section 5 concludes the paper and outlines some future directions.

2 The proposed approach

CNNs are a particular type of neural networks, which use the convolution oper-

ation in one or more layers for the learning process. These networks are inspired

by the primal visual system, and are therefore extensively used with image and

video inputs [10]. A CNN is composed by three main layers:

– Convolutional Layer: The convolutional layer is the one tasked with ap-

plying the convolution operation on the input. This is done by passing a

A CNN Approach for Audio Classiﬁcation in Construction Sites 3

ﬁlter (or kernel) over the matricial input, computing the convolution value,

and using the obtained result as the value of one cell of the output matrix

(called feature map); the ﬁlter is then shifted by a predeﬁned stride along its

dimensions. The ﬁlters parameters are trained during the training process.

– Detector layer: In the detector layer, the output of the convolution is

passed through a nonlinear function, usually a ReLU function.

– Pooling layer: The pooling layer is meant to reduce the dimensionality of

data by combining the output of neuron clusters at one layer into one single

neuron in the subsequent layer.

The last layer of the network is a fully connected one (a layer whose units

are connected to every single unit from the previous one), which outputs the

probability of the input to belong to each of the classes.

CNNs in a machine learning system show some advantages with respect to

traditional fully connected neural networks, because they allow sparse interac-

tions, parameters sharing and equivariant representations.

The reasons why we used CNNs in our approach is due to the intrinsic nature

of audio signals. CNNs are extensively used with images and, since the spectrum

of the audio is an actual picture of the signal, it is straightforward to see why

CNNs are a good idea for such kind of input, being able to exploit the adjacency

properties of audio signals and recognize patterns in the spectrum images that

can properly represent each one of the classes taken into consideration.

The proposed architecture consists in a DCNN composed of eight layers, as

shown in Fig. 1, that is fed with the mel spectrogram extracted from audio signals

and its time derivative. Speciﬁcally, we have as input a tensor of dimension

60 × 2 × 2 that is a couple of images representing the spectrogram and its time

derivative: 60 is the number of mel bands, while 2 is the number of time buckets.

Then, we have ﬁve convolutional layers, followed by a dense fully connected layer

with 200 units and a ﬁnal softmax layer that performs the classiﬁcation over the

5 classes. The structure of the proposed network is summarized in the following

Table 1, and it can be graphically appreciated in Fig. 1.

Layer Input Shape Filters Kernel Size Strides Output Shape

Conv1 [batch, 60, 2, 2] 24 (6,2) (1,1) [batch, 60, 2, 24]

Conv2 [batch, 60, 2, 24] 24 (6,2) (1,1) [batch, 60, 2, 24]

Conv3 [batch, 60, 2, 24] 48 (5,1) (2,2) [batch, 30, 1, 48]

Conv4 [batch, 30, 1, 48] 48 (5,1) (2,2) [batch, 15, 1, 48]

Conv5 [batch, 15, 1, 48] 64 (4,1) (2,2) [batch, 8, 1, 64]

Flatten [batch, 8, 1, 64] – – – [batch, 512]

Dense [batch, 512] – – – [batch, 200]

Output - Dense [batch, 200] – – – [batch, 5]

Table 1. Parameters of the proposed DCNN architecture.

4 Maccagno et al.

All the layers employ a ReLu activation function except for the output layers

which uses a Sofmax function. The optimizer chosen for the network is an Adam

Optimizer [15], with the a learning rate set to 0.0005. Such value was chosen

by performing a grid search in the range [0.00001, 0.001]. Moreover, a dropout

strategy, with a rate equal to 30%, has been used in the dense layer.

Fig. 1. Graphical representation of the proposed architecture.

Regarding the setting of other hyper-parameters, diﬀerent strategies were

adopted. For the batch size, a grid search was used to determine the most ap-

propriate values. The ﬁlter size and the stride were set reasonably according

to the input size. Small ﬁlters were adopted such to capture small, local and

adjacent features that are typical of audio data. Lastly, to prevent the network

depth from either exploding in size, adding unnecessary complexity for no ac-

tual return, or not being high enough, therefore returning substandard results,

we decided to use the same amount of layers as other related works, such as the

one in [14], as a baseline. Variations on this depth have not shown appreciable

improvements on the overall eﬀectiveness of the networks classiﬁcation, so it has

been kept unchanged.

2.1 Spectrogram Extraction

The proposed DCNN uses, as its inputs, the mel spectrogram that is a version

of the spectrogram where the frequency scale has been distorted in a perceptual

way, and its time derivative.

The technique used to extract the spectrogram from the sample is the same

used by Piczak [11], via the Python library librosa

. The frames were re-

sampled to 22,050 Hz, then a window of size 1024 with hop-size of 512 and 60

mel bands has been used. A mel band represents an interval of frequencies which

are perceived to have the same pitch by human listeners. They have been found

to be performing in speech recognition.

With this parameters, and the chosen length of 30 ms for the frames (see

next sections), we obtain a small spectrogram of 60 rows (bands) and 2 columns

Available at: https://librosa.github.io/

A CNN Approach for Audio Classiﬁcation in Construction Sites 5

(buckets). Then, using again librosa, we compute the derivative of the spec-

trogram and we overlap the two matrices, obtaining a dual channel input which

is fed into the network.

0 50 100 150 200 250 300 350 400

Time bucket

Mel band

spectrum

0 50 100 150 200 250 300 350 400

Time bucket

Mel band

delta

Fig. 2. Example of a log-mel spectrogram extracted from a fragment along with its

derivative. On the abscissa we ﬁnd the time buckets, each of which representing a

sample about 23 ms long, while on the ordinates the log-mel bands. Since our fragments

are 30 ms long, the spectrogram we extract will contain 2 buckets.

3 Experimental setup

3.1 Dataset

The authors collected audio data of equipment operations in several construction

sites consisting diverse construction machines and equipments. Unlike artiﬁcially

built datasets, when working with real data diﬀerent problems arise, such as noise

due to weather conditions and/or workers talking among themselves. Thus, we

focused our work on the classiﬁcation of a reduced number of classes, speciﬁcally

Backhoe JD50D Compact, Compactor Ingersoll Rand, Concrete Mixer, Excavator

Cat 320E, Excavator Hitachi 50U. Classes which did not have enough usable

audio (too short, excessive noise, low quality of the audio) were ignored for this

work. The activities of these machines were observed during certain periods, and

the audio signals generated were recorded accordingly. A Zoom H1 digital handy

recorder has been used for data collection purposes. All ﬁles have been recorded

by using a sample rate of 44,100 Hz and a total of about one hour of sound data

(eight diﬀerent ﬁles for each machine) has been used to train the architecture.

3.2 Data Preprocessing

In order to feed the network with enough and proper data, each audio ﬁle for

each class is segmented into ﬁxed length frames (the choice of the best frame

HTML Viewer

References

PDF

Open Access

More filters

Journal Article•DOI•

Microphone array based classification for security monitoring in unstructured environments

[...]

Simone Scardapane¹, Michele Scarpiniti¹, Marta Bucciarelli¹, Fabiola Colone¹, Marcello Vincenzo Mansueto, Raffaele Parisi¹ - Show less +2 more•Institutions (1)

Sapienza University of Rome¹

01 Nov 2015-Aeu-international Journal of Electronics and Communications

TL;DR: An extensive set of simulations showing the effectiveness of the proposed architecture are provided, with a particular emphasis on the innovative aspects that are introduced with respect to the state-of-the-art.

...read moreread less

Abstract: The aim of this paper is to describe a novel security system able to localize and classify audio sources in an outdoor environment. Its primary intended use is for security monitoring in severe scenarios, and it has been designed to cope with a large set of heterogeneous objects, including weapons, human speakers and vehicles. The system is the result of a research project sponsored by the Italian Ministry of Defense. It is composed of a large squared array of 864 microphones arranged in a rectangular lattice, whose input is processed using a classical delay-and-sum beamformer. The result of this localization process is elaborated by a complex multi-level classification system designed in a modular fashion. In this paper, after presenting the details of the system's design, with a particular emphasis on the innovative aspects that are introduced with respect to the state-of-the-art, we provide an extensive set of simulations showing the effectiveness of the proposed architecture. We conclude by describing the current limits of the system, and the projected further developments.

...read moreread less

15 citations

A CNN Approach for Audio Classification in Construction Sites

Summary (2 min read)

1 Introduction

2 The proposed approach

2.1 Spectrogram Extraction

3.1 Dataset

3.2 Data Preprocessing

4.1 Selecting the frame size

4.2 Classification Results

4.3 Prediction

Figures (8)

Citations

References

Related Papers (5)

Trending Questions (1)