A CNN Approach for Audio Classification in Construction Sites
TL;DR: This work developed an application for the classification of different types and brands of construction vehicles and tools, which operates on the emitted audio through a stack of convolutional layers, demonstrating its effectiveness in environmental sound classification (ESC) achieving a high accuracy.
Abstract: Convolutional Neural Networks (CNNs) have been widely used in the field of audio recognition and classification, since they often provide positive results. Motivated by the success of this kind of approach and the lack of practical methodologies for the monitoring of construction sites by using audio data, we developed an application for the classification of different types and brands of construction vehicles and tools, which operates on the emitted audio through a stack of convolutional layers. The proposed architecture works on the mel-spectrogram representation of the input audio frames and it demonstrates its effectiveness in environmental sound classification (ESC) achieving a high accuracy. In summary, our contribution shows that techniques employed for general ESC can be also successfully adapted to a more specific environmental sound classification task, such as event recognition in construction sites.
Summary (2 min read)
- In last years, many research efforts have been made towards the event classification of audio data, due to the availability of cheap sensors .
- This approach is revealing itself as a promising method and a supportive resource for unmanned field monitoring and safety surveillance that leverages construction project management and decision making [8, 9].
- The classification will be carried on five classes extracted from audio files collected in several construction sites, containing in situ recordings of multiple vehicles and tools.
- Section 3 introduces the experimental setup, while Section 4 shows the obtained numerical results.
2 The proposed approach
- CNNs are a particular type of neural networks, which use the convolution operation in one or more layers for the learning process.
- In the detector layer, the output of the convolution is passed through a nonlinear function, usually a ReLU function.
- The reasons why the authors used CNNs in their approach is due to the intrinsic nature of audio signals.
- For the batch size, a grid search was used to determine the most appropriate values.
- Lastly, to prevent the network depth from either exploding in size, adding unnecessary complexity for no actual return, or not being high enough, therefore returning substandard results, the authors decided to use the same amount of layers as other related works, such as the one in , as a baseline.
2.1 Spectrogram Extraction
- The proposed DCNN uses, as its inputs, the mel spectrogram that is a version of the spectrogram where the frequency scale has been distorted in a perceptual way, and its time derivative.
- A mel band represents an interval of frequencies which are perceived to have the same pitch by human listeners.
- And the chosen length of 30 ms for the frames (see next sections), the authors obtain a small spectrogram of 60 rows and 2 columns 4 Available at: https://librosa.github.io/ .
- Then, using again librosa, the authors compute the derivative of the spectrogram and they overlap the two matrices, obtaining a dual channel input which is fed into the network.
- The authors collected audio data of equipment operations in several construction sites consisting diverse construction machines and equipments.
- Unlike artificially built datasets, when working with real data different problems arise, such as noise due to weather conditions and/or workers talking among themselves.
- Thus, the authors focused their work on the classification of a reduced number of classes, specifically Backhoe JD50D Compact, Compactor Ingersoll Rand, Concrete Mixer, Excavator Cat 320E, Excavator Hitachi 50U.
- Classes which did not have enough usable audio (too short, excessive noise, low quality of the audio) were ignored for this work.
- A Zoom H1 digital handy recorder has been used for data collection purposes.
3.2 Data Preprocessing
- In order to feed the network with enough and proper data, each audio file for each class is segmented into fixed length frames (the choice of the best frame size is described in the experiment section).
- As first step, the authors split the original audio files into two parts, training samples (70% of the original length) and test samples (30% of the original length).
- This is done to avoid testing the network on data used previously to train the network, as this would cause the network to overfit and give misleading results.
- After that, the dataset is balanced by taking N samples for each class, where N is the number of elements contained in the class with the least amount of samples.
- Numerical results have been evaluated in terms of accuracy, recall, precision and F1 score .
4.1 Selecting the frame size
- A sizeable amount of time was spent into finding the proper length for the audio segments.
- This is of crucial importance since, if the length is not adequate, the network will not be able to learn proper features that clearly characterize the input.
- Hence, in order to select the most suitable length, the authors generated different dataset variants by splitting the audio using different frame lengths, and they subsequently trained and tested different models on the differently-sized datasets.
- As the authors can see, with a smaller frame size better results are obtained, while they notice a drop as the size increases.
- The results of the classification are shown in the next subsection.
4.2 Classification Results
- As just stated, a 5-fold cross validation was performed and the results are shown in Table 2.
- The dataset was split into training set and validation set (80% – 20%) for each fold.
- The way the network learns can be seen in Fig.
- The class with worst result is the Excavator Cat 320E that performs at 95% of accuracy.
- All details, features and parameters of the implemented classifiers can be found in .
- The proposed approach can be used to promptly predict the active working vehicles and tools.
- In fact, with such an approach, project managers will be able to remotely and continuously monitor the status of workers and machines, investigate the effective distribution of hours, and detect issues of safety in a timely manner.
- Every frame will be classified as belonging to one of the classes and the audio track will be labeled according to the majority of the labels among all the frames.
- One can also see what is the probability for the input track to belong to each of the classes.the authors.
Did you find this useful? Give us your feedback
Cites methods from "A CNN Approach for Audio Classifica..."
... Maccagno, Alessandro & Mastropietro, Andrea & Mazziotta, Umberto & Scarpiniti, Michele & Lee, Yongcheol & Uncini, Aurelio....
...We further extend our experiment using a pretrained Inception V3  model trained on 1....
Related Papers (5)