scispace - formally typeset

Book ChapterDOI

A CNN Approach for Audio Classification in Construction Sites

01 Jan 2021-pp 371-381

TL;DR: This work developed an application for the classification of different types and brands of construction vehicles and tools, which operates on the emitted audio through a stack of convolutional layers, demonstrating its effectiveness in environmental sound classification (ESC) achieving a high accuracy.

AbstractConvolutional Neural Networks (CNNs) have been widely used in the field of audio recognition and classification, since they often provide positive results. Motivated by the success of this kind of approach and the lack of practical methodologies for the monitoring of construction sites by using audio data, we developed an application for the classification of different types and brands of construction vehicles and tools, which operates on the emitted audio through a stack of convolutional layers. The proposed architecture works on the mel-spectrogram representation of the input audio frames and it demonstrates its effectiveness in environmental sound classification (ESC) achieving a high accuracy. In summary, our contribution shows that techniques employed for general ESC can be also successfully adapted to a more specific environmental sound classification task, such as event recognition in construction sites.

Summary (2 min read)

1 Introduction

  • In last years, many research efforts have been made towards the event classification of audio data, due to the availability of cheap sensors [1].
  • This approach is revealing itself as a promising method and a supportive resource for unmanned field monitoring and safety surveillance that leverages construction project management and decision making [8, 9].
  • The classification will be carried on five classes extracted from audio files collected in several construction sites, containing in situ recordings of multiple vehicles and tools.
  • Section 3 introduces the experimental setup, while Section 4 shows the obtained numerical results.

2 The proposed approach

  • CNNs are a particular type of neural networks, which use the convolution operation in one or more layers for the learning process.
  • In the detector layer, the output of the convolution is passed through a nonlinear function, usually a ReLU function.
  • The reasons why the authors used CNNs in their approach is due to the intrinsic nature of audio signals.
  • For the batch size, a grid search was used to determine the most appropriate values.
  • Lastly, to prevent the network depth from either exploding in size, adding unnecessary complexity for no actual return, or not being high enough, therefore returning substandard results, the authors decided to use the same amount of layers as other related works, such as the one in [14], as a baseline.

2.1 Spectrogram Extraction

  • The proposed DCNN uses, as its inputs, the mel spectrogram that is a version of the spectrogram where the frequency scale has been distorted in a perceptual way, and its time derivative.
  • A mel band represents an interval of frequencies which are perceived to have the same pitch by human listeners.
  • And the chosen length of 30 ms for the frames (see next sections), the authors obtain a small spectrogram of 60 rows and 2 columns 4 Available at: https://librosa.github.io/ .
  • Then, using again librosa, the authors compute the derivative of the spectrogram and they overlap the two matrices, obtaining a dual channel input which is fed into the network.

3.1 Dataset

  • The authors collected audio data of equipment operations in several construction sites consisting diverse construction machines and equipments.
  • Unlike artificially built datasets, when working with real data different problems arise, such as noise due to weather conditions and/or workers talking among themselves.
  • Thus, the authors focused their work on the classification of a reduced number of classes, specifically Backhoe JD50D Compact, Compactor Ingersoll Rand, Concrete Mixer, Excavator Cat 320E, Excavator Hitachi 50U.
  • Classes which did not have enough usable audio (too short, excessive noise, low quality of the audio) were ignored for this work.
  • A Zoom H1 digital handy recorder has been used for data collection purposes.

3.2 Data Preprocessing

  • In order to feed the network with enough and proper data, each audio file for each class is segmented into fixed length frames (the choice of the best frame size is described in the experiment section).
  • As first step, the authors split the original audio files into two parts, training samples (70% of the original length) and test samples (30% of the original length).
  • This is done to avoid testing the network on data used previously to train the network, as this would cause the network to overfit and give misleading results.
  • After that, the dataset is balanced by taking N samples for each class, where N is the number of elements contained in the class with the least amount of samples.
  • Numerical results have been evaluated in terms of accuracy, recall, precision and F1 score [17].

4.1 Selecting the frame size

  • A sizeable amount of time was spent into finding the proper length for the audio segments.
  • This is of crucial importance since, if the length is not adequate, the network will not be able to learn proper features that clearly characterize the input.
  • Hence, in order to select the most suitable length, the authors generated different dataset variants by splitting the audio using different frame lengths, and they subsequently trained and tested different models on the differently-sized datasets.
  • As the authors can see, with a smaller frame size better results are obtained, while they notice a drop as the size increases.
  • The results of the classification are shown in the next subsection.

4.2 Classification Results

  • As just stated, a 5-fold cross validation was performed and the results are shown in Table 2.
  • The dataset was split into training set and validation set (80% – 20%) for each fold.
  • The way the network learns can be seen in Fig.
  • The class with worst result is the Excavator Cat 320E that performs at 95% of accuracy.
  • All details, features and parameters of the implemented classifiers can be found in [8].

4.3 Prediction

  • The proposed approach can be used to promptly predict the active working vehicles and tools.
  • In fact, with such an approach, project managers will be able to remotely and continuously monitor the status of workers and machines, investigate the effective distribution of hours, and detect issues of safety in a timely manner.
  • Every frame will be classified as belonging to one of the classes and the audio track will be labeled according to the majority of the labels among all the frames.
  • One can also see what is the probability for the input track to belong to each of the classes.the authors.

Did you find this useful? Give us your feedback

...read more

Content maybe subject to copyright    Report

A CNN Approach for Audio Classification in
Construction Sites
Alessandro Maccagno
1
, Andrea Mastropietro
1
, Umberto Mazziotta
1
, Michele
Scarpiniti
2
, Yong-Cheol Lee
3
, and Aurelio Uncini
2
1
Department of Computer, Control and Management Engineering,
Sapienza University of Rome, Italy,
{maccagno.1653200,mastropietro.1652886,mazziotta.1647818}@studenti.uniroma1.it,
2
Department of Information Engineering, Electronics and Telecommunications,
Sapienza University of Rome, Italy
{michele.scarpiniti,aurelio.uncini}@uniroma1.it,
3
Department of Construction Management,
Louisiana State University, Baton Rouge, USA
yclee@lsu.edu
Abstract. Convolutional Neural Networks (CNNs) have been widely
used in the field of audio recognition and classification, since they often
provide positive results. Motivated by the success of this kind of approach
and the lack of practical methodologies for the monitoring of construction
sites by using audio data, we developed an application for the classifi-
cation of different types and brands of construction vehicles and tools,
which operates on the emitted audio through a stack of convolutional
layers. The proposed architecture works on the mel-spectrogram repre-
sentation of the input audio frames and it demonstrates its effectiveness
in environmental sound classification (ESC) achieving a high accuracy. In
summary, our contribution shows that techniques employed for general
ESC can be also successfully adapted to a more specific environmental
sound classification task, such as event recognition in construction sites.
Keywords: Deep learning, Convolutional neural networks, Audio pro-
cessing, Environmental sound classification, Construction sites.
1 Introduction
In last years, many research efforts have been made towards the event classifica-
tion of audio data, due to the availability of cheap sensors [1]. In fact, systems
based on acoustic sensors are of particular interest for their flexibility and cheap-
ness [2]. When we consider generic outdoor scenarios, an automatic monitoring
system based on a microphone array would be an invaluable tool in assessing
and controlling any type of situation occurring in the environment [3]. This in-
cludes, but is not limited to, handling large civil and/or military events. The
idea in these works is to use Computational Auditory Scene Analysis (CASA)
[4], which involves Computational Intelligence and Machine Learning techniques,

2 Maccagno et al.
to recognize the presence of specific objects into sound tracks. This last problem
is a notable example of Automatic Audio Classification (AAC) [5], the task of
automatically labeling a given audio signal in a set of predefined classes.
Getting into the more specific field of environmental sound classification
(ESC) in construction site, the closest attempts have been performed by Cheng
et al. [6], who used Support Vector Machines (SVM) to analyze the activity of
construction tools and equipment. Recent applications of AAC have also been
addressed to audio-based construction sites monitoring [7–9], in order to improve
the construction process management of field activities. This approach is reveal-
ing itself as a promising method and a supportive resource for unmanned field
monitoring and safety surveillance that leverages construction project manage-
ment and decision making [8, 9]. More recently, several studies extend these ef-
forts to more complicated architectures exploiting Deep Learning techniques [10].
In the literature, it is possible to find several instances of successful appli-
cations in the field of environmental sound classification that make use of deep
learning. For example, in the work of Piczak [11], the author exploits a 2-layered
CNN working on the spectrogram of the data to perform ESC, reaching an
average accuracy of 70% over different datasets. Other approaches, instead of
using handcrafted features such as the spectrogram, perform end-to-end environ-
mental sound classification obtaining higher results with respect to the previous
ones [12, 13].
Inspired and motivated by the MelNet architecture described by Li et al. [14],
which has been proven to be remarkably effective in environmental sound clas-
sification, the aim of this paper is to develop an application able to recognize
vehicles and tools used in construction sites, and classify them in terms of type
and brand. This task will be tackled with a neural network approach, involving
the use of a Deep Convolutional Neural Network (DCNN), which will be fed
with the mel spectrogram of the audio source as input. The classification will be
carried on five classes extracted from audio files collected in several construction
sites, containing in situ recordings of multiple vehicles and tools. We demonstrate
that the proposed approach for ESC can obtain good results (average accuracy
of 97%) to a very specific domain as the one of construction sites.
The rest of this paper is organized as follows. Section 2 describes the pro-
posed approach used to perform the sound classification. Section 3 introduces
the experimental setup, while Section 4 shows the obtained numerical results.
Finally, section 5 concludes the paper and outlines some future directions.
2 The proposed approach
CNNs are a particular type of neural networks, which use the convolution oper-
ation in one or more layers for the learning process. These networks are inspired
by the primal visual system, and are therefore extensively used with image and
video inputs [10]. A CNN is composed by three main layers:
Convolutional Layer: The convolutional layer is the one tasked with ap-
plying the convolution operation on the input. This is done by passing a

A CNN Approach for Audio Classification in Construction Sites 3
filter (or kernel) over the matricial input, computing the convolution value,
and using the obtained result as the value of one cell of the output matrix
(called feature map); the filter is then shifted by a predefined stride along its
dimensions. The filters parameters are trained during the training process.
Detector layer: In the detector layer, the output of the convolution is
passed through a nonlinear function, usually a ReLU function.
Pooling layer: The pooling layer is meant to reduce the dimensionality of
data by combining the output of neuron clusters at one layer into one single
neuron in the subsequent layer.
The last layer of the network is a fully connected one (a layer whose units
are connected to every single unit from the previous one), which outputs the
probability of the input to belong to each of the classes.
CNNs in a machine learning system show some advantages with respect to
traditional fully connected neural networks, because they allow sparse interac-
tions, parameters sharing and equivariant representations.
The reasons why we used CNNs in our approach is due to the intrinsic nature
of audio signals. CNNs are extensively used with images and, since the spectrum
of the audio is an actual picture of the signal, it is straightforward to see why
CNNs are a good idea for such kind of input, being able to exploit the adjacency
properties of audio signals and recognize patterns in the spectrum images that
can properly represent each one of the classes taken into consideration.
The proposed architecture consists in a DCNN composed of eight layers, as
shown in Fig. 1, that is fed with the mel spectrogram extracted from audio signals
and its time derivative. Specifically, we have as input a tensor of dimension
60 × 2 × 2 that is a couple of images representing the spectrogram and its time
derivative: 60 is the number of mel bands, while 2 is the number of time buckets.
Then, we have five convolutional layers, followed by a dense fully connected layer
with 200 units and a final softmax layer that performs the classification over the
5 classes. The structure of the proposed network is summarized in the following
Table 1, and it can be graphically appreciated in Fig. 1.
Layer Input Shape Filters Kernel Size Strides Output Shape
Conv1 [batch, 60, 2, 2] 24 (6,2) (1,1) [batch, 60, 2, 24]
Conv2 [batch, 60, 2, 24] 24 (6,2) (1,1) [batch, 60, 2, 24]
Conv3 [batch, 60, 2, 24] 48 (5,1) (2,2) [batch, 30, 1, 48]
Conv4 [batch, 30, 1, 48] 48 (5,1) (2,2) [batch, 15, 1, 48]
Conv5 [batch, 15, 1, 48] 64 (4,1) (2,2) [batch, 8, 1, 64]
Flatten [batch, 8, 1, 64] [batch, 512]
Dense [batch, 512] [batch, 200]
Output - Dense [batch, 200] [batch, 5]
Table 1. Parameters of the proposed DCNN architecture.

4 Maccagno et al.
All the layers employ a ReLu activation function except for the output layers
which uses a Sofmax function. The optimizer chosen for the network is an Adam
Optimizer [15], with the a learning rate set to 0.0005. Such value was chosen
by performing a grid search in the range [0.00001, 0.001]. Moreover, a dropout
strategy, with a rate equal to 30%, has been used in the dense layer.
Fig. 1. Graphical representation of the proposed architecture.
Regarding the setting of other hyper-parameters, different strategies were
adopted. For the batch size, a grid search was used to determine the most ap-
propriate values. The filter size and the stride were set reasonably according
to the input size. Small filters were adopted such to capture small, local and
adjacent features that are typical of audio data. Lastly, to prevent the network
depth from either exploding in size, adding unnecessary complexity for no ac-
tual return, or not being high enough, therefore returning substandard results,
we decided to use the same amount of layers as other related works, such as the
one in [14], as a baseline. Variations on this depth have not shown appreciable
improvements on the overall effectiveness of the networks classification, so it has
been kept unchanged.
2.1 Spectrogram Extraction
The proposed DCNN uses, as its inputs, the mel spectrogram that is a version
of the spectrogram where the frequency scale has been distorted in a perceptual
way, and its time derivative.
The technique used to extract the spectrogram from the sample is the same
used by Piczak [11], via the Python library librosa
4
. The frames were re-
sampled to 22,050 Hz, then a window of size 1024 with hop-size of 512 and 60
mel bands has been used. A mel band represents an interval of frequencies which
are perceived to have the same pitch by human listeners. They have been found
to be performing in speech recognition.
With this parameters, and the chosen length of 30 ms for the frames (see
next sections), we obtain a small spectrogram of 60 rows (bands) and 2 columns
4
Available at: https://librosa.github.io/

A CNN Approach for Audio Classification in Construction Sites 5
(buckets). Then, using again librosa, we compute the derivative of the spec-
trogram and we overlap the two matrices, obtaining a dual channel input which
is fed into the network.
0 50 100 150 200 250 300 350 400
Time bucket
0
20
40
Mel band
spectrum
0 50 100 150 200 250 300 350 400
Time bucket
0
20
40
Mel band
delta
Fig. 2. Example of a log-mel spectrogram extracted from a fragment along with its
derivative. On the abscissa we find the time buckets, each of which representing a
sample about 23 ms long, while on the ordinates the log-mel bands. Since our fragments
are 30 ms long, the spectrogram we extract will contain 2 buckets.
3 Experimental setup
3.1 Dataset
The authors collected audio data of equipment operations in several construction
sites consisting diverse construction machines and equipments. Unlike artificially
built datasets, when working with real data different problems arise, such as noise
due to weather conditions and/or workers talking among themselves. Thus, we
focused our work on the classification of a reduced number of classes, specifically
Backhoe JD50D Compact, Compactor Ingersoll Rand, Concrete Mixer, Excavator
Cat 320E, Excavator Hitachi 50U. Classes which did not have enough usable
audio (too short, excessive noise, low quality of the audio) were ignored for this
work. The activities of these machines were observed during certain periods, and
the audio signals generated were recorded accordingly. A Zoom H1 digital handy
recorder has been used for data collection purposes. All files have been recorded
by using a sample rate of 44,100 Hz and a total of about one hour of sound data
(eight different files for each machine) has been used to train the architecture.
3.2 Data Preprocessing
In order to feed the network with enough and proper data, each audio file for
each class is segmented into fixed length frames (the choice of the best frame

Citations
More filters

Journal ArticleDOI
TL;DR: The result of this study demonstrates the potential of the proposed system to be applied for automated monitoring and data collection in modular construction factory in conjunction with other activity recognition frameworks based on computer vision (CV) and/or inertial measurement units (IMU).
Abstract: Modular construction is an attractive building method due to its advantages over traditional stick-built methods in terms of reduced waste and construction time, more control over resources and environment, and easier implementation of novel techniques and technologies in a controlled factory setting. However, efficient and timely decision-making in modular factories requires spatiotemporal information about the resources regarding their locations and activities which motivates the necessity for an automated activity identification framework. Thus, this paper utilizes sound, a ubiquitous data source present in every modular construction factory, for the automatic identification of commonly performed manual activities such as hammering, nailing, sawing, etc. To develop a robust activity identification model, it is imperative to engineer the appropriate features of the data source (i.e., traits of the signal) that provides a compact yet descriptive representation of the parameterized audio signal based on the nature of the sound, which is very dependent on the application domain. In-depth analysis regarding appropriate features selection and engineering for audio-based activity identification in construction is missing from current research. Thus, this research extensively investigates the effects of various features extracted from four different domains related to audio signals (time-, time-frequency-, cepstral-, and wavelet-domains), in the overall performance of the activity identification model. The effect of these features on activity identification performance was tested by collecting and analyzing audio data generated from manual activities at a modular construction factory. The collected audio signals were first balanced using time-series data augmentation techniques and then used to extract a 318-dimensional feature vector containing 18 different feature sets from the abovementioned four domains. Several sensitivity analyses were performed to optimize the feature space using a feature ranking technique (i.e., Relief algorithm), and the contribution of features in the top feature sets using a support vector machine (SVM). Eventually, a final feature space was designed containing a 130-dimensional feature vector and 0.5-second window size yielding about 97% F-1 score for identifying different activities. The contributions of this study are two-fold: 1. A novel means of automated manual construction activity identification using audio signal is presented; and 2. Foundational knowledge on the selection and optimization of the feature space from four domains is provided for future work in this research field. The result of this study demonstrates the potential of the proposed system to be applied for automated monitoring and data collection in modular construction factory in conjunction with other activity recognition frameworks based on computer vision (CV) and/or inertial measurement units (IMU).

7 citations


Journal ArticleDOI
TL;DR: The sounds of work activities and equipment operations at a construction site provide critical information regarding construction progress, task performance, and safety issues.
Abstract: The sounds of work activities and equipment operations at a construction site provide critical information regarding construction progress, task performance, and safety issues The construc

6 citations


Proceedings ArticleDOI
24 Jan 2021
Abstract: In this paper, we propose a Deep Recurrent Neural Network (DRNN) approach based on Long-Short Term Memory (LSTM) units for the classification of audio signals recorded in construction sites. Five classes of multiple vehicles and tools, normally used in construction sites, have been considered. The input provided to the DRNN consists in the concatenation of several spectral features, like MFCCs, mel-scaled spectrogram, chroma and spectral contrast. The proposed architecture and the feature extraction have been described. Some experimental results, obtained by using real-world recordings, demonstrate the effectiveness of the proposed idea. The final overall accuracy on the test set is up to 97% and overcomes other state-of-the-art approaches.

2 citations


Journal ArticleDOI
TL;DR: The aim of the work is to obtain an accurate and flexible tool for consistently executing and managing the unmanned monitoring of construction sites by using distributed acoustic sensors by using a Deep Belief Network based approach.
Abstract: In this paper, we propose a Deep Belief Network (DBN) based approach for the classification of audio signals to improve work activity identification and remote surveillance of construction projects. The aim of the work is to obtain an accurate and flexible tool for consistently executing and managing the unmanned monitoring of construction sites by using distributed acoustic sensors. In this paper, ten classes of multiple construction equipment and tools, frequently and broadly used in construction sites, have been collected and examined to conduct and validate the proposed approach. The input provided to the DBN consists in the concatenation of several statistics evaluated by a set of spectral features, like MFCCs and mel-scaled spectrogram. The proposed architecture, along with the preprocessing and the feature extraction steps, has been described in details while the effectiveness of the proposed idea has been demonstrated by some numerical results, evaluated by using real-world recordings. The final overall accuracy on the test set is up to 98% and is a significantly improved performance compared to other state-of-the-are approaches. A practical and real-time application of the presented method has been also proposed in order to apply the classification scheme to sound data recorded in different environmental scenarios.

1 citations


Proceedings ArticleDOI
14 Oct 2020
TL;DR: This research paper is an effort to identify 4 similar String instruments (Acoustic Guitar, Cello, Violin and Electric Guitar) in the audio recordings and shows that using image based transfer learning models like Inception and VGG gives better results among the aforementioned architectures.
Abstract: Automatic Instrument recognition in sound recordings is a traditional method and has been gaining a lot of attention since the last decades due to the advent of music streaming services like Spotify, Apple Music, Deezer etc. Distinction between similar instruments like Cello and Violin, Flute and Clarinet still remains a challenging task for machines and even humans. This research paper is an effort to identify 4 similar String instruments (Acoustic Guitar, Cello, Violin and Electric Guitar) in the audio recordings. We have used 1D MLPs and 2D CNN architectures for classifying the sounds and compared the performance on different audio features. Our experiments also show that using image based transfer learning models like Inception and VGG gives better results among the aforementioned architectures.

1 citations


Cites methods from "A CNN Approach for Audio Classifica..."

  • ...[4] Maccagno, Alessandro & Mastropietro, Andrea & Mazziotta, Umberto & Scarpiniti, Michele & Lee, Yongcheol & Uncini, Aurelio....

    [...]

  • ...We further extend our experiment using a pretrained Inception V3 [4] model trained on 1....

    [...]


References
More filters

Proceedings Article
01 Jan 2015
TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Abstract: We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and/or parameters. The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods. Finally, we discuss AdaMax, a variant of Adam based on the infinity norm.

78,539 citations


Book
01 Oct 2004
TL;DR: Introduction to Machine Learning is a comprehensive textbook on the subject, covering a broad array of topics not usually included in introductory machine learning texts, and discusses many methods from different fields, including statistics, pattern recognition, neural networks, artificial intelligence, signal processing, control, and data mining.
Abstract: The goal of machine learning is to program computers to use example data or past experience to solve a given problem. Many successful applications of machine learning exist already, including systems that analyze past sales data to predict customer behavior, optimize robot behavior so that a task can be completed using minimum resources, and extract knowledge from bioinformatics data. Introduction to Machine Learning is a comprehensive textbook on the subject, covering a broad array of topics not usually included in introductory machine learning texts. In order to present a unified treatment of machine learning problems and solutions, it discusses many methods from different fields, including statistics, pattern recognition, neural networks, artificial intelligence, signal processing, control, and data mining. All learning algorithms are explained so that the student can easily move from the equations in the book to a computer program. The text covers such topics as supervised learning, Bayesian decision theory, parametric methods, multivariate methods, multilayer perceptrons, local models, hidden Markov models, assessing and comparing classification algorithms, and reinforcement learning. New to the second edition are chapters on kernel machines, graphical models, and Bayesian estimation; expanded coverage of statistical tests in a chapter on design and analysis of machine learning experiments; case studies available on the Web (with downloadable results for instructors); and many additional exercises. All chapters have been revised and updated. Introduction to Machine Learning can be used by advanced undergraduates and graduate students who have completed courses in computer programming, probability, calculus, and linear algebra. It will also be of interest to engineers in the field who are concerned with the application of machine learning methods. Adaptive Computation and Machine Learning series

3,947 citations


Book ChapterDOI
01 Jan 2019
Abstract: Machine learning is evolved from a collection of powerful techniques in AI areas and has been extensively used in data mining, which allows the system to learn the useful structural patterns and models from training data Machine learning algorithms can be basically classified into four categories: supervised, unsupervised, semi-supervised and reinforcement learning In this chapter, widely-used machine learning algorithms are introduced Each algorithm is briefly explained with some examples

1,664 citations


Journal ArticleDOI
TL;DR: This paper focuses on the development of model-Based Speech Segregation in CASA systems, which was first introduced in 2000 and has since been upgraded to a full-blown model-based system.
Abstract: Foreword. Preface. Contributors. Acronyms. 1. Fundamentals of Computational Auditory Scene Analysis (DeLiang Wang and Guy J. Brown). 1.1 Human Auditory Scene Analysis. 1.1.1 Structure and Function of the Auditory System. 1.1.2 Perceptual Organization of Simple Stimuli. 1.1.3 Perceptual Segregation of Speech from Other Sounds. 1.1.4 Perceptual Mechanisms. 1.2 Computational Auditory Scene Analysis (CASA). 1.2.1 What Is CASA? 1.2.2 What Is the Goal of CASA? 1.2.3 Why CASA? 1.3 Basics of CASA Systems. 1.3.1 System Architecture. 1.3.2 Cochleagram. 1.3.3 Correlogram. 1.3.4 Cross-Correlogram. 1.3.5 Time-Frequency Masks. 1.3.6 Resynthesis. 1.4 CASA Evaluation. 1.4.1 Evaluation Criteria. 1.4.2 Corpora. 1.5 Other Sound Separation Approaches. 1.6 A Brief History of CASA (Prior to 2000). 1.6.1 Monaural CASA Systems. 1.6.2 Binaural CASA Systems. 1.6.3 Neural CASA Models. 1.7 Conclusions 36 Acknowledgments. References. 2. Multiple F0 Estimation (Alain de Cheveigne). 2.1 Introduction. 2.2 Signal Models. 2.3 Single-Voice F0 Estimation. 2.3.1 Spectral Approach. 2.3.2 Temporal Approach. 2.3.3 Spectrotemporal Approach. 2.4 Multiple-Voice F0 Estimation. 2.4.1 Spectral Approach. 2.4.2 Temporal Approach. 2.4.3 Spectrotemporal Approach. 2.5 Issues. 2.5.1 Spectral Resolution. 2.5.2 Temporal Resolution. 2.5.3 Spectrotemporal Resolution. 2.6 Other Sources of Information. 2.6.1 Temporal and Spectral Continuity. 2.6.2 Instrument Models. 2.6.3 Learning-Based Techniques. 2.7 Estimating the Number of Sources. 2.8 Evaluation. 2.9 Application Scenarios. 2.10 Conclusion. Acknowledgments. References. 3. Feature-Based Speech Segregation (DeLiang Wang). 3.1 Introduction. 3.2 Feature Extraction. 3.2.1 Pitch Detection. 3.2.2 Onset and Offset Detection. 3.2.3 Amplitude Modulation Extraction. 3.2.4 Frequency Modulation Detection. 3.3 Auditory Segmentation. 3.3.1 What Is the Goal of Auditory Segmentation? 3.3.2 Segmentation Based on Cross-Channel Correlation and Temporal Continuity. 3.3.3 Segmentation Based on Onset and Offset Analysis. 3.4 Simultaneous Grouping. 3.4.1 Voiced Speech Segregation. 3.4.2 Unvoiced Speech Segregation. 3.5 Sequential Grouping. 3.5.1 Spectrum-Based Sequential Grouping. 3.5.2 Pitch-Based Sequential Grouping. 3.5.3 Model-Based Sequential Grouping. 3.6 Discussion. Acknowledgments. References. 4. Model-Based Scene Analysis (Daniel P. W. Ellis). 4.1 Introduction. 4.2 Source Separation as Inference. 4.3 Hidden Markov Models. 4.4 Aspects of Model-Based Systems. 4.4.1 Constraints: Types and Representations. 4.4.2 Fitting Models. 4.4.3 Generating Output. 4.5 Discussion. 4.5.1 Unknown Interference. 4.5.2 Ambiguity and Adaptation. 4.5.3 Relations to Other Separation Approaches. 4.6 Conclusions. References. 5. Binaural Sound Localization (Richard M. Stern, Guy J. Brown, and DeLiang Wang). 5.1 Introduction. 5.2 Physical and Physiological Mechanisms Underlying Auditory Localization. 5.2.1 Physical Cues. 5.2.2 Physiological Estimation of ITD and IID. 5.3 Spatial Perception of Single Sources. 5.3.1 Sensitivity to Differences in Interaural Time and Intensity. 5.3.2 Lateralization of Single Sources. 5.3.3 Localization of Single Sources. 5.3.4 The Precedence Effect. 5.4 Spatial Perception of Multiple Sources. 5.4.1 Localization of Multiple Sources. 5.4.2 Binaural Signal Detection. 5.5 Models of Binaural Perception. 5.5.1 Classical Models of Binaural Hearing. 5.5.2 Cross-Correlation-Based Models of Binaural Interaction. 5.5.3 Some Extensions to Cross-Correlation-Based Binaural Models. 5.6 Multisource Sound Localization. 5.6.1 Estimating Source Azimuth from Interaural Cross-Correlation. 5.6.2 Methods for Resolving Azimuth Ambiguity. 5.6.3 Localization of Moving Sources. 5.7 General Discussion. Acknowledgments. References. 6. Localization-Based Grouping (Albert S. Feng and Douglas L. Jones). 6.1 Introduction. 6.2 Classical Beamforming Techniques. 6.2.1 Fixed Beamforming Techniques. 6.2.2 Adaptive Beamforming Techniques. 6.2.3 Independent Component Analysis Techniques. 6.2.4 Other Localization-Based Techniques. 6.3 Location-Based Grouping Using Interaural Time Difference Cue. 6.4 Location-Based Grouping Using Interaural Intensity Difference Cue. 6.5 Location-Based Grouping Using Multiple Binaural Cues. 6.6 Discussion and Conclusions. Acknowledgments. References. 7. Reverberation (Guy J. Brown and Kalle J. Palomaki). 7.1 Introduction. 7.2 Effects of Reverberation on Listeners. 7.2.1 Speech Perception. 7.2.2 Sound Localization. 7.2.3 Source Separation and Signal Detection. 7.2.4 Distance Perception. 7.2.5 Auditory Spatial Impression. 7.3 Effects of Reverberation on Machines. 7.4 Mechanisms Underlying Robustness to Reverberation in Human Listeners. 7.4.1 The Role of Slow Temporal Modulations in Speech Perception. 7.4.2 The Binaural Advantage. 7.4.3 The Precedence Effect. 7.4.4 Perceptual Compensation for Spectral Envelope Distortion. 7.5 Reverberation-Robust Acoustic Processing. 7.5.1 Dereverberation. 7.5.2 Reverberation-Robust Acoustic Features. 7.5.3 Reverberation Masking. 7.6 CASA and Reverberation. 7.6.1 Systems Based on Directional Filtering. 7.6.2 CASA for Robust ASR in Reverberant Conditions. 7.6.3 Systems that Use Multiple Cues. 7.7 Discussion and Conclusions. Acknowledgments. References. 8. Analysis of Musical Audio Signals (Masataka Goto). 8.1 Introduction. 8.2 Music Scene Description. 8.2.1 Music Scene Descriptions. 8.2.2 Difficulties Associated with Musical Audio Signals. 8.3 Estimating Melody and Bass Lines. 8.3.1 PreFEst-front-end: Forming the Observed Probability Density Functions. 8.3.2 PreFEst-core: Estimating the F0's Probability Density Function. 8.3.3 PreFEst-back-end: Sequential F0 Tracking by Multiple-Agent Architecture. 8.3.4 Other Methods. 8.4 Estimating Beat Structure. 8.4.1 Estimating Period and Phase. 8.4.2 Dealing with Ambiguity. 8.4.3 Using Musical Knowledge. 8.5 Estimating Chorus Sections and Repeated Sections. 8.5.1 Extracting Acoustic Features and Calculating Their Similarity. 8.5.2 Finding Repeated Sections. 8.5.3 Grouping Repeated Sections. 8.5.4 Detecting Modulated Repetition. 8.5.5 Selecting Chorus Sections. 8.5.6 Other Methods. 8.6 Discussion and Conclusions. 8.6.1 Importance. 8.6.2 Evaluation Issues. 8.6.3 Future Directions. References. 9. Robust Automatic Speech Recognition (Jon Barker). 9.1 Introduction. 9.2 ASA and Speech Perception in Humans. 9.2.1 Speech Perception and Simultaneous Grouping. 9.2.2 Speech Perception and Sequential Grouping. 9.2.3 Speech Schemes. 9.2.4 Challenges to the ASA Account of Speech Perception. 9.2.5 Interim Summary. 9.3 Speech Recognition by Machine. 9.3.1 The Statistical Basis of ASR. 9.3.2 Traditional Approaches to Robust ASR. 9.3.3 CASA-Driven Approaches to ASR. 9.4 Primitive CASA and ASR. 9.4.1 Speech and Time-Frequency Masking. 9.4.2 The Missing-Data Approach to ASR. 9.4.3 Marginalization-Based Missing-Data ASR Systems. 9.4.4 Imputation-Based Missing-Data Solutions. 9.4.5 Estimating the Missing-Data Mask. 9.4.6 Difficulties with the Missing-Data Approach. 9.5 Model-Based CASA and ASR. 9.5.1 The Speech Fragment Decoding Framework. 9.5.2 Coupling Source Segregation and Recognition. 9.6 Discussion and Conclusions. 9.7 Concluding Remarks. References. 10. Neural and Perceptual Modeling (Guy J. Brown and DeLiang Wang). 10.1 Introduction. 10.2 The Neural Basis of Auditory Grouping. 10.2.1 Theoretical Solutions to the Binding Problem. 10.2.2 Empirical Results on Binding and ASA. 10.3 Models of Individual Neurons. 10.3.1 Relaxation Oscillators. 10.3.2 Spike Oscillators. 10.3.3 A Model of a Specific Auditory Neuron. 10.4 Models of Specific Perceptual Phenomena. 10.4.1 Perceptual Streaming of Tone Sequences. 10.4.2 Perceptual Segregation of Concurrent Vowels with Different F0s. 10.5 The Oscillatory Correlation Framework for CASA. 10.5.1 Speech Segregation Based on Oscillatory Correlation. 10.6 Schema-Driven Grouping. 10.7 Discussion. 10.7.1 Temporal or Spatial Coding of Auditory Grouping. 10.7.2 Physiological Support for Neural Time Delays. 10.7.3 Convergence of Psychological, Physiological, and Computational Approaches. 10.7.4 Neural Models as a Framework for CASA. 10.7.5 The Role of Attention. 10.7.6 Schema-Based Organization. Acknowledgments. References. Index.

899 citations


Journal ArticleDOI
Abstract: A subjective scale for the measurement of pitch was constructed from determinations of the half‐value of pitches at various frequencies. This scale differs from both the musical scale and the frequency scale, neither of which is subjective. Five observers fractionated tones of 10 different frequencies at a loudness level of 60 db. From these fractionations a numerical scale was constructed which is proportional to the perceived magnitude of subjective pitch. In numbering the scale the 1000‐cycle tone was assigned the pitch of 1000 subjective units (mels). The close agreement of the pitch scale with an integration of the differential thresholds (DL's) shows that, unlike the DL's for loudness, all DL's for pitch are of uniform subjective magnitude. The agreement further implies that pitch and differential sensitivity to pitch are both rectilinear functions of extent on the basilar membrane. The correspondence of the pitch scale and the experimentally determined location of the resonant areas of the basilar membrane suggests that, in cutting a pitch in half, the observer adjusts the tone until it stimulates a position half‐way from the original locus to the apical end of the membrane. Measurement of the subjective size of musical intervals (such as octaves) in terms of the pitch scale shows that the intervals become larger as the frequency of the mid‐point of the interval increases (except in the two highest audible octaves). This result confirms earlier judgments as to the relative size of octaves in different parts of the frequency range.

891 citations