scispace - formally typeset
Open AccessProceedings ArticleDOI

Robust features for environmental sound classification

TLDR
Algorithm to classify environmental sounds with the aim of providing contextual information to devices such as hearing aids for optimum performance using signal sub-band energy and Gaussian mixture model is described.
Abstract
In this paper we describe algorithms to classify environmental sounds with the aim of providing contextual information to devices such as hearing aids for optimum performance. We use signal sub-band energy to construct signal-dependent dictionary and matching pursuit algorithms to obtain a sparse representation of a signal. The coefficients of the sparse vector are used as weights to compute weighted features. These features, along with mel frequency cepstral coefficients (MFCC), are used as feature vectors for classification. Experimental results show that the proposed method gives an accuracy as high as 95.6 %, while classifying 14 categories of environmental sound using a gaussian mixture model (GMM).

read more

Content maybe subject to copyright    Report

HAL Id: hal-01456201
https://hal.inria.fr/hal-01456201
Submitted on 4 Feb 2017
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entic research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diusion de documents
scientiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Robust features for environmental sound classication
Sunit Sivasankaran, K.M.M Prabhu
To cite this version:
Sunit Sivasankaran, K.M.M Prabhu. Robust features for environmental sound classication.
2013 IEEE International Conference on Electronics, Computing and Communication Technologies
(CONECCT), Jan 2013, Bangalore, India. pp.1 - 6, �10.1109/CONECCT.2013.6469297�. �hal-
01456201�

Robust features for Environmental Sound
Classification
Sunit Sivasankaran, K.M.M. Prabhu
Department of Electrical Engineering
Indian Institute Of Technology Madras, Chennai-36, India
Email: sunit.sivasankaran@gmail.com, prabhu@ee.iitm.ac.in
Abstract—In this paper we describe algorithms to classify
environmental sounds with the aim of providing contextual
information to devices such as hearing aids for optimum
performance. We use signal sub-band energy to construct
signal-dependent dictionary and matching pursuit algo-
rithms to obtain a sparse representation of a signal. The
coefficients of the sparse vector are used as weights to
compute weighted features. These features, along with mel
frequency cepstral coefficients (MFCC) are used as feature
vectors for classification. Experimental results show that
the proposed method gives a maximum accuracy of 95.6
% while classifying 14 categories of environmental sound
using a gaussian mixture model (GMM).
I. INTRODUCTION
To a careful listener, an audio recording is a rich source
of information giving clues such as location, direction of
vehicular movement, environmental information, speed
of wind and so on. It is, therefore, only natural to
ask if we could make machines imitate human listening
capabilities. One step in this direction is to train a
machine to automatically classify the environment based
on a set of features extracted from an audio sample.
Environmental sound classification has a variety of
applications. Modern hearing aids [1] consists of several
programs which account for the reverberation model
of environments such as meeting rooms, auditoriums
and other noisy environments. Automatic recognition of
the surrounding environment allows these machines to
switch between programs and work with minimum user
interference. Other applications include video scene anal-
ysis which is generally achieved using computationally
heavy video processing techniques. Liu et al [2] have
proposed a set of low level features such as frequency
centroid, frequency bandwidth, energy ratio of sub-band
to characterize the audio clip of a video scene and were
shown to have good scene discrimination capabilities.
Another interesting application of the environmental
classification problem is in creating automatic diary [3]
by processing the audio recorded over a certain period.
Users are automatically given locations they have visited
based on classifying the environments using features
extracted from the audio samples.
The problem of environment classification is a sub-
problem of a bigger area of research called computational
auditory scene recognition (CASR) [4], where the focus
is on recognizing the context. Application of CASR
include providing context to devices such as hearing aids
and mobile phones, enabling it to provide better service.
Previous attempts to classify environment have given
rise to a new set of features. Peltonen et al [4] have used
mel frequency cepstral coefficient (MFCC) as features
and gaussian mixture models (GMM) and neural network
as classifiers. They report an average recognition rate
of only 68.4 % using MFCC as features and GMM as
classifier, while classifying 17 natural sounds. Chu et al
[5] have proposed to use a combination of MFCC and
a set of features extracted using matching pursuit (MP)
algorithm, to classify a set of 14 natural sounds. Using
GMM as a preferred classifier, they have reported an
accuracy of about 83.9 %. Adilo
˘
glu et al [6] have de-
veloped a dissimilarity function to compute the distance
between sounds and used support vector machine (SVM)
for classification purposes.
In this paper we extend the work reported in [5]
and classify the environment recording using the MP
algorithm. We develop different frequency scaling meth-
ods for constructing a dictionary with an objective to
capture information which are not captured by MFCC.
We achieve a maximum accuracy of about 95.6 %
while classifying 14 classes using a GMM classifier. The
proposed algorithm constructs a dictionary using prior
knowledge of the signal with a small increase in the
computational cost.
The rest of the paper is organized as follows. Section II
explains the feature extraction methods including the MP
based feature. Section III introduces the linear piecewise
model for constructing better dictionary using sub-band
energy of the signal and Section IV discusses procedure
to obtain weights for computing weighted features. Sec-
tion V explains the experimental setup used. Results are
analysed in Section VI while Section VII concludes the
paper.
II. FEATURE EXTRACTION
A varied set of features such as zero crossing rate,
MFCC, band energy ratio, spectral flux, statistical mo-
ments ([2],[4]) and features obtained using the MP algo-
rithm and their combinations have been used to classify
natural audio sounds. The best reported accuracy [5]
was obtained using a combination of MFCC and MP
features. MFCC is obtained by first computing the short
time Fourier transform of the signal. The spectrum values
of each frame are then grouped into bands using a set
of triangular filters [7]. The bandwidth of the triangular
filters are constant for center frequencies below 1 kHz

and increases exponentially upto 4 kHz. 13 mel fre-
quency cepstral coefficients for each frame are obtained
by taking the discrete cosine transform (DCT) of the
log of the magnitude of filter outputs. Since the filter
bandwidths of the filterbank are narrow below 1 kHz,
MFCC can represent low frequency content of the signal
adequately well.
A. Matching Pursuit for feature extraction
There are many algorithms [8] to obtain a sparse
representation of a signal, given a dictionary. Commonly
used algorithms are the basis pursuit (BP), orthogo-
nal matching pursuit (OMP), iterative hard threshold-
ing (IHT) and compressive sampling matching pursuit
(CoSaMP). Sounds containing harmonic sections such as
sound from an ambulance siren can also be decomposed
using harmonic matching pursuit [9]. Owing to the
simplicity of orthogonal matching pursuit (OMP), we use
the technique to compute MP features.
Given a dictionary D of size m × n and an observed
signal y, MP algorithm gives a sparse vector, x, using an
iterative approach. Each iteration captures the maximum
possible residual energy. The number of iteration is either
fixed by predetermining the sparsity, ||x||
0
= K or by
thresholding the residual energy, ||Dx y||
2
. In this
paper we predetermine the value of K. Chu et al [5] have
reported no significant improvement in the classification
performance for K 5 because of which we fix the
value of K to five.
1) Building the dictionary : A detailed review of
dictionary building techniques for matching pursuit al-
gorithms has been described in [10]. Approaches include
learned dictionaries such as K-SVD [11] which is known
to represent signals better, but it is computationally inten-
sive and data dependent. On the other hand, analytical
dictionaries such as the Fourier and wavelet have fast
implementations and analytic formulation with support-
ing proofs and error rate bounds. One such dictionary
is the Gabor dictionary whose atoms are constructed
using a Gabor filter which are known to have a good
representation for audio signals. Gammatone filter based
dictionaries which are modelled on human psychoacous-
tics, have also been reported to have a good audio
representation [12]. We, however, use Gabor atoms to
construct our dictionary since the principles developed in
this paper can easily be extended to other time-frequency
atoms as well.
A real discrete Gabor time-frequency atom can be rep-
resented by
g[k] =
k
g
s
e
π(ku)
2
/s
2
cos (2πω(k u) + θ), (1)
where, the constants s, u, k
g
, ω are the scale, shift, nor-
malization factor and frequency values. Scale and shift
values were set to s = 2
p
(1 p 8) and u =
{0, 64, 128, 192} respectively, to construct the dictionary.
A logarithmic scale for ω = Ci
2.62
(with 1 i
35, C = 0.5 × 35
2.6
) was used to accommodate finer
granularity below 1000 Hz [5]. This gave a dictionary
with n = 8 × 4 × 35 = 1120 atoms. Phase θ was set
to zero since it has been reported to have no significant
impact on the classification result.
Dictionary constructed using Gabor atoms were used by
the OMP algorithm to choose five atoms which best
correlate with the signal. The mean (µ) and the standard
deviation (σ) of the frequency (ω
s
) and scale (s
s
) of the
selected atoms was used as MP features. The feature set
for classification is as follows:
[MF CC, µ(ω
s
), σ(ω
s
), µ(s
s
), σ(s
s
)].
We refer to the above mentioned set of features as
unweighted features.
MP features depend on dictionary D, whose atoms are
constructed using Gabor function. The following section
describes a method to construct relevant dictionaries
using energy distribution of the signal.
III. LINEAR PIECEWISE SCALING
In this section we introduce the piecewise method
of frequency scaling to build a dictionary using apriori
knowledge of the signal. Here, the sub-band energy
ratio, which is the normalized energy distribution in sub-
bands, is used to determine the number of atoms to be
allocated per frequency band. Since MFCC has a good
representation of the signal in low frequency region ( 1
kHz), we pass the signal through a high pass filter having
a cut-off frequency of 1 kHz. This is done to ensure that
the MP features do not capture the information which is
already been captured by the MFCC. Now the j
th
sub-
band energy, E
sb
(j), is obtained by,
E
sb
(j) =
X
P sb(j)
|X(P )|
2
j = 1, 2, . . . , N,
where X(P ) is the discrete Fourier transform (DFT) of
the signal and N is the total number of sub-bands.
We then normalize the energy to obtain a distribution
function as follows:
E
sb
n
(j) =
E
sb
(j)
P
N
i=1
E
sb
(i)
, j = 1, 2, . . . , N.
The product E
sb
n
(j) × n
f
, rounded off to the nearest
integer, decides the number of frequency elements to be
allocated to the j
th
sub-band and is denoted by n
sb
(j).
n
sb
(j) = round(E
sb
n
(j) ×n
f
), j = 1, 2, . . . , N. (2)
here round(.) denotes the rounding off operator, n
f
is
the total number of frequency elements. In our experi-
ment, we set n
f
= 35.
A linear piecewise model for the j
th
sub-band is then
constructed by dividing the straight line joining the
frequency boundaries of the sub-band into n
sb
(j) equi-
spaced points. The corresponding frequency points are
used to construct the dictionary D using Gabor atoms (1)
for OMP. Fig. 1 shows the frequency allocation for ocean
and casino sounds using the algorithm. An ocean sound
has higher energy in the low frequency region because
of which the algorithm adaptively allocates more atoms
in the lower frequency band. Similarly, higher number of
atoms were allocated to high frequency region in case of

a casino sound due to higher energy in the high frequency
band. In one such example, 19 atoms were allocated for
an ocean sound in the frequency range, ω < 0.1π as
compared to 6 for a casino sound (Fig. 1).
IV. COMPUTING WEIGHTED FEATURES
After obtaining the dictionary D, OMP algorithm is
used to find atoms which correlate well with the signal.
Each of the selected atoms has a different value of
correlation coefficient with respect to the signal. To
capture this variation, we propose to use weighted mean
and deviation.
If d
s
are the atoms selected by orthogonal matching
pursuit and r
i
is the residue after i
th
iteration of the
algorithm, the inner product x
s
(i) = d
s
(i)
T
r
i
, for i =
1, 2, . . . , K, are the non zero components of the sparse
vector x and are used as weights while computing the
weighted mean (µ
w
),
µ
w
(w
s
) =
P
K
i=1
abs(x
s
(i)) × w
s
(i)
P
K
i=1
abs(x
s
(i))
. (3)
The unbiased estimator for weighted standard deviation
is computed by,
σ
w
(w
s
) =
v
u
u
t
V
1
V
2
1
V
2
K
X
i=1
abs(x
s
(i)) × (w
s
(i) µ
w
(w
s
))
2
,
(4)
where, V
1
=
P
K
i=1
abs(x
s
(i)) and V
2
=
P
K
i=1
x
2
s
(i).
We similarly compute the weighted mean and standard
deviation of the scale parameter. The new set of features
are:
[MF CC, µ
w
(ω
s
), σ
w
(ω
s
), µ
w
(s
s
), σ
w
(s
s
)].
The summary of the algorithm is detailed below:
Algorithm (Feature Extraction)
INPUT: Audio segment y[n], number of Sub-bands(N ),
total number of frequency elements (n
f
), scale
range({s}), shift range ({u})
step 1: Find the support ω.
Pass the signal y[n] through a high pass filter having
a cut-off of 1 kHz.
Divide the spectrum Y (e
jω
) into N sub-bands.
Compute the energy in each sub-band E
sb
(j) =
P
P sb(j)
|Y (P )|
2
.
Find the number of atoms to be allocated to each sub-
band, n
sb
(Eqn.(2)).
Find the plausible set of frequency elements ω =
{ω
1
, ω
2
, . . . ω
N
}.
step 2: Extract feature vectors using OMP.
Construct a Gabor dictionary D using ω as frequency
elements.
Do an OMP using D as dictionary.
Init: ω
s
= , s
s
= , x
s
= , = , residual r
o
= y
and counter c = 1.
WHILE: c 5
Find the column d
k
of D which correlates the most
with the residue:
k arg max
j
|hr
c1
, d
j
i|
k
=
k1
{d
k
}
ω
s
= ω
s
freq{d
k
}
s
s
= s
s
scale{d
k
}
Find the best coefficients
x
s
= arg min
θ
||y D
k
θ||
2
update the residual:
r
c
= y D
k
x
s
.
c:=c+1
END WHILE:
Find the weighted mean (3) and standard deviation
(4) of ω and scale s.
OUTPUT: Weighted features -
[µ
ω
(ω
s
), σ
w
(ω
s
), µ
w
(s
s
), σ
w
(s
s
)]
0 5 10 15 20 25 30 35
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
index(i)
Frequency, ω
i
(× π)
Casino
Ocean
Low frequency (ω=C × i
2.6
)
High Frequency (ω=0.5 × 1000
−1/ i
)
Fig. 1. Frequency allocation using the piece-wise model for casino
and ocean sounds and its comparison with low and high frequency
emphasized scaling model (Section VI)
V. EXPERIMENTS
A. Dataset
The results presented in this paper are based on a
set of data collected from an online sound repository
at freesound.org [13]. The collection method is similar
to that mentioned in [5]. In order to compare our results
with those presented in [5], we have used the data set
collected from [13] and applied them to our algorithm
and those presented in [5]. A total of 14 audio recordings,
namely, Nature in daytime, Inside Vehicle, Restaurant,
Casino, Nature at Night, Bell, Playgrounds, Street Traffic,
Thundering, Train, Rain, Stream, Ocean and Street with
Ambulance, representing different environments were
collected. Each one of these environments are now
referred to as a class. However, no preprocessing was
done on the collected data.

B. Method
For each of the 14 classes, a minimum of 4 unique
recordings were collected. All collected data had a two
channel recording (stereo) out of which only one channel
was used. This is to avoid duplication. All the files were
collected as uncompressed wav file and had a sampling
rate of 44.1 kHz, but of varying time durations (from 30
secs to 8 mins). We divide each recording into segments
of 4 seconds. 75 % of all the collected segments were
used for training and 25% for testing. Features for both
training and testing were computed on these segments.
MFCC was computed by dividing the 4 sec segments
into blocks of 20ms using a Hamming window with 50
% overlap. The same blocks were used to compute MP
features.
To classify the segments, we build Gaussian mixture
model (GMM) for each class, which is described by,
p =
N
g
X
k=1
α
k
N(µ
k
, Σ
k
),
where N(.) is the normal distribution function. α
k
is
the weight, µ
k
and Σ
k
are the mean and variance of the
k
th
mixture. N
g
is the number of mixtures in the GMM.
α
k
, µ
k
and Σ
k
are obtained from the features extracted
from the training data, using the standard expectation
maximization (EM) algorithm.
To classify a segment s, the posterior probability
p(s|µ
i
, Σ
i
) is computed for each frame of the segment. A
segment is assigned to k
th
class, if the sum of posterior
probabilities of all frames for the k
th
class is maximum.
If there are N
s
frame in a segment, the segment is
allocated to class c
k
, if
k = arg max
i
N
s
X
j=1
log{p(s
j
|µ
i
, Σ
i
)}.
Different values of N
g
were tried in [5]. Best results were
obtained using a GMM having 5 mixture components.
Henceforth, we construct 5 mixture GMM for all classes
in our experiment. Confusion matrix was constructed
and accuracy values were computed as the ratio of sum
of diagonal values to the total sum of all elements in
the matrix. All results reported in this paper are the
average of accuracy values obtained using ten fold cross
validation.
VI. RESULTS AND ANALYSIS
We implement the algorithm detailed in [5] on our
dataset and obtain an accuracy of 83.2 % while clas-
sifying the audio segments without preprocessing them.
We use this as a base to compare the performance of
our algorithm. On studying the periodogram of environ-
mental sounds (Fig. 2), it was found that certain signals
such as audio recordings from casino have substantial
energy in the higher frequency region. To understand
the impact of high frequency on the classification, we
construct a dictionary using a high frequency emphasized
scaling function ω = 0.5a
1/i
, as against the low
frequency emphasized scaling function ω = C × i
2.6
advocated in [5]. A value of a = 1000 was chosen
for a smoother ascend towards high frequency while
still distributing enough atoms in the higher part of the
frequency spectrum. The variation of ω as a function of i
for a = 1000 is shown in Fig. 1. We obtain an accuracy
of 84.5 % using weighted features and high frequency
emphasized scaling function, an improvement of 1.56
% over the low frequency emphasized scaling function.
The increase in accuracy is due to better representation
of the high frequency region in the MP features. We
would like to reiterate the fact that, the low frequency
region is adequately represented with MFCC. Class-wise
accuracy comparison for low frequency and high fre-
quency emphasised scaling functions using unweighted
features are shown in Fig. 3. We observe that the high
frequency emphasised dictionary outperforms its low
frequency counterpart, while classifying sounds from
nature at day time, inside vehicle, casino and nature at
night time whose periodogram showed high energy in the
high frequency region, while low frequency emphasised
dictionary performed better in classifying sounds from
ocean, rain and train which has higher energy content in
the lower frequency range.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−120
−100
−80
−60
−40
−20
0
20
ω (× π rad/sample)
Power/frequency (dB/rad/sample)
(a) Casino Recording
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−140
−120
−100
−80
−60
−40
−20
0
20
ω (× π rad/sample)
Power/frequency (dB/rad/sample)
(b) Ambulance Recording
Fig. 2. Variation in periodograms of environmental sounds
A. Effect of using sub-band information
The results obtained using high frequency emphasised
dictionary showed the need to construct signal specific

Citations
More filters
Proceedings ArticleDOI

Environmental sound recognition: A survey

TL;DR: This survey will offer a qualitative and elucidatory survey on recent developments of environmental sound recognition, and includes three parts: i) basic environmental sound processing schemes, ii) stationary ESR techniques and iii) non-stationary E SR techniques.
Journal ArticleDOI

Environmental sound recognition: a survey

TL;DR: This survey will offer a qualitative and elucidatory survey on recent developments of environmental sound recognition, and includes four parts: basic environmental sound-processing schemes, stationary E SR techniques, non-stationary ESR techniques, and performance comparison of selected methods.
Proceedings ArticleDOI

Classification of environmental audio signals using statistical time and frequency features

TL;DR: An approach for location classification that does not need to have an explicit information about the place, in contrast with systems such as a Quick Response Code (QR) or Radio Frequency Identificator tag is presented.
Proceedings ArticleDOI

Environmental sound recognition using Gaussian mixture model and neural network classifier

TL;DR: This paper deals with the prototype modeling for environmental sound recognition and shows a better efficiency than the already existing method.
Proceedings ArticleDOI

Acoustic feature extraction by tensor-based sparse representation for sound effects classification

TL;DR: Experimental results show that exploiting tensor representation allows to characterize distinctive transient TF atoms, yielding an average accuracy improvement of 9.7% and 12.5% compared with matching pursuit (MP) and MFCC features.
References
More filters
Journal ArticleDOI

Dictionaries for Sparse Representation Modeling

TL;DR: This paper surveys the various options such training has to offer, up to the most recent contributions and structures of the MOD, the K-SVD, the Generalized PCA and others.
Journal ArticleDOI

Computational Methods for Sparse Solution of Linear Inverse Problems

TL;DR: This paper surveys the major practical algorithms for sparse approximation with specific attention to computational issues, to the circumstances in which individual methods tend to perform well, and to the theoretical guarantees available.

Computational Methods for Sparse Solution of Linear Inverse Problems In many engineering areas, such as signal processing, practical results can be obtained by identifying approaches that yield the greatest quality improvement, or by selecting the most suitable computation methods.

TL;DR: In this paper, a survey of the major practical algorithms for sparse approximation is presented, focusing on computational issues, circumstances in which individual methods tend to perform well, and theoretical guarantees available.
Journal ArticleDOI

Environmental Sound Recognition With Time–Frequency Audio Features

TL;DR: An empirical feature analysis for audio environment characterization is performed and a matching pursuit algorithm is proposed to use to obtain effective time-frequency features to yield higher recognition accuracy for environmental sounds.
Journal ArticleDOI

Harmonic decomposition of audio signals with matching pursuit

TL;DR: A simple note detection algorithm is described that shows how one could use a harmonic matching pursuit to detect notes even in difficult situations, e.g., very different note durations, lots of reverberation, and overlapping notes.
Related Papers (5)
Frequently Asked Questions (1)
Q1. What are the contributions in "Robust features for environmental sound classification" ?

In this paper the authors describe algorithms to classify environmental sounds with the aim of providing contextual information to devices such as hearing aids for optimum performance. The authors use signal sub-band energy to construct signal-dependent dictionary and matching pursuit algorithms to obtain a sparse representation of a signal.