What are the contributions in "Robust features for environmental sound classification" ?

In this paper the authors describe algorithms to classify environmental sounds with the aim of providing contextual information to devices such as hearing aids for optimum performance. The authors use signal sub-band energy to construct signal-dependent dictionary and matching pursuit algorithms to obtain a sparse representation of a signal.

(Open Access) Robust features for environmental sound classification (2013) | Sunit Sivasankaran

HAL Id: hal-01456201

https://hal.inria.fr/hal-01456201

Submitted on 4 Feb 2017

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-

entic research documents, whether they are pub-

lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diusion de documents

scientiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Robust features for environmental sound classication

Sunit Sivasankaran, K.M.M Prabhu

To cite this version:

Sunit Sivasankaran, K.M.M Prabhu. Robust features for environmental sound classication.

2013 IEEE International Conference on Electronics, Computing and Communication Technologies

(CONECCT), Jan 2013, Bangalore, India. pp.1 - 6, �10.1109/CONECCT.2013.6469297�. �hal-

01456201�

Robust features for Environmental Sound

Classiﬁcation

Sunit Sivasankaran, K.M.M. Prabhu

Department of Electrical Engineering

Indian Institute Of Technology Madras, Chennai-36, India

Email: sunit.sivasankaran@gmail.com, prabhu@ee.iitm.ac.in

Abstract—In this paper we describe algorithms to classify

environmental sounds with the aim of providing contextual

information to devices such as hearing aids for optimum

performance. We use signal sub-band energy to construct

signal-dependent dictionary and matching pursuit algo-

rithms to obtain a sparse representation of a signal. The

coefﬁcients of the sparse vector are used as weights to

compute weighted features. These features, along with mel

frequency cepstral coefﬁcients (MFCC) are used as feature

vectors for classiﬁcation. Experimental results show that

the proposed method gives a maximum accuracy of 95.6

% while classifying 14 categories of environmental sound

using a gaussian mixture model (GMM).

I. INTRODUCTION

To a careful listener, an audio recording is a rich source

of information giving clues such as location, direction of

vehicular movement, environmental information, speed

of wind and so on. It is, therefore, only natural to

ask if we could make machines imitate human listening

capabilities. One step in this direction is to train a

machine to automatically classify the environment based

on a set of features extracted from an audio sample.

Environmental sound classiﬁcation has a variety of

applications. Modern hearing aids [1] consists of several

programs which account for the reverberation model

of environments such as meeting rooms, auditoriums

and other noisy environments. Automatic recognition of

the surrounding environment allows these machines to

switch between programs and work with minimum user

interference. Other applications include video scene anal-

ysis which is generally achieved using computationally

heavy video processing techniques. Liu et al [2] have

proposed a set of low level features such as frequency

centroid, frequency bandwidth, energy ratio of sub-band

to characterize the audio clip of a video scene and were

shown to have good scene discrimination capabilities.

Another interesting application of the environmental

classiﬁcation problem is in creating automatic diary [3]

by processing the audio recorded over a certain period.

Users are automatically given locations they have visited

based on classifying the environments using features

extracted from the audio samples.

The problem of environment classiﬁcation is a sub-

problem of a bigger area of research called computational

auditory scene recognition (CASR) [4], where the focus

is on recognizing the context. Application of CASR

include providing context to devices such as hearing aids

and mobile phones, enabling it to provide better service.

Previous attempts to classify environment have given

rise to a new set of features. Peltonen et al [4] have used

mel frequency cepstral coefﬁcient (MFCC) as features

and gaussian mixture models (GMM) and neural network

as classiﬁers. They report an average recognition rate

of only 68.4 % using MFCC as features and GMM as

classiﬁer, while classifying 17 natural sounds. Chu et al

[5] have proposed to use a combination of MFCC and

a set of features extracted using matching pursuit (MP)

algorithm, to classify a set of 14 natural sounds. Using

GMM as a preferred classiﬁer, they have reported an

accuracy of about 83.9 %. Adilo

glu et al [6] have de-

veloped a dissimilarity function to compute the distance

between sounds and used support vector machine (SVM)

for classiﬁcation purposes.

In this paper we extend the work reported in [5]

and classify the environment recording using the MP

algorithm. We develop different frequency scaling meth-

ods for constructing a dictionary with an objective to

capture information which are not captured by MFCC.

We achieve a maximum accuracy of about 95.6 %

while classifying 14 classes using a GMM classiﬁer. The

proposed algorithm constructs a dictionary using prior

knowledge of the signal with a small increase in the

computational cost.

The rest of the paper is organized as follows. Section II

explains the feature extraction methods including the MP

based feature. Section III introduces the linear piecewise

model for constructing better dictionary using sub-band

energy of the signal and Section IV discusses procedure

to obtain weights for computing weighted features. Sec-

tion V explains the experimental setup used. Results are

analysed in Section VI while Section VII concludes the

paper.

II. FEATURE EXTRACTION

A varied set of features such as zero crossing rate,

MFCC, band energy ratio, spectral ﬂux, statistical mo-

ments ([2],[4]) and features obtained using the MP algo-

rithm and their combinations have been used to classify

natural audio sounds. The best reported accuracy [5]

was obtained using a combination of MFCC and MP

features. MFCC is obtained by ﬁrst computing the short

time Fourier transform of the signal. The spectrum values

of each frame are then grouped into bands using a set

of triangular ﬁlters [7]. The bandwidth of the triangular

ﬁlters are constant for center frequencies below 1 kHz

and increases exponentially upto 4 kHz. 13 mel fre-

quency cepstral coefﬁcients for each frame are obtained

by taking the discrete cosine transform (DCT) of the

log of the magnitude of ﬁlter outputs. Since the ﬁlter

bandwidths of the ﬁlterbank are narrow below 1 kHz,

MFCC can represent low frequency content of the signal

adequately well.

A. Matching Pursuit for feature extraction

There are many algorithms [8] to obtain a sparse

representation of a signal, given a dictionary. Commonly

used algorithms are the basis pursuit (BP), orthogo-

nal matching pursuit (OMP), iterative hard threshold-

ing (IHT) and compressive sampling matching pursuit

(CoSaMP). Sounds containing harmonic sections such as

sound from an ambulance siren can also be decomposed

using harmonic matching pursuit [9]. Owing to the

simplicity of orthogonal matching pursuit (OMP), we use

the technique to compute MP features.

Given a dictionary D of size m × n and an observed

signal y, MP algorithm gives a sparse vector, x, using an

iterative approach. Each iteration captures the maximum

possible residual energy. The number of iteration is either

ﬁxed by predetermining the sparsity, ||x||

= K or by

thresholding the residual energy, ||Dx −y||

≤ . In this

paper we predetermine the value of K. Chu et al [5] have

reported no signiﬁcant improvement in the classiﬁcation

performance for K ≥ 5 because of which we ﬁx the

value of K to ﬁve.

1) Building the dictionary : A detailed review of

dictionary building techniques for matching pursuit al-

gorithms has been described in [10]. Approaches include

learned dictionaries such as K-SVD [11] which is known

to represent signals better, but it is computationally inten-

sive and data dependent. On the other hand, analytical

dictionaries such as the Fourier and wavelet have fast

implementations and analytic formulation with support-

ing proofs and error rate bounds. One such dictionary

is the Gabor dictionary whose atoms are constructed

using a Gabor ﬁlter which are known to have a good

representation for audio signals. Gammatone ﬁlter based

dictionaries which are modelled on human psychoacous-

tics, have also been reported to have a good audio

representation [12]. We, however, use Gabor atoms to

construct our dictionary since the principles developed in

this paper can easily be extended to other time-frequency

atoms as well.

A real discrete Gabor time-frequency atom can be rep-

resented by

g[k] =

√

−π(k−u)

cos (2πω(k − u) + θ), (1)

where, the constants s, u, k

, ω are the scale, shift, nor-

malization factor and frequency values. Scale and shift

values were set to s = 2

(1 ≤ p ≤ 8) and u =

{0, 64, 128, 192} respectively, to construct the dictionary.

A logarithmic scale for ω = Ci

2.62

(with 1 ≤ i ≤

35, C = 0.5 × 35

−2.6

) was used to accommodate ﬁner

granularity below 1000 Hz [5]. This gave a dictionary

with n = 8 × 4 × 35 = 1120 atoms. Phase θ was set

to zero since it has been reported to have no signiﬁcant

impact on the classiﬁcation result.

Dictionary constructed using Gabor atoms were used by

the OMP algorithm to choose ﬁve atoms which best

correlate with the signal. The mean (µ) and the standard

deviation (σ) of the frequency (ω

) and scale (s

) of the

selected atoms was used as MP features. The feature set

for classiﬁcation is as follows:

[MF CC, µ(ω

), σ(ω

), µ(s

), σ(s

)].

We refer to the above mentioned set of features as

unweighted features.

MP features depend on dictionary D, whose atoms are

constructed using Gabor function. The following section

describes a method to construct relevant dictionaries

using energy distribution of the signal.

III. LINEAR PIECEWISE SCALING

In this section we introduce the piecewise method

of frequency scaling to build a dictionary using apriori

knowledge of the signal. Here, the sub-band energy

ratio, which is the normalized energy distribution in sub-

bands, is used to determine the number of atoms to be

allocated per frequency band. Since MFCC has a good

representation of the signal in low frequency region (≤ 1

kHz), we pass the signal through a high pass ﬁlter having

a cut-off frequency of 1 kHz. This is done to ensure that

the MP features do not capture the information which is

already been captured by the MFCC. Now the j

sub-

band energy, E

(j), is obtained by,

(j) =

P ∈sb(j)

|X(P )|

j = 1, 2, . . . , N,

where X(P ) is the discrete Fourier transform (DFT) of

the signal and N is the total number of sub-bands.

We then normalize the energy to obtain a distribution

function as follows:

(j) =

(j)

i=1

(i)

, j = 1, 2, . . . , N.

The product E

(j) × n

, rounded off to the nearest

integer, decides the number of frequency elements to be

allocated to the j

sub-band and is denoted by n

(j).

(j) = round(E

(j) ×n

), j = 1, 2, . . . , N. (2)

here round(.) denotes the rounding off operator, n

the total number of frequency elements. In our experi-

ment, we set n

= 35.

A linear piecewise model for the j

sub-band is then

constructed by dividing the straight line joining the

frequency boundaries of the sub-band into n

(j) equi-

spaced points. The corresponding frequency points are

used to construct the dictionary D using Gabor atoms (1)

for OMP. Fig. 1 shows the frequency allocation for ocean

and casino sounds using the algorithm. An ocean sound

has higher energy in the low frequency region because

of which the algorithm adaptively allocates more atoms

in the lower frequency band. Similarly, higher number of

atoms were allocated to high frequency region in case of

a casino sound due to higher energy in the high frequency

band. In one such example, 19 atoms were allocated for

an ocean sound in the frequency range, ω < 0.1π as

compared to 6 for a casino sound (Fig. 1).

IV. COMPUTING WEIGHTED FEATURES

After obtaining the dictionary D, OMP algorithm is

used to ﬁnd atoms which correlate well with the signal.

Each of the selected atoms has a different value of

correlation coefﬁcient with respect to the signal. To

capture this variation, we propose to use weighted mean

and deviation.

If d

are the atoms selected by orthogonal matching

pursuit and r

is the residue after i

iteration of the

algorithm, the inner product x

(i) = d

(i)

, for i =

1, 2, . . . , K, are the non zero components of the sparse

vector x and are used as weights while computing the

weighted mean (µ

) =

i=1

abs(x

(i)) × w

(i)

i=1

abs(x

(i))

. (3)

The unbiased estimator for weighted standard deviation

is computed by,

) =

− V

i=1

abs(x

(i)) × (w

(i) − µ

))

(4)

where, V

i=1

abs(x

(i)) and V

i=1

(i).

We similarly compute the weighted mean and standard

deviation of the scale parameter. The new set of features

are:

[MF CC, µ

(ω

), σ

(ω

), µ

), σ

)].

The summary of the algorithm is detailed below:

Algorithm (Feature Extraction)

INPUT: Audio segment y[n], number of Sub-bands(N ),

total number of frequency elements (n

), scale

range({s}), shift range ({u})

step 1: Find the support ω.

→ Pass the signal y[n] through a high pass ﬁlter having

a cut-off of 1 kHz.

→ Divide the spectrum Y (e

jω

) into N sub-bands.

→ Compute the energy in each sub-band E

(j) =

P ∈sb(j)

|Y (P )|

→ Find the number of atoms to be allocated to each sub-

band, n

(Eqn.(2)).

→ Find the plausible set of frequency elements ω =

{ω

, ω

, . . . ω

step 2: Extract feature vectors using OMP.

→ Construct a Gabor dictionary D using ω as frequency

elements.

→ Do an OMP using D as dictionary.

→ Init: ω

= ∅, s

= ∅, x

= ∅, Ω = ∅, residual r

= y

and counter c = 1.

WHILE: c ≤ 5

→ Find the column d

of D which correlates the most

with the residue:

k ∈ arg max

|hr

c−1

, d

Ω

= Ω

k−1

∪ {d

}

= ω

∪ freq{d

}

= s

∪ scale{d

}

→ Find the best coefﬁcients

= arg min

||y − D

Ω

θ||

→ update the residual:

= y − D

Ω

→ c:=c+1

END WHILE:

→ Find the weighted mean (3) and standard deviation

(4) of ω and scale s.

OUTPUT: Weighted features -

[µ

(ω

), σ

(ω

), µ

), σ

)]

0 5 10 15 20 25 30 35

0.1

0.2

0.3

0.4

0.5

0.6

0.7

index(i)

Frequency, ω

(× π)

Casino

Ocean

Low frequency (ω=C × i

2.6

)

High Frequency (ω=0.5 × 1000

−1/ i

)

Fig. 1. Frequency allocation using the piece-wise model for casino

and ocean sounds and its comparison with low and high frequency

emphasized scaling model (Section VI)

V. EXPERIMENTS

A. Dataset

The results presented in this paper are based on a

set of data collected from an online sound repository

at freesound.org [13]. The collection method is similar

to that mentioned in [5]. In order to compare our results

with those presented in [5], we have used the data set

collected from [13] and applied them to our algorithm

and those presented in [5]. A total of 14 audio recordings,

namely, Nature in daytime, Inside Vehicle, Restaurant,

Casino, Nature at Night, Bell, Playgrounds, Street Trafﬁc,

Thundering, Train, Rain, Stream, Ocean and Street with

Ambulance, representing different environments were

collected. Each one of these environments are now

referred to as a class. However, no preprocessing was

done on the collected data.

B. Method

For each of the 14 classes, a minimum of 4 unique

recordings were collected. All collected data had a two

channel recording (stereo) out of which only one channel

was used. This is to avoid duplication. All the ﬁles were

collected as uncompressed wav ﬁle and had a sampling

rate of 44.1 kHz, but of varying time durations (from 30

secs to 8 mins). We divide each recording into segments

of 4 seconds. 75 % of all the collected segments were

used for training and 25% for testing. Features for both

training and testing were computed on these segments.

MFCC was computed by dividing the 4 sec segments

into blocks of 20ms using a Hamming window with 50

% overlap. The same blocks were used to compute MP

features.

To classify the segments, we build Gaussian mixture

model (GMM) for each class, which is described by,

p =

k=1

N(µ

, Σ

where N(.) is the normal distribution function. α

the weight, µ

and Σ

are the mean and variance of the

mixture. N

is the number of mixtures in the GMM.

, µ

and Σ

are obtained from the features extracted

from the training data, using the standard expectation

maximization (EM) algorithm.

To classify a segment s, the posterior probability

p(s|µ

, Σ

) is computed for each frame of the segment. A

segment is assigned to k

class, if the sum of posterior

probabilities of all frames for the k

class is maximum.

If there are N

frame in a segment, the segment is

allocated to class c

, if

k = arg max

j=1

log{p(s

|µ

, Σ

)}.

Different values of N

were tried in [5]. Best results were

obtained using a GMM having 5 mixture components.

Henceforth, we construct 5 mixture GMM for all classes

in our experiment. Confusion matrix was constructed

and accuracy values were computed as the ratio of sum

of diagonal values to the total sum of all elements in

the matrix. All results reported in this paper are the

average of accuracy values obtained using ten fold cross

validation.

VI. RESULTS AND ANALYSIS

We implement the algorithm detailed in [5] on our

dataset and obtain an accuracy of 83.2 % while clas-

sifying the audio segments without preprocessing them.

We use this as a base to compare the performance of

our algorithm. On studying the periodogram of environ-

mental sounds (Fig. 2), it was found that certain signals

such as audio recordings from casino have substantial

energy in the higher frequency region. To understand

the impact of high frequency on the classiﬁcation, we

construct a dictionary using a high frequency emphasized

scaling function ω = 0.5a

−1/i

, as against the low

frequency emphasized scaling function ω = C × i

2.6

advocated in [5]. A value of a = 1000 was chosen

for a smoother ascend towards high frequency while

still distributing enough atoms in the higher part of the

frequency spectrum. The variation of ω as a function of i

for a = 1000 is shown in Fig. 1. We obtain an accuracy

of 84.5 % using weighted features and high frequency

emphasized scaling function, an improvement of 1.56

% over the low frequency emphasized scaling function.

The increase in accuracy is due to better representation

of the high frequency region in the MP features. We

would like to reiterate the fact that, the low frequency

region is adequately represented with MFCC. Class-wise

accuracy comparison for low frequency and high fre-

quency emphasised scaling functions using unweighted

features are shown in Fig. 3. We observe that the high

frequency emphasised dictionary outperforms its low

frequency counterpart, while classifying sounds from

nature at day time, inside vehicle, casino and nature at

night time whose periodogram showed high energy in the

high frequency region, while low frequency emphasised

dictionary performed better in classifying sounds from

ocean, rain and train which has higher energy content in

the lower frequency range.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−120

−100

−80

−60

−40

−20

ω (× π rad/sample)

Power/frequency (dB/rad/sample)

(a) Casino Recording

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−140

−120

−100

−80

−60

−40

−20

ω (× π rad/sample)

Power/frequency (dB/rad/sample)

(b) Ambulance Recording

Fig. 2. Variation in periodograms of environmental sounds

A. Effect of using sub-band information

The results obtained using high frequency emphasised

dictionary showed the need to construct signal speciﬁc

Robust features for environmental sound classification

Figures

Citations

Environmental sound recognition: A survey

Environmental sound recognition: a survey

Classification of environmental audio signals using statistical time and frequency features

Environmental sound recognition using Gaussian mixture model and neural network classifier

Acoustic feature extraction by tensor-based sparse representation for sound effects classification

References

Dictionaries for Sparse Representation Modeling

Computational Methods for Sparse Solution of Linear Inverse Problems

Computational Methods for Sparse Solution of Linear Inverse Problems In many engineering areas, such as signal processing, practical results can be obtained by identifying approaches that yield the greatest quality improvement, or by selecting the most suitable computation methods.

Environmental Sound Recognition With Time–Frequency Audio Features

Harmonic decomposition of audio signals with matching pursuit

Related Papers (5)

Cepstral modulation ratio regression (CMRARE) parameters for audio signal analysis and classification

Environmental Sound Recognition With Time–Frequency Audio Features

Experiments on speech tracking in audio documents using Gaussian mixture modeling

Adaptive DCTNet for Audio Signal Classification

A Comprehensive Noise Robust Speech Parameterization Algorithm Using Wavelet Packet Decomposition-Based Denoising and Speech Feature Representation Techniques

Frequently Asked Questions (1)

Q1. What are the contributions in "Robust features for environmental sound classification" ?