What have the authors stated for future works in "Deep maxout networks applied to noise-robust speech recognition" ?

Further lines of research include testing the DMN in a more complete datasets.

What is the important problem to overcome in DNN training?

To obtain state emission likelihoods p(ot|s), the Bayes rule is used as follows:p(ot|s) = p(s|ot) · p(ot) p(s)(3)where p(s|ot) is the posterior probability estimated by the DNN, p(ot) is a scaling factor constant for each observation and can be ignored, and p(s) is the class prior which can be estimated by counting the occurrences of each state on the training data.

What is the effect of dropout on the DMN?

Note that DMNs fairly reduce the number of parameters over DNNs, as the weight matrix W(l) of each layer in the DMN is 1/g of the size of its equivalent DNN weight matrix.

What were the input features of the DNNs?

In all of the cases, the input features were 12th-order MFCCs plus a log-energy coefficient, and their corresponding first and second order derivatives yielding a 39 component feature vector.

How many layers are used for the epoch time?

The authors computed the average epoch time over all the iterations for 5 hidden layers networks with 1024 nodes per layer for the DNNs and 400 maxout units per layer and group size g = 3 for the DMN.

What is the definition of a dropout DNN?

Dropout DNN can be seen as an ensemble of DNNs, given that on each presentation of a training example, a different sub-model is trained and the sub-models predictions are averaged together.

How many noises were tested on the development set?

HDF and group size were validated on the development set as can be seen on Figure 2 considering 5 hidden layer networks, yielding an optimal dropout factor of 0.1 for dropout DNNs, 0.2 for DMNs and a group size of g = 3.

What is the output of the hidden node i of the layer l?

The output of the hidden node i of the layer l + 1 can be computed as follows:h (l+1) i = max j∈1,...,g z (l+1) ij , 1 ≤ l ≤ L (7)where z (l+1) ij are the lineal pre-activation values from the l layer:z(l+1) = W(l)h(l) + b(l) (8)As can be observed the max-pooling operation is applied over the z(l+1) vector.

(Open Access) Deep Maxout Networks Applied to Noise-Robust Speech Recognition (2014) | Fernando de-la-Calle-Silos

Q: What are the contributions mentioned in the paper "Deep maxout networks applied to noise-robust speech recognition" ?

In this paper, the authors investigate Deep Maxout Networks ( DMN ) for acoustic modeling in a noisy automatic speech recognition environment.

Q: What is the main reason why the authors have used DMNs in noisy environments?

The authors hypothesize that DMNs can improve the recognition rates in noisy conditions given that they are capable to model the speech variability from limited data more effectively [14].

This is a postprint version of the following published document:

Navarro Mesa, J. L., et al. (eds.) (2014). Advances in Speech and

Language Technologies for Iberian Languages: Second International

Conference, IberSPEECH 2014, Las Palmas de Gran Canaria, Spain,

November 19-21, 2014. Proceedings. (pp. 109-118). (Lecture Notes in

Computer Science; 8854). Springer International Publishing.

DOI: http://dx.doi.org/10.1007/978-3-319-13623-3_12

Deep Maxout Networks Applied

to Noise-Robust Speech Recognition

F. de-la-Calle-Silos, A. Gallardo-Antol´ın,

and C. Pel´aez-Moreno

Department of Signal Theory and Communications,

Universidad Carlos III de Madrid,

Legan´es (Madrid), Spain

fsilos@tsc.uc3m.es

Abstract. Deep Neural Networks (DNN) have become very popular for

acoustic modeling due to the improvements found over traditional Gaus-

sian Mixture Models (GMM). However, not many works have addressed

the robustness of these systems under noisy conditions. Recently, the

machine learning community has proposed new methods to improve the

accuracy of DNNs by using techniques such as dropout and maxout. In

this paper, we investigate Deep Maxout Networks (DMN) for acoustic

modeling in a noisy automatic speech recognition environment. Experi-

ments show that DMNs improve substantially the recognition accuracy

over DNNs and other traditional techniques in both clean and noisy con-

ditions on the TIMIT dataset.

Keywords: noise robustness, deep neural networks, dropout, deep max-

out networks, speech recognition, deep learning.

1 Introduction

Machine performance in Automatic Speech Recognition (ASR) tasks is still far

away from that of humans, and noisy conditions only compound the problem.

Noise robustness techniques can be divided into two approaches: feature en-

hancement and model adaptation. Feature enhancement tries to remove noise

from the speech signal without changing the acoustic model parameters while

model adaptation changes these parameters to ﬁt the model to the noisy speech

signal. Apart from these techniques, the last years have witnessed an impor-

tant leap in performance with the introduction of new acoustic models based

on Deep Neural Networks (DNNs) in comparison with conventional Gaussian

Mixture Model-Hidden Markov Model (GMM-HMM) ([7], [3]) ASR systems.

Nevertheless, the performance of these kind of ASR systems in noisy conditions

has not yet been fully assessed.

Deep Neural Networks can be applied both in the so-called tandem [16] and

hybrid [15] architectures. In the ﬁrst case, DNNs can be trained to generate

bottleneck features which are fed to a conventional GMM-HMM back-end. In

the second, DNNs are employed for acoustic modeling by replacing the GMMs

into an HMM system. In this paper we adopt a DNNs hybrid conﬁguration.

DNN-HMM hybrid systems combine several features that make them supe-

rior to previous Artiﬁcial Neural Network (ANN)-HMM hybrid systems [11]: a)

DNNs have a larger number of hidden layers leading to systems with many more

parameters than the later. As a result, these models are less inﬂuenced by the

mismatch between training and testing data but can easily suﬀer from overﬁt-

ting if the training set is not big enough, b) the network usually models senones

(tied states) directly (although there might be thousands of senones), and c) long

context windows are used. Although conventional ANN also take into account

longer context window than HMM or are able to model senones, the key to the

success of the DNN-HMM is the combination of these components. DNN-HMM

systems with these properties are often named Context-Dependent Deep Neural

Network HMM (CD-DNN-HMM).

However, the most remarkable diﬀerence with traditional neural networks is

that a pr e-training stage in needed to reduce the chance that the error back-

propagation algorithm employed for training falls into a poor local minimum.

Besides, some recent methods have been proposed to avoid overﬁtting and im-

prove the accuracy of the networks, as for example, dropout [8] which randomly

omits hidden units in the training stage. Another related technique is the so-

called Deep Maxout Networks (DMNs) [5] that splits the hidden units at each

layer into non-overlapping groups, each of them generating an activation using a

max pooling operation. This way, DMNs reduces the size of the parameter space

signiﬁcantly making it very suited for ASR tasks where the training sets and

input and output dimensions are normally quite large. For this reason, DMNs

have been employed in low-resources speech recognition devices [14] boosting

the performance over other methods. We hypothesize that DMNs can improve

the recognition rates in noisy conditions given that they are capable to model

the speech variability from limited data more eﬀectively [14].

As mentioned before, the number of research works that test DNNs in noisy

conditions is still small. Notably, [18] applies DNNs with dropout on the Aurora

4 dataset with encouraging results. Up to our knowledge, the present paper is

the ﬁrst to apply Deep Maxout Networks in combination with dropout strategies

in a noisy speech recognition task demonstrating a substantial improvement of

the recognition accuracy over traditional DNN and other traditional techniques.

The remainder of this paper is organized as follows: Section 2 introduces deep

neural networks and their application under a hybrid automatic speech recog-

nition architecture, Section 3 and Section 4 describe the dropout and maxout

methods, respectively. Finally, our results are presented in Section 5 followed by

some conclusions and further lines of research in Section 6.

2 Deep Neural Networks and Hybrid Speech Recognition

Systems

A Deep Neural Network (DNN) is a Multi-Layer Perceptron (MLP) with a larger

number of hidden layers between its inputs and outputs, whose weights are fully

connected and are often initialized using an unsupervised pre-training scheme.

As a traditional MLP the feed-forward architecture can be computed as fol-

lows:

(l+1)

= σ



(l)

+ b

(l)



, 1 ≤ l ≤ L (1)

where h

(l+1)

is the vector of inputs to the l +1 layer,σ(x)=(1+e

−x

)

−1

is the

sigmoid activation function, L is the total number of hidden layers, h

(l)

is the

output vector of the hidden layer l and W

(l)

and b

(l)

are the weight matrix and

bias vector of layer l, respectively.

Training a DNN using the well-known error back-propagation (BP) algorithm

with a random initialization of its weight matrices may not provide a good per-

formance as it may become stuck in a local minimum. To overcome this prob-

lem, DNN parameters are often initialized using an unsupervised technique as

Restricted Bolzmann Machines (RBMs) [6] or Stacked Denoising Autoencoders

(SDAs) [19]. Nevertheless, as it will be explained later in this paper, pre-training

may not be necessary if some recently proposed anti-overﬁtting techniques are

used.

2.1 Hybrid Speech Recognition Systems

In a hybrid DNN/HMM system, just as in classical ANN/HMM hybrids [1], a

DNN is trained to classify the input acoustic features into classes corresponding

to the states of HMMs, in such a way that, the state emission likelihoods usually

computed with GMM are replaced by the likelihoods generated by the DNN.

The DNN estimates the posterior probability p(s|o

) of each state s given the

observation o

at time t, through a softmax ﬁnal layer:

p(s|o

exp



(L)

+ b

(L)





exp



(L)

+ b

(L)



. (2)

In a hybrid ASR system, the HMM topology is set from a previously trained

GMM-HMM, and the DNN training data come from the forced-alignment be-

tween the state-level transcripts and the corresponding speech signals obtained

by using this initial GMM-HMM system.

In the recognition stage, the DNN estimates the emission probability of each

HMM state. To obtain state emission likelihoods p(o

|s), the Bayes rule is used

as follows:

p(o

|s)=

p(s|o

) · p(o

)

p(s)

(3)

where p(s|o

) is the posterior probability estimated by the DNN, p(o

)isa

scaling factor constant for each observation and can be ignored, and p(s)isthe

class prior which can be estimated by counting the occurrences of each state on

the training data.

3Dropout

The most important problem to overcome in DNN training is overﬁtting. Nor-

mally this problem arises when we try to train a large DNN with a small training

set. A training method called dropout proposed in [8] tries to reduce overﬁtting

and improves the generalization capability of the network by randomly omitting

a certain percentage of the hidden units on each training iteration.

When dropout is employed, the activation function of Eq. (1) can be rewritten

as:

(l+1)

= m

(l)

σ



(l)

+ b

(l)



, 1 ≤ l ≤ L (4)

where  denotes the element-wise product, m

(l)

is a binary vector of the same

dimension of h

(l)

whose elements are sampled from a Bernoulli distribution with

probability p. This probability is the so called Hidden Drop Factor (HDF) and

must be determined over a validation set as it will be seen in Section 5.

As the sigmoid function has the property that σ(0) = 0, Eq. (4) can be

rewritten as:

(l+1)

= σ



(l)





(l)

+ b

(l)



, 1 ≤ l ≤ L (5)

where dropout is applied on the inputs of the activation function, leading a more

eﬃcient way of perform dropout training.

Note that dropout is only applied in the training stage whereas on testing

all the hidden units become active. Dropout DNN can be seen as an ensemble

of DNNs, given that on each presentation of a training example, a diﬀerent

sub-model is trained and the sub-models predictions are averaged together. This

technique is similar to bagging [2] where many diﬀerent models are trained using

diﬀerent subsets of the training data, but in dropout each model is only trained

in a single iteration and all the models share some parameters.

Dropout networks are trained with the standard stochastic gradient descent

algorithm but using the forward architecture presented on Eq. (4) instead of

Eq. (1). Following [13], we compensate the parameters in testing by scaling the

weight matrices taking into account the dropout factor as follows:

(l)

=(1− HDF) · W

(l)

(6)

Dropout has already successfully tested on noise robust ASR in [18]. Its ben-

eﬁts come from the improved generalization abilities attained by reducing their

capacity. Another interpretation of the behaviour of dropout is that in the train-

ing state it adds random noise to the training set resulting in a network that

is very robust to variabilities in the inputs (in our particular case, due to the

addition of noise).

4 Deep Maxout Networks

A Maxout Deep Neural Network (DMN) [5] is a modiﬁcation of the feed-forward

architecture (Eq. (1)) where the maxout activation function is employed. The

maxout unit simply takes the maximum over a set of inputs. In a DMN each

Deep Maxout Networks Applied to Noise-Robust Speech Recognition

Figures

Citations

A review on machine learning principles for multi-view biological data integration.

Attacks and defenses in user authentication systems: A survey

New Artificial Intelligence approaches for future UAV Ground Control Stations

Deep residual networks for pre-classification based Indian language identification

An Analysis of Deep Neural Networks in Broad Phonetic Classes for Noisy Speech Recognition

References

Bagging predictors

Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups

Improving neural networks by preventing co-adaptation of feature detectors

The Kaldi Speech Recognition Toolkit

Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising 1 criterion

Related Papers (5)

Deep maxout networks for low-resource speech recognition

Investigation of maxout networks for speech recognition

Deep maxout neural networks for speech recognition

Phone recognition with hierarchical convolutional deep maxout networks

Evaluating robust features on deep neural networks for speech recognition in noisy and channel mismatched conditions.

Frequently Asked Questions (14)

Q1. What are the contributions mentioned in the paper "Deep maxout networks applied to noise-robust speech recognition" ?

Q2. What have the authors stated for future works in "Deep maxout networks applied to noise-robust speech recognition" ?

Q3. What is the important problem to overcome in DNN training?

Q4. What is the effect of dropout on the DMN?

Q5. What is the main reason why the authors have used DMNs in noisy environments?

Q6. What are the main reasons why DNNs are still far away from humans?

Q7. What is the way to train a DNN in noisy conditions?

Q8. What were the input features of the DNNs?

Q9. How many layers are used for the epoch time?

Q10. What are the advantages of DNN-HMM hybrid systems?

Q11. What is the definition of a dropout DNN?

Q12. How many noises were tested on the development set?

Q13. What is the output of the hidden node i of the layer l?

Q14. What is the main difference between a dropout and a normal DMN?