scispace - formally typeset
Open AccessBook ChapterDOI

Deep Maxout Networks Applied to Noise-Robust Speech Recognition

TLDR
Experiments show that DMNs improve substantially the recognition accuracy over DNNs and other traditional techniques in both clean and noisy conditions on the TIMIT dataset.
Abstract
Deep Neural Networks DNN have become very popular for acoustic modeling due to the improvements found over traditional Gaussian Mixture Models GMM. However, not many works have addressed the robustness of these systems under noisy conditions. Recently, the machine learning community has proposed new methods to improve the accuracy of DNNs by using techniques such as dropout and maxout. In this paper, we investigate Deep Maxout Networks DMN for acoustic modeling in a noisy automatic speech recognition environment. Experiments show that DMNs improve substantially the recognition accuracy over DNNs and other traditional techniques in both clean and noisy conditions on the TIMIT dataset.

read more

Content maybe subject to copyright    Report

This is a postprint version of the following published document:
Navarro Mesa, J. L., et al. (eds.) (2014). Advances in Speech and
Language Technologies for Iberian Languages: Second International
Conference, IberSPEECH 2014, Las Palmas de Gran Canaria, Spain,
November 19-21, 2014. Proceedings. (pp. 109-118). (Lecture Notes in
Computer Science; 8854). Springer International Publishing.
DOI: http://dx.doi.org/10.1007/978-3-319-13623-3_12
© 2014 Springer International Publishing Switzerland

Deep Maxout Networks Applied
to Noise-Robust Speech Recognition
F. de-la-Calle-Silos, A. Gallardo-Antol´ın,
and C. Pel´aez-Moreno
Department of Signal Theory and Communications,
Universidad Carlos III de Madrid,
Legan´es (Madrid), Spain
fsilos@tsc.uc3m.es
Abstract. Deep Neural Networks (DNN) have become very popular for
acoustic modeling due to the improvements found over traditional Gaus-
sian Mixture Models (GMM). However, not many works have addressed
the robustness of these systems under noisy conditions. Recently, the
machine learning community has proposed new methods to improve the
accuracy of DNNs by using techniques such as dropout and maxout. In
this paper, we investigate Deep Maxout Networks (DMN) for acoustic
modeling in a noisy automatic speech recognition environment. Experi-
ments show that DMNs improve substantially the recognition accuracy
over DNNs and other traditional techniques in both clean and noisy con-
ditions on the TIMIT dataset.
Keywords: noise robustness, deep neural networks, dropout, deep max-
out networks, speech recognition, deep learning.
1 Introduction
Machine performance in Automatic Speech Recognition (ASR) tasks is still far
away from that of humans, and noisy conditions only compound the problem.
Noise robustness techniques can be divided into two approaches: feature en-
hancement and model adaptation. Feature enhancement tries to remove noise
from the speech signal without changing the acoustic model parameters while
model adaptation changes these parameters to fit the model to the noisy speech
signal. Apart from these techniques, the last years have witnessed an impor-
tant leap in performance with the introduction of new acoustic models based
on Deep Neural Networks (DNNs) in comparison with conventional Gaussian
Mixture Model-Hidden Markov Model (GMM-HMM) ([7], [3]) ASR systems.
Nevertheless, the performance of these kind of ASR systems in noisy conditions
has not yet been fully assessed.
Deep Neural Networks can be applied both in the so-called tandem [16] and
hybrid [15] architectures. In the first case, DNNs can be trained to generate
bottleneck features which are fed to a conventional GMM-HMM back-end. In
the second, DNNs are employed for acoustic modeling by replacing the GMMs
into an HMM system. In this paper we adopt a DNNs hybrid configuration.
1

DNN-HMM hybrid systems combine several features that make them supe-
rior to previous Artificial Neural Network (ANN)-HMM hybrid systems [11]: a)
DNNs have a larger number of hidden layers leading to systems with many more
parameters than the later. As a result, these models are less influenced by the
mismatch between training and testing data but can easily suffer from overfit-
ting if the training set is not big enough, b) the network usually models senones
(tied states) directly (although there might be thousands of senones), and c) long
context windows are used. Although conventional ANN also take into account
longer context window than HMM or are able to model senones, the key to the
success of the DNN-HMM is the combination of these components. DNN-HMM
systems with these properties are often named Context-Dependent Deep Neural
Network HMM (CD-DNN-HMM).
However, the most remarkable difference with traditional neural networks is
that a pr e-training stage in needed to reduce the chance that the error back-
propagation algorithm employed for training falls into a poor local minimum.
Besides, some recent methods have been proposed to avoid overfitting and im-
prove the accuracy of the networks, as for example, dropout [8] which randomly
omits hidden units in the training stage. Another related technique is the so-
called Deep Maxout Networks (DMNs) [5] that splits the hidden units at each
layer into non-overlapping groups, each of them generating an activation using a
max pooling operation. This way, DMNs reduces the size of the parameter space
significantly making it very suited for ASR tasks where the training sets and
input and output dimensions are normally quite large. For this reason, DMNs
have been employed in low-resources speech recognition devices [14] boosting
the performance over other methods. We hypothesize that DMNs can improve
the recognition rates in noisy conditions given that they are capable to model
the speech variability from limited data more effectively [14].
As mentioned before, the number of research works that test DNNs in noisy
conditions is still small. Notably, [18] applies DNNs with dropout on the Aurora
4 dataset with encouraging results. Up to our knowledge, the present paper is
the first to apply Deep Maxout Networks in combination with dropout strategies
in a noisy speech recognition task demonstrating a substantial improvement of
the recognition accuracy over traditional DNN and other traditional techniques.
The remainder of this paper is organized as follows: Section 2 introduces deep
neural networks and their application under a hybrid automatic speech recog-
nition architecture, Section 3 and Section 4 describe the dropout and maxout
methods, respectively. Finally, our results are presented in Section 5 followed by
some conclusions and further lines of research in Section 6.
2 Deep Neural Networks and Hybrid Speech Recognition
Systems
A Deep Neural Network (DNN) is a Multi-Layer Perceptron (MLP) with a larger
number of hidden layers between its inputs and outputs, whose weights are fully
connected and are often initialized using an unsupervised pre-training scheme.
2

As a traditional MLP the feed-forward architecture can be computed as fol-
lows:
h
(l+1)
= σ
W
(l)
h
(l)
+ b
(l)
, 1 l L (1)
where h
(l+1)
is the vector of inputs to the l +1 layer,σ(x)=(1+e
x
)
1
is the
sigmoid activation function, L is the total number of hidden layers, h
(l)
is the
output vector of the hidden layer l and W
(l)
and b
(l)
are the weight matrix and
bias vector of layer l, respectively.
Training a DNN using the well-known error back-propagation (BP) algorithm
with a random initialization of its weight matrices may not provide a good per-
formance as it may become stuck in a local minimum. To overcome this prob-
lem, DNN parameters are often initialized using an unsupervised technique as
Restricted Bolzmann Machines (RBMs) [6] or Stacked Denoising Autoencoders
(SDAs) [19]. Nevertheless, as it will be explained later in this paper, pre-training
may not be necessary if some recently proposed anti-overfitting techniques are
used.
2.1 Hybrid Speech Recognition Systems
In a hybrid DNN/HMM system, just as in classical ANN/HMM hybrids [1], a
DNN is trained to classify the input acoustic features into classes corresponding
to the states of HMMs, in such a way that, the state emission likelihoods usually
computed with GMM are replaced by the likelihoods generated by the DNN.
The DNN estimates the posterior probability p(s|o
t
) of each state s given the
observation o
t
at time t, through a softmax final layer:
p(s|o
t
)=
exp
W
(L)
h
(L)
+ b
(L)
¯
S
exp
W
(L)
h
(L)
+ b
(L)
. (2)
In a hybrid ASR system, the HMM topology is set from a previously trained
GMM-HMM, and the DNN training data come from the forced-alignment be-
tween the state-level transcripts and the corresponding speech signals obtained
by using this initial GMM-HMM system.
In the recognition stage, the DNN estimates the emission probability of each
HMM state. To obtain state emission likelihoods p(o
t
|s), the Bayes rule is used
as follows:
p(o
t
|s)=
p(s|o
t
) · p(o
t
)
p(s)
(3)
where p(s|o
t
) is the posterior probability estimated by the DNN, p(o
t
)isa
scaling factor constant for each observation and can be ignored, and p(s)isthe
class prior which can be estimated by counting the occurrences of each state on
the training data.
3Dropout
The most important problem to overcome in DNN training is overfitting. Nor-
mally this problem arises when we try to train a large DNN with a small training
3

set. A training method called dropout proposed in [8] tries to reduce overfitting
and improves the generalization capability of the network by randomly omitting
a certain percentage of the hidden units on each training iteration.
When dropout is employed, the activation function of Eq. (1) can be rewritten
as:
h
(l+1)
= m
(l)
σ
W
(l)
h
(l)
+ b
(l)
, 1 l L (4)
where denotes the element-wise product, m
(l)
is a binary vector of the same
dimension of h
(l)
whose elements are sampled from a Bernoulli distribution with
probability p. This probability is the so called Hidden Drop Factor (HDF) and
must be determined over a validation set as it will be seen in Section 5.
As the sigmoid function has the property that σ(0) = 0, Eq. (4) can be
rewritten as:
h
(l+1)
= σ
m
(l)
W
(l)
h
(l)
+ b
(l)

, 1 l L (5)
where dropout is applied on the inputs of the activation function, leading a more
efficient way of perform dropout training.
Note that dropout is only applied in the training stage whereas on testing
all the hidden units become active. Dropout DNN can be seen as an ensemble
of DNNs, given that on each presentation of a training example, a different
sub-model is trained and the sub-models predictions are averaged together. This
technique is similar to bagging [2] where many different models are trained using
different subsets of the training data, but in dropout each model is only trained
in a single iteration and all the models share some parameters.
Dropout networks are trained with the standard stochastic gradient descent
algorithm but using the forward architecture presented on Eq. (4) instead of
Eq. (1). Following [13], we compensate the parameters in testing by scaling the
weight matrices taking into account the dropout factor as follows:
W
(l)
=(1 HDF) · W
(l)
(6)
Dropout has already successfully tested on noise robust ASR in [18]. Its ben-
efits come from the improved generalization abilities attained by reducing their
capacity. Another interpretation of the behaviour of dropout is that in the train-
ing state it adds random noise to the training set resulting in a network that
is very robust to variabilities in the inputs (in our particular case, due to the
addition of noise).
4 Deep Maxout Networks
A Maxout Deep Neural Network (DMN) [5] is a modification of the feed-forward
architecture (Eq. (1)) where the maxout activation function is employed. The
maxout unit simply takes the maximum over a set of inputs. In a DMN each
4

Citations
More filters
Journal ArticleDOI

A review on machine learning principles for multi-view biological data integration.

TL;DR: It is shown that Bayesian models are able to use prior information and model measurements with various distributions, and a range of deep neural networks can be integrated in multi-modal learning for capturing the complex mechanism of biological systems.
Journal ArticleDOI

Attacks and defenses in user authentication systems: A survey

TL;DR: A comprehensive review on attacks and defenses of the authentication systems, comparing and analyzing the performance of different defense mechanisms in different types of authentication systems and proposing a set of evaluation criteria for evaluating different kinds of attack defense mechanisms.
Proceedings ArticleDOI

New Artificial Intelligence approaches for future UAV Ground Control Stations

TL;DR: A Multi-Objective Genetic Algorithm for solving Mission Planning and Replanning problems and a Procedure Following Evaluation methodology based on Petri Nets are presented, based on a framework that simulates a GCS with support for multiple UAVs.
Book ChapterDOI

An Analysis of Deep Neural Networks in Broad Phonetic Classes for Noisy Speech Recognition

TL;DR: The experiments demonstrate that performance is still tightly related to the particular phonetic class being stops and affricates the least resilient but also that relative improvements of both DNN variants are distributed unevenly across those classes having the type of noise a significant influence on the distribution.
References
More filters
Journal ArticleDOI

Bagging predictors

Leo Breiman
TL;DR: Tests on real and simulated data sets using classification and regression trees and subset selection in linear regression show that bagging can give substantial gains in accuracy.
Journal ArticleDOI

Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups

TL;DR: This article provides an overview of progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.
Posted Content

Improving neural networks by preventing co-adaptation of feature detectors

TL;DR: The authors randomly omits half of the feature detectors on each training case to prevent complex co-adaptations in which a feature detector is only helpful in the context of several other specific feature detectors.
Proceedings Article

The Kaldi Speech Recognition Toolkit

TL;DR: The design of Kaldi is described, a free, open-source toolkit for speech recognition research that provides a speech recognition system based on finite-state automata together with detailed documentation and a comprehensive set of scripts for building complete recognition systems.

Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising 1 criterion

P. Vincent
TL;DR: This work clearly establishes the value of using a denoising criterion as a tractable unsupervised objective to guide the learning of useful higher level representations.
Related Papers (5)
Frequently Asked Questions (14)
Q1. What are the contributions mentioned in the paper "Deep maxout networks applied to noise-robust speech recognition" ?

In this paper, the authors investigate Deep Maxout Networks ( DMN ) for acoustic modeling in a noisy automatic speech recognition environment. 

Further lines of research include testing the DMN in a more complete datasets. 

To obtain state emission likelihoods p(ot|s), the Bayes rule is used as follows:p(ot|s) = p(s|ot) · p(ot) p(s)(3)where p(s|ot) is the posterior probability estimated by the DNN, p(ot) is a scaling factor constant for each observation and can be ignored, and p(s) is the class prior which can be estimated by counting the occurrences of each state on the training data. 

Note that DMNs fairly reduce the number of parameters over DNNs, as the weight matrix W(l) of each layer in the DMN is 1/g of the size of its equivalent DNN weight matrix. 

The authors hypothesize that DMNs can improve the recognition rates in noisy conditions given that they are capable to model the speech variability from limited data more effectively [14]. 

Machine performance in Automatic Speech Recognition (ASR) tasks is still far away from that of humans, and noisy conditions only compound the problem. 

Training a DNN using the well-known error back-propagation (BP) algorithm with a random initialization of its weight matrices may not provide a good performance as it may become stuck in a local minimum. 

In all of the cases, the input features were 12th-order MFCCs plus a log-energy coefficient, and their corresponding first and second order derivatives yielding a 39 component feature vector. 

The authors computed the average epoch time over all the iterations for 5 hidden layers networks with 1024 nodes per layer for the DNNs and 400 maxout units per layer and group size g = 3 for the DMN. 

DNN-HMM hybrid systems combine several features that make them superior to previous Artificial Neural Network (ANN)-HMM hybrid systems [11]: a) DNNs have a larger number of hidden layers leading to systems with many more parameters than the later. 

Dropout DNN can be seen as an ensemble of DNNs, given that on each presentation of a training example, a different sub-model is trained and the sub-models predictions are averaged together. 

HDF and group size were validated on the development set as can be seen on Figure 2 considering 5 hidden layer networks, yielding an optimal dropout factor of 0.1 for dropout DNNs, 0.2 for DMNs and a group size of g = 3. 

The output of the hidden node i of the layer l + 1 can be computed as follows:h (l+1) i = max j∈1,...,g z (l+1) ij , 1 ≤ l ≤ L (7)where z (l+1) ij are the lineal pre-activation values from the l layer:z(l+1) = W(l)h(l) + b(l) (8)As can be observed the max-pooling operation is applied over the z(l+1) vector. 

Another interpretation of the behaviour of dropout is that in the training state it adds random noise to the training set resulting in a network that is very robust to variabilities in the inputs (in their particular case, due to the addition of noise).