scispace - formally typeset
Open AccessProceedings ArticleDOI

Size matters: an empirical study of neural network training for large vocabulary continuous speech recognition

TLDR
There appears to be an optimal ratio of training patterns to parameters of around 25:1 in these circumstances, and doubling the training data and system size appears to provide diminishing returns of error rate reduction for the largest systems.
Abstract
We have trained and tested a number of large neural networks for the purpose of emission probability estimation in large vocabulary continuous speech recognition. In particular, the problem under test is the DARPA Broadcast News task. Our goal here was to determine the relationship between training time, word error rate, size of the training set, and size of the neural network. In all cases, the network architecture was quite simple, comprising a single large hidden layer with an input window consisting of feature vectors from 9 frames around the current time, with a single output for each of 54 phonetic categories. Thus far, simultaneous increases to the size of the training set and the neural network improve performance; in other words, more data helps, as does the training of more parameters. We continue to be surprised that such a simple system works as well as it does for complex tasks. Given a limitation in training time, however, there appears to be an optimal ratio of training patterns to parameters of around 25:1 in these circumstances. Additionally, doubling the training data and system size appears to provide diminishing returns of error rate reduction for the largest systems.

read more

Content maybe subject to copyright    Report

SIZE MATTERS:
AN EMPIRICAL STUDY OF NEURAL NETWORK TRAINING FOR LARGE
VOCABULARY CONTINUOUS SPEECH RECOGNITION
Dan Ellis
y
and Nelson Morgan
y
;
z
y
International Computer Science Institute, 1947 Center St, Berkeley, CA 94704
z
University of California at Berkeley, EECS Department, Berkeley, CA 94720
Tel: (510) 643-9153, FAX: (510) 643-7684, Email:
f
dpwe, morgan
g
@icsi.berkeley.edu
ABSTRACT
Wehave trained and tested a number of large neural
networks for the purpose of emission probability estima-
tion in large vocabulary continuous sp eech recognition. In
particular, the problem under test is the DARPA Broad-
cast News task. Our goal here was to determine the rela-
tionship between training time, word error rate, size of the
training set, and size of the neural network. In all cases, the
network architecture was quite simple, comprising a single
large hidden layer with an input window consisting of fea-
ture vectors from 9 frames around the current time, with a
single output for each of 54 phonetic categories. Thus far,
simultaneous increases to the size of the training set and the
neural network improve performance; in other words, more
data helps, as do es the training of more parameters. We
continue to be surprised that such a simple system works
as well as it do es for complex tasks. Given a limitation
in training time, however, there appears to be an optimal
ratio of training patterns to parameters of around 25:1 in
these circumstances. Additionally, doubling the training
data and system size app ears to provide diminishin g re-
turns of error rate reduction for the largest systems.
1. INTRODUCTION
For about 10 years, we and others have trained large neu-
ral networks to estimate posterior probabilities of context-
independent phonetic classes for use in speech recognition
systems based on Hidden Markov Models (HMMs) [7]. For
small tasks, mo derate amounts of training data, and when
simple mo dels were used, we consistently found that we
could provide competitive and often improved recognition
performance in comparison with systems that used more
standard architectures and training metho ds (e.g., Gaus-
sian mixtures trained with a Maximum Likelihoo d crite-
rion) [9]. However, for large tasks for which a great deal of
training data was available, wehave had dicultyachieving
the performance of likelihoo d-based HMM systems. Some
of this dierence is undoubtedly due to scientically unin-
teresting factors, such as the resources required to correct
faulty transcription. We also havewondered whether some
of the observed diculty might be a straightforward trade-
o between computation and p erformance. As wehave
usually designed it, hybrid HMM/ANN training requires
the up date of all network parameters in response to ev-
ery input frame. On the other hand, direct training of
state-conditional feature densities in HMM systems only re-
quires the update of parameters corresp onding to the state
or states asso ciated with each feature vector. Furthermore,
at least in principle, likelihoo d-based HMM systems can
always benet from more acoustic data by improving the
estimates for ever-ner categories (e.g., from triphones to
quinphones), since with more data these rarer categories
will b ecome more p opulated.
Of course, there are analogous pro cedures available for
connectionist systems; for instance, the CDNN described in
[1], with variants explored in [2] and [5], can yield density
estimates for context-dependent classes as a product of net-
work outputs. The ACID/HNN system of [3] go es even fur-
ther, resulting in an extensive set of polyphone probabiliti es
that can be used for a fully context-dependent system in the
spirit of the large HMMs. In exp eriments on Switchboard,
for instance, this latter system app ears to b e quite compa-
rable in p erformance to the best likelihoo d-base d HMMs.
However, to achieve this result the simplicity of the large
single network is sacriced, leaving us with the question:
can we extract greater recognition accuracy from an in-
crease in training data without complicatin g the structure?
In previous work, wetypically did not incorp orate large
amounts of training data (e.g., much more than 10 hours of
speech). We also did not have sucient computational re-
sources to explore the simplest approach: namely,tokeep
the simple architecture constant and merely increase the
size of the network for training on larger training sets. This
year, we developed a baseline recognition system for the
Broadcast News task, for whichwe had 74 hours of acoustic
data.
1
While using all of this data was preferable, systems
trained on subsets were good enough to generate the com-
parative results for this exp eriment. For the parts of the
experiment in whichwe used larger networks and larger
fractions of the data, the amount of computation would
have previously b een prohibiti ve. However, we recently
completed the development ofamultiprocessor machine in-
corporating VLSI developed in our group, and this permit-
ted trainings that required on the order of 10
15
arithmetic
operations for the larger cases.
1
Components of a variant of this system are currently b eing
developed for the 1998 DARPA Hub 4 evaluation, in collabo-
ration with the connectionist groups at Cambridge University,
Sheeld University, and FacultePolytechnique de Mons.

Given the availability of acoustic materials and compu-
tational resources, we decided to push our simple system to
its limit, and also to test it for a range of training set and
neural net sizes.
2. METHODS
The basic pro cedure was to train neural networks with
a range of sizes on acoustic training data from dierent
amounts of large vocabulary continuous speech. Each net-
work was then used in a hybrid HMM/ANN recognizer, and
was evaluated with word error rate on a large vocabulary
task using a 65K word lexicon.
2.1. Corpus
For these experiments, we used the Broadcast News corpus.
This is a collection of speech from American radio and tele-
vision news broadcasts, such as the National Public Radio
program
All Things Considered
or
Nightline
, televised in the
U.S. on the ABC network. These shows comprise a whole
range of sp eaking conditions, from planned sp eech in stu-
dio environments to spontaneous speech in noisy eld con-
ditions over telephone lines. The (p ossibly multi-sentence)
segments are divided into 7 dierent focus conditions repre-
senting dierent acoustic/speaking environments; the ma-
jority conditions are planned studio sp eech and sp ontaneous
studio sp eech.
2.2. System Architecture
As in many of our previous papers [7], the underlying statis-
tical model was an extremely simple HMM. For eachof54
phonetic categories, we had an HMM consisting of a strictly
left-to-right model with multiple states tied to a single dis-
tribution; multiple rep eated states were used to establish
a minimum duration for each phone. Transition probabili-
ties were set to .5. The emission probabiliti es of the HMM
were scaled likelihoo ds estimated by dividin g the network
outputs by the priors for each class. The network was a
Multi-Layer Perceptron (MLP) with a single sigmoidal hid-
den layer, whose size for these experiments was varied from
500 to 4000 by factors of two. For eachchoice of hidden
layer size, a training was done using
1
8
,
1
4
,
1
2
, and all of the
74 hours of acoustic training material that was available
to us for this study. Note that the largest training incor-
porated ab out 700,000 parameters and 16 million acoustic
frames. 54 outputs associated with the phonetic classes
were generated by softmax functions computed from the
weighted hidden unit outputs. For the main set of experi-
ments reported here, the input consisted of feature vectors
from the frame associated with the target lab el, as well as
from 4 vectors into the past and 4 into the future.
To generate the features used in these experiments, the
speechwas ltered and downsampled to 8 kHz. PLP-12 fea-
tures [4], including the PLP gain term to give a 13-element
feature vector, were computed every 16 ms, and normal-
ized according to the mean and variance of each segment
in the training data. These segments were provided to us
by our colleagues in the Cambridge University connectionist
group, and roughly corresponded to an utterance or a group
of utterances by a single speaker. In practice a go od seg-
mentation system (such as the one developed by the HTK
group at Cambridge) does not degrade performance over
that achieved by manually segmenting chunks associated
with a single signal source [11]. We used the Noway large
vocabulary decoder [8], and co-developed a large vocabu-
lary pronunciation lexicon with our partners at Cambridge
and Sheeld. The backo trigram grammar from the Cam-
bridge BN 97 system was used, incorp orating 7M bigrams
and 24M trigrams. It had been trained using both text
sources (broadcast news and newswire texts) and broadcast
news acoustic transcripts.
2.3. Training Hardware and Software
Connectionist training of large networks is quite compu-
tationally demanding, as noted ab ove. To aid in this task,
we developed a vector micropro cessor described in [10], and
vectorized software that incorporated ecient assembler rou-
tines for this processor while providing a C++ structure
that permitted a mo derate range of exp erimentation for our
trainings. The board-level system (called the Spert-I I) also
includes 8 MB of fast (zero wait state) SRAM so that mem-
ory accesses are not a b ottleneck for the neural computation
of large networks. Current high end PCs and moderate level
workstations are now fast enough to compete with this sys-
tem (when highly optimized software is used), but wehave
also developed 2-pro cessor and 4-pro cessor systems which
are signicantly faster for suciently large networks. Using
a commercial bus expansion chassis, these p ermit the con-
nection of four Spert-I Is to a single Sparc host. Although
the bus bandwidth is necessarily shared between the four
processors, accumulating error gradients over 16 to 32 pat-
terns permits a near linear sp eedup for the larger networks
trained in this study. The Sp ert-II boards can also be used
independently within the multiprocessor system for those
problems with smaller networks.
3. RESULTS
Training set size, hours
9.25 18.5 37 74
# Hidden units
500 42.8% 41.0% 40.2% 39.2%
1000 41.8% 38.8% 36.5% 36.9%
2000 40.4% 37.2% 35.6% 34.4%
4000 40.3% 37.4% 33.9% 33.7%
Table 1: Word Error Rate p ercentages for the overall hybrid
recognition system incorp orating classier networks with
dierent-sized hidden layers, trained on varying amounts of
acoustic data. In each case, the input consisted of 9 vectors
of length 13, and the output layer was 54 units. The number
of parameters for eachlayered network was ((9x13)+54)x(#
Hidden units) weights, plus (# Hidden units +54) biases.
Table 1 details the eect of varying the size of the training
set and the number of hidden layer units on the error rate
of the overall hybrid system, all other parameters b eing
held constant. These results are plotted as a 3-dimensional

500
1000
2000
4000
9.25
18.5
37
74
32
34
36
38
40
42
44
Hidden layer / unitsTraining set / hours
WER%
WER for PLP12N nets vs. net size & training data
Figure 1: Surface plot of system word error rate as a func-
tion of the amount of training data and the hidden layer
size.
surface, i.e. error rate as a function of training set and
hidden layer size, in Figure 1. The most obvious trend is
that increasing either parameter will improve the overall
system performance in virtually every condition.
There are some caveats to be b orne in mind when con-
sidering the word error rate gures. These results were
obtained on a separate test set of 32 minutes containing
5938 words; by our reckoning, to be signicant at the 5%
level, error rates must dier by at least 1.5%.
2
There is
additional variabilityintro duced by the randomization of
pattern presentation used in the network training.
Training followed a standard `simulated annealing' pro-
cess with rep eated passes or `epo chs' over the entire training
set; after initial stabilization, the learning rate was halved
on each successive iteration until the frame classication ac-
curacy on a separate cross-validation set improved by less
than 0.5%. The interaction between this criterion and other
variables meant that the dierent networks trained for dif-
ferentnumbers of ep ochs, between 7 and 10.
The acoustic data for these experiments was limited to 4
kHz bandwidth before feature extraction.
3
While this pro-
cessing succeeded in its intention of improving the relative
performance on the telephone-channel speech which forms
some 15% of the corpus, it app ears to increase the error
for the remaining full-bandwidth data. Finally, the decoder
pruning for these tests was chosen to b e fairly aggressive,
giving a typical recognition sp eed of about 2x real-time;
slower but more exhaustive deco ding would yield a relative
2
This test set, used internally within our collaboration with
Cambridge and Sheeld, is a subset of the BN 97 evaluation set.
Previous exp erience at Cambridge suggests that this subset is
slightly \harder" than the larger set, typically by about a factor
of 10% in relative error.
3
This bandlimiting was done for its complementarity with the
Cambridge system, with whom wewould b e merging models; in
practice the overall loss in p erformance due to this bandwidth
reduction app eared to be minor.
12 510 20 50
178 GCUP
356 GCUP
712 GCUP
1.42 TCUP
2.8 TCUP
5.7 TCUP
11.4 TCUP
100 200 500
32
34
36
38
40
42
44
WER vs. frames/weight
WER%
frames/weight
Figure 2: Slices through the surface of the previous gure,
showing the variation of error rate with the ratio of training
patterns to network weights for a xed number of connec-
tion up dates p er training epo ch.
error rate improvement of 5-10%.
Figure 2 shows a succession of slices through the error-
rate surface, taken in planes parallel to the view plane. Each
slice corresponds to a constant pro duct of training set size
and hidden layer size, or equivalently the number of con-
nection up dates per complete training epo ch. Each line is
tagged with this number, with the maximum value of 11.4
TCUP (11.4 x 10
12
updates) for the case of the 4000 hidden
units by (117+54) weights per unit by 16.7 million training
patterns. These slices conrm the central `dip' visible in
the error rate surface, indicating that for a given amountof
training computation, there is an optimal ratio of training
frames per network weight in the range 10 to 40.
4. DISCUSSION
Our primary observation, that improvements are almost al-
ways obtained by increasing either or both of the amountof
training data or the numberofnetwork parameters, is not
surprising. It is encouraging, however, that these increases
continue to be signicant out to the practical limits of our
current resources, at least when considering simultaneous
increases of training set size and network size (i.e. the lead-
ing diagonal of Table 1). In fact, the 1998 Broadcast News
evaluations provide a second nominal 100 hour training set,
so we are now planning to train an 8000 hidden unit net on
150 hours of data (using 28 features per frame). Even using
our custom multi-processor hardware, this training will re-
quire over three weeks of computation. Were we to use our
300 MHz Sun Ultra-30, we pro ject that this training would
take several months to complete.
Given our earlier experience with training networks for
speech recognition, our test points for this study straddled
a minimum in the patterns-per-parameter dimension. The

size of the available training set and the practical limits on
network size coincided at about this ratio, using PLP-12 as
the input feature. As part of our Broadcast News eort, we
are also employing a dierent set of 28 features based on
the modulation-sp ectrum, using a modied form of the ap-
proach described in [6]. While wehave too few results to see
if this ratio changes when evaluated over a dierent-sized
vector of dierent features, it is clear from the experiments
wehave done that wedocontinue to derive improvements
from increasing the network and training size.
Finally, although the error rate does continue to fall as
wemove to larger data sets and more parameters, exami-
nation of the leading diagonal for Table 1 shows that there
does appear to b e a diminishing of returns for this strategy.
The error reduction for each doubling of b oth training set
size and parameters goes from 9.3% for the rst doubling
down to 5.3% for the last. It may be that we are nearing
the limits of potential improvements of this system without
incorporating more structure. In fact, as previously noted,
we are currently engaged in developing a joint system with
our European partners in whichwe are merging estimators
that often lead to dierent errors. Ultimately, this is likely
to be the way in whichwe will incorporate an even larger
number of parameters for improved recognition accuracy.
5. CONCLUSION
As stated in the title, it app ears that over the range of pa-
rameters weinvestigated, size does matter, and the most ob-
vious route to improving sp eech recognition, that of increas-
ing the amount of training data and the number of classi-
er parameters, is still a viable course for the hybrid con-
nectionist architecture. While our absolute system p erfor-
mance is not as goo d as some other more complex systems,
it is notable howmuch can be achieved by this baseline.
Routine renements suchascontext-dep endence, gender-
dependence, feature adaptation (e.g. vocal-tract length nor-
malization) and higher-order grammars can all b e employed
to improve performance. Also, simple model merging tech-
niques using multiple hybrid HMM/ANN estimators form
part of the overall Broadcast News evaluation eort we are
conducting in collab oration with Cambridge University and
our other partners.
Acknowledgments
We thank Eric Fosler and Adam Janin for their part in
this work, and Jim Beck and David Johnson for the com-
putational systems that we used. Additionall y,we thank
Steve Renals, Gethin Williams, Tony Robinson, and espe-
cially Gary Cook for a range of support that was essential
for getting up to sp eed on the Broadcast News large vo-
cabulary recognition task. This study was conducted with
primary supp ort from the Europ ean Community basic re-
search grant SPRACH, under a sub contract from the Fac-
ultePolytechnique de Mons in Belgium.
6. REFERENCES
[1] Bourlard, H., Morgan, N., Wooters, C., and Renals,
S., \CDNN: A Context Dependent Neural Network
for Continuous Speech Recognition"
Proc. IEEE Int.
Conf. Acoustics, Speech & Signal Processing,
San Fran-
cisco, I I-349-352, 1992.
[2] Cohen, M., Franco, H., Morgan, N., Rumelhart, D.,
and Abrash, V., \Context-Dependent Multiple Distri-
bution Phonetic Mo deling",
Advances in Neural Infor-
mation Processing Systems V
, pp. 649-657, 1993.
[3] Fritsch, J., \ACID/HNN: A Framework for Hierar-
chical Connectionist Acoustic Mo deling,"
1997 IEEE
Workshop on Automatic Speech Recognition and Un-
derstanding Proceedings
, eds. S. Furui, B.-H. Juang,
and W. Chou, pp. 164-171, 1997.
[4] Hermansky, H., \Perceptual linear predictive (PLP)
analysis of speech,"
Journal Acoust. Soc. Am.
,vol. 87,
no. 4, pp. 1738-1752, 1990.
[5] Kershaw, D., Robinson, A., and Hochberg, M.,
\Context-Dependent Classes in a Hybrid Recurrent
Network-HMM Speech Recognition Systems,"
Ad-
vances in Neural Information Processing Systems VIII
,
pp. 750-756, 1996.
[6] Kingsbury, B., Morgan, N., and Greenberg, S., \Ro-
bust speech recognition using the modulation sp ectro-
gram,"
Speech Communication
,vol. 25(1-2), August
1998.
[7] Morgan, N., and Bourlard, H., \Continuous Speech
Recognition: An Introduction to the Hybrid
HMM/Connectionist Approach."
Signal Processing
Magazine
, pp 25-42, May 1995.
[8] Renals, S., and Hochberg, M., \Ecient Search Using
Posterior Phone Probability Estimates,"
Proc. IEEE
Int. Conf. Acoustics, Speech & Signal Processing,
De-
troit, Vol. 1, pp. 596-599, 1995.
[9] Steeneken, J.M. and Van
Leeuwen, D.A., \Multi-lingual assessment of speaker
independent large vocabulary speech-recognition sys-
tems: the
SQALE
pro ject (speech recognition quality
assessment for language engineering),"
Proceedings of
EUROSPEECH'95
(Madrid, Spain), 1995.
[10] Wawrzynek, J., Asanovic, K., Kingsbury, B., Beck, J.,
Johnson, D., Morgan, N., \SPERT-I I: A Vector Mi-
croprocessor System,"
IEEE Computer
,vol. 29, no. 3,
pp 79-86, March 1996.
[11] Woodland, P., Hain, T., Johnson, S., Niesler, T.,
Tuerk, A., Whittaker, E., and Young, S., \The 1997
HTK Broadcast News Transcription System,"
Proc. of
the Broadcast News Transcription and Understanding
Workshop,
Landsdowne, Virginia, pp 41-48, 1998.
Citations
More filters
Posted Content

Deep Speech: Scaling up end-to-end speech recognition

TL;DR: Deep Speech, a state-of-the-art speech recognition system developed using end-to-end deep learning, outperforms previously published results on the widely studied Switchboard Hub5'00, achieving 16.0% error on the full test set.
Proceedings ArticleDOI

Probabilistic and Bottle-Neck Features for LVCSR of Meetings

TL;DR: This work is exploring the possibility of obtaining the features directly from neural net without the necessity of converting output probabilities to features suitable for subsequent GMM-HMM system.
Proceedings ArticleDOI

Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling

TL;DR: This paper demonstrates that neural-network acoustic models can be trained with sequence classification criteria using exactly the same lattice-based methods that have been developed for Gaussian mixture HMMs, and that using a sequence classification criterion in training leads to considerably better performance.
Posted Content

Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

TL;DR: In this paper, an end-to-end deep learning approach was used to recognize either English or Mandarin Chinese speech, two vastly different languages, and the results showed that their system is competitive with the transcription of human workers when benchmarked on standard datasets.
References
More filters
Journal ArticleDOI

Perceptual linear predictive (PLP) analysis of speech

TL;DR: A new technique for the analysis of speech, the perceptual linear predictive (PLP) technique, which uses three concepts from the psychophysics of hearing to derive an estimate of the auditory spectrum, and yields a low-dimensional representation of speech.
Journal ArticleDOI

Robust speech recognition using the modulation spectrogram

TL;DR: Using the modulation spectrogram as a front end for ASR provides a significant improvement in performance on highly reverberant speech and when it is used in combination with log-RASTA-PLP performance over a range of noisy and reverberant conditions is significantly improved, suggesting that the use of multiple representations is another promising method for improving the robustness of ASR systems.
Journal ArticleDOI

Continuous speech recognition

TL;DR: The authors focus on a tutorial description of the hybrid HMM/ANN method, which provides a mechanism for incorporating a range of sources of evidence without strong assumptions about their joint statistics, and may have applicability to much more complex systems that can incorporate deep acoustic and linguistic context.
Journal ArticleDOI

Spert-II: a vector microprocessor system

TL;DR: A prototype full custom vector microprocessor, TO, is packaged as the Spert-II (Synthetic Perceptron Testbed II) workstation accelerator system, to accelerate multiparameter neural network training for speech recognition research.
Proceedings ArticleDOI

CDNN: a context dependent neural network for continuous speech recognition

TL;DR: It is shown how, without any simplifying assumptions, one can estimate likelihoods for context-dependent phonetic models with nets that are not substantially larger than context-independent MLPs.
Related Papers (5)
Frequently Asked Questions (11)
Q1. What have the authors contributed in "Size matters: an empirical study of neural network training for large vocabulary continuous speech recognition" ?

Additionally, doubling the training data and system size appears to provide diminishing returns of error rate reduction for the largest systems. 

The basic procedure was to train neural networks with a range of sizes on acoustic training data from di erent amounts of large vocabulary continuous speech. 

The error reduction for each doubling of both training set size and parameters goes from 9.3% for the rst doubling down to 5.3% for the last. 

For about 10 years, the authors and others have trained large neural networks to estimate posterior probabilities of contextindependent phonetic classes for use in speech recognition systems based on Hidden Markov Models (HMMs) [7]. 

For each choice of hidden layer size, a training was done using 18 , 1 4 , 1 2 , and all of the74 hours of acoustic training material that was available to us for this study. 

the authors recently completed the development of a multiprocessor machine incorporating VLSI developed in their group, and this permitted trainings that required on the order of 1015 arithmetic operations for the larger cases. 

These slices con rm the central `dip' visible in the error rate surface, indicating that for a given amount of training computation, there is an optimal ratio of training frames per network weight in the range 10 to 40. 

While this processing succeeded in its intention of improving the relative performance on the telephone-channel speech which forms some 15% of the corpus, it appears to increase the error for the remaining full-bandwidth data. 

The (possibly multi-sentence) segments are divided into 7 di erent focus conditions representing di erent acoustic/speaking environments; the majority conditions are planned studio speech and spontaneous studio speech. 

Some of this di erence is undoubtedly due to scienti cally uninteresting factors, such as the resources required to correct faulty transcription. 

These results were obtained on a separate test set of 32 minutes containing 5938 words; by their reckoning, to be signi cant at the 5% level, error rates must di er by at least 1.5%.2