How many hours of training was used for this study?

For each choice of hidden layer size, a training was done using 18 , 1 4 , 1 2 , and all of the74 hours of acoustic training material that was available to us for this study.

How many arithmetic operations were required for the larger cases?

the authors recently completed the development of a multiprocessor machine incorporating VLSI developed in their group, and this permitted trainings that required on the order of 1015 arithmetic operations for the larger cases.

How much error rate is required to be signi cant?

These results were obtained on a separate test set of 32 minutes containing 5938 words; by their reckoning, to be signi cant at the 5% level, error rates must di er by at least 1.5%.2

(Open Access) Size matters: an empirical study of neural network training for large vocabulary continuous speech recognition (1999) | Daniel P. W. Ellis

Q: What was the basic procedure for this experiment?

The basic procedure was to train neural networks with a range of sizes on acoustic training data from di erent amounts of large vocabulary continuous speech.

Q: How much error reduction for each doubling of both training set size and parameters?

The error reduction for each doubling of both training set size and parameters goes from 9.3% for the rst doubling down to 5.3% for the last.

Q: How many years have the authors trained large neural networks to estimate posterior probabilities of context-dependent phone?

For about 10 years, the authors and others have trained large neural networks to estimate posterior probabilities of contextindependent phonetic classes for use in speech recognition systems based on Hidden Markov Models (HMMs) [7].

Q: What is the dip' in the error rate surface?

These slices con rm the central `dip' visible in the error rate surface, indicating that for a given amount of training computation, there is an optimal ratio of training frames per network weight in the range 10 to 40.

Q: What is the effect of the decoding on the telephone-channel speech?

While this processing succeeded in its intention of improving the relative performance on the telephone-channel speech which forms some 15% of the corpus, it appears to increase the error for the remaining full-bandwidth data.

Q: What are the common types of speech?

The (possibly multi-sentence) segments are divided into 7 di erent focus conditions representing di erent acoustic/speaking environments; the majority conditions are planned studio speech and spontaneous studio speech.

Q: What is the reason for the di culty?

Some of this di erence is undoubtedly due to scienti cally uninteresting factors, such as the resources required to correct faulty transcription.

SIZE MATTERS:

AN EMPIRICAL STUDY OF NEURAL NETWORK TRAINING FOR LARGE

VOCABULARY CONTINUOUS SPEECH RECOGNITION

Dan Ellis

and Nelson Morgan

;

International Computer Science Institute, 1947 Center St, Berkeley, CA 94704

University of California at Berkeley, EECS Department, Berkeley, CA 94720

Tel: (510) 643-9153, FAX: (510) 643-7684, Email:

dpwe, morgan

@icsi.berkeley.edu

ABSTRACT

Wehave trained and tested a number of large neural

networks for the purpose of emission probability estima-

tion in large vocabulary continuous sp eech recognition. In

particular, the problem under test is the DARPA Broad-

cast News task. Our goal here was to determine the rela-

tionship between training time, word error rate, size of the

training set, and size of the neural network. In all cases, the

network architecture was quite simple, comprising a single

large hidden layer with an input window consisting of fea-

ture vectors from 9 frames around the current time, with a

single output for each of 54 phonetic categories. Thus far,

simultaneous increases to the size of the training set and the

neural network improve performance; in other words, more

data helps, as do es the training of more parameters. We

continue to be surprised that such a simple system works

as well as it do es for complex tasks. Given a limitation

in training time, however, there appears to be an optimal

ratio of training patterns to parameters of around 25:1 in

these circumstances. Additionally, doubling the training

data and system size app ears to provide diminishin g re-

turns of error rate reduction for the largest systems.

1. INTRODUCTION

For about 10 years, we and others have trained large neu-

ral networks to estimate posterior probabilities of context-

independent phonetic classes for use in speech recognition

systems based on Hidden Markov Models (HMMs) [7]. For

small tasks, mo derate amounts of training data, and when

simple mo dels were used, we consistently found that we

could provide competitive and often improved recognition

performance in comparison with systems that used more

standard architectures and training metho ds (e.g., Gaus-

sian mixtures trained with a Maximum Likelihoo d crite-

rion) [9]. However, for large tasks for which a great deal of

training data was available, wehave had dicultyachieving

the performance of likelihoo d-based HMM systems. Some

of this dierence is undoubtedly due to scientically unin-

teresting factors, such as the resources required to correct

faulty transcription. We also havewondered whether some

of the observed diculty might be a straightforward trade-

o between computation and p erformance. As wehave

usually designed it, hybrid HMM/ANN training requires

the up date of all network parameters in response to ev-

ery input frame. On the other hand, direct training of

state-conditional feature densities in HMM systems only re-

quires the update of parameters corresp onding to the state

or states asso ciated with each feature vector. Furthermore,

at least in principle, likelihoo d-based HMM systems can

always benet from more acoustic data by improving the

estimates for ever-ner categories (e.g., from triphones to

quinphones), since with more data these rarer categories

will b ecome more p opulated.

Of course, there are analogous pro cedures available for

connectionist systems; for instance, the CDNN described in

[1], with variants explored in [2] and [5], can yield density

estimates for context-dependent classes as a product of net-

work outputs. The ACID/HNN system of [3] go es even fur-

ther, resulting in an extensive set of polyphone probabiliti es

that can be used for a fully context-dependent system in the

spirit of the large HMMs. In exp eriments on Switchboard,

for instance, this latter system app ears to b e quite compa-

rable in p erformance to the best likelihoo d-base d HMMs.

However, to achieve this result the simplicity of the large

single network is sacriced, leaving us with the question:

can we extract greater recognition accuracy from an in-

crease in training data without complicatin g the structure?

In previous work, wetypically did not incorp orate large

amounts of training data (e.g., much more than 10 hours of

speech). We also did not have sucient computational re-

sources to explore the simplest approach: namely,tokeep

the simple architecture constant and merely increase the

size of the network for training on larger training sets. This

year, we developed a baseline recognition system for the

Broadcast News task, for whichwe had 74 hours of acoustic

data.

While using all of this data was preferable, systems

trained on subsets were good enough to generate the com-

parative results for this exp eriment. For the parts of the

experiment in whichwe used larger networks and larger

fractions of the data, the amount of computation would

have previously b een prohibiti ve. However, we recently

completed the development ofamultiprocessor machine in-

corporating VLSI developed in our group, and this permit-

ted trainings that required on the order of 10

arithmetic

operations for the larger cases.

Components of a variant of this system are currently b eing

developed for the 1998 DARPA Hub 4 evaluation, in collabo-

ration with the connectionist groups at Cambridge University,

Sheeld University, and FacultePolytechnique de Mons.

Given the availability of acoustic materials and compu-

tational resources, we decided to push our simple system to

its limit, and also to test it for a range of training set and

neural net sizes.

2. METHODS

The basic pro cedure was to train neural networks with

a range of sizes on acoustic training data from dierent

amounts of large vocabulary continuous speech. Each net-

work was then used in a hybrid HMM/ANN recognizer, and

was evaluated with word error rate on a large vocabulary

task using a 65K word lexicon.

2.1. Corpus

For these experiments, we used the Broadcast News corpus.

This is a collection of speech from American radio and tele-

vision news broadcasts, such as the National Public Radio

program

All Things Considered

Nightline

, televised in the

U.S. on the ABC network. These shows comprise a whole

range of sp eaking conditions, from planned sp eech in stu-

dio environments to spontaneous speech in noisy eld con-

ditions over telephone lines. The (p ossibly multi-sentence)

segments are divided into 7 dierent focus conditions repre-

senting dierent acoustic/speaking environments; the ma-

jority conditions are planned studio sp eech and sp ontaneous

studio sp eech.

2.2. System Architecture

As in many of our previous papers [7], the underlying statis-

tical model was an extremely simple HMM. For eachof54

phonetic categories, we had an HMM consisting of a strictly

left-to-right model with multiple states tied to a single dis-

tribution; multiple rep eated states were used to establish

a minimum duration for each phone. Transition probabili-

ties were set to .5. The emission probabiliti es of the HMM

were scaled likelihoo ds estimated by dividin g the network

outputs by the priors for each class. The network was a

Multi-Layer Perceptron (MLP) with a single sigmoidal hid-

den layer, whose size for these experiments was varied from

500 to 4000 by factors of two. For eachchoice of hidden

layer size, a training was done using

, and all of the

74 hours of acoustic training material that was available

to us for this study. Note that the largest training incor-

porated ab out 700,000 parameters and 16 million acoustic

frames. 54 outputs associated with the phonetic classes

were generated by softmax functions computed from the

weighted hidden unit outputs. For the main set of experi-

ments reported here, the input consisted of feature vectors

from the frame associated with the target lab el, as well as

from 4 vectors into the past and 4 into the future.

To generate the features used in these experiments, the

speechwas ltered and downsampled to 8 kHz. PLP-12 fea-

tures [4], including the PLP gain term to give a 13-element

feature vector, were computed every 16 ms, and normal-

ized according to the mean and variance of each segment

in the training data. These segments were provided to us

by our colleagues in the Cambridge University connectionist

group, and roughly corresponded to an utterance or a group

of utterances by a single speaker. In practice a go od seg-

mentation system (such as the one developed by the HTK

group at Cambridge) does not degrade performance over

that achieved by manually segmenting chunks associated

with a single signal source [11]. We used the Noway large

vocabulary decoder [8], and co-developed a large vocabu-

lary pronunciation lexicon with our partners at Cambridge

and Sheeld. The backo trigram grammar from the Cam-

bridge BN 97 system was used, incorp orating 7M bigrams

and 24M trigrams. It had been trained using both text

sources (broadcast news and newswire texts) and broadcast

news acoustic transcripts.

2.3. Training Hardware and Software

Connectionist training of large networks is quite compu-

tationally demanding, as noted ab ove. To aid in this task,

we developed a vector micropro cessor described in [10], and

vectorized software that incorporated ecient assembler rou-

tines for this processor while providing a C++ structure

that permitted a mo derate range of exp erimentation for our

trainings. The board-level system (called the Spert-I I) also

includes 8 MB of fast (zero wait state) SRAM so that mem-

ory accesses are not a b ottleneck for the neural computation

of large networks. Current high end PCs and moderate level

workstations are now fast enough to compete with this sys-

tem (when highly optimized software is used), but wehave

also developed 2-pro cessor and 4-pro cessor systems which

are signicantly faster for suciently large networks. Using

a commercial bus expansion chassis, these p ermit the con-

nection of four Spert-I Is to a single Sparc host. Although

the bus bandwidth is necessarily shared between the four

processors, accumulating error gradients over 16 to 32 pat-

terns permits a near linear sp eedup for the larger networks

trained in this study. The Sp ert-II boards can also be used

independently within the multiprocessor system for those

problems with smaller networks.

3. RESULTS

Training set size, hours

9.25 18.5 37 74

# Hidden units

500 42.8% 41.0% 40.2% 39.2%

1000 41.8% 38.8% 36.5% 36.9%

2000 40.4% 37.2% 35.6% 34.4%

4000 40.3% 37.4% 33.9% 33.7%

Table 1: Word Error Rate p ercentages for the overall hybrid

recognition system incorp orating classier networks with

dierent-sized hidden layers, trained on varying amounts of

acoustic data. In each case, the input consisted of 9 vectors

of length 13, and the output layer was 54 units. The number

of parameters for eachlayered network was ((9x13)+54)x(#

Hidden units) weights, plus (# Hidden units +54) biases.

Table 1 details the eect of varying the size of the training

set and the number of hidden layer units on the error rate

of the overall hybrid system, all other parameters b eing

held constant. These results are plotted as a 3-dimensional

500

1000

2000

4000

9.25

18.5

Hidden layer / unitsTraining set / hours

WER%

WER for PLP12N nets vs. net size & training data

Figure 1: Surface plot of system word error rate as a func-

tion of the amount of training data and the hidden layer

size.

surface, i.e. error rate as a function of training set and

hidden layer size, in Figure 1. The most obvious trend is

that increasing either parameter will improve the overall

system performance in virtually every condition.

There are some caveats to be b orne in mind when con-

sidering the word error rate gures. These results were

obtained on a separate test set of 32 minutes containing

5938 words; by our reckoning, to be signicant at the 5%

level, error rates must dier by at least 1.5%.

There is

additional variabilityintro duced by the randomization of

pattern presentation used in the network training.

Training followed a standard `simulated annealing' pro-

cess with rep eated passes or `epo chs' over the entire training

set; after initial stabilization, the learning rate was halved

on each successive iteration until the frame classication ac-

curacy on a separate cross-validation set improved by less

than 0.5%. The interaction between this criterion and other

variables meant that the dierent networks trained for dif-

ferentnumbers of ep ochs, between 7 and 10.

The acoustic data for these experiments was limited to 4

kHz bandwidth before feature extraction.

While this pro-

cessing succeeded in its intention of improving the relative

performance on the telephone-channel speech which forms

some 15% of the corpus, it app ears to increase the error

for the remaining full-bandwidth data. Finally, the decoder

pruning for these tests was chosen to b e fairly aggressive,

giving a typical recognition sp eed of about 2x real-time;

slower but more exhaustive deco ding would yield a relative

This test set, used internally within our collaboration with

Cambridge and Sheeld, is a subset of the BN 97 evaluation set.

Previous exp erience at Cambridge suggests that this subset is

slightly \harder" than the larger set, typically by about a factor

of 10% in relative error.

This bandlimiting was done for its complementarity with the

Cambridge system, with whom wewould b e merging models; in

practice the overall loss in p erformance due to this bandwidth

reduction app eared to be minor.

12 510 20 50

178 GCUP

356 GCUP

712 GCUP

1.42 TCUP

2.8 TCUP

5.7 TCUP

11.4 TCUP

100 200 500

WER vs. frames/weight

WER%

frames/weight

Figure 2: Slices through the surface of the previous gure,

showing the variation of error rate with the ratio of training

patterns to network weights for a xed number of connec-

tion up dates p er training epo ch.

error rate improvement of 5-10%.

Figure 2 shows a succession of slices through the error-

rate surface, taken in planes parallel to the view plane. Each

slice corresponds to a constant pro duct of training set size

and hidden layer size, or equivalently the number of con-

nection up dates per complete training epo ch. Each line is

tagged with this number, with the maximum value of 11.4

TCUP (11.4 x 10

updates) for the case of the 4000 hidden

units by (117+54) weights per unit by 16.7 million training

patterns. These slices conrm the central `dip' visible in

the error rate surface, indicating that for a given amountof

training computation, there is an optimal ratio of training

frames per network weight in the range 10 to 40.

4. DISCUSSION

Our primary observation, that improvements are almost al-

ways obtained by increasing either or both of the amountof

training data or the numberofnetwork parameters, is not

surprising. It is encouraging, however, that these increases

continue to be signicant out to the practical limits of our

current resources, at least when considering simultaneous

increases of training set size and network size (i.e. the lead-

ing diagonal of Table 1). In fact, the 1998 Broadcast News

evaluations provide a second nominal 100 hour training set,

so we are now planning to train an 8000 hidden unit net on

150 hours of data (using 28 features per frame). Even using

our custom multi-processor hardware, this training will re-

quire over three weeks of computation. Were we to use our

300 MHz Sun Ultra-30, we pro ject that this training would

take several months to complete.

Given our earlier experience with training networks for

speech recognition, our test points for this study straddled

a minimum in the patterns-per-parameter dimension. The

size of the available training set and the practical limits on

network size coincided at about this ratio, using PLP-12 as

the input feature. As part of our Broadcast News eort, we

are also employing a dierent set of 28 features based on

the modulation-sp ectrum, using a modied form of the ap-

proach described in [6]. While wehave too few results to see

if this ratio changes when evaluated over a dierent-sized

vector of dierent features, it is clear from the experiments

wehave done that wedocontinue to derive improvements

from increasing the network and training size.

Finally, although the error rate does continue to fall as

wemove to larger data sets and more parameters, exami-

nation of the leading diagonal for Table 1 shows that there

does appear to b e a diminishing of returns for this strategy.

The error reduction for each doubling of b oth training set

size and parameters goes from 9.3% for the rst doubling

down to 5.3% for the last. It may be that we are nearing

the limits of potential improvements of this system without

incorporating more structure. In fact, as previously noted,

we are currently engaged in developing a joint system with

our European partners in whichwe are merging estimators

that often lead to dierent errors. Ultimately, this is likely

to be the way in whichwe will incorporate an even larger

number of parameters for improved recognition accuracy.

5. CONCLUSION

As stated in the title, it app ears that over the range of pa-

rameters weinvestigated, size does matter, and the most ob-

vious route to improving sp eech recognition, that of increas-

ing the amount of training data and the number of classi-

er parameters, is still a viable course for the hybrid con-

nectionist architecture. While our absolute system p erfor-

mance is not as goo d as some other more complex systems,

it is notable howmuch can be achieved by this baseline.

Routine renements suchascontext-dep endence, gender-

dependence, feature adaptation (e.g. vocal-tract length nor-

malization) and higher-order grammars can all b e employed

to improve performance. Also, simple model merging tech-

niques using multiple hybrid HMM/ANN estimators form

part of the overall Broadcast News evaluation eort we are

conducting in collab oration with Cambridge University and

our other partners.

Acknowledgments

We thank Eric Fosler and Adam Janin for their part in

this work, and Jim Beck and David Johnson for the com-

putational systems that we used. Additionall y,we thank

Steve Renals, Gethin Williams, Tony Robinson, and espe-

cially Gary Cook for a range of support that was essential

for getting up to sp eed on the Broadcast News large vo-

cabulary recognition task. This study was conducted with

primary supp ort from the Europ ean Community basic re-

search grant SPRACH, under a sub contract from the Fac-

ultePolytechnique de Mons in Belgium.

6. REFERENCES

[1] Bourlard, H., Morgan, N., Wooters, C., and Renals,

S., \CDNN: A Context Dependent Neural Network

for Continuous Speech Recognition"

Proc. IEEE Int.

Conf. Acoustics, Speech & Signal Processing,

San Fran-

cisco, I I-349-352, 1992.

[2] Cohen, M., Franco, H., Morgan, N., Rumelhart, D.,

and Abrash, V., \Context-Dependent Multiple Distri-

bution Phonetic Mo deling",

Advances in Neural Infor-

mation Processing Systems V

, pp. 649-657, 1993.

[3] Fritsch, J., \ACID/HNN: A Framework for Hierar-

chical Connectionist Acoustic Mo deling,"

1997 IEEE

Workshop on Automatic Speech Recognition and Un-

derstanding Proceedings

, eds. S. Furui, B.-H. Juang,

and W. Chou, pp. 164-171, 1997.

[4] Hermansky, H., \Perceptual linear predictive (PLP)

analysis of speech,"

Journal Acoust. Soc. Am.

,vol. 87,

no. 4, pp. 1738-1752, 1990.

[5] Kershaw, D., Robinson, A., and Hochberg, M.,

\Context-Dependent Classes in a Hybrid Recurrent

Network-HMM Speech Recognition Systems,"

Ad-

vances in Neural Information Processing Systems VIII

pp. 750-756, 1996.

[6] Kingsbury, B., Morgan, N., and Greenberg, S., \Ro-

bust speech recognition using the modulation sp ectro-

gram,"

Speech Communication

,vol. 25(1-2), August

1998.

[7] Morgan, N., and Bourlard, H., \Continuous Speech

Recognition: An Introduction to the Hybrid

HMM/Connectionist Approach."

Signal Processing

Magazine

, pp 25-42, May 1995.

[8] Renals, S., and Hochberg, M., \Ecient Search Using

Posterior Phone Probability Estimates,"

Proc. IEEE

Int. Conf. Acoustics, Speech & Signal Processing,

De-

troit, Vol. 1, pp. 596-599, 1995.

[9] Steeneken, J.M. and Van

Leeuwen, D.A., \Multi-lingual assessment of speaker

independent large vocabulary speech-recognition sys-

tems: the

SQALE

pro ject (speech recognition quality

assessment for language engineering),"

Proceedings of

EUROSPEECH'95

(Madrid, Spain), 1995.

[10] Wawrzynek, J., Asanovic, K., Kingsbury, B., Beck, J.,

Johnson, D., Morgan, N., \SPERT-I I: A Vector Mi-

croprocessor System,"

IEEE Computer

,vol. 29, no. 3,

pp 79-86, March 1996.

[11] Woodland, P., Hain, T., Johnson, S., Niesler, T.,

Tuerk, A., Whittaker, E., and Young, S., \The 1997

HTK Broadcast News Transcription System,"

Proc. of

the Broadcast News Transcription and Understanding

Workshop,

Landsdowne, Virginia, pp 41-48, 1998.

Size matters: an empirical study of neural network training for large vocabulary continuous speech recognition

Figures

Citations

End to end speech recognition in English and Mandarin

Deep Speech: Scaling up end-to-end speech recognition

Probabilistic and Bottle-Neck Features for LVCSR of Meetings

Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling

Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

References

Perceptual linear predictive (PLP) analysis of speech

Robust speech recognition using the modulation spectrogram

Continuous speech recognition

Spert-II: a vector microprocessor system

CDNN: a context dependent neural network for continuous speech recognition

Related Papers (5)

Connectionist Speech Recognition: A Hybrid Approach

Acoustic Modeling Using Deep Belief Networks

Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition

Perceptual linear predictive (PLP) analysis of speech

Towards End-To-End Speech Recognition with Recurrent Neural Networks

Frequently Asked Questions (11)

Q1. What have the authors contributed in "Size matters: an empirical study of neural network training for large vocabulary continuous speech recognition" ?

Q2. What was the basic procedure for this experiment?

Q3. How much error reduction for each doubling of both training set size and parameters?

Q4. How many years have the authors trained large neural networks to estimate posterior probabilities of context-dependent phone?

Q5. How many hours of training was used for this study?

Q6. How many arithmetic operations were required for the larger cases?

Q7. What is the dip' in the error rate surface?

Q8. What is the effect of the decoding on the telephone-channel speech?

Q9. What are the common types of speech?

Q10. What is the reason for the di culty?

Q11. How much error rate is required to be signi cant?