Size matters: an empirical study of neural network training for large vocabulary continuous speech recognition
read more
Citations
End to end speech recognition in English and Mandarin
Deep Speech: Scaling up end-to-end speech recognition
Probabilistic and Bottle-Neck Features for LVCSR of Meetings
Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
References
Perceptual linear predictive (PLP) analysis of speech
Robust speech recognition using the modulation spectrogram
Continuous speech recognition
Spert-II: a vector microprocessor system
CDNN: a context dependent neural network for continuous speech recognition
Related Papers (5)
Frequently Asked Questions (11)
Q2. What was the basic procedure for this experiment?
The basic procedure was to train neural networks with a range of sizes on acoustic training data from di erent amounts of large vocabulary continuous speech.
Q3. How much error reduction for each doubling of both training set size and parameters?
The error reduction for each doubling of both training set size and parameters goes from 9.3% for the rst doubling down to 5.3% for the last.
Q4. How many years have the authors trained large neural networks to estimate posterior probabilities of context-dependent phone?
For about 10 years, the authors and others have trained large neural networks to estimate posterior probabilities of contextindependent phonetic classes for use in speech recognition systems based on Hidden Markov Models (HMMs) [7].
Q5. How many hours of training was used for this study?
For each choice of hidden layer size, a training was done using 18 , 1 4 , 1 2 , and all of the74 hours of acoustic training material that was available to us for this study.
Q6. How many arithmetic operations were required for the larger cases?
the authors recently completed the development of a multiprocessor machine incorporating VLSI developed in their group, and this permitted trainings that required on the order of 1015 arithmetic operations for the larger cases.
Q7. What is the dip' in the error rate surface?
These slices con rm the central `dip' visible in the error rate surface, indicating that for a given amount of training computation, there is an optimal ratio of training frames per network weight in the range 10 to 40.
Q8. What is the effect of the decoding on the telephone-channel speech?
While this processing succeeded in its intention of improving the relative performance on the telephone-channel speech which forms some 15% of the corpus, it appears to increase the error for the remaining full-bandwidth data.
Q9. What are the common types of speech?
The (possibly multi-sentence) segments are divided into 7 di erent focus conditions representing di erent acoustic/speaking environments; the majority conditions are planned studio speech and spontaneous studio speech.
Q10. What is the reason for the di culty?
Some of this di erence is undoubtedly due to scienti cally uninteresting factors, such as the resources required to correct faulty transcription.
Q11. How much error rate is required to be signi cant?
These results were obtained on a separate test set of 32 minutes containing 5938 words; by their reckoning, to be signi cant at the 5% level, error rates must di er by at least 1.5%.2