Speaker adaptation of context dependent deep neural networks
read more
Citations
Deep Learning: Methods and Applications
Speech Recognition Using Deep Neural Networks: A Systematic Review
EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding
EESEN: End-to-End Speech Recognition using Deep RNN Models and WFST-based Decoding
A Review of Domain Adaptation without Target Labels
References
Learning representations by back-propagating errors
Learning internal representations by error propagation
Perceptual linear predictive (PLP) analysis of speech
Acoustic Modeling Using Deep Belief Networks
Maximum likelihood linear transformations for HMM-based speech recognition
Related Papers (5)
Frequently Asked Questions (16)
Q2. How many Gaussian states were used in the real time GMM system?
The speaker independent, real-time GMM system uses PLP features [20], semi-tied covariances [21] and linear discriminant analysis for dimensionality reduction ofthe 9 consecutively stacked PLP features down to 39 dimensions, and boosted MMI discriminative training [22] of context-dependent states [23] clustered using decision trees [24] to 7969 states; the real time GMM system contained a total of 340k Gaussians, while a larger 500k Gaussian was also available.
Q3. how can momentum speed up the training process?
It has been found that using momentum [14] can speed up the training process by adding common contributions from previous updates to the gradient update in a second term∆wt = − ∇wE(wt) + α∆wt−1 (4)For example, with α set to 0.9, constant parts of the gradient are amplified by 1/(1−α) or 10, while parts that oscillate are smoothed out over time.
Q4. What is the power of a DNN in acoustic modeling?
The power of DNNs over conventional GMMs for acoustic modeling in large vocabulary continuous speech recognition has been demonstrated in recent literature, where the number of hidden layers is between 5 and 9, with thousands of hidden nodes and context dependent output states [9, 10].
Q5. What is the definition of a deep neural network?
Since many more layers, for example 5 to 9 layers, are used than was typically explored in the past, this has been described as a deep neural network or DNN.
Q6. What is the reason for the use of deep neural networks in acoustic modeling?
The recent successfulapplication of deep neural networks for acoustic modeling has been shown to be due to:• deep networks of many layers, • wide hidden layers of many nodes, and • many context dependent states to model phonemes.
Q7. What is the simplest way to overfit the constrained transform?
In [18], the solution to overfitting the constrained transform for adapting the first layer in a network to a speaker can also be viewed as L2 regularization.
Q8. What is the main idea of the paper?
This paper will also examine various aspects of the training procedure including the optimization hyperparameters such as learning rate and momentum, supervised enrollment versus unsupervised training, how the amount of data affects gains and stochastic minibatch vs batch training.
Q9. How is the DNN system able to run in realtime on a recent mobile device?
However on an embedded device the smallest 100k Gaussian GMM system was able to run in realtime on a recent mobile device, whereas the small 4x512 DNN is slightly slower on the same device.
Q10. How many frames did the DNNs examine?
The networks examined were small: 9 stacked frames yielding 234 inputs, one hidden layer of 1000 units, and 48 outputs one for each context independent phone.
Q11. How does the L2 prior regularization work?
In this work the authors show that L2 prior regularization is helpful in improving generalization when adapting neural networks to speaker specific data.
Q12. What is the way to estimate the acoustic model?
This can be manipulated into being a transform that is applied efficiently to the speech features, with the model parameters unchangedN (ot; µ̂m,s, Σ̂m,s) = |As|N (Asot + bs;µm,Σm) (2)Rather than using a single transformation per speaker, transforms can be estimated for similar Gaussians in the acoustic model by clustering them, e.g. using regression trees [6, 7].
Q13. What is the alternative technique for acoustic modeling?
An alternative technique is called constrained MLLR (CMLLR) [1] which constrains a matrix in the transformation of the model means and variance.
Q14. What is the weight decay factor on the L2 penalty term?
Here the weight update in equation 4 then becomes∆wt = − ∇wE(wt) + α∆wt−1 − β(wt−1 −w0) (6)where β is the weight decay factor on the L2 penalty term which decays the weights towards the original model weights.
Q15. How many speakers are adapted to the DNN?
In this section, results are obtained by adapting the network per speaker with labels determined from the large 6x2176 deep DNN, i.e. with a WER of 10.2% on the 10 minutes of adaptation data.
Q16. How many speakers can be adapted to form an evaluation set?
This allows the construction of an anonymized data set of 80 speakers, each with about an hour of adaptation data to form Pers1a adapt set and ten minutes of evaluation data to form an evaluation set Pers1a eval.