What are the contributions in "Speaker adaptation of context dependent deep neural networks" ?

This work explores how deep neural networks may be adapted to speakers by re-training the input layer, the output layer or the entire network. The authors look at how L2 regularization using weight decay to the speaker independent model improves generalization.

how can momentum speed up the training process?

It has been found that using momentum [14] can speed up the training process by adding common contributions from previous updates to the gradient update in a second term∆wt = − ∇wE(wt) + α∆wt−1 (4)For example, with α set to 0.9, constant parts of the gradient are amplified by 1/(1−α) or 10, while parts that oscillate are smoothed out over time.

What is the simplest way to overfit the constrained transform?

In [18], the solution to overfitting the constrained transform for adapting the first layer in a network to a speaker can also be viewed as L2 regularization.

What is the way to estimate the acoustic model?

This can be manipulated into being a transform that is applied efficiently to the speech features, with the model parameters unchangedN (ot; µ̂m,s, Σ̂m,s) = |As|N (Asot + bs;µm,Σm) (2)Rather than using a single transformation per speaker, transforms can be estimated for similar Gaussians in the acoustic model by clustering them, e.g. using regression trees [6, 7].

(Open Access) Speaker adaptation of context dependent deep neural networks (2013) | Hank Liao

Q: What is the reason for the use of deep neural networks in acoustic modeling?

The recent successfulapplication of deep neural networks for acoustic modeling has been shown to be due to:• deep networks of many layers, • wide hidden layers of many nodes, and • many context dependent states to model phonemes.

Q: How is the DNN system able to run in realtime on a recent mobile device?

However on an embedded device the smallest 100k Gaussian GMM system was able to run in realtime on a recent mobile device, whereas the small 4x512 DNN is slightly slower on the same device.

Q: How many frames did the DNNs examine?

The networks examined were small: 9 stacked frames yielding 234 inputs, one hidden layer of 1000 units, and 48 outputs one for each context independent phone.

SPEAKER ADAPTATION OF CONTEXT DEPENDENT DEEP NEURAL NETWORKS

Hank Liao

Google Inc.

New York, NY, USA.

ABSTRACT

There has been little work on examining how deep neural networks

may be adapted to speakers for improved speech recognition accu-

racy. Past work has examined using a discriminatively trained afﬁne

transformation of the input features applied at a frame level or the

re-training of the entire shallow network for a speciﬁc speaker. This

work explores how deep neural networks may be adapted to speak-

ers by re-training the input layer, the output layer or the entire net-

work. We look at how L2 regularization using weight decay to the

speaker independent model improves generalization. Other train-

ing factors are examined including the role momentum plays and

stochastic mini-batch versus batch training. While improvements

are signiﬁcant for smaller networks, the largest show little gain from

adaptation on a large vocabulary mobile speech recognition task.

Index Terms— Large vocabulary continuous speech recogni-

tion, Multilayer perceptrons, Deep neural networks, Speaker adapta-

tion

1. INTRODUCTION

In automatic speech recognition systems it is common to adapt a

well-trained, general acoustic model to new users or environmental

conditions. There are a variety of common techniques for Gaussian

mixture model-based (GMM) acoustic models of speech and much

research in this area [1]. Recently, the multilayer perceptron (MLP)

has shown excellent results for modeling speech acoustics [2]. Since

many more layers, for example 5 to 9 layers, are used than was typ-

ically explored in the past, this has been described as a deep neural

network or DNN. With the recent trend towards adopting this acous-

tic model, it is worth investigating if and how DNNs can be adapted

for new users or environments. Whether adaptation is even neces-

sary is debatable since the larger networks have been shown to be

invariant to some speaker effects, although they may still gain from

some feature space transformations [3]. This paper examines how

speciﬁc layers in a DNN acoustic model can be adapted directly for

speciﬁc speakers, and how the size of the network and regularization

during training affect supervised speaker enrollment and unsuper-

vised speaker adaptation strategies.

The paper is organized as follows. First, commonly used

speaker adaptation techniques for GMM-based acoustic models

are discussed. Second, an overview of state-of-the-art DNN acoustic

modeling is described along with a review of techniques for adapting

neural network models. Experiments contrasting these techniques

are reported next. Finally conclusions based on the experimental

results are summarized.

2. GMM SYSTEM ADAPTATION

Many state-of-the-art speech recognition system use a hidden

Markov model of speech with GMMs modeling the output context-

dependent state distributions. To improve performance, they may

use a variety of adaptation techniques that will be described brieﬂy

here. A Gaussian in the speaker independent acoustic model, con-

taining all states, can be indexed by m, with mean and variance

parameters denoted by µ

and Σ

. The acoustic model can be

estimated directly from the speaker data, but using the well-trained

speaker independent model as a prior. This results in a maximum

a posteriori update of the model mean [4], also know as MAP

estimation

τµ

m,t

τ + T

where T total adaptation frames, τ controls weight of prior data, and

m,t

is the posterior probability of Gaussian component m at time

t. This approach deals with data sparsity by using the prior model

information.

An alternative is to tie the model transformation across the many

acoustic model parameters, which leads to the formulation for max-

imum likelihood linear regression, or MLLR [5]. A shared afﬁne

transformation is applied to the model means

N (o

;

m,s

, Σ

) = N (o

; A

+ b

, Σ

) (1)

to maximize the likelihood of observed data from the speaker given

the model parameters, where o

is a frame of speech from the

speaker, A

and b

the matrix and bias of the afﬁne transform for

speaker s. This requires an update of every model mean in the

acoustic model every time the speaker changes.

An alternative technique is called constrained MLLR (CM-

LLR) [1] which constrains a matrix in the transformation of the

model means and variance. This can be manipulated into being a

transform that is applied efﬁciently to the speech features, with the

model parameters unchanged

N (o

;

m,s

) = |A

|N (A

+ b

; µ

, Σ

) (2)

Rather than using a single transformation per speaker, transforms

can be estimated for similar Gaussians in the acoustic model by clus-

tering them, e.g. using regression trees [6, 7]. This improves the

power of MLLR and CMLLR transformations.

3. DEEP NEURAL NETWORK ACOUSTIC MODELS

The use of neural networks for acoustic modeling in speech recog-

nition is not novel. The multilayer perceptron neural network was

used for speech recognition in [8]. However the computation power

over a decade ago limited their effectiveness. The recent successful

application of deep neural networks for acoustic modeling has been

shown to be due to:

• deep networks of many layers,

• wide hidden layers of many nodes, and

• many context dependent states to model phonemes.

Graphics processing units have also enabled these deep net-

works to be trained in reasonable amounts of time. The power

of DNNs over conventional GMMs for acoustic modeling in large

vocabulary continuous speech recognition has been demonstrated

in recent literature, where the number of hidden layers is between

5 and 9, with thousands of hidden nodes and context dependent

output states [9, 10]. The DNN is also adept at de-correlating

frames allowing a larger context window with many consecutively

“stacked” frames of log ﬁlterbank features compared to mel fre-

quency ceptral coefﬁcients (MFCC) or perceptual linear prediction

(PLP) features [11] (on the order of 26 frames compared to 11 re-

spectively). While there is some debate whether pre-training [12],

or other forms of neural network initialization [3] are needed, all

continue to use back-propagation to ﬁne-tune the DNN [13], that is

using this weight update at time t

∆w

= −∇

E(w

) (3)

where ∇

represents the gradient operator with respect to the weight

vector w, E(w) the error function and  the learning rate. It has been

found that using momentum [14] can speed up the training process

by adding common contributions from previous updates to the gra-

dient update in a second term

∆w

= −∇

E(w

) + α∆w

t−1

(4)

For example, with α set to 0.9, constant parts of the gradient are

ampliﬁed by 1/(1−α) or 10, while parts that oscillate are smoothed

out over time.

To apply the neural network in acoustic modeling, the CD state

emission likelihood is computed from the CD state posterior gener-

ated by the DNN as follows using Bayes’ Rule

p(o|s; θ) ≈

P (s|o; θ)

P (s; θ)

(5)

where o denotes the feaures, s a CD state, and P (s) the state prior.

The approximation would be an equality if the right hand side was

multiplied by p(o).

The ﬁrst DNN deployed in production by Google [15] has the

topology shown in ﬁgure 1. The number of parameter comparisons

is shown for a GMM based system that had approximately the same

real time speed.

3.1. Adaptation of Neural Networks

Previous work in [16] looked at speaker adaption of shallow neural

networks and with context independent units. The networks exam-

ined were small: 9 stacked frames yielding 234 inputs, one hidden

layer of 1000 units, and 48 outputs one for each context indepen-

dent phone. Adaptation was performed by either estimating a nor-

malization afﬁne transform applied to each frame trained via back-

propagation, re-training the entire network, or both combined. Sig-

niﬁcant gains were achieved using these techniques. Later this was

referred to as feature-space discriminative linear regression and ap-

plied to a large vocbaulary task with a DNN [3]; the improvement on

a 45M parameter network (16.9M non-zero) was less at 4% relative.

This paper is motivated by this previous work to examine how

re-training only portions of the network and the size of the network

affects speaker adaptation performance. This paper will also exam-

ine how regularization may be used to improve generalization. This

can be done by adding half of a squared penalty term that to the error

function to minimize the difference between the updated weight and

unadapted network weight, which shall be referred to as L2 prior

regularization. Here the weight update in equation 4 then becomes

∆w

= −∇

E(w

) + α∆w

t−1

− β(w

t−1

− w

) (6)

where β is the weight decay factor on the L2 penalty term which de-

cays the weights towards the original model weights. The larger

the penalty term, the more difﬁcult it is for the updated weights

to deviate from the original model weights w

. This is similar to

what is described as MAP adaptation of a maximum entropy model

in [17]. In [18], the solution to overﬁtting the constrained transform

for adapting the ﬁrst layer in a network to a speaker can also be

viewed as L2 regularization.

This paper will also examine various aspects of the training pro-

cedure including the optimization hyperparameters such as learn-

ing rate and momentum, supervised enrollment versus unsupervised

training, how the amount of data affects gains and stochastic mini-

batch vs batch training.

4. EXPERIMENTAL RESULTS

The experiments are conducted on proprietary, anonymized mobile

search data sets. Utterances are typically short at about 3-10 seconds

in duration. The training set is approximately 3000 hours of mobile

speech data. Two test sets were used, the ﬁrst Unified1a is an

uniform sampling of 30 hours of mobile search data which includes

VoiceSearch queries and VoiceIME dictation. The second is sampled

from users who have opted to donate their speech data [19]. This

allows the construction of an anonymized data set of 80 speakers,

each with about an hour of adaptation data to form Pers1a

adapt

set and ten minutes of evaluation data to form an evaluation set

Pers1a eval.

The vocabulary of the recognition system is approximately one

million words. The language model is a Katz smoothed language

model trained on both the transcribed acoustic training data, written

query sources and unsupervised mobile speech data. A variety of

acoustic models are evaluated. The speaker independent, real-time

GMM system uses PLP features [20], semi-tied covariances [21]

and linear discriminant analysis for dimensionality reduction of

Fig. 1. Mobile speech recognition deep neural network.

the 9 consecutively stacked PLP features down to 39 dimensions,

and boosted MMI discriminative training [22] of context-dependent

states [23] clustered using decision trees [24] to 7969 states; the

real time GMM system contained a total of 340k Gaussians, while

a larger 500k Gaussian was also available. The deployed DNN sys-

tem uses 11 stacked 40-dimensional log ﬁlterbank feature frames,

4 hidden layers of 2560 nodes each, and 7969 output nodes corre-

sponding to the same context dependent state inventory. A much

smaller DNN was trained with 16 stacked frames, 4 hidden layers

of 512 nodes each and 1000 outputs. Also a larger DNN with 26

stacked frames, 6 hidden layers of 2176 nodes and 14247 outputs

was trained to be more comparable to the results in [3].

4.1. Experiments on Unified1a

First some experiments are done on the Unified1a test set. Ta-

ble 1 compares the relative performance of the various acoustic mod-

els discussed above. The results demonstrate that for a similar num-

Model # of params Unified1a

GMM, 100k Gaussian 7.5M 20.6

GMM, 340k Gaussian 27M 16.0

GMM, 580k Gaussian 46M 15.4

DNN, 4x512+1000 1.6M 16.1

DNN, 4x2560+7969 41M 12.3

DNN, 6x2176+14000 60M 9.9

Table 1. Comparison of unadapted GMM and DNN acoustic mod-

els, number of parameters and % WER.

ber of parameters (41M) the DNN is signiﬁcantly better than the

GMM system. The DNN system is also effective at smaller and

larger network sizes, whereas the GMM system degrades badly at

100k Gaussians and doesn’t improve much by increasing the num-

ber of parameters beyond 580k Gaussians. However on an embed-

ded device the smallest 100k Gaussian GMM system was able to run

in realtime on a recent mobile device, whereas the small 4x512 DNN

is slightly slower on the same device.

4.2. Experiments on Pers1a

Some comparisons can be made between a GMM system and a

DNN. We choose to compare a slightly smaller GMM system (27M

parameters) with the 41M parameter DNN system since it is only

slightly worse than the bigger system and from experience adapta-

tion of GMM based systems tends to work better when the models

are smaller. Table 2 compares various GMM system adaptation

techniques with the 4x2560 hidden layer DNN. CMLLR and MLLR

give similar gains and improve as expected with increased number

System Speaker # of Transforms (%WER)

Adaptation — 2 16 256

GMM

— 18.3

CMLLR 16.7 16.1 16.1

MLLR 16.7 16.1 15.7

MAP, τ=1.0 16.7

DNN — 14.1

Table 2. Comparison of unsupervised speaker adaptation, using 1

hour of adaptation data, of 340k Gaussian GMM acoustic model

with 4x2560 hidden layer DNN on Pers1a (%WER).

of transforms; at 256 transforms the relative reduction in word error

rate (WER) is 12-14%. MAP adaptation does surprisingly poorly,

yielding the equivalent gain of 1 MLLR transform; perhaps this is

due to the errors in the adaptation hypothesis. The unadapted DNN

performs better than any of the adapted GMM systems; although it

is larger, it is unlikely that a larger adapted GMM system would be

better than the unadapted DNN system since the absolute difference

on Unified1a between the smaller and bigger GMM system was

0.6%.

Neural Network Adaptation

The DNN can be adapted as suggested previously. Instead of apply-

ing a discriminatively trained transform of the features, in this work

for speaker adaptation we look at only re-training the input layer, the

output layer or the entire network. When training and re-training,

for adaptation, the neural networks a default step size of 0.02 was

used, along with momentum set to 0.9, weight decay set to 0.01 and

a mini-batch size of 200 frames. First some initial experimentation

was done with one speaker to test some of the hyperparameters set-

tings. A single speaker has about 1500 words for adaptation and

1500 words for the held-out evaluation testing. Table 3 shows the

difference between using momentum or not during adaptation.

Adapt α  Adapt (Epochs) Eval (Epochs)

Style 1 10 100 1 10 100

— 11.7 15.0

Mini- 0.9 0.02 7.6 6.8 5.9 11.7 11.6 11.9

Batch 0.0 0.2 7.9 7.0 6.4 12.0 11.4 11.5

Batch

0.9 0.02 11.3 11.0 8.2 14.3 13.5 11.9

0.0 0.2 11.5 10.5 8.2 14.3 13.3 12.0

Table 3. Comparing effect of momentum (α) for adapting a 4x512

hidden layer DNN to one speaker (% WER): Adapt is a ten minute

portion of the adaptation data, whereas Eval is the held out test.

Mini-batch training used momentum, batch training did not.

Compared to multiplying the step size by ten to 0.2, momentum

was found to give similar results but nevertheless applied for con-

sistency with the original neural network training regime. We also

found stochastic mini-batch to converge faster than batch updates,

e.g. within 10 epochs versus 100-1000 epochs in a batch training

scenario. With mini-batch training and without regularization, the

error rates could be reduced by more than half on the adaptation

data, but didn’t generalize on the held out evaluation data; using L2

prior regularization improved results as shown in table 4. The L2

prior regularization was found to be not useful with batch training,

Adapt β Adapt (Epochs) Eval (Epochs)

Style 1 10 100 1 10 100

— 11.7 15.0

Mini- 0.01 7.6 6.8 5.9 11.7 11.6 11.9

batch 0 7.2 4.8 2.8 11.9 12.5 14.8

Batch

0.01 11.5 10.5 8.2 14.3 13.3 12.0

0 11.5 10.4 8.3 14.3 13.3 12.2

Table 4. Comparing using L2 prior regularization (β := weight

decay) for adapting a 4x512 hidden layer DNN to one speaker

(% WER): Adapt is a ten minute portion of the adaptation data,

whereas Eval is the held out test. Mini-batch training used mo-

mentum, batch training did not.

probably due to the smoother gradients. Further network adaptation

results all use L2 prior regularization.

Enrollment-style network adaptation

One approach to speaker adaptation is to have speakers read some

material where the transcript is known and use this labeled data to

ﬁne-tune the network. This is often called speaker enrollment. In

this work though we do not have speakers read data, rather we use

the labeled results for 10 minutes of the adaptation data as the en-

rollment data for supervised training. Table 5 shows the results for

this type of speaker adaptation where the number of epochs is shown

and different parts of the DNN are adapted. Note that while labeled

data is used to adapt the network to each speaker, the results are on

held out evaluation data.

# Epochs (%WER)

Adaptation 1 10 100

— 17.1

Input layer 15.7 15.4 15.2

Output layer 16.4 16.0 15.9

All layers 14.4 14.2 14.4

Table 5. Enrollment style speaker adaptation of a 4x512 hidden

layer DNN evaluated on Pers1a (%WER).

Although there are more output parameters, only updating the

input layer gives better results than only updating the output layer.

Even better is updating all 1.6M parameters in the network. Surpris-

ingly, after 1 epoch most of the gains for adaptation are achieved.

Only updating a single layer gives a relative improvement of around

10% whereas the entire network more than 15%. These are similar to

the gains found when adapting GMM acoustic models with common

techniques.

Unsupervised network adaptation

The previous section looked at adaptation when the labels for the

adaptation data were known. Whether the adaptation improvements

hold when the labels are unknown is not certain. In this section,

results are obtained by adapting the network per speaker with la-

bels determined from the large 6x2176 deep DNN, i.e. with a WER

of 10.2% on the 10 minutes of adaptation data. These results are

shown in table 6. The upper three rows demonstrate large gains of

# Epochs (%WER)

Test set Adaptation 1 10 100

Adapt10min

— 16.5

Input layer 14.6 13.8 13.5

Output layer 15.5 14.5 14.0

All layers 12.5 11.5 11.2

Eval

— 17.1

Input layer 15.8 15.5 15.3

Output layer 16.5 16.2 16.0

All layers 14.7 14.5 14.5

Table 6. Unsupervised speaker adaptation of a 4x512 hidden layer

DNN evaluated on Pers1a (%WER).

6-30% relative when evaluating on the same unsupervised adapta-

tion data. Thus this approach could be used for ofﬂine systems that

use multi-pass decoding strategies. In the bottom three rows, the un-

supervised enrollment adaptation shows less improvement than with

the speaker enrollment in table 5, but the adaptation of the entire net-

work still shows about 15% relative gain indicating an incremental

unsupervised adaptation strategy could work well.

Adaptation of a large network

The previous DNN adaptation results demonstrated that speaker

adaptation could work well for a small network. Table 7 describes

results of unsupervised adaptation of a large DNN. Here the results

# Epochs (%WER)

Test set Adaptation 1 10

Adapt10min

— 10.2

Input layer 10.4 10.0

Output layer 10.2 10.0

All layers 9.8 10.0

Eval

— 10.8

Input layer 11.0 10.9

Output layer 11.0 10.9

All layers 10.2 10.3

Table 7. Unsupervised speaker adaptation of a 6x2176 hidden layer

DNN evaluated on Pers1a (%WER).

are more mixed. While results with supervised data show improve-

ments, unsupervised adaptation of the large neural network shows

no gain for adapting just the input or output layer. There is a small

improvement from adapting the entire network, of 4.9% relative,

but storing an entire network of 60M parameters per speaker is

unwieldy. This minimal gain is slightly more than the 4% relative

found by using a speaker-level discriminative transform with a DNN

in [3] on the Fisher task; the discriminative transform is much more

compact and can be estimated more rapidly. The minimal gain

can be attributed to the large number of layers in the network that

perform a powerful speaker normalizing feature extraction [11].

5. CONCLUSIONS

This paper compares some standard speaker adapted GMM systems

with adaptation of deep neural networks. As shown in prior work

an unadapted DNN system is handily better than an adapted GMM-

based acoustic model. In this work we show that L2 prior regular-

ization is helpful in improving generalization when adapting neural

networks to speaker speciﬁc data. While momentum wasn’t found

to be appreciably useful, mini-batch training converged considerably

faster than batch and yielded slightly better results. It is also shown

that small networks can beneﬁt from both supervised and unsuper-

vised speaker adaptation. Similar to previous work, large neural net-

works do not beneﬁt as much from adaptation techniques.

6. REFERENCES

[1] M.J.F. Gales, “Maximum Likelihood Linear Transformations

For HMM-Based Speech Recognition,” Computer Speech and

Language, vol. 12, Jan. 1998.

[2] D. Yu, L. Deng, and G. Dahl, “Roles of pretraining and ﬁne-

tuning in context-dependent dnn-hmms for real-world speech

recognition,” in Proc. NIPS Workshop on Deep Learning and

Unsupervised Feature Learning, Dec. 2010.

[3] F. Seide, G. Li, X. Chen, and D. Yu, “Feature engineering

in context dependent deep neural networks for conversational

speech transcription,” in Proc. ASRU, 2011.

[4] J.L. Gauvain and C.H. Lee, “Bayesian learning of gaus-

sian mixture densities for hidden Markove models,” in Proc.

DARPA Speech and Natural Language Workshop, 1991.

[5] C.J. Legetter and P.C. Woodland, “Maximum likelihood linear

regression speaker adaptation of contiuous density HMMs,”

Computer Speech and Languages, 1997.

[6] K. Shinoda and T. Watanabe, “Speaker adaptation with au-

tonomous control using tree structure,” in Proc. Eurospeech,

1995.

[7] M.J.F. Gales, “The generation and the use of regression

class trees for MLLR adaptation,” Tech. Rep. CUED/F-

INFENG/TR263, University of Cambridge, 1996, Available

from http://mi.eng.cam.ac.uk/reports/index-speech.html.

[8] M. Cohen, H. Franco, N. Morgan, D. Rumelhart, and

V. Abrash, “Hybrid neural network/hidden Markov model

continuous-speech recognition,” in Proc. Eurospeech, 1992.

[9] F. Seide, G. Li, and D. Yu, “Conversational speech transcrip-

tion using context-dependent deep neural networks,” in Proc.

Interspeech, 2011.

[10] T. Sainath, B. Kingsbury, B. Ramabhadran, P. Fousek, P. No-

vak, and A. Mohamed, “Making deep belief networks effective

for large vocabulary continuous speech recognition,” in Proc.

ASRU, 2011.

[11] A. Mohamed, G. Hinton, and G. Penn, “Understanding how

deep belief networks perform acoustic modelling,” in Proc.

ICASSP, 2012.

[12] A. Mohamed, G.E. Dahl, and G. Hinton, “Acoustic model-

ing using deep belief networks,” IEEE Trans. on Acoustics,

Speech, and Language Processing, 2012.

[13] D. Rumelhart, G. Hinton, and R. Williams, “Learning repre-

sentation by back-propagating errors,” Nature, vol. 323, Oct.

1986.

[14] D. Rumelhart, G. Hinton, and R. Williams, “Learning inter-

nal representation by error propagation,” Parallel distributed

processing, vol. 1, 1986.

[15] N. Jaitly, P. Nguyen, A.W. Senior, and V. Vanhoucke, “Appli-

cation of pretrained deep neural networks to large vocabulary

speech recognition,” in Proc. Interspeech, 2012.

[16] V. Abrash, H. Franco, A. Sankar, and M. Cohen, “Connec-

tionist speaker normalization and adaptation,” in Proc. Eu-

rospeech, 1995.

[17] C. Chelba and A. Acero, “Adaptation of maximum entropy

capitalizer: Little data can help a lot,” in Proc. ACL, 2004.

[18] J. Trmal, J. Zelinka, and L. M:uller, “Adaptation of a feed-

forward artiﬁcial neural network using a linear transform,” in

Proc. TSD, 2010.

[19] A. Man

e and G. Shires, “Voice search gets personal,”

Dec. 2010, http://googlemobile.blogspot.com/2010/12/voice-

search-gets-personal.html.

[20] H. Hermansky, “Perceptual Linear Predictive (PLP) analysis

of speech,” Journal of the Acoustic Society of America, vol.

87, no. 4, pp. 1738–1752, 1990.

[21] M.J.F. Gales, “Semi-tied covariance matrices for hidden

Markov models,” IEEE Trans. on Speech and Audio Process-

ing, May 1999.

[22] D. Povey, D. Kanevsky, B. Kingsbury, B. Ramabhadran,

G. Saon, and K. Visweswariah, “Boosted MMI for model and

feature-space discriminative training,” in Proc. ICASSP, 2008.

[23] L.R. Bahl, P.V. de Souza, P.S. Gopalkrishnan, D. Nahamoo,

and M.A. Picheny, “Context dependent modelling of phones

in continuous speech using decision trees,” in Proc. DARPA

Speech and Natural Language Processing Workshop, 1991, pp.

264–270.

[24] S.J. Young, J.J. Odell, and P.C. Woodland, “Tree-based state

tying for high accuracy acoustic modelling,” in Proc. ARPA

Workshop on Human Language Technology, 1994, pp. 307–

312.

Speaker adaptation of context dependent deep neural networks

Figures

Citations

Deep Learning: Methods and Applications

Speech Recognition Using Deep Neural Networks: A Systematic Review

EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding

EESEN: End-to-End Speech Recognition using Deep RNN Models and WFST-based Decoding

A Review of Domain Adaptation without Target Labels

References

Learning representations by back-propagating errors

Learning internal representations by error propagation

Perceptual linear predictive (PLP) analysis of speech

Acoustic Modeling Using Deep Belief Networks

Maximum likelihood linear transformations for HMM-based speech recognition

Related Papers (5)

Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription

Speaker adaptation of neural network acoustic models using i-vectors

Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups

The Kaldi Speech Recognition Toolkit

Maximum likelihood linear transformations for HMM-based speech recognition

Frequently Asked Questions (16)

Q1. What are the contributions in "Speaker adaptation of context dependent deep neural networks" ?

Q2. How many Gaussian states were used in the real time GMM system?

Q3. how can momentum speed up the training process?

Q4. What is the power of a DNN in acoustic modeling?

Q5. What is the definition of a deep neural network?

Q6. What is the reason for the use of deep neural networks in acoustic modeling?

Q7. What is the simplest way to overfit the constrained transform?

Q8. What is the main idea of the paper?

Q9. How is the DNN system able to run in realtime on a recent mobile device?

Q10. How many frames did the DNNs examine?

Q11. How does the L2 prior regularization work?

Q12. What is the way to estimate the acoustic model?

Q13. What is the alternative technique for acoustic modeling?

Q14. What is the weight decay factor on the L2 penalty term?

Q15. How many speakers are adapted to the DNN?

Q16. How many speakers can be adapted to form an evaluation set?