LSTM Neural Networks for Language Modeling.

doi:10.21437/INTERSPEECH.2012-65

LSTM Neural Networks for Language Modeling

Martin Sundermeyer, Ralf Schl

¨

uter, and Hermann Ney

Human Language Technology and Pattern Recognition, Computer

Science Department, RWTH Aachen University, Aachen, Germany

{sundermeyer,schlueter,ney}@cs.rwth-aachen.de

Abstract

Neural networks have become increasingly popular for the task

of language modeling. Whereas feed-forward networks only

exploit a ﬁxed context length to predict the next word of a se-

quence, conceptually, standard recurrent neural networks can

take into account all of the predecessor words. On the other

hand, it is well known that recurrent networks are difﬁcult to

train and therefore are unlikely to show the full potential of re-

current models.

These problems are addressed by a the Long Short-Term

Memory neural network architecture. In this work, we ana-

lyze this type of network on an English and a large French

language modeling task. Experiments show improvements of

about 8 % relative in perplexity over standard recurrent neural

network LMs. In addition, we gain considerable improvements

in WER on top of a state-of-the-art speech recognition system.

Index Terms: language modeling, recurrent neural networks,

LSTM neural networks

1. Introduction

In automatic speech recognition, the language model (LM) of a

recognition system is the core component that incorporates syn-

tactical and semantical constraints of a given natural language.

While today mainly backing-off models ([1]) are used for the

recognition pass, feed-forward neural network LMs, ﬁrst intro-

duced in [2], have become an important supplement to existing

techniques in the rescoring stage ([3]).

Both approaches rely on the n-gram approximation where

the probability p(w

N

1

) of a word sequence w

N

1

is factorized as

p(w

N

1

) =

N

Y

m=1

p(w

m

|h

m

)

so that only the n − 1 preceeding words h

m

:= w

m−1

m−n+1

are used to estimate the probability of the word at position m.

However, neural network LMs overcome a major drawback of

backing-off models ([5]): Whenever an n-gram (h, w) has not

been observed in training, a backing-off model lacks an explicit

estimate for the probability of this n-gram. Therefore it falls

back on the estimate for the (n − 1)-gram (

¯

h, w) where the

left-most word of h has been removed to construct

¯

h, and γ(h)

is a normalization constant:

p(w|h) = γ(h)p(w|

¯

h).

In contrast to backing-off models, neural network LMs always

estimate probabilities based on the full history, regardless of

whether the n-gram was seen in training or not.

On the other hand, the n-gram assumption still leads to an

inaccuracy in the modeling when feed-forward neural network

LMs are used. According to the chain rule of probability theory,

all predecessor words w

m−1

1

need to be taken into account to

predict the m-th word of a sentence:

p(w

N

1

) =

N

Y

m=1

p(w

m

|w

m−1

1

).

This can be remedied by replacing the feed-forward architecture

by a recurrent neural network architecture which is appropriate

for sequence modeling (see [6], [7]).

Unfortunately, recurrent neural networks are hard to train

using backpropagation through time ([8]). The main difﬁculty

lies in the well-known vanishing gradient problem ([9]) which

means that the gradient that is propagated back through the net-

work either decays or grows exponentially.

One approach to improved training of recurrent neural net-

works lies in better optimization algorithms that make use of

higher-order information (see e. g. [10]). However, this usu-

ally comes at the price of signiﬁcantly increased computational

costs which makes these methods less attractive for language

modeling where the amount of training data is extremely large.

An alternative solution called Long Short-Term Memory

(LSTM) was proposed in [11]: The network architecture is

modiﬁed such that the vanishing gradient problem is explicitly

avoided, whereas the training algorithm is left unchanged.

In this work, we introduce LSTMs to the ﬁeld of language

modeling. We analyze its effectiveness on an English and a

large French corpus in terms of perplexity and word error rate.

Furthermore, we investigate techniques for decreased training

times and compare different neural network LM architectures.

2. LSTM neural networks

In [11], the vanishing gradient problem was analysed in detail.

Whenever the gradient of the error function of the neural net-

work is propagated back through a unit of a neural network, it

gets scaled by a certain factor. For nearly all practically relevant

cases, this factor is either greater than one or smaller than one.

As a result, in a recurrent neural network, the gradient

blows up or decays exponentially over time. (For the language

modeling point of view, time steps correspond to word posi-

tions in a sentence.) Thus, the gradient either dominates the

next weight adaptation step or effectively gets lost.

To avoid this scaling effect, the authors re-designed the unit

of a neural network in such a way that its corresponding scal-

ing factor is ﬁxed to one. The new unit type that is obtained

from this design goal is rather limited in its learning capabili-

ties. Therefore, the unit was enriched by several so-called gat-

ing units. The ﬁnal unit is depicted in Fig. 1, where we have

included two modiﬁcations of the original LSTM unit proposed

in [12] and [13].

Figure 1: LSTM memory cell with gating units

A standard neural network unit i only consists of the input

activation a

i

and the output activation b

i

which are related—

when a tanh activation function is used—by

b

i

= tanh(a

i

).

The LSTM unit adds several intermediate steps: After applying

the activation function to a

i

, the result is multiplied by a fac-

tor b

ι

. Then the inner activation value of the previous time step,

multiplied by the quantity b

φ

is added due to the recurrent self-

connection. Finally, the result is scaled by b

ω

and fed to another

activation function, yielding b

i

. The factors b

ι

, b

φ

, b

ω

∈ (0, 1),

indicated by the small white circles, are controlled by additional

units (depicted as blue circles) called input, output, and forget

gate, respectively. The gating units sum the activations of the

previous hidden layer and the activations of the current layer

from the previous time step as well as the inner activation of

the LSTM unit. The resulting value is squashed by a logistic

sigmoid function which then is set to b

ι

, b

φ

, or b

ω

, respectively.

For brevity, we omit the rather extensive equations describ-

ing the LSTM network. These can be found e. g. in [14]

1

.

The whole LSTM unit including the gating units may be in-

terpreted as a differentiable version of computer memory ([14]).

For this reason, LSTM units sometimes are also referred to as

LSTM memory cells. Whether one adheres to the proposed in-

terpretation of the gating units or not, the LSTM architecture

solves the vanishing gradient problem at small computational

extra-costs. In addition, it has the desirable property of includ-

ing standard recurrent neural network units as a special case.

3. Neural network language models

Although there are several differences in the neural network lan-

guage models that have been successfully applied so far, all of

them share some basic principles:

• The input words are encoded by 1-of-K coding where K

is the number of words in the vocabulary.

• At the output layer, a softmax activation function is used

to produce correctly normalized probability values.

1

As opposed to our LSTM version, in [14] the gating units do not

receive the activations of the previous hidden layer

• As training criterion the cross entropy error is used

which is equivalent to maximum likelihood.

We also follow this approach. It is generally advised to normal-

ize the input data of a neural network ([15]) which means that a

linear transformation is applied so that the data have zero mean

and unit variance. When using 1-of-K coding, this is obviously

not the case.

Giving up the sparseness of the input features (which is usu-

ally exploited to speed up matrix computations, cf. [16]), the

data can easily be normalized because there exist closed-form

solutions for the mean and variance of the 1-of-K encoded input

features that depend only on the unigram counts of the words

observed in the training data. On the contrary we observed that

convergence was considerably slowed down by normalization.

It seems that it sufﬁces when the input data in each dimension

lie in the same [0, 1] range.

As the input features are highly correlated (e. g., we have

x

i

= 1 −

P

i6=j

x

i

) for the i-th dimension of an input vari-

able x), applying a whitening transform to the features appears

to be more promising. Because of the high dimensionality, this

seems practically unfeasible.

Regarding the network topology, in [6] a single recurrent

hidden layer was used, while in [3] an architecture with two

hidden layers was applied, the ﬁrst layer having the interpreta-

tion of projecting the input words to a continuous space. In a

similar spirit, we stick to the topology shown in Fig. 2 where

we plug in LSTM units into the second recurrent layer, combin-

ing it with different projection layers of standard neural network

units.

Figure 2: Neural network LM architecture

For large-vocabulary language modeling, training is

strongly dominated by the computation of the input activa-

tions a

i

of the softmax output layer which in contrast to the

input layer is not sparse:

a

i

=

J

X

j=1

ω

ij

b

j

.

Here, J denotes the number of nodes in the last hidden layer,

ω

ij

are the weights between the last hidden layer and the output

layer, and i = 1, . . . , V , where V is the vocabulary size.

To reduce the computational effort, in [17] (following an

idea from [18]), it was proposed to split the words into a set of

disjoint word classes. Then the probability p(w

m

|w

m−1

1

) can

be factorized as follows:

p(w

m

|w

m−1

1

) = p



w

m

|c(w

m

), w

m−1

1



p



c(w

m

)|w

m−1

1



where w

m

∈ c(w

m

), and c(w

m

) is the class of word w

m

. How

to deﬁne a reasonable set of classes is described in [19]. Using

this identity, the computational complexity can be signiﬁcantly

reduced.

4. Experimental results

For the experimental results, we concentrated on two corpora:

The English Treebank-3 Corpus and the French corpora from

the Quaero

2

project. Details can be found in Table 1.

LM train dev1 dev2 test

Treebank 930 K 74 K – 82 K

Quaero French 27 M 46 K 36 K 35 K

Table 1: Corpus sizes in number of running words; the vocab-

ulary size of the Treebank corpus is 10 K, for Quaero French it

is 170 K; dev1 was used as validation data for neural network

training, dev2 for optimizing the LM scale

The results for the Treebank-3 corpus are summarized in

Fig. 3. First, we trained a recurrent neural network LM with the

architecture shown in Fig. 2 except that we omitted the projec-

tion layer. For the recurrent hidden layer, once we chose stan-

dard units with a sigmoid activation function and once LSTM

units, see Fig. 3 (a). We found that the perplexity of the models

was constantly lower by about 8 % compared to the standard re-

current neural network. The perplexities we obtained with the

sigmoidal recurrent network closely match those obtained with

the rnnlm toolkit ([20]).

The training times for these two models are of similar or-

der. However, the corresponding model sizes actually are quite

different for a given number of hidden layer nodes: While the

LSTM version with 150 hidden nodes corresponds to 7.6 M pa-

rameters, the sigmoidal network has only 3.0 M weight param-

eters. On the other hand, when increasing the model size of the

sigmoidal network until a comparable number of parameters is

reached, no signiﬁcant improvement can be obtained (350 hid-

den nodes correspond to 7.1 M parameters). In addition, when

a projection layer is used and the vocabulary size is huge, the

overhead in model size of the LSTM variant is negligible.

In a second set of experiments, we tried to ﬁnd out whether

an additional projection layer gives further improvements, cf.

Fig. 3 (b). Unfortunately, compared to the raw LSTM version,

neither a linear layer (where the activation function is the iden-

tity) nor a sigmoidal layer led to lower perplexities. Our in-

terpretation of the results is that such a projection layer creates

smeared input features that complicate the learning task for the

LSTM units.

For the results we have shown so far, a single input sentence

was presented to the network during training and testing. This

means that the maximum context length is limited to about 21

words which is the average sentence length in the Treebank cor-

pus. However, unlike for standard recurrent neural networks,

LSTM nets might be able to exploit even longer context sizes.

Therefore we increased the size of the input sequences by con-

catenating a ﬁxed number of consecutive sentences. The effect

on the performance can be seen in Fig. 3 (c).

We observe that in case of a single hidden layer with LSTM

cells, a small improvement is possible when switching from

one to two concatenated sentences. The same holds true for

an LSTM with a sigmoidal projection layer. Interestingly, in

2

http://www.quaero.org

contrast to single sentences, a linear projection layer helps for

longer input sequences.

Probably the number of different words the neural network

has to distinguish at the output layer is too large for learn-

ing complex long-range dependencies. Therefore we observe

the general tendency that the perplexity signiﬁcantly increases

when the context length exceeds a certain threshold, regardless

of any preprocessing.

However it seems that the smearing of the input features

introduced by the linear layer is beneﬁcial for long input se-

quences, and we obtained best perplexities when LSTM cells

were combined with this type of projection layer.

Finally, we investigated the interaction between LSTM net-

works and a clustered output layer. For this experiment, we

used an LSTM network with 200 hidden nodes and no projec-

tion layer.

As shown in Fig. 3 (d), the impact of clustering on perplex-

ities is only moderate, while large speed ups are possible for

training (as well as for testing). In theory, the speed up should

be largest for C =

√

V where C denotes the number of classes,

and V is the vocabulary size. It turns out that this behaviour is

not exactly matched in practice because classes have different

sizes.

Apart from the results for the comparatively small English

corpus, we also applied the LSTM networks to a large vocab-

ulary French speech recognition task. Within the Quaero re-

search project, yearly evaluations are held where speech recog-

nition systems are evaluated on broadcast conversational pod-

cast data.

We took our best French recognition system which showed

to be competitive in the 2011 evaluation. The system included

state-of-the-art acoustic models including cross-adaptation,

MLP-features, and discriminate training. The backing-off LM

was trained on more than 4 B words.

From the lattices created by the speech recognizer, we ex-

tracted n-best lists of size n = 1000. We trained an LSTM

LM using 300 hidden nodes and 27 M running words of in-

domain training data. Although the Kneser-Ney (KN) backing-

off model was trained on more than a hundred times more data,

by interpolation, we obtained improvements in word error rate

of 0.5 % on the development and 0.3 % on the test data of the

2011 evaluation.

LM dev2 test

KN 4-gram 19.7 % 17.6 %

KN 4-gram + LSTM 19.2 % 17.3 %

Table 2: Word error rate results for Quaero French.

5. Conclusions

In this paper, we applied the LSTM neural network architecture

to two language modeling tasks. This network type is especially

well-suited to language modeling as in theory it allows the exact

modeling of the probability of a word sequence. As opposed to

previous approaches, it does not suffer from conceptual prob-

lems of standard recurrent neural network training.

We explored several different neural network topologies

and analyzed the importance of the wide-spread use of an ad-

ditional hidden projection layer. We showed that the LSTM

network can be combined with existing clustering techniques to

gain large speed ups in training and testing times at a small loss

in performance.

120

130

140

150

160

50 100 150 200 250 300 350

PPL

Hidden layer size

(a) Hidden layer sizes (one hidden layers)

Sigmoid

LSTM

120

130

140

150

160

50 100 150 200 250 300 350

PPL

Hidden layer size

(b) Hidden layer sizes (two hidden layers)

Linear + LSTM

Sigmoid + LSTM

120

130

140

150

160

1 2 4 8 16 32 64

PPL

Number of Sentences

(c) Sequence length

Sigmoid + LSTM

LSTM

Linear + LSTM

120

130

140

150

160

0 200 400 600 800 1000

10

12

14

16

18

20

22

PPL

Speed up factor

Number of clusters

(d) Number of clusters vs. speed up

Speed up factor

PPL

Figure 3: Experimental results on the Treebank corpus; for (c) and (d), 200 nodes were used for the hidden layers.

Experiments suggest that the performance of standard re-

current neural network architectures can be improved by about

8 % relative in terms of perplexity. Finally, comparatively large

improvements were obtained when interpolating an LSTM LM

with a huge Kneser-Ney smoothed backing-off model on top of

a state-of-the-art French recognition system.

For future work, it seems interesting to analyze the differ-

ences between standard and LSTM networks and the impact on

the recognition quality of a speech recognizer.

6. Acknowledgment

This work was partly realized as part of the Quaero programme,

funded by OSEO, French State agency for innovation.

7. References

[1] Kneser, R., and Ney, H., “Improved Backing-Off For M-Gram

Language Modeling”, Proc. of ICASSP 1995, pp. 181–184

[2] Bengio, Y., Ducharme, R., “A neural probabilistic language

model”, Proc. of Advances in Neural Information Processing Sys-

tems (2001), vol. 13., pp. 932–938.

[3] Schwenk, H., “Continuous space language models”, Computer

Speech and Language 21 (2007), pp. 492–518

[5] Oparin, I., Sundermeyer, M., Ney, H., Gauvain, J.-L., “Perfor-

mance Analysis of Neural Networks in Combination with n-Gram

Language Models”, Proc. of ICASSP 2012, accepted for publica-

tion

[6] Mikolov, T., Karaﬁ

´

at, M., Burget, L.,

ˇ

Cernoczk

´

y, J. H., and Khu-

danpur, S., “Recurrent neural network based language model”

Proc. of Interspeech 2010, pp. 1045–1048

[7] Elman, J., “Finding Structure in Time”, Cognitive Science 14

(1990), pp. 179–211

[8] Rumelhart, D. E., Hinton, G. E., Williams, R. J., “Learning rep-

resentations by back-propagating errors”, Nature 323 (1986), pp.

533–536

[9] Bengio, Y., Simard, P., Frasconi, P., “Learning long-term depen-

dencies with gradient descent is difﬁcult” IEEE Transactions on

Neural Networks 5 (1994), pp. 157–166

[10] Martens, J., Sutskever, I., “Learning Recurrent Neural Networks

with Hessian-Free Optimization”, Proc. of the 28th Int. Conf. on

Machine Learning 2011

[11] Hochreiter, S., Schmidhuber, J., “Long Short-Term Memory”,

Neural Computation 9 (8), 1997, pp. 1735–1780

[12] Gers, F. A., “Learning to Forget: Continual Prediction with

LSTM”, Proc. of the 9th Int. Conf. on Artiﬁcial Neural Networks,

1999, pp. 850–855

[13] Gers, F. A., Schraudolph, N. N., Schmidhuber, J., “Learning Pre-

cise Timing with LSTM Recurrent Networks”, Journal of Ma-

chine Learning Research 3, 2002, pp. 115–143

[14] Graves, A., Schmidhuber, J., “Framewise Phoneme Classiﬁcation

with Bidirectional LSTM and Other Neural Network Architec-

tures”, Neural Networks, Vol. 18, Issue 5–6, 2005, pp. 602–610

[15] Bishop, C., “Neural Networks for Pattern Recognition”, Claren-

don Press, Oxford, 1995

[16] Le, H. S., Allauzen, A., Wisniewski, G., Yvon, F., “Training con-

tinuous space language models: some practical issues”, Proc. of

the 2010 Conf. on Emp. Methods in NLP, pp. 778–788

[17] Morin, F., Bengio, Y., “Hierarchical Probabilistic Neural Network

Language Model”, Proc. of the 10th Int. Workshop on Artiﬁcial

Intelligence and Statistics

[18] Goodman, J., “Classes for fast maximum entropy training”, Proc.

of the ICASSP, 2001

[19] Mikolov, T., Kombrink, S., Burget, L.,

ˇ

Cernock

´

y, J., Khudanpur,

S., “Extensions of Recurrent Neural Network Language Model”,

Proc. of the ICASSP 2011, pp. 5528–5531

[20] Mikolov, T., Kombrink, S., Deoras, A., Burget, L.,

ˇ

Cernock

´

y,

J., “RNNLM – Recurrent Neural Network Language Modeling

Toolkit”, Proc. of the 2011 ASRU Workshop, pp. 196–201

LSTM Neural Networks for Language Modeling.

Figures

Citations

Sequence to Sequence Learning with Neural Networks

Sequence to Sequence Learning with Neural Networks

Long Short-Term Memory Recurrent Neural Network Architectures for Large Scale Acoustic Modeling

Recurrent Neural Network Regularization

Neural Networks and Deep Learning

References

Long short-term memory

Learning representations by back-propagating errors

Neural networks for pattern recognition

Neural Networks for Pattern Recognition

Finding Structure in Time

Related Papers (5)

Long short-term memory

Adam: A Method for Stochastic Optimization

Attention is All you Need

Glove: Global Vectors for Word Representation

Deep Residual Learning for Image Recognition