Bidirectional recurrent neural networks

doi:10.1109/78.650093

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 45, NO. 11, NOVEMBER 1997 2673

Bidirectional Recurrent Neural Networks

Mike Schuster and Kuldip K. Paliwal, Member, IEEE

Abstract—In the ﬁrst part of this paper, a regular recurrent

neural network (RNN) is extended to a bidirectional recurrent

neural network (BRNN). The BRNN can be trained without

the limitation of using input information just up to a preset

future frame. This is accomplished by training it simultaneously

in positive and negative time direction. Structure and training

procedure of the proposed network are explained. In regression

and classiﬁcation experiments on artiﬁcial data, the proposed

structure gives better results than other approaches. For real

data, classiﬁcation experiments for phonemes from the TIMIT

database show the same tendency.

In the second part of this paper, it is shown how the proposed

bidirectional structure can be easily modiﬁed to allow efﬁcient

estimation of the conditional posterior probability of complete

symbol sequences without making any explicit assumption about

the shape of the distribution. For this part, experiments on real

data are reported.

Index Terms—Recurrent neural networks.

I. INTRODUCTION

A. General

M

ANY classiﬁcation and regression problems of engi-

neering interest are currently solved with statistical

approaches using the principle of “learning from examples.”

For a certain model with a given structure inferred from the

prior knowledge about the problem and characterized by a

number of parameters, the aim is to estimate these parameters

accurately and reliably using a ﬁnite amount of training data.

In general, the parameters of the model are determined by a

supervised training process, whereas the structure of the model

is deﬁned in advance. Choosing a proper structure for the

model is often the only way for the designer of the system

to put in prior knowledge about the solution of the problem.

Artiﬁcial neural networks (ANN’s) (see [2] for an excellent

introduction) are one group of models that take the principle

“infer the knowledge from the data” to an extreme. In this

paper, we are interested in studying ANN structures for one

particular class of problems that are represented by temporal

sequences of input–output data pairs. For these types of

problems, which occur, for example, in speech recognition,

time series prediction, dynamic control systems, etc., one of

the challenges is to choose an appropriate network structure

Manuscript received June 5, 1997. The associate editor coordinating the

review of this paper and approving it for publication was Prof. Jenq-Neng

Hwang.

M. Schuster is with the ATR Interpreting Telecommunications Research

Laboratory, Kyoto, Japan.

K. K. Paliwal is with the ATR Interpreting Telecommunications Research

Laboratory, Kyoto, Japan, on leave from the School of Microelectronic

Engineering, Grifﬁth University, Brisbane, Australia.

Publisher Item Identiﬁer S 1053-587X(97)08055-0.

that, at least theoretically, is able to use all available input

information to predict a point in the output space.

Many ANN structures have been proposed in the literature

to deal with time varying patterns. Multilayer perceptrons

(MLP’s) have the limitation that they can only deal with

static data patterns (i.e., input patterns of a predeﬁned dimen-

sionality), which requires deﬁnition of the size of the input

window in advance. Waibel et al. [16] have pursued time delay

neural networks (TDNN’s), which have proven to be a useful

improvement over regular MLP’s in many applications. The

basic idea of a TDNN is to tie certain parameters in a regular

MLP structure without restricting the learning capability of the

ANN too much. Recurrent neural networks (RNN’s) [5], [8],

[12], [13], [15] provide another alternative for incorporating

temporal dynamics and are discussed in more detail in a later

section.

In this paper, we investigate different ANN structures for

incorporating temporal dynamics. We conduct a number of

experiments using both artiﬁcial and real-world data. We show

the superiority of RNN’s over the other structures. We then

point out some of the limitations of RNN’s and propose a

modiﬁed version of an RNN called a bidirectional recurrent

neural network, which overcomes these limitations.

B. Technical

Consider a (time) sequence of input data vectors

and a sequence of corresponding output data vectors

with neighboring data-pairs (in time) being somehow statisti-

cally dependent. Given time sequences

and as training

data, the aim is to learn the rules to predict the output data

given the input data. Inputs and outputs can, in general, be

continuous and/or categorical variables. When outputs are

continuous, the problem is known as a regression problem,

and when they are categorical (class labels), the problem is

known as a classiﬁcation problem. In this paper, the term

prediction is used as a general term that includes regression

and classiﬁcation.

1) Unimodal Regression: For unimodal regression or func-

tion approximation, the components of the output vectors are

continuous variables. The ANN parameters are estimated to

maximize some predeﬁned objective criterion (e.g., maximize

the likelihood of the output data). When the distribution of the

errors between the desired and the estimated output vectors

is assumed to be Gaussian with zero mean and a ﬁxed global

data-dependent variance, the likelihood criterion reduces to the

1053–587X/97$10.00  1997 IEEE

2674 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 45, NO. 11, NOVEMBER 1997

(a) (b)

Fig. 1. General structure of a regular unidirectional RNN shown (a) with a delay line and (b) unfolded in time for two time steps.

convenient Euclidean distance measure between the desired

and the estimated output vectors or the mean-squared-error

criterion, which has to be minimized during training [2]. It

has been shown by a number of researchers [2], [9] that

neural networks can estimate the conditional average of the

desired output (or target) vectors at their network outputs, i.e.,

, where is an expectation operator.

2) Classiﬁcation: In the case of a classiﬁcation problem,

one seeks the most probable class out of a given pool of

classes for every time frame , given an input vector

sequence

. To make this kind of problem suitable to be

solved by an ANN, the categorical variables are usually coded

as vectors as follows. Consider that

is the desired class

label for the frame at time

. Then, construct an output vector

such that its th component is one and other components

are zero. The output vector sequence

constructed in this

manner along with the input vector sequence

can be

used to train the network under some optimality criterion,

usually the cross-entropy criterion [2], [9], which results from

a maximum likelihood estimation assuming a multinomial

output distribution. It has been shown [3], [6], [9] that the

th network output at each time point can be interpreted as

an estimate of the conditional posterior probability of class

membership [

] for class , with the

quality of the estimate depending on the size of the training

data and the complexity of the network.

For some applications, it is not necessary to estimate the

conditional posterior probability

of a single

class given the sequence of input vectors but the conditional

posterior probability

of a sequence of

classes given the sequence of input vectors.

1

C. Organization of the Paper

This paper is organized in two parts. Given a series of paired

input/output vectors

, we want to

train bidirectional recurrent neural networks to perform the

following tasks.

• Unimodal regression (i.e., compute

)or

classiﬁcation [i.e., compute

for every output class and decide the class using the

maximum a posteriori decision rule]. In this case, the

outputs are treated statistically independent. Experiments

1

Here, we want to make a distinction between

C

t

and

c

t

.

C

t

is a categorical

random variable, and

c

t

is its value.

for this part are conducted for artiﬁcial toy data as well

as for real data.

• Estimation of the conditional probability of a complete

sequence of classes of length

using all available input

information [i.e., compute

]. In

this case, the outputs are treated as being statistically

dependent, which makes the estimation more difﬁcult and

requires a slightly different network structure than the one

used in the ﬁrst part. For this part, results of experiments

for real data are reported.

II. P

REDICTION ASSUMING INDEPENDENT OUTPUTS

A. Recurrent Neural Networks

RNN’s provide a very elegant way of dealing with (time)

sequential data that embodies correlations between data points

that are close in the sequence. Fig. 1 shows a basic RNN

architecture with a delay line and unfolded in time for two

time steps. In this structure, the input vectors

are fed one

at a time into the RNN. Instead of using a ﬁxed number of

input vectors as done in the MLP and TDNN structures, this

architecture can make use of all the available input information

up to the current time frame

(i.e., )

to predict

. How much of this information is captured by

a particular RNN depends on its structure and the training

algorithm. An illustration of the amount of input information

used for prediction with different kinds of NN’s is given in

Fig. 2.

Future input information coming up later than

is usually

also useful for prediction. With an RNN, this can be partially

achieved by delaying the output by a certain number of

time

frames to include future information up to

to predict

(Fig. 2). Theoretically, could be made very large to

capture all the available future information, but in practice,

it is found that prediction results drop if

is too large. A

possible explanation for this could be that with rising

,

the modeling power of the RNN is increasingly concentrated

on remembering the input information up to

for the

prediction of

, leaving less modeling power for combining

the prediction knowledge from different input vectors.

While delaying the output by some frames has been used

successfully to improve results in a practical speech recogni-

tion system [12], which was also conﬁrmed by the experiments

conducted here, the optimal delay is task dependent and has to

SCHUSTER AND PALIWAL: BIDIRECTIONAL RECURRENT NEURAL NETWORKS 2675

Fig. 2. Visualization of the amount of input information used for prediction by different network structures.

Fig. 3. General structure of the bidirectional recurrent neural network (BRNN) shown unfolded in time for three time steps.

be found by the “trial and error” error method on a validation

test set. Certainly, a more elegant approach would be desirable.

To use all available input information, it is possible to use

two separate networks (one for each time direction) and then

somehow merge the results. Both networks can then be called

experts for the speciﬁc problem on which the networks are

trained. One way of merging the opinions of different experts

is to assume the opinions to be independent, which leads to

arithmetic averaging for regression and to geometric averaging

(or, alternatively, to an arithmetic averaging in the log domain)

for classiﬁcation. These merging procedures are referred to

as linear opinion pooling and logarithmic opinion pooling,

respectively [1], [7]. Although simple merging of network

outputs has been applied successfully in practice [14], it is

generally not clear how to merge network outputs in an optimal

way since different networks trained on the same data can no

longer be regarded as independent.

B. Bidirectional Recurrent Neural Networks

To overcome the limitations of a regular RNN outlined

in the previous section, we propose a bidirectional recurrent

neural network (BRNN) that can be trained using all available

input information in the past and future of a speciﬁc time

frame.

1) Structure: The idea is to split the state neurons of a

regular RNN in a part that is responsible for the positive

time direction (forward states) and a part for the negative time

direction (backward states). Outputs from forward states are

not connected to inputs of backward states, and vice versa.

This leads to the general structure that can be seen in Fig. 3,

where it is unfolded over three time steps. It is not possible to

display the BRNN structure in a ﬁgure similar to Fig. 1 with

the delay line since the delay would have to be positive and

negative in time. Note that without the backward states, this

structure simpliﬁes to a regular unidirectional forward RNN,

as shown in Fig. 1. If the forward states are taken out, a

regular RNN with a reversed time axis results. With both time

directions taken care of in the same network, input information

in the past and the future of the currently evaluated time frame

can directly be used to minimize the objective function without

the need for delays to include future information, as for the

regular unidirectional RNN discussed above.

2) Training: The BRNN can principally be trained with

the same algorithms as a regular unidirectional RNN because

there are no interactions between the two types of state

neurons and, therefore, can be unfolded into a general feed-

forward network. However, if, for example, any form of

back-propagation through time (BPTT) is used, the forward

and backward pass procedure is slightly more complicated

because the update of state and output neurons can no longer

be done one at a time. If BPTT is used, the forward and

backward passes over the unfolded BRNN over time are done

almost in the same way as for a regular MLP. Some special

treatment is necessary only at the beginning and the end of

the training data. The forward state inputs at

and the

2676 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 45, NO. 11, NOVEMBER 1997

backward state inputs at are not known. Setting these

could be made part of the learning process, but here, they

are set arbitrarily to a ﬁxed value (0.5). In addition, the local

state derivatives at

for the forward states and at

for the backward states are not known and are set here to

zero, assuming that the information beyond that point is not

important for the current update, which is, for the boundaries,

certainly the case. The training procedure for the unfolded

bidirectional network over time can be summarized as follows.

1) FORWARD PASS

Run all input data for one time slice through

the BRNN and determine all predicted outputs.

a) Do forward pass just for forward states (from

to ) and backward states (from to ).

b) Do forward pass for output neurons.

2) BACKWARD PASS

Calculate the part of the objective function derivative

for the time slice

used in the forward pass.

a) Do backward pass for output neurons.

b) Do backward pass just for forward states (from

to ) and backward states (from

to ).

3) UPDATE WEIGHTS

C. Experiments and Results

In this section, we describe a number of experiments with

the goal of comparing the performance of the BRNN structure

with that of other structures. In order to provide a fair com-

parison, we have used different structures with a comparable

number of parameters as a rough complexity measure. The

experiments are done for artiﬁcial data for both regression

and classiﬁcation tasks with small networks to allow extensive

experiments and for real data for a phoneme classiﬁcation task

with larger networks.

1) Experiments with Artiﬁcial Data:

a) Description of Data: In these experiments, an artiﬁ-

cial data set is used to conduct a set of regression and

classiﬁcation experiments. The artiﬁcial data is generated as

follows. First, a stream of 10 000 random numbers between

zero and one is created as the one-dimensional (1-D) input

data to the ANN. The 1-D output data (the desired output) is

obtained as the weighted sum of the inputs within a window

of 10 frames to the left and 20 frames to the right with respect

to the current frame. The weighting falls of linearly on both

sides as

The weighting procedure introduces correlations between

neighboring input/output data pairs that become less for

data pairs further apart. Note that the correlations are not

symmetrical, being on the right side of each frame, which

is twice as “broad” as on the left side. For the classiﬁcation

TABLE I

D

ETAILS OF REGRESSION AND CLASSIFICATION

ARCHITECTURES EVALUATED IN OUR EXPERIMENTS

experiments, the output data is mapped to two classes, with

class 0 for all output values below (or equal to) 0.5 and class

1 for all output values above 0.5, giving approximately 59%

of the data to class 0 and 41% to class 1.

b) Experiments: Separate experiments are conducted for

regression and classiﬁcation tasks. For each task, four different

architectures are tested (Table I). Type “MERGE” refers to

the merged results of type RNN-FOR and RNN-BACK be-

cause they are regular unidirectional recurrent neural networks

trained in the forward and backward time directions, respec-

tively. The ﬁrst three architecture types are also evaluated over

different shifts of the output data in the positive time direction,

allowing the RNN to use future information, as discussed

above.

Every test (ANN training/evaluation) is run 100 times with

different initializations of the ANN to get at least partially

rid of random ﬂuctuations of the results due to convergence

to local minima of the objective function. All networks are

trained with 200 cycles of a modiﬁed version of the re-

silient propagation (RPROP) technique [10] and extended to

a RPROP through a time variant. All weights in the structure

are initialized in the range (

) drawn from the uniform

distribution, except the output biases, which are set so that

the corresponding output gives the prior average of the output

data in case of zero input activation.

For the regression experiments, the networks use the

activation function and are trained to minimize the mean-

squared-error objective function. For type “MERGE,” the

arithmetic mean of the network outputs of “RNN-FOR” and

“RNN-BACK” is taken, which assumes them to be indepen-

dent, as discussed above for the linear opinion pool.

For the classiﬁcation experiments, the output layer uses the

“softmax” output function [4] so that outputs add up to one

and can be interpreted as probabilities. As commonly used for

ANN’s to be trained as classiﬁers, the cross-entropy objective

function is used as the optimization criterion. Because the

outputs are probabilities assumed to be generated by inde-

pendent events, for type “MERGE,” the normalized geometric

mean (logarithmic opinion pool) of the network outputs of

“RNN-FOR” and “RNN-BACK” is taken.

c) Results: The results for the regression and the classiﬁ-

cation experiments averaged over 100 training/evaluation runs

can be seen in Figs. 4 and 5, respectively. For the regression

task, the mean squared error depending on the shift of the

output data in positive time direction seen from the time

axis of the network is shown. For the classiﬁcation task, the

recognition rate, instead of the mean value of the objective

function (which would be the mean cross-entropy), is shown

SCHUSTER AND PALIWAL: BIDIRECTIONAL RECURRENT NEURAL NETWORKS 2677

Fig. 4. Averaged results (100 runs) for the regression experiment on artiﬁcial data over different shifts of the output data with respect to the input data

in future direction (viewed from the time axis of the corresponding network) for several structures.

because it is a more familiar measure to characterize results

of classiﬁcation experiments.

Several interesting properties of RNN’s in general can be

directly seen from these ﬁgures. The minimum (maximum)

for the regression (classiﬁcation) task should be at 20 frames

delay for the forward RNN and at 10 frames delay for the

backward RNN because at those points, all information for

a perfect regression (classiﬁcation) has been fed into the

network. Neither is the case because the modeling power

of the networks given by the structure and the number of

free parameters is not sufﬁcient for the optimal solution.

Instead, the single time direction networks try to make a

tradeoff between “remembering” the past input information,

which is useful for regression (classiﬁcation), and “knowledge

combining” of currently available input information. This

results in an optimal delay of one (two) frame for the forward

RNN and ﬁve (six) frames for the backward RNN. The

optimum delay is larger for the backward RNN because the

artiﬁcially created correlations in the training data are not

symmetrical with the important information for regression

(classiﬁcation) being twice as dense on the left side as on the

right side of each frame. In the case of the backward RNN,

the time series is evaluated from right to left with the denser

information coming up later. Because the denser information

can be evaluated easier (fewer parameters are necessary for

a contribution to the objective function minimization), the

optimal delay is larger for the backward RNN. If the delay

is so large that almost no important information can be saved

over time, the network converges to the best possible solution

based only on prior information. This can be seen for the

classiﬁcation task with the backward RNN, which converges

to 59% (prior of class 0) for more than 15 frames delay.

Another sign for the tradeoff between “remembering” and

“knowledge combining” is the variation in the standard devia-

tion of the results, which is only shown for the backward RNN

in the classiﬁcation task. In areas where both mechanisms

could be useful (a 3 to 17 frame shift), different local minima

of the objective function correspond to a certain amount

to either one of these mechanisms, which results in larger

ﬂuctuations of the results than in areas where “remembering”

is not very useful (

5 to 3 frame shift) or not possible (17

to 20 frame shift).

If the outputs of forward and backward RNN’s are merged

so that all available past and future information for regression

(classiﬁcation) is present, the results for the delays tested here

(

2 to 10) are, in almost all cases, better than with only one

network. This is no surprise because besides the use of more

useful input information, the number of free parameters for

the model doubled.

For the BRNN, it does not make sense to delay the output

data because the structure is already designed to cope with

all available input information on both sides of the currently

evaluated time point. Therefore, the experiments for the BRNN

are only run for SHIFT

. For the regression and classiﬁca-

tion tasks tested here, the BRNN clearly performs better than

the network “MERGE” built out of the single time-direction

networks “RNN-FOR” and “RNN-BACK,” with a comparable

number of total free parameters.

2) Experiments with Real Data: The goal of the experi-

ments with real data is to compare different ANN structures

Bidirectional recurrent neural networks

Citations

Deep Learning

Neural Machine Translation by Jointly Learning to Align and Translate

Deep learning in neural networks

Neural Machine Translation by Jointly Learning to Align and Translate

Speech recognition with deep recurrent neural networks

References

Neural networks for pattern recognition

Learning internal representations by error propagation

Neural Networks for Pattern Recognition

Stastical Decision Theory and Bayesian Analysis.

Statistical Decision Theory and Bayesian Analysis

Related Papers (5)

Long short-term memory

Adam: A Method for Stochastic Optimization

Glove: Global Vectors for Word Representation

Dropout: a simple way to prevent neural networks from overfitting

Deep Residual Learning for Image Recognition