scispace - formally typeset
Open AccessJournal ArticleDOI

Bidirectional recurrent neural networks

Mike Schuster, +1 more
- 01 Nov 1997 - 
- Vol. 45, Iss: 11, pp 2673-2681
Reads0
Chats0
TLDR
It is shown how the proposed bidirectional structure can be easily modified to allow efficient estimation of the conditional posterior probability of complete symbol sequences without making any explicit assumption about the shape of the distribution.
Abstract
In the first part of this paper, a regular recurrent neural network (RNN) is extended to a bidirectional recurrent neural network (BRNN). The BRNN can be trained without the limitation of using input information just up to a preset future frame. This is accomplished by training it simultaneously in positive and negative time direction. Structure and training procedure of the proposed network are explained. In regression and classification experiments on artificial data, the proposed structure gives better results than other approaches. For real data, classification experiments for phonemes from the TIMIT database show the same tendency. In the second part of this paper, it is shown how the proposed bidirectional structure can be easily modified to allow efficient estimation of the conditional posterior probability of complete symbol sequences without making any explicit assumption about the shape of the distribution. For this part, experiments on real data are reported.

read more

Content maybe subject to copyright    Report

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 45, NO. 11, NOVEMBER 1997 2673
Bidirectional Recurrent Neural Networks
Mike Schuster and Kuldip K. Paliwal, Member, IEEE
AbstractIn the first part of this paper, a regular recurrent
neural network (RNN) is extended to a bidirectional recurrent
neural network (BRNN). The BRNN can be trained without
the limitation of using input information just up to a preset
future frame. This is accomplished by training it simultaneously
in positive and negative time direction. Structure and training
procedure of the proposed network are explained. In regression
and classification experiments on artificial data, the proposed
structure gives better results than other approaches. For real
data, classification experiments for phonemes from the TIMIT
database show the same tendency.
In the second part of this paper, it is shown how the proposed
bidirectional structure can be easily modified to allow efficient
estimation of the conditional posterior probability of complete
symbol sequences without making any explicit assumption about
the shape of the distribution. For this part, experiments on real
data are reported.
Index TermsRecurrent neural networks.
I. INTRODUCTION
A. General
M
ANY classification and regression problems of engi-
neering interest are currently solved with statistical
approaches using the principle of “learning from examples.”
For a certain model with a given structure inferred from the
prior knowledge about the problem and characterized by a
number of parameters, the aim is to estimate these parameters
accurately and reliably using a finite amount of training data.
In general, the parameters of the model are determined by a
supervised training process, whereas the structure of the model
is defined in advance. Choosing a proper structure for the
model is often the only way for the designer of the system
to put in prior knowledge about the solution of the problem.
Artificial neural networks (ANN’s) (see [2] for an excellent
introduction) are one group of models that take the principle
“infer the knowledge from the data” to an extreme. In this
paper, we are interested in studying ANN structures for one
particular class of problems that are represented by temporal
sequences of input–output data pairs. For these types of
problems, which occur, for example, in speech recognition,
time series prediction, dynamic control systems, etc., one of
the challenges is to choose an appropriate network structure
Manuscript received June 5, 1997. The associate editor coordinating the
review of this paper and approving it for publication was Prof. Jenq-Neng
Hwang.
M. Schuster is with the ATR Interpreting Telecommunications Research
Laboratory, Kyoto, Japan.
K. K. Paliwal is with the ATR Interpreting Telecommunications Research
Laboratory, Kyoto, Japan, on leave from the School of Microelectronic
Engineering, Griffith University, Brisbane, Australia.
Publisher Item Identifier S 1053-587X(97)08055-0.
that, at least theoretically, is able to use all available input
information to predict a point in the output space.
Many ANN structures have been proposed in the literature
to deal with time varying patterns. Multilayer perceptrons
(MLP’s) have the limitation that they can only deal with
static data patterns (i.e., input patterns of a predefined dimen-
sionality), which requires definition of the size of the input
window in advance. Waibel et al. [16] have pursued time delay
neural networks (TDNN’s), which have proven to be a useful
improvement over regular MLP’s in many applications. The
basic idea of a TDNN is to tie certain parameters in a regular
MLP structure without restricting the learning capability of the
ANN too much. Recurrent neural networks (RNN’s) [5], [8],
[12], [13], [15] provide another alternative for incorporating
temporal dynamics and are discussed in more detail in a later
section.
In this paper, we investigate different ANN structures for
incorporating temporal dynamics. We conduct a number of
experiments using both artificial and real-world data. We show
the superiority of RNN’s over the other structures. We then
point out some of the limitations of RNN’s and propose a
modified version of an RNN called a bidirectional recurrent
neural network, which overcomes these limitations.
B. Technical
Consider a (time) sequence of input data vectors
and a sequence of corresponding output data vectors
with neighboring data-pairs (in time) being somehow statisti-
cally dependent. Given time sequences
and as training
data, the aim is to learn the rules to predict the output data
given the input data. Inputs and outputs can, in general, be
continuous and/or categorical variables. When outputs are
continuous, the problem is known as a regression problem,
and when they are categorical (class labels), the problem is
known as a classification problem. In this paper, the term
prediction is used as a general term that includes regression
and classification.
1) Unimodal Regression: For unimodal regression or func-
tion approximation, the components of the output vectors are
continuous variables. The ANN parameters are estimated to
maximize some predefined objective criterion (e.g., maximize
the likelihood of the output data). When the distribution of the
errors between the desired and the estimated output vectors
is assumed to be Gaussian with zero mean and a fixed global
data-dependent variance, the likelihood criterion reduces to the
1053–587X/97$10.00 1997 IEEE

2674 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 45, NO. 11, NOVEMBER 1997
(a) (b)
Fig. 1. General structure of a regular unidirectional RNN shown (a) with a delay line and (b) unfolded in time for two time steps.
convenient Euclidean distance measure between the desired
and the estimated output vectors or the mean-squared-error
criterion, which has to be minimized during training [2]. It
has been shown by a number of researchers [2], [9] that
neural networks can estimate the conditional average of the
desired output (or target) vectors at their network outputs, i.e.,
, where is an expectation operator.
2) Classification: In the case of a classification problem,
one seeks the most probable class out of a given pool of
classes for every time frame , given an input vector
sequence
. To make this kind of problem suitable to be
solved by an ANN, the categorical variables are usually coded
as vectors as follows. Consider that
is the desired class
label for the frame at time
. Then, construct an output vector
such that its th component is one and other components
are zero. The output vector sequence
constructed in this
manner along with the input vector sequence
can be
used to train the network under some optimality criterion,
usually the cross-entropy criterion [2], [9], which results from
a maximum likelihood estimation assuming a multinomial
output distribution. It has been shown [3], [6], [9] that the
th network output at each time point can be interpreted as
an estimate of the conditional posterior probability of class
membership [
] for class , with the
quality of the estimate depending on the size of the training
data and the complexity of the network.
For some applications, it is not necessary to estimate the
conditional posterior probability
of a single
class given the sequence of input vectors but the conditional
posterior probability
of a sequence of
classes given the sequence of input vectors.
1
C. Organization of the Paper
This paper is organized in two parts. Given a series of paired
input/output vectors
, we want to
train bidirectional recurrent neural networks to perform the
following tasks.
Unimodal regression (i.e., compute
)or
classification [i.e., compute
for every output class and decide the class using the
maximum a posteriori decision rule]. In this case, the
outputs are treated statistically independent. Experiments
1
Here, we want to make a distinction between
C
t
and
c
t
.
C
t
is a categorical
random variable, and
c
t
is its value.
for this part are conducted for artificial toy data as well
as for real data.
Estimation of the conditional probability of a complete
sequence of classes of length
using all available input
information [i.e., compute
]. In
this case, the outputs are treated as being statistically
dependent, which makes the estimation more difficult and
requires a slightly different network structure than the one
used in the first part. For this part, results of experiments
for real data are reported.
II. P
REDICTION ASSUMING INDEPENDENT OUTPUTS
A. Recurrent Neural Networks
RNN’s provide a very elegant way of dealing with (time)
sequential data that embodies correlations between data points
that are close in the sequence. Fig. 1 shows a basic RNN
architecture with a delay line and unfolded in time for two
time steps. In this structure, the input vectors
are fed one
at a time into the RNN. Instead of using a fixed number of
input vectors as done in the MLP and TDNN structures, this
architecture can make use of all the available input information
up to the current time frame
(i.e., )
to predict
. How much of this information is captured by
a particular RNN depends on its structure and the training
algorithm. An illustration of the amount of input information
used for prediction with different kinds of NN’s is given in
Fig. 2.
Future input information coming up later than
is usually
also useful for prediction. With an RNN, this can be partially
achieved by delaying the output by a certain number of
time
frames to include future information up to
to predict
(Fig. 2). Theoretically, could be made very large to
capture all the available future information, but in practice,
it is found that prediction results drop if
is too large. A
possible explanation for this could be that with rising
,
the modeling power of the RNN is increasingly concentrated
on remembering the input information up to
for the
prediction of
, leaving less modeling power for combining
the prediction knowledge from different input vectors.
While delaying the output by some frames has been used
successfully to improve results in a practical speech recogni-
tion system [12], which was also confirmed by the experiments
conducted here, the optimal delay is task dependent and has to

SCHUSTER AND PALIWAL: BIDIRECTIONAL RECURRENT NEURAL NETWORKS 2675
Fig. 2. Visualization of the amount of input information used for prediction by different network structures.
Fig. 3. General structure of the bidirectional recurrent neural network (BRNN) shown unfolded in time for three time steps.
be found by the “trial and error” error method on a validation
test set. Certainly, a more elegant approach would be desirable.
To use all available input information, it is possible to use
two separate networks (one for each time direction) and then
somehow merge the results. Both networks can then be called
experts for the specific problem on which the networks are
trained. One way of merging the opinions of different experts
is to assume the opinions to be independent, which leads to
arithmetic averaging for regression and to geometric averaging
(or, alternatively, to an arithmetic averaging in the log domain)
for classification. These merging procedures are referred to
as linear opinion pooling and logarithmic opinion pooling,
respectively [1], [7]. Although simple merging of network
outputs has been applied successfully in practice [14], it is
generally not clear how to merge network outputs in an optimal
way since different networks trained on the same data can no
longer be regarded as independent.
B. Bidirectional Recurrent Neural Networks
To overcome the limitations of a regular RNN outlined
in the previous section, we propose a bidirectional recurrent
neural network (BRNN) that can be trained using all available
input information in the past and future of a specific time
frame.
1) Structure: The idea is to split the state neurons of a
regular RNN in a part that is responsible for the positive
time direction (forward states) and a part for the negative time
direction (backward states). Outputs from forward states are
not connected to inputs of backward states, and vice versa.
This leads to the general structure that can be seen in Fig. 3,
where it is unfolded over three time steps. It is not possible to
display the BRNN structure in a figure similar to Fig. 1 with
the delay line since the delay would have to be positive and
negative in time. Note that without the backward states, this
structure simplifies to a regular unidirectional forward RNN,
as shown in Fig. 1. If the forward states are taken out, a
regular RNN with a reversed time axis results. With both time
directions taken care of in the same network, input information
in the past and the future of the currently evaluated time frame
can directly be used to minimize the objective function without
the need for delays to include future information, as for the
regular unidirectional RNN discussed above.
2) Training: The BRNN can principally be trained with
the same algorithms as a regular unidirectional RNN because
there are no interactions between the two types of state
neurons and, therefore, can be unfolded into a general feed-
forward network. However, if, for example, any form of
back-propagation through time (BPTT) is used, the forward
and backward pass procedure is slightly more complicated
because the update of state and output neurons can no longer
be done one at a time. If BPTT is used, the forward and
backward passes over the unfolded BRNN over time are done
almost in the same way as for a regular MLP. Some special
treatment is necessary only at the beginning and the end of
the training data. The forward state inputs at
and the

2676 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 45, NO. 11, NOVEMBER 1997
backward state inputs at are not known. Setting these
could be made part of the learning process, but here, they
are set arbitrarily to a fixed value (0.5). In addition, the local
state derivatives at
for the forward states and at
for the backward states are not known and are set here to
zero, assuming that the information beyond that point is not
important for the current update, which is, for the boundaries,
certainly the case. The training procedure for the unfolded
bidirectional network over time can be summarized as follows.
1) FORWARD PASS
Run all input data for one time slice through
the BRNN and determine all predicted outputs.
a) Do forward pass just for forward states (from
to ) and backward states (from to ).
b) Do forward pass for output neurons.
2) BACKWARD PASS
Calculate the part of the objective function derivative
for the time slice
used in the forward pass.
a) Do backward pass for output neurons.
b) Do backward pass just for forward states (from
to ) and backward states (from
to ).
3) UPDATE WEIGHTS
C. Experiments and Results
In this section, we describe a number of experiments with
the goal of comparing the performance of the BRNN structure
with that of other structures. In order to provide a fair com-
parison, we have used different structures with a comparable
number of parameters as a rough complexity measure. The
experiments are done for artificial data for both regression
and classification tasks with small networks to allow extensive
experiments and for real data for a phoneme classification task
with larger networks.
1) Experiments with Artificial Data:
a) Description of Data: In these experiments, an artifi-
cial data set is used to conduct a set of regression and
classification experiments. The artificial data is generated as
follows. First, a stream of 10 000 random numbers between
zero and one is created as the one-dimensional (1-D) input
data to the ANN. The 1-D output data (the desired output) is
obtained as the weighted sum of the inputs within a window
of 10 frames to the left and 20 frames to the right with respect
to the current frame. The weighting falls of linearly on both
sides as
The weighting procedure introduces correlations between
neighboring input/output data pairs that become less for
data pairs further apart. Note that the correlations are not
symmetrical, being on the right side of each frame, which
is twice as “broad” as on the left side. For the classification
TABLE I
D
ETAILS OF REGRESSION AND CLASSIFICATION
ARCHITECTURES EVALUATED IN OUR EXPERIMENTS
experiments, the output data is mapped to two classes, with
class 0 for all output values below (or equal to) 0.5 and class
1 for all output values above 0.5, giving approximately 59%
of the data to class 0 and 41% to class 1.
b) Experiments: Separate experiments are conducted for
regression and classification tasks. For each task, four different
architectures are tested (Table I). Type “MERGE” refers to
the merged results of type RNN-FOR and RNN-BACK be-
cause they are regular unidirectional recurrent neural networks
trained in the forward and backward time directions, respec-
tively. The first three architecture types are also evaluated over
different shifts of the output data in the positive time direction,
allowing the RNN to use future information, as discussed
above.
Every test (ANN training/evaluation) is run 100 times with
different initializations of the ANN to get at least partially
rid of random fluctuations of the results due to convergence
to local minima of the objective function. All networks are
trained with 200 cycles of a modified version of the re-
silient propagation (RPROP) technique [10] and extended to
a RPROP through a time variant. All weights in the structure
are initialized in the range (
) drawn from the uniform
distribution, except the output biases, which are set so that
the corresponding output gives the prior average of the output
data in case of zero input activation.
For the regression experiments, the networks use the
activation function and are trained to minimize the mean-
squared-error objective function. For type “MERGE,” the
arithmetic mean of the network outputs of “RNN-FOR” and
“RNN-BACK” is taken, which assumes them to be indepen-
dent, as discussed above for the linear opinion pool.
For the classification experiments, the output layer uses the
“softmax” output function [4] so that outputs add up to one
and can be interpreted as probabilities. As commonly used for
ANN’s to be trained as classifiers, the cross-entropy objective
function is used as the optimization criterion. Because the
outputs are probabilities assumed to be generated by inde-
pendent events, for type “MERGE,” the normalized geometric
mean (logarithmic opinion pool) of the network outputs of
“RNN-FOR” and “RNN-BACK” is taken.
c) Results: The results for the regression and the classifi-
cation experiments averaged over 100 training/evaluation runs
can be seen in Figs. 4 and 5, respectively. For the regression
task, the mean squared error depending on the shift of the
output data in positive time direction seen from the time
axis of the network is shown. For the classification task, the
recognition rate, instead of the mean value of the objective
function (which would be the mean cross-entropy), is shown

SCHUSTER AND PALIWAL: BIDIRECTIONAL RECURRENT NEURAL NETWORKS 2677
Fig. 4. Averaged results (100 runs) for the regression experiment on artificial data over different shifts of the output data with respect to the input data
in future direction (viewed from the time axis of the corresponding network) for several structures.
because it is a more familiar measure to characterize results
of classification experiments.
Several interesting properties of RNN’s in general can be
directly seen from these figures. The minimum (maximum)
for the regression (classification) task should be at 20 frames
delay for the forward RNN and at 10 frames delay for the
backward RNN because at those points, all information for
a perfect regression (classification) has been fed into the
network. Neither is the case because the modeling power
of the networks given by the structure and the number of
free parameters is not sufficient for the optimal solution.
Instead, the single time direction networks try to make a
tradeoff between “remembering” the past input information,
which is useful for regression (classification), and “knowledge
combining” of currently available input information. This
results in an optimal delay of one (two) frame for the forward
RNN and five (six) frames for the backward RNN. The
optimum delay is larger for the backward RNN because the
artificially created correlations in the training data are not
symmetrical with the important information for regression
(classification) being twice as dense on the left side as on the
right side of each frame. In the case of the backward RNN,
the time series is evaluated from right to left with the denser
information coming up later. Because the denser information
can be evaluated easier (fewer parameters are necessary for
a contribution to the objective function minimization), the
optimal delay is larger for the backward RNN. If the delay
is so large that almost no important information can be saved
over time, the network converges to the best possible solution
based only on prior information. This can be seen for the
classification task with the backward RNN, which converges
to 59% (prior of class 0) for more than 15 frames delay.
Another sign for the tradeoff between “remembering” and
“knowledge combining” is the variation in the standard devia-
tion of the results, which is only shown for the backward RNN
in the classification task. In areas where both mechanisms
could be useful (a 3 to 17 frame shift), different local minima
of the objective function correspond to a certain amount
to either one of these mechanisms, which results in larger
fluctuations of the results than in areas where “remembering”
is not very useful (
5 to 3 frame shift) or not possible (17
to 20 frame shift).
If the outputs of forward and backward RNN’s are merged
so that all available past and future information for regression
(classification) is present, the results for the delays tested here
(
2 to 10) are, in almost all cases, better than with only one
network. This is no surprise because besides the use of more
useful input information, the number of free parameters for
the model doubled.
For the BRNN, it does not make sense to delay the output
data because the structure is already designed to cope with
all available input information on both sides of the currently
evaluated time point. Therefore, the experiments for the BRNN
are only run for SHIFT
. For the regression and classifica-
tion tasks tested here, the BRNN clearly performs better than
the network “MERGE” built out of the single time-direction
networks “RNN-FOR” and “RNN-BACK,” with a comparable
number of total free parameters.
2) Experiments with Real Data: The goal of the experi-
ments with real data is to compare different ANN structures

Citations
More filters
Book

Deep Learning

TL;DR: Deep learning as mentioned in this paper is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts, and it is used in many applications such as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames.
Proceedings Article

Neural Machine Translation by Jointly Learning to Align and Translate

TL;DR: It is conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and it is proposed to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.
Journal ArticleDOI

Deep learning in neural networks

TL;DR: This historical survey compactly summarizes relevant work, much of it from the previous millennium, review deep supervised learning, unsupervised learning, reinforcement learning & evolutionary computation, and indirect search for short programs encoding deep and large networks.
Posted Content

Neural Machine Translation by Jointly Learning to Align and Translate

TL;DR: In this paper, the authors propose to use a soft-searching model to find the parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.
Proceedings ArticleDOI

Speech recognition with deep recurrent neural networks

TL;DR: This paper investigates deep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs.
References
More filters
Book

Neural networks for pattern recognition

TL;DR: This is the first comprehensive treatment of feed-forward neural networks from the perspective of statistical pattern recognition, and is designed as a text, with over 100 exercises, to benefit anyone involved in the fields of neural computation and pattern recognition.
Book ChapterDOI

Learning internal representations by error propagation

TL;DR: This chapter contains sections titled: The Problem, The Generalized Delta Rule, Simulation Results, Some Further Generalizations, Conclusion.
Book ChapterDOI

Neural Networks for Pattern Recognition

TL;DR: The chapter discusses two important directions of research to improve learning algorithms: the dynamic node generation, which is used by the cascade correlation algorithm; and designing learning algorithms where the choice of parameters is not an issue.
Book

Statistical Decision Theory and Bayesian Analysis

TL;DR: An overview of statistical decision theory, which emphasizes the use and application of the philosophical ideas and mathematical structure of decision theory.
Related Papers (5)