scispace - formally typeset
Open AccessProceedings ArticleDOI

Learning to Translate in Real-time with Neural Machine Translation

TLDR
A neural machine translation (NMT) framework for simultaneous translation in which an agent learns to make decisions on when to translate from the interaction with a pre-trained NMT environment is proposed.
Abstract
Translating in real-time, a.k.a.simultaneous translation, outputs translation words before the input sentence ends, which is a challenging problem for conventional machine translation methods. We propose a neural machine translation (NMT) framework for simultaneous translation in which an agent learns to make decisions on when to translate from the interaction with a pre-trained NMT environment. To trade off quality and delay, we extensively explore various targets for delay and design a method for beam-search applicable in the simultaneous MT setting. Experiments against state-of-the-art baselines on two language pairs demonstrate the efficacy of the proposed framework both quantitatively and qualitatively.

read more

Content maybe subject to copyright    Report

Learning to Translate in Real-time with Neural Machine Translation
Jiatao Gu
, Graham Neubig
, Kyunghyun Cho
and Victor O.K. Li
The University of Hong Kong
Carnegie Mellon University
New York University
{jiataogu, vli}@eee.hku.hk
gneubig@cs.cmu.edu
kyunghyun.cho@nyu.edu
Abstract
Translating in real-time, a.k.a. simultane-
ous translation, outputs translation words
before the input sentence ends, which is a
challenging problem for conventional ma-
chine translation methods. We propose a
neural machine translation (NMT) frame-
work for simultaneous translation in which
an agent learns to make decisions on when
to translate from the interaction with a
pre-trained NMT environment. To trade
off quality and delay, we extensively ex-
plore various targets for delay and design
a method for beam-search applicable in
the simultaneous MT setting. Experiments
against state-of-the-art baselines on two
language pairs demonstrate the efficacy
of the proposed framework both quantita-
tively and qualitatively.
1
1 Introduction
Simultaneous translation, the task of translating
content in real-time as it is produced, is an im-
portant tool for real-time understanding of spoken
lectures or conversations (F
¨
ugen et al., 2007; Ban-
galore et al., 2012). Different from the typical
machine translation (MT) task, in which transla-
tion quality is paramount, simultaneous translation
requires balancing the trade-off between transla-
tion quality and time delay to ensure that users
receive translated content in an expeditious man-
ner (Mieno et al., 2015). A number of methods
have been proposed to solve this problem, mostly
in the context of phrase-based machine translation.
These methods are based on a segmenter, which
receives the input one word at a time, then decides
when to send it to a MT system that translates each
1
Code and data can be found at https://github.
com/nyu-dl/dl4mt-simul-trans.
Last
night
we
served
Mr
X
a
beer
,
who
died
during
the
night
.
< eos >
< eos >
.
ist
storben
ge--
Nacht
der
Laufe
im
der
,
serviert
Bier
ein
X
Herrn
wir
haben
Abend
Gestern
READ
WRITE
Figure 1: Example output from the proposed
framework in DE EN simultaneous transla-
tion. The heat-map represents the soft alignment
between the incoming source sentence (left, up-
to-down) and the emitted translation (top, left-
to-right). The length of each column represents
the number of source words being waited for be-
fore emitting the translation. Best viewed when
zoomed digitally.
segment independently (Oda et al., 2014) or with a
minimal amount of language model context (Ban-
galore et al., 2012).
Independently of simultaneous translation, ac-
curacy of standard MT systems has greatly im-
proved with the introduction of neural-network-
based MT systems (NMT) (Sutskever et al., 2014;
Bahdanau et al., 2014). Very recently, there have
been a few efforts to apply NMT to simultane-
ous translation either through heuristic modifica-
tions to the decoding process (Cho and Esipova,
2016), or through the training of an independent
segmentation network that chooses when to per-
form output using a standard NMT model (Satija
and Pineau, 2016). However, the former model
lacks a capability to learn the appropriate timing
with which to perform translation, and the latter
model uses a standard NMT model as-is, lack-
ing a holistic design of the modeling and learning
within the simultaneous MT context. In addition,
neither model has demonstrated gains over previ-
arXiv:1610.00388v3 [cs.CL] 10 Jan 2017

ous segmentation-based baselines, leaving ques-
tions of their relative merit unresolved.
In this paper, we propose a unified design for
learning to perform neural simultaneous machine
translation. The proposed framework is based on
formulating translation as an interleaved sequence
of two actions: READ and WRITE. Based on this,
we devise a model connecting the NMT system
and these READ/WRITE decisions. An example
of how translation is performed in this framework
is shown in Fig. 1, and detailed definitions of the
problem and proposed framework are described in
§2 and §3. To learn which actions to take when, we
propose a reinforcement-learning-based strategy
with a reward function that considers both qual-
ity and delay (§4). We also develop a beam-search
method that performs search within the translation
segments (§5).
We evaluate the proposed method on English-
Russian (EN-RU) and English-German (EN-DE)
translation in both directions (§6). The quantita-
tive results show strong improvements compared
to both the NMT-based algorithm and a conven-
tional segmentation methods. We also extensively
analyze the effectiveness of the learning algorithm
and the influence of the trade-off in the optimiza-
tion criterion, by varying a target delay. Finally,
qualitative visualization is utilized to discuss the
potential and limitations of the framework.
2 Problem Definition
Suppose we have a buffer of input words X =
{x
1
, ..., x
T
s
} to be translated in real-time. We de-
fine the simultaneous translation task as sequen-
tially making two interleaved decisions: READ or
WRITE. More precisely, the translator READs a
source word x
η
from the input buffer in chrono-
logical order as translation context, or WRITEs a
translated word y
τ
onto the output buffer, resulting
in output sentence Y = {y
1
, ..., y
T
t
}, and action
sequence A = {a
1
, ..., a
T
} consists of T
s
READs
and T
t
WRITEs, so T = T
s
+ T
t
.
Similar to standard MT, we have a measure
Q(Y ) to evaluate the translation quality, such as
BLEU score (Papineni et al., 2002). For simulta-
neous translation we are also concerned with the
fact that each action incurs a time delay D(A).
D(A) will mainly be influenced by delay caused
by READ, as this entails waiting for a human
speaker to continue speaking (about 0.3s per word
for an average speaker), while WRITE consists of
generating a few words from a machine transla-
Figure 2: Illustration of the proposed framework:
at each step, the NMT environment (left) com-
putes a candidate translation. The recurrent agent
(right) will the observation including the candi-
dates and send back decisions–READ or WRITE.
tion system, which is possible on the order of mil-
liseconds. Thus, our objective is finding an opti-
mal policy that generates decision sequences with
a good trade-off between higher quality Q(Y ) and
lower delay D(A). We elaborate on exactly how
to define this trade-off in §4.2.
In the following sections, we first describe how
to connect the READ/WRITE actions with the NMT
system (§3), and how to optimize the system to
improve simultaneous MT results (§4).
3 Simultaneous Translation
with Neural Machine Translation
The proposed framework is shown in Fig. 2, and
can be naturally decomposed into two parts: envi-
ronment (§3.1) and agent (§3.2).
3.1 Environment
Encoder: READ The first element of the NMT
system is the encoder, which converts input words
X = {x
1
, ..., x
T
s
} into context vectors H =
{h
1
, ..., h
T
s
}. Standard NMT uses bi-directional
RNNs as encoders (Bahdanau et al., 2014), but this
is not suitable for simultaneous processing as us-
ing a reverse-order encoder requires knowing the
final word of the sentence before beginning pro-
cessing. Thus, we utilize a simple left-to-right uni-
directional RNN as our encoder:
h
η
= φ
UNI-ENC
(h
η1
, x
η
) (1)
Decoder: WRITE Similar with standard MT, we
use an attention-based decoder. In contrast, we

only reference the words that have been read from
the input when generating each target word:
c
η
τ
= φ
ATT
(z
τ 1
, y
τ 1
, H
η
)
z
η
τ
= φ
DEC
(z
τ 1
, y
τ 1
, c
η
τ
)
p (y|y
, H
η
) exp [φ
OUT
(z
η
τ
)] ,
(2)
where for τ, z
τ 1
and y
τ 1
represent the previous
decoder state and output word, respectively. H
η
is used to represent the incomplete input states,
where H
η
is a prefix of H. As the WRITE action
calculates the probability of the next word on the
fly, we need greedy decoding for each step:
y
η
τ
= arg max
y
p (y|y
, H
η
) (3)
Note that y
η
τ
, z
η
τ
corresponds to H
η
and is the can-
didate for y
τ
, z
τ
. The agent described in the next
section decides whether to take this candidate or
wait for better predictions.
3.2 Agent
A trainable agent is designed to make decisions
A = {a
1
, .., a
T
}, a
t
A sequentially based on
observations O = {o
1
, ..., o
T
}, o
t
O, and then
control the translation environment properly.
Observation As shown in Fig 2, we concatenate
the current context vector c
η
τ
, the current decoder
state z
η
τ
and the embedding vector of the candidate
word y
η
τ
as the continuous observation, o
τ +η
=
[c
η
τ
; z
η
τ
; E(y
η
τ
)] to represent the current state.
Action Similarly to prior work (Grissom II et al.,
2014), we define the following set of actions:
READ: the agent rejects the candidate and waits
to encode the next word from input buffer;
WRITE: the agent accepts the candidate and
emits it as the prediction into output buffer;
Policy How the agent chooses the actions based
on the observation defines the policy. In our set-
ting, we utilize a stochastic policy π
θ
parameter-
ized by a recurrent neural network, that is:
s
t
= f
θ
(s
t1
, o
t
)
π
θ
(a
t
|a
<t
, o
t
) g
θ
(s
t
) ,
(4)
where s
t
is the internal state of the agent, and is
updated recurrently yielding the distribution of the
action a
t
. Based on the policy of our agent, the
overall algorithm of greedy decoding is shown in
Algorithm 1, The algorithm outputs the translation
result and a sequence of observation-action pairs.
Algorithm 1 Simultaneous Greedy Decoding
Require: NMT system φ, policy π
θ
, τ
MAX
, input
buffer X, output buffer Y , state buffer S.
1: Init x
1
X, h
1
φ
ENC
(x
1
) , H
1
{h
1
}
2: z
0
φ
INIT
H
1
, y
0
hsi
3: τ 0, η 1
4: while τ < τ
MAX
do
5: t τ + η
6: y
η
τ
, z
η
τ
, o
t
φ (z
τ 1
, y
τ 1
, H
η
)
7: a
t
π
θ
(a
t
; a
<t
, o
<t
) , S (o
t
, a
t
)
8: if a
t
= READ and x
η
6= h/si then
9: x
η+1
X, h
η+1
φ
ENC
(h
η
, x
η+1
)
10: H
η+1
H
η
{h
η+1
}, η η + 1
11: if |Y | = 0 then z
0
φ
INIT
(H
η
)
12: else if a
t
= WRITE then
13: z
τ
z
η
τ
, y
τ
y
η
τ
14: Y y
τ
, τ τ + 1
15: if y
τ
= h/si then break
4 Learning
The proposed framework can be trained using re-
inforcement learning. More precisely, we use pol-
icy gradient algorithm together with variance re-
duction and regularization techniques.
4.1 Pre-training
We need an NMT environment for the agent to ex-
plore and use to generate translations. Here, we
simply pre-train the NMT encoder-decoder on full
sentence pairs with maximum likelihood, and as-
sume the pre-trained model is still able to generate
reasonable translations even on incomplete source
sentences. Although this is likely sub-optimal, our
NMT environment based on uni-directional RNNs
can treat incomplete source sentences in a manner
similar to shorter source sentences and has the po-
tential to translate them more-or-less correctly.
4.2 Reward Function
The policy is learned in order to increase a reward
for the translation. At each step the agent will re-
ceive a reward signal r
t
based on (o
t
, a
t
). To eval-
uate a good simultaneous machine translation, a
reward must consider both quality and delay.
Quality We evaluate the translation quality us-
ing metrics such as BLEU (Papineni et al., 2002).
The BLEU score is defined as the weighted geo-
metric average of the modified n-gram precision
BLEU
0
, multiplied by the brevity penalty BP to
punish a short translation. In practice, the vanilla

BLEU score is not a good metric at sentence level
because being a geometric average, the score will
reduce to zero if one of the precisions is zero. To
avoid this, we used a smoothed version of BLEU
for our implementation (Lin and Och, 2004).
BLEU(Y, Y
) = BP · BLEU
0
(Y, Y
), (5)
where Y
is the reference and Y is the output. We
decompose BLEU and use the difference of par-
tial BLEU scores as the reward, that is:
r
Q
t
=
BLEU
0
(Y, Y
, t) t < T
BLEU(Y, Y
) t = T
(6)
where Y
t
is the cumulative output at t (Y
0
= ),
and BLEU
0
(Y, Y
, t) = BLEU
0
(Y
t
, Y
)
BLEU
0
(Y
t1
, Y
). Obviously, if a
t
= READ, no
new words are written into Y , yielding r
Q
t
= 0.
Note that we do not multiply BP until the end of
the sentence, as it would heavily penalize partial
translation results.
Delay As another critical feature, delay judges
how much time is wasted waiting for the transla-
tion. Ideally we would directly measure the actual
time delay incurred by waiting for the next word.
For simplicity, however, we suppose it consumes
the same amount of time listening for one more
word. We define two measurements, global and
local, respectively:
Average Proportion (AP): following the def-
inition in (Cho and Esipova, 2016), X, Y are
the source and decoded sequences respectively,
and we use s(τ ) to denote the number of source
words been waited when decoding word y
τ
,
0 < d (X, Y ) =
1
|X||Y |
X
τ
s(τ) 1
d
t
=
0 t < T
d(X, Y ) t = T
(7)
d is a global delay metric, which defines the av-
erage waiting proportion of the source sentence
when translating each word.
Consecutive Wait length (CW): in speech
translation, listeners are also concerned with
long silences during which no translation oc-
curs. To capture this, we also consider on how
many words were waited for (READ ) consecu-
tively between translating two words. For each
action, where we initially define c
0
= 0,
c
t
=
c
t1
+ 1 a
t
= READ
0 a
t
= WRITE
(8)
Target Delay: We further define “target delay”
for both d and c as d
and c
, respectively, as
different simultaneous translation applications
may have different requirements on delay. In
our implementation, the reward function for de-
lay is written as:
r
D
t
= α·[sgn(c
t
c
) + 1]+β·d
t
d
+
(9)
where α 0, β 0.
Trade-off between quality and delay A good
simultaneous translation system requires balanc-
ing the trade-off of translation quality and time
delay. Obviously, achieving the best translation
quality and the shortest translation delays are in
a sense contradictory. In this paper, the trade-off
is achieved by balancing the rewards r
t
= r
Q
t
+r
D
t
provided to the system, that is, by adjusting the co-
efficients α, β and the target delay d
, c
in Eq. 9.
4.3 Reinforcement Learning
Policy Gradient We freeze the pre-trained pa-
rameters of an NMT model, and train the agent
using the policy gradient (Williams, 1992). The
policy gradient maximizes the following expected
cumulative future rewards, J = E
π
θ
h
P
T
t=1
r
t
i
,
whose gradient is
θ
J = E
π
θ
"
T
X
t
=1
θ
log π
θ
(a
t
)R
t
#
(10)
R
t
=
P
T
k=t
h
r
Q
k
+ r
D
k
i
is the cumulative future
rewards for current observation and action. In
practice, Eq. 10 is estimated by sampling multi-
ple action trajectories from the current policy π
θ
,
collecting the corresponding rewards.
Variance Reduction Directly using the policy
gradient suffers from high variance, which makes
learning unstable and inefficient. We thus em-
ploy the variance reduction techniques suggested
by Mnih and Gregor (2014). We subtract from
R
t
the output of a baseline network b
ϕ
to obtain
ˆ
R
t
= R
t
b
ϕ
(o
t
), and centered re-scale the re-
ward as
˜
R
t
=
ˆ
R
t
b
σ
2
+ǫ
with a running average b
and standard deviation σ. The baseline network is
trained to minimize the squared loss as follows:
L
ϕ
= E
π
θ
"
T
X
t=1
kR
t
b
ϕ
(o
t
) k
2
#
(11)
We also regularize the negative entropy of the
policy to facilitate exploration.

Algorithm 2 Learning with Policy Gradient
Require: NMT system φ, agent θ, baseline ϕ
1: Pretrain the NMT system φ using MLE;
2: Initialize the agent θ;
3: while stopping criterion fails do
4: Obtain a translation pairs: {(X, Y
)};
5: for (Y, S) Simultaneous Decoding do
6: for (o
t
, a
t
) in S do
7: Compute the quality: r
Q
t
;
8: Compute the delay: r
D
t
;
9: Compute the baseline: b
ϕ
(o
t
);
10: Collect the future rewards: {R
t
};
11: Perform variance reduction: {
˜
R
t
};
12: Update: θ θ + λ
1
θ
[J κH(π
θ
)]
13: Update: ϕ ϕ λ
2
ϕ
L
The overall learning algorithm is summarized
in Algorithm 2. For efficiency, instead of updating
with stochastic gradient descent (SGD) on a single
sentence, both the agent and the baseline are opti-
mized using a minibatch of multiple sentences.
5 Simultaneous Beam Search
In previous sections we described a simultaneous
greedy decoding algorithm. In standard NMT it
has been shown that beam search, where the de-
coder keeps a beam of k translation trajectories,
greatly improves translation quality (Sutskever et
al., 2014), as shown in Fig. 3 (A).
It is non-trivial to directly apply beam-search in
simultaneous machine translation, as beam search
waits until the last word to write down translation.
Based on our assumption WRITE does not cost de-
lay, we can perform a simultaneous beam-search
when the agent chooses to consecutively WRITE:
keep multiple beams of translation trajectories in
temporary buffer and output the best path when
the agent switches to READ. As shown in Fig. 3
(B) & (C), it tries to search for a relatively better
path while keeping the delay unchanged.
Note that we do not re-train the agent for simul-
taneous beam-search. At each step we simply in-
put the observation of the current best trajectory
into the agent for making next decision.
6 Experiments
6.1 Settings
Dataset To extensively study the proposed si-
multaneous translation model, we train and evalu-
ate it on two different language pairs: “English-
Figure 3: Illustrations of (A) beam-search, (B) si-
multaneous greedy decoding and (C) simultaneous
beam-search.
German (EN-DE)” and “English-Russian (EN-
RU)” in both directions per pair. We use the par-
allel corpora available from WMT’15
2
for both
pre-training the NMT environment and learning
the policy. We utilize newstest-2013 as the valida-
tion set to evaluate the proposed algorithm. Both
the training set and the validation set are tokenized
and segmented into sub-word units with byte-pair
encoding (BPE) (Sennrich et al., 2015). We only
use sentence pairs where both sides are less than
50 BPE subword symbols long for training.
Environment & Agent Settings We pre-trained
the NMT environments for both language pairs
and both directions following the same setting
from (Cho and Esipova, 2016). We further built
our agents, using a recurrent policy with 512
GRUs and a softmax function to produce the ac-
tion distribution. All our agents are trained us-
ing policy gradient using Adam (Kingma and Ba,
2014) optimizer, with a mini-batch size of 10. For
each sentence pair in a batch, 5 trajectories are
sampled. For testing, instead of sampling we pick
the action with higher probability each step.
Baselines We compare the proposed methods
against previously proposed baselines. For fair
comparison, we use the same NMT environment:
Wait-Until-End (WUE): an agent that starts to
WRITE only when the last source word is seen.
In general, we expect this to achieve the best
quality of translation. We perform both greedy
decoding and beam-search with this method.
Wait-One-Step (WOS): an agent that WRITEs
after each READs. Such a policy is problematic
when the source and target language pairs have
different word orders or lengths (e.g. EN-DE).
2
http://www.statmt.org/wmt15/

Citations
More filters
Posted Content

Plug and Play Language Models: A Simple Approach to Controlled Text Generation

TL;DR: The Plug and Play Language Model (PPLM) for controllable language generation is proposed, which combines a pretrained LM with one or more simple attribute classifiers that guide text generation without any further training of the LM.
Book

Neural Machine Translation

TL;DR: A comprehensive treatment of the topic, ranging from introduction to neural networks, computation graphs, description of the currently dominant attentional sequence-to-sequence model, recent refinements, alternative architectures and challenges.
Proceedings ArticleDOI

Monotonic Infinite Lookback Attention for Simultaneous Machine Translation

TL;DR: This paper proposed Monotonic Infinite Lookback (MILk) attention, which maintains both a hard, monotonic attention head to schedule the reading of the source sentence, and a soft attention head that extends from the monotone head back to the beginning of the input sentence.
Proceedings ArticleDOI

STACL: Simultaneous Translation with Implicit Anticipation and Controllable Latency using Prefix-to-Prefix Framework

TL;DR: This paper propose a prefix-to-prefix framework for multaneous translation that implicitly learns to anticipate in a single translation model, which achieves low latency and reasonable qual- ity (compared to full-sentence translation) on 4 directions.
Posted Content

Neural Machine Translation and Sequence-to-sequence Models: A Tutorial

TL;DR: The tutorial attempts to explain the intuition behind the various methods covered, then delves into them with enough mathematical detail to understand them concretely, and culiminates with a suggestion for an implementation exercise, where readers can test that they understood the content in practice.
References
More filters
Proceedings Article

Adam: A Method for Stochastic Optimization

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Proceedings ArticleDOI

Bleu: a Method for Automatic Evaluation of Machine Translation

TL;DR: This paper proposed a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.
Proceedings Article

Neural Machine Translation by Jointly Learning to Align and Translate

TL;DR: It is conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and it is proposed to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.
Proceedings Article

Sequence to Sequence Learning with Neural Networks

TL;DR: The authors used a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector.
Proceedings ArticleDOI

Neural Machine Translation of Rare Words with Subword Units

TL;DR: This paper introduces a simpler and more effective approach, making the NMT model capable of open-vocabulary translation by encoding rare and unknown words as sequences of subword units, and empirically shows that subword models improve over a back-off dictionary baseline for the WMT 15 translation tasks English-German and English-Russian by 1.3 BLEU.
Related Papers (5)