Learning to Translate in Real-time with Neural Machine Translation

doi:10.18653/V1/E17-1099

Jiatao Gu

†

, Graham Neubig

♦

, Kyunghyun Cho

‡

and Victor O.K. Li

†

The University of Hong Kong

♦

Carnegie Mellon University

‡

New York University

†

{jiataogu, vli}@eee.hku.hk

♦

gneubig@cs.cmu.edu

‡

kyunghyun.cho@nyu.edu

Abstract

Translating in real-time, a.k.a. simultane-

ous translation, outputs translation words

before the input sentence ends, which is a

challenging problem for conventional ma-

chine translation methods. We propose a

neural machine translation (NMT) frame-

work for simultaneous translation in which

an agent learns to make decisions on when

to translate from the interaction with a

pre-trained NMT environment. To trade

off quality and delay, we extensively ex-

plore various targets for delay and design

a method for beam-search applicable in

the simultaneous MT setting. Experiments

against state-of-the-art baselines on two

language pairs demonstrate the efﬁcacy

of the proposed framework both quantita-

tively and qualitatively.

1

1 Introduction

Simultaneous translation, the task of translating

content in real-time as it is produced, is an im-

portant tool for real-time understanding of spoken

lectures or conversations (F

¨

ugen et al., 2007; Ban-

galore et al., 2012). Different from the typical

machine translation (MT) task, in which transla-

tion quality is paramount, simultaneous translation

requires balancing the trade-off between transla-

tion quality and time delay to ensure that users

receive translated content in an expeditious man-

ner (Mieno et al., 2015). A number of methods

have been proposed to solve this problem, mostly

in the context of phrase-based machine translation.

These methods are based on a segmenter, which

receives the input one word at a time, then decides

when to send it to a MT system that translates each

1

Code and data can be found at https://github.

com/nyu-dl/dl4mt-simul-trans.

Last

night

we

served

Mr

X

a

beer

,

who

died

during

the

night

.

< eos >

.

ist

storben

ge--

Nacht

der

Laufe

im

der

,

serviert

Bier

ein

X

Herrn

wir

haben

Abend

Gestern

READ

WRITE

Figure 1: Example output from the proposed

framework in DE → EN simultaneous transla-

tion. The heat-map represents the soft alignment

between the incoming source sentence (left, up-

to-down) and the emitted translation (top, left-

to-right). The length of each column represents

the number of source words being waited for be-

fore emitting the translation. Best viewed when

zoomed digitally.

segment independently (Oda et al., 2014) or with a

minimal amount of language model context (Ban-

galore et al., 2012).

Independently of simultaneous translation, ac-

curacy of standard MT systems has greatly im-

proved with the introduction of neural-network-

based MT systems (NMT) (Sutskever et al., 2014;

Bahdanau et al., 2014). Very recently, there have

been a few efforts to apply NMT to simultane-

ous translation either through heuristic modiﬁca-

tions to the decoding process (Cho and Esipova,

2016), or through the training of an independent

segmentation network that chooses when to per-

form output using a standard NMT model (Satija

and Pineau, 2016). However, the former model

lacks a capability to learn the appropriate timing

with which to perform translation, and the latter

model uses a standard NMT model as-is, lack-

ing a holistic design of the modeling and learning

within the simultaneous MT context. In addition,

neither model has demonstrated gains over previ-

arXiv:1610.00388v3 [cs.CL] 10 Jan 2017

ous segmentation-based baselines, leaving ques-

tions of their relative merit unresolved.

In this paper, we propose a uniﬁed design for

learning to perform neural simultaneous machine

translation. The proposed framework is based on

formulating translation as an interleaved sequence

of two actions: READ and WRITE. Based on this,

we devise a model connecting the NMT system

and these READ/WRITE decisions. An example

of how translation is performed in this framework

is shown in Fig. 1, and detailed deﬁnitions of the

problem and proposed framework are described in

§2 and §3. To learn which actions to take when, we

propose a reinforcement-learning-based strategy

with a reward function that considers both qual-

ity and delay (§4). We also develop a beam-search

method that performs search within the translation

segments (§5).

We evaluate the proposed method on English-

Russian (EN-RU) and English-German (EN-DE)

translation in both directions (§6). The quantita-

tive results show strong improvements compared

to both the NMT-based algorithm and a conven-

tional segmentation methods. We also extensively

analyze the effectiveness of the learning algorithm

and the inﬂuence of the trade-off in the optimiza-

tion criterion, by varying a target delay. Finally,

qualitative visualization is utilized to discuss the

potential and limitations of the framework.

2 Problem Deﬁnition

Suppose we have a buffer of input words X =

{x

1

, ..., x

T

s

} to be translated in real-time. We de-

ﬁne the simultaneous translation task as sequen-

tially making two interleaved decisions: READ or

WRITE. More precisely, the translator READs a

source word x

η

from the input buffer in chrono-

logical order as translation context, or WRITEs a

translated word y

τ

onto the output buffer, resulting

in output sentence Y = {y

1

, ..., y

T

t

}, and action

sequence A = {a

1

, ..., a

T

} consists of T

s

READs

and T

t

WRITEs, so T = T

s

+ T

t

.

Similar to standard MT, we have a measure

Q(Y ) to evaluate the translation quality, such as

BLEU score (Papineni et al., 2002). For simulta-

neous translation we are also concerned with the

fact that each action incurs a time delay D(A).

D(A) will mainly be inﬂuenced by delay caused

by READ, as this entails waiting for a human

speaker to continue speaking (about 0.3s per word

for an average speaker), while WRITE consists of

generating a few words from a machine transla-

Figure 2: Illustration of the proposed framework:

at each step, the NMT environment (left) com-

putes a candidate translation. The recurrent agent

(right) will the observation including the candi-

dates and send back decisions–READ or WRITE.

tion system, which is possible on the order of mil-

liseconds. Thus, our objective is ﬁnding an opti-

mal policy that generates decision sequences with

a good trade-off between higher quality Q(Y ) and

lower delay D(A). We elaborate on exactly how

to deﬁne this trade-off in §4.2.

In the following sections, we ﬁrst describe how

to connect the READ/WRITE actions with the NMT

system (§3), and how to optimize the system to

improve simultaneous MT results (§4).

3 Simultaneous Translation

with Neural Machine Translation

The proposed framework is shown in Fig. 2, and

can be naturally decomposed into two parts: envi-

ronment (§3.1) and agent (§3.2).

3.1 Environment

Encoder: READ The ﬁrst element of the NMT

system is the encoder, which converts input words

X = {x

1

, ..., x

T

s

} into context vectors H =

{h

1

, ..., h

T

s

}. Standard NMT uses bi-directional

RNNs as encoders (Bahdanau et al., 2014), but this

is not suitable for simultaneous processing as us-

ing a reverse-order encoder requires knowing the

ﬁnal word of the sentence before beginning pro-

cessing. Thus, we utilize a simple left-to-right uni-

directional RNN as our encoder:

h

η

= φ

UNI-ENC

(h

η−1

, x

η

) (1)

Decoder: WRITE Similar with standard MT, we

use an attention-based decoder. In contrast, we

only reference the words that have been read from

the input when generating each target word:

c

η

τ

= φ

ATT

(z

τ −1

, y

τ −1

, H

η

)

z

η

τ

= φ

DEC

(z

τ −1

, y

τ −1

, c

η

τ

)

p (y|y

<τ

, H

η

) ∝ exp [φ

OUT

(z

η

τ

)] ,

(2)

where for τ, z

τ −1

and y

τ −1

represent the previous

decoder state and output word, respectively. H

η

is used to represent the incomplete input states,

where H

η

is a preﬁx of H. As the WRITE action

calculates the probability of the next word on the

ﬂy, we need greedy decoding for each step:

y

η

τ

= arg max

y

p (y|y

<τ

, H

η

) (3)

Note that y

η

τ

, z

η

τ

corresponds to H

η

and is the can-

didate for y

τ

, z

τ

. The agent described in the next

section decides whether to take this candidate or

wait for better predictions.

3.2 Agent

A trainable agent is designed to make decisions

A = {a

1

, .., a

T

}, a

t

∈ A sequentially based on

observations O = {o

1

, ..., o

T

}, o

t

∈ O, and then

control the translation environment properly.

Observation As shown in Fig 2, we concatenate

the current context vector c

η

τ

, the current decoder

state z

η

τ

and the embedding vector of the candidate

word y

η

τ

as the continuous observation, o

τ +η

=

[c

η

τ

; z

η

τ

; E(y

η

τ

)] to represent the current state.

Action Similarly to prior work (Grissom II et al.,

2014), we deﬁne the following set of actions:

• READ: the agent rejects the candidate and waits

to encode the next word from input buffer;

• WRITE: the agent accepts the candidate and

emits it as the prediction into output buffer;

Policy How the agent chooses the actions based

on the observation deﬁnes the policy. In our set-

ting, we utilize a stochastic policy π

θ

parameter-

ized by a recurrent neural network, that is:

s

t

= f

θ

(s

t−1

, o

t

)

π

θ

(a

t

|a

<t

, o

≤t

) ∝ g

θ

(s

t

) ,

(4)

where s

t

is the internal state of the agent, and is

updated recurrently yielding the distribution of the

action a

t

. Based on the policy of our agent, the

overall algorithm of greedy decoding is shown in

Algorithm 1, The algorithm outputs the translation

result and a sequence of observation-action pairs.

Algorithm 1 Simultaneous Greedy Decoding

Require: NMT system φ, policy π

θ

, τ

MAX

, input

buffer X, output buffer Y , state buffer S.

1: Init x

1

⇐ X, h

1

← φ

ENC

(x

1

) , H

1

← {h

1

}

2: z

0

← φ

INIT



H

1



, y

0

← hsi

3: τ ← 0, η ← 1

4: while τ < τ

MAX

do

5: t ← τ + η

6: y

η

τ

, z

η

τ

, o

t

← φ (z

τ −1

, y

τ −1

, H

η

)

7: a

t

∼ π

θ

(a

t

; a

<t

, o

<t

) , S ⇐ (o

t

, a

t

)

8: if a

t

= READ and x

η

6= h/si then

9: x

η+1

⇐ X, h

η+1

← φ

ENC

(h

η

, x

η+1

)

10: H

η+1

← H

η

∪ {h

η+1

}, η ← η + 1

11: if |Y | = 0 then z

0

← φ

INIT

(H

η

)

12: else if a

t

= WRITE then

13: z

τ

← z

η

τ

, y

τ

← y

η

τ

14: Y ⇐ y

τ

, τ ← τ + 1

15: if y

τ

= h/si then break

4 Learning

The proposed framework can be trained using re-

inforcement learning. More precisely, we use pol-

icy gradient algorithm together with variance re-

duction and regularization techniques.

4.1 Pre-training

We need an NMT environment for the agent to ex-

plore and use to generate translations. Here, we

simply pre-train the NMT encoder-decoder on full

sentence pairs with maximum likelihood, and as-

sume the pre-trained model is still able to generate

reasonable translations even on incomplete source

sentences. Although this is likely sub-optimal, our

NMT environment based on uni-directional RNNs

can treat incomplete source sentences in a manner

similar to shorter source sentences and has the po-

tential to translate them more-or-less correctly.

4.2 Reward Function

The policy is learned in order to increase a reward

for the translation. At each step the agent will re-

ceive a reward signal r

t

based on (o

t

, a

t

). To eval-

uate a good simultaneous machine translation, a

reward must consider both quality and delay.

Quality We evaluate the translation quality us-

ing metrics such as BLEU (Papineni et al., 2002).

The BLEU score is deﬁned as the weighted geo-

metric average of the modiﬁed n-gram precision

BLEU

0

, multiplied by the brevity penalty BP to

punish a short translation. In practice, the vanilla

BLEU score is not a good metric at sentence level

because being a geometric average, the score will

reduce to zero if one of the precisions is zero. To

avoid this, we used a smoothed version of BLEU

for our implementation (Lin and Och, 2004).

BLEU(Y, Y

∗

) = BP · BLEU

0

(Y, Y

∗

), (5)

where Y

∗

is the reference and Y is the output. We

decompose BLEU and use the difference of par-

tial BLEU scores as the reward, that is:

r

Q

t

=



∆BLEU

0

(Y, Y

∗

, t) t < T

BLEU(Y, Y

∗

) t = T

(6)

where Y

t

is the cumulative output at t (Y

0

= ∅),

and ∆BLEU

0

(Y, Y

∗

, t) = BLEU

0

(Y

t

, Y

∗

) −

BLEU

0

(Y

t−1

, Y

∗

). Obviously, if a

t

= READ, no

new words are written into Y , yielding r

Q

t

= 0.

Note that we do not multiply BP until the end of

the sentence, as it would heavily penalize partial

translation results.

Delay As another critical feature, delay judges

how much time is wasted waiting for the transla-

tion. Ideally we would directly measure the actual

time delay incurred by waiting for the next word.

For simplicity, however, we suppose it consumes

the same amount of time listening for one more

word. We deﬁne two measurements, global and

local, respectively:

• Average Proportion (AP): following the def-

inition in (Cho and Esipova, 2016), X, Y are

the source and decoded sequences respectively,

and we use s(τ ) to denote the number of source

words been waited when decoding word y

τ

,

0 < d (X, Y ) =

1

|X||Y |

X

τ

s(τ) ≤ 1

d

t

=



0 t < T

d(X, Y ) t = T

(7)

d is a global delay metric, which deﬁnes the av-

erage waiting proportion of the source sentence

when translating each word.

• Consecutive Wait length (CW): in speech

translation, listeners are also concerned with

long silences during which no translation oc-

curs. To capture this, we also consider on how

many words were waited for (READ ) consecu-

tively between translating two words. For each

action, where we initially deﬁne c

0

= 0,

c

t

=



c

t−1

+ 1 a

t

= READ

0 a

t

= WRITE

(8)

• Target Delay: We further deﬁne “target delay”

for both d and c as d

∗

and c

∗

, respectively, as

different simultaneous translation applications

may have different requirements on delay. In

our implementation, the reward function for de-

lay is written as:

r

D

t

= α·[sgn(c

t

− c

∗

) + 1]+β·⌊d

t

−d

∗

⌋

+

(9)

where α ≤ 0, β ≤ 0.

Trade-off between quality and delay A good

simultaneous translation system requires balanc-

ing the trade-off of translation quality and time

delay. Obviously, achieving the best translation

quality and the shortest translation delays are in

a sense contradictory. In this paper, the trade-off

is achieved by balancing the rewards r

t

= r

Q

t

+r

D

t

provided to the system, that is, by adjusting the co-

efﬁcients α, β and the target delay d

∗

, c

∗

in Eq. 9.

4.3 Reinforcement Learning

Policy Gradient We freeze the pre-trained pa-

rameters of an NMT model, and train the agent

using the policy gradient (Williams, 1992). The

policy gradient maximizes the following expected

cumulative future rewards, J = E

π

θ

h

P

T

t=1

r

t

i

,

whose gradient is

∇

θ

J = E

π

θ

"

T

X

t

′

=1

∇

θ

log π

θ

(a

t

′

|·)R

t

#

(10)

R

t

=

P

T

k=t

h

r

Q

k

+ r

D

k

i

is the cumulative future

rewards for current observation and action. In

practice, Eq. 10 is estimated by sampling multi-

ple action trajectories from the current policy π

θ

,

collecting the corresponding rewards.

Variance Reduction Directly using the policy

gradient suffers from high variance, which makes

learning unstable and inefﬁcient. We thus em-

ploy the variance reduction techniques suggested

by Mnih and Gregor (2014). We subtract from

R

t

the output of a baseline network b

ϕ

to obtain

ˆ

R

t

= R

t

− b

ϕ

(o

t

), and centered re-scale the re-

ward as

˜

R

t

=

ˆ

R

t

−b

√

σ

2

+ǫ

with a running average b

and standard deviation σ. The baseline network is

trained to minimize the squared loss as follows:

L

ϕ

= E

π

θ

"

T

X

t=1

kR

t

− b

ϕ

(o

t

) k

2

#

(11)

We also regularize the negative entropy of the

policy to facilitate exploration.

Algorithm 2 Learning with Policy Gradient

Require: NMT system φ, agent θ, baseline ϕ

1: Pretrain the NMT system φ using MLE;

2: Initialize the agent θ;

3: while stopping criterion fails do

4: Obtain a translation pairs: {(X, Y

∗

)};

5: for (Y, S) ∼ Simultaneous Decoding do

6: for (o

t

, a

t

) in S do

7: Compute the quality: r

Q

t

;

8: Compute the delay: r

D

t

;

9: Compute the baseline: b

ϕ

(o

t

);

10: Collect the future rewards: {R

t

};

11: Perform variance reduction: {

˜

R

t

};

12: Update: θ ← θ + λ

1

∇

θ

[J − κH(π

θ

)]

13: Update: ϕ ← ϕ − λ

2

∇

ϕ

L

The overall learning algorithm is summarized

in Algorithm 2. For efﬁciency, instead of updating

with stochastic gradient descent (SGD) on a single

sentence, both the agent and the baseline are opti-

mized using a minibatch of multiple sentences.

5 Simultaneous Beam Search

In previous sections we described a simultaneous

greedy decoding algorithm. In standard NMT it

has been shown that beam search, where the de-

coder keeps a beam of k translation trajectories,

greatly improves translation quality (Sutskever et

al., 2014), as shown in Fig. 3 (A).

It is non-trivial to directly apply beam-search in

simultaneous machine translation, as beam search

waits until the last word to write down translation.

Based on our assumption WRITE does not cost de-

lay, we can perform a simultaneous beam-search

when the agent chooses to consecutively WRITE:

keep multiple beams of translation trajectories in

temporary buffer and output the best path when

the agent switches to READ. As shown in Fig. 3

(B) & (C), it tries to search for a relatively better

path while keeping the delay unchanged.

Note that we do not re-train the agent for simul-

taneous beam-search. At each step we simply in-

put the observation of the current best trajectory

into the agent for making next decision.

6 Experiments

6.1 Settings

Dataset To extensively study the proposed si-

multaneous translation model, we train and evalu-

ate it on two different language pairs: “English-

Figure 3: Illustrations of (A) beam-search, (B) si-

multaneous greedy decoding and (C) simultaneous

beam-search.

German (EN-DE)” and “English-Russian (EN-

RU)” in both directions per pair. We use the par-

allel corpora available from WMT’15

2

for both

pre-training the NMT environment and learning

the policy. We utilize newstest-2013 as the valida-

tion set to evaluate the proposed algorithm. Both

the training set and the validation set are tokenized

and segmented into sub-word units with byte-pair

encoding (BPE) (Sennrich et al., 2015). We only

use sentence pairs where both sides are less than

50 BPE subword symbols long for training.

Environment & Agent Settings We pre-trained

the NMT environments for both language pairs

and both directions following the same setting

from (Cho and Esipova, 2016). We further built

our agents, using a recurrent policy with 512

GRUs and a softmax function to produce the ac-

tion distribution. All our agents are trained us-

ing policy gradient using Adam (Kingma and Ba,

2014) optimizer, with a mini-batch size of 10. For

each sentence pair in a batch, 5 trajectories are

sampled. For testing, instead of sampling we pick

the action with higher probability each step.

Baselines We compare the proposed methods

against previously proposed baselines. For fair

comparison, we use the same NMT environment:

• Wait-Until-End (WUE): an agent that starts to

WRITE only when the last source word is seen.

In general, we expect this to achieve the best

quality of translation. We perform both greedy

decoding and beam-search with this method.

• Wait-One-Step (WOS): an agent that WRITEs

after each READs. Such a policy is problematic

when the source and target language pairs have

different word orders or lengths (e.g. EN-DE).

2

http://www.statmt.org/wmt15/

Learning to Translate in Real-time with Neural Machine Translation

Citations

Plug and Play Language Models: A Simple Approach to Controlled Text Generation

Neural Machine Translation

Monotonic Infinite Lookback Attention for Simultaneous Machine Translation

STACL: Simultaneous Translation with Implicit Anticipation and Controllable Latency using Prefix-to-Prefix Framework

Neural Machine Translation and Sequence-to-sequence Models: A Tutorial

References

Adam: A Method for Stochastic Optimization

Bleu: a Method for Automatic Evaluation of Machine Translation

Neural Machine Translation by Jointly Learning to Align and Translate

Sequence to Sequence Learning with Neural Networks

Neural Machine Translation of Rare Words with Subword Units

Related Papers (5)

Monotonic Infinite Lookback Attention for Simultaneous Machine Translation

Bleu: a Method for Automatic Evaluation of Machine Translation

Attention is All you Need

Can neural machine translation do simultaneous translation

Neural Machine Translation by Jointly Learning to Align and Translate