Fully Character-Level Neural Machine Translation without Explicit Segmentation

doi:10.1162/TACL_A_00067

Fully Character-Level Neural Machine Translation

without Explicit Segmentation

Jason Lee

∗

ETH Z

¨

urich

jasonlee@inf.ethz.ch

Kyunghyun Cho

New York University

kyunghyun.cho@nyu.edu

Thomas Hofmann

ETH Z

¨

urich

thomas.hofmann@inf.ethz.ch

Abstract

Most existing machine translation systems op-

erate at the level of words, relying on ex-

plicit segmentation to extract tokens. We in-

troduce a neural machine translation (NMT)

model that maps a source character sequence

to a target character sequence without any seg-

mentation. We employ a character-level con-

volutional network with max-pooling at the

encoder to reduce the length of source rep-

resentation, allowing the model to be trained

at a speed comparable to subword-level mod-

els while capturing local regularities. Our

character-to-character model outperforms a

recently proposed baseline with a subword-

level encoder on WMT’15 DE-EN and CS-

EN, and gives comparable performance on FI-

EN and RU-EN. We then demonstrate that

it is possible to share a single character-

level encoder across multiple languages by

training a model on a many-to-one transla-

tion task. In this multilingual setting, the

character-level encoder signiﬁcantly outper-

forms the subword-level encoder on all the

language pairs. We observe that on CS-EN,

FI-EN and RU-EN, the quality of the multilin-

gual character-level translation even surpasses

the models speciﬁcally trained on that lan-

guage pair alone, both in terms of the BLEU

score and human judgment.

1 Introduction

Nearly all previous work in machine translation has

been at the level of words. Aside from our intu-

∗

The majority of this work was completed while the author

was visiting New York University.

itive understanding of word as a basic unit of mean-

ing (Jackendoff, 1992), one reason behind this is

that sequences are signiﬁcantly longer when rep-

resented in characters, compounding the problem

of data sparsity and modeling long-range depen-

dencies. This has driven NMT research to be al-

most exclusively word-level (Bahdanau et al., 2015;

Sutskever et al., 2014).

Despite their remarkable success, word-level

NMT models suffer from several major weaknesses.

For one, they are unable to model rare, out-of-

vocabulary words, making them limited in translat-

ing languages with rich morphology such as Czech,

Finnish and Turkish. If one uses a large vocabulary

to combat this (Jean et al., 2015), the complexity of

training and decoding grows linearly with respect to

the target vocabulary size, leading to a vicious cycle.

To address this, we present a fully character-level

NMT model that maps a character sequence in a

source language to a character sequence in a target

language. We show that our model outperforms a

baseline with a subword-level encoder on DE-EN

and CS-EN, and achieves a comparable result on

FI-EN and RU-EN. A purely character-level NMT

model with a basic encoder was proposed as a base-

line by Luong and Manning (2016), but training it

was prohibitively slow. We were able to train our

model at a reasonable speed by drastically reducing

the length of source sentence representation using a

stack of convolutional, pooling and highway layers.

One advantage of character-level models is that

they are better suited for multilingual translation

than their word-level counterparts which require a

separate word vocabulary for each language. We

365

Transactions of the Association for Computational Linguistics, vol. 5, pp. 365–378, 2017. Action Editor: Adam Lopez.

Submission batch: 11/2016; Revision batch: 2/2017; Published 10/2017.

c

2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.

verify this by training a single model to translate

four languages (German, Czech, Finnish and Rus-

sian) to English. Our multilingual character-level

model outperforms the subword-level baseline by

a considerable margin in all four language pairs,

strongly indicating that a character-level model is

more ﬂexible in assigning its capacity to different

language pairs. Furthermore, we observe that our

multilingual character-level translation even exceeds

the quality of bilingual translation in three out of

four language pairs, both in BLEU score metric

and human evaluation. This demonstrates excel-

lent parameter efﬁciency of character-level transla-

tion in a multilingual setting. We also showcase

our model’s ability to handle intra-sentence code-

switching while performing language identiﬁcation

on the ﬂy.

The contributions of this work are twofold: we

empirically show that (1) we can train character-to-

character NMT model without any explicit segmen-

tation; and (2) we can share a single character-level

encoder across multiple languages to build a mul-

tilingual translation system without increasing the

model size.

2 Background: Attentional Neural

Machine Translation

Neural machine translation (NMT) is a recently

proposed approach to machine translation that

builds a single neural network which takes as an

input, a source sentence X = (x

1

, . . . , x

T

X

) and

generates its translation Y = (y

1

, . . . , y

T

Y

), where

x

t

and y

t

0

are source and target symbols (Bahdanau

et al., 2015; Sutskever et al., 2014; Luong et al.,

2015; Cho et al., 2014a). Attentional NMT models

have three components: an encoder, a decoder and

an attention mechanism.

Encoder Given a source sentence X, the en-

coder constructs a continuous representation that

summarizes its meaning with a recurrent neural

network (RNN). A bidirectional RNN is often

implemented as proposed in (Bahdanau et al.,

2015). A forward encoder reads the input sentence

from left to right:

−→

h

t

=

−→

f

enc



E

x

(x

t

),

−→

h

t−1



.

Similarly, a backward encoder reads it from right

to left:

←−

h

t

=

←−

f

enc



E

x

(x

t

),

←−

h

t+1



, where E

x

is

the source embedding lookup table, and

−→

f

enc

and

←−

f

enc

are recurrent activation functions such as long

short-term memory units (LSTMs) (Hochreiter

and Schmidhuber, 1997) or gated recurrent units

(GRUs) (Cho et al., 2014b). The encoder constructs

a set of continuous source sentence representations

C by concatenating the forward and backward hid-

den states at each timestep: C =



h

1

, . . . , h

T

X



,

where h

t

=



−→

h

t

;

←−

h

t



.

Attention First introduced in Bahdanau et al.

(2015), the attention mechanism lets the decoder at-

tend more to different source symbols for each target

symbol. More concretely, it computes the context

vector c

t

0

at each decoding time step t

0

as a weighted

sum of the source hidden states: c

t

0

=

P

T

X

t=1

α

t

0

t

h

t

.

Similarly to Chung et al. (2016) and Firat et al.

(2016a), each attentional weight α

t

0

t

represents how

relevant the t-th source token x

t

is to the t

0

-th target

token y

t

0

, and is computed as:

α

t

0

t

=

1

Z

exp



score



E

y

(y

t

0

−1

), s

t

0

−1

, h

t





, (1)

where Z =

P

T

X

k=1

exp



score(E

y

(y

t

0

−1

), s

t

0

−1

, h

k

)



is the normalization constant. score() is a feed-

forward neural network with a single hidden layer

that scores how well the source symbol x

t

and the

target symbol y

t

0

match. E

y

is the target embedding

lookup table and s

t

0

is the target hidden state at time

t

0

.

Decoder Given a source context vector c

t

0

, the de-

coder computes its hidden state at time t

0

as: s

t

0

=

f

dec



E

y

(y

t

0

−1

), s

t

0

−1

, c

t

0



. Then, a parametric func-

tion out

k

() returns the conditional probability of the

next target symbol being k:

p(y

t

0

=k|y

<t

0

, X) =

1

Z

exp



out

k



E

y

(y

t

0

−1

), s

t

0

, c

t

0





(2)

where Z is again the normalization constant:

Z =

P

j

exp



out

j

(E

y

(y

t

0

−1

), s

t

0

, c

t

0

)



.

Training The entire model can be trained end-to-

end by minimizing the negative conditional log-

366

likelihood, which is deﬁned as:

L = −

1

N

X

n=1

T

(n)

Y

X

t=1

log p(y

t

= y

(n)

t

|y

(n)

<t

, X

(n)

),

where N is the number of sentence pairs, and X

(n)

and y

(n)

t

are the source sentence and the t-th target

symbol in the n-th pair, respectively.

3 Fully Character-Level Translation

3.1 Why Character-Level?

The beneﬁts of character-level translation over

word-level translation are well known. Chung et al.

(2016) present three main arguments: character level

models (1) do not suffer from out-of-vocabulary is-

sues, (2) are able to model different, rare morpho-

logical variants of a word, and (3) do not require seg-

mentation. Particularly, text segmentation is highly

non-trivial for many languages and problematic even

for English as word tokenizers are either manually

designed or trained on a corpus using an objective

function that is unrelated to the translation task at

hand, which makes the overall system sub-optimal.

Here we present two additional arguments for

character-level translation. First, a character-level

translation system can easily be applied to a mul-

tilingual translation setting. Between European lan-

guages where the majority of alphabets overlaps, for

instance, a character-level model may easily iden-

tify morphemes that are shared across different lan-

guages. A word-level model, however, will need a

separate word vocabulary for each language, allow-

ing no cross-lingual parameter sharing.

Also, by not segmenting source sentences into

words, we no longer inject our knowledge of words

and word boundaries into the system; instead, we

encourage the model to discover an internal struc-

ture of a sentence by itself and learn how a sequence

of symbols can be mapped to a continuous meaning

representation.

3.2 Related Work

To address these limitations associated with word-

level translation, a recent line of research has inves-

tigated using sub-word information.

Costa-Juss

´

a and Fonollosa (2016) replaced the

word-lookup table with convolutional and highway

layers on top of character embeddings, while still

segmenting source sentences into words. Target sen-

tences were also segmented into words, and predic-

tions were made at word-level.

Similarly, Ling et al. (2015) employed a bidirec-

tional LSTM to compose character embeddings into

word embeddings. At the target side, another LSTM

takes the hidden state of the decoder and generates

the target word, character by character. While this

system is completely open-vocabulary, it also re-

quires ofﬂine segmentation. Character-to-word and

word-to-character LSTMs signiﬁcantly slow down

training, as well.

Most recently, Luong and Manning (2016) pro-

posed a hybrid scheme that consults character-level

information whenever the model encounters an out-

of-vocabulary word. As a baseline, they also imple-

mented a purely character-level NMT model with

4 layers of unidirectional LSTMs with 512 cells,

with attention over each character. Despite being

extremely slow (approximately 3 months to train),

the character-level model gave a comparable perfor-

mance to the word-level baseline. This shows the

possibility of fully character-level translation.

Having a word-level decoder restricts the model

to only being able to generate previously seen words.

Sennrich et al. (2015) introduced a subword-level

NMT model that is capable of open-vocabulary

translation using subword-level segmentation based

on the byte pair encoding (BPE) algorithm. Starting

from a character vocabulary, the algorithm identi-

ﬁes frequent character n-grams in the training data

and iteratively adds them to the vocabulary, ulti-

mately giving a subword vocabulary which consists

of words, subwords and characters. Once the seg-

mentation rules have been learned, their model per-

forms subword-to-subword translation (bpe2bpe) in

the same way as word-to-word translation.

Perhaps the work that is closest to our end goal is

(Chung et al., 2016), which used a subword-level

encoder from (Sennrich et al., 2015) and a fully

character-level decoder (bpe2char). Their results

show that character-level decoding performs better

than subword-level decoding. Motivated by this

work, we aim for fully character-level translation at

both sides (char2char).

Outside NMT, our work is based on a few exist-

ing approaches that applied convolutional networks

367

to text, most notably in text classiﬁcation (Zhang et

al., 2015; Xiao and Cho, 2016). We also drew in-

spiration for our multilingual models from previous

work that showed the possibility of training a single

recurrent model for multiple languages in domains

other than translation (Tsvetkov et al., 2016; Gillick

et al., 2015).

3.3 Challenges

Sentences are on average 6 (DE, CS and RU) to 8

(FI) times longer when represented in characters.

This poses three major challenges to achieving fully

character-level translation.

(1) Training/decoding latency For the decoder, al-

though the sequence to be generated is much longer,

each character-level softmax operation costs consid-

erably less compared to a word- or subword-level

softmax. Chung et al. (2016) report that character-

level decoding is only 14% slower than subword-

level decoding.

On the other hand, computational complexity of

the attention mechanism grows quadratically with

respect to the sentence length, as it needs to attend

to every source token for every target token. This

makes a naive character-level approach, such as

in Luong and Manning (2016), computationally

prohibitive. Consequently, reducing the length of

the source sequence is key to ensuring reasonable

speed in both training and decoding.

(2) Mapping character sequence to continuous

representation The arbitrary relationship between

the orthography of a word and its meaning is a well-

known problem in linguistics (de Saussure, 1916).

Building a character-level encoder is arguably a

more difﬁcult problem, as the encoder needs to learn

a highly non-linear function from a long sequence

of character symbols to a meaning representation.

(3) Long range dependencies in characters A

character-level encoder needs to model dependen-

cies over longer timespans than a word-level en-

coder does.

4 Fully Character-Level NMT

4.1 Encoder

We design an encoder that addresses all the chal-

lenges discussed above by using convolutional and

pooling layers aggressively to both (1) drastically

shorten the input sentence; and (2) efﬁciently

capture local regularities. Inspired by the character-

level language model from Kim et al. (2015), our

encoder ﬁrst reduces the source sentence length

with a series of convolutional, pooling and highway

layers. The shorter representation, instead of the full

character sequence, is passed through a bidirectional

GRU to (3) help it resolve long term dependencies.

We illustrate the proposed encoder in Figure 1 and

discuss each layer in detail below.

Embedding We map the sequence of source

characters (x

1

, . . . , x

T

x

) to a sequence of

character embeddings of dimensionality d

c

:

X = (C(x

1

), . . . , C(x

T

x

)) ∈ R

d

c

×T

x

where T

x

is the number of source characters and C is the

character embedding lookup table: C ∈ R

d

c

×|C|

.

Convolution One-dimensional convolution opera-

tion is then used along consecutive character embed-

dings. Assuming we have a single ﬁlter f ∈ R

d

c

×w

of width w, we ﬁrst apply padding to the beginning

and the end of X, such that the padded sentence

X

0

∈ R

d

c

×(T

x

+w−1)

is w − 1 symbols longer. We

then apply a narrow convolution between X

0

and f

such that the k-th element of the output Y

k

is given

as:

Y

k

= (X

0

∗ f)

k

=

X

i,j

(X

0

[:,k−w+1:k]

⊗ f)

ij

, (3)

where ⊗ denotes elementwise matrix multiplication

and ∗ is the convolution operation. X

0

[:,k−w+1:k]

is

the sliced subset of X

0

that contains all the rows but

only w adjacent columns. The padding scheme em-

ployed above, commonly known as half convolution,

ensures that the length of the output is identical to

the length of the input, (i.e., Y ∈ R

1×T

x

).

We just illustrated how a single convolutional

ﬁlter of ﬁxed width might be applied to a sentence.

In order to extract informative character patterns

of different lengths, we employ a set of ﬁlters of

varying widths. More concretely, we use a ﬁlter

368

_

T

h

e

s

e

c

o

n

d

p

e

r

s

o

n

_

!"#$%&'%()&*+,-#.-%/0"-#

1+2&34

5(6+7--%"#$+

8"09+!0*":&+!

;-/*'%()&*+

<"$98()+=&08-*>

!"#$%&'%()&*

?":"*&@0"-#(%+A24

,9(*(@0&*

BCD&::"#$E

"

#$%

&

'

"

(

)

$*%

&

+,-#.'

"

/$%

&

'

"

/$*%

&

0

1

.'

"

/$*%

&

0

1

.'

!&$C&#0+

BCD&::"#$E

Figure 1: Encoder architecture schematics. Underscore denotes padding. A dotted vertical line delimits each segment.

The stride of pooling s is 5 in the diagram.

bank F = {f

1

, . . . , f

m

} where f

i

= R

d

c

×i×n

i

is

a collection of n

i

ﬁlters of width i. Our model

uses m = 8, hence extracting character n-grams

up to 8 characters long. Outputs from all the ﬁlters

are stacked upon each other, giving a single repre-

sentation Y ∈ R

N×T

x

, where the dimensionality

of each column is given by the total number of

ﬁlters N =

P

m

i=1

n

i

. Finally, rectiﬁed linear

activation (ReLU) is applied elementwise to this

representation.

Max pooling with stride The output from the con-

volutional layer is ﬁrst split into segments of width

s, and max-pooling over time is applied to each seg-

ment with no overlap. This procedure selects the

most salient features to give a segment embedding.

Each segment embedding is a summary of meaning-

ful character n-grams occurring in a particular (over-

lapping) subsequence in the source sentence. Note

that the rightmost segment (above ‘on’) in Figure 1

may capture ‘son’ (the ﬁlter in green) although ‘s’

occurs in the previous segment. In other words, our

segments are overlapping as opposed to in word- or

subword-level models with hard segmentation.

Segments act as our internal linguistic unit from

this layer and above: the attention mechanism, for

instance, attends to each source segment instead of

source character. This shortens the source repre-

sentation s-fold: Y

0

∈ R

N×(T

x

/s)

. Empirically, we

found using a smaller s leads to better performance

at increased training time. We chose s = 5 in

our experiments as it gives a reasonable balance

between the two.

Highway network A sequence of segment embed-

dings from the max pooling layer is fed into a high-

way network (Srivastava et al., 2015). Highway net-

works are shown to signiﬁcantly improve the qual-

ity of a character-level language model when used

with convolutional layers (Kim et al., 2015). A high-

way network transforms input x with a gating mech-

anism that adaptively regulates information ﬂow:

y = g  ReLU(W

1

x + b

1

) + (1 − g)  x,

where g = σ((W

2

x + b

2

)). We apply this to each

segment embedding individually.

Recurrent layer Finally, the output from the

highway layer is given to a bidirectional GRU from

§2, using each segment embedding as input.

Subword-level encoder Unlike a subword-level

encoder, our model does not commit to a speciﬁc

choice of segmentation; instead it is trained to

consider every possible character pattern and extract

only the most meaningful ones. Therefore, the

deﬁnition of segmentation in our model is dynamic

unlike subword-level encoders. During training,

the model ﬁnds the most salient character patterns

in a sentence via max-pooling, and the character

369

Fully Character-Level Neural Machine Translation without Explicit Segmentation

Citations

Google's Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation

Tacotron: Towards End-to-End Speech Synthesis

Google's Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation

Synthetic and Natural Noise Both Break Neural Machine Translation

Unsupervised neural machine translation

References

Adam: A Method for Stochastic Optimization

Long short-term memory

Neural Machine Translation by Jointly Learning to Align and Translate

Learning Phrase Representations using RNN Encoder--Decoder for Statistical Machine Translation

Neural Machine Translation by Jointly Learning to Align and Translate

Related Papers (5)

Neural Machine Translation by Jointly Learning to Align and Translate

Bleu: a Method for Automatic Evaluation of Machine Translation

Attention is All you Need

Sequence to Sequence Learning with Neural Networks

Adam: A Method for Stochastic Optimization