scispace - formally typeset
Open AccessJournal ArticleDOI

Fully Character-Level Neural Machine Translation without Explicit Segmentation

Reads0
Chats0
TLDR
A neural machine translation model that maps a source character sequence to a target character sequence without any segmentation is introduced, allowing the model to be trained at a speed comparable to subword-level models while capturing local regularities.
Abstract
Most existing machine translation systems operate at the level of words, relying on explicit segmentation to extract tokens. We introduce a neural machine translation (NMT) model that maps a source character sequence to a target character sequence without any segmentation. We employ a character-level convolutional network with max-pooling at the encoder to reduce the length of source representation, allowing the model to be trained at a speed comparable to subword-level models while capturing local regularities. Our character-to-character model outperforms a recently proposed baseline with a subword-level encoder on WMT'15 DE-EN and CS-EN, and gives comparable performance on FI-EN and RU-EN. We then demonstrate that it is possible to share a single character-level encoder across multiple languages by training a model on a many-to-one translation task. In this multilingual setting, the character-level encoder significantly outperforms the subword-level encoder on all the language pairs. We observe that on CS-EN, FI-EN and RU-EN, the quality of the multilingual character-level translation even surpasses the models specifically trained on that language pair alone, both in terms of BLEU score and human judgment.

read more

Content maybe subject to copyright    Report

Fully Character-Level Neural Machine Translation
without Explicit Segmentation
Jason Lee
ETH Z
¨
urich
jasonlee@inf.ethz.ch
Kyunghyun Cho
New York University
kyunghyun.cho@nyu.edu
Thomas Hofmann
ETH Z
¨
urich
thomas.hofmann@inf.ethz.ch
Abstract
Most existing machine translation systems op-
erate at the level of words, relying on ex-
plicit segmentation to extract tokens. We in-
troduce a neural machine translation (NMT)
model that maps a source character sequence
to a target character sequence without any seg-
mentation. We employ a character-level con-
volutional network with max-pooling at the
encoder to reduce the length of source rep-
resentation, allowing the model to be trained
at a speed comparable to subword-level mod-
els while capturing local regularities. Our
character-to-character model outperforms a
recently proposed baseline with a subword-
level encoder on WMT’15 DE-EN and CS-
EN, and gives comparable performance on FI-
EN and RU-EN. We then demonstrate that
it is possible to share a single character-
level encoder across multiple languages by
training a model on a many-to-one transla-
tion task. In this multilingual setting, the
character-level encoder significantly outper-
forms the subword-level encoder on all the
language pairs. We observe that on CS-EN,
FI-EN and RU-EN, the quality of the multilin-
gual character-level translation even surpasses
the models specifically trained on that lan-
guage pair alone, both in terms of the BLEU
score and human judgment.
1 Introduction
Nearly all previous work in machine translation has
been at the level of words. Aside from our intu-
The majority of this work was completed while the author
was visiting New York University.
itive understanding of word as a basic unit of mean-
ing (Jackendoff, 1992), one reason behind this is
that sequences are significantly longer when rep-
resented in characters, compounding the problem
of data sparsity and modeling long-range depen-
dencies. This has driven NMT research to be al-
most exclusively word-level (Bahdanau et al., 2015;
Sutskever et al., 2014).
Despite their remarkable success, word-level
NMT models suffer from several major weaknesses.
For one, they are unable to model rare, out-of-
vocabulary words, making them limited in translat-
ing languages with rich morphology such as Czech,
Finnish and Turkish. If one uses a large vocabulary
to combat this (Jean et al., 2015), the complexity of
training and decoding grows linearly with respect to
the target vocabulary size, leading to a vicious cycle.
To address this, we present a fully character-level
NMT model that maps a character sequence in a
source language to a character sequence in a target
language. We show that our model outperforms a
baseline with a subword-level encoder on DE-EN
and CS-EN, and achieves a comparable result on
FI-EN and RU-EN. A purely character-level NMT
model with a basic encoder was proposed as a base-
line by Luong and Manning (2016), but training it
was prohibitively slow. We were able to train our
model at a reasonable speed by drastically reducing
the length of source sentence representation using a
stack of convolutional, pooling and highway layers.
One advantage of character-level models is that
they are better suited for multilingual translation
than their word-level counterparts which require a
separate word vocabulary for each language. We
365
Transactions of the Association for Computational Linguistics, vol. 5, pp. 365–378, 2017. Action Editor: Adam Lopez.
Submission batch: 11/2016; Revision batch: 2/2017; Published 10/2017.
c
2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.

verify this by training a single model to translate
four languages (German, Czech, Finnish and Rus-
sian) to English. Our multilingual character-level
model outperforms the subword-level baseline by
a considerable margin in all four language pairs,
strongly indicating that a character-level model is
more flexible in assigning its capacity to different
language pairs. Furthermore, we observe that our
multilingual character-level translation even exceeds
the quality of bilingual translation in three out of
four language pairs, both in BLEU score metric
and human evaluation. This demonstrates excel-
lent parameter efficiency of character-level transla-
tion in a multilingual setting. We also showcase
our model’s ability to handle intra-sentence code-
switching while performing language identification
on the fly.
The contributions of this work are twofold: we
empirically show that (1) we can train character-to-
character NMT model without any explicit segmen-
tation; and (2) we can share a single character-level
encoder across multiple languages to build a mul-
tilingual translation system without increasing the
model size.
2 Background: Attentional Neural
Machine Translation
Neural machine translation (NMT) is a recently
proposed approach to machine translation that
builds a single neural network which takes as an
input, a source sentence X = (x
1
, . . . , x
T
X
) and
generates its translation Y = (y
1
, . . . , y
T
Y
), where
x
t
and y
t
0
are source and target symbols (Bahdanau
et al., 2015; Sutskever et al., 2014; Luong et al.,
2015; Cho et al., 2014a). Attentional NMT models
have three components: an encoder, a decoder and
an attention mechanism.
Encoder Given a source sentence X, the en-
coder constructs a continuous representation that
summarizes its meaning with a recurrent neural
network (RNN). A bidirectional RNN is often
implemented as proposed in (Bahdanau et al.,
2015). A forward encoder reads the input sentence
from left to right:
h
t
=
f
enc
E
x
(x
t
),
h
t1
.
Similarly, a backward encoder reads it from right
to left:
h
t
=
f
enc
E
x
(x
t
),
h
t+1
, where E
x
is
the source embedding lookup table, and
f
enc
and
f
enc
are recurrent activation functions such as long
short-term memory units (LSTMs) (Hochreiter
and Schmidhuber, 1997) or gated recurrent units
(GRUs) (Cho et al., 2014b). The encoder constructs
a set of continuous source sentence representations
C by concatenating the forward and backward hid-
den states at each timestep: C =
h
1
, . . . , h
T
X
,
where h
t
=
h
t
;
h
t
.
Attention First introduced in Bahdanau et al.
(2015), the attention mechanism lets the decoder at-
tend more to different source symbols for each target
symbol. More concretely, it computes the context
vector c
t
0
at each decoding time step t
0
as a weighted
sum of the source hidden states: c
t
0
=
P
T
X
t=1
α
t
0
t
h
t
.
Similarly to Chung et al. (2016) and Firat et al.
(2016a), each attentional weight α
t
0
t
represents how
relevant the t-th source token x
t
is to the t
0
-th target
token y
t
0
, and is computed as:
α
t
0
t
=
1
Z
exp
score
E
y
(y
t
0
1
), s
t
0
1
, h
t
, (1)
where Z =
P
T
X
k=1
exp
score(E
y
(y
t
0
1
), s
t
0
1
, h
k
)
is the normalization constant. score() is a feed-
forward neural network with a single hidden layer
that scores how well the source symbol x
t
and the
target symbol y
t
0
match. E
y
is the target embedding
lookup table and s
t
0
is the target hidden state at time
t
0
.
Decoder Given a source context vector c
t
0
, the de-
coder computes its hidden state at time t
0
as: s
t
0
=
f
dec
E
y
(y
t
0
1
), s
t
0
1
, c
t
0
. Then, a parametric func-
tion out
k
() returns the conditional probability of the
next target symbol being k:
p(y
t
0
=k|y
<t
0
, X) =
1
Z
exp
out
k
E
y
(y
t
0
1
), s
t
0
, c
t
0
(2)
where Z is again the normalization constant:
Z =
P
j
exp
out
j
(E
y
(y
t
0
1
), s
t
0
, c
t
0
)
.
Training The entire model can be trained end-to-
end by minimizing the negative conditional log-
366

likelihood, which is defined as:
L =
1
N
N
X
n=1
T
(n)
Y
X
t=1
log p(y
t
= y
(n)
t
|y
(n)
<t
, X
(n)
),
where N is the number of sentence pairs, and X
(n)
and y
(n)
t
are the source sentence and the t-th target
symbol in the n-th pair, respectively.
3 Fully Character-Level Translation
3.1 Why Character-Level?
The benefits of character-level translation over
word-level translation are well known. Chung et al.
(2016) present three main arguments: character level
models (1) do not suffer from out-of-vocabulary is-
sues, (2) are able to model different, rare morpho-
logical variants of a word, and (3) do not require seg-
mentation. Particularly, text segmentation is highly
non-trivial for many languages and problematic even
for English as word tokenizers are either manually
designed or trained on a corpus using an objective
function that is unrelated to the translation task at
hand, which makes the overall system sub-optimal.
Here we present two additional arguments for
character-level translation. First, a character-level
translation system can easily be applied to a mul-
tilingual translation setting. Between European lan-
guages where the majority of alphabets overlaps, for
instance, a character-level model may easily iden-
tify morphemes that are shared across different lan-
guages. A word-level model, however, will need a
separate word vocabulary for each language, allow-
ing no cross-lingual parameter sharing.
Also, by not segmenting source sentences into
words, we no longer inject our knowledge of words
and word boundaries into the system; instead, we
encourage the model to discover an internal struc-
ture of a sentence by itself and learn how a sequence
of symbols can be mapped to a continuous meaning
representation.
3.2 Related Work
To address these limitations associated with word-
level translation, a recent line of research has inves-
tigated using sub-word information.
Costa-Juss
´
a and Fonollosa (2016) replaced the
word-lookup table with convolutional and highway
layers on top of character embeddings, while still
segmenting source sentences into words. Target sen-
tences were also segmented into words, and predic-
tions were made at word-level.
Similarly, Ling et al. (2015) employed a bidirec-
tional LSTM to compose character embeddings into
word embeddings. At the target side, another LSTM
takes the hidden state of the decoder and generates
the target word, character by character. While this
system is completely open-vocabulary, it also re-
quires offline segmentation. Character-to-word and
word-to-character LSTMs significantly slow down
training, as well.
Most recently, Luong and Manning (2016) pro-
posed a hybrid scheme that consults character-level
information whenever the model encounters an out-
of-vocabulary word. As a baseline, they also imple-
mented a purely character-level NMT model with
4 layers of unidirectional LSTMs with 512 cells,
with attention over each character. Despite being
extremely slow (approximately 3 months to train),
the character-level model gave a comparable perfor-
mance to the word-level baseline. This shows the
possibility of fully character-level translation.
Having a word-level decoder restricts the model
to only being able to generate previously seen words.
Sennrich et al. (2015) introduced a subword-level
NMT model that is capable of open-vocabulary
translation using subword-level segmentation based
on the byte pair encoding (BPE) algorithm. Starting
from a character vocabulary, the algorithm identi-
fies frequent character n-grams in the training data
and iteratively adds them to the vocabulary, ulti-
mately giving a subword vocabulary which consists
of words, subwords and characters. Once the seg-
mentation rules have been learned, their model per-
forms subword-to-subword translation (bpe2bpe) in
the same way as word-to-word translation.
Perhaps the work that is closest to our end goal is
(Chung et al., 2016), which used a subword-level
encoder from (Sennrich et al., 2015) and a fully
character-level decoder (bpe2char). Their results
show that character-level decoding performs better
than subword-level decoding. Motivated by this
work, we aim for fully character-level translation at
both sides (char2char).
Outside NMT, our work is based on a few exist-
ing approaches that applied convolutional networks
367

to text, most notably in text classification (Zhang et
al., 2015; Xiao and Cho, 2016). We also drew in-
spiration for our multilingual models from previous
work that showed the possibility of training a single
recurrent model for multiple languages in domains
other than translation (Tsvetkov et al., 2016; Gillick
et al., 2015).
3.3 Challenges
Sentences are on average 6 (DE, CS and RU) to 8
(FI) times longer when represented in characters.
This poses three major challenges to achieving fully
character-level translation.
(1) Training/decoding latency For the decoder, al-
though the sequence to be generated is much longer,
each character-level softmax operation costs consid-
erably less compared to a word- or subword-level
softmax. Chung et al. (2016) report that character-
level decoding is only 14% slower than subword-
level decoding.
On the other hand, computational complexity of
the attention mechanism grows quadratically with
respect to the sentence length, as it needs to attend
to every source token for every target token. This
makes a naive character-level approach, such as
in Luong and Manning (2016), computationally
prohibitive. Consequently, reducing the length of
the source sequence is key to ensuring reasonable
speed in both training and decoding.
(2) Mapping character sequence to continuous
representation The arbitrary relationship between
the orthography of a word and its meaning is a well-
known problem in linguistics (de Saussure, 1916).
Building a character-level encoder is arguably a
more difficult problem, as the encoder needs to learn
a highly non-linear function from a long sequence
of character symbols to a meaning representation.
(3) Long range dependencies in characters A
character-level encoder needs to model dependen-
cies over longer timespans than a word-level en-
coder does.
4 Fully Character-Level NMT
4.1 Encoder
We design an encoder that addresses all the chal-
lenges discussed above by using convolutional and
pooling layers aggressively to both (1) drastically
shorten the input sentence; and (2) efficiently
capture local regularities. Inspired by the character-
level language model from Kim et al. (2015), our
encoder first reduces the source sentence length
with a series of convolutional, pooling and highway
layers. The shorter representation, instead of the full
character sequence, is passed through a bidirectional
GRU to (3) help it resolve long term dependencies.
We illustrate the proposed encoder in Figure 1 and
discuss each layer in detail below.
Embedding We map the sequence of source
characters (x
1
, . . . , x
T
x
) to a sequence of
character embeddings of dimensionality d
c
:
X = (C(x
1
), . . . , C(x
T
x
)) R
d
c
×T
x
where T
x
is the number of source characters and C is the
character embedding lookup table: C R
d
c
×|C|
.
Convolution One-dimensional convolution opera-
tion is then used along consecutive character embed-
dings. Assuming we have a single filter f R
d
c
×w
of width w, we first apply padding to the beginning
and the end of X, such that the padded sentence
X
0
R
d
c
×(T
x
+w1)
is w 1 symbols longer. We
then apply a narrow convolution between X
0
and f
such that the k-th element of the output Y
k
is given
as:
Y
k
= (X
0
f)
k
=
X
i,j
(X
0
[:,kw+1:k]
f)
ij
, (3)
where denotes elementwise matrix multiplication
and is the convolution operation. X
0
[:,kw+1:k]
is
the sliced subset of X
0
that contains all the rows but
only w adjacent columns. The padding scheme em-
ployed above, commonly known as half convolution,
ensures that the length of the output is identical to
the length of the input, (i.e., Y R
1×T
x
).
We just illustrated how a single convolutional
filter of fixed width might be applied to a sentence.
In order to extract informative character patterns
of different lengths, we employ a set of filters of
varying widths. More concretely, we use a filter
368

_
_
T
h
e
s
e
c
o
n
d
p
e
r
s
o
n
_
_
!"#$%&'%()&*+,-#.-%/0"-#
1+2&34
5(6+7--%"#$+
8"09+!0*":&+!
;-/*'%()&*+
<"$98()+=&08-*>
!"#$%&'%()&*
?":"*&@0"-#(%+A24
,9(*(@0&*
BCD&::"#$E
"
#$%
&
'
"
(
)
$*%
&
+,-#.'
"
/$%
&
'
"
/$*%
&
0
1
.'
"
/$*%
&
0
1
.'
!&$C&#0+
BCD&::"#$E
Figure 1: Encoder architecture schematics. Underscore denotes padding. A dotted vertical line delimits each segment.
The stride of pooling s is 5 in the diagram.
bank F = {f
1
, . . . , f
m
} where f
i
= R
d
c
×i×n
i
is
a collection of n
i
filters of width i. Our model
uses m = 8, hence extracting character n-grams
up to 8 characters long. Outputs from all the filters
are stacked upon each other, giving a single repre-
sentation Y R
N×T
x
, where the dimensionality
of each column is given by the total number of
filters N =
P
m
i=1
n
i
. Finally, rectified linear
activation (ReLU) is applied elementwise to this
representation.
Max pooling with stride The output from the con-
volutional layer is first split into segments of width
s, and max-pooling over time is applied to each seg-
ment with no overlap. This procedure selects the
most salient features to give a segment embedding.
Each segment embedding is a summary of meaning-
ful character n-grams occurring in a particular (over-
lapping) subsequence in the source sentence. Note
that the rightmost segment (above ‘on’) in Figure 1
may capture ‘son’ (the filter in green) although ‘s’
occurs in the previous segment. In other words, our
segments are overlapping as opposed to in word- or
subword-level models with hard segmentation.
Segments act as our internal linguistic unit from
this layer and above: the attention mechanism, for
instance, attends to each source segment instead of
source character. This shortens the source repre-
sentation s-fold: Y
0
R
N×(T
x
/s)
. Empirically, we
found using a smaller s leads to better performance
at increased training time. We chose s = 5 in
our experiments as it gives a reasonable balance
between the two.
Highway network A sequence of segment embed-
dings from the max pooling layer is fed into a high-
way network (Srivastava et al., 2015). Highway net-
works are shown to significantly improve the qual-
ity of a character-level language model when used
with convolutional layers (Kim et al., 2015). A high-
way network transforms input x with a gating mech-
anism that adaptively regulates information flow:
y = g ReLU(W
1
x + b
1
) + (1 g) x,
where g = σ((W
2
x + b
2
)). We apply this to each
segment embedding individually.
Recurrent layer Finally, the output from the
highway layer is given to a bidirectional GRU from
§2, using each segment embedding as input.
Subword-level encoder Unlike a subword-level
encoder, our model does not commit to a specific
choice of segmentation; instead it is trained to
consider every possible character pattern and extract
only the most meaningful ones. Therefore, the
definition of segmentation in our model is dynamic
unlike subword-level encoders. During training,
the model finds the most salient character patterns
in a sentence via max-pooling, and the character
369

Citations
More filters
Journal ArticleDOI

Google's Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation

TL;DR: This work proposes a simple solution to use a single Neural Machine Translation (NMT) model to translate between multiple languages using a shared wordpiece vocabulary, and introduces an artificial token at the beginning of the input sentence to specify the required target language.
Proceedings ArticleDOI

Tacotron: Towards End-to-End Speech Synthesis

TL;DR: Tacotron as mentioned in this paper is an end-to-end generative text to speech model that synthesizes speech directly from characters, given pairs, the model can be trained completely from scratch with random initialization.
Posted Content

Google's Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation

TL;DR: The authors propose to add an artificial token at the beginning of the input sentence to specify the required target language, which improves the translation quality of all involved language pairs, even while keeping the total number of model parameters constant.
Proceedings Article

Synthetic and Natural Noise Both Break Neural Machine Translation

TL;DR: It is found that a model based on a character convolutional neural network is able to simultaneously learn representations robust to multiple kinds of noise, including structure-invariant word representations and robust training on noisy texts.
Proceedings Article

Unsupervised neural machine translation

TL;DR: This paper proposed an unsupervised encoder-decoder model that can be trained on monolingual corpora alone using a combination of denoising and back-translation, achieving state-of-the-art performance in machine translation.
References
More filters
Proceedings Article

Adam: A Method for Stochastic Optimization

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Journal ArticleDOI

Long short-term memory

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
Proceedings Article

Neural Machine Translation by Jointly Learning to Align and Translate

TL;DR: It is conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and it is proposed to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.
Proceedings ArticleDOI

Learning Phrase Representations using RNN Encoder--Decoder for Statistical Machine Translation

TL;DR: In this paper, the encoder and decoder of the RNN Encoder-Decoder model are jointly trained to maximize the conditional probability of a target sequence given a source sequence.
Posted Content

Neural Machine Translation by Jointly Learning to Align and Translate

TL;DR: In this paper, the authors propose to use a soft-searching model to find the parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.