How Much Does Tokenization Affect Neural Machine Translation

Open AccessPosted Content

How Much Does Tokenization Affect Neural Machine Translation

- 20 Dec 2018 -

TLDR

The conclusion is reached that the tokenization significantly affects the final translation quality and that the best tokenizer differs for different language pairs.

Abstract:

Tokenization or segmentation is a wide concept that covers simple processes such as separating punctuation from words, or more sophisticated processes such as applying morphological knowledge. Neural Machine Translation (NMT) requires a limited-size vocabulary for computational cost and enough examples to estimate word embeddings. Separating punctuation and splitting tokens into words or subwords has proven to be helpful to reduce vocabulary and increase the number of examples of each word, improving the translation quality. Tokenization is more challenging when dealing with languages with no separator between words. In order to assess the impact of the tokenization in the quality of the final translation on NMT, we experimented on five tokenizers over ten language pairs. We reached the conclusion that the tokenization significantly affects the final translation quality and that the best tokenizer differs for different language pairs.

Citations

PDF

Open Access

More filters

Posted Content

Neural Machine Translation: A Review

Felix Stahlberg

TL;DR: This work traces back the origins of modern NMT architectures to word and sentence embeddings and earlier examples of the encoder-decoder network family and concludes with a survey of recent trends in the field.

...read moreread less

Proceedings ArticleDOI

How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models

Phillip Rust, +4 more

TL;DR: The authors provide a systematic and comprehensive empirical comparison of pretrained multilingual language models versus their monolingual counterparts with regard to their monolinguistic task performance, and find that while the pretraining data size is an important factor in the downstream performance, a designated mon-olingual tokenizer plays an equally important role in downstream performance.

...read moreread less

Posted Content

Byte Pair Encoding is Suboptimal for Language Model Pretraining

Kaj Bostrom, +1 more

- 07 Apr 2020 -

arXiv: Computation and Language

TL;DR: Differences between BPE and unigram LM tokenization are analyzed, finding that the latter method recovers subword units that align more closely with morphology and avoids problems stemming from BPE’s greedy construction procedure.

...read moreread less

Proceedings ArticleDOI

Investigating the Effectiveness of BPE: The Power of Shorter Sequences.

Matthias Gallé

TL;DR: The experiments show that - given a fixed vocabulary size budget - the fewer tokens an algorithm needs to cover the test set, the better the translation (as measured by BLEU).

...read moreread less

Posted Content

How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models.

Phillip Rust, +4 more

- 31 Dec 2020 -

arXiv: Computation and Language

TL;DR: This paper provided a systematic and comprehensive empirical comparison of pretrained multilingual language models versus their monolingual counterparts with regard to their monolinguistic task performance, and found that while the pretraining data size is an important factor in the downstream performance of the multilingual model, a designated mon-olingual tokenizer plays an equally important role in downstream performance.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings Article

Adam: A Method for Stochastic Optimization

Diederik P. Kingma, +1 more

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

...read moreread less

Journal ArticleDOI

Long short-term memory

Sepp Hochreiter, +1 more

- 01 Nov 1997 -

Neural Computation

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.

...read moreread less

Posted Content

Adam: A Method for Stochastic Optimization

Diederik P. Kingma, +1 more

- 22 Dec 2014 -

arXiv: Learning

TL;DR: In this article, the adaptive estimates of lower-order moments are used for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimate of lowerorder moments.

...read moreread less

Proceedings ArticleDOI

Bleu: a Method for Automatic Evaluation of Machine Translation

Kishore Papineni, +3 more

TL;DR: This paper proposed a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.

...read moreread less

Proceedings Article

Neural Machine Translation by Jointly Learning to Align and Translate

Dzmitry Bahdanau, +2 more

TL;DR: It is conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and it is proposed to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.

...read moreread less

Collapse

How Much Does Tokenization Affect Neural Machine Translation

Citations

Neural Machine Translation: A Review

How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models

Byte Pair Encoding is Suboptimal for Language Model Pretraining

Investigating the Effectiveness of BPE: The Power of Shorter Sequences.

How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models.

References

Adam: A Method for Stochastic Optimization

Long short-term memory

Adam: A Method for Stochastic Optimization

Bleu: a Method for Automatic Evaluation of Machine Translation

Neural Machine Translation by Jointly Learning to Align and Translate

Related Papers (5)

Study of Various Methods for Tokenization

Arabic Tokenization System

Tokenization as the initial phase in NLP

Orthographic and morphological processing for English---Arabic statistical machine translation

Attention is All you Need