How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models

doi:10.18653/V1/2021.ACL-LONG.243

Open AccessProceedings ArticleDOI

How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models

Phillip Rust, +4 more

- pp 3118-3135

Chats0

TLDR

The authors provide a systematic and comprehensive empirical comparison of pretrained multilingual language models versus their monolingual counterparts with regard to their monolinguistic task performance, and find that while the pretraining data size is an important factor in the downstream performance, a designated mon-olingual tokenizer plays an equally important role in downstream performance.

Abstract:

In this work, we provide a systematic and comprehensive empirical comparison of pretrained multilingual language models versus their monolingual counterparts with regard to their monolingual task performance. We study a set of nine typologically diverse languages with readily available pretrained monolingual models on a set of five diverse monolingual downstream tasks. We first aim to establish, via fair and controlled comparisons, if a gap between the multilingual and the corresponding monolingual representation of that language exists, and subsequently investigate the reason for any performance difference. To disentangle conflating factors, we train new monolingual models on the same data, with monolingually and multilingually trained tokenizers. We find that while the pretraining data size is an important factor, a designated monolingual tokenizer plays an equally important role in the downstream performance. Our results show that languages that are adequately represented in the multilingual model’s vocabulary exhibit negligible performance decreases over their monolingual counterparts. We further find that replacing the original multilingual tokenizer with the specialized monolingual tokenizer improves the downstream performance of the multilingual model for almost every task and language.

How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models

Citations

When does pretraining help?: assessing self-supervised learning for law and the CaseHOLD dataset of 53,000+ legal holdings

UNKs Everywhere: Adapting Multilingual Language Models to New Scripts.

MAD-G: Multilingual Adapter Generation for Efficient Cross-Lingual Transfer.

What to Pre-Train on? Efficient Intermediate Task Selection

KLUE: Korean Language Understanding Evaluation.

References

Improving Bi-LSTM Performance for Indonesian Sentiment Analysis Using Paragraph Vector

Related Papers (5)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Unsupervised Cross-lingual Representation Learning at Scale

Attention is All you Need

RoBERTa: A Robustly Optimized BERT Pretraining Approach

SQuAD: 100,000+ Questions for Machine Comprehension of Text