How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models

doi:10.18653/V1/2021.ACL-LONG.243

Open AccessProceedings ArticleDOI

How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models

Phillip Rust, +4 more

- pp 3118-3135

Chats0

TLDR

The authors provide a systematic and comprehensive empirical comparison of pretrained multilingual language models versus their monolingual counterparts with regard to their monolinguistic task performance, and find that while the pretraining data size is an important factor in the downstream performance, a designated mon-olingual tokenizer plays an equally important role in downstream performance.

Abstract:

In this work, we provide a systematic and comprehensive empirical comparison of pretrained multilingual language models versus their monolingual counterparts with regard to their monolingual task performance. We study a set of nine typologically diverse languages with readily available pretrained monolingual models on a set of five diverse monolingual downstream tasks. We first aim to establish, via fair and controlled comparisons, if a gap between the multilingual and the corresponding monolingual representation of that language exists, and subsequently investigate the reason for any performance difference. To disentangle conflating factors, we train new monolingual models on the same data, with monolingually and multilingually trained tokenizers. We find that while the pretraining data size is an important factor, a designated monolingual tokenizer plays an equally important role in the downstream performance. Our results show that languages that are adequately represented in the multilingual model’s vocabulary exhibit negligible performance decreases over their monolingual counterparts. We further find that replacing the original multilingual tokenizer with the specialized monolingual tokenizer improves the downstream performance of the multilingual model for almost every task and language.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

When does pretraining help?: assessing self-supervised learning for law and the CaseHOLD dataset of 53,000+ legal holdings

Lucia Zheng, +4 more

TL;DR: In this article, the authors present a new dataset called Case Holdings On Legal Decisions (CaseHOLD), which consists of over 53,000+ multiple choice questions to identify the relevant holding of a cited case.

...read moreread less

Proceedings Article

UNKs Everywhere: Adapting Multilingual Language Models to New Scripts.

Jonas Pfeiffer, +3 more

TL;DR: This paper propose a series of data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts. But their methods rely on matrix factorization, which is not suitable for low resource languages.

...read moreread less

Proceedings Article

MAD-G: Multilingual Adapter Generation for Efficient Cross-Lingual Transfer.

Alan Ansell, +6 more

TL;DR: The authors propose MAD-G (Multilingual ADapter Generation) which generates language adapters from language representations based on typological features. But this approach is not viable for the vast majority of languages, due to limitations in their corpus size or compute budgets.

...read moreread less

Posted Content

What to Pre-Train on? Efficient Intermediate Task Selection

Clifton Poth, +3 more

- 16 Apr 2021 -

arXiv: Computation and Language

TL;DR: This article showed that efficient embedding based methods that rely solely on the respective datasets outperform computational expensive few-shot fine-tuning approaches, demonstrating that they are able to efficiently identify the best datasets for intermediate training.

...read moreread less

Posted Content

KLUE: Korean Language Understanding Evaluation.

Sungjoon Park, +30 more

- 20 May 2021 -

arXiv: Computation and Language

TL;DR: The Korean Language Understanding Evaluation (KLUE) benchmark as mentioned in this paper is a collection of 8 Korean NLP tasks, including Topic Classification, SemanticTextual Similarity, Natural Language Inference, Named Entity Recognition, Relation Extraction, Dependency Parsing, Machine Reading Comprehension, and Dialogue State Tracking.

...read moreread less

Collapse

References

PDF

Open Access

More filters

AraBERT: Transformer-based Model for Arabic Language Understanding

Wissam Antoun, +2 more

TL;DR: This paper pre-trained BERT specifically for the Arabic language in the pursuit of achieving the same success that BERT did for the English language, and showed that the newly developed AraBERT achieved state-of-the-art performance on most tested Arabic NLP tasks.

...read moreread less

Proceedings ArticleDOI

Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT

Shijie Wu, +1 more

TL;DR: This paper explored the broader cross-lingual potential of multilingual BERT as a zero-shot language transfer model on 5 NLP tasks covering a total of 39 languages from various language families: NLI, document classification, NER, POS tagging, and dependency parsing.

...read moreread less

Proceedings Article

UDPipe: Trainable Pipeline for Processing CoNLL-U Files Performing Tokenization, Morphological Analysis, POS Tagging and Parsing

Milan Straka, +2 more

TL;DR: UDPipe, a pipeline processing CoNLL-U-formatted files, performs tokenization, morphological analysis, part-of-speech tagging, lemmatization and dependency parsing for nearly all treebanks of Universal Dependencies 1.2.

...read moreread less

Proceedings Article

Cross-Lingual Ability of Multilingual BERT: An Empirical Study

Karthikeyan K, +3 more

TL;DR: A comprehensive study of the contribution of different components in M-BERT to its cross-lingual ability, finding that the lexical overlap between languages plays a negligible role, while the depth of the network is an integral part of it.

...read moreread less

Posted Content

Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection

Joakim Nivre, +8 more

- 22 Apr 2020 -

arXiv: Computation and Language

TL;DR: Universal Dependencies as mentioned in this paper is an open community effort to create cross-linguistically consistent treebank annotation for many languages within a dependency-based lexicalist framework, which consists in a linguistically motivated word segmentation; a morphological layer comprising lemmas, universal part-of-speech tags, and standardized morphological features; and a syntactic layer focusing on syntactic relations between predicates, arguments and modifiers.

...read moreread less

Collapse

arXiv: Computation and Language

SQuAD: 100,000+ Questions for Machine Comprehension of Text

Pranav Rajpurkar, +3 more

How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models

Citations

When does pretraining help?: assessing self-supervised learning for law and the CaseHOLD dataset of 53,000+ legal holdings

UNKs Everywhere: Adapting Multilingual Language Models to New Scripts.

MAD-G: Multilingual Adapter Generation for Efficient Cross-Lingual Transfer.

What to Pre-Train on? Efficient Intermediate Task Selection

KLUE: Korean Language Understanding Evaluation.

References

AraBERT: Transformer-based Model for Arabic Language Understanding

Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT

UDPipe: Trainable Pipeline for Processing CoNLL-U Files Performing Tokenization, Morphological Analysis, POS Tagging and Parsing

Cross-Lingual Ability of Multilingual BERT: An Empirical Study

Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection

Related Papers (5)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Unsupervised Cross-lingual Representation Learning at Scale

Attention is All you Need

RoBERTa: A Robustly Optimized BERT Pretraining Approach

SQuAD: 100,000+ Questions for Machine Comprehension of Text