scispace - formally typeset
Open AccessProceedings ArticleDOI

How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models

Reads0
Chats0
TLDR
The authors provide a systematic and comprehensive empirical comparison of pretrained multilingual language models versus their monolingual counterparts with regard to their monolinguistic task performance, and find that while the pretraining data size is an important factor in the downstream performance, a designated mon-olingual tokenizer plays an equally important role in downstream performance.
Abstract
In this work, we provide a systematic and comprehensive empirical comparison of pretrained multilingual language models versus their monolingual counterparts with regard to their monolingual task performance. We study a set of nine typologically diverse languages with readily available pretrained monolingual models on a set of five diverse monolingual downstream tasks. We first aim to establish, via fair and controlled comparisons, if a gap between the multilingual and the corresponding monolingual representation of that language exists, and subsequently investigate the reason for any performance difference. To disentangle conflating factors, we train new monolingual models on the same data, with monolingually and multilingually trained tokenizers. We find that while the pretraining data size is an important factor, a designated monolingual tokenizer plays an equally important role in the downstream performance. Our results show that languages that are adequately represented in the multilingual model’s vocabulary exhibit negligible performance decreases over their monolingual counterparts. We further find that replacing the original multilingual tokenizer with the specialized monolingual tokenizer improves the downstream performance of the multilingual model for almost every task and language.

read more

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI

When does pretraining help?: assessing self-supervised learning for law and the CaseHOLD dataset of 53,000+ legal holdings

TL;DR: In this article, the authors present a new dataset called Case Holdings On Legal Decisions (CaseHOLD), which consists of over 53,000+ multiple choice questions to identify the relevant holding of a cited case.
Proceedings Article

UNKs Everywhere: Adapting Multilingual Language Models to New Scripts.

TL;DR: This paper propose a series of data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts. But their methods rely on matrix factorization, which is not suitable for low resource languages.
Proceedings Article

MAD-G: Multilingual Adapter Generation for Efficient Cross-Lingual Transfer.

TL;DR: The authors propose MAD-G (Multilingual ADapter Generation) which generates language adapters from language representations based on typological features. But this approach is not viable for the vast majority of languages, due to limitations in their corpus size or compute budgets.
Posted Content

What to Pre-Train on? Efficient Intermediate Task Selection

TL;DR: This article showed that efficient embedding based methods that rely solely on the respective datasets outperform computational expensive few-shot fine-tuning approaches, demonstrating that they are able to efficiently identify the best datasets for intermediate training.
Posted Content

KLUE: Korean Language Understanding Evaluation.

TL;DR: The Korean Language Understanding Evaluation (KLUE) benchmark as mentioned in this paper is a collection of 8 Korean NLP tasks, including Topic Classification, SemanticTextual Similarity, Natural Language Inference, Named Entity Recognition, Relation Extraction, Dependency Parsing, Machine Reading Comprehension, and Dialogue State Tracking.
References
More filters

AraBERT: Transformer-based Model for Arabic Language Understanding

TL;DR: This paper pre-trained BERT specifically for the Arabic language in the pursuit of achieving the same success that BERT did for the English language, and showed that the newly developed AraBERT achieved state-of-the-art performance on most tested Arabic NLP tasks.
Proceedings ArticleDOI

Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT

TL;DR: This paper explored the broader cross-lingual potential of multilingual BERT as a zero-shot language transfer model on 5 NLP tasks covering a total of 39 languages from various language families: NLI, document classification, NER, POS tagging, and dependency parsing.
Proceedings Article

UDPipe: Trainable Pipeline for Processing CoNLL-U Files Performing Tokenization, Morphological Analysis, POS Tagging and Parsing

TL;DR: UDPipe, a pipeline processing CoNLL-U-formatted files, performs tokenization, morphological analysis, part-of-speech tagging, lemmatization and dependency parsing for nearly all treebanks of Universal Dependencies 1.2.
Proceedings Article

Cross-Lingual Ability of Multilingual BERT: An Empirical Study

TL;DR: A comprehensive study of the contribution of different components in M-BERT to its cross-lingual ability, finding that the lexical overlap between languages plays a negligible role, while the depth of the network is an integral part of it.
Posted Content

Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection

TL;DR: Universal Dependencies as mentioned in this paper is an open community effort to create cross-linguistically consistent treebank annotation for many languages within a dependency-based lexicalist framework, which consists in a linguistically motivated word segmentation; a morphological layer comprising lemmas, universal part-of-speech tags, and standardized morphological features; and a syntactic layer focusing on syntactic relations between predicates, arguments and modifiers.
Related Papers (5)