scispace - formally typeset
Open AccessProceedings ArticleDOI

How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models

Reads0
Chats0
TLDR
The authors provide a systematic and comprehensive empirical comparison of pretrained multilingual language models versus their monolingual counterparts with regard to their monolinguistic task performance, and find that while the pretraining data size is an important factor in the downstream performance, a designated mon-olingual tokenizer plays an equally important role in downstream performance.
Abstract
In this work, we provide a systematic and comprehensive empirical comparison of pretrained multilingual language models versus their monolingual counterparts with regard to their monolingual task performance. We study a set of nine typologically diverse languages with readily available pretrained monolingual models on a set of five diverse monolingual downstream tasks. We first aim to establish, via fair and controlled comparisons, if a gap between the multilingual and the corresponding monolingual representation of that language exists, and subsequently investigate the reason for any performance difference. To disentangle conflating factors, we train new monolingual models on the same data, with monolingually and multilingually trained tokenizers. We find that while the pretraining data size is an important factor, a designated monolingual tokenizer plays an equally important role in the downstream performance. Our results show that languages that are adequately represented in the multilingual model’s vocabulary exhibit negligible performance decreases over their monolingual counterparts. We further find that replacing the original multilingual tokenizer with the specialized monolingual tokenizer improves the downstream performance of the multilingual model for almost every task and language.

read more

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI

When does pretraining help?: assessing self-supervised learning for law and the CaseHOLD dataset of 53,000+ legal holdings

TL;DR: In this article, the authors present a new dataset called Case Holdings On Legal Decisions (CaseHOLD), which consists of over 53,000+ multiple choice questions to identify the relevant holding of a cited case.
Proceedings Article

UNKs Everywhere: Adapting Multilingual Language Models to New Scripts.

TL;DR: This paper propose a series of data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts. But their methods rely on matrix factorization, which is not suitable for low resource languages.
Proceedings Article

MAD-G: Multilingual Adapter Generation for Efficient Cross-Lingual Transfer.

TL;DR: The authors propose MAD-G (Multilingual ADapter Generation) which generates language adapters from language representations based on typological features. But this approach is not viable for the vast majority of languages, due to limitations in their corpus size or compute budgets.
Posted Content

What to Pre-Train on? Efficient Intermediate Task Selection

TL;DR: This article showed that efficient embedding based methods that rely solely on the respective datasets outperform computational expensive few-shot fine-tuning approaches, demonstrating that they are able to efficiently identify the best datasets for intermediate training.
Posted Content

KLUE: Korean Language Understanding Evaluation.

TL;DR: The Korean Language Understanding Evaluation (KLUE) benchmark as mentioned in this paper is a collection of 8 Korean NLP tasks, including Topic Classification, SemanticTextual Similarity, Natural Language Inference, Named Entity Recognition, Relation Extraction, Dependency Parsing, Machine Reading Comprehension, and Dialogue State Tracking.
References
More filters
Proceedings ArticleDOI

Improving Bi-LSTM Performance for Indonesian Sentiment Analysis Using Paragraph Vector

TL;DR: In this paper, a paragraph vector is concatenated to each word vector of the document to provide information context for each sequence processing, which can help to differentiate ambiguous Indonesian words.
Related Papers (5)