How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models

doi:10.18653/V1/2021.ACL-LONG.243

Open AccessProceedings ArticleDOI

How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models

- pp 3118-3135

TLDR

The authors provide a systematic and comprehensive empirical comparison of pretrained multilingual language models versus their monolingual counterparts with regard to their monolinguistic task performance, and find that while the pretraining data size is an important factor in the downstream performance, a designated mon-olingual tokenizer plays an equally important role in downstream performance.

Abstract:

In this work, we provide a systematic and comprehensive empirical comparison of pretrained multilingual language models versus their monolingual counterparts with regard to their monolingual task performance. We study a set of nine typologically diverse languages with readily available pretrained monolingual models on a set of five diverse monolingual downstream tasks. We first aim to establish, via fair and controlled comparisons, if a gap between the multilingual and the corresponding monolingual representation of that language exists, and subsequently investigate the reason for any performance difference. To disentangle conflating factors, we train new monolingual models on the same data, with monolingually and multilingually trained tokenizers. We find that while the pretraining data size is an important factor, a designated monolingual tokenizer plays an equally important role in the downstream performance. Our results show that languages that are adequately represented in the multilingual model’s vocabulary exhibit negligible performance decreases over their monolingual counterparts. We further find that replacing the original multilingual tokenizer with the specialized monolingual tokenizer improves the downstream performance of the multilingual model for almost every task and language.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Multilingual text categorization and sentiment analysis: a comparative analysis of the utilization of multilingual approaches for classifying twitter data

George Manias, +4 more

- 08 May 2023 -

Neural Computing and Applications

TL;DR: In this article , a comparative analysis of multilingual approaches for classifying both the sentiment and the text of an examined multilingual corpus was performed and four multilingual BERT-based classifiers and a zero-shot classification approach were utilized and compared in terms of their accuracy and applicability in the classification of multinational data.

...read moreread less

Posted Content

Specializing Multilingual Language Models: An Empirical Study

Ethan C. Chau, +1 more

- 16 Jun 2021 -

arXiv: Computation and Language

TL;DR: This article study the performance, extensibility, and interaction of two such adaptations for this low-resource setting: vocabulary augmentation and script transliteration, and they yield a mixed result, upholding the viability of these approaches while raising new questions around how to optimally adapt multilingual models to lowresource settings.

...read moreread less

Posted Content

BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding

Abhik Bhattacharjee, +6 more

- 01 Jan 2021 -

arXiv: Computation and Language

TL;DR: The BanglaBERT model as mentioned in this paper proposed a straightforward solution by transcribing languages to a common script, which can effectively improve the performance of a multilingual model for the Bangla language.

...read moreread less

Proceedings Article

Code-switched inspired losses for spoken dialog representations.

Pierre Colombo, +3 more

TL;DR: The authors introduce new pretraining losses tailored to learn generic multilingual spoken dialogue representations, which expose the model to code-switched language. But their experiments show that their new losses achieve a better performance in both monolingual and multilingual settings.

...read moreread less

Proceedings ArticleDOI

Vietnamese Sentiment Analysis: An Overview and Comparative Study of Fine-tuning Pretrained Language Models

Dang Van Thin, +2 more

TL;DR: In this paper , a fine-tuning approach to investigate the performance of different pre-trained language models for the Vietnamese Sentiment Analysis (SA) task is presented, and the experimental results show the superior performance of the monolingual PhoBERT model and ViT5 model in comparison with previous studies.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings Article

Adam: A Method for Stochastic Optimization

Diederik P. Kingma, +1 more

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

...read moreread less

Proceedings Article

Attention is All you Need

Ashish Vaswani, +7 more

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.

...read moreread less

Proceedings ArticleDOI

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, +3 more

TL;DR: BERT as mentioned in this paper pre-trains deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

...read moreread less

Posted Content

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, +9 more

- 26 Jul 2019 -

arXiv: Computation and Language

TL;DR: It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.

...read moreread less

Proceedings ArticleDOI

Deep contextualized word representations

Matthew E. Peters, +6 more

TL;DR: This paper introduced a new type of deep contextualized word representation that models both complex characteristics of word use (e.g., syntax and semantics), and how these uses vary across linguistic contexts (i.e., to model polysemy).

...read moreread less

Collapse

arXiv: Computation and Language

SQuAD: 100,000+ Questions for Machine Comprehension of Text

Pranav Rajpurkar, +3 more

How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models

Citations

Multilingual text categorization and sentiment analysis: a comparative analysis of the utilization of multilingual approaches for classifying twitter data

Specializing Multilingual Language Models: An Empirical Study

BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding

Code-switched inspired losses for spoken dialog representations.

Vietnamese Sentiment Analysis: An Overview and Comparative Study of Fine-tuning Pretrained Language Models

References

Adam: A Method for Stochastic Optimization

Attention is All you Need

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Deep contextualized word representations

Related Papers (5)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Unsupervised Cross-lingual Representation Learning at Scale

Attention is All you Need

RoBERTa: A Robustly Optimized BERT Pretraining Approach

SQuAD: 100,000+ Questions for Machine Comprehension of Text