scispace - formally typeset
Open AccessProceedings Article

UNKs Everywhere: Adapting Multilingual Language Models to New Scripts.

Reads0
Chats0
TLDR
This paper propose a series of data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts. But their methods rely on matrix factorization, which is not suitable for low resource languages.
Abstract
Massively multilingual language models such as multilingual BERT offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks. However, due to limited capacity and large differences in pretraining data sizes, there is a profound performance gap between resource-rich and resource-poor target languages. The ultimate challenge is dealing with under-resourced languages not covered at all by the models and written in scripts unseen during pretraining. In this work, we propose a series of novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts. Relying on matrix factorization, our methods capitalize on the existing latent knowledge about multiple languages already available in the pretrained model’s embedding matrix. Furthermore, we show that learning of the new dedicated embedding matrix in the target language can be improved by leveraging a small number of vocabulary items (i.e., the so-called lexically overlapping tokens) shared between mBERT’s and target language vocabulary. Our adaptation techniques offer substantial performance gains for languages with unseen scripts. We also demonstrate that they can yield improvements for low-resource languages written in scripts covered by the pretrained model.

read more

Citations
More filters
Posted Content

AdapterFusion: Non-Destructive Task Composition for Transfer Learning

TL;DR: This work proposes AdapterFusion, a new two stage learning algorithm that leverages knowledge from multiple tasks by separating the two stages, i.e., knowledge extraction and knowledge composition, so that the classifier can effectively exploit the representations learned frommultiple tasks in a non-destructive manner.
Proceedings ArticleDOI

AdapterFusion: Non-destructive task composition for transfer learning

TL;DR: In this paper, the authors propose a two-stage learning algorithm that leverages knowledge from multiple tasks to solve the problem of catastrophic forgetting and difficulties in dataset balancing, by separating the two stages, i.e., knowledge extraction and knowledge composition.
Proceedings ArticleDOI

How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models

TL;DR: The authors provide a systematic and comprehensive empirical comparison of pretrained multilingual language models versus their monolingual counterparts with regard to their monolinguistic task performance, and find that while the pretraining data size is an important factor in the downstream performance, a designated mon-olingual tokenizer plays an equally important role in downstream performance.
Posted Content

AmericasNLI: Evaluating Zero-shot Natural Language Understanding of Pretrained Multilingual Models in Truly Low-resource Languages.

TL;DR: In this paper, an extension of XNLI to 10 indigenous languages of the Americas is presented, and the authors find that XLM-R's zero-shot performance is poor for all 10 languages, with an average performance of 38.62%.
Posted Content

How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models.

TL;DR: This paper provided a systematic and comprehensive empirical comparison of pretrained multilingual language models versus their monolingual counterparts with regard to their monolinguistic task performance, and found that while the pretraining data size is an important factor in the downstream performance of the multilingual model, a designated mon-olingual tokenizer plays an equally important role in downstream performance.
References
More filters
Proceedings Article

Attention is All you Need

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
Proceedings ArticleDOI

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

TL;DR: BERT as mentioned in this paper pre-trains deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Proceedings Article

Categorical Reparameterization with Gumbel-Softmax

TL;DR: Gumbel-Softmax as mentioned in this paper replaces the non-differentiable samples from a categorical distribution with a differentiable sample from a novel Gumbel softmax distribution, which has the essential property that it can be smoothly annealed into the categorical distributions.
Proceedings ArticleDOI

Unsupervised Cross-lingual Representation Learning at Scale

TL;DR: It is shown that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks, and the possibility of multilingual modeling without sacrificing per-language performance is shown for the first time.
Posted Content

Gaussian Error Linear Units (GELUs)

TL;DR: An empirical evaluation of the GELU nonlinearity against the ReLU and ELU activations is performed and performance improvements are found across all considered computer vision, natural language processing, and speech tasks.
Related Papers (5)