Open AccessProceedings Article
UNKs Everywhere: Adapting Multilingual Language Models to New Scripts.
Jonas Pfeiffer,Ivan Vulić,Iryna Gurevych,Sebastian Ruder +3 more
- pp 10186-10203
Reads0
Chats0
TLDR
This paper propose a series of data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts. But their methods rely on matrix factorization, which is not suitable for low resource languages.Abstract:
Massively multilingual language models such as multilingual BERT offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks. However, due to limited capacity and large differences in pretraining data sizes, there is a profound performance gap between resource-rich and resource-poor target languages. The ultimate challenge is dealing with under-resourced languages not covered at all by the models and written in scripts unseen during pretraining. In this work, we propose a series of novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts. Relying on matrix factorization, our methods capitalize on the existing latent knowledge about multiple languages already available in the pretrained model’s embedding matrix. Furthermore, we show that learning of the new dedicated embedding matrix in the target language can be improved by leveraging a small number of vocabulary items (i.e., the so-called lexically overlapping tokens) shared between mBERT’s and target language vocabulary. Our adaptation techniques offer substantial performance gains for languages with unseen scripts. We also demonstrate that they can yield improvements for low-resource languages written in scripts covered by the pretrained model.read more
Citations
More filters
Posted Content
AdapterFusion: Non-Destructive Task Composition for Transfer Learning
TL;DR: This work proposes AdapterFusion, a new two stage learning algorithm that leverages knowledge from multiple tasks by separating the two stages, i.e., knowledge extraction and knowledge composition, so that the classifier can effectively exploit the representations learned frommultiple tasks in a non-destructive manner.
Proceedings ArticleDOI
AdapterFusion: Non-destructive task composition for transfer learning
TL;DR: In this paper, the authors propose a two-stage learning algorithm that leverages knowledge from multiple tasks to solve the problem of catastrophic forgetting and difficulties in dataset balancing, by separating the two stages, i.e., knowledge extraction and knowledge composition.
Proceedings ArticleDOI
How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models
TL;DR: The authors provide a systematic and comprehensive empirical comparison of pretrained multilingual language models versus their monolingual counterparts with regard to their monolinguistic task performance, and find that while the pretraining data size is an important factor in the downstream performance, a designated mon-olingual tokenizer plays an equally important role in downstream performance.
Posted Content
AmericasNLI: Evaluating Zero-shot Natural Language Understanding of Pretrained Multilingual Models in Truly Low-resource Languages.
Abteen Ebrahimi,Manuel Mager,Arturo Oncevay,Vishrav Chaudhary,Luis Chiruzzo,Angela Fan,John Ortega,Ricardo Argenton Ramos,Annette Rios,Ivan Vladimir,Gustavo A. Giménez-Lugo,Elisabeth Mager,Graham Neubig,Alexis Palmer,Rolando A. Coto Solano,Ngoc Thang Vu,Katharina Kann +16 more
TL;DR: In this paper, an extension of XNLI to 10 indigenous languages of the Americas is presented, and the authors find that XLM-R's zero-shot performance is poor for all 10 languages, with an average performance of 38.62%.
Posted Content
How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models.
TL;DR: This paper provided a systematic and comprehensive empirical comparison of pretrained multilingual language models versus their monolingual counterparts with regard to their monolinguistic task performance, and found that while the pretraining data size is an important factor in the downstream performance of the multilingual model, a designated mon-olingual tokenizer plays an equally important role in downstream performance.
References
More filters
Proceedings Article
Attention is All you Need
Ashish Vaswani,Noam Shazeer,Niki Parmar,Jakob Uszkoreit,Llion Jones,Aidan N. Gomez,Lukasz Kaiser,Illia Polosukhin +7 more
TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
Proceedings ArticleDOI
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TL;DR: BERT as mentioned in this paper pre-trains deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Proceedings Article
Categorical Reparameterization with Gumbel-Softmax
Eric Jang,Shixiang Gu,Ben Poole +2 more
TL;DR: Gumbel-Softmax as mentioned in this paper replaces the non-differentiable samples from a categorical distribution with a differentiable sample from a novel Gumbel softmax distribution, which has the essential property that it can be smoothly annealed into the categorical distributions.
Proceedings ArticleDOI
Unsupervised Cross-lingual Representation Learning at Scale
Alexis Conneau,Kartikay Khandelwal,Naman Goyal,Vishrav Chaudhary,Guillaume Wenzek,Francisco Guzmán,Edouard Grave,Myle Ott,Luke Zettlemoyer,Veselin Stoyanov +9 more
TL;DR: It is shown that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks, and the possibility of multilingual modeling without sacrificing per-language performance is shown for the first time.
Posted Content
Gaussian Error Linear Units (GELUs)
Dan Hendrycks,Kevin Gimpel +1 more
TL;DR: An empirical evaluation of the GELU nonlinearity against the ReLU and ELU activations is performed and performance improvements are found across all considered computer vision, natural language processing, and speech tasks.