English Intermediate-Task Training Improves Zero-Shot Cross-Lingual Transfer Too

Open AccessProceedings Article

English Intermediate-Task Training Improves Zero-Shot Cross-Lingual Transfer Too

- pp 557-575

TLDR

This work evaluates intermediate-task transfer in a zero-shot cross-lingual setting on the XTREME benchmark, and finds MNLI, SQuAD and HellaSwag achieve the best overall results as intermediate tasks, while multi-task intermediate offers small additional improvements.

Abstract:

Intermediate-task training—fine-tuning a pretrained model on an intermediate task before fine-tuning again on the target task—often improves model performance substantially on language understanding tasks in monolingual English settings. We investigate whether English intermediate-task training is still helpful on non-English target tasks. Using nine intermediate language-understanding tasks, we evaluate intermediate-task transfer in a zero-shot cross-lingual setting on the XTREME benchmark. We see large improvements from intermediate training on the BUCC and Tatoeba sentence retrieval tasks and moderate improvements on question-answering target tasks. MNLI, SQuAD and HellaSwag achieve the best overall results as intermediate tasks, while multi-task intermediate offers small additional improvements. Using our best intermediate-task models for each target task, we obtain a 5.4 point improvement over XLM-R Large on the XTREME benchmark, setting the state of the art as of June 2020. We also investigate continuing multilingual MLM during intermediate-task training and using machine-translated intermediate-task data, but neither consistently outperforms simply performing English intermediate-task training.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

Neural Unsupervised Domain Adaptation in NLP—A Survey

Alan Ramponi, +1 more

TL;DR: This survey reviews neural unsupervised domain adaptation techniques which do not require labeled target domain data, and revisits the notion of domain, and uncovers a bias in the type of Natural Language Processing tasks which received most attention.

...read moreread less

Posted Content

mT5: A massively multilingual pre-trained text-to-text transformer

Linting Xue, +7 more

- 22 Oct 2020 -

arXiv: Computation and Language

TL;DR: This article proposed a multilingual variant of T5, mT5, which was pre-trained on a new Common Crawl-based dataset covering 101 languages and achieved state-of-the-art performance on many multilingual benchmarks.

...read moreread less

Proceedings ArticleDOI

Continual Lifelong Learning in Natural Language Processing: A Survey

Magdalena Biesialska, +2 more

- 17 Dec 2020 -

arXiv: Computation and Language

TL;DR: This work looks at the problem of CL through the lens of various NLP tasks, and discusses major challenges in CL and current methods applied in neural network models.

...read moreread less

Posted Content

Rethinking embedding coupling in pre-trained language models

Hyung Won Chung, +4 more

- 24 Oct 2020 -

arXiv: Computation and Language

TL;DR: The analysis shows that larger output embeddings prevent the model's last layers from overspecializing to the pre-training task and encourage Transformer representations to be more general and more transferable to other tasks and languages.

...read moreread less

Posted Content

FILTER: An Enhanced Fusion Method for Cross-lingual Language Understanding

Yuwei Fang, +4 more

- 10 Sep 2020 -

arXiv: Computation and Language

TL;DR: FILTER is proposed, an enhanced fusion method that takes cross-lingual data as input for XLM finetuning and proposes an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings ArticleDOI

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, +3 more

TL;DR: BERT as mentioned in this paper pre-trains deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

...read moreread less

Posted Content

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, +9 more

- 26 Jul 2019 -

arXiv: Computation and Language

TL;DR: It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.

...read moreread less

Proceedings Article

PyTorch: An Imperative Style, High-Performance Deep Learning Library

Adam Paszke, +20 more

TL;DR: This paper details the principles that drove the implementation of PyTorch and how they are reflected in its architecture, and explains how the careful and pragmatic implementation of the key components of its runtime enables them to work together to achieve compelling performance.

...read moreread less

Posted Content

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, +8 more

- 23 Oct 2019 -

arXiv: Learning

TL;DR: This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.

...read moreread less

Posted Content

SQuAD: 100,000+ Questions for Machine Comprehension of Text

Pranav Rajpurkar, +3 more

- 16 Jun 2016 -

arXiv: Computation and Language

TL;DR: The Stanford Question Answering Dataset (SQuAD) as mentioned in this paper is a reading comprehension dataset consisting of 100,000+ questions posed by crowdworkers on a set of Wikipedia articles, where the answer to each question is a segment of text from the corresponding reading passage.

...read moreread less