Intermediate-Task Transfer Learning with Pretrained Language Models: When and Why Does It Work?

doi:10.18653/V1/2020.ACL-MAIN.467

Open AccessProceedings ArticleDOI

Intermediate-Task Transfer Learning with Pretrained Language Models: When and Why Does It Work?

- pp 5231-5247

TLDR

It is observed that intermediate tasks requiring high-level inference and reasoning abilities tend to work best and that target task performance is strongly correlated with higher-level abilities such as coreference resolution, but it is failed to observe more granular correlations between probing and target taskperformance.

Abstract:

While pretrained models such as BERT have shown large gains across natural language understanding tasks, their performance can be improved by further training the model on a data-rich intermediate task, before fine-tuning it on a target task. However, it is still poorly understood when and why intermediate-task training is beneficial for a given target task. To investigate this, we perform a large-scale study on the pretrained RoBERTa model with 110 intermediate-target task combinations. We further evaluate all trained models with 25 probing tasks meant to reveal the specific skills that drive transfer. We observe that intermediate tasks requiring high-level inference and reasoning abilities tend to work best. We also observe that target task performance is strongly correlated with higher-level abilities such as coreference resolution. However, we fail to observe more granular correlations between probing and target task performance, highlighting the need for further work on broad-coverage probing benchmarks. We also observe evidence that the forgetting of knowledge learned during pretraining may limit our analysis, highlighting the need for further work on transfer learning methods in these settings.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

A Primer in BERTology: What We Know About How BERT Works

Anna Rogers, +2 more

- 01 Jan 2020 -

Transactions of the Association for Comp...

TL;DR: A survey of over 150 studies of the BERT model can be found in this paper, where the current state of knowledge about how BERT works, what kind of information it learns and how it is represented, common modifications to its training objectives and architecture, the overparameterization issue and approaches to compression.

...read moreread less

Posted Content

A Primer in BERTology: What we know about how BERT works

Anna Rogers, +2 more

- 27 Feb 2020 -

arXiv: Computation and Language

TL;DR: This paper is the first survey of over 150 studies of the popular BERT model, reviewing the current state of knowledge about how BERT works, what kind of information it learns and how it is represented, common modifications to its training objectives and architecture, the overparameterization issue, and approaches to compression.

...read moreread less

Posted Content

AdapterFusion: Non-Destructive Task Composition for Transfer Learning

Jonas Pfeiffer, +4 more

- 01 May 2020 -

arXiv: Computation and Language

TL;DR: This work proposes AdapterFusion, a new two stage learning algorithm that leverages knowledge from multiple tasks by separating the two stages, i.e., knowledge extraction and knowledge composition, so that the classifier can effectively exploit the representations learned frommultiple tasks in a non-destructive manner.

...read moreread less

Posted Content

AdapterHub: A Framework for Adapting Transformers.

Jonas Pfeiffer, +7 more

- 15 Jul 2020 -

arXiv: Computation and Language

TL;DR: AdaptersHub is proposed, a framework that allows dynamic “stiching-in” of pre-trained adapters for different tasks and languages that enables scalable and easy access to sharing of task-specific models, particularly in low-resource scenarios.

...read moreread less

Proceedings ArticleDOI

From zero to hero: On the limitations of zero-shot language transfer with multilingual transformers

Anne Lauscher, +3 more

TL;DR: It is demonstrated that the inexpensive few-shot transfer (i.e., additional fine-tuning on a few target-language instances) is surprisingly effective across the board, warranting more research efforts reaching beyond the limiting zero-shot conditions.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings Article

Adam: A Method for Stochastic Optimization

Diederik P. Kingma, +1 more

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

...read moreread less

Proceedings ArticleDOI

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, +3 more

TL;DR: BERT as mentioned in this paper pre-trains deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

...read moreread less

Posted Content

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, +9 more

- 26 Jul 2019 -

arXiv: Computation and Language

TL;DR: It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.

...read moreread less

Proceedings Article

PyTorch: An Imperative Style, High-Performance Deep Learning Library

Adam Paszke, +20 more

TL;DR: This paper details the principles that drove the implementation of PyTorch and how they are reflected in its architecture, and explains how the careful and pragmatic implementation of the key components of its runtime enables them to work together to achieve compelling performance.

...read moreread less

Posted Content

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, +8 more

- 23 Oct 2019 -

arXiv: Learning

TL;DR: This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.

...read moreread less

Collapse

Related Papers (5)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, +3 more

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, +9 more

- 26 Jul 2019 -

arXiv: Computation and Language

Intermediate-Task Transfer Learning with Pretrained Language Models: When and Why Does It Work?

Citations

A Primer in BERTology: What We Know About How BERT Works

A Primer in BERTology: What we know about how BERT works

AdapterFusion: Non-Destructive Task Composition for Transfer Learning

AdapterHub: A Framework for Adapting Transformers.

From zero to hero: On the limitations of zero-shot language transfer with multilingual transformers

References

Adam: A Method for Stochastic Optimization

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

RoBERTa: A Robustly Optimized BERT Pretraining Approach

PyTorch: An Imperative Style, High-Performance Deep Learning Library

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Related Papers (5)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

RoBERTa: A Robustly Optimized BERT Pretraining Approach

A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

SQuAD: 100,000+ Questions for Machine Comprehension of Text