A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks

doi:10.18653/V1/D17-1206

Open AccessProceedings ArticleDOI

A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks

Kazuma Hashimoto, +3 more

- pp 1923-1933

Chats0

TLDR

The authors introduce a joint many-task model together with a strategy for successively growing its depth to solve increasingly complex tasks, and use a simple regularization term to allow for optimizing all model weights to improve one task's loss without exhibiting catastrophic interference of the other tasks.

Abstract:

Transfer and multi-task learning have traditionally focused on either a single source-target pair or very few, similar tasks. Ideally, the linguistic levels of morphology, syntax and semantics would benefit each other by being trained in a single model. We introduce a joint many-task model together with a strategy for successively growing its depth to solve increasingly complex tasks. Higher layers include shortcut connections to lower-level task predictions to reflect linguistic hierarchies. We use a simple regularization term to allow for optimizing all model weights to improve one task’s loss without exhibiting catastrophic interference of the other tasks. Our single end-to-end model obtains state-of-the-art or competitive results on five different tasks from tagging, parsing, relatedness, and entailment tasks.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

Deep contextualized word representations

Matthew E. Peters, +6 more

TL;DR: This paper introduced a new type of deep contextualized word representation that models both complex characteristics of word use (e.g., syntax and semantics), and how these uses vary across linguistic contexts (i.e., to model polysemy).

...read moreread less

Proceedings ArticleDOI

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

Alex Wang, +5 more

TL;DR: The gluebenchmark as mentioned in this paper is a benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models.

...read moreread less

Posted Content

An Overview of Multi-Task Learning in Deep Neural Networks

Sebastian Ruder

- 15 Jun 2017 -

arXiv: Learning

TL;DR: This article seeks to help ML practitioners apply MTL by shedding light on how MTL works and providing guidelines for choosing appropriate auxiliary tasks, particularly in deep neural networks.

...read moreread less

Posted Content

Deep contextualized word representations

Matthew E. Peters, +6 more

- 15 Feb 2018 -

arXiv: Computation and Language

TL;DR: This article introduced a new type of deep contextualized word representation that models both complex characteristics of word use (e.g., syntax and semantics), and how these uses vary across linguistic contexts (i.e., to model polysemy).

...read moreread less

Proceedings Article

Contextual String Embeddings for Sequence Labeling

Alan Akbik, +2 more

TL;DR: This paper proposes to leverage the internal states of a trained character language model to produce a novel type of word embedding which they refer to as contextual string embeddings, which are fundamentally model words as sequences of characters and are contextualized by their surrounding text.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

Long short-term memory

Sepp Hochreiter, +1 more

- 01 Nov 1997 -

Neural Computation

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.

...read moreread less

Journal Article

Dropout: a simple way to prevent neural networks from overfitting

Nitish Srivastava, +4 more

- 01 Jan 2014 -

Journal of Machine Learning Research

TL;DR: It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.

...read moreread less

Proceedings ArticleDOI

Glove: Global Vectors for Word Representation

Jeffrey Pennington, +2 more

TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.

...read moreread less

Proceedings Article

Distributed Representations of Words and Phrases and their Compositionality

Tomas Mikolov, +4 more

TL;DR: This paper presents a simple method for finding phrases in text, and shows that learning good vector representations for millions of phrases is possible and describes a simple alternative to the hierarchical softmax called negative sampling.

...read moreread less

Proceedings Article

Sequence to Sequence Learning with Neural Networks

Ilya Sutskever, +2 more

TL;DR: The authors used a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector.

...read moreread less

Collapse

Neural Computation

Deep contextualized word representations

Matthew E. Peters, +6 more

A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks

Citations

Deep contextualized word representations

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

An Overview of Multi-Task Learning in Deep Neural Networks

Deep contextualized word representations

Contextual String Embeddings for Sequence Labeling

References

Long short-term memory

Dropout: a simple way to prevent neural networks from overfitting

Glove: Global Vectors for Word Representation

Distributed Representations of Words and Phrases and their Compositionality

Sequence to Sequence Learning with Neural Networks

Related Papers (5)

Glove: Global Vectors for Word Representation

Adam: A Method for Stochastic Optimization

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Long short-term memory

Deep contextualized word representations