Dissecting Contextual Word Embeddings: Architecture and Representation

Open AccessPosted Content

Dissecting Contextual Word Embeddings: Architecture and Representation

- 27 Aug 2018 -

TLDR

This article showed that the choice of neural architecture (e.g., LSTM, CNN, or self attention) influences both end task accuracy and qualitative properties of the representations that are learned.

Abstract:

Contextual word representations derived from pre-trained bidirectional language models (biLMs) have recently been shown to provide significant improvements to the state of the art for a wide range of NLP tasks. However, many questions remain as to how and why these models are so effective. In this paper, we present a detailed empirical study of how the choice of neural architecture (e.g. LSTM, CNN, or self attention) influences both end task accuracy and qualitative properties of the representations that are learned. We show there is a tradeoff between speed and accuracy, but all architectures learn high quality contextual representations that outperform word embeddings for four challenging NLP tasks. Additionally, all architectures learn representations that vary with network depth, from exclusively morphological based at the word embedding layer through local syntax based in the lower contextual layers to longer range semantics such coreference at the upper layers. Together, these results suggest that unsupervised biLMs, independent of architecture, are learning much more about the structure of language than previously appreciated.

Citations

PDF

Open Access

More filters

Proceedings Article

Language Models are Few-Shot Learners

Tom B. Brown, +30 more

TL;DR: GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.

...read moreread less

Proceedings ArticleDOI

BERT Rediscovers the Classical NLP Pipeline

Ian Tenney, +2 more

TL;DR: This work finds that the model represents the steps of the traditional NLP pipeline in an interpretable and localizable way, and that the regions responsible for each step appear in the expected sequence: POS tagging, parsing, NER, semantic roles, then coreference.

...read moreread less

Proceedings ArticleDOI

What does BERT learn about the structure of language

Ganesh Jawahar, +2 more

TL;DR: This work provides novel support for the possibility that BERT networks capture structural information about language by performing a series of experiments to unpack the elements of English language structure learned by BERT.

...read moreread less

Proceedings ArticleDOI

A Structural Probe for Finding Syntax in Word Representations

John Hewitt, +1 more

TL;DR: A structural probe is proposed, which evaluates whether syntax trees are embedded in a linear transformation of a neural network’s word representation space, and shows that such transformations exist for both ELMo and BERT but not in baselines, providing evidence that entire syntax Trees are embedded implicitly in deep models’ vector geometry.

...read moreread less

Proceedings ArticleDOI

How multilingual is Multilingual BERT

Telmo Pires, +2 more

TL;DR: This article showed that M-BERT is surprisingly good at zero-shot cross-lingual model transfer, in which task-specific annotations in one language are used to fine-tune the model for evaluation in another language.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.

...read moreread less

Proceedings ArticleDOI

Glove: Global Vectors for Word Representation

Jeffrey Pennington, +2 more

TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.

...read moreread less

Journal Article

Visualizing Data using t-SNE

Laurens van der Maaten, +1 more

- 01 Jan 2008 -

Journal of Machine Learning Research

TL;DR: A new technique called t-SNE that visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map, a variation of Stochastic Neighbor Embedding that is much easier to optimize, and produces significantly better visualizations by reducing the tendency to crowd points together in the center of the map.

...read moreread less

Posted Content

Efficient Estimation of Word Representations in Vector Space

Tomas Mikolov, +3 more

- 16 Jan 2013 -

arXiv: Computation and Language

TL;DR: This paper proposed two novel model architectures for computing continuous vector representations of words from very large data sets, and the quality of these representations is measured in a word similarity task and the results are compared to the previously best performing techniques based on different types of neural networks.

...read moreread less

Proceedings Article

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

John Lafferty, +2 more

TL;DR: This work presents iterative parameter estimation algorithms for conditional random fields and compares the performance of the resulting models to HMMs and MEMMs on synthetic and natural-language data.

...read moreread less

Collapse

arXiv: Computation and Language

Multilingual Distributed Representations without Word Alignment

Karl Moritz Hermann, +1 more

- 20 Dec 2013 -

arXiv: Computation and Language

Dissecting Contextual Word Embeddings: Architecture and Representation

Citations

Language Models are Few-Shot Learners

BERT Rediscovers the Classical NLP Pipeline

What does BERT learn about the structure of language

A Structural Probe for Finding Syntax in Word Representations

How multilingual is Multilingual BERT

References

Deep Residual Learning for Image Recognition

Glove: Global Vectors for Word Representation

Visualizing Data using t-SNE

Efficient Estimation of Word Representations in Vector Space

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

Related Papers (5)

Dissecting Contextual Word Embeddings: Architecture and Representation

Co-learning of Word Representations and Morpheme Representations

Improving Word Representations via Global Context and Multiple Word Prototypes

Mixed Membership Word Embeddings for Computational Social Science

Multilingual Distributed Representations without Word Alignment