A Primer in BERTology: What We Know About How BERT Works

doi:10.1162/TACL_A_00349

Open AccessJournal ArticleDOI

A Primer in BERTology: What We Know About How BERT Works

Anna Rogers, +2 more

- 01 Jan 2020 -

Transactions of the Association for Comp...

- Vol. 8, pp 842-866

TLDR

A survey of over 150 studies of the BERT model can be found in this paper, where the current state of knowledge about how BERT works, what kind of information it learns and how it is represented, common modifications to its training objectives and architecture, the overparameterization issue and approaches to compression.

Abstract:

Transformer-based models have pushed state of the art in many areas of NLP, but our understanding of what is behind their success is still limited. This paper is the first survey of over 150 studies of the popular BERT model. We review the current state of knowledge about how BERT works, what kind of information it learns and how it is represented, common modifications to its training objectives and architecture, the overparameterization issue and approaches to compression. We then outline directions for future research.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜

Emily M. Bender, +3 more

TL;DR: The authors take a step back and ask: How big is too big? What are the possible risks associated with this technology and what paths are available for mitigating those risks? They provide recommendations including weighing the environmental and financial costs first, investing resources into curating and carefully documenting datasets rather than ingesting everything on the web, carrying out pre-development exercises evaluating how the planned approach fits into research and development goals and supports stakeholder values, and encouraging research directions beyond ever larger language models.

...read moreread less

Posted Content

Pretrained Transformers for Text Ranking: BERT and Beyond

Jimmy Lin, +2 more

- 13 Oct 2020 -

arXiv: Information Retrieval

TL;DR: This tutorial provides an overview of text ranking with neural network architectures known as transformers, of which BERT (Bidirectional Encoder Representations from Transformers) is the best-known example, and covers a wide range of techniques.

...read moreread less

Posted Content

LEGAL-BERT: The Muppets straight out of Law School

Ilias Chalkidis, +4 more

- 06 Oct 2020 -

arXiv: Computation and Language

TL;DR: In this article, the authors explore several approaches for applying BERT models to downstream legal tasks, evaluating on multiple datasets, and propose a broader hyper-parameter search space when fine-tuning for downstream tasks.

...read moreread less

Proceedings ArticleDOI

Learning How to Ask: Querying LMs with Mixtures of Soft Prompts

Guanghui Qin, +1 more

TL;DR: This work explores the idea of learning prompts by gradient descent—either fine-tuning prompts taken from previous work, or starting from random initialization, showing that the implicit factual knowledge in language models was previously underestimated.

...read moreread less

Proceedings ArticleDOI

Factual Probing Is [MASK]: Learning vs. Learning to Recall.

Zexuan Zhong, +2 more

TL;DR: OptiPrompt is proposed, a novel and efficient method which directly optimizes in continuous embedding space and is able to predict an additional 6.4% of facts in the LAMA benchmark.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Posted Content

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, +3 more

- 11 Oct 2018 -

arXiv: Computation and Language

TL;DR: A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

...read moreread less

Posted Content

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, +9 more

- 26 Jul 2019 -

arXiv: Computation and Language

TL;DR: It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.

...read moreread less

Posted Content

Distilling the Knowledge in a Neural Network

Geoffrey E. Hinton, +2 more

- 09 Mar 2015 -

arXiv: Machine Learning

TL;DR: This work shows that it can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model and introduces a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse.

...read moreread less

Posted Content

Distributed Representations of Words and Phrases and their Compositionality

Tomas Mikolov, +4 more

- 16 Oct 2013 -

arXiv: Computation and Language

TL;DR: In this paper, the Skip-gram model is used to learn high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships and improve both the quality of the vectors and the training speed.

...read moreread less

Posted Content

Attention Is All You Need

Ashish Vaswani, +7 more

- 12 Jun 2017 -

arXiv: Computation and Language

TL;DR: A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

...read moreread less

Collapse

Related Papers (5)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, +3 more

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, +9 more

- 26 Jul 2019 -

arXiv: Computation and Language

A Primer in BERTology: What We Know About How BERT Works

Citations

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜

Pretrained Transformers for Text Ranking: BERT and Beyond

LEGAL-BERT: The Muppets straight out of Law School

Learning How to Ask: Querying LMs with Mixtures of Soft Prompts

Factual Probing Is [MASK]: Learning vs. Learning to Recall.

References

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Distilling the Knowledge in a Neural Network

Distributed Representations of Words and Phrases and their Compositionality

Attention Is All You Need

Related Papers (5)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Attention is All you Need

Deep contextualized word representations

Glove: Global Vectors for Word Representation

A Primer in BERTology: What We Know About How BERT Works

Citations

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? &#x1f99c;

Pretrained Transformers for Text Ranking: BERT and Beyond

LEGAL-BERT: The Muppets straight out of Law School

Learning How to Ask: Querying LMs with Mixtures of Soft Prompts

Factual Probing Is [MASK]: Learning vs. Learning to Recall.

References

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Distilling the Knowledge in a Neural Network

Distributed Representations of Words and Phrases and their Compositionality

Attention Is All You Need

Related Papers (5)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Attention is All you Need

Deep contextualized word representations

Glove: Global Vectors for Word Representation

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜