A Primer in BERTology: What We Know About How BERT Works
TLDR
A survey of over 150 studies of the BERT model can be found in this paper, where the current state of knowledge about how BERT works, what kind of information it learns and how it is represented, common modifications to its training objectives and architecture, the overparameterization issue and approaches to compression.Abstract:
Transformer-based models have pushed state of the art in many areas of NLP, but our understanding of what is behind their success is still limited. This paper is the first survey of over 150 studies of the popular BERT model. We review the current state of knowledge about how BERT works, what kind of information it learns and how it is represented, common modifications to its training objectives and architecture, the overparameterization issue and approaches to compression. We then outline directions for future research.read more
Citations
More filters
Proceedings ArticleDOI
On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜
TL;DR: The authors take a step back and ask: How big is too big? What are the possible risks associated with this technology and what paths are available for mitigating those risks? They provide recommendations including weighing the environmental and financial costs first, investing resources into curating and carefully documenting datasets rather than ingesting everything on the web, carrying out pre-development exercises evaluating how the planned approach fits into research and development goals and supports stakeholder values, and encouraging research directions beyond ever larger language models.
Posted Content
Pretrained Transformers for Text Ranking: BERT and Beyond
TL;DR: This tutorial provides an overview of text ranking with neural network architectures known as transformers, of which BERT (Bidirectional Encoder Representations from Transformers) is the best-known example, and covers a wide range of techniques.
Posted Content
LEGAL-BERT: The Muppets straight out of Law School
Ilias Chalkidis,Manos Fergadiotis,Prodromos Malakasiotis,Nikolaos Aletras,Ion Androutsopoulos +4 more
TL;DR: In this article, the authors explore several approaches for applying BERT models to downstream legal tasks, evaluating on multiple datasets, and propose a broader hyper-parameter search space when fine-tuning for downstream tasks.
Proceedings ArticleDOI
Learning How to Ask: Querying LMs with Mixtures of Soft Prompts
Guanghui Qin,Jason Eisner +1 more
TL;DR: This work explores the idea of learning prompts by gradient descent—either fine-tuning prompts taken from previous work, or starting from random initialization, showing that the implicit factual knowledge in language models was previously underestimated.
Proceedings ArticleDOI
Factual Probing Is [MASK]: Learning vs. Learning to Recall.
TL;DR: OptiPrompt is proposed, a novel and efficient method which directly optimizes in continuous embedding space and is able to predict an additional 6.4% of facts in the LAMA benchmark.
References
More filters
Posted Content
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TL;DR: A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Posted Content
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu,Myle Ott,Naman Goyal,Jingfei Du,Mandar Joshi,Danqi Chen,Omer Levy,Michael Lewis,Luke Zettlemoyer,Veselin Stoyanov +9 more
TL;DR: It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.
Posted Content
Distilling the Knowledge in a Neural Network
TL;DR: This work shows that it can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model and introduces a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse.
Posted Content
Distributed Representations of Words and Phrases and their Compositionality
TL;DR: In this paper, the Skip-gram model is used to learn high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships and improve both the quality of the vectors and the training speed.
Posted Content
Attention Is All You Need
Ashish Vaswani,Noam Shazeer,Niki Parmar,Jakob Uszkoreit,Llion Jones,Aidan N. Gomez,Lukasz Kaiser,Illia Polosukhin +7 more
TL;DR: A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.