Showing papers by &quot;Luke Zettlemoyer published in 2021&quot;

Better Fine-Tuning by Reducing Representational Collapse

26 Jan 2021-arXiv: Computation and Language

TL;DR: This paper proposed pre-finetuning, an additional large-scale learning stage between language model pre-training and fine-tuning, which is designed to encourage learning of representations that generalize better to many different tasks.

...read moreread less

Abstract: We propose pre-finetuning, an additional large-scale learning stage between language model pre-training and fine-tuning. Pre-finetuning is massively multi-task learning (around 50 datasets, over 4.8 million total labeled examples), and is designed to encourage learning of representations that generalize better to many different tasks. We show that pre-finetuning consistently improves performance for pretrained discriminators (e.g.~RoBERTa) and generation models (e.g.~BART) on a wide range of tasks (sentence prediction, commonsense reasoning, MRC, etc.), while also significantly improving sample efficiency during fine-tuning. We also show that large-scale multi-tasking is crucial; pre-finetuning can hurt performance when few tasks are used up until a critical point (usually above 15) after which performance improves linearly in the number of tasks.

...read moreread less

50 citations

Proceedings Article•

[...]

Armen Aghajanyan¹, Akshat Shrivastava¹, Anchit Gupta¹, Naman Goyal², Luke Zettlemoyer³, Sonal Gupta¹ - Show less +2 more•Institutions (3)

Facebook¹, Georgia Institute of Technology², University of Washington³

03 May 2021

TL;DR: In this article, a simplified and efficient method rooted in trust region theory that replaces previously used adversarial objectives with parametric noise (sampling from either a normal or uniform distribution), thereby discouraging representation change during fine-tuning when possible without hurting performance.

...read moreread less

Abstract: Although widely adopted, existing approaches for fine-tuning pre-trained language models have been shown to be unstable across hyper-parameter settings, motivating recent work on trust region methods. In this paper, we present a simplified and efficient method rooted in trust region theory that replaces previously used adversarial objectives with parametric noise (sampling from either a normal or uniform distribution), thereby discouraging representation change during fine-tuning when possible without hurting performance. We also introduce a new analysis to motivate the use of trust region methods more generally, by studying representational collapse; the degradation of generalizable representations from pre-trained models as they are fine-tuned for a specific end task. Extensive experiments show that our fine-tuning method matches or exceeds the performance of previous trust region methods on a range of understanding and generation tasks (including DailyMail/CNN, Gigaword, Reddit TIFU, and the GLUE benchmark), while also being much faster. We also show that it is less prone to representation collapse; the pre-trained models maintain more generalizable representations every time they are fine-tuned.

...read moreread less

46 citations

Proceedings Article•

Nearest Neighbor Machine Translation

[...]

Urvashi Khandelwal¹, Angela Fan², Dan Jurafsky¹, Luke Zettlemoyer³, Michael Lewis² - Show less +1 more•Institutions (3)

Stanford University¹, Facebook², University of Washington³

03 May 2021

TL;DR: For instance, kNN-MT as discussed by the authors predicts tokens with a nearest-neighbor classifier over a large datastore of cached examples, using representations from a neural translation model for similarity search.

...read moreread less

Abstract: We introduce k-nearest-neighbor machine translation (kNN-MT), which predicts tokens with a nearest-neighbor classifier over a large datastore of cached examples, using representations from a neural translation model for similarity search This approach requires no additional training and scales to give the decoder direct access to billions of examples at test time, resulting in a highly expressive model that consistently improves performance across many settings Simply adding nearest-neighbor search improves a state-of-the-art German-English translation model by 15 BLEU kNN-MT allows a single model to be adapted to diverse domains by using a domain-specific datastore, improving results by an average of 92 BLEU over zero-shot transfer, and achieving new state-of-the-art results---without training on these domains A massively multilingual model can also be specialized for particular language pairs, with improvements of 3 BLEU for translating from English into German and Chinese Qualitatively, kNN-MT is easily interpretable; it combines source and target context to retrieve highly relevant examples

...read moreread less

39 citations

Posted Content•

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

[...]

Hu Xu¹, Gargi Ghosh¹, Po-Yao Huang², Dmytro Okhonko¹, Armen Aghajanyan¹, Florian Metze², Luke Zettlemoyer³, Christoph Feichtenhofer¹ - Show less +4 more•Institutions (3)

Facebook¹, Carnegie Mellon University², University of Washington³

28 Sep 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: This article proposed a contrastive approach to pre-train a unified model for zero-shot video and text understanding without using any labels on downstream tasks, which achieved state-of-the-art performance.

...read moreread less

Abstract: We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot video and text understanding, without using any labels on downstream tasks. VideoCLIP trains a transformer for video and text by contrasting temporally overlapping positive video-text pairs with hard negatives from nearest neighbor retrieval. Our experiments on a diverse series of downstream tasks, including sequence-level text-video retrieval, VideoQA, token-level action localization, and action segmentation reveal state-of-the-art performance, surpassing prior work, and in some cases even outperforming supervised approaches. Code is made available at this https URL.

...read moreread less

28 citations

Posted Content•

BASE Layers: Simplifying Training of Large, Sparse Models

[...]

Michael Lewis¹, Shruti Bhosale¹, Tim Dettmers², Naman Goyal¹, Luke Zettlemoyer² - Show less +1 more•Institutions (2)

Detecting Hallucinated Content in Conditional Neural Sequence Generation

30 Mar 2021-arXiv: Computation and Language

TL;DR: This article propose a balanced assignment of experts (BASE) layer for large language models that greatly simplifies existing high capacity sparse layers, and formulate token-to-expert allocation as a linear assignment problem, allowing an optimal assignment in which each expert receives an equal number of tokens.

...read moreread less

Abstract: We introduce a new balanced assignment of experts (BASE) layer for large language models that greatly simplifies existing high capacity sparse layers. Sparse layers can dramatically improve the efficiency of training and inference by routing each token to specialized expert modules that contain only a small fraction of the model parameters. However, it can be difficult to learn balanced routing functions that make full use of the available experts; existing approaches typically use routing heuristics or auxiliary expert-balancing loss functions. In contrast, we formulate token-to-expert allocation as a linear assignment problem, allowing an optimal assignment in which each expert receives an equal number of tokens. This optimal assignment scheme improves efficiency by guaranteeing balanced compute loads, and also simplifies training by not requiring any new hyperparameters or auxiliary losses. Code is publicly released at this https URL

...read moreread less

15 citations

Proceedings Article•DOI•

[...]

Chunting Zhou¹, Graham Neubig², Jiatao Gu², Mona Diab³, Francisco Guzman⁴, Luke Zettlemoyer⁴, Marjan Ghazvininejad² - Show less +3 more•Institutions (4)

Carnegie Mellon University¹, Facebook², Columbia University³, University of Washington⁴

04 May 2021

TL;DR: This article proposed a method for learning to model hallucination detection, based on pretrained language models fine tuned on synthetic data that includes automatically inserted hallucination labels, which achieved an average F1 of around 0.6 across all the benchmark datasets.

...read moreread less

Abstract: Neural sequence models can generate highly fluent sentences but recent studies have also shown that they are also prone to hallucinate additional content not supported by the input, which can cause a lack of trust in the model. To better assess the faithfulness of the machine outputs, we propose a new task to predict whether each token in the output sequence is hallucinated conditioned on the source input, and collect new manually annotated evaluation sets for this task. We also introduce a novel method for learning to model hallucination detection, based on pretrained language models fine tuned on synthetic data that includes automatically inserted hallucinations. Experiments on machine translation and abstract text summarization demonstrate the effectiveness of our proposed approach -- we obtain an average F1 of around 0.6 across all the benchmark datasets. Furthermore, we demonstrate how to use the token-level hallucination labels to define a fine-grained loss over the target sequence in the low-resource machine translation and achieve significant improvements over strong baseline methods. We will also release our annotated data and code for future research.

...read moreread less

14 citations

Proceedings Article•DOI•

VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding

[...]

Hu Xu¹, Gargi Ghosh¹, Po-Yao Huang², Prahal Arora, Masoumeh Aminzadeh³, Christoph Feichtenhofer¹, Florian Metze², Luke Zettlemoyer¹ - Show less +4 more•Institutions (3)

Facebook¹, Carnegie Mellon University², Georgia Institute of Technology³

01 Aug 2021

TL;DR: The authors proposed a task-agnostic multi-modal pre-training approach that can accept either video or text input, or both for a variety of end-to-end tasks.

...read moreread less

Abstract: We present a simplified, task-agnostic multi-modal pre-training approach that can accept either video or text input, or both for a variety of end tasks. Existing pre-training are task-specific by adopting either a single cross-modal encoder that requires both modalities, limiting their use for retrieval-style end tasks or more complex multitask learning with two unimodal encoders, limiting early cross-modal fusion. We instead introduce new pretraining masking schemes that better mix across modalities (e.g. by forcing masks for text to predict the closest video embeddings) while also maintaining separability (e.g. unimodal predictions are sometimes required, without using all the input). Experimental results show strong performance across a wider range of tasks than any previous methods, often outperforming task-specific pre-training.

...read moreread less

13 citations

Posted Content•

Surface Form Competition: Why the Highest Probability Answer Isn't Always Right

[...]

Ari Holtzman, Peter West, Vered Schwartz, Yejin Choi, Luke Zettlemoyer - Show less +1 more

16 Apr 2021-arXiv: Computation and Language

TL;DR: This article proposed Domain Conditional Pointwise Mutual Information (DPMI), which directly compensates for surface form competition by reweighing each option according to a term that is proportional to its a priori likelihood within the context of the specific zero-shot task.

...read moreread less

Abstract: Large language models have shown promising results in zero-shot settings (Brown et al.,2020; Radford et al., 2019). For example, they can perform multiple choice tasks simply by conditioning on a question and selecting the answer with the highest probability. However, ranking by string probability can be problematic due to surface form competition-wherein different surface forms compete for probability mass, even if they represent the same underlying concept, e.g. "computer" and "PC." Since probability mass is finite, this lowers the probability of the correct answer, due to competition from other strings that are valid answers (but not one of the multiple choice options). We introduce Domain Conditional Pointwise Mutual Information, an alternative scoring function that directly compensates for surface form competition by simply reweighing each option according to a term that is proportional to its a priori likelihood within the context of the specific zero-shot task. It achieves consistent gains in zero-shot performance over both calibrated (Zhao et al., 2021) and uncalibrated scoring functions on all GPT-2 and GPT-3 models over a variety of multiple choice datasets.

...read moreread less

12 citations

Proceedings Article•

Learning Better Structured Representations Using Low-rank Adaptive Label Smoothing

[...]

Asish Ghoshal¹, Xilun Chen¹, Sonal Gupta¹, Luke Zettlemoyer², Yashar Mehdad¹ - Show less +1 more•Institutions (2)

Noisy Channel Language Model Prompting for Few-Shot Text Classification

03 May 2021

TL;DR: Low-rank adaptive label smoothing (LORAS) is proposed, a simple yet novel method for training with learned soft targets that generalizes label smoothed and adapts to the latent structure of the label space in structured prediction tasks.

...read moreread less

Abstract: Training with soft targets instead of hard targets has been shown to improve performance and calibration of deep neural networks. Label smoothing is a popular way of computing soft targets, where one-hot encoding of a class is smoothed with a uniform distribution. Owing to its simplicity, label smoothing has found wide-spread use for training deep neural networks on a wide variety of tasks, ranging from image and text classification to machine translation and semantic parsing. Complementing recent empirical justification for label smoothing, we obtain PAC-Bayesian generalization bounds for label smoothing and show that the generalization error depends on the choice of the noise (smoothing) distribution. Then we propose low-rank adaptive label smoothing (LORAS): a simple yet novel method for training with learned soft targets that generalizes label smoothing and adapts to the latent structure of the label space in structured prediction tasks. Specifically, we evaluate our method on semantic parsing tasks and show that training with appropriately smoothed soft targets can significantly improve accuracy and model calibration, especially in low-resource settings. Used in conjunction with pre-trained sequence-to-sequence models, our method achieves state of the art performance on four semantic parsing data sets. LORAS can be used with any model, improves performance and implicit model calibration without increasing the number of model parameters, and can be scaled to problems with large label spaces containing tens of thousands of labels.

...read moreread less

12 citations

Posted Content•

[...]

Sewon Min¹, Michael Lewis¹, Hannaneh Hajishirzi¹, Luke Zettlemoyer²•Institutions (2)

DeLighT: Deep and Light-weight Transformer

09 Aug 2021-arXiv: Computation and Language

TL;DR: This article introduced a noisy channel approach for language model prompting in few-shot text classification, where instead of computing the likelihood of the label given the input, channel models compute the conditional probability of the input given the label, and are thereby required to explain every word in the input.

...read moreread less

Abstract: We introduce a noisy channel approach for language model prompting in few-shot text classification. Instead of computing the likelihood of the label given the input (referred as direct models), channel models compute the conditional probability of the input given the label, and are thereby required to explain every word in the input. We use channel models for recently proposed few-shot learning methods with no or very limited updates to the language model parameters, via either in-context demonstration or prompt tuning. Our experiments show that, for both methods, channel models significantly outperform their direct counterparts, which we attribute to their stability, i.e., lower variance and higher worst-case accuracy. We also present extensive ablations that provide recommendations for when to use channel prompt tuning instead of other competitive models (e.g., direct head tuning): channel prompt tuning is preferred when the number of training examples is small, labels in the training data are imbalanced, or generalization to unseen labels is required.

...read moreread less

Proceedings Article•

[...]

Sachin Mehta¹, Marjan Ghazvininejad², Srinivasan Iyer², Luke Zettlemoyer¹, Hannaneh Hajishirzi¹ - Show less +1 more•Institutions (2)

Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning

03 May 2021

TL;DR: DeLighT as mentioned in this paper is a deep and light-weight transformer-based model for machine translation and language modeling tasks with 2.5 to 4 times fewer parameters on average.

...read moreread less

Abstract: We introduce a deep and light-weight transformer, DeLighT, that delivers similar or better performance than standard transformer-based models with significantly fewer parameters. DeLighT more efficiently allocates parameters both (1) within each Transformer block using the DeLighT transformation, a deep and light-weight transformation and (2) across blocks using block-wise scaling, that allows for shallower and narrower DeLighT blocks near the input and wider and deeper DeLighT blocks near the output. Overall, DeLighT networks are 2.5 to 4 times deeper than standard transformer models and yet have fewer parameters and operations. Experiments on benchmark machine translation and language modeling tasks show that DeLighT matches or improves the performance of baseline Transformers with 2 to 3 times fewer parameters on average.

...read moreread less

Proceedings Article•DOI•

[...]

Armen Aghajanyan¹, Sonal Gupta¹, Luke Zettlemoyer¹•Institutions (1)

Question Answering Infused Pre-training of General-Purpose Contextualized Representations.

01 Aug 2021

TL;DR: The authors empirically show that common pre-trained models have a very low intrinsic dimension; in other words, there exists a low dimension reparameterization that is as effective for fine-tuning as the full parameter space.

...read moreread less

Abstract: Although pretrained language models can be fine-tuned to produce state-of-the-art results for a very wide range of language understanding tasks, the dynamics of this process are not well understood, especially in the low data regime. Why can we use relatively vanilla gradient descent algorithms (e.g., without strong regularization) to tune a model with hundreds of millions of parameters on datasets with only hundreds or thousands of labeled examples? In this paper, we argue that analyzing fine-tuning through the lens of intrinsic dimension provides us with empirical and theoretical intuitions to explain this remarkable phenomenon. We empirically show that common pre-trained models have a very low intrinsic dimension; in other words, there exists a low dimension reparameterization that is as effective for fine-tuning as the full parameter space. For example, by optimizing only 200 trainable parameters randomly projected back into the full space, we can tune a RoBERTa model to achieve 90% of the full parameter performance levels on MRPC. Furthermore, we empirically show that pre-training implicitly minimizes intrinsic dimension and, perhaps surprisingly, larger models tend to have lower intrinsic dimension after a fixed number of pre-training updates, at least in part explaining their extreme effectiveness. Lastly, we connect intrinsic dimensionality with low dimensional task representations and compression based generalization bounds to provide intrinsic-dimension-based generalization bounds that are independent of the full parameter count.

...read moreread less

Posted Content•

[...]

Robin Jia¹, Michael Lewis¹, Luke Zettlemoyer¹•Institutions (1)

FEWS: Large-Scale, Low-Shot Word Sense Disambiguation with the Dictionary.

15 Jun 2021-arXiv: Computation and Language

TL;DR: The authors proposed a pre-training objective based on question answering for learning general-purpose contextual representations, motivated by the intuition that the representation of a phrase in a passage should encode all questions that the phrase can answer in context.

...read moreread less

Abstract: This paper proposes a pre-training objective based on question answering (QA) for learning general-purpose contextual representations, motivated by the intuition that the representation of a phrase in a passage should encode all questions that the phrase can answer in context. We accomplish this goal by training a bi-encoder QA model, which independently encodes passages and questions, to match the predictions of a more accurate cross-encoder model on 80 million synthesized QA pairs. By encoding QA-relevant information, the bi-encoder's token-level representations are useful for non-QA downstream tasks without extensive (or in some cases, any) fine-tuning. We show large improvements over both RoBERTa-large and previous state-of-the-art results on zero-shot and few-shot paraphrase detection on four datasets, few-shot named entity recognition on two datasets, and zero-shot sentiment analysis on three datasets.

...read moreread less

Proceedings Article•DOI•

[...]

Terra Blevins¹, Mandar Joshi¹, Luke Zettlemoyer¹•Institutions (1)

University of Washington¹

01 Apr 2021

TL;DR: This article proposed FEWS (Few-shot Examples of Word Senses), a new low-shot WSD dataset automatically extracted from example sentences in Wiktionary, which provides high sense coverage across different natural language domains and provides a large training set that covers many more senses than previous datasets and a comprehensive evaluation set containing few- and zero-shot examples of a wide variety of senses.

...read moreread less

Abstract: Current models for Word Sense Disambiguation (WSD) struggle to disambiguate rare senses, despite reaching human performance on global WSD metrics. This stems from a lack of data for both modeling and evaluating rare senses in existing WSD datasets. In this paper, we introduce FEWS (Few-shot Examples of Word Senses), a new low-shot WSD dataset automatically extracted from example sentences in Wiktionary. FEWS has high sense coverage across different natural language domains and provides: (1) a large training set that covers many more senses than previous datasets and (2) a comprehensive evaluation set containing few- and zero-shot examples of a wide variety of senses. We establish baselines on FEWS with knowledge-based and neural WSD approaches and present transfer learning experiments demonstrating that models additionally trained with FEWS better capture rare senses in existing WSD datasets. Finally, we find humans outperform the best baseline models on FEWS, indicating that FEWS will support significant future work on low-shot WSD.

...read moreread less

Proceedings Article•DOI•

Bilingual Lexicon Induction via Unsupervised Bitext Construction and Word Alignment

[...]

Haoyue Shi¹, Luke Zettlemoyer², Sida I. Wang²•Institutions (2)

Toyota Technological Institute at Chicago¹, Facebook²

01 Aug 2021

TL;DR: The authors combine unsupervised bite-xt mining and word alignment to improve the quality of bilingual lexicons, achieving the state-of-the-art performance on the BUCC 2020 shared task.

...read moreread less

Abstract: Bilingual lexicons map words in one language to their translations in another, and are typically induced by learning linear projections to align monolingual word embedding spaces. In this paper, we show it is possible to produce much higher quality lexicons with methods that combine (1) unsupervised bitext mining and (2) unsupervised word alignment. Directly applying a pipeline that uses recent algorithms for both subproblems significantly improves induced lexicon quality and further gains are possible by learning to filter the resulting lexical entries, with both unsupervised and semi-supervised schemes. Our final model outperforms the state of the art on the BUCC 2020 shared task by 14 F1 points averaged over 12 language pairs, while also providing a more interpretable approach that allows for rich reasoning of word meaning in context. Further analysis of our output and the standard reference lexicons suggests they are of comparable quality, and new benchmarks may be needed to measure further progress on this task.

...read moreread less

Posted Content•

HTLM: Hyper-Text Pre-Training and Prompting of Language Models.

[...]

Armen Aghajanyan¹, Dmytro Okhonko¹, Michael Lewis¹, Mandar Joshi¹, Hu Xu¹, Gargi Ghosh¹, Luke Zettlemoyer¹ - Show less +3 more•Institutions (1)

Prompting Contrastive Explanations for Commonsense Reasoning Tasks

14 Jul 2021-arXiv: Computation and Language

TL;DR: The authors proposed HTLM, a hyper-text language model trained on a large-scale web crawl for zero-shot summarization, and showed that pretraining with a BART-style denoising loss on simplified HTML provides highly effective transfer for a wide range of end tasks and supervision levels.

...read moreread less

Abstract: We introduce HTLM, a hyper-text language model trained on a large-scale web crawl. Modeling hyper-text has a number of advantages: (1) it is easily gathered at scale, (2) it provides rich document-level and end-task-adjacent supervision (e.g. class and id attributes often encode document category information), and (3) it allows for new structured prompting that follows the established semantics of HTML (e.g. to do zero-shot summarization by infilling title tags for a webpage that contains the input text). We show that pretraining with a BART-style denoising loss directly on simplified HTML provides highly effective transfer for a wide range of end tasks and supervision levels. HTLM matches or exceeds the performance of comparably sized text-only LMs for zero-shot prompting and fine-tuning for classification benchmarks, while also setting new state-of-the-art performance levels for zero-shot summarization. We also find that hyper-text prompts provide more value to HTLM, in terms of data efficiency, than plain text prompts do for existing LMs, and that HTLM is highly effective at auto-prompting itself, by simply generating the most likely hyper-text formatting for any available training data. We will release all code and models to support future HTLM research.

...read moreread less

Proceedings Article•DOI•

[...]

Bhargavi Paranjape¹, Julian Michael¹, Marjan Ghazvininejad², Hannaneh Hajishirzi¹, Luke Zettlemoyer² - Show less +1 more•Institutions (2)

Luna: Linear Unified Nested Attention

01 Aug 2021

Posted Content•

[...]

Xuezhe Ma¹, Xiang Kong², Sinong Wang³, Chunting Zhou², Jonathan May⁴, Hao Ma³, Luke Zettlemoyer³ - Show less +3 more•Institutions (4)

Information Sciences Institute¹, Carnegie Mellon University², Facebook³, University of Southern California⁴

03 Jun 2021-arXiv: Learning

TL;DR: Luna as discussed by the authors proposes a linear unified nested attention mechanism that approximates softmax attention with two nested linear attention functions, yielding only linear (as opposed to quadratic) time and space complexity.

...read moreread less

Abstract: The quadratic computational and memory complexities of the Transformer's attention mechanism have limited its scalability for modeling long sequences. In this paper, we propose Luna, a linear unified nested attention mechanism that approximates softmax attention with two nested linear attention functions, yielding only linear (as opposed to quadratic) time and space complexity. Specifically, with the first attention function, Luna packs the input sequence into a sequence of fixed length. Then, the packed sequence is unpacked using the second attention function. As compared to a more traditional attention mechanism, Luna introduces an additional sequence with a fixed length as input and an additional corresponding output, which allows Luna to perform attention operation linearly, while also storing adequate contextual information. We perform extensive evaluations on three benchmarks of sequence modeling tasks: long-context sequence modeling, neural machine translation and masked language modeling for large-scale pretraining. Competitive or even better experimental results demonstrate both the effectiveness and efficiency of Luna compared to a variety

...read moreread less

Proceedings Article•

Surface Form Competition: Why the Highest Probability Answer Isn't Always Right.

[...]

Ari Holtzman¹, Peter West², Vered Shwartz², Yejin Choi¹, Luke Zettlemoyer² - Show less +1 more•Institutions (2)

Allen Institute for Artificial Intelligence¹, University of Washington²

16 Apr 2021

TL;DR: This article proposed Domain Conditional Pointwise Mutual Information (DPMI), which directly compensates for surface form competition by reweighing each option according to its a priori likelihood within the context of a specific task.

...read moreread less

Abstract: Large language models have shown promising results in zero-shot settings. For example, they can perform multiple choice tasks simply by conditioning on a question and selecting the answer with the highest probability. However, ranking by string probability can be problematic due to surface form competition—wherein different surface forms compete for probability mass, even if they represent the same underlying concept in a given context, e.g. “computer” and “PC.” Since probability mass is finite, this lowers the probability of the correct answer, due to competition from other strings that are valid answers (but not one of the multiple choice options). We introduce Domain Conditional Pointwise Mutual Information, an alternative scoring function that directly compensates for surface form competition by simply reweighing each option according to its a priori likelihood within the context of a specific task. It achieves consistent gains in zero-shot performance over both calibrated and uncalibrated scoring functions on all GPT-2 and GPT-3 models on a variety of multiple choice datasets.

...read moreread less

Posted Content•

DEMix Layers: Disentangling Domains for Modular Language Modeling

[...]

Suchin Gururangan¹, Michael Lewis, Ari Holtzman, Noah A. Smith, Luke Zettlemoyer - Show less +1 more•Institutions (1)

University of Washington¹

11 Aug 2021-arXiv: Computation and Language

TL;DR: The authors introduce a domain expert mixture (DEMix) layer that enables conditioning a language model (LM) on the domain of the input text, which makes the LM modular: experts can be mixed, added or removed after initial training.

...read moreread less

Abstract: We introduce a new domain expert mixture (DEMix) layer that enables conditioning a language model (LM) on the domain of the input text. A DEMix layer is a collection of expert feedforward networks, each specialized to a domain, that makes the LM modular: experts can be mixed, added or removed after initial training. Extensive experiments with autoregressive transformer LMs (up to 1.3B parameters) show that DEMix layers reduce test-time perplexity, increase training efficiency, and enable rapid adaptation with little overhead. We show that mixing experts during inference, using a parameter-free weighted ensemble, allows the model to better generalize to heterogeneous or unseen domains. We also show that experts can be added to iteratively incorporate new domains without forgetting older ones, and that experts can be removed to restrict access to unwanted domains, without additional training. Overall, these results demonstrate benefits of explicitly conditioning on textual domains during language modeling.

...read moreread less

Posted Content•

MetaICL: Learning to Learn In Context

[...]

Sewon Min¹, Michael Lewis¹, Luke Zettlemoyer¹, Hannaneh Hajishirzi²•Institutions (2)

8-bit Optimizers via Block-wise Quantization

29 Oct 2021-arXiv: Computation and Language

TL;DR: Meta-training for In-Context Learning (MetaICL) as discussed by the authors is a meta-training framework for few-shot learning where a pretrained language model is tuned to do in-context learn-ing on a large set of training tasks.

...read moreread less

Abstract: We introduce MetaICL (Meta-training for In-Context Learning), a new meta-training framework for few-shot learning where a pretrained language model is tuned to do in-context learn-ing on a large set of training tasks. This meta-training enables the model to more effectively learn a new task in context at test time, by simply conditioning on a few training examples with no parameter updates or task-specific templates. We experiment on a large, diverse collection of tasks consisting of 142 NLP datasets including classification, question answering, natural language inference, paraphrase detection and more, across seven different meta-training/target splits. MetaICL outperforms a range of baselines including in-context learning without meta-training and multi-task learning followed by zero-shot transfer. We find that the gains are particularly significant for target tasks that have domain shifts from the meta-training tasks, and that using a diverse set of the meta-training tasks is key to improvements. We also show that MetaICL approaches (and sometimes beats) the performance of models fully finetuned on the target task training data, and outperforms much bigger models with nearly 8x parameters.

...read moreread less

Posted Content•

[...]

Tim Dettmers¹, Michael Lewis¹, Sam Shleifer², Luke Zettlemoyer²•Institutions (2)

Inducing Semantic Roles Without Syntax

06 Oct 2021-arXiv: Learning

TL;DR: In this article, the authors propose block-wise quantization, a form of nonlinear optimization that is precise for both large and small magnitude values, and a stable embedding layer to reduce gradient variance that comes from the highly non-uniform distribution of input tokens in language models.

...read moreread less

Abstract: Stateful optimizers maintain gradient statistics over time, e.g., the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past gradient values. This state can be used to accelerate optimization compared to plain stochastic gradient descent but uses memory that might otherwise be allocated to model parameters, thereby limiting the maximum size of models trained in practice. In this paper, we develop the first optimizers that use 8-bit statistics while maintaining the performance levels of using 32-bit optimizer states. To overcome the resulting computational, quantization, and stability challenges, we develop block-wise dynamic quantization. Block-wise quantization divides input tensors into smaller blocks that are independently quantized. Each block is processed in parallel across cores, yielding faster optimization and high precision quantization. To maintain stability and performance, we combine block-wise quantization with two additional changes: (1) dynamic quantization, a form of non-linear optimization that is precise for both large and small magnitude values, and (2) a stable embedding layer to reduce gradient variance that comes from the highly non-uniform distribution of input tokens in language models. As a result, our 8-bit optimizers maintain 32-bit performance with a small fraction of the memory footprint on a range of tasks, including 1.5B parameter language modeling, GLUE finetuning, ImageNet classification, WMT'14 machine translation, MoCo v2 contrastive ImageNet pretraining+finetuning, and RoBERTa pretraining, without changes to the original optimizer hyperparameters. We open-source our 8-bit optimizers as a drop-in replacement that only requires a two-line code change.

...read moreread less

Proceedings Article•DOI•

[...]

Julian Michael¹, Luke Zettlemoyer²•Institutions (2)

DESCGEN: A Distantly Supervised Datasetfor Generating Entity Descriptions

01 Aug 2021

Proceedings Article•DOI•

[...]

Weijia Shi¹, Mandar Joshi², Luke Zettlemoyer²•Institutions (2)

University of California, Los Angeles¹, Facebook²

01 Aug 2021

TL;DR: In this paper, given mentions spread over multiple documents, the goal is to generate an entity summary description, where the documents were collected using a combination of entity linking and hyperlinks into the entity pages, which together provided high-quality distant supervision.

...read moreread less

Abstract: Short textual descriptions of entities provide summaries of their key attributes and have been shown to be useful sources of background knowledge for tasks such as entity linking and question answering. However, generating entity descriptions, especially for new and long-tail entities, can be challenging since relevant information is often scattered across multiple sources with varied content and style. We introduce DESCGEN: given mentions spread over multiple documents, the goal is to generate an entity summary description. DESCGEN consists of 37K entity descriptions from Wikipedia and Fandom, each paired with nine evidence documents on average. The documents were collected using a combination of entity linking and hyperlinks into the entity pages, which together provide high-quality distant supervision. Compared to other multi-document summarization tasks, our task is entity-centric, more abstractive, and covers a wide range of domains. We also propose a two-stage extract-then-generate baseline and show that there exists a large gap (19.9% in ROUGE-L) between state-of-art models and human performance, suggesting that the data will support significant future work.

...read moreread less

Proceedings Article•

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

[...]

Hu Xu¹, Gargi Ghosh¹, Po-Yao Huang², Dmytro Okhonko¹, Armen Aghajanyan¹, Florian Metze², Luke Zettlemoyer³, Christoph Feichtenhofer¹ - Show less +4 more•Institutions (3)

Facebook¹, Carnegie Mellon University², University of Washington³

28 Sep 2021

TL;DR: VideoCLIP as discussed by the authors trains a transformer for video and text by contrasting temporally overlapping positive video-text pairs with hard negatives from nearest neighbor retrieval, without using any labels on downstream tasks.

...read moreread less

Proceedings Article•

BASE Layers: Simplifying Training of Large, Sparse Models

[...]

Michael Lewis¹, Shruti Bhosale¹, Tim Dettmers², Naman Goyal¹, Luke Zettlemoyer² - Show less +1 more•Institutions (2)