Designing Effective Sparse Expert Models

doi:10.1109/ipdpsw55747.2022.00171

Proceedings ArticleDOI

Designing Effective Sparse Expert Models

TLDR

This paper proposed a stable and transferable Mixture-of-Experts (MoE-32B) model with 269B parameters, with a computational cost comparable to a 32B dense encoder-decoder Transformer.

Abstract:

Scale has opened new frontiers in natural language processing -- but at a high cost. In response, Mixture-of-Experts (MoE) and Switch Transformers have been proposed as an energy efficient path to even larger and more capable language models. But advancing the state-of-the-art across a broad set of natural language tasks has been hindered by training instabilities and uncertain quality during fine-tuning. Our work focuses on these issues and acts as a design guide. We conclude by scaling a sparse model to 269B parameters, with a computational cost comparable to a 32B dense encoder-decoder Transformer (Stable and Transferable Mixture-of-Experts or ST-MoE-32B). For the first time, a sparse model achieves state-of-the-art performance in transfer learning, across a diverse set of tasks including reasoning (SuperGLUE, ARC Easy, ARC Challenge), summarization (XSum, CNN-DM), closed book question answering (WebQA, Natural Questions), and adversarially constructed tasks (Winogrande, ANLI R3).

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Can language models automate data wrangling?

Gonzalo Jaimovitch-López, +4 more

- 01 Dec 2022 -

Machine Learning

TL;DR: The authors applied different variants of the language model Generative Pre-trained Transformer (GPT) to five batteries covering a wide range of data wrangling problems and compared the effect of prompts and few-shot regimes on their results and how they compare with specialised data-wangling systems and other tools.

...read moreread less

Journal ArticleDOI

Foundation Models for Text Generation

- 01 Jan 2023 -

Artificial intelligence: Foundations, th...

TL;DR: The authors discusses foundation models for text generation, such as autoregressive language models and machine translation models, which take a text in one language and translate it into another language, usually starting from an initial text input.

...read moreread less

Journal ArticleDOI

Knowledge Acquired by Foundation Models

- 01 Jan 2023 -

Artificial intelligence: Foundations, th...

TL;DR: In this article , the authors investigate the knowledge acquired by PLMs and the larger Foundation Models and investigate if the benchmarks are reliable and reproducible, i.e. whether they actually test the targeted properties and yield the same performance values when repeated by other researchers.

...read moreread less

Journal ArticleDOI

Improving Pre-trained Language Models

- 01 Jan 2023 -

Artificial intelligence: Foundations, th...

TL;DR: The authors describes a number of different approaches to improve the performance of Pre-trained Language Models (PLMs), i.e. variants of BERT, autoregressive language models similar to GPT, and sequence-to-sequence models like Transformers.

...read moreread less