What Language Model to Train if You Have One Million GPU Hours?

doi:10.48550/arXiv.2210.15424

Proceedings ArticleDOI

What Language Model to Train if You Have One Million GPU Hours?

Teven Le Scao, +18 more

- pp 765-782

Chats0

TLDR

An ablation study at the billion-parameter scale compar-ing different modeling practices and their impact on zero-shot generalization is performed and the performance of a multilingual model and how it compares to the English-only one is studied.

Abstract:

The crystallization of modeling methods around the Transformer architecture has been a boon for practitioners. Simple, well-motivated architectural variations can transfer across tasks and scale, increasing the impact of modeling research. However, with the emergence of state-of-the-art 100B+ parameters models, large language models are increasingly expensive to accurately design and train. Notably, it can be difficult to evaluate how modeling decisions may impact emergent capabilities, given that these capabilities arise mainly from sheer scale alone. In the process of building BLOOM--the Big Science Large Open-science Open-access Multilingual language model--our goal is to identify an architecture and training setup that makes the best use of our 1,000,000 A100-GPU-hours budget. Specifically, we perform an ablation study at the billion-parameter scale comparing different modeling practices and their impact on zero-shot generalization. In addition, we study the impact of various popular pre-training corpora on zero-shot generalization. We also study the performance of a multilingual model and how it compares to the English-only one. Finally, we consider the scaling behaviour of Transformers to choose the target model size, shape, and training setup. All our models and code are open-sourced at https://huggingface.co/bigscience .

Citations

PDF

Open Access

More filters

Journal ArticleDOI

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Teven Le Scao, +386 more

- 09 Nov 2022 -

arXiv.org

TL;DR: BLOOM as discussed by the authors is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total).

...read moreread less

Proceedings ArticleDOI

GLM-130B: An Open Bilingual Pre-trained Model

Aohan Zeng, +17 more

TL;DR: An attempt to open-source a 100B-scale model at least as good as GPT-3 and unveil how models of such a scale can be successfully pre-trained, including its design choices, training strategies for both efficiency and stability, and engineering efforts is introduced.

...read moreread less

Journal ArticleDOI

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

Stella Biderman, +10 more

- 03 Apr 2023 -

arXiv.org

TL;DR: Pythia as discussed by the authors ) is a suite of 16 language models trained on public data seen in the exact same order and ranging in size from 70M to 12B parameters, with 154 checkpoints for each one of the 16 models, alongside tools to download and reconstruct their exact training dataaloaders for further study.

...read moreread less

Proceedings ArticleDOI

What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization?

Thomas Wang, +7 more

TL;DR: A large-scale evaluation of modeling choices and their impact on zero-shot generalization of large pretrained Transformer language models focuses on text-to-text models and shows that causal decoder-only models trained on an autoregressive language modeling objective exhibit the strongest zero- shot generalization after purely self-supervised pretraining.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings Article

Adam: A Method for Stochastic Optimization

Diederik P. Kingma, +1 more

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

...read moreread less

Proceedings Article

Attention is All you Need

Ashish Vaswani, +7 more

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.

...read moreread less

Posted Content

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, +8 more

- 23 Oct 2019 -

arXiv: Learning

TL;DR: This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.

...read moreread less

Proceedings Article

Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank

Richard Socher, +6 more

TL;DR: A Sentiment Treebank that includes fine grained sentiment labels for 215,154 phrases in the parse trees of 11,855 sentences and presents new challenges for sentiment compositionality, and introduces the Recursive Neural Tensor Network.

...read moreread less

Posted Content

SQuAD: 100,000+ Questions for Machine Comprehension of Text

Pranav Rajpurkar, +3 more

- 16 Jun 2016 -

arXiv: Computation and Language

TL;DR: The Stanford Question Answering Dataset (SQuAD) as mentioned in this paper is a reading comprehension dataset consisting of 100,000+ questions posed by crowdworkers on a set of Wikipedia articles, where the answer to each question is a segment of text from the corresponding reading passage.

...read moreread less

Collapse

What Language Model to Train if You Have One Million GPU Hours?

Citations

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

A Survey of Large Language Models

GLM-130B: An Open Bilingual Pre-trained Model

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization?

References

Adam: A Method for Stochastic Optimization

Attention is All you Need

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank

SQuAD: 100,000+ Questions for Machine Comprehension of Text