scispace - formally typeset
Proceedings ArticleDOI

What Language Model to Train if You Have One Million GPU Hours?

Reads0
Chats0
TLDR
An ablation study at the billion-parameter scale compar-ing different modeling practices and their impact on zero-shot generalization is performed and the performance of a multilingual model and how it compares to the English-only one is studied.
Abstract
The crystallization of modeling methods around the Transformer architecture has been a boon for practitioners. Simple, well-motivated architectural variations can transfer across tasks and scale, increasing the impact of modeling research. However, with the emergence of state-of-the-art 100B+ parameters models, large language models are increasingly expensive to accurately design and train. Notably, it can be difficult to evaluate how modeling decisions may impact emergent capabilities, given that these capabilities arise mainly from sheer scale alone. In the process of building BLOOM--the Big Science Large Open-science Open-access Multilingual language model--our goal is to identify an architecture and training setup that makes the best use of our 1,000,000 A100-GPU-hours budget. Specifically, we perform an ablation study at the billion-parameter scale comparing different modeling practices and their impact on zero-shot generalization. In addition, we study the impact of various popular pre-training corpora on zero-shot generalization. We also study the performance of a multilingual model and how it compares to the English-only one. Finally, we consider the scaling behaviour of Transformers to choose the target model size, shape, and training setup. All our models and code are open-sourced at https://huggingface.co/bigscience .

read more

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Teven Le Scao, +386 more
- 09 Nov 2022 - 
TL;DR: BLOOM as discussed by the authors is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total).
Journal ArticleDOI

A Survey of Large Language Models

TL;DR: Recently, a large language model (LLM) as mentioned in this paper has been proposed by pre-training Transformer models over large-scale corpora, showing strong capabilities in solving various NLP tasks.
Proceedings ArticleDOI

GLM-130B: An Open Bilingual Pre-trained Model

TL;DR: An attempt to open-source a 100B-scale model at least as good as GPT-3 and unveil how models of such a scale can be successfully pre-trained, including its design choices, training strategies for both efficiency and stability, and engineering efforts is introduced.
Journal ArticleDOI

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

TL;DR: Pythia as discussed by the authors ) is a suite of 16 language models trained on public data seen in the exact same order and ranging in size from 70M to 12B parameters, with 154 checkpoints for each one of the 16 models, alongside tools to download and reconstruct their exact training dataaloaders for further study.
Proceedings ArticleDOI

What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization?

TL;DR: A large-scale evaluation of modeling choices and their impact on zero-shot generalization of large pretrained Transformer language models focuses on text-to-text models and shows that causal decoder-only models trained on an autoregressive language modeling objective exhibit the strongest zero- shot generalization after purely self-supervised pretraining.
References
More filters
Proceedings Article

Adam: A Method for Stochastic Optimization

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Proceedings Article

Attention is All you Need

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
Posted Content

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

TL;DR: This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.
Proceedings Article

Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank

TL;DR: A Sentiment Treebank that includes fine grained sentiment labels for 215,154 phrases in the parse trees of 11,855 sentences and presents new challenges for sentiment compositionality, and introduces the Recursive Neural Tensor Network.
Posted Content

SQuAD: 100,000+ Questions for Machine Comprehension of Text

TL;DR: The Stanford Question Answering Dataset (SQuAD) as mentioned in this paper is a reading comprehension dataset consisting of 100,000+ questions posed by crowdworkers on a set of Wikipedia articles, where the answer to each question is a segment of text from the corresponding reading passage.