Nicholas Joseph

Journal ArticleDOI

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

- 12 Apr 2022 -

TL;DR: An iterated online mode of training, where preference models and RL policies are updated on a weekly cadence with fresh human feedback data, and a roughly linear relation between the RL reward and the square root of the KL divergence between the policy and its initialization is identified.

...read moreread less

Journal ArticleDOI

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, +50 more

- 15 Dec 2022 -

arXiv.org

TL;DR: In this article , the authors use RL from AI Feedback (RLAIF) to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them.

...read moreread less

Journal ArticleDOI

Language Models (Mostly) Know What They Know

Saurav Kadavath, +35 more

- 11 Jul 2022 -

arXiv.org

TL;DR: This article showed that large models are well-calibrated on diverse multiple choice and true/false questions when they are provided in the right format, and showed that models can be trained to predict the probability that"I know"the answer to a question, without reference to any particular proposed answer.

...read moreread less

Journal ArticleDOI

In-context Learning and Induction Heads

Catherine Anne White Olsson, +25 more

- 24 Sep 2022 -

arXiv.org

TL;DR: It is found that induction heads develop at precisely the same point as a sudden sharp increase in incontext learning ability, visible as a bump in the training loss.

...read moreread less

Proceedings ArticleDOI

Predictability and Surprise in Large Generative Models

Deep Ganguli, +29 more

TL;DR: This paper highlights a counterintuitive property of large-scale generative models, which have a paradoxical combination of predictable loss on a broad training distribution, and unpredictable specific capabilities, inputs, and outputs, and analyzed how these conflicting properties combine to give model developers various motivations for deploying these models, and challenges that can hinder deployment.

...read moreread less

Papers

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Constitutional AI: Harmlessness from AI Feedback

Language Models (Mostly) Know What They Know

In-context Learning and Induction Heads

Predictability and Surprise in Large Generative Models