Anna Chen

Journal ArticleDOI

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

- 12 Apr 2022 -

TL;DR: An iterated online mode of training, where preference models and RL policies are updated on a weekly cadence with fresh human feedback data, and a roughly linear relation between the RL reward and the square root of the KL divergence between the policy and its initialization is identified.

...read moreread less

Journal ArticleDOI

Language Models (Mostly) Know What They Know

Saurav Kadavath, +35 more

- 11 Jul 2022 -

arXiv.org

TL;DR: This article showed that large models are well-calibrated on diverse multiple choice and true/false questions when they are provided in the right format, and showed that models can be trained to predict the probability that"I know"the answer to a question, without reference to any particular proposed answer.

...read moreread less

Journal ArticleDOI

In-context Learning and Induction Heads

Catherine Anne White Olsson, +25 more

- 24 Sep 2022 -

arXiv.org

TL;DR: It is found that induction heads develop at precisely the same point as a sudden sharp increase in incontext learning ability, visible as a bump in the training loss.

...read moreread less

Proceedings ArticleDOI

Predictability and Surprise in Large Generative Models

Deep Ganguli, +29 more

TL;DR: This paper highlights a counterintuitive property of large-scale generative models, which have a paradoxical combination of predictable loss on a broad training distribution, and unpredictable specific capabilities, inputs, and outputs, and analyzed how these conflicting properties combine to give model developers various motivations for deploying these models, and challenges that can hinder deployment.

...read moreread less

Journal ArticleDOI

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Deep Ganguli, +35 more

- 23 Aug 2022 -

arXiv.org

TL;DR: It is found that the RLHF models are increasinglycult to red team as they scale, and a trend with scale for the other model types is found, which indicates that this transparency accelerates the ability to work together as a community in order to develop shared norms, practices, and technical standards.

...read moreread less

Papers

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Language Models (Mostly) Know What They Know

In-context Learning and Induction Heads

Predictability and Surprise in Large Generative Models

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned