A
Anna Chen
Publications - 9
Citations - 596
Anna Chen is an academic researcher. The author has contributed to research in topics: Computer science & Counterintuitive. The author has co-authored 1 publications.
Papers
More filters
Journal ArticleDOI
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai,Andy Jones,Kamal K. Ndousse,Amanda Askell,Anna Chen,Nova DasSarma,Dawn Drain,Stanislav Fort,Deep Ganguli,Tom Henighan,Nicholas Joseph,Saurav Kadavath,John Kernion,Tom Conerly,Sheer El-Showk,Nelson Elhage,Zac Hatfield-Dodds,Danny Hernandez,Tristan Hume,Scott Johnston,S. M. Kravec,Liane Lovitt,Neel Nanda,Catherine Anne White Olsson,Dario Amodei,Tom B. Brown,Jack Clark,Samuel McCandlish,Chris Olah,Benjamin Mann,Jared Kaplan +30 more
TL;DR: An iterated online mode of training, where preference models and RL policies are updated on a weekly cadence with fresh human feedback data, and a roughly linear relation between the RL reward and the square root of the KL divergence between the policy and its initialization is identified.
Journal ArticleDOI
Language Models (Mostly) Know What They Know
Saurav Kadavath,Tom Conerly,Amanda Askell,Tom Henighan,Dawn Drain,Ethan Perez,Nicholas Schiefer,Zachary Dodds,Nova DasSarma,Eli Tran-Johnson,Scott Johnston,Sheer El-Showk,Andy Jones,Nelson Elhage,Tristan Hume,Anna Chen,Yuntao Bai,Sam W. Bowman,Stanislav Fort,Deep Ganguli,Danny Hernandez,Josh Jacobson,John Kernion,S. M. Kravec,Liane Lovitt,Kamal K. Ndousse,Catherine Anne White Olsson,Sam Ringer,Dario Amodei,Tom B. Brown,Jack Clark,Nicholas Joseph,Benjamin Mann,Samuel McCandlish,Chris Olah,Jared Kaplan +35 more
TL;DR: This article showed that large models are well-calibrated on diverse multiple choice and true/false questions when they are provided in the right format, and showed that models can be trained to predict the probability that"I know"the answer to a question, without reference to any particular proposed answer.
Journal ArticleDOI
In-context Learning and Induction Heads
Catherine Anne White Olsson,Nelson Elhage,Neel Nanda,Nicholas Joseph,Nova DasSarma,Tom Henighan,Benjamin Mann,Amanda Askell,Yuntao Bai,Anna Chen,Tom Conerly,Dawn Drain,Deep Ganguli,Zac Hatfield-Dodds,Danny Hernandez,Scott Johnston,Andy Jones,John Kernion,Liane Lovitt,Kamal K. Ndousse,Dario Amodei,Tom B. Brown,Jack Clark,Jared Kaplan,Samuel McCandlish,Chris Olah +25 more
TL;DR: It is found that induction heads develop at precisely the same point as a sudden sharp increase in incontext learning ability, visible as a bump in the training loss.
Proceedings ArticleDOI
Predictability and Surprise in Large Generative Models
Deep Ganguli,Danny Hernandez,Liane Lovitt,Nova DasSarma,Tom Henighan,Andy Jones,Nicholas Joseph,John Kernion,Benjamin Mann,Amanda Askell,Yuntao Bai,Anna Chen,Tom Conerly,Dawn Drain,Nelson Elhage,Sheer El Showk,Stanislav Fort,Zac Hatfield-Dodds,Scott Johnston,S. M. Kravec,Neel Nanda,Kamal K. Ndousse,Catherine Anne White Olsson,Daniela Amodei,Dario Amodei,Tom B. Brown,Jared Kaplan,Samuel McCandlish,Chris Olah,Jack Clark +29 more
TL;DR: This paper highlights a counterintuitive property of large-scale generative models, which have a paradoxical combination of predictable loss on a broad training distribution, and unpredictable specific capabilities, inputs, and outputs, and analyzed how these conflicting properties combine to give model developers various motivations for deploying these models, and challenges that can hinder deployment.
Journal ArticleDOI
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
Deep Ganguli,Liane Lovitt,John Kernion,Amanda Askell,Yuntao Bai,Saurav Kadavath,Benjamin Mann,Ethan Perez,Nicholas Schiefer,Kamal K. Ndousse,Andy Jones,Sam W. Bowman,Anna Chen,Tom Conerly,Nova DasSarma,Dawn Drain,Nelson Elhage,Sheer El-Showk,Stanislav Fort,Zachary Dodds,Tom Henighan,Danny Hernandez,Tristan Hume,Josh Jacobson,Scott Johnston,S. M. Kravec,Catherine Anne White Olsson,Sam Ringer,Eli Tran-Johnson,Dario Amodei,Tom B. Brown,Nicholas Joseph,Samuel McCandlish,Chris Olah,Jared Kaplan,Jack Clark +35 more
TL;DR: It is found that the RLHF models are increasinglycult to red team as they scale, and a trend with scale for the other model types is found, which indicates that this transparency accelerates the ability to work together as a community in order to develop shared norms, practices, and technical standards.