Language models generalize beyond natural proteins
Robert Verkuil,Ori Kabeli,Yilun Du,Basile I. M. Wicky,Lukas F. Milles,Justas Dauparas,David Baker,Sergey Ovchinnikov,Tom Sercu,Alexander Rives +9 more
TLDR
This paper showed that language models can learn a deep grammar that enables the design of protein structure, extending beyond natural proteins, and showed that such a grammar can be used to generate de novo proteins.Abstract:
Learning the design patterns of proteins from sequences across evolution may have promise toward generative protein design. However it is unknown whether language models, trained on sequences of natural proteins, will be capable of more than memorization of existing protein families. Here we show that language models generalize beyond natural proteins to generate de novo proteins. We focus on two protein design tasks: fixed backbone design where the structure is specified, and unconstrained generation where the structure is sampled from the model. Remarkably although the models are trained only on sequences, we find that they are capable of designing structure. A total of 228 generated proteins are evaluated experimentally with high overall success rates (152/228 or 67%) in producing a soluble and monomeric species by size exclusion chromatography. Out of 152 experimentally successful designs, 35 have no significant sequence match to known natural proteins. Of the remaining 117, sequence identity to the nearest sequence match is at median 27%, below 20% for 6 designs, and as low as 18% for 3 designs. For fixed backbone design, the language model generates successful designs for each of eight experimentally evaluated artificially created fixed backbone targets. For unconstrained generation, sampled proteins cover diverse topologies and secondary structure compositions, and have high experimental success rate (71/129 or 55%). The designs reflect deep patterns linking sequence and structure, including motifs that occur in related natural structures, and motifs that are not observed in similar structural contexts in known protein families. The results show that language models, though only trained on sequences, learn a deep grammar that enables the design of protein structure, extending beyond natural proteins.read more
Citations
More filters
Journal ArticleDOI
ClimaX: A foundation model for weather and climate
TL;DR: ClimaX as mentioned in this paper extends the Transformer architecture with novel encoding and aggregation blocks that allow effective use of available compute while maintaining general utility for modeling the non-linear dynamics and complex interactions between multiple variables.
Posted ContentDOI
Structure-informed Language Models Are Protein Designers
TL;DR: LM-Design as discussed by the authors is a structural surgery on pLMs, where a lightweight structural adapter is implanted into the language model and endows it with structural awareness, and iterative refinement is performed to effectively optimize the generated protein sequences.
Posted ContentDOI
Unlocking de novo antibody design with generative artificial intelligence
Amir Shanehsazzadeh,Sharrol Bachas,George Kasun,J. Mark Sutton,Andrea K. Steiger,Richard W. Shuai,Christa Kohnert,Alex Morehead,Amber Brown,Chelsea Chung,Breanna K. Luton,Nicolas Diaz,Matthew Mcpartlon,Bailey Knight,Macey Radach,Katherine B. Bateman,David A. Spencer,Jovan Cejovic,Gaelin Kopec‐Belliveau,Robel Haile,Edriss Yassine,Cailen M McCloskey,Monica Natividad,Dalton Chapman,Luka Stojanovic,Goran Rakocevic,Gregory Hannum,Engin Yapici,Katy M. Moran,Rodante Caguiat,Shaheed A. Abdulhaqq,Zheyuan Guo,Lillian R. Klug,Miles Gander,Joshua Meier +34 more
TL;DR: This paper used generative deep learning models to de novo design antibodies against three distinct targets, in a zero-shot fashion, where all designs are the result of a single round of model generations with no followup optimization.
Journal ArticleDOI
A Text-guided Protein Design Framework
Shengchao Liu,Yutao Zhu,Jiarui Lu,Zhao Xu,Weili Nie,Anthony Gitter,Chaowei Xiao,Jia Tang,Hong Hui Guo,Animashree Anandkumar +9 more
TL;DR: ProteinDT as mentioned in this paper is a multi-modal framework that leverages textual descriptions for protein design, which consists of three subsequent steps: ProteinCLAP that aligns the representation of two modalities, a facilitator that generates the protein representation from the text modality, and a decoder that generate the protein sequences from the representation.
Journal ArticleDOI
Accelerating the integration of ChatGPT and other large‐scale AI models into biomedical research and healthcare
TL;DR: In this article , the authors provide a general overview of advanced large-scale AI models, including language models, vision-language models, graph learning models, language-conditioned multiagent models, and multimodal embodied models.
References
More filters
Posted Content
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TL;DR: A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Journal ArticleDOI
BLAST+: architecture and applications.
Christiam Camacho,George Coulouris,Vahram Avagyan,Ning Ma,Jason S. Papadopoulos,Kevin Bealer,Thomas L. Madden +6 more
TL;DR: The new BLAST command-line applications, compared to the current BLAST tools, demonstrate substantial speed improvements for long queries as well as chromosome length database sequences.
Journal ArticleDOI
Highly accurate protein structure prediction with AlphaFold
John M. Jumper,Richard O. Evans,Alexander Pritzel,Tim Green,Michael Figurnov,Olaf Ronneberger,Kathryn Tunyasuvunakool,Russell Bates,Augustin Žídek,Anna Potapenko,Alex Bridgland,Clemens Meyer,Simon A. A. Kohl,Andrew J. Ballard,Andrew Cowie,Bernardino Romera-Paredes,Stanislav Nikolov,R. D. Jain,Jonas Adler,Trevor Back,Stig Petersen,David Reiman,Ellen Clancy,Michal Zielinski,Martin Steinegger,Michalina Pacholska,Tamas Berghammer,Sebastian Bodenstein,David L. Silver,Oriol Vinyals,Andrew W. Senior,Koray Kavukcuoglu,Pushmeet Kohli,Demis Hassabis +33 more
TL;DR: For example, AlphaFold as mentioned in this paper predicts protein structures with an accuracy competitive with experimental structures in the majority of cases using a novel deep learning architecture. But the accuracy is limited by the fact that no homologous structure is available.
Proceedings Article
Language Models are Few-Shot Learners
Tom B. Brown,Benjamin Mann,Nick Ryder,Melanie Subbiah,Jared Kaplan,Prafulla Dhariwal,Arvind Neelakantan,Pranav Shyam,Girish Sastry,Amanda Askell,Sandhini Agarwal,Ariel Herbert-Voss,Gretchen Krueger,Thomas Henighan,Rewon Child,Aditya Ramesh,Daniel M. Ziegler,Jeffrey Wu,Clemens Winter,Christopher Hesse,Mark Chen,Eric Sigler,Mateusz Litwin,Scott Gray,Benjamin Chess,Jack Clark,Christopher Berner,Samuel McCandlish,Alec Radford,Ilya Sutskever,Dario Amodei +30 more
TL;DR: GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.
Book
Accelerated Profile HMM Searches
TL;DR: An acceleration heuristic for profile HMMs, the “multiple segment Viterbi” (MSV) algorithm, which computes an optimal sum of multiple ungapped local alignment segments using a striped vector-parallel approach previously described for fast Smith/Waterman alignment.
Related Papers (5)
Generative Alignment and Semantic Parsing for Learning from Ambiguous Supervision
Joohyun Kim,Raymond J. Mooney +1 more