Language models generalize beyond natural proteins

doi:10.1101/2022.12.21.521521

Open AccessPosted ContentDOI

Language models generalize beyond natural proteins

Robert Verkuil, +9 more

- 22 Dec 2022 -

bioRxiv

TLDR

This paper showed that language models can learn a deep grammar that enables the design of protein structure, extending beyond natural proteins, and showed that such a grammar can be used to generate de novo proteins.

Abstract:

Learning the design patterns of proteins from sequences across evolution may have promise toward generative protein design. However it is unknown whether language models, trained on sequences of natural proteins, will be capable of more than memorization of existing protein families. Here we show that language models generalize beyond natural proteins to generate de novo proteins. We focus on two protein design tasks: fixed backbone design where the structure is specified, and unconstrained generation where the structure is sampled from the model. Remarkably although the models are trained only on sequences, we find that they are capable of designing structure. A total of 228 generated proteins are evaluated experimentally with high overall success rates (152/228 or 67%) in producing a soluble and monomeric species by size exclusion chromatography. Out of 152 experimentally successful designs, 35 have no significant sequence match to known natural proteins. Of the remaining 117, sequence identity to the nearest sequence match is at median 27%, below 20% for 6 designs, and as low as 18% for 3 designs. For fixed backbone design, the language model generates successful designs for each of eight experimentally evaluated artificially created fixed backbone targets. For unconstrained generation, sampled proteins cover diverse topologies and secondary structure compositions, and have high experimental success rate (71/129 or 55%). The designs reflect deep patterns linking sequence and structure, including motifs that occur in related natural structures, and motifs that are not observed in similar structural contexts in known protein families. The results show that language models, though only trained on sequences, learn a deep grammar that enables the design of protein structure, extending beyond natural proteins.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

ClimaX: A foundation model for weather and climate

Tung Nguyen, +4 more

- 24 Jan 2023 -

arXiv.org

TL;DR: ClimaX as mentioned in this paper extends the Transformer architecture with novel encoding and aggregation blocks that allow effective use of available compute while maintaining general utility for modeling the non-linear dynamics and complex interactions between multiple variables.

...read moreread less

Posted ContentDOI

Structure-informed Language Models Are Protein Designers

Zaixiang Zheng, +4 more

- 03 Feb 2023 -

bioRxiv

TL;DR: LM-Design as discussed by the authors is a structural surgery on pLMs, where a lightweight structural adapter is implanted into the language model and endows it with structural awareness, and iterative refinement is performed to effectively optimize the generated protein sequences.

...read moreread less

Posted ContentDOI

Unlocking de novo antibody design with generative artificial intelligence

Amir Shanehsazzadeh, +34 more

- 29 Mar 2023 -

bioRxiv

TL;DR: This paper used generative deep learning models to de novo design antibodies against three distinct targets, in a zero-shot fashion, where all designs are the result of a single round of model generations with no followup optimization.

...read moreread less

Journal ArticleDOI

A Text-guided Protein Design Framework

Shengchao Liu, +9 more

- 09 Feb 2023 -

arXiv.org

TL;DR: ProteinDT as mentioned in this paper is a multi-modal framework that leverages textual descriptions for protein design, which consists of three subsequent steps: ProteinCLAP that aligns the representation of two modalities, a facilitator that generates the protein representation from the text modality, and a decoder that generate the protein sequences from the representation.

...read moreread less

Journal ArticleDOI

Accelerating the integration of ChatGPT and other large‐scale AI models into biomedical research and healthcare

Yingfeng Zheng

TL;DR: In this article , the authors provide a general overview of advanced large-scale AI models, including language models, vision-language models, graph learning models, language-conditioned multiagent models, and multimodal embodied models.

...read moreread less

References

PDF

Open Access

More filters

Posted Content

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, +3 more

- 11 Oct 2018 -

arXiv: Computation and Language

TL;DR: A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

...read moreread less

Journal ArticleDOI

BLAST+: architecture and applications.

Christiam Camacho, +6 more

- 15 Dec 2009 -

BMC Bioinformatics

TL;DR: The new BLAST command-line applications, compared to the current BLAST tools, demonstrate substantial speed improvements for long queries as well as chromosome length database sequences.

...read moreread less

Journal ArticleDOI

Highly accurate protein structure prediction with AlphaFold

John M. Jumper, +33 more

- 15 Jul 2021 -

Nature

TL;DR: For example, AlphaFold as mentioned in this paper predicts protein structures with an accuracy competitive with experimental structures in the majority of cases using a novel deep learning architecture. But the accuracy is limited by the fact that no homologous structure is available.

...read moreread less

Proceedings Article

Language Models are Few-Shot Learners

Tom B. Brown, +30 more

TL;DR: GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.

...read moreread less

Book

Accelerated Profile HMM Searches

Sean R. Eddy

TL;DR: An acceleration heuristic for profile HMMs, the “multiple segment Viterbi” (MSV) algorithm, which computes an optimal sum of multiple ungapped local alignment segments using a striped vector-parallel approach previously described for fast Smith/Waterman alignment.

...read moreread less