scispace - formally typeset
Open AccessPosted ContentDOI

Language models generalize beyond natural proteins

TLDR
This paper showed that language models can learn a deep grammar that enables the design of protein structure, extending beyond natural proteins, and showed that such a grammar can be used to generate de novo proteins.
Abstract
Learning the design patterns of proteins from sequences across evolution may have promise toward generative protein design. However it is unknown whether language models, trained on sequences of natural proteins, will be capable of more than memorization of existing protein families. Here we show that language models generalize beyond natural proteins to generate de novo proteins. We focus on two protein design tasks: fixed backbone design where the structure is specified, and unconstrained generation where the structure is sampled from the model. Remarkably although the models are trained only on sequences, we find that they are capable of designing structure. A total of 228 generated proteins are evaluated experimentally with high overall success rates (152/228 or 67%) in producing a soluble and monomeric species by size exclusion chromatography. Out of 152 experimentally successful designs, 35 have no significant sequence match to known natural proteins. Of the remaining 117, sequence identity to the nearest sequence match is at median 27%, below 20% for 6 designs, and as low as 18% for 3 designs. For fixed backbone design, the language model generates successful designs for each of eight experimentally evaluated artificially created fixed backbone targets. For unconstrained generation, sampled proteins cover diverse topologies and secondary structure compositions, and have high experimental success rate (71/129 or 55%). The designs reflect deep patterns linking sequence and structure, including motifs that occur in related natural structures, and motifs that are not observed in similar structural contexts in known protein families. The results show that language models, though only trained on sequences, learn a deep grammar that enables the design of protein structure, extending beyond natural proteins.

read more

Citations
More filters
Journal ArticleDOI

ClimaX: A foundation model for weather and climate

TL;DR: ClimaX as mentioned in this paper extends the Transformer architecture with novel encoding and aggregation blocks that allow effective use of available compute while maintaining general utility for modeling the non-linear dynamics and complex interactions between multiple variables.
Posted ContentDOI

Structure-informed Language Models Are Protein Designers

TL;DR: LM-Design as discussed by the authors is a structural surgery on pLMs, where a lightweight structural adapter is implanted into the language model and endows it with structural awareness, and iterative refinement is performed to effectively optimize the generated protein sequences.
Journal ArticleDOI

A Text-guided Protein Design Framework

TL;DR: ProteinDT as mentioned in this paper is a multi-modal framework that leverages textual descriptions for protein design, which consists of three subsequent steps: ProteinCLAP that aligns the representation of two modalities, a facilitator that generates the protein representation from the text modality, and a decoder that generate the protein sequences from the representation.
Journal ArticleDOI

Accelerating the integration of ChatGPT and other large‐scale AI models into biomedical research and healthcare

TL;DR: In this article , the authors provide a general overview of advanced large-scale AI models, including language models, vision-language models, graph learning models, language-conditioned multiagent models, and multimodal embodied models.
References
More filters
Posted Content

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

TL;DR: A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Journal ArticleDOI

BLAST+: architecture and applications.

TL;DR: The new BLAST command-line applications, compared to the current BLAST tools, demonstrate substantial speed improvements for long queries as well as chromosome length database sequences.
Book

Accelerated Profile HMM Searches

TL;DR: An acceleration heuristic for profile HMMs, the “multiple segment Viterbi” (MSV) algorithm, which computes an optimal sum of multiple ungapped local alignment segments using a striped vector-parallel approach previously described for fast Smith/Waterman alignment.
Related Papers (5)