scispace - formally typeset
Open AccessJournal ArticleDOI

Protein sequence-to-structure learning: Is this the end(-to-end revolution)?

TLDR
The potential of deep learning has been recognized in the protein structure prediction community for some time, and became indisputable after CASP13 as discussed by the authors, which boosted the field to unanticipated levels reaching near-experimental accuracy.
Abstract
The potential of deep learning has been recognized in the protein structure prediction community for some time, and became indisputable after CASP13. In CASP14, deep learning has boosted the field to unanticipated levels reaching near-experimental accuracy. This success comes from advances transferred from other machine learning areas, as well as methods specifically designed to deal with protein sequences and structures, and their abstractions. Novel emerging approaches include (i) geometric learning, that is, learning on representations such as graphs, three-dimensional (3D) Voronoi tessellations, and point clouds; (ii) pretrained protein language models leveraging attention; (iii) equivariant architectures preserving the symmetry of 3D space; (iv) use of large meta-genome databases; (v) combinations of protein representations; and (vi) finally truly end-to-end architectures, that is, differentiable models starting from a sequence and returning a 3D structure. Here, we provide an overview and our opinion of the novel deep learning approaches developed in the last 2 years and widely used in CASP14.

read more

Content maybe subject to copyright    Report

R E V I E W
Protein sequence-to-structure learning: Is this the
end(-to-end revolution)?
Protein sequence-to-structure learning
Elodie Laine
1
| Stephan Eismann
2
| Arne Elofsson
3
| Sergei Grudinin
4
1
Sorbonne Université, CNRS, IBPS,
Laboratoire de Biologie Computationnelle
et Quantitative (LCQB), 75005 Paris, France
2
Dep. of Computer Science and Applied
Physics, Stanford University, Stanford, CA
94305, USA
3
Dep. of Biochemistry and Biophysics and
Science for Life Laboratory, Stockholm
University, Box 1031, 171 72 Solna,
Sweden
4
Univ. Grenoble Alpes, CNRS, Grenoble
INP, LJK, 38000 Grenoble, France
Correspondence
Sergei Grudinin, Univ. Grenoble Alpes,
CNRS, Grenoble INP, LJK, 38000 Grenoble,
France
Email:
sergei.grudinin@univ-grenoble-alpes.fr
Funding information
EL was funded by the French national
research agency grant ANR-17-CE12-0009.
SE was supported by a Stanford Bio-X
Bowes Fellowship. AE was funded by grants
from the Swedish, E-science Research
Center, Swedish National Infrastructure for
Computing, and Swedish Natural Science
Research Council No VR-NT 2016-03798.
The potential of deep learning has been recognized in the
protein structure prediction community for some time, and
became indisputable after CASP13. In CASP14, deep learn-
ing has boosted the field to unanticipated levels reaching
near-experimental accuracy. This success comes from ad-
vances transferred from other machine learning areas, as
well as methods specifically designed to deal with protein
sequences and structures, and their abstractions. Novel
emerging approaches include (i) geometric learning, i.e.
learning on representations such as graphs, 3D Voronoi
tessellations, and point clouds; (ii) pre-trained protein lan-
guage models leveraging attention; (iii) equivariant architec-
tures preserving the symmetry of 3D space; (iv) use of large
meta-genome databases; (v) combinations of protein repre-
sentations; (vi) and finally truly end-to-end architectures,
i.e. differentiable models starting from a sequence and re-
turning a 3D structure. Here, we provide an overview and
our opinion of the novel deep learning approaches devel-
oped in the last two years and widely used in CASP14.
K E Y W O R D S
deep learning, protein structure prediction, CASP14, geometric
learning, equivariance, end-to-end architectures, protein
language models
1 | INTRODUCTION
In December 2020, the fourteenth edition of CASP
marked a big leap in protein three-dimensional (3D)
structure prediction. Indeed, deep learning-powered ap-
proaches have reached unprecedented levels of near-
experimental accuracy. This achievement has been
made possible thanks to the latest improvements in ge-
ometric learning and natural language processing (NLP)
techniques, and to the amounts of sequence and struc-
Equally contributing authors.
1

2 Laine et al.
ture data accessible today. The fundamental basis for
the revolution in structure prediction comes from the
use of co-evolution. While traditional measures of co-
variations in natural sequences led to a few successes
[1, 2, 3], major improvements came from recasting the
problem as an inverse Potts model [4, 5]. These ideas
started to show their full potential about 10 years ago
with the development of efficient methods dealing with
large scale multiple sequence alignments [6, 7, 8]. They
enabled the modelling of 3D structures for large protein
families [9, 10, 11, 12, 13, 14].
Shifting from unsupervised statistical inference to su-
pervised deep learning further boosted the accuracy
of the predicted contacts, and extended the applicabil-
ity of this conceptual framework to families with fewer
sequences [15, 16] and to the prediction of residue-
residue distances [17, 18]. These advances have signif-
icantly increased the protein structure modelling cover-
age of genomes [19, 20, 21], and also of bacterial inter-
actomes [22, 23, 24]. Over the past years, the CASP
community has contributed to these efforts, with an in-
creasing number of teams developing and applying deep
learning approaches.
The emergence of novel deep learning techniques
has inspired a re-visit of the representations best suited
for biological objects (protein sequences and structures).
In particular, advances in the treatment of language
[25] and of 3D geometry [26, 27, 28, 29, 30] by deep
learning architectures have further benefited the field
of protein structure and function prediction. Expanding
on this progress, the DeepMind team demonstrated in
CASP14 that it is possible to produce extremely accu-
rate 3D models of proteins by learning end-to-end from
sequence alignments of related proteins [31]. This im-
plies being able to capture long-range dependencies be-
tween amino acid residues, to transform these depen-
dencies into structural constraints, and to preserve the
symmetry and properties of the 3D space when operat-
ing on protein structures.
This article is a follow-up to Kandathil et al. [32].
It aims at providing CASP participants and observers
with some overview of the recent developments in deep
learning applied to protein structure prediction, and
some comprehensive description of key concepts we
think have contributed to the formidable improvements
we have witnessed in the latest CASP edition. We
then discuss the implications of these improvements,
the next-to-solve problems, and speculate about the fu-
ture of structural (and computational) biology.
2 | END-TO-END LEARNING FOR
PROTEIN STRUCTURE PREDIC-
TION
One of the advantages of deep learning methods com-
pared with traditional machine learning approaches is
EMBER-NLP
HMS-Casper-NLP
AlphaFold2
DMPfold2-New
DMPfold2
CUTSP
rawMSA
MSATransformer
DeepPotential
CopulaNet
Pharmulator
TOWER
Kiharalab_Contact
HMS-Casper
ProSPr
NOVA
iPhord
trRosetta
RaptorX
Galaxy
tripletRes
EMBER
A2I2Prot
DESTINI2
DeepHelicon
DeepHomo
ICOS
PrayogRealDistance
RBO-PSP-CP
DeepECA
ropius0
tFold
QDeep
RaptorX-QA
Ornate
3DCNN
AngularQA
3DCNN_prof
DeepAccNet
topQA
graphQA
graph-sh (S-GCN)
Deep-ML
GQArank
EDN
VoroCNN-GEMME
BrainFold*
Pretrained
MSA
MSA-feat
Contacts
Geometry
Structure
QA
F I G U R E 1 Schematic representation of the inputs
and outputs of deep learning-based methods in
CASP14, excluding pipelines compiling several
methods coming from different sources, and methods
lacking a clear description. The blue and red lines
indicate the input and output levels, respectively.
Pretrained: sequence embeddings determined from
NLP models pre-trained on huge amounts of sequence
data. MSA: raw multiple sequence alignement.
MSA-feat: MSA features (such as PSSMs, covariance
and precision matrices). Contacts: contact or distance
matrix. Geometry: geometrical features, typically
including contacts/distances and torsion angles.
Structure: 3D coordinates. QA: model quality. In case
of several inputs and/or outputs, we report those
closest to the "end". BrainFold is highlighted with a star
as it takes only the query sequence as input, without
using pre-trained embeddings. This classification is
based on available information from CASP abstracts
and publications/preprints. See Supplementary Table
S1 for more details.
the ability to automatically extract features from the in-
put data without the need to carefully handcraft them
(and potentially miss salient information). Assuming suf-
ficient training data is available, learned features are ex-
pected to better generalize to heterogeneous or novel
datasets. In addition, it is generally accepted that end-to-
end learning, where the network is trained to produce
the exact desired output and not some sort of heuris-
tic representation of it, is advantageous. Indeed, achiev-
ing a high accuracy on some intermediate result does
not guarantee high accuracy on the final output. For
instance, a learning algorithm may achieve a small loss
on dihedral angles, and yet computing atomic coordi-
nates from the predicted dihedral angles could lead to a
high reconstruction error [33]. Nevertheless, introduc-
ing well-chosen intermediate losses in a so-called "end-
to-end" architecture can help to produce better final
outputs [31]. These auxiliary intermediate losses pro-
vide some guarantee that the method is not only able
to produce an accurate final output (e.g. a protein 3D
structure) but also to accurately model some other prop-

Laine et al. 3
erties of the object under study (e.g. secondary struc-
ture, stereo-chemical quality...), and a mean to incorpo-
rate additional domain knowledge. While most protein
structure prediction methods take pre-computed fea-
tures as input and output a contact or distance map, pos-
sibly augmented with other geometrical features (Fig. 1,
see iPhord, ProSPr [34], Kiharalab_Contact [35], Phar-
mulator, DeepPotential, RaptorX [36], Galaxy, Triple-
tRes [37], A2I2Prot, DESTINI2 [38], DeepHelicon [39],
DeepHomo [40], ICOS, PrayogRealDistance [41, 42],
RBO-PSP-CP [43], DeepECA, ropius0 [44], tFOLD, plus
QUARK, Risoluto, Multicom [45] and those from the
Zhang lab), several efforts have been recently engaged
towards developing end-to-end architectures. Here, we
will shortly review these efforts and try to identify the
key components of what represents end-to-end learn-
ing in protein structure prediction (Table 1).
Ideally, the ultimate input would be the sequence
of the query protein. So far, only a couple of learn-
ing methods have exploited solely and directly this in-
formation to efficiently fold proteins de novo [46, 47].
They rely on differentiable [46] and neural [47] poten-
tials whose parameters are learnt from conformational
ensembles generated by Langevin dynamics simulations.
More commonly, the strategy of state-of-the-art meth-
ods is to leverage the very high degenerative nature of
the sequence-structure relationship through the use of
a multiple sequence alignment (MSA) of evolutionary-
related sequences, or a pre-trained protein language
model (see below). In this context, methods qualifying
for "end-to-X" learning should take as input raw (pos-
sibly aligned) sequence(s), as opposed to features de-
rived from them such as conservation levels (e.g. stored
in a Position-Specific Scoring Matrix or PSSM) or co-
evolution estimates (e.g. mutual information, direct pair-
wise couplings). One of the first examples of end-to-
X method was rawMSA [48], which leveraged embed-
ding techniques from the field of NLP, to map the amino
acid residues into a continuous space adaptively learned
based on the sequence context (Table 1). In DMP-
fold2 [49, 50], this idea was extended to MSAs of ar-
bitrary lengths by scanning individual columns in the
MSA with stacked Gated Recurrent Unit (GRU) layers.
CopulaNet [51] adopts a query-centered view by ex-
panding the input MSA to a set of query-homolog pair-
wise alignments prior to embedding it. In AlphaFold2
[31], the MSA embedding is obtained through several
rounds of self-attention (see below) applied to the MSA
rows and columns. Beyond computing MSA embed-
dings, rawMSA, CopulaNet and AlphaFold2 add an ex-
plicit step aimed at converting the information they con-
tain into residue-residue pairwise couplings through an
outer product operation on the embedding vectors. Re-
cently, a compromise end-to-X solution where the com-
putation of traditional hand-crafted features takes place
on the GPU and is tightly coupled to the network was im-
plemented into trRosetta [52], allowing for backpropa-
gating gradients all the way to the input sequences [53].
At the other end of the spectrum, the ultimate output
is the 3D structure of the query protein. Thus, an "X-
to-end" deep learning architecture should directly pro-
duce 3D coordinates and not some intermediate repre-
sentation such as a contact map. M. AlQuraishi [54] was
among the first to develop such a method in 2019 (Table
1). The model takes as input a PSSM, without account-
ing for any co-evolutionary information, and outputs the
Cartesian coordinates of the protein. The torsion angles
are predicted and used to reconstruct the 3D structure.
Although novel, such an approach has so far not proven
to perform better than earlier methods in CASP. One
well-known problem is that internal coordinates are ex-
tremely sensitive to small deviations as the latter easily
propagate through the protein, generating large errors
in the reconstructed structure [33]. To overcome this
problem, it is possible to efficiently reconstruct Carte-
sian coordinates from a distance matrix by using multi-
dimensional scaling (MDS) or other optimization tech-
niques as in CUTSP [55], DMPfold2 [49], or E2E and
FALCON-geom methods of CASP14. In its classical for-
mulation, used by both DMPfold2 and E2E, MDS ex-
tracts exact 3D coordinates (provided that the distance
matrix is exact) through eigendecomposition of the cen-
tered distance matrix. Nevertheless, one issue with us-
ing MDS as the final layer in the network is that the
output may be a mirror image (chiral version) of the pro-
tein. The most recent version of DMPfold2 (DPMfold2-
new in Table 1 [50]) attempted to resolve this issue by
adding an extra-GRU layer. AlphaFold2 takes a differ-
ent route and elegantly solves the 3D reconstruction
and the mirror-image problems jointly by learning spatial
transformations of the local reference frames of each
of the protein residues. Computing the geometric loss
function in the local frames automatically distinguishes
the mirror images, as one of the local axes is a vector
product of the two others. Noticeably, even though X-
to-end approaches generate a 3D structure, the latter is
usually refined afterwards (for example through molec-
ular dynamics simulations). For instance, relaxation of
AlphaFold2’s output with a physical force field is neces-
sary to enforce peptide bond geometry [31].
Although the protein 3D structure appears as an obvi-
ous and legitimate target, one may wonder whether gen-
erating 3D coordinates confers any advantage, in terms
of problem solving and performance, compared to a per-
fect 2D contact map. First, as mentioned above, effi-
cient methods to use 2D information for generating 3D
models exist [56, 52]. Further, the most popular residue-
or even atom-level loss functions used in deep neural
networks (DNNs) do not depend on the superposition of
the predicted model to the ground-truth structure and
are evaluated using the comparison of distance maps.
The most illustrative example is the local distance dif-
ference test (lDDT) [57], which has been employed as a
target function in CASP14 by some of the best perform-

4 Laine et al.
ers including AlphaFold2 [31] and Rosetta. The value of
this loss would not change if we swap the 3D and 2D
representations. Nevertheless, it is not clear whether
a perfect 2D map can be reached without using some
3D knowledge about the structure. Operating on 3D
representations allows calculating global or local quality
scores reflective of the structural accuracy in a way that
2D distance maps do not, as illustrated by the mirror-
image issue mentioned above. The DNN can then learn
to regress against these quality scores, and iteratively re-
fine a first rough 3D guess by predicting (local) deforma-
tions to arrive at a better structure. However, operating
in 3D poses specific challenges related to the preserva-
tion of symmetries, which we discuss in Section 5. So
far, the only successful example of indisputable improve-
ment of 3D structure representation over 2D maps is
given by AlphaFold2 [31]. Whether similar performance
can be achieved with 2D maps and whether 2D maps
are needed at all in the predictive process remain open
questions.
Being able to produce 3D models resembling experi-
mental structures implies being able to tell apart "good"
from "bad" models. Hence, protein model quality assess-
ment (MQA or QA), now referred in CASP to as estima-
tion of model accuracy (EMA), has always been an im-
portant step in protein structure prediction pipelines. It
allows, in principle, to choose the best models (in case
of global QA) and/or spot inaccuracies in the proposed
models for a subsequent refinement (in case of local QA).
In recent years, a large number of deep learning-based
approaches have been specifically designed for this task.
Classically, they take a 3D model as input and then as-
sess its quality in a stand-alone fashion (Fig. 1). Alter-
natively, some teams proposed integrative approaches.
For example, QDeep QA predictions [58] are based on
distance estimations from DMPfold [21]. In GalaxyRe-
fine2 [59], RefineD [60], and Baker suite [61], the QA is
incorporated into a model refinement pipeline. Finally,
QA blocks may be used as an integral part of a sequence-
to-structure prediction process, as is the case in DMP-
fold2 [49] and AlphaFold2 [31].
3 | THE IMPORTANCE OF DATA
AND DATA REPRESENTATIONS
The success of deep-learning methods is heavily
grounded in the availability of large amounts of data,
and the development of suitable representations struc-
turing and expressing the information they contain. The
advent of high throughput sequencing technologies has
widened the gap between the number of known protein
sequences and known protein structures. Genomics
has become pre-eminent in terms of data scale, with
an exponential growth [64, 65]. These huge amounts
of data offer unprecedented opportunities to develop
high-capacity models detecting co-variation patterns
and learning the "protein language".
3.1 | Leveraging (meta-)genomics
In the last few years, the accessible resources for unan-
notated sequences coming from metagenomics exper-
iments have multiplied. They include databases like
NCBI GenBank [66], Metaclust [67], BFD [68], MetaEuk
[69], EBI MGnify [70], and IMG/M [71]. Since CASP12,
several teams attempted to exploit this type of data,
mostly to increase the depth of the MSAs and obtain a
more accurate estimation of (co-)evolutionary features.
For example, RaptorX [36], methods from the Yang and
Baker teams [72, 73], Multicom [45], and GALAXY ex-
ploited metagenome data for contact prediction and
distance estimation between residue pairs in combina-
tion with residual convolutional neural networks (resC-
NNs). The HMS-Casper [54, 74], DMPfold2 [49] and
AlphaFold2 methods [31] exploited them directly to
predict 3D structures. Regarding QA, DeepPotential
from the Zhang lab and QDeep [58] leverage gener-
ated MSA profiles from metagenome databases. To
gather large amounts of sequences, coming from dif-
ferent sources, many teams relied on the DeepMSA al-
gorithm [75]. Most of the time, the sequences were
integrated altogether in a single MSA. However, some
methods proposed to combine several MSAs with dif-
ferent weights (e.g. Kihara’s lab) or to select a few of
them with high depth and/or variability (e.g. DeepPo-
tential). Noteworthily, deep learning is not only used to
exploit sequence alignments, but also to generate them.
For instance, the SAdLSA algorithm improves the qual-
ity of low-sequence identity alignments by learning the
"protein folding code" from structural alignments [76].
NDThreader [77] and ProALIGN [78] are specifically de-
signed to optimally align the query with the template
in template-based modeling. Both methods exploit pre-
dicted or observed inter-residue distances to improve
the sequence alignments, a strategy that proved power-
ful already in CASP13 [72, 79, 80].
3.2 | From MSA to query-specific
embeddings
The most traditional way to extract information from an
MSA is to compute a probabilistic profile or a PSSM re-
flecting the abundance of each amino acid at each po-
sition. This type of representation has been very popu-
lar from the very first CASP editions. Over the past 10
years, direct coupling analysis (DCA)-based models[12],
including Potts model and pseudolikelihood maximiza-
tion [8, 81, 82, 83], and Graphical lasso-based (low-rank)
models [84, 85, 86] became widespread in the com-
munity. These statistical methods explicitly estimate
residue pairwise couplings as proxies for 3D contacts.
More recently, some meta-models [87, 88], correlation
and precision matrix-based approaches [89, 90, 52], and

Laine et al. 5
TA B L E 1 Overview of X-to-end and end-to-X deep learning approaches for protein structure prediction.
End-to-end learning
AlphaFold2[31] The MSA, along with templates, is fed into a translation and rotation equivairant trans-
former architecture, which outputs a 3D structural model
DMPfold2
(new)[49, 50]
The MSA, along with the precision matrix, is fed into a GRU, which outputs a 3D struc-
ture
End-to-X learning
MSA Transformer[62] Transformer architecture
rawMSA[48] The MSA is fed into a 2DCNN (the first convolutional layer creates an embedding) which
outputs a contact map
CopulaNet[51] Extracts all sequence pairs from the MSA and feeds them to a dilated resCNN
TOWER The network is trained with a deep dilated resCNN to predict inter-residue distances
directly from the raw MSA
trRosetta[52] Computes traditional MSA features on the fly and passes them to dilated convolutional
layers
X-to-end learning
NOVA[63] Adopts DeepFragLib from the same team which uses Long Short Term Memory units
(LSTMs), to output a 3D structure
DMPfold2[49] The MSA, along with the precision matrix, is fed into a GRU, which outputs distances
and angles (version used in CASP14)
HMS-Casper[54] Raw sequences plus PSSMs are given to a "Recurrent Geometrical Network" comprising
LSTM and geometric units and outputting a 3D structure
a variety of of deep-learning models [91, 16, 92, 93,
21, 38, 73, 42, 41, 37, 45], including generative adver-
sarial networks for contact map generation and refine-
ment [94, 35], got widely used to capture the same type
of co-evolutionary information. One limitation of these
methods is that they estimate average properties over
an ensemble of sequences representative of a protein
family. Hence, they may miss information specifically
relevant to the protein query. The DeepMind team cir-
cumvented this limitation with AlphaFold2 by comput-
ing embeddings for residue-residue relationships within
the query and sequence-residue relationships between
the sequences in the MSA and the query, and making
the information flow between these two representa-
tions. Alternatively, one may transfer the knowledge ac-
quired on hundreds of millions of natural sequences to
generate query-specific embeddings (Table 2). Several
models developed for NLP, including BERT [95], ELMo
[96], and GPT-2 [97], have been adapted to the "protein
language". During the semi-supervised training phase,
the model attempts to predict a masked or the next to-
ken [98]. In CASP14, EMBER directly made use of ELMo
and BERT while HMS-Casper [54] used a reformulated
version of the latter, called AminoBert. A2I2Prot and
CUTSP leveraged the TAPE initiative [98], which pro-
vides data, tasks and benchmarks to facilitate the evalu-
ation of protein transfer learning.
3.3 | Representations of protein
structure
Sequence-based protein representation may be en-
riched with different levels of structural information, for
example, some prior knowledge about secondary struc-
ture (SS) elements. In principle, some of these elements,
such as alpha helices or beta strands, can be represented
with 3D primitives. An interesting idea that we saw in
CASP14 was the use of a discrete version of Frenet-
Serret frames for the protein backbone parametrization
by HMS-Casper. However, such a representation is very
complex, and a much simpler way would be to abstract
SS primitives with a hydrogen-bond (HB) 2D map. For
example, the ISSEC network was specifically trained to
segment SS elements in 2D contact maps [99]. Similarly,
the protein 3D topology may be abstracted as a 2D con-
tact map, or its probabilistic generalization, e.g. a matrix
filled with continuous probabilities or contact propen-
sities between protein atoms or residues. Beyond 2D
contact maps, richer descriptions of the 3D structures
can be achieved with 2D contact manifolds and protein
surfaces, 3D molecular graphs, point clouds, sets of ori-
ented local frames, volumetric 3D maps, or 3D tessel-
lations, e.g. through Voronoi diagrams (Table 2). These
different levels of protein representations and their ap-
plications in CASP are discussed in more details below
and schematically shown in Fig. 2.
3.3.1 | Volumetric protein
representations
The first attempt to train 3D CNNs on a volumetric pro-
tein representation dates back to CASP12, with the goal
of assessing protein model quality [100]. The architec-
ture was robust but had two major limitations. Specifi-
cally, it relied on a predefined protein’s atom types, and
the orientation of the protein model given as input had

Citations
More filters
Journal ArticleDOI

Critical assessment of methods of protein structure prediction (CASP)-Round XIV.

TL;DR: In the most recent Critical Assessment of Structure Prediction (CASP14), deep learning methods from one research group consistently delivered computed structures rivaling the corresponding experimental ones in accuracy as mentioned in this paper.
Journal ArticleDOI

On the Potential of Machine Learning to Examine the Relationship Between Sequence, Structure, Dynamics and Function of Intrinsically Disordered Proteins

TL;DR: In this paper, the authors discuss computational methods aimed to predict transiently formed local and long-range structure, including methods for integrative structural biology, and how experiments are providing insight into such complexes and may enable more accurate predictions.
Journal ArticleDOI

Protein Design: From the Aspect of Water Solubility and Stability

TL;DR: A comprehensive review of recent advances in the protein design field with respect to water solubility and structural stability is presented in this paper , where the authors discuss the transmembrane protein solubilization and de novo transmodal protein design.
Journal ArticleDOI

Protein Design with Deep Learning.

TL;DR: A review of the most suitable representations for protein data can be found in this paper, which discusses their strengths and weaknesses, and details their associated DL architecture for design and related tasks.
Journal ArticleDOI

The Transporter-Mediated Cellular Uptake and Efflux of Pharmaceutical Drugs and Biotechnology Products: How and Why Phospholipid Bilayer Transport Is Negligible in Real Biomembranes

TL;DR: In this paper, it was shown that the likelihood of pharmaceutical drugs being able to diffuse through whatever unhindered phospholipid bilayer may exist in intact biological membranes in vivo is vanishingly low.
References
More filters
Journal ArticleDOI

Sparse inverse covariance estimation with the graphical lasso

TL;DR: Using a coordinate descent procedure for the lasso, a simple algorithm is developed that solves a 1000-node problem in at most a minute and is 30-4000 times faster than competing methods.
Journal ArticleDOI

A fast algorithm for particle simulations

TL;DR: An algorithm is presented for the rapid evaluation of the potential and force fields in systems involving large numbers of particles whose interactions are Coulombic or gravitational in nature, making it considerably more practical for large-scale problems encountered in plasma physics, fluid dynamics, molecular dynamics, and celestial mechanics.
Journal ArticleDOI

The energy landscapes and motions of proteins.

TL;DR: The concepts that emerge from studies of the conformational substates and the motions between them permit a quantitative discussion of one simple reaction, the binding of small ligands such as carbon monoxide to myoglobin.
Journal ArticleDOI

Generalized neural-network representation of high-dimensional potential-energy surfaces.

TL;DR: A new kind of neural-network representation of DFT potential-energy surfaces is introduced, which provides the energy and forces as a function of all atomic positions in systems of arbitrary size and is several orders of magnitude faster than DFT.
Frequently Asked Questions (15)
Q1. What are the contributions in "Protein sequence-to-structure learning" ?

For instance, this paper proposed a deep learning-powered approach for protein 3D structure prediction, which can capture long-range dependencies between amino acid residues, transform these dependencies into structural constraints, and preserve the symmetry and properties of the 3D space. 

Recent efforts include the use of spherical convolutions in combination with a residue-level coordinate system to learn a local quality metric [107], and the development of invariant volumetric [101] and equivariant point clouds representations in 3D [110, 111]. 

The authors believe that equivariant architectures in learning from macromolecular structure will grow further in popularity due to their parameter-efficient expressive power and their ability to directly reason about, and also predict geometric quantities such as vectors. 

The method updates these frames indirectly by applying an attention mechanism to "3D points" generated from the query sequence embedding. 

A2I2Prot and CUTSP leveraged the TAPE initiative [98], which provides data, tasks and benchmarks to facilitate the evaluation of protein transfer learning. 

In case the nodes represent residues, and the attention weights can be interpreted in terms of 3D distances or contact, only 2 links are necessary to infer a triangle (in blue). 

Spherical harmonics have played a prominent role in molecular surface representations for several decades [144, 145] and are also at the heart of the classical fast multipole method [146]. 

QA blocksmay be used as an integral part of a sequenceto-structure prediction process, as is the case in DMPfold2 [49] and AlphaFold2 [31]. 

2.The first attempt to train 3D CNNs on a volumetric protein representation dates back to CASP12, with the goal of assessing protein model quality [100]. 

One of the advantages of deep learning methods compared with traditional machine learning approaches isEMBER-NLPHMS-Casper-NLPAlphaFold2 DMPfold2-NewDMPfold2 CUTSPrawMSA MSATransformerDeepPotentialCopulaNetPharmulator TOWERKiharalab_ContactHMS-CasperProSPr 

The importance of relative orientation is also apparent in the cat cartoons — rotating the mouth motif by 180◦ with respect to the nose turns the happy cat into a sad one. 

this accounting of long-range dependencies comes at the expense of precision, since it occurs only after a certain depth in the network. 

More commonly, the strategy of state-of-the-art methods is to leverage the very high degenerative nature of the sequence-structure relationship through the use of a multiple sequence alignment (MSA) of evolutionaryrelated sequences, or a pre-trained protein language model (see below). 

These ideas started to show their full potential about 10 years ago with the development of efficient methods dealing with large scale multiple sequence alignments [6, 7, 8]. 

Given a protein structure, a network should further be able to identify structural motifs independent of the orientation and position in which they occur.