Home
/
Authors
/
Zachary Wu

Author

Zachary Wu

Bio: Zachary Wu is an academic researcher from California Institute of Technology. The author has contributed to research in topics: Function (engineering) & Directed evolution. The author has an hindex of 8, co-authored 11 publications receiving 856 citations.

Papers

PDF

Open Access

More filters

Journal Article•DOI•

Highly accurate protein structure prediction for the human proteome

[...]

Kathryn Tunyasuvunakool, Jonas Adler, Zachary Wu, Tim Green, Michal Zielinski, Augustin Žídek, Alex Bridgland, Andrew Cowie, Clemens Meyer, Agata Laydon, Sameer Velankar¹, Gerard J. Kleywegt¹, Alex Bateman¹, Richard Evans, Alexander Pritzel, Michael Figurnov, Olaf Ronneberger, Russell Bates, Simon A. A. Kohl, Anna Potapenko, Andrew J. Ballard, Bernardino Romera-Paredes, Stanislav Nikolov, R. D. Jain, Ellen Clancy, David Reiman, Stig Petersen, Andrew W. Senior, Koray Kavukcuoglu, Ewan Birney¹, Pushmeet Kohli, John M. Jumper, Demis Hassabis - Show less +29 more•Institutions (1)

European Bioinformatics Institute¹

22 Jul 2021-Nature

TL;DR: The AlphaFold2 dataset as discussed by the authors is a large-scale and high-accuracy structure prediction dataset for protein structures, which is used to evaluate the structural properties of proteins.

...read moreread less

Abstract: Protein structures can provide invaluable information, both for reasoning about biological processes and for enabling interventions such as structure-based drug development or targeted mutagenesis. After decades of effort, 17% of the total residues in human protein sequences are covered by an experimentally-determined structure1. Here we dramatically expand structural coverage by applying the state-of-the-art machine learning method, AlphaFold2, at scale to almost the entire human proteome (98.5% of human proteins). The resulting dataset covers 58% of residues with a confident prediction, of which a subset (36% of all residues) have very high confidence. We introduce several metrics developed by building on the AlphaFold model, and use them to interpret the dataset, identifying strong multi-domain predictions as well as regions likely to be disordered. Finally, we provide some case studies illustrating how high-quality predictions may be used to generate biological hypotheses. Importantly, we are making our predictions freely available to the community via a public database (hosted by the European Bioinformatics Institute at https://alphafold.ebi.ac.uk/ ). We anticipate that routine large-scale and high-accuracy structure prediction will become an important tool, allowing new questions to be addressed from a structural perspective.

...read moreread less

1,238 citations

Journal Article•DOI•

Machine-learning-guided directed evolution for protein engineering.

[...]

Kevin K. Yang¹, Zachary Wu¹, Frances H. Arnold¹•Institutions (1)

California Institute of Technology¹

15 Jul 2019-Nature Methods

TL;DR: The steps required to build machine-learning sequence–function models and to use those models to guide engineering are introduced and the underlying principles of this engineering paradigm are illustrated with the help of case studies.

...read moreread less

Abstract: Protein engineering through machine-learning-guided directed evolution enables the optimization of protein functions. Machine-learning approaches predict how sequence maps to function in a data-driven manner without requiring a detailed model of the underlying physics or biological pathways. Such methods accelerate directed evolution by learning from the properties of characterized variants and using that information to select sequences that are likely to exhibit improved properties. Here we introduce the steps required to build machine-learning sequence-function models and to use those models to guide engineering, making recommendations at each stage. This review covers basic concepts relevant to the use of machine learning for protein engineering, as well as the current literature and applications of this engineering paradigm. We illustrate the process with two case studies. Finally, we look to future opportunities for machine learning to enable the discovery of unknown protein functions and uncover the relationship between protein sequence and function.

...read moreread less

527 citations

Journal Article•DOI•

Machine learning-assisted directed protein evolution with combinatorial libraries.

[...]

Zachary Wu¹, S. B. Jennifer Kan¹, Russell D. Lewis¹, Bruce J. Wittmann¹, Frances H. Arnold¹ - Show less +1 more•Institutions (1)

California Institute of Technology¹

30 Apr 2019-Proceedings of the National Academy of Sciences of the United States of America

TL;DR: It is proposed that the expense of experimentally testing a large number of protein variants can be decreased and the outcome can be improved by incorporating machine learning with directed evolution, and that machine learning-guided directed evolution finds variants with higher fitness than those found by other directed evolution approaches.

...read moreread less

Abstract: To reduce experimental effort associated with directed protein evolution and to explore the sequence space encoded by mutating multiple positions simultaneously, we incorporate machine learning into the directed evolution workflow. Combinatorial sequence space can be quite expensive to sample experimentally, but machine-learning models trained on tested variants provide a fast method for testing sequence space computationally. We validated this approach on a large published empirical fitness landscape for human GB1 binding protein, demonstrating that machine learning-guided directed evolution finds variants with higher fitness than those found by other directed evolution approaches. We then provide an example application in evolving an enzyme to produce each of the two possible product enantiomers (i.e., stereodivergence) of a new-to-nature carbene Si–H insertion reaction. The approach predicted libraries enriched in functional enzymes and fixed seven mutations in two rounds of evolution to identify variants for selective catalysis with 93% and 79% ee (enantiomeric excess). By greatly increasing throughput with in silico modeling, machine learning enhances the quality and diversity of sequence solutions for a protein engineering problem.

...read moreread less

315 citations

Journal Article•DOI•

Learned protein embeddings for machine learning.

[...]

Kevin K. Yang¹, Zachary Wu¹, Claire N. Bedbrook¹, Frances H. Arnold¹•Institutions (1)

California Institute of Technology¹

01 Aug 2018-Bioinformatics

TL;DR: The predictive power of Gaussian process models trained usingembeddings is comparable to those trained on existing representations, which suggests that embeddings enable accurate predictions despite having orders of magnitude fewer dimensions.

...read moreread less

Abstract: Motivation: Machine-learning models trained on protein sequences and their measured functions can infer biological properties of unseen sequences without requiring an understanding of the underlying physical or biological mechanisms. Such models enable the prediction and discovery of sequences with optimal properties. Machine-learning models generally require that their inputs be vectors, and the conversion from a protein sequence to a vector representation affects the model’s ability to learn. We propose to learn embedded representations of protein sequences that take advantage of the vast quantity of unmeasured protein sequence data available. These embeddings are low-dimensional and can greatly simplify downstream modeling. Results: The predictive power of Gaussian process models trained using embeddings is comparable to those trained on existing representations, which suggests that embeddings enable accurate predictions despite having orders of magnitude fewer dimensions. Moreover, embeddings are simpler to obtain because they do not require alignments, structural data, or selection of informative amino-acid properties. Visualizing the embedding vectors shows meaningful relationships between the embedded proteins are captured.

...read moreread less

230 citations

Journal Article•DOI•

Stereoselective Enzymatic Synthesis of Heteroatom-Substituted Cyclopropanes

[...]

Oliver F. Brandenberg¹, Christopher K. Prier¹, Kai Chen¹, Anders M. Knight¹, Zachary Wu¹, Frances H. Arnold¹ - Show less +2 more•Institutions (1)

California Institute of Technology¹

24 Feb 2018-ACS Catalysis

TL;DR: This work engineered variants of Cytochrome P450BM3 that catalyze the cyclopropanation of heteroatom-bearing alkenes, providing valuable nitrogen-, oxygen-, and sulfur-substitutedcyclopropanes, and expands the catalytic functions of iron heme proteins.

...read moreread less

Abstract: The repurposing of hemoproteins for non-natural carbene transfer activities has generated enzymes for functions previously accessible only to chemical catalysts. With activities constrained to specific substrate classes, however, the synthetic utility of these new biocatalysts has been limited. To expand the capabilities of non-natural carbene transfer biocatalysis, we engineered variants of Cytochrome P450BM3 that catalyze the cyclopropanation of heteroatom-bearing alkenes, providing valuable nitrogen-, oxygen-, and sulfur-substituted cyclopropanes. Four or five active-site mutations converted a single parent enzyme into selective catalysts for the synthesis of both cis and trans heteroatom-substituted cyclopropanes, with high diastereoselectivities and enantioselectivities and up to 40 000 total turnovers. This work highlights the ease of tuning hemoproteins by directed evolution for efficient cyclopropanation of new substrate classes and expands the catalytic functions of iron heme proteins.

...read moreread less

78 citations

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

Highly accurate protein structure prediction with AlphaFold

[...]

John M. Jumper, Richard O. Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russell Bates, Augustin Žídek, Anna Potapenko, Alex Bridgland, Clemens Meyer, Simon A. A. Kohl, Andrew J. Ballard, Andrew Cowie, Bernardino Romera-Paredes, Stanislav Nikolov, R. D. Jain, Jonas Adler, Trevor Back, Stig Petersen, David Reiman, Ellen Clancy, Michal Zielinski, Martin Steinegger¹, Michalina Pacholska, Tamas Berghammer, Sebastian Bodenstein, David L. Silver, Oriol Vinyals, Andrew W. Senior, Koray Kavukcuoglu, Pushmeet Kohli, Demis Hassabis - Show less +30 more•Institutions (1)

Seoul National University¹

15 Jul 2021-Nature

TL;DR: For example, AlphaFold as mentioned in this paper predicts protein structures with an accuracy competitive with experimental structures in the majority of cases using a novel deep learning architecture. But the accuracy is limited by the fact that no homologous structure is available.

...read moreread less

Abstract: Proteins are essential to life, and understanding their structure can facilitate a mechanistic understanding of their function. Through an enormous experimental effort1–4, the structures of around 100,000 unique proteins have been determined5, but this represents a small fraction of the billions of known protein sequences6,7. Structural coverage is bottlenecked by the months to years of painstaking effort required to determine a single protein structure. Accurate computational approaches are needed to address this gap and to enable large-scale structural bioinformatics. Predicting the three-dimensional structure that a protein will adopt based solely on its amino acid sequence—the structure prediction component of the ‘protein folding problem’8—has been an important open research problem for more than 50 years9. Despite recent progress10–14, existing methods fall far short of atomic accuracy, especially when no homologous structure is available. Here we provide the first computational method that can regularly predict protein structures with atomic accuracy even in cases in which no similar structure is known. We validated an entirely redesigned version of our neural network-based model, AlphaFold, in the challenging 14th Critical Assessment of protein Structure Prediction (CASP14)15, demonstrating accuracy competitive with experimental structures in a majority of cases and greatly outperforming other methods. Underpinning the latest version of AlphaFold is a novel machine learning approach that incorporates physical and biological knowledge about protein structure, leveraging multi-sequence alignments, into the design of the deep learning algorithm. AlphaFold predicts protein structures with an accuracy competitive with experimental structures in the majority of cases using a novel deep learning architecture.

...read moreread less

10,601 citations

Journal Article•DOI•

AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models.

[...]

Mihaly Varadi¹, Stephen Anyango¹, Mandar Deshpande¹, Sreenath Nair¹, Cindy Natassia¹, Galabina Yordanova¹, David Yu Yuan¹, Oana Stroe¹, Gemma Wood¹, Agata Laydon, Augustin Žídek, Tim Green, Kathryn Tunyasuvunakool, Stig Petersen, John M. Jumper, Ellen Clancy, Richard E. Green, Ankur Vora, Mira Lutfi, Michael Figurnov, Andrew Cowie, Nicole Hobbs, Pushmeet Kohli, Gerard J. Kleywegt¹, Ewan Birney¹, Demis Hassabis, Sameer Velankar¹ - Show less +23 more•Institutions (1)

European Bioinformatics Institute¹

17 Nov 2021-Nucleic Acids Research

TL;DR: The AlphaFold Protein Structure Database (AlphaFold DB, https://alphafold.ebi.ac.uk) is an openly accessible, extensive database of high-accuracy protein-structure predictions.

...read moreread less

Abstract: The AlphaFold Protein Structure Database (AlphaFold DB, https://alphafold.ebi.ac.uk) is an openly accessible, extensive database of high-accuracy protein-structure predictions. Powered by AlphaFold v2.0 of DeepMind, it has enabled an unprecedented expansion of the structural coverage of the known protein-sequence space. AlphaFold DB provides programmatic access to and interactive visualization of predicted atomic coordinates, per-residue and pairwise model-confidence estimates and predicted aligned errors. The initial release of AlphaFold DB contains over 360,000 predicted structures across 21 model-organism proteomes, which will soon be expanded to cover most of the (over 100 million) representative sequences from the UniRef90 data set.

...read moreread less

2,008 citations

Journal Article•DOI•

ColabFold: making protein folding accessible to all

[...]

Milot Mirdita¹, Tatiana Valdez Bubnova², Oi Wah Liew³•Institutions (3)

Seoul National University¹, Harvard University², University of Göttingen³

30 May 2022-Nature Methods

TL;DR: ColabFold as discussed by the authors combines the fast homology search of MMseqs2 with AlphaFold2 or RoseTTAFold for protein folding and achieves 40-60fold faster search and optimized model utilization.

...read moreread less

Abstract: ColabFold offers accelerated prediction of protein structures and complexes by combining the fast homology search of MMseqs2 with AlphaFold2 or RoseTTAFold. ColabFold's 40-60-fold faster search and optimized model utilization enables prediction of close to 1,000 structures per day on a server with one graphics processing unit. Coupled with Google Colaboratory, ColabFold becomes a free and accessible platform for protein folding. ColabFold is open-source software available at https://github.com/sokrypton/ColabFold and its novel environmental databases are available at https://colabfold.mmseqs.com .

...read moreread less

1,553 citations

Posted Content•DOI•

Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences

[...]

Alexander Rives¹, Siddharth Goyal², Joshua Meier², Demi Guo², Myle Ott², C. Lawrence Zitnick², Jerry Ma², Rob Fergus², Rob Fergus¹ - Show less +5 more•Institutions (2)

New York University¹, Facebook²

29 Apr 2019-bioRxiv

TL;DR: This work uses unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million protein sequences spanning evolutionary diversity, enabling state-of-the-art supervised prediction of mutational effect and secondary structure, and improving state- of- the-art features for long-range contact prediction.

...read moreread less

Abstract: In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised learning has led to major advances in representation learning and statistical generation. In biology, the anticipated growth of sequencing promises unprecedented data on natural sequence diversity. Learning the natural distribution of evolutionary protein sequence variation is a logical step toward predictive and generative modeling for biology. To this end we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million sequences spanning evolutionary diversity. The resulting model maps raw sequences to representations of biological properties without labels or prior domain knowledge. The learned representation space organizes sequences at multiple levels of biological granularity from the biochemical to proteomic levels. Learning recovers information about protein structure: secondary structure and residue-residue contacts can be extracted by linear projections from learned representations. With small amounts of labeled data, the ability to identify tertiary contacts is further improved. Learning on full sequence diversity rather than individual protein families increases recoverable information about secondary structure. We show the networks generalize by adapting them to variant activity prediction from sequences only, with results that are comparable to a state-of-the-art variant predictor that uses evolutionary and structurally derived features.

...read moreread less

748 citations

Journal Article•DOI•

Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences

[...]

Alexander Rives¹, Alexander Rives², Joshua Meier¹, Tom Sercu¹, Siddharth Goyal¹, Zeming Lin², Jason Liu¹, Demi Guo³, Myle Ott¹, C. Lawrence Zitnick¹, Jerry Ma⁴, Jerry Ma⁵, Rob Fergus² - Show less +9 more•Institutions (5)

Facebook¹, New York University², Harvard University³, University of Chicago⁴, Yale University⁵

13 Apr 2021-Proceedings of the National Academy of Sciences of the United States of America

TL;DR: This paper used unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million protein sequences spanning evolutionary diversity, which contains information about biological properties in its representations.

...read moreread less

Abstract: In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised learning has led to major advances in representation learning and statistical generation In the life sciences, the anticipated growth of sequencing promises unprecedented data on natural sequence diversity Protein language modeling at the scale of evolution is a logical step toward predictive and generative artificial intelligence for biology To this end, we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million protein sequences spanning evolutionary diversity The resulting model contains information about biological properties in its representations The representations are learned from sequence data alone The learned representation space has a multiscale organization reflecting structure from the level of biochemical properties of amino acids to remote homology of proteins Information about secondary and tertiary structure is encoded in the representations and can be identified by linear projections Representation learning produces features that generalize across a range of applications, enabling state-of-the-art supervised prediction of mutational effect and secondary structure and improving state-of-the-art features for long-range contact prediction

...read moreread less

700 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse