Home
/
Authors
/
Louis Bergelson

Author

Louis Bergelson

Bio: Louis Bergelson is an academic researcher from Broad Institute. The author has contributed to research in topics: Gene & Genome. The author has an hindex of 4, co-authored 5 publications receiving 3186 citations.

Topics: Gene, Genome, Constraint (information theory), Exome sequencing, Human genome ...read more

Papers

PDF

Open Access

More filters

Journal Article•DOI•

The mutational constraint spectrum quantified from variation in 141,456 humans

[...]

Konrad J. Karczewski¹, Laurent C. Francioli¹, Grace Tiao¹, Beryl B. Cummings¹, Jessica Alföldi¹, Qingbo Wang¹, Ryan L. Collins¹, Kristen M. Laricchia¹, Andrea Ganna¹, Daniel P. Birnbaum¹, Laura D. Gauthier¹, Harrison Brand¹, Matthew Solomonson¹, Nicholas A. Watts¹, Daniel R. Rhodes², Moriel Singer-Berk¹, Eleina M. England¹, Eleanor G. Seaby¹, Jack A. Kosmicki¹, Raymond K. Walters¹, Katherine Tashman¹, Yossi Farjoun¹, Eric Banks¹, Timothy Poterba¹, Arcturus Wang¹, Cotton Seed¹, Nicola Whiffin¹, Jessica X. Chong³, Kaitlin E. Samocha⁴, Emma Pierce-Hoffman¹, Zachary Zappala¹, Anne H. O’Donnell-Luria¹, Eric Vallabh Minikel¹, Ben Weisburd¹, Monkol Lek⁵, James S. Ware¹, Christopher Vittal⁶, Irina M. Armean¹, Louis Bergelson¹, Kristian Cibulskis¹, Kristen M. Connolly¹, Miguel Covarrubias¹, Stacey Donnelly¹, Steven Ferriera¹, Stacey Gabriel¹, Jeff Gentry¹, Namrata Gupta¹, Thibault Jeandet¹, Diane Kaplan¹, Christopher Llanwarne¹, Ruchi Munshi¹, Sam Novod¹, Nikelle Petrillo¹, David Roazen¹, Valentin Ruano-Rubio¹, Andrea Saltzman¹, Molly Schleicher¹, Jose Soto¹, Kathleen Tibbetts¹, Charlotte Tolonen¹, Gordon Wade¹, Michael E. Talkowski¹, Benjamin M. Neale¹, Mark J. Daly¹, Daniel G. MacArthur¹ - Show less +61 more•Institutions (6)

Broad Institute¹, Queen Mary University of London², University of Washington³, Wellcome Trust Sanger Institute⁴, Yale University⁵, Harvard University⁶

27 May 2020-Nature

TL;DR: A catalogue of predicted loss-of-function variants in 125,748 whole-exome and 15,708 whole-genome sequencing datasets from the Genome Aggregation Database (gnomAD) reveals the spectrum of mutational constraints that affect these human protein-coding genes.

...read moreread less

Abstract: Genetic variants that inactivate protein-coding genes are a powerful source of information about the phenotypic consequences of gene disruption: genes that are crucial for the function of an organism will be depleted of such variants in natural populations, whereas non-essential genes will tolerate their accumulation. However, predicted loss-of-function variants are enriched for annotation errors, and tend to be found at extremely low frequencies, so their analysis requires careful variant annotation and very large sample sizes1. Here we describe the aggregation of 125,748 exomes and 15,708 genomes from human sequencing studies into the Genome Aggregation Database (gnomAD). We identify 443,769 high-confidence predicted loss-of-function variants in this cohort after filtering for artefacts caused by sequencing and annotation errors. Using an improved model of human mutation rates, we classify human protein-coding genes along a spectrum that represents tolerance to inactivation, validate this classification using data from model organisms and engineered human cells, and show that it can be used to improve the power of gene discovery for both common and rare diseases. A catalogue of predicted loss-of-function variants in 125,748 whole-exome and 15,708 whole-genome sequencing datasets from the Genome Aggregation Database (gnomAD) reveals the spectrum of mutational constraints that affect these human protein-coding genes.

...read moreread less

4,913 citations

Posted Content•DOI•

Variation across 141,456 human exomes and genomes reveals the spectrum of loss-of-function intolerance across human protein-coding genes

[...]

Konrad J. Karczewski¹, Konrad J. Karczewski², Laurent C. Francioli², Laurent C. Francioli¹, Grace Tiao¹, Grace Tiao², Beryl B. Cummings¹, Beryl B. Cummings², Jessica Alföldi², Jessica Alföldi¹, Qingbo Wang¹, Qingbo Wang², Ryan L. Collins², Ryan L. Collins¹, Kristen M. Laricchia², Kristen M. Laricchia¹, Andrea Ganna², Andrea Ganna³, Andrea Ganna¹, Daniel P. Birnbaum², Laura D. Gauthier², Harrison Brand², Harrison Brand¹, Matthew Solomonson², Matthew Solomonson¹, Nicholas A. Watts¹, Nicholas A. Watts², Daniel R. Rhodes⁴, Moriel Singer-Berk², Eleanor G. Seaby¹, Eleanor G. Seaby², Jack A. Kosmicki¹, Jack A. Kosmicki², Raymond K. Walters¹, Raymond K. Walters², Katherine Tashman¹, Katherine Tashman², Yossi Farjoun², Eric Banks², Timothy Poterba², Timothy Poterba¹, Arcturus Wang², Arcturus Wang¹, Cotton Seed¹, Cotton Seed², Nicola Whiffin⁵, Nicola Whiffin², Jessica X. Chong⁶, Kaitlin E. Samocha⁷, Emma Pierce-Hoffman², Zachary Zappala⁸, Zachary Zappala², Anne H. O’Donnell-Luria⁹, Anne H. O’Donnell-Luria², Anne H. O’Donnell-Luria¹, Eric Vallabh Minikel², Ben Weisburd², Monkol Lek¹⁰, Monkol Lek², James S. Ware², James S. Ware⁵, Christopher Vittal², Christopher Vittal¹, Irina M. Armean¹, Irina M. Armean¹¹, Irina M. Armean², Louis Bergelson², Kristian Cibulskis², Kristen M. Connolly², Miguel Covarrubias², Stacey Donnelly², Steven Ferriera², Stacey Gabriel², Jeff Gentry², Namrata Gupta², Thibault Jeandet², Diane Kaplan², Christopher Llanwarne², Ruchi Munshi², Sam Novod², Nikelle Petrillo², David Roazen², Valentin Ruano-Rubio², Andrea Saltzman², Molly Schleicher², Jose Soto², Kathleen Tibbetts², Charlotte Tolonen², Gordon Wade², Michael E. Talkowski¹, Michael E. Talkowski², Benjamin M. Neale², Benjamin M. Neale¹, Mark J. Daly², Daniel G. MacArthur¹, Daniel G. MacArthur² - Show less +92 more•Institutions (11)

Harvard University¹, Broad Institute², University of Helsinki³, Queen Mary University of London⁴, National Institutes of Health⁵, University of Washington⁶, Wellcome Trust Sanger Institute⁷, Vertex Pharmaceuticals⁸, Boston Children's Hospital⁹, Yale University¹⁰, European Bioinformatics Institute¹¹

30 Jan 2019-bioRxiv

TL;DR: Using an improved human mutation rate model, human protein-coding genes are classified along a spectrum representing tolerance to inactivation, validate this classification using data from model organisms and engineered human cells, and show that it can be used to improve gene discovery power for both common and rare diseases.

...read moreread less

Abstract: Summary Genetic variants that inactivate protein-coding genes are a powerful source of information about the phenotypic consequences of gene disruption: genes critical for an organism’s function will be depleted for such variants in natural populations, while non-essential genes will tolerate their accumulation. However, predicted loss-of-function (pLoF) variants are enriched for annotation errors, and tend to be found at extremely low frequencies, so their analysis requires careful variant annotation and very large sample sizes. Here, we describe the aggregation of 125,748 exomes and 15,708 genomes from human sequencing studies into the Genome Aggregation Database (gnomAD). We identify 443,769 high-confidence pLoF variants in this cohort after filtering for sequencing and annotation artifacts. Using an improved model of human mutation, we classify human protein-coding genes along a spectrum representing intolerance to inactivation, validate this classification using data from model organisms and engineered human cells, and show that it can be used to improve gene discovery power for both common and rare diseases.

...read moreread less

1,128 citations

Journal Article•DOI•

Author Correction: The mutational constraint spectrum quantified from variation in 141,456 humans

[...]

Konrad J. Karczewski¹, Konrad J. Karczewski², Laurent C. Francioli², Laurent C. Francioli¹, Grace Tiao¹, Grace Tiao², Beryl B. Cummings², Beryl B. Cummings¹, Jessica Alföldi¹, Jessica Alföldi², Qingbo Wang², Qingbo Wang¹, Ryan L. Collins¹, Ryan L. Collins², Kristen M. Laricchia¹, Kristen M. Laricchia², Andrea Ganna², Andrea Ganna¹, Andrea Ganna³, Daniel P. Birnbaum², Daniel P. Birnbaum¹, Laura D. Gauthier², Harrison Brand¹, Harrison Brand², Matthew Solomonson¹, Matthew Solomonson², Nicholas A. Watts¹, Nicholas A. Watts², Daniel R. Rhodes⁴, Moriel Singer-Berk¹, Moriel Singer-Berk², Eleina M. England¹, Eleina M. England², Eleanor G. Seaby², Eleanor G. Seaby¹, Jack A. Kosmicki², Jack A. Kosmicki¹, Raymond K. Walters¹, Raymond K. Walters², Katherine Tashman¹, Katherine Tashman², Yossi Farjoun², Eric Banks², Timothy Poterba¹, Timothy Poterba², Arcturus Wang², Arcturus Wang¹, Cotton Seed², Cotton Seed¹, Nicola Whiffin, Jessica X. Chong⁵, Kaitlin E. Samocha⁶, Emma Pierce-Hoffman², Emma Pierce-Hoffman¹, Zachary Zappala¹, Zachary Zappala⁷, Zachary Zappala², Anne H. O’Donnell-Luria, Eric Vallabh Minikel², Ben Weisburd², Monkol Lek⁸, James S. Ware⁹, James S. Ware², Christopher Vittal², Christopher Vittal¹, Irina M. Armean², Irina M. Armean¹, Louis Bergelson², Kristian Cibulskis², Kristen M. Connolly², Miguel Covarrubias², Stacey Donnelly², Steven Ferriera², Stacey Gabriel², Jeff Gentry², Namrata Gupta², Thibault Jeandet², Diane Kaplan², Christopher Llanwarne², Ruchi Munshi², Sam Novod², Nikelle Petrillo², David Roazen², Valentin Ruano-Rubio², Andrea Saltzman², Molly Schleicher², Jose Soto², Kathleen Tibbetts², Charlotte Tolonen², Gordon Wade², Michael E. Talkowski², Michael E. Talkowski¹, Benjamin M. Neale², Benjamin M. Neale¹, Mark J. Daly, Daniel G. MacArthur - Show less +92 more•Institutions (9)

Harvard University¹, Broad Institute², University of Helsinki³, Queen Mary University of London⁴, University of Washington⁵, Wellcome Trust Sanger Institute⁶, Vertex Pharmaceuticals⁷, Yale University⁸, National Institutes of Health⁹

03 Feb 2021-Nature

56 citations

Journal Article•DOI•

Addendum: The mutational constraint spectrum quantified from variation in 141,456 humans

[...]

Sanna Gudmundsson¹, Sanna Gudmundsson², Sanna Gudmundsson³, Konrad J. Karczewski¹, Konrad J. Karczewski², Laurent C. Francioli¹, Laurent C. Francioli², Grace Tiao², Grace Tiao¹, Beryl B. Cummings², Beryl B. Cummings¹, Jessica Alföldi², Jessica Alföldi¹, Qingbo Wang¹, Qingbo Wang², Ryan L. Collins¹, Ryan L. Collins², Kristen M. Laricchia², Kristen M. Laricchia¹, Andrea Ganna¹, Andrea Ganna⁴, Andrea Ganna², Daniel P. Birnbaum¹, Daniel P. Birnbaum², Laura D. Gauthier¹, Harrison Brand¹, Harrison Brand², Matthew Solomonson², Matthew Solomonson¹, Nicholas A. Watts¹, Nicholas A. Watts², Daniel R. Rhodes⁵, Moriel Singer-Berk², Moriel Singer-Berk¹, Eleina M. England¹, Eleina M. England², Eleanor G. Seaby¹, Eleanor G. Seaby², Jack A. Kosmicki², Jack A. Kosmicki¹, Raymond K. Walters², Raymond K. Walters¹, Katherine Tashman², Katherine Tashman¹, Yossi Farjoun¹, Eric Banks¹, Timothy Poterba¹, Timothy Poterba², Arcturus Wang¹, Arcturus Wang², Cotton Seed¹, Cotton Seed², Nicola Whiffin, Jessica X. Chong⁶, Kaitlin E. Samocha⁷, Emma Pierce-Hoffman¹, Emma Pierce-Hoffman², Zachary Zappala², Zachary Zappala¹, Zachary Zappala⁸, Anne H. O’Donnell-Luria, Eric Vallabh Minikel¹, Ben Weisburd¹, Monkol Lek⁹, James S. Ware¹, James S. Ware¹⁰, Christopher Vittal¹, Christopher Vittal², Irina M. Armean², Irina M. Armean¹, Louis Bergelson¹, Kristian Cibulskis¹, Kristen M. Connolly¹, Miguel Covarrubias¹, Stacey Donnelly¹, Steven Ferriera¹, Stacey Gabriel¹, Jeff Gentry¹, Namrata Gupta¹, Thibault Jeandet¹, Diane Kaplan¹, Christopher Llanwarne¹, Ruchi Munshi¹, Sam Novod¹, Nikelle Petrillo¹, David Roazen¹, Valentin Ruano-Rubio¹, Andrea Saltzman¹, Molly Schleicher¹, Jose Soto¹, Kathleen Tibbetts¹, Charlotte Tolonen¹, Gordon Wade¹, Michael E. Talkowski², Michael E. Talkowski¹, Benjamin M. Neale¹, Benjamin M. Neale², Mark J. Daly, Daniel G. MacArthur - Show less +95 more•Institutions (10)

Broad Institute¹, Harvard University², Boston Children's Hospital³, University of Helsinki⁴, Queen Mary University of London⁵, University of Washington⁶, Wellcome Trust Sanger Institute⁷, Vertex Pharmaceuticals⁸, Yale University⁹, National Institutes of Health¹⁰

01 Sep 2021-Nature

30 citations

Posted Content•DOI•

A genome-wide mutational constraint map quantified from variation in 76,156 human genomes

[...]

Siwei Chen, Laurent C. Francioli, Julia K. Goodrich, Ryan L. Collins, Qingbo Wang, Jessica Alföldi, Nicholas A. Watts, Christopher Vittal, Laura D. Gauthier, Timothy Poterba, Michael D. Wilson, Yekaterina Tarasova, William Phu, Mary T. Yohannes, Zan Koenig, Yossi Farjoun, Eric Banks, Stacey Donnelly, Stacey Gabriel, Namrata Gupta, S. K. Ferriera, Charlotte Tolonen, Sam Novod, Louis Bergelson, David Roazen, Valentin Ruano-Rubio, Miguel Covarrubias, Chris Llanwarne, Nikelle Petrillo, Gordon Wade, Thibault Jeandet, Ruchi Munshi, Kathleen Tibbetts, Anne H. O’Donnell-Luria, Matthew Solomonson, Cotton Seed, Alicia R. Martin, Michael E. Talkowski, Heidi L. Rehm, Mark J. Daly, Grace Tiao, Benjamin M. Neale, Daniel G. MacArthur, Konrad J. Karczewski - Show less +40 more

21 Mar 2022-bioRxiv

TL;DR: It is demonstrated that this genome-wide constraint map provides an effective approach for characterizing the non-coding genome and improving the identification and interpretation of functional human genetic variation.

...read moreread less

Abstract: The depletion of disruptive variation caused by purifying natural selection (constraint) has been widely used to investigate protein-coding genes underlying human disorders, but attempts to assess constraint for non-protein-coding regions have proven more difficult. Here we aggregate, process, and release a dataset of 76,156 human genomes from the Genome Aggregation Database (gnomAD), the largest public open-access human genome reference dataset, and use this dataset to build a mutational constraint map for the whole genome. We present a refined mutational model that incorporates local sequence context and regional genomic features to detect depletions of variation across the genome. As expected, proteincoding sequences overall are under stronger constraint than non-coding regions. Within the noncoding genome, constrained regions are enriched for regulatory elements and variants implicated in complex human diseases and traits, facilitating the triangulation of biological annotation, disease association, and natural selection to non-coding DNA analysis. More constrained regulatory elements tend to regulate more constrained genes, while non-coding constraint captures additional functional information underrecognized by gene constraint metrics. We demonstrate that this genome-wide constraint map provides an effective approach for characterizing the non-coding genome and improving the identification and interpretation of functional human genetic variation.

...read moreread less

25 citations

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

The DisGeNET knowledge platform for disease genomics: 2019 update.

[...]

Janet Piñero¹, Juan Manuel Ramírez-Anguita¹, Josep Saüch-Pitarch¹, Francesco Ronzano¹, Emilio Centeno¹, Ferran Sanz¹, Laura I. Furlong¹ - Show less +3 more•Institutions (1)

Pompeu Fabra University¹

04 Nov 2019-Nucleic Acids Research

TL;DR: The DisGeNET platform, a knowledge management platform integrating and standardizing data about disease associated genes and variants from multiple sources, is an interoperable resource supporting a variety of applications in genomic medicine and drug R&D.

...read moreread less

Abstract: One of the most pressing challenges in genomic medicine is to understand the role played by genetic variation in health and disease. Thanks to the exploration of genomic variants at large scale, hundreds of thousands of disease-associated loci have been uncovered. However, the identification of variants of clinical relevance is a significant challenge that requires comprehensive interrogation of previous knowledge and linkage to new experimental results. To assist in this complex task, we created DisGeNET (http://www.disgenet.org/), a knowledge management platform integrating and standardizing data about disease associated genes and variants from multiple sources, including the scientific literature. DisGeNET covers the full spectrum of human diseases as well as normal and abnormal traits. The current release covers more than 24 000 diseases and traits, 17 000 genes and 117 000 genomic variants. The latest developments of DisGeNET include new sources of data, novel data attributes and prioritization metrics, a redesigned web interface and recently launched APIs. Thanks to the data standardization, the combination of expert curated information with data automatically mined from the scientific literature, and a suite of tools for accessing its publicly available data, DisGeNET is an interoperable resource supporting a variety of applications in genomic medicine and drug R&D.

...read moreread less

1,183 citations

Journal Article•DOI•

Large-Scale Exome Sequencing Study Implicates Both Developmental and Functional Changes in the Neurobiology of Autism

[...]

F. Kyle Satterstrom¹, F. Kyle Satterstrom², Jack A. Kosmicki, Jiebiao Wang³ +198 more•Institutions (53)

06 Feb 2020-Cell

TL;DR: The largest exome sequencing study of autism spectrum disorder (ASD) to date, using an enhanced analytical framework to integrate de novo and case-control rare variation, identifies 102 risk genes at a false discovery rate of 0.1 or less, consistent with multiple paths to an excitatory-inhibitory imbalance underlying ASD.

...read moreread less

1,169 citations

Journal Article•DOI•

Genetic mechanisms of critical illness in Covid-19.

[...]

Erola Pairo-Castineira¹, Erola Pairo-Castineira², Sara Clohisey², Lucija Klaric¹ +1446 more•Institutions (27)

04 Mar 2021-Nature

TL;DR: The GenOMICC (Genetics Of Mortality In Critical Care) genome-wide association study in 2244 critically ill Covid-19 patients from 208 UK intensive care units is reported, finding evidence in support of a causal link from low expression of IFNAR2, and high expression of TYK2, to life-threatening disease.

...read moreread less

Abstract: Host-mediated lung inflammation is present1, and drives mortality2, in the critical illness caused by coronavirus disease 2019 (COVID-19). Host genetic variants associated with critical illness may identify mechanistic targets for therapeutic development3. Here we report the results of the GenOMICC (Genetics Of Mortality In Critical Care) genome-wide association study in 2,244 critically ill patients with COVID-19 from 208 UK intensive care units. We have identified and replicated the following new genome-wide significant associations: on chromosome 12q24.13 (rs10735079, P = 1.65 × 10−8) in a gene cluster that encodes antiviral restriction enzyme activators (OAS1, OAS2 and OAS3); on chromosome 19p13.2 (rs74956615, P = 2.3 × 10−8) near the gene that encodes tyrosine kinase 2 (TYK2); on chromosome 19p13.3 (rs2109069, P = 3.98 × 10−12) within the gene that encodes dipeptidyl peptidase 9 (DPP9); and on chromosome 21q22.1 (rs2236757, P = 4.99 × 10−8) in the interferon receptor gene IFNAR2. We identified potential targets for repurposing of licensed medications: using Mendelian randomization, we found evidence that low expression of IFNAR2, or high expression of TYK2, are associated with life-threatening disease; and transcriptome-wide association in lung tissue revealed that high expression of the monocyte–macrophage chemotactic receptor CCR2 is associated with severe COVID-19. Our results identify robust genetic signals relating to key host antiviral defence mechanisms and mediators of inflammatory organ damage in COVID-19. Both mechanisms may be amenable to targeted treatment with existing drugs. However, large-scale randomized clinical trials will be essential before any change to clinical practice. A genome-wide association study of critically ill patients with COVID-19 identifies genetic signals that relate to important host antiviral defence mechanisms and mediators of inflammatory organ damage that may be targeted by repurposing drug treatments.

...read moreread less

941 citations

Journal Article•DOI•

Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program.

[...]

Daniel Taliun¹, Daniel N. Harris², Michael D. Kessler², Jedidiah Carlson¹ +202 more•Institutions (61)

10 Feb 2021-Nature

TL;DR: The Trans-Omics for Precision Medicine (TOPMed) project as discussed by the authors aims to elucidate the genetic architecture and biology of heart, lung, blood and sleep disorders, with the ultimate goal of improving diagnosis, treatment and prevention of these diseases.

...read moreread less

Abstract: The Trans-Omics for Precision Medicine (TOPMed) programme seeks to elucidate the genetic architecture and biology of heart, lung, blood and sleep disorders, with the ultimate goal of improving diagnosis, treatment and prevention of these diseases The initial phases of the programme focused on whole-genome sequencing of individuals with rich phenotypic data and diverse backgrounds Here we describe the TOPMed goals and design as well as the available resources and early insights obtained from the sequence data The resources include a variant browser, a genotype imputation server, and genomic and phenotypic data that are available through dbGaP (Database of Genotypes and Phenotypes)1 In the first 53,831 TOPMed samples, we detected more than 400 million single-nucleotide and insertion or deletion variants after alignment with the reference genome Additional previously undescribed variants were detected through assembly of unmapped reads and customized analysis in highly variable loci Among the more than 400 million detected variants, 97% have frequencies of less than 1% and 46% are singletons that are present in only one individual (53% among unrelated individuals) These rare variants provide insights into mutational processes and recent human evolutionary history The extensive catalogue of genetic variation in TOPMed studies provides unique opportunities for exploring the contributions of rare and noncoding sequence variants to phenotypic variation Furthermore, combining TOPMed haplotypes with modern imputation methods improves the power and reach of genome-wide association studies to include variants down to a frequency of approximately 001% The goals, resources and design of the NHLBI Trans-Omics for Precision Medicine (TOPMed) programme are described, and analyses of rare variants detected in the first 53,831 samples provide insights into mutational processes and recent human evolutionary history

...read moreread less

801 citations

Posted Content•DOI•

Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences

[...]

Alexander Rives¹, Siddharth Goyal², Joshua Meier², Demi Guo², Myle Ott², C. Lawrence Zitnick², Jerry Ma², Rob Fergus², Rob Fergus¹ - Show less +5 more•Institutions (2)

New York University¹, Facebook²

29 Apr 2019-bioRxiv

TL;DR: This work uses unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million protein sequences spanning evolutionary diversity, enabling state-of-the-art supervised prediction of mutational effect and secondary structure, and improving state- of- the-art features for long-range contact prediction.

...read moreread less

Abstract: In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised learning has led to major advances in representation learning and statistical generation. In biology, the anticipated growth of sequencing promises unprecedented data on natural sequence diversity. Learning the natural distribution of evolutionary protein sequence variation is a logical step toward predictive and generative modeling for biology. To this end we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million sequences spanning evolutionary diversity. The resulting model maps raw sequences to representations of biological properties without labels or prior domain knowledge. The learned representation space organizes sequences at multiple levels of biological granularity from the biochemical to proteomic levels. Learning recovers information about protein structure: secondary structure and residue-residue contacts can be extracted by linear projections from learned representations. With small amounts of labeled data, the ability to identify tertiary contacts is further improved. Learning on full sequence diversity rather than individual protein families increases recoverable information about secondary structure. We show the networks generalize by adapting them to variant activity prediction from sequences only, with results that are comparable to a state-of-the-art variant predictor that uses evolutionary and structurally derived features.

...read moreread less

748 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse