1
Large-scale computational discovery and analysis of virus-derived 1
microbial nanocompartments 2
Michael P. Andreas and Tobias W. Giessen* 3
Department of Biomedical Engineering, University of Michigan Medical School, Ann Arbor, MI, USA 4
Department of Biological Chemistry, University of Michigan Medical School, Ann Arbor, MI, USA 5
*correspondence: tgiessen@umich.edu 6
7
8
9
10
11
12
13
14
15
16
17
18
19
.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted March 18, 2021. ; https://doi.org/10.1101/2021.03.18.436031doi: bioRxiv preprint
2
Abstract 20
Protein compartments represent an important strategy for subcellular spatial control and 21
compartmentalization. Encapsulins are a class of microbial protein compartments defined by the viral 22
HK97-fold of their capsid protein, self-assembly into icosahedral shells, and dedicated cargo loading 23
mechanism for sequestering specific enzymes. Encapsulins are often misannotated and traditional 24
sequence-based searches yield many false positive hits in the form of phage capsids. This has hampered 25
progress in understanding the distribution and functional diversity of encapsulins. Here, we develop an 26
integrated search strategy to carry out a large-scale computational analysis of prokaryotic genomes with 27
the goal of discovering an exhaustive and curated set of all HK97-fold encapsulin-like systems. We report 28
the discovery and analysis of over 6,000 encapsulin-like systems in 31 bacterial and 4 archaeal phyla, 29
including two novel encapsulin families as well as many new operon types that fall within the two 30
already known families. We formulate hypotheses about the biological functions and biomedical 31
relevance of newly identified operons which range from natural product biosynthesis and stress 32
resistance to carbon metabolism and anaerobic hydrogen production. We conduct an evolutionary 33
analysis of encapsulins and related HK97-type virus families and show that they share a common 34
ancestor. We conclude that encapsulins likely evolved from HK97-type bacteriophages. Our study sheds 35
new light on the evolutionary interplay of viruses and cellular organisms, the recruitment of protein 36
folds for novel functions, and the functional diversity of microbial protein organelles. 37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted March 18, 2021. ; https://doi.org/10.1101/2021.03.18.436031doi: bioRxiv preprint
3
Introduction 53
Spatial compartmentalization is a ubiquitous feature of biological systems.
1
In fact, biological entities 54
like cells and viruses only exist because of the presence of a barrier that separates their interior from the 55
environment. This concept of creating distinct spaces separate from their surroundings extends further 56
to intracellular organization with many layers of sub-compartmentalization found within most cells.
2,3
57
Intracellular compartments with a proteomically defined interior and a discrete boundary that fulfill 58
distinct biochemical or physiological functions are generally referred to as organelles.
4
This includes both 59
lipid-bound organelles, phase-separated structures, and protein-based compartments. Distinguishing 60
features between eukaryotic lipid-based and prokaryotic protein-based organelles include their size 61
range – micro vs. nano scale – and the fact that protein organelle structure is genetically encoded and 62
thus generally more defined. Still, compartmentalization, however it is achieved, can ultimately serve 63
four distinct functions, namely, the creation of distinct reaction spaces and environments, storage, 64
transport, and regulation.
4
Often, compartmentalization can serve multiple of these functions at the 65
same time. More specifically, the functions of intracellular compartments include sequestering toxic 66
reactions and metabolites, creating distinct biochemical environments to stimulate enzyme or pathway 67
activity, and dynamically storing nutrients for later use, among many others.
4
68
One of the most widespread and diverse classes of protein-based compartments are encapsulin 69
nanocompartments, or simply encapsulins.
5-7
So far, two families of encapsulins have been reported in a 70
variety of bacterial and archaeal phyla.
8-10
They are proposed to be involved in oxidative stress 71
resistance,
9,11-13
iron mineralization and storage,
14,15
anaerobic ammonium oxidation,
16
and sulfur 72
metabolism.
8
All known encapsulins self-assemble from a single capsid protein into compartments 73
between 24 and 42 nm in diameter with either T=1, T=3 or T=4 icosahedral symmetry.
10,12,15
Their 74
defining feature is the ability to selectively encapsulate cargo proteins which include ferritin-like 75
proteins, hemerythrins, peroxidases and desulfurases.
8,9
In classical encapsulins (Family 1), 76
encapsulation is mediated by short C-terminal peptide sequences referred to as targeting peptides (TPs) 77
or cargo-loading peptides (CLPs)
10,15,17
while for Family 2 systems, larger N-terminal protein domains are 78
proposed to mediate encapsulation.
8
For most encapsulin systems, little is known about the specific 79
reasons or functional consequences of enzyme encapsulation. Suggestions include the sequestration of 80
toxic or reactive intermediates as well as enhancing enzyme activity and the prevention of unwanted 81
side reactions. One of the most intriguing features of encapsulins is that in contrast to all other known 82
protein-based compartments or organelles, their capsid monomer shares the HK97 phage-like fold.
10,12,15
83
This has led to the suggestion that encapsulins are derived from or in some way connected to the world 84
of phages and viruses.
5,9
85
Here, we carry out a large-scale in-depth computational analysis of prokaryotic genomes with the goal 86
of discovering and classifying an exhaustive set of all HK97-type protein organelle systems. We develop 87
a Hidden Markov Model (HMM)-, Pfam family-, and genome neighborhood analysis (GNA)-based search 88
strategy and substantially expand the number of identified encapsulin-like operons. We report the 89
discovery and analysis of two novel encapsulin families (Family 3 and Family 4) as well as many new 90
operon types that fall within Family 1 and Family 2. We formulate data-driven hypotheses about the 91
potential biological functions of newly identified operons which will guide future experimental studies of 92
.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted March 18, 2021. ; https://doi.org/10.1101/2021.03.18.436031doi: bioRxiv preprint
4
encapsulin-like systems. Further, we conduct a detailed evolutionary analysis of encapsulin-like systems 93
and related HK97-type virus families and show that encapsulins and HK97-type viruses share a common 94
ancestor and that encapsulins likely evolved from HK97-type phages. Our study sheds new light on the 95
evolutionary interplay of viruses and cellular organisms, the recruitment of protein folds for novel 96
functions, and the functional diversity of microbial protein organelles. 97
Results and Discussion 98
Distribution, diversity, and classification of encapsulin systems found in prokaryotes 99
All bacterial and archaeal proteomes available in the UniProtKB
18
database (Family 1, 2, and 4: March 100
2020; Family 3: February 2021) were analyzed for the presence of encapsulin-like proteins using an 101
HMM-based search strategy. It was discovered that all Pfam families associated with initial search hits 102
belong to a single Pfam clan (CL0373)
19
encompassing the majority of HK97-fold proteins catalogued in 103
the Pfam database. Thus, we supplemented our initial hit dataset with all sequences associated with 104
CL0373. This was followed by GNA-based curation
20
of the expanded dataset to remove all false 105
106
Fig. 1. Distribution of encapsulin-like systems in prokaryotes. Left: Phylogenetic tree based on 108 of the major archaeal and 107
bacterial phyla.
21
Phyla containing encapsulin-like systems are highlighted in blue. Differently colored dots indicate the 108
presence of the respective encapsulin family within the phylum. Right: List of phyla discovered to encode encapsulin-like 109
systems. The Count column shows the number of identified systems and the total number of proteomes available in UniProt (# 110
systems identified / # UniProt proteomes). Ca. refers to candidate phyla. Phylum names colored red show new phyla or 111
uncultured/unclassified organisms not shown in the phylogenetic tree. *Ca. Modulibacteria is not an annotated phylum in 112
UniProt but has been proposed as a candidate phylum.
22
113
.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted March 18, 2021. ; https://doi.org/10.1101/2021.03.18.436031doi: bioRxiv preprint
5
positives, primarily phage genomes, resulting in a curated list of 6,133 encapsulin-like proteins (Fig. 1 114
and Supplementary Data 1). Encapsulin-like systems can be found in 31 bacterial and 4 archaeal phyla. 115
Based on the sequence similarity and Pfam family membership of identified capsid proteins, and the 116
genome-neighborhood composition of associated operons, encapsulin-like systems could be classified 117
into 4 distinct families (Fig. 2). Family 1 and 2 represent previously identified encapsulin operon types 118
containing capsid proteins falsely annotated as bacteriocin (PF04454: Linocin_M18) and transcriptional 119
regulator/membrane protein (no Pfam), respectively. Family 1 will be referred to as Classical Encapsulins 120
given the fact that they were the first discovered and are the best characterized. Family 3 and 4 121
represent newly discovered systems. Family 3 encapsulins are falsely annotated as phage major capsid 122
protein (PF05065: Phage_capsid) and are found embedded within large biosynthetic gene clusters 123
(BGCs) encoding different peptide-based natural products. Therefore, Family 3 was dubbed Natural 124
Product Encapsulins. Family 4 is characterized by a highly truncated encapsulin-like capsid protein which 125
is generally annotated as an uncharacterized protein (PF08967: DUF1884) and arranged in conserved 126
two-component operons with different enzymes. Family 4 proteins represent the A-domain of the 127
canonical HK97-fold with all other domains usually associated with this fold missing. Thus, Family 4 will 128
be referred to as A-domain Encapsulins. 129
Classical Encapsulins (Family 1) represent the most widespread family of encapsulin-like systems. They 130
can be found in 31 out of 35 prokaryotic phyla found to encode encapsulin-like operons (Fig. 1). 2,383 131
Classical Encapsulin operons were discovered with the phyla Proteobacteria, Actinobacteria and 132
Firmicutes containing the majority of identified systems. However, it should be noted that these phyla 133
134
Fig. 2. Novel classification scheme for encapsulin-like operons. Shown are the 4 newly defined families of encapsulins with the 135
respective Pfam annotations if available. Encapsulin-like capsid components are shown in red. Confirmed and proposed cargo 136
proteins are shown in blue. Non-cargo accessory components are shown in grey. The number of identified systems of a given 137
family is shown after the operon in red (I, # identified) and the number of distinct cargo types is shown in cyan (CT, # cargo 138
types). Dotted lines indicate optional presence of operon components. cNMP: cyclic nucleotide-binding domain (orange), Enc: 139
encapsulin-like capsid component. BGC: biosynthetic gene cluster. 140
.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted March 18, 2021. ; https://doi.org/10.1101/2021.03.18.436031doi: bioRxiv preprint