scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A new view of the tree of life

TL;DR: New genomic data from over 1,000 uncultivated and little known organisms, together with published sequences, are used to infer a dramatically expanded version of the tree of life, with Bacteria, Archaea and Eukarya included.
Abstract: The tree of life is one of the most important organizing principles in biology1. Gene surveys suggest the existence of an enormous number of branches2, but even an approximation of the full scale of the tree has remained elusive. Recent depictions of the tree of life have focused either on the nature of deep evolutionary relationships3–5 or on the known, well-classified diversity of life with an emphasis on eukaryotes6. These approaches overlook the dramatic change in our understanding of life's diversity resulting from genomic sampling of previously unexamined environments. New methods to generate genome sequences illuminate the identity of organisms and their metabolic capacities, placing them in community and ecosystem contexts7,8. Here, we use new genomic data from over 1,000 uncultivated and little known organisms, together with published sequences, to infer a dramatically expanded version of the tree of life, with Bacteria, Archaea and Eukarya included. The depiction is both a global overview and a snapshot of the diversity within each major lineage. The results reveal the dominance of bacterial diversification and underline the importance of organisms lacking isolated representatives, with substantial evolution concentrated in a major radiation of such organisms. This tree highlights major lineages currently underrepresented in biogeochemical models and identifies radiations that are probably important for future evolutionary analyses. An update to the ‘tree of life’ has revealed a dominance of bacterial diversity in many ecosystems and extensive evolution in some branches of the tree. It also highlights how few organisms we have been able to cultivate for further investigation.

Content maybe subject to copyright    Report

UC Berkeley
UC Berkeley Previously Published Works
Title
A new view of the tree of life.
Permalink
https://escholarship.org/uc/item/65h6718x
Journal
Nature microbiology, 1(5)
ISSN
2058-5276
Authors
Hug, Laura A
Baker, Brett J
Anantharaman, Karthik
et al.
Publication Date
2016-04-01
DOI
10.1038/nmicrobiol.2016.48
Peer reviewed
eScholarship.org Powered by the California Digital Library
University of California

A new view of the tree of life
Laura A. Hug
1
, Brett J. Baker
2
, Karthik Anantharaman
1
, Christopher T . Brow n
3
, Alexander J. Probst
1
,
Cindy J. Castelle
1
,CristinaN.Buttereld
1
,AlexW.Hernsdorf
3
, Yuki Amano
4
,KotaroIse
4
,
Yohey Suzuki
5
, Natasha Dudek
6
,DavidA.Relman
7,8
, Kari M. Finstad
9
, Ronald Amundson
9
,
Brian C. Thomas
1
and Jillian F. Baneld
1,9
*
The tree of life is one of the most important organizing prin-
ciples in biology
1
. Gene surveys suggest the existence of an
enormous number of branches
2
, but even an approximation of
the full scale of the tree has remained elusive. Recent depic-
tions of the tree of life have focused either on the nature of
deep evolutionary relationships
35
or on the known, well-classi-
ed diversity of life with an emphasis on eukaryotes
6
. These
approaches overlook the dramatic change in our understanding
of lifes diversity resulting from genomic sampling of previously
unexamined environments. New methods to generate genome
sequences illuminate the identity of organisms and their meta-
bolic capacities, placing them in community and ecosystem con-
texts
7,8
. Here, we use new genomic data from over 1,000
uncultivated and little known organisms, together with pub-
lished sequences, to infer a dramatically expanded version of
the tree of life, with Bacteria, Archaea and Eukarya included.
The depiction is both a global overview and a snapshot of the
diversity within each major lineage. The results reveal the dom-
inance of bacterial diversication and underline the importance
of organisms lacking isolated representatives, with substantial
evolution concentrated in a major radiation of such organisms.
This tree highlights major lineages currently underrepresented
in biogeochemical models and identies radiations that are
probably important for future evolutionary analyses.
Early approaches to describe the treeoflifedistinguishedorganisms
based o n their physical characteristics and metabolic featur es.
Molecular methods dramatically br oadened the div ersity that could
be included in the tree because they cir cumv ented the need for dir ect
observation and e xperimenta tion by r elying on sequenced genes as
markers for lineages. Gene surve y s, typically using the small subunit
ribosomal RNA (SSU rRNA) gene, provided a remarkable and nov el
view of the biological world
1,9,10
, but questions about the structur e
and extent of diversity remain. Organisms from novel lineages hav e
eluded surveys, because many are invisible to these methods due to
sequence divergence relativ e to the primers commonly used for gene
amplica tion
7,11
. Furthermore, unusual sequences, including those
with unex pected insertions, may be discarded as artefacts
7
.
Wholegenomereconstructionwasrst accomplished in 1995
(ref. 12), with a near-exponential increase in the number of draft
genomes reported each subsequent year. There are 30,437 genomes
from all three domains of lifeBa cte ria, Ar chaea and Eukarya
which are curr ently available in the Joint Genome Ins titutes
Integra ted Microbial Genomes da tabase (a ccessed 24 September 2015).
Contributing to this expansion in genome numbers are single cell
genomics
13
and metagenomics s tudies. Metagenomics is a shotgun
sequencing-based method in which DNA isolated directly from the
environment is sequenced, and the recons tructed genome fragments
are assigned to draft genomes
14
. New bioinformatics methods yield
complete and near-complete genome sequences, without a reliance
on cultivation or reference genomes
7,15
. These genome- (ra ther than
gene) based approaches provide information about metabolic poten-
tial and a variety of phylogenetically informative sequences that can
be used to classify organisms
16
. Here, we have constructed a tree
of life by making use of genomes from public databases and 1,011
newly reconstructed genomes that we recovered from a variety of
environments (see Methods).
To render this tree of life, we aligned and concatenated a set of 16
ribosomal protein sequences from each organism. This approach
yields a higher-resolution tree than is obtained from a single gene,
such as the widely used 16S rRNA gene
16
. The use of ribosomal pro-
teins avoids artefacts that would arise from phylogenies constructed
using genes with unrelated functions and subject to different evol-
utionary processes. Another important advantage of the chosen
ribosomal proteins is that they tend to be syntenic and co-located
in a small genomic region in Bacteria and Archaea, reducing
binning errors that could substantially perturb the geometry of
the tree. Included in this tree is one representative per genus for
all genera for which high-quality draft and complete genomes
exist (3,083 organisms in total).
Despite the methodological challenges, we have included repre-
sentatives of all three domains of life. Our primary focus relates to
the status of Bacteria and Archaea, as these organisms have been
most difcult to prole using macroscopic approaches, and substan-
tial progress has been made recently with acquisition of new genome
sequences
7,8,13
. The placement of Eukarya relative to Bacteria and
Archaea is controversial
1,4,5,17,18
. Eukaryotes are believed to be evol-
utionary chimaeras that arose via endosymbiotic fusion, probably
involving bacterial and archaeal cells
19
. Here, we do not attempt
to condently resolve the placement of the Eukarya. We position
them using sequences of a subset of their nuclear-encoded riboso-
mal proteins, an approach that classies them based on the inheri-
tance of their information systems as opposed to lipid or other
cellular structures
5
.
Figure 1 presents a new view of the tree of life. This is one of a
relatively small number of three-domain trees constructed from
molecular information so far, and the rst comprehensive tree to
1
Department of Earth and Planetary Science, UC Berkeley, Berkeley, California 94720, USA.
2
Department of Marine Science, University of Texas Austin,
Port Aransas, Texas 78373, USA.
3
Department of Plant and Microbial Biology, UC Berkeley, Berkeley, California 94720, USA.
4
Sector of Decommissioning
and Radioactive Wastes Management, Japan Atomic Energy Agency, Ibaraki 319-1184, Japan.
5
Graduate School of Science, The University of Tokyo,
Tokyo 113-8654, Japan.
6
Department of Ecology and Evolutionary Biology, UC Santa Cruz, Santa Cruz, California 95064, USA.
7
Departments of Medicine
and of Microbiology and Immunology, Stanford University, Stanford, California 94305, USA.
8
Veterans Affairs Palo Alto Health Care System, Palo Alto,
California 94304, USA.
9
Department of Environmental Science, Policy, and Management, UC Berkeley, Berkeley, California 94720, USA.
Present address:
Department of Biology, University of Waterloo, Waterloo, Ontario N2L 3G1, Canada.
*
e-mail:
jbanel d@berk eley.edu
LETTERS
PUBLISHED: 11 APRIL 2016 |
ARTICLE NUMBER: 16048 | DOI: 10.1038/NMICROBIOL.2016.48
OPEN
NATURE MICROBIOLOGY | VOL 1 | MAY 2016 | www.nature.com/naturemicrobiology 1
© 2016 Macmillan Publishers Limited. All rights reserved

0.4
Candidate
Phyla Radiation
Microgenomates
Parcubacteria
Eukaryotes
Archaea
Bacteria
DPANN
Opisthokonta
Amoebozoa
Chromalveolata
Archaeplastida
Excavata
RBX1
WOR1
Cyanobacteria
Melainabacteria
PVC
superphylum
TAC K
Major lineage lacking isolated representative:
Major lineages with isolated representative: italics
Dojkabacteria WS6
Peregrinibacteria
Gracilibacteria BD1-5, GN02
Absconditabacteria SR1
Katanobacteria
WWE3
Berkelbacteria
SM2F11
CPR1
CPR3
Nomurabacteria
Kaiserbacteria
Adlerbacteria
Campbellbacteria
Wirthbacteria
Chloroflexi
Armatimonadetes
Giovannonibacteria
Wolfebacteria
Jorgensenbacteria
Azambacteria
Yanofskybacteria
Moranbacteria
Magasanikbacteria
Uhrbacteria
Falkowbacteria
Saccharibacteria
Woesebacteria
Amesbacteria
Shapirobacteria
Collierbacteria
Pacebacteria
Beckwithbacteria
Roizmanbacteria
Gottesmanbacteria
Levybacteria
Daviesbacteria
Curtissbacteria
Nanoarchaeota
Woesearchaeota
Pacearchaeota
Nanohaloarchaeota
Micrarchaeota
Altiarchaeales
Aenigmarchaeota
Diapherotrites
Z7ME43
Loki.
Thaumarchaeota
Archaeoglobi
Methanomicrobia
Halobacteria
Thermoplasmata
Methanococci
Spirochaetes
Firmicutes
(Tenericutes)
Bacteroidetes
Chlorobi
Gammaproteobacteria
Alphaproteobacteria
Betaproteobacteria
Actinobacteria
Planctomycetes
Chlamydiae,
Lentisphaerae,
Verrucomicrobia
O
mnitrop
h
ica
Aminicentantes
Rokubacteria
NC10
Elusimicrobia
Poribacteria
Ignavibacteria
Dadabacteria
TM6
Atribacteria
Gemmatimonadetes
Cloacimonetes
Fibrobacteres
Nitrospirae
Latescibacteria
TA06
Caldithrix
Marinimicrobia
WOR-3
Zixibacteria
Synergistetes
Fusobacteria
Aquificae
Calescamantes
Deinococcus-Therm.
Caldiserica
Dictyoglomi
Deltaprotebacteria
(Thermodesulfobacteria)
Epsilonproteobacteria
Deferribacteres
Chrysiogenetes
Tectomicrobia, Modulibacteria
Nitrospinae
Acidobacteria
Zetaproteo.
Thermotogae
A
ci
d
it
h
io
b
aci
ll
i
a
Parvarchaeota
Hydrogenedentes NKB19
Thor.
BRC1
Thermococci
Methanobacteria
Hadesarchaea
Methanopyri
Aigarch.
Crenarch.
YNPFFA
Korarch.
Bathyarc.
Figure 1 | A current view of the tree of life, encompassing the total diversity represented by sequenced genomes. The tree includes 92 named bacterial
phyla, 26 archaeal phyla and all ve of the Eukaryotic supergroups. Major lineages are assigned arbitrary colours and named, with well-char a cterized lineage
names, in italics. Lineages lacking an isolated representative are highlighted with non-italicized names and red dots. For details on taxon samplingandtree
inference, see Methods. The names Tenericutes and Thermodesulfobacteria are bracketed to indicate that these lineages branch within the Firmicutesand
the Deltaproteobacteria, respectively . Eukaryotic supergroups are noted, but not otherwise delineated due to the low resolution of these lineages.TheCPR
phyla are assigned a single colour as they are composed entirely of organisms without isolated repr esentatives, and are still in the process of denition at
lower taxonomic levels. The complete ribosomal protein tree is available in rectangular format with full bootstrap values as Supplementary Fig. 1 andin
Newick format in Supplementary Dataset 2.
LETTERS
NATURE MICROBIOLOGY
DOI: 10.1038/NMICROBIOL.2016.48
NATURE MICROBIOLOGY |VOL1|MAY2016|www.nature.com/naturemicrobiology2
© 2016 Macmillan Publishers Limited. All rights reserved

0.2
Korarchaeota
Diapherotrites
Nanohaloarchaeota
Unclassified archaea
Pacearchaeota
Woesearchaeota, Nanoarchaeota
Woesearchaeota
Altiarchaeales
Z7ME43
Methanopyri, Methanococci, Methanobacteria, Hadesarchaea, Thermococci
Archaeoglobi, Methanomicrobia, Halobacteria
Aciduliprofundum, Thermoplasmata
Uncultured Thermoplasmata
Thermoplasmata
Opisthokonta, Excavata, Archaeplastida
Chromalveolata, Amoebozoa
Crenarchaeota
Crenarchaeota
Thorarchaeota
Lokiarchaeota
YNPFFA
Thaumarchaeota
Thaumarchaeota
Cyanobacteria, Melainabacteria
Dojkabacteria WS6
CPR3
Katanobacteria WWE3
Katanobacteria WWE3
Microgenomates Roizmanbacteria
Microgenomates Roizmanbacteria
Microgenomates
Microgenomates Curtissbacteria
Microgenomates Daviesbacteria
Microgenomates Levybacteria
Microgenomates Woesebacteria
Microgenomates Amesbacteria
Microgenomates Shapirobacteria
Microgenomates Beckwithbacteria, Pacebacteria, Collierbacteria
Microgenomates Gottesmanbacteria
KAZAN
CPR2, Saccharibacteria TM7
Berkelbacteria
Berkelbacteria
Berkelbacteria
Berkelbacteria
CPR Uncultured unclassified bacteria
Peregrinibacteria
Peregrinibacteria
Absconditabacteria SR1
Gracilibacteria BD1-5 / GNO2
SM2F11
Parcubacteria
Parcubacteria Kuenenbacteria, Falkowbacteria, Uhrbacteria, Magasanikbacteria
Parcubacteria
Parcubacteria
Parcubacteria
Parcubacteria
Parcubacteria Azambacteria, Jorgensenbacteria, Wolfebacteria, Giovannonibacteria,
Nomurabacteria, Campbellbacteria, Adlerbacteria, Kaiserbacteria
Parcubacteria
Parcubacteria Moranbacteria
Parcubacteria
Parcubacteria Yanofskybacteria
Deinococcus-Thermus
Aquificae, Calescamantes EM19
Caldiserica, Dictyoglomi
Thermotogae
Omnitrophica
Omnitrophica
Spirochaetes
Spirochaetes
Hydrogenedentes NKB19
Deltaproteobacteria
Epsilonproteobacteria
TM6
Alphaproteobacteria, Zetaproteobacteria, Betaproteobacteria, Gammaproteobacteria
Chrysiogenetes, Deferribacteres
Modulibacteria, Tectomicrobia, Nitrospinae, Nitrospirae, Dadabacteria, Thermodesulfobacteria, Deltaprot.
NC10, Rokubacteria, Aminicenantes, Acidobacteria
Planctomycetes
Chlamydiae
Lentisphaerae
Verrucomicrobia
Verrucomicrobia
RBX-1
WOR-1
Firmicutes, Tenericutes, Armatimonadetes, Chloroflexi, Actinobacteria
Fusobacteria, Synergistetes
Uncultured bacteria (CP RIF32)
Zixibacteria, Marinimicrobia, Caldithrix, Chlorobi, Ignavibacteria, Bacteroidetes
Fibrobacteres
Cloacamonetes
Atribacteria (OP9)
BRC1, Poribacteria
Latescibacteria WS3
Gemmatimonadetes, WOR-3, TA06
Elusimicrobia
Uncultured bacteria
Uncultured bacteria (CP RIF1)
Aigarchaeota, Cand. Caldiarchaeum subterraneum
Unclassified archaea
Parcubacteria
Candidate Phyla Radiation
Cyanobacteria, Melainabacteria
Deinococcus-T
h
ermus
Aquificae, Calescamantes EM19
Caldiserica, Dictyoglomi
q,q,
Ai C
Omnitrophica
Omnitrophica
p
p
Spirochaetes
Spirochaetes
Si h t
Hydrogenedentes NKB19
Deltaproteobacteria
Hd d t N
Epsilonproteobacteria
b
TM6
A
l
p
h
aproteo
b
acteria, Zetaproteo
b
acteria, Betaproteo
b
acteria, Gammaproteo
b
acteri
a
Chrysiogenetes, Deferribacteres
Modulibacteria, Tectomicrobia, Nitrospinae, Nitrospirae, Dadabacteria, Thermodesulfobacteria, Deltaprot.
NC10, Rokubacteria, Aminicenantes, Acidobacteria
Df b
Df b
,,p,
,,p,
Planctomycetes
p
p
Chlamydiae
y
Lentisphaerae
Cayd
Ch
Verrucomicrobia
Verrucomicrobia
p
p
RBX-1
WOR-1
Firmicutes, Tenericutes, Armatimonadetes, Chloroflexi, Actinobacteria
Fusobacteria, Synergistetes
Uncultured bacteria (CP RIF32)
,y g
,y g
Zixibacteria, Marinimicrobia, Caldithrix, Chlorobi, Ignavibacteria, Bacteroidetes
Fibrobacteres
Cloacamonetes
Atribacteria (OP9)
BRC1, Poribacteria
()
L
atesci
b
acteria WS3
Gemmatimonadetes, WOR-3, TA06
b
M
Elusimicrobia
U
ncu
l
ture
d
b
acteri
a
Uncultured bacteria (CP RIF1)
Oh
Dojkabacteria WS6
C
PR
3
Katanobacteria WWE3
Katanobacteria WWE3
Microgenomates Roizmanbacteria
Micro
g
enomates Roizman
b
acteri
a
Microgenomates
Microgenomates Curtissbacteria
g
g
Microgenomates Daviesbacteria
g
g
Microgenomates Levybacteria
Microgenomates Woesebacteria
Microgenomates Amesbacteria
Mi t L b t i
Microgenomates Shapirobacteria
Mi t W b
Mi t
Microgenomates Beckwithbacteria, Pacebacteria, Collierbacteria
Mi Sh i b i
Microgenomates Gottesmanbacteria
tRi bti
gy
gy
KA
Z
A
N
CPR2, Sacc
h
ari
b
acteria TM7
B
e
rk
e
lb
ac
t
e
ri
a
Ber
k
e
lb
acteria
B
e
rk
e
lb
ac
t
e
ri
a
B
e
rk
e
lb
ac
t
e
ri
a
CPR Uncultured unclassified bacteri
a
Peregrinibacteria
P
ere
g
rinibacteri
a
Absconditabacteria SR1
G
raci
l
i
b
acteria BD1-5
/
GNO
2
S
M2F11
Parcubacteria
Parcubacteria Kuenenbacteria, Falkowbacteria, Uhrbacteria, Magasanikbacteria
te a
Parcubacteria
Parcubacteria
Pa
r
cu
b
ac
t
e
ri
a
Parcubacteria
Absc
Abs
Parcubacteria Azambacteria, Jorgensenbacteria, Wolfebacteria, Giovannonibacteria,
Nomurabacteria, Campbellbacteria, Adlerbacteria, Kaiserbacteria
g
g
Pa
r
cu
b
ac
t
e
ri
a
P
arcu
b
acteria Moran
b
acteri
a
Pa
r
cubac
t
e
r
ia
Parcubacteria Yanofskybacteria
Pb i
Can
d
i
d
ate P
h
y
l
a Ra
d
iation
Dia
p
herotrite
s
Na
n
o
h
a
l
oa
r
c
h
aeo
t
a
U
nc
l
assifie
d
arc
h
aea
Pacea
r
c
h
aeo
t
a
W
oesearchaeota
,
Nanoarchaeot
a
Woesearc
h
aeota
Alti
a
r
c
h
aea
l
es
Z7ME43
Methanopyri, Methanococci, Methanobacteria, Hadesarchaea, Thermococci
E43
Archaeoglobi, Methanomicrobia, Halobacteria
,,,
,,,
Aciduliprofundum, Thermoplasmata
g
g
Uncultured Thermoplasmata
p,
p,
Thermoplasmata
p
U
nclassified archae
a
Ko
r
a
r
chaeo
t
a
,
C
renarchaeot
a
C
renarc
h
aeota
Tho
r
a
r
chaeota
L
o
ki
a
r
c
h
aeo
t
a
YNPFFA
Thaumarchaeota
Thaumarchaeota
blb
Aigarchaeota,
FFA
C
an
d.
Ca
ld
iarc
h
aeum su
b
terraneu
m
C b tiMli bti
O
pist
h
o
k
onta, Excavata, Arc
h
aep
l
asti
d
a
Ch
roma
l
veo
l
ata
,
Amoe
b
ozoa
Th h
,
Th h t
Eukaryotes
Bacteria
Archaea
Katanobacteria WWE3
Bootstrap 85%
85% > Bootstrap 50%
Woesearchaeota, Nanoarchaeota
Figure 2 | A reformatted view of the tree in Fig. 1 in which each major lineage represents the same amount of evolutionary distance. The threshold for
groups (coloured wedges ) was an average branch length of <0.65 substitutions per site. Notably, some well-accepted phyla become single groups and
others are split into multiple distinct groups. We undertook this analysis to provide perspective on the structure of the tree, and do not propose the resulting
groups to ha ve special taxonomic status. The massive scale of diversity in the CPR and the large fraction of major lineages that lack isolated repr esentatives
(red dots) are apparent from this analysis. Bootstrap support values are indicated by circles on nodesblack for support of 85% and abov e, gre y for support
from 50 to 84%. The complete ribosomal protein tree is available in rectangular format with full bootstrap values as Supplementary Fig. 1 and in Newick
format in Supplementary Dataset 2.
NATURE MICROBIOLOGY DOI: 10.1038/NMICROBIOL.2016.48
LETTERS
NATURE MICROBIOLOGY | VOL 1 | MAY 2016 | www.nature.com/naturemicrobiology 3
© 2016 Macmillan Publishers Limited. All rights reserved

be published since the development of genome-resolved meta-
genomics. We highlight all major lineages with genomic represen-
tation, most of which are phylum-level branches (see
Supplementary Fig. 1 for full bootstrap support values). However,
we separately identify the Classes of the Proteobacteria, because
the phylum is not monophyletic (for example, the
Deltaproteobacteria branch away from the other Proteobacteria, as
previously reported
2,20
).
The tree in Fig. 1 recapitulates expected organism groupings at
most taxonomic levels and is largely congruent with the tree calcu-
lated using traditional SSU rRNA gene sequence information
(Supplementary Fig. 2). The support values for taxonomic groups
are s tr ong at the Species through Class levels (>85%), with modera te-
to-strong support for Phyla (>75% in most cases), but the branch-
ing order of the deepest branches cannot be condently resolved
(Supplementary Fig. 1). The lower support for deep branch place-
ments is a consequence of our prioritization of taxon sampling
over number of genes used for tree construction. As proposed
recently, the Eukarya, a group that includes protists, fungi, plants
and animals, branches within the Archaea, specically within the
TACK superphylum
21
and sibling to the Lokiarchaeota
22
.
Interestingly, this placement is not evident in the SSU rRNA tree,
which has the three-domain topology proposed by Woese and co-
workers in 1990
1
(Supplementary Fig. 2). The two-domain Eocyte
tree and the three-domain tree are competing hypotheses for the
origin of Eukarya
5
; further analyses to resolve these and other
deep relationships will be strengthened with the availability of
genomes for a greater diversity of organisms. Important advantages
of the ribosomal protein tree compared with the SSU rRNA gene
tree are that it includes organisms with incomplete or unavailable
SSU rRNA gene sequences and more strongly resolves the deeper
radiations. Ribosomal proteins have been shown to contain compo-
sitional biases across the three domains, driven by thermophilic,
mesophilic and halophilic lifestyles as well as by a primitive
genetic code
23
. Continued expansion of the number of genome
sequences for non-extremophile Archaea, such as the DPANN
lineages
8,13
, may allow clarication of these compositional biases.
A striking feature of this tree is the large number of major
lineages without isolated representatives (red dots in Fig. 1). Many
of these lineages are clustered together into discrete regions of the
tree. Of particular note is the Candidate Phyla Radiation (CPR)
7
,
highlighted in purple in Fig. 1. Based on information available
from hundreds of genomes from genome-resolved metagenomics
and single-cell genomics methods to date, all members have rela-
tively small genomes and most have somewhat (if not highly)
restricted metabolic capacities
7,13,24
. Many are inferred (and some
have been shown) to be symbionts
7,25,26
. Thus far, all cells lack com-
plete citric acid cycles and respiratory chains and most have limited
or no ability to synthesize nucleotides and amino acids. It remains
unclear whether these reduced metabolisms are a consequence of
superphylum-wide loss of capacities or if these are inherited charac-
teristics that hint at an early metabolic platform for life. If inherited,
then adoption of symbiotic lifestyles may have been a later inno-
vation by these organisms once more complex organisms appeared.
Figure 2 presents another perspective, where the major lineages of
the tree are dened using evolutionary distance, so that the main
groups become apparent without bias arising from historical
naming conventions. This depiction uses the same inferred tree as
in Fig. 1, but with groups dened on the basis of average branch
length to the leaf taxa. We chose an average branch length that
best recapitulated the current taxonomy (smaller values fragmented
many currently accepted phyla and larger values collapsed accepted
phyla into very few lineages, see Methods). Evident in Fig. 2 is the
enormous extent of evolution that has occurred within the CPR.
The diversity within the CPR could be a result of the early emergence
of this group and/or a consequence of rapid evolution related to
symbiotic lifestyles. The CPR is early-emerging on the ribosomal
protein tree (Fig. 1), but not in the SSU rRNA tree (Supplementary
Fig. 2). Regardless of branching order, the CPR, in combination
with other lineages that lack isolated representatives (red dots in
Fig. 2), clearly comprises the majority of lifescurrentdiversity.
Domain Bacteria includes more major lineages of organisms
than the other Domains. We do not attribute the smaller scope of
the Archaea relative to Bacteria to sampling bias because meta-
genomics and single-cell genomics methods detect members of
both domains equally well. Consistent with this view, Archaea are
less prominent and less diverse in many ecosystems (for example,
seawater
27
, hydrothermal vents
28
, the terrestrial subsurface
15
and
human-associated microbiomes
29
). The lower apparent phylo-
genetic diversity of Eukarya is fully expected, based on their
comparatively recent evolution.
The tree of life as we know it has dramatically expanded due to
new genomic sampling of previously enigmatic or unknown
microbial lineages. This depiction of the tree captures the current
genomic sampling of life, illustrating the progress that has been
made in the last two decades following the rst published
genome. What emerges from analysis of this tree is the depth of
evolutionary history that is contained within the Bacteria, in part
due to the CPR, which appears to subdivide the domain. Most
importantly, the analysis highlights the large fraction of diversity
that is currently only accessible via cultivation-independent
genome-resolved approaches.
Methods
A data set comprehensively covering the three domains of life was generated using
publicly available genomes from the Joint Genome Institutes IMG-M database (img.
jgi.doe.gov), a previously developed data set of eukaryotic genome information
30
,
previously published genomes derived from metagenomic data sets
7,8,31,32
and newly
reconstructed genomes from current metagenome projects (see Supplementary
Table 1 for NCBI accession numbers). From IMG-M, genomes were sampled such
that a single representative for each dened genus was selected. For phyla and
candidate phyla lacking full taxonomic denition, every member of the phylum was
initially included. Subsequently, these radiations were sampled to an approximate
genus level of divergence based on comparison with taxonomically described phyla,
thus removing strain- and species-level overlaps. Finally, initial tree reconstructions
identied aberrant long-branch attraction effects placing the Microsporidia, a group
of parasitic fungi, with the Korarchaeota. The Microsporidia are known to
contribute long branch attraction artefacts confounding placement of the Eukarya
33
,
and were subsequently removed from the analysis.
This study includes 1,011 organisms from lineages for which genomes were
not previously available. The organisms were present in samples collected from a
shallow aquifer system, a deep subsurface research site in Japan, a salt crust in the
Atacama Desert, grassland meadow soil in northern California, a CO
2
-rich geyser
system, and two dolphin mouths. Genomes were reconstructed from metagenomes
as described previously
7
. Genomes were only included if they were estimated to be
>70% complete based on presence/absence of a suite of 51 single copy genes for
Bacteria and 38 single copy genes for Archaea. Genomes were additionally required
to have consistent nucleotide composition and coverage across scaffolds, as
determined using the ggkbase binning software (ggkbase.berkeley.edu), and to show
consistent placement across both SSU rRNA and concatenated ribosomal protein
phylogenies. This contributed marker gene information for 1,011 newly sampled
organisms, whose genomes were reconstructed for metabolic analyses to be
published separately.
The concatenated ribosomal protein alignment was constructed as described
previously
16
. In brief, the 16 ribosomal protein data sets (ribosomal proteins L2, L3,
L4, L5, L6, L14, L16, L18, L22, L24, S3, S8, S10, S17 and S19) were aligned
independently using MUSCLE v. 3.8.31 (ref. 34). Alignments were trimmed to
remove ambiguously aligned C and N termini as well as columns composed of more
than 95% gaps. Taxa were removed if their available sequence data represented less
than 50% of the expected alignment columns (90% of taxa had more than 80% of the
expected alignment columns). The 16 alignments were concatenated, forming a nal
alignment comprising 3,083 genomes and 2,596 amino-acid positions. A maximum
likelihood tree was constructed using RAxML v. 8.1.24 (ref. 35), as implemented on
the CIPRES web server
36
, under the LG plus gamma model of evolution
(PROTGAMMALG in the RAxML model section), and with the number of
bootstraps automatically determined (MRE-based bootstopping criterion). A total of
156 bootstrap replicates were conducted under the rapid bootstrapping algorithm,
with 100 sampled to generate proportional support values. The full tree inference
required 3,840 computational hours on the CIPRES supercomputer.
LETTERS
NATURE MICROBIOLOGY
DOI: 10.1038/NMICROBIOL.2016.48
NATURE MICROBIOLOGY |VOL1|MAY2016|www.nature.com/naturemicrobiology4
© 2016 Macmillan Publishers Limited. All rights reserved

Citations
More filters
Journal ArticleDOI
TL;DR: This work used a concatenated protein phylogeny as the basis for a bacterial taxonomy that conservatively removes polyphyletic groups and normalizes taxonomic ranks on the basis of relative evolutionary divergence.
Abstract: Taxonomy is an organizing principle of biology and is ideally based on evolutionary relationships among organisms. Development of a robust bacterial taxonomy has been hindered by an inability to obtain most bacteria in pure culture and, to a lesser extent, by the historical use of phenotypes to guide classification. Culture-independent sequencing technologies have matured sufficiently that a comprehensive genome-based taxonomy is now possible. We used a concatenated protein phylogeny as the basis for a bacterial taxonomy that conservatively removes polyphyletic groups and normalizes taxonomic ranks on the basis of relative evolutionary divergence. Under this approach, 58% of the 94,759 genomes comprising the Genome Taxonomy Database had changes to their existing taxonomy. This result includes the description of 99 phyla, including six major monophyletic units from the subdivision of the Proteobacteria, and amalgamation of the Candidate Phyla Radiation into a single phylum. Our taxonomy should enable improved classification of uncultured bacteria and provide a sound basis for ecological and evolutionary studies.

2,098 citations

Journal ArticleDOI
01 Nov 2017-Nature
TL;DR: A meta-analysis of microbial community samples collected by hundreds of researchers for the Earth Microbiome Project is presented, creating both a reference database giving global context to DNA sequence data and a framework for incorporating data from future studies, fostering increasingly complete characterization of Earth’s microbial diversity.
Abstract: Our growing awareness of the microbial world’s importance and diversity contrasts starkly with our limited understanding of its fundamental structure. Despite recent advances in DNA sequencing, a lack of standardized protocols and common analytical frameworks impedes comparisons among studies, hindering the development of global inferences about microbial life on Earth. Here we present a meta-analysis of microbial community samples collected by hundreds of researchers for the Earth Microbiome Project. Coordinated protocols and new analytical methods, particularly the use of exact sequences instead of clustered operational taxonomic units, enable bacterial and archaeal ribosomal RNA gene sequences to be followed across multiple studies and allow us to explore patterns of diversity at an unprecedented scale. The result is both a reference database giving global context to DNA sequence data and a framework for incorporating data from future studies, fostering increasingly complete characterization of Earth’s microbial diversity.

1,676 citations

Journal ArticleDOI
TL;DR: The recovery of 7,903 bacterial and archaeal metagenome-assembled genomes increases the phylogenetic diversity represented by public genome repositories and provides the first representatives from 20 candidate phyla.
Abstract: Challenges in cultivating microorganisms have limited the phylogenetic diversity of currently available microbial genomes. This is being addressed by advances in sequencing throughput and computational techniques that allow for the cultivation-independent recovery of genomes from metagenomes. Here, we report the reconstruction of 7,903 bacterial and archaeal genomes from >1,500 public metagenomes. All genomes are estimated to be ≥50% complete and nearly half are ≥90% complete with ≤5% contamination. These genomes increase the phylogenetic diversity of bacterial and archaeal genome trees by >30% and provide the first representatives of 17 bacterial and three archaeal candidate phyla. We also recovered 245 genomes from the Patescibacteria superphylum (also known as the Candidate Phyla Radiation) and find that the relative diversity of this group varies substantially with different protein marker sets. The scale and quality of this data set demonstrate that recovering genomes from metagenomes provides an expedient path forward to exploring microbial dark matter.

1,248 citations

Posted Content
TL;DR: The OGB datasets are large-scale, encompass multiple important graph ML tasks, and cover a diverse range of domains, ranging from social and information networks to biological networks, molecular graphs, source code ASTs, and knowledge graphs, indicating fruitful opportunities for future research.
Abstract: We present the Open Graph Benchmark (OGB), a diverse set of challenging and realistic benchmark datasets to facilitate scalable, robust, and reproducible graph machine learning (ML) research. OGB datasets are large-scale, encompass multiple important graph ML tasks, and cover a diverse range of domains, ranging from social and information networks to biological networks, molecular graphs, source code ASTs, and knowledge graphs. For each dataset, we provide a unified evaluation protocol using meaningful application-specific data splits and evaluation metrics. In addition to building the datasets, we also perform extensive benchmark experiments for each dataset. Our experiments suggest that OGB datasets present significant challenges of scalability to large-scale graphs and out-of-distribution generalization under realistic data splits, indicating fruitful opportunities for future research. Finally, OGB provides an automated end-to-end graph ML pipeline that simplifies and standardizes the process of graph data loading, experimental setup, and model evaluation. OGB will be regularly updated and welcomes inputs from the community. OGB datasets as well as data loaders, evaluation scripts, baseline code, and leaderboards are publicly available at this https URL .

1,097 citations


Cites background from "A new view of the tree of life"

  • ..., mammals, bacterial families, archaeans) and span the tree of life (Hug et al., 2016)....

    [...]

  • ...…is a set of undirected protein association neighborhoods extracted from the protein-protein association networks of 1,581 different species (Szklarczyk et al., 2019) that cover 37 broad taxonomic groups (e.g., mammals, bacterial families, archaeans) and span the tree of life (Hug et al., 2016)....

    [...]

Journal ArticleDOI
TL;DR: Both stochastic and deterministic components embedded in various ecological processes, including selection, dispersal, diversification, and drift are described.
Abstract: Understanding the mechanisms controlling community diversity, functions, succession, and biogeography is a central, but poorly understood, topic in ecology, particularly in microbial ecology. Although stochastic processes are believed to play nonnegligible roles in shaping community structure, their importance relative to deterministic processes is hotly debated. The importance of ecological stochasticity in shaping microbial community structure is far less appreciated. Some of the main reasons for such heavy debates are the difficulty in defining stochasticity and the diverse methods used for delineating stochasticity. Here, we provide a critical review and synthesis of data from the most recent studies on stochastic community assembly in microbial ecology. We then describe both stochastic and deterministic components embedded in various ecological processes, including selection, dispersal, diversification, and drift. We also describe different approaches for inferring stochasticity from observational diversity patterns and highlight experimental approaches for delineating ecological stochasticity in microbial communities. In addition, we highlight research challenges, gaps, and future directions for microbial community assembly research.

1,071 citations

References
More filters
Journal ArticleDOI
TL;DR: MUSCLE is a new computer program for creating multiple alignments of protein sequences that includes fast distance estimation using kmer counting, progressive alignment using a new profile function the authors call the log-expectation score, and refinement using tree-dependent restricted partitioning.
Abstract: We describe MUSCLE, a new computer program for creating multiple alignments of protein sequences. Elements of the algorithm include fast distance estimation using kmer counting, progressive alignment using a new profile function we call the logexpectation score, and refinement using treedependent restricted partitioning. The speed and accuracy of MUSCLE are compared with T-Coffee, MAFFT and CLUSTALW on four test sets of reference alignments: BAliBASE, SABmark, SMART and a new benchmark, PREFAB. MUSCLE achieves the highest, or joint highest, rank in accuracy on each of these sets. Without refinement, MUSCLE achieves average accuracy statistically indistinguishable from T-Coffee and MAFFT, and is the fastest of the tested methods for large numbers of sequences, aligning 5000 sequences of average length 350 in 7 min on a current desktop computer. The MUSCLE program, source code and PREFAB test data are freely available at http://www.drive5. com/muscle.

37,524 citations

Journal ArticleDOI
TL;DR: UNLABELLED RAxML-VI-HPC (randomized axelerated maximum likelihood for high performance computing) is a sequential and parallel program for inference of large phylogenies with maximum likelihood (ML) that has been used to compute ML trees on two of the largest alignments to date.
Abstract: Summary: RAxML-VI-HPC (randomized axelerated maximum likelihood for high performance computing) is a sequential and parallel program for inference of large phylogenies with maximum likelihood (ML). Low-level technical optimizations, a modification of the search algorithm, and the use of the GTR+CAT approximation as replacement for GTR+Γ yield a program that is between 2.7 and 52 times faster than the previous version of RAxML. A large-scale performance comparison with GARLI, PHYML, IQPNNI and MrBayes on real data containing 1000 up to 6722 taxa shows that RAxML requires at least 5.6 times less main memory and yields better trees in similar times than the best competing program (GARLI) on datasets up to 2500 taxa. On datasets ≥4000 taxa it also runs 2--3 times faster than GARLI. RAxML has been parallelized with MPI to conduct parallel multiple bootstraps and inferences on distinct starting trees. The program has been used to compute ML trees on two of the largest alignments to date containing 25 057 (1463 bp) and 2182 (51 089 bp) taxa, respectively. Availability: icwww.epfl.ch/~stamatak Contact: Alexandros.Stamatakis@epfl.ch Supplementary information: Supplementary data are available at Bioinformatics online.

14,847 citations

Proceedings ArticleDOI
23 Dec 2010
TL;DR: Development of the CIPRES Science Gateway is described, a web portal designed to provide researchers with transparent access to the fastest available community codes for inference of phylogenetic relationships, and implementation of these codes on scalable computational resources.
Abstract: Understanding the evolutionary history of living organisms is a central problem in biology. Until recently the ability to infer evolutionary relationships was limited by the amount of DNA sequence data available, but new DNA sequencing technologies have largely removed this limitation. As a result, DNA sequence data are readily available or obtainable for a wide spectrum of organisms, thus creating an unprecedented opportunity to explore evolutionary relationships broadly and deeply across the Tree of Life. Unfortunately, the algorithms used to infer evolutionary relationships are NP-hard, so the dramatic increase in available DNA sequence data has created a commensurate increase in the need for access to powerful computational resources. Local laptop or desktop machines are no longer viable for analysis of the larger data sets available today, and progress in the field relies upon access to large, scalable high-performance computing resources. This paper describes development of the CIPRES Science Gateway, a web portal designed to provide researchers with transparent access to the fastest available community codes for inference of phylogenetic relationships, and implementation of these codes on scalable computational resources. Meeting the needs of the community has included developing infrastructure to provide access, working with the community to improve existing community codes, developing infrastructure to insure the portal is scalable to the entire systematics community, and adopting strategies that make the project sustainable by the community. The CIPRES Science Gateway has allowed more than 1800 unique users to run jobs that required 2.5 million Service Units since its release in December 2009. (A Service Unit is a CPU-hour at unit priority).

9,117 citations

Journal ArticleDOI
28 Jul 1995-Science
TL;DR: An approach for genome analysis based on sequencing and assembly of unselected pieces of DNA from the whole chromosome has been applied to obtain the complete nucleotide sequence of the genome from the bacterium Haemophilus influenzae Rd.
Abstract: An approach for genome analysis based on sequencing and assembly of unselected pieces of DNA from the whole chromosome has been applied to obtain the complete nucleotide sequence (1,830,137 base pairs) of the genome from the bacterium Haemophilus influenzae Rd. This approach eliminates the need for initial mapping efforts and is therefore applicable to the vast array of microbial species for which genome maps are unavailable. The H. influenzae Rd genome sequence (Genome Sequence DataBase accession number L42023) represents the only complete genome sequence from a free-living organism.

5,944 citations

Journal ArticleDOI
TL;DR: SILVA (from Latin silva, forest), was implemented to provide a central comprehensive web resource for up to date, quality controlled databases of aligned rRNA sequences from the Bacteria, Archaea and Eukarya domains.
Abstract: Sequencing ribosomal RNA (rRNA) genes is currently the method of choice for phylogenetic reconstruction, nucleic acid based detection and quantification of microbial diversity. The ARB software suite with its corresponding rRNA datasets has been accepted by researchers worldwide as a standard tool for large scale rRNA analysis. However, the rapid increase of publicly available rRNA sequence data has recently hampered the maintenance of comprehensive and curated rRNA knowledge databases. A new system, SILVA (from Latin silva, forest), was implemented to provide a central comprehensive web resource for up to date, quality controlled databases of aligned rRNA sequences from the Bacteria, Archaea and Eukarya domains. All sequences are checked for anomalies, carry a rich set of sequence associated contextual information, have multiple taxonomic classifications, and the latest validly described nomenclature. Furthermore, two precompiled sequence datasets compatible with ARB are offered for download on the SILVA website: (i) the reference (Ref) datasets, comprising only high quality, nearly full length sequences suitable for in-depth phylogenetic analysis and probe design and (ii) the comprehensive Parc datasets with all publicly available rRNA sequences longer than 300 nucleotides suitable for biodiversity analyses. The latest publicly available database release 91 (August 2007) hosts 547 521 sequences split into 461 823 small subunit and 85 689 large subunit rRNAs.

5,733 citations