scispace - formally typeset
Open AccessJournal ArticleDOI

Fast Metagenomic Binning via Hashing and Bayesian Clustering.

Reads0
Chats0
TLDR
GATTACA provides a way to index metagenomic samples offline once and reuse them across experiments, and provides an efficient way to identify publicly available metagenome data that can be incorporated into the set of reference metagenomes to further improve binning accuracy.
Abstract
We introduce GATTACA, a framework for fast unsupervised binning of metagenomic contigs. Similar to recent approaches, GATTACA clusters contigs based on their coverage profiles across a large cohort of metagenomic samples; however, unlike previous methods that rely on read mapping, GATTACA quickly estimates these profiles from kmer counts stored in a compact index. This approach can result in over an order of magnitude speedup, while matching the accuracy of earlier methods on synthetic and real data benchmarks. It also provides a way to index metagenomic samples (e.g., from public repositories such as the Human Microbiome Project) offline once and reuse them across experiments; furthermore, the small size of the sample indices allows them to be easily transferred and stored. Leveraging the MinHash technique, GATTACA also provides an efficient way to identify publicly available metagenomic data that can be incorporated into the set of reference metagenomes to further improve binning accuracy. Thus, enabling easy indexing and reuse of publicly available metagenomic data sets, GATTACA makes accurate metagenomic analyses accessible to a much wider range of researchers.

read more

Content maybe subject to copyright    Report

GATTACA: Lightweight Metagenomic Binning with Compact
Indexing of Kmer Counts and MinHash-based Panel Selection
Victoria Popic
1?
, Volodymyr Kuleshov
1
, Michael Snyder
2
, and Serafim Batzoglou
1
1
Department of Computer Science, Stanford University, Stanford CA, USA
2
Department of Genetics, Stanford University, Stanford CA, USA
{viq, kuleshov, mpsnyder, serafim}@stanford.edu
Abstract. We introduce GATTACA, a framework for rapid and accurate binning of metagenomic
contigs from a single or multiple metagenomic samples into clusters associated with individual species.
The clusters are computed using co-abundance profiles within a set of reference metagnomes; unlike
previous methods, GATTACA estimates these profiles from k-mer counts stored in a highly compact
index. On multiple synthetic and real benchmark datasets, GATTACA produces clusters that corre-
spond to distinct bacterial species with an accuracy that matches earlier methods, while being up to 20×
faster when the reference panel index can be computed offline and 6× faster for online co-abundance
estimation. Leveraging the MinHash technique to quickly compare metagenomic samples, GATTACA
also provides an efficient way to identify publicly-available metagenomic data that can be incorporated
into the set of reference metagenomes to further improve binning accuracy. Thus, enabling easy in-
dexing and reuse of publicly-available metagenomic datasets, GATTACA makes accurate metagenomic
analyses accessible to a much wider range of researchers.
1 Introduction
Despite their important role, microbes constitute the dark matter of the biological universe. Thousands of
species live in the human gut, but only a small fraction can be isolated and studied in a laboratory and
very little is known about those that cannot be cultured. The short read lengths of modern sequencing
instruments combined with various inherent difficulties associated with complex bacterial environments
make it very difficult to perform simple tasks such as accurately identifying bacterial strains, recovering
their genomic sequences, and assessing their abundance. Many approaches have been proposed to address
these shortcomings. Specialized library preparation techniques such as Hi-C or synthetic long reads are often
very accurate, but also prohibitively complex. As a result, approaches based on contig binning are more
popular in practice. Metagenomic binning refers to the problem of grouping together partially assembled
sequence fragments (or contigs) that belong to the same species. Current binning techniques fall into mainly
two categories: (1) supervised classification of contigs into known taxons via comparisons to previously
catalogued species [13,26,30] and (2) unsupervised clustering techniques using features derived directly from
the metagenomic sample data [?, 2, 3, 16, 17, 20, 31], where unsupervised clustering has the clear advantage
of binning contigs that pertain to previously unknown species. While some unsupervised techniques [17, 28]
perform clustering based only on the contig sequence composition (the frequency of certain short motifs,
e.g. all tetra-mers), the most successful recent approaches [2, 3, 16, 20, 31] also incorporate contig coverage
profiles across multiple metagenomic samples. In brief, these techniques assemble de-novo bacterial contigs
and estimate the coverage of each contig within each sample of a large mategenomic cohort using read
mapping. Naturally, contigs belonging to the same species will have similar abundances across different
samples (determined by which cohort samples the species is present in); coverage profiles can therefore
?
Corresponding author.
.CC-BY-NC 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted April 26, 2017. ; https://doi.org/10.1101/130997doi: bioRxiv preprint

be used to cluster related contigs. This approach is accurate but has two main limitations: it requires a
large cohort of samples, as well as sizable compute resources for read alignment. We address both of these
limitations in this work.
In particular, we present GATTACA, a lightweight framework for metagenomic binning, which (1) avoids
read alignment without loss of accuracy and (2) enables efficient stand-alone analysis of single metagenomic
samples. Both results are based on the finding that we can approximate contig coverages using kmer counts
while still achieving the same binning accuracy as leading alignment-based methods. In addition to offering
a significant speedup in coverage estimation, using kmer counts, as opposed to alignment, provides us with
the exciting ability to index offline any publicly-available metagenomic sample and incorporate it into the
coverage profile of the contigs being processed. This allows us to efficiently pull in data from large growing
repositories, such as the Human Microbiome Project (HMP) [29] or EBI Metagenomics archive [12] into
any metagenomic study (especially one where only a single or few samples are available) at almost no cost.
For example, our kmer count index for a typical HMP sample only requires 100MB on average. We achieve
the small space requirement by leveraging memory-efficient hashing with minimal perfect hash functions
(MPHFs) and the probabilistic Bloom filter data structure. In contrast, using these datasets with read
alignment would require massive downloads (for example, a single HMP sample is roughly 7GB compressed
and 30GB uncompressed) and expensive subsequent handling to map the reads. In terms of speedup, we
found our coverage estimation time to be at least an order of magnitude faster (approximately 20×) when
the index is computed offline (e.g. for recyclable public reference samples) and about about 6× when the
kmers are counted on-the-fly (e.g. for private samples used only once), when compared to read mapping.
While using small indices allows us to incorporate a large number of publicly-available samples into a
given study, not all existing samples will carry content relevant to the study in question. Namely, samples
that don’t contain any of the species present in a given set of contigs cannot contribute any useful information
for grouping the contigs. The same logic applies also to samples that carry content identical to a sample
that has already been included. This motivates the need to additionaly define appropriate sample selection
criteria, for which we propose two metrics: (1) relevance and (2) diversity. More specifically, we would like
to select a panel of samples which share content with the sample being analyzed (our query) but that also
differ in the content that is shared. We use locality sensitive hashing [15] and the MinHash technique [7],
to compare the samples efficiently. At a high level, we create and index small MinHash fingerprints for each
sample in the database (offline), and then extract the appropriate samples according to the fingerprint of the
query. The resulting index can be separately downloaded and used to determine which samples to include
into the analysis; it needs to be updated only occasionally when new samples become available.
We evaluate GATTACA in clustering contigs assembled across multiple samples (co-assemblies) and
from individual samples, using both synthetic and real datasets. We compare our results with several
state-of-the-art methods in metagenomic binning: CONCOCT [3], MetaBat [16], and MaxBin [31], using
standardized cluster evaluation metrics and benchmarks (reusing evaluation scripts from existing methods
when appropriate). GATTACA was implemented in C++ and Python and is freely available at http::
//viq854.github.com/gattaca.
2 Methods
2.1 Index of Kmer Counts
In order to quickly estimate contig coverages, GATTACA builds a small index of kmer counts for each
sample in the cohort. Several solutions have been proposed for exact (e.g. using hash maps [21] or minimum
perfect hash functions [24]) and approximate kmer counting (e.g. using the count-min sketch [32]). Since the
content of each sample in our panel is static, our index uses a minimal perfect hash function [9] to store the
kmer counts without loss of accuracy, resulting in a drastic reduction in space when compared to traditional
2
.CC-BY-NC 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted April 26, 2017. ; https://doi.org/10.1101/130997doi: bioRxiv preprint

hash tables (we also found it to be more space-efficient than the count-min sketch solution for the same
binning accuracy). At a high level, given a set S of n keys, a minimal perfect hash function (MPHF) h
provides a mapping between the keys and n consecutive integers from 0 to n 1; that is, h is an injection
on S, guaranteeing no collisions among its keys (for x and y in S, if x 6= y, then h(x) 6= h(y)) and exactly n
possible outputs from the integer set {0, 1, 2, ..., n1}. We use the BDZ algorithm based on random r-partite
hypergraphs [6] for constructing the MPHFs.
Index Construction. To construct the index, we first generate the kmers from the all reads in the sample
(accounting for both forward and reverse complement strands) and exclude kmers that occur only once, since
these are most likely present due to sequencing errors. We use a kmer length of 31-bp in our experiments
(compacting the kmers into 64-bit integers for convenience). We then generate the MPHF, h
S
, for the
resulting set of distinct kmers, S, and store their counts in an integer array A (|A| = |S|), at the indices
given by h
S
; namely, A[h
S
(x)] = count(x), for each kmer x S. We found 8 bits to be sufficient for storing
the kmer counts (and since many counts are small, these can be compressed even further using techniques
such as varint encoding). Finally, we need to store the elements in S to support lookups, since h
S
(z) for
z / S will return a valid but incorrect index into A. One direct solution for storing S would be to rely on
the MPHF, using a secondary array B and setting B[h
S
(x)] = x for all kmers x S; then we could check
upon lookup of a key y, if B[h
S
(y)] is equal y and determine whether y was in the set. However, this solution
requires storing the array B of |S| 64-bit integers, which is 4× larger than A, and would substantially increase
the index. So instead, we store the set S in a Bloom filter, BF , which is a widely used probabilistic data
structure for testing set membership that offers space-efficiency at the expense of possible false positives (no
false negatives are possible). We have configured the size of BF based on a false positive probability of 0.05.
As a result, our index for each sample consists of: (1) the MPHF, h
S
, (2) the array of counts, A, and (3) the
Bloom filter storing the elements of S, BF . As an example, the size of the index constructed for an HMP
sample containing 20 million 100-bp long reads was 108MB.
Coverage Estimation. Given a contig c and an index I of a cohort sample, we estimate the coverage of c
in this sample by performing lookups in I for each kmer in c and then computing the median of the resulting
counts. More specifically, we return the median of the set of counts C = {...count(x)....| kmers x C},
where
count(x) =
(
I.A[I.h
S
(x)], if x I.BF
0, otherwise.
(1)
2.2 Contig Representation
Given a set of contigs assembled from a single or multiple metagenomic samples, our goal is to bin together
the contigs that belong to the same class (e.g. species or strain). Similar to existing methods, e.g. CONCOCT,
we first represent each contig as a multi-dimensional vector using both its sequence composition and coverage
profile across multiple samples, where our coverages are approximated using kmer counts instead of read
mapping, as described above. Namely, given M reference samples (either from the same study or from a
public database), our coverage profile is the median count of the contig kmers in each sample., while the
composition profile is the normalized frequency of each possible tetra-mer in the contig and its reverse
complement (resulting in a total of F = 136 such features); the normalization of composition features is
done according to the CONCOCT procedure (please see [3] for details). Therefore, each contig is a vector
V = [c
1
, ..., c
M
, f
1
, ..., f
F
], where c
i
:= the median kmer count in sample i and f
j
:= the frequency of
tetra-mer j in the contig sequence.
3
.CC-BY-NC 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted April 26, 2017. ; https://doi.org/10.1101/130997doi: bioRxiv preprint

2.3 Clustering algorithm
Given the resulting vector representations, we cluster the contigs using a Bayesian Gaussian mixture model
(GMM) with a Dirichlet prior. In brief, we define a mixture distribution p of K Gaussian components over
n data points x
i
R
d
and (unobserved) assignment labels z
i
{0, 1}
K
for i = 1, ..., n. Our model is the
product of a likelihood term
p(z, X|θ) =
n
Y
i=1
K
Y
k=1
π
z
ik
k
N (x
i
| µ
k
, Λ
1
k
)
z
ik
and a prior term
p(θ) = Dir(π|α
0
)
K
Y
k=1
N (µ
k
| m
0
, (β
0
Λ
k
)
1
)Wi(Λ
k
|L
0
ν
0
).
Here, X R
d×n
is the matrix of data points and N (· | µ
k
, Λ
1
k
)
z
ik
is a multivariate Gaussian with mean µ
k
and inverse covariance matrix Λ
k
. The π
K1
form a vector of cluster weights. Together, the µ
k
, Λ
k
, π
form the parameter vector θ of the likelihood. The prior over θ is a product of a Dirichlet with hyper-
paremeter α
0
K1
, a multivariate normal with hyper-parameters m
0
R
d
, β
0
> 0 and a Wishart
distribution parametrized by L
0
R
d×d
p.s.d. and ν
0
> 0.
We perform inference by maximizing the marginal log-likelihood log p(X) using variational inference. In
brief, we maximize the evidence lower bound
log p(X) E
q(z,θ)
[log p(z, X, θ) log q(X, z)]
over the set of approximating distributions q. By our choice of conjugate prior, the posterior p(θ, z|X)
and hence the optimal q have the same form, which factors over q(z|θ)q(θ). We optimize the bound using
variational expectation-maximization, which consists of repeatedly choosing updating q(z|θ) and q(θ). Each
update has a closed-form solution by our choice of conjugate prior. We conclude the algorithm by assigning
each data point to its maximum a-posteriori label according to q(z|θ). We refer the reader to section 21.6.1
in the standard textbook of Murphy (2012) [22] for the full derivation of this algorithm.
At a high-level, the above model is very similar to automatic relevance determination (ARD), which
is used by CONCOCT. We have found our approach to perform better in practice than ARD, especially
for automatically determining the number of clusters in the data. Both algorithms are implemented in our
software package. Other alternative clustering methods can also be easily plugged into GATTACA’s binning
pipeline.
2.4 Sample Selection
Given a query sample, Q, we would like to select n samples from the public database, which can provide
discriminatory features for clustering the contigs of Q (where the features represent the coverage of the
contigs in the respective samples). Intuitively, the selected samples must share some content with Q (have
relevance), as well as have pairwise diversity among themselves to guarantee coverage of different contigs of
Q. Similar relevance and diversity concepts can be found in online recommendation systems (e.g. for articles
or music [1, 8]).
By representing each sample as a set of overlapping kmers, we apply the Jaccard coefficient to measure
their similarity, where the Jaccard coefficient J(A, B) =
|AB|
|AB|
, for two sets A and B. Then, we can consider
relevant the samples that are within a certain distance from Q under Jaccard (e.g. all samples S for which
J(S, Q) > 0). However, computing the Jaccard distances directly on the kmer sets would be inefficient for
large databases. Therefore, we apply the min-wise independent permutations (MinHash) LSH scheme [7] to
create small fingerprints for each sample set instead, defined as follows.
4
.CC-BY-NC 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted April 26, 2017. ; https://doi.org/10.1101/130997doi: bioRxiv preprint

Let U be the ground set of all possible set items. Given a random permutation π of indices of U and a set
X, let h
π
(X) = min
xX
{π(x)}. The MinHash LSH family H will consist of all such functions for each choice
of π. It can be easily shown that for a given h chosen uniformly at random, P r[h
π
(A) = h
π
(B)] = J(A, B)
(see [7] for details). Due to the high variance in the probability of collision, we concatenate L different
hash functions from the family H chosen independently at random to form the fingerprint. Then given the
number of the hash collisions among the chosen L functions, c, the ratio c/L can also be used as an unbiased
estimator for J(A, B).
To summarize, given the kmer set K = {s
0
, s
1
, ..., s
n1
} of some sample S and L hash functions from H,
we construct the MinHash fingerprint vector F = [f
0
, f
1
, ..., f
L1
], such that the fingerprint entry f
i
is the
minimum set element under hash function h
i
:
f
i
= min{h
i
(s
0
), h
i
(s
1
), ..., h
i
(s
n1
)}.
Now given the fingerprints, we can define relevance between a sample S and Q as simply the number
of entries shared by their fingerprints. By indexing the fingerprints of all the samples in the database into
L tables (based on the value of each fingerprint entry, respectively), we can find all the samples that share
at least one fingerprint entry in common with Q using simple lookups, as well as rank them according to
relevance.
Finally, if the number of relevant samples is too high, we can reduce our panel using the diversity criterion.
That is, given all the relevant samples, we can select n samples that maximize the diversity of the set. This
problem is known as the dispersion problem [10], where the objective is to locate k points among n, such
that some function of distances between the k points is maximized. One popular optimality criteria is the
MAX-MIN, which maximizes the minimum distance between a pair of points. This problem is known to be
NP-hard; however, an efficient greedy heuristic algorithm exists for the MAX-MIN dispersion problem when
the distances satisfy the triangle inequality, with provable performance guarantee of 2 [25]. Given two samples
A and B, we define their diversity as: D(A, B) = 1 J(A, B) and apply the greedy algorithm of [25] to find
the n samples. While this procedure is simple and can be efficiently used to detect samples with distinct kmer
sets, its main limitation is that it cannot be used to find samples which differ in kmer frequency only (since
frequency does not affect Jaccard distance), which could also be used to generate discriminatory features.
3 Results
3.1 Datasets
Synthetic datasets. We used two synthetic datasets generated by Alneberg et al. [3] from the 16S rRNA
samples of the Human Microbiome Project(HMP) [29]. The first dataset (”Species-Mock”) consists of 96 sam-
ples containing a mixture of 101 different species (without strain-level variation), while the second dataset
(”Strain-Mock”) consists of 64 samples comprising a mixture of 20 different organisms, of which some rep-
resent strains of the same species (e.g., this dataset includes five different E. coli strains). The relative
abundance profiles of the species and strains in each sample were assigned according to the distribution of
the 101 and 20 most abundant organisms in the original HMP samples, respectively. Reads (100-bp long)
were simulated from random positions of the genomes present in the sample based on their relative abun-
dance, for a total of 7.75 million reads and 11.75 million reads in each ”Species-Mock” and ”Strain-Mock”
sample, respectively. Both datasets contain the set of contigs co-assembled across all the samples by Alneberg
et al. using the Ray assembler [5], and partitioned into fragments of 10 kilobases when appropriate. We used
the default minimum contig length of 1000-bp when running CONCOCT, MaxBin, and GATTACA; this
parameter was set to 1500-bp for MetaBat, which is the smallest length supported by this method. As a
result, the ”Species-Mock” included 37,627 valid contigs and the ”Strain-Mock” included 9,411 valid contigs
5
.CC-BY-NC 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted April 26, 2017. ; https://doi.org/10.1101/130997doi: bioRxiv preprint

Figures
Citations
More filters

SPAdes, a new genome assembly algorithm and its applications to single-cell sequencing ( 7th Annual SFAF Meeting, 2012)

Glenn Tesler
TL;DR: SPAdes as mentioned in this paper is a new assembler for both single-cell and standard (multicell) assembly, and demonstrate that it improves on the recently released E+V-SC assembler and on popular assemblers Velvet and SoapDeNovo (for multicell data).
Journal ArticleDOI

Capturing sequence diversity in metagenomes with comprehensive and scalable probe design.

TL;DR: This work designs optimal probe sets, with a specified number of oligonucleotides, that achieve full coverage of, and scale well with, known sequence diversity, and applies them to capture viral genomes in complex metagenomic samples.
Proceedings ArticleDOI

Communication-Efficient Jaccard similarity for High-Performance Distributed Genome Comparisons

TL;DR: SimilarityAtScale as mentioned in this paper is the first communication-efficient distributed algorithm for computing the Jaccard similarity among pairs of large datasets, which provides an efficient encoding of this problem into a multiplication of sparse matrices.
Posted Content

Communication-Efficient Jaccard Similarity for High-Performance Distributed Genome Comparisons

TL;DR: The design and implementation of SimilarityAtScale is designed and implemented, the first communication-efficient distributed algorithm for computing the Jaccard similarity among pairs of large datasets, and the resulting scheme is the first to enable accurateJaccard distance derivations for massive datasets, using large-scale distributed-memory systems.
Journal ArticleDOI

Inference of phylogenetic trees directly from raw sequencing reads using Read2Tree

TL;DR: Read2Tree as mentioned in this paper directly processes raw sequencing reads into groups of corresponding genes and bypasses traditional steps in phylogeny inference, such as genome assembly, annotation and all-versus-all sequence comparisons, while retaining accuracy.
References
More filters
Journal ArticleDOI

The Sequence Alignment/Map format and SAMtools

TL;DR: SAMtools as discussed by the authors implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments.
Journal ArticleDOI

Fast gapped-read alignment with Bowtie 2

TL;DR: Bowtie 2 combines the strengths of the full-text minute index with the flexibility and speed of hardware-accelerated dynamic programming algorithms to achieve a combination of high speed, sensitivity and accuracy.

SPAdes, a new genome assembly algorithm and its applications to single-cell sequencing ( 7th Annual SFAF Meeting, 2012)

Glenn Tesler
TL;DR: SPAdes as mentioned in this paper is a new assembler for both single-cell and standard (multicell) assembly, and demonstrate that it improves on the recently released E+V-SC assembler and on popular assemblers Velvet and SoapDeNovo (for multicell data).
Journal ArticleDOI

CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes

TL;DR: An objective measure of genome quality is proposed that can be used to select genomes suitable for specific gene- and genome-centric analyses of microbial communities and is shown to provide accurate estimates of genome completeness and contamination and to outperform existing approaches.
Journal ArticleDOI

The human microbiome project.

TL;DR: A strategy to understand the microbial components of the human genetic and metabolic landscape and how they contribute to normal physiology and predisposition to disease.
Related Papers (5)
Frequently Asked Questions (16)
Q1. What contributions have the authors mentioned in the paper "Gattaca: lightweight metagenomic binning with compact indexing of kmer counts and minhash-based panel selection" ?

The authors introduce GATTACA, a framework for rapid and accurate binning of metagenomic contigs from a single or multiple metagenomic samples into clusters associated with individual species. Leveraging the MinHash technique to quickly compare metagenomic samples, GATTACA also provides an efficient way to identify publicly-available metagenomic data that can be incorporated into the set of reference metagenomes to further improve binning accuracy. 

Due to the high variance in the probability of collision, the authors concatenate L different hash functions from the family H chosen independently at random to form the fingerprint. 

The authors used the SPADES [4] assembler with default parameters to assemble the contigs of individual samples (cutting contigs into 10 kilobase fragments and filtering contigs shorter than 1000-bp, as in simulation). 

Reads (100-bp long) were simulated from random positions of the genomes present in the sample based on their relative abundance, for a total of 7.75 million reads and 11.75 million reads in each ”Species-Mock” and ”Strain-Mock” sample, respectively. 

By representing each sample as a set of overlapping kmers, the authors apply the Jaccard coefficient to measure their similarity, where the Jaccard coefficient J(A,B) = |A∩B||A∪B| , for two sets A and B. 

In addition to CheckM, the authors also applied Prodigal [14] to predict and functionally annotate genes on their sample contigs and then RPS-BLAST to COG annotate the protein sequences (using the NCBI COG database). 

The authors found 8 bits to be sufficient for storing the kmer counts (and since many counts are small, these can be compressed even further using techniques such as varint encoding). 

if the samples in the cohort have already been indexed (for publicly available data or multi-sample studies reusing the same sample cohort), then GATTACA would only need 4.3h to finish (resulting in a roughly 20× speedup). 

In terms of speedup, the authors found their coverage estimation time to be at least an order of magnitude faster (approximately 20×) when the index is computed offline (e.g. for recyclable public reference samples) and about about 6× when the kmers are counted on-the-fly (e.g. for private samples used only once), when compared to read mapping. 

ignoring the BWT indexing time, it would take CONCOCT roughly 81h to map the 95 HMP samples against these contigs; while GATTACA would require only 12.7h. 

The short read lengths of modern sequencing instruments – combined with various inherent difficulties associated with complex bacterial environments – make it very difficult to perform simple tasks such as accurately identifying bacterial strains, recovering their genomic sequences, and assessing their abundance. 

The authors downloaded 1.372 × 108 100-bp short reads from the SRA052203 NCBI archive as 18 separate samples (of which 7 were resequenced from the original 11 samples). 

By their choice of conjugate prior, the posterior p(θ, z|X) and hence the optimal q have the same form, which factors over q(z|θ)q(θ). 

By indexing the fingerprints of all the samples in the database into L tables (based on the value of each fingerprint entry, respectively), the authors can find all the samples that share at least one fingerprint entry in common with Q using simple lookups, as well as rank them according to relevance. 

This motivates the need to additionaly define appropriate sample selection criteria, for which the authors propose two metrics: (1) relevance and (2) diversity. 

Given a contig c and an index The authorof a cohort sample, the authors estimate the coverage of c in this sample by performing lookups in The authorfor each kmer in c and then computing the median of the resulting counts.