Why do the authors concatenate different hash functions from the family H?

Due to the high variance in the probability of collision, the authors concatenate L different hash functions from the family H chosen independently at random to form the fingerprint.

How did the authors assemble the contigs of the ”Species-Mock”?

The authors used the SPADES [4] assembler with default parameters to assemble the contigs of individual samples (cutting contigs into 10 kilobase fragments and filtering contigs shorter than 1000-bp, as in simulation).

How can the authors measure the similarity of the samples?

By representing each sample as a set of overlapping kmers, the authors apply the Jaccard coefficient to measure their similarity, where the Jaccard coefficient J(A,B) = |A∩B||A∪B| , for two sets A and B.

How did the authors use Prodigal to classify contigs?

In addition to CheckM, the authors also applied Prodigal [14] to predict and functionally annotate genes on their sample contigs and then RPS-BLAST to COG annotate the protein sequences (using the NCBI COG database).

How long would it take to map the samples in the cohort?

if the samples in the cohort have already been indexed (for publicly available data or multi-sample studies reusing the same sample cohort), then GATTACA would only need 4.3h to finish (resulting in a roughly 20× speedup).

How fast is the coverage estimation of a contig?

In terms of speedup, the authors found their coverage estimation time to be at least an order of magnitude faster (approximately 20×) when the index is computed offline (e.g. for recyclable public reference samples) and about about 6× when the kmers are counted on-the-fly (e.g. for private samples used only once), when compared to read mapping.

How long would it take to map the 95 HMP samples against these contigs?

ignoring the BWT indexing time, it would take CONCOCT roughly 81h to map the 95 HMP samples against these contigs; while GATTACA would require only 12.7h.

What are the main reasons for the short read lengths of modern sequencing instruments?

The short read lengths of modern sequencing instruments – combined with various inherent difficulties associated with complex bacterial environments – make it very difficult to perform simple tasks such as accurately identifying bacterial strains, recovering their genomic sequences, and assessing their abundance.

How many contigs were resequenced from the original 11 samples?

The authors downloaded 1.372 × 108 100-bp short reads from the SRA052203 NCBI archive as 18 separate samples (of which 7 were resequenced from the original 11 samples).

How can the authors find samples that share at least one fingerprint entry in common with Q?

By indexing the fingerprints of all the samples in the database into L tables (based on the value of each fingerprint entry, respectively), the authors can find all the samples that share at least one fingerprint entry in common with Q using simple lookups, as well as rank them according to relevance.

What motivates the need to further define appropriate sample selection criteria?

This motivates the need to additionaly define appropriate sample selection criteria, for which the authors propose two metrics: (1) relevance and (2) diversity.

(Open Access) Fast Metagenomic Binning via Hashing and Bayesian Clustering. (2018) | Victoria Popic

Q: What contributions have the authors mentioned in the paper "Gattaca: lightweight metagenomic binning with compact indexing of kmer counts and minhash-based panel selection" ?

The authors introduce GATTACA, a framework for rapid and accurate binning of metagenomic contigs from a single or multiple metagenomic samples into clusters associated with individual species. Leveraging the MinHash technique to quickly compare metagenomic samples, GATTACA also provides an efficient way to identify publicly-available metagenomic data that can be incorporated into the set of reference metagenomes to further improve binning accuracy.

Q: How many bits are sufficient for storing the kmer counts?

The authors found 8 bits to be sufficient for storing the kmer counts (and since many counts are small, these can be compressed even further using techniques such as varint encoding).

GATTACA: Lightweight Metagenomic Binning with Compact

Indexing of Kmer Counts and MinHash-based Panel Selection

Victoria Popic

, Volodymyr Kuleshov

, Michael Snyder

, and Seraﬁm Batzoglou

1∗

Department of Computer Science, Stanford University, Stanford CA, USA

Department of Genetics, Stanford University, Stanford CA, USA

{viq, kuleshov, mpsnyder, serafim}@stanford.edu

Abstract. We introduce GATTACA, a framework for rapid and accurate binning of metagenomic

contigs from a single or multiple metagenomic samples into clusters associated with individual species.

The clusters are computed using co-abundance proﬁles within a set of reference metagnomes; unlike

previous methods, GATTACA estimates these proﬁles from k-mer counts stored in a highly compact

index. On multiple synthetic and real benchmark datasets, GATTACA produces clusters that corre-

spond to distinct bacterial species with an accuracy that matches earlier methods, while being up to 20×

faster when the reference panel index can be computed oﬄine and 6× faster for online co-abundance

estimation. Leveraging the MinHash technique to quickly compare metagenomic samples, GATTACA

also provides an eﬃcient way to identify publicly-available metagenomic data that can be incorporated

into the set of reference metagenomes to further improve binning accuracy. Thus, enabling easy in-

dexing and reuse of publicly-available metagenomic datasets, GATTACA makes accurate metagenomic

analyses accessible to a much wider range of researchers.

1 Introduction

Despite their important role, microbes constitute the dark matter of the biological universe. Thousands of

species live in the human gut, but only a small fraction can be isolated and studied in a laboratory and

very little is known about those that cannot be cultured. The short read lengths of modern sequencing

instruments – combined with various inherent diﬃculties associated with complex bacterial environments

– make it very diﬃcult to perform simple tasks such as accurately identifying bacterial strains, recovering

their genomic sequences, and assessing their abundance. Many approaches have been proposed to address

these shortcomings. Specialized library preparation techniques such as Hi-C or synthetic long reads are often

very accurate, but also prohibitively complex. As a result, approaches based on contig binning are more

popular in practice. Metagenomic binning refers to the problem of grouping together partially assembled

sequence fragments (or contigs) that belong to the same species. Current binning techniques fall into mainly

two categories: (1) supervised classiﬁcation of contigs into known taxons via comparisons to previously

catalogued species [13,26,30] and (2) unsupervised clustering techniques using features derived directly from

the metagenomic sample data [?, 2, 3, 16, 17, 20, 31], where unsupervised clustering has the clear advantage

of binning contigs that pertain to previously unknown species. While some unsupervised techniques [17, 28]

perform clustering based only on the contig sequence composition (the frequency of certain short motifs,

e.g. all tetra-mers), the most successful recent approaches [2, 3, 16, 20, 31] also incorporate contig coverage

proﬁles across multiple metagenomic samples. In brief, these techniques assemble de-novo bacterial contigs

and estimate the coverage of each contig within each sample of a large mategenomic cohort using read

mapping. Naturally, contigs belonging to the same species will have similar abundances across diﬀerent

samples (determined by which cohort samples the species is present in); coverage proﬁles can therefore

Corresponding author.

.CC-BY-NC 4.0 International licensea

certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted April 26, 2017. ; https://doi.org/10.1101/130997doi: bioRxiv preprint

be used to cluster related contigs. This approach is accurate but has two main limitations: it requires a

large cohort of samples, as well as sizable compute resources for read alignment. We address both of these

limitations in this work.

In particular, we present GATTACA, a lightweight framework for metagenomic binning, which (1) avoids

read alignment without loss of accuracy and (2) enables eﬃcient stand-alone analysis of single metagenomic

samples. Both results are based on the ﬁnding that we can approximate contig coverages using kmer counts

while still achieving the same binning accuracy as leading alignment-based methods. In addition to oﬀering

a signiﬁcant speedup in coverage estimation, using kmer counts, as opposed to alignment, provides us with

the exciting ability to index oﬄine any publicly-available metagenomic sample and incorporate it into the

coverage proﬁle of the contigs being processed. This allows us to eﬃciently pull in data from large growing

repositories, such as the Human Microbiome Project (HMP) [29] or EBI Metagenomics archive [12] into

any metagenomic study (especially one where only a single or few samples are available) at almost no cost.

For example, our kmer count index for a typical HMP sample only requires 100MB on average. We achieve

the small space requirement by leveraging memory-eﬃcient hashing with minimal perfect hash functions

(MPHFs) and the probabilistic Bloom ﬁlter data structure. In contrast, using these datasets with read

alignment would require massive downloads (for example, a single HMP sample is roughly 7GB compressed

and 30GB uncompressed) and expensive subsequent handling to map the reads. In terms of speedup, we

found our coverage estimation time to be at least an order of magnitude faster (approximately 20×) when

the index is computed oﬄine (e.g. for recyclable public reference samples) and about about 6× when the

kmers are counted on-the-ﬂy (e.g. for private samples used only once), when compared to read mapping.

While using small indices allows us to incorporate a large number of publicly-available samples into a

given study, not all existing samples will carry content relevant to the study in question. Namely, samples

that don’t contain any of the species present in a given set of contigs cannot contribute any useful information

for grouping the contigs. The same logic applies also to samples that carry content identical to a sample

that has already been included. This motivates the need to additionaly deﬁne appropriate sample selection

criteria, for which we propose two metrics: (1) relevance and (2) diversity. More speciﬁcally, we would like

to select a panel of samples which share content with the sample being analyzed (our query) but that also

diﬀer in the content that is shared. We use locality sensitive hashing [15] and the MinHash technique [7],

to compare the samples eﬃciently. At a high level, we create and index small MinHash ﬁngerprints for each

sample in the database (oﬄine), and then extract the appropriate samples according to the ﬁngerprint of the

query. The resulting index can be separately downloaded and used to determine which samples to include

into the analysis; it needs to be updated only occasionally when new samples become available.

We evaluate GATTACA in clustering contigs assembled across multiple samples (co-assemblies) and

from individual samples, using both synthetic and real datasets. We compare our results with several

state-of-the-art methods in metagenomic binning: CONCOCT [3], MetaBat [16], and MaxBin [31], using

standardized cluster evaluation metrics and benchmarks (reusing evaluation scripts from existing methods

when appropriate). GATTACA was implemented in C++ and Python and is freely available at http::

//viq854.github.com/gattaca.

2 Methods

2.1 Index of Kmer Counts

In order to quickly estimate contig coverages, GATTACA builds a small index of kmer counts for each

sample in the cohort. Several solutions have been proposed for exact (e.g. using hash maps [21] or minimum

perfect hash functions [24]) and approximate kmer counting (e.g. using the count-min sketch [32]). Since the

content of each sample in our panel is static, our index uses a minimal perfect hash function [9] to store the

kmer counts without loss of accuracy, resulting in a drastic reduction in space when compared to traditional

.CC-BY-NC 4.0 International licensea

certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted April 26, 2017. ; https://doi.org/10.1101/130997doi: bioRxiv preprint

hash tables (we also found it to be more space-eﬃcient than the count-min sketch solution for the same

binning accuracy). At a high level, given a set S of n keys, a minimal perfect hash function (MPHF) h

provides a mapping between the keys and n consecutive integers from 0 to n − 1; that is, h is an injection

on S, guaranteeing no collisions among its keys (for x and y in S, if x 6= y, then h(x) 6= h(y)) and exactly n

possible outputs from the integer set {0, 1, 2, ..., n−1}. We use the BDZ algorithm based on random r-partite

hypergraphs [6] for constructing the MPHFs.

Index Construction. To construct the index, we ﬁrst generate the kmers from the all reads in the sample

(accounting for both forward and reverse complement strands) and exclude kmers that occur only once, since

these are most likely present due to sequencing errors. We use a kmer length of 31-bp in our experiments

(compacting the kmers into 64-bit integers for convenience). We then generate the MPHF, h

, for the

resulting set of distinct kmers, S, and store their counts in an integer array A (|A| = |S|), at the indices

given by h

; namely, A[h

(x)] = count(x), for each kmer x ∈ S. We found 8 bits to be suﬃcient for storing

the kmer counts (and since many counts are small, these can be compressed even further using techniques

such as varint encoding). Finally, we need to store the elements in S to support lookups, since h

(z) for

z /∈ S will return a valid but incorrect index into A. One direct solution for storing S would be to rely on

the MPHF, using a secondary array B and setting B[h

(x)] = x for all kmers x ∈ S; then we could check

upon lookup of a key y, if B[h

(y)] is equal y and determine whether y was in the set. However, this solution

requires storing the array B of |S| 64-bit integers, which is 4× larger than A, and would substantially increase

the index. So instead, we store the set S in a Bloom ﬁlter, BF , which is a widely used probabilistic data

structure for testing set membership that oﬀers space-eﬃciency at the expense of possible false positives (no

false negatives are possible). We have conﬁgured the size of BF based on a false positive probability of 0.05.

As a result, our index for each sample consists of: (1) the MPHF, h

, (2) the array of counts, A, and (3) the

Bloom ﬁlter storing the elements of S, BF . As an example, the size of the index constructed for an HMP

sample containing 20 million 100-bp long reads was 108MB.

Coverage Estimation. Given a contig c and an index I of a cohort sample, we estimate the coverage of c

in this sample by performing lookups in I for each kmer in c and then computing the median of the resulting

counts. More speciﬁcally, we return the median of the set of counts C = {...count(x)....| ∀ kmers x ∈ C},

where

count(x) =

(

I.A[I.h

(x)], if x ∈ I.BF

0, otherwise.

(1)

2.2 Contig Representation

Given a set of contigs assembled from a single or multiple metagenomic samples, our goal is to bin together

the contigs that belong to the same class (e.g. species or strain). Similar to existing methods, e.g. CONCOCT,

we ﬁrst represent each contig as a multi-dimensional vector using both its sequence composition and coverage

proﬁle across multiple samples, where our coverages are approximated using kmer counts instead of read

mapping, as described above. Namely, given M reference samples (either from the same study or from a

public database), our coverage proﬁle is the median count of the contig kmers in each sample., while the

composition proﬁle is the normalized frequency of each possible tetra-mer in the contig and its reverse

complement (resulting in a total of F = 136 such features); the normalization of composition features is

done according to the CONCOCT procedure (please see [3] for details). Therefore, each contig is a vector

V = [c

, ..., c

, f

, ..., f

], where c

:= the median kmer count in sample i and f

:= the frequency of

tetra-mer j in the contig sequence.

.CC-BY-NC 4.0 International licensea

certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted April 26, 2017. ; https://doi.org/10.1101/130997doi: bioRxiv preprint

2.3 Clustering algorithm

Given the resulting vector representations, we cluster the contigs using a Bayesian Gaussian mixture model

(GMM) with a Dirichlet prior. In brief, we deﬁne a mixture distribution p of K Gaussian components over

n data points x

∈ R

and (unobserved) assignment labels z

∈ {0, 1}

for i = 1, ..., n. Our model is the

product of a likelihood term

p(z, X|θ) =

i=1

k=1

N (x

| µ

, Λ

−1

)

and a prior term

p(θ) = Dir(π|α

)

k=1

N (µ

| m

, (β

)

−1

)Wi(Λ

Here, X ∈ R

d×n

is the matrix of data points and N (· | µ

, Λ

−1

)

is a multivariate Gaussian with mean µ

and inverse covariance matrix Λ

. The π ∈ ∆

K−1

form a vector of cluster weights. Together, the µ

, Λ

, π

form the parameter vector θ of the likelihood. The prior over θ is a product of a Dirichlet with hyper-

paremeter α

∈ ∆

K−1

, a multivariate normal with hyper-parameters m

∈ R

, β

> 0 and a Wishart

distribution parametrized by L

∈ R

d×d

p.s.d. and ν

> 0.

We perform inference by maximizing the marginal log-likelihood log p(X) using variational inference. In

brief, we maximize the evidence lower bound

log p(X) ≥ E

q(z,θ)

[log p(z, X, θ) − log q(X, z)]

over the set of approximating distributions q. By our choice of conjugate prior, the posterior p(θ, z|X)

and hence the optimal q have the same form, which factors over q(z|θ)q(θ). We optimize the bound using

variational expectation-maximization, which consists of repeatedly choosing updating q(z|θ) and q(θ). Each

update has a closed-form solution by our choice of conjugate prior. We conclude the algorithm by assigning

each data point to its maximum a-posteriori label according to q(z|θ). We refer the reader to section 21.6.1

in the standard textbook of Murphy (2012) [22] for the full derivation of this algorithm.

At a high-level, the above model is very similar to automatic relevance determination (ARD), which

is used by CONCOCT. We have found our approach to perform better in practice than ARD, especially

for automatically determining the number of clusters in the data. Both algorithms are implemented in our

software package. Other alternative clustering methods can also be easily plugged into GATTACA’s binning

pipeline.

2.4 Sample Selection

Given a query sample, Q, we would like to select n samples from the public database, which can provide

discriminatory features for clustering the contigs of Q (where the features represent the coverage of the

contigs in the respective samples). Intuitively, the selected samples must share some content with Q (have

relevance), as well as have pairwise diversity among themselves to guarantee coverage of diﬀerent contigs of

Q. Similar relevance and diversity concepts can be found in online recommendation systems (e.g. for articles

or music [1, 8]).

By representing each sample as a set of overlapping kmers, we apply the Jaccard coeﬃcient to measure

their similarity, where the Jaccard coeﬃcient J(A, B) =

|A∩B|

|A∪B|

, for two sets A and B. Then, we can consider

relevant the samples that are within a certain distance from Q under Jaccard (e.g. all samples S for which

J(S, Q) > 0). However, computing the Jaccard distances directly on the kmer sets would be ineﬃcient for

large databases. Therefore, we apply the min-wise independent permutations (MinHash) LSH scheme [7] to

create small ﬁngerprints for each sample set instead, deﬁned as follows.

.CC-BY-NC 4.0 International licensea

certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted April 26, 2017. ; https://doi.org/10.1101/130997doi: bioRxiv preprint

Let U be the ground set of all possible set items. Given a random permutation π of indices of U and a set

X, let h

(X) = min

x∈X

{π(x)}. The MinHash LSH family H will consist of all such functions for each choice

of π. It can be easily shown that for a given h chosen uniformly at random, P r[h

(A) = h

(B)] = J(A, B)

(see [7] for details). Due to the high variance in the probability of collision, we concatenate L diﬀerent

hash functions from the family H chosen independently at random to form the ﬁngerprint. Then given the

number of the hash collisions among the chosen L functions, c, the ratio c/L can also be used as an unbiased

estimator for J(A, B).

To summarize, given the kmer set K = {s

, s

, ..., s

n−1

} of some sample S and L hash functions from H,

we construct the MinHash ﬁngerprint vector F = [f

, f

, ..., f

L−1

], such that the ﬁngerprint entry f

is the

minimum set element under hash function h

= min{h

), h

), ..., h

n−1

)}.

Now given the ﬁngerprints, we can deﬁne relevance between a sample S and Q as simply the number

of entries shared by their ﬁngerprints. By indexing the ﬁngerprints of all the samples in the database into

L tables (based on the value of each ﬁngerprint entry, respectively), we can ﬁnd all the samples that share

at least one ﬁngerprint entry in common with Q using simple lookups, as well as rank them according to

relevance.

Finally, if the number of relevant samples is too high, we can reduce our panel using the diversity criterion.

That is, given all the relevant samples, we can select n samples that maximize the diversity of the set. This

problem is known as the dispersion problem [10], where the objective is to locate k points among n, such

that some function of distances between the k points is maximized. One popular optimality criteria is the

MAX-MIN, which maximizes the minimum distance between a pair of points. This problem is known to be

NP-hard; however, an eﬃcient greedy heuristic algorithm exists for the MAX-MIN dispersion problem when

the distances satisfy the triangle inequality, with provable performance guarantee of 2 [25]. Given two samples

A and B, we deﬁne their diversity as: D(A, B) = 1 − J(A, B) and apply the greedy algorithm of [25] to ﬁnd

the n samples. While this procedure is simple and can be eﬃciently used to detect samples with distinct kmer

sets, its main limitation is that it cannot be used to ﬁnd samples which diﬀer in kmer frequency only (since

frequency does not aﬀect Jaccard distance), which could also be used to generate discriminatory features.

3 Results

3.1 Datasets

Synthetic datasets. We used two synthetic datasets generated by Alneberg et al. [3] from the 16S rRNA

samples of the Human Microbiome Project(HMP) [29]. The ﬁrst dataset (”Species-Mock”) consists of 96 sam-

ples containing a mixture of 101 diﬀerent species (without strain-level variation), while the second dataset

(”Strain-Mock”) consists of 64 samples comprising a mixture of 20 diﬀerent organisms, of which some rep-

resent strains of the same species (e.g., this dataset includes ﬁve diﬀerent E. coli strains). The relative

abundance proﬁles of the species and strains in each sample were assigned according to the distribution of

the 101 and 20 most abundant organisms in the original HMP samples, respectively. Reads (100-bp long)

were simulated from random positions of the genomes present in the sample based on their relative abun-

dance, for a total of 7.75 million reads and 11.75 million reads in each ”Species-Mock” and ”Strain-Mock”

sample, respectively. Both datasets contain the set of contigs co-assembled across all the samples by Alneberg

et al. using the Ray assembler [5], and partitioned into fragments of 10 kilobases when appropriate. We used

the default minimum contig length of 1000-bp when running CONCOCT, MaxBin, and GATTACA; this

parameter was set to 1500-bp for MetaBat, which is the smallest length supported by this method. As a

result, the ”Species-Mock” included 37,627 valid contigs and the ”Strain-Mock” included 9,411 valid contigs

.CC-BY-NC 4.0 International licensea

certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted April 26, 2017. ; https://doi.org/10.1101/130997doi: bioRxiv preprint

Fast Metagenomic Binning via Hashing and Bayesian Clustering.

Figures

Citations

SPAdes, a new genome assembly algorithm and its applications to single-cell sequencing ( 7th Annual SFAF Meeting, 2012)

Capturing sequence diversity in metagenomes with comprehensive and scalable probe design.

Communication-Efficient Jaccard similarity for High-Performance Distributed Genome Comparisons

Communication-Efficient Jaccard Similarity for High-Performance Distributed Genome Comparisons

Inference of phylogenetic trees directly from raw sequencing reads using Read2Tree

References

The Sequence Alignment/Map format and SAMtools

Fast gapped-read alignment with Bowtie 2

SPAdes, a new genome assembly algorithm and its applications to single-cell sequencing ( 7th Annual SFAF Meeting, 2012)

CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes

The human microbiome project.

Related Papers (5)

A probabilistic approach to accurate abundance-based binning of metagenomic reads

Low-density locality-sensitive hashing boosts metagenomic binning

SPHINX—an algorithm for taxonomic binning of metagenomic sequences

MetaCoAG: Binning Metagenomic Contigs via Composition, Coverage and Assembly Graphs

K-mer analysis pipeline for classification of dna sequences from metagenomic samples

Frequently Asked Questions (16)

Q1. What contributions have the authors mentioned in the paper "Gattaca: lightweight metagenomic binning with compact indexing of kmer counts and minhash-based panel selection" ?

Q2. Why do the authors concatenate different hash functions from the family H?

Q3. How did the authors assemble the contigs of the ”Species-Mock”?

Q4. How many reads were simulated from random positions of the genomes?

Q5. How can the authors measure the similarity of the samples?

Q6. How did the authors use Prodigal to classify contigs?

Q7. How many bits are sufficient for storing the kmer counts?

Q8. How long would it take to map the samples in the cohort?

Q9. How fast is the coverage estimation of a contig?

Q10. How long would it take to map the 95 HMP samples against these contigs?

Q11. What are the main reasons for the short read lengths of modern sequencing instruments?

Q12. How many contigs were resequenced from the original 11 samples?

Q13. What is the function of the prior termp(z, x)?

Q14. How can the authors find samples that share at least one fingerprint entry in common with Q?

Q15. What motivates the need to further define appropriate sample selection criteria?

Q16. What is the way to estimate the coverage of a set of counts?