Fast Metagenomic Binning via Hashing and Bayesian Clustering.
read more
Citations
SPAdes, a new genome assembly algorithm and its applications to single-cell sequencing ( 7th Annual SFAF Meeting, 2012)
Capturing sequence diversity in metagenomes with comprehensive and scalable probe design.
Communication-Efficient Jaccard similarity for High-Performance Distributed Genome Comparisons
Communication-Efficient Jaccard Similarity for High-Performance Distributed Genome Comparisons
Inference of phylogenetic trees directly from raw sequencing reads using Read2Tree
References
The Sequence Alignment/Map format and SAMtools
Fast gapped-read alignment with Bowtie 2
SPAdes, a new genome assembly algorithm and its applications to single-cell sequencing ( 7th Annual SFAF Meeting, 2012)
CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes
The human microbiome project.
Related Papers (5)
Low-density locality-sensitive hashing boosts metagenomic binning
Frequently Asked Questions (16)
Q2. Why do the authors concatenate different hash functions from the family H?
Due to the high variance in the probability of collision, the authors concatenate L different hash functions from the family H chosen independently at random to form the fingerprint.
Q3. How did the authors assemble the contigs of the ”Species-Mock”?
The authors used the SPADES [4] assembler with default parameters to assemble the contigs of individual samples (cutting contigs into 10 kilobase fragments and filtering contigs shorter than 1000-bp, as in simulation).
Q4. How many reads were simulated from random positions of the genomes?
Reads (100-bp long) were simulated from random positions of the genomes present in the sample based on their relative abundance, for a total of 7.75 million reads and 11.75 million reads in each ”Species-Mock” and ”Strain-Mock” sample, respectively.
Q5. How can the authors measure the similarity of the samples?
By representing each sample as a set of overlapping kmers, the authors apply the Jaccard coefficient to measure their similarity, where the Jaccard coefficient J(A,B) = |A∩B||A∪B| , for two sets A and B.
Q6. How did the authors use Prodigal to classify contigs?
In addition to CheckM, the authors also applied Prodigal [14] to predict and functionally annotate genes on their sample contigs and then RPS-BLAST to COG annotate the protein sequences (using the NCBI COG database).
Q7. How many bits are sufficient for storing the kmer counts?
The authors found 8 bits to be sufficient for storing the kmer counts (and since many counts are small, these can be compressed even further using techniques such as varint encoding).
Q8. How long would it take to map the samples in the cohort?
if the samples in the cohort have already been indexed (for publicly available data or multi-sample studies reusing the same sample cohort), then GATTACA would only need 4.3h to finish (resulting in a roughly 20× speedup).
Q9. How fast is the coverage estimation of a contig?
In terms of speedup, the authors found their coverage estimation time to be at least an order of magnitude faster (approximately 20×) when the index is computed offline (e.g. for recyclable public reference samples) and about about 6× when the kmers are counted on-the-fly (e.g. for private samples used only once), when compared to read mapping.
Q10. How long would it take to map the 95 HMP samples against these contigs?
ignoring the BWT indexing time, it would take CONCOCT roughly 81h to map the 95 HMP samples against these contigs; while GATTACA would require only 12.7h.
Q11. What are the main reasons for the short read lengths of modern sequencing instruments?
The short read lengths of modern sequencing instruments – combined with various inherent difficulties associated with complex bacterial environments – make it very difficult to perform simple tasks such as accurately identifying bacterial strains, recovering their genomic sequences, and assessing their abundance.
Q12. How many contigs were resequenced from the original 11 samples?
The authors downloaded 1.372 × 108 100-bp short reads from the SRA052203 NCBI archive as 18 separate samples (of which 7 were resequenced from the original 11 samples).
Q13. What is the function of the prior termp(z, x)?
By their choice of conjugate prior, the posterior p(θ, z|X) and hence the optimal q have the same form, which factors over q(z|θ)q(θ).
Q14. How can the authors find samples that share at least one fingerprint entry in common with Q?
By indexing the fingerprints of all the samples in the database into L tables (based on the value of each fingerprint entry, respectively), the authors can find all the samples that share at least one fingerprint entry in common with Q using simple lookups, as well as rank them according to relevance.
Q15. What motivates the need to further define appropriate sample selection criteria?
This motivates the need to additionaly define appropriate sample selection criteria, for which the authors propose two metrics: (1) relevance and (2) diversity.
Q16. What is the way to estimate the coverage of a set of counts?
Given a contig c and an index The authorof a cohort sample, the authors estimate the coverage of c in this sample by performing lookups in The authorfor each kmer in c and then computing the median of the resulting counts.