scispace - formally typeset
Search or ask a question

Showing papers in "bioRxiv in 2016"


Posted ContentDOI
22 Dec 2016-bioRxiv
TL;DR: Tests on both synthetic and real reads show Unicycler can assemble larger contigs with fewer misassemblies than other hybrid assemblers, even when long read depth and accuracy are low.
Abstract: The Illumina DNA sequencing platform generates accurate but short reads, which can be used to produce accurate but fragmented genome assemblies. Pacific Biosciences and Oxford Nanopore Technologies DNA sequencing platforms generate long reads that can produce more complete genome assemblies, but the sequencing is more expensive and error prone. There is significant interest in combining data from these complementary sequencing technologies to generate more accurate "hybrid" assemblies. However, few tools exist that truly leverage the benefits of both types of data, namely the accuracy of short reads and the structural resolving power of long reads. Here we present Unicycler, a new tool for assembling bacterial genomes from a combination of short and long reads, which produces assemblies that are accurate, complete and cost-effective. Unicycler builds an initial assembly graph from short reads using the de novo assembler SPAdes and then simplifies the graph using information from short and long reads. Unicycler utilises a novel semi-global aligner, which is used to align long reads to the assembly graph. Tests on both synthetic and real reads show Unicycler can assemble larger contigs with fewer misassemblies than other hybrid assemblers, even when long read depth and accuracy are low. Unicycler is open source (GPLv3) and available at github.com/rrwick/Unicycler.

1,750 citations


Posted ContentDOI
20 Jun 2016-bioRxiv
TL;DR: FGSEA method is presented, able to estimate arbitrarily low GSEA P-values with a higher accuracy and much faster compared to other implementations, and a polynomial algorithm is presented to calculate GSEAP-values exactly.
Abstract: Preranked gene set enrichment analysis (GSEA) is a widely used method for interpretation of gene expression data in terms of biological processes. Here we present FGSEA method that is able to estimate arbitrarily low GSEA P-values with a higher accuracy and much faster compared to other implementations. We also present a polynomial algorithm to calculate GSEA P-values exactly, which we use to practically confirm the accuracy of the method.

1,433 citations


Posted ContentDOI
20 Jun 2016-bioRxiv
TL;DR: It is shown that it is possible to make hundreds of thousands permutations in a few minutes, which leads to very accurate p-values, which allows applying standard FDR correction procedures, which are more accurate than the ones currently used.
Abstract: Gene set enrichment analysis is a widely used tool for analyzing gene expression data. However, current implementations are slow due to a large number of required samples for the analysis to have a good statistical power. In this paper we present a novel algorithm, that efficiently reuses one sample multiple times and thus speeds up the analysis. We show that it is possible to make hundreds of thousands permutations in a few minutes, which leads to very accurate p-values. This, in turn, allows applying standard FDR correction procedures, which are more accurate than the ones currently used. The method is implemented in a form of an R package and is freely available at \url{https://github.com/ctlab/fgsea}.

1,221 citations


Posted ContentDOI
15 Oct 2016-bioRxiv
TL;DR: UNOISE2 is described, an updated version of the UNOISE algorithm for denoising (error-correcting) Illumina amplicon reads and it is shown that it has comparable or better accuracy than DADA2.
Abstract: Amplicon sequencing of tags such as 16S and ITS ribosomal RNA is a popular method for investigating microbial populations. In such experiments, sequence errors caused by PCR and sequencing are difficult to distinguish from true biological variation. I describe UNOISE2, an updated version of the UNOISE algorithm for denoising (error-correcting) Illumina amplicon reads and show that it has comparable or better accuracy than DADA2.

1,032 citations


Posted ContentDOI
06 Jan 2016-bioRxiv
TL;DR: This analysis updates the widely-cited 10:1 ratio, showing that the number of bacteria in the authors' bodies is actually of the same order as thenumber of human cells.
Abstract: We critically revisit the ″common knowledge″ that bacteria outnumber human cells by a ratio of at least 10:1 in the human body. We found the total number of bacteria in the ″reference man″ to be 3.9·1013, with an uncertainty (SEM) of 25%, and a variation over the population (CV) of 52%. For human cells we identify the dominant role of the hematopoietic lineage to the total count of body cells (≈90%), and revise past estimates to reach a total of 3.0·1013 human cells in the 70 kg ″reference man″ with 2% uncertainty and 14% CV. Our analysis updates the widely-cited 10:1 ratio, showing that the number of bacteria in our bodies is actually of the same order as the number of human cells. Indeed, the numbers are similar enough that each defecation event may flip the ratio to favor human cells over bacteria.

844 citations


Posted ContentDOI
23 Feb 2016-bioRxiv
TL;DR: A collaborative effort in which a centralized analysis pipeline is applied to a SCZ cohort, finding support at a suggestive level for nine additional candidate susceptibility and protective loci, which consist predominantly of CNVs mediated by non-allelic homologous recombination (NAHR).
Abstract: Genomic copy number variants (CNVs) have been strongly implicated in the etiology of schizophrenia (SCZ). However, apart from a small number of risk variants, elucidation of the CNV contribution to risk has been difficult due to the rarity of risk alleles, all occurring in less than 1% of cases. We sought to address this obstacle through a collaborative effort in which we applied a centralized analysis pipeline to a SCZ cohort of 21,094 cases and 20,227 controls. We observed a global enrichment of CNV burden in cases (OR=1.11, P=5.7e-15), which persisted after excluding loci implicated in previous studies (OR=1.07, P=1.7e-6). CNV burden is also enriched for genes associated with synaptic function (OR = 1.68, P = 2.8e-11) and neurobehavioral phenotypes in mouse (OR = 1.18, P= 7.3e-5). We identified genome-wide significant support for eight loci, including 1q21.1, 2p16.3 (NRXN1), 3q29, 7q11.2, 15q13.3, distal 16p11.2, proximal 16p11.2 and 22q11.2. We find support at a suggestive level for nine additional candidate susceptibility and protective loci, which consist predominantly of CNVs mediated by non-allelic homologous recombination (NAHR).

764 citations


Posted ContentDOI
30 Jun 2016-bioRxiv
TL;DR: Seat2p is introduced: a fast, accurate and complete pipeline that registers raw movies, detects active cells, extracts their calcium traces and infers their spike times, and recovers ~2 times more cells than the previous state-of-the-art method.
Abstract: Two-photon microscopy of calcium-dependent sensors has enabled unprecedented recordings from vast populations of neurons. While the sensors and microscopes have matured over several generations of development, computational methods to process the resulting movies remain inefficient and can give results that are hard to interpret. Here we introduce Suite2p: a fast, accurate and complete pipeline that registers raw movies, detects active cells, extracts their calcium traces and infers their spike times. Suite2p runs on standard workstations, operates faster than real time, and recovers ~2 times more cells than the previous state-of-the-art method. Its low computational load allows routine detection of ~10,000 cells simultaneously with standard two-photon resonant-scanning microscopes. Recordings at this scale promise to reveal the fine structure of activity in large populations of neurons or large populations of subcellular structures such as synaptic boutons.

620 citations


Posted ContentDOI
31 May 2016-bioRxiv
TL;DR: Cellular characterization of the immune infiltrates revealed a role of cancer-germline antigens in spontaneous immunity and showed that tumor genotypes determine immunophenotypes and tumor escape mechanisms and a scoring scheme for the quantification termed immunophenoscore was developed.
Abstract: Current major challenges in cancer immunotherapy include identification of patients likely to respond to therapy and development of strategies to treat non-responders To address these problems and facilitate understanding of the tumor-immune cell interactions we inferred the cellular composition and functional orientation of immune infiltrates, and characterized tumor antigens in 19 solid cancers from The Cancer Genome Atlas (TCGA) Decomposition of immune infiltrates revealed prognostic cellular profiles for distinct cancers, and showed that the tumor genotypes determine immunophenotypes and tumor escape mechanisms The genotype-immunophenotype relationships were evident at the high-level view (mutational load, tumor heterogenity) and at the low-level view (mutational origin) of the genomic landscapes Using random forest approach we identified determinants of immunogenicity and developed an immunophenoscore based on the infiltration of immune subsets and expression of immunomodulatory molecules The immunophenoscore predicted response to immunotherapy with anti-CTLA-4 and anti-PD-1 antibodies in two validation cohorts Our findings and the database we developed (TCIA-The Cancer Immunome Atlas, http://tciaat) may help informing cancer immunotherapy and facilitate the development of precision immuno-oncology

615 citations


Posted ContentDOI
09 Sep 2016-bioRxiv
TL;DR: The SINTAX algorithm predicts taxonomy by using k-mer similarity to identify the top hit in a reference database and provides bootstrap confidence for all ranks in the prediction, achieving comparable or better accuracy to the RDP Naive Bayesian Classifier with a simpler algorithm that does not require training.
Abstract: Metagenomics experiments often characterize microbial communities by sequencing the ribosomal 16S and ITS regions. Taxonomy prediction is a fundamental step in such studies. The SINTAX algorithm predicts taxonomy by using k-mer similarity to identify the top hit in a reference database and provides bootstrap confidence for all ranks in the prediction. SINTAX achieves comparable or better accuracy to the RDP Naive Bayesian Classifier with a simpler algorithm that does not require training. Most tested methods are shown to have high rates of over-classification errors where novel taxa are incorrectly predicted to have known names.

547 citations


Posted ContentDOI
22 Sep 2016-bioRxiv
TL;DR: EggNOG-mapper, a tool for functional annotation of large sets of sequences based on fast orthology assignments using precomputed clusters and phylogenies from eggNOG, benchmarked Gene Ontology predictions against two widely used homology-based approaches: BLAST and InterProScan.
Abstract: Orthology assignment is ideally suited for functional inference. However, because predicting orthology is computationally intensive at large scale, and most pipelines relatively inaccessible, less precise homology-based functional transfer is still the default for (meta-)genome annotation. We therefore developed eggNOG-mapper, a tool for functional annotation of large sets of sequences based on fast orthology assignments using precomputed clusters and phylogenies from eggNOG. To validate our method, we benchmarked Gene Ontology predictions against two widely used homology-based approaches: BLAST and InterProScan. Compared to BLAST, eggNOG-mapper reduced by 7% the rate of false positive assignments, and increased by 19% the ratio of curated terms recovered over all terms assigned per protein. Compared to InterProScan, eggNOG-mapper achieved similar proteome coverage and precision, while predicting on average 32 more terms per protein and increasing by 26% the rate of curated terms recovered over total term assignments per protein. Through strict orthology assignments, eggNOG-mapper further renders more specific annotations than possible from domain similarity only (e.g. predicting gene family names). eggNOG-mapper runs ~15x than BLAST and at least 2.5x faster than InterProScan. The tool is available standalone or as an online service at http://eggnog-mapper.embl.de.

533 citations


Posted ContentDOI
30 Jun 2016-bioRxiv
TL;DR: Kilosort models the recorded voltage as a sum of template waveforms triggered on the spike times, allowing overlapping spikes to be identified and resolved and is an important step towards fully automated spike sorting of multichannel electrode recordings.
Abstract: Advances in silicon probe technology mean that in vivo electrophysiological recordings from hundreds of channels will soon become commonplace. To interpret these recordings we need fast, scalable and accurate methods for spike sorting, whose output requires minimal time for manual curation. Here we introduce Kilosort, a spike sorting framework that meets these criteria, and show that it allows rapid and accurate sorting of large-scale in vivo data. Kilosort models the recorded voltage as a sum of template waveforms triggered on the spike times, allowing overlapping spikes to be identified and resolved. Rapid processing is achieved thanks to a novel low-dimensional approximation for the spatiotemporal distribution of each template, and to batch-based optimization on GPUs. A novel post-clustering merging step based on the continuity of the templates substantially reduces the requirement for subsequent manual curation operations. We compare Kilosort to an established algorithm on data obtained from 384-channel electrodes, and show superior performance, at much reduced processing times. Data from 384-channel electrode arrays can be processed in approximately realtime. Kilosort is an important step towards fully automated spike sorting of multichannel electrode recordings, and is freely available github.com/cortex-lab/Kilosort.

Posted ContentDOI
16 Apr 2016-bioRxiv
TL;DR: This work quantified the genetic sharing of 25 brain disorders based on summary statistics from genome-wide association studies of 215,683 patients and 657,164 controls, and their relationship to 17 phenotypes from 1,191,588 individuals and performed extensive simulations to explore how power, diagnostic misclassification and phenotypic heterogeneity affect genetic correlations.
Abstract: Disorders of the brain exhibit considerable epidemiological comorbidity and frequently share symptoms, provoking debate about the extent of their etiologic overlap. Here we apply linkage disequilibrium score regression (LDSC) to quantify the extent of shared genetic contributions across 23 brain disorders (n=842,820), 11 quantitative and four dichotomous traits of interest (n=722,125) based on genome-wide association meta-analyses. Psychiatric disorders show substantial sharing of common variant risk, while many neurological disorders appear more distinct from one another, suggesting substantive differences in the specificity of the genetic etiology of these disorders. Further, we observe little evidence of widespread sharing of the common genetic risk between neurological and psychiatric disorders studied. In addition, we identify significant sharing of genetic influences between the certain quantitative measures and brain disorders, including major depressive disorder and neuroticism personality score. These results highlight the importance of common genetic variation as a source of risk for brain disorders and the potential of using heritability methods to obtain a more comprehensive view of the genetic architecture of brain phenotypes.

Posted ContentDOI
15 Aug 2016-bioRxiv
TL;DR: The R/Bioconductor package scater is developed to facilitate rigorous pre-processing, quality control, normalisation and visualisation of scRNA-seq data and provides a convenient, flexible workflow to process raw sequencing reads into a high-quality expression dataset ready for downstream analysis.
Abstract: Motivation: Single-cell RNA sequencing (scRNA-seq) is increasingly used to study gene expression at the level of individual cells. However, preparing raw sequence data for further analysis is not a straightforward process. Biases, artifacts, and other sources of unwanted variation are present in the data, requiring substantial time and effort to be spent on pre-processing, quality control (QC) and normalisation. Results: We have developed the R/Bioconductor package scater to facilitate rigorous pre-processing, quality control, normalisation and visualisation of scRNA-seq data. The package provides a convenient, flexible workflow to process raw sequencing reads into a high-quality expression dataset ready for downstream analysis. scater provides a rich suite of plotting tools for single-cell data and a flexible data structure that is compatible with existing tools and can be used as infrastructure for future software development. Availability: The open-source code, along with installation instructions, vignettes and case studies, is available through Bioconductor at http://bioconductor.org/packages/scater. Supplementary information: Supplementary material is available online at bioRxiv accompanying this manuscript, and all materials required to reproduce the results presented in this paper are available at dx.doi.org/10.5281/zenodo.60139.

Posted ContentDOI
10 Jun 2016-bioRxiv
TL;DR: A novel method for the differential analysis of RNA-Seq data that utilizes bootstrapping in conjunction with response error linear modeling to decouple biological variance from inferential variance is described.
Abstract: We describe a novel method for the differential analysis of RNA-Seq data that utilizes bootstrapping in conjunction with response error linear modeling to decouple biological variance from inferential variance. The method is implemented in an interactive shiny app called sleuth that utilizes kallisto quantifications and bootstraps for fast and accurate analysis of RNA-Seq experiments.

Posted ContentDOI
25 May 2016-bioRxiv
TL;DR: Centrifuge is a novel microbial classification engine that enables rapid, accurate and sensitive labeling of reads and quantification of species on desktop computers and makes it possible to index the entire NCBI non-redundant nucleotide sequence database with an index size of 69 GB, in contrast to k-mer based indexing schemes, which require far more extensive space.
Abstract: Centrifuge is a novel microbial classification engine that enables rapid, accurate and sensitive labeling of reads and quantification of species on desktop computers. The system uses an indexing scheme based on the Burrows-Wheeler transform (BWT) and the Ferragina-Manzini (FM) index, optimized specifically for the metagenomic classification problem. Centrifuge requires a relatively small index (4.2 GB for 4,078 bacterial and 200 archaeal genomes) and classifies sequences at very high speed, allowing it to process the millions of reads from a typical high-throughput DNA sequencing run within a few minutes. Together these advances enable timely and accurate analysis of large metagenomics data sets on conventional desktop computers. Because of its space-optimized indexing schemes, Centrifuge also makes it possible to index the entire NCBI non-redundant nucleotide sequence database (a total of 109 billion bases) with an index size of 69 GB, in contrast to k-mer based indexing schemes, which require far more extensive space. Centrifuge is available as free, open-source software from http://www.ccb.jhu.edu/software/centrifuge.

Posted ContentDOI
25 Oct 2016-bioRxiv
TL;DR: How the field should evolve is described to produce the most meaningful answers to neuroscientific questions, and current and suggested best practices are outlined.
Abstract: Functional neuroimaging techniques have transformed our ability to probe the neurobiological basis of behaviour and are increasingly being applied by the wider neuroscience community. However, concerns have recently been raised that the conclusions drawn from some human neuroimaging studies are either spurious or not generalizable. Problems such as low statistical power, flexibility in data analysis, software errors, and lack of direct replication apply to many fields, but perhaps particularly to fMRI. Here we discuss these problems, outline current and suggested best practices, and describe how we think the field should evolve to produce the most meaningful answers to neuroscientific questions.

Posted ContentDOI
02 Sep 2016-bioRxiv
TL;DR: Single-Cell Consensus Clustering (SC3), a tool for unsupervised clustering of scRNA-seq data, achieves high accuracy and robustness by consistently integrating different clustering solutions through a consensus approach.
Abstract: Using single-cell RNA-seq (scRNA-seq), the full transcriptome of individual cells can be acquired, enabling a quantitative cell-type characterisation based on expression profiles. However, due to the large variability in gene expression, identifying cell types based on the transcriptome remains challenging. We present Single-Cell Consensus Clustering (SC3), a tool for unsupervised clustering of scRNA-seq data. SC3 achieves high accuracy and robustness by consistently integrating different clustering solutions through a consensus approach. Tests on twelve published datasets show that SC3 outperforms five existing methods while remaining scalable, as shown by the analysis of a large dataset containing 44,808 cells. Moreover, an interactive graphical implementation makes SC3 accessible to a wide audience of users, and SC3 aids biological interpretation by identifying marker genes, differentially expressed genes and outlier cells. We illustrate the capabilities of SC3 by characterising newly obtained transcriptomes from subclones of neoplastic cells collected from patients.

Posted ContentDOI
09 Nov 2016-bioRxiv
TL;DR: It is demonstrated that the coreceptor CD28 is strongly preferred over the TCR as a target for dephosphorylation by PD-1- recruited Shp2 phosphatase, suggesting that costimulatory pathways may play unexpected roles in regulating effector T cell function and therapeutic responses to anti-PD-L1/PD-1.
Abstract: Programmed death-1 (PD-1) is a co-inhibitory receptor that suppresses T cell activation and is an important cancer immunotherapy target. Upon activation by its ligand PD-L1, PD-1 is thought to suppress signaling through the T cell receptor (TCR). Here, by titrating the strength of PD-1 signaling in both biochemical reconstitution systems and in T cells, we demonstrate that the coreceptor CD28 is strongly preferred over the TCR as a target for dephosphorylation by PD-1- recruited Shp2 phosphatase. We also show that PD-1 colocalizes with the costimulatory receptor CD28 in plasma membrane microclusters but partially segregates from the TCR. These results reveal that PD-1 suppresses T cell function primarily by inactivating CD28 signaling, suggesting that costimulatory pathways may play unexpected roles in regulating effector T cell function and therapeutic responses to anti-PD-L1/PD-1.

Posted ContentDOI
21 Aug 2016-bioRxiv
TL;DR: In this paper, a convolutional neural network (CNN) was used for classification of functional MRI data for the purposes of medical image analysis and Alzheimer's disease prediction, achieving a reproducible accuracy of 99.9% and 98.84% for the fMRI and MRI pipelines, respectively.
Abstract: To extract patterns from neuroimaging data, various techniques, including statistical methods and machine learning algorithms, have been explored to ultimately aid in Alzheimer9s disease diagnosis of older adults in both clinical and research applications. However, identifying the distinctions between Alzheimer′s brain data and healthy brain data in older adults (age > 75) is challenging due to highly similar brain patterns and image intensities. Recently, cutting-edge deep learning technologies have been rapidly expanding into numerous fields, including medical image analysis. This work outlines state-of-the-art deep learning-based pipelines employed to distinguish Alzheimer′s magnetic resonance imaging (MRI) and functional MRI data from normal healthy control data for the same age group. Using these pipelines, which were executed on a GPU-based high performance computing platform, the data were strictly and carefully preprocessed. Next, scale and shift invariant low- to high-level features were obtained from a high volume of training images using convolutional neural network (CNN) architecture. In this study, functional MRI data were used for the first time in deep learning applications for the purposes of medical image analysis and Alzheimer′s disease prediction. These proposed and implemented pipelines, which demonstrate a significant improvement in classification output when compared to other studies, resulted in high and reproducible accuracy rates of 99.9% and 98.84% for the fMRI and MRI pipelines, respectively.

Posted ContentDOI
18 Jun 2016-bioRxiv
TL;DR: A protocol for genome-scale knockout and transcriptional activation screening using the CRISPR-Cas9 system is described to validate candidate genes identified from the screen and strategies for confirming the screening phenotype as well as genetic perturbation are described.
Abstract: Forward genetic screens are powerful tools for the unbiased discovery and functional characterization of specific genetic elements associated with a phenotype of interest. Recently, the RNA-guided endonuclease Cas9 from the microbial immune system CRISPR (clustered regularly interspaced short palindromic repeats) has been adapted for genome-scale screening by combining Cas9 with guide RNA libraries. Here we describe a protocol for genome-scale knockout and transcriptional activation screening using the CRISPR-Cas9 system. Custom- or ready-made guide RNA libraries are constructed and packaged into lentivirus for delivery into cells for screening. As each screen is unique, we provide guidelines for determining screening parameters and maintaining sufficient coverage. To validate candidate genes identified from the screen, we further describe strategies for confirming the screening phenotype as well as genetic perturbation through analysis of indel rate and transcriptional activation. Beginning with library design, a genome-scale screen can be completed in 6-10 weeks followed by 3-4 weeks of validation.

Posted ContentDOI
16 Sep 2016-bioRxiv
TL;DR: A new deep learning method that predicts contacts by integrating both evolutionary coupling (EC) and sequence conservation information through an ultra-deep neural network formed by two deep residual neural networks that greatly outperforms existing methods and leads to much more accurate contact-assisted folding.
Abstract: Motivation: Protein contacts contain key information for the understanding of protein structure and function and thus, contact prediction from sequence is an important problem. Recently exciting progress has been made on this problem, but the predicted contacts for proteins without many sequence homologs is still of low quality and not extremely useful for de novo structure prediction. Method: This paper presents a new deep learning method that predicts contacts by integrating both evolutionary coupling (EC) and sequence conservation information through an ultra-deep neural network formed by two deep residual neural networks. The first residual network conducts a series of 1-dimensional convolutional transformation of sequential features; the second residual network conducts a series of 2-dimensional convolutional transformation of pairwise information including output of the first residual network, EC information and pairwise potential. By using very deep residual networks, we can model very complex relationship between sequence and contact map as well as long-range interdependency between contacts and thus, obtain high-quality contact prediction. Results: Our method greatly outperforms existing contact prediction methods and leads to much more accurate contact-assisted protein folding. Tested on the 105 CASP11 targets, 76 CAMEO test proteins and 398 membrane proteins, the average top L long-range prediction accuracy obtained our method, the representative EC method CCMpred and the CASP11 winner MetaPSICOV is 0.47, 0.21 and 0.30, respectively; the average top L/10 long-range accuracy of our method, CCMpred and MetaPSICOV is 0.77, 0.47 and 0.59, respectively. Ab initio folding using our predicted contacts as restraints can yield correct folds (i.e., TMscore>0.6) for 203 of the 579 test proteins, while that using MetaPSICOV- and CCMpred-predicted contacts can do so for only 79 and 62 of them, respectively. Further, our contact-assisted models also have much better quality than template-based models (especially for membrane proteins). Using our predicted contacts as restraints, we can (ab initio) fold 208 of the 398 membrane proteins with TMscore>0.5. By contrast, when the training proteins of our method are used as templates, homology modeling can only do so for 10 of them. One interesting finding is that even if we do not train our prediction models with any membrane proteins, our method works very well on membrane protein contact prediction. In the recent blind CAMEO benchmark, our method successfully folded one mainly-beta protein of 182 residues with a novel fold.

Posted ContentDOI
02 Nov 2016-bioRxiv
TL;DR: WhatsHap is a production-ready tool for highly accurate read-based phasing that was designed from the beginning to leverage third-generation sequencing technologies, whose long reads can span many variants and are therefore ideal for phasing.
Abstract: Read-based phasing allows to reconstruct the haplotype structure of a sample purely from sequencing reads. While phasing is a required step for answering questions about population genetics, compound heterozygosity, and to aid in clinical decision making, there has been a lack of an accurate, usable and standards-based software. WhatsHap is a production-ready tool for highly accurate read-based phasing. It was designed from the beginning to leverage third-generation sequencing technologies, whose long reads can span many variants and are therefore ideal for phasing. WhatsHap works also well with second-generation data, is easy to use and will phase not only SNVs, but also indels and other variants. It is unique in its ability to combine read-based with genetic phasing, allowing to further improve accuracy if multiple related samples are provided.

Posted ContentDOI
15 Dec 2016-bioRxiv
TL;DR: The first algorithm for the identification of modified nucleotides without the need for prior training data is presented along with the open source software implementation, nanoraw, which accurately assigns contiguous raw nanopore signal to genomic positions, enabling novel data visualization and increasing power and accuracy for the discovery of covalently modified bases in native DNA.
Abstract: Advances in nanopore sequencing technology have enabled investigation of the full catalogue of covalent DNA modifications. We present the first algorithm for the identification of modified nucleotides without the need for prior training data along with the open source software implementation, nanoraw. Nanoraw accurately assigns contiguous raw nanopore signal to genomic positions, enabling novel data visualization, and increasing power and accuracy for the discovery of covalently modified bases in native DNA. Ground truth case studies utilizing synthetically methylated DNA show the capacity to identify three distinct methylation marks, 4mC, 5mC, and 6mA, in seven distinct sequence contexts without any changes to the algorithm. We demonstrate quantitative reproducibility simultaneously identifying 5mC and 6mA in native E. coli across biological replicates processed in different labs. Finally we propose a pipeline for the comprehensive discovery of DNA modifications in any genome without a priori knowledge of their chemical identities.

Posted ContentDOI
12 Sep 2016-bioRxiv
TL;DR: UCHIME2 is described, an update of the popular UCHIME chimera detection algorithm with new modes optimized for high-resolution biological sequence reconstruction (“denoising”) and other applications and that UCHIME2 achieves higher detection accuracy than previous methods.
Abstract: Amplicon sequencing generates chimeric reads which can cause spurious inferences of biological variation. I describe UCHIME2, an update of the popular UCHIME chimera detection algorithm with new modes optimized for high-resolution biological sequence reconstruction ("denoising") and other applications. I show that chimera frequency correlates inversely with divergence, that error-free chimera prediction from sequence is impossible in principle, and that UCHIME2 achieves higher detection accuracy than previous methods.

Posted ContentDOI
27 Oct 2016-bioRxiv
TL;DR: This work demonstrates CRISPR genome editing together with single-cell RNA sequencing as a new screening paradigm that combines key advantages of pooled and arrayed screens and links guide-RNA expression to the associated transcriptome responses in thousands of single cells.
Abstract: CRISPR-based genetic screens have revolutionized the search for new gene functions and biological mechanisms. However, widely used pooled screens are limited to simple read-outs of cell proliferation or the production of a selectable marker protein. Arrayed screens allow for more complex molecular read-outs such as transcriptome profiling, but they provide much lower throughput. Here we demonstrate CRISPR genome editing together with single-cell RNA sequencing as a new screening paradigm that combines key advantages of pooled and arrayed screens. This approach allowed us to link guide-RNA expression to the associated transcriptome responses in thousands of single cells using a straightforward and broadly applicable screening workflow.

Posted ContentDOI
26 Jul 2016-bioRxiv
TL;DR: A droplet-based system that enables 3′ mRNA counting of up to tens of thousands of single cells per sample is described and enables characterization of diverse biological systems with single cell mRNA analysis.
Abstract: Characterizing the transcriptome of individual cells is fundamental to understanding complex biological systems. We describe a droplet-based system that enables 3′ mRNA counting of up to tens of thousands of single cells per sample. Cell encapsulation in droplets takes place in ~6 minutes, with ~50% cell capture efficiency, up to 8 samples at a time. The speed and efficiency allow the processing of precious samples while minimizing stress to cells. To demonstrate the system′s technical performance and its applications, we collected transcriptome data from ~¼ million single cells across 29 samples. First, we validate the sensitivity of the system and its ability to detect rare populations using cell lines and synthetic RNAs. Then, we profile 68k peripheral blood mononuclear cells (PBMCs) to demonstrate the system′s ability to characterize large immune populations. Finally, we use sequence variation in the transcriptome data to determine host and donor chimerism at single cell resolution in bone marrow mononuclear cells (BMMCs) of transplant patients. This analysis enables characterization of the complex interplay between donor and host cells and monitoring of treatment response. This high-throughput system is robust and enables characterization of diverse biological systems with single cell mRNA analysis.

Posted ContentDOI
03 Jun 2016-bioRxiv
TL;DR: The FALCON-based assemblies were substantially more contiguous and complete than alternate short or long-read approaches, and enabled the study of haplotype structures and heterozygosities between the homologous chromosomes, including identifying widespread heterozygous structural variations within the coding sequences.
Abstract: While genome assembly projects have been successful in a number of haploid or inbred species, one of the current main challenges is assembling non-inbred or rearranged heterozygous genomes. To address this critical need, we introduce the open-source FALCON and FALCON-Unzip algorithms (https://github.com/PacificBiosciences/FALCON/) to assemble Single Molecule Real-Time (SMRT(R)) Sequencing data into highly accurate, contiguous, and correctly phased diploid genomes. We demonstrate the quality of this approach by assembling new reference sequences for three heterozygous samples, including an F1 hybrid of the model species Arabidopsis thaliana, the widely cultivated V. vinifera cv. Cabernet Sauvignon, and the coral fungus Clavicorona pyxidata that have challenged short-read assembly approaches. The FALCON-based assemblies were substantially more contiguous and complete than alternate short or long-read approaches. The phased diploid assembly enabled the study of haplotype structures and heterozygosities between the homologous chromosomes, including identifying widespread heterozygous structural variations within the coding sequences.

Posted ContentDOI
19 Sep 2016-bioRxiv
TL;DR: It is demonstrated that escape from XCI results in sex biases in gene expression, thus establishing incomplete XCI as a likely mechanism introducing phenotypic diversity6,7 and this updated catalogue of XCI across human tissues informs the understanding of the extent and impact of the incompleteness.
Abstract: X chromosome inactivation (XCI) silences the transcription from one of the two X chromosomes in mammalian female cells to balance expression dosage between XX females and XY males. XCI is, however, characteristically incomplete in humans: up to one third of X-chromosomal genes are expressed from both the active and inactive X chromosomes (Xa and Xi, respectively) in female cells, with the degree of "escape" from inactivation varying between genes and individuals 1,2 (Fig. 1). However, the extent to which XCI is shared between cells and tissues remains poorly characterized 3,4 , as does the degree to which incomplete XCI manifests as detectable sex differences in gene expression 5 and phenotypic traits 6 . Here we report a systematic survey of XCI using a combination of over 5,500 transcriptomes from 449 individuals spanning 29 tissues, and 940 single-cell transcriptomes, integrated with genomic sequence data (Fig. 1). By combining information across these data types we show that XCI at the 683 X-chromosomal genes assessed is generally uniform across human tissues, but identify examples of heterogeneity between tissues, individuals and cells. We show that incomplete XCI affects at least 23% of X-chromosomal genes, identify seven new escape genes supported by multiple lines of evidence, and demonstrate that escape from XCI results in sex biases in gene expression, thus establishing incomplete XCI as a likely mechanism introducing phenotypic diversity 6,7 . Overall, this updated catalogue of XCI across human tissues informs our understanding of the extent and impact of the incompleteness in the maintenance of XCI.

Posted ContentDOI
08 Sep 2016-bioRxiv
TL;DR: HiChIP of cohesin reveals multi-scale genome architecture with greater signal to background than in situ Hi-C and adds to the toolbox of 3D genome structure and regulation for diverse biomedical applications.
Abstract: Genome conformation is central to gene control but challenging to interrogate. Here we present HiChIP, a protein-centric chromatin conformation method. HiChIP improves the yield of conformation-informative reads by over 10-fold and lowers input requirement over 100-fold relative to ChIA-PET. HiChIP of cohesin reveals multi-scale genome architecture with greater signal to background than in situ Hi-C. Thus, HiChIP adds to the toolbox of 3D genome structure and regulation for diverse biomedical applications.

Posted ContentDOI
30 Aug 2016-bioRxiv
TL;DR: It is asserted that the collected updates in GRCh38 make the newer assembly a more robust substrate for comprehensive analyses that will promote the understanding of human biology and advance the efforts to improve health.
Abstract: The human reference genome assembly plays a central role in nearly all aspects of today's basic and clinical research. GRCh38 is the first coordinate-changing assembly update since 2009 and reflects the resolution of roughly 1000 issues and encompasses modifications ranging from thousands of single base changes to megabase-scale path reorganizations, gap closures and localization of previously orphaned sequences. We developed a new approach to sequence generation for targeted base updates and used data from new genome mapping technologies and single haplotype resources to identify and resolve larger assembly issues. For the first time, the reference assembly contains sequence-based representations for the centromeres. We also expanded the number of alternate loci to create a reference that provides a more robust representation of human population variation. We demonstrate that the updates render the reference an improved annotation substrate, alter read alignments in unchanged regions and impact variant interpretation at clinically relevant loci. We additionally evaluated a collection of new de novo long-read haploid assemblies and conclude that while the new assemblies compare favorably to the reference with respect to continuity, error rate, and gene completeness, the reference still provides the best representation for complex genomic regions and coding sequences. We assert that the collected updates in GRCh38 make the newer assembly a more robust substrate for comprehensive analyses that will promote our understanding of human biology and advance our efforts to improve health.