scispace - formally typeset
Search or ask a question

Showing papers in "bioRxiv in 2015"


Posted ContentDOI
30 Oct 2015-bioRxiv
TL;DR: The aggregation and analysis of high-quality exome (protein-coding region) sequence data for 60,706 individuals of diverse ethnicities generated as part of the Exome Aggregation Consortium (ExAC) provides direct evidence for the presence of widespread mutational recurrence.
Abstract: Large-scale reference data sets of human genetic variation are critical for the medical and functional interpretation of DNA sequence changes. Here we describe the aggregation and analysis of high-quality exome (protein-coding region) sequence data for 60,706 individuals of diverse ethnicities. The resulting catalogue of human genetic diversity has unprecedented resolution, with an average of one variant every eight bases of coding sequence and the presence of widespread mutational recurrence. The deep catalogue of variation provided by the Exome Aggregation Consortium (ExAC) can be used to calculate objective metrics of pathogenicity for sequence variants, and to identify genes subject to strong selection against various classes of mutation; we identify 3,230 genes with near-complete depletion of truncating variants, 79% of which have no currently established human disease phenotype. Finally, we show that these data can be used for the efficient filtering of candidate disease-causing variants, and for the discovery of human knockout variants in protein-coding genes.

1,552 citations


Posted ContentDOI
13 May 2015-bioRxiv
TL;DR: Roary is introduced, a tool that rapidly builds large-scale pan genomes, identifying the core and dispensable accessory genes and making construction of the pan genome of thousands of prokaryote samples possible on a standard desktop without compromising on the accuracy of results.
Abstract: A typical prokaryote population sequencing study can now consist of hundreds or thousands of isolates. Interrogating these datasets can provide detailed insights into the genetic structure of of prokaryotic genomes. We introduce Roary, a tool that rapidly builds large-scale pan genomes, identifying the core and dispensable accessory genes. Roary makes construction of the pan genome of thousands of prokaryote samples possible on a standard desktop without compromising on the accuracy of results. Using a single CPU Roary can produce a pan genome consisting of 1000 isolates in 4.5 hours using 13 GB of RAM, with further speedups possible using multiple processors.

955 citations


Posted ContentDOI
14 Jul 2015-bioRxiv
TL;DR: Based on the global CTF determination, the local defocus for each particle and for single frames of movies is accurately refined, which improves CTF parameters of all particles for subsequent image processing.
Abstract: Accurate estimation of the contrast transfer function (CTF) is critical for a near-atomic resolution cryo electron microscopy (cryoEM) reconstruction. Here, I present a GPU-accelerated computer program, Gctf, for accurate and robust, real-time CTF determination. Similar to alternative programs, the main target of Gctf is to maximize the cross-correlation of a simulated CTF with the power spectra of observed micrographs after background reduction. However, novel approaches in Gctf improve both speed and accuracy. In addition to GPU acceleration, a fast ?1-dimensional search plus 2-dimensional refinement (1S2R)? procedure significantly speeds up Gctf. Based on the global CTF determination, the local defocus for each particle and for single frames of movies is accurately refined, which improves CTF parameters of all particles for subsequent image processing. Novel diagnosis method using equiphase averaging(EFA) and self-consistency verification procedures have also been implemented in the program for practical use, especially for aims of near-atomic reconstruction. Gctf is an independent program and the outputs can be easily imported into other cryoEM software such as Relion and Frealign. The results from several representative datasets are shown and discussed in this paper.

744 citations


Posted ContentDOI
15 Jun 2015-bioRxiv
TL;DR: A new methodology to analyze single-cell transcriptomic data is presented that models this bimodality within a coherent generalized linear modeling framework, and the cellular detection rate, the fraction of genes turned on in a cell, is introduced.
Abstract: Single-cell transcriptomic profiling enables the unprecedented interrogation of gene expression heterogeneity in rare cell populations that would otherwise be obscured in bulk RNA sequencing experiments. The stochastic nature of transcription is revealed in the bimodality of single-cell transcriptomic data, a feature shared across single-cell expression platforms. There is, however, a paucity of computational tools that take advantage of this unique characteristic. We present a new methodology to analyze single-cell transcriptomic data that models this bimodality within a coherent generalized linear modeling framework. We propose a two-part, generalized linear model that allows one to characterize biological changes in the proportions of cells that are expressing each gene, and in the positive mean expression level of that gene. We introduce the cellular detection rate, the fraction of genes turned on in a cell, and show how it can be used to simultaneously adjust for technical variation and so-called “extrinsic noise” at the single-cell level without the use of control genes. Our model permits direct inference on statistics formed by collections of genes, facilitating gene set enrichment analysis. The residuals defined by such models can be manipulated to interrogate cellular heterogeneity and gene-gene correlation across cells and conditions, providing insights into the temporal evolution of networks of co-expressed genes at the single-cell level. Using two single-cell RNA-seq datasets, including newly generated data from Mucosal Associated Invariant T (MAIT) cells, we show how model residuals can be used to identify significant changes across biologically relevant gene sets that are missed by other methods and characterize cellular heterogeneity in response to stimulation.

556 citations


Posted ContentDOI
16 Feb 2015-bioRxiv
TL;DR: A set of novel tools to solve the problem of “spike sorting” for large, dense electrode arrays are presented, implemented in a suite of practical, user-friendly, open-source software.
Abstract: Developments in microfabrication technology have enabled the production of neural electrode arrays with hundreds of closely-spaced recording sites, and electrodes with thousands of sites are currently under development. These probes will in principle allow the simultaneous recording of very large numbers of neurons. However, use of this technology requires the development of techniques for decoding the spike times of the recorded neurons, from the raw data captured from the probes. There currently exists no practical solution to this problem of “spike sorting” for large, dense electrode arrays. Here, we present a set of novel tools to solve this problem, implemented in a suite of practical, user-friendly, open-source software. We validate these methods on data from rat cortex, demonstrating error rates as low as 5%.

265 citations


Posted ContentDOI
28 Jul 2015-bioRxiv
TL;DR: Circlator is the first tool to automate assembly circularization and produce accurate linear rep-resentations of circular sequences, correctly circularizing 26 of 27 circularizable sequences.
Abstract: The assembly of DNA sequence data into finished genomes is undergoing a renaissance thanks to emerging technologies producing reads of tens of kilobases. Assembling complete bacterial and small eukaryotic genomes is now possible, but the final step of circularizing sequences remains unsolved. Here we present Circlator, the first tool to automate assembly circularization and produce accurate linear representations of circular sequences. Using Pacific Biosciences and Oxford Nanopore data, Circlator correctly circularized 26 of 27 circularizable sequences, comprising 11 chromosomes and 12 plasmids from bacteria, the apicoplast and mitochondrion of Plasmodium falciparum and a human mitochondrion. Circlator is available at http://sanger-pathogens.github.io/circlator/.

235 citations


Posted ContentDOI
26 Oct 2015-bioRxiv
TL;DR: An exciting new era is entering, in which neurobiologically faithful feedforward and recurrent computational models of how biological brains perform high-level feats of intelligence, including vision, are able to be built.
Abstract: Recent advances in neural network modelling have enabled major strides in computer vision and other artificial intelligence applications. Human-level visual recognition abilities are coming within reach of artificial systems. Artificial neural networks are inspired by the brain and their computations could be implemented in biological neurons. Convolutional feedforward networks, which now dominate computer vision, take further inspiration from the architecture of the primate visual hierarchy. However, the current models are designed with engineering goals and not to model brain computations. Nevertheless, initial studies comparing internal representations between these models and primate brains find surprisingly similar representational spaces. With human-level performance no longer out of reach, we are entering an exciting new era, in which we will be able to build neurobiologically faithful feedforward and recurrent computational models of how biological brains perform high-level feats of intelligence, including vision.

202 citations


Posted ContentDOI
03 Aug 2015-bioRxiv
TL;DR: A novel algorithm for comparing variant call sets that deals with complex call representation discrepancies and through a dynamic programing method that minimizes false positives and negatives globally across the entire call sets for accurate performance evaluation of VCFs is developed.
Abstract: To evaluate and compare the performance of variant calling methods and their confidence scores, comparisons between a test call set and a ?gold standard? need to be carried out. Unfortunately, these comparisons are not straightforward with the current Variant Call Files (VCF), which are the standard output of most variant calling algorithms for high-throughput sequencing data. Comparisons of VCFs are often confounded by the different representations of indels, MNPs, and combinations thereof with SNVs in complex regions of the genome, resulting in misleading results. A variant caller is inherently a classification method designed to score putative variants with confidence scores that could permit controlling the rate of false positives (FP) or false negatives (FN) for a given application. Receiver operator curves (ROC) and the area under the ROC (AUC) are efficient metrics to evaluate a test call set versus a gold standard. However, in the case of VCF data this also requires a special accounting to deal with discrepant representations. We developed a novel algorithm for comparing variant call sets that deals with complex call representation discrepancies and through a dynamic programing method that minimizes false positives and negatives globally across the entire call sets for accurate performance evaluation of VCFs.

176 citations


Posted ContentDOI
27 May 2015-bioRxiv
TL;DR: A genome-wide scan for genetic variants that influence multiple human phenotypes by comparing large genome- wide association studies of 40 traits or diseases is performed, showing evidence that increased BMI causally increases triglyceride levels, and that increased liability to hypothyroidism causally decreases adult height.
Abstract: We performed a genome-wide scan for genetic variants that influence multiple human phenotypes by comparing large genome-wide association studies (GWAS) of 40 traits or diseases, including anthropometric traits (e.g. nose size and male pattern baldness), immune traits (e.g. susceptibility to childhood ear infections and Crohn's disease), metabolic phenotypes (e.g. type 2 diabetes and lipid levels), and psychiatric diseases (e.g. schizophrenia and Parkinson's disease). First, we identified 307 loci (at a false discovery rate of 10%) that influence multiple traits (excluding “trivial” phenotype pairs like type 2 diabetes and fasting glucose). Several loci influence a large number of phenotypes; for example, variants near the blood group gene ABO influence eleven of these traits, including risk of childhood ear infections (rs635634: log-odds ratio = 0.06, P = 1.4 × 10−8) and allergies (log-odds ratio = 0.05, P = 2.5 × 10−8), among others. Similarly, a nonsynonymous variant in the zinc transporter SLC39A8 influences seven of these traits, including risk of schizophrenia (rs13107325: log-odds ratio = 0.15, P = 2 × 10−12) and Parkinson’s disease (log-odds ratio = -0.15, P = 1.6 × 10−7), among others. Second, we used these loci to identify traits that share multiple genetic causes in common. For example, genetic variants that delay age of menarche in women also, on average, delay age of voice drop in men, decrease body mass index (BMI), increase adult height, and decrease risk of male pattern baldness. Finally, we identified four pairs of traits that show evidence of a causal relationship. For example, we show evidence that increased BMI causally increases triglyceride levels, and that increased liability to hypothyroidism causally decreases adult height.

162 citations


Posted ContentDOI
20 Dec 2015-bioRxiv
TL;DR: The DanQ model, a novel hybrid convolutional and bi-directional long short-term memory recurrent neural network framework for predicting noncoding function de novo from sequence, improves considerably upon other models across several metrics.
Abstract: Modeling the properties and functions of DNA sequences is an important, but challenging task in the broad field of genomics. This task is particularly difficult for noncoding DNA, the vast majority of which is still poorly understood in terms of function. A powerful predictive model for the function of noncoding DNA can have enormous benefit for both basic science and translational research because over 98% of the human genome is noncoding and 93% of disease-associated variants lie in these regions. To address this need, we propose DanQ, a novel hybrid convolutional and bi-directional long short-term memory recurrent neural network framework for predicting noncoding function de novo from sequence. In the DanQ model, the convolution layer captures regulatory motifs, while the recurrent layer captures long-term dependencies between the motifs in order to learn a regulatory "grammar" to improve predictions. DanQ improves considerably upon other models across several metrics. For some regulatory markers, DanQ can achieve over a 50% relative improvement in the area under the precision-recall curve metric compared to related models.

158 citations


Posted ContentDOI
22 Sep 2015-bioRxiv
TL;DR: FINEMAP is introduced, a software package to efficiently explore a set of the most important causal configurations of the region via a shotgun stochastic search algorithm, and is therefore a promising tool for analyzing growing amounts of data produced in genome-wide association studies.
Abstract: Motivation: The goal of fine-mapping in genomic regions associated with complex diseases and traits is to identify causal variants that point to molecular mechanisms behind the associations. Recent fine-mapping methods using summary data from genome-wide association studies rely on exhaustive search through all possible causal configurations, which is computationally expensive. Results: We introduce FINEMAP, a software package to efficiently explore a set of the most important causal configurations of the region via a shotgun stochastic search algorithm. We show that FINEMAP produces accurate results in a fraction of processing time of existing approaches and is therefore a promising tool for analyzing growing amounts of data produced in genome-wide association studies. Availability: FINEMAP v1.0 is freely available for Mac OS X and Linux at http://www.christianbenner.com.

Posted ContentDOI
30 Oct 2015-bioRxiv
TL;DR: PHYLUCE is an efficient and easy-to-install software package that accomplishes targeted, enriched loci from the off-target background data; align enriched contigs representing conserved loci to one another; and prepare and manipulate these alignments for subsequent phylogenomic inference.
Abstract: Summary: Targeted enrichment of conserved and ultraconserved genomic elements allows universal collection of phylogenomic data from hundreds of species at multiple time scales ( 300 Ma). Prior to downstream inference, data from these types of targeted enrichment studies must undergo pre-processing to assemble contigs from sequence data; identify targeted, enriched loci from the off-target background data; align enriched contigs representing conserved loci to one another; and prepare and manipulate these alignments for subsequent phylogenomic inference. PHYLUCE is an efficient and easy-to-install software package that accomplishes these tasks across hundreds of taxa and thousands of enriched loci. Availability and Implementation: PHYLUCE is written for Python 2.7. PHYLUCE is supported on OSX and Linux (RedHat/CentOS) operating systems. PHYLUCE source code is distributed under a BSD-style license from https://www.github.com/faircloth-lab/phyluce/. PHYLUCE is also available as a package (https://binstar.org/faircloth-lab/phyluce) for the Anaconda Python distribution that installs all dependencies, and users can request a PHYLUCE instance on iPlant Atmosphere (tag: phyluce). The software manual and a tutorial are available from http://phyluce.readthedocs.org/en/latest/ and test data are available from doi: 10.6084/m9.figshare.1284521.

Posted ContentDOI
06 Jan 2015-bioRxiv
TL;DR: In this paper, the authors describe software developed to make use of these data as existing packages were incapable of assembling long reads at such high error rate (~35% error), with these methods were able to error correct and assemble the nanopore reads de novo, producing an assembly that is contiguous and accurate: with a contig N50 length of 479kb, and has greater than 99% consensus identity when compared to the reference.
Abstract: Monitoring the progress of DNA through a pore has been postulated as a method for sequencing DNA for several decades1,2. Recently, a nanopore instrument, the Oxford Nanopore MinION, has become available3. Here we describe our sequencing of the S. cerevisiae genome. We describe software developed to make use of these data as existing packages were incapable of assembling long reads at such high error rate (~35% error). With these methods we were able to error correct and assemble the nanopore reads de novo, producing an assembly that is contiguous and accurate: with a contig N50 length of 479kb, and has greater than 99% consensus identity when compared to the reference. The assembly with the long nanopore reads was able to correctly assemble gene cassettes, rRNAs, transposable elements, and other genomic features that were almost entirely absent in an assembly using Illumina sequencing alone (with a contig N50 of only 59,927bp).

Posted ContentDOI
27 Jun 2015-bioRxiv
TL;DR: Salmon is introduced, a novel method and software tool for transcript quantication that exhibits state-of-the-art accuracy while being signicantly faster than most other tools.
Abstract: Transcript quantication is a central task in the analysis of RNA-seq data. Accurate computational methods for the quantication of transcript abundances are essential for downstream analysis. However, most existing approaches are much slower than is necessary for their degree of accuracy. We introduce Salmon, a novel method and software tool for transcript quantication that exhibits state-of-the-art accuracy while being signicantly faster than most other tools. Salmon achieves this through the combined application of a two-phase inference procedure, a reduced data representation, and a novel lightweight read alignment algorithm. Salmon is written in C++11, and is available under the GPL v3 license as open-source software at https://combine-lab.github.io/salmon.

Posted ContentDOI
02 Mar 2015-bioRxiv
TL;DR: A new method is introduced, LDpred, which infers the posterior mean causal effect size of each marker using a prior on effect sizes and LD information from an external reference panel, and shows that LDpred outperforms the pruning/thresholding approach, particularly at large sample sizes.
Abstract: Polygenic risk scores have shown great promise in predicting complex disease risk, and will become more accurate as training sample sizes increase. The standard approach for calculating risk scores involves LD-pruning markers and applying a P-value threshold to association statistics, but this discards information and may reduce predictive accuracy. We introduce a new method, LDpred, which infers the posterior mean causal effect size of each marker using a prior on effect sizes and LD information from an external reference panel. Theory and simulations show that LDpred outperforms the pruning/thresholding approach, particularly at large sample sizes. Accordingly, prediction R2 increased from 20.1% to 25.3% in a large schizophrenia data set and from 9.8% to 12.0% in a large multiple sclerosis data set. A similar relative improvement in accuracy was observed for three additional large disease data sets and when predicting in non-European schizophrenia samples. The advantage of LDpred over existing methods will grow as sample sizes increase.

Posted ContentDOI
27 Nov 2015-bioRxiv
TL;DR: This work combined a user-centric design philosophy with sustainable software development approaches to create Sequenceserver, a modern graphical user interface for BLAST that greatly facilitate BLAST analysis and interpretation and thus substantially enhance researcher productivity.
Abstract: The dramatic drop in DNA sequencing costs has created many opportunities for novel biological research. These opportunities largely rest upon the ability to effectively compare newly obtained and previously known sequences. This is commonly done with BLAST, yet using BLAST directly on new datasets requires substantial technical skills or helpful colleagues. Furthermore, graphical interfaces for BLAST are challenging to install and largely mimic underlying computational processes rather than work patterns of researchers. We combined a user-centric design philosophy with sustainable software development approaches to create Sequenceserver (http://sequenceserver.com), a modern graphical user interface for BLAST. Sequenceserver substantially increases the efficiency of researchers working with sequence data. This is due first to innovations at three levels. First, our software can be installed and used on custom datasets extremely rapidly for personal and shared applications. Second, based on analysis of user input and simple algorithms, Sequenceserver reduces the amount of decisions the user must make, provides interactive visual feedback, and prevents common potential errors that would otherwise cause erroneous results. Finally, Sequenceserver provides multiple highly visual and text-based output options that mirror the requirements and work patterns of researchers. Together, these features greatly facilitate BLAST analysis and interpretation and thus substantially enhance researcher productivity.

Posted ContentDOI
16 Dec 2015-bioRxiv
TL;DR: Health disparities persist across race/ethnicity for the majority of Healthy People 2010 health indicators, and biomedical research and study populations must better reflect the country’s changing demographics.
Abstract: Summary Points Health disparities persist across race/ethnicity for the majority of Healthy People 2010 health indicators. Most physicians and scientists are informed by research extrapolated from a largely homogenous population, usually White and male. A growing proportion of Americans are not fully benefiting from clinical and biomedical advances since racial and ethnic minorities make up nearly 40% of the U.S. population. Ignoring the racial/ethnic diversity of the U.S. population is a missed scientific opportunity to fully understand the factors that lead to disease or health. U.S. biomedical research and study populations must better reflect the country's changing demographics. Adequate representation of diverse populations in scientific research is imperative as a matter of social justice, economics, and science.

Posted ContentDOI
05 Feb 2015-bioRxiv
TL;DR: The dbSUPER database as mentioned in this paper is the first integrated and interactive database of super-enhancers, with the primary goal of providing a resource for further study of transcriptional control of cell identity and disease by archiving computationally produced data.
Abstract: Super-enhancer is a newly proposed concept, which refers to clusters of enhancers that can drive cell-type-specific gene expression and are crucial in cell identity. Many disease-associated sequence variations are enriched in the super-enhancer regions of disease-relevant cell types. Thus, super-enhancers can be used as potential biomarkers for disease diagnosis and therapeutics. Current studies have identified super-enhancers for more than 100 cell types in human and mouse. However, no centralized resource to integrate all these findings is available yet. We developed dbSUPER (http://bioinfo.au.tsinghua.edu.cn/dbsuper/), the first integrated and interactive database of super-enhancers, with the primary goal of providing a resource for further study of transcriptional control of cell identity and disease by archiving computationally produced data. This data can be easily send to Galaxy, GREAT and Cistrome web servers for further downstream analysis. dbSUPER provides a responsive and user-friendly web interface to facilitate efficient and comprehensive searching and browsing. dbSUPER provides downloadable and exportable features in a variety of data formats, and can be visualized in UCSC genome browser while custom tracks will be added automatically. Further, dbSUPER lists genes associated with super-enhancers and links to various databases, including GeneCards, UniProt and Entrez. Our database also provides an overlap analysis tool, to check the overlap of user defined regions with the current database. We believe, dbSUPER is a valuable resource for the bioinformatics and genetics research community.

Posted ContentDOI
10 Nov 2015-bioRxiv
TL;DR: It is indicated that a substantial level of pleiotropy exists between cognitive abilities and many human mental and physical health disorders and traits and that it can be used to predict phenotypic variance across samples.
Abstract: The causes of the known associations between poorer cognitive function and many adverse neuropsychiatric outcomes, poorer physical health, and earlier death remain unknown. We used linkage disequilibrium regression and polygenic profile scoring to test for shared genetic aetiology between cognitive functions and neuropsychiatric disorders and physical health. Using information provided by many published genome-wide association study consortia, we created polygenic profile scores for 24 vascular-metabolic, neuropsychiatric, physiological-anthropometric, and cognitive traits in the participants of UK Biobank, a very large population-based sample (N = 112 151). Pleiotropy between cognitive and health traits was quantified by deriving genetic correlations using summary genome-wide association study statistics applied to the method of linkage disequilibrium regression. Substantial and significant genetic correlations were observed between cognitive test scores in the UK Biobank sample and many of the mental and physical health-related traits and disorders assessed here. In addition, highly significant associations were observed between the cognitive test scores in the UK Biobank sample and many polygenic profile scores, including coronary artery disease, stroke, Alzheimer's disease, schizophrenia, autism, major depressive disorder, BMI, intracranial volume, infant head circumference, and childhood cognitive ability. Where disease diagnosis was available for UK Biobank participants we were able to show that these results were not confounded by those who had the relevant disease. These findings indicate that a substantial level of pleiotropy exists between cognitive abilities and many human mental and physical health disorders and traits and that it can be used to predict phenotypic variance across samples.

Posted ContentDOI
27 Jun 2015-bioRxiv
TL;DR: TransRate can accurately evaluate assemblies of conserved and novel RNA molecules of any kind in any species and is more accurate than comparable methods and demonstrates its use on a variety of data.
Abstract: TransRate is a tool for reference-free quality assessment of de novo transcriptome assemblies. Using only sequenced reads as the input, TransRate measures the quality of individual contigs and whole assemblies, enabling assembly optimization and comparison. TransRate can accurately evaluate assemblies of conserved and novel RNA molecules of any kind in any species. We show that it is more accurate than comparable methods and demonstrate its use on a variety of data.

Posted ContentDOI
25 Sep 2015-bioRxiv
TL;DR: Quantitative evaluation of several published microbial detection methods shows that One Codex has the highest degree of sensitivity and specificity (AUC = 0.97, compared to 0.82-0.88 for other methods), both when detecting well-characterized species as well as newly sequenced, “taxonomically novel” organisms.
Abstract: High-throughput sequencing (HTS) is increasingly being used for broad applications of microbial characterization, such as microbial ecology, clinical diagnosis, and outbreak epidemiology. However, the analytical task of comparing short sequence reads against the known diversity of microbial life has proved to be computationally challenging. The One Codex data platform was created with the dual goals of analyzing microbial data against the largest possible collection of microbial reference genomes, as well as presenting those results in a format that is consumable by applied end-users. One Codex identifies microbial sequences using a "k-mer based" taxonomic classification algorithm through a web-based data platform, using a reference database that currently includes approximately 40,000 bacterial, viral, fungal, and protozoan genomes. In order to evaluate whether this classification method and associated database provided quantitatively different performance for microbial identification, we created a large and diverse evaluation dataset containing 50 million reads from 10,639 genomes, as well as sequences from six organisms novel species not be included in the reference databases of any of the tested classifiers. Quantitative evaluation of several published microbial detection methods shows that One Codex has the highest degree of sensitivity and specificity (AUC = 0.97, compared to 0.82-0.88 for other methods), both when detecting well-characterized species as well as newly sequenced, "taxonomically novel" organisms.

Posted ContentDOI
01 Jun 2015-bioRxiv
TL;DR: A method to identify approximately independent blocks of linkage disequilibrium (LD) in the human genome that enable automated analysis of multiple genome-wide association studies is presented.
Abstract: We present a method to identify approximately independent blocks of linkage disequilibrium (LD) in the human genome. These blocks enable automated analysis of multiple genome-wide association studies.

Posted ContentDOI
06 Nov 2015-bioRxiv
TL;DR: What’s in my Pot?
Abstract: Whole genome sequencing on next-generation instruments provides an unbiased way to identify the organisms present in complex metagenomic samples. However, the time-to-result can be protracted because of fixed-time sequencing runs and cumbersome bioinformatics workflows. This limits the utility of the approach in settings where rapid species identification is crucial, such as in the quality control of food-chain components, or in during an outbreak of an infectious disease. Here we present What′s in my Pot? (WIMP), a laboratory and analysis workflow in which, starting with an unprocessed sample, sequence data is generated and bacteria, viruses and fungi present in the sample are classified to subspecies and strain level in a quantitative manner, without prior knowledge of the sample composition, in approximately 3.5 hours. This workflow relies on the combination of Oxford Nanopore Technologies′ MinION ™ sensing device with a real-time species identification bioinformatics application.

Posted ContentDOI
06 Aug 2015-bioRxiv
TL;DR: DADA2 analysis of vaginal samples revealed a diversity of Lactobacillus crispatus strains undetected by OTU methods, and identified more real variants than other methods in Illumina-sequenced mock communities, some differing by a single nucleotide, while outputting fewer spurious sequences.
Abstract: Microbial communities are commonly characterized by amplifying and sequencing target genes, but errors limit the precision of amplicon sequencing. We present DADA2, a software package that models and corrects amplicon errors. DADA2 identified more real variants than other methods in Illumina-sequenced mock communities, some differing by a single nucleotide, while outputting fewer spurious sequences. DADA2 analysis of vaginal samples revealed a diversity of Lactobacillus crispatus strains undetected by OTU methods.

Posted ContentDOI
25 Aug 2015-bioRxiv
TL;DR: It is found that the proportion of genes reported as expressed explains a substantial part of observed variability and that this quantity varies systematically across experimental batches.
Abstract: Single-cell RNA-Sequencing (scRNA-Seq) has become the most widely used high-throughput method for transcription profiling of individual cells. Systematic errors, including batch effects, have been widely reported as a major challenge in high-throughput technologies. Surprisingly, these issues have received minimal attention in published studies based on scRNA-Seq technology. We examined data from five published studies and found that systematic errors can explain a substantial percentage of observed cell-to-cell expression variability. Specifically, we found that the proportion of genes reported as expressed explains a substantial part of observed variability and that this quantity varies systematically across experimental batches. Furthermore, we found that the implemented experimental designs confounded outcomes of interest with batch effects, a design that can bring into question some of the conclusions of these studies. Finally, we propose a simple experimental design that can ameliorate the effect of theses systematic errors have on downstream results.

Posted ContentDOI
15 Sep 2015-bioRxiv
TL;DR: A large, diverse set of sequencing data for seven human genomes is described; five are current or candidate NIST Reference Materials and two Personal Genome Project trios, one of Ashkenazim Jewish ancestry and one of Chinese ancestry are described.
Abstract: The Genome in a Bottle Consortium hosted by the National Institute of Standards and Technology, (NIST), is creating reference materials and data for human genome sequencing, as well as methods for genome comparison and benchmarking. Here, we describe a large, diverse set of sequencing data for seven human genomes; five are current or candidate NIST Reference Materials. The pilot genome, NA12878, has been released as NIST RM 8398. We also describe data from two Personal Genome Project trios, one of Ashkenazim Jewish ancestry and one of Asian ancestry. The data described come from 11 technologies: BioNano Genomics, Complete Genomics paired-end and LFR, Ion Proton exome, Oxford Nanopore, Pacific Biosciences, SOLiD, 10X Genomics GemCodeTM WGS, and Illumina paired-end, mate-pair, and synthetic long reads. Cell lines, DNA, and data from these individuals are publicly available and highly characterized. Therefore, we expect these data to be useful for revealing novel information about the human genome and improving sequencing technologies, SNP, indel, and structural variant calling, and de novo assembly.

Posted ContentDOI
26 Mar 2015-bioRxiv
TL;DR: These findings establish for the first time a circuit mechanism for “temporal receptive windows” that are progressively enlarged along the cortical hierarchy, extend the concept of working memory from local to large circuits, and suggest a re-interpretation of functional connectivity measures.
Abstract: We developed a large-scale dynamical model of the macaque neocortex based on recent quantitative connectivity data. A hierarchy of timescales naturally emerges from this system: sensory areas show brief, transient responses to input (appropriate for sensory processing), whereas association areas integrate inputs over time and exhibit persistent activity (suitable for decision-making and working memory). The model displays multiple temporal hierarchies, as evidenced by contrasting responses to visual and somatosensory stimulation. Moreover, slower prefrontal and temporal areas have a disproportionate impact on global brain dynamics. These findings establish for the first time a circuit mechanism for "temporal receptive windows" that are progressively enlarged along the cortical hierarchy, extend the concept of working memory from local to large circuits, and suggest a re-interpretation of functional connectivity measures.

Posted ContentDOI
29 Jul 2015-bioRxiv
TL;DR: An approach that utilizes genotype likelihoods rather than a single observed best genotype to estimate ϕ is described and it is demonstrated that this method can accurately infer relatedness in both simulated and real 2nd generation sequencing data from a wide variety of human populations down to at least the third degree.
Abstract: The inference of biological relatedness from DNA sequence data has a wide array of applications, such as in the study of human disease, anthropology and ecology. One of the most common analytical frameworks for performing this inference is to genotype individuals for large numbers of independent genomewide markers and use population allele frequencies to infer the probability of identity-by-descent (IBD) given observed genotypes. Current implementations of this class of methods assume genotypes are known without error. However, with the advent of second generation sequencing data there are now an increasing number of situations where the confidence attached to a particular genotype may be poor because of low coverage. Such scenarios may lead to biased estimates of the kinship coefficient, Φ. We describe an approach that utilizes genotype likelihoods rather than a single observed best genotype to estimate Φ and demonstrate that we can accurately infer relatedness in both simulated and real second generation sequencing data from a wide variety of human populations down to at least the third degree when coverage is as low as 2x for both individuals, while other commonly used methods such as PLINK exhibit large biases in such situations. In addition the method appears to be robust when the assumed population allele frequencies are diverged from the true frequencies for realistic levels of genetic drift. This approach has been implemented in the C++ software lcmlkin.

Posted ContentDOI
15 Dec 2015-bioRxiv
TL;DR: A deep learning method (abbreviated as D-GEX) to infer the expression of target genes from theexpression of landmark genes, which shows that deep learning achieves lower error than linear regression in 99.97% of the target genes.
Abstract: Motivation: Large-scale gene expression profiling has been widely used to characterize cellular states in response to various disease conditions, genetic perturbations, etc. Although the cost of whole-genome expression profiles has been dropping steadily, generating a compendium of expression profiling over thousands of samples is still very expensive. Recognizing that gene expressions are often highly correlated, researchers from the NIH LINCS program have developed a cost- effective strategy of profiling only 1,000 carefully selected landmark genes and relying on computational methods to infer the expression of remaining target genes. However, the computational approach adopted by the LINCS program is currently based on linear regression, limiting its accuracy since it does not capture complex nonlinear relationship between expression of genes. Results: We present a deep learning method (abbreviated as D-GEX) to infer the expression of target genes from the expression of landmark genes. We used the microarray-based GEO dataset, consisting of 111K expression profiles, to train our model and compare its performance to those from other methods. In terms of mean absolute error averaged across all genes, deep learning significantly outperforms linear regression with 15.33% relative improvement. A gene-wise comparative analysis shows that deep learning achieves lower error than linear regression in 99.97% of the target genes. We also tested the performance of our learned model on an independent RNA-Seq-based GTEx dataset, which consists of 2,921 expression profiles. Deep learning still outperforms linear regression with 6.57% relative improvement, and achieves lower error in 81.31% of the target genes. Availability: D-GEX is available at https://github.com/uci-cbcl/D-GEX.

Posted ContentDOI
07 Aug 2015-bioRxiv
TL;DR: FastQTL is developed, a method that implements a popular cis-QTL mapping strategy in a user- and cluster-friendly tool and proposes an efficient permutation procedure to control for multiple testing.
Abstract: Motivation: In order to discover quantitative trait loci (QTLs), multi-dimensional genomic data sets combining DNA-seq and ChiP-/RNA-seq require methods that rapidly correlate tens of thousands of molecular phenotypes with millions of genetic variants while appropriately controlling for multiple testing. Results: We have developed FastQTL, a method that implements a popular cis-QTL mapping strategy in a user- and cluster-friendly tool. FastQTL also proposes an efficient permutation procedure to control for multiple testing. The outcome of permutations is modeled using beta distributions trained from a few permutations and from which adjusted p-values can be estimated at any level of significance with little computational cost. The Geuvadis & GTEx pilot data sets can be now easily analyzed an order of magnitude faster than previous approaches. Availability: Source code, binaries and comprehensive documentation of FastQTL are freely available to download at http://fastqtl.sourceforge.net/