scispace - formally typeset
Search or ask a question

Showing papers in "bioRxiv in 2017"


Posted ContentDOI
22 Jun 2017-bioRxiv
TL;DR: UFBoot2 is presented, which substantially accelerates UFBoot and reduces the risk of overestimating branch supports due to polytomies or severe model violations and provides suitable bootstrap resampling strategies for phylogenomic data.
Abstract: The standard bootstrap (SBS), despite being computationally intensive, is widely used in maximum likelihood phylogenetic analyses. We recently proposed the ultrafast bootstrap approximation (UFBoot) to reduce computing time while achieving more unbiased branch supports than SBS under mild model violations. UFBoot has been steadily adopted as an efficient alternative to SBS and other bootstrap approaches. Here, we present UFBoot2, which substantially accelerates UFBoot and reduces the risk of overestimating branch supports due to polytomies or severe model violations. Additionally, UFBoot2 provides suitable bootstrap resampling strategies for phylogenomic data. UFBoot2 is 778 and 8.4 times (median) faster than SBS and RAxML rapid bootstrap on tested datasets, respectively. UFBoot2 is implemented in the IQ-TREE software package version 1.6 and freely available at http://www.iqtree.org.

1,742 citations


Posted ContentDOI
12 Jan 2017-bioRxiv
TL;DR: QuPath provides researchers with powerful batch-processing and scripting functionality, and an extensible platform with which to develop and share new algorithms to analyze complex tissue images, making it suitable for a wide range of additional image analysis applications across biomedical research.
Abstract: QuPath is new bioimage analysis software designed to meet the growing need for a user-friendly, extensible, open-source solution for digital pathology and whole slide image analysis. In addition to offering a comprehensive panel of tumor identification and high-throughput biomarker evaluation tools, QuPath provides researchers with powerful batch-processing and scripting functionality, and an extensible platform with which to develop and share new algorithms to analyze complex tissue images. Furthermore, QuPath9s flexible design make it suitable for a wide range of additional image analysis applications across biomedical research.

1,448 citations


Posted ContentDOI
31 May 2017-bioRxiv
TL;DR: SCENIC (Single Cell rEgulatory Network Inference and Clustering) is the first method to analyze scRNA-seq data using a network-centric, rather than cell-centric approach and allows for the simultaneous tracing of genomic regulatory programs and the mapping of cellular identities emerging from these programs.
Abstract: Single-cell RNA-seq allows building cell atlases of any given tissue and infer the dynamics of cellular state transitions during developmental or disease trajectories. Both the maintenance and transitions of cell states are encoded by regulatory programs in the genome sequence. However, this regulatory code has not yet been exploited to guide the identification of cellular states from single-cell RNA-seq data. Here we describe a computational resource, called SCENIC (Single Cell rEgulatory Network Inference and Clustering), for the simultaneous reconstruction of gene regulatory networks (GRNs) and the identification of stable cell states, using single-cell RNA-seq data. SCENIC outperforms existing approaches at the level of cell clustering and transcription factor identification. Importantly, we show that cell state identification based on GRNs is robust towards batch-effects and technical-biases. We applied SCENIC to a compendium of single-cell data from the mouse and human brain and demonstrate that the proper combinations of transcription factors, target genes, enhancers, and cell types can be identified. Moreover, we used SCENIC to map the cell state landscape in melanoma and identified a gene regulatory network underlying a proliferative melanoma state driven by MITF and STAT and a contrasting network controlling an invasive state governed by NFATC2 and NFIB. We further validated these predictions by showing that two transcription factors are predominantly expressed in early metastatic sentinel lymph nodes. In summary, SCENIC is the first method to analyze scRNA-seq data using a network-centric, rather than cell-centric approach. SCENIC is generic, easy to use, and flexible, and allows for the simultaneous tracing of genomic regulatory programs and the mapping of cellular identities emerging from these programs. Availability: SCENIC is available as an R workflow based on three new R/Bioconductor packages: GENIE3, RcisTarget and AUCell. As scalable alternative to GENIE3, we also provide GRNboost, paving the way towards the network analysis across millions of single cells.

1,101 citations


Posted ContentDOI
14 Nov 2017-bioRxiv
TL;DR: A novel assembly-based approach to variant calling, the GATK HaplotypeCaller and Reference Confidence Model, that determines genotype likelihoods independently per-sample but performs joint calling across all samples within a project simultaneously, showing that the accuracy of indel variant calling is superior in comparison to other algorithms.
Abstract: Comprehensive disease gene discovery in both common and rare diseases will require the efficient and accurate detection of all classes of genetic variation across tens to hundreds of thousands of human samples. We describe here a novel assembly-based approach to variant calling, the GATK HaplotypeCaller (HC) and Reference Confidence Model (RCM), that determines genotype likelihoods independently per-sample but performs joint calling across all samples within a project simultaneously. We show by calling over 90,000 samples from the Exome Aggregation Consortium (ExAC) that, in contrast to other algorithms, the HC-RCM scales efficiently to very large sample sizes without loss in accuracy; and that the accuracy of indel variant calling is superior in comparison to other algorithms. More importantly, the HC-RCM produces a fully squared-off matrix of genotypes across all samples at every genomic position being investigated. The HC- RCM is a novel, scalable, assembly-based algorithm with abundant applications for population genetics and clinical studies.

1,033 citations


Posted ContentDOI
08 Mar 2017-bioRxiv
TL;DR: XCell as mentioned in this paper is a gene-signature based method for inferring 64 immune and stroma cell types from 1,822 transcriptomic profiles of pure human cells from various sources, employed a curve fitting approach for linear comparison of cell types, and introduced a novel spillover compensation technique for separating closely related cell types.
Abstract: Tissues are a complex milieu consisting of numerous cell types. For example, understanding the cellular heterogeneity the tumor microenvironment is an emerging field of research. Numerous methods have been published in recent years for the enumeration of cell subsets from tissue expression profiles. However, the available methods suffer from three major problems: inferring cell subset based on gene sets learned and verified from limited sources; displaying only partial portrayal of the full cellular heterogeneity; and insufficient validation in mixed tissues. To address these issues we developed xCell, a novel gene-signature based method for inferring 64 immune and stroma cell types. We first curated and harmonized 1,822 transcriptomic profiles of pure human cell types from various sources, employed a curve fitting approach for linear comparison of cell types, and introduced a novel spillover compensation technique for separating between closely related cell types. We test the ability of our model learned from pure cell types to infer enrichments of cell types in mixed tissues, using both comprehensive in silico analyses, and by comparison to cytometry immunophenotyping to show that our scores outperform previously published methods. Finally, we explore the cell type enrichments in tumor samples and show that the cellular heterogeneity of the tumor microenvironment uniquely characterizes different cancer types. We provide our method for inferring cell type abundances as a public resource to allow researchers to portray the cellular heterogeneity landscape of tissue expression profiles: http://xCell.ucsf.edu/.

995 citations


Posted ContentDOI
12 Jul 2017-bioRxiv
TL;DR: The integrative analysis of more than 2,600 whole cancer genomes and their matching normal tissues across 39 distinct tumour types represents the most comprehensive look at cancer whole genomes to date.
Abstract: We report the integrative analysis of more than 2,600 whole cancer genomes and their matching normal tissues across 39 distinct tumour types. By studying whole genomes we have been able to catalogue non-coding cancer driver events, study patterns of structural variation, infer tumour evolution, probe the interactions among variants in the germline genome, the tumour genome and the transcriptome, and derive an understanding of how coding and non-coding variations together contribute to driving individual patient9s tumours. This work represents the most comprehensive look at cancer whole genomes to date. NOTE TO READERS: This is an incomplete draft of the marker paper for the Pan-Cancer Analysis of Whole Genomes Project, and is intended to provide the background information for a series of in-depth papers that will be posted to BioRixv during the summer of 2017.

735 citations


Posted ContentDOI
06 Jun 2017-bioRxiv
TL;DR: The results suggest that gwMRF parcellations reveal neurobiologically meaningful features of brain organization and are potentially useful for future applications requiring dimensionality reduction of voxel-wise fMRI data.
Abstract: A central goal in systems neuroscience is the parcellation of the cerebral cortex into discrete neurobiological “atoms”. Resting-state functional magnetic resonance imaging (rs-fMRI) offers the possibility of in-vivo human cortical parcellation. Almost all previous parcellations relied on one of two approaches. The local gradient approach detects abrupt transitions in functional connectivity patterns. These transitions potentially reflect cortical areal boundaries defined by histology or visuotopic fMRI. By contrast, the global similarity approach clusters similar functional connectivity patterns regardless of spatial proximity, resulting in parcels with homogeneous (similar) rs-fMRI signals. Here we propose a gradient-weighted Markov Random Field (gwMRF) model integrating local gradient and global similarity approaches. Using task-fMRI and rs-fMRI across diverse acquisition protocols, we found gwMRF parcellations to be more homogeneous than four previously published parcellations. Furthermore, gwMRF parcellations agreed with the boundaries of certain cortical areas defined using histology and visuotopic fMRI. Some parcels captured sub-areal (somatotopic and visuotopic) features that likely reflect distinct computational units within known cortical areas. These results suggest that gwMRF parcellations reveal neurobiologically meaningful features of brain organization and are potentially useful for future applications requiring dimensionality reduction of voxel-wise fMRI data. Multi-resolution parcellations generated from 1489 participants are available at FREESURFER_WIKI LINK_TO_BE_ADDED.

698 citations


Posted ContentDOI
20 Jul 2017-bioRxiv
TL;DR: The UK Biobank project is a large prospective cohort study of ~500,000 individuals from across the United Kingdom, aged between 40-69 at recruitment, and a set of analyses that reveal properties of the genetic data – such as population structure and relatedness – that can be important for downstream analyses are conducted.
Abstract: The UK Biobank project is a large prospective cohort study of ~500,000 individuals from across the United Kingdom, aged between 40-69 at recruitment. A rich variety of phenotypic and health-related information is available on each participant, making the resource unprecedented in its size and scope. Here we describe the genome-wide genotype data (~805,000 markers) collected on all individuals in the cohort and its quality control procedures. Genotype data on this scale offers novel opportunities for assessing quality issues, although the wide range of ancestries of the individuals in the cohort also creates particular challenges. We also conducted a set of analyses that reveal properties of the genetic data – such as population structure and relatedness – that can be important for downstream analyses. In addition, we phased and imputed genotypes into the dataset, using computationally efficient methods combined with the Haplotype Reference Consortium (HRC) and UK10K haplotype resource. This increases the number of testable variants by over 100-fold to ~96 million variants. We also imputed classical allelic variation at 11 human leukocyte antigen (HLA) genes, and as a quality control check of this imputation, we replicate signals of known associations between HLA alleles and many common diseases. We describe tools that allow efficient genome-wide association studies (GWAS) of multiple traits and fast phenome-wide association studies (PheWAS), which work together with a new compressed file format that has been used to distribute the dataset. As a further check of the genotyped and imputed datasets, we performed a test-case genome-wide association scan on a well-studied human trait, standing height.

659 citations


Posted ContentDOI
10 May 2017-bioRxiv
TL;DR: A new, low-cost, high throughput reduced representation expression profiling method, L1000, is shown to be highly reproducible, comparable to RNA sequencing, and suitable for computational inference of the expression levels of 81% of non-measured transcripts.
Abstract: We previously piloted the concept of a Connectivity Map (CMap), whereby genes, drugs and disease states are connected by virtue of common gene-expression signatures. Here, we report more than a 1,000-fold scale-up of the CMap as part of the NIH LINCS Consortium, made possible by a new, low-cost, high throughput reduced representation expression profiling method that we term L1000. We show that L1000 is highly reproducible, comparable to RNA sequencing, and suitable for computational inference of the expression levels of 81% of non-measured transcripts. We further show that the expanded CMap can be used to discover mechanism of action of small molecules, functionally annotate genetic variants of disease genes, and inform clinical trials. The 1.3 million L1000 profiles described here, as well as tools for their analysis, are available at https://clue.io.

636 citations


Posted ContentDOI
10 Jul 2017-bioRxiv
TL;DR: In this paper, a non-Gaussian version of the coefficient of determination (R2GLMM) is proposed for estimating the proportion of variance explained by a statistical model and is an important summary statistic of biological interest.
Abstract: The coefficient of determination R2 quantifies the proportion of variance explained by a statistical model and is an important summary statistic of biological interest. However, estimating R2 for generalized linear mixed models (GLMMs) remains challenging. We have previously introduced a version of R2 that we called R2GLMM for Poisson and binomial GLMMs, but not for other distributional families. Similarly, we earlier discussed how to estimate intra-class correlation coefficients ICC using Poisson and binomial GLMMs. In this article, we expand our methods to all other non-Gaussian distributions, in particular to negative binomial and gamma distributions that are commonly used for modelling biological data. While expanding our approach, we highlight two useful concepts for biologists, Jensen9s inequality and the delta method, both of which help us in understanding the properties of GLMMs. Jensen9s inequality has important implications for biologically meaningful interpretation of GLMMs, while the delta method allows a general derivation of variance associated with non-Gaussian distributions. We also discuss some special considerations for binomial GLMMs with binary or proportion data. We illustrate the implementation of our extension by worked examples from the field of ecology and evolution in the R environment. However, our method can be used across disciplines and regardless of statistical environments.

549 citations


Posted ContentDOI
19 Apr 2017-bioRxiv
TL;DR: A novel method, Slingshot, is introduced for inferring multiple developmental lineages from single-cell gene expression data and is described as a uniquely robust and flexible tool for inference and ordering cells to reflect continuous, branching processes.
Abstract: Single-cell transcriptomics allows researchers to investigate complex communities of heterogeneous cells. These methods can be applied to stem cells and their descendants in order to chart the progression from multipotent progenitors to fully differentiated cells. While a number of statistical and computational methods have been proposed for analyzing cell lineages, the problem of accurately characterizing multiple branching lineages remains difficult to solve. Here, we introduce a novel method, Slingshot, for inferring multiple developmental lineages from single-cell gene expression data. Slingshot is a uniquely robust and flexible tool for inferring developmental lineages and ordering cells to reflect continuous, branching processes.

Posted ContentDOI
10 Jul 2017-bioRxiv
TL;DR: CERES, a computational method to estimate gene dependency levels from CRISPR-Cas9 essentiality screens while accounting for the copy-number-specific effect, as well as variable sgRNA activity, is developed and applied to sets of screens performed with different sgRNAs and found that it reduces false positive results and provides meaningful estimates of sg RNA activity.
Abstract: The CRISPR-Cas9 system has revolutionized gene editing both on single genes and in multiplexed loss-of-function screens, enabling precise genome-scale identification of genes essential to proliferation and survival of cancer cells. However, previous studies reported that an anti-proliferative effect of Cas9-mediated DNA cleavage confounds such measurement of genetic dependency, particularly in the setting of copy number gain. We performed genome-scale CRISPR-Cas9 essentiality screens on 342 cancer cell lines and found that this effect is common to all lines, leading to false positive results when targeting genes in copy number amplified regions. We developed CERES, a computational method to estimate gene dependency levels from CRISPR-Cas9 essentiality screens while accounting for the copy-number-specific effect, as well as variable sgRNA activity. We applied CERES to sets of screens performed with different sgRNA libraries and found that it reduces false positive results and provides meaningful estimates of sgRNA activity. As a result, the application of CERES improves confidence in the interpretation of genetic dependency data from CRISPR-Cas9 essentiality screens of cancer cell lines.

Posted ContentDOI
31 Mar 2017-bioRxiv
TL;DR: HiGlass as mentioned in this paper is a web-based viewer for genome interaction maps featuring synchronized navigation of multiple views as well as continuous zooming and panning for navigation across genomic loci and resolutions.
Abstract: We present HiGlass (http://higlass.io), a web-based viewer for genome interaction maps featuring synchronized navigation of multiple views as well as continuous zooming and panning for navigation across genomic loci and resolutions. We demonstrate how visual comparison of Hi-C and other genomic data from different experimental conditions can be used to efficiently identify salient outcomes of experimental perturbations, generate new hypotheses, and share the results with the community.

Posted ContentDOI
06 Dec 2017-bioRxiv
TL;DR: A combined transcriptomic and projectional taxonomy of cortical cell types from functionally distinct regions of the mouse cortex is established and correspondence between excitatory transcriptomic types and their region-specific long-range target specificity is demonstrated.
Abstract: Neocortex contains a multitude of cell types segregated into layers and functionally distinct regions. To investigate the diversity of cell types across the mouse neocortex, we analyzed 12,714 cells from the primary visual cortex (VISp), and 9,035 cells from the anterior lateral motor cortex (ALM) by deep single-cell RNA-sequencing (scRNA-seq), identifying 116 transcriptomic cell types. These two regions represent distant poles of the neocortex and perform distinct functions. We define 50 inhibitory transcriptomic cell types, all of which are shared across both cortical regions. In contrast, 49 of 52 excitatory transcriptomic types were found in either VISp or ALM, with only three present in both. By combining single cell RNA-seq and retrograde labeling, we demonstrate correspondence between excitatory transcriptomic types and their region-specific long-range target specificity. This study establishes a combined transcriptomic and projectional taxonomy of cortical cell types from functionally distinct regions of the mouse cortex.

Posted ContentDOI
17 Aug 2017-bioRxiv
TL;DR: Technical sequencing quality metrics can be complemented by quantifying completeness in terms of the expected gene content of Benchmarking Universal Single-Copy Orthologs (BUSCO), now in its third release.
Abstract: Genomics promises comprehensive surveying of genomes and metagenomes, but rapidly changing technologies and expanding data volumes make evaluation of completeness a challenging task. Technical sequencing quality metrics can be complemented by quantifying completeness in terms of the expected gene content of Benchmarking Universal Single-Copy Orthologs (BUSCO, http://busco.ezlab.org). Now in its third release, BUSCO utilities extend beyond quality control to applications in comparative genomics, gene predictor training, metagenomics, and phylogenomics.

Posted ContentDOI
17 Mar 2017-bioRxiv
TL;DR: This work improves accuracy but also broadens the scope of absolute cell fraction predictions from tumor gene expression data, and provides a unique novel experimental benchmark for immunogenomics analyses in cancer research.
Abstract: Immune cells infiltrating tumors can have important impact on tumor progression and response to therapy. We present an efficient algorithm to simultaneously estimate the fraction of cancer and immune cell types from bulk tumor gene expression data. Our method integrates novel gene expression profiles from circulating and tumor infiltrating cells for each major immune cell type, cell-type specific mRNA content and the ability to model uncharacterized, and possibly highly variable, cell types. Feasibility is demonstrated by validation with flow cytometry, immunohistochemistry and single-cell RNA-Seq analyses of human melanoma and colorectal tumor specimens. Altogether, our work not only improves accuracy but also broadens the scope of absolute cell fraction predictions from tumor gene expression data, and provides a unique novel experimental benchmark for immunogenomics analyses in cancer research.

Posted ContentDOI
20 Apr 2017-bioRxiv
TL;DR: Modelling the repeat structure of the human genome predicts extraordinarily contiguous assemblies may be possible using nanopore reads alone, and it is found that adding an additional 5×-coverage of ‘ultra-long’ reads more than doubled the assembly contiguity.
Abstract: Nanopore sequencing is a promising technique for genome sequencing due to its portability, ability to sequence long reads from single molecules, and to simultaneously assay DNA methylation. However until recently nanopore sequencing has been mainly applied to small genomes, due to the limited output attainable. We present nanopore sequencing and assembly of the GM12878 Utah/Ceph human reference genome generated using the Oxford Nanopore MinION and R9.4 version chemistry. We generated 91.2 Gb of sequence data (~30x theoretical coverage) from 39 flowcells. De novo assembly yielded a highly complete and contiguous assembly (NG50 ~3Mb). We observed considerable variability in homopolymeric tract resolution between different basecallers. The data permitted sensitive detection of both large structural variants and epigenetic modifications. Further we developed a new approach exploiting the long-read capability of this system and found that adding an additional 5x-coverage of "ultra-long" reads (read N50 of 99.7kb) more than doubled the assembly contiguity. Modelling the repeat structure of the human genome predicts extraordinarily contiguous assemblies may be possible using nanopore reads alone. Portable de novo sequencing of human genomes may be important for rapid point-of-care diagnosis of rare genetic diseases and cancer, and monitoring of cancer progression. The complete dataset including raw signal is available as an Amazon Web Services Open Dataset at: https://github.com/nanopore-wgs-consortium/NA12878.

Posted ContentDOI
23 Sep 2017-bioRxiv
TL;DR: A suite of long- and short-read, strand-specific sequencing technologies, optical mapping, and variant discovery algorithms are applied to comprehensively analyze three human parent–child trios to define the full spectrum of human genetic variation in a haplotype-resolved manner.
Abstract: The incomplete identification of structural variants from whole-genome sequencing data limits studies of human genetic diversity and disease association. Here, we apply a suite of long- and short-read, strand-specific sequencing technologies, optical mapping, and variant discovery algorithms to comprehensively analyze three human parent-child trios to define the full spectrum of human genetic variation in a haplotype-resolved manner. We identify 818,181 indel variants (<50 bp) and 31,599 structural variants (≥50 bp) per human genome, a seven fold increase in structural variation compared to previous reports, including from the 1000 Genomes Project. We also discovered 156 inversions per genome, most of which previously escaped detection, as well as large unbalanced chromosomal rearrangements. We provide near-complete, haplotype-resolved structural variation for three genomes that can now be used as a gold standard for the scientific community and we make specific recommendations for maximizing structural variation sensitivity for future large-scale genome sequencing studies.

Posted ContentDOI
11 Jul 2017-bioRxiv
TL;DR: Whole-genome sequencing of 2,778 tumour samples from 2,658 donors is used to reconstruct the life history, evolution of mutational processes, and driver mutation sequences of 39 cancer types, suggesting a window of opportunity for early cancer detection.
Abstract: Cancer develops through a process of somatic evolution. Here, we reconstruct the evolutionary history of 2,778 tumour samples from 2,658 donors spanning 39 cancer types. Characteristic copy number gains, such as trisomy 7 in glioblastoma or isochromosome 17q in medulloblastoma, are found amongst the earliest events in tumour evolution. The early phases of oncogenesis are driven by point mutations in a restricted set of cancer genes, often including biallelic inactivation of tumour suppressors. By contrast, increased genomic instability, a more than three-fold diversification of driver genes, and an acceleration of mutational processes are features of later stages. Clock-like mutations yield estimates for whole genome duplications and subclonal diversification in chronological time. Our results suggest that driver mutations often precede diagnosis by many years, and in some cases decades. Taken together, these data reveal common and divergent trajectories of cancer evolution, pivotal for understanding tumour biology and guiding early cancer detection.

Posted ContentDOI
07 Feb 2017-bioRxiv
TL;DR: The mathematical models and physical concepts that underlie the latest Rosetta energy function, beta_nov15, and the latest advances in the energy function that extend capabilities from soluble proteins to also include membrane proteins, peptides containing non-canonical amino acids, carbohydrates, nucleic acids, and other macromolecules are discussed.
Abstract: Over the past decade, the Rosetta biomolecular modeling suite has informed diverse biological questions and engineering challenges ranging from interpretation of low-resolution structural data to design of nanomaterials, protein therapeutics, and vaccines. Central to Rosetta9s success is the energy function: a model parameterized from small molecule and X-ray crystal structure data used to approximate the energy associated with each biomolecule conformation. This paper describes the mathematical models and physical concepts that underlie the latest Rosetta energy function, Aasgard2017. Applying these concepts, we explain how to use Rosetta energies to identify and analyze the features of biomolecular models. Finally, we discuss the latest advances in the energy function that extend capabilities from soluble proteins to also include membrane proteins, peptides containing non-canonical amino acids, carbohydrates, nucleic acids, and other macromolecules.

Posted ContentDOI
25 Jan 2017-bioRxiv
TL;DR: This work develops and applies an approach that uses stratified LD score regression to test whether disease heritability is enriched in regions surrounding genes with the highest specific expression in a given tissue and demonstrates that the polygenic approach is a powerful way to leverage gene expression data for interpreting GWAS signal.
Abstract: Genetics can provide a systematic approach to discovering the tissues and cell types relevant for a complex disease or trait. Identifying these tissues and cell types is critical for following up on non-coding allelic function, developing ex-vivo models, and identifying therapeutic targets. Here, we analyze gene expression data from several sources, including the GTEx and PsychENCODE consortia, together with genome-wide association study (GWAS) summary statistics for 48 diseases and traits with an average sample size of 86,850, to identify disease-relevant tissues and cell types. We develop and apply an approach that uses stratified LD score regression to test whether disease heritability is enriched in regions surrounding genes with the highest specific expression in a given tissue. We detect tissue-specific enrichments at FDR < 5% for 30 diseases and traits across a broad range of tissues that recapitulate known biology. In our analysis of traits with observed central nervous system enrichment, we detect an enrichment of neurons over other brain cell types for several brain-related traits, enrichment of inhibitory neurons over excitatory neurons for bipolar disorder, and enrichments in the cortex for schizophrenia and in the striatum for migraine. In our analysis of traits with observed immunological enrichment, we identify enrichments of alpha beta T cells for asthma and eczema, B cells for primary biliary cirrhosis, and myeloid cells for lupus and Alzheimer's disease. Our results demonstrate that our polygenic approach is a powerful way to leverage gene expression data for interpreting GWAS signal.

Posted ContentDOI
20 Apr 2017-bioRxiv
TL;DR: In this paper, the assembly and annotation of maize, a genetic and agricultural model species, using Single Molecule Real-Time (SMRT) sequencing and high-resolution optical mapping is reported.
Abstract: Complete and accurate reference genomes and annotations provide fundamental tools for characterization of genetic and functional variation. These resources facilitate elucidation of biological processes and support translation of research findings into improved and sustainable agricultural technologies. Many reference genomes for crop plants have been generated over the past decade, but these genomes are often fragmented and missing complex repeat regions. Here, we report the assembly and annotation of maize, a genetic and agricultural model species, using Single Molecule Real-Time (SMRT) sequencing and high-resolution optical mapping. Relative to the previous reference genome, our assembly features a 52-fold increase in contig length and significant improvements in the assembly of intergenic spaces and centromeres. Characterization of the repetitive portion of the genome revealed over 130,000 intact transposable elements (TEs), allowing us to identify TE lineage expansions unique to maize. Gene annotations were updated using 111,000 full-length transcripts obtained by SMRT sequencing. In addition, comparative optical mapping of two other inbreds revealed a prevalence of deletions in the low gene density region and maize lineage-specific genes.

Posted ContentDOI
02 May 2017-bioRxiv
TL;DR: BugBase is presented, an algorithm that predicts organism-level coverage of functional pathways as well as biologically interpretable phenotypes such as oxygen tolerance, Gram staining and pathogenic potential, within complex microbiomes using either whole-genome shotgun or marker gene sequencing data.
Abstract: Shotgun metagenomics and marker gene amplicon sequencing can be used to directly measure or predict the functional repertoire of the microbiota en masse, but current methods do not readily estimate the functional capability of individual microorganisms. Here we present BugBase, an algorithm that predicts organism-level coverage of functional pathways as well as biologically interpretable phenotypes such as oxygen tolerance, Gram staining, and pathogenic potential, within complex microbiomes using either whole-genome shotgun or marker gene sequencing data. We find the organism-level pathway coverage of BugBase predictions to be statistically higher powered than current bag-of-genes approaches for discerning functional changes in both host-associated and environmental microbiomes.

Posted ContentDOI
12 Jun 2017-bioRxiv
TL;DR: This work leveraged the exome sequencing data of 60,706 individuals from the Exome Aggregation Consortium (ExAC) dataset to identify sub-genic regions that are depleted of missense variation and used this depletion as part of a novel missense deleteriousness metric named MPC.
Abstract: Given increasing numbers of patients who are undergoing exome or genome sequencing, it is critical to establish tools and methods to interpret the impact of genetic variation. While the ability to predict deleteriousness for any given variant is limited, missense variants remain a particularly challenging class of variation to interpret, since they can have drastically different effects depending on both the precise location and specific amino acid substitution of the variant. In order to better evaluate missense variation, we leveraged the exome sequencing data of 60,706 individuals from the Exome Aggregation Consortium (ExAC) dataset to identify sub-genic regions that are depleted of missense variation. We further used this depletion as part of a novel missense deleteriousness metric named MPC. We applied MPC to de novo missense variants and identified a category of de novo missense variants with the same impact on neurodevelopmental disorders as truncating mutations in intolerant genes, supporting the value of incorporating regional missense constraint in variant interpretation.

Posted ContentDOI
05 Nov 2017-bioRxiv
TL;DR: A theory is presented that reveals conceptual insights into how task complexity governs both neural dimensionality and accurate recovery of dynamic portraits, thereby providing quantitative guidelines for future large-scale experimental design.
Abstract: In many experiments, neuroscientists tightly control behavior, record many trials, and obtain trial-averaged firing rates from hundreds of neurons in circuits containing billions of behaviorally relevant neurons. Dimensionality reduction methods reveal a striking simplicity underlying such multi-neuronal data: they can be reduced to a low-dimensional space, and the resulting neural trajectories in this space yield a remarkably insightful dynamical portrait of circuit computation. This simplicity raises profound and timely conceptual questions. What are its origins and its implications for the complexity of neural dynamics? How would the situation change if we recorded more neurons? When, if at all, can we trust dynamical portraits obtained from measuring an infinitesimal fraction of task relevant neurons? We present a theory that answers these questions, and test it using physiological recordings from reaching monkeys. This theory reveals conceptual insights into how task complexity governs both neural dimensionality and accurate recovery of dynamic portraits, thereby providing quantitative guidelines for future large-scale experimental design.

Posted ContentDOI
01 May 2017-bioRxiv
TL;DR: A new R package, glmmTMB, is presented, that increases the range of models that can easily be fitted to count data using maximum likelihood estimation and is faster than packages that use Markov chain Monte Carlo sampling for estimation.
Abstract: Ecological phenomena are often measured in the form of count data. These data can be analyzed using generalized linear mixed models (GLMMs) when observations are correlated in ways that require random effects. However, count data are often zero-inflated, containing more zeros than would be expected from the standard error distributions used in GLMMs, e.g., parasite counts may be exactly zero for hosts with effective immune defenses but vary according to a negative binomial distribution for non-resistant hosts. We present a new R package, glmmTMB, that increases the range of models that can easily be fitted to count data using maximum likelihood estimation. The interface was developed to be familiar to users of the lme4 R package, a common tool for fitting GLMMs. To maximize speed and flexibility, estimation is done using Template Model Builder (TMB), utilizing automatic differentiation to estimate model gradients and the Laplace approximation for handling random effects. We demonstrate glmmTMB and compare it to other available methods using two ecological case studies. In general, glmmTMB is more flexible than other packages available for estimating zero-inflated models via maximum likelihood estimation and is faster than packages that use Markov chain Monte Carlo sampling for estimation; it is also more flexible for zero-inflated modelling than INLA, but speed comparisons vary with model and data structure. Our package can be used to fit GLMs and GLMMs with or without zero-inflation as well as hurdle models. By allowing ecologists to quickly estimate a wide variety of models using a single package, glmmTMB makes it easier to find appropriate models and test hypotheses to describe ecological processes.

Posted ContentDOI
03 Jun 2017-bioRxiv
TL;DR: The hypothesis that clinical diagnosis of ADHD is an extreme expression of one or more continuous heritable traits is supported, supported by additional analyses of a self-reported ADHD sample and a study of quantitative measures of ADHD symptoms in the population.
Abstract: Attention-Deficit/Hyperactivity Disorder (ADHD) is a highly heritable childhood behavioral disorder affecting 5% of school-age children and 25% of adults Common genetic variants contribute substantially to ADHD susceptibility, but no individual variants have been robustly associated with ADHD We report a genome-wide association meta-analysis of 20,183 ADHD cases and 35,191 controls that identifies variants surpassing genome-wide significance in 12 independent loci, revealing new and important information on the underlying biology of ADHD Associations are enriched in evolutionarily constrained genomic regions and loss-of-function intolerant genes, as well as around brain-expressed regulatory marks These findings, based on clinical interviews and/or medical records are supported by additional analyses of a self-reported ADHD sample and a study of quantitative measures of ADHD symptoms in the population Meta-analyzing these data with our primary scan yielded a total of 16 genome-wide significant loci The results support the hypothesis that clinical diagnosis of ADHD is an extreme expression of one or more continuous heritable traits

Posted ContentDOI
17 Feb 2017-bioRxiv
TL;DR: The proposed algorithm for fast Non-Rigid Motion Correction (NoRMCorre) based on template matching is introduced, which can be run in an online mode resulting in comparable to or even faster than real time motion registration on streaming data.
Abstract: Motion correction is a challenging pre-processing problem that arises early in the analysis pipeline of calcium imaging data sequences. Here we introduce an algorithm for fast Non-Rigid Motion Correction (NoRMCorre) based on template matching. orm operates by splitting the field of view into overlapping spatial patches that are registered for rigid translation against a continuously updated template. The estimated alignments are subsequently up-sampled to create a smooth motion field for each frame that can efficiently approximate non-rigid motion in a piecewise-rigid manner. orm allows for subpixel registration and can be run in an online mode resulting in comparable to or even faster than real time motion registration on streaming data. We evaluate the performance of the proposed method with simple yet intuitive metrics and compare against other non-rigid registration methods on two-photon calcium imaging datasets. Open source Matlab and Python code is also made available.

Posted ContentDOI
08 Mar 2017-bioRxiv
TL;DR: This is the first study to employ deep learning to identify multi-omics features linked to the differential survival of HCC patients, and it is expected this workflow to be useful at predicting HCC prognosis prediction.
Abstract: Identifying robust survival subgroups of hepatocellular carcinoma (HCC) will significantly improve patient care. Currently, endeavor of integrating multi-omics data to explicitly predict HCC survival from multiple patient cohorts is lacking. To fill in this gap, we present a deep learning (DL) based model on HCC that robustly differentiates survival subpopulations of patients in six cohorts. We train the DL based, survival-sensitive model on 360 HCC patient data using RNA-seq, miRNA-seq and methylation data from TCGA. This model provides two optimal subgroups of patients with significant survival differences (P=7.13e-6) and good model fitness (C-index=0.68). More aggressive subtype is associated with frequent TP53 inactivation mutations, higher expression of stemness markers (KRT19, EPCAM) and tumor marker BIRC5, and activated Wnt and Akt signaling pathways. We validated this multi-omics model on five external datasets of various omics types: LIRI-JP cohort (n=230, c-index=0.75), NCI cohort (n=221, c-index=0.67), Chinese cohort (n=166, c-index=0.69), E-TABM-36 cohort (n=40, c-index=0.77), and Hawaiian cohort (n=27, c-index=0.82). This is the first study to employ deep learning to identify multi-omics features linked to the differential survival of HCC patients. Given its robustness over multiple cohorts, we expect this model to be clinically useful for HCC prognosis prediction.

Posted ContentDOI
22 May 2017-bioRxiv
TL;DR: A custom high-throughput EM platform was developed and the entire adult fruit fly brain was imaged, using electron microscopy, enabling brain-spanning mapping of neuronal circuits at the synaptic level and finding that axonal arbors providing input to the MB calyx are more tightly clustered than previously indicated by light-level data.
Abstract: Drosophila melanogaster has a rich repertoire of innate and learned behaviors Its 100,000-neuron brain is a large but tractable target for comprehensive neural circuit mapping Only electron microscopy (EM) enables complete, unbiased mapping of synaptic connectivity; however, the fly brain is too large for conventional EM We developed a custom high-throughput EM platform and imaged the entire brain of an adult female fly We validated the dataset by tracing brain-spanning circuitry involving the mushroom body (MB), intensively studied for its role in learning Here we describe the complete set of olfactory inputs to the MB; find a new cell type providing driving input to Kenyon cells (the intrinsic MB neurons); identify neurons postsynaptic to Kenyon cell dendrites; and find that axonal arbors providing input to the MB calyx are more tightly clustered than previously indicated by light-level data This freely available EM dataset will significantly accelerate Drosophila neuroscience