scispace - formally typeset
Search or ask a question

Showing papers in "bioRxiv in 2019"


Posted ContentDOI
09 Aug 2019-bioRxiv
TL;DR: It is shown that U1 AMO also modulates cancer cells’ phenotype, dose-dependently increasing migration and invasion in vitro by up to 500%, whereas U1 over-expression has the opposite effect.
Abstract: Stimulated cells and cancer cells have widespread shortening of mRNA 3’-utranslated regions (3’UTRs) and switches to shorter mRNA isoforms due to usage of more proximal polyadenylation signals (PASs) in the last exon and in introns. U1 snRNA (U1), vertebrates’ most abundant non-coding (spliceosomal) small nuclear RNA, silences proximal PASs and its inhibition with antisense morpholino oligonucleotides (U1 AMO) triggers widespread mRNA shortening. Here we show that U1 AMO also modulates cancer cells’ phenotype, dose-dependently increasing migration and invasion in vitro by up to 500%, whereas U1 over-expression has the opposite effect. In addition to 3’UTR length, numerous transcriptome changes that could contribute to this phenotype are observed, including alternative splicing, and mRNA expression levels of proto-oncogenes and tumor suppressors. These findings reveal an unexpected link between U1 regulation and oncogenic and activated cell states, and suggest U1 as a potential target for their modulation.

1,660 citations


Posted ContentDOI
24 Apr 2019-bioRxiv
TL;DR: This extends OrthoFinder’s high accuracy orthogroup inference to provide phylogenetic inference of orthologs, rooted genes trees, gene duplication events, the rooted species tree, and comparative genomic statistics.
Abstract: Here, we present a major advance of the OrthoFinder method. This extends OrthoFinder’s high accuracy orthogroup inference to provide phylogenetic inference of orthologs, rooted genes trees, gene duplication events, the rooted species tree, and comparative genomic statistics. Each output is benchmarked on appropriate real or simulated datasets and, where comparable methods exist, OrthoFinder is equivalent to or outperforms these methods. Furthermore, OrthoFinder is the most accurate ortholog inference method on the Quest for Orthologs benchmark test. Finally, OrthoFinder’s comprehensive phylogenetic analysis is achieved with equivalent speed and scalability to the fastest, score-based heuristic methods. OrthoFinder is available at https://github.com/davidemms/OrthoFinder.

1,366 citations


Posted ContentDOI
03 Oct 2019-bioRxiv
TL;DR: Analysis of the v8 data provides insights into the tissue-specificity of genetic effects, and shows that cell type composition is a key factor in understanding gene regulatory mechanisms in human tissues.
Abstract: The Genotype-Tissue Expression (GTEx) project was established to characterize genetic effects on the transcriptome across human tissues, and to link these regulatory mechanisms to trait and disease associations. Here, we present analyses of the v8 data, based on 17,382 RNA-sequencing samples from 54 tissues of 948 post-mortem donors. We comprehensively characterize genetic associations for gene expression and splicing in cis and trans, showing that regulatory associations are found for almost all genes, and describe the underlying molecular mechanisms and their contribution to allelic heterogeneity and pleiotropy of complex traits. Leveraging the large diversity of tissues, we provide insights into the tissue-specificity of genetic effects, and show that cell type composition is a key factor in understanding gene regulatory mechanisms in human tissues.

1,243 citations


Posted ContentDOI
14 Mar 2019-bioRxiv
TL;DR: It is proposed that the Pearson residuals from ’regularized negative binomial regression’, where cellular sequencing depth is utilized as a covariate in a generalized linear model, successfully remove the influence of technical characteristics from downstream analyses while preserving biological heterogeneity.
Abstract: Single-cell RNA-seq (scRNA-seq) data exhibits significant cell-to-cell variation due to technical factors, including the number of molecules detected in each cell, which can confound biological heterogeneity with technical effects. To address this, we present a modeling framework for the normalization and variance stabilization of molecular count data from scRNA-seq experiments. We propose that the Pearson residuals from ’regularized negative binomial regression’, where cellular sequencing depth is utilized as a covariate in a generalized linear model, successfully remove the influence of technical characteristics from downstream analyses while preserving biological heterogeneity. Importantly, we show that an unconstrained negative binomial model may overfit scRNA-seq data, and overcome this by pooling information across genes with similar abundances to obtain stable parameter estimates. Our procedure omits the need for heuristic steps including pseudocount addition or log-transformation, and improves common downstream analytical tasks such as variable gene selection, dimensional reduction, and differential expression. Our approach can be applied to any UMI-based scRNA-seq dataset and is freely available as part of the R package sctransform, with a direct interface to our single-cell toolkit Seurat.

1,175 citations


Posted ContentDOI
Konrad J. Karczewski1, Konrad J. Karczewski2, Laurent C. Francioli1, Laurent C. Francioli2, Grace Tiao1, Grace Tiao2, Beryl B. Cummings1, Beryl B. Cummings2, Jessica Alföldi2, Jessica Alföldi1, Qingbo Wang2, Qingbo Wang1, Ryan L. Collins1, Ryan L. Collins2, Kristen M. Laricchia1, Kristen M. Laricchia2, Andrea Ganna1, Andrea Ganna3, Andrea Ganna2, Daniel P. Birnbaum1, Laura D. Gauthier1, Harrison Brand1, Harrison Brand2, Matthew Solomonson1, Matthew Solomonson2, Nicholas A. Watts1, Nicholas A. Watts2, Daniel R. Rhodes4, Moriel Singer-Berk1, Eleanor G. Seaby1, Eleanor G. Seaby2, Jack A. Kosmicki1, Jack A. Kosmicki2, Raymond K. Walters1, Raymond K. Walters2, Katherine Tashman1, Katherine Tashman2, Yossi Farjoun1, Eric Banks1, Timothy Poterba2, Timothy Poterba1, Arcturus Wang2, Arcturus Wang1, Cotton Seed2, Cotton Seed1, Nicola Whiffin1, Nicola Whiffin5, Jessica X. Chong6, Kaitlin E. Samocha7, Emma Pierce-Hoffman1, Zachary Zappala1, Zachary Zappala8, Anne H. O’Donnell-Luria1, Anne H. O’Donnell-Luria2, Anne H. O’Donnell-Luria9, Eric Vallabh Minikel1, Ben Weisburd1, Monkol Lek10, Monkol Lek1, James S. Ware1, James S. Ware5, Christopher Vittal2, Christopher Vittal1, Irina M. Armean1, Irina M. Armean2, Irina M. Armean11, Louis Bergelson1, Kristian Cibulskis1, Kristen M. Connolly1, Miguel Covarrubias1, Stacey Donnelly1, Steven Ferriera1, Stacey Gabriel1, Jeff Gentry1, Namrata Gupta1, Thibault Jeandet1, Diane Kaplan1, Christopher Llanwarne1, Ruchi Munshi1, Sam Novod1, Nikelle Petrillo1, David Roazen1, Valentin Ruano-Rubio1, Andrea Saltzman1, Molly Schleicher1, Jose Soto1, Kathleen Tibbetts1, Charlotte Tolonen1, Gordon Wade1, Michael E. Talkowski2, Michael E. Talkowski1, Benjamin M. Neale1, Benjamin M. Neale2, Mark J. Daly1, Daniel G. MacArthur2, Daniel G. MacArthur1 
30 Jan 2019-bioRxiv
TL;DR: Using an improved human mutation rate model, human protein-coding genes are classified along a spectrum representing tolerance to inactivation, validate this classification using data from model organisms and engineered human cells, and show that it can be used to improve gene discovery power for both common and rare diseases.
Abstract: Summary Genetic variants that inactivate protein-coding genes are a powerful source of information about the phenotypic consequences of gene disruption: genes critical for an organism’s function will be depleted for such variants in natural populations, while non-essential genes will tolerate their accumulation. However, predicted loss-of-function (pLoF) variants are enriched for annotation errors, and tend to be found at extremely low frequencies, so their analysis requires careful variant annotation and very large sample sizes. Here, we describe the aggregation of 125,748 exomes and 15,708 genomes from human sequencing studies into the Genome Aggregation Database (gnomAD). We identify 443,769 high-confidence pLoF variants in this cohort after filtering for sequencing and annotation artifacts. Using an improved model of human mutation, we classify human protein-coding genes along a spectrum representing intolerance to inactivation, validate this classification using data from model organisms and engineered human cells, and show that it can be used to improve gene discovery power for both common and rare diseases.

1,128 citations


Posted ContentDOI
29 Apr 2019-bioRxiv
TL;DR: This work uses unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million protein sequences spanning evolutionary diversity, enabling state-of-the-art supervised prediction of mutational effect and secondary structure, and improving state- of- the-art features for long-range contact prediction.
Abstract: In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised learning has led to major advances in representation learning and statistical generation. In biology, the anticipated growth of sequencing promises unprecedented data on natural sequence diversity. Learning the natural distribution of evolutionary protein sequence variation is a logical step toward predictive and generative modeling for biology. To this end we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million sequences spanning evolutionary diversity. The resulting model maps raw sequences to representations of biological properties without labels or prior domain knowledge. The learned representation space organizes sequences at multiple levels of biological granularity from the biochemical to proteomic levels. Learning recovers information about protein structure: secondary structure and residue-residue contacts can be extracted by linear projections from learned representations. With small amounts of labeled data, the ability to identify tertiary contacts is further improved. Learning on full sequence diversity rather than individual protein families increases recoverable information about secondary structure. We show the networks generalize by adapting them to variant activity prediction from sequences only, with results that are comparable to a state-of-the-art variant predictor that uses evolutionary and structurally derived features.

748 citations


Posted ContentDOI
15 Feb 2019-bioRxiv
TL;DR: A high-level overview of the features of the MRtrix3 framework and general-purpose image processing applications provided with the software is provided.
Abstract: MRtrix3 is an open-source, cross-platform software package for medical image processing, analysis and visualization, with a particular emphasis on the investigation of the brain using diffusion MRI. It is implemented using a fast, modular and flexible general-purpose code framework for image data access and manipulation, enabling efficient development of new applications, whilst retaining high computational performance and a consistent command-line interface between applications. In this article, we provide a high-level overview of the features of the MRtrix3 framework and general-purpose image processing applications provided with the software.

728 citations


Posted ContentDOI
29 Oct 2019-bioRxiv
TL;DR: ScVelo enables disentangling heterogeneous subpopulation kinetics with unprecedented resolution in hippocampal dentate gyrus neurogenesis and pancreatic endocrinogenesis and is anticipate that scVelo will greatly facilitate the study of lineage decisions, gene regulation, and pathway activity identification.
Abstract: The introduction of RNA velocity in single cells has opened up new ways of studying cellular differentiation. The originally proposed framework obtains velocities as the deviation of the observed ratio of spliced and unspliced mRNA from an inferred steady state. Errors in velocity estimates arise if the central assumptions of a common splicing rate and the observation of the full splicing dynamics with steady-state mRNA levels are violated. With scVelo (https://scvelo.org), we address these restrictions by solving the full transcriptional dynamics of splicing kinetics using a likelihood-based dynamical model. This generalizes RNA velocity to a wide variety of systems comprising transient cell states, which are common in development and in response to perturbations. We infer gene-specific rates of transcription, splicing and degradation, and recover the latent time of the underlying cellular processes. This latent time represents the cell’s internal clock and is based only on its transcriptional dynamics. Moreover, scVelo allows us to identify regimes of regulatory changes such as stages of cell fate commitment and, therein, systematically detects putative driver genes. We demonstrate that scVelo enables disentangling heterogeneous subpopulation kinetics with unprecedented resolution in hippocampal dentate gyrus neurogenesis and pancreatic endocrinogenesis. We anticipate that scVelo will greatly facilitate the study of lineage decisions, gene regulation, and pathway activity identification.

712 citations


Posted ContentDOI
Daniel Taliun1, Daniel N. Harris2, Michael D. Kessler2, Jedidiah Carlson3  +191 moreInstitutions (61)
06 Mar 2019-bioRxiv
TL;DR: The nearly complete catalog of genetic variation in TOPMed studies provides unique opportunities for exploring the contributions of rare and non-coding sequence variants to phenotypic variation as well as resources and early insights from the sequence data.
Abstract: Summary paragraph The Trans-Omics for Precision Medicine (TOPMed) program seeks to elucidate the genetic architecture and disease biology of heart, lung, blood, and sleep disorders, with the ultimate goal of improving diagnosis, treatment, and prevention. The initial phases of the program focus on whole genome sequencing of individuals with rich phenotypic data and diverse backgrounds. Here, we describe TOPMed goals and design as well as resources and early insights from the sequence data. The resources include a variant browser, a genotype imputation panel, and sharing of genomic and phenotypic data via dbGaP. In 53,581 TOPMed samples, >400 million single-nucleotide and insertion/deletion variants were detected by alignment with the reference genome. Additional novel variants are detectable through assembly of unmapped reads and customized analysis in highly variable loci. Among the >400 million variants detected, 97% have frequency

662 citations


Posted ContentDOI
22 Apr 2019-bioRxiv
TL;DR: ModelTest-NG is a re-implementation from scratch of jModelTest and ProtTest, two popular tools for selecting the best-fit nucleotide and amino acid substitution models, respectively, and introduces several new features, such as ascertainment bias correction, mixture and FreeRate models, or the automatic processing of partitioned datasets.
Abstract: ModelTest-NG is a re-implementation from scratch of jModelTest and ProtTest, two popular tools for selecting the best-fit nucleotide and amino acid substitution models, respectively. ModelTest-NG is one to two orders of magnitude faster than jModelTest and ProtTest but equally accurate, and introduces several new features, such as ascertainment bias correction, mixture and FreeRate models, or the automatic processing of partitioned datasets. ModelTest-NG is available under a GNU GPL3 license at https://github.com/ddarriba/modeltest.

465 citations


Posted ContentDOI
08 Apr 2019-bioRxiv
TL;DR: KofamKOALA is a web server to assign KEGG Orthologs (KOs) to protein sequences by homology search against a database of profile hidden Markov models (KOfam) with pre-computed adaptive score thresholds.
Abstract: Summary KofamKOALA is a web server to assign KEGG Orthologs (KOs) to protein sequences by homology search against a database of profile hidden Markov models (KOfam) with pre-computed adaptive score thresholds. KofamKOALA is faster than existing KO assignment tools with its accuracy being comparable to the best performing tools. Function annotation by KofamKOALA helps linking genes to KEGG resources such as the KEGG pathway maps and facilitates molecular network reconstruction. Availability KofamKOALA, KofamScan, and KOfam are freely available from https://www.genome.jp/tools/kofamkoala/ Contact ogata@kuicr.kyoto-u.ac.jp

Posted ContentDOI
26 Sep 2019-bioRxiv
TL;DR: UCSC Xena as mentioned in this paper is a web-based visualization tool for both public and private omics data, supported through Xena Browser and multiple turn-key Xena Hubs, allowing researchers to view their own data securely, using private Xena hubs, simultaneously visualizing large public cancer genomics datasets, including TCGA and the GDC.
Abstract: UCSC Xena is a visual exploration resource for both public and private omics data, supported through the web-based Xena Browser and multiple turn-key Xena Hubs. This unique archecture allows researchers to view their own data securely, using private Xena Hubs, simultaneously visualizing large public cancer genomics datasets, including TCGA and the GDC. Data integration occurs only within the Xena Browser, keeping private data private. Xena supports virtually any functional genomics data, including SNVs, INDELs, large structural variants, CNV, expression, DNA methylation, ATAC-seq signals, and phenotypic annotations. Browser features include the Visual Spreadsheet, survival analyses, powerful filtering and subgrouping, statistical analyses, genomic signatures, and bookmarks. Xena differentiates itself from other genomics tools, including its predecessor, the UCSC Cancer Genomics Browser, by its ability to easily and securely view public and private data, its high performance, its broad data type support, and many unique features.

Posted ContentDOI
06 Mar 2019-bioRxiv
TL;DR: Cleavage Under Targets and Tagmentation (CUT&Tag), an enzyme-tethering strategy that provides efficient high-resolution sequencing libraries for profiling diverse chromatin components, is described and demonstrated by profiling histone modifications, RNA Polymerase II and transcription factors on low cell numbers and single cells.
Abstract: Many chromatin features play critical roles in regulating gene expression. A complete understanding of gene regulation will require the mapping of specific chromatin features in small samples of cells at high resolution. Here we describe Cleavage Under Targets and Tagmentation (CUT&Tag), an enzyme-tethering strategy that provides efficient high-resolution sequencing libraries for profiling diverse chromatin components. In CUT&Tag, a chromatin protein is bound in situ by a specific antibody, which then tethers a protein A-Tn5 transposase fusion protein. Activation of the transposase efficiently generates fragment libraries with high resolution and exceptionally low background. All steps from live cells to sequencing-ready libraries can be performed in a single tube on the benchtop or a microwell in a high-throughput pipeline, and the entire procedure can be performed in one day. We demonstrate the utility of CUT&Tag by profiling histone modifications, RNA Polymerase II and transcription factors on low cell numbers and single cells.

Posted ContentDOI
15 Jun 2019-bioRxiv
TL;DR: PICRUSt2 as mentioned in this paper extends the capabilities of the original PICrUSt method to predict approximate functional potential of a community based on marker gene sequencing profiles, including an expanded database of gene families and reference genomes, a new approach compatible with any OTU-picking or denoising algorithm, novel phenotype predictions, and novel fungal reference databases that enable predictions from 18S rRNA gene and internal transcribed spacer amplicon data.
Abstract: One major limitation of microbial community marker gene sequencing is that it does not provide direct information on the functional composition of sampled communities. Here, we present PICRUSt2, which expands the capabilities of the original PICRUSt method to predict approximate functional potential of a community based on marker gene sequencing profiles. This updated method and implementation includes several improvements over the previous algorithm: an expanded database of gene families and reference genomes, a new approach now compatible with any OTU-picking or denoising algorithm, novel phenotype predictions, and novel fungal reference databases that enable predictions from 18S rRNA gene and internal transcribed spacer amplicon data. Upon evaluation, PICRUSt2 was more accurate than PICRUSt1 and other current approaches and also more flexible to allow the addition of custom reference databases. Last, we demonstrate the utility of PICRUSt2 by identifying potential disease-associated microbial functional signatures based on 16S rRNA gene sequencing of ileal biopsies collected from a cohort of human subjects with inflammatory bowel disease. PICRUSt2 is freely available at: https://github.com/picrust/picrust2.

Posted ContentDOI
08 Jul 2019-bioRxiv
TL;DR: StringTie2 is a reference-guided transcriptome assembler that works with both short and long reads and includes new computational methods to handle the high error rate of long-read sequencing technology, which previous assemblers could not tolerate.
Abstract: RNA sequencing using the latest single-molecule sequencing instruments produces reads that are thousands of nucleotides long. The ability to assemble these long reads can greatly improve the sensitivity of long-read analyses. Here we present StringTie2, a reference-guided transcriptome assembler that works with both short and long reads. StringTie2 includes new computational methods to handle the high error rate of long-read sequencing technology, which previous assemblers could not tolerate. It also offers the ability to work with full-length super-reads assembled from short reads, which further improves the quality of assemblies. On 33 short-read datasets from humans and two plant species, StringTie2 is 47.3% more precise and 3.9% more sensitive than Scallop. On multiple long read datasets, StringTie2 on average correctly assembles 8.3 and 2.6 times as many transcripts as FLAIR and Traphlor, respectively, with substantially higher precision. StringTie2 is also faster and has a smaller memory footprint than all comparable tools.

Posted ContentDOI
26 Jan 2019-bioRxiv
TL;DR: A meta-analysis of genome-wide studies of anorexia nervosa, attention-deficit/hyperactivity disorder, autism spectrum disorder, bipolar disorder, major depression, obsessive-compulsive disorder, schizophrenia, and Tourette syndrome revealed a meaningful structure within the eight disorders identifying three groups of inter-related disorders.
Abstract: Genetic influences on psychiatric disorders transcend diagnostic boundaries, suggesting substantial pleiotropy of contributing loci. However, the nature and mechanisms of these pleiotropic effects remain unclear. We performed a meta-analysis of 232,964 cases and 494,162 controls from genome-wide studies of anorexia nervosa, attention-deficit/hyperactivity disorder, autism spectrum disorder, bipolar disorder, major depression, obsessive-compulsive disorder, schizophrenia, and Tourette syndrome. Genetic correlation analyses revealed a meaningful structure within the eight disorders identifying three groups of inter-related disorders. We detected 109 loci associated with at least two psychiatric disorders, including 23 loci with pleiotropic effects on four or more disorders and 11 loci with antagonistic effects on multiple disorders. The pleiotropic loci are located within genes that show heightened expression in the brain throughout the lifespan, beginning in the second trimester prenatally, and play prominent roles in a suite of neurodevelopmental processes. These findings have important implications for psychiatric nosology, drug development, and risk prediction.

Posted ContentDOI
26 Mar 2019-bioRxiv
TL;DR: This work applies deep learning to unlabelled amino acid sequences to distill the fundamental features of a protein into a statistical representation that is semantically rich and structurally, evolutionarily, and biophysically grounded.
Abstract: Rational protein engineering requires a holistic understanding of protein function. Here, we apply deep learning to unlabelled amino acid sequences to distill the fundamental features of a protein into a statistical representation that is semantically rich and structurally, evolutionarily, and biophysically grounded. We show that the simplest models built on top of this unified representation (UniRep) are broadly applicable and generalize to unseen regions of sequence space. Our data-driven approach reaches near state-of-the-art or superior performance predicting stability of natural and de novo designed proteins as well as quantitative function of molecularly diverse mutants. UniRep further enables two orders of magnitude cost savings in a protein engineering task. We conclude UniRep is a versatile protein summary that can be applied across protein engineering informatics.

Posted ContentDOI
20 Jun 2019-bioRxiv
TL;DR: TAPE as discussed by the authors is a set of five biologically relevant semi-supervised learning tasks spread across different domains of protein biology, and it is designed to test biologically relevant generalization that transfers to real-life scenarios.
Abstract: Protein modeling is an increasingly popular area of machine learning research. Semi-supervised learning has emerged as an important paradigm in protein modeling due to the high cost of acquiring supervised protein labels, but the current literature is fragmented when it comes to datasets and standardized evaluation techniques. To facilitate progress in this field, we introduce the Tasks Assessing Protein Embeddings (TAPE), a set of five biologically relevant semi-supervised learning tasks spread across different domains of protein biology. We curate tasks into specific training, validation, and test splits to ensure that each task tests biologically relevant generalization that transfers to real-life scenarios. We bench-mark a range of approaches to semi-supervised protein representation learning, which span recent work as well as canonical sequence learning techniques. We find that self-supervised pretraining is helpful for almost all models on all tasks, more than doubling performance in some cases. Despite this increase, in several cases features learned by self-supervised pretraining still lag behind features extracted by state-of-the-art non-neural techniques. This gap in performance suggests a huge opportunity for innovative architecture design and improved modeling paradigms that better capture the signal in biological sequences. TAPE will help the machine learning community focus effort on scientifically relevant problems. Toward this end, all data and code used to run these experiments are available at https://github.com/songlab-cal/tape.

Posted ContentDOI
25 Feb 2019-bioRxiv
TL;DR: A single-instruction multiple-data (SIMD) vectorized implementation of the Viterbi algorithm for profile HMM alignment is developed and the added functionalities and increased speed of HHsearch and HHblits should facilitate their use in large-scale protein structure and function prediction, e.g. in metagenomics and genomics projects.
Abstract: Background: HH-suite is a widely used open source software suite for sensitive sequence similarity searches and protein fold recognition. It is based on pairwise alignment of profile Hidden Markov models (HMMs), which represent multiple sequence alignments of homologous sequences. Results: We developed a single-instruction multiple-data (SIMD) vectorized implementation of the Viterbi algorithm for profile HMM alignment and introduced various other speed-ups. This accelerated HHsearch by a factor 4 and HHblits by a factor 2 over the previous version 2.0.16. HHblits3 is ~10x faster than PSI-BLAST and ~20x faster than HMMER3. Jobs to perform HHsearch and HHblits searches with many query profile HMMs can be parallelized over cores and over servers in a cluster using OpenMP and message passing interface (MPI). The free, open-source, GNU GPL(v3)-licensed software is available at https://github.com/soedinglab/hh-suite. Conclusion: The added functionalities and increased speed of HHsearch and HHblits should facilitate their use in large-scale protein structure and function prediction, e.g. in metagenomics and genomics projects.

Posted ContentDOI
30 Apr 2019-bioRxiv
TL;DR: Overall, tRNA detection sensitivity and specificity is improved for all isotypes, particularly those utilizing specialized models for selenocysteine and the three subtypes of tRNA genes encoding a CAU anticodon.
Abstract: tRNAscan-SE has been widely used for whole-genome transfer RNA gene prediction for nearly two decades. With the increased availability of new genomes, a vastly larger training set has enabled creation of nearly one hundred specialized isotype-specific models, greatly improving tRNAscan-SE’s ability to identify and classify both typical and atypical tRNAs. We employ a new multi-model annotation strategy where predicted tRNAs are scored against a full set of isotype-specific covariance models. A post-filtering feature also better identifies tRNA-derived SINEs that are abundant in many eukaryotic genomes, and provides a “high confidence” tRNA gene set which improves upon prior pseudogene prediction. These new enhancements of tRNAscan-SE will provide researchers more accurate detection and more comprehensive annotation for tRNA genes.

Posted ContentDOI
02 Dec 2019-bioRxiv
TL;DR: Mutect2 is a somatic variant caller that uses local assembly and realignment to detect SNVs and indels, and is based on several probabilistic models for genotyping and filtering that work well with and without a matched normal sample and for all sequencing depths.
Abstract: Mutect2 is a somatic variant caller that uses local assembly and realignment to detect SNVs and indels. Assembly implies whole haplotypes and read pairs, rather than single bases, as the atomic units of biological variation and sequencing evidence, improving variant calling. Beyond local assembly and alignment, Mutect2 is based on several probabilistic models for genotyping and filtering that work well with and without a matched normal sample and for all sequencing depths.

Posted ContentDOI
21 Aug 2019-bioRxiv
TL;DR: The analyses identify several candidate biomarkers of cellular senescence that overlap with aging markers in human plasma, including GDF15, STC1 and SERPINs, which significantly correlated with age in plasma from a human cohort, the Baltimore Longitudinal Study of Aging.
Abstract: SUMMARY The senescence-associated secretory phenotype (SASP) has recently emerged as both a driver of, and promising therapeutic target for, multiple age-related conditions, ranging from neurodegeneration to cancer. The complexity of the SASP, typically monitored by a few dozen secreted proteins, has been greatly underappreciated, and a small set of factors cannot explain the diverse phenotypes it produces in vivo. Here, we present ‘SASP Atlas’, a comprehensive proteomic database of soluble and exosome SASP factors originating from multiple senescence inducers and cell types. Each profile consists of hundreds of largely distinct proteins, but also includes a subset of proteins elevated in all SASPs. Based on our analyses, we propose several candidate biomarkers of cellular senescence, including GDF15, STC1 and SERPINs. This resource will facilitate identification of proteins that drive specific senescence-associated phenotypes and catalog potential senescence biomarkers to assess the burden, originating stimulus and tissue of senescent cells in vivo.

Posted ContentDOI
18 Jul 2019-bioRxiv
TL;DR: Neuralink’s approach to BMI has unprecedented packaging density and scalability in a clinically relevant package and has achieved a spiking yield of up to 85.5 % in chronically implanted electrodes.
Abstract: Brain-machine interfaces (BMIs) hold promise for the restoration of sensory and motor function and the treatment of neurological disorders, but clinical BMIs have not yet been widely adopted, in part because modest channel counts have limited their potential. In this white paper, we describe Neuralink’s first steps toward a scalable high-bandwidth BMI system. We have built arrays of small and flexible electrode “threads”, with as many as 3,072 electrodes per array distributed across 96 threads. We have also built a neurosurgical robot capable of inserting six threads (192 electrodes) per minute. Each thread can be individually inserted into the brain with micron precision for avoidance of surface vasculature and targeting specific brain regions. The electrode array is packaged into a small implantable device that contains custom chips for low-power on-board amplification and digitization: the package for 3,072 channels occupies less than (23 × 18.5 × 2) mm3. A single USB-C cable provides full-bandwidth data streaming from the device, recording from all channels simultaneously. This system has achieved a spiking yield of up to 85.5 % in chronically implanted electrodes. Neuralink’s approach to BMI has unprecedented packaging density and scalability in a clinically relevant package.

Posted ContentDOI
22 Feb 2019-bioRxiv
TL;DR: A file format called cooler, based on a sparse data model, that can support genomically-labeled matrices at any resolution, which has the flexibility to accommodate various descriptions of the data axes, resolutions, data density patterns, and metadata.
Abstract: Most existing coverage-based (epi)genomic datasets are one-dimensional, but newer technologies probing interactions (physical, genetic, etc.) produce quantitative maps with two-dimensional genomic coordinate systems. Storage and computational costs mount sharply with data resolution when such maps are stored in dense form. Hence, there is a pressing need to develop data storage strategies that handle the full range of useful resolutions in multidimensional genomic datasets by taking advantage of their sparse nature, while supporting efficient compression and providing fast random access to facilitate development of scalable algorithms for data analysis. We developed a file format called cooler, based on a sparse data model, that can support genomically-labeled matrices at any resolution. It has the flexibility to accommodate various descriptions of the data axes (genomic coordinates, tracks and bin annotations), resolutions, data density patterns, and metadata. Cooler is based on HDF5 and is supported by a Python library and command line suite to create, read, inspect and manipulate cooler data collections. The format has been adopted as a standard by the NIH 4D Nucleome Consortium. Cooler is cross-platform, BSD-licensed, and can be installed from the Python Package Index or the bioconda repository. The source code is maintained on Github at https://github.com/mirnylab/cooler.

Posted ContentDOI
31 Jul 2019-bioRxiv
TL;DR: The current and projected Achilles processing pipeline, including recent improvements and the analyses that led us to adopt them, are presented, spanning data releases from early 2018 to the first quarter of 2020.
Abstract: One of the main goals of the Cancer Dependency Map project is to systematically identify cancer vulnerabilities across cancer types to accelerate therapeutic discovery. Project Achilles serves this goal through the in vitro study of genetic dependencies in cancer cell lines using CRISPR/Cas9 (and, previously, RNAi) loss-of-function screens. The project is committed to the public release of its experimental results quarterly on the DepMap Portal (https://depmap.org), on a pre-publication basis. As the experiment has evolved, data processing procedures have changed. Here we present the current and projected Achilles processing pipeline, including recent improvements and the analyses that led us to adopt them, spanning data releases from early 2018 to the first quarter of 2020. Notable changes include quality control metrics, calculation of probabilities of dependency, and correction for screen quality and other biases. Developing and improving methods for extracting biologically-meaningful scores from Achilles experiments is an ongoing process, and we will continue to evaluate and revise data processing procedures to produce the best results.

Posted ContentDOI
06 Aug 2019-bioRxiv
TL;DR: A striking buildup of lipid droplets in microglia with aging in mouse and human brains is reported and it is proposed that LAM contribute to age-related and genetic forms of neurodegeneration.
Abstract: Microglia become progressively activated and seemingly dysfunctional with age, and genetic studies have linked these cells to the pathogenesis of a growing number of neurodegenerative diseases. Here we report a striking buildup of lipid droplets in microglia with aging in mouse and human brains. These cells, which we call lipid droplet-accumulating microglia (LAM), are defective in phagocytosis, produce high levels of reactive oxygen species, and secrete pro-inflammatory cytokines. RNA sequencing analysis of LAM revealed a transcriptional profile driven by innate inflammation distinct from previously reported microglial states. An unbiased CRISPR-Cas9 screen identified genetic modifiers of lipid droplet formation; surprisingly, variants of several of these genes, including progranulin, are causes of autosomal dominant forms of human neurodegenerative diseases. We thus propose that LAM contribute to age-related and genetic forms of neurodegeneration.

Posted ContentDOI
Matteo Dainese1, Emily A. Martin1, Marcelo A. Aizen2, Matthias Albrecht, Ignasi Bartomeus3, Riccardo Bommarco4, Luísa G. Carvalheiro5, Luísa G. Carvalheiro6, Rebecca Chaplin-Kramer7, Vesna Gagic8, Lucas Alejandro Garibaldi9, Jaboury Ghazoul10, Heather Grab11, Mattias Jonsson4, Daniel S. Karp12, Christina M. Kennedy13, David Kleijn14, Claire Kremen15, Douglas A. Landis16, Deborah K. Letourneau17, Lorenzo Marini18, Katja Poveda11, Romina Rader19, Henrik G. Smith20, Teja Tscharntke21, Georg K.S. Andersson20, Isabelle Badenhausser22, Isabelle Badenhausser23, Svenja Baensch21, Antonio Diego M. Bezerra24, Felix J.J.A. Bianchi14, Virginie Boreux10, Vincent Bretagnolle22, Berta Caballero-López, Pablo Cavigliasso25, Aleksandar Ćetković26, Natacha P. Chacoff27, Alice Classen1, Sarah Cusser28, Felipe D. da Silva e Silva29, G. Arjen de Groot14, Jan H. Dudenhöffer30, Johan Ekroos20, Thijs P.M. Fijen14, Pierre Franck23, Breno Magalhães Freitas24, Michael P.D. Garratt31, Claudio Gratton32, Juliana Hipólito9, Andrea Holzschuh1, Lauren Hunt33, Aaron L. Iverson11, Shalene Jha34, Tamar Keasar35, Tania N. Kim36, Miriam Kishinevsky35, Björn K. Klatt21, Björn K. Klatt20, Alexandra-Maria Klein37, Kristin M. Krewenka38, Smitha Krishnan10, Ashley E. Larsen39, Claire Lavigne23, Heidi Liere40, Bea Maas41, Rachel E. Mallinger42, Eliana Martinez Pachon, Alejandra Martínez-Salinas43, Timothy D. Meehan44, Matthew G. E. Mitchell15, Gonzalo Alberto Roman Molina45, Maike Nesper10, Lovisa Nilsson20, Megan E. O'Rourke46, Marcell K. Peters1, Milan Plećaš26, Simon G. Potts31, Davi de L. Ramos29, Jay A. Rosenheim17, Maj Rundlöf20, Adrien Rusch47, Agustín Sáez2, Jeroen Scheper14, Matthias Schleuning, Julia Schmack48, Amber R. Sciligo17, Colleen L. Seymour, Dara A. Stanley49, Rebecca Stewart20, Jane C. Stout50, Louis Sutter, Mayura B. Takada51, Hisatomo Taki, Giovanni Tamburini4, Matthias Tschumi, Blandina Felipe Viana52, Catrin Westphal21, Bryony K. Willcox19, Stephen D. Wratten53, Akira Yoshioka54, Carlos Zaragoza-Trello3, Wei Zhang55, Yi Zou56, Ingolf Steffan-Dewenter1 
University of Würzburg1, National University of Comahue2, Spanish National Research Council3, Swedish University of Agricultural Sciences4, University of Lisbon5, Universidade Federal de Goiás6, Stanford University7, Commonwealth Scientific and Industrial Research Organisation8, National University of Río Negro9, ETH Zurich10, Cornell University11, University of California, Davis12, The Nature Conservancy13, Wageningen University and Research Centre14, University of British Columbia15, Great Lakes Bioenergy Research Center16, University of California, Berkeley17, University of Padua18, University of New England (United States)19, Lund University20, University of Göttingen21, University of La Rochelle22, Institut national de la recherche agronomique23, Federal University of Ceará24, Concordia University Wisconsin25, University of Belgrade26, National University of Tucumán27, Michigan State University28, University of Brasília29, University of Greenwich30, University of Reading31, University of Wisconsin-Madison32, Boise State University33, University of Texas at Austin34, University of Haifa35, Kansas State University36, University of Freiburg37, University of Hamburg38, University of California, Santa Barbara39, Seattle University40, University of Vienna41, University of Florida42, Centro Agronómico Tropical de Investigación y Enseñanza43, National Audubon Society44, University of Buenos Aires45, Virginia Tech46, University of Bordeaux47, University of Auckland48, University College Dublin49, Trinity College, Dublin50, University of Tokyo51, Federal University of Bahia52, Lincoln University (Pennsylvania)53, National Institute for Environmental Studies54, International Food Policy Research Institute55, Xi'an Jiaotong-Liverpool University56
20 Feb 2019-bioRxiv
TL;DR: Using a global database from 89 crop systems, the relative importance of abundance and species richness for pollination, biological pest control and final yields in the context of on-going land-use change is partitioned.
Abstract: Human land use threatens global biodiversity and compromises multiple ecosystem functions critical to food production. Whether crop yield-related ecosystem services can be maintained by few abundant species or rely on high richness remains unclear. Using a global database from 89 crop systems, we partition the relative importance of abundance and species richness for pollination, biological pest control and final yields in the context of on-going land-use change. Pollinator and enemy richness directly supported ecosystem services independent of abundance. Up to 50% of the negative effects of landscape simplification on ecosystem services was due to richness losses of service-providing organisms, with negative consequences for crop yields. Maintaining the biodiversity of ecosystem service providers is therefore vital to sustain the flow of key agroecosystem benefits to society.

Posted ContentDOI
26 Nov 2019-bioRxiv
TL;DR: This new program brings substantial improvements over the original version of RepeatModeler, one of the most widely used tools for TE discovery, and incorporates a module for structural discovery of complete LTR retroelements, which are widespread in eukaryotic genomes but recalcitrant to automated identification because of their size and sequence complexity.
Abstract: The accelerating pace of genome sequencing throughout the tree of life is driving the need for improved unsupervised annotation of genome components such as transposable elements (TEs). Because the types and sequences of TEs are highly variable across species, automated TE discovery and annotation are challenging and time-consuming tasks. A critical first step is the de novo identification and accurate compilation of sequence models representing all the unique TE families dispersed in the genome. Here we introduce RepeatModeler2, a new pipeline that greatly facilitates this process. This new program brings substantial improvements over the original version of RepeatModeler, one of the most widely used tools for TE discovery. In particular, this version incorporates a module for structural discovery of complete LTR retroelements, which are widespread in eukaryotic genomes but recalcitrant to automated identification because of their size and sequence complexity. We benchmarked RepeatModeler2 on three model species with diverse TE landscapes and high-quality, manually curated TE libraries: Drosophila melanogaster (fruit fly), Danio rerio (zebrafish), and Oryza sativa (rice). In these three species, RepeatModeler2 identified approximately three times more consensus sequences matching with >95% sequence identity and sequence coverage to the manually curated sequences than the original RepeatModeler. As expected, the greatest improvement is for LTR retroelements. The program had an extremely low false positive rate when applied to simulated genomes devoid of TEs. Thus, RepeatModeler2 represents a valuable addition to the genome annotation toolkit that will enhance the identification and study of TEs in eukaryotic genome sequences. RepeatModeler2 is available as source code or a containerized package under an open license (https://github.com/Dfam- consortium/RepeatModeler, https://github.com/Dfam-consortium/TETools).

Posted ContentDOI
18 Apr 2019-bioRxiv
TL;DR: It is anticipated that droplet-based single-cell chromatin accessibility will provide a broadly applicable means of identifying regulatory factors and elements that underlie cell type and function.
Abstract: Understanding complex tissues requires single-cell deconstruction of gene regulation with precision and scale. Here we present a massively parallel droplet-based platform for mapping transposase-accessible chromatin in tens of thousands of single cells per sample (scATAC-seq). We obtain and analyze chromatin profiles of over 200,000 single cells in two primary human systems. In blood, scATAC-seq allows marker-free identification of cell type-specific cis- and trans-regulatory elements, mapping of disease-associated enhancer activity, and reconstruction of trajectories of differentiation from progenitors to diverse and rare immune cell types. In basal cell carcinoma, scATAC-seq reveals regulatory landscapes of malignant, stromal, and immune cell types in the tumor microenvironment. Moreover, scATAC-seq of serial tumor biopsies before and after PD-1 blockade allows identification of chromatin regulators and differentiation trajectories of therapy-responsive intratumoral T cell subsets, revealing a shared regulatory program driving CD8+ T cell exhaustion and CD4+ T follicular helper cell development. We anticipate that droplet-based single-cell chromatin accessibility will provide a broadly applicable means of identifying regulatory factors and elements that underlie cell type and function.

Posted ContentDOI
05 Jun 2019-bioRxiv
TL;DR: The genetic basis of 38 blood and urine laboratory tests is evaluated, which tissues contribute to the biomarker function, the causal influences of the biomarkers, and how this can be used to predict disease are shown.
Abstract: Clinical laboratory tests are a critical component of the continuum of care and provide a means for rapid diagnosis and monitoring of chronic disease. In this study, we systematically evaluated the genetic basis of 38 blood and urine laboratory tests measured in 358,072 participants in the UK Biobank and identified 1,857 independent loci associated with at least one laboratory test, including 488 large-effect protein truncating, missense, and copy-number variants. We tested these loci for enrichment in specific single cell types in kidney, liver, and pancreas relevant to disease aetiology. We then causally linked the biomarkers to medically relevant phenotypes through genetic correlation and Mendelian randomization. Finally, we developed polygenic risk scores (PRS) for each biomarker and built multi-PRS models using all 38 PRSs simultaneously. We found substantially improved prediction of incidence in FinnGen (n=135,500) with the multi-PRS relative to single-disease PRSs for renal failure, myocardial infarction, liver fat percentage, and alcoholic cirrhosis. Together, our results show the genetic basis of these biomarkers, which tissues contribute to the biomarker function, the causal influences of the biomarkers, and how we can use this to predict disease.