Showing papers in "bioRxiv in 2019"
TL;DR: It is shown that U1 AMO also modulates cancer cells’ phenotype, dose-dependently increasing migration and invasion in vitro by up to 500%, whereas U1 over-expression has the opposite effect.
Abstract: Stimulated cells and cancer cells have widespread shortening of mRNA 3’-utranslated regions (3’UTRs) and switches to shorter mRNA isoforms due to usage of more proximal polyadenylation signals (PASs) in the last exon and in introns. U1 snRNA (U1), vertebrates’ most abundant non-coding (spliceosomal) small nuclear RNA, silences proximal PASs and its inhibition with antisense morpholino oligonucleotides (U1 AMO) triggers widespread mRNA shortening. Here we show that U1 AMO also modulates cancer cells’ phenotype, dose-dependently increasing migration and invasion in vitro by up to 500%, whereas U1 over-expression has the opposite effect. In addition to 3’UTR length, numerous transcriptome changes that could contribute to this phenotype are observed, including alternative splicing, and mRNA expression levels of proto-oncogenes and tumor suppressors. These findings reveal an unexpected link between U1 regulation and oncogenic and activated cell states, and suggest U1 as a potential target for their modulation.
TL;DR: This extends OrthoFinder’s high accuracy orthogroup inference to provide phylogenetic inference of orthologs, rooted genes trees, gene duplication events, the rooted species tree, and comparative genomic statistics.
Abstract: Here, we present a major advance of the OrthoFinder method. This extends OrthoFinder’s high accuracy orthogroup inference to provide phylogenetic inference of orthologs, rooted genes trees, gene duplication events, the rooted species tree, and comparative genomic statistics. Each output is benchmarked on appropriate real or simulated datasets and, where comparable methods exist, OrthoFinder is equivalent to or outperforms these methods. Furthermore, OrthoFinder is the most accurate ortholog inference method on the Quest for Orthologs benchmark test. Finally, OrthoFinder’s comprehensive phylogenetic analysis is achieved with equivalent speed and scalability to the fastest, score-based heuristic methods. OrthoFinder is available at https://github.com/davidemms/OrthoFinder.
Broad Institute1, University of Chicago2, University of Geneva3, University of Dundee4, Columbia University5, Princeton University6, Max Planck Society7, Johns Hopkins University8, Stanford University9, Vanderbilt University10, University of Cambridge11, Vanderbilt University Medical Center12, Massachusetts Eye and Ear Infirmary13, Harvard University14, Scripps Health15, Polytechnic University of Catalonia16, University of Pennsylvania17
TL;DR: Analysis of the v8 data provides insights into the tissue-specificity of genetic effects, and shows that cell type composition is a key factor in understanding gene regulatory mechanisms in human tissues.
Abstract: The Genotype-Tissue Expression (GTEx) project was established to characterize genetic effects on the transcriptome across human tissues, and to link these regulatory mechanisms to trait and disease associations. Here, we present analyses of the v8 data, based on 17,382 RNA-sequencing samples from 54 tissues of 948 post-mortem donors. We comprehensively characterize genetic associations for gene expression and splicing in cis and trans, showing that regulatory associations are found for almost all genes, and describe the underlying molecular mechanisms and their contribution to allelic heterogeneity and pleiotropy of complex traits. Leveraging the large diversity of tissues, we provide insights into the tissue-specificity of genetic effects, and show that cell type composition is a key factor in understanding gene regulatory mechanisms in human tissues.
TL;DR: It is proposed that the Pearson residuals from ’regularized negative binomial regression’, where cellular sequencing depth is utilized as a covariate in a generalized linear model, successfully remove the influence of technical characteristics from downstream analyses while preserving biological heterogeneity.
Abstract: Single-cell RNA-seq (scRNA-seq) data exhibits significant cell-to-cell variation due to technical factors, including the number of molecules detected in each cell, which can confound biological heterogeneity with technical effects. To address this, we present a modeling framework for the normalization and variance stabilization of molecular count data from scRNA-seq experiments. We propose that the Pearson residuals from ’regularized negative binomial regression’, where cellular sequencing depth is utilized as a covariate in a generalized linear model, successfully remove the influence of technical characteristics from downstream analyses while preserving biological heterogeneity. Importantly, we show that an unconstrained negative binomial model may overfit scRNA-seq data, and overcome this by pooling information across genes with similar abundances to obtain stable parameter estimates. Our procedure omits the need for heuristic steps including pseudocount addition or log-transformation, and improves common downstream analytical tasks such as variable gene selection, dimensional reduction, and differential expression. Our approach can be applied to any UMI-based scRNA-seq dataset and is freely available as part of the R package sctransform, with a direct interface to our single-cell toolkit Seurat.
TL;DR: Using an improved human mutation rate model, human protein-coding genes are classified along a spectrum representing tolerance to inactivation, validate this classification using data from model organisms and engineered human cells, and show that it can be used to improve gene discovery power for both common and rare diseases.
Abstract: Summary Genetic variants that inactivate protein-coding genes are a powerful source of information about the phenotypic consequences of gene disruption: genes critical for an organism’s function will be depleted for such variants in natural populations, while non-essential genes will tolerate their accumulation. However, predicted loss-of-function (pLoF) variants are enriched for annotation errors, and tend to be found at extremely low frequencies, so their analysis requires careful variant annotation and very large sample sizes. Here, we describe the aggregation of 125,748 exomes and 15,708 genomes from human sequencing studies into the Genome Aggregation Database (gnomAD). We identify 443,769 high-confidence pLoF variants in this cohort after filtering for sequencing and annotation artifacts. Using an improved model of human mutation, we classify human protein-coding genes along a spectrum representing intolerance to inactivation, validate this classification using data from model organisms and engineered human cells, and show that it can be used to improve gene discovery power for both common and rare diseases.
TL;DR: This work uses unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million protein sequences spanning evolutionary diversity, enabling state-of-the-art supervised prediction of mutational effect and secondary structure, and improving state- of- the-art features for long-range contact prediction.
Abstract: In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised learning has led to major advances in representation learning and statistical generation. In biology, the anticipated growth of sequencing promises unprecedented data on natural sequence diversity. Learning the natural distribution of evolutionary protein sequence variation is a logical step toward predictive and generative modeling for biology. To this end we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million sequences spanning evolutionary diversity. The resulting model maps raw sequences to representations of biological properties without labels or prior domain knowledge. The learned representation space organizes sequences at multiple levels of biological granularity from the biochemical to proteomic levels. Learning recovers information about protein structure: secondary structure and residue-residue contacts can be extracted by linear projections from learned representations. With small amounts of labeled data, the ability to identify tertiary contacts is further improved. Learning on full sequence diversity rather than individual protein families increases recoverable information about secondary structure. We show the networks generalize by adapting them to variant activity prediction from sequences only, with results that are comparable to a state-of-the-art variant predictor that uses evolutionary and structurally derived features.
TL;DR: A high-level overview of the features of the MRtrix3 framework and general-purpose image processing applications provided with the software is provided.
Abstract: MRtrix3 is an open-source, cross-platform software package for medical image processing, analysis and visualization, with a particular emphasis on the investigation of the brain using diffusion MRI. It is implemented using a fast, modular and flexible general-purpose code framework for image data access and manipulation, enabling efficient development of new applications, whilst retaining high computational performance and a consistent command-line interface between applications. In this article, we provide a high-level overview of the features of the MRtrix3 framework and general-purpose image processing applications provided with the software.
TL;DR: ScVelo enables disentangling heterogeneous subpopulation kinetics with unprecedented resolution in hippocampal dentate gyrus neurogenesis and pancreatic endocrinogenesis and is anticipate that scVelo will greatly facilitate the study of lineage decisions, gene regulation, and pathway activity identification.
Abstract: The introduction of RNA velocity in single cells has opened up new ways of studying cellular differentiation. The originally proposed framework obtains velocities as the deviation of the observed ratio of spliced and unspliced mRNA from an inferred steady state. Errors in velocity estimates arise if the central assumptions of a common splicing rate and the observation of the full splicing dynamics with steady-state mRNA levels are violated. With scVelo (https://scvelo.org), we address these restrictions by solving the full transcriptional dynamics of splicing kinetics using a likelihood-based dynamical model. This generalizes RNA velocity to a wide variety of systems comprising transient cell states, which are common in development and in response to perturbations. We infer gene-specific rates of transcription, splicing and degradation, and recover the latent time of the underlying cellular processes. This latent time represents the cell’s internal clock and is based only on its transcriptional dynamics. Moreover, scVelo allows us to identify regimes of regulatory changes such as stages of cell fate commitment and, therein, systematically detects putative driver genes. We demonstrate that scVelo enables disentangling heterogeneous subpopulation kinetics with unprecedented resolution in hippocampal dentate gyrus neurogenesis and pancreatic endocrinogenesis. We anticipate that scVelo will greatly facilitate the study of lineage decisions, gene regulation, and pathway activity identification.
TL;DR: The nearly complete catalog of genetic variation in TOPMed studies provides unique opportunities for exploring the contributions of rare and non-coding sequence variants to phenotypic variation as well as resources and early insights from the sequence data.
Abstract: Summary paragraph The Trans-Omics for Precision Medicine (TOPMed) program seeks to elucidate the genetic architecture and disease biology of heart, lung, blood, and sleep disorders, with the ultimate goal of improving diagnosis, treatment, and prevention. The initial phases of the program focus on whole genome sequencing of individuals with rich phenotypic data and diverse backgrounds. Here, we describe TOPMed goals and design as well as resources and early insights from the sequence data. The resources include a variant browser, a genotype imputation panel, and sharing of genomic and phenotypic data via dbGaP. In 53,581 TOPMed samples, >400 million single-nucleotide and insertion/deletion variants were detected by alignment with the reference genome. Additional novel variants are detectable through assembly of unmapped reads and customized analysis in highly variable loci. Among the >400 million variants detected, 97% have frequency
TL;DR: ModelTest-NG is a re-implementation from scratch of jModelTest and ProtTest, two popular tools for selecting the best-fit nucleotide and amino acid substitution models, respectively, and introduces several new features, such as ascertainment bias correction, mixture and FreeRate models, or the automatic processing of partitioned datasets.
Abstract: ModelTest-NG is a re-implementation from scratch of jModelTest and ProtTest, two popular tools for selecting the best-fit nucleotide and amino acid substitution models, respectively. ModelTest-NG is one to two orders of magnitude faster than jModelTest and ProtTest but equally accurate, and introduces several new features, such as ascertainment bias correction, mixture and FreeRate models, or the automatic processing of partitioned datasets. ModelTest-NG is available under a GNU GPL3 license at https://github.com/ddarriba/modeltest.
TL;DR: KofamKOALA is a web server to assign KEGG Orthologs (KOs) to protein sequences by homology search against a database of profile hidden Markov models (KOfam) with pre-computed adaptive score thresholds.
Abstract: Summary KofamKOALA is a web server to assign KEGG Orthologs (KOs) to protein sequences by homology search against a database of profile hidden Markov models (KOfam) with pre-computed adaptive score thresholds. KofamKOALA is faster than existing KO assignment tools with its accuracy being comparable to the best performing tools. Function annotation by KofamKOALA helps linking genes to KEGG resources such as the KEGG pathway maps and facilitates molecular network reconstruction. Availability KofamKOALA, KofamScan, and KOfam are freely available from https://www.genome.jp/tools/kofamkoala/ Contact firstname.lastname@example.org
TL;DR: UCSC Xena as mentioned in this paper is a web-based visualization tool for both public and private omics data, supported through Xena Browser and multiple turn-key Xena Hubs, allowing researchers to view their own data securely, using private Xena hubs, simultaneously visualizing large public cancer genomics datasets, including TCGA and the GDC.
Abstract: UCSC Xena is a visual exploration resource for both public and private omics data, supported through the web-based Xena Browser and multiple turn-key Xena Hubs. This unique archecture allows researchers to view their own data securely, using private Xena Hubs, simultaneously visualizing large public cancer genomics datasets, including TCGA and the GDC. Data integration occurs only within the Xena Browser, keeping private data private. Xena supports virtually any functional genomics data, including SNVs, INDELs, large structural variants, CNV, expression, DNA methylation, ATAC-seq signals, and phenotypic annotations. Browser features include the Visual Spreadsheet, survival analyses, powerful filtering and subgrouping, statistical analyses, genomic signatures, and bookmarks. Xena differentiates itself from other genomics tools, including its predecessor, the UCSC Cancer Genomics Browser, by its ability to easily and securely view public and private data, its high performance, its broad data type support, and many unique features.
TL;DR: Cleavage Under Targets and Tagmentation (CUT&Tag), an enzyme-tethering strategy that provides efficient high-resolution sequencing libraries for profiling diverse chromatin components, is described and demonstrated by profiling histone modifications, RNA Polymerase II and transcription factors on low cell numbers and single cells.
Abstract: Many chromatin features play critical roles in regulating gene expression. A complete understanding of gene regulation will require the mapping of specific chromatin features in small samples of cells at high resolution. Here we describe Cleavage Under Targets and Tagmentation (CUT&Tag), an enzyme-tethering strategy that provides efficient high-resolution sequencing libraries for profiling diverse chromatin components. In CUT&Tag, a chromatin protein is bound in situ by a specific antibody, which then tethers a protein A-Tn5 transposase fusion protein. Activation of the transposase efficiently generates fragment libraries with high resolution and exceptionally low background. All steps from live cells to sequencing-ready libraries can be performed in a single tube on the benchtop or a microwell in a high-throughput pipeline, and the entire procedure can be performed in one day. We demonstrate the utility of CUT&Tag by profiling histone modifications, RNA Polymerase II and transcription factors on low cell numbers and single cells.
TL;DR: PICRUSt2 as mentioned in this paper extends the capabilities of the original PICrUSt method to predict approximate functional potential of a community based on marker gene sequencing profiles, including an expanded database of gene families and reference genomes, a new approach compatible with any OTU-picking or denoising algorithm, novel phenotype predictions, and novel fungal reference databases that enable predictions from 18S rRNA gene and internal transcribed spacer amplicon data.
Abstract: One major limitation of microbial community marker gene sequencing is that it does not provide direct information on the functional composition of sampled communities. Here, we present PICRUSt2, which expands the capabilities of the original PICRUSt method to predict approximate functional potential of a community based on marker gene sequencing profiles. This updated method and implementation includes several improvements over the previous algorithm: an expanded database of gene families and reference genomes, a new approach now compatible with any OTU-picking or denoising algorithm, novel phenotype predictions, and novel fungal reference databases that enable predictions from 18S rRNA gene and internal transcribed spacer amplicon data. Upon evaluation, PICRUSt2 was more accurate than PICRUSt1 and other current approaches and also more flexible to allow the addition of custom reference databases. Last, we demonstrate the utility of PICRUSt2 by identifying potential disease-associated microbial functional signatures based on 16S rRNA gene sequencing of ileal biopsies collected from a cohort of human subjects with inflammatory bowel disease. PICRUSt2 is freely available at: https://github.com/picrust/picrust2.
TL;DR: StringTie2 is a reference-guided transcriptome assembler that works with both short and long reads and includes new computational methods to handle the high error rate of long-read sequencing technology, which previous assemblers could not tolerate.
Abstract: RNA sequencing using the latest single-molecule sequencing instruments produces reads that are thousands of nucleotides long. The ability to assemble these long reads can greatly improve the sensitivity of long-read analyses. Here we present StringTie2, a reference-guided transcriptome assembler that works with both short and long reads. StringTie2 includes new computational methods to handle the high error rate of long-read sequencing technology, which previous assemblers could not tolerate. It also offers the ability to work with full-length super-reads assembled from short reads, which further improves the quality of assemblies. On 33 short-read datasets from humans and two plant species, StringTie2 is 47.3% more precise and 3.9% more sensitive than Scallop. On multiple long read datasets, StringTie2 on average correctly assembles 8.3 and 2.6 times as many transcripts as FLAIR and Traphlor, respectively, with substantially higher precision. StringTie2 is also faster and has a smaller memory footprint than all comparable tools.
Harvard University1, University of North Carolina at Chapel Hill2, University of Texas at Austin3, VU University Amsterdam4, Broad Institute5, Icahn School of Medicine at Mount Sinai6, Cardiff University7, Stanford University8, Federal University of São Paulo9, University of Pennsylvania10, University of Helsinki11, University of Illinois at Urbana–Champaign12, University of California, Los Angeles13, Centre for Addiction and Mental Health14, University Medical Center Groningen15, Universidade Federal do Rio Grande do Sul16, King's College London17, University of Edinburgh18, University of Oslo19, Lundbeck20, Indiana University21, Veterans Health Administration22, State University of New York Upstate Medical University23, Yale University24, University of Florida25, VA Boston Healthcare System26, Virginia Commonwealth University27, Maine Medical Center28, University of California, Berkeley29, University of Queensland30
TL;DR: A meta-analysis of genome-wide studies of anorexia nervosa, attention-deficit/hyperactivity disorder, autism spectrum disorder, bipolar disorder, major depression, obsessive-compulsive disorder, schizophrenia, and Tourette syndrome revealed a meaningful structure within the eight disorders identifying three groups of inter-related disorders.
Abstract: Genetic influences on psychiatric disorders transcend diagnostic boundaries, suggesting substantial pleiotropy of contributing loci. However, the nature and mechanisms of these pleiotropic effects remain unclear. We performed a meta-analysis of 232,964 cases and 494,162 controls from genome-wide studies of anorexia nervosa, attention-deficit/hyperactivity disorder, autism spectrum disorder, bipolar disorder, major depression, obsessive-compulsive disorder, schizophrenia, and Tourette syndrome. Genetic correlation analyses revealed a meaningful structure within the eight disorders identifying three groups of inter-related disorders. We detected 109 loci associated with at least two psychiatric disorders, including 23 loci with pleiotropic effects on four or more disorders and 11 loci with antagonistic effects on multiple disorders. The pleiotropic loci are located within genes that show heightened expression in the brain throughout the lifespan, beginning in the second trimester prenatally, and play prominent roles in a suite of neurodevelopmental processes. These findings have important implications for psychiatric nosology, drug development, and risk prediction.
TL;DR: This work applies deep learning to unlabelled amino acid sequences to distill the fundamental features of a protein into a statistical representation that is semantically rich and structurally, evolutionarily, and biophysically grounded.
Abstract: Rational protein engineering requires a holistic understanding of protein function. Here, we apply deep learning to unlabelled amino acid sequences to distill the fundamental features of a protein into a statistical representation that is semantically rich and structurally, evolutionarily, and biophysically grounded. We show that the simplest models built on top of this unified representation (UniRep) are broadly applicable and generalize to unseen regions of sequence space. Our data-driven approach reaches near state-of-the-art or superior performance predicting stability of natural and de novo designed proteins as well as quantitative function of molecularly diverse mutants. UniRep further enables two orders of magnitude cost savings in a protein engineering task. We conclude UniRep is a versatile protein summary that can be applied across protein engineering informatics.
TL;DR: TAPE as discussed by the authors is a set of five biologically relevant semi-supervised learning tasks spread across different domains of protein biology, and it is designed to test biologically relevant generalization that transfers to real-life scenarios.
Abstract: Protein modeling is an increasingly popular area of machine learning research. Semi-supervised learning has emerged as an important paradigm in protein modeling due to the high cost of acquiring supervised protein labels, but the current literature is fragmented when it comes to datasets and standardized evaluation techniques. To facilitate progress in this field, we introduce the Tasks Assessing Protein Embeddings (TAPE), a set of five biologically relevant semi-supervised learning tasks spread across different domains of protein biology. We curate tasks into specific training, validation, and test splits to ensure that each task tests biologically relevant generalization that transfers to real-life scenarios. We bench-mark a range of approaches to semi-supervised protein representation learning, which span recent work as well as canonical sequence learning techniques. We find that self-supervised pretraining is helpful for almost all models on all tasks, more than doubling performance in some cases. Despite this increase, in several cases features learned by self-supervised pretraining still lag behind features extracted by state-of-the-art non-neural techniques. This gap in performance suggests a huge opportunity for innovative architecture design and improved modeling paradigms that better capture the signal in biological sequences. TAPE will help the machine learning community focus effort on scientifically relevant problems. Toward this end, all data and code used to run these experiments are available at https://github.com/songlab-cal/tape.
TL;DR: A single-instruction multiple-data (SIMD) vectorized implementation of the Viterbi algorithm for profile HMM alignment is developed and the added functionalities and increased speed of HHsearch and HHblits should facilitate their use in large-scale protein structure and function prediction, e.g. in metagenomics and genomics projects.
Abstract: Background: HH-suite is a widely used open source software suite for sensitive sequence similarity searches and protein fold recognition. It is based on pairwise alignment of profile Hidden Markov models (HMMs), which represent multiple sequence alignments of homologous sequences. Results: We developed a single-instruction multiple-data (SIMD) vectorized implementation of the Viterbi algorithm for profile HMM alignment and introduced various other speed-ups. This accelerated HHsearch by a factor 4 and HHblits by a factor 2 over the previous version 2.0.16. HHblits3 is ~10x faster than PSI-BLAST and ~20x faster than HMMER3. Jobs to perform HHsearch and HHblits searches with many query profile HMMs can be parallelized over cores and over servers in a cluster using OpenMP and message passing interface (MPI). The free, open-source, GNU GPL(v3)-licensed software is available at https://github.com/soedinglab/hh-suite. Conclusion: The added functionalities and increased speed of HHsearch and HHblits should facilitate their use in large-scale protein structure and function prediction, e.g. in metagenomics and genomics projects.
TL;DR: Overall, tRNA detection sensitivity and specificity is improved for all isotypes, particularly those utilizing specialized models for selenocysteine and the three subtypes of tRNA genes encoding a CAU anticodon.
Abstract: tRNAscan-SE has been widely used for whole-genome transfer RNA gene prediction for nearly two decades. With the increased availability of new genomes, a vastly larger training set has enabled creation of nearly one hundred specialized isotype-specific models, greatly improving tRNAscan-SE’s ability to identify and classify both typical and atypical tRNAs. We employ a new multi-model annotation strategy where predicted tRNAs are scored against a full set of isotype-specific covariance models. A post-filtering feature also better identifies tRNA-derived SINEs that are abundant in many eukaryotic genomes, and provides a “high confidence” tRNA gene set which improves upon prior pseudogene prediction. These new enhancements of tRNAscan-SE will provide researchers more accurate detection and more comprehensive annotation for tRNA genes.
TL;DR: Mutect2 is a somatic variant caller that uses local assembly and realignment to detect SNVs and indels, and is based on several probabilistic models for genotyping and filtering that work well with and without a matched normal sample and for all sequencing depths.
Abstract: Mutect2 is a somatic variant caller that uses local assembly and realignment to detect SNVs and indels. Assembly implies whole haplotypes and read pairs, rather than single bases, as the atomic units of biological variation and sequencing evidence, improving variant calling. Beyond local assembly and alignment, Mutect2 is based on several probabilistic models for genotyping and filtering that work well with and without a matched normal sample and for all sequencing depths.
TL;DR: The analyses identify several candidate biomarkers of cellular senescence that overlap with aging markers in human plasma, including GDF15, STC1 and SERPINs, which significantly correlated with age in plasma from a human cohort, the Baltimore Longitudinal Study of Aging.
Abstract: SUMMARY The senescence-associated secretory phenotype (SASP) has recently emerged as both a driver of, and promising therapeutic target for, multiple age-related conditions, ranging from neurodegeneration to cancer. The complexity of the SASP, typically monitored by a few dozen secreted proteins, has been greatly underappreciated, and a small set of factors cannot explain the diverse phenotypes it produces in vivo. Here, we present ‘SASP Atlas’, a comprehensive proteomic database of soluble and exosome SASP factors originating from multiple senescence inducers and cell types. Each profile consists of hundreds of largely distinct proteins, but also includes a subset of proteins elevated in all SASPs. Based on our analyses, we propose several candidate biomarkers of cellular senescence, including GDF15, STC1 and SERPINs. This resource will facilitate identification of proteins that drive specific senescence-associated phenotypes and catalog potential senescence biomarkers to assess the burden, originating stimulus and tissue of senescent cells in vivo.
TL;DR: Neuralink’s approach to BMI has unprecedented packaging density and scalability in a clinically relevant package and has achieved a spiking yield of up to 85.5 % in chronically implanted electrodes.
Abstract: Brain-machine interfaces (BMIs) hold promise for the restoration of sensory and motor function and the treatment of neurological disorders, but clinical BMIs have not yet been widely adopted, in part because modest channel counts have limited their potential. In this white paper, we describe Neuralink’s first steps toward a scalable high-bandwidth BMI system. We have built arrays of small and flexible electrode “threads”, with as many as 3,072 electrodes per array distributed across 96 threads. We have also built a neurosurgical robot capable of inserting six threads (192 electrodes) per minute. Each thread can be individually inserted into the brain with micron precision for avoidance of surface vasculature and targeting specific brain regions. The electrode array is packaged into a small implantable device that contains custom chips for low-power on-board amplification and digitization: the package for 3,072 channels occupies less than (23 × 18.5 × 2) mm3. A single USB-C cable provides full-bandwidth data streaming from the device, recording from all channels simultaneously. This system has achieved a spiking yield of up to 85.5 % in chronically implanted electrodes. Neuralink’s approach to BMI has unprecedented packaging density and scalability in a clinically relevant package.
TL;DR: A file format called cooler, based on a sparse data model, that can support genomically-labeled matrices at any resolution, which has the flexibility to accommodate various descriptions of the data axes, resolutions, data density patterns, and metadata.
Abstract: Most existing coverage-based (epi)genomic datasets are one-dimensional, but newer technologies probing interactions (physical, genetic, etc.) produce quantitative maps with two-dimensional genomic coordinate systems. Storage and computational costs mount sharply with data resolution when such maps are stored in dense form. Hence, there is a pressing need to develop data storage strategies that handle the full range of useful resolutions in multidimensional genomic datasets by taking advantage of their sparse nature, while supporting efficient compression and providing fast random access to facilitate development of scalable algorithms for data analysis. We developed a file format called cooler, based on a sparse data model, that can support genomically-labeled matrices at any resolution. It has the flexibility to accommodate various descriptions of the data axes (genomic coordinates, tracks and bin annotations), resolutions, data density patterns, and metadata. Cooler is based on HDF5 and is supported by a Python library and command line suite to create, read, inspect and manipulate cooler data collections. The format has been adopted as a standard by the NIH 4D Nucleome Consortium. Cooler is cross-platform, BSD-licensed, and can be installed from the Python Package Index or the bioconda repository. The source code is maintained on Github at https://github.com/mirnylab/cooler.
TL;DR: The current and projected Achilles processing pipeline, including recent improvements and the analyses that led us to adopt them, are presented, spanning data releases from early 2018 to the first quarter of 2020.
Abstract: One of the main goals of the Cancer Dependency Map project is to systematically identify cancer vulnerabilities across cancer types to accelerate therapeutic discovery. Project Achilles serves this goal through the in vitro study of genetic dependencies in cancer cell lines using CRISPR/Cas9 (and, previously, RNAi) loss-of-function screens. The project is committed to the public release of its experimental results quarterly on the DepMap Portal (https://depmap.org), on a pre-publication basis. As the experiment has evolved, data processing procedures have changed. Here we present the current and projected Achilles processing pipeline, including recent improvements and the analyses that led us to adopt them, spanning data releases from early 2018 to the first quarter of 2020. Notable changes include quality control metrics, calculation of probabilities of dependency, and correction for screen quality and other biases. Developing and improving methods for extracting biologically-meaningful scores from Achilles experiments is an ongoing process, and we will continue to evaluate and revise data processing procedures to produce the best results.
TL;DR: A striking buildup of lipid droplets in microglia with aging in mouse and human brains is reported and it is proposed that LAM contribute to age-related and genetic forms of neurodegeneration.
Abstract: Microglia become progressively activated and seemingly dysfunctional with age, and genetic studies have linked these cells to the pathogenesis of a growing number of neurodegenerative diseases. Here we report a striking buildup of lipid droplets in microglia with aging in mouse and human brains. These cells, which we call lipid droplet-accumulating microglia (LAM), are defective in phagocytosis, produce high levels of reactive oxygen species, and secrete pro-inflammatory cytokines. RNA sequencing analysis of LAM revealed a transcriptional profile driven by innate inflammation distinct from previously reported microglial states. An unbiased CRISPR-Cas9 screen identified genetic modifiers of lipid droplet formation; surprisingly, variants of several of these genes, including progranulin, are causes of autosomal dominant forms of human neurodegenerative diseases. We thus propose that LAM contribute to age-related and genetic forms of neurodegeneration.
University of Würzburg1, National University of Comahue2, Spanish National Research Council3, Swedish University of Agricultural Sciences4, University of Lisbon5, Universidade Federal de Goiás6, Stanford University7, Commonwealth Scientific and Industrial Research Organisation8, National University of Río Negro9, ETH Zurich10, Cornell University11, University of California, Davis12, The Nature Conservancy13, Wageningen University and Research Centre14, University of British Columbia15, Great Lakes Bioenergy Research Center16, University of California, Berkeley17, University of Padua18, University of New England (United States)19, Lund University20, University of Göttingen21, University of La Rochelle22, Institut national de la recherche agronomique23, Federal University of Ceará24, Concordia University Wisconsin25, University of Belgrade26, National University of Tucumán27, Michigan State University28, University of Brasília29, University of Greenwich30, University of Reading31, University of Wisconsin-Madison32, Boise State University33, University of Texas at Austin34, University of Haifa35, Kansas State University36, University of Freiburg37, University of Hamburg38, University of California, Santa Barbara39, Seattle University40, University of Vienna41, University of Florida42, Centro Agronómico Tropical de Investigación y Enseñanza43, National Audubon Society44, University of Buenos Aires45, Virginia Tech46, University of Bordeaux47, University of Auckland48, University College Dublin49, Trinity College, Dublin50, University of Tokyo51, Federal University of Bahia52, Lincoln University (Pennsylvania)53, National Institute for Environmental Studies54, International Food Policy Research Institute55, Xi'an Jiaotong-Liverpool University56
TL;DR: Using a global database from 89 crop systems, the relative importance of abundance and species richness for pollination, biological pest control and final yields in the context of on-going land-use change is partitioned.
Abstract: Human land use threatens global biodiversity and compromises multiple ecosystem functions critical to food production. Whether crop yield-related ecosystem services can be maintained by few abundant species or rely on high richness remains unclear. Using a global database from 89 crop systems, we partition the relative importance of abundance and species richness for pollination, biological pest control and final yields in the context of on-going land-use change. Pollinator and enemy richness directly supported ecosystem services independent of abundance. Up to 50% of the negative effects of landscape simplification on ecosystem services was due to richness losses of service-providing organisms, with negative consequences for crop yields. Maintaining the biodiversity of ecosystem service providers is therefore vital to sustain the flow of key agroecosystem benefits to society.
TL;DR: This new program brings substantial improvements over the original version of RepeatModeler, one of the most widely used tools for TE discovery, and incorporates a module for structural discovery of complete LTR retroelements, which are widespread in eukaryotic genomes but recalcitrant to automated identification because of their size and sequence complexity.
Abstract: The accelerating pace of genome sequencing throughout the tree of life is driving the need for improved unsupervised annotation of genome components such as transposable elements (TEs). Because the types and sequences of TEs are highly variable across species, automated TE discovery and annotation are challenging and time-consuming tasks. A critical first step is the de novo identification and accurate compilation of sequence models representing all the unique TE families dispersed in the genome. Here we introduce RepeatModeler2, a new pipeline that greatly facilitates this process. This new program brings substantial improvements over the original version of RepeatModeler, one of the most widely used tools for TE discovery. In particular, this version incorporates a module for structural discovery of complete LTR retroelements, which are widespread in eukaryotic genomes but recalcitrant to automated identification because of their size and sequence complexity. We benchmarked RepeatModeler2 on three model species with diverse TE landscapes and high-quality, manually curated TE libraries: Drosophila melanogaster (fruit fly), Danio rerio (zebrafish), and Oryza sativa (rice). In these three species, RepeatModeler2 identified approximately three times more consensus sequences matching with >95% sequence identity and sequence coverage to the manually curated sequences than the original RepeatModeler. As expected, the greatest improvement is for LTR retroelements. The program had an extremely low false positive rate when applied to simulated genomes devoid of TEs. Thus, RepeatModeler2 represents a valuable addition to the genome annotation toolkit that will enhance the identification and study of TEs in eukaryotic genome sequences. RepeatModeler2 is available as source code or a containerized package under an open license (https://github.com/Dfam- consortium/RepeatModeler, https://github.com/Dfam-consortium/TETools).
TL;DR: It is anticipated that droplet-based single-cell chromatin accessibility will provide a broadly applicable means of identifying regulatory factors and elements that underlie cell type and function.
Abstract: Understanding complex tissues requires single-cell deconstruction of gene regulation with precision and scale. Here we present a massively parallel droplet-based platform for mapping transposase-accessible chromatin in tens of thousands of single cells per sample (scATAC-seq). We obtain and analyze chromatin profiles of over 200,000 single cells in two primary human systems. In blood, scATAC-seq allows marker-free identification of cell type-specific cis- and trans-regulatory elements, mapping of disease-associated enhancer activity, and reconstruction of trajectories of differentiation from progenitors to diverse and rare immune cell types. In basal cell carcinoma, scATAC-seq reveals regulatory landscapes of malignant, stromal, and immune cell types in the tumor microenvironment. Moreover, scATAC-seq of serial tumor biopsies before and after PD-1 blockade allows identification of chromatin regulators and differentiation trajectories of therapy-responsive intratumoral T cell subsets, revealing a shared regulatory program driving CD8+ T cell exhaustion and CD4+ T follicular helper cell development. We anticipate that droplet-based single-cell chromatin accessibility will provide a broadly applicable means of identifying regulatory factors and elements that underlie cell type and function.
TL;DR: The genetic basis of 38 blood and urine laboratory tests is evaluated, which tissues contribute to the biomarker function, the causal influences of the biomarkers, and how this can be used to predict disease are shown.
Abstract: Clinical laboratory tests are a critical component of the continuum of care and provide a means for rapid diagnosis and monitoring of chronic disease. In this study, we systematically evaluated the genetic basis of 38 blood and urine laboratory tests measured in 358,072 participants in the UK Biobank and identified 1,857 independent loci associated with at least one laboratory test, including 488 large-effect protein truncating, missense, and copy-number variants. We tested these loci for enrichment in specific single cell types in kidney, liver, and pancreas relevant to disease aetiology. We then causally linked the biomarkers to medically relevant phenotypes through genetic correlation and Mendelian randomization. Finally, we developed polygenic risk scores (PRS) for each biomarker and built multi-PRS models using all 38 PRSs simultaneously. We found substantially improved prediction of incidence in FinnGen (n=135,500) with the multi-PRS relative to single-disease PRSs for renal failure, myocardial infarction, liver fat percentage, and alcoholic cirrhosis. Together, our results show the genetic basis of these biomarkers, which tissues contribute to the biomarker function, the causal influences of the biomarkers, and how we can use this to predict disease.