Showing papers in "bioRxiv in 2019"

PDF

Open Access

Posted Content•DOI•

U1 snRNP regulates cancer cell migration and invasion

[...]

Jung-Min Oh¹, Christopher C. Venters¹, Chao Di¹, Anna Maria Pinto¹, Lili Wan¹, Ihab Younis¹, Zhiqiang Cai¹, Chie Arai¹, Byung Ran So¹, Gideon Dreyfuss¹ - Show less +6 more•Institutions (1)

University of Pennsylvania¹

09 Aug 2019-bioRxiv

TL;DR: It is shown that U1 AMO also modulates cancer cells’ phenotype, dose-dependently increasing migration and invasion in vitro by up to 500%, whereas U1 over-expression has the opposite effect.

...read moreread less

Abstract: Stimulated cells and cancer cells have widespread shortening of mRNA 3’-utranslated regions (3’UTRs) and switches to shorter mRNA isoforms due to usage of more proximal polyadenylation signals (PASs) in the last exon and in introns. U1 snRNA (U1), vertebrates’ most abundant non-coding (spliceosomal) small nuclear RNA, silences proximal PASs and its inhibition with antisense morpholino oligonucleotides (U1 AMO) triggers widespread mRNA shortening. Here we show that U1 AMO also modulates cancer cells’ phenotype, dose-dependently increasing migration and invasion in vitro by up to 500%, whereas U1 over-expression has the opposite effect. In addition to 3’UTR length, numerous transcriptome changes that could contribute to this phenotype are observed, including alternative splicing, and mRNA expression levels of proto-oncogenes and tumor suppressors. These findings reveal an unexpected link between U1 regulation and oncogenic and activated cell states, and suggest U1 as a potential target for their modulation.

...read moreread less

1,660 citations

Posted Content•DOI•

OrthoFinder: phylogenetic orthology inference for comparative genomics

[...]

David M. Emms¹, Steven L. Kelly¹•Institutions (1)

University of Oxford¹

24 Apr 2019-bioRxiv

TL;DR: This extends OrthoFinder’s high accuracy orthogroup inference to provide phylogenetic inference of orthologs, rooted genes trees, gene duplication events, the rooted species tree, and comparative genomic statistics.

...read moreread less

Abstract: Here, we present a major advance of the OrthoFinder method. This extends OrthoFinder’s high accuracy orthogroup inference to provide phylogenetic inference of orthologs, rooted genes trees, gene duplication events, the rooted species tree, and comparative genomic statistics. Each output is benchmarked on appropriate real or simulated datasets and, where comparable methods exist, OrthoFinder is equivalent to or outperforms these methods. Furthermore, OrthoFinder is the most accurate ortholog inference method on the Quest for Orthologs benchmark test. Finally, OrthoFinder’s comprehensive phylogenetic analysis is achieved with equivalent speed and scalability to the fastest, score-based heuristic methods. OrthoFinder is available at https://github.com/davidemms/OrthoFinder.

...read moreread less

1,366 citations

Posted Content•DOI•

The GTEx Consortium atlas of genetic regulatory effects across human tissues

[...]

François Aguet¹, Alvaro N. Barbeira², Rodrigo Bonazzola², Andrew A. Brown³, SE Castel⁴, Brian Jo, Silva Kasela⁴, Sarah Kim-Hellmuth⁵, Yanyu Liang², Meritxell Oliva², Princy E Parsana⁶, Elise Flynn⁴, Laure Fresard⁶, Eric R Gaamzon⁷, Andrew R Hamel², Yuan He⁶, Farhad Hormozdiari⁸, Pejman Mohammadi⁹, Manuel Muñoz-Aguirre¹⁰, YoSon Park¹¹, Ashis Saha⁶, Ayellet V Segrć², Benjamin J Strober⁶, Xiaoquan Wen¹¹, Valentin Wucher¹⁰, Sayantan Das¹¹, D Garrido-Martin¹⁰, Robert E Handsaker¹², Paul J Hoffman¹³, Seva Kashin¹², Alan Kwong¹¹, Xiao Li⁸, Daniel G. MacArthur⁸, John M. Rouhana², Matthew Stephens², Ellen Todres¹, Ana Viñuela¹⁴, Gao Wang¹⁵, Yuxin Zou¹⁵, Christopher D. Brown¹¹, Nancy Cox⁷, Emmanouil T. Dermitzakis, Barbara E. Engelhardt, Gad Getz⁸, Roderic Guigo¹⁶, Stephen B. Montgomery¹⁷, Barbara E Stranger¹⁷, Hae Kyung Im², Alexis Battle⁶, Kristin G. Ardlie⁸, Lappalainen T⁴ - Show less +47 more•Institutions (17)

Broad Institute¹, University of Chicago², University of Geneva³, University of Dundee⁴, Columbia University⁵, Princeton University⁶, Max Planck Society⁷, Johns Hopkins University⁸, Stanford University⁹, Vanderbilt University¹⁰, University of Cambridge¹¹, Vanderbilt University Medical Center¹², Massachusetts Eye and Ear Infirmary¹³, Harvard University¹⁴, Scripps Health¹⁵, Polytechnic University of Catalonia¹⁶, University of Pennsylvania¹⁷

03 Oct 2019-bioRxiv

TL;DR: Analysis of the v8 data provides insights into the tissue-specificity of genetic effects, and shows that cell type composition is a key factor in understanding gene regulatory mechanisms in human tissues.

...read moreread less

Abstract: The Genotype-Tissue Expression (GTEx) project was established to characterize genetic effects on the transcriptome across human tissues, and to link these regulatory mechanisms to trait and disease associations. Here, we present analyses of the v8 data, based on 17,382 RNA-sequencing samples from 54 tissues of 948 post-mortem donors. We comprehensively characterize genetic associations for gene expression and splicing in cis and trans, showing that regulatory associations are found for almost all genes, and describe the underlying molecular mechanisms and their contribution to allelic heterogeneity and pleiotropy of complex traits. Leveraging the large diversity of tissues, we provide insights into the tissue-specificity of genetic effects, and show that cell type composition is a key factor in understanding gene regulatory mechanisms in human tissues.

...read moreread less

1,243 citations

Posted Content•DOI•

Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression

[...]

Christoph Hafemeister, Rahul Satija¹•Institutions (1)

New York University¹

14 Mar 2019-bioRxiv

TL;DR: It is proposed that the Pearson residuals from ’regularized negative binomial regression’, where cellular sequencing depth is utilized as a covariate in a generalized linear model, successfully remove the influence of technical characteristics from downstream analyses while preserving biological heterogeneity.

...read moreread less

Abstract: Single-cell RNA-seq (scRNA-seq) data exhibits significant cell-to-cell variation due to technical factors, including the number of molecules detected in each cell, which can confound biological heterogeneity with technical effects. To address this, we present a modeling framework for the normalization and variance stabilization of molecular count data from scRNA-seq experiments. We propose that the Pearson residuals from ’regularized negative binomial regression’, where cellular sequencing depth is utilized as a covariate in a generalized linear model, successfully remove the influence of technical characteristics from downstream analyses while preserving biological heterogeneity. Importantly, we show that an unconstrained negative binomial model may overfit scRNA-seq data, and overcome this by pooling information across genes with similar abundances to obtain stable parameter estimates. Our procedure omits the need for heuristic steps including pseudocount addition or log-transformation, and improves common downstream analytical tasks such as variable gene selection, dimensional reduction, and differential expression. Our approach can be applied to any UMI-based scRNA-seq dataset and is freely available as part of the R package sctransform, with a direct interface to our single-cell toolkit Seurat.

...read moreread less

1,175 citations

Posted Content•DOI•

Variation across 141,456 human exomes and genomes reveals the spectrum of loss-of-function intolerance across human protein-coding genes

[...]

Konrad J. Karczewski¹, Konrad J. Karczewski², Laurent C. Francioli¹, Laurent C. Francioli², Grace Tiao¹, Grace Tiao², Beryl B. Cummings¹, Beryl B. Cummings², Jessica Alföldi², Jessica Alföldi¹, Qingbo Wang², Qingbo Wang¹, Ryan L. Collins¹, Ryan L. Collins², Kristen M. Laricchia¹, Kristen M. Laricchia², Andrea Ganna¹, Andrea Ganna³, Andrea Ganna², Daniel P. Birnbaum¹, Laura D. Gauthier¹, Harrison Brand¹, Harrison Brand², Matthew Solomonson¹, Matthew Solomonson², Nicholas A. Watts¹, Nicholas A. Watts², Daniel R. Rhodes⁴, Moriel Singer-Berk¹, Eleanor G. Seaby¹, Eleanor G. Seaby², Jack A. Kosmicki¹, Jack A. Kosmicki², Raymond K. Walters¹, Raymond K. Walters², Katherine Tashman¹, Katherine Tashman², Yossi Farjoun¹, Eric Banks¹, Timothy Poterba², Timothy Poterba¹, Arcturus Wang², Arcturus Wang¹, Cotton Seed², Cotton Seed¹, Nicola Whiffin¹, Nicola Whiffin⁵, Jessica X. Chong⁶, Kaitlin E. Samocha⁷, Emma Pierce-Hoffman¹, Zachary Zappala¹, Zachary Zappala⁸, Anne H. O’Donnell-Luria¹, Anne H. O’Donnell-Luria², Anne H. O’Donnell-Luria⁹, Eric Vallabh Minikel¹, Ben Weisburd¹, Monkol Lek¹⁰, Monkol Lek¹, James S. Ware¹, James S. Ware⁵, Christopher Vittal², Christopher Vittal¹, Irina M. Armean¹, Irina M. Armean², Irina M. Armean¹¹, Louis Bergelson¹, Kristian Cibulskis¹, Kristen M. Connolly¹, Miguel Covarrubias¹, Stacey Donnelly¹, Steven Ferriera¹, Stacey Gabriel¹, Jeff Gentry¹, Namrata Gupta¹, Thibault Jeandet¹, Diane Kaplan¹, Christopher Llanwarne¹, Ruchi Munshi¹, Sam Novod¹, Nikelle Petrillo¹, David Roazen¹, Valentin Ruano-Rubio¹, Andrea Saltzman¹, Molly Schleicher¹, Jose Soto¹, Kathleen Tibbetts¹, Charlotte Tolonen¹, Gordon Wade¹, Michael E. Talkowski², Michael E. Talkowski¹, Benjamin M. Neale¹, Benjamin M. Neale², Mark J. Daly¹, Daniel G. MacArthur², Daniel G. MacArthur¹ - Show less +92 more•Institutions (11)

Broad Institute¹, Harvard University², University of Helsinki³, Queen Mary University of London⁴, National Institutes of Health⁵, University of Washington⁶, Wellcome Trust Sanger Institute⁷, Vertex Pharmaceuticals⁸, Boston Children's Hospital⁹, Yale University¹⁰, European Bioinformatics Institute¹¹

30 Jan 2019-bioRxiv

TL;DR: Using an improved human mutation rate model, human protein-coding genes are classified along a spectrum representing tolerance to inactivation, validate this classification using data from model organisms and engineered human cells, and show that it can be used to improve gene discovery power for both common and rare diseases.

...read moreread less

Abstract: Summary Genetic variants that inactivate protein-coding genes are a powerful source of information about the phenotypic consequences of gene disruption: genes critical for an organism’s function will be depleted for such variants in natural populations, while non-essential genes will tolerate their accumulation. However, predicted loss-of-function (pLoF) variants are enriched for annotation errors, and tend to be found at extremely low frequencies, so their analysis requires careful variant annotation and very large sample sizes. Here, we describe the aggregation of 125,748 exomes and 15,708 genomes from human sequencing studies into the Genome Aggregation Database (gnomAD). We identify 443,769 high-confidence pLoF variants in this cohort after filtering for sequencing and annotation artifacts. Using an improved model of human mutation, we classify human protein-coding genes along a spectrum representing intolerance to inactivation, validate this classification using data from model organisms and engineered human cells, and show that it can be used to improve gene discovery power for both common and rare diseases.

...read moreread less

1,128 citations

Posted Content•DOI•

Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences

[...]

Alexander Rives¹, Siddharth Goyal², Joshua Meier², Demi Guo², Myle Ott², C. Lawrence Zitnick², Jerry Ma², Rob Fergus¹, Rob Fergus² - Show less +5 more•Institutions (2)

New York University¹, Facebook²

29 Apr 2019-bioRxiv

TL;DR: This work uses unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million protein sequences spanning evolutionary diversity, enabling state-of-the-art supervised prediction of mutational effect and secondary structure, and improving state- of- the-art features for long-range contact prediction.

...read moreread less

Abstract: In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised learning has led to major advances in representation learning and statistical generation. In biology, the anticipated growth of sequencing promises unprecedented data on natural sequence diversity. Learning the natural distribution of evolutionary protein sequence variation is a logical step toward predictive and generative modeling for biology. To this end we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million sequences spanning evolutionary diversity. The resulting model maps raw sequences to representations of biological properties without labels or prior domain knowledge. The learned representation space organizes sequences at multiple levels of biological granularity from the biochemical to proteomic levels. Learning recovers information about protein structure: secondary structure and residue-residue contacts can be extracted by linear projections from learned representations. With small amounts of labeled data, the ability to identify tertiary contacts is further improved. Learning on full sequence diversity rather than individual protein families increases recoverable information about secondary structure. We show the networks generalize by adapting them to variant activity prediction from sequences only, with results that are comparable to a state-of-the-art variant predictor that uses evolutionary and structurally derived features.

...read moreread less

748 citations

Posted Content•DOI•

MRtrix3: A fast, flexible and open software framework for medical image processing and visualisation

[...]

Jacques-Donald Tournier¹, Robert E. Smith², Robert E. Smith³, David Raffelt², Rami Tabbara², Thijs Dhollander², Thijs Dhollander³, Maximilian Pietsch¹, Daan Christiaens¹, Ben Jeurissen⁴, Chun-Hung Yeh², Chun-Hung Yeh³, Alan Connelly³, Alan Connelly² - Show less +10 more•Institutions (4)

King's College London¹, Florey Institute of Neuroscience and Mental Health², University of Melbourne³, University of Antwerp⁴

15 Feb 2019-bioRxiv

TL;DR: A high-level overview of the features of the MRtrix3 framework and general-purpose image processing applications provided with the software is provided.

...read moreread less

Abstract: MRtrix3 is an open-source, cross-platform software package for medical image processing, analysis and visualization, with a particular emphasis on the investigation of the brain using diffusion MRI. It is implemented using a fast, modular and flexible general-purpose code framework for image data access and manipulation, enabling efficient development of new applications, whilst retaining high computational performance and a consistent command-line interface between applications. In this article, we provide a high-level overview of the features of the MRtrix3 framework and general-purpose image processing applications provided with the software.

...read moreread less

728 citations

Posted Content•DOI•

Generalizing RNA velocity to transient cell states through dynamical modeling

[...]

Volker Bergen¹, Marius Lange¹, Stefan Peidli¹, F. Alexander Wolf, Fabian J. Theis¹ - Show less +1 more•Institutions (1)

Technische Universität München¹

29 Oct 2019-bioRxiv

TL;DR: ScVelo enables disentangling heterogeneous subpopulation kinetics with unprecedented resolution in hippocampal dentate gyrus neurogenesis and pancreatic endocrinogenesis and is anticipate that scVelo will greatly facilitate the study of lineage decisions, gene regulation, and pathway activity identification.

...read moreread less

Abstract: The introduction of RNA velocity in single cells has opened up new ways of studying cellular differentiation. The originally proposed framework obtains velocities as the deviation of the observed ratio of spliced and unspliced mRNA from an inferred steady state. Errors in velocity estimates arise if the central assumptions of a common splicing rate and the observation of the full splicing dynamics with steady-state mRNA levels are violated. With scVelo (https://scvelo.org), we address these restrictions by solving the full transcriptional dynamics of splicing kinetics using a likelihood-based dynamical model. This generalizes RNA velocity to a wide variety of systems comprising transient cell states, which are common in development and in response to perturbations. We infer gene-specific rates of transcription, splicing and degradation, and recover the latent time of the underlying cellular processes. This latent time represents the cell’s internal clock and is based only on its transcriptional dynamics. Moreover, scVelo allows us to identify regimes of regulatory changes such as stages of cell fate commitment and, therein, systematically detects putative driver genes. We demonstrate that scVelo enables disentangling heterogeneous subpopulation kinetics with unprecedented resolution in hippocampal dentate gyrus neurogenesis and pancreatic endocrinogenesis. We anticipate that scVelo will greatly facilitate the study of lineage decisions, gene regulation, and pathway activity identification.

...read moreread less

712 citations

Posted Content•DOI•

Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program

[...]

Daniel Taliun¹, Daniel N. Harris², Michael D. Kessler², Jedidiah Carlson³ +191 more•Institutions (61)

06 Mar 2019-bioRxiv

TL;DR: The nearly complete catalog of genetic variation in TOPMed studies provides unique opportunities for exploring the contributions of rare and non-coding sequence variants to phenotypic variation as well as resources and early insights from the sequence data.

...read moreread less

Abstract: Summary paragraph The Trans-Omics for Precision Medicine (TOPMed) program seeks to elucidate the genetic architecture and disease biology of heart, lung, blood, and sleep disorders, with the ultimate goal of improving diagnosis, treatment, and prevention. The initial phases of the program focus on whole genome sequencing of individuals with rich phenotypic data and diverse backgrounds. Here, we describe TOPMed goals and design as well as resources and early insights from the sequence data. The resources include a variant browser, a genotype imputation panel, and sharing of genomic and phenotypic data via dbGaP. In 53,581 TOPMed samples, >400 million single-nucleotide and insertion/deletion variants were detected by alignment with the reference genome. Additional novel variants are detectable through assembly of unmapped reads and customized analysis in highly variable loci. Among the >400 million variants detected, 97% have frequency

...read moreread less

662 citations

Posted Content•DOI•

ModelTest-NG: a new and scalable tool for the selection of DNA and protein evolutionary models

[...]

Diego Darriba¹, David Posada², Alexey M. Kozlov¹, Alexandros Stamatakis¹, Alexandros Stamatakis³, Benoit Morel¹, Tomas Flouri⁴ - Show less +3 more•Institutions (4)

Heidelberg Institute for Theoretical Studies¹, University of Vigo², Karlsruhe Institute of Technology³, University College London⁴

22 Apr 2019-bioRxiv

TL;DR: ModelTest-NG is a re-implementation from scratch of jModelTest and ProtTest, two popular tools for selecting the best-fit nucleotide and amino acid substitution models, respectively, and introduces several new features, such as ascertainment bias correction, mixture and FreeRate models, or the automatic processing of partitioned datasets.

...read moreread less

Abstract: ModelTest-NG is a re-implementation from scratch of jModelTest and ProtTest, two popular tools for selecting the best-fit nucleotide and amino acid substitution models, respectively. ModelTest-NG is one to two orders of magnitude faster than jModelTest and ProtTest but equally accurate, and introduces several new features, such as ascertainment bias correction, mixture and FreeRate models, or the automatic processing of partitioned datasets. ModelTest-NG is available under a GNU GPL3 license at https://github.com/ddarriba/modeltest.

...read moreread less

465 citations

Posted Content•DOI•

KofamKOALA: KEGG ortholog assignment based on profile HMM and adaptive score threshold

[...]

Takuya Aramaki¹, Romain Blanc-Mathieu¹, Hisashi Endo¹, Koichi Ohkubo¹, Koichi Ohkubo², Minoru Kanehisa¹, Susumu Goto, Hiroyuki Ogata¹ - Show less +4 more•Institutions (2)

Kyoto University¹, Hewlett-Packard²

08 Apr 2019-bioRxiv

TL;DR: KofamKOALA is a web server to assign KEGG Orthologs (KOs) to protein sequences by homology search against a database of profile hidden Markov models (KOfam) with pre-computed adaptive score thresholds.

...read moreread less

Abstract: Summary KofamKOALA is a web server to assign KEGG Orthologs (KOs) to protein sequences by homology search against a database of profile hidden Markov models (KOfam) with pre-computed adaptive score thresholds. KofamKOALA is faster than existing KO assignment tools with its accuracy being comparable to the best performing tools. Function annotation by KofamKOALA helps linking genes to KEGG resources such as the KEGG pathway maps and facilitates molecular network reconstruction. Availability KofamKOALA, KofamScan, and KOfam are freely available from https://www.genome.jp/tools/kofamkoala/ Contact ogata@kuicr.kyoto-u.ac.jp

...read moreread less

Posted Content•DOI•

The UCSC Xena platform for public and private cancer genomics data visualization and interpretation

[...]

Mary Goldman¹, Brian Craft¹, Mim Hastie, Kristupas Repečka², Fran McDade, Akhil Kamath³, Ayan Banerjee, Yunhai Luo⁴, Dave Rogers, Angela N. Brooks⁵, Jingchun Zhu¹, David Haussler¹ - Show less +8 more•Institutions (5)

University of California, Santa Cruz¹, Vilnius University², Birla Institute of Technology and Science³, Stanford University⁴, University of California, Berkeley⁵

26 Sep 2019-bioRxiv

TL;DR: UCSC Xena as mentioned in this paper is a web-based visualization tool for both public and private omics data, supported through Xena Browser and multiple turn-key Xena Hubs, allowing researchers to view their own data securely, using private Xena hubs, simultaneously visualizing large public cancer genomics datasets, including TCGA and the GDC.

...read moreread less

Abstract: UCSC Xena is a visual exploration resource for both public and private omics data, supported through the web-based Xena Browser and multiple turn-key Xena Hubs. This unique archecture allows researchers to view their own data securely, using private Xena Hubs, simultaneously visualizing large public cancer genomics datasets, including TCGA and the GDC. Data integration occurs only within the Xena Browser, keeping private data private. Xena supports virtually any functional genomics data, including SNVs, INDELs, large structural variants, CNV, expression, DNA methylation, ATAC-seq signals, and phenotypic annotations. Browser features include the Visual Spreadsheet, survival analyses, powerful filtering and subgrouping, statistical analyses, genomic signatures, and bookmarks. Xena differentiates itself from other genomics tools, including its predecessor, the UCSC Cancer Genomics Browser, by its ability to easily and securely view public and private data, its high performance, its broad data type support, and many unique features.

...read moreread less

Posted Content•DOI•

CUT&Tag for efficient epigenomic profiling of small samples and single cells

[...]

Hatice S. Kaya-Okur¹, Hatice S. Kaya-Okur², Steven J. Wu³, Steven J. Wu¹, Christine A. Codomo², Christine A. Codomo¹, Erica S. Pledger¹, Terri D. Bryson¹, Terri D. Bryson², Jorja G. Henikoff¹, Kami Ahmad¹, Steven Henikoff¹, Steven Henikoff² - Show less +9 more•Institutions (3)

Fred Hutchinson Cancer Research Center¹, Howard Hughes Medical Institute², University of Washington³

06 Mar 2019-bioRxiv

TL;DR: Cleavage Under Targets and Tagmentation (CUT&Tag), an enzyme-tethering strategy that provides efficient high-resolution sequencing libraries for profiling diverse chromatin components, is described and demonstrated by profiling histone modifications, RNA Polymerase II and transcription factors on low cell numbers and single cells.

...read moreread less

Abstract: Many chromatin features play critical roles in regulating gene expression. A complete understanding of gene regulation will require the mapping of specific chromatin features in small samples of cells at high resolution. Here we describe Cleavage Under Targets and Tagmentation (CUT&Tag), an enzyme-tethering strategy that provides efficient high-resolution sequencing libraries for profiling diverse chromatin components. In CUT&Tag, a chromatin protein is bound in situ by a specific antibody, which then tethers a protein A-Tn5 transposase fusion protein. Activation of the transposase efficiently generates fragment libraries with high resolution and exceptionally low background. All steps from live cells to sequencing-ready libraries can be performed in a single tube on the benchtop or a microwell in a high-throughput pipeline, and the entire procedure can be performed in one day. We demonstrate the utility of CUT&Tag by profiling histone modifications, RNA Polymerase II and transcription factors on low cell numbers and single cells.

...read moreread less

Posted Content•DOI•

PICRUSt2: An improved and extensible approach for metagenome inference

[...]

Gavin M. Douglas¹, Vincent J. Maffei², Jesse R. Zaneveld³, Svetlana N. Yurgel¹, James R. Brown⁴, Christopher M. Taylor², Curtis Huttenhower⁵, Morgan G. I. Langille¹ - Show less +4 more•Institutions (5)

Dalhousie University¹, Louisiana State University², University of Washington³, GlaxoSmithKline⁴, Harvard University⁵

15 Jun 2019-bioRxiv

TL;DR: PICRUSt2 as mentioned in this paper extends the capabilities of the original PICrUSt method to predict approximate functional potential of a community based on marker gene sequencing profiles, including an expanded database of gene families and reference genomes, a new approach compatible with any OTU-picking or denoising algorithm, novel phenotype predictions, and novel fungal reference databases that enable predictions from 18S rRNA gene and internal transcribed spacer amplicon data.

...read moreread less

Abstract: One major limitation of microbial community marker gene sequencing is that it does not provide direct information on the functional composition of sampled communities. Here, we present PICRUSt2, which expands the capabilities of the original PICRUSt method to predict approximate functional potential of a community based on marker gene sequencing profiles. This updated method and implementation includes several improvements over the previous algorithm: an expanded database of gene families and reference genomes, a new approach now compatible with any OTU-picking or denoising algorithm, novel phenotype predictions, and novel fungal reference databases that enable predictions from 18S rRNA gene and internal transcribed spacer amplicon data. Upon evaluation, PICRUSt2 was more accurate than PICRUSt1 and other current approaches and also more flexible to allow the addition of custom reference databases. Last, we demonstrate the utility of PICRUSt2 by identifying potential disease-associated microbial functional signatures based on 16S rRNA gene sequencing of ileal biopsies collected from a cohort of human subjects with inflammatory bowel disease. PICRUSt2 is freely available at: https://github.com/picrust/picrust2.

...read moreread less

Posted Content•DOI•

Transcriptome assembly from long-read RNA-seq alignments with StringTie2

[...]

Sam Kovaka¹, Aleksey V. Zimin¹, Geo Pertea¹, Roham Razaghi¹, Steven L. Salzberg¹, Mihaela Pertea¹ - Show less +2 more•Institutions (1)

Johns Hopkins University¹

08 Jul 2019-bioRxiv

TL;DR: StringTie2 is a reference-guided transcriptome assembler that works with both short and long reads and includes new computational methods to handle the high error rate of long-read sequencing technology, which previous assemblers could not tolerate.

...read moreread less

Abstract: RNA sequencing using the latest single-molecule sequencing instruments produces reads that are thousands of nucleotides long. The ability to assemble these long reads can greatly improve the sensitivity of long-read analyses. Here we present StringTie2, a reference-guided transcriptome assembler that works with both short and long reads. StringTie2 includes new computational methods to handle the high error rate of long-read sequencing technology, which previous assemblers could not tolerate. It also offers the ability to work with full-length super-reads assembled from short reads, which further improves the quality of assemblies. On 33 short-read datasets from humans and two plant species, StringTie2 is 47.3% more precise and 3.9% more sensitive than Scallop. On multiple long read datasets, StringTie2 on average correctly assembles 8.3 and 2.6 times as many transcripts as FLAIR and Traphlor, respectively, with substantially higher precision. StringTie2 is also faster and has a smaller memory footprint than all comparable tools.

...read moreread less

Posted Content•DOI•

Genome wide meta-analysis identifies genomic relationships, novel loci, and pleiotropic mechanisms across eight psychiatric disorders

[...]

Phil Lee, Anttila¹, Hyejung Won², Yen-Chen Anne Feng, Jacob Rosenthal, Zhaozhong Zhu, Elliot M. Tucker-Drob³, Michel G. Nivard⁴, Andrew D. Grotzinger³, Danielle Posthuma, Wang Mm, Dongmei Yu⁵, E Stahl⁶, Raymond K. Walters¹, Richard Anney⁷, Laramie E. Duncan⁸, Sintia Iole Belangero⁹, Jurjen J. Luykx, Henry R. Kranzler¹⁰, Anna Keski-Rahkonen¹¹, Edwin H. Cook¹², G. Kirov⁷, Giovanni Coppola¹³, J Kaprio¹¹, Clement C. Zai¹⁴, Pieter J. Hoekstra¹⁵, Tobias Banaschewski, Luis Augusto Rohde¹⁶, Patrick F. Sullivan, Barbara Franke¹, Daly Mj², Cynthia M. Bulik¹⁷, Lewis Cm¹⁸, McIntosh Am⁷, Michael Conlon O'Donovan, Amanda B Zheutlin¹⁹, Andreassen Oa²⁰, Borglum Ad¹⁷, Breen G²¹, Howard J. Edenberg²², Fanous Ah²³, Faraone Sv²⁴, Gelernter J²⁵, Carol A. Mathews, Mattheisen M²⁶, Karen S. Mitchell²⁷, Michael C. Neale, John I. Nurnberger¹, Ripke S²⁸, Santangelo Sl⁵, Jeremiah M. Scharf²⁹, Stein Mb², Thornton Lm⁷, Walters Jt³⁰, Wray Nr¹³, Geschwind Dh¹, Benjamin M. Neale²⁷, Kenneth S. Kendler¹, Smoller Jw²⁹ - Show less +55 more•Institutions (30)

Harvard University¹, University of North Carolina at Chapel Hill², University of Texas at Austin³, VU University Amsterdam⁴, Broad Institute⁵, Icahn School of Medicine at Mount Sinai⁶, Cardiff University⁷, Stanford University⁸, Federal University of São Paulo⁹, University of Pennsylvania¹⁰, University of Helsinki¹¹, University of Illinois at Urbana–Champaign¹², University of California, Los Angeles¹³, Centre for Addiction and Mental Health¹⁴, University Medical Center Groningen¹⁵, Universidade Federal do Rio Grande do Sul¹⁶, King's College London¹⁷, University of Edinburgh¹⁸, University of Oslo¹⁹, Lundbeck²⁰, Indiana University²¹, Veterans Health Administration²², State University of New York Upstate Medical University²³, Yale University²⁴, University of Florida²⁵, VA Boston Healthcare System²⁶, Virginia Commonwealth University²⁷, Maine Medical Center²⁸, University of California, Berkeley²⁹, University of Queensland³⁰

26 Jan 2019-bioRxiv

TL;DR: A meta-analysis of genome-wide studies of anorexia nervosa, attention-deficit/hyperactivity disorder, autism spectrum disorder, bipolar disorder, major depression, obsessive-compulsive disorder, schizophrenia, and Tourette syndrome revealed a meaningful structure within the eight disorders identifying three groups of inter-related disorders.

...read moreread less

Abstract: Genetic influences on psychiatric disorders transcend diagnostic boundaries, suggesting substantial pleiotropy of contributing loci. However, the nature and mechanisms of these pleiotropic effects remain unclear. We performed a meta-analysis of 232,964 cases and 494,162 controls from genome-wide studies of anorexia nervosa, attention-deficit/hyperactivity disorder, autism spectrum disorder, bipolar disorder, major depression, obsessive-compulsive disorder, schizophrenia, and Tourette syndrome. Genetic correlation analyses revealed a meaningful structure within the eight disorders identifying three groups of inter-related disorders. We detected 109 loci associated with at least two psychiatric disorders, including 23 loci with pleiotropic effects on four or more disorders and 11 loci with antagonistic effects on multiple disorders. The pleiotropic loci are located within genes that show heightened expression in the brain throughout the lifespan, beginning in the second trimester prenatally, and play prominent roles in a suite of neurodevelopmental processes. These findings have important implications for psychiatric nosology, drug development, and risk prediction.

...read moreread less

Posted Content•DOI•

Unified rational protein engineering with sequence-only deep representation learning

[...]

Ethan C. Alley¹, Grigory Khimulya, Surojit Biswas¹, Mohammed AlQuraishi², George M. Church², George M. Church¹ - Show less +2 more•Institutions (2)

Wyss Institute for Biologically Inspired Engineering¹, Harvard University²

26 Mar 2019-bioRxiv

TL;DR: This work applies deep learning to unlabelled amino acid sequences to distill the fundamental features of a protein into a statistical representation that is semantically rich and structurally, evolutionarily, and biophysically grounded.

...read moreread less

Abstract: Rational protein engineering requires a holistic understanding of protein function. Here, we apply deep learning to unlabelled amino acid sequences to distill the fundamental features of a protein into a statistical representation that is semantically rich and structurally, evolutionarily, and biophysically grounded. We show that the simplest models built on top of this unified representation (UniRep) are broadly applicable and generalize to unseen regions of sequence space. Our data-driven approach reaches near state-of-the-art or superior performance predicting stability of natural and de novo designed proteins as well as quantitative function of molecularly diverse mutants. UniRep further enables two orders of magnitude cost savings in a protein engineering task. We conclude UniRep is a versatile protein summary that can be applied across protein engineering informatics.

...read moreread less

Posted Content•DOI•

Evaluating Protein Transfer Learning with TAPE

[...]

Roshan Rao¹, Nicholas Bhattacharya¹, Neil Thomas¹, Yan Duan, Xi Chen, John Canny¹, John Canny², Pieter Abbeel¹, Yun S. Song¹ - Show less +5 more•Institutions (2)

University of California, Berkeley¹, Google²

20 Jun 2019-bioRxiv

TL;DR: TAPE as discussed by the authors is a set of five biologically relevant semi-supervised learning tasks spread across different domains of protein biology, and it is designed to test biologically relevant generalization that transfers to real-life scenarios.

...read moreread less

Abstract: Protein modeling is an increasingly popular area of machine learning research. Semi-supervised learning has emerged as an important paradigm in protein modeling due to the high cost of acquiring supervised protein labels, but the current literature is fragmented when it comes to datasets and standardized evaluation techniques. To facilitate progress in this field, we introduce the Tasks Assessing Protein Embeddings (TAPE), a set of five biologically relevant semi-supervised learning tasks spread across different domains of protein biology. We curate tasks into specific training, validation, and test splits to ensure that each task tests biologically relevant generalization that transfers to real-life scenarios. We bench-mark a range of approaches to semi-supervised protein representation learning, which span recent work as well as canonical sequence learning techniques. We find that self-supervised pretraining is helpful for almost all models on all tasks, more than doubling performance in some cases. Despite this increase, in several cases features learned by self-supervised pretraining still lag behind features extracted by state-of-the-art non-neural techniques. This gap in performance suggests a huge opportunity for innovative architecture design and improved modeling paradigms that better capture the signal in biological sequences. TAPE will help the machine learning community focus effort on scientifically relevant problems. Toward this end, all data and code used to run these experiments are available at https://github.com/songlab-cal/tape.

...read moreread less

Posted Content•DOI•

HH-suite3 for fast remote homology detection and deep protein annotation

[...]

Martin Steinegger¹, Markus Meier¹, Milot Mirdita¹, Harald Voehringer², Stephan J. Haunsberger³, Johannes Soeding¹ - Show less +2 more•Institutions (3)

Max Planck Society¹, European Bioinformatics Institute², Royal College of Surgeons in Ireland³

25 Feb 2019-bioRxiv

TL;DR: A single-instruction multiple-data (SIMD) vectorized implementation of the Viterbi algorithm for profile HMM alignment is developed and the added functionalities and increased speed of HHsearch and HHblits should facilitate their use in large-scale protein structure and function prediction, e.g. in metagenomics and genomics projects.

...read moreread less

Abstract: Background: HH-suite is a widely used open source software suite for sensitive sequence similarity searches and protein fold recognition. It is based on pairwise alignment of profile Hidden Markov models (HMMs), which represent multiple sequence alignments of homologous sequences. Results: We developed a single-instruction multiple-data (SIMD) vectorized implementation of the Viterbi algorithm for profile HMM alignment and introduced various other speed-ups. This accelerated HHsearch by a factor 4 and HHblits by a factor 2 over the previous version 2.0.16. HHblits3 is ~10x faster than PSI-BLAST and ~20x faster than HMMER3. Jobs to perform HHsearch and HHblits searches with many query profile HMMs can be parallelized over cores and over servers in a cluster using OpenMP and message passing interface (MPI). The free, open-source, GNU GPL(v3)-licensed software is available at https://github.com/soedinglab/hh-suite. Conclusion: The added functionalities and increased speed of HHsearch and HHblits should facilitate their use in large-scale protein structure and function prediction, e.g. in metagenomics and genomics projects.

...read moreread less

Posted Content•DOI•

tRNAscan-SE 2.0: Improved Detection and Functional Classification of Transfer RNA Genes

[...]

Patricia P. Chan¹, Brian Y. Lin¹, Allysia J. Mak¹, Todd M. Lowe¹•Institutions (1)

University of California, Santa Cruz¹

30 Apr 2019-bioRxiv

TL;DR: Overall, tRNA detection sensitivity and specificity is improved for all isotypes, particularly those utilizing specialized models for selenocysteine and the three subtypes of tRNA genes encoding a CAU anticodon.

...read moreread less

Abstract: tRNAscan-SE has been widely used for whole-genome transfer RNA gene prediction for nearly two decades. With the increased availability of new genomes, a vastly larger training set has enabled creation of nearly one hundred specialized isotype-specific models, greatly improving tRNAscan-SE’s ability to identify and classify both typical and atypical tRNAs. We employ a new multi-model annotation strategy where predicted tRNAs are scored against a full set of isotype-specific covariance models. A post-filtering feature also better identifies tRNA-derived SINEs that are abundant in many eukaryotic genomes, and provides a “high confidence” tRNA gene set which improves upon prior pseudogene prediction. These new enhancements of tRNAscan-SE will provide researchers more accurate detection and more comprehensive annotation for tRNA genes.

...read moreread less

Posted Content•DOI•

Calling Somatic SNVs and Indels with Mutect2

[...]

David Benjamin¹, Takuto Sato¹, Kristian Cibulskis¹, Gad Getz¹, Chip Stewart¹, Lee Lichtenstein¹ - Show less +2 more•Institutions (1)

Broad Institute¹

02 Dec 2019-bioRxiv

TL;DR: Mutect2 is a somatic variant caller that uses local assembly and realignment to detect SNVs and indels, and is based on several probabilistic models for genotyping and filtering that work well with and without a matched normal sample and for all sequencing depths.

...read moreread less

Abstract: Mutect2 is a somatic variant caller that uses local assembly and realignment to detect SNVs and indels. Assembly implies whole haplotypes and read pairs, rather than single bases, as the atomic units of biological variation and sequencing evidence, improving variant calling. Beyond local assembly and alignment, Mutect2 is based on several probabilistic models for genotyping and filtering that work well with and without a matched normal sample and for all sequencing depths.

...read moreread less

Posted Content•DOI•

A Proteomic Atlas of Senescence-Associated Secretomes for Aging Biomarker Development

[...]

Nathan Basisty¹, Abhijit Kale¹, Ok-Hee Jeon¹, Chisaka Kuehnemann¹, Therese Payne¹, Chirag Rao¹, Anja Holtz¹, Samah Shah¹, Luigi Ferrucci², Judith Campisi¹, Judith Campisi³, Birgit Schilling¹ - Show less +8 more•Institutions (3)

Buck Institute for Research on Aging¹, National Institutes of Health², Lawrence Berkeley National Laboratory³

21 Aug 2019-bioRxiv

TL;DR: The analyses identify several candidate biomarkers of cellular senescence that overlap with aging markers in human plasma, including GDF15, STC1 and SERPINs, which significantly correlated with age in plasma from a human cohort, the Baltimore Longitudinal Study of Aging.

...read moreread less

Abstract: SUMMARY The senescence-associated secretory phenotype (SASP) has recently emerged as both a driver of, and promising therapeutic target for, multiple age-related conditions, ranging from neurodegeneration to cancer. The complexity of the SASP, typically monitored by a few dozen secreted proteins, has been greatly underappreciated, and a small set of factors cannot explain the diverse phenotypes it produces in vivo. Here, we present ‘SASP Atlas’, a comprehensive proteomic database of soluble and exosome SASP factors originating from multiple senescence inducers and cell types. Each profile consists of hundreds of largely distinct proteins, but also includes a subset of proteins elevated in all SASPs. Based on our analyses, we propose several candidate biomarkers of cellular senescence, including GDF15, STC1 and SERPINs. This resource will facilitate identification of proteins that drive specific senescence-associated phenotypes and catalog potential senescence biomarkers to assess the burden, originating stimulus and tissue of senescent cells in vivo.

...read moreread less

Posted Content•DOI•

An integrated brain-machine interface platform with thousands of channels

[...]

Elon R. Musk, Neuralink

18 Jul 2019-bioRxiv

TL;DR: Neuralink’s approach to BMI has unprecedented packaging density and scalability in a clinically relevant package and has achieved a spiking yield of up to 85.5 % in chronically implanted electrodes.

...read moreread less

Abstract: Brain-machine interfaces (BMIs) hold promise for the restoration of sensory and motor function and the treatment of neurological disorders, but clinical BMIs have not yet been widely adopted, in part because modest channel counts have limited their potential. In this white paper, we describe Neuralink’s first steps toward a scalable high-bandwidth BMI system. We have built arrays of small and flexible electrode “threads”, with as many as 3,072 electrodes per array distributed across 96 threads. We have also built a neurosurgical robot capable of inserting six threads (192 electrodes) per minute. Each thread can be individually inserted into the brain with micron precision for avoidance of surface vasculature and targeting specific brain regions. The electrode array is packaged into a small implantable device that contains custom chips for low-power on-board amplification and digitization: the package for 3,072 channels occupies less than (23 × 18.5 × 2) mm3. A single USB-C cable provides full-bandwidth data streaming from the device, recording from all channels simultaneously. This system has achieved a spiking yield of up to 85.5 % in chronically implanted electrodes. Neuralink’s approach to BMI has unprecedented packaging density and scalability in a clinically relevant package.

...read moreread less

Posted Content•DOI•

Cooler: scalable storage for Hi-C data and other genomically-labeled arrays

[...]

Nezar Abdennur¹, Leonid A. Mirny¹•Institutions (1)

Massachusetts Institute of Technology¹

22 Feb 2019-bioRxiv

TL;DR: A file format called cooler, based on a sparse data model, that can support genomically-labeled matrices at any resolution, which has the flexibility to accommodate various descriptions of the data axes, resolutions, data density patterns, and metadata.

...read moreread less

Abstract: Most existing coverage-based (epi)genomic datasets are one-dimensional, but newer technologies probing interactions (physical, genetic, etc.) produce quantitative maps with two-dimensional genomic coordinate systems. Storage and computational costs mount sharply with data resolution when such maps are stored in dense form. Hence, there is a pressing need to develop data storage strategies that handle the full range of useful resolutions in multidimensional genomic datasets by taking advantage of their sparse nature, while supporting efficient compression and providing fast random access to facilitate development of scalable algorithms for data analysis. We developed a file format called cooler, based on a sparse data model, that can support genomically-labeled matrices at any resolution. It has the flexibility to accommodate various descriptions of the data axes (genomic coordinates, tracks and bin annotations), resolutions, data density patterns, and metadata. Cooler is based on HDF5 and is supported by a Python library and command line suite to create, read, inspect and manipulate cooler data collections. The format has been adopted as a standard by the NIH 4D Nucleome Consortium. Cooler is cross-platform, BSD-licensed, and can be installed from the Python Package Index or the bioconda repository. The source code is maintained on Github at https://github.com/mirnylab/cooler.

...read moreread less

Posted Content•DOI•

Extracting Biological Insights from the Project Achilles Genome-Scale CRISPR Screens in Cancer Cell Lines

[...]

Joshua M. Dempster¹, Jordan Rossen¹, Mariya Kazachkova¹, Joshua Pan¹, Guillaume Kugener¹, David E. Root¹, Aviad Tsherniak¹ - Show less +3 more•Institutions (1)

Broad Institute¹

31 Jul 2019-bioRxiv

TL;DR: The current and projected Achilles processing pipeline, including recent improvements and the analyses that led us to adopt them, are presented, spanning data releases from early 2018 to the first quarter of 2020.

...read moreread less

Abstract: One of the main goals of the Cancer Dependency Map project is to systematically identify cancer vulnerabilities across cancer types to accelerate therapeutic discovery. Project Achilles serves this goal through the in vitro study of genetic dependencies in cancer cell lines using CRISPR/Cas9 (and, previously, RNAi) loss-of-function screens. The project is committed to the public release of its experimental results quarterly on the DepMap Portal (https://depmap.org), on a pre-publication basis. As the experiment has evolved, data processing procedures have changed. Here we present the current and projected Achilles processing pipeline, including recent improvements and the analyses that led us to adopt them, spanning data releases from early 2018 to the first quarter of 2020. Notable changes include quality control metrics, calculation of probabilities of dependency, and correction for screen quality and other biases. Developing and improving methods for extracting biologically-meaningful scores from Achilles experiments is an ongoing process, and we will continue to evaluate and revise data processing procedures to produce the best results.

...read moreread less

Posted Content•DOI•

Lipid droplet accumulating microglia represent a dysfunctional and pro-inflammatory state in the aging brain

[...]

Julia Marschallinger¹, Julia Marschallinger², Tal Iram¹, Macy E. Zardeneta¹, Song E. Lee¹, Benoit Lehallier¹, Michael S. Haney¹, John V. Pluvinage¹, Vidhu Mathur¹, Oliver Hahn¹, David W. Morgens¹, Justin Kim¹, Julia Tevini², Thomas K. Felder², Heimo Wolinski³, Carolyn R. Bertozzi¹, Michael C. Bassik¹, Michael C. Bassik², Ludwig Aigner², Tony Wyss-Coray - Show less +16 more•Institutions (3)

Stanford University¹, Paracelsus Private Medical University of Salzburg², University of Graz³

06 Aug 2019-bioRxiv

TL;DR: A striking buildup of lipid droplets in microglia with aging in mouse and human brains is reported and it is proposed that LAM contribute to age-related and genetic forms of neurodegeneration.

...read moreread less

Abstract: Microglia become progressively activated and seemingly dysfunctional with age, and genetic studies have linked these cells to the pathogenesis of a growing number of neurodegenerative diseases. Here we report a striking buildup of lipid droplets in microglia with aging in mouse and human brains. These cells, which we call lipid droplet-accumulating microglia (LAM), are defective in phagocytosis, produce high levels of reactive oxygen species, and secrete pro-inflammatory cytokines. RNA sequencing analysis of LAM revealed a transcriptional profile driven by innate inflammation distinct from previously reported microglial states. An unbiased CRISPR-Cas9 screen identified genetic modifiers of lipid droplet formation; surprisingly, variants of several of these genes, including progranulin, are causes of autosomal dominant forms of human neurodegenerative diseases. We thus propose that LAM contribute to age-related and genetic forms of neurodegeneration.

...read moreread less

Posted Content•DOI•

A global synthesis reveals biodiversity-mediated benefits for crop production

[...]

Matteo Dainese¹, Emily A. Martin¹, Marcelo A. Aizen², Matthias Albrecht, Ignasi Bartomeus³, Riccardo Bommarco⁴, Luísa G. Carvalheiro⁵, Luísa G. Carvalheiro⁶, Rebecca Chaplin-Kramer⁷, Vesna Gagic⁸, Lucas Alejandro Garibaldi⁹, Jaboury Ghazoul¹⁰, Heather Grab¹¹, Mattias Jonsson⁴, Daniel S. Karp¹², Christina M. Kennedy¹³, David Kleijn¹⁴, Claire Kremen¹⁵, Douglas A. Landis¹⁶, Deborah K. Letourneau¹⁷, Lorenzo Marini¹⁸, Katja Poveda¹¹, Romina Rader¹⁹, Henrik G. Smith²⁰, Teja Tscharntke²¹, Georg K.S. Andersson²⁰, Isabelle Badenhausser²², Isabelle Badenhausser²³, Svenja Baensch²¹, Antonio Diego M. Bezerra²⁴, Felix J.J.A. Bianchi¹⁴, Virginie Boreux¹⁰, Vincent Bretagnolle²², Berta Caballero-López, Pablo Cavigliasso²⁵, Aleksandar Ćetković²⁶, Natacha P. Chacoff²⁷, Alice Classen¹, Sarah Cusser²⁸, Felipe D. da Silva e Silva²⁹, G. Arjen de Groot¹⁴, Jan H. Dudenhöffer³⁰, Johan Ekroos²⁰, Thijs P.M. Fijen¹⁴, Pierre Franck²³, Breno Magalhães Freitas²⁴, Michael P.D. Garratt³¹, Claudio Gratton³², Juliana Hipólito⁹, Andrea Holzschuh¹, Lauren Hunt³³, Aaron L. Iverson¹¹, Shalene Jha³⁴, Tamar Keasar³⁵, Tania N. Kim³⁶, Miriam Kishinevsky³⁵, Björn K. Klatt²¹, Björn K. Klatt²⁰, Alexandra-Maria Klein³⁷, Kristin M. Krewenka³⁸, Smitha Krishnan¹⁰, Ashley E. Larsen³⁹, Claire Lavigne²³, Heidi Liere⁴⁰, Bea Maas⁴¹, Rachel E. Mallinger⁴², Eliana Martinez Pachon, Alejandra Martínez-Salinas⁴³, Timothy D. Meehan⁴⁴, Matthew G. E. Mitchell¹⁵, Gonzalo Alberto Roman Molina⁴⁵, Maike Nesper¹⁰, Lovisa Nilsson²⁰, Megan E. O'Rourke⁴⁶, Marcell K. Peters¹, Milan Plećaš²⁶, Simon G. Potts³¹, Davi de L. Ramos²⁹, Jay A. Rosenheim¹⁷, Maj Rundlöf²⁰, Adrien Rusch⁴⁷, Agustín Sáez², Jeroen Scheper¹⁴, Matthias Schleuning, Julia Schmack⁴⁸, Amber R. Sciligo¹⁷, Colleen L. Seymour, Dara A. Stanley⁴⁹, Rebecca Stewart²⁰, Jane C. Stout⁵⁰, Louis Sutter, Mayura B. Takada⁵¹, Hisatomo Taki, Giovanni Tamburini⁴, Matthias Tschumi, Blandina Felipe Viana⁵², Catrin Westphal²¹, Bryony K. Willcox¹⁹, Stephen D. Wratten⁵³, Akira Yoshioka⁵⁴, Carlos Zaragoza-Trello³, Wei Zhang⁵⁵, Yi Zou⁵⁶, Ingolf Steffan-Dewenter¹ - Show less +100 more•Institutions (56)

University of Würzburg¹, National University of Comahue², Spanish National Research Council³, Swedish University of Agricultural Sciences⁴, University of Lisbon⁵, Universidade Federal de Goiás⁶, Stanford University⁷, Commonwealth Scientific and Industrial Research Organisation⁸, National University of Río Negro⁹, ETH Zurich¹⁰, Cornell University¹¹, University of California, Davis¹², The Nature Conservancy¹³, Wageningen University and Research Centre¹⁴, University of British Columbia¹⁵, Great Lakes Bioenergy Research Center¹⁶, University of California, Berkeley¹⁷, University of Padua¹⁸, University of New England (United States)¹⁹, Lund University²⁰, University of Göttingen²¹, University of La Rochelle²², Institut national de la recherche agronomique²³, Federal University of Ceará²⁴, Concordia University Wisconsin²⁵, University of Belgrade²⁶, National University of Tucumán²⁷, Michigan State University²⁸, University of Brasília²⁹, University of Greenwich³⁰, University of Reading³¹, University of Wisconsin-Madison³², Boise State University³³, University of Texas at Austin³⁴, University of Haifa³⁵, Kansas State University³⁶, University of Freiburg³⁷, University of Hamburg³⁸, University of California, Santa Barbara³⁹, Seattle University⁴⁰, University of Vienna⁴¹, University of Florida⁴², Centro Agronómico Tropical de Investigación y Enseñanza⁴³, National Audubon Society⁴⁴, University of Buenos Aires⁴⁵, Virginia Tech⁴⁶, University of Bordeaux⁴⁷, University of Auckland⁴⁸, University College Dublin⁴⁹, Trinity College, Dublin⁵⁰, University of Tokyo⁵¹, Federal University of Bahia⁵², Lincoln University (Pennsylvania)⁵³, National Institute for Environmental Studies⁵⁴, International Food Policy Research Institute⁵⁵, Xi'an Jiaotong-Liverpool University⁵⁶

20 Feb 2019-bioRxiv

TL;DR: Using a global database from 89 crop systems, the relative importance of abundance and species richness for pollination, biological pest control and final yields in the context of on-going land-use change is partitioned.

...read moreread less

Abstract: Human land use threatens global biodiversity and compromises multiple ecosystem functions critical to food production. Whether crop yield-related ecosystem services can be maintained by few abundant species or rely on high richness remains unclear. Using a global database from 89 crop systems, we partition the relative importance of abundance and species richness for pollination, biological pest control and final yields in the context of on-going land-use change. Pollinator and enemy richness directly supported ecosystem services independent of abundance. Up to 50% of the negative effects of landscape simplification on ecosystem services was due to richness losses of service-providing organisms, with negative consequences for crop yields. Maintaining the biodiversity of ecosystem service providers is therefore vital to sustain the flow of key agroecosystem benefits to society.

...read moreread less

Posted Content•DOI•

RepeatModeler2: automated genomic discovery of transposable element families

[...]

Jullien M. Flynn¹, Robert Hubley², Clément Goubert¹, Jeb Rosen², Andrew G. Clark¹, Cédric Feschotte¹, Arian F.A. Smit² - Show less +3 more•Institutions (2)

Cornell University¹, Institute for Systems Biology²

26 Nov 2019-bioRxiv

TL;DR: This new program brings substantial improvements over the original version of RepeatModeler, one of the most widely used tools for TE discovery, and incorporates a module for structural discovery of complete LTR retroelements, which are widespread in eukaryotic genomes but recalcitrant to automated identification because of their size and sequence complexity.

...read moreread less

Abstract: The accelerating pace of genome sequencing throughout the tree of life is driving the need for improved unsupervised annotation of genome components such as transposable elements (TEs). Because the types and sequences of TEs are highly variable across species, automated TE discovery and annotation are challenging and time-consuming tasks. A critical first step is the de novo identification and accurate compilation of sequence models representing all the unique TE families dispersed in the genome. Here we introduce RepeatModeler2, a new pipeline that greatly facilitates this process. This new program brings substantial improvements over the original version of RepeatModeler, one of the most widely used tools for TE discovery. In particular, this version incorporates a module for structural discovery of complete LTR retroelements, which are widespread in eukaryotic genomes but recalcitrant to automated identification because of their size and sequence complexity. We benchmarked RepeatModeler2 on three model species with diverse TE landscapes and high-quality, manually curated TE libraries: Drosophila melanogaster (fruit fly), Danio rerio (zebrafish), and Oryza sativa (rice). In these three species, RepeatModeler2 identified approximately three times more consensus sequences matching with >95% sequence identity and sequence coverage to the manually curated sequences than the original RepeatModeler. As expected, the greatest improvement is for LTR retroelements. The program had an extremely low false positive rate when applied to simulated genomes devoid of TEs. Thus, RepeatModeler2 represents a valuable addition to the genome annotation toolkit that will enhance the identification and study of TEs in eukaryotic genome sequences. RepeatModeler2 is available as source code or a containerized package under an open license (https://github.com/Dfam- consortium/RepeatModeler, https://github.com/Dfam-consortium/TETools).

...read moreread less

Posted Content•DOI•

Massively parallel single-cell chromatin landscapes of human immune cell development and intratumoral T cell exhaustion

[...]

Ansuman T. Satpathy¹, Jeffrey M. Granja¹, Kathryn E. Yost¹, Yanyan Qi¹, Francesca Meschi, Geoffrey P. McDermott, Brett N. Olsen, Maxwell R. Mumbach¹, Sarah E. Pierce¹, M. Ryan Corces¹, Preyas Shah, Jason C. Bell, Darisha Jhutty, Corey M. Nemec, Jean Wang, Li Wang, Yifeng Yin, Paul G. Giresi, Anne Lynn S. Chang¹, Grace X.Y. Zheng, William J. Greenleaf¹, Howard Y. Chang - Show less +18 more•Institutions (1)

Stanford University¹

18 Apr 2019-bioRxiv

TL;DR: It is anticipated that droplet-based single-cell chromatin accessibility will provide a broadly applicable means of identifying regulatory factors and elements that underlie cell type and function.

...read moreread less

Abstract: Understanding complex tissues requires single-cell deconstruction of gene regulation with precision and scale. Here we present a massively parallel droplet-based platform for mapping transposase-accessible chromatin in tens of thousands of single cells per sample (scATAC-seq). We obtain and analyze chromatin profiles of over 200,000 single cells in two primary human systems. In blood, scATAC-seq allows marker-free identification of cell type-specific cis- and trans-regulatory elements, mapping of disease-associated enhancer activity, and reconstruction of trajectories of differentiation from progenitors to diverse and rare immune cell types. In basal cell carcinoma, scATAC-seq reveals regulatory landscapes of malignant, stromal, and immune cell types in the tumor microenvironment. Moreover, scATAC-seq of serial tumor biopsies before and after PD-1 blockade allows identification of chromatin regulators and differentiation trajectories of therapy-responsive intratumoral T cell subsets, revealing a shared regulatory program driving CD8+ T cell exhaustion and CD4+ T follicular helper cell development. We anticipate that droplet-based single-cell chromatin accessibility will provide a broadly applicable means of identifying regulatory factors and elements that underlie cell type and function.

...read moreread less

Posted Content•DOI•

Genetics of 38 blood and urine biomarkers in the UK Biobank

[...]

Nasa Sinnott-Armstrong¹, Yosuke Tanigawa¹, David Amar¹, Nina Mars², Matthew Aguirre¹, Guhan Venkataraman¹, Michael Wainberg¹, Hanna Ollila¹, Hanna Ollila², Hanna Ollila³, James P. Pirruccello⁴, James P. Pirruccello³, Junyang Qian¹, Anna Shcherbina¹, Fatima Rodriguez¹, Themistocles L. Assimes⁵, Themistocles L. Assimes¹, Vineeta Agarwala¹, Robert Tibshirani¹, Trevor Hastie¹, Samuli Ripatti², Samuli Ripatti⁴, Jonathan K. Pritchard⁶, Jonathan K. Pritchard¹, Mark J. Daly³, Mark J. Daly², Mark J. Daly⁴, Manuel A. Rivas¹, Finn Gen¹ - Show less +25 more•Institutions (6)

Stanford University¹, University of Helsinki², Harvard University³, Broad Institute⁴, VA Palo Alto Healthcare System⁵, Howard Hughes Medical Institute⁶

05 Jun 2019-bioRxiv

TL;DR: The genetic basis of 38 blood and urine laboratory tests is evaluated, which tissues contribute to the biomarker function, the causal influences of the biomarkers, and how this can be used to predict disease are shown.

...read moreread less

Abstract: Clinical laboratory tests are a critical component of the continuum of care and provide a means for rapid diagnosis and monitoring of chronic disease. In this study, we systematically evaluated the genetic basis of 38 blood and urine laboratory tests measured in 358,072 participants in the UK Biobank and identified 1,857 independent loci associated with at least one laboratory test, including 488 large-effect protein truncating, missense, and copy-number variants. We tested these loci for enrichment in specific single cell types in kidney, liver, and pancreas relevant to disease aetiology. We then causally linked the biomarkers to medically relevant phenotypes through genetic correlation and Mendelian randomization. Finally, we developed polygenic risk scores (PRS) for each biomarker and built multi-PRS models using all 38 PRSs simultaneously. We found substantially improved prediction of incidence in FinnGen (n=135,500) with the multi-PRS relative to single-disease PRSs for renal failure, myocardial infarction, liver fat percentage, and alcoholic cirrhosis. Together, our results show the genetic basis of these biomarkers, which tissues contribute to the biomarker function, the causal influences of the biomarkers, and how we can use this to predict disease.

...read moreread less

Collapse