Journal•
bioRxiv
About: bioRxiv is an academic journal. The journal publishes majorly in the area(s): Population & Gene. Over the lifetime, 154314 publication(s) have been published receiving 439493 citation(s). The journal is also known as: bioRxiv.org : the preprint server for biology & bioRxivorg.
Topics: Population, Gene, Genome, Chromatin, RNA
Papers
More filters
TL;DR: This work presents DESeq2, a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates, which enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression.
Abstract: In comparative high-throughput sequencing assays, a fundamental task is the analysis of count data, such as read counts per gene in RNA-Seq data, for evidence of systematic changes across experimental conditions. Small replicate numbers, discreteness, large dynamic range and the presence of outliers require a suitable statistical approach. We present DESeq2, a method for differential analysis of count data. DESeq2 uses shrinkage estimation for dispersions and fold changes to improve stability and interpretability of the estimates. This enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression and facilitates downstream tasks such as gene ranking and visualization. DESeq2 is available as an R/Bioconductor package.
2,229 citations
Harvard University1, Broad Institute2, Cardiff University3, Icahn School of Medicine at Mount Sinai4, University of Michigan5, University of Cambridge6, Karolinska Institutet7, University of Eastern Finland8, University of Oxford9, Cedars-Sinai Medical Center10, University of Ottawa11, University of Helsinki12, University of Pennsylvania13, University of North Carolina at Chapel Hill14, University of Mississippi Medical Center15
TL;DR: The aggregation and analysis of high-quality exome (protein-coding region) sequence data for 60,706 individuals of diverse ethnicities generated as part of the Exome Aggregation Consortium (ExAC) provides direct evidence for the presence of widespread mutational recurrence.
Abstract: Large-scale reference data sets of human genetic variation are critical for the medical and functional interpretation of DNA sequence changes. Here we describe the aggregation and analysis of high-quality exome (protein-coding region) sequence data for 60,706 individuals of diverse ethnicities. The resulting catalogue of human genetic diversity has unprecedented resolution, with an average of one variant every eight bases of coding sequence and the presence of widespread mutational recurrence. The deep catalogue of variation provided by the Exome Aggregation Consortium (ExAC) can be used to calculate objective metrics of pathogenicity for sequence variants, and to identify genes subject to strong selection against various classes of mutation; we identify 3,230 genes with near-complete depletion of truncating variants, 79% of which have no currently established human disease phenotype. Finally, we show that these data can be used for the efficient filtering of candidate disease-causing variants, and for the discovery of human knockout variants in protein-coding genes.
1,552 citations
TL;DR: Using an improved human mutation rate model, human protein-coding genes are classified along a spectrum representing tolerance to inactivation, validate this classification using data from model organisms and engineered human cells, and show that it can be used to improve gene discovery power for both common and rare diseases.
Abstract: Summary Genetic variants that inactivate protein-coding genes are a powerful source of information about the phenotypic consequences of gene disruption: genes critical for an organism’s function will be depleted for such variants in natural populations, while non-essential genes will tolerate their accumulation. However, predicted loss-of-function (pLoF) variants are enriched for annotation errors, and tend to be found at extremely low frequencies, so their analysis requires careful variant annotation and very large sample sizes. Here, we describe the aggregation of 125,748 exomes and 15,708 genomes from human sequencing studies into the Genome Aggregation Database (gnomAD). We identify 443,769 high-confidence pLoF variants in this cohort after filtering for sequencing and annotation artifacts. Using an improved model of human mutation, we classify human protein-coding genes along a spectrum representing intolerance to inactivation, validate this classification using data from model organisms and engineered human cells, and show that it can be used to improve gene discovery power for both common and rare diseases.
1,037 citations
TL;DR: It is shown that it is possible to make hundreds of thousands permutations in a few minutes, which leads to very accurate p-values, which allows applying standard FDR correction procedures, which are more accurate than the ones currently used.
Abstract: Gene set enrichment analysis is a widely used tool for analyzing gene
expression data. However, current implementations are slow due to a large
number of required samples for the analysis to have a good statistical power.
In this paper we present a novel algorithm, that efficiently reuses
one sample multiple times and thus speeds up the analysis.
We show that it is possible to make hundreds of thousands permutations
in a few minutes, which leads to very accurate p-values. This, in turn,
allows applying standard FDR correction procedures, which are
more accurate than the ones currently used.
The method is implemented in a form of an R package and
is freely available at \url{https://github.com/ctlab/fgsea}.
788 citations
Moscow State University1, Leiden University Medical Center2, Loyola University Chicago3, University of North Carolina at Chapel Hill4, Utrecht University5, Charité6, Texas A&M University–Texarkana7, University of Iowa8, University of Hong Kong9, Spanish National Research Council10, University of Giessen11
TL;DR: The Coronavirus Study Group (CSG) of the International Committee on Taxonomy of Viruses assessed the novelty of the human pathogen tentatively named 2019-nCoV and formally recognizes this virus as a sister to severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2).
Abstract: The present outbreak of lower respiratory tract infections, including respiratory distress syndrome, is the third spillover, in only two decades, of an animal coronavirus to humans resulting in a major epidemic. Here, the Coronavirus Study Group (CSG) of the International Committee on Taxonomy of Viruses, which is responsible for developing the official classification of viruses and taxa naming (taxonomy) of the Coronaviridae family, assessed the novelty of the human pathogen tentatively named 2019-nCoV. Based on phylogeny, taxonomy and established practice, the CSG formally recognizes this virus as a sister to severe acute respiratory syndrome coronaviruses (SARS-CoVs) of the species Severe acute respiratory syndrome-related coronavirus and designates it as severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). To facilitate communication, the CSG further proposes to use the following naming convention for individual isolates: SARS-CoV-2/Isolate/Host/Date/Location. The spectrum of clinical manifestations associated with SARS-CoV-2 infections in humans remains to be determined. The independent zoonotic transmission of SARS-CoV and SARS-CoV-2 highlights the need for studying the entire (virus) species to complement research focused on individual pathogenic viruses of immediate significance. This research will improve our understanding of virus-host interactions in an ever-changing environment and enhance our preparedness for future outbreaks.
781 citations