scispace - formally typeset
Journal

bioRxiv

About: bioRxiv is an academic journal. The journal publishes majorly in the area(s): Population & Gene. Over the lifetime, 154314 publication(s) have been published receiving 439493 citation(s). The journal is also known as: bioRxiv.org : the preprint server for biology & bioRxivorg.

...read more

Topics: Population, Gene, Genome ...read more
Papers
  More

Open accessPosted ContentDOI: 10.1101/002832
17 Nov 2014-bioRxiv
Abstract: In comparative high-throughput sequencing assays, a fundamental task is the analysis of count data, such as read counts per gene in RNA-Seq data, for evidence of systematic changes across experimental conditions. Small replicate numbers, discreteness, large dynamic range and the presence of outliers require a suitable statistical approach. We present DESeq2, a method for differential analysis of count data. DESeq2 uses shrinkage estimation for dispersions and fold changes to improve stability and interpretability of the estimates. This enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression and facilitates downstream tasks such as gene ranking and visualization. DESeq2 is available as an R/Bioconductor package.

...read more

  • Figure 9 Precision estimated from experimental reproducibility. Each algorithm’s precision in the evaluation set (box plots) is evaluated using the calls of each other algorithm in the verification set (panels with grey label).
    Figure 9 Precision estimated from experimental reproducibility. Each algorithm’s precision in the evaluation set (box plots) is evaluated using the calls of each other algorithm in the verification set (panels with grey label).
  • Figure 2 Effect of shrinkage on logarithmic fold change estimates. Plots of the (A)MLE (i.e., no shrinkage) and (B)MAP estimate (i.e., with shrinkage) for the LFCs attributable to mouse strain, over the average expression strength for a ten vs eleven sample comparison of the Bottomly et al. [16] dataset. Small triangles at the top and bottom of the plots indicate points that would fall outside of the plotting window. Two genes with similar mean count and MLE logarithmic fold change are highlighted with green and purple circles. (C) The counts (normalized by size factors sj) for these genes reveal low dispersion for the gene in green and high dispersion for the gene in purple. (D) Density plots of the likelihoods (solid lines, scaled to integrate to 1) and the posteriors (dashed lines) for the green and purple genes and of the prior (solid black line): due to the higher
    Figure 2 Effect of shrinkage on logarithmic fold change estimates. Plots of the (A)MLE (i.e., no shrinkage) and (B)MAP estimate (i.e., with shrinkage) for the LFCs attributable to mouse strain, over the average expression strength for a ten vs eleven sample comparison of the Bottomly et al. [16] dataset. Small triangles at the top and bottom of the plots indicate points that would fall outside of the plotting window. Two genes with similar mean count and MLE logarithmic fold change are highlighted with green and purple circles. (C) The counts (normalized by size factors sj) for these genes reveal low dispersion for the gene in green and high dispersion for the gene in purple. (D) Density plots of the likelihoods (solid lines, scaled to integrate to 1) and the posteriors (dashed lines) for the green and purple genes and of the prior (solid black line): due to the higher
  • Figure 8 Sensitivity estimated from experimental reproducibility. Each algorithm’s sensitivity in the evaluation set (box plots) is evaluated using the calls of each other algorithm in the verification set (panels with grey label).
    Figure 8 Sensitivity estimated from experimental reproducibility. Each algorithm’s sensitivity in the evaluation set (box plots) is evaluated using the calls of each other algorithm in the verification set (panels with grey label).
  • Figure 3 Stability of logarithmic fold changes. DESeq2 is run on equally split halves of the data of Bottomly et al. [16], and the LFCs from the halves are plotted against each other. (A)MLEs, i.e., without LFC shrinkage. (B)MAP estimates, i.e., with shrinkage. Points in the top left and bottom right quadrants indicate genes with a change of sign of LFC. Red points indicate genes with adjusted P value < 0.1. The legend displays the root-mean-square error of the estimates in group I compared to those in group II. LFC, logarithmic fold change; MAP, maximum a posteriori; MLE,
    Figure 3 Stability of logarithmic fold changes. DESeq2 is run on equally split halves of the data of Bottomly et al. [16], and the LFCs from the halves are plotted against each other. (A)MLEs, i.e., without LFC shrinkage. (B)MAP estimates, i.e., with shrinkage. Points in the top left and bottom right quadrants indicate genes with a change of sign of LFC. Red points indicate genes with adjusted P value < 0.1. The legend displays the root-mean-square error of the estimates in group I compared to those in group II. LFC, logarithmic fold change; MAP, maximum a posteriori; MLE,
Topics: Count data (53%), Bioconductor (53%), Fold change (51%)

2,229 Citations


Open accessPosted ContentDOI: 10.1101/030338
30 Oct 2015-bioRxiv
Abstract: Large-scale reference data sets of human genetic variation are critical for the medical and functional interpretation of DNA sequence changes. Here we describe the aggregation and analysis of high-quality exome (protein-coding region) sequence data for 60,706 individuals of diverse ethnicities. The resulting catalogue of human genetic diversity has unprecedented resolution, with an average of one variant every eight bases of coding sequence and the presence of widespread mutational recurrence. The deep catalogue of variation provided by the Exome Aggregation Consortium (ExAC) can be used to calculate objective metrics of pathogenicity for sequence variants, and to identify genes subject to strong selection against various classes of mutation; we identify 3,230 genes with near-complete depletion of truncating variants, 79% of which have no currently established human disease phenotype. Finally, we show that these data can be used for the efficient filtering of candidate disease-causing variants, and for the discovery of human knockout variants in protein-coding genes.

...read more

  • Figure 1 | Patterns of genetic variation in 60,706 humans. a, The size and diversity of public reference exome data sets. ExAC exceeds previous data sets in size for all studied populations. b, Principal component analysis (PCA) dividing ExAC individuals into five continental populations. PC2 and PC3 are shown; additional PCs are in Extended Data Fig. 5a. c, The allele frequency spectrum of ExAC highlights that the majority of genetic variants are rare and novel (absent from prior databases of genetic variation, such as dbSNP). d, The proportion of possible variation observed by mutational context and functional class. Over half of all possible CpG transitions are observed. Error bars represent standard error of the mean. e, f, The number (e), and frequency distribution (proportion singleton; f) of indels, by size. Compared to in-frame indels, frameshift variants are less common (have a higher proportion of singletons, a proxy for predicted deleteriousness on gene product). Error bars indicate 95% confidence intervals.
    Figure 1 | Patterns of genetic variation in 60,706 humans. a, The size and diversity of public reference exome data sets. ExAC exceeds previous data sets in size for all studied populations. b, Principal component analysis (PCA) dividing ExAC individuals into five continental populations. PC2 and PC3 are shown; additional PCs are in Extended Data Fig. 5a. c, The allele frequency spectrum of ExAC highlights that the majority of genetic variants are rare and novel (absent from prior databases of genetic variation, such as dbSNP). d, The proportion of possible variation observed by mutational context and functional class. Over half of all possible CpG transitions are observed. Error bars represent standard error of the mean. e, f, The number (e), and frequency distribution (proportion singleton; f) of indels, by size. Compared to in-frame indels, frameshift variants are less common (have a higher proportion of singletons, a proxy for predicted deleteriousness on gene product). Error bars indicate 95% confidence intervals.
  • Figure 2 | Mutational recurrence at large sample sizes. a, Proportion of validated de novo variants from two external data sets that are independently found in ExAC, separated by functional class and mutational context. Error bars represent standard error of the mean. Colours are consistent in a–d. b, Number of unique variants observed, by mutational context, as a function of number of individuals (downsampled from ExAC). CpG transitions, the most likely mutational event, begin reaching saturation at ~ 20,000 individuals. c, The site frequency spectrum is shown for each mutational context. d, For doubletons (variants with an allele count (AC) of 2), mutation rate is positively correlated with the likelihood of being found in two individuals of different continental populations. e, The mutability-adjusted proportion of singletons (MAPS) is shown across functional classes. Error bars represent standard error of the mean of the proportion of singletons.
    Figure 2 | Mutational recurrence at large sample sizes. a, Proportion of validated de novo variants from two external data sets that are independently found in ExAC, separated by functional class and mutational context. Error bars represent standard error of the mean. Colours are consistent in a–d. b, Number of unique variants observed, by mutational context, as a function of number of individuals (downsampled from ExAC). CpG transitions, the most likely mutational event, begin reaching saturation at ~ 20,000 individuals. c, The site frequency spectrum is shown for each mutational context. d, For doubletons (variants with an allele count (AC) of 2), mutation rate is positively correlated with the likelihood of being found in two individuals of different continental populations. e, The mutability-adjusted proportion of singletons (MAPS) is shown across functional classes. Error bars represent standard error of the mean of the proportion of singletons.
  • Figure 4 | Filtering for Mendelian variant discovery. a, Predicted missense and protein-truncating variants in 500 randomly chosen ExAC individuals were filtered based on allele frequency (AF) information from ESP, or from the remaining ExAC individuals. At a 0.1% allele frequency filter, ExAC provides greater power to remove candidate variants, leaving an average of 154 variants for analysis, compared to 1,090 after filtering against ESP. Popmax allele frequency also provides greater power than global allele frequency, particularly when populations are unequally sampled. b, Estimates of allele frequency in Europeans based on ESP are more precise at higher allele frequencies. Sampling variance and ascertainment bias make allele frequency estimates unreliable, posing problems for Mendelian variant filtration. 69% of ESP European singletons are not seen a second time in ExAC (tall bar at left), illustrating the dangers of filtering on very low allele counts. c, Allele frequency spectrum of disease-causing variants in the Human Gene Mutation Database (HGMD) and/or pathogenic or probable pathogenic variants in ClinVar for well-characterized autosomal dominant and autosomal recessive disease genes28. Most are not found in ExAC; however, many of the reportedly pathogenic variants found in ExAC are at too high a frequency to be consistent with disease prevalence and penetrance. d, Literature review of variants with > 1% global allele frequency or > 1% Latin American or South Asian population allele frequency confirmed there is insufficient evidence for pathogenicity for the majority of these variants. Variants were reclassified by American College of Medical Genetics and Genomics (ACMG) guidelines24.
    Figure 4 | Filtering for Mendelian variant discovery. a, Predicted missense and protein-truncating variants in 500 randomly chosen ExAC individuals were filtered based on allele frequency (AF) information from ESP, or from the remaining ExAC individuals. At a 0.1% allele frequency filter, ExAC provides greater power to remove candidate variants, leaving an average of 154 variants for analysis, compared to 1,090 after filtering against ESP. Popmax allele frequency also provides greater power than global allele frequency, particularly when populations are unequally sampled. b, Estimates of allele frequency in Europeans based on ESP are more precise at higher allele frequencies. Sampling variance and ascertainment bias make allele frequency estimates unreliable, posing problems for Mendelian variant filtration. 69% of ESP European singletons are not seen a second time in ExAC (tall bar at left), illustrating the dangers of filtering on very low allele counts. c, Allele frequency spectrum of disease-causing variants in the Human Gene Mutation Database (HGMD) and/or pathogenic or probable pathogenic variants in ClinVar for well-characterized autosomal dominant and autosomal recessive disease genes28. Most are not found in ExAC; however, many of the reportedly pathogenic variants found in ExAC are at too high a frequency to be consistent with disease prevalence and penetrance. d, Literature review of variants with > 1% global allele frequency or > 1% Latin American or South Asian population allele frequency confirmed there is insufficient evidence for pathogenicity for the majority of these variants. Variants were reclassified by American College of Medical Genetics and Genomics (ACMG) guidelines24.
Topics: Exome (57%), Genetic variation (54%), Human genetic variation (53%) ...read more

1,552 Citations


Open accessPosted ContentDOI: 10.1101/531210
30 Jan 2019-bioRxiv
Abstract: Summary Genetic variants that inactivate protein-coding genes are a powerful source of information about the phenotypic consequences of gene disruption: genes critical for an organism’s function will be depleted for such variants in natural populations, while non-essential genes will tolerate their accumulation. However, predicted loss-of-function (pLoF) variants are enriched for annotation errors, and tend to be found at extremely low frequencies, so their analysis requires careful variant annotation and very large sample sizes. Here, we describe the aggregation of 125,748 exomes and 15,708 genomes from human sequencing studies into the Genome Aggregation Database (gnomAD). We identify 443,769 high-confidence pLoF variants in this cohort after filtering for sequencing and annotation artifacts. Using an improved model of human mutation, we classify human protein-coding genes along a spectrum representing intolerance to inactivation, validate this classification using data from model organisms and engineered human cells, and show that it can be used to improve gene discovery power for both common and rare diseases.

...read more

  • Fig. 4 | Biological properties of constrained genes and transcripts. a, The mean number of protein–protein interactions is plotted as a function of LOEUF decile: more constrained genes have more interaction partners (LOEUF linear regression r = −0.14; P = 1.7 × 10−51). Error bars correspond to 95% confidence intervals. b, The number of tissues where a gene is expressed (transcripts per million > 0.3), binned by LOEUF decile, is shown as a violin plot with the mean number overlaid as points: more constrained genes are more likely to be expressed in several tissues (LOEUF linear regression r = −0.31; P < 1 × 10−100). c, For 1,740 genes in which there exists at least one constrained and one unconstrained transcript, the proportion of expression derived from the constrained transcript is plotted as a histogram.
    Fig. 4 | Biological properties of constrained genes and transcripts. a, The mean number of protein–protein interactions is plotted as a function of LOEUF decile: more constrained genes have more interaction partners (LOEUF linear regression r = −0.14; P = 1.7 × 10−51). Error bars correspond to 95% confidence intervals. b, The number of tissues where a gene is expressed (transcripts per million > 0.3), binned by LOEUF decile, is shown as a violin plot with the mean number overlaid as points: more constrained genes are more likely to be expressed in several tissues (LOEUF linear regression r = −0.31; P < 1 × 10−100). c, For 1,740 genes in which there exists at least one constrained and one unconstrained transcript, the proportion of expression derived from the constrained transcript is plotted as a histogram.
  • Fig. 5 | Disease applications of constraint. a, The rate ratio is defined by the rate of de novo variants (number per patient) in 5,305 cases of intellectual disability/developmental delay (ID/DD) divided by the rate in 2,179 controls. pLoF variants in the most constrained decile of the genome are approximately 11-fold more likely to be found in cases compared to controls. Error bars represent 95% confidence intervals. b, Marginal enrichment in per-SNV heritability explained by common (minor allele frequency > 5%) variants within 100-kb of genes in each LOEUF decile, estimated by linkage disequilibrium (LD) score regression48. Enrichment is compared to the average SNV genome-wide. The results reported here are from random effects meta-analysis of 276 independent traits (subsetted from the 658 traits with UK Biobank or large-scale consortium GWAS results). Error bars represent 95% confidence intervals. c, Conditional enrichment in per-SNV common variant heritability tested using regression of linkage disequilibrium score in each of 658 common disease and trait GWAS results. P values evaluate whether per-SNV heritability is proportional to the LOEUF of the nearest gene, conditional on 75 existing functional, linkage disequilibrium, and minor-allele-frequency-related genomic annotations. Colours alternate by broad phenotype category.
    Fig. 5 | Disease applications of constraint. a, The rate ratio is defined by the rate of de novo variants (number per patient) in 5,305 cases of intellectual disability/developmental delay (ID/DD) divided by the rate in 2,179 controls. pLoF variants in the most constrained decile of the genome are approximately 11-fold more likely to be found in cases compared to controls. Error bars represent 95% confidence intervals. b, Marginal enrichment in per-SNV heritability explained by common (minor allele frequency > 5%) variants within 100-kb of genes in each LOEUF decile, estimated by linkage disequilibrium (LD) score regression48. Enrichment is compared to the average SNV genome-wide. The results reported here are from random effects meta-analysis of 276 independent traits (subsetted from the 658 traits with UK Biobank or large-scale consortium GWAS results). Error bars represent 95% confidence intervals. c, Conditional enrichment in per-SNV common variant heritability tested using regression of linkage disequilibrium score in each of 658 common disease and trait GWAS results. P values evaluate whether per-SNV heritability is proportional to the LOEUF of the nearest gene, conditional on 75 existing functional, linkage disequilibrium, and minor-allele-frequency-related genomic annotations. Colours alternate by broad phenotype category.
Topics: Exome sequencing (55%), Genome (54%), Gene (51%)

1,037 Citations


Open accessPosted ContentDOI: 10.1101/060012
Alexey Sergushichev1Institutions (1)
20 Jun 2016-bioRxiv
Abstract: Gene set enrichment analysis is a widely used tool for analyzing gene expression data. However, current implementations are slow due to a large number of required samples for the analysis to have a good statistical power. In this paper we present a novel algorithm, that efficiently reuses one sample multiple times and thus speeds up the analysis. We show that it is possible to make hundreds of thousands permutations in a few minutes, which leads to very accurate p-values. This, in turn, allows applying standard FDR correction procedures, which are more accurate than the ones currently used. The method is implemented in a form of an R package and is freely available at \url{https://github.com/ctlab/fgsea}.

...read more

788 Citations


Open accessPosted ContentDOI: 10.1101/2020.02.07.937862
11 Feb 2020-bioRxiv
Abstract: The present outbreak of lower respiratory tract infections, including respiratory distress syndrome, is the third spillover, in only two decades, of an animal coronavirus to humans resulting in a major epidemic. Here, the Coronavirus Study Group (CSG) of the International Committee on Taxonomy of Viruses, which is responsible for developing the official classification of viruses and taxa naming (taxonomy) of the Coronaviridae family, assessed the novelty of the human pathogen tentatively named 2019-nCoV. Based on phylogeny, taxonomy and established practice, the CSG formally recognizes this virus as a sister to severe acute respiratory syndrome coronaviruses (SARS-CoVs) of the species Severe acute respiratory syndrome-related coronavirus and designates it as severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). To facilitate communication, the CSG further proposes to use the following naming convention for individual isolates: SARS-CoV-2/Isolate/Host/Date/Location. The spectrum of clinical manifestations associated with SARS-CoV-2 infections in humans remains to be determined. The independent zoonotic transmission of SARS-CoV and SARS-CoV-2 highlights the need for studying the entire (virus) species to complement research focused on individual pathogenic viruses of immediate significance. This research will improve our understanding of virus-host interactions in an ever-changing environment and enhance our preparedness for future outbreaks.

...read more

  • Fig. 2 | Phylogeny of coronaviruses. a, concatenated multiple sequence alignments (MSAs) of the protein domain combination44 used for phylogenetic and DEmArc analyses of the family Coronaviridae. Shown are the locations of the replicative domains conserved in the ordert Nidovirales in relation to several other OrF1a/b-encoded domains and other major OrFs in the SArS-coV genome. 5d, 5 domains: nsp5A-3cLpro, two beta-barrel domains of the 3c-like protease;
    Fig. 2 | Phylogeny of coronaviruses. a, concatenated multiple sequence alignments (MSAs) of the protein domain combination44 used for phylogenetic and DEmArc analyses of the family Coronaviridae. Shown are the locations of the replicative domains conserved in the ordert Nidovirales in relation to several other OrF1a/b-encoded domains and other major OrFs in the SArS-coV genome. 5d, 5 domains: nsp5A-3cLpro, two beta-barrel domains of the 3c-like protease;
Topics: Coronavirus (68%), Respiratory tract infections (55%), Virus classification (54%) ...read more

781 Citations


Performance
Metrics
No. of papers from the Journal in previous years
YearPapers
202137,102
202042,689
201931,470
201822,828
201712,258
20165,058

Top Attributes

Show by:

Journal's top 5 most impactful authors

Ian J. Deary

97 papers, 604 citations

Andrew M. McIntosh

78 papers, 652 citations

Aviv Regev

75 papers, 1K citations

Ole A. Andreassen

71 papers, 950 citations

George Davey Smith

62 papers, 642 citations

Network Information
Related Journals (5)
eLife

14.8K papers, 420.6K citations

94% related
PLOS Computational Biology

8.4K papers, 405.9K citations

92% related
PLOS Biology

5.4K papers, 547.7K citations

90% related
BMC Biology

1.9K papers, 85.6K citations

89% related
PLOS Genetics

9.2K papers, 619.8K citations

88% related