scispace - formally typeset
Author

Simon Anders

Bio: Simon Anders is a academic researcher at Heidelberg University who has co-authored 84 publication(s) receiving 66957 citation(s). The author has an hindex of 34. Previous affiliations of Simon Anders include Technische Universität München & University of Helsinki. The author has done significant research in the topic(s): Bioconductor & Count data.

...read more

Topics: Bioconductor, Count data, Population ...read more
Papers
  More

Open accessJournal ArticleDOI: 10.1186/S13059-014-0550-8
05 Dec 2014-Genome Biology
Abstract: In comparative high-throughput sequencing assays, a fundamental task is the analysis of count data, such as read counts per gene in RNA-seq, for evidence of systematic changes across experimental conditions. Small replicate numbers, discreteness, large dynamic range and the presence of outliers require a suitable statistical approach. We present DESeq2, a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates. This enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression. The DESeq2 package is available at http://www.bioconductor.org/packages/release/bioc/html/DESeq2.html .

...read more

Topics: MRNA Sequencing (54%), Integrator complex (51%), Count data (50%) ...read more

29,675 Citations


Open accessJournal ArticleDOI: 10.1093/BIOINFORMATICS/BTU638
15 Jan 2015-Bioinformatics
Abstract: Motivation: A large choice of tools exists for many standard tasks in the analysis of high-throughput sequencing (HTS) data. However, once a project deviates from standard workflows, custom scripts are needed. Results: We present HTSeq, a Python library to facilitate the rapid development of such scripts. HTSeq offers parsers for many common data formats in HTS projects, as well as classes to represent data, such as genomic coordinates, sequences, sequencing reads, alignments, gene model information and variant calls, and provides data structures that allow for querying via genomic coordinates. We also present htseq-count, a tool developed with HTSeq that preprocesses RNA-Seq data for differential expression analysis by counting the overlap of reads with genes. Availability and implementation: HTSeq is released as an opensource software under the GNU General Public Licence and available from http://www-huber.embl.de/HTSeq or from the Python Package Index at https://pypi.python.org/pypi/HTSeq. Contact: sanders@fs.tum.de

...read more

11,833 Citations


Open accessJournal ArticleDOI: 10.1186/GB-2010-11-10-R106
27 Oct 2010-Genome Biology
Abstract: High-throughput sequencing assays such as RNA-Seq, ChIP-Seq or barcode counting provide quantitative readouts in the form of count data. To infer differential signal in such data correctly and with good statistical power, estimation of data variability throughout the dynamic range and a suitable error model are required. We propose a method based on the negative binomial distribution, with variance and mean linked by local regression and present an implementation, DESeq, as an R/Bioconductor package.

...read more

Topics: Count data (63%), Bioconductor (54%), Binomial distribution (52%) ...read more

11,332 Citations


Open accessPosted ContentDOI: 10.1101/002832
17 Nov 2014-bioRxiv
Abstract: In comparative high-throughput sequencing assays, a fundamental task is the analysis of count data, such as read counts per gene in RNA-Seq data, for evidence of systematic changes across experimental conditions. Small replicate numbers, discreteness, large dynamic range and the presence of outliers require a suitable statistical approach. We present DESeq2, a method for differential analysis of count data. DESeq2 uses shrinkage estimation for dispersions and fold changes to improve stability and interpretability of the estimates. This enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression and facilitates downstream tasks such as gene ranking and visualization. DESeq2 is available as an R/Bioconductor package.

...read more

  • Figure 2 Effect of shrinkage on logarithmic fold change estimates. Plots of the (A)MLE (i.e., no shrinkage) and (B)MAP estimate (i.e., with shrinkage) for the LFCs attributable to mouse strain, over the average expression strength for a ten vs eleven sample comparison of the Bottomly et al. [16] dataset. Small triangles at the top and bottom of the plots indicate points that would fall outside of the plotting window. Two genes with similar mean count and MLE logarithmic fold change are highlighted with green and purple circles. (C) The counts (normalized by size factors sj) for these genes reveal low dispersion for the gene in green and high dispersion for the gene in purple. (D) Density plots of the likelihoods (solid lines, scaled to integrate to 1) and the posteriors (dashed lines) for the green and purple genes and of the prior (solid black line): due to the higher
    Figure 2 Effect of shrinkage on logarithmic fold change estimates. Plots of the (A)MLE (i.e., no shrinkage) and (B)MAP estimate (i.e., with shrinkage) for the LFCs attributable to mouse strain, over the average expression strength for a ten vs eleven sample comparison of the Bottomly et al. [16] dataset. Small triangles at the top and bottom of the plots indicate points that would fall outside of the plotting window. Two genes with similar mean count and MLE logarithmic fold change are highlighted with green and purple circles. (C) The counts (normalized by size factors sj) for these genes reveal low dispersion for the gene in green and high dispersion for the gene in purple. (D) Density plots of the likelihoods (solid lines, scaled to integrate to 1) and the posteriors (dashed lines) for the green and purple genes and of the prior (solid black line): due to the higher
  • Figure 3 Stability of logarithmic fold changes. DESeq2 is run on equally split halves of the data of Bottomly et al. [16], and the LFCs from the halves are plotted against each other. (A)MLEs, i.e., without LFC shrinkage. (B)MAP estimates, i.e., with shrinkage. Points in the top left and bottom right quadrants indicate genes with a change of sign of LFC. Red points indicate genes with adjusted P value < 0.1. The legend displays the root-mean-square error of the estimates in group I compared to those in group II. LFC, logarithmic fold change; MAP, maximum a posteriori; MLE,
    Figure 3 Stability of logarithmic fold changes. DESeq2 is run on equally split halves of the data of Bottomly et al. [16], and the LFCs from the halves are plotted against each other. (A)MLEs, i.e., without LFC shrinkage. (B)MAP estimates, i.e., with shrinkage. Points in the top left and bottom right quadrants indicate genes with a change of sign of LFC. Red points indicate genes with adjusted P value < 0.1. The legend displays the root-mean-square error of the estimates in group I compared to those in group II. LFC, logarithmic fold change; MAP, maximum a posteriori; MLE,
  • Figure 8 Sensitivity estimated from experimental reproducibility. Each algorithm’s sensitivity in the evaluation set (box plots) is evaluated using the calls of each other algorithm in the verification set (panels with grey label).
    Figure 8 Sensitivity estimated from experimental reproducibility. Each algorithm’s sensitivity in the evaluation set (box plots) is evaluated using the calls of each other algorithm in the verification set (panels with grey label).
  • Figure 9 Precision estimated from experimental reproducibility. Each algorithm’s precision in the evaluation set (box plots) is evaluated using the calls of each other algorithm in the verification set (panels with grey label).
    Figure 9 Precision estimated from experimental reproducibility. Each algorithm’s precision in the evaluation set (box plots) is evaluated using the calls of each other algorithm in the verification set (panels with grey label).
Topics: Count data (53%), Bioconductor (53%), Fold change (51%)

2,229 Citations


Open accessJournal ArticleDOI: 10.1038/NMETH.3252
Wolfgang Huber, Vincent J. Carey1, Robert Gentleman2, Simon Anders  +22 moreInstitutions (13)
01 Feb 2015-Nature Methods
Abstract: Bioconductor is an open-source, open-development software project for the analysis and comprehension of high-throughput data in genomics and molecular biology. The project aims to enable interdisciplinary research, collaboration and rapid development of scientific software. Based on the statistical programming language R, Bioconductor comprises 934 interoperable packages contributed by a large, diverse community of scientists. Packages cover a range of bioinformatic and statistical applications. They undergo formal initial review and continuous automated testing. We present an overview for prospective users and contributors.

...read more

Topics: Bioconductor (71%)

2,202 Citations


Cited by
  More

Open accessJournal ArticleDOI: 10.1186/S13059-014-0550-8
05 Dec 2014-Genome Biology
Abstract: In comparative high-throughput sequencing assays, a fundamental task is the analysis of count data, such as read counts per gene in RNA-seq, for evidence of systematic changes across experimental conditions. Small replicate numbers, discreteness, large dynamic range and the presence of outliers require a suitable statistical approach. We present DESeq2, a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates. This enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression. The DESeq2 package is available at http://www.bioconductor.org/packages/release/bioc/html/DESeq2.html .

...read more

Topics: MRNA Sequencing (54%), Integrator complex (51%), Count data (50%) ...read more

29,675 Citations


Open accessJournal ArticleDOI: 10.1101/GR.107524.110
Aaron McKenna1, Matthew Hanna, Eric Banks, Andrey Sivachenko  +7 moreInstitutions (1)
01 Sep 2010-Genome Research
Abstract: Next-generation DNA sequencing (NGS) projects, such as the 1000 Genomes Project, are already revolutionizing our understanding of genetic variation among individuals. However, the massive data sets generated by NGS—the 1000 Genome pilot alone includes nearly five terabases—make writing feature-rich, efficient, and robust analysis tools difficult for even computationally sophisticated individuals. Indeed, many professionals are limited in the scope and the ease with which they can answer scientific questions by the complexity of accessing and manipulating the data produced by these machines. Here, we discuss our Genome Analysis Toolkit (GATK), a structured programming framework designed to ease the development of efficient and robust analysis tools for next-generation DNA sequencers using the functional programming philosophy of MapReduce. The GATK provides a small but rich set of data access patterns that encompass the majority of analysis tool needs. Separating specific analysis calculations from common data management infrastructure enables us to optimize the GATK framework for correctness, stability, and CPU and memory efficiency and to enable distributed and shared memory parallelization. We highlight the capabilities of the GATK by describing the implementation and application of robust, scale-tolerant tools like coverage calculators and single nucleotide polymorphism (SNP) calling. We conclude that the GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.

...read more

Topics: Variant Call Format (52%), Software framework (50%)

16,404 Citations


Open accessJournal ArticleDOI: 10.1093/NAR/GKV007
Matthew E. Ritchie1, Belinda Phipson2, Di Wu3, Yifang Hu1  +4 moreInstitutions (5)
Abstract: limma is an R/Bioconductor software package that provides an integrated solution for analysing data from gene expression experiments. It contains rich features for handling complex experimental designs and for information borrowing to overcome the problem of small sample sizes. Over the past decade, limma has been a popular choice for gene discovery through differential expression analyses of microarray and high-throughput PCR data. The package contains particularly strong facilities for reading, normalizing and exploring such data. Recently, the capabilities of limma have been significantly expanded in two important directions. First, the package can now perform both differential expression and differential splicing analyses of RNA sequencing (RNA-seq) data. All the downstream analysis tools previously restricted to microarray data are now available for RNA-seq as well. These capabilities allow users to analyse both RNA-seq and microarray data with very similar pipelines. Second, the package is now able to go past the traditional gene-wise expression analyses in a variety of ways, analysing expression profiles in terms of co-regulated sets of genes or in terms of higher-order expression signatures. This provides enhanced possibilities for biological interpretation of gene expression differences. This article reviews the philosophy and design of the limma package, summarizing both new and historical features, with an emphasis on recent enhancements and features that have not been previously described.

...read more

Topics: Microarray databases (61%), Bioconductor (51%)

13,819 Citations


Open accessJournal ArticleDOI: 10.1093/BIOINFORMATICS/BTU638
15 Jan 2015-Bioinformatics
Abstract: Motivation: A large choice of tools exists for many standard tasks in the analysis of high-throughput sequencing (HTS) data. However, once a project deviates from standard workflows, custom scripts are needed. Results: We present HTSeq, a Python library to facilitate the rapid development of such scripts. HTSeq offers parsers for many common data formats in HTS projects, as well as classes to represent data, such as genomic coordinates, sequences, sequencing reads, alignments, gene model information and variant calls, and provides data structures that allow for querying via genomic coordinates. We also present htseq-count, a tool developed with HTSeq that preprocesses RNA-Seq data for differential expression analysis by counting the overlap of reads with genes. Availability and implementation: HTSeq is released as an opensource software under the GNU General Public Licence and available from http://www-huber.embl.de/HTSeq or from the Python Package Index at https://pypi.python.org/pypi/HTSeq. Contact: sanders@fs.tum.de

...read more

11,833 Citations


Open accessJournal ArticleDOI: 10.1186/GB-2010-11-10-R106
27 Oct 2010-Genome Biology
Abstract: High-throughput sequencing assays such as RNA-Seq, ChIP-Seq or barcode counting provide quantitative readouts in the form of count data. To infer differential signal in such data correctly and with good statistical power, estimation of data variability throughout the dynamic range and a suitable error model are required. We propose a method based on the negative binomial distribution, with variance and mean linked by local regression and present an implementation, DESeq, as an R/Bioconductor package.

...read more

Topics: Count data (63%), Bioconductor (54%), Binomial distribution (52%) ...read more

11,332 Citations


Performance
Metrics

Author's H-index: 34

No. of papers from the Author in previous years
YearPapers
20218
202015
201910
20184
20174
20165

Top Attributes

Show by:

Author's top 5 most impactful journals

bioRxiv

12 papers, 2.6K citations

Bioinformatics

4 papers, 12.3K citations

Cell

3 papers, 470 citations

Nature Methods

3 papers, 3K citations

medRxiv

3 papers, 43 citations

Network Information
Related Authors (5)
Svetlana Ovchinnikova

8 papers, 281 citations

91% related
Peter-Martin Bruch

8 papers, 34 citations

81% related
Vladislav Kim

4 papers, 337 citations

81% related
Wolfgang Huber

335 papers, 106.1K citations

79% related
Emma I. Andersson

29 papers, 1.5K citations

77% related