scispace - formally typeset
Author

Gordon K. Smyth

Bio: Gordon K. Smyth is a academic researcher from Walter and Eliza Hall Institute of Medical Research. The author has contributed to research in topic(s): Cellular differentiation & Gene expression profiling. The author has an hindex of 93, co-authored 357 publication(s) receiving 124066 citation(s). Previous affiliations of Gordon K. Smyth include University of Queensland & Beaumont Hospital.

...read more

Papers
  More

Open accessJournal ArticleDOI: 10.1093/BIOINFORMATICS/BTP616
01 Jan 2010-Bioinformatics
Abstract: Summary: It is expected that emerging digital gene expression (DGE) technologies will overtake microarray technologies in the near future for many functional genomics applications. One of the fundamental data analysis tasks, especially for gene expression studies, involves determining whether there is evidence that counts for a transcript or exon are significantly different across experimental conditions. edgeR is a Bioconductor software package for examining differential expression of replicated count data. An overdispersed Poisson model is used to account for both biological and technical variability. Empirical Bayes methods are used to moderate the degree of overdispersion across transcripts, improving the reliability of inference. The methodology can be used even with the most minimal levels of replication, provided at least one phenotype or experimental condition is replicated. The software may have other applications beyond sequencing data, such as proteome peptide count data. Availability: The package is freely available under the LGPL licence from the Bioconductor web site (http://bioconductor.org).

...read more

Topics: Bioconductor (64%)

21,575 Citations


Open accessJournal ArticleDOI: 10.1093/NAR/GKV007
Matthew E. Ritchie1, Belinda Phipson2, Di Wu3, Yifang Hu1  +4 moreInstitutions (5)
Abstract: limma is an R/Bioconductor software package that provides an integrated solution for analysing data from gene expression experiments. It contains rich features for handling complex experimental designs and for information borrowing to overcome the problem of small sample sizes. Over the past decade, limma has been a popular choice for gene discovery through differential expression analyses of microarray and high-throughput PCR data. The package contains particularly strong facilities for reading, normalizing and exploring such data. Recently, the capabilities of limma have been significantly expanded in two important directions. First, the package can now perform both differential expression and differential splicing analyses of RNA sequencing (RNA-seq) data. All the downstream analysis tools previously restricted to microarray data are now available for RNA-seq as well. These capabilities allow users to analyse both RNA-seq and microarray data with very similar pipelines. Second, the package is now able to go past the traditional gene-wise expression analyses in a variety of ways, analysing expression profiles in terms of co-regulated sets of genes or in terms of higher-order expression signatures. This provides enhanced possibilities for biological interpretation of gene expression differences. This article reviews the philosophy and design of the limma package, summarizing both new and historical features, with an emphasis on recent enhancements and features that have not been previously described.

...read more

Topics: Microarray databases (61%), Bioconductor (51%)

13,819 Citations


Open accessJournal ArticleDOI: 10.1186/GB-2004-5-10-R80
15 Sep 2004-Genome Biology
Abstract: The Bioconductor project is an initiative for the collaborative creation of extensible software for computational biology and bioinformatics. The goals of the project include: fostering collaborative development and widespread use of innovative software, reducing barriers to entry into interdisciplinary scientific research, and promoting the achievement of remote reproducibility of research results. We describe details of our aims and methods, identify current challenges, compare Bioconductor to other open bioinformatics projects, and provide working examples.

...read more

Topics: Bioconductor (65%)

11,488 Citations


Open accessJournal ArticleDOI: 10.2202/1544-6115.1027
Gordon K. Smyth1Institutions (1)
Abstract: The problem of identifying differentially expressed genes in designed microarray experiments is considered. Lonnstedt and Speed (2002) derived an expression for the posterior odds of differential expression in a replicated two-color experiment using a simple hierarchical parametric model. The purpose of this paper is to develop the hierarchical model of Lonnstedt and Speed (2002) into a practical approach for general microarray experiments with arbitrary numbers of treatments and RNA samples. The model is reset in the context of general linear models with arbitrary coefficients and contrasts of interest. The approach applies equally well to both single channel and two color microarray experiments. Consistent, closed form estimators are derived for the hyperparameters in the model. The estimators proposed have robust behavior even for small numbers of arrays and allow for incomplete data arising from spot filtering or spot quality weights. The posterior odds statistic is reformulated in terms of a moderated t-statistic in which posterior residual standard deviations are used in place of ordinary standard deviations. The empirical Bayes approach is equivalent to shrinkage of the estimated sample variances towards a pooled estimate, resulting in far more stable inference when the number of arrays is small. The use of moderated t-statistics has the advantage over the posterior odds that the number of hyperparameters which need to estimated is reduced; in particular, knowledge of the non-null prior for the fold changes are not required. The moderated t-statistic is shown to follow a t-distribution with augmented degrees of freedom. The moderated t inferential approach extends to accommodate tests of composite null hypotheses through the use of moderated F-statistics. The performance of the methods is demonstrated in a simulation study. Results are presented for two publicly available data sets.

...read more

  • Figure 1: Example designs for two color microarrays.
    Figure 1: Example designs for two color microarrays.
  • Table 3: Top 30 genes from the Swirl data
    Table 3: Top 30 genes from the Swirl data
  • Table 1: Area under the Receiver Operating Curve for five statistics and three simulation scenarios.
    Table 1: Area under the Receiver Operating Curve for five statistics and three simulation scenarios.
  • Figure 4: False discovery rates for different gene selection statistics when the true variances are somewhat different, i.e., the prior and residual degrees of freedom are balanced. The rates are means of actual false discovery rates for 100 simulated data sets.
    Figure 4: False discovery rates for different gene selection statistics when the true variances are somewhat different, i.e., the prior and residual degrees of freedom are balanced. The rates are means of actual false discovery rates for 100 simulated data sets.
  • Table 4: Top 15 genes from the ApoAI data
    Table 4: Top 15 genes from the ApoAI data
  • + 3

Topics: Linear model (53%), Bayes' theorem (52%), Parametric model (51%) ...read more

11,231 Citations


Open accessJournal ArticleDOI: 10.1093/BIOINFORMATICS/BTT656
Yang Liao1, Gordon K. Smyth1, Wei Shi1Institutions (1)
01 Apr 2014-Bioinformatics
Abstract: MOTIVATION: Next-generation sequencing technologies generate millions of short sequence reads, which are usually aligned to a reference genome. In many applications, the key information required for downstream analysis is the number of reads mapping to each genomic feature, for example to each exon or each gene. The process of counting reads is called read summarization. Read summarization is required for a great variety of genomic analyses but has so far received relatively little attention in the literature. RESULTS: We present featureCounts, a read summarization program suitable for counting reads generated from either RNA or genomic DNA sequencing experiments. featureCounts implements highly efficient chromosome hashing and feature blocking techniques. It is considerably faster than existing methods (by an order of magnitude for gene-level summarization) and requires far less computer memory. It works with either single or paired-end reads and provides a wide range of options appropriate for different sequencing applications. AVAILABILITY AND IMPLEMENTATION: featureCounts is available under GNU General Public License as part of the Subread (http://subread.sourceforge.net) or Rsubread (http://www.bioconductor.org) software packages.

...read more

8,495 Citations


Cited by
  More

Open accessJournal ArticleDOI: 10.1186/S13059-014-0550-8
05 Dec 2014-Genome Biology
Abstract: In comparative high-throughput sequencing assays, a fundamental task is the analysis of count data, such as read counts per gene in RNA-seq, for evidence of systematic changes across experimental conditions. Small replicate numbers, discreteness, large dynamic range and the presence of outliers require a suitable statistical approach. We present DESeq2, a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates. This enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression. The DESeq2 package is available at http://www.bioconductor.org/packages/release/bioc/html/DESeq2.html .

...read more

Topics: MRNA Sequencing (54%), Integrator complex (51%), Count data (50%) ...read more

29,675 Citations


Open accessJournal ArticleDOI: 10.1093/BIOINFORMATICS/BTP616
01 Jan 2010-Bioinformatics
Abstract: Summary: It is expected that emerging digital gene expression (DGE) technologies will overtake microarray technologies in the near future for many functional genomics applications. One of the fundamental data analysis tasks, especially for gene expression studies, involves determining whether there is evidence that counts for a transcript or exon are significantly different across experimental conditions. edgeR is a Bioconductor software package for examining differential expression of replicated count data. An overdispersed Poisson model is used to account for both biological and technical variability. Empirical Bayes methods are used to moderate the degree of overdispersion across transcripts, improving the reliability of inference. The methodology can be used even with the most minimal levels of replication, provided at least one phenotype or experimental condition is replicated. The software may have other applications beyond sequencing data, such as proteome peptide count data. Availability: The package is freely available under the LGPL licence from the Bioconductor web site (http://bioconductor.org).

...read more

Topics: Bioconductor (64%)

21,575 Citations


Open accessJournal ArticleDOI: 10.1093/NAR/GKV007
Matthew E. Ritchie1, Belinda Phipson2, Di Wu3, Yifang Hu1  +4 moreInstitutions (5)
Abstract: limma is an R/Bioconductor software package that provides an integrated solution for analysing data from gene expression experiments. It contains rich features for handling complex experimental designs and for information borrowing to overcome the problem of small sample sizes. Over the past decade, limma has been a popular choice for gene discovery through differential expression analyses of microarray and high-throughput PCR data. The package contains particularly strong facilities for reading, normalizing and exploring such data. Recently, the capabilities of limma have been significantly expanded in two important directions. First, the package can now perform both differential expression and differential splicing analyses of RNA sequencing (RNA-seq) data. All the downstream analysis tools previously restricted to microarray data are now available for RNA-seq as well. These capabilities allow users to analyse both RNA-seq and microarray data with very similar pipelines. Second, the package is now able to go past the traditional gene-wise expression analyses in a variety of ways, analysing expression profiles in terms of co-regulated sets of genes or in terms of higher-order expression signatures. This provides enhanced possibilities for biological interpretation of gene expression differences. This article reviews the philosophy and design of the limma package, summarizing both new and historical features, with an emphasis on recent enhancements and features that have not been previously described.

...read more

Topics: Microarray databases (61%), Bioconductor (51%)

13,819 Citations


Open accessJournal ArticleDOI: 10.1093/BIOINFORMATICS/BTU638
15 Jan 2015-Bioinformatics
Abstract: Motivation: A large choice of tools exists for many standard tasks in the analysis of high-throughput sequencing (HTS) data. However, once a project deviates from standard workflows, custom scripts are needed. Results: We present HTSeq, a Python library to facilitate the rapid development of such scripts. HTSeq offers parsers for many common data formats in HTS projects, as well as classes to represent data, such as genomic coordinates, sequences, sequencing reads, alignments, gene model information and variant calls, and provides data structures that allow for querying via genomic coordinates. We also present htseq-count, a tool developed with HTSeq that preprocesses RNA-Seq data for differential expression analysis by counting the overlap of reads with genes. Availability and implementation: HTSeq is released as an opensource software under the GNU General Public Licence and available from http://www-huber.embl.de/HTSeq or from the Python Package Index at https://pypi.python.org/pypi/HTSeq. Contact: sanders@fs.tum.de

...read more

11,833 Citations


Open accessJournal ArticleDOI: 10.1186/GB-2010-11-10-R106
27 Oct 2010-Genome Biology
Abstract: High-throughput sequencing assays such as RNA-Seq, ChIP-Seq or barcode counting provide quantitative readouts in the form of count data. To infer differential signal in such data correctly and with good statistical power, estimation of data variability throughout the dynamic range and a suitable error model are required. We propose a method based on the negative binomial distribution, with variance and mean linked by local regression and present an implementation, DESeq, as an R/Bioconductor package.

...read more

Topics: Count data (63%), Bioconductor (54%), Binomial distribution (52%) ...read more

11,332 Citations


Performance
Metrics

Author's H-index: 93

No. of papers from the Author in previous years
YearPapers
202118
202016
201920
201828
201722
201627

Top Attributes

Show by:

Author's top 5 most impactful journals

Nature Immunology

13 papers, 2.2K citations

Nucleic Acids Research

12 papers, 20.5K citations

Blood

12 papers, 662 citations

bioRxiv

12 papers, 58 citations

Bioinformatics

12 papers, 35K citations

Network Information
Related Authors (5)
Warren S. Alexander

254 papers, 28.4K citations

94% related
Delphine Merino

33 papers, 2.2K citations

94% related
Matthew E. Ritchie

135 papers, 20.9K citations

93% related
Alexandra L. Garnham

37 papers, 674 citations

93% related
Douglas J. Hilton

226 papers, 27.8K citations

93% related