scispace - formally typeset
Search or ask a question
Author

Matthew E. Ritchie

Bio: Matthew E. Ritchie is an academic researcher from Walter and Eliza Hall Institute of Medical Research. The author has contributed to research in topics: Bioconductor & Progenitor cell. The author has an hindex of 42, co-authored 135 publications receiving 20980 citations. Previous affiliations of Matthew E. Ritchie include University of Cambridge & University of Melbourne.


Papers
More filters
Journal ArticleDOI
TL;DR: The open‐source Glimma package creates interactive graphics for exploring gene expression analysis with a few simple R commands, and extends popular plots found in the limma package, to allow individual data points to be queried and additional annotation information to be displayed upon hovering or selecting particular points.
Abstract: Motivation graphics for RNA-sequencing and microarray gene expression analyses may contain upwards of tens of thousands of points. Details about certain genes or samples of interest are easily obscured in such dense summary displays. Incorporating interactivity into summary plots would enable additional information to be displayed on demand and facilitate intuitive data exploration. Results The open-source Glimma package creates interactive graphics for exploring gene expression analysis with a few simple R commands. It extends popular plots found in the limma package, such as multi-dimensional scaling plots and mean-difference plots, to allow individual data points to be queried and additional annotation information to be displayed upon hovering or selecting particular points. It also offers links between plots so that more information can be revealed on demand. Glimma is widely applicable, supporting data analyses from a number of well-established Bioconductor workflows ( limma , edgeR and DESeq2 ) and uses D3/JavaScript to produce HTML pages with interactive displays that enable more effective data exploration by end-users. Results from Glimma can be easily shared between bioinformaticians and biologists, enhancing reporting capabilities while maintaining reproducibility. Availability and implementation The Glimma R package is available from http://bioconductor.org/packages/Glimma/ . Contact su.s@wehi.edu.au , law@wehi.edu.au or mritchie@wehi.edu.au.

108 citations

01 Jan 2009
TL;DR: Swift as mentioned in this paper is a new tool for performing primary data analysis on the Illumina Solexa Sequencing Platform, which completes the full analysis process from raw images through to base calls.
Abstract: Motivation: Primary data analysis methods are of critical importance in second generation DNA sequencing. Improved methods have the potential to increase yield and reduce the error rates. Openly documented analysis tools enable the user to understand the primary data, this is important for the optimization and validity of their scientific work. Results: In this article, we describe Swift, a new tool for performing primary data analysis on the Illumina Solexa Sequencing Platform. Swift is the first tool, outside of the vendors own software, which completes the full analysis process, from raw images through to base calls. As such it provides an alternative to, and independent validation of, the vendor supplied tool. Our results show that Swift is able to increase yield by 13.8%, at comparable error rate. Availability and Implementation: Swift is implemented in C++ and supported under Linux. It is supplied under an open source license (LGPL3), allowing researchers to build upon the platform. Swift is available from http://swiftng.sourceforge.net. Contact: new@sgenomics.org; nava.whiteford@nanoporetech.com Supplementary information: Supplementary data are available at Bioinformatics online.

95 citations

Journal ArticleDOI
TL;DR: This open-source processing pipeline in edgeR provides a complete analysis solution for screen data, that begins with the raw sequence reads and ends with a ranked list of candidate genes for downstream biological validation.
Abstract: Pooled library sequencing screens that perturb gene function in a high-throughput manner are becoming increasingly popular in functional genomics research. Irrespective of the mechanism by which loss of function is achieved, via either RNA interference using short hairpin RNAs (shRNAs) or genetic mutation using single guide RNAs (sgRNAs) with the CRISPR-Cas9 system, there is a need to establish optimal analysis tools to handle such data. Our open-source processing pipeline in edgeR provides a complete analysis solution for screen data, that begins with the raw sequence reads and ends with a ranked list of candidate genes for downstream biological validation. We first summarize the raw data contained in a fastq file into a matrix of counts (samples in the columns, genes in the rows) with options for allowing mismatches and small shifts in sequence position. Diagnostic plots, normalization and differential representation analysis can then be performed using established methods to prioritize results in a statistically rigorous way, with the choice of either the classic exact testing methodology or generalized linear modeling that can handle complex experimental designs. A detailed users' guide that demonstrates how to analyze screen data in edgeR along with a point-and-click implementation of this workflow in Galaxy are also provided. The edgeR package is freely available from http://www.bioconductor.org.

95 citations

Journal ArticleDOI
TL;DR: Interestingly, the deeply quiescent MaSC subset (Lgr5+Tspan8hi) resides within the proximal region throughout life, and has a transcriptome strikingly similar to that of claudin-low tumours.
Abstract: Despite accumulating evidence for a mammary differentiation hierarchy, the basal compartment comprising stem cells remains poorly characterized. Through gene expression profiling of Lgr5+ basal epithelial cells, we identify a new marker, Tetraspanin8 (Tspan8). Fractionation based on Tspan8 and Lgr5 expression uncovered three distinct mammary stem cell (MaSC) subsets in the adult mammary gland. These exist in a largely quiescent state but differ in their reconstituting ability, spatial localization, and their molecular and epigenetic signatures. Interestingly, the deeply quiescent MaSC subset (Lgr5+Tspan8hi) resides within the proximal region throughout life, and has a transcriptome strikingly similar to that of claudin-low tumours. Lgr5+Tspan8hi cells appear to originate from the embryonic mammary primordia before switching to a quiescent state postnatally but can be activated by ovarian hormones. Our findings reveal an unexpected degree of complexity within the adult MaSC compartment and identify a dormant subset poised for activation in response to physiological stimuli.

94 citations

Journal ArticleDOI
TL;DR: Using long-read nanopore sequencing it is shown that, with an average genomic coverage of ∼10, it is possible to determine both the level of methylation of CpG sites and the haplotype from which each read arises, and successfully identifies known imprinting control regions (ICRs) as well as some novel differentially methylated regions which may represent new ICRs.
Abstract: Systematic variation in the methylation of cytosines at CpG sites plays a critical role in early development of humans and other mammals. Of particular interest are regions of differential methylation between parental alleles, as these often dictate monoallelic gene expression, resulting in parent of origin specific control of the embryonic transcriptome and subsequent development, in a phenomenon known as genomic imprinting. Using long-read nanopore sequencing we show that, with an average genomic coverage of ∼10, it is possible to determine both the level of methylation of CpG sites and the haplotype from which each read arises. The long-read property is exploited to characterize, using novel methods, both methylation and haplotype for reads that have reduced basecalling precision compared to Sanger sequencing. We validate the analysis both through comparison of nanopore-derived methylation patterns with those from Reduced Representation Bisulfite Sequencing data and through comparison with previously reported data. Our analysis successfully identifies known imprinting control regions (ICRs) as well as some novel differentially methylated regions which, due to their proximity to hitherto unknown monoallelically expressed genes, may represent new ICRs.

92 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: The philosophy and design of the limma package is reviewed, summarizing both new and historical features, with an emphasis on recent enhancements and features that have not been previously described.
Abstract: limma is an R/Bioconductor software package that provides an integrated solution for analysing data from gene expression experiments. It contains rich features for handling complex experimental designs and for information borrowing to overcome the problem of small sample sizes. Over the past decade, limma has been a popular choice for gene discovery through differential expression analyses of microarray and high-throughput PCR data. The package contains particularly strong facilities for reading, normalizing and exploring such data. Recently, the capabilities of limma have been significantly expanded in two important directions. First, the package can now perform both differential expression and differential splicing analyses of RNA sequencing (RNA-seq) data. All the downstream analysis tools previously restricted to microarray data are now available for RNA-seq as well. These capabilities allow users to analyse both RNA-seq and microarray data with very similar pipelines. Second, the package is now able to go past the traditional gene-wise expression analyses in a variety of ways, analysing expression profiles in terms of co-regulated sets of genes or in terms of higher-order expression signatures. This provides enhanced possibilities for biological interpretation of gene expression differences. This article reviews the philosophy and design of the limma package, summarizing both new and historical features, with an emphasis on recent enhancements and features that have not been previously described.

22,147 citations

Journal ArticleDOI
TL;DR: FeatureCounts as discussed by the authors is a read summarization program suitable for counting reads generated from either RNA or genomic DNA sequencing experiments, which implements highly efficient chromosome hashing and feature blocking techniques.
Abstract: MOTIVATION: Next-generation sequencing technologies generate millions of short sequence reads, which are usually aligned to a reference genome. In many applications, the key information required for downstream analysis is the number of reads mapping to each genomic feature, for example to each exon or each gene. The process of counting reads is called read summarization. Read summarization is required for a great variety of genomic analyses but has so far received relatively little attention in the literature. RESULTS: We present featureCounts, a read summarization program suitable for counting reads generated from either RNA or genomic DNA sequencing experiments. featureCounts implements highly efficient chromosome hashing and feature blocking techniques. It is considerably faster than existing methods (by an order of magnitude for gene-level summarization) and requires far less computer memory. It works with either single or paired-end reads and provides a wide range of options appropriate for different sequencing applications. AVAILABILITY AND IMPLEMENTATION: featureCounts is available under GNU General Public License as part of the Subread (http://subread.sourceforge.net) or Rsubread (http://www.bioconductor.org) software packages.

14,103 citations

Journal ArticleDOI
Adam Auton1, Gonçalo R. Abecasis2, David Altshuler3, Richard Durbin4  +514 moreInstitutions (90)
01 Oct 2015-Nature
TL;DR: The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations, and has reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-generation sequencing, deep exome sequencing, and dense microarray genotyping.
Abstract: The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations. Here we report completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping. We characterized a broad spectrum of genetic variation, in total over 88 million variants (84.7 million single nucleotide polymorphisms (SNPs), 3.6 million short insertions/deletions (indels), and 60,000 structural variants), all phased onto high-quality haplotypes. This resource includes >99% of SNP variants with a frequency of >1% for a variety of ancestries. We describe the distribution of genetic variation across the global sample, and discuss the implications for common disease studies.

12,661 citations

Journal ArticleDOI
TL;DR: An analytical strategy for integrating scRNA-seq data sets based on common sources of variation is introduced, enabling the identification of shared populations across data sets and downstream comparative analysis.
Abstract: Computational single-cell RNA-seq (scRNA-seq) methods have been successfully applied to experiments representing a single condition, technology, or species to discover and define cellular phenotypes. However, identifying subpopulations of cells that are present across multiple data sets remains challenging. Here, we introduce an analytical strategy for integrating scRNA-seq data sets based on common sources of variation, enabling the identification of shared populations across data sets and downstream comparative analysis. We apply this approach, implemented in our R toolkit Seurat (http://satijalab.org/seurat/), to align scRNA-seq data sets of peripheral blood mononuclear cells under resting and stimulated conditions, hematopoietic progenitors sequenced using two profiling technologies, and pancreatic cell 'atlases' generated from human and mouse islets. In each case, we learn distinct or transitional cell states jointly across data sets, while boosting statistical power through integrated analysis. Our approach facilitates general comparisons of scRNA-seq data sets, potentially deepening our understanding of how distinct cell states respond to perturbation, disease, and evolution.

7,741 citations