scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A data-driven approach to preprocessing Illumina 450K methylation array data

01 May 2013-BMC Genomics (BioMed Central)-Vol. 14, Iss: 1, pp 293-293
TL;DR: It is demonstrated that quantile normalization methods produce marked improvement, even in highly consistent data, by all three metrics, and that careful selection of preprocessing steps can minimize variance and thus improve statistical power, especially for the detection of the small absolute DNA methylation changes likely associated with complex disease phenotypes.
Abstract: As the most stable and experimentally accessible epigenetic mark, DNA methylation is of great interest to the research community. The landscape of DNA methylation across tissues, through development and in disease pathogenesis is not yet well characterized. Thus there is a need for rapid and cost effective methods for assessing genome-wide levels of DNA methylation. The Illumina Infinium HumanMethylation450 (450K) BeadChip is a very useful addition to the available methods for DNA methylation analysis but its complex design, incorporating two different assay methods, requires careful consideration. Accordingly, several normalization schemes have been published. We have taken advantage of known DNA methylation patterns associated with genomic imprinting and X-chromosome inactivation (XCI), in addition to the performance of SNP genotyping assays present on the array, to derive three independent metrics which we use to test alternative schemes of correction and normalization. These metrics also have potential utility as quality scores for datasets. The standard index of DNA methylation at any specific CpG site is β = M/(M + U + 100) where M and U are methylated and unmethylated signal intensities, respectively. Betas (βs) calculated from raw signal intensities (the default GenomeStudio behavior) perform well, but using 11 methylomic datasets we demonstrate that quantile normalization methods produce marked improvement, even in highly consistent data, by all three metrics. The commonly used procedure of normalizing betas is inferior to the separate normalization of M and U, and it is also advantageous to normalize Type I and Type II assays separately. More elaborate manipulation of quantiles proves to be counterproductive. Careful selection of preprocessing steps can minimize variance and thus improve statistical power, especially for the detection of the small absolute DNA methylation changes likely associated with complex disease phenotypes. For the convenience of the research community we have created a user-friendly R software package called wateRmelon, downloadable from bioConductor, compatible with the existing methylumi, minfi and IMA packages, that allows others to utilize the same normalization methods and data quality tests on 450K data.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: A suite of computational tools that incorporate state-of-the-art statistical techniques for the analysis of DNAm data are described that include methods for preprocessing, quality assessment and detection of differentially methylated regions from the kilobase to the megabase scale.
Abstract: Motivation The recently released Infinium HumanMethylation450 array (the '450k' array) provides a high-throughput assay to quantify DNA methylation (DNAm) at ∼450 000 loci across a range of genomic features. Although less comprehensive than high-throughput sequencing-based techniques, this product is more cost-effective and promises to be the most widely used DNAm high-throughput measurement technology over the next several years. Results Here we describe a suite of computational tools that incorporate state-of-the-art statistical techniques for the analysis of DNAm data. The software is structured to easily adapt to future versions of the technology. We include methods for preprocessing, quality assessment and detection of differentially methylated regions from the kilobase to the megabase scale. We show how our software provides a powerful and flexible development platform for future methods. We also illustrate how our methods empower the technology to make discoveries previously thought to be possible only with sequencing-based methods. Availability and implementation http://bioconductor.org/packages/release/bioc/html/minfi.html. Contact khansen@jhsph.edu; rafa@jimmy.harvard.edu Supplementary information Supplementary data are available at Bioinformatics online.

2,961 citations

Journal ArticleDOI
TL;DR: The EPIC array is a significant improvement over the HM450 array, with increased genome coverage of regulatory regions and high reproducibility and reliability, providing a valuable tool for high-throughput human methylome analyses from diverse clinical samples.
Abstract: In recent years the Illumina HumanMethylation450 (HM450) BeadChip has provided a user-friendly platform to profile DNA methylation in human samples. However, HM450 lacked coverage of distal regulatory elements. Illumina have now released the MethylationEPIC (EPIC) BeadChip, with new content specifically designed to target these regions. We have used HM450 and whole-genome bisulphite sequencing (WGBS) to perform a critical evaluation of the new EPIC array platform. EPIC covers over 850,000 CpG sites, including >90 % of the CpGs from the HM450 and an additional 413,743 CpGs. Even though the additional probes improve the coverage of regulatory elements, including 58 % of FANTOM5 enhancers, only 7 % distal and 27 % proximal ENCODE regulatory elements are represented. Detailed comparisons of regulatory elements from EPIC and WGBS show that a single EPIC probe is not always informative for those distal regulatory elements showing variable methylation across the region. However, overall data from the EPIC array at single loci are highly reproducible across technical and biological replicates and demonstrate high correlation with HM450 and WGBS data. We show that the HM450 and EPIC arrays distinguish differentially methylated probes, but the absolute agreement depends on the threshold set for each platform. Finally, we provide an annotated list of probes whose signal could be affected by cross-hybridisation or underlying genetic variation. The EPIC array is a significant improvement over the HM450 array, with increased genome coverage of regulatory regions and high reproducibility and reliability, providing a valuable tool for high-throughput human methylome analyses from diverse clinical samples.

825 citations

Journal ArticleDOI
TL;DR: An integrated analysis pipeline offering a choice of the most popular normalization methods while also introducing new methods for calling differentially methylated regions and detecting copy number aberrations is presented.
Abstract: The Illumina Infinium HumanMethylation450 BeadChip is a new platform for high-throughput DNA methylation analysis. Several methods for normalization and processing of these data have been published recently. Here we present an integrated analysis pipeline offering a choice of the most popular normalization methods while also introducing new methods for calling differentially methylated regions and detecting copy number aberrations. Availability and implementation: ChAMP is implemented as a Bioconductor package in R. The package and the vignette can be downloaded at bioconductor.org

705 citations

Journal ArticleDOI
02 Nov 2017-Cell
TL;DR: This large-scale analysis of 206 adult soft tissue sarcomas reveals previously unappreciated sarcoma-type-specific changes in copy number, methylation, RNA, and protein, providing insights into refining Sarcoma therapy and relationships to other cancer types.

684 citations


Cites methods from "A data-driven approach to preproces..."

  • ...For the methylation data (Illumina Infinium 450k arrays), the median absolute deviation was employed to select the top 4000 most variable CpG sites after beta-mixture quantile normalization (Pidsley et al., 2013)....

    [...]

Journal ArticleDOI
TL;DR: The algorithm, functional normalization, is adapted to the Illumina 450k methylation array and outperforms all existing normalization methods with respect to replication of results between experiments, and yields robust results even in the presence of batch effects.
Abstract: We propose an extension to quantile normalization that removes unwanted technical variation using control probes. We adapt our algorithm, functional normalization, to the Illumina 450k methylation array and address the open problem of normalizing methylation data with global epigenetic changes, such as human cancers. Using data sets from The Cancer Genome Atlas and a large case–control study, we show that our algorithm outperforms all existing normalization methods with respect to replication of results between experiments, and yields robust results even in the presence of batch effects. Functional normalization can be applied to any microarray platform, provided suitable control probes are available.

649 citations


Cites background or methods from "A data-driven approach to preproces..."

  • ..., 2013], dasen [Pidsley et al., 2013], and noob [Triche et al....

    [...]

  • ...Supplementary Figures S3a,b contains results for additional preprocessing methods: BMIQ [Teschendorff et al., 2013], SWAN [Maksimovic et al., 2012] and dasen [Pidsley et al., 2013]....

    [...]

  • ...Funnorm improves X and Y chromosomes probes prediction in blood samples As suggested previously [Pidsley et al., 2013], one can benchmark performance by identifying DMPs associated with sex....

    [...]

  • ...Funnorm improves X and Y chromosomes probes prediction in blood samples As suggested previously [Pidsley et al., 2013], one can benchmark performance by identifying DMPs associated with sex....

    [...]

  • ...Several methods have been proposed for normalization of the 450k array, including Quantile normalization [Touleimat and Tost, 2012, Aryee et al., 2014], SWAN [Maksimovic et al., 2012], BMIQ [Teschendorff et al., 2013], dasen [Pidsley et al., 2013], and noob [Triche et al., 2013]....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: Details of the aims and methods of Bioconductor, the collaborative creation of extensible software for computational biology and bioinformatics, and current challenges are described.
Abstract: The Bioconductor project is an initiative for the collaborative creation of extensible software for computational biology and bioinformatics. The goals of the project include: fostering collaborative development and widespread use of innovative software, reducing barriers to entry into interdisciplinary scientific research, and promoting the achievement of remote reproducibility of research results. We describe details of our aims and methods, identify current challenges, compare Bioconductor to other open bioinformatics projects, and provide working examples.

12,142 citations

Journal ArticleDOI
TL;DR: There is no obvious downside to using RMA and attaching a standard error (SE) to this quantity using a linear model which removes probe-specific affinities, and the exploratory data analyses of the probe level data motivate a new summary measure that is a robust multi-array average (RMA) of background-adjusted, normalized, and log-transformed PM values.
Abstract: SUMMARY In this paper we report exploratory analyses of high-density oligonucleotide array data from the Affymetrix GeneChip R � system with the objective of improving upon currently used measures of gene expression. Our analyses make use of three data sets: a small experimental study consisting of five MGU74A mouse GeneChip R � arrays, part of the data from an extensive spike-in study conducted by Gene Logic and Wyeth’s Genetics Institute involving 95 HG-U95A human GeneChip R � arrays; and part of a dilution study conducted by Gene Logic involving 75 HG-U95A GeneChip R � arrays. We display some familiar features of the perfect match and mismatch probe ( PM and MM )v alues of these data, and examine the variance–mean relationship with probe-level data from probes believed to be defective, and so delivering noise only. We explain why we need to normalize the arrays to one another using probe level intensities. We then examine the behavior of the PM and MM using spike-in data and assess three commonly used summary measures: Affymetrix’s (i) average difference (AvDiff) and (ii) MAS 5.0 signal, and (iii) the Li and Wong multiplicative model-based expression index (MBEI). The exploratory data analyses of the probe level data motivate a new summary measure that is a robust multiarray average (RMA) of background-adjusted, normalized, and log-transformed PM values. We evaluate the four expression summary measures using the dilution study data, assessing their behavior in terms of bias, variance and (for MBEI and RMA) model fit. Finally, we evaluate the algorithms in terms of their ability to detect known levels of differential expression using the spike-in data. We conclude that there is no obvious downside to using RMA and attaching a standard error (SE) to this quantity using a linear model which removes probe-specific affinities. ∗ To whom correspondence should be addressed

10,711 citations


"A data-driven approach to preproces..." refers background or methods in this paper

  • ...Standard, specially constructed control datasets produced by spiking samples have been influential in the gene expression field [13], but would not be as suitable for analysis of DNA methylation....

    [...]

  • ...Quantile normalization (QN) is a well established technique in gene expression analysis, where it has been shown to perform well [13]....

    [...]

Book ChapterDOI
01 Jan 2005
TL;DR: This chapter starts with the simplest replicated designs and progresses through experiments with two or more groups, direct designs, factorial designs and time course experiments with technical as well as biological replication.
Abstract: A survey is given of differential expression analyses using the linear modeling features of the limma package. The chapter starts with the simplest replicated designs and progresses through experiments with two or more groups, direct designs, factorial designs and time course experiments. Experiments with technical as well as biological replication are considered. Empirical Bayes test statistics are explained. The use of quality weights, adaptive background correction and control spots in conjunction with linear modelling is illustrated on the β7 data.

5,920 citations

Book
27 Jan 2006
TL;DR: In this article, the authors present a detailed case study of R algorithms with publicly available data, and a major section of the book is devoted to fully worked case studies, with a companion website where readers can reproduce every number, figure and table on their own computers.
Abstract: Full four-color book. Some of the editors created the Bioconductor project and Robert Gentleman is one of the two originators of R. All methods are illustrated with publicly available data, and a major section of the book is devoted to fully worked case studies. Code underlying all of the computations that are shown is made available on a companion website, and readers can reproduce every number, figure, and table on their own computers.

2,625 citations

Journal ArticleDOI
TL;DR: In this article, the authors present a Bioinformatics and Computational Biology Solutions Using R and Bioconductor (BIBOS) using R and BIBOS, which is a combination of R and CRF.
Abstract: (2007). Bioinformatics and Computational Biology Solutions Using R and Bioconductor. Journal of the American Statistical Association: Vol. 102, No. 477, pp. 388-389.

1,743 citations