LSimpute: accurate estimation of missing values in microarray data with least squares methods.

doi:10.1093/NAR/GNH026

Open AccessJournal ArticleDOI

LSimpute: accurate estimation of missing values in microarray data with least squares methods.

Trond Hellem Bø, +2 more

- 01 Feb 2004 -

Nucleic Acids Research

- Vol. 32, Iss: 3

Chats0

TLDR

Novel methods for estimation of missing values in microarray data sets that are based on the least squares principle, and that utilize correlations between both genes and arrays are presented.

Abstract:

Microarray experiments generate data sets with information on the expression levels of thousands of genes in a set of biological samples. Unfortunately, such experiments often produce multiple missing expression values, normally due to various experimental problems. As many algorithms for gene expression analysis require a complete data matrix as input, the missing values have to be estimated in order to analyze the available data. Alternatively, genes and arrays can be removed until no missing values remain. However, for genes or arrays with only a small number of missing values, it is desirable to impute those values. For the subsequent analysis to be as informative as possible, it is essential that the estimates for the missing gene expression values are accurate. A small amount of badly estimated missing values in the data might be enough for clustering methods, such as hierachical clustering or K-means clustering, to produce misleading results. Thus, accurate methods for missing value estimation are needed. We present novel methods for estimation of missing values in microarray data sets that are based on the least squares principle, and that utilize correlations between both genes and arrays. For this set of methods, we use the common reference name LSimpute. We compare the estimation accuracy of our methods with the widely used KNNimpute on three complete data matrices from public data sets by randomly knocking out data (labeling as missing). From these tests, we conclude that our LSimpute methods produce estimates that consistently are more accurate than those obtained using KNNimpute. Additionally, we examine a more classic approach to missing value estimation based on expectation maximization (EM). We refer to our EM implementations as EMimpute, and the estimate errors using the EMimpute methods are compared with those our novel methods produce. The results indicate that on average, the estimates from our best performing LSimpute method are at least as accurate as those from the best EMimpute algorithm.

LSimpute: accurate estimation of missing values in microarray data with least squares methods.

Citations

Singular Value Decomposition for Genome-Wide Expression Data Processing and Modeling

Activation of IFN pathways and plasmacytoid dendritic cell recruitment in target organs of primary Sjögren’s syndrome

Missing value estimation for DNA microarray gene expression data: local least squares imputation

Gene expression profiling of minor salivary glands clearly distinguishes primary Sjögren's syndrome patients from healthy control subjects.

Integrative analysis of transcriptomic and proteomic data: challenges, solutions and applications.

References

Cluster analysis and display of genome-wide expression patterns

Molecular portraits of human breast tumours

Molecular classification of cancer: class discovery and class prediction by gene expression monitoring.

Applied Multivariate Statistical Analysis

Applied Multivariate Statistical Analysis.

Related Papers (5)

Missing value estimation methods for DNA microarrays.

Missing value estimation for DNA microarray gene expression data: local least squares imputation

A Bayesian missing value estimation method for gene expression profile data

Comprehensive Identification of Cell Cycle–regulated Genes of the Yeast Saccharomyces cerevisiae by Microarray Hybridization

Gaussian mixture clustering and imputation of microarray data