Topic

Imputation (statistics)

About: Imputation (statistics) is a research topic. Over the lifetime, 8203 publications have been published within this topic receiving 315547 citations. The topic is also known as: data imputation.

...read moreread less

Papers published on a yearly basis

1 / 2

Papers

PDF

Open Access

More filters

Journal Article•DOI•

MissForest—non-parametric missing value imputation for mixed-type data

[...]

Daniel J. Stekhoven¹, Peter Bühlmann¹•Institutions (1)

ETH Zurich¹

01 Jan 2012-Bioinformatics

TL;DR: In this comparative study, missForest outperforms other methods of imputation especially in data settings where complex interactions and non-linear relations are suspected and the out-of-bag imputation error estimates of missForest prove to be adequate in all settings.

...read moreread less

Abstract: Motivation Modern data acquisition based on high-throughput technology is often facing the problem of missing data. Algorithms commonly used in the analysis of such large-scale data often depend on a complete set. Missing value imputation offers a solution to this problem. However, the majority of available imputation methods are restricted to one type of variable only: continuous or categorical. For mixed-type data, the different types are usually handled separately. Therefore, these methods ignore possible relations between variable types. We propose a non-parametric method which can cope with different types of variables simultaneously. Results We compare several state of the art methods for the imputation of missing values. We propose and evaluate an iterative imputation method (missForest) based on a random forest. By averaging over many unpruned classification or regression trees, random forest intrinsically constitutes a multiple imputation scheme. Using the built-in out-of-bag error estimates of random forest, we are able to estimate the imputation error without the need of a test set. Evaluation is performed on multiple datasets coming from a diverse selection of biological fields with artificially introduced missing values ranging from 10% to 30%. We show that missForest can successfully handle missing values, particularly in datasets including different types of variables. In our comparative study, missForest outperforms other methods of imputation especially in data settings where complex interactions and non-linear relations are suspected. The out-of-bag imputation error estimates of missForest prove to be adequate in all settings. Additionally, missForest exhibits attractive computational efficiency and can cope with high-dimensional data. Availability The package missForest is freely available from http://stat.ethz.ch/CRAN/. Contact stekhoven@stat.math.ethz.ch; buhlmann@stat.math.ethz.ch

...read moreread less

2,928 citations

Journal Article•DOI•

Amelia II: A Program for Missing Data

[...]

James Honaker, Gary King, Matthew Blackwell

12 Dec 2011-Journal of Statistical Software

TL;DR: The Amelia II package implements a new expectation-maximization with bootstrapping algorithm that works faster, with larger numbers of variables, and is far easier to use, than various Markov chain Monte Carlo approaches, but gives essentially the same answers.

...read moreread less

Abstract: Amelia II is a complete R package for multiple imputation of missing data. The package implements a new expectation-maximization with bootstrapping algorithm that works faster, with larger numbers of variables, and is far easier to use, than various Markov chain Monte Carlo approaches, but gives essentially the same answers. The program also improves imputation models by allowing researchers to put Bayesian priors on individual cell values, thereby including a great deal of potentially valuable and extensive information. It also includes features to accurately impute cross-sectional datasets, individual time series, or sets of time series for different cross-sections. A full set of graphical diagnostics are also available. The program is easy to use, and the simplicity of the algorithm makes it far more robust; both a simple command line and extensive graphical user interface are included.

...read moreread less

2,404 citations

Book•

Flexible Imputation of Missing Data

[...]

Stef van Buuren

29 Mar 2012

TL;DR: The problem of missing data concepts of MCAR, MAR and MNAR simple solutions that do not (always) work multiple imputation in a nutshell and some dangers, some do's and some don'ts are covered.

...read moreread less

Abstract: Basics Introduction The problem of missing data Concepts of MCAR, MAR and MNAR Simple solutions that do not (always) work Multiple imputation in a nutshell Goal of the book What the book does not cover Structure of the book Exercises Multiple imputation Historic overview Incomplete data concepts Why and when multiple imputation works Statistical intervals and tests Evaluation criteria When to use multiple imputation How many imputations? Exercises Univariate missing data How to generate multiple imputations Imputation under the normal linear normal Imputation under non-normal distributions Predictive mean matching Categorical data Other data types Classification and regression trees Multilevel data Non-ignorable methods Exercises Multivariate missing data Missing data pattern Issues in multivariate imputation Monotone data imputation Joint Modeling Fully Conditional Specification FCS and JM Conclusion Exercises Imputation in practice Overview of modeling choices Ignorable or non-ignorable? Model form and predictors Derived variables Algorithmic options Diagnostics Conclusion Exercises Analysis of imputed data What to do with the imputed data? Parameter pooling Statistical tests for multiple imputation Stepwise model selection Conclusion Exercises Case studies Measurement issues Too many columns Sensitivity analysis Correct prevalence estimates from self-reported data Enhancing comparability Exercises Selection issues Correcting for selective drop-out Correcting for non-response Exercises Longitudinal data Long and wide format SE Fireworks Disaster Study Time raster imputation Conclusion Exercises Extensions Conclusion Some dangers, some do's and some don'ts Reporting Other applications Future developments Exercises Appendices: Software R S-Plus Stata SAS SPSS Other software References Author Index Subject Index

...read moreread less

2,156 citations

Journal Article•DOI•

Multiple Imputation of Missing Values

[...]

Patrick Royston

01 Aug 2004-Stata Journal

TL;DR: This article describes an implementation for Stata of the MICE method of multiple multivariate imputation, described by van Buuren, Boshuizen, and Knook (1999), and describes five ado-files, which create multiple mult variables and utilities to intercon-vert datasets created by mvis and by the miset program from John Carlin and colleagues.

...read moreread less

Abstract: Following the seminal publications of Rubin about thirty years ago, statisticians have become increasingly aware of the inadequacy of "complete-case" analysis of datasets with missing observations. In medicine, for example, observa- tions may be missing in a sporadic way for different covariates, and a complete-case analysis may omit as many as half of the available cases. Hotdeck imputation was implemented in Stata in 1999 by Mander and Clayton. However, this technique may perform poorly when many rows of data have at least one missing value. This article describes an implementation for Stata of the MICE method of multiple multivariate imputation described by van Buuren, Boshuizen, and Knook (1999). MICE stands for multivariate imputation by chained equations. The basic idea of data analysis with multiple imputation is to create a small number (e.g., 5-10) of copies of the data, each of which has the missing values suitably imputed, and analyze each complete dataset independently. Estimates of parameters of inter- est are averaged across the copies to give a single estimate. Standard errors are computed according to the "Rubin rules", devised to allow for the between- and within-imputation components of variation in the parameter estimates. This arti- cle describes five ado-files. mvis creates multiple multivariate imputations. uvis imputes missing values for a single variable as a function of several covariates, each with complete data. micombine fits a wide variety of regression models to a mul- tiply imputed dataset, combining the estimates using Rubin's rules, and supports survival analysis models (stcox and streg), categorical data models, generalized linear models, and more. Finally, misplit and mijoin are utilities to intercon- vert datasets created by mvis and by the miset program from John Carlin and colleagues. The use of the routines is illustrated with an example of prognostic modeling in breast cancer.

...read moreread less

2,132 citations

Journal Article•DOI•

Multiple imputation of discrete and continuous data by fully conditional specification

[...]

Stef van Buuren¹•Institutions (1)

Utrecht University¹

01 Jun 2007-Statistical Methods in Medical Research

TL;DR: FCS is a semi-parametric and flexible alternative that specifies the multivariate model by a series of conditional models, one for each incomplete variable, but its statistical properties are difficult to establish.

...read moreread less

Abstract: The goal of multiple imputation is to provide valid inferences for statistical estimates from incomplete data. To achieve that goal, imputed values should preserve the structure in the data, as well as the uncertainty about this structure, and include any knowledge about the process that generated the missing data. Two approaches for imputing multivariate data exist: joint modeling (JM) and fully conditional specification (FCS). JM is based on parametric statistical theory, and leads to imputation procedures whose statistical properties are known. JM is theoretically sound, but the joint model may lack flexibility needed to represent typical data features, potentially leading to bias. FCS is a semi-parametric and flexible alternative that specifies the multivariate model by a series of conditional models, one for each incomplete variable. FCS provides tremendous flexibility and is easy to apply, but its statistical properties are difficult to establish. Simulation work shows that FCS behaves very well in ...

...read moreread less

2,119 citations

Collapse

Network Information

Performance

Metrics

10,423

Papers

378,760

Citations

No. of papers in the topic in previous years
Year	Papers
2024	4
2023	740
2022	1,477
2021	852
2020	724
2019	603

Imputation (statistics)

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics