scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Comparison of imputation methods for missing laboratory data in medicine

01 Aug 2013-BMJ Open (BMJ Publishing Group)-Vol. 3, Iss: 8
TL;DR: MissForest is a highly accurate method of imputations for missing laboratory data and outperforms other common imputation techniques in terms of imputation error and maintenance of predictive ability with imputed values in two clinical predicative models.
Abstract: Objectives Missing laboratory data is a common issue, but the optimal method of imputation of missing values has not been determined. The aims of our study were to compare the accuracy of four imputation methods for missing completely at random laboratory data and to compare the effect of the imputed values on the accuracy of two clinical predictive models. Design Retrospective cohort analysis of two large data sets. Setting A tertiary level care institution in Ann Arbor, Michigan. Participants The Cirrhosis cohort had 446 patients and the Inflammatory Bowel Disease cohort had 395 patients. Methods Non-missing laboratory data were randomly removed with varying frequencies from two large data sets, and we then compared the ability of four methods—missForest, mean imputation, nearest neighbour imputation and multivariate imputation by chained equations (MICE)—to impute the simulated missing data. We characterised the accuracy of the imputation and the effect of the imputation on predictive ability in two large data sets. Results MissForest had the least imputation error for both continuous and categorical variables at each frequency of missingness, and it had the smallest prediction difference when models used imputed laboratory values. In both data sets, MICE had the second least imputation error and prediction difference, followed by the nearest neighbour and mean imputation. Conclusions MissForest is a highly accurate method of imputation for missing laboratory data and outperforms other common imputation techniques in terms of imputation error and maintenance of predictive ability with imputed values in two clinical predicative models.
Citations
More filters
Journal ArticleDOI
TL;DR: Among hospitalized patients with COVID-19 and coexisting hypertension, inpatient use of ACEI/ARB was associated with lower risk of all-cause mortality compared with ACEi/ARB nonusers, and it is unlikely that in-hospital use ofACEI/ARB wasassociated with an increased mortality risk.
Abstract: Rationale: Use of ACEIs (angiotensin-converting enzyme inhibitors) and ARBs (angiotensin II receptor blockers) is a major concern for clinicians treating coronavirus disease 2019 (COVID-19) in pati...

938 citations

Journal ArticleDOI
TL;DR: This primer highlights several differences between efficacy and effectiveness studies including study design, patient populations, intervention design, data analysis, and result reporting.
Abstract: Although efficacy and effectiveness studies are both important when evaluating interventions, they serve distinct purposes and have different study designs. Unfortunately, the distinction between these two types of trials is often poorly understood. In this primer, we highlight several differences between these two types of trials including study design, patient populations, intervention design, data analysis, and result reporting.

535 citations

Journal ArticleDOI
TL;DR: A retrospective study on 13,981 patients with COVID-19 in Hubei Province, China found that the risk for 28-day all-cause mortality was 5.2% and 9.4% in the matched statin and non-statin groups, respectively, with a hazard ratio 0.58, implying the potential benefits of statin therapy in hospitalized subjects with CO VID-19.

381 citations

Journal ArticleDOI
TL;DR: RF imputation is revealed to be generally robust with performance improving with increasing correlation, and performance was good under moderate to high missingness, and even when data was missing not at random.
Abstract: Random forest (RF) missing data algorithms are an attractive approach for imputing missing data. They have the desirable properties of being able to handle mixed types of missing data, they are adaptive to interactions and nonlinearity, and they have the potential to scale to big data settings. Currently there are many different RF imputation algorithms, but relatively little guidance about their efficacy. Using a large, diverse collection of data sets, imputation performance of various RF algorithms was assessed under different missing data mechanisms. Algorithms included proximity imputation, on the fly imputation, and imputation utilizing multivariate unsupervised and supervised splitting-the latter class representing a generalization of a new promising imputation algorithm called missForest. Our findings reveal RF imputation to be generally robust with performance improving with increasing correlation. Performance was good under moderate to high missingness, and even (in certain cases) when data was missing not at random.

368 citations


Cites methods from "Comparison of imputation methods fo..."

  • ...Comparisons of RF imputation to other procedures have been considered by [21,26], and there have been studies looking at effectiveness of RF imputation when combined with other methods (for instance Shah et al....

    [...]

  • ...MissForest has been shown [26] to outperform well-known methods such as k-nearest neighbors (KNN) [22] and parametric MICE [25] (multivariate imputation using chained equation)....

    [...]

References
More filters
Journal Article
TL;DR: Copyright (©) 1999–2012 R Foundation for Statistical Computing; permission is granted to make and distribute verbatim copies of this manual provided the copyright notice and permission notice are preserved on all copies.
Abstract: Copyright (©) 1999–2012 R Foundation for Statistical Computing. Permission is granted to make and distribute verbatim copies of this manual provided the copyright notice and this permission notice are preserved on all copies. Permission is granted to copy and distribute modified versions of this manual under the conditions for verbatim copying, provided that the entire resulting derived work is distributed under the terms of a permission notice identical to this one. Permission is granted to copy and distribute translations of this manual into another language, under the above conditions for modified versions, except that this permission notice may be stated in a translation approved by the R Core Team.

272,030 citations

Journal ArticleDOI
01 Oct 2001
TL;DR: Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the forest, and are also applicable to regression.
Abstract: Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, aaa, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.

79,257 citations

01 Jan 2007
TL;DR: random forests are proposed, which add an additional layer of randomness to bagging and are robust against overfitting, and the randomForest package provides an R interface to the Fortran programs by Breiman and Cutler.
Abstract: Recently there has been a lot of interest in “ensemble learning” — methods that generate many classifiers and aggregate their results. Two well-known methods are boosting (see, e.g., Shapire et al., 1998) and bagging Breiman (1996) of classification trees. In boosting, successive trees give extra weight to points incorrectly predicted by earlier predictors. In the end, a weighted vote is taken for prediction. In bagging, successive trees do not depend on earlier trees — each is independently constructed using a bootstrap sample of the data set. In the end, a simple majority vote is taken for prediction. Breiman (2001) proposed random forests, which add an additional layer of randomness to bagging. In addition to constructing each tree using a different bootstrap sample of the data, random forests change how the classification or regression trees are constructed. In standard trees, each node is split using the best split among all variables. In a random forest, each node is split using the best among a subset of predictors randomly chosen at that node. This somewhat counterintuitive strategy turns out to perform very well compared to many other classifiers, including discriminant analysis, support vector machines and neural networks, and is robust against overfitting (Breiman, 2001). In addition, it is very user-friendly in the sense that it has only two parameters (the number of variables in the random subset at each node and the number of trees in the forest), and is usually not very sensitive to their values. The randomForest package provides an R interface to the Fortran programs by Breiman and Cutler (available at http://www.stat.berkeley.edu/ users/breiman/). This article provides a brief introduction to the usage and features of the R functions.

14,830 citations

Journal ArticleDOI
TL;DR: Mice adds new functionality for imputing multilevel data, automatic predictor selection, data handling, post-processing imputed values, specialized pooling routines, model selection tools, and diagnostic graphs.
Abstract: The R package mice imputes incomplete multivariate data by chained equations. The software mice 1.0 appeared in the year 2000 as an S-PLUS library, and in 2001 as an R package. mice 1.0 introduced predictor selection, passive imputation and automatic pooling. This article documents mice, which extends the functionality of mice 1.0 in several ways. In mice, the analysis of imputed data is made completely general, whereas the range of models under which pooling works is substantially extended. mice adds new functionality for imputing multilevel data, automatic predictor selection, data handling, post-processing imputed values, specialized pooling routines, model selection tools, and diagnostic graphs. Imputation of categorical data is improved in order to bypass problems caused by perfect prediction. Special attention is paid to transformations, sum scores, indices and interactions using passive imputation, and to the proper setup of the predictor matrix. mice can be downloaded from the Comprehensive R Archive Network. This article provides a hands-on, stepwise approach to solve applied incomplete data problems.

10,234 citations


"Comparison of imputation methods fo..." refers methods in this paper

  • ...We implemented this in R using the package ‘mice’.(12)...

    [...]

Journal ArticleDOI
TL;DR: A general coefficient measuring the similarity between two sampling units is defined and the matrix of similarities between all pairs of sample units is shown to be positive semidefinite.
Abstract: A general coefficient measuring the similarity between two sampling units is defined. The matrix of similarities between all pairs of sample units is shown to be positive semidefinite (except possibly when there are missing values). This is important for the multidimensional Euclidean representation of the sample and also establishes some inequalities amongst the similarities relating three individuals. The definition is extended to cope with a hierarchy of characters.

4,204 citations


"Comparison of imputation methods fo..." refers methods in this paper

  • ...In order to accommodate both continuous and categorical variables, the Gower distance is used.(10) For the categorical variables, we imputed the missing values by weighted mode instead of a weighted mean as used for continuous variables....

    [...]