scispace - formally typeset
Search or ask a question

Showing papers on "Imputation (statistics) published in 2014"


Journal ArticleDOI
TL;DR: The current user-friendly review provides five easy-to-understand practical guidelines, with the goal of reducing missing data bias and error in the reporting of research results.
Abstract: Missing data (a) reside at three missing data levels of analysis (item-, construct-, and person-level), (b) arise from three missing data mechanisms (missing completely at random, missing at random...

646 citations


Journal ArticleDOI
TL;DR: This work reviewed 80 articles of empirical studies published in the 2012 issues of the Journal of Pediatric Psychology to present a picture of how adequately missing data are currently handled in this field.
Abstract: We provide conceptual introductions to missingness mechanisms--missing completely at random, missing at random, and missing not at random--and state-of-the-art methods of handling missing data--full-information maximum likelihood and multiple imputation--followed by a discussion of planned missing designs: Multiform questionnaire protocols, 2-method measurement models, and wave-missing longitudinal designs We reviewed 80 articles of empirical studies published in the 2012 issues of the Journal of Pediatric Psychology to present a picture of how adequately missing data are currently handled in this field To illustrate the benefits of using multiple imputation or full-information maximum likelihood and incorporating planned missingness into study designs, we provide example analyses of empirical data gathered using a 3-form planned missing design

495 citations


Journal ArticleDOI
TL;DR: Compared parametric MICE with a random forest-based MICE algorithm, random forest imputation may be useful for imputing complex epidemiologic data sets in which some patients have missing data.
Abstract: Multivariate imputation by chained equations (MICE) is commonly used for imputing missing data in epidemiologic research. The "true" imputation model may contain nonlinearities which are not included in default imputation models. Random forest imputation is a machine learning technique which can accommodate nonlinearities and interactions and does not require a particular regression model to be specified. We compared parametric MICE with a random forest-based MICE algorithm in 2 simulation studies. The first study used 1,000 random samples of 2,000 persons drawn from the 10,128 stable angina patients in the CALIBER database (Cardiovascular Disease Research using Linked Bespoke Studies and Electronic Records; 2001-2010) with complete data on all covariates. Variables were artificially made "missing at random," and the bias and efficiency of parameter estimates obtained using different imputation methods were compared. Both MICE methods produced unbiased estimates of (log) hazard ratios, but random forest was more efficient and produced narrower confidence intervals. The second study used simulated data in which the partially observed variable depended on the fully observed variables in a nonlinear way. Parameter estimates were less biased using random forest MICE, and confidence interval coverage was better. This suggests that random forest imputation may be useful for imputing complex epidemiologic data sets in which some patients have missing data.

413 citations


Journal ArticleDOI
TL;DR: PMM and LRD may have a role for imputing covariates which are not strongly associated with outcome, and when the imputation model is thought to be slightly but not grossly misspecified, which is better than fully parametric imputation in simulation studies.
Abstract: Multiple imputation is a commonly used method for handling incomplete covariates as it can provide valid inference when data are missing at random. This depends on being able to correctly specify the parametric model used to impute missing values, which may be difficult in many realistic settings. Imputation by predictive mean matching (PMM) borrows an observed value from a donor with a similar predictive mean; imputation by local residual draws (LRD) instead borrows the donor’s residual. Both methods relax some assumptions of parametric imputation, promising greater robustness when the imputation model is misspecified. We review development of PMM and LRD and outline the various forms available, and aim to clarify some choices about how and when they should be used. We compare performance to fully parametric imputation in simulation studies, first when the imputation model is correctly specified and then when it is misspecified. In using PMM or LRD we strongly caution against using a single donor, the default value in some implementations, and instead advocate sampling from a pool of around 10 donors. We also clarify which matching metric is best. Among the current MI software there are several poor implementations. PMM and LRD may have a role for imputing covariates (i) which are not strongly associated with outcome, and (ii) when the imputation model is thought to be slightly but not grossly misspecified. Researchers should spend efforts on specifying the imputation model correctly, rather than expecting predictive mean matching or local residual draws to do the work.

336 citations


Journal ArticleDOI
TL;DR: In this article, the authors evaluate the performance of four approaches for estimating missing values in trait databases (KNN, multivariate imputation by chained equations (mice), missForest and Phylopars), and test whether imputed datasets retain underlying allometric relationships among traits.
Abstract: 1 Despite efforts in data collection, missing values are commonplace in life-history trait databases Because these values typically are not missing randomly, the common practice of removingmissing data not only reduces sample size, but also introduces bias that can lead to incorrect conclusions Imputingmissing values is a potential solution to this problem Here, we evaluate the performance of four approaches for estimating missing values in trait databases (K-nearest neighbour (kNN), multivariate imputation by chained equations (mice), missForest and Phylopars), and testwhether imputed datasets retain underlying allometric relationships among traits 2 Starting with a nearly complete trait dataset on the mammalian order Carnivora (using four traits), we artificially removed values so that the percent ofmissing values ranged from 10% to 80% Using the original values as a reference, we assessed imputation performance using normalized root mean squared errorWe also evaluated whether including phylogenetic information improved imputation performance in kNN, mice, and missForest (it is a required input in Phylopars) Finally, we evaluated the extent to which the allometric relationship between two traits (body mass and longevity) was conserved for imputed datasets by looking at the difference (bias) between the slope of the original and the imputed datasets or datasets with missing values removed 3 Three of the tested approaches (mice, missForest and Phylopars), resulted in qualitatively equivalent imputation performance, and all had significantly lower errors than kNN Adding phylogenetic information into the imputation algorithms improved estimation of missing values for all tested traits The allometric relationship between body mass and longevity was conserved when up to 60% of data were missing, either with or without phylogenetic information, depending on the approach This relationship was less biased in imputed datasets compared to datasets withmissing values removed, especially whenmore than 30%of values weremissing 4 Imputations provide valuable alternatives to removing missing observations in trait databases as they produce low errors and retain relationships among traits Although we must continue to prioritize data collection on species traits, imputations can provide a valuable solution for conducting macroecological and evolutionary studies using life-history trait databases

264 citations


Journal ArticleDOI
TL;DR: A large gap is apparent between statistical methods research related to missing data and use of these methods in application settings, including RCTs in top medical journals.
Abstract: Missing outcome data is a threat to the validity of treatment effect estimates in randomized controlled trials. We aimed to evaluate the extent, handling, and sensitivity analysis of missing data and intention-to-treat (ITT) analysis of randomized controlled trials (RCTs) in top tier medical journals, and compare our findings with previous reviews related to missing data and ITT in RCTs. Review of RCTs published between July and December 2013 in the BMJ, JAMA, Lancet, and New England Journal of Medicine, excluding cluster randomized trials and trials whose primary outcome was survival. Of the 77 identified eligible articles, 73 (95%) reported some missing outcome data. The median percentage of participants with a missing outcome was 9% (range 0 – 70%). The most commonly used method to handle missing data in the primary analysis was complete case analysis (33, 45%), while 20 (27%) performed simple imputation, 15 (19%) used model based methods, and 6 (8%) used multiple imputation. 27 (35%) trials with missing data reported a sensitivity analysis. However, most did not alter the assumptions of missing data from the primary analysis. Reports of ITT or modified ITT were found in 52 (85%) trials, with 21 (40%) of them including all randomized participants. A comparison to a review of trials reported in 2001 showed that missing data rates and approaches are similar, but the use of the term ITT has increased, as has the report of sensitivity analysis. Missing outcome data continues to be a common problem in RCTs. Definitions of the ITT approach remain inconsistent across trials. A large gap is apparent between statistical methods research related to missing data and use of these methods in application settings, including RCTs in top medical journals.

247 citations


Posted Content
TL;DR: The 1992 Survey of Consumer Finances consisted of five complete data sets because missing data are multiply imputed as mentioned in this paper, and the value of using all five data sets and the risk of using only a single data set in empirical research are explained.
Abstract: The 1992 Survey of Consumer Finances consisted of five complete data sets because missing data are multiply imputed. The incidence of missing data in the 1992 SCF is addressed and illustrates the difficulty of obtaining financial information from individuals. The value of using all five data sets and the risk of using only a single data set in empirical research are explained. Estimates derived separately from each data set are compared to results using all five data sets to illustrate the extra variability in the data due to imputation. Researchers are encouraged to use information from all five data sets in order to make valid inferences.

225 citations


Journal ArticleDOI
TL;DR: This paper considers the issues of missing data at each stage of research, and the importance of sensitivity analyses, including the role of missing not at random models, such as pattern mixture, selection, and shared parameter models.
Abstract: Patient-reported outcomes are increasingly used in health research, including randomized controlled trials and observational studies. However, the validity of results in longitudinal studies can crucially hinge on the handling of missing data. This paper considers the issues of missing data at each stage of research. Practical strategies for minimizing missingness through careful study design and conduct are given. Statistical approaches that are commonly used, but should be avoided, are discussed, including how these methods can yield biased and misleading results. Methods that are valid for data which are missing at random are outlined, including maximum likelihood methods, multiple imputation and extensions to generalized estimating equations: weighted generalized estimating equations, generalized estimating equations with multiple imputation, and doubly robust generalized estimating equations. Finally, we discuss the importance of sensitivity analyses, including the role of missing not at random models, such as pattern mixture, selection, and shared parameter models. We demonstrate many of these concepts with data from a randomized controlled clinical trial on renal cancer patients, and show that the results are dependent on missingness assumptions and the statistical approach.

218 citations


Journal ArticleDOI
TL;DR: How ‘Missing at random’ differs from ‘missing completely atrandom’ via an imagined dialogue between a clinical researcher and statistician is clarified.
Abstract: The terminology describing missingness mechanisms is confusing. In particular the meaning of ‘missing at random’ is often misunderstood, leading researchers faced with missing data problems away from multiple imputation, a method with considerable advantages. The purpose of this article is to clarify how ‘missing at random’ differs from ‘missing completely at random’ via an imagined dialogue between a clinical researcher and statistician.

216 citations


Journal ArticleDOI
TL;DR: This study explores the performance of simple and more advanced methods for handling missing data in cases when some, many, or all item scores are missing in a multi-item instrument and recommends applying MI to the item scores to get the most accurate regression model estimates.

197 citations


Journal ArticleDOI
TL;DR: This work develops a new method for Gaussian imputation from summary association statistics, a type of data that is becoming widely available and increases the magnitude and statistical evidence of enrichment at genic versus non-genic loci for these traits, as compared with an analysis without 1000G imputation.
Abstract: MOTIVATION Imputation using external reference panels (e.g. 1000 Genomes) is a widely used approach for increasing power in genome-wide association studies and meta-analysis. Existing hidden Markov models (HMM)-based imputation approaches require individual-level genotypes. Here, we develop a new method for Gaussian imputation from summary association statistics, a type of data that is becoming widely available. RESULTS In simulations using 1000 Genomes (1000G) data, this method recovers 84% (54%) of the effective sample size for common (>5%) and low-frequency (1-5%) variants [increasing to 87% (60%) when summary linkage disequilibrium information is available from target samples] versus the gold standard of 89% (67%) for HMM-based imputation, which cannot be applied to summary statistics. Our approach accounts for the limited sample size of the reference panel, a crucial step to eliminate false-positive associations, and it is computationally very fast. As an empirical demonstration, we apply our method to seven case-control phenotypes from the Wellcome Trust Case Control Consortium (WTCCC) data and a study of height in the British 1958 birth cohort (1958BC). Gaussian imputation from summary statistics recovers 95% (105%) of the effective sample size (as quantified by the ratio of [Formula: see text] association statistics) compared with HMM-based imputation from individual-level genotypes at the 227 (176) published single nucleotide polymorphisms (SNPs) in the WTCCC (1958BC height) data. In addition, for publicly available summary statistics from large meta-analyses of four lipid traits, we publicly release imputed summary statistics at 1000G SNPs, which could not have been obtained using previously published methods, and demonstrate their accuracy by masking subsets of the data. We show that 1000G imputation using our approach increases the magnitude and statistical evidence of enrichment at genic versus non-genic loci for these traits, as compared with an analysis without 1000G imputation. Thus, imputation of summary statistics will be a valuable tool in future functional enrichment analyses. AVAILABILITY AND IMPLEMENTATION Publicly available software package available at http://bogdan.bioinformatics.ucla.edu/software/. CONTACT bpasaniuc@mednet.ucla.edu or aprice@hsph.harvard.edu SUPPLEMENTARY INFORMATION Supplementary materials are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: It is concluded that, when interaction effects are present in a dataset, substantial gains are possible by using recursive partitioning for imputation compared to standard applications.

Journal ArticleDOI
TL;DR: The package MissMech implements two tests of MCAR that can be run using a function called TestMCARNormality, which is valid if data are normally distributed, and another test does not require any distributional assumptions for the data.
Abstract: Researchers are often faced with analyzing data sets that are not complete. To properly analyze such data sets requires the knowledge of the missing data mechanism. If data are missing completely at random (MCAR), then many missing data analysis techniques lead to valid inference. Thus, tests of MCAR are desirable. The package MissMech implements two tests developed by Jamshidian and Jalal (2010) for this purpose. These tests can be run using a function called TestMCARNormality. One of the tests is valid if data are normally distributed, and another test does not require any distributional assumptions for the data. In addition to testing MCAR, in some special cases, the function TestMCARNormality is also able to test whether data have a multivariate normal distribution. As a bonus, the functions in MissMech can also be used for the following additional tasks: (i) test of homoscedasticity for several groups when data are completely observed, (ii) perform the k-sample test of Anderson-Darling to determine whether k groups of univariate data come from the same distribution, (iii) impute incomplete data sets using two methods, one where normality is assumed and one where no specic distributional assumptions are made, (iv) obtain normal-theory maximum likelihood estimates for mean and covariance matrix when data are incomplete, along with their standard errors, and nally (v) perform the Neyman’s test of uniformity. All of these features are explained in the paper, including examples.


Journal ArticleDOI
TL;DR: The authors reviewed the current literature on missing data handling methods within the special context of education research to summarize the pros and cons of various methods and provide guidelines for future research in this area.
Abstract: Missing data are a common occurrence in survey-based research studies in education, and the way missing values are handled can significantly affect the results of analyses based on such data. Despite known problems with performance of some missing data handling methods, such as mean imputation, many researchers in education continue to use those methods as a quick fix. This study reviews the current literature on missing data handling methods within the special context of education research to summarize the pros and cons of various methods and provides guidelines for future research in this area.

01 Jan 2014
TL;DR: One outcome of the statistical analyses undertaken in this study is the formulation of easy-to-implement guidelines for educational researchers that allows one to choose one of the following factors when all others are given: sample size, proportion of missing data in the sample, method of analysis, and missing data handling method.
Abstract: The effect of a number of factors, such as the choice of analytical method, the handling method for missing data, sample size, and proportion of missing data, were examined to evaluate the effect of missing data treatment on accuracy of estimation. A methodological approach involving simulated data was adopted. One outcome of the statistical analyses undertaken in this study is the formulation of easy-to-implement guidelines for educational researchers that allows one to choose one of the following factors when all others are given: sample size, proportion of missing data in the sample, method of analysis, and missing data handling method.

Journal ArticleDOI
TL;DR: A new variable importance measure that is applicable to any kind of data—whether it does or does not contain missing values is presented, which takes the occurrence of missing values into account and makes results also differ from those obtained under multiple imputation.
Abstract: Random forests are widely used in many research fields for prediction and interpretation purposes Their popularity is rooted in several appealing characteristics, such as their ability to deal with high dimensional data, complex interactions and correlations between variables Another important feature is that random forests provide variable importance measures that can be used to identify the most important predictor variables Though there are alternatives like complete case analysis and imputation, existing methods for the computation of such measures cannot be applied straightforward when the data contains missing values This paper presents a solution to this pitfall by introducing a new variable importance measure that is applicable to any kind of data--whether it does or does not contain missing values An extensive simulation study shows that the new measure meets sensible requirements and shows good variable ranking properties An application to two real data sets also indicates that the new approach may provide a more sensible variable ranking than the widespread complete case analysis It takes the occurrence of missing values into account which makes results also differ from those obtained under multiple imputation

Journal ArticleDOI
TL;DR: Different substitutes of missing values namely: zero, mean, median, k-nearest neighbours (kNN) and random forest (RF) imputation are analysed in terms of their influence on unsupervised and supervised learning and, thus, their impact on the final output(s) of biological interpretation.
Abstract: Missing values are known to be problematic for the analysis of gas chromatography-mass spectrometry (GC-MS) metabolomics data. Typically these values cover about 10%–20% of all data and can originate from various backgrounds, including analytical, computational, as well as biological. Currently, the most well known substitute for missing values is a mean imputation. In fact, some researchers consider this aspect of data analysis in their metabolomics pipeline as so routine that they do not even mention using this replacement approach. However, this may have a significant influence on the data analysis output(s) and might be highly sensitive to the distribution of samples between different classes. Therefore, in this study we have analysed different substitutes of missing values namely: zero, mean, median, k-nearest neighbours (kNN) and random forest (RF) imputation, in terms of their influence on unsupervised and supervised learning and, thus, their impact on the final output(s) in terms of biological interpretation. These comparisons have been demonstrated both visually and computationally (classification rate) to support our findings. The results show that the selection of the replacement methods to impute missing values may have a considerable effect on the classification accuracy, if performed incorrectly this may negatively influence the biomarkers selected for an early disease diagnosis or identification of cancer related metabolites. In the case of GC-MS metabolomics data studied here our findings recommend that RF should be favored as an imputation of missing value over the other tested methods. This approach displayed excellent results in terms of classification rate for both supervised methods namely: principal components-linear discriminant analysis (PC-LDA) (98.02%) and partial least squares-discriminant analysis (PLS-DA) (97.96%) outperforming other imputation methods.

Journal ArticleDOI
TL;DR: Among various methods, the probabilistic principal component analysis (PPCA) yields best performance in all aspects and is used to impute data online before making further analysis and is robust to weather changes.
Abstract: Many traffic management and control applications require highly complete and accurate data of traffic flow. However, because of various reasons such as sensor failure or transmission error, it is common that some traffic flow data are lost. As a result, various methods were proposed by using a wide spectrum of techniques to estimate missing traffic data in the last two decades. Generally, these missing data imputation methods can be categorised into three kinds: prediction methods, interpolation methods and statistical learning methods. To assess their performance, these methods are compared from different aspects in this paper, including reconstruction errors, statistical behaviours and running speeds. Results show that statistical learning methods are more effective than the other two kinds of imputation methods when data of a single detector is utilised. Among various methods, the probabilistic principal component analysis (PPCA) yields best performance in all aspects. Numerical tests demonstrate that PPCA can be used to impute data online before making further analysis (e.g. make traffic prediction) and is robust to weather changes.

Journal ArticleDOI
TL;DR: In this paper, the problem of high-dimensional covariance matrix estimation with missing observations was studied and a simple procedure was proposed, which is computationally tractable in high-dimension and does not require imputation of the missing data.
Abstract: In this paper, we study the problem of high-dimensional covariance matrix estimation with missing observations. We propose a simple procedure computationally tractable in high-dimension and that does not require imputation of the missing data. We establish non-asymptotic sparsity oracle inequalities for the estimation of the covariance matrix involving the Frobenius and the spectral norms which are valid for any setting of the sample size, probability of a missing observation and the dimensionality of the covariance matrix. We further establish minimax lower bounds showing that our rates are minimax optimal up to a logarithmic factor.

Journal ArticleDOI
TL;DR: P predictive mean matching performance is at least as good as the investigated dedicated methods for imputing semicontinuous data and, in contrast to other methods, is the only method that yields plausible imputations and preserves the original data distributions.
Abstract: Multiple imputation methods properly account for the uncertainty of missing data. One of those methods for creating multiple imputations is predictive mean matching (PMM), a general purpose method. Little is known about the performance of PMM in imputing non-normal semicontinuous data (skewed data with a point mass at a certain value and otherwise continuously distributed). We investigate the performance of PMM as well as dedicated methods for imputing semicontinuous data by performing simulation studies under univariate and multivariate missingness mechanisms. We also investigate the performance on real-life datasets. We conclude that PMM performance is at least as good as the investigated dedicated methods for imputing semicontinuous data and, in contrast to other methods, is the only method that yields plausible imputations and preserves the original data distributions. © 2014 The Authors.

Journal ArticleDOI
TL;DR: Existing imputation methods for phenomic data are investigated, a novel concept of "imputability measure" (IM) is introduced to identify missing values that are fundamentally inadequate to impute and a self-training selection (STS) scheme to select the best imputation method is proposed.
Abstract: In modern biomedical research of complex diseases, a large number of demographic and clinical variables, herein called phenomic data, are often collected and missing values (MVs) are inevitable in the data collection process. Since many downstream statistical and bioinformatics methods require complete data matrix, imputation is a common and practical solution. In high-throughput experiments such as microarray experiments, continuous intensities are measured and many mature missing value imputation methods have been developed and widely applied. Numerous methods for missing data imputation of microarray data have been developed. Large phenomic data, however, contain continuous, nominal, binary and ordinal data types, which void application of most methods. Though several methods have been developed in the past few years, not a single complete guideline is proposed with respect to phenomic missing data imputation. In this paper, we investigated existing imputation methods for phenomic data, proposed a self-training selection (STS) scheme to select the best imputation method and provide a practical guideline for general applications. We introduced a novel concept of "imputability measure" (IM) to identify missing values that are fundamentally inadequate to impute. In addition, we also developed four variations of K-nearest-neighbor (KNN) methods and compared with two existing methods, multivariate imputation by chained equations (MICE) and missForest. The four variations are imputation by variables (KNN-V), by subjects (KNN-S), their weighted hybrid (KNN-H) and an adaptively weighted hybrid (KNN-A). We performed simulations and applied different imputation methods and the STS scheme to three lung disease phenomic datasets to evaluate the methods. An R package "phenomeImpute" is made publicly available. Simulations and applications to real datasets showed that MICE often did not perform well; KNN-A, KNN-H and random forest were among the top performers although no method universally performed the best. Imputation of missing values with low imputability measures increased imputation errors greatly and could potentially deteriorate downstream analyses. The STS scheme was accurate in selecting the optimal method by evaluating methods in a second layer of missingness simulation. All source files for the simulation and the real data analyses are available on the author's publication website.

Journal ArticleDOI
TL;DR: This paper investigated the performance of classical and model-based approaches in empirical data, accounting for different kinds of missing responses simultaneously, and confirmed the existence of a unidimensional tendency to omit items.
Abstract: Data from competence tests usually show a number of missing responses on test items due to both omitted and not-reached items. Different approaches for dealing with missing responses exist, and there are no clear guidelines on which of those to use. While classical approaches rely on an ignorable missing data mechanism, the most recently developed model-based approaches account for nonignorable missing responses. Model-based approaches include the missing propensity in the measurement model. Although these models are very promising, the assumptions made in these models have not yet been tested for plausibility in empirical data. Furthermore, studies investigating the performance of different approaches have only focused on one kind of missing response at once. In this study, we investigated the performance of classical and model-based approaches in empirical data, accounting for different kinds of missing responses simultaneously. We confirmed the existence of a unidimensional tendency to omit items. Indicating nonignorability of the missing mechanism, missing tendency due to both omitted and not-reached items correlated with ability. However, results on parameter estimation showed that ignoring missing

Journal ArticleDOI
TL;DR: The method of MI is introduced, as well as a discussion surrounding when MI can be a useful method for handling missing data and the drawbacks of this approach, when exploring the association between current asthma status and forced expiratory volume using data from a population‐based longitudinal cohort study.
Abstract: Missing data are common in both observational and experimental studies. Multiple imputation (MI) is a two-stage approach where missing values are imputed a number of times using a statistical model based on the available data and then inference is combined across the completed datasets. This approach is becoming increasingly popular for handling missing data. In this paper, we introduce the method of MI, as well as a discussion surrounding when MI can be a useful method for handling missing data and the drawbacks of this approach. We illustrate MI when exploring the association between current asthma status and forced expiratory volume in 1 s after adjustment for potential confounders using data from a population-based longitudinal cohort study.

Journal ArticleDOI
TL;DR: This paper presents simple methods for missing values imputation like using most common value, mean or median, closest fit approach and methods based on data mining algorithms like k-nearest neighbor, neural networks and association rules, discusses their usability and presents issues with their applicability on examples.
Abstract: Many existing industrial and research data sets contain missing values due to various reasons, such as manual data entry procedures, equipment errors and incorrect measurements Problems associated with missing values are loss of efficiency, complications in handling and analyzing the data and bias resulting from differences between missing and complete data The important factor for selection of approach to missing values is missing data mechanism There are various strategies for dealing with missing values Some analytical methods have their own approach to handle missing values Data set reduction is another option Finally missing values problem can be handled by missing values imputation This paper presents simple methods for missing values imputation like using most common value, mean or median, closest fit approach and methods based on data mining algorithms like k-nearest neighbor, neural networks and association rules, discusses their usability and presents issues with their applicability on examples

Journal ArticleDOI
TL;DR: Bias was greater when the match rate was low or the identifier error rate was high and in these cases, PII performed better than HW classification at reducing bias due to false-matches and this study highlights the importance of evaluating the potential impact of linkage error on results.
Abstract: Background: Linkage of electronic healthcare records is becoming increasingly important for research purposes. However, linkage error due to mis-recorded or missing identifiers can lead to biased results. We evaluated the impact of linkage error on estimated infection rates using two different methods for classifying links: highest-weight (HW) classification using probabilistic match weights and prior-informed imputation (PII) using match probabilities. Methods: A gold-standard dataset was created through deterministic linkage of unique identifiers in admission data from two hospitals and infection data recorded at the hospital laboratories (original data). Unique identifiers were then removed and data were re-linked by date of birth, sex and Soundex using two classification methods: i) HW classification - accepting the candidate record with the highest weight exceeding a threshold and ii) PII–imputing values from a match probability distribution. To evaluate methods for linking data with different error rates, non-random error and different match rates, we generated simulation data. Each set of simulated files was linked using both classification methods. Infection rates in the linked data were compared with those in the gold-standard data. Results: In the original gold-standard data, 1496/20924 admissions linked to an infection. In the linked original data, PII provided least biased results: 1481 and 1457 infections (upper/lower thresholds) compared with 1316 and 1287 (HW upper/lower thresholds). In the simulated data, substantial bias (up to 112%) was introduced when linkage error varied by hospital. Bias was also greater when the match rate was low or the identifier error rate was high and in these cases, PII performed better than HW classification at reducing bias due to false-matches. Conclusions: This study highlights the importance of evaluating the potential impact of linkage error on results. PII can help incorporate linkage uncertainty into analysis and reduce bias due to linkage error, without requiring identifiers.

Journal ArticleDOI
TL;DR: In this paper, the authors characterize the stationary distributions of iterative imputations and their statistical properties, and give a set of sucient conditions under which the imputation distribution converges in total variation to the posterior distribution of a Bayesian model.
Abstract: Iterative imputation, in which variables are imputed one at a time each given a model predicting from all the others, is a popular technique that can be convenient and exible, as it replaces a potentially dicult multivariate modeling problem with relatively simple univariate regressions. In this paper, we begin to characterize the stationary distributions of iterative imputations and their statistical properties. More precisely, when the conditional models are compatible (dened in the text), we give a set of sucient conditions under which the imputation distribution converges in total variation to the posterior distribution of a Bayesian model. When the conditional models are incompatible but are valid, we show that the combined imputation estimator is consistent.

Journal ArticleDOI
TL;DR: It is demonstrated that using incomplete cases often increases the effectiveness of nearest neighbor imputation (especially at higher missingness levels), regardless of the type of missingness.

Journal ArticleDOI
TL;DR: A novel Nearest Neighbor (NN) imputation method that estimates missing data in WSNs by learning spatial and temporal correlations between sensor nodes by utilizing a kd-tree data structure, which is a non-parametric, data-driven binary search tree.

Journal ArticleDOI
TL;DR: In this article, a framework for model selection and model averaging in the context of missing data is proposed, where the focus lies on multiple imputation as a strategy to deal with the missingness: a consequent combination with model averaging aims to incorporate both the uncertainty associated with the model selection process and with the imputation process.