scispace - formally typeset
Search or ask a question

Showing papers on "Imputation (statistics) published in 2015"


Book
01 Jan 2015
TL;DR: Cumming et al. as discussed by the authors presented a review of basic statistics for SPSS and other useful procedures, including the following: 1. Introduction 2. Data Coding and Exploratory Analysis (EDA) 3. Imputation of Missing Data 4. Several Measures of Reliability 5. Selecting and Interpreting Inferential Statistics 7. Multiple Regression 8. Mediation, Moderation, and Canonical Correlation 9. Logistic Regression and Discriminant Analysis 10. Factorial ANOVA and ANCOVA 11. Repeated-Me
Abstract: 1. Introduction 2. Data Coding and Exploratory Analysis (EDA) 3. Imputation of Missing Data 4. Several Measures of Reliability 5. Exploratory Factor Analysis and Principal Components Analysis 6. Selecting and Interpreting Inferential Statistics 7. Multiple Regression 8. Mediation, Moderation, and Canonical Correlation 9. Logistic Regression and Discriminant Analysis 10. Factorial ANOVA and ANCOVA 11. Repeated-Measures and Mixed ANOVAs 12. Multivariate Analysis of Variance (MANOVA) 13. Multilevel Linear Modeling/Hierarchical Linear Modeling Appendix A. Getting Started With SPSS and Other Useful Procedures D. Quick, M. Myers Appendix B. Review of Basic Statistics J.M. Cumming, A. Weinberg Appendix C. Answers to Odd Interpretation Questions

854 citations


Journal ArticleDOI
TL;DR: zCompositions as discussed by the authors is an R package for the imputation of left-censored data under a compositional approach, which is used in fields like geochemistry of waters or sedimentary rocks, environmental studies related to air pollution, physicochemical analysis of glass fragments in forensic science.

551 citations


Journal ArticleDOI
TL;DR: This work demonstrates how the application of software engineering techniques can help to keep imputation broadly accessible and speed up imputation by an order of magnitude compared with the previous implementation.
Abstract: Summary: Genotype imputation is a key step in the analysis of genome-wide association studies. Upcoming very large reference panels, such as those from The 1000 Genomes Project and the Haplotype Consortium, will improve imputation quality of rare and less common variants, but will also increase the computational burden. Here, we demonstrate how the application of software engineering techniques can help to keep imputation broadly accessible. Overall, these improvements speed up imputation by an order of magnitude compared with our previous implementation. Availability and implementation: minimac2, including source code, documentation, and examples is available at http://genome.sph.umich.edu/wiki/Minimac2 Contact: ude.hcimu@bshcufc, ude.hcimu@olacnog

454 citations


Journal ArticleDOI
TL;DR: In this paper, the authors show how imputation by fully conditional specification, a popular approach for performing multiple imputation, can be modified so that covariates are imputed from models which are compatible with the substantive model.
Abstract: Missing covariate data commonly occur in epidemiological and clinical research, and are often dealt with using multiple imputation. Imputation of partially observed covariates is complicated if the substantive model is non-linear (e.g. Cox proportional hazards model), or contains non-linear (e.g. squared) or interaction terms, and standard software implementations of multiple imputation may impute covariates from models that are incompatible with such substantive models. We show how imputation by fully conditional specification, a popular approach for performing multiple imputation, can be modified so that covariates are imputed from models which are compatible with the substantive model. We investigate through simulation the performance of this proposal, and compare it with existing approaches. Simulation results suggest our proposal gives consistent estimates for a range of common substantive models, including models which contain non-linear covariate effects or interactions, provided data are missing at random and the assumed imputation models are correctly specified and mutually compatible. Stata software implementing the approach is freely available.

356 citations



Journal ArticleDOI
TL;DR: This review outlined deficiencies in the documenting of missing data and the details provided about imputation in medical research articles, and only a few articles performed sensitivity analyses following MI even though this is strongly recommended in guidelines.
Abstract: Missing data are common in medical research, which can lead to a loss in statistical power and potentially biased results if not handled appropriately. Multiple imputation (MI) is a statistical method, widely adopted in practice, for dealing with missing data. Many academic journals now emphasise the importance of reporting information regarding missing data and proposed guidelines for documenting the application of MI have been published. This review evaluated the reporting of missing data, the application of MI including the details provided regarding the imputation model, and the frequency of sensitivity analyses within the MI framework in medical research articles. A systematic review of articles published in the Lancet and New England Journal of Medicine between January 2008 and December 2013 in which MI was implemented was carried out. We identified 103 papers that used MI, with the number of papers increasing from 11 in 2008 to 26 in 2013. Nearly half of the papers specified the proportion of complete cases or the proportion with missing data by each variable. In the majority of the articles (86%) the imputed variables were specified. Of the 38 papers (37%) that stated the method of imputation, 20 used chained equations, 8 used multivariate normal imputation, and 10 used alternative methods. Very few articles (9%) detailed how they handled non-normally distributed variables during imputation. Thirty-nine papers (38%) stated the variables included in the imputation model. Less than half of the papers (46%) reported the number of imputations, and only two papers compared the distribution of imputed and observed data. Sixty-six papers presented the results from MI as a secondary analysis. Only three articles carried out a sensitivity analysis following MI to assess departures from the missing at random assumption, with details of the sensitivity analyses only provided by one article. This review outlined deficiencies in the documenting of missing data and the details provided about imputation. Furthermore, only a few articles performed sensitivity analyses following MI even though this is strongly recommended in guidelines. Authors are encouraged to follow the available guidelines and provide information on missing data and the imputation process.

299 citations


Journal ArticleDOI
TL;DR: This work demonstrates the application of FCS MI in support of a large epidemiologic study evaluating national blood utilization patterns in a sub-Saharan African country and offers a principled yet flexible method of addressing missing data.
Abstract: Missing data commonly occur in large epidemiologic studies. Ignoring incompleteness or handling the data inappropriately may bias study results, reduce power and efficiency, and alter important risk/benefit relationships. Standard ways of dealing with missing values, such as complete case analysis (CCA), are generally inappropriate due to the loss of precision and risk of bias. Multiple imputation by fully conditional specification (FCS MI) is a powerful and statistically valid method for creating imputations in large data sets which include both categorical and continuous variables. It specifies the multivariate imputation model on a variable-by-variable basis and offers a principled yet flexible method of addressing missing data, which is particularly useful for large data sets with complex data structures. However, FCS MI is still rarely used in epidemiology, and few practical resources exist to guide researchers in the implementation of this technique. We demonstrate the application of FCS MI in support of a large epidemiologic study evaluating national blood utilization patterns in a sub-Saharan African country. A number of practical tips and guidelines for implementing FCS MI based on this experience are described.

270 citations


Journal ArticleDOI
10 Nov 2015-JAMA
TL;DR: This issue of JAMA reports results of a cluster-randomized clinical trial designed to evaluate the effects of physician financial incentives, patient incentives, or shared physician and patient incentives on low density lipoprotein cholesterol levels among patients with high cardiovascular risk.
Abstract: In this issue of JAMA, Asch et al1 report results of a cluster-randomized clinical trial designed to evaluate the effects of physician financial incentives, patient incentives, or shared physician and patient incentives on low density lipoprotein cholesterol (LDL-C) levels among patients with high cardiovascular risk. Because 1 or more follow-up LDL-C measurements were missing for approximately 7% of participants, Asch et al used multiple imputation (MI) to analyze their data and concluded that shared financial incentives for physicians and patients, but not incentives to physicians or patients alone, resulted in the patients having lower LDL-C levels. Imputation is the process of replacing missing data with 1 or more specific values, to allow statistical analysis that includes all participants and not just those who do not have any missing data. Missing data are common in research. In a previous JAMA Guide to Statistics and Methods, Newgard and Lewis2 reviewed the causes of missing data. These are divided into 3 classes: 1) missing completely at random, the most restrictive assumption, indicating that whether a data point is missing is completely unrelated to observed and unobserved data; 2) missing at random, a more realistic assumption than missing completely at random, indicating whether a data point is missing can be explained by the observed data; or 3) missing not at random, meaning that the missingness is dependent on the unobserved values. Common statistical methods used for handling missing values were reviewed.2 When missing data occur, it is important to not exclude cases with missing information (analyses after such exclusion are known as complete case analyses). Single-value imputation methods are those that estimate what each missing value might have been and replace it with a single value in the data set. Single-value imputation methods include mean imputation, last observation carried forward, and random imputation. These approaches can yield biased results and are suboptimal. Multiple imputation better handles missing data by estimating and replacing missing values many times.

264 citations


Posted Content
TL;DR: In this paper, a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF), was proposed.
Abstract: Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.

211 citations


Journal ArticleDOI
TL;DR: The goal for imputation in the field of computational proteomics should be to develop new approaches that work generically for this data type and new strategies to guide users in the selection of the best imputation for their dataset and analysis objectives.
Abstract: In this review, we apply selected imputation strategies to label-free liquid chromatography–mass spectrometry (LC–MS) proteomics datasets to evaluate the accuracy with respect to metrics of variance and classification. We evaluate several commonly used imputation approaches for individual merits and discuss the caveats of each approach with respect to the example LC–MS proteomics data. In general, local similarity-based approaches, such as the regularized expectation maximization and least-squares adaptive algorithms, yield the best overall performances with respect to metrics of accuracy and robustness. However, no single algorithm consistently outperforms the remaining approaches, and in some cases, performing classification without imputation sometimes yielded the most accurate classification. Thus, because of the complex mechanisms of missing data in proteomics, which also vary from peptide to protein, no individual method is a single solution for imputation. On the basis of the observations in this r...

211 citations


Journal ArticleDOI
TL;DR: A novel online (adaptive) algorithm is developed to obtain multi-way decompositions of low-rank tensors with missing entries and perform imputation as a byproduct, and their superior performance relative to state-of-the-art alternatives.
Abstract: Extracting latent low-dimensional structure from high-dimensional data is of paramount importance in timely inference tasks encountered with “Big Data” analytics. However, increasingly noisy, heterogeneous, and incomplete datasets, as well as the need for real-time processing of streaming data, pose major challenges to this end. In this context, the present paper permeates benefits from rank minimization to scalable imputation of missing data, via tracking low-dimensional subspaces and unraveling latent (possibly multi-way) structure from incomplete streaming data. For low-rank matrix data, a subspace estimator is proposed based on an exponentially weighted least-squares criterion regularized with the nuclear norm. After recasting the nonseparable nuclear norm into a form amenable to online optimization, real-time algorithms with complementary strengths are developed, and their convergence is established under simplifying technical assumptions. In a stationary setting, the asymptotic estimates obtained offer the well-documented performance guarantees of the batch nuclear-norm regularized estimator. Under the same unifying framework, a novel online (adaptive) algorithm is developed to obtain multi-way decompositions of low-rank tensors with missing entries and perform imputation as a byproduct. Simulated tests with both synthetic as well as real Internet and cardiac magnetic resonance imagery (MRI) data confirm the efficacy of the proposed algorithms, and their superior performance relative to state-of-the-art alternatives.

Journal ArticleDOI
TL;DR: This study compares 6 different imputation methods and suggests that bPCA and FKM are two imputations methods of interest which deserve further consideration in practice.
Abstract: Missing data are part of almost all research and introduce an element of ambiguity into data analysis. It follows that we need to consider them appropriately in order to provide an efficient and valid analysis. In the present study, we compare 6 different imputation methods: Mean, K-nearest neighbors (KNN), fuzzy K-means (FKM), singular value decomposition (SVD), bayesian principal component analysis (bPCA) and multiple imputations by chained equations (MICE). Comparison was performed on four real datasets of various sizes (from 4 to 65 variables), under a missing completely at random (MCAR) assumption, and based on four evaluation criteria: Root mean squared error (RMSE), unsupervised classification error (UCE), supervised classification error (SCE) and execution time. Our results suggest that bPCA and FKM are two imputation methods of interest which deserve further consideration in practice.

Journal ArticleDOI
TL;DR: A hybrid approach integrating the Fuzzy C-Means-based imputation method with the Genetic Algorithm is develop for missing traffic volume data estimation based on inductance loop detector outputs to show the proposed approach outperforms the conventional methods under prevailing traffic conditions.
Abstract: Although various innovative traffic sensing technologies have been widely employed, incomplete sensor data is one of the most major problems to significantly degrade traffic data quality and integrity. In this study, a hybrid approach integrating the Fuzzy C-Means (FCM)-based imputation method with the Genetic Algorithm (GA) is develop for missing traffic volume data estimation based on inductance loop detector outputs. By utilizing the weekly similarity among data, the conventional vector-based data structure is firstly transformed into the matrix-based data pattern. Then, the GA is applied to optimize the membership functions and centroids in the FCM model. The experimental tests are conducted to verify the effectiveness of the proposed approach. The traffic volume data collected at different temporal scales were used as the testing dataset, and three different indicators, including root mean square error, correlation coefficient, and relative accuracy, are utilized to quantify the imputation performance compared with some conventional methods (Historical method, Double Exponential Smoothing, and Autoregressive Integrated Moving Average model). The results show the proposed approach outperforms the conventional methods under prevailing traffic conditions.

Journal ArticleDOI
TL;DR: The authors consider the unique challenges associated with attrition, incomplete repeated measures, and unknown observations of time as well as factors responsible for differences in the value of imputation.
Abstract: This article offers an applied review of key issues and methods for the analysis of longitudinal panel data in the presence of missing values. The authors consider the unique challenges associated with attrition (survey dropout), incomplete repeated measures, and unknown observations of time. Using simulated data based on 4 waves of the Marital Instability Over the Life Course Study (n = 2,034), they applied a fixed effect regression model and an event-history analysis with time-varying covariates. They then compared results for analyses with nonimputed missing data and with imputed data both in long and in wide structures. Imputation produced improved estimates in the event-history analysis but only modest improvements in the estimates and standard errors of the fixed effects analysis. Factors responsible for differences in the value of imputation are examined, and recommendations for handling missing values in panel data are presented.

Proceedings Article
25 Jul 2015
TL;DR: This work proposes a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF), which enables the use of techniques from computer vision for time series classification and imputation.
Abstract: Inspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/ Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.

Journal ArticleDOI
TL;DR: A new cluster-based oversampling approach robust to small and imbalanced datasets, which accounts for the heterogeneity of patients with Hepatocellular Carcinoma is proposed and coupled with neural networks outperformed all others.

Journal ArticleDOI
01 Sep 2015-JAMA
TL;DR: The authors concluded that finerenone improved the UACR, a result that is consistent regardless of themethod forhandling missing data.
Abstract: Missingdata are common in clinical research, particularly for variables requiring complex, time-sensitive, resource-intensive, or longitudinal data collectionmethods. However, even seemingly readily available information canbemissing. There aremany reasons for “missingness,” includingmissed study visits, patients lost to follow-up, missing information in sourcedocuments, lackof availability (eg, laboratory tests that were not performed), and clinical scenariospreventingcollectionofcertainvariables (eg,missingcoma scaledata in sedatedpatients). It is particularly challenging to interpretstudieswhenprimaryoutcomedataaremissing.However,many methods commonly used for handling missing values during data analysis can yield biased results, decrease study power, or lead to underestimates of uncertainty, all reducing the chance of drawing valid conclusions. In this issue of JAMA, Bakris et al evaluated the effect of finerenoneonurinaryalbumin-creatinine ratio (UACR) inpatientswithdiabetic nephropathy in a randomized, phase 2B, dose-finding clinical trial conducted in 148 sites in 23 countries.1 Because of the logistical complexity of the study, it is not surprising that some of the intended data collection could not be completed, resulting in missing outcomedata. Bakris et al used several analysis and imputation techniques (ie,methods for replacingmissingdatawith specific values) to assess theeffects ofdifferent approaches forhandlingmissing data. Thesemethods included complete case analysis (restricting theanalysis to includeonlypatientswithobserved90-dayUACR values); last observation carried forward (LOCF; typically this involvesusing the last recordeddatapointas the finaloutcome;Bakris et al used the higher of 2 UACR values and, separately, themost recent UACR obtained prior to study discontinuation); baseline observationcarried forward (using thebaselineUACRvalueas theoutcomeUACR value, therefore assuming no treatment effect for that patient);mean value imputation (replacingmissing valueswith the meanofobservedUACRvalues); and randomimputation (using randomly selectedUACRvalues to replacemissingUACRvalues).1Multiple imputation2 tohandlemissingvalueswasalsoperformed.With the exception of multiple imputation, each of the imputation approaches replaces a missing value with a single number (termed “single”or “simple” imputation)andcanthreatenthevalidityof study results.3,4 The authors concluded that finerenone improved the UACR,a result thatwasconsistent regardlessof themethod forhandling missing data.

Journal ArticleDOI
TL;DR: It is revealed that the field has yet to make substantial use of this technique despite common employment of quantitative analysis, and that in research where MI is used, many recommended MI reporting practices are not being followed.
Abstract: Higher education researchers using survey data often face decisions about handling missing data. Multiple imputation (MI) is considered by many statisticians to be the most appropriate technique for addressing missing data in many circumstances. In particular, it has been shown to be preferable to listwise deletion, which has historically been a commonly employed method for quantitative research. However, our analysis of a decade of higher education research literature reveals that the field has yet to make substantial use of this technique despite common employment of quantitative analysis, and that in research where MI is used, many recommended MI reporting practices are not being followed. We conclude that additional information about the technique and recommended reporting practices may help improve the quality of the research involving missing data. In an attempt to address this issue, we develop a set of reporting recommendations based on a synthesis of the MI methodological literature and offer a discussion of these recommendations oriented toward applied researchers. The recommended MI reporting practices involve describing the nature and structure of any missing data, describing the imputation model and procedures, and describing any notable imputation results.

Journal ArticleDOI
TL;DR: An imputation-based method that is suitable for multivariate time series data, which uses the EM algorithm under the assumption of normal distribution is presented, which exhibited good accuracy and precision in different settings with respect to the patterns of missing observations.

Journal ArticleDOI
TL;DR: The case where some of the data values are missing is studied and a review of methods which accommodate PCA to missing data is proposed and several techniques to consider or estimate (impute) missing values in PCA are presented.
Abstract: Principal component analysis (PCA) is a standard technique to summarize the main structures of a data table containing the measurements of several quantitative variables for a number of individuals. Here, we study the case where some of the data values are missing and propose a review of methods which accommodate PCA to missing data. In plant ecology, this statistical challenge relates to the current effort to compile global plant functional trait databases producing matrices with a large amount of missing values. We present several techniques to consider or estimate (impute) missing values in PCA and compare them using theoretical considerations. We carried out a simulation study to evaluate the relative merits of the different approaches in various situations (correlation structure, number of variables and individuals, and percentage of missing values) and also applied them on a real data set. Lastly, we discuss the advantages and drawbacks of these approaches, the potential pitfalls and future challenges that need to be addressed in the future.

Journal ArticleDOI
TL;DR: It is concluded that MLMI may substantially improve the estimation of between-study heterogeneity parameters and allow for imputation of systematically missing predictors in IPD-MA aimed at the development and validation of prediction models.
Abstract: Individual participant data meta-analyses (IPD-MA) are increasingly used for developing and validating multivariable (diagnostic or prognostic) risk prediction models. Unfortunately, some predictors or even outcomes may not have been measured in each study and are thus systematically missing in some individual studies of the IPD-MA. As a consequence, it is no longer possible to evaluate between-study heterogeneity and to estimate study-specific predictor effects, or to include all individual studies, which severely hampers the development and validation of prediction models.Here, we describe a novel approach for imputing systematically missing data and adopt a generalized linear mixed model to allow for between-study heterogeneity. This approach can be viewed as an extension of Resche-Rigon's method (Stat Med 2013), relaxing their assumptions regarding variance components and allowing imputation of linear and nonlinear predictors.We illustrate our approach using a case study with IPD-MA of 13 studies to develop and validate a diagnostic prediction model for the presence of deep venous thrombosis. We compare the results after applying four methods for dealing with systematically missing predictors in one or more individual studies: complete case analysis where studies with systematically missing predictors are removed, traditional multiple imputation ignoring heterogeneity across studies, stratified multiple imputation accounting for heterogeneity in predictor prevalence, and multilevel multiple imputation (MLMI) fully accounting for between-study heterogeneity.We conclude that MLMI may substantially improve the estimation of between-study heterogeneity parameters and allow for imputation of systematically missing predictors in IPD-MA aimed at the development and validation of prediction models. © 2015 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: The proposed HPM-MI system has significantly improved data quality by use of best imputation technique after quantitative analysis of eleven imputation approaches and will be very useful in prediction for medical domain especially when numbers of missing value are large in the data set.
Abstract: Proposed novel hybrid prediction model with missing value imputation.HPM-MI has improved accuracy, sensitivity, specificity, kappa and ROC on 3 datasets.The best accuracy is achieved for diabetes, hepatitis, and breast cancer datasets.MVI is one of the important step of proposed model. Accurate prediction in the presence of large number of missing values in the data set has always been a challenging problem. Most of hybrid models to address this challenge have either deleted the missing instances from the data set (popularly known as case deletion) or have used some default way to fill the missing values. This paper, presents a novel hybrid prediction model with missing value imputation (HPM-MI) that analyze various imputation techniques using simple K-means clustering and apply the best one to a data set. The proposed hybrid model is the first one to use combination of K-means clustering with Multilayer Perceptron. K-means clustering is also used to validate class labels of given data (incorrectly classified instances are deleted i.e. pattern extracted from original data) before applying classifier. The proposed system has significantly improved data quality by use of best imputation technique after quantitative analysis of eleven imputation approaches. The efficiency of proposed model as predictive classification system is investigated on three benchmark medical data sets namely Pima Indians Diabetes, Wisconsin Breast Cancer, and Hepatitis from the UCI Repository of Machine Learning. In addition to accuracy, sensitivity, specificity; kappa statistics and the area under ROC are also computed. The experimental results show HPM-MI has produced accuracy, sensitivity, specificity, kappa and ROC as 99.82%, 100%, 99.74%, 0.996 and 1.0 respectively for Pima Indian Diabetes data set, 99.39%, 99.31%, 99.54%, 0.986, and 1.0 respectively for breast cancer data set and 99.08%, 100%, 96.55%, 0.978 and 0.99 respectively for Hepatitis data set. Results are best in comparison with existing methods. Further, the performance of our model is measured and analyzed as function of missing rate and train-test ratio using 2D synthetic data set and Wisconsin Diagnostics Breast Cancer Data Sets. Results are promising and therefore the proposed model will be very useful in prediction for medical domain especially when numbers of missing value are large in the data set.

22 Nov 2015
TL;DR: Main features include plausible value imputation, multilevel imputation functions, imputation using partial least squares for high dimensional predictors and two-way imputation.
Abstract: Description Contains some auxiliary functions for multiple imputation which complements existing functionality in R. In addition to some utility functions, main features include plausible value imputation, multilevel imputation functions, imputation using partial least squares (PLS) for high dimensional predictors and two-way imputation.

Journal ArticleDOI
TL;DR: This research work analyzes a real breast cancer dataset from Institute Portuguese of Oncology of Porto with a high percentage of unknown categorical information and constructed prediction models for breast cancer survivability using K-Nearest Neighbors, Classification Trees, Logistic Regression and Support Vector Machines.

Posted Content
TL;DR: The results show that either an interpolation with seasonal kalman filter from the zoo package or a linear interpolation on seasonal loess decomposed data from the forecast package were the most effective methods for dealing with missing data in most of the scenarios assessed in this paper.
Abstract: Missing values in datasets are a well-known problem and there are quite a lot of R packages offering imputation functions. But while imputation in general is well covered within R, it is hard to find functions for imputation of univariate time series. The problem is, most standard imputation techniques can not be applied directly. Most algorithms rely on inter-attribute correlations, while univariate time series imputation needs to employ time dependencies. This paper provides an overview of univariate time series imputation in general and an in-detail insight into the respective implementations within R packages. Furthermore, we experimentally compare the R functions on different time series using four different ratios of missing data. Our results show that either an interpolation with seasonal kalman filter from the zoo package or a linear interpolation on seasonal loess decomposed data from the forecast package were the most effective methods for dealing with missing data in most of the scenarios assessed in this paper.

Journal ArticleDOI
TL;DR: Improved versions of this method for nearest neighbor imputation performs well, especially when the number of predictors is large, and is evaluated in simulation studies and with several real data sets from different fields.

Journal ArticleDOI
TL;DR: Experimental results show that feature relevance has a non-ignorable influence on missing data estimation based on grey theory, and the method is considered superior to the other four estimation strategies, which can be significantly reduced by using the approach in classification tasks.
Abstract: Treatment of missing data has become increasingly significant in scientific research and engineering applications. The classic imputation strategy based on the K nearest neighbours (KNN) has been widely used to solve the plague problem. However, former studies do not give much attention to feature relevance, which has a significant impact on the selection of nearest neighbours. As a result, biased results may appear in similarity measurements. In this paper, we propose a novel method to impute missing data, named feature weighted grey KNN (FWGKNN) imputation algorithm. This approach employs mutual information (MI) to measure feature relevance. We present an experimental evaluation for five UCI datasets in three missingness mechanisms with various missing rates. Experimental results show that feature relevance has a non-ignorable influence on missing data estimation based on grey theory, and our method is considered superior to the other four estimation strategies. Moreover, the classification bias can be significantly reduced by using our approach in classification tasks.

Journal ArticleDOI
TL;DR: A strategy has been proposed that provides a balance between the optimal imputation strategy that was k‐means nearest neighbor and the best approximation of positioning real zeros and it was observed that as low as 40% missing data could be truly missing.
Abstract: The origin of missing values can be caused by different reasons and depending on these origins missing values should be considered differently and dealt with in different ways. In this research, four methods of imputation have been compared with respect to revealing their effects on the normality and variance of data, on statistical significance and on the approximation of a suitable threshold to accept missing data as truly missing. Additionally, the effects of different strategies for controlling familywise error rate or false discovery and how they work with the different strategies for missing value imputation have been evaluated. Missing values were found to affect normality and variance of data and k-means nearest neighbour imputation was the best method tested for restoring this. Bonferroni correction was the best method for maximizing true positives and minimizing false positives and it was observed that as low as 40% missing data could be truly missing. The range between 40 and 70% missing values was defined as a "gray area" and therefore a strategy has been proposed that provides a balance between the optimal imputation strategy that was k-means nearest neighbor and the best approximation of positioning real zeros.

Journal ArticleDOI
TL;DR: This work proposes a full information maximum likelihood (FIML) approach to item-level missing data handling that mitigates the loss in power due to missing scale scores and utilizes the available item- level data without altering the substantive analysis.
Abstract: Often when participants have missing scores on one or more of the items comprising a scale, researchers compute prorated scale scores by averaging the available items. Methodologists have cautioned that proration may make strict assumptions about the mean and covariance structures of the items comprising the scale (Schafer & Graham, 2002 ; Graham, 2009 ; Enders, 2010 ). We investigated proration empirically and found that it resulted in bias even under a missing completely at random (MCAR) mechanism. To encourage researchers to forgo proration, we describe a full information maximum likelihood (FIML) approach to item-level missing data handling that mitigates the loss in power due to missing scale scores and utilizes the available item-level data without altering the substantive analysis. Specifically, we propose treating the scale score as missing whenever one or more of the items are missing and incorporating items as auxiliary variables. Our simulations suggest that item-level missing data handling drastically increases power relative to scale-level missing data handling. These results have important practical implications, especially when recruiting more participants is prohibitively difficult or expensive. Finally, we illustrate the proposed method with data from an online chronic pain management program.

Journal ArticleDOI
01 Apr 2015
TL;DR: A single imputation approach relying on a multilayer perceptron whose training is conducted with different learning rules, and a multiple imputations approach based on the combination of multilayers perceptron and k-nearest neighbours are proposed, which improve the automation level and data quality offering a satisfactory performance.
Abstract: Graphical abstractDisplay Omitted HighlightsImputation data for monotone patterns of missing values.An estimation model of missing data based on multilayer perceptron.Combination of neural network and k-nearest neighbour-based multiple imputation.Comparison of the performance of proposed models with three classic procedures.Three classic single imputation models: mean/mode, regression and hot-deck. The knowledge discovery process is supported by data files information gathered from collected data sets, which often contain errors in the form of missing values. Data imputation is the activity aimed at estimating values for missing data items. This study focuses on the development of automated data imputation models, based on artificial neural networks for monotone patterns of missing values. The present work proposes a single imputation approach relying on a multilayer perceptron whose training is conducted with different learning rules, and a multiple imputation approach based on the combination of multilayer perceptron and k-nearest neighbours. Eighteen real and simulated databases were exposed to a perturbation experiment with random generation of monotone missing data pattern. An empirical test was accomplished on these data sets, including both approaches (single and multiple imputations), and three classical single imputation procedures - mean/mode imputation, regression and hot-deck - were also considered. Therefore, the experiments involved five imputation methods. The results, considering different performance measures, demonstrated that, in comparison with traditional tools, both proposals improve the automation level and data quality offering a satisfactory performance.