scispace - formally typeset
Search or ask a question
Topic

Missing data

About: Missing data is a research topic. Over the lifetime, 21363 publications have been published within this topic receiving 784923 citations.


Papers
More filters
Posted Content
01 Jan 2001
TL;DR: This paper gives a lightning overview of data mining and its relation to statistics, with particular emphasis on tools for the detection of adverse drug reactions.
Abstract: The growing interest in data mining is motivated by a common problem across disciplines: how does one store, access, model, and ultimately describe and understand very large data sets? Historically, different aspects of data mining have been addressed independently by different disciplines. This is the first truly interdisciplinary text on data mining, blending the contributions of information science, computer science, and statistics. The book consists of three sections. The first, foundations, provides a tutorial overview of the principles underlying data mining algorithms and their application. The presentation emphasizes intuition rather than rigor. The second section, data mining algorithms, shows how algorithms are constructed to solve specific problems in a principled manner. The algorithms covered include trees and rules for classification and regression, association rules, belief networks, classical statistical models, nonlinear models such as neural networks, and local "memory-based" models. The third section shows how all of the preceding analysis fits together when applied to real-world data mining problems. Topics include the role of metadata, how to handle missing data, and data preprocessing.

3,765 citations

Journal ArticleDOI
TL;DR: A Monte Carlo simulation examined the performance of 4 missing data methods in structural equation models and found that full information maximum likelihood (FIML) estimation was superior across all conditions of the design.
Abstract: A Monte Carlo simulation examined the performance of 4 missing data methods in structural equation models: full information maximum likelihood (FIML), listwise deletion, pairwise deletion, and similar response pattern imputation. The effects of 3 independent variables were examined (factor loading magnitude, sample size, and missing data rate) on 4 outcome measures: convergence failures, parameter estimate bias, parameter estimate efficiency, and model goodness of fit. Results indicated that FIML estimation was superior across all conditions of the design. Under ignorable missing data conditions (missing completely at random and missing at random), FIML estimates were unbiased and more efficient than the other methods. In addition, FIML yielded the lowest proportion of convergence failures and provided near-optimal Type 1 error rates across both simulations.

3,748 citations

Journal ArticleDOI
TL;DR: It is shown that KNNimpute appears to provide a more robust and sensitive method for missing value estimation than SVDimpute, and both SVD Impute and KNN Impute surpass the commonly used row average method (as well as filling missing values with zeros).
Abstract: Motivation: Gene expression microarray experiments can generate data sets with multiple missing expression values. Unfortunately, many algorithms for gene expression analysis require a complete matrix of gene array values as input. For example, methods such as hierarchical clustering and K-means clustering are not robust to missing data, and may lose effectiveness even with a few missing values. Methods for imputing missing data are needed, therefore, to minimize the effect of incomplete data sets on analyses, and to increase the range of data sets to which these algorithms can be applied. In this report, we investigate automated methods for estimating missing data. Results: We present a comparative study of several methods for the estimation of missing values in gene microarray data. We implemented and evaluated three methods: a Singular Value Decomposition (SVD) based method (SVDimpute), weighted K-nearest neighbors (KNNimpute), and row average. We evaluated the methods using a variety of parameter settings and over different real data sets, and assessed the robustness of the imputation methods to the amount of missing data over the range of 1–20% missing values. We show that KNNimpute appears to provide a more robust and sensitive method for missing value estimation than SVDimpute, and both SVDimpute and KNNimpute surpass the commonly used row average method (as well as filling missing values with zeros). We report results of the comparative experiments and provide recommendations and tools for accurate estimation of missing microarray data under a variety of conditions. Availability: The software is available at http://smi-web.

3,542 citations

Book
07 Mar 2008
TL;DR: Applied Survival Analysis, Second Edition is an ideal book for graduate-level courses in biostatistics, statistics, and epidemiologic methods and serves as a valuable reference for practitioners and researchers in any health-related field or for professionals in insurance and government.
Abstract: THE MOST PRACTICAL, UP-TO-DATE GUIDE TO MODELLING AND ANALYZING TIME-TO-EVENT DATANOW IN A VALUABLE NEW EDITION Since publication of the first edition nearly a decade ago, analyses using time-to-event methods have increase considerably in all areas of scientific inquiry mainly as a result of model-building methods available in modern statistical software packages. However, there has been minimal coverage in the available literature to9 guide researchers, practitioners, and students who wish to apply these methods to health-related areas of study. Applied Survival Analysis, Second Edition provides a comprehensive and up-to-date introduction to regression modeling for time-to-event data in medical, epidemiological, biostatistical, and other health-related research. This book places a unique emphasis on the practical and contemporary applications of regression modeling rather than the mathematical theory. It offers a clear and accessible presentation of modern modeling techniques supplemented with real-world examples and case studies. Key topics covered include: variable selection, identification of the scale of continuous covariates, the role of interactions in the model, assessment of fit and model assumptions, regression diagnostics, recurrent event models, frailty models, additive models, competing risk models, and missing data. Features of the Second Edition include: Expanded coverage of interactions and the covariate-adjusted survival functions The use of the Worchester Heart Attack Study as the main modeling data set for illustrating discussed concepts and techniques New discussion of variable selection with multivariable fractional polynomials Further exploration of time-varying covariates, complex with examples Additional treatment of the exponential, Weibull, and log-logistic parametric regression models Increased emphasis on interpreting and using results as well as utilizing multiple imputation methods to analyze data with missing values New examples and exercises at the end of each chapter Analyses throughout the text are performed using Stata Version 9, and an accompanying FTP site contains the data sets used in the book. Applied Survival Analysis, Second Edition is an ideal book for graduate-level courses in biostatistics, statistics, and epidemiologic methods. It also serves as a valuable reference for practitioners and researchers in any health-related field or for professionals in insurance and government.

3,507 citations

Journal ArticleDOI
TL;DR: Essential features of multiple imputation are reviewed, with answers to frequently asked questions about using the method in practice.
Abstract: In recent years, multiple imputation has emerged as a convenient and flexible paradigm for analysing data with missing values. Essential features of multiple imputation are reviewed, with answers to frequently asked questions about using the method in practice.

3,387 citations


Network Information
Related Topics (5)
Inference
36.8K papers, 1.3M citations
87% related
Regression analysis
31K papers, 1.7M citations
87% related
Estimator
97.3K papers, 2.6M citations
87% related
Sampling (statistics)
65.3K papers, 1.2M citations
83% related
Cluster analysis
146.5K papers, 2.9M citations
81% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20252
20242
2023931
20222,020
20211,639
20201,642