scispace - formally typeset
Search or ask a question
Topic

Missing data

About: Missing data is a research topic. Over the lifetime, 21363 publications have been published within this topic receiving 784923 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: This work presents a new framework to structure and classify missingness patterns, and benchmark the performance of a number of state-of-the-art imputation techniques, both stochastic multiple imputation (MI) approaches and deterministic spectral decomposition techniques.
Abstract: Missing data is a problem appearing ubiquitously across many fields and needs to be dealt with systematically. For multivariate time series data imputation can be a challenging problem. We consider the particular case of credit default swap time series, where missing data can pose a considerable problem preventing important value at risk estimates. We present a new framework to structure and classify missingness patterns, and generate suitable realistic test sets. We then benchmark the performance of a number of state-of-the-art imputation techniques, both stochastic multiple imputation (MI) approaches and deterministic spectral decomposition techniques. We demonstrate that for the missingness patterns under consideration, the MI package Amelia based on the expectation maximisation algorithm performs most robustly and reliably, however, other techniques like multiple singular spectral analysis can also perform well. Our results can serve as a valuable guideline for researchers and practicioners working with incomplete multivariate time series.

8 citations

Journal ArticleDOI
TL;DR: This work considers the set of areal units as only partially observed, and proposes to fit models of interest to a portion of the data and hold out the rest for model comparison, investigating the performance of semiconductor chip data and the nested modeling structure.
Abstract: Areal unit or discrete spatial data is customarily modeled with the goal of spatial smoothing, typically using Markov random field models. Examples include image restoration and disease mapping. Here, we focus on a different issue for such data; we consider the set of areal units as only partially observed. One application is to learn about the smoothing behavior of various Markov random field models. That is, if two different smoothing priors are used, how can we quantify the relative smoothing that each imposes? We propose to fit models of interest to a portion of the data and hold out the rest for model comparison. A second application concerns the setting where, in fact, only a portion of the areal units have been observed, and we seek prediction of the remainder. Our motivating context investigates the performance of semiconductor chips, created as dies (the areal units) within wafers within lots, yielding nested modeling structure. Multiple tests are administered to each die involving both binary and continuous measurements. In practice, only a small subset of the dies are sampled, resulting in prediction of performance for the remaining unsampled dies. Furthermore, dies in the same locations are tested on each wafer, and the manufacturing process encourages within wafer, between wafer and between lot dependence. Other missing data applications include damaged images and small area estimation with missing observations for some units. We demonstrate prediction first with an image that is observed at several rates of missingness. Then, a well-studied Ohio lung cancer dataset is used for model comparison with regard to smoothing. Finally, examination of the nested modeling for semiconductor chip data is offered.

8 citations

Posted Content
TL;DR: In this article, an extended stochastic gradient MCMCMC lgoriathm was proposed for large-scale Bayesian computing problems, such as dimension jumping and missing data.
Abstract: Stochastic gradient Markov chain Monte Carlo (MCMC) algorithms have received much attention in Bayesian computing for big data problems, but they are only applicable to a small class of problems for which the parameter space has a fixed dimension and the log-posterior density is differentiable with respect to the parameters. This paper proposes an extended stochastic gradient MCMC lgoriathm which, by introducing appropriate latent variables, can be applied to more general large-scale Bayesian computing problems, such as those involving dimension jumping and missing data. Numerical studies show that the proposed algorithm is highly scalable and much more efficient than traditional MCMC algorithms. The proposed algorithms have much alleviated the pain of Bayesian methods in big data computing.

8 citations

Journal ArticleDOI
TL;DR: A density estimation framework that integrates information from empirical models, environment conditions, and satellite measurement data, based on Gaussian Processes which are nonlinear, non-parametric regression methods is presented.

8 citations

Journal ArticleDOI
TL;DR: By leveraging motif-based graph aggregation, a spatiotemporal imputation approach is proposed to address the issue of traffic data missing and the results showed that the proposed approach was feasible and accurate.
Abstract: Due to the incomplete coverage and failure of traffic data collectors during the collection, traffic data usually suffers from information missing. Achieving accurate imputation is critical to the operation of transportation networks. Existing approaches usually focus on the characteristic analysis of temporal variation and adjacent spatial representation, and the consideration of higher-order spatial correlations and continuous data missing attracts more attentions from the academia and industry. In this paper, by leveraging motif-based graph aggregation, we propose a spatiotemporal imputation approach to address the issue of traffic data missing. First, through motif discovery, the higher-order graph aggregation model was presented in traffic networks. It utilized graph convolution network (GCN) to polymerize the correlated segment attributes of the missing data segments. Then, the multitime dimension imputation model based on bidirectional long short-term memory (Bi-LSTM) incorporated the recent, daily-periodic, and weekly-periodic dependencies of the historical data. Finally, the spatial aggregated values and the temporal fusion values were integrated to obtain the results. We conducted comprehensive experiments based on the real-world dataset and discussed the case of random and continuous data missing by different time intervals, and the results showed that the proposed approach was feasible and accurate.

8 citations


Network Information
Related Topics (5)
Inference
36.8K papers, 1.3M citations
87% related
Regression analysis
31K papers, 1.7M citations
87% related
Estimator
97.3K papers, 2.6M citations
87% related
Sampling (statistics)
65.3K papers, 1.2M citations
83% related
Cluster analysis
146.5K papers, 2.9M citations
81% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20252
20242
2023931
20222,020
20211,639
20201,642