scispace - formally typeset
Search or ask a question

Showing papers on "Resampling published in 2018"


Journal ArticleDOI
TL;DR: In this article, the authors provide an overview of variable selection methods that are based on significance or information criteria, penalized likelihood, change-in-estimate criterion, background knowledge, or combinations thereof.
Abstract: Statistical models support medical research by facilitating individualized outcome prognostication conditional on independent variables or by estimating effects of risk factors adjusted for covariates. Theory of statistical models is well-established if the set of independent variables to consider is fixed and small. Hence, we can assume that effect estimates are unbiased and the usual methods for confidence interval estimation are valid. In routine work, however, it is not known a priori which covariates should be included in a model, and often we are confronted with the number of candidate variables in the range 10-30. This number is often too large to be considered in a statistical model. We provide an overview of various available variable selection methods that are based on significance or information criteria, penalized likelihood, the change-in-estimate criterion, background knowledge, or combinations thereof. These methods were usually developed in the context of a linear regression model and then transferred to more generalized linear models or models for censored survival data. Variable selection, in particular if used in explanatory modeling where effect estimates are of central interest, can compromise stability of a final model, unbiasedness of regression coefficients, and validity of p-values or confidence intervals. Therefore, we give pragmatic recommendations for the practicing statistician on application of variable selection methods in general (low-dimensional) modeling problems and on performing stability investigations and inference. We also propose some quantities based on resampling the entire variable selection process to be routinely reported by software packages offering automated variable selection algorithms.

783 citations


Journal ArticleDOI
TL;DR: This work uses a general feature-extraction operator to represent application-dependent features and proposes a general reconstruction error to evaluate the quality of resampling; by minimizing the error, it obtains a general form of optimal resamplings distribution.
Abstract: To reduce the cost of storing, processing, and visualizing a large-scale point cloud, we propose a randomized resampling strategy that selects a representative subset of points while preserving application-dependent features. The strategy is based on graphs, which can represent underlying surfaces and lend themselves well to efficient computation. We use a general feature-extraction operator to represent application-dependent features and propose a general reconstruction error to evaluate the quality of resampling; by minimizing the error, we obtain a general form of optimal resampling distribution. The proposed resampling distribution is guaranteed to be shift-, rotation- and scale-invariant in the three-dimensional space. We then specify the feature-extraction operator to be a graph filter and study specific resampling strategies based on all-pass, low-pass, high-pass graph filtering and graph filter banks. We validate the proposed methods on three applications: Large-scale visualization, accurate registration, and robust shape modeling demonstrating the effectiveness and efficiency of the proposed resampling methods.

139 citations


Journal ArticleDOI
TL;DR: In this paper, the authors present four methods that combine bootstrap estimation with multiple imputation to address missing data and show that three of the four approaches yield valid inference, but that the performance of the methods varies with respect to the number of imputed data sets and the extent of missingness.
Abstract: Many modern estimators require bootstrapping to calculate confidence intervals because either no analytic standard error is available or the distribution of the parameter of interest is nonsymmetric. It remains however unclear how to obtain valid bootstrap inference when dealing with multiple imputation to address missing data. We present 4 methods that are intuitively appealing, easy to implement, and combine bootstrap estimation with multiple imputation. We show that 3 of the 4 approaches yield valid inference, but that the performance of the methods varies with respect to the number of imputed data sets and the extent of missingness. Simulation studies reveal the behavior of our approaches in finite samples. A topical analysis from HIV treatment research, which determines the optimal timing of antiretroviral treatment initiation in young children, demonstrates the practical implications of the 4 methods in a sophisticated and realistic setting. This analysis suffers from missing data and uses the g-formula for inference, a method for which no standard errors are available.

131 citations


Journal ArticleDOI
TL;DR: In this paper, the authors compare the traditional approach of a single split of data into a training set and test set (for accuracy assessment), to a resampling framework where the classification and accuracy assessment are repeated many times.

131 citations


01 Apr 2018
TL;DR: This paper compares the traditional approach of a single split of data into a training set (for classification) and test set ( for accuracy assessment), to a resampling framework where the classification and accuracy assessment are repeated many times, and shows how a Resampling approach enables generation of spatially continuous maps of classification uncertainty.
Abstract: Maps that categorise the landscape into discrete units are a cornerstone of many scientific, management and conservation activities. The accuracy of these maps is often the primary piece of information used to make decisions about the mapping process or judge the quality of the final map. Variance is critical information when considering map accuracy, yet commonly reported accuracy metrics often do not provide that information. Various resampling frameworks have been proposed and shown to reconcile this issue, but have had limited uptake. In this paper, we compare the traditional approach of a single split of data into a training set (for classification) and test set (for accuracy assessment), to a resampling framework where the classification and accuracy assessment are repeated many times. Using a relatively simple vegetation mapping example and two common classifiers (maximum likelihood and random forest), we compare variance in mapped area estimates and accuracy assessment metrics (overall accuracy, kappa, user, producer, entropy, purity, quantity/allocation disagreement). Input field data points were repeatedly split into training and test sets via bootstrapping, Monte Carlo cross-validation (67:33 and 80:20 split ratios) and k-fold (5-fold) cross-validation. Additionally, within the cross-validation, we tested four designs: simple random, block hold-out, stratification by class, and stratification by both class and space. A classification was performed for every split of every methodological combination (100’s iterations each), creating sampling distributions for the mapped area of each class and the accuracy metrics. We found that regardless of resampling design, a single split of data into training and test sets results in a large variance in estimates of accuracy and mapped area. In the worst case, overall accuracy varied between ~40–80% in one resampling design, due only to random variation in partitioning into training and test sets. On the other hand, we found that all resampling procedures provided accurate estimates of error, and that they can also provide confidence intervals that are informative about the performance and uncertainty of the classifier. Importantly, we show that these confidence intervals commonly encompassed the magnitudes of increase or decrease in accuracy that are often cited in literature as justification for methodological or sampling design choices. We also show how a resampling approach enables generation of spatially continuous maps of classification uncertainty. Based on our results, we make recommendations about which resampling design to use and how it could be implemented. We also provide a fully worked mapping example, which includes traditional inference of uncertainty from the error matrix and provides examples for presenting the final map and its accuracy.

116 citations


Posted Content
TL;DR: This work proposes to learn independent features with adversarial objectives which optimize such measures implicitly without the need to compute any probability densities, and shows that this strategy can easily be applied to different types of model architectures and solve both linear and non-linear ICA problems.
Abstract: Reliable measures of statistical dependence could be useful tools for learning independent features and performing tasks like source separation using Independent Component Analysis (ICA). Unfortunately, many of such measures, like the mutual information, are hard to estimate and optimize directly. We propose to learn independent features with adversarial objectives which optimize such measures implicitly. These objectives compare samples from the joint distribution and the product of the marginals without the need to compute any probability densities. We also propose two methods for obtaining samples from the product of the marginals using either a simple resampling trick or a separate parametric distribution. Our experiments show that this strategy can easily be applied to different types of model architectures and solve both linear and non-linear ICA problems.

90 citations


Posted Content
TL;DR: In this article, the authors present a scheme to obtain an inexpensive and reliable estimate of the uncertainty associated with the predictions of a machine-learning model of atomic and molecular properties, which is based on resampling, with multiple models being generated based on sub-sampling of the same training data.
Abstract: We present a scheme to obtain an inexpensive and reliable estimate of the uncertainty associated with the predictions of a machine-learning model of atomic and molecular properties. The scheme is based on resampling, with multiple models being generated based on sub-sampling of the same training data. The accuracy of the uncertainty prediction can be benchmarked by maximum likelihood estimation, which can also be used to correct for correlations between resampled models, and to improve the performance of the uncertainty estimation by a cross-validation procedure. In the case of sparse Gaussian Process Regression models, this resampled estimator can be evaluated at negligible cost. We demonstrate the reliability of these estimates for the prediction of molecular energetics, and for the estimation of nuclear chemical shieldings in molecular crystals. Extension to estimate the uncertainty in energy differences, forces, or other correlated predictions is straightforward. This method can be easily applied to other machine learning schemes, and will be beneficial to make data-driven predictions more reliable, and to facilitate training-set optimization and active-learning strategies.

81 citations


Journal ArticleDOI
TL;DR: In this article, a multivariate method of rank sampling for distributions and dependences (R2D2) bias correction is proposed to adjust not only the univariate distributions but also their inter-variable and inter-site dependence structures.
Abstract: . Climate simulations often suffer from statistical biases with respect to observations or reanalyses. It is therefore common to correct (or adjust) those simulations before using them as inputs into impact models. However, most bias correction (BC) methods are univariate and so do not account for the statistical dependences linking the different locations and/or physical variables of interest. In addition, they are often deterministic, and stochasticity is frequently needed to investigate climate uncertainty and to add constrained randomness to climate simulations that do not possess a realistic variability. This study presents a multivariate method of rank resampling for distributions and dependences (R2D2) bias correction allowing one to adjust not only the univariate distributions but also their inter-variable and inter-site dependence structures. Moreover, the proposed R2D2 method provides some stochasticity since it can generate as many multivariate corrected outputs as the number of statistical dimensions (i.e., number of grid cell × number of climate variables) of the simulations to be corrected. It is based on an assumption of stability in time of the dependence structure – making it possible to deal with a high number of statistical dimensions – that lets the climate model drive the temporal properties and their changes in time. R2D2 is applied on temperature and precipitation reanalysis time series with respect to high-resolution reference data over the southeast of France (1506 grid cell). Bivariate, 1506-dimensional and 3012-dimensional versions of R2D2 are tested over a historical period and compared to a univariate BC. How the different BC methods behave in a climate change context is also illustrated with an application to regional climate simulations over the 2071–2100 period. The results indicate that the 1d-BC basically reproduces the climate model multivariate properties, 2d-R2D2 is only satisfying in the inter-variable context, 1506d-R2D2 strongly improves inter-site properties and 3012d-R2D2 is able to account for both. Applications of the proposed R2D2 method to various climate datasets are relevant for many impact studies. The perspectives of improvements are numerous, such as introducing stochasticity in the dependence itself, questioning its stability assumption, and accounting for temporal properties adjustment while including more physics in the adjustment procedures.

81 citations


Journal ArticleDOI
TL;DR: In this paper, an asymptotic framework for conducting inference on parameters of the form (0), where is a known directionally dierentiable function and 0 is estimated by ^ n, is presented.
Abstract: This paper studies an asymptotic framework for conducting inference on parameters of the form ( 0), where is a known directionally dierentiable function and 0 is estimated by ^ n. In these settings, the asymptotic distribution of the plug-in estimator ( ^ n) can be readily derived employing existing extensions to the Delta method. We show, however, that the \standard" bootstrap is only consistent under overly stringent conditions { in particular we establish that dierentiability of is a necessary and sucient condition for bootstrap consistency whenever the limiting distribution of ^ n is Gaussian. An alternative resampling scheme is proposed which remains consistent when the bootstrap fails, and is shown to provide local size control under restrictions on the directional derivative of . We illustrate the utility of our results by developing a test of whether a Hilbert space valued parameter belongs to a convex set { a setting that includes moment inequality problems and certain tests of shape restrictions as special cases.

80 citations


Journal ArticleDOI
TL;DR: The regression kink (RK) design is an increasingly popular empirical method for estimating causal effects of policies, such as the effect of unemployment benefits on unemployment duration as discussed by the authors, using si...
Abstract: The regression kink (RK) design is an increasingly popular empirical method for estimating causal effects of policies, such as the effect of unemployment benefits on unemployment duration. Using si...

76 citations


Journal ArticleDOI
TL;DR: In this paper, simultaneous confidence bands are constructed for a general moment condition model with high-dimensional parameters, where the Neyman orthogonality condition is assumed to be satisfied.
Abstract: In this paper, we develop procedures to construct simultaneous confidence bands for p ˜ potentially infinite-dimensional parameters after model selection for general moment condition models where p ˜ is potentially much larger than the sample size of available data, n. This allows us to cover settings with functional response data where each of the p ˜ parameters is a function. The procedure is based on the construction of score functions that satisfy Neyman orthogonality condition approximately. The proposed simultaneous confidence bands rely on uniform central limit theorems for high-dimensional vectors (and not on Donsker arguments as we allow for p ˜ ≫ n ). To construct the bands, we employ a multiplier bootstrap procedure which is computationally efficient as it only involves resampling the estimated score functions (and does not require resolving the high-dimensional optimization problems). We formally apply the general theory to inference on regression coefficient process in the distribution regression model with a logistic link, where two implementations are analyzed in detail. Simulations and an application to real data are provided to help illustrate the applicability of the results.

Journal ArticleDOI
TL;DR: In this article, a Markov Chain Monte Carlo (MCMC) algorithm based on Group Importance Sampling (GIS) is proposed for the sequential importance sampling (SIS) problem.

Journal ArticleDOI
TL;DR: The study outlines the proof-of-principle that neuroimaging models for brain-age prediction can use Bayesian optimization to derive case-specific pre-processing parameters and suggests that different pre- processing parameters are selected when optimization is conducted in specific contexts.
Abstract: Neuroimaging-based age prediction using machine learning is proposed as a biomarker of brain aging, relating to cognitive performance, health outcomes and progression of neurodegenerative disease However, even leading age-prediction algorithms contain measurement error, motivating efforts to improve experimental pipelines T1-weighted MRI is commonly used for age prediction, and the pre-processing of these scans involves normalization to a common template and resampling to a common voxel size, followed by spatial smoothing Resampling parameters are often selected arbitrarily Here, we sought to improve brain-age prediction accuracy by optimizing resampling parameters using Bayesian optimization Using data on N = 2003 healthy individuals (aged 16-90 years) we trained support vector machines to (i) distinguish between young ( 50 years) brains (classification) and (ii) predict chronological age (regression) We also evaluated generalisability of the age-regression model to an independent dataset (CamCAN, N = 648, aged 18-88 years) Bayesian optimization was used to identify optimal voxel size and smoothing kernel size for each task This procedure adaptively samples the parameter space to evaluate accuracy across a range of possible parameters, using independent sub-samples to iteratively assess different parameter combinations to arrive at optimal values When distinguishing between young and old brains a classification accuracy of 881% was achieved, (optimal voxel size = 115 mm3, smoothing kernel = 23 mm) For predicting chronological age, a mean absolute error (MAE) of 508 years was achieved, (optimal voxel size = 373 mm3, smoothing kernel = 368 mm) This was compared to performance using default values of 15 mm3 and 4mm respectively, resulting in MAE = 548 years, though this 73% improvement was not statistically significant When assessing generalisability, best performance was achieved when applying the entire Bayesian optimization framework to the new dataset, out-performing the parameters optimized for the initial training dataset Our study outlines the proof-of-principle that neuroimaging models for brain-age prediction can use Bayesian optimization to derive case-specific pre-processing parameters Our results suggest that different pre-processing parameters are selected when optimization is conducted in specific contexts This potentially motivates use of optimization techniques at many different points during the experimental process, which may improve statistical sensitivity and reduce opportunities for experimenter-led bias

Journal ArticleDOI
01 Jan 2018-Test
TL;DR: In this paper, an alternative proof for exact testing with random permutations was given, viewing the test as a "conditional Monte Carlo test" as it has been called in the literature.
Abstract: When permutation methods are used in practice, often a limited number of random permutations are used to decrease the computational burden. However, most theoretical literature assumes that the whole permutation group is used, and methods based on random permutations tend to be seen as approximate. There exists a very limited amount of literature on exact testing with random permutations, and only recently a thorough proof of exactness was given. In this paper, we provide an alternative proof, viewing the test as a “conditional Monte Carlo test” as it has been called in the literature. We also provide extensions of the result. Importantly, our results can be used to prove properties of various multiple testing procedures based on random permutations.

Journal ArticleDOI
TL;DR: The proposed test statistic is modified and extended to factorial MANOVA designs, incorporating general heteroscedastic models, and the only distributional assumption is the existence of the group-wise covariance matrices, which may even be singular.

Journal ArticleDOI
TL;DR: In this article, Monte Carlo resampling is used to determine the number of components in partial least squares (PLS) regression, where the data are randomly and repeatedly divided into calibration and validation samples.
Abstract: Monte Carlo resampling is utilized to determine the number of components in partial least squares (PLS) regression. The data are randomly and repeatedly divided into calibration and validation samples. For each repetition, the root‐mean‐squared error (RMSE) is determined for the validation samples for a = 1, 2, … , A PLS components to provide a distribution of RMSE values for each number of PLS components. These distributions are used to determine the median RMSE for each number of PLS components. The component (Amin) having the lowest median RMSE is located. The fraction p of the RMSE values of Amin exceeding the median RMSE for the preceding component is determined. This fraction p represents a probability measure that can be used to decide if the RMSE for the Amin PLS component is significantly lower than the RMSE for the preceding component for a preselected threshold (pupper). If so, it defines the optimum number of PLS components. If not, the process is repeated for the previous components until significance is achieved. The pupper = 0.5 implies that the median is used for selecting the optimum number of components. The RMSE is approximately normally distributed on the smallest components. This can be utilized to relate p to a fraction of a standard deviation. For instance, p = 0.308 corresponds to half a standard deviation if RMSE is normally distributed.

Journal Article
TL;DR: A software tool that provides the user with a series of (semi-)automated image analysis and PET based segmentation methods to quantitatively analyse tracer uptake in oncology and lymphoma PET/CT studies and allows for quick and reliable analysis of (FDG) PET/ CT studies using state of the art segmentation and image processing methods.
Abstract: 1753 Aim: Quantitative analysis of PET studies requires standardized and advanced image processing and analysis tools. The aim of this work was to develop a software tool that provides the user with a series of (semi-)automated image analysis and PET based segmentation methods to quantitatively analyse tracer uptake in oncology and lymphoma PET/CT studies. Methods and results: ACCURATE is developed in IDL version 8.4 (Harris Geospatial Solutions, Bloomfield, USA) and runs under the IDL virtual machine, which is freely available. ACCURATE allows for DICOM input of both PET and CT data. It includes several image processing steps such as rebinning/resampling, cropping and smoothing. After loading and processing PET/CT images, lesions can be delineated using a range of segmentation Methods: fixed sized segmentations (squares, cubes, circles, spheres), manual free hand segmentation, fixed SUV thresholds; % of SUVmax or SUVpeak isocontours with or without local contrast correction; a gradient based method and 2 majority vote based implementations. During the extraction of quantitative uptake features, such as SUVmax, SUVpeak, SUVmean, metabolic volume and total lesion glycolysis and various other 1st order statistics, 4 different partial volume correction methods can be applied. Moreover the tool can provide the user with more than 250 radiomics output features, with either relative and/or fixed SUV binning. In addition, optimized semi-automated workflows for quickly and easily assessing total tumor burden are available. The tool has also functionality to extract time activity curves from dynamic PET studies which can be used for full quantitative kinetic analysis. Finally, the tool can deal with correct SUV calculations for long-lived isotopes, i.e. it allows for decay correction over multiple days. Several data sanity checks have been incorporated, such as swapping of weight and height and other data plausibility checks. Liver uptake can be quickly previewed and verified. Results obtained with ACCURATE have been verified using simulations, phantom studies and a head-to-head comparison with other (commercially available) image analysis tools using clinical datasets. Conclusions: ACCURATE allows for quick and reliable analysis of (FDG) PET/CT studies using state of the art segmentation and image processing methods. The tool is particularly suited for exploring the effects of image processing, segmentation strategies and partial volume corrections on the quantitative evaluation of oncology or lymphoma PET studies. New functionality beyond state the art will be continuously added. The tools will be made freely available (r.boellaard@vumc.nl)

Journal ArticleDOI
TL;DR: An R package to facilitate estimating ICC and its CI for binary responses using different methods and can be a very useful tool for researchers to design cluster randomized trials with binary outcome.

01 Sep 2018
TL;DR: In this article, simultaneous confidence bands are constructed for a general moment condition model with high-dimensional parameters, where the Neyman orthogonality condition is assumed to be satisfied.
Abstract: In this paper, we develop procedures to construct simultaneous confidence bands for p ˜ potentially infinite-dimensional parameters after model selection for general moment condition models where p ˜ is potentially much larger than the sample size of available data, n. This allows us to cover settings with functional response data where each of the p ˜ parameters is a function. The procedure is based on the construction of score functions that satisfy Neyman orthogonality condition approximately. The proposed simultaneous confidence bands rely on uniform central limit theorems for high-dimensional vectors (and not on Donsker arguments as we allow for p ˜ ≫ n ). To construct the bands, we employ a multiplier bootstrap procedure which is computationally efficient as it only involves resampling the estimated score functions (and does not require resolving the high-dimensional optimization problems). We formally apply the general theory to inference on regression coefficient process in the distribution regression model with a logistic link, where two implementations are analyzed in detail. Simulations and an application to real data are provided to help illustrate the applicability of the results.

Journal ArticleDOI
TL;DR: In this article, the authors proposed a method to estimate variances of a number of Monte Carlo approximations that particle filters deliver, by keeping track of certain key features of the genealogical structure arising from resampling operations.
Abstract: SummaryThis paper concerns numerical assessment of Monte Carlo error in particle filters. We show that by keeping track of certain key features of the genealogical structure arising from resampling operations, it is possible to estimate variances of a number of Monte Carlo approximations that particle filters deliver. All our estimators can be computed from a single run of a particle filter. We establish that, as the number of particles grows, our estimators are weakly consistent for asymptotic variances of the Monte Carlo approximations and some of them are also non-asymptotically unbiased. The asymptotic variances can be decomposed into terms corresponding to each time step of the algorithm, and we show how to estimate each of these terms consistently. When the number of particles may vary over time, this allows approximation of the asymptotically optimal allocation of particle numbers.

Journal ArticleDOI
TL;DR: In this paper, a spatially varying coefficient model is proposed to explore the spatial nonstationarity of a regression relationship for spatial data, and a roughness penalty is incorporated to balance the goodness of fit and smoothness.
Abstract: Spatially varying coefficient models are a classical tool to explore the spatial nonstationarity of a regression relationship for spatial data. In this paper, we study the estimation and inference in spatially varying coefficient models for data distributed over complex domains. We use bivariate splines over triangulations to represent the coefficient functions. The estimators of the coefficient functions are consistent, and rates of convergence of the proposed estimators are established. A penalized bivariate spline estimation method is also introduced, in which a roughness penalty is incorporated to balance the goodness of fit and smoothness. In addition, we propose hypothesis tests to examine if the coefficient function is really varying over space or admits a certain parametric form. The proposed method is much more computationally efficient than the well‐known geographically weighted regression technique and thus usable for analyzing massive data sets. The performances of the estimators and the proposed tests are evaluated by simulation experiments. An environmental data example is used to illustrate the application of the proposed method.

Posted Content
TL;DR: In order to speed up Sequential Monte Carlo (SMC) for Bayesian inference in large data problems by data subsampling, an approximately unbiased and efficient annealed likelihood estimator based on data subsAMpling is used.
Abstract: We show how to speed up Sequential Monte Carlo (SMC) for Bayesian inference in large data problems by data subsampling. SMC sequentially updates a cloud of particles through a sequence of distributions, beginning with a distribution that is easy to sample from such as the prior and ending with the posterior distribution. Each update of the particle cloud consists of three steps: reweighting, resampling, and moving. In the move step, each particle is moved using a Markov kernel; this is typically the most computationally expensive part, particularly when the dataset is large. It is crucial to have an efficient move step to ensure particle diversity. Our article makes two important contributions. First, in order to speed up the SMC computation, we use an approximately unbiased and efficient annealed likelihood estimator based on data subsampling. The subsampling approach is more memory efficient than the corresponding full data SMC, which is an advantage for parallel computation. Second, we use a Metropolis within Gibbs kernel with two conditional updates. A Hamiltonian Monte Carlo update makes distant moves for the model parameters, and a block pseudo-marginal proposal is used for the particles corresponding to the auxiliary variables for the data subsampling. We demonstrate both the usefulness and limitations of the methodology for estimating four generalized linear models and a generalized additive model with large datasets.

Journal ArticleDOI
TL;DR: A general framework for assessing and comparing the stability of results is presented and it is demonstrated that unstable algorithms can produce stable results when the functional form of the relationship between the predictors and the response matches the algorithm.
Abstract: Stability is a major requirement to draw reliable conclusions when interpreting results from supervised statistical learning. In this article, we present a general framework for assessing and compa...

Journal ArticleDOI
TL;DR: The key contribution of this paper is the generalization of centroidal Voronoi tessellation (CVT) to point cloud datasets to make point resampling practical and efficient.
Abstract: This paper presents a novel technique for resampling point clouds of a smooth surface. The key contribution of this paper is the generalization of centroidal Voronoi tessellation (CVT) to point cloud datasets to make point resampling practical and efficient. In particular, the CVT on a point cloud is efficiently computed by restricting the Voronoi cells to the underlying surface, which is locally approximated by a set of best-fitting planes. We also develop an efficient method to progressively improve the resampling quality by interleaving optimization of resampling points and update of the fitting planes. Our versatile framework is capable of generating high-quality resampling results with isotropic or anisotropic distributions from a given point cloud. We conduct extensive experiments to demonstrate the efficacy and robustness of our resampling method.

Journal ArticleDOI
TL;DR: A wild bootstrap resampling technique for nonparametric inference on transition probabilities in a general time-inhomogeneous Markov multistate model to investigate a non-standard time-to-event outcome with data from a recent study of prophylactic treatment in allogeneic transplanted leukemia patients.
Abstract: We suggest a wild bootstrap resampling technique for nonparametric inference on transition probabilities in a general time-inhomogeneous Markov multistate model. We first approximate the limiting distribution of the Nelson-Aalen estimator by repeatedly generating standard normal wild bootstrap variates, while the data is kept fixed. Next, a transformation using a functional delta method argument is applied. The approach is conceptually easier than direct resampling for the transition probabilities. It is used to investigate a non-standard time-to-event outcome, currently being alive without immunosuppressive treatment, with data from a recent study of prophylactic treatment in allogeneic transplanted leukemia patients. Due to non-monotonic outcome probabilities in time, neither standard survival nor competing risks techniques apply, which highlights the need for the present methodology. Finite sample performance of time-simultaneous confidence bands for the outcome probabilities is assessed in an extensive simulation study motivated by the clinical trial data. Example code is provided in the web-based Supplementary Materials.

Book ChapterDOI
09 Nov 2018
TL;DR: This paper proposes a new undersampling method that eliminates negative instances from the overlapping region and hence improves the visibility of the minority instances and shows statistically significant improvements in classification performance.
Abstract: Classification of imbalanced data remains an important field in machine learning. Several methods have been proposed to address the class imbalance problem including data resampling, adaptive learning and cost adjusting algorithms. Data resampling methods are widely used due to their simplicity and flexibility. Most existing resampling techniques aim at rebalancing class distribution. However, class imbalance is not the only factor that impacts the performance of the learning algorithm. Class overlap has proved to have a higher impact on the classification of imbalanced datasets than the dominance of the negative class. In this paper, we propose a new undersampling method that eliminates negative instances from the overlapping region and hence improves the visibility of the minority instances. Testing and evaluating the proposed method using 36 public imbalanced datasets showed statistically significant improvements in classification performance.

Journal ArticleDOI
TL;DR: It is shown through a mix of numerical and theoretical work that the bootstrap is fraught with problems, and both of the most commonly used methods of bootstrapping for regression— residual bootstrap and pairs bootstrap—give very poor inference on β as the ratio p/n grows.
Abstract: We consider the performance of the bootstrap in high-dimensions for the setting of linear regression, where p < n but p/n is not close to zero. We consider ordinary least-squares as well as robust ...

Journal ArticleDOI
TL;DR: The results of an extensive simulation study and applications to several real data sets show that the new permutation based significance test for Kernel Change Point detection performs either at par or better than the state-of-the art significance tests for detecting the presence of correlation changes, implying that its use can be generally recommended.
Abstract: Detecting abrupt correlation changes in multivariate time series is crucial in many application fields such as signal processing, functional neuroimaging, climate studies, and financial analysis. To detect such changes, several promising correlation change tests exist, but they may suffer from severe loss of power when there is actually more than one change point underlying the data. To deal with this drawback, we propose a permutation based significance test for Kernel Change Point (KCP) detection on the running correlations. Given a requested number of change points K, KCP divides the time series into K + 1 phases by minimizing the within-phase variance. The new permutation test looks at how the average within-phase variance decreases when K increases and compares this to the results for permuted data. The results of an extensive simulation study and applications to several real data sets show that, depending on the setting, the new test performs either at par or better than the state-of-the art significance tests for detecting the presence of correlation changes, implying that its use can be generally recommended.

Journal ArticleDOI
TL;DR: In this paper, a Fisher randomization test (FRT) is used to estimate missing potential outcomes under a compatible sharp null hypothesis, where the treatment does not affect the units on average.
Abstract: The Fisher randomization test (FRT) is appropriate for any test statistic, under a sharp null hypothesis that can recover all missing potential outcomes. However, it is often sought after to test a weak null hypothesis that the treatment does not affect the units on average. To use the FRT for a weak null hypothesis, we must address two issues. First, we need to impute the missing potential outcomes although the weak null hypothesis cannot determine all of them. Second, we need to choose a proper test statistic. For a general weak null hypothesis, we propose an approach to imputing missing potential outcomes under a compatible sharp null hypothesis. Building on this imputation scheme, we advocate a studentized statistic. The resulting FRT has multiple desirable features. First, it is model-free. Second, it is finite-sample exact under the sharp null hypothesis that we use to impute the potential outcomes. Third, it conservatively controls large-sample type I errors under the weak null hypothesis of interest. Therefore, our FRT is agnostic to the treatment effect heterogeneity. We establish a unified theory for general factorial experiments. We also extend it to stratified and clustered experiments.

Posted Content
TL;DR: This article proposed an inference procedure that is robust not only to small probability weights entering the estimator but also to a wide range of trimming threshold choices, by adapting to these different asymptotic distributions.
Abstract: Inverse Probability Weighting (IPW) is widely used in empirical work in economics and other disciplines. As Gaussian approximations perform poorly in the presence of "small denominators," trimming is routinely employed as a regularization strategy. However, ad hoc trimming of the observations renders usual inference procedures invalid for the target estimand, even in large samples. In this paper, we first show that the IPW estimator can have different (Gaussian or non-Gaussian) asymptotic distributions, depending on how "close to zero" the probability weights are and on how large the trimming threshold is. As a remedy, we propose an inference procedure that is robust not only to small probability weights entering the IPW estimator but also to a wide range of trimming threshold choices, by adapting to these different asymptotic distributions. This robustness is achieved by employing resampling techniques and by correcting a non-negligible trimming bias. We also propose an easy-to-implement method for choosing the trimming threshold by minimizing an empirical analogue of the asymptotic mean squared error. In addition, we show that our inference procedure remains valid with the use of a data-driven trimming threshold. We illustrate our method by revisiting a dataset from the National Supported Work program.