scispace - formally typeset
Search or ask a question

Showing papers on "Resampling published in 2008"


Journal ArticleDOI
TL;DR: The bias-corrected bootstrap had the least biased confidence intervals, greatest power to detect nonzero effects and contrasts, and the most accurate overall Type I error compared to 2-path effects, while resampling approaches had somewhat greater power and might be preferable because of ease of use and flexibility.
Abstract: Recent advances in testing mediation have found that certain resampling methods and tests based on the mathematical distribution of 2 normal random variables substantially outperform the traditiona...

1,034 citations


Book
01 Jan 2008
TL;DR: In this paper, the authors present an approach to multivariate data analysis for paleontological data, which is based on the allometric equation and a set of properties of the data.
Abstract: Preface. Acknowledgments. 1 Introduction. 1.1 The nature of paleontological data. 1.2 Advantages and pitfalls of paleontological data analysis. 1.3 Software. 2 Basic statistical methods. 2.1 Introduction. 2.2 Statistical distributions. 2.3 Shapiro-Wilk test for normal distribution. 2.4 F test for equality of variances. 2.5 Student's t test and Welch test for equality of means. 2.6 Mann-Whitney U test for equality of medians. 2.7 Kolmogorov-Smirnov test for equality of distributions. 2.8 Permutation and resampling. 2.9 One-way ANOVA. 2.10 Kruskal-Wallis test. 2.11 Linear correlation. 2.12 Non-parametric tests for correlation. 2.13 Linear regression. 2.14 Reduced major axis regression. 2.15 Nonlinear curve fitting. 2.16 Chi-square test. 3 Introduction to multivariate data analysis. 3.1 Approaches to multivariate data analysis. 3.2 Multivariate distributions. 3.3 Parametric multivariate tests. 3.4 Non-parametric multivariate tests. 3.5 Hierarchical cluster analysis. 3.5 K-means cluster analysis. 4 Morphometrics. 4.1 Introduction. 4.2 The allometric equation. 4.3 Principal components analysis (PCA). 4.4 Multivariate allometry. 4.5 Discriminant analysis for two groups. 4.6 Canonical variate analysis (CVA). 4.7 MANOVA. 4.8 Fourier shape analysis. 4.9 Elliptic Fourier analysis. 4.10 Eigenshape analysis. 4.11 Landmarks and size measures. 4.12 Procrustean fitting. 4.13 PCA of landmark data. 4.14 Thin-plate spline deformations. 4.15 Principal and partial warps. 4.16 Relative warps. 4.17 Regression of partial warp scores. 4.18 Disparity measures. 4.19 Point distribution statistics. 4.20 Directional statistics. Case study: The ontogeny of a Silurian trilobite. 5 Phylogenetic analysis. 5.1 Introduction. 5.2 Characters. 5.3 Parsimony analysis. 5.4 Character state reconstruction. 5.5 Evaluation of characters and tree topologies. 5.6 Consensus trees. 5.7 Consistency index. 5.8 Retention index. 5.9 Bootstrapping. 5.10 Bremer support. 5.11 Stratigraphical congruency indices. 5.12 Phylogenetic analysis with Maximum Likelihood. Case study: The systematics of heterosporous ferns. 6 Paleobiogeography and paleoecology. 6.1 Introduction. 6.2 Diversity indices. 6.3 Taxonomic distinctness. 6.4 Comparison of diversity indices. 6.5 Abundance models. 6.6 Rarefaction. 6.7 Diversity curves. 6.8 Size-frequency and survivorship curves. 6.9 Association similarity indices for presence/absence data. 6.10 Association similarity indices for abundance data. 6.11 ANOSIM and NPMANOVA. 6.12 Correspondence analysis. 6.13 Principal Coordinates analysis (PCO). 6.14 Non-metric Multidimensional Scaling (NMDS). 6.15 Seriation. Case study: Ashgill brachiopod paleocommunities from East China. 7 Time series analysis. 7.1 Introduction. 7.2 Spectral analysis. 7.3 Autocorrelation. 7.4 Cross-correlation. 7.5 Wavelet analysis. 7.6 Smoothing and filtering. 7.7 Runs test. Case study: Sepkoski's generic diversity curve for the Phanerozoic. 8 Quantitative biostratigraphy. 8.1 Introduction. 8.2 Parametric confidence intervals on stratigraphic ranges. 8.3 Non-parametric confidence intervals on stratigraphic ranges. 8.4 Graphic correlation. 8.5 Constrained optimisation. 8.6 Ranking and scaling. 8.7 Unitary Associations. 8.8 Biostratigraphy by ordination. 8.9 What is the best method for quantitative biostratigraphy?. Appendix A: Plotting techniques. Appendix B: Mathematical concepts and notation. References. Index

867 citations


Journal ArticleDOI
TL;DR: A revised version of the metareg command, which performs meta-analysis regression (meta-regression) on study-level summary data, is presented, which involves improvements to the estimation methods and the addition of an option to use a permutation test to estimate p-values.
Abstract: We present a revised version of the metareg command, which performs meta-analysis regression (meta-regression) on study-level summary data. The ma- jor revisions involve improvements to the estimation methods and the addition of an option to use a permutation test to estimate p-values, including an adjustment for multiple testing. We have also made additions to the output, added an option to produce a graph, and included support for the predict command. Stata 8.0 or above is required.

794 citations


Journal ArticleDOI
TL;DR: It is shown that interpolated signals and their derivatives contain specific detectable periodic properties, and a blind, efficient, and automatic method capable of finding traces of resampling and interpolation is proposed.
Abstract: In this paper, we analyze and analytically describe the specific statistical changes brought into the covariance structure of signal by the interpolation process. We show that interpolated signals and their derivatives contain specific detectable periodic properties. Based on this, we propose a blind, efficient, and automatic method capable of finding traces of resampling and interpolation. The proposed method can be very useful in many areas, especially in image security and authentication. For instance, when two or more images are spliced together, to create high quality and consistent image forgeries, almost always geometric transformations, such as scaling, rotation, or skewing are needed. These procedures are typically based on a resampling and interpolation step. By having a method capable of detecting the traces of resampling, we can significantly reduce the successful usage of such forgeries. Among other points, the presented method is also very useful in estimation of the geometric transformations factors.

304 citations


Journal ArticleDOI
TL;DR: In this article, a new quantile regression approach for survival data subject to conditionally independent censoring is proposed, which leads to a simple algorithm that involves minimizations only of L1-type convex functions.
Abstract: Quantile regression offers great flexibility in assessing covariate effects on event times, thereby attracting considerable interests in its applications in survival analysis. But currently available methods often require stringent assumptions or complex algorithms. In this article we develop a new quantile regression approach for survival data subject to conditionally independent censoring. The proposed martingale-based estimating equations naturally lead to a simple algorithm that involves minimizations only of L1-type convex functions. We establish uniform consistency and weak convergence of the resultant estimators. We develop inferences accordingly, including hypothesis testing, second-stage inference, and model diagnostics. We evaluate the finite-sample performance of the proposed methods through extensive simulation studies. An analysis of a recent dialysis study illustrates the practical utility of our proposals.

285 citations


Journal ArticleDOI
TL;DR: A systematic review of the modern way of assessing risk prediction models using methods derived from ROC methodology and from probability forecasting theory to compare measures of predictive performance.
Abstract: For medical decision making and patient information, predictions of future status variables play an important role. Risk prediction models can be derived with many different statistical approaches. To compare them, measures of predictive performance are derived from ROC methodology and from probability forecasting theory. These tools can be applied to assess single markers, multivariable regression models and complex model selection algorithms. This article provides a systematic review of the modern way of assessing risk prediction models. Particular attention is put on proper benchmarks and resampling techniques that are important for the interpretation of measured performance. All methods are illustrated with data from a clinical study in head and neck cancer patients.

249 citations


Proceedings ArticleDOI
22 Sep 2008
TL;DR: This paper revisits the state-of-the-art resampling detector and presents an equivalent accelerated and simplified detector, which is orders of magnitudes faster than the conventional scheme and experimentally shown to be comparably reliable.
Abstract: This paper revisits the state-of-the-art resampling detector, which is based on periodic artifacts in the residue of a local linear predictor. Inspired by recent findings from the literature, we take a closer look at the complex detection procedure and model the detected artifacts in the spatial and frequency domain by means of the variance of the prediction residue. We give an exact formulation on how transformation parameters influence the appearance of periodic artifacts and analytically derive the expected position of characteristic resampling peaks. We present an equivalent accelerated and simplified detector, which is orders of magnitudes faster than the conventional scheme and experimentally shown to be comparably reliable.

215 citations


Journal ArticleDOI
TL;DR: A new method for efficient guided simulation of dependent data, which satisfy imposed network constraints as conditional independence structures is discussed, which is useful for testing a potentially new method on π0 or FDR estimation in a dependency context.
Abstract: We consider effects of dependence among variables of high-dimensional data in multiple hypothesis testing problems, in particular the False Discovery Rate (FDR) control procedures. Recent simulation studies consider only simple correlation structures among variables, which is hardly inspired by real data features. Our aim is to systematically study effects of several network features like sparsity and correlation strength by imposing dependence structures among variables using random correlation matrices. We study the robustness against dependence of several FDR procedures that are popular in microarray studies, such as Benjamin-Hochberg FDR, Storey's q-value, SAM and resampling based FDR procedures. False Non-discovery Rates and estimates of the number of null hypotheses are computed from those methods and compared. Our simulation study shows that methods such as SAM and the q-value do not adequately control the FDR to the level claimed under dependence conditions. On the other hand, the adaptive Benjamini-Hochberg procedure seems to be most robust while remaining conservative. Finally, the estimates of the number of true null hypotheses under various dependence conditions are variable. We discuss a new method for efficient guided simulation of dependent data, which satisfy imposed network constraints as conditional independence structures. Our simulation set-up allows for a structural study of the effect of dependencies on multiple testing criterions and is useful for testing a potentially new method on π0 or FDR estimation in a dependency context.

210 citations


Journal ArticleDOI
TL;DR: Early findings on ldquocounter-forensicrdquo techniques put into question the reliability of known forensic tools against smart counterfeiters in general, and might serve as benchmarks and motivation for the development of much improved forensic techniques.
Abstract: Resampling detection has become a standard tool for forensic analyses of digital images. This paper presents new variants of image transformation operations which are undetectable by resampling detectors based on periodic variations in the residual signal of local linear predictors in the spatial domain. The effectiveness of the proposed method is supported with evidence from experiments on a large image database for various parameter settings. We benchmark detectability as well as the resulting image quality against conventional linear and bicubic interpolation and interpolation with a sinc kernel. These early findings on ldquocounter-forensicrdquo techniques put into question the reliability of known forensic tools against smart counterfeiters in general, and might serve as benchmarks and motivation for the development of much improved forensic techniques.

201 citations


Journal ArticleDOI
TL;DR: Yin and Cook as mentioned in this paper proposed a dimension reduction method for estimating the directions in a multiple-index regression based on information extraction, which significantly reduces the computational complexity, because the nonparametric procedure involves only a one-dimensional search at each stage.

179 citations


Proceedings ArticleDOI
20 Jul 2008
TL;DR: This paper presents a cluster-based resampling method to select better pseudo-relevant documents based on the relevance model, and shows higher relevance density than the baseline relevance model on all collections, resulting in better retrieval accuracy in pseudo-relevance feedback.
Abstract: Typical pseudo-relevance feedback methods assume the top-retrieved documents are relevant and use these pseudo-relevant documents to expand terms. The initial retrieval set can, however, contain a great deal of noise. In this paper, we present a cluster-based resampling method to select better pseudo-relevant documents based on the relevance model. The main idea is to use document clusters to find dominant documents for the initial retrieval set, and to repeatedly feed the documents to emphasize the core topics of a query. Experimental results on large-scale web TREC collections show significant improvements over the relevance model. For justification of the resampling approach, we examine relevance density of feedback documents. A higher relevance density will result in greater retrieval accuracy, ultimately approaching true relevance feedback. The resampling approach shows higher relevance density than the baseline relevance model on all collections, resulting in better retrieval accuracy in pseudo-relevance feedback. This result indicates that the proposed method is effective for pseudo-relevance feedback.

Journal ArticleDOI
TL;DR: In this article, a new statistical method for regional climate simulations is introduced, which is constrained only by the parameters of a linear regression line for a characteristic climatological variable, and is evaluated by means of a cross validation experiment for the Elbe river basin.
Abstract: A new statistical method for regional climate simulations is introduced. Its simulations are constrained only by the parameters of a linear regression line for a characteristic climatological variable. Simulated series are generated by resampling from segments of observation series such that the resulting series comply with the prescribed regression parameters and possess realistic annual cycles and persistence. The resampling guarantees that the simulated series are physically consistent both with respect to the combinations of different meteorological variables and to their spatial distribution at each time step. The resampling approach is evaluated by means of a cross validation experiment for the Elbe river basin: Its simulations are compared both to an observed climatology and to data simulated by a dynamical RCM. This cross validation shows that the approach is able to reproduce the observed climatology with respect to statistics such as long-term means, persistence features (e.g., dry spells) and extreme events. The agreement of its simulations with the observational data is much closer than for the RCM data.

Journal ArticleDOI
TL;DR: A general strategy for variable selection in semiparametric regression models by penalizing appropriate estimating functions by establishing a general asymptotic theory for penalized estimating functions and present suitable numerical algorithms to implement the proposed estimators.
Abstract: We propose a general strategy for variable selection in semiparametric regression models by penalizing appropriate estimating functions. Important applications include semiparametric linear regression with censored responses and semiparametric regression with missing predictors. Unlike the existing penalized maximum likelihood estimators, the proposed penalized estimating functions may not pertain to the derivatives of any objective functions and may be discrete in the regression coefficients. We establish a general asymptotic theory for penalized estimating functions and present suitable numerical algorithms to implement the proposed estimators. In addition, we develop a resampling technique to estimate the variances of the estimated regression coefficients when the asymptotic variances cannot be evaluated directly. Simulation studies demonstrate that the proposed methods perform well in variable selection and variance estimation. We illustrate our methods using data from the Paul Coverdell Stroke Registry.

Posted Content
TL;DR: Huber and Kim as mentioned in this paper show that the weighted ensemble method is statistically exact for a wide class of Markovian and non-Markovian dynamics. But they do not consider nonstatic binning procedures, which merely guide the resampling process.
Abstract: The "weighted ensemble" method, introduced by Huber and Kim, [G. A. Huber and S. Kim, Biophys. J. 70, 97 (1996)], is one of a handful of rigorous approaches to path sampling of rare events. Expanding earlier discussions, we show that the technique is statistically exact for a wide class of Markovian and non-Markovian dynamics. The derivation is based on standard path-integral (path probability) ideas, but recasts the weighted-ensemble approach as simple "resampling" in path space. Similar reasoning indicates that arbitrary nonstatic binning procedures, which merely guide the resampling process, are also valid. Numerical examples confirm the claims, including the use of bins which can adaptively find the target state in a simple model.

Journal ArticleDOI
TL;DR: Analogy-X provides a sound statistical basis for analogy, removes the need for heuristic search and greatly improves its algorithmic performance.
Abstract: Data-intensive analogy has been proposed as a means of software cost estimation as an alternative to other data intensive methods such as linear regression. Unfortunately, there are drawbacks to the method. There is no mechanism to assess its appropriateness for a specific dataset. In addition, heuristic algorithms are necessary to select the best set of variables and identify abnormal project cases. We introduce a solution to these problems based upon the use of the Mantel correlation randomization test called Analogy-X. We use the strength of correlation between the distance matrix of project features and the distance matrix of known effort values of the dataset. The method is demonstrated using the Desharnais dataset and two random datasets, showing (1) the use of Mantel's correlation to identify whether analogy is appropriate, (2) a stepwise procedure for feature selection, as well as (3) the use of a leverage statistic for sensitivity analysis that detects abnormal data points. Analogy-X, thus, provides a sound statistical basis for analogy, removes the need for heuristic search and greatly improves its algorithmic performance.

Journal ArticleDOI
TL;DR: Three alternatives to MCMC methods are reviewed, including importance sampling, the forward-backward algorithm, and sequential Monte Carlo (SMC), which are demonstrated on a range of examples, including estimating the transition density of a diffusion and of a discrete-state continuous-time Markov chain; inferring structure in population genetics; and segmenting genetic divergence data.
Abstract: We consider analysis of complex stochastic models based upon partial information. MCMC and reversible jump MCMC are often the methods of choice for such problems, but in some situations they can be difficult to implement; and suffer from problems such as poor mixing, and the difficulty of diagnosing convergence. Here we review three alternatives to MCMC methods: importance sampling, the forward-backward algorithm, and sequential Monte Carlo (SMC). We discuss how to design good proposal densities for importance sampling, show some of the range of models for which the forward-backward algorithm can be applied, and show how resampling ideas from SMC can be used to improve the efficiency of the other two methods. We demonstrate these methods on a range of examples, including estimating the transition density of a diffusion and of a discrete-state continuous-time Markov chain; inferring structure in population genetics; and segmenting genetic divergence data.

Journal ArticleDOI
TL;DR: A new, simple, consistent and powerful test for independence by using symbolic dynamics and permutation entropy as a measure of serial dependence and a standard asymptotic distribution of an affine transformation of the permutations entropy under the null hypothesis of independence is constructed.

Journal ArticleDOI
TL;DR: In this article, a cross-sectional resampling scheme is proposed to construct bootstrap samples by replacing cross-sectional units with replacement, which provides asymptotic reflnements.
Abstract: This paper considers the issue of bootstrap resampling in panel datasets. The availability of datasets with large temporal and cross sectional dimensions suggests the possibility of new resampling schemes. We suggest one possibility which has not been widely explored in the literature. It amounts to constructing bootstrap samples by resampling whole cross sectional units with replacement. In cases where the data do not exhibit cross sectional dependence but exhibit temporal dependence, such a resampling scheme is of great interest as it allows the application of i.i.d. bootstrap resampling rather than block bootstrap resampling. It is well known that the former enables superior approximation to distributions of statistics compared to the latter. We prove that the bootstrap based on cross sectional resampling provides asymptotic reflnements. A Monte Carlo study illustrates the superior properties of the new resampling scheme


Journal ArticleDOI
TL;DR: There can be a large difference in the RMSE obtained using different resampling methods, especially when the feature space dimensionality is relatively large and the sample size is small, as shown in the results of a Monte Carlo simulation study of classifier performance prediction under the constraint of a limited dataset.
Abstract: In a practical classifier design problem, the true population is generally unknown and the available sample is finite-sized. A common approach is to use a resampling technique to estimate the performance of the classifier that will be trained with the available sample. We conducted a Monte Carlo simulation study to compare the ability of the different resampling techniques in training the classifier and predicting its performance under the constraint of a finite-sized sample. The true population for the two classes was assumed to be multivariate normal distributions with known covariance matrices. Finite sets of sample vectors were drawn from the population. The true performance of the classifier is defined as the area under the receiver operating characteristic curve (AUC) when the classifier designed with the specific sample is applied to the true population. We investigated methods based on the Fukunaga-Hayes and the leave-one-out techniques, as well as three different types of bootstrap methods, namely, the ordinary, 0.632, and 0.632+ bootstrap. The Fisher's linear discriminant analysis was used as the classifier. The dimensionality of the feature space was varied from 3 to 15. The sample size n2 from the positive class was varied between 25 and 60, while the number of cases from the negative class was either equal to n2 or 3n2. Each experiment was performed with an independent dataset randomly drawn from the true population. Using a total of 1000 experiments for each simulation condition, we compared the bias, the variance, and the root-mean-squared error (RMSE) of the AUC estimated using the different resampling techniques relative to the true AUC (obtained from training on a finite dataset and testing on the population). Our results indicated that, under the study conditions, there can be a large difference in the RMSE obtained using different resampling methods, especially when the feature space dimensionality is relatively large and the sample size is small. Under this type of conditions, the 0.632 and 0.632+ bootstrap methods have the lowest RMSE, indicating that the difference between the estimated and the true performances obtained using the 0.632 and 0.632+ bootstrap will be statistically smaller than those obtained using the other three resampling methods. Of the three bootstrap methods, the 0.632+ bootstrap provides the lowest bias. Although this investigation is performed under some specific conditions, it reveals important trends for the problem of classifier performance prediction under the constraint of a limited dataset.

Journal ArticleDOI
TL;DR: A new adaptive simulation approach that does away with likelihood ratios, while retaining the multi-level approach of the cross-entropy method and allows one to sample exactly from the target distribution rather than asymptotically as in Markov chain Monte Carlo.
Abstract: Although importance sampling is an established and effective sampling and estimation technique, it becomes unstable and unreliable for high-dimensional problems. The main reason is that the likelihood ratio in the importance sampling estimator degenerates when the dimension of the problem becomes large. Various remedies to this problem have been suggested, including heuristics such as resampling. Even so, the consensus is that for large-dimensional problems, likelihood ratios (and hence importance sampling) should be avoided. In this paper we introduce a new adaptive simulation approach that does away with likelihood ratios, while retaining the multi-level approach of the cross-entropy method. Like the latter, the method can be used for rare-event probability estimation, optimization, and counting. Moreover, the method allows one to sample exactly from the target distribution rather than asymptotically as in Markov chain Monte Carlo. Numerical examples demonstrate the effectiveness of the method for a variety of applications.

Journal ArticleDOI
TL;DR: An open-source software tool is developed that addresses the need to offer public software tools incorporating permutation tests for multiple hypotheses assessment and for controlling the rate of Type I errors, and can be applied to support a significant spectrum of hypothesis testing tasks in functional genomics.
Abstract: Genomics and proteomics analyses regularly involve the simultaneous test of hundreds of hypotheses, either on numerical or categorical data. To correct for the occurrence of false positives, validation tests based on multiple testing correction, such as Bonferroni and Benjamini and Hochberg, and re-sampling, such as permutation tests, are frequently used. Despite the known power of permutation-based tests, most available tools offer such tests for either t-test or ANOVA only. Less attention has been given to tests for categorical data, such as the Chi-square. This project takes a first step by developing an open-source software tool, Ptest, that addresses the need to offer public software tools incorporating these and other statistical tests with options for correcting for multiple hypotheses. This study developed a public-domain, user-friendly software whose purpose was twofold: first, to estimate test statistics for categorical and numerical data; and second, to validate the significance of the test statistics via Bonferroni, Benjamini and Hochberg, and a permutation test of numerical and categorical data. The tool allows the calculation of Chi-square test for categorical data, and ANOVA test, Bartlett's test and t-test for paired and unpaired data. Once a test statistic is calculated, Bonferroni, Benjamini and Hochberg, and a permutation tests are implemented, independently, to control for Type I errors. An evaluation of the software using different public data sets is reported, which illustrates the power of permutation tests for multiple hypotheses assessment and for controlling the rate of Type I errors. The analytical options offered by the software can be applied to support a significant spectrum of hypothesis testing tasks in functional genomics, using both numerical and categorical data.

Journal ArticleDOI
TL;DR: A novel signed-rank test for clustered paired data is obtained using the general principle of within-cluster resampling and it is shown that only this test maintains the correct size under a null hypothesis of marginal symmetry compared to four existing signed rank tests.
Abstract: We consider the problem of comparing two outcome measures when the pairs are clustered. Using the general principle of within-cluster resampling, we obtain a novel signed-rank test for clustered paired data. We show by a simple informative cluster size simulation model that only our test maintains the correct size under a null hypothesis of marginal symmetry compared to four other existing signed rank tests; further, our test has adequate power when cluster size is noninformative. In general, cluster size is informative if the distribution of pair-wise differences within a cluster depends on the cluster size. An application of our method to testing radiation toxicity trend is presented.

Journal ArticleDOI
TL;DR: The wild bootstrap is proved to be an asymptotically valid method of resampling homogenous panel unit root test statistics and an empirical illustration underpins that the current account to GDP ratio is likely panel stationary.

Journal ArticleDOI
TL;DR: The study shows that those methods which produce parsimonious profiles generally result in better prediction accuracy than methods which don't include variable selection, and for very small profile sizes, the sparse penalised likelihood methods tend to result in more stable profiles than univariate filtering while maintaining similar predictive performance.
Abstract: One application of gene expression arrays is to derive molecular profiles, i.e., sets of genes, which discriminate well between two classes of samples, for example between tumour types. Users are confronted with a multitude of classification methods of varying complexity that can be applied to this task. To help decide which method to use in a given situation, we compare important characteristics of a range of classification methods, including simple univariate filtering, penalised likelihood methods and the random forest. Classification accuracy is an important characteristic, but the biological interpretability of molecular profiles is also important. This implies both parsimony and stability, in the sense that profiles should not vary much when there are slight changes in the training data. We perform a random resampling study to compare these characteristics between the methods and across a range of profile sizes. We measure stability by adopting the Jaccard index to assess the similarity of resampled molecular profiles. We carry out a case study on five well-established cancer microarray data sets, for two of which we have the benefit of being able to validate the results in an independent data set. The study shows that those methods which produce parsimonious profiles generally result in better prediction accuracy than methods which don't include variable selection. For very small profile sizes, the sparse penalised likelihood methods tend to result in more stable profiles than univariate filtering while maintaining similar predictive performance.

Journal ArticleDOI
TL;DR: A simple approximation for the p‐value of the MAX test with or without adjusting for the covariates is provided, which makes theMAX test readily applicable to GWAS.
Abstract: Summary Genome-wide association study (GWAS), typically involving 100,000 to 500,000 single-nucleotide polymorphisms (SNPs), is a powerful approach to identify disease susceptibility loci. In a GWAS, single-marker analysis, which tests one SNP at a time, is usually used as the first stage to screen SNPs across the genome in order to identify a small fraction of promising SNPs with relatively low p-values for further and more focused studies. For single-marker analysis, the trend test derived for an additive genetic model is often used. This may not be robust when the additive assumption is not appropriate for the true underlying disease model. A robust test, MAX, based on the maximum of three trend test statistics derived for recessive, additive, and dominant models, has been proposed recently for GWAS. But its p-value has to be evaluated through a resampling-based procedure, which is computationally challenging for the analysis of GWAS. Obtaining the p-value for MAX with adjustment for the covariates can be even more time-consuming. In this article, we provide a simple approximation for the p-value of the MAX test with or without adjusting for the covariates. The new method avoids resampling steps and thus makes the MAX test readily applicable to GWAS. We use simulation studies as well as real datasets on 17 confirmed disease-associated SNPs to assess the accuracy of the proposed method. We also apply the method to the GWAS of coronary artery disease.

Journal ArticleDOI
TL;DR: In this article, the authors study an autoregressive time series model with a possible change in the regression parameters and obtain approximate estimates to the critical values for change-point tests through various bootstrapping methods.

Journal ArticleDOI
TL;DR: This approach applies to a wide variety of semiparametric and nonparametric problems in biostatistics and does not require solving estimating equations and is thus much faster than the existing resampling procedures.
Abstract: We propose a simple and general resampling strategy to estimate variances for parameter estimators derived from nonsmooth estimating functions. This approach applies to a wide variety of semiparametric and nonparametric problems in biostatistics. It does not require solving estimating equations and is thus much faster than the existing resampling procedures. Its usefulness is illustrated with heteroscedastic quantile regression and censored data rank regression. Numerical results based on simulated and real data are provided.

Journal ArticleDOI
Ji Meng Loh1
TL;DR: In this paper, the authors examined the validity of nonparametric spatial bootstrap as a procedure to quantify errors in estimates of N-point correlation functions and found that with clustered point data sets, confidence intervals obtained using the marked point bootstrap has empirical coverage closer to the nominal level than the confidence interval obtained using Poisson errors.
Abstract: In this paper we examine the validity of nonparametric spatial bootstrap as a procedure to quantify errors in estimates of N-point correlation functions. We do this by means of a small simulation study with simple point process models and estimating the two-point correlation functions and their errors. The coverage of confidence intervals obtained using bootstrap is compared with those obtained from assuming Poisson errors. The bootstrap procedure considered here is adapted for use with spatial (i.e., dependent) data. In particular, we describe a marked point bootstrap where, instead of resampling points or blocks of points, we resample marks assigned to the data points. These marks are numerical values that are based on the statistic of interest. We describe how the marks are defined for the two- and three-point correlation functions. By resampling marks, the bootstrap samples retain more of the dependence structure present in the data. Furthermore, this method of bootstrap can be performed much quicker than some other bootstrap methods for spatial data, making it a more practical method with large data sets. We find that with clustered point data sets, confidence intervals obtained using the marked point bootstrap has empirical coverage closer to the nominal level than the confidence intervals obtained using Poisson errors. The bootstrap errors were also found to be closer to the true errors for the clustered point data sets.

Journal ArticleDOI
TL;DR: In this paper, a modified L2-distance between the nonparametric estimator of regression function and its counterpart under null hypothesis is used as a test statistic which delimits the contribution from areas where data are sparse.
Abstract: This paper concerns statistical tests for simple structures such as parametric models, lower order models and additivity in a general nonparametric autoregression setting. We propose to use a modified L2-distance between the nonparametric estimator of regression function and its counterpart under null hypothesis as our test statistic which delimits the contribution from areas where data are sparse. The asymptotic properties of the test statistic are established, which indicates the test statistic is asymptotically equivalent to a quadratic form of innovations. A regression type resampling scheme (i.e. wild bootstrap) is adapted to estimate the distribution of this quadratic form. Further, we have shown that asymptotically this bootstrap distribution is indeed the distribution of the test statistics under null hypothesis. The proposed methodology has been illustrated by both simulation and application to German stock index data.