scispace - formally typeset
Search or ask a question

Showing papers on "Sample size determination published in 2016"


Journal ArticleDOI
TL;DR: It is suggested that the size of a sample with sufficient information power depends on (a) the aim of the study, (b) sample specificity, (c) use of established theory, (d) quality of dialogue, and (e) analysis strategy.
Abstract: Sample sizes must be ascertained in qualitative studies like in quantitative studies but not by the same means. The prevailing concept for sample size in qualitative studies is "saturation." Saturation is closely tied to a specific methodology, and the term is inconsistently applied. We propose the concept "information power" to guide adequate sample size for qualitative studies. Information power indicates that the more information the sample holds, relevant for the actual study, the lower amount of participants is needed. We suggest that the size of a sample with sufficient information power depends on (a) the aim of the study, (b) sample specificity, (c) use of established theory, (d) quality of dialogue, and (e) analysis strategy. We present a model where these elements of information and their relevant dimensions are related to information power. Application of this model in the planning and during data collection of a qualitative study is discussed.

3,885 citations


Journal ArticleDOI
TL;DR: Diagonally weighted least squares was less biased and more accurate than MLR in estimating the factor loadings across nearly every condition and the proposed model tended to be over-rejected by chi-square test statistics under both MLR and WLSMV in the condition of small sample size N = 200.
Abstract: In confirmatory factor analysis (CFA), the use of maximum likelihood (ML) assumes that the observed indicators follow a continuous and multivariate normal distribution, which is not appropriate for ordinal observed variables. Robust ML (MLR) has been introduced into CFA models when this normality assumption is slightly or moderately violated. Diagonally weighted least squares (WLSMV), on the other hand, is specifically designed for ordinal data. Although WLSMV makes no distributional assumptions about the observed variables, a normal latent distribution underlying each observed categorical variable is instead assumed. A Monte Carlo simulation was carried out to compare the effects of different configurations of latent response distributions, numbers of categories, and sample sizes on model parameter estimates, standard errors, and chi-square test statistics in a correlated two-factor model. The results showed that WLSMV was less biased and more accurate than MLR in estimating the factor loadings across nearly every condition. However, WLSMV yielded moderate overestimation of the interfactor correlations when the sample size was small or/and when the latent distributions were moderately nonnormal. With respect to standard error estimates of the factor loadings and the interfactor correlations, MLR outperformed WLSMV when the latent distributions were nonnormal with a small sample size of N = 200. Finally, the proposed model tended to be over-rejected by chi-square test statistics under both MLR and WLSMV in the condition of small sample size N = 200.

1,319 citations


Journal ArticleDOI
TL;DR: This paper provides a data-driven approach to partition the data into subpopulations that differ in the magnitude of their treatment effects, and proposes an “honest” approach to estimation, whereby one sample is used to construct the partition and another to estimate treatment effects for each subpopulation.
Abstract: In this paper we propose methods for estimating heterogeneity in causal effects in experimental and observational studies and for conducting hypothesis tests about the magnitude of differences in treatment effects across subsets of the population. We provide a data-driven approach to partition the data into subpopulations that differ in the magnitude of their treatment effects. The approach enables the construction of valid confidence intervals for treatment effects, even with many covariates relative to the sample size, and without “sparsity” assumptions. We propose an “honest” approach to estimation, whereby one sample is used to construct the partition and another to estimate treatment effects for each subpopulation. Our approach builds on regression tree methods, modified to optimize for goodness of fit in treatment effects and to account for honest estimation. Our model selection criterion anticipates that bias will be eliminated by honest estimation and also accounts for the effect of making additional splits on the variance of treatment effect estimates within each subpopulation. We address the challenge that the “ground truth” for a causal effect is not observed for any individual unit, so that standard approaches to cross-validation must be modified. Through a simulation study, we show that for our preferred method honest estimation results in nominal coverage for 90% confidence intervals, whereas coverage ranges between 74% and 84% for nonhonest approaches. Honest estimation requires estimating the model with a smaller sample size; the cost in terms of mean squared error of treatment effects for our preferred method ranges between 7–22%.

913 citations


Journal ArticleDOI
TL;DR: This paper looks at how to choose an external pilot trial sample size in order to minimise the sample size of the overall clinical trial programme, that is, the pilot and the main trial together, and produces a method of calculating the optimal solution.
Abstract: Sample size justification is an important consideration when planning a clinical trial, not only for the main trial but also for any preliminary pilot trial. When the outcome is a continuous variable, the sample size calculation requires an accurate estimate of the standard deviation of the outcome measure. A pilot trial can be used to get an estimate of the standard deviation, which could then be used to anticipate what may be observed in the main trial. However, an important consideration is that pilot trials often estimate the standard deviation parameter imprecisely. This paper looks at how we can choose an external pilot trial sample size in order to minimise the sample size of the overall clinical trial programme, that is, the pilot and the main trial together. We produce a method of calculating the optimal solution to the required pilot trial sample size when the standardised effect size for the main trial is known. However, as it may not be possible to know the standardised effect size to be used prior to the pilot trial, approximate rules are also presented. For a main trial designed with 90% power and two-sided 5% significance, we recommend pilot trial sample sizes per treatment arm of 75, 25, 15 and 10 for standardised effect sizes that are extra small (≤0.1), small (0.2), medium (0.5) or large (0.8), respectively.

783 citations


Journal ArticleDOI
TL;DR: In qualitative research, the determination of sample size is contextual and partially dependent upon the scientific paradigm under which investigation is taking place as discussed by the authors, which will require larger samples than in-depth qualitative research does, so that a representative picture of the whole population under review can be gained.
Abstract: Purpose Qualitative researchers have been criticised for not justifying sample size decisions in their research. This short paper addresses the issue of which sample sizes are appropriate and valid within different approaches to qualitative research. Design/methodology/approach The sparse literature on sample sizes in qualitative research is reviewed and discussed. This examination is informed by the personal experience of the author in terms of assessing, as an editor, reviewer comments as they relate to sample size in qualitative research. Also, the discussion is informed by the author’s own experience of undertaking commercial and academic qualitative research over the last 31 years. Findings In qualitative research, the determination of sample size is contextual and partially dependent upon the scientific paradigm under which investigation is taking place. For example, qualitative research which is oriented towards positivism, will require larger samples than in-depth qualitative research does, so that a representative picture of the whole population under review can be gained. Nonetheless, the paper also concludes that sample sizes involving one single case can be highly informative and meaningful as demonstrated in examples from management and medical research. Unique examples of research using a single sample or case but involving new areas or findings that are potentially highly relevant, can be worthy of publication. Theoretical saturation can also be useful as a guide in designing qualitative research, with practical research illustrating that samples of 12 may be cases where data saturation occurs among a relatively homogeneous population. Practical implications Sample sizes as low as one can be justified. Researchers and reviewers may find the discussion in this paper to be a useful guide to determining and critiquing sample size in qualitative research. Originality/value Sample size in qualitative research is always mentioned by reviewers of qualitative papers but discussion tends to be simplistic and relatively uninformed. The current paper draws attention to how sample sizes, at both ends of the size continuum, can be justified by researchers. This will also aid reviewers in their making of comments about the appropriateness of sample sizes in qualitative research.

687 citations


Journal ArticleDOI
TL;DR: Four new supervised methods to detect the number of clusters were developed and tested and were found to outperform the existing methods using both evenly and unevenly sampled data sets and a subsampling strategy aiming to reduce sampling unevenness between subpopulations is presented and tested.
Abstract: Inferences of population structure and more precisely the identification of genetically homogeneous groups of individuals are essential to the fields of ecology, evolutionary biology and conservation biology. Such population structure inferences are routinely investigated via the program structure implementing a Bayesian algorithm to identify groups of individuals at Hardy-Weinberg and linkage equilibrium. While the method is performing relatively well under various population models with even sampling between subpopulations, the robustness of the method to uneven sample size between subpopulations and/or hierarchical levels of population structure has not yet been tested despite being commonly encountered in empirical data sets. In this study, I used simulated and empirical microsatellite data sets to investigate the impact of uneven sample size between subpopulations and/or hierarchical levels of population structure on the detected population structure. The results demonstrated that uneven sampling often leads to wrong inferences on hierarchical structure and downward-biased estimates of the true number of subpopulations. Distinct subpopulations with reduced sampling tended to be merged together, while at the same time, individuals from extensively sampled subpopulations were generally split, despite belonging to the same panmictic population. Four new supervised methods to detect the number of clusters were developed and tested as part of this study and were found to outperform the existing methods using both evenly and unevenly sampled data sets. Additionally, a subsampling strategy aiming to reduce sampling unevenness between subpopulations is presented and tested. These results altogether demonstrate that when sampling evenness is accounted for, the detection of the correct population structure is greatly improved.

631 citations


Journal ArticleDOI
TL;DR: It is recently confirmed that a split sample approach with50% held out leads to models with a suboptimal perfor-mance, that is, models with unstable and on average the same performance as obtained with half the sample size, so strongly advise against random split sample approaches in small development samples.

606 citations


Journal ArticleDOI
TL;DR: A novel method using simulated species to identify the minimum number of records required to generate accurate SDMs for taxa of different pre-defined prevalence classes is presented, which is applicable to any taxonomic clade or group, study area or climate scenario.
Abstract: Species distribution models (SDMs) are widely used to predict the occurrence of species. Because SDMs generally use presence-only data, validation of the predicted distribution and assessing model accuracy is challenging. Model performance depends on both sample size and species’ prevalence, being the fraction of the study area occupied by the species. Here, we present a novel method using simulated species to identify the minimum number of records required to generate accurate SDMs for taxa of different pre-defined prevalence classes. We quantified model performance as a function of sample size and prevalence and found model performance to increase with increasing sample size under constant prevalence, and to decrease with increasing prevalence under constant sample size. The area under the curve (AUC) is commonly used as a measure of model performance. However, when applied to presence-only data it is prevalence-dependent and hence not an accurate performance index. Testing the AUC of an SDM for significant deviation from random performance provides a good alternative. We assessed the minimum number of records required to obtain good model performance for species of different prevalence classes in a virtual study area and in a real African study area. The lower limit depends on the species’ prevalence with absolute minimum sample sizes as low as 3 for narrow-ranged and 13 for widespread species for our virtual study area which represents an ideal, balanced, orthogonal world. The lower limit of 3, however, is flawed by statistical artefacts related to modelling species with a prevalence below 0.1. In our African study area lower limits are higher, ranging from 14 for narrow-ranged to 25 for widespread species. We advocate identifying the minimum sample size for any species distribution modelling by applying the novel method presented here, which is applicable to any taxonomic clade or group, study area or climate scenario.

472 citations


Journal ArticleDOI
TL;DR: This study suggests that externally validating a prognostic model requires a minimum of 100 events and ideally 200 (or more) events, and provides guidance on sample size for investigators designing an external validation study.
Abstract: After developing a prognostic model, it is essential to evaluate the performance of the model in samples independent from those used to develop the model, which is often referred to as external validation. However, despite its importance, very little is known about the sample size requirements for conducting an external validation. Using a large real data set and resampling methods, we investigate the impact of sample size on the performance of six published prognostic models. Focussing on unbiased and precise estimation of performance measures (e.g. the c-index, D statistic and calibration), we provide guidance on sample size for investigators designing an external validation study. Our study suggests that externally validating a prognostic model requires a minimum of 100 events and ideally 200 (or more) events.

428 citations


Journal ArticleDOI
TL;DR: The Pearson productmoment correlation coefficient and the Spearman rank correlation coefficient (rs) are widely used in psychological research as mentioned in this paper, and they have similar expected values but rs is more variable than rp, especially when the correlation is strong.
Abstract: The Pearson product-moment correlation coefficient (rp) and the Spearman rank correlation coefficient (rs) are widely used in psychological research. We compare rp and rs on 3 criteria: variability, bias with respect to the population value, and robustness to an outlier. Using simulations across low (N = 5) to high (N = 1,000) sample sizes we show that, for normally distributed variables, rp and rs have similar expected values but rs is more variable, especially when the correlation is strong. However, when the variables have high kurtosis, rp is more variable than rs. Next, we conducted a sampling study of a psychometric dataset featuring symmetrically distributed data with light tails, and of 2 Likert-type survey datasets, 1 with light-tailed and the other with heavy-tailed distributions. Consistent with the simulations, rp had lower variability than rs in the psychometric dataset. In the survey datasets with heavy-tailed variables in particular, rs had lower variability than rp, and often corresponded more accurately to the population Pearson correlation coefficient (Rp) than rp did. The simulations and the sampling studies showed that variability in terms of standard deviations can be reduced by about 20% by choosing rs instead of rp. In comparison, increasing the sample size by a factor of 2 results in a 41% reduction of the standard deviations of rs and rp. In conclusion, rp is suitable for light-tailed distributions, whereas rs is preferable when variables feature heavy-tailed distributions or when outliers are present, as is often the case in psychological research.

428 citations


Journal ArticleDOI
TL;DR: These tables were derived from formulation of sensitivity and specificity test using Power Analysis and Sample Size (PASS) software based on desired type I error, power and effect size.
Abstract: Sensitivity and specificity analysis is commonly used for screening and diagnostic tests. The main issue researchers face is to determine the sufficient sample sizes that are related with screening and diagnostic studies. Although the formula for sample size calculation is available but concerning majority of the researchers are not mathematicians or statisticians, hence, sample size calculation might not be easy for them. This review paper provides sample size tables with regards to sensitivity and specificity analysis. These tables were derived from formulation of sensitivity and specificity test using Power Analysis and Sample Size (PASS) software based on desired type I error, power and effect size. The approaches on how to use the tables were also discussed.

Journal ArticleDOI
TL;DR: This review discusses selected issues in adaptive design, which obtains the most information possible in an unbiased way while putting the fewest patients at risk.
Abstract: Investigators use adaptive trial designs to alter basic features of an ongoing trial. This approach obtains the most information possible in an unbiased way while putting the fewest patients at risk. In this review, the authors discuss selected issues in adaptive design.

Proceedings ArticleDOI
Kelly Caine1
07 May 2016
TL;DR: An analysis of all manuscripts published at CHI2014 and an analysis of local standards for sample size within the CHI community find that sample size ranges from 1 -- 916,000 and the most common sample size is 12.
Abstract: We describe the primary ways researchers can determine the size of a sample of research participants, present the benefits and drawbacks of each of those methods, and focus on improving one method that could be useful to the CHI community: local standards. To determine local standards for sample size within the CHI community, we conducted an analysis of all manuscripts published at CHI2014. We find that sample size for manuscripts published at CHI ranges from 1 -- 916,000 and the most common sample size is 12. We also find that sample size differs based on factors such as study setting and type of methodology employed. The outcome of this paper is an overview of the various ways sample size may be determined and an analysis of local standards for sample size within the CHI community. These contributions may be useful to researchers planning studies and reviewers evaluating the validity of results.

Journal ArticleDOI
TL;DR: It is shown that Firth’s correction can be used to improve the accuracy of regression coefficients and alleviate the problems associated with separation, and there is an urgent need for new research to provide guidance for supporting sample size considerations for binary logistic regression analysis.
Abstract: Ten events per variable (EPV) is a widely advocated minimal criterion for sample size considerations in logistic regression analysis. Of three previous simulation studies that examined this minimal EPV criterion only one supports the use of a minimum of 10 EPV. In this paper, we examine the reasons for substantial differences between these extensive simulation studies. The current study uses Monte Carlo simulations to evaluate small sample bias, coverage of confidence intervals and mean square error of logit coefficients. Logistic regression models fitted by maximum likelihood and a modified estimation procedure, known as Firth’s correction, are compared. The results show that besides EPV, the problems associated with low EPV depend on other factors such as the total sample size. It is also demonstrated that simulation results can be dominated by even a few simulated data sets for which the prediction of the outcome by the covariates is perfect (‘separation’). We reveal that different approaches for identifying and handling separation leads to substantially different simulation results. We further show that Firth’s correction can be used to improve the accuracy of regression coefficients and alleviate the problems associated with separation. The current evidence supporting EPV rules for binary logistic regression is weak. Given our findings, there is an urgent need for new research to provide guidance for supporting sample size considerations for binary logistic regression analysis.

Journal ArticleDOI
TL;DR: The results indicated that an EPV rule of thumb should be data driven and that EPV ≥ 20 generally eliminates bias in regression coefficients when many low-prevalence predictors are included in a Cox model.

Posted Content
TL;DR: Compared to the basic RNN, GRAM achieved 10% higher accuracy for predicting diseases rarely observed in the training data and 3% improved area under the ROC curve for predicting heart failure using an order of magnitude less training data.
Abstract: Deep learning methods exhibit promising performance for predictive modeling in healthcare, but two important challenges remain: -Data insufficiency:Often in healthcare predictive modeling, the sample size is insufficient for deep learning methods to achieve satisfactory results. -Interpretation:The representations learned by deep learning methods should align with medical knowledge. To address these challenges, we propose a GRaph-based Attention Model, GRAM that supplements electronic health records (EHR) with hierarchical information inherent to medical ontologies. Based on the data volume and the ontology structure, GRAM represents a medical concept as a combination of its ancestors in the ontology via an attention mechanism. We compared predictive performance (i.e. accuracy, data needs, interpretability) of GRAM to various methods including the recurrent neural network (RNN) in two sequential diagnoses prediction tasks and one heart failure prediction task. Compared to the basic RNN, GRAM achieved 10% higher accuracy for predicting diseases rarely observed in the training data and 3% improved area under the ROC curve for predicting heart failure using an order of magnitude less training data. Additionally, unlike other methods, the medical concept representations learned by GRAM are well aligned with the medical ontology. Finally, GRAM exhibits intuitive attention behaviors by adaptively generalizing to higher level concepts when facing data insufficiency at the lower level concepts.

Journal ArticleDOI
TL;DR: Recovery of cannabis-impaired driving is associated with a statistically significant increase in motor vehicle crash risk, and the increase is of low to medium magnitude.
Abstract: AIMS: To determine whether and to what extent acute cannabis intoxication increases motor vehicle crash risk. DESIGN: Study 1 replicates two published meta-analyses, correcting for methodological shortcomings. Study 2 is an updated meta-analysis using 28 estimates from 21 observational studies. These included studies from three earlier reviews, supplemented by results from a structured search in Web of Science and Google Scholar and by the personal libraries of the research team. Risk estimates were combined using random effects models and meta-regression techniques. SETTING: Study 1 replicates the analysis of Asbridge et al., based on 9 studies from 5 countries, published 1982-2007; and Li et al., based on 9 studies from 6 countries, published 2001-10. Study 2 involves studies from 13 countries published in the period 1982-2015. PARTICIPANTS: In Study 1, total counts extracted totalled 50 877 (27 967 cases, 22 910 controls) for Asbridge et al. and 93 229 (4 236 cases and 88 993 controls) for Li et al. Study 2 used confounder-adjusted estimates where available (combined sample size of 222 511) and crude counts from the remainder (17 228 total counts), giving a combined sample count of 239 739. MEASUREMENTS: Odds-ratios were used from case-control studies and adjusted odds-ratio analogues from culpability studies. The impact of the substantial variation in confounder adjustment was explored in subsample analyses. FINDINGS: Study 1 substantially revises previous risk estimates downwards, with both the originally reported point estimates lying outside the revised confidence interval. Revised estimates were similar to those of Study 2, which found cannabis-impaired driving associated with a statistically significant risk increase of low-to-moderate magnitude (random effects model odds ratio 1.36 (1.15-1.61), meta-regression odds ratio 1.22 (1.1-1.36)). Subsample analyses found higher odds-ratio estimates for case control studies, low study quality, limited control of confounders, medium quality use data, and not controlling for alcohol intoxication. CONCLUSIONS: Acute cannabis intoxication is associated with a statistically significant increase in motor vehicle crash risk. The increase is of low to medium magnitude. Remaining selection effects in the studies used may limit causal interpretation of the pooled estimates. Language: en

Journal ArticleDOI
TL;DR: The basic elements related to the selection of participants for a health research are discussed and sample representativeness, sample frame, types of sampling, as well as the impact that non-respondents may have on results of a study are described.
Abstract: Background: In this paper, the basic elements related to the selection of participants for a health research are discussed. Sample representativeness, sample frame, types of sampling, as well as the impact that non-respondents may have on results of a study are described. The whole discussion is supported by practical examples to facilitate the reader's understanding. Objective: To introduce readers to issues related to sampling.

Journal ArticleDOI
TL;DR: Using 1,000 replications of 12 conditions with varied Level 1 and Level 2 sample sizes, the author compared parameter estimates, standard errors, and statistical significance using various alternative procedures to indicate that several acceptable procedures can be used in lieu of or together with multilevel modeling.
Abstract: Multilevel modeling has grown in use over the years as a way to deal with the nonindependent nature of observations found in clustered data. However, other alternatives to multilevel modeling are available that can account for observations nested within clusters, including the use of Taylor series linearization for variance estimation, the design effect adjusted standard errors approach, and fixed effects modeling. Using 1,000 replications of 12 conditions with varied Level 1 and Level 2 sample sizes, the author compared parameter estimates, standard errors, and statistical significance using various alternative procedures. Results indicate that several acceptable procedures can be used in lieu of or together with multilevel modeling, depending on the type of research question asked and the number of clusters under investigation. Guidelines for applied researchers are discussed.

Journal ArticleDOI
26 Feb 2016-PLOS ONE
TL;DR: The apparent failure of the Reproducibility Project to replicate many target effects can be adequately explained by overestimation of effect sizes due to small sample sizes and publication bias in the psychological literature.
Abstract: We revisit the results of the recent Reproducibility Project: Psychology by the Open Science Collaboration. We compute Bayes factors—a quantity that can be used to express comparative evidence for an hypothesis but also for the null hypothesis—for a large subset (N = 72) of the original papers and their corresponding replication attempts. In our computation, we take into account the likely scenario that publication bias had distorted the originally published results. Overall, 75% of studies gave qualitatively similar results in terms of the amount of evidence provided. However, the evidence was often weak (i.e., Bayes factor < 10). The majority of the studies (64%) did not provide strong evidence for either the null or the alternative hypothesis in either the original or the replication, and no replication attempts provided strong evidence in favor of the null. In all cases where the original paper provided strong evidence but the replication did not (15%), the sample size in the replication was smaller than the original. Where the replication provided strong evidence but the original did not (10%), the replication sample size was larger. We conclude that the apparent failure of the Reproducibility Project to replicate many target effects can be adequately explained by overestimation of effect sizes (or overestimation of evidence against the null hypothesis) due to small sample sizes and publication bias in the psychological literature. We further conclude that traditional sample sizes are insufficient and that a more widespread adoption of Bayesian methods is desirable.

Journal ArticleDOI
TL;DR: Five metrics commonly used as quantitative descriptors of sample similarity in detrital geochronology, including the Kolmogorov-Smirnov and Kuiper tests are tested, as well as Cross-correlation, Likeness, and Similarity coefficients of probability density plots, and locally adaptive, variable-bandwidth KDEs.
Abstract: The increase in detrital geochronological data presents challenges to existing approaches to data visualization and comparison, and highlights the need for quantitative techniques able to evaluate and compare multiple large data sets. We test five metrics commonly used as quantitative descriptors of sample similarity in detrital geochronology: the Kolmogorov-Smirnov (K-S) and Kuiper tests, as well as Cross-correlation, Likeness, and Similarity coefficients of probability density plots (PDPs), kernel density estimates (KDEs), and locally adaptive, variable-bandwidth KDEs (LA-KDEs). We assess these metrics by applying them to 20 large synthetic data sets and one large empirical data set, and evaluate their utility in terms of sample similarity based on the following three criteria. (1) Similarity of samples from the same population should systematically increase with increasing sample size. (2) Metrics should maximize sensitivity by using the full range of possible coefficients. (3) Metrics should minimize artifacts resulting from sample-specific complexity. K-S and Kuiper test p-values passed only one criterion, indicating that they are poorly suited as quantitative descriptors of sample similarity. Likeness and Similarity coefficients of PDPs, as well as K-S and Kuiper test D and V values, performed better by passing two of the criteria. Cross-correlation of PDPs passed all three criteria. All coefficients calculated from KDEs and LA-KDEs failed at least two of the criteria. As hypothesis tests of derivation from a common source, individual K-S and Kuiper p-values too frequently reject the null hypothesis that samples come from a common source when they are identical. However, mean p-values calculated by repeated subsampling and comparison (minimum of 4 trials) consistently yield a binary discrimination of identical versus different source populations. Cross-correlation and Likeness of PDPs and Cross-correlation of KDEs yield the widest divergence in coefficients and thus a consistent discrimination between identical and different source populations, with Cross-correlation of PDPs requiring the smallest sample size. In light of this, we recommend acquisition of large detrital geochronology data sets for quantitative comparison. We also recommend repeated subsampling of detrital geochronology data sets and calculation of the mean and standard deviation of the comparison metric in order to capture the variability inherent in sampling a multimodal population. These statistical tools are implemented using DZstats, a MATLAB-based code that can be accessed via an executable file graphical user interface. It implements all of the statistical tests discussed in this paper, and exports the results both as spreadsheets and as graphic files.

Journal ArticleDOI
TL;DR: In this article, the authors explore the properties of the RV coefficient using simulated data sets and show that it is adversely affected by attributes of the data (sample size and number of variables) that do not characterize the covariance structure between sets of variables.
Abstract: Summary Modularity describes the case where patterns of trait covariation are unevenly dispersed across traits. Specifically, trait correlations are high and concentrated within subsets of variables (modules), but the correlations between traits across modules are relatively weaker. For morphometric data sets, hypotheses of modularity are commonly evaluated using the RV coefficient, an association statistic used in a wide variety of fields. In this article, I explore the properties of the RV coefficient using simulated data sets. Using data drawn from a normal distribution where the data were neither modular nor integrated in structure, I show that the RV coefficient is adversely affected by attributes of the data (sample size and the number of variables) that do not characterize the covariance structure between sets of variables. Thus, with the RV coefficient, patterns of modularity or integration in data are confounded with trends generated by sample size and the number of variables, which limits biological interpretations and renders comparisons of RV coefficients across data sets uninformative. As an alternative, I propose the covariance ratio (CR) for quantifying modular structure and show that it is unaffected by sample size or the number of variables. Further, statistical tests based on the CR exhibit appropriate type I error rates and display higher statistical power relative to the RV coefficient when evaluating modular data. Overall, these findings demonstrate that the RV coefficient does not display statistical characteristics suitable for reliable assessment of hypotheses of modular or integrated structure and therefore should not be used to evaluate these patterns in morphological data sets. By contrast, the covariance ratio meets these criteria and provides a useful alternative method for assessing the degree of modular structure in morphological data.

Journal ArticleDOI
TL;DR: The results show that efficiency can be achieved by solely balancing the covariate distributions without resorting to direct estimation of propensity score or outcome regression function, and the proposed variance estimator outperforms existing estimators that require a direct approximation of the efficient influence function.
Abstract: The estimation of average treatment effects based on observational data is extremely important in practice and has been studied by generations of statisticians under different frameworks. Existing globally efficient estimators require non-parametric estimation of a propensity score function, an outcome regression function or both, but their performance can be poor in practical sample sizes. Without explicitly estimating either functions, we consider a wide class calibration weights constructed to attain an exact three-way balance of the moments of observed covariates among the treated, the control, and the combined group. The wide class includes exponential tilting, empirical likelihood and generalized regression as important special cases, and extends survey calibration estimators to different statistical problems and with important distinctions. Global semiparametric efficiency for the estimation of average treatment effects is established for this general class of calibration estimators. The results show that efficiency can be achieved by solely balancing the covariate distributions without resorting to direct estimation of propensity score or outcome regression function. We also propose a consistent estimator for the efficient asymptotic variance, which does not involve additional functional estimation of either the propensity score or the outcome regression functions. The proposed variance estimator outperforms existing estimators that require a direct approximation of the efficient influence function.

Journal ArticleDOI
TL;DR: This work derives formulae for sample size for repeated cross-section and closed cohort cluster randomised trials with normally distributed outcome measures, under a multilevel model allowing for variation between clusters and between times within clusters.
Abstract: The sample size required for a cluster randomised trial is inflated compared with an individually randomised trial because outcomes of participants from the same cluster are correlated. Sample size calculations for longitudinal cluster randomised trials (including stepped wedge trials) need to take account of at least two levels of clustering: the clusters themselves and times within clusters. We derive formulae for sample size for repeated cross-section and closed cohort cluster randomised trials with normally distributed outcome measures, under a multilevel model allowing for variation between clusters and between times within clusters. Our formulae agree with those previously described for special cases such as crossover and analysis of covariance designs, although simulation suggests that the formulae could underestimate required sample size when the number of clusters is small. Whether using a formula or simulation, a sample size calculation requires estimates of nuisance parameters, which in our model include the intracluster correlation, cluster autocorrelation, and individual autocorrelation. A cluster autocorrelation less than 1 reflects a situation where individuals sampled from the same cluster at different times have less correlated outcomes than individuals sampled from the same cluster at the same time. Nuisance parameters could be estimated from time series obtained in similarly clustered settings with the same outcome measure, using analysis of variance to estimate variance components. Copyright © 2016 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: A simple model is derived of how the different factors, such as sample heterogeneity and study setup determine this ML effect size, and explain the variation in prediction accuracies found from the literature, both in cross-validation and independent sample testing are explained.
Abstract: Recently it was suggested that much larger cohorts are needed to prove the diagnostic value of neuroimaging biomarkers in psychiatry. While within a sample increase of diagnostic accuracy of schizophrenia with number of subjects (N) has been shown, the relationship between N and accuracy is completely different between studies. Using data from a meta-analysis of machine learning in imaging schizophrenia, we found that while low-N studies can reach 90% and higher accuracy, above N/2=50 the maximum accuracy achieved steadily drops to below 70% for N/2>150. We investigate the role N plays in the wide variability in accuracy results (63-97%). We hypothesize that the underlying cause of the decrease in accuracy with increasing N is sample heterogeneity. While smaller studies more easily include a homogeneous group of subjects (strict inclusion criteria are easily met; subjects live close to study site), larger studies inevitably need to relax the criteria / recruit from large geographic areas. A schizophrenia prediction model based on a heterogeneous group of patients with presumably a heterogeneous pattern of structural or functional brain changes will not be able to capture the whole variety of changes, thus being limited to patterns shared by most patients. In addition to heterogeneity, we investigate other factors influencing accuracy and introduce a machine learning effect size. We derive a simple model of how the different factors such as sample heterogeneity determine this effect size, and explain the variation in prediction accuracies found from the literature, both in cross-validation and independent sample testing. From this we argue that smaller-N studies may reach high prediction accuracy at the cost of lower generalizability to other samples. Higher-N studies, on the other hand, will have more generalization power, but at the cost of lower accuracy. In conclusion, when comparing results from different machine learning studies, the sample sizes should be taken into account. To assess the generalizability of the models, validation of the prediction models should be tested in independent samples. The prediction of more complex measures such as outcome, which are expected to have an underlying pattern of more subtle brain abnormalities, will require large (multicenter) studies.

Journal ArticleDOI
01 Jan 2016
TL;DR: In this article, it is shown how the distribution for the degrees of freedom is dependent on the sample sizes and the variances of the samples, and hence gives an insight into why Welch's test is Type I error robust under normality.
Abstract: The comparison of two means is one of the most commonly applied statistical procedures in psychology. The independent samples t-test corrected for unequal variances is commonly known as Welch’s test, and is widely considered to be a robust alternative to the independent samples t-test. The properties of Welch’s test that make it Type I error robust are examined. The degrees of freedom used in Welch’s test are a random variable, the distributions of which are examined using simulation. It is shown how the distribution for the degrees of freedom is dependent on the sample sizes and the variances of the samples. The impact of sample variances on the degrees of freedom, the resultant critical value and the test statistic is considered, and hence gives an insight into why Welch’s test is Type I error robust under normality.

Journal ArticleDOI
10 Mar 2016
TL;DR: In this article, simpler guidelines are proposed to estimate sufficient sample size requirements in different scenarios and tabulate tables that show sample size calculation based on desired correlation coefficient, power and type 1 error (p-value) values.
Abstract: Correlation analysis is a common statistical analysis in various fields. The aim is usually to determine to what extent two numerical variables are correlate d with each other. One of the issues that are important to be considered before conducting any correlation analysis is to plan for the sufficient sample size. This is to ensure, the results that to be derived from the analysis be able to reach a desired minimum correlation coefficient value with sufficient power and desired type I error or p-value. Sample size estimation for correlation analysis should be in line with the study objective. Researchers who are not statistician need simpler guideline to determine the sufficient sample size for correlation analysis. Therefore, this study aims to tabulate tables that show sample size calculation based on desired correlation coefficient, power and type 1 error (p-value) values. Moving towards that, simpler guidelines are proposed to estimate sufficient sample size requirements in different scenarios.

Journal ArticleDOI
TL;DR: This work proposes to retain the likelihood ratio test in combination with decision criteria that increase with sample size, and addresses the concern that structural equation models cannot necessarily be expected to provide an exact description of real-world phenomena.
Abstract: One of the most important issues in structural equation modeling concerns testing model fit. We propose to retain the likelihood ratio test in combination with decision criteria that increase with sample size. Specifically, rooted in Neyman–Pearson hypothesis testing, we advocate balancing α- and β-error risks. This strategy has a number of desirable consequences and addresses several objections that have been raised against the likelihood ratio test in model evaluation. First, balancing error risks avoids logical problems with Fisher-type hypotheses tests when predicting the null hypothesis (i.e., model fit). Second, both types of statistical decision errors are controlled. Third, larger samples are encouraged (rather than penalized) because both error risks diminish as the sample size increases. Finally, the strategy addresses the concern that structural equation models cannot necessarily be expected to provide an exact description of real-world phenomena.

01 Jan 2016
TL;DR: In this article, the authors proposed a new procedure that includes the analysis of variance F-test as well as decisions on pairwise contrasts by Scheff's method, which is consistent with the decisions of the F-Test as well and with those of Scheffe's method of judging all contrasts, and might be regarded as a practical way of applying the latter.
Abstract: contained in the set of all means involved in an analysis of variance. Significance decisions are based on sums of squares between means within sets, using the same critical value as for the overall F-test. The decisions are shown to be transitive in the sense that any set containing a significant subset is itself significant. However, decisions may be incomplete in that a set may be significant and yet none of its subsets be significant, so that the form of the heterogeneity of the set cannot be inferred with the specified degree of confidence. The new procedure includes the analysis of variance F-test as well as decisions on pairwise contrasts by Scheff 's method. It is consistent with the decisions of the F-test as well as with those of Scheffe's method of judging all contrasts, and might be regarded as a practical way of applying the latter. The probability of making at least one type I error among all the decisions does not exceed the significance level of the overall F-test. Together with Scheff6's method the new procedure may be regarded as providing detailed decisions implicit in significant F-tests. Probabilities of type I errors for sets of any given number of means are defined, and their importance in evaluating multiple comparisons methods pointed out. Tukey's method is seen to imply a range procedure which has advantages when the sample sizes are equal. Steps wise methods such as those of Duncan and of Newman and Keuls are compared with the above procedures in terms of their error probabilities and other properties.

Journal ArticleDOI
TL;DR: In this article, the authors proposed a new method of sample size estimation for Bland-Altman agreement assessment, which is based on the width of the confidence interval for LOAs (limits of agreement) in comparison to predefined clinical agreement limit.
Abstract: The Bland-Altman method has been widely used for assessing agreement between two methods of measurement. However, it remains unsolved about sample size estimation. We propose a new method of sample size estimation for Bland-Altman agreement assessment. According to the Bland-Altman method, the conclusion on agreement is made based on the width of the confidence interval for LOAs (limits of agreement) in comparison to predefined clinical agreement limit. Under the theory of statistical inference, the formulae of sample size estimation are derived, which depended on the pre-determined level of α, β, the mean and the standard deviation of differences between two measurements, and the predefined limits. With this new method, the sample sizes are calculated under different parameter settings which occur frequently in method comparison studies, and Monte-Carlo simulation is used to obtain the corresponding powers. The results of Monte-Carlo simulation showed that the achieved powers could coincide with the pre-determined level of powers, thus validating the correctness of the method. The method of sample size estimation can be applied in the Bland-Altman method to assess agreement between two methods of measurement.