scispace - formally typeset
Search or ask a question

Showing papers on "Sample size determination published in 2015"


Book
01 Jun 2015
TL;DR: A practical primer on how to calculate and report effect sizes for t-tests and ANOVA's such that effect sizes can be used in a-priori power analyses and meta-analyses and a detailed overview of the similarities and differences between within- and between-subjects designs is provided.
Abstract: Effect sizes are the most important outcome of empirical studies. Most articles on effect sizes highlight their importance to communicate the practical significance of results. For scientists themselves, effect sizes are most useful because they facilitate cumulative science. Effect sizes can be used to determine the sample size for follow-up studies, or examining effects across studies. This article aims to provide a practical primer on how to calculate and report effect sizes for t-tests and ANOVA’s such that effect sizes can be used in a-priori power analyses and meta-analyses. Whereas many articles about effect sizes focus on between-subjects designs and address within-subjects designs only briefly, I provide a detailed overview of the similarities and differences between within- and between-subjects designs. I suggest that some research questions in experimental psychology examine inherently intra-individual effects, which makes effect sizes that incorporate the correlation between measures the best summary of the results. Finally, a supplementary spreadsheet is provided to make it as easy as possible for researchers to incorporate effect size calculations into their workflow.

5,374 citations


Journal ArticleDOI
TL;DR: LDpred is introduced, a method that infers the posterior mean effect size of each marker by using a prior on effect sizes and LD information from an external reference panel, and outperforms the approach of pruning followed by thresholding, particularly at large sample sizes.
Abstract: Polygenic risk scores have shown great promise in predicting complex disease risk and will become more accurate as training sample sizes increase. The standard approach for calculating risk scores involves linkage disequilibrium (LD)-based marker pruning and applying a p value threshold to association statistics, but this discards information and can reduce predictive accuracy. We introduce LDpred, a method that infers the posterior mean effect size of each marker by using a prior on effect sizes and LD information from an external reference panel. Theory and simulations show that LDpred outperforms the approach of pruning followed by thresholding, particularly at large sample sizes. Accordingly, predicted R(2) increased from 20.1% to 25.3% in a large schizophrenia dataset and from 9.8% to 12.0% in a large multiple sclerosis dataset. A similar relative improvement in accuracy was observed for three additional large disease datasets and for non-European schizophrenia samples. The advantage of LDpred over existing methods will grow as sample sizes increase.

1,088 citations


Journal ArticleDOI
TL;DR: In this article, the authors proposed a confidence interval that does not require equal variances or equal covariances and demonstrated that the proposed method performed better than alternative methods, and also presented some sample size formulas that approximate the sample size requirements for desired power or desired confidence interval precision.
Abstract: Summary Cronbach's alpha is one of the most widely used measures of reliability in the social and organizational sciences. Current practice is to report the sample value of Cronbach's alpha reliability, but a confidence interval for the population reliability value also should be reported. The traditional confidence interval for the population value of Cronbach's alpha makes an unnecessarily restrictive assumption that the multiple measurements have equal variances and equal covariances. We propose a confidence interval that does not require equal variances or equal covariances. The results of a simulation study demonstrated that the proposed method performed better than alternative methods. We also present some sample size formulas that approximate the sample size requirements for desired power or desired confidence interval precision. R functions are provided that can be used to implement the proposed confidence interval and sample size methods. Copyright © 2014 John Wiley & Sons, Ltd.

578 citations


Journal ArticleDOI
TL;DR: This study extracted 147,328 correlations and developed a hierarchical taxonomy of variables reported in Journal of Applied Psychology and Personnel Psychology from 1980 to 2010 to produce empirical effect size benchmarks at the omnibus level, for 20 common research domains, and for an even finer grained level of generality.
Abstract: Effect size information is essential for the scientific enterprise and plays an increasingly central role in the scientific process. We extracted 147,328 correlations and developed a hierarchical taxonomy of variables reported in Journal of Applied Psychology and Personnel Psychology from 1980 to 2010 to produce empirical effect size benchmarks at the omnibus level, for 20 common research domains, and for an even finer grained level of generality. Results indicate that the usual interpretation and classification of effect sizes as small, medium, and large bear almost no resemblance to findings in the field, because distributions of effect sizes exhibit tertile partitions at values approximately one-half to one-third those intuited by Cohen (1988). Our results offer information that can be used for research planning and design purposes, such as producing better informed non-nil hypotheses and estimating statistical power and planning sample size accordingly. We also offer information useful for understanding the relative importance of the effect sizes found in a particular study in relationship to others and which research domains have advanced more or less, given that larger effect sizes indicate a better understanding of a phenomenon. Also, our study offers information about research domains for which the investigation of moderating effects may be more fruitful and provide information that is likely to facilitate the implementation of Bayesian analysis. Finally, our study offers information that practitioners can use to evaluate the relative effectiveness of various types of interventions.

500 citations


Journal ArticleDOI
TL;DR: The purpose of this paper is to provide insight into whether and how researchers have dealt with sample size calculations for healthcare-related DCE studies, to introduce and explain the required sample size for parameter estimates in DCEs, and to provide a step-by-step guide for the calculation of the minimum sample size requirements for D CEs in health care.
Abstract: Discrete-choice experiments (DCEs) have become a commonly used instrument in health economics and patient-preference analysis, addressing a wide range of policy questions. An important question when setting up a DCE is the size of the sample needed to answer the research question of interest. Although theory exists as to the calculation of sample size requirements for stated choice data, it does not address the issue of minimum sample size requirements in terms of the statistical power of hypothesis tests on the estimated coefficients. The purpose of this paper is threefold: (1) to provide insight into whether and how researchers have dealt with sample size calculations for healthcare-related DCE studies; (2) to introduce and explain the required sample size for parameter estimates in DCEs; and (3) to provide a step-by-step guide for the calculation of the minimum sample size requirements for DCEs in health care.

475 citations


Journal ArticleDOI
TL;DR: This study demonstrates how applying signal classification to Gaussian random signals can yield decoding accuracies of up to 70% or higher in two-class decoding with small sample sets, taking sample size into account.

452 citations


Journal ArticleDOI
TL;DR: In this paper, the authors propose a tool to help users to decide what would be a useful sample size for their particular context when investigating patterns across participants, based on the expected population theme prevalence of the least prevalent themes.
Abstract: Thematic analysis is frequently used to analyse qualitative data in psychology, healthcare, social research and beyond. An important stage in planning a study is determining how large a sample size may be required, however current guidelines for thematic analysis are varied, ranging from around 2 to over 400 and it is unclear how to choose a value from the space in between. Some guidance can also not be applied prospectively. This paper introduces a tool to help users think about what would be a useful sample size for their particular context when investigating patterns across participants. The calculation depends on (a) the expected population theme prevalence of the least prevalent theme, derived either from prior knowledge or based on the prevalence of the rarest themes considered worth uncovering, e.g. 1 in 10, 1 in 100; (b) the number of desired instances of the theme; and (c) the power of the study. An adequately powered study will have a high likelihood of finding sufficient themes of the desired prevalence. This calculation can then be used alongside other considerations. We illustrate how to use the method to calculate sample size before starting a study and achieved power given a sample size, providing tables of answers and code for use in the free software, R. Sample sizes are comparable to those found in the literature, for example to have 80% power to detect two instances of a theme with 10% prevalence, 29 participants are required. Increasing power, increasing the number of instances or decreasing prevalence increases the sample size needed. We do not propose this as a ritualistic requirement for study design, but rather as a pragmatic supporting tool to help plan studies using thematic analysis.

328 citations


Journal ArticleDOI
TL;DR: A simple formula is presented to calculate the sample size needed to be able to identify, with a chosen level of confidence, problems that may arise with a given probability.

293 citations


Journal ArticleDOI
21 Jul 2015-PeerJ
TL;DR: Simulation results suggest that OLRE are a useful tool for modelling overdispersion in Binomial data, but that they do not perform well in all circumstances and researchers should take care to verify the robustness of parameter estimates of OLRE models.
Abstract: Overdispersion is a common feature of models of biological data, but researchers often fail to model the excess variation driving the overdispersion, resulting in biased parameter estimates and standard errors. Quantifying and modeling overdispersion when it is present is therefore critical for robust biological inference. One means to account for overdispersion is to add an observation-level random effect (OLRE) to a model, where each data point receives a unique level of a random effect that can absorb the extra-parametric variation in the data. Although some studies have investigated the utility of OLRE to model overdispersion in Poisson count data, studies doing so for Binomial proportion data are scarce. Here I use a simulation approach to investigate the ability of both OLRE models and Beta-Binomial models to recover unbiased parameter estimates in mixed effects models of Binomial data under various degrees of overdispersion. In addition, as ecologists often fit random intercept terms to models when the random effect sample size is low (<5 levels), I investigate the performance of both model types under a range of random effect sample sizes when overdispersion is present. Simulation results revealed that the efficacy of OLRE depends on the process that generated the overdispersion; OLRE failed to cope with overdispersion generated from a Beta-Binomial mixture model, leading to biased slope and intercept estimates, but performed well for overdispersion generated by adding random noise to the linear predictor. Comparison of parameter estimates from an OLRE model with those from its corresponding Beta-Binomial model readily identified when OLRE were performing poorly due to disagreement between effect sizes, and this strategy should be employed whenever OLRE are used for Binomial data to assess their reliability. Beta-Binomial models performed well across all contexts, but showed a tendency to underestimate effect sizes when modelling non-Beta-Binomial data. Finally, both OLRE and Beta-Binomial models performed poorly when models contained <5 levels of the random intercept term, especially for estimating variance components, and this effect appeared independent of total sample size. These results suggest that OLRE are a useful tool for modelling overdispersion in Binomial data, but that they do not perform well in all circumstances and researchers should take care to verify the robustness of parameter estimates of OLRE models.

287 citations


Journal ArticleDOI
TL;DR: Small samples (5–15 participants) that are common in pre-tests of questionaires may fail to uncover even common problems, and a default sample size of 30 participants is recommended.
Abstract: Purpose To provide guidance regarding the desirable size of pre-tests of psychometric questionnaires, when the purpose of the pre-test is to detect misunderstandings, ambiguities, or other difficulties participants may encounter with instrument items (called «problems»). Methods We computed (a) the power to detect a problem for various levels of prevalence and various sample sizes, (b) the required sample size to detect problems for various levels of prevalence, and (c) upper confidence limits for problem prevalence in situations where no problems were detected. Results As expected, power increased with problem prevalence and with sample size. If problem prevalence was 0.05, a sample of 10 participants had only a power of 40 % to detect the problem, and a sample of 20 achieved a power of 64 %. To achieve a power of 80 %, 32 participants were necessary if the prevalence of the problem was 0.05, 16 participants if prevalence was 0.10, and 8 if prevalence was 0.20. If no problems were observed in a given sample, the upper limit of a two-sided 90 % confidence interval reached 0.26 for a sample size of 10, 0.14 for a sample size of 20, and 0.10 for a sample of 30 participants. Conclusions Small samples (5‐15 participants) that are common in pre-tests of questionaires may fail to uncover even common problems. A default sample size of 30 participants is recommended.

274 citations


Journal ArticleDOI
TL;DR: A three-level meta-analytic model is evaluated to account for dependent effect sizes, extending the simulation results of Van den Noortgate, López-López, Marín-Martínez, and Sánchez-Meca Behavior Research Methods by allowing for a variation in the number of effect sizes per study.
Abstract: In meta-analysis, dependent effect sizes are very common. An example is where in one or more studies the effect of an intervention is evaluated on multiple outcome variables for the same sample of participants. In this paper, we evaluate a three-level meta-analytic model to account for this kind of dependence, extending the simulation results of Van den Noortgate, Lopez-Lopez, Marin-Martinez, and Sanchez-Meca Behavior Research Methods, 45, 576-594 (2013) by allowing for a variation in the number of effect sizes per study, in the between-study variance, in the correlations between pairs of outcomes, and in the sample size of the studies. At the same time, we explore the performance of the approach if the outcomes used in a study can be regarded as a random sample from a population of outcomes. We conclude that although this approach is relatively simple and does not require prior estimates of the sampling covariances between effect sizes, it gives appropriate mean effect size estimates, standard error estimates, and confidence interval coverage proportions in a variety of realistic situations.

Journal ArticleDOI
TL;DR: Study sample size, which is likely a proxy for case assessment method, and the use of DSM‐IV‐TR diagnostic criteria are the major sources of heterogeneity across studies, which refines the population prevalence estimate of TS in children to be 0.3% to 0.9%.
Abstract: The aim of this study was to refine the population prevalence estimate of Tourette Syndrome (TS) in children and to investigate potential sources of heterogeneity in previously published studies. A systematic review was conducted and all qualifying published studies of TS prevalence were examined. Extracted data were subjected to a random-effects meta-analysis weighted by sample size; meta-regressions were performed to examine covariates that have previously been proposed as potential sources of heterogeneity. Twenty-six articles met study inclusion criteria. Studies derived from clinically referred cases had prevalence estimates that were significantly lower than those derived from population-based samples (P = 0.004). Among the 21 population-based prevalence studies, the pooled TS population prevalence estimate was 0.52% (95% confidence interval CI: 0.32-0.85). In univariable meta-regression analysis, study sample size (P = 0.002) and study date (P = 0.03) were significant predictors of TS prevalence. In the final multivariable model including sample size, study date, age, and diagnostic criteria, only sample size (P < 0.001) and diagnostic criteria (omnibus P = 0.003; Diagnostic and Statistical Manual of Mental Disorders, Fourth Edition, Text Revision [DSM-IV-TR]: P = 0.005) were independently associated with variation in TS population prevalence across studies. This study refines the population prevalence estimate of TS in children to be 0.3% to 0.9%. Study sample size, which is likely a proxy for case assessment method, and the use of DSM-IV-TR diagnostic criteria are the major sources of heterogeneity across studies. The true TS population prevalence rate is likely at the higher end of these estimates, given the methodological limitations of most studies. Further studies in large, well-characterized samples will be helpful to determine the burden of disease in the general population.

Journal ArticleDOI
TL;DR: This paper gives the most comprehensive description of published methodology for sample size calculation for cluster randomized trials and provides an important resource for those designing these trials.
Abstract: Background: The use of cluster randomized trials (CRTs) is increasing, along with the variety in their design and analysis. The simplest approach for their sample size calculation is to calculate the sample size assuming individual randomization and inflate this by a design effect to account for randomization by cluster. The assumptions of a simple design effect may not always be met; alternative or more complicated approaches are required. Methods: We summarise a wide range of sample size methods available for cluster randomized trials. For those familiar with sample size calculations for individually randomized trials but with less experience in the clustered case, this manuscript provides formulae for a wide range of scenarios with associated explanation and recommendations. For those with more experience, comprehensive summaries are provided that allow quick identification of methods for a given design, outcome and analysis method. Results: We present first those methods applicable to the simplest two-arm, parallel group, completely randomized design followed by methods that incorporate deviations from this design such as: variability in cluster sizes; attrition; non-compliance; or the inclusion of baseline covariates or repeated measures. The paper concludes with methods for alternative designs. Conclusions: There is a large amount of methodology available for sample size calculations in CRTs. This paper gives the most comprehensive description of published methodology for sample size calculation and provides an important resource for those designing these trials.

Journal ArticleDOI
TL;DR: This work proposes the use of a more inclusive LCA to generate posterior probabilities; this LCA includes additional variables present in the analytic model and shows that with sufficient measurement quality or sample size, the proposed strategy reduces or eliminates bias.
Abstract: Despite recent methodological advances in latent class analysis (LCA) and a rapid increase in its application in behavioral research, complex research questions that include latent class variables often must be addressed by classifying individuals into latent classes and treating class membership as known in a subsequent analysis. Traditional approaches to classifying individuals based on posterior probabilities are known to produce attenuated estimates in the analytic model. We propose the use of a more inclusive LCA to generate posterior probabilities; this LCA includes additional variables present in the analytic model. A motivating empirical demonstration is presented, followed by a simulation study to assess the performance of the proposed strategy. Results show that with sufficient measurement quality or sample size, the proposed strategy reduces or eliminates bias.

Journal ArticleDOI
TL;DR: A more robust estimate of the relative influence of genetic effects on wellbeing is provided using a sample size weighted average heritability analysis of 30 twin-family studies on wellbeing and satisfaction with life.
Abstract: Wellbeing is a major topic of research across several disciplines, reflecting the increasing recognition of its strong value across major domains in life. Previous twin-family studies have revealed that individual differences in wellbeing are accounted for by both genetic as well as environmental factors. A systematic literature search identified 30 twin-family studies on wellbeing or a related measure such as satisfaction with life or happiness. Review of these studies showed considerable variation in heritability estimates (ranging from 0 to 64 %), which makes it difficult to draw firm conclusions regarding the genetic influences on wellbeing. For overall wellbeing twelve heritability estimates, from 10 independent studies, were meta-analyzed by computing a sample size weighted average heritability. Ten heritability estimates, derived from 9 independent samples, were used for the meta-analysis of satisfaction with life. The weighted average heritability of wellbeing, based on a sample size of 55,974 individuals, was 36 % (34–38), while the weighted average heritability for satisfaction with life was 32 % (29–35) (n = 47,750). With this result a more robust estimate of the relative influence of genetic effects on wellbeing is provided.

Journal ArticleDOI
TL;DR: The results showed that the Cox proportional hazard model exhibited a poor estimation of population means of healthcare costs and the β1 even under proportional hazard data, and increasing the sample size could improve the performance of the OLS-based model.
Abstract: Skewed data is the main issue in statistical models in healthcare costs. Data transformation is a conventional method to decrease skewness, but there are some disadvantages. Some recent studies have employed generalized linear models (GLMs) and Cox proportional hazard regression as alternative estimators. The aim of this study was to investigate how well these alternative estimators perform in terms of bias and precision when the data are skewed. The primary outcome was an estimation of population means of healthcare costs and the secondary outcome was the impact of a covariate on healthcare cost. Alternative estimators, such as ordinary least squares (OLS) for Ln(y) or Log(y), Gamma, Weibull and Cox proportional hazard regression models, were compared using Monte Carlo simulation under different situations, which were generated from skewed distributions. We found that there was not one best model across all generated conditions. However, GLMs, especially the Gamma regression model, behaved well in the estimation of population means of healthcare costs. The results showed that the Cox proportional hazard model exhibited a poor estimation of population means of healthcare costs and the β1 even under proportional hazard data. Approximately results are consistent by increasing the sample size. However, increasing the sample size could improve the performance of the OLS-based model.

Journal ArticleDOI
TL;DR: The authors report multiple effect sizes based on a common pool of subjects or that report effect sizes from several samples that were treated with very similar researches, which is a common problem in meta-analyses.
Abstract: Meta-analyses often include studies that report multiple effect sizes based on a common pool of subjects or that report effect sizes from several samples that were treated with very similar researc...

Journal ArticleDOI
TL;DR: The Fragility Index is a novel metric to inform about the robustness of statistically significant results from RCTs of spine surgery interventions and indicates that the statistical significance of a result hinges on only a few events, and a large Fragility index increases one's confidence in the observed treatment effects.

Journal ArticleDOI
TL;DR: The purpose of this editorial is to provide guidance to prospective authors conducting precision studies in terms of basic concepts, terminology, statistical methods, sample size considerations, study design, use of 1 eye or 2 eyes, and worked examples.
Abstract: Ophthalmology is a technologically advancing field with the constant production of new instrumentation. Instruments are usually released on the market with manufacturers claiming performance in various data formats, and this is typically followed by clinical evaluation studies by independent users. Ubiquitous issues in this area include, not least, the variety of methods, both practically and statistically, and the various terminology differences. It is becoming exceedingly difficult for clinicians and researchers to make sense of study results that have used nonstandard methodology in addition to confounding terminology. Terms such as consistency, precision, reliability, accuracy, repeatability, reproducibility, and agreement are just some examples of the many terms that frequently appear in the ophthalmic literature with inconsistent definitions and synonymous usage. The purpose of this editorial is to provide guidance to prospective authors conducting precision (repeatability and reproducibility) studies in terms of basic concepts, terminology, statistical methods, sample size considerations, study design, use of 1 eye or 2 eyes, and worked examples. A number of bodies in the scientific world have devised their own terminology and statistical recommendations, some of which are similar and others markedly different. An example is the International Organization for Standardization (ISO), an independent nongovernmental membership organization and the world's largest developer of voluntary International Standards. ISO has published more than 20 500 International Standards covering almost every industry, including ophthalmology. Examples include standards relating to contact lenses, contact lens care products, intraocular lenses, intraocular implants, spectacle lenses, and ophthalmic instruments. The ISO standardsmight differ markedly even within ophthalmic instruments, an example being differing statistical methods for tonometry and topography. A full list is available on the ISOwebsite (www.iso.org). Such ISO standards are developed by following a formal methodology involving a multi-stakeholder process. They are organic and dynamic, and developments might take years. This editorial is not set to replace any of the ISO standards or suggest that a single approach should be applied to every situation but rather provides a practical commentary on precision that draws

Proceedings ArticleDOI
10 Aug 2015
TL;DR: This paper proposes a novel multi-task learning framework which aims to concurrently address all the challenges of spatial event forecasting from social media by extracting and utilizing appropriate shared information that effectively increases the sample size for each location, thus improving the forecasting performance.
Abstract: Spatial event forecasting from social media is an important problem but encounters critical challenges, such as dynamic patterns of features (keywords) and geographic heterogeneity (e.g., spatial correlations, imbalanced samples, and different populations in different locations). Most existing approaches (e.g., LASSO regression, dynamic query expansion, and burst detection) are designed to address some of these challenges, but not all of them. This paper proposes a novel multi-task learning framework which aims to concurrently address all the challenges. Specifically, given a collection of locations (e.g., cities), we propose to build forecasting models for all locations simultaneously by extracting and utilizing appropriate shared information that effectively increases the sample size for each location, thus improving the forecasting performance. We combine both static features derived from a predefined vocabulary by domain experts and dynamic features generated from dynamic query expansion in a multi-task feature learning framework; we investigate different strategies to balance homogeneity and diversity between static and dynamic terms. Efficient algorithms based on Iterative Group Hard Thresholding are developed to achieve efficient and effective model training and prediction. Extensive experimental evaluations on Twitter data from four different countries in Latin America demonstrated the effectiveness of our proposed approach.

Journal ArticleDOI
TL;DR: It is demonstrated not only that bootstrapping has insufficient statistical power to provide a rigorous hypothesis test in most conditions but also that boot strapping has a tendency to exhibit an inflated Type I error rate.
Abstract: Bootstrapping is an analytical tool commonly used in psychology to test the statistical significance of the indirect effect in mediation models. Bootstrapping proponents have particularly advocated for its use for samples of 20-80 cases. This advocacy has been heeded, especially in the Journal of Applied Psychology, as researchers are increasingly utilizing bootstrapping to test mediation with samples in this range. We discuss reasons to be concerned with this escalation, and in a simulation study focused specifically on this range of sample sizes, we demonstrate not only that bootstrapping has insufficient statistical power to provide a rigorous hypothesis test in most conditions but also that bootstrapping has a tendency to exhibit an inflated Type I error rate. We then extend our simulations to investigate an alternative empirical resampling method as well as a Bayesian approach and demonstrate that they exhibit comparable statistical power to bootstrapping in small samples without the associated inflated Type I error. Implications for researchers testing mediation hypotheses in small samples are presented. For researchers wishing to use these methods in their own research, we have provided R syntax in the online supplemental materials.

Journal ArticleDOI
TL;DR: The modified Knapp-Hartung method (mKH) as discussed by the authors applies an ad hoc correction and has been proposed to prevent counterintuitive effects and to yield more conservative inference.
Abstract: Random-effects meta-analysis is commonly performed by first deriving an estimate of the between-study variation, the heterogeneity, and subsequently using this as the basis for combining results, i.e., for estimating the effect, the figure of primary interest. The heterogeneity variance estimate however is commonly associated with substantial uncertainty, especially in contexts where there are only few studies available, such as in small populations and rare diseases. Confidence intervals and tests for the effect may be constructed via a simple normal approximation, or via a Student-t distribution, using the Hartung-Knapp-Sidik-Jonkman (HKSJ) approach, which additionally uses a refined estimator of variance of the effect estimator. The modified Knapp-Hartung method (mKH) applies an ad hoc correction and has been proposed to prevent counterintuitive effects and to yield more conservative inference. We performed a simulation study to investigate the behaviour of the standard HKSJ and modified mKH procedures in a range of circumstances, with a focus on the common case of meta-analysis based on only a few studies. The standard HKSJ procedure works well when the treatment effect estimates to be combined are of comparable precision, but nominal error levels are exceeded when standard errors vary considerably between studies (e.g. due to variations in study size). Application of the modification on the other hand yields more conservative results with error rates closer to the nominal level. Differences are most pronounced in the common case of few studies of varying size or precision. Use of the modified mKH procedure is recommended, especially when only a few studies contribute to the meta-analysis and the involved studies’ precisions (standard errors) vary.

Journal ArticleDOI
TL;DR: This article describes and compares approximate and exact confidence intervals that are – with one exception – easy to calculate or available in common software packages and makes recommendations for both small and moderate-to-large sample sizes.
Abstract: The relationship between two independent binomial proportions is commonly estimated and presented using the difference between proportions, the number needed to treat, the ratio of proportions or the odds ratio. Several different confidence intervals are available, but they can produce markedly different results. Some of the traditional approaches, such as the Wald interval for the difference between proportions and the Katz log interval for the ratio of proportions, do not perform well unless the sample size is large. Better intervals are available. This article describes and compares approximate and exact confidence intervals that are – with one exception – easy to calculate or available in common software packages. We illustrate the performances of the intervals and make recommendations for both small and moderate-to-large sample sizes.

Posted ContentDOI
02 Mar 2015-bioRxiv
TL;DR: A new method is introduced, LDpred, which infers the posterior mean causal effect size of each marker using a prior on effect sizes and LD information from an external reference panel, and shows that LDpred outperforms the pruning/thresholding approach, particularly at large sample sizes.
Abstract: Polygenic risk scores have shown great promise in predicting complex disease risk, and will become more accurate as training sample sizes increase. The standard approach for calculating risk scores involves LD-pruning markers and applying a P-value threshold to association statistics, but this discards information and may reduce predictive accuracy. We introduce a new method, LDpred, which infers the posterior mean causal effect size of each marker using a prior on effect sizes and LD information from an external reference panel. Theory and simulations show that LDpred outperforms the pruning/thresholding approach, particularly at large sample sizes. Accordingly, prediction R2 increased from 20.1% to 25.3% in a large schizophrenia data set and from 9.8% to 12.0% in a large multiple sclerosis data set. A similar relative improvement in accuracy was observed for three additional large disease data sets and when predicting in non-European schizophrenia samples. The advantage of LDpred over existing methods will grow as sample sizes increase.

Journal ArticleDOI
TL;DR: A shorter duration of untreated depression is associated with more favorable outcomes for major depression, including depression-related disability, which might have direct implications for both primary and secondary prevention.

Journal ArticleDOI
17 Aug 2015-Trials
TL;DR: It is found that usually the SWT is relatively insensitive to variations in the intracluster correlation, and that failure to account for a potential time effect will artificially and grossly overestimate the power of a study.
Abstract: Stepped wedge trials (SWTs) can be considered as a variant of a clustered randomised trial, although in many ways they embed additional complications from the point of view of statistical design and analysis. While the literature is rich for standard parallel or clustered randomised clinical trials (CRTs), it is much less so for SWTs. The specific features of SWTs need to be addressed properly in the sample size calculations to ensure valid estimates of the intervention effect. We critically review the available literature on analytical methods to perform sample size and power calculations in a SWT. In particular, we highlight the specific assumptions underlying currently used methods and comment on their validity and potential for extensions. Finally, we propose the use of simulation-based methods to overcome some of the limitations of analytical formulae. We performed a simulation exercise in which we compared simulation-based sample size computations with analytical methods and assessed the impact of varying the basic parameters to the resulting sample size/power, in the case of continuous and binary outcomes and assuming both cross-sectional data and the closed cohort design. We compared the sample size requirements for a SWT in comparison to CRTs based on comparable number of measurements in each cluster. In line with the existing literature, we found that when the level of correlation within the clusters is relatively high (for example, greater than 0.1), the SWT requires a smaller number of clusters. For low values of the intracluster correlation, the two designs produce more similar requirements in terms of total number of clusters. We validated our simulation-based approach and compared the results of sample size calculations to analytical methods; the simulation-based procedures perform well, producing results that are extremely similar to the analytical methods. We found that usually the SWT is relatively insensitive to variations in the intracluster correlation, and that failure to account for a potential time effect will artificially and grossly overestimate the power of a study. We provide a framework for handling the sample size and power calculations of a SWT and suggest that simulation-based procedures may be more effective, especially in dealing with the specific features of the study at hand. In selected situations and depending on the level of intracluster correlation and the cluster size, SWTs may be more efficient than comparable CRTs. However, the decision about the design to be implemented will be based on a wide range of considerations, including the cost associated with the number of clusters, number of measurements and the trial duration.

Journal ArticleDOI
TL;DR: The results suggest that the GEE Wald z-test should be avoided in the analyses of CRTs with few clusters even when bias-corrected sandwich estimators are used, and a formula is derived to calculate the power and minimum total number of clusters one needs using the t-test and KC-correction for theCRTs with binary outcomes.
Abstract: The sandwich estimator in generalized estimating equations (GEE) approach underestimates the true variance in small samples and consequently results in inflated type I error rates in hypothesis testing. This fact limits the application of the GEE in cluster-randomized trials (CRTs) with few clusters. Under various CRT scenarios with correlated binary outcomes, we evaluate the small sample properties of the GEE Wald tests using bias-corrected sandwich estimators. Our results suggest that the GEE Wald z-test should be avoided in the analyses of CRTs with few clusters even when bias-corrected sandwich estimators are used. With t-distribution approximation, the Kauermann and Carroll (KC)-correction can keep the test size to nominal levels even when the number of clusters is as low as 10 and is robust to the moderate variation of the cluster sizes. However, in cases with large variations in cluster sizes, the Fay and Graubard (FG)-correction should be used instead. Furthermore, we derive a formula to calculate the power and minimum total number of clusters one needs using the t-test and KC-correction for the CRTs with binary outcomes. The power levels as predicted by the proposed formula agree well with the empirical powers from the simulations. The proposed methods are illustrated using real CRT data. We conclude that with appropriate control of type I error rates under small sample sizes, we recommend the use of GEE approach in CRTs with binary outcomes because of fewer assumptions and robustness to the misspecification of the covariance structure.

Journal ArticleDOI
TL;DR: In this paper, a non-parametric test based on the principles of the Kolmogorov-Smirnov (KS) test, referred to as the KS Predictive Accuracy (KSPA) test is proposed.
Abstract: This paper introduces a complement statistical test for distinguishing between the predictive accuracy of two sets of forecasts. We propose a non-parametric test founded upon the principles of the Kolmogorov-Smirnov (KS) test, referred to as the KS Predictive Accuracy (KSPA) test. The KSPA test is able to serve two distinct purposes. Initially, the test seeks to determine whether there exists a statistically significant difference between the distribution of forecast errors, and secondly it exploits the principles of stochastic dominance to determine whether the forecasts with the lower error also reports a stochastically smaller error than forecasts from a competing model, and thereby enables distinguishing between the predictive accuracy of forecasts. We perform a simulation study for the size and power of the proposed test and report the results for different noise distributions, sample sizes and forecasting horizons. The simulation results indicate that the KSPA test is correctly sized, and robust in the face of varying forecasting horizons and sample sizes along with significant accuracy gains reported especially in the case of small sample sizes. Real world applications are also considered to illustrate the applicability of the proposed KSPA test in practice.

Journal ArticleDOI
TL;DR: Whether more or less data are required according to the analytic strategy used is asked, and if the basic analytic operation is the development of categories or themes, do these operations require differences in sample size, with most other things in the design being equal?
Abstract: But there is more: It also depends on the investigator— how theoretically smart, how well these data are theoretically sampled and verified, how well funded, how much time allotted, and how patient she is and how hard he thinks. With too few participants or too little data, analysis is more difficult as patterns are more difficult to identify. With too little data, replication may not occur: Variation is scattered, and important features in these data may be missing or overlooked. So with a small sample, what have you got? Not very much interesting to write about. Previously, I discussed those design features that contribute to sample size (Morse, 1991, 2000). But less discussed in the literature are influences of the types of analytic strategies that are used. Guest, Bunce, and Johnson (2006) conducted an analysis of semi-structured interviews to see at what point saturation was obtained. These interviews appeared to be conducted with respect to a reasonably objective and stable phenomenon, which, of course, would make replication much easier to obtain and their study less useful to those researchers who are interested in more abstract and complex phenomena. Here I will try and extend the discussion even further, and ask whether more or less data are required according to the analytic strategy used. That is, if the basic analytic operation is the development of categories or themes, do these operations require differences in sample size, with most other things in the design being equal? Developing categories and themes are two different cognitive and mechanical operations used in analysis (Morse, 2008). Each of these will be discussed below.

Journal ArticleDOI
TL;DR: Extensive Monte Carlo simulations demonstrate that the methods used in this paper have desirable finite-sample properties and outperform previous proposals.