scispace - formally typeset
Search or ask a question

Showing papers on "Sample size determination published in 2017"


01 Dec 2017
TL;DR: In this article, the importance and procedure of determining sample size for continuous and categorical variables using Cochran's (1977) formula is described and the usage of sample sizes formula, including the formula for adjusting for the Cochran (1977), correction when the sample size exceeds 5% of the population.
Abstract: Sample size determination is often an important step and decision that educational and organizational researchers are facing. The quality and precision of research is being influenced by inadequate, excessive or inappropriate sample sizes. Selecting the sample size for a study requires compromise between balancing the need for statistical power, economy and timeliness. There is a temptation for the researchers to take some short cuts. The paper describes the importance and procedure of determining sample size for continuous and categorical variables using Cochran’s (1977) formula. The paper illustrates the usage of sample sizes formula, including the formula for adjusting for Cochran’s (1977) correction when the sample size exceeds 5% of the population. Tables are included to help researchers in determining the sample size for a research problem based on any three alpha levels and a set of standard error rate for categorical and continuous data. Procedures for determining the appropriate sample size for multiple regression, factor analysis and structural equation modeling are discussed. Common issues in sample size determination are examined. Non-respondent sampling issues are also addressed.

3,519 citations


Journal ArticleDOI
TL;DR: Results of simulations show that the two most common methods for evaluating significance, using likelihood ratio tests and applying the z distribution to the Wald t values from the model output (t-as-z), are somewhat anti-conservative, especially for smaller sample sizes.
Abstract: Mixed-effects models are being used ever more frequently in the analysis of experimental data. However, in the lme4 package in R the standards for evaluating significance of fixed effects in these models (i.e., obtaining p-values) are somewhat vague. There are good reasons for this, but as researchers who are using these models are required in many cases to report p-values, some method for evaluating the significance of the model output is needed. This paper reports the results of simulations showing that the two most common methods for evaluating significance, using likelihood ratio tests and applying the z distribution to the Wald t values from the model output (t-as-z), are somewhat anti-conservative, especially for smaller sample sizes. Other methods for evaluating significance, including parametric bootstrapping and the Kenward-Roger and Satterthwaite approximations for degrees of freedom, were also evaluated. The results of these simulations suggest that Type 1 error rates are closest to .05 when models are fitted using REML and p-values are derived using the Kenward-Roger or Satterthwaite approximations, as these approximations both produced acceptable Type 1 error rates even for smaller samples.

1,045 citations


Journal ArticleDOI
TL;DR: Trial Sequential Analysis represents analysis of meta-analytic data, with transparent assumptions, and better control of type I and type II errors than the traditional meta-analysis using naïve unadjusted confidence intervals.
Abstract: Most meta-analyses in systematic reviews, including Cochrane ones, do not have sufficient statistical power to detect or refute even large intervention effects. This is why a meta-analysis ought to be regarded as an interim analysis on its way towards a required information size. The results of the meta-analyses should relate the total number of randomised participants to the estimated required meta-analytic information size accounting for statistical diversity. When the number of participants and the corresponding number of trials in a meta-analysis are insufficient, the use of the traditional 95% confidence interval or the 5% statistical significance threshold will lead to too many false positive conclusions (type I errors) and too many false negative conclusions (type II errors). We developed a methodology for interpreting meta-analysis results, using generally accepted, valid evidence on how to adjust thresholds for significance in randomised clinical trials when the required sample size has not been reached. The Lan-DeMets trial sequential monitoring boundaries in Trial Sequential Analysis offer adjusted confidence intervals and restricted thresholds for statistical significance when the diversity-adjusted required information size and the corresponding number of required trials for the meta-analysis have not been reached. Trial Sequential Analysis provides a frequentistic approach to control both type I and type II errors. We define the required information size and the corresponding number of required trials in a meta-analysis and the diversity (D2) measure of heterogeneity. We explain the reasons for using Trial Sequential Analysis of meta-analysis when the actual information size fails to reach the required information size. We present examples drawn from traditional meta-analyses using unadjusted naive 95% confidence intervals and 5% thresholds for statistical significance. Spurious conclusions in systematic reviews with traditional meta-analyses can be reduced using Trial Sequential Analysis. Several empirical studies have demonstrated that the Trial Sequential Analysis provides better control of type I errors and of type II errors than the traditional naive meta-analysis. Trial Sequential Analysis represents analysis of meta-analytic data, with transparent assumptions, and better control of type I and type II errors than the traditional meta-analysis using naive unadjusted confidence intervals.

627 citations


Journal ArticleDOI
TL;DR: A new method and convenient tools for determining sample size and power in mediation models are proposed and demonstrated and will allow researchers to quickly and easily determine power and sample size for simple and complex mediation models.
Abstract: Mediation analyses abound in social and personality psychology. Current recommendations for assessing power and sample size in mediation models include using a Monte Carlo power analysis simulation and testing the indirect effect with a bootstrapped confidence interval. Unfortunately, these methods have rarely been adopted by researchers due to limited software options and the computational time needed. We propose a new method and convenient tools for determining sample size and power in mediation models. We demonstrate our new method through an easy-to-use application that implements the method. These developments will allow researchers to quickly and easily determine power and sample size for simple and complex mediation models.

615 citations


Journal ArticleDOI
TL;DR: In light of the findings, the recently reported low replication success in psychology is realistic, and worse performance may be expected for cognitive neuroscience, and false report probability is likely to exceed 50% for the whole literature.
Abstract: We have empirically assessed the distribution of published effect sizes and estimated power by analyzing 26,841 statistical records from 3,801 cognitive neuroscience and psychology papers published recently. The reported median effect size was D = 0.93 (interquartile range: 0.64–1.46) for nominally statistically significant results and D = 0.24 (0.11–0.42) for nonsignificant results. Median power to detect small, medium, and large effects was 0.12, 0.44, and 0.73, reflecting no improvement through the past half-century. This is so because sample sizes have remained small. Assuming similar true effect sizes in both disciplines, power was lower in cognitive neuroscience than in psychology. Journal impact factors negatively correlated with power. Assuming a realistic range of prior probabilities for null hypotheses, false report probability is likely to exceed 50% for the whole literature. In light of our findings, the recently reported low replication success in psychology is realistic, and worse performance may be expected for cognitive neuroscience.

475 citations


Journal ArticleDOI
TL;DR: In this article, the authors raise awareness on error bars of cross-validation, which are often underestimated and propose solutions to increase sample size, tackling possible increases in heterogeneity of the data.

408 citations


Journal ArticleDOI
08 Jun 2017-PLOS ONE
TL;DR: In this article, exploratory graph analysis (EGA) was used to estimate the number of dimensions in the four-factor structure when the correlation between factors was.7, showing an accuracy of 100% for a sample size of 5,000 observations.
Abstract: The estimation of the correct number of dimensions is a long-standing problem in psychometrics. Several methods have been proposed, such as parallel analysis (PA), Kaiser-Guttman’s eigenvalue-greater-than-one rule, multiple average partial procedure (MAP), the maximum-likelihood approaches that use fit indexes as BIC and EBIC and the less used and studied approach called very simple structure (VSS). In the present paper a new approach to estimate the number of dimensions will be introduced and compared via simulation to the traditional techniques pointed above. The approach proposed in the current paper is called exploratory graph analysis (EGA), since it is based on the graphical lasso with the regularization parameter specified using EBIC. The number of dimensions is verified using the walktrap, a random walk algorithm used to identify communities in networks. In total, 32,000 data sets were simulated to fit known factor structures, with the data sets varying across different criteria: number of factors (2 and 4), number of items (5 and 10), sample size (100, 500, 1000 and 5000) and correlation between factors (orthogonal, .20, .50 and .70), resulting in 64 different conditions. For each condition, 500 data sets were simulated using lavaan. The result shows that the EGA performs comparable to parallel analysis, EBIC, eBIC and to Kaiser-Guttman rule in a number of situations, especially when the number of factors was two. However, EGA was the only technique able to correctly estimate the number of dimensions in the four-factor structure when the correlation between factors were .7, showing an accuracy of 100% for a sample size of 5,000 observations. Finally, the EGA was used to estimate the number of factors in a real dataset, in order to compare its performance with the other six techniques tested in the simulation study.

337 citations


Journal ArticleDOI
TL;DR: This contribution investigates the properties of a procedure for Bayesian hypothesis testing that allows optional stopping with unlimited multiple testing, even after each participant, and investigates the long-term rate of misleading evidence, the average expected sample sizes, and the biasedness of effect size estimates when an SBF design is applied to a test of mean differences between 2 groups.
Abstract: Unplanned optional stopping rules have been criticized for inflating Type I error rates under the null hypothesis significance testing (NHST) paradigm. Despite these criticisms, this research practice is not uncommon, probably because it appeals to researcher's intuition to collect more data to push an indecisive result into a decisive region. In this contribution, we investigate the properties of a procedure for Bayesian hypothesis testing that allows optional stopping with unlimited multiple testing, even after each participant. In this procedure, which we call Sequential Bayes Factors (SBFs), Bayes factors are computed until an a priori defined level of evidence is reached. This allows flexible sampling plans and is not dependent upon correct effect size guesses in an a priori power analysis. We investigated the long-term rate of misleading evidence, the average expected sample sizes, and the biasedness of effect size estimates when an SBF design is applied to a test of mean differences between 2 groups. Compared with optimal NHST, the SBF design typically needs 50% to 70% smaller samples to reach a conclusion about the presence of an effect, while having the same or lower long-term rate of wrong inference. (PsycINFO Database Record

327 citations


Posted Content
TL;DR: In this article, the authors raise awareness on error bars of cross-validation, which are often underestimated, and propose solutions to increase sample size, tackling possible increases in heterogeneity of the data.
Abstract: Predictive models ground many state-of-the-art developments in statistical brain image analysis: decoding, MVPA, searchlight, or extraction of biomarkers. The principled approach to establish their validity and usefulness is cross-validation, testing prediction on unseen data. Here, I would like to raise awareness on error bars of cross-validation, which are often underestimated. Simple experiments show that sample sizes of many neuroimaging studies inherently lead to large error bars, eg $\pm$10% for 100 samples. The standard error across folds strongly underestimates them. These large error bars compromise the reliability of conclusions drawn with predictive models, such as biomarkers or methods developments where, unlike with cognitive neuroimaging MVPA approaches, more samples cannot be acquired by repeating the experiment across many subjects. Solutions to increase sample size must be investigated, tackling possible increases in heterogeneity of the data.

323 citations


Journal ArticleDOI
TL;DR: Evaluating the predictive ability of history questions, self-report measures, and performance-based measures for assessing fall risk of community-dwelling older adults by calculating and comparing posttest probability (PoTP) values for individual test/measures found no single test/measure demonstrated strong PoTP values.
Abstract: BACKGROUND: Falls and their consequences are significant concerns for older adults, caregivers, and health care providers. Identification of fall risk is crucial for appropriate referral to preventive interventions. Falls are multifactorial; no single measure is an accurate diagnostic tool. There is limited information on which history question, self-report measure, or performance-based measure, or combination of measures, best predicts future falls. Purpose: First, to evaluate the predictive ability of history questions, self-report measures, and performance-based measures for assessing fall risk of community-dwelling older adults by calculating and comparing posttest probability (PoTP) values for individual test/measures. Second, to evaluate usefulness of cumulative PoTP for measures in combination. Data Sources: To be included, a study must have used fall status as an outcome or classification variable, have a sample size of at least 30 ambulatory community-living older adults (>=65 years), and track falls occurrence for a minimum of 6 months. Studies in acute or long-term care settings, as well as those including participants with significant cognitive or neuromuscular conditions related to increased fall risk, were excluded. Searches of Medline/PubMED and Cumulative Index of Nursing and Allied Health (CINAHL) from January 1990 through September 2013 identified 2294 abstracts concerned with fall risk assessment in community-dwelling older adults. Study Selection: Because the number of prospective studies of fall risk assessment was limited, retrospective studies that classified participants (faller/nonfallers) were also included. Ninety-five full-text articles met inclusion criteria; 59 contained necessary data for calculation of PoTP. The Quality Assessment Tool for Diagnostic Accuracy Studies (QUADAS) was used to assess each study's methodological quality. Data Extraction: Study design and QUADAS score determined the level of evidence. Data for calculation of sensitivity (Sn), specificity (Sp), likelihood ratios (LR), and PoTP values were available for 21 of 46 measures used as search terms. An additional 73 history questions, self-report measures, and performance-based measures were used in included articles; PoTP values could be calculated for 35. Data Synthesis: Evidence tables including PoTP values were constructed for 15 history questions, 15 self-report measures, and 26 performance-based measures. Recommendations for clinical practice were based on consensus. Limitations: Variations in study quality, procedures, and statistical analyses challenged data extraction, interpretation, and synthesis. There was insufficient data for calculation of PoTP values for 63 of 119 tests. Conclusions: No single test/measure demonstrated strong PoTP values. Five history questions, 2 self-report measures, and 5 performance-based measures may have clinical usefulness in assessing risk of falling on the basis of cumulative PoTP. Berg Balance Scale score (=12 seconds), and 5 times sit-to-stand times (>=12) seconds are currently the most evidence-supported functional measures to determine individual risk of future falls. Shortfalls identified during review will direct researchers to address knowledge gaps. Copyright (C) 2016 the Section on Geriatrics of the American Physical Therapy Association Language: en

320 citations


01 Jan 2017
TL;DR: This review paper aimed to present several tables that could illustrate the minimum sample sizes required for estimating the desired effect size of ICC, which is a measurement of the magnitude of an agreement.
Abstract: Intraclass correlation coefficient (ICC) measures the extent of agreement and consistency among raters for two or more numerical or quantitative variables. This review paper aimed to present several tables that could illustrate the minimum sample sizes required for estimating the desired effect size of ICC, which is a measurement of the magnitude of an agreement. Determination of the minimum sample size under such circumstances is based on the two fundamentally important parameters, namely the actual value of the ICC and the number of observations made by each subject. The sample size calculations are derived from Power Analysis and Sample Size (PASS) software where the alpha and minimum required power is fixed at 0.05 and higher than 0.80 respectively. A discussion on how to use these tables for determining sample sizes required for each of the various scenarios and the limitations associated with their use in each of these scenarios is provided.

Journal ArticleDOI
20 Nov 2017-PLOS ONE
TL;DR: This work aimed to clarify the power problem by considering and contrasting two simulated scenarios of such possible brain-behavior correlations: weak diffuse effects and strong localized effects.
Abstract: Statistically underpowered studies can result in experimental failure even when all other experimental considerations have been addressed impeccably. In fMRI the combination of a large number of dependent variables, a relatively small number of observations (subjects), and a need to correct for multiple comparisons can decrease statistical power dramatically. This problem has been clearly addressed yet remains controversial-especially in regards to the expected effect sizes in fMRI, and especially for between-subjects effects such as group comparisons and brain-behavior correlations. We aimed to clarify the power problem by considering and contrasting two simulated scenarios of such possible brain-behavior correlations: weak diffuse effects and strong localized effects. Sampling from these scenarios shows that, particularly in the weak diffuse scenario, common sample sizes (n = 20-30) display extremely low statistical power, poorly represent the actual effects in the full sample, and show large variation on subsequent replications. Empirical data from the Human Connectome Project resembles the weak diffuse scenario much more than the localized strong scenario, which underscores the extent of the power problem for many studies. Possible solutions to the power problem include increasing the sample size, using less stringent thresholds, or focusing on a region-of-interest. However, these approaches are not always feasible and some have major drawbacks. The most prominent solutions that may help address the power problem include model-based (multivariate) prediction methods and meta-analyses with related synthesis-oriented approaches.

01 Jan 2017
TL;DR: In this paper, the authors propose a tool to help users to decide what would be a useful sample size for their particular context when investigating patterns across participants, based on the expected population theme prevalence of the least prevalent themes.
Abstract: Thematic analysis is frequently used to analyse qualitative data in psychology, healthcare, social research and beyond. An important stage in planning a study is determining how large a sample size may be required, however current guidelines for thematic analysis are varied, ranging from around 2 to over 400 and it is unclear how to choose a value from the space in between. Some guidance can also not be applied prospectively. This paper introduces a tool to help users think about what would be a useful sample size for their particular context when investigating patterns across participants. The calculation depends on (a) the expected population theme prevalence of the least prevalent theme, derived either from prior knowledge or based on the prevalence of the rarest themes considered worth uncovering, e.g. 1 in 10, 1 in 100; (b) the number of desired instances of the theme; and (c) the power of the study. An adequately powered study will have a high likelihood of finding sufficient themes of the desired prevalence. This calculation can then be used alongside other considerations. We illustrate how to use the method to calculate sample size before starting a study and achieved power given a sample size, providing tables of answers and code for use in the free software, R. Sample sizes are comparable to those found in the literature, for example to have 80% power to detect two instances of a theme with 10% prevalence, 29 participants are required. Increasing power, increasing the number of instances or decreasing prevalence increases the sample size needed. We do not propose this as a ritualistic requirement for study design, but rather as a pragmatic supporting tool to help plan studies using thematic analysis.

Journal ArticleDOI
TL;DR: The aim of this article is to guide researchers in calculating the minimum and maximum numbers of animals required in animal research by reformulating the error DF formulas.
Abstract: Animal research plays an important role in the pre-clinical phase of clinical trials. In animal studies, the power analysis approach to sample size calculation is recommended. Whenever it is not possible to assume the standard deviation and the effect size, an alternative to the power analysis approach is the 'resource equation' approach, which sets the acceptable range of the error degrees of freedom (DF) in an analysis of variance (ANOVA). The aim of this article is to guide researchers in calculating the minimum and maximum numbers of animals required in animal research by reformulating the error DF formulas.

Journal ArticleDOI
TL;DR: In this article, the effect of the number of events per variable (EPV) on the relative performance of three different methods for assessing the predictive accuracy of a logistic regression model: apparent performance in the analysis sample, split-sample validation and optimism correction using bootstrap methods.
Abstract: We conducted an extensive set of empirical analyses to examine the effect of the number of events per variable (EPV) on the relative performance of three different methods for assessing the predictive accuracy of a logistic regression model: apparent performance in the analysis sample, split-sample validation, and optimism correction using bootstrap methods. Using a single dataset of patients hospitalized with heart failure, we compared the estimates of discriminatory performance from these methods to those for a very large independent validation sample arising from the same population. As anticipated, the apparent performance was optimistically biased, with the degree of optimism diminishing as the number of events per variable increased. Differences between the bootstrap-corrected approach and the use of an independent validation sample were minimal once the number of events per variable was at least 20. Split-sample assessment resulted in too pessimistic and highly uncertain estimates of model performance. Apparent performance estimates had lower mean squared error compared to split-sample estimates, but the lowest mean squared error was obtained by bootstrap-corrected optimism estimates. For bias, variance, and mean squared error of the performance estimates, the penalty incurred by using split-sample validation was equivalent to reducing the sample size by a proportion equivalent to the proportion of the sample that was withheld for model validation. In conclusion, split-sample validation is inefficient and apparent performance is too optimistic for internal validation of regression-based prediction models. Modern validation methods, such as bootstrap-based optimism correction, are preferable. While these findings may be unsurprising to many statisticians, the results of the current study reinforce what should be considered good statistical practice in the development and validation of clinical prediction models.

14 Nov 2017
TL;DR: In this article, the authors present a summary of how to calculate the survey sample size in social research and information system research and how to estimate the sample size for any empirical study in which the goal is to make inferences about a population from a sample.
Abstract: The sample size is a significant feature of any empirical study in which the goal is to make inferences about a population from a sample. In order to generalize from a random sample and avoid sampling errors or biases, a random sample needs to be of adequate size. This study presents a summary of how to calculate the survey sample size in social research and information system research.

Journal ArticleDOI
TL;DR: This work presents an alternative approach that adjusts sample effect sizes for bias and uncertainty, and it is demonstrated its effectiveness for several experimental designs.
Abstract: The sample size necessary to obtain a desired level of statistical power depends in part on the population value of the effect size, which is, by definition, unknown. A common approach to sample-size planning uses the sample effect size from a prior study as an estimate of the population value of the effect to be detected in the future study. Although this strategy is intuitively appealing, effect-size estimates, taken at face value, are typically not accurate estimates of the population effect size because of publication bias and uncertainty. We show that the use of this approach often results in underpowered studies, sometimes to an alarming degree. We present an alternative approach that adjusts sample effect sizes for bias and uncertainty, and we demonstrate its effectiveness for several experimental designs. Furthermore, we discuss an open-source R package, BUCSS, and user-friendly Web applications that we have made available to researchers so that they can easily implement our suggested methods.

Journal ArticleDOI
07 Jul 2017-PeerJ
TL;DR: The widespread use of ‘statistical significance’ as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process, and potential arguments against removing significance thresholds are discussed.
Abstract: The widespread use of ‘statistical significance’ as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process (according to the American Statistical Association). We review why degrading p -values into ‘significant’ and ‘nonsignificant’ contributes to making studies irreproducible, or to making them seem irreproducible. A major problem is that we tend to take small p -values at face value, but mistrust results with larger p -values. In either case, p -values tell little about reliability of research, because they are hardly replicable even if an alternative hypothesis is true. Also significance ( p ≤ 0.05) is hardly replicable: at a good statistical power of 80%, two studies will be ‘conflicting’, meaning that one is significant and the other is not, in one third of the cases if there is a true effect. A replication can therefore not be interpreted as having failed only because it is nonsignificant. Many apparent replication failures may thus reflect faulty judgment based on significance thresholds rather than a crisis of unreplicable research. Reliable conclusions on replicability and practical importance of a finding can only be drawn using cumulative evidence from multiple independent studies. However, applying significance thresholds makes cumulative knowledge unreliable. One reason is that with anything but ideal statistical power, significant effect sizes will be biased upwards. Interpreting inflated significant results while ignoring nonsignificant results will thus lead to wrong conclusions. But current incentives to hunt for significance lead to selective reporting and to publication bias against nonsignificant findings. Data dredging, p -hacking, and publication bias should be addressed by removing fixed significance thresholds. Consistent with the recommendations of the late Ronald Fisher, p -values should be interpreted as graded measures of the strength of evidence against the null hypothesis. Also larger p -values offer some evidence against the null hypothesis, and they cannot be interpreted as supporting the null hypothesis, falsely concluding that ‘there is no effect’. Information on possible true effect sizes that are compatible with the data must be obtained from the point estimate, e.g., from a sample average, and from the interval estimate, such as a confidence interval. We review how confusion about interpretation of larger p -values can be traced back to historical disputes among the founders of modern statistics. We further discuss potential arguments against removing significance thresholds, for example that decision rules should rather be more stringent, that sample sizes could decrease, or that p -values should better be completely abandoned. We conclude that whatever method of statistical inference we use, dichotomous threshold thinking must give way to non-automated informed judgment.

Journal ArticleDOI
TL;DR: In this paper, the SDSS-IV MaNGA survey is described and the final properties of the main samples along with important considerations for using these samples for science, while simultaneously optimizing the size distribution of the integral field units (IFUs), the IFU allocation strategy and the target density to produce a survey defined in terms of maximizing S/N, spatial resolution, and sample size.
Abstract: We describe the sample design for the SDSS-IV MaNGA survey and present the final properties of the main samples along with important considerations for using these samples for science. Our target selection criteria were developed while simultaneously optimizing the size distribution of the MaNGA integral field units (IFUs), the IFU allocation strategy, and the target density to produce a survey defined in terms of maximizing S/N, spatial resolution, and sample size. Our selection strategy makes use of redshift limits that only depend on i-band absolute magnitude ($M_i$), or, for a small subset of our sample, $M_i$ and color (NUV-i). Such a strategy ensures that all galaxies span the same range in angular size irrespective of luminosity and are therefore covered evenly by the adopted range of IFU sizes. We define three samples: the Primary and Secondary samples are selected to have a flat number density with respect to $M_i$ and are targeted to have spectroscopic coverage to 1.5 and 2.5 effective radii (Re), respectively. The Color-Enhanced supplement increases the number of galaxies in the low-density regions of color-magnitude space by extending the redshift limits of the Primary sample in the appropriate color bins. The samples cover the stellar mass range $5\times10^8 \leq M_* \leq 3\times10^{11} M_{\odot}$ and are sampled at median physical resolutions of 1.37 kpc and 2.5 kpc for the Primary and Secondary samples respectively. We provide weights that will statistically correct for our luminosity and color-dependent selection function and IFU allocation strategy, thus correcting the observed sample to a volume limited sample.

Journal ArticleDOI
TL;DR: Current and proposed new concepts of effect size (ES) quantification and significance are illustrated and discussed, with a focus on statistical and clinical/subjective interpretation and supported by empirical examples.

Journal ArticleDOI
14 Jul 2017-BMJ
TL;DR: This work introduces several practical aids to help researchers design cluster randomised trials in which all observations make a material contribution to the study, and enables identification of the point at which observations begin to make a negligible contribution to a study for a given target difference.
Abstract: Cluster randomised trials have diminishing returns in power and precision as cluster size increases. Making the cluster a lot larger while keeping the number of clusters fixed might yield only a very small increase in power and precision, owing to the intracluster correlation. Identifying the point at which observations start making a negligible contribution to the power or precision of the study—which we call the point of diminishing returns—is important for designing efficient trials. Current methods for identifying this point are potentially useful as rules of thumb but don’t generally work well. We introduce several practical aids to help researchers design cluster randomised trials in which all observations make a material contribution to the study. Power curves enable identification of the point at which observations begin to make a negligible contribution to a study for a given target difference. Under this paradigm, the number needed per arm under individual randomisation gives an upper bound on the cluster size, which should not be exceeded. Corresponding precision curves can be useful for accommodating flexibility in the choice of target difference and show the point at which confidence intervals around the estimated effect size no longer decrease. To design efficient trials, the number of clusters and cluster size should be determined concurrently, not independently. Funders and researchers should be aware of diminishing returns in cluster trials. Researchers should routinely plot power or precision curves when performing sample size calculations so that the implications of cluster sizes can be transparent. Even when data appear to be “free,” in the sense that few resources are needed to obtain the data, excessive cluster sizes can have important ramifications

BookDOI
15 Aug 2017
TL;DR: Sample Size Calculations in Clinical Research, Third Edition as mentioned in this paper presents statistical procedures for performing sample size calculations during various phases of clinical research and development, including a well-balanced summary of current and emerging clinical issues, regulatory requirements, and recently developed statistical methodologies for sample size calculation.
Abstract: Praise for the Second Edition: "… this is a useful, comprehensive compendium of almost every possible sample size formula. The strong organization and carefully defined formulae will aid any researcher designing a study." -Biometrics "This impressive book contains formulae for computing sample size in a wide range of settings. One-sample studies and two-sample comparisons for quantitative, binary, and time-to-event outcomes are covered comprehensively, with separate sample size formulae for testing equality, non-inferiority, and equivalence. Many less familiar topics are also covered …" – Journal of the Royal Statistical Society Sample Size Calculations in Clinical Research, Third Edition presents statistical procedures for performing sample size calculations during various phases of clinical research and development. A comprehensive and unified presentation of statistical concepts and practical applications, this book includes a well-balanced summary of current and emerging clinical issues, regulatory requirements, and recently developed statistical methodologies for sample size calculation. Features: Compares the relative merits and disadvantages of statistical methods for sample size calculations Explains how the formulae and procedures for sample size calculations can be used in a variety of clinical research and development stages Presents real-world examples from several therapeutic areas, including cardiovascular medicine, the central nervous system, anti-infective medicine, oncology, and women’s health Provides sample size calculations for dose response studies, microarray studies, and Bayesian approaches This new edition is updated throughout, includes many new sections, and five new chapters on emerging topics: two stage seamless adaptive designs, cluster randomized trial design, zero-inflated Poisson distribution, clinical trials with extremely low incidence rates, and clinical trial simulation.

Journal ArticleDOI
TL;DR: The present study compared nonparametric bootstrap test with pooled resampling method corresponding to parametric, non Parametric, and permutation tests through extensive simulations under various conditions and using real data examples to overcome the problem related with small samples in hypothesis testing.
Abstract: Experimental studies in biomedical research frequently pose analytical problems related to small sample size. In such studies, there are conflicting findings regarding the choice of parametric and nonparametric analysis, especially with non-normal data. In such instances, some methodologists questioned the validity of parametric tests and suggested nonparametric tests. In contrast, other methodologists found nonparametric tests to be too conservative and less powerful and thus preferred using parametric tests. Some researchers have recommended using a bootstrap test; however, this method also has small sample size limitation. We used a pooled method in nonparametric bootstrap test that may overcome the problem related with small samples in hypothesis testing. The present study compared nonparametric bootstrap test with pooled resampling method corresponding to parametric, nonparametric, and permutation tests through extensive simulations under various conditions and using real data examples. The nonparametric pooled bootstrap t-test provided equal or greater power for comparing two means as compared with unpaired t-test, Welch t-test, Wilcoxon rank sum test, and permutation test while maintaining type I error probability for any conditions except for Cauchy and extreme variable lognormal distributions. In such cases, we suggest using an exact Wilcoxon rank sum test. Nonparametric bootstrap paired t-test also provided better performance than other alternatives. Nonparametric bootstrap test provided benefit over exact Kruskal-Wallis test. We suggest using nonparametric bootstrap test with pooled resampling method for comparing paired or unpaired means and for validating the one way analysis of variance test results for non-normal data in small sample size studies. Copyright © 2017 John Wiley & Sons, Ltd.

Journal ArticleDOI
01 Jul 2017-Infancy
TL;DR: Examining the effect of sample size on statistical power and the conclusions drawn from infant looking time research revealed that despite clear results with the original large samples, the results with smaller subsamples were highly variable, yielding both false positive and false negative outcomes.
Abstract: Infant research is hard. It is difficult, expensive, and time consuming to identify, recruit and test infants. As a result, ours is a field of small sample sizes. Many studies using infant looking time as a measure have samples of 8 to 12 infants per cell, and studies with more than 24 infants per cell are uncommon. This paper examines the effect of such sample sizes on statistical power and the conclusions drawn from infant looking time research. An examination of the state of the current literature suggests that most published looking time studies have low power, which leads in the long run to an increase in both false positive and false negative results. Three data sets with large samples (>30 infants) were used to simulate experiments with smaller sample sizes; 1000 random subsamples of 8, 12, 16, 20, and 24 infants from the overall samples were selected, making it possible to examine the systematic effect of sample size on the results. This approach revealed that despite clear results with the original large samples, the results with smaller subsamples were highly variable, yielding both false positive and false negative outcomes. Finally, a number of emerging possible solutions are discussed.

Book ChapterDOI
TL;DR: This chapter outlines the data collection procedure for the research and the information about the study area and sample size is provided.
Abstract: This chapter outlines the data collection procedure for the research. The information about the study area and sample size is provided.

Journal ArticleDOI
TL;DR: The use of the CI supplements the P value by providing an estimate of actual clinical effect, and of late, clinical trials are being designed specifically as superiority, non-inferiority or equivalence studies are based on CI values rather than the Pvalue from intergroup comparison.
Abstract: Biomedical research is seldom done with entire populations but rather with samples drawn from a population. Although we work with samples, our goal is to describe and draw inferences regarding the underlying population. It is possible to use a sample statistic and estimates of error in the sample to get a fair idea of the population parameter, not as a single value, but as a range of values. This range is the confidence interval (CI) which is estimated on the basis of a desired confidence level. Calculation of the CI of a sample statistic takes the general form: CI = Point estimate ± Margin of error, where the margin of error is given by the product of a critical value (z) derived from the standard normal curve and the standard error of point estimate. Calculation of the standard error varies depending on whether the sample statistic of interest is a mean, proportion, odds ratio (OR), and so on. The factors affecting the width of the CI include the desired confidence level, the sample size and the variability in the sample. Although the 95% CI is most often used in biomedical research, a CI can be calculated for any level of confidence. A 99% CI will be wider than 95% CI for the same sample. Conflict between clinical importance and statistical significance is an important issue in biomedical research. Clinical importance is best inferred by looking at the effect size, that is how much is the actual change or difference. However, statistical significance in terms of P only suggests whether there is any difference in probability terms. Use of the CI supplements the P value by providing an estimate of actual clinical effect. Of late, clinical trials are being designed specifically as superiority, non-inferiority or equivalence studies. The conclusions from these alternative trial designs are based on CI values rather than the P value from intergroup comparison.

Journal ArticleDOI
08 Sep 2017-eLife
TL;DR: It is found that funnel plots of the Standardized Mean Difference plotted against the standard error (SE) are susceptible to distortion, leading to overestimation of the existence and extent of publication bias.
Abstract: Meta-analyses are increasingly used for synthesis of evidence from biomedical research, and often include an assessment of publication bias based on visual or analytical detection of asymmetry in funnel plots. We studied the influence of different normalisation approaches, sample size and intervention effects on funnel plot asymmetry, using empirical datasets and illustrative simulations. We found that funnel plots of the Standardized Mean Difference (SMD) plotted against the standard error (SE) are susceptible to distortion, leading to overestimation of the existence and extent of publication bias. Distortion was more severe when the primary studies had a small sample size and when an intervention effect was present. We show that using the Normalised Mean Difference measure as effect size (when possible), or plotting the SMD against a sample size-based precision estimate, are more reliable alternatives. We conclude that funnel plots using the SMD in combination with the SE are unsuitable for publication bias assessments and can lead to false-positive results.

Journal ArticleDOI
TL;DR: It is hypothesized that studying subsets of very large meta-analyses would allow for systematic identification of best practices to improve reproducibility, and showed relatively greater reproducedcibility with more-stringent effect size thresholds with relaxed significance thresholds and relatively lower reproducible when imposing extraneous constraints on residual heterogeneity.
Abstract: Findings from clinical and biological studies are often not reproducible when tested in independent cohorts. Due to the testing of a large number of hypotheses and relatively small sample sizes, results from whole-genome expression studies in particular are often not reproducible. Compared to single-study analysis, gene expression meta-analysis can improve reproducibility by integrating data from multiple studies. However, there are multiple choices in designing and carrying out a meta-analysis. Yet, clear guidelines on best practices are scarce. Here, we hypothesized that studying subsets of very large meta-analyses would allow for systematic identification of best practices to improve reproducibility. We therefore constructed three very large gene expression meta-analyses from clinical samples, and then examined meta-analyses of subsets of the datasets (all combinations of datasets with up to N/2 samples and K/2 datasets) compared to a ‘silver standard’ of differentially expressed genes found in the entire cohort. We tested three random-effects meta-analysis models using this procedure. We showed relatively greater reproducibility with more-stringent effect size thresholds with relaxed significance thresholds; relatively lower reproducibility when imposing extraneous constraints on residual heterogeneity; and an underestimation of actual false positive rate by Benjamini–Hochberg correction. In addition, multivariate regression showed that the accuracy of a meta-analysis increased significantly with more included datasets even when controlling for sample size.

Journal ArticleDOI
TL;DR: It is found that properly specified CPMs generally have good finite sample performance with moderate sample sizes, but that bias may occur when the sample size is small, and these models are fairly robust to minor or moderate link function misspecification in the authors' simulations.
Abstract: We study the application of a widely used ordinal regression model, the cumulative probability model (CPM), for continuous outcomes. Such models are attractive for the analysis of continuous response variables because they are invariant to any monotonic transformation of the outcome and because they directly model the cumulative distribution function from which summaries such as expectations and quantiles can easily be derived. Such models can also readily handle mixed type distributions. We describe the motivation, estimation, inference, model assumptions, and diagnostics. We demonstrate that CPMs applied to continuous outcomes are semiparametric transformation models. Extensive simulations are performed to investigate the finite sample performance of these models. We find that properly specified CPMs generally have good finite sample performance with moderate sample sizes, but that bias may occur when the sample size is small. Cumulative probability models are fairly robust to minor or moderate link function misspecification in our simulations. For certain purposes, the CPMs are more efficient than other models. We illustrate their application, with model diagnostics, in a study of the treatment of HIV. CD4 cell count and viral load 6 months after the initiation of antiretroviral therapy are modeled using CPMs; both variables typically require transformations, and viral load has a large proportion of measurements below a detection limit.

Journal ArticleDOI
26 Jul 2017-PLOS ONE
TL;DR: The sample size in qualitative research that is required to reach theoretical saturation is explored and seven guidelines for purposive sampling are formulated and recommend that researchers follow a minimum information scenario.
Abstract: I explore the sample size in qualitative research that is required to reach theoretical saturation. I conceptualize a population as consisting of sub-populations that contain different types of information sources that hold a number of codes. Theoretical saturation is reached after all the codes in the population have been observed once in the sample. I delineate three different scenarios to sample information sources: “random chance,” which is based on probability sampling, “minimal information,” which yields at least one new code per sampling step, and “maximum information,” which yields the largest number of new codes per sampling step. Next, I use simulations to assess the minimum sample size for each scenario for systematically varying hypothetical populations. I show that theoretical saturation is more dependent on the mean probability of observing codes than on the number of codes in a population. Moreover, the minimal and maximal information scenarios are significantly more efficient than random chance, but yield fewer repetitions per code to validate the findings. I formulate guidelines for purposive sampling and recommend that researchers follow a minimum information scenario.