scispace - formally typeset
Search or ask a question

Showing papers on "Sample size determination published in 2019"


Journal ArticleDOI
07 Nov 2019-PLOS ONE
TL;DR: The authors' simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000, while Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size.
Abstract: Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.

622 citations


Journal ArticleDOI
TL;DR: The minimum values of n and E (and subsequently the minimum number of events per predictor parameter, EPP) should be calculated to meet the following three criteria: small optimism in predictor effect estimates as defined by a global shrinkage factor of ≥0.9, aim to reduce overfitting conditional on a chosen p, and require prespecification of the model's anticipated Cox‐Snell R2.
Abstract: When designing a study to develop a new prediction model with binary or time-to-event outcomes, researchers should ensure their sample size is adequate in terms of the number of participants (n) and outcome events (E) relative to the number of predictor parameters (p) considered for inclusion. We propose that the minimum values of n and E (and subsequently the minimum number of events per predictor parameter, EPP) should be calculated to meet the following three criteria: (i) small optimism in predictor effect estimates as defined by a global shrinkage factor of ≥0.9, (ii) small absolute difference of ≤ 0.05 in the model's apparent and adjusted Nagelkerke's R2 , and (iii) precise estimation of the overall risk in the population. Criteria (i) and (ii) aim to reduce overfitting conditional on a chosen p, and require prespecification of the model's anticipated Cox-Snell R2 , which we show can be obtained from previous studies. The values of n and E that meet all three criteria provides the minimum sample size required for model development. Upon application of our approach, a new diagnostic model for Chagas disease requires an EPP of at least 4.8 and a new prognostic model for recurrent venous thromboembolism requires an EPP of at least 23. This reinforces why rules of thumb (eg, 10 EPP) should be avoided. Researchers might additionally ensure the sample size gives precise estimates of key predictor effects; this is especially important when key categorical predictors have few events in some categories, as this may substantially increase the numbers required.

425 citations


Journal ArticleDOI
TL;DR: Six parameters influencing saturation in focus group data are identified: study purpose, type of codes, group stratification, number of groups per stratum, and type and degree of saturation.
Abstract: Saturation is commonly used to determine sample sizes in qualitative research, yet there is little guidance on what influences saturation. We aimed to assess saturation and identify parameters to estimate sample sizes for focus group studies in advance of data collection. We used two approaches to assess saturation in data from 10 focus group discussions. Four focus groups were sufficient to identify a range of new issues (code saturation), but more groups were needed to fully understand these issues (meaning saturation). Group stratification influenced meaning saturation, whereby one focus group per stratum was needed to identify issues; two groups per stratum provided a more comprehensive understanding of issues, but more groups per stratum provided little additional benefit. We identify six parameters influencing saturation in focus group data: study purpose, type of codes, group stratification, number of groups per stratum, and type and degree of saturation.

349 citations


Journal ArticleDOI
TL;DR: The results showed that the effect of p on the population CFI and TLI depended on thetype of specification error, whereas a higher p was associated with lower values of the population RMSEA regardless of the type of model misspecification.
Abstract: This study investigated the effect the number of observed variables (p) has on three structural equation modeling indices: the comparative fit index (CFI), the Tucker-Lewis index (TLI), and the root mean square error of approximation (RMSEA). The behaviors of the population fit indices and their sample estimates were compared under various conditions created by manipulating the number of observed variables, the types of model misspecification, the sample size, and the magnitude of factor loadings. The results showed that the effect of p on the population CFI and TLI depended on the type of specification error, whereas a higher p was associated with lower values of the population RMSEA regardless of the type of model misspecification. In finite samples, all three fit indices tended to yield estimates that suggested a worse fit than their population counterparts, which was more pronounced with a smaller sample size, higher p, and lower factor loading.

323 citations


Journal ArticleDOI
TL;DR: It is shown that out-of-sample predictive performance can better be approximated by considering the number of predictors, the total sample size and the events fraction, and it is proposed that the development of new sample size criteria for prediction models should be based on these three parameters.
Abstract: Binary logistic regression is one of the most frequently applied statistical approaches for developing clinical prediction models. Developers of such models often rely on an Events Per Variable criterion (EPV), notably EPV ≥10, to determine the minimal sample size required and the maximum number of candidate predictors that can be examined. We present an extensive simulation study in which we studied the influence of EPV, events fraction, number of candidate predictors, the correlations and distributions of candidate predictor variables, area under the ROC curve, and predictor effects on out-of-sample predictive performance of prediction models. The out-of-sample performance (calibration, discrimination and probability prediction error) of developed prediction models was studied before and after regression shrinkage and variable selection. The results indicate that EPV does not have a strong relation with metrics of predictive performance, and is not an appropriate criterion for (binary) prediction model development studies. We show that out-of-sample predictive performance can better be approximated by considering the number of predictors, the total sample size and the events fraction. We propose that the development of new sample size criteria for prediction models should be based on these three parameters, and provide suggestions for improving sample size determination.

276 citations


Journal ArticleDOI
TL;DR: A hands-on tutorial illustrating how a priori and post hoc power analyses for the most frequently used two-level models are conducted and case-sensitive rules of thumb for deriving sufficient sample sizes as well as minimum detectable effect sizes that yield a power ≥ .80 for the effects and input parameters most frequently analyzed by psychologists are provided.
Abstract: The estimation of power in two-level models used to analyze data that are hierarchically structured is particularly complex because the outcome contains variance at two levels that is regressed on predictors at two levels. Methods for the estimation of power in two-level models have been based on formulas and Monte Carlo simulation. We provide a hands-on tutorial illustrating how a priori and post hoc power analyses for the most frequently used two-level models are conducted. We describe how a population model for the power analysis can be specified by using standardized input parameters and how the power analysis is implemented in SIMR, a very flexible power estimation method based on Monte Carlo simulation. Finally, we provide case-sensitive rules of thumb for deriving sufficient sample sizes as well as minimum detectable effect sizes that yield a power ≥ .80 for the effects and input parameters most frequently analyzed by psychologists. For medium variance components, the results indicate that with lower level (L1) sample sizes up to 30 and higher level (L2) sample sizes up to 200, medium and large fixed effects can be detected. However, small L2 direct- or cross-level interaction effects cannot be detected with up to 200 clusters. The tutorial and guidelines should be of help to researchers dealing with multilevel study designs such as individuals clustered within groups or repeated measurements clustered within individuals. (PsycINFO Database Record (c) 2019 APA, all rights reserved).

227 citations


Journal ArticleDOI
TL;DR: Student's t test (t test), analysis of variance (ANOVA), and analysis of covariance (ANCOVA) are statistical methods used in the testing of hypothesis for comparison of means between the groups.
Abstract: Student's t test (t test), analysis of variance (ANOVA), and analysis of covariance (ANCOVA) are statistical methods used in the testing of hypothesis for comparison of means between the groups. The Student's t test is used to compare the means between two groups, whereas ANOVA is used to compare the means among three or more groups. In ANOVA, first gets a common P value. A significant P value of the ANOVA test indicates for at least one pair, between which the mean difference was statistically significant. To identify that significant pair(s), we use multiple comparisons. In ANOVA, when using one categorical independent variable, it is called one-way ANOVA, whereas for two categorical independent variables, it is called two-way ANOVA. When using at least one covariate to adjust with dependent variable, ANOVA becomes ANCOVA. When the size of the sample is small, mean is very much affected by the outliers, so it is necessary to keep sufficient sample size while using these methods.

206 citations


Journal ArticleDOI
TL;DR: In this article, the authors show that the maximum likelihood estimate (MLE) is biased, the variability of the MLE is far greater than classically estimated, and the likelihood-ratio test (LRT) is not distributed as a χ2 The bias of MLE yields wrong predictions for the probability of a case based on observed values of the covariates.
Abstract: Students in statistics or data science usually learn early on that when the sample size n is large relative to the number of variables p, fitting a logistic model by the method of maximum likelihood produces estimates that are consistent and that there are well-known formulas that quantify the variability of these estimates which are used for the purpose of statistical inference. We are often told that these calculations are approximately valid if we have 5 to 10 observations per unknown parameter. This paper shows that this is far from the case, and consequently, inferences produced by common software packages are often unreliable. Consider a logistic model with independent features in which n and p become increasingly large in a fixed ratio. We prove that (i) the maximum-likelihood estimate (MLE) is biased, (ii) the variability of the MLE is far greater than classically estimated, and (iii) the likelihood-ratio test (LRT) is not distributed as a χ2 The bias of the MLE yields wrong predictions for the probability of a case based on observed values of the covariates. We present a theory, which provides explicit expressions for the asymptotic bias and variance of the MLE and the asymptotic distribution of the LRT. We empirically demonstrate that these results are accurate in finite samples. Our results depend only on a single measure of signal strength, which leads to concrete proposals for obtaining accurate inference in finite samples through the estimate of this measure.

188 citations


Journal ArticleDOI
TL;DR: This study investigated the distribution of effect sizes in both individual differences research and group differences research in gerontology to provide estimates of effect size estimates in the field and found Cohen’s guidelines appear to overestimate effect sizes.
Abstract: Background and objectives Researchers typically use Cohen's guidelines of Pearson's r = .10, .30, and .50, and Cohen's d = 0.20, 0.50, and 0.80 to interpret observed effect sizes as small, medium, or large, respectively. However, these guidelines were not based on quantitative estimates and are only recommended if field-specific estimates are unknown. This study investigated the distribution of effect sizes in both individual differences research and group differences research in gerontology to provide estimates of effect sizes in the field. Research design and methods Effect sizes (Pearson's r, Cohen's d, and Hedges' g) were extracted from meta-analyses published in 10 top-ranked gerontology journals. The 25th, 50th, and 75th percentile ranks were calculated for Pearson's r (individual differences) and Cohen's d or Hedges' g (group differences) values as indicators of small, medium, and large effects. A priori power analyses were conducted for sample size calculations given the observed effect size estimates. Results Effect sizes of Pearson's r = .12, .20, and .32 for individual differences research and Hedges' g = 0.16, 0.38, and 0.76 for group differences research were interpreted as small, medium, and large effects in gerontology. Discussion and implications Cohen's guidelines appear to overestimate effect sizes in gerontology. Researchers are encouraged to use Pearson's r = .10, .20, and .30, and Cohen's d or Hedges' g = 0.15, 0.40, and 0.75 to interpret small, medium, and large effects in gerontology, and recruit larger samples.

187 citations


Posted Content
TL;DR: This work proposes a parametric Q-learning algorithm that finds an approximate-optimal policy using a sample size proportional to the feature dimension $K$ and invariant with respect to the size of the state space, and exploits the monotonicity property and intrinsic noise structure of the Bellman operator.
Abstract: Consider a Markov decision process (MDP) that admits a set of state-action features, which can linearly express the process's probabilistic transition model. We propose a parametric Q-learning algorithm that finds an approximate-optimal policy using a sample size proportional to the feature dimension $K$ and invariant with respect to the size of the state space. To further improve its sample efficiency, we exploit the monotonicity property and intrinsic noise structure of the Bellman operator, provided the existence of anchor state-actions that imply implicit non-negativity in the feature space. We augment the algorithm using techniques of variance reduction, monotonicity preservation, and confidence bounds. It is proved to find a policy which is $\epsilon$-optimal from any initial state with high probability using $\widetilde{O}(K/\epsilon^2(1-\gamma)^3)$ sample transitions for arbitrarily large-scale MDP with a discount factor $\gamma\in(0,1)$. A matching information-theoretical lower bound is proved, confirming the sample optimality of the proposed method with respect to all parameters (up to polylog factors).

175 citations


Journal ArticleDOI
TL;DR: This study indicates that the epigenetic clock can be improved by increasing the training sample size and that its association with mortality attenuates with increased prediction of chronological age.
Abstract: DNA methylation changes with age. Chronological age predictors built from DNA methylation are termed ‘epigenetic clocks’. The deviation of predicted age from the actual age (‘age acceleration residual’, AAR) has been reported to be associated with death. However, it is currently unclear how a better prediction of chronological age affects such association. In this study, we build multiple predictors based on training DNA methylation samples selected from 13,661 samples (13,402 from blood and 259 from saliva). We use the Lothian Birth Cohorts of 1921 (LBC1921) and 1936 (LBC1936) to examine whether the association between AAR (from these predictors) and death is affected by (1) improving prediction accuracy of an age predictor as its training sample size increases (from 335 to 12,710) and (2) additionally correcting for confounders (i.e., cellular compositions). In addition, we investigated the performance of our predictor in non-blood tissues. We found that in principle, a near-perfect age predictor could be developed when the training sample size is sufficiently large. The association between AAR and mortality attenuates as prediction accuracy increases. AAR from our best predictor (based on Elastic Net, https://github.com/qzhang314/DNAm-based-age-predictor ) exhibits no association with mortality in both LBC1921 (hazard ratio = 1.08, 95% CI 0.91–1.27) and LBC1936 (hazard ratio = 1.00, 95% CI 0.79–1.28). Predictors based on small sample size are prone to confounding by cellular compositions relative to those from large sample size. We observed comparable performance of our predictor in non-blood tissues with a multi-tissue-based predictor. This study indicates that the epigenetic clock can be improved by increasing the training sample size and that its association with mortality attenuates with increased prediction of chronological age.

Journal ArticleDOI
TL;DR: This work compares how accurately five study designs estimated the true effect of a simulated environmental impact that caused a step-change response in a population's density and proposes ‘accuracy weights’ and demonstrates how they can weight studies in three recent meta-analyses by accounting for study design and sample size.
Abstract: Monitoring the impacts of anthropogenic threats and interventions to mitigate these threats is key to understanding how to best conserve biodiversity. Ecologists use many different study designs to monitor such impacts. Simpler designs lacking controls (e.g. Before–After (BA) and After) or pre-impact data (e.g. Control–Impact (CI)) are considered to be less robust than more complex designs (e.g. Before–After Control-Impact (BACI) or Randomized Controlled Trials (RCTs)). However, we lack quantitative estimates of how much less accurate simpler study designs are in ecology. Understanding this could help prioritize research and weight studies by their design's accuracy in meta-analysis and evidence assessment. We compared how accurately five study designs estimated the true effect of a simulated environmental impact that caused a step-change response in a population's density. We derived empirical estimates of several simulation parameters from 47 ecological datasets to ensure our simulations were realistic. We measured design performance by determining the percentage of simulations where: (a) the true effect fell within the 95% Confidence Intervals of effect size estimates, and (b) each design correctly estimated the true effect's direction and magnitude. We also considered how sample size affected their performance. We demonstrated that BACI designs performed: 1.3–1.8 times better than RCTs; 2.9–4.2 times versus BA; 3.2–4.6 times versus CI; and 7.1–10.1 times versus After designs (depending on sample size), when correctly estimating true effect's direction and magnitude to within ±30%. Although BACI designs suffered from low power at small sample sizes, they outperformed other designs for almost all performance measures. Increasing sample size improved BACI design accuracy, but only increased the precision of simpler designs around biased estimates. Synthesis and applications. We suggest that more investment in more robust designs is needed in ecology since inferences from simpler designs, even with large sample sizes may be misleading. Facilitating this requires longer-term funding and stronger research–practice partnerships. We also propose ‘accuracy weights’ and demonstrate how they can weight studies in three recent meta-analyses by accounting for study design and sample size. We hope these help decision-makers and meta-analysts better account for study design when assessing evidence.

Journal ArticleDOI
TL;DR: An extensive Monte Carlo simulation is presented to investigate the performance of extraction criteria under varying sample sizes, numbers of indicators per factor, loading magnitudes, underlying multivariate distributions of observed variables, as well as how the performanceof the extraction criteria are influenced by the presence of cross-loadings and minor factors for unidimensional, orthogonal, and correlated factor models.
Abstract: Exploratory factor analyses are commonly used to determine the underlying factors of multiple observed variables. Many criteria have been suggested to determine how many factors should be retained. In this study, we present an extensive Monte Carlo simulation to investigate the performance of extraction criteria under varying sample sizes, numbers of indicators per factor, loading magnitudes, underlying multivariate distributions of observed variables, as well as how the performance of the extraction criteria are influenced by the presence of cross-loadings and minor factors for unidimensional, orthogonal, and correlated factor models. We compared several variants of traditional parallel analysis (PA), the Kaiser-Guttman Criterion, and sequential χ2 model tests (SMT) with 4 recently suggested methods: revised PA, comparison data (CD), the Hull method, and the Empirical Kaiser Criterion (EKC). No single extraction criterion performed best for every factor model. In unidimensional and orthogonal models, traditional PA, EKC, and Hull consistently displayed high hit rates even in small samples. Models with correlated factors were more challenging, where CD and SMT outperformed other methods, especially for shorter scales. Whereas the presence of cross-loadings generally increased accuracy, non-normality had virtually no effect on most criteria. We suggest researchers use a combination of SMT and either Hull, the EKC, or traditional PA, because the number of factors was almost always correctly retrieved if those methods converged. When the results of this combination rule are inconclusive, traditional PA, CD, and the EKC performed comparatively well. However, disagreement also suggests that factors will be harder to detect, increasing sample size requirements to N ≥ 500. (PsycINFO Database Record (c) 2019 APA, all rights reserved).

Journal ArticleDOI
TL;DR: The study highlights the scarcity of research in training set size determination methodologies applied to ML in medical imaging, emphasizes the need to standardize current reporting practices, and guides future work in development and streamlining of pre hoc and post hoc sample size approaches.
Abstract: Purpose The required training sample size for a particular machine learning (ML) model applied to medical imaging data is often unknown. The purpose of this study was to provide a descriptive review of current sample-size determination methodologies in ML applied to medical imaging and to propose recommendations for future work in the field. Methods We conducted a systematic literature search of articles using Medline and Embase with keywords including “machine learning,” “image,” and “sample size.” The search included articles published between 1946 and 2018. Data regarding the ML task, sample size, and train-test pipeline were collected. Results A total of 167 articles were identified, of which 22 were included for qualitative analysis. There were only 4 studies that discussed sample-size determination methodologies, and 18 that tested the effect of sample size on model performance as part of an exploratory analysis. The observed methods could be categorized as pre hoc model-based approaches, which relied on features of the algorithm, or post hoc curve-fitting approaches requiring empirical testing to model and extrapolate algorithm performance as a function of sample size. Between studies, we observed great variability in performance testing procedures used for curve-fitting, model assessment methods, and reporting of confidence in sample sizes. Conclusions Our study highlights the scarcity of research in training set size determination methodologies applied to ML in medical imaging, emphasizes the need to standardize current reporting practices, and guides future work in development and streamlining of pre hoc and post hoc sample size approaches.

Journal ArticleDOI
TL;DR: It is proposed that the minimum value of n should meet the following four key criteria: small optimism in predictor effect estimates as defined by a global shrinkage factor, and precise estimation of the mean predicted outcome value (model intercept).
Abstract: In the medical literature, hundreds of prediction models are being developed to predict health outcomes in individuals. For continuous outcomes, typically a linear regression model is developed to predict an individual's outcome value conditional on values of multiple predictors (covariates). To improve model development and reduce the potential for overfitting, a suitable sample size is required in terms of the number of subjects (n) relative to the number of predictor parameters (p) for potential inclusion. We propose that the minimum value of n should meet the following four key criteria: (i) small optimism in predictor effect estimates as defined by a global shrinkage factor of ≥0.9; (ii) small absolute difference of ≤ 0.05 in the apparent and adjusted R2 ; (iii) precise estimation (a margin of error ≤ 10% of the true value) of the model's residual standard deviation; and similarly, (iv) precise estimation of the mean predicted outcome value (model intercept). The criteria require prespecification of the user's chosen p and the model's anticipated R2 as informed by previous studies. The value of n that meets all four criteria provides the minimum sample size required for model development. In an applied example, a new model to predict lung function in African-American women using 25 predictor parameters requires at least 918 subjects to meet all criteria, corresponding to at least 36.7 subjects per predictor parameter. Even larger sample sizes may be needed to additionally ensure precise estimates of key predictor effects, especially when important categorical predictors have low prevalence in certain categories.

Journal ArticleDOI
TL;DR: The new method improves the specificity and sensitivity of lists of regions and accurately controls the false discovery rate, and develops an inferential approach, based on a pooled null distribution, that can be implemented even when as few as two samples per population are available.
Abstract: With recent advances in sequencing technology, it is now feasible to measure DNA methylation at tens of millions of sites across the entire genome. In most applications, biologists are interested in detecting differentially methylated regions, composed of multiple sites with differing methylation levels among populations. However, current computational approaches for detecting such regions do not provide accurate statistical inference. A major challenge in reporting uncertainty is that a genome-wide scan is involved in detecting these regions, which needs to be accounted for. A further challenge is that sample sizes are limited due to the costs associated with the technology. We have developed a new approach that overcomes these challenges and assesses uncertainty for differentially methylated regions in a rigorous manner. Region-level statistics are obtained by fitting a generalized least squares regression model with a nested autoregressive correlated error structure for the effect of interest on transformed methylation proportions. We develop an inferential approach, based on a pooled null distribution, that can be implemented even when as few as two samples per population are available. Here, we demonstrate the advantages of our method using both experimental data and Monte Carlo simulation. We find that the new method improves the specificity and sensitivity of lists of regions and accurately controls the false discovery rate.

Journal ArticleDOI
TL;DR: This research was an attempt to transform the classical statistical machine- learning classification method based on original samples into a deep-learning classification methodbased on data augmentation.

Journal ArticleDOI
TL;DR: In this paper, the relative efficacy of different machine learning regression algorithms for different types of neuroimaging data are evaluated with both real and simulated MRI data, and compared with standard multiple regression.

Journal ArticleDOI
TL;DR: All the evaluations including precision, coherence, stability, and clustering resolution should be taken into consideration when choosing an appropriate tool for cytometry data analysis and decision guidelines are provided for the general reader to more easily choose the most suitable clustering tools.
Abstract: With the expanding applications of mass cytometry in medical research, a wide variety of clustering methods, both semi-supervised and unsupervised, have been developed for data analysis. Selecting the optimal clustering method can accelerate the identification of meaningful cell populations. To address this issue, we compared three classes of performance measures, “precision” as external evaluation, “coherence” as internal evaluation, and stability, of nine methods based on six independent benchmark datasets. Seven unsupervised methods (Accense, Xshift, PhenoGraph, FlowSOM, flowMeans, DEPECHE, and kmeans) and two semi-supervised methods (Automated Cell-type Discovery and Classification and linear discriminant analysis (LDA)) are tested on six mass cytometry datasets. We compute and compare all defined performance measures against random subsampling, varying sample sizes, and the number of clusters for each method. LDA reproduces the manual labels most precisely but does not rank top in internal evaluation. PhenoGraph and FlowSOM perform better than other unsupervised tools in precision, coherence, and stability. PhenoGraph and Xshift are more robust when detecting refined sub-clusters, whereas DEPECHE and FlowSOM tend to group similar clusters into meta-clusters. The performances of PhenoGraph, Xshift, and flowMeans are impacted by increased sample size, but FlowSOM is relatively stable as sample size increases. All the evaluations including precision, coherence, stability, and clustering resolution should be taken into synthetic consideration when choosing an appropriate tool for cytometry data analysis. Thus, we provide decision guidelines based on these characteristics for the general reader to more easily choose the most suitable clustering tools.

Journal ArticleDOI
29 May 2019-PLOS ONE
TL;DR: The present study revealed three key findings: many of the primary studies used a small sample size;Small sample size bias was pronounced in many ofthe analyses; and when small sample Size bias was taken into account, the effect of PPIs on well-being were small but significant, whereas the effect on depression were variable, dependent on outliers, and generally not statistically significant.
Abstract: For at least four decades, researchers have studied the effectiveness of interventions designed to increase well-being. These interventions have become known as positive psychology interventions (PPIs). Two highly cited meta-analyses examined the effectiveness of PPIs on well-being and depression: Sin and Lyubomirsky (2009) and Bolier et al. (2013). Sin and Lyubomirsky reported larger effects of PPIs on well-being (r = .29) and depression (r = .31) than Bolier et al. reported for subjective well-being (r = .17), psychological well-being (r = .10), and depression (r = .11). A detailed examination of the two meta-analyses reveals that the authors employed different approaches, used different inclusion and exclusion criteria, analyzed different sets of studies, described their methods with insufficient detail to compare them clearly, and did not report or properly account for significant small sample size bias. The first objective of the current study was to reanalyze the studies selected in each of the published meta-analyses, while taking into account small sample size bias. The second objective was to replicate each meta-analysis by extracting relevant effect sizes directly from the primary studies included in the meta-analyses. The present study revealed three key findings: (1) many of the primary studies used a small sample size; (2) small sample size bias was pronounced in many of the analyses; and (3) when small sample size bias was taken into account, the effect of PPIs on well-being were small but significant (approximately r = .10), whereas the effect of PPIs on depression were variable, dependent on outliers, and generally not statistically significant. Future PPI research needs to focus on increasing sample sizes. A future meta-analyses of this research needs to assess cumulative effects from a comprehensive collection of primary studies while being mindful of issues such as small sample size bias.

Journal ArticleDOI
TL;DR: The empirical results suggest that the relationship between HDL-C and CAD is heterogeneous, and it may be too soon to completely dismiss the HDL hypothesis.
Abstract: BACKGROUND Summary-data Mendelian randomization (MR) has become a popular research design to estimate the causal effect of risk exposures. With the sample size of GWAS continuing to increase, it is now possible to use genetic instruments that are only weakly associated with the exposure. DEVELOPMENT We propose a three-sample genome-wide design where typically 1000 independent genetic instruments across the whole genome are used. We develop an empirical partially Bayes statistical analysis approach where instruments are weighted according to their strength; thus weak instruments bring less variation to the estimator. The estimator is highly efficient with many weak genetic instruments and is robust to balanced and/or sparse pleiotropy. APPLICATION We apply our method to estimate the causal effect of body mass index (BMI) and major blood lipids on cardiovascular disease outcomes, and obtain substantially shorter confidence intervals (CIs). In particular, the estimated causal odds ratio of BMI on ischaemic stroke is 1.19 (95% CI: 1.07-1.32, P-value <0.001); the estimated causal odds ratio of high-density lipoprotein cholesterol (HDL-C) on coronary artery disease (CAD) is 0.78 (95% CI: 0.73-0.84, P-value <0.001). However, the estimated effect of HDL-C attenuates and become statistically non-significant when we only use strong instruments. CONCLUSIONS A genome-wide design can greatly improve the statistical power of MR studies. Robust statistical methods may alleviate but not solve the problem of horizontal pleiotropy. Our empirical results suggest that the relationship between HDL-C and CAD is heterogeneous, and it may be too soon to completely dismiss the HDL hypothesis.

Journal ArticleDOI
26 Apr 2019
TL;DR: In this paper, the authors proposed two relaxations called the Randomized Conditional Independence Test (RCIT) and the Randomised conditional Correlation Test (RCoT) which both approximate KCIT by utilizing random Fourier features.
Abstract: Constraint-based causal discovery (CCD) algorithms require fast and accurate conditional independence (CI) testing. The Kernel Conditional Independence Test (KCIT) is currently one of the most popular CI tests in the non-parametric setting, but many investigators cannot use KCIT with large datasets because the test scales at least quadratically with sample size. We therefore devise two relaxations called the Randomized Conditional Independence Test (RCIT) and the Randomized conditional Correlation Test (RCoT) which both approximate KCIT by utilizing random Fourier features. In practice, both of the proposed tests scale linearly with sample size and return accurate p-values much faster than KCIT in the large sample size context. CCD algorithms run with RCIT or RCoT also return graphs at least as accurate as the same algorithms run with KCIT but with large reductions in run time.

Journal ArticleDOI
TL;DR: The results indicate that in certain design configurations, including the one corresponding to the proposed trial, a correlation decay can have an important impact on variances of treatment effect estimators, and hence on sample size and power.
Abstract: Stepped wedge and cluster randomised crossover trials are examples of cluster randomised designs conducted over multiple time periods that are being used with increasing frequency in health research. Recent systematic reviews of both of these designs indicate that the within-cluster correlation is typically taken account of in the analysis of data using a random intercept mixed model, implying a constant correlation between any two individuals in the same cluster no matter how far apart in time they are measured: within-period and between-period intra-cluster correlations are assumed to be identical. Recently proposed extensions allow the within- and between-period intra-cluster correlations to differ, although these methods require that all between-period intra-cluster correlations are identical, which may not be appropriate in all situations. Motivated by a proposed intensive care cluster randomised trial, we propose an alternative correlation structure for repeated cross-sectional multiple-period cluster randomised trials in which the between-period intra-cluster correlation is allowed to decay depending on the distance between measurements. We present results for the variance of treatment effect estimators for varying amounts of decay, investigating the consequences of the variation in decay on sample size planning for stepped wedge, cluster crossover and multiple-period parallel-arm cluster randomised trials. We also investigate the impact of assuming constant between-period intra-cluster correlations instead of decaying between-period intra-cluster correlations. Our results indicate that in certain design configurations, including the one corresponding to the proposed trial, a correlation decay can have an important impact on variances of treatment effect estimators, and hence on sample size and power. An R Shiny app allows readers to interactively explore the impact of correlation decay.

Journal ArticleDOI
TL;DR: Simulation studies are used to investigate the disjunctive power, marginal power and FWER obtained after applying Bonferroni, Holm, Hochberg, Dubey/Armitage-Parmar and Stepdown-minP adjustment methods.
Abstract: Multiple primary outcomes may be specified in randomised controlled trials (RCTs). When analysing multiple outcomes it’s important to control the family wise error rate (FWER). A popular approach to do this is to adjust the p-values corresponding to each statistical test used to investigate the intervention effects by using the Bonferroni correction. It’s also important to consider the power of the trial to detect true intervention effects. In the context of multiple outcomes, depending on the clinical objective, the power can be defined as: ‘disjunctive power’, the probability of detecting at least one true intervention effect across all the outcomes or ‘marginal power’ the probability of finding a true intervention effect on a nominated outcome. We provide practical recommendations on which method may be used to adjust for multiple comparisons in the sample size calculation and the analysis of RCTs with multiple primary outcomes. We also discuss the implications on the sample size for obtaining 90% disjunctive power and 90% marginal power. We use simulation studies to investigate the disjunctive power, marginal power and FWER obtained after applying Bonferroni, Holm, Hochberg, Dubey/Armitage-Parmar and Stepdown-minP adjustment methods. Different simulation scenarios were constructed by varying the number of outcomes, degree of correlation between the outcomes, intervention effect sizes and proportion of missing data. The Bonferroni and Holm methods provide the same disjunctive power. The Hochberg and Hommel methods provide power gains for the analysis, albeit small, in comparison to the Bonferroni method. The Stepdown-minP procedure performs well for complete data. However, it removes participants with missing values prior to the analysis resulting in a loss of power when there are missing data. The sample size requirement to achieve the desired disjunctive power may be smaller than that required to achieve the desired marginal power. The choice between whether to specify a disjunctive or marginal power should depend on the clincial objective.

Journal ArticleDOI
TL;DR: This meta-analysis assesses effect sizes for statistically significant group-level differences between individuals with autism and control individuals for 5 distinct psychological constructs and 2 neurologic markers.
Abstract: Importance The definition and nature of autism have been highly debated, as exemplified by several revisions of theDSM(DSM-III, DSM-IIIR, DSM-IV, andDSM-5) criteria. There has recently been a move from a categorical view toward a spectrum-based view. These changes have been accompanied by a steady increase in the prevalence of the condition. Changes in the definition of autism that may increase heterogeneity could affect the results of autism research; specifically, a broadening of the population with autism could result in decreasing effect sizes of group comparison studies. Objective To examine the correlation between publication year and effect size of autism-control group comparisons across several domains of published autism neurocognitive research. Data Sources This meta-analysis investigated 11 meta-analyses obtained through a systematic search of PubMed for meta-analyses published from January 1, 1966, through January 27, 2019, using the search stringautismAND (meta-analysisORmeta-analytic). The last search was conducted on January 27, 2019. Study Selection Meta-analyses were included if they tested the significance of group differences between individuals with autism and control individuals on a neurocognitive construct. Meta-analyses were only included if the tested group difference was significant and included data with a span of at least 15 years. Data Extraction and Synthesis Data were extracted and analyzed according to the Preferred Reporting Items for Systematic Reviews and Meta-analyses (PRISMA) reporting guideline using fixed-effects models. Main Outcomes and Measures Estimated slope of the correlation between publication year and effect size, controlling for differences in methods, sample size, and study quality. Results The 11 meta-analyses included data from a total of 27 723 individuals. Demographic data such as sex and age were not available for the entire data set. Seven different psychological and neurologic constructs were analyzed based on data from these meta-analyses. Downward temporal trends for effect size were found for all constructs (slopes: –0.067 to –0.003), with the trend being significant in 5 of 7 cases: emotion recognition (slope: –0.028 [95% CI, –0.048 to –0.007]), theory of mind (–0.045 [95% CI, –0.066 to –0.024]), planning (–0.067 [95% CI, –0.125 to –0.009]), P3b amplitude (–0.048 [95% CI, –0.093 to –0.004]), and brain size (–0.047 [95% CI, –0.077 to –0.016]). In contrast, 3 analogous constructs in schizophrenia, a condition that is also heterogeneous but with no reported increase in prevalence, did not show a similar trend. Conclusions and Relevance The findings suggest that differences between individuals with autism and those without the diagnosis have decreased over time and that possible changes in the definition of autism from a narrowly defined and homogenous population toward an inclusive and heterogeneous population may reduce our capacity to build mechanistic models of the condition.

Journal ArticleDOI
TL;DR: In this article, the authors explore the effect of sample size on the performance of NRP and find that a large number of training presences is not always an appropriate strategy for NRP.
Abstract: Most high-performing species distribution modelling techniques require both presences, and either absences or pseudo-absences or background points. In this paper, we explore the effect of sample size, towards developing improved strategies for modelling. We generated 1800 virtual species with three levels of prevalence using ten modelling techniques, while varying the number of training presences (NTP) and the number of random points (NRP representing pseudo-absences or background sites). For five of the ten modelling techniques we built two versions of models: one with an equal total weight (ETW) setting where the total weight for pseudo-absence is equivalent to the total weight for presence, and another with an unequal total weight (UTW) setting where the total weight for pseudo-absence is not required to be equal to the total weight for presence. We compared two strategies for NRP: a small multiplier strategy (i.e. setting NRP at a few times as large as NTP), and a large number strategy (i.e. using numerous random points). We produced ensemble models (by averaging the predictions from 30 models built with the same set of training presences and different sets of random points in equivalent numbers) for three NTP magnitudes and two NRP strategies. We found that model accuracy altered as NRP increased with four distinct patterns of performance: increasing, decreasing, arch-shaped and horizontal. In most cases ETW improved model performance. Ensemble models had higher accuracy than the corresponding single models, and this improvement was pronounced when NTP was low. We conclude that a large NRP is not always an appropriate strategy. The best choice for NRP will depend on the modelling techniques used, species prevalence and NTP. We recommend building ensemble models instead of single models, using the small multiplier strategy for NRP with ETW, especially when only a small number of species presence records are available.

Journal ArticleDOI
01 Jul 2019-Catena
TL;DR: In this paper, the effect of different sample sizes and raster resolutions in landslide susceptibility modeling and prediction accuracy of shallow landslides was evaluated in the Bijar region of the Kurdistan province (Iran) was selected as a case study.
Abstract: Understanding landslide characteristics such as their locations, dimensions, and spatial distribution is of highly importance in landslide modeling and prediction. The main objective of this study was to assess the effect of different sample sizes and raster resolutions in landslide susceptibility modeling and prediction accuracy of shallow landslides. In this regard, the Bijar region of the Kurdistan province (Iran) was selected as a case study. Accordingly, a total of 20 landslide conditioning factors were considered with six different raster resolutions (10 m, 15 m, 20 m, 30 m, 50 m, and 100 m) and four different sample sizes (60/40%, 70/30%, 80/20%, and 90/10%) were investigated. The merit of each conditioning factors was assessed using the Information Gain Ratio (IGR) technique, whereas Alternating decision tree (ADTree), which has been rarely explored for landslide modeling, was used for building models. Performance of the models was assessed using the area under the ROC curve (AUROC), sensitivity, specificity, accuracy, kappa and RMSE criteria. The results show that with increasing the number of training pixels in the modeling process, the accuracy is increased. Findings also indicate that for the sample sizes of 60/40% (AUROC = 0.800) and 70/30% (AUROC = 0.899), the highest prediction accuracy is derived with the raster resolution of 10 m. With the raster resolution of 20 m, the highest prediction accuracy for the sample size of 80/20% (AUROC = 0.871) and 90/10% (AUROC = 0.864). These outcomes provide a guideline for future research enabling researchers to select an optimal data resolution for landslide hazard modeling.

Proceedings Article
01 Jan 2019
TL;DR: In this article, a parametric Q-learning algorithm was proposed to find an approximate-optimal policy using a sample size proportional to the feature dimension and invariant with respect to the size of the state space.
Abstract: Consider a Markov decision process (MDP) that admits a set of state-action features, which can linearly express the process's probabilistic transition model. We propose a parametric Q-learning algorithm that finds an approximate-optimal policy using a sample size proportional to the feature dimension $K$ and invariant with respect to the size of the state space. To further improve its sample efficiency, we exploit the monotonicity property and intrinsic noise structure of the Bellman operator, provided the existence of anchor state-actions that imply implicit non-negativity in the feature space. We augment the algorithm using techniques of variance reduction, monotonicity preservation, and confidence bounds. It is proved to find a policy which is $\epsilon$-optimal from any initial state with high probability using $\widetilde{O}(K/\epsilon^2(1-\gamma)^3)$ sample transitions for arbitrarily large-scale MDP with a discount factor $\gamma\in(0,1)$. A matching information-theoretical lower bound is proved, confirming the sample optimality of the proposed method with respect to all parameters (up to polylog factors).

Journal ArticleDOI
J Uttley1
25 Jan 2019-Leukos
TL;DR: Addressing issues raised in this article related to sample sizes, statistical test assumptions, and reporting of effect sizes can improve the evidential value of lighting research.
Abstract: The reporting of accurate and appropriate conclusions is an essential aspect of scientific research, and failure in this endeavor can threaten the progress of cumulative knowledge. This is highligh...

Journal ArticleDOI
TL;DR: Individual results of studies using a-tDCS applied over the prefrontal and motor cortices either before or during dynamic muscle strength testing showed positive results, but performing meta-analysis was not possible.