scispace - formally typeset
Search or ask a question

Showing papers on "Sample size determination published in 2020"


Journal ArticleDOI
18 Mar 2020-BMJ
TL;DR: In this article, the authors provide guidance on how to calculate the sample size required to develop a clinical prediction model.
Abstract: Clinical prediction models aim to predict outcomes in individuals, to inform diagnosis or prognosis in healthcare. Hundreds of prediction models are published in the medical literature each year, yet many are developed using a dataset that is too small for the total number of participants or outcome events. This leads to inaccurate predictions and consequently incorrect healthcare decisions for some individuals. In this article, the authors provide guidance on how to calculate the sample size required to develop a clinical prediction model.

646 citations


Journal ArticleDOI
05 May 2020-PLOS ONE
TL;DR: This work describes and validate a simple-to-apply method for assessing and reporting on saturation in the context of inductive thematic analyses and proposes a more flexible approach to reporting saturation.
Abstract: Data saturation is the most commonly employed concept for estimating sample sizes in qualitative research. Over the past 20 years, scholars using both empirical research and mathematical/statistical models have made significant contributions to the question: How many qualitative interviews are enough? This body of work has advanced the evidence base for sample size estimation in qualitative inquiry during the design phase of a study, prior to data collection, but it does not provide qualitative researchers with a simple and reliable way to determine the adequacy of sample sizes during and/or after data collection. Using the principle of saturation as a foundation, we describe and validate a simple-to-apply method for assessing and reporting on saturation in the context of inductive thematic analyses. Following a review of the empirical research on data saturation and sample size estimation in qualitative research, we propose an alternative way to evaluate saturation that overcomes the shortcomings and challenges associated with existing methods identified in our review. Our approach includes three primary elements in its calculation and assessment: Base Size, Run Length, and New Information Threshold. We additionally propose a more flexible approach to reporting saturation. To validate our method, we use a bootstrapping technique on three existing thematically coded qualitative datasets generated from in-depth interviews. Results from this analysis indicate the method we propose to assess and report on saturation is feasible and congruent with findings from earlier studies.

640 citations


Journal ArticleDOI
TL;DR: In some cases the comparison of two models using ICs can be viewed as equivalent to a likelihood ratio test, with the different criteria representing different alpha levels and BIC being a more conservative test than AIC.
Abstract: Choosing a model with too few parameters can involve making unrealistically simple assumptions and lead to high bias, poor prediction, and missed opportunities for insight. Such models are not flexible enough to describe the sample or the population well. A model with too many parameters can fit the observed data very well, but be too closely tailored to it. Such models may generalize poorly. Penalizedlikelihood information criteria, such as Akaike’s Information Criterion (AIC), the Bayesian Information Criterion (BIC), the Consistent AIC, and the Adjusted BIC, are widely used for model selection. However, different criteria sometimes support different models, leading to uncertainty about which criterion is the most trustworthy. In some simple cases the comparison of two models using information criteria can be viewed as equivalent to a likelihood ratio test, with the different models representing different alpha levels (i.e., different emphases on sensitivity or specificity; Lin & Dayton 1997). This perspective may lead to insights about how to interpret the criteria in less simple situations. For example, AIC or BIC could be preferable, depending on sample size and on the relative importance one assigns to sensitivity versus specificity. Understanding the differences among the criteria may make it easier to compare their results and to use them to make informed decisions.

444 citations


Journal ArticleDOI
TL;DR: This Viewpoint proposes a simple way to highlight both experimental reproducibility and cell-to-cell variation, while avoiding pitfalls common in analysis of cell biology data.
Abstract: P values and error bars help readers infer whether a reported difference would likely recur, with the sample size n used for statistical tests representing biological replicates, independent measurements of the population from separate experiments. We provide examples and practical tutorials for creating figures that communicate both the cell-level variability and the experimental reproducibility.

330 citations


Journal ArticleDOI
21 Feb 2020-PLOS ONE
TL;DR: A minimum N = 8 is informative given very little variance, but minimum N ≥ 25 is required for more variance, and alternative models are better compared using information theory indices such as AIC but not R2 or adjusted R2.
Abstract: Regressions and meta-regressions are widely used to estimate patterns and effect sizes in various disciplines. However, many biological and medical analyses use relatively low sample size (N), contributing to concerns on reproducibility. What is the minimum N to identify the most plausible data pattern using regressions? Statistical power analysis is often used to answer that question, but it has its own problems and logically should follow model selection to first identify the most plausible model. Here we make null, simple linear and quadratic data with different variances and effect sizes. We then sample and use information theoretic model selection to evaluate minimum N for regression models. We also evaluate the use of coefficient of determination (R2) for this purpose; it is widely used but not recommended. With very low variance, both false positives and false negatives occurred at N < 8, but data shape was always clearly identified at N ≥ 8. With high variance, accurate inference was stable at N ≥ 25. Those outcomes were consistent at different effect sizes. Akaike Information Criterion weights (AICc wi) were essential to clearly identify patterns (e.g., simple linear vs. null); R2 or adjusted R2 values were not useful. We conclude that a minimum N = 8 is informative given very little variance, but minimum N ≥ 25 is required for more variance. Alternative models are better compared using information theory indices such as AIC but not R2 or adjusted R2. Insufficient N and R2-based model selection apparently contribute to confusion and low reproducibility in various disciplines. To avoid those problems, we recommend that research based on regressions or meta-regressions use N ≥ 25.

263 citations


Journal ArticleDOI
01 Jun 2020
TL;DR: In this article, the authors discuss the factors that influence sample size decisions and present the guidelines to perform power analysis using the G*Power program, however, a caveat is that researchers should not blindly follow these rules.
Abstract: Determining an appropriate sample size is vital in drawing realistic conclusions from research findings. Although there are several widely adopted rules of thumb to calculate sample size, researchers remain unclear about which one to consider when determining sample size in their respective studies. ‘How large should the sample be?’ is one the most frequently asked questions in survey research. The objective of this editorial is three-fold. First, we discuss the factors that influence sample size decisions. Second, we review existing rules of thumb related to the calculation of sample size. Third, we present the guidelines to perform power analysis using the G*Power programme. There is, however, a caveat: we urge researchers not to blindly follow these rules. Such rules or guidelines should be understood in their specific contexts and under the conditions in which they were prescribed. We hope that this editorial does not only provide researchers a fundamental understanding of sample size and its associated issues, but also facilitates their consideration of sample size determination in their own studies.

234 citations


Posted ContentDOI
22 Aug 2020-bioRxiv
TL;DR: It is shown that the pairing of small brain-behavioral phenotype effect sizes with sampling variability is a key element in wide-spread BWAS replication failure, and large consortia are needed to usher in a new era of reproducible human brain-wide association studies.
Abstract: Magnetic resonance imaging (MRI) continues to drive many important neuroscientific advances. However, progress in uncovering reproducible associations between individual differences in brain structure/function and behavioral phenotypes (e.g., cognition, mental health) may have been undermined by typical neuroimaging sample sizes (median N=25)1,2. Leveraging the Adolescent Brain Cognitive Development (ABCD) Study3 (N=11,878), we estimated the effect sizes and reproducibility of these brain-wide associations studies (BWAS) as a function of sample size. The very largest, replicable brain-wide associations for univariate and multivariate methods were r=0.14 and r=0.34, respectively. In smaller samples, typical for brain-wide association studies (BWAS), irreproducible, inflated effect sizes were ubiquitous, no matter the method (univariate, multivariate). Until sample sizes started to approach consortium-levels, BWAS were underpowered and statistical errors assured. Multiple factors contribute to replication failures4–6; here, we show that the pairing of small brain-behavioral phenotype effect sizes with sampling variability is a key element in wide-spread BWAS replication failure. Brain-behavioral phenotype associations stabilize and become more reproducible with sample sizes of N⪆2,000. While investigator-initiated brain-behavior research continues to generate hypotheses and propel innovation, large consortia are needed to usher in a new era of reproducible human brain-wide association studies.

175 citations


Journal ArticleDOI
TL;DR: The sample size of highly cited experimental fMRI studies increased at a rate of 0.74 participant/year and this rate of increase was commensurate with the median sample sizes of neuroimaging studies published in top neuroim imaging journals in 2017 and 2018.

173 citations


Journal ArticleDOI
TL;DR: This article proposed an adjustment to the margin of error in Yamane's (1967) formula to make it applicable for use in determining optimum sample size for both continuous and categorical variables at all levels of confidence.
Abstract: Obtaining a representative sample size remains critical to survey researchers because of its implication for cost, time and precision of the sample estimate. However, the difficulty of obtaining a good estimate of population variance coupled with insufficient skills in sampling theory impede the researchers’ ability to obtain an optimum sample in survey research. This paper proposes an adjustment to the margin of error in Yamane’s (1967) formula to make it applicable for use in determining optimum sample size for both continuous and categorical variables at all levels of confidence. A minimum sample size determination table is developed for use by researchers based on the adjusted formula developed in this paper.

161 citations


Journal ArticleDOI
TL;DR: This work systematically profiled the performance of deep, kernel, and linear models as a function of sample size on UKBiobank brain images against established machine learning references to benchmark performance scaling with increasingly sophisticated prediction algorithms and with increasing sample size in reference machine-learning and biomedical datasets.
Abstract: Recently, deep learning has unlocked unprecedented success in various domains, especially using images, text, and speech. However, deep learning is only beneficial if the data have nonlinear relationships and if they are exploitable at available sample sizes. We systematically profiled the performance of deep, kernel, and linear models as a function of sample size on UKBiobank brain images against established machine learning references. On MNIST and Zalando Fashion, prediction accuracy consistently improves when escalating from linear models to shallow-nonlinear models, and further improves with deep-nonlinear models. In contrast, using structural or functional brain scans, simple linear models perform on par with more complex, highly parameterized models in age/sex prediction across increasing sample sizes. In sum, linear models keep improving as the sample size approaches ~10,000 subjects. Yet, nonlinearities for predicting common phenotypes from typical brain scans remain largely inaccessible to the examined kernel and deep learning methods. Schulz et al. systematically benchmark performance scaling with increasingly sophisticated prediction algorithms and with increasing sample size in reference machine-learning and biomedical datasets. Complicated nonlinear intervariable relationships remain largely inaccessible for predicting key phenotypes from typical brain scans.

150 citations


Journal ArticleDOI
TL;DR: This paper proposes to further advance the literature by developing a smoothly weighted estimator for the sample standard deviation that fully utilizes the sample size information and shows that the new estimator provides a more accurate estimate for normal data and also performs favorably for non-normal data.
Abstract: When reporting the results of clinical studies, some researchers may choose the five-number summary (including the sample median, the first and third quartiles, and the minimum and maximum values) rather than the sample mean and standard deviation (SD), particularly for skewed data For these studies, when included in a meta-analysis, it is often desired to convert the five-number summary back to the sample mean and SD For this purpose, several methods have been proposed in the recent literature and they are increasingly used nowadays In this article, we propose to further advance the literature by developing a smoothly weighted estimator for the sample SD that fully utilizes the sample size information For ease of implementation, we also derive an approximation formula for the optimal weight, as well as a shortcut formula for the sample SD Numerical results show that our new estimator provides a more accurate estimate for normal data and also performs favorably for non-normal data Together with the optimal sample mean estimator in Luo et al, our new methods have dramatically improved the existing methods for data transformation, and they are capable to serve as "rules of thumb" in meta-analysis for studies reported with the five-number summary Finally for practical use, an Excel spreadsheet and an online calculator are also provided for implementing our optimal estimators

01 Apr 2020
TL;DR: The NHANES 2015-2018 sample design and the methods used to create sample weights and variance units for the public-use data files are described, including sample weights for selected subsamples, such as the fasting subsample.
Abstract: Background The purpose of the National Health and Nutrition Examination Survey (NHANES) is to produce national estimates representative of the total noninstitutionalized civilian U.S. population. The sample for NHANES is selected using a complex, four-stage sample design. NHANES sample weights are used by analysts to produce estimates of the health-related statistics that would have been obtained if the entire sampling frame (i.e., the noninstitutionalized civilian U.S. population) had been surveyed. Sampling errors should be calculated for all survey estimates to aid in determining their statistical reliability. For complex sample surveys, exact mathematical formulas for variance estimates that fully incorporate the sample design are usually not available. Variance approximation procedures are required to provide reasonable, approximately unbiased, and design-consistent estimates of variance. Objective This report describes the NHANES 2015-2018 sample design and the methods used to create sample weights and variance units for the public-use data files, including sample weights for selected subsamples, such as the fasting subsample. The impacts of sample design changes on estimation for NHANES 2015-2018 are described. Approaches that data users can use to modify sample weights when combining survey cycles or when combining subsamples are also included.

Journal ArticleDOI
01 Jul 2020-Chest
TL;DR: Basic statistical concepts in sample size estimation are reviewed, statistical considerations in the choice of a sample size for randomized controlled trials and observational studies are discussed, and strategies for reducing sample size when planning a study are provided.

Journal ArticleDOI
TL;DR: This article provides an overview of the types of error that occur, their impacts on analytic results, and statistical methods to mitigate the biases that they cause, and two of the simpler methods that adjust for bias in regression coefficients caused by measurement error in continuous covariates.
Abstract: Measurement error and misclassification of variables frequently occur in epidemiology and involve variables important to public health. Their presence can impact strongly on results of statistical analyses involving such variables. However, investigators commonly fail to pay attention to biases resulting from such mismeasurement. We provide, in two parts, an overview of the types of error that occur, their impacts on analytic results, and statistical methods to mitigate the biases that they cause. In this first part, we review different types of measurement error and misclassification, emphasizing the classical, linear, and Berkson models, and on the concepts of nondifferential and differential error. We describe the impacts of these types of error in covariates and in outcome variables on various analyses, including estimation and testing in regression models and estimating distributions. We outline types of ancillary studies required to provide information about such errors and discuss the implications of covariate measurement error for study design. Methods for ascertaining sample size requirements are outlined, both for ancillary studies designed to provide information about measurement error and for main studies where the exposure of interest is measured with error. We describe two of the simpler methods, regression calibration and simulation extrapolation (SIMEX), that adjust for bias in regression coefficients caused by measurement error in continuous covariates, and illustrate their use through examples drawn from the Observing Protein and Energy (OPEN) dietary validation study. Finally, we review software available for implementing these methods. The second part of the article deals with more advanced topics.

Journal ArticleDOI
TL;DR: Experimental results show that the proposed variable sample size strategy is more suitable to FMABC-FS, and FMA BC-FS can obtain better feature subsets with much less running time than those comparison algorithms.

Posted Content
TL;DR: A simulation approach to estimate power and classification accuracy for popular analysis pipelines found that clustering outcomes were driven by large effect sizes or the accumulation of many smaller effects across features, and were mostly unaffected by differences in covariance structure.
Abstract: Cluster algorithms are increasingly popular in biomedical research due to their compelling ability to identify discrete subgroups in data, and their increasing accessibility in mainstream software. While guidelines exist for algorithm selection and outcome evaluation, there are no firmly established ways of computing a priori statistical power for cluster analysis. Here, we estimated power and accuracy for common analysis pipelines through simulation. We varied subgroup size, number, separation (effect size), and covariance structure. We then subjected generated datasets to dimensionality reduction (none, multidimensional scaling, or UMAP) and cluster algorithms (k-means, agglomerative hierarchical clustering with Ward or average linkage and Euclidean or cosine distance, HDBSCAN). Finally, we compared the statistical power of discrete (k-means), "fuzzy" (c-means), and finite mixture modelling approaches (which include latent profile and latent class analysis). We found that outcomes were driven by large effect sizes or the accumulation of many smaller effects across features, and were unaffected by differences in covariance structure. Sufficient statistical power was achieved with relatively small samples (N=20 per subgroup), provided cluster separation is large ({\Delta}=4). Fuzzy clustering provided a more parsimonious and powerful alternative for identifying separable multivariate normal distributions, particularly those with slightly lower centroid separation ({\Delta}=3). Overall, we recommend that researchers 1) only apply cluster analysis when large subgroup separation is expected, 2) aim for sample sizes of N=20 to N=30 per expected subgroup, 3) use multidimensional scaling to improve cluster separation, and 4) use fuzzy clustering or finite mixture modelling approaches that are more powerful and more parsimonious with partially overlapping multivariate normal distributions.

Journal ArticleDOI
TL;DR: An efficient, easy to compute and robust lower bound estimator for the number of undetected cases and a motivating application to the Austrian situation is provided and compared with an independent and representative study on prevalence of Covid-19 infection.

Journal ArticleDOI
TL;DR: The simulation results show that the median-based methods outperform the transformation- based methods when meta-analyzing studies that report the median of the outcome, especially when the outcome is skewed.
Abstract: We consider the problem of meta-analyzing two-group studies that report the median of the outcome. Often, these studies are excluded from meta-analysis because there are no well-established statistical methods to pool the difference of medians. To include these studies in meta-analysis, several authors have recently proposed methods to estimate the sample mean and standard deviation from the median, sample size, and several commonly reported measures of spread. Researchers frequently apply these methods to estimate the difference of means and its variance for each primary study and pool the difference of means using inverse variance weighting. In this work, we develop several methods to directly meta-analyze the difference of medians. We conduct a simulation study evaluating the performance of the proposed median-based methods and the competing transformation-based methods. The simulation results show that the median-based methods outperform the transformation-based methods when meta-analyzing studies that report the median of the outcome, especially when the outcome is skewed. Moreover, we illustrate the various methods on a real-life data set.

Journal ArticleDOI
01 Apr 2020-Catena
TL;DR: In this article, the authors evaluated the influence of sample size on the accuracy of different individual and hybrid models, adaptive neuro-fuzzy inference system (ANFIS), ANFIS-ICA, alternating decision tree (ADT), and random forest (RF), considering the number of springs from 177 to 714.
Abstract: Machine learning models have attracted much research attention for groundwater potential mapping. However, the accuracy of models for groundwater potential mapping is significantly influenced by sample size and this is still a challenge. This study evaluates the influence of sample size on the accuracy of different individual and hybrid models, adaptive neuro-fuzzy inference system (ANFIS), ANFIS-imperial competitive algorithm (ANFIS-ICA), alternating decision tree (ADT), and random forest (RF) to model groundwater potential, considering the number of springs from 177 to 714. A well-documented inventory of springs, as a natural representative of groundwater potential, was used to designate four sample data sets: 100% (D1), 75% (D2), 50% (D3), and 25% (D4) of the entire springs inventory. Each data set was randomly split into two groups of 30% (for training) and 70% (for validation). Fifteen diverse geo-environmental factors were employed as independent variables. The area under the operating receiver characteristic curve (AUROC) and the true skill statistic (TSS) as two cutoff-independent and cutoff-dependent performance metrics were used to assess the performance of models. Results showed that the sample size influenced the performance of four machine learning algorithms, but RF had a lower sensitivity to the reduction of sample size. In addition, validation results revealed that RF (AUROC = 90.74–96.32%, TSS = 0.79–0.85) had the best performance based on all four sample data sets, followed by ANFIS-ICA (AUROC = 81.23–91.55%, TSS = 0.74–0.81), ADT (AUROC = 79.29–88.46%, TSS = 0.59–0.74), and ANFIS (AUROC = 73.11–88.43%, TSS = 0.59–0.74). Further, the relative slope position, lithology, and distance from faults were the main spring-affecting factors contributing to groundwater potential modelling. This study can provide useful guidelines and a valuable reference for selecting machine learning models when a complete spring inventory in a watershed is unavailable.

Journal ArticleDOI
TL;DR: This article provides a tutorial on sample size calculation for cluster randomized designs with particular emphasis on designs with multiple periods of measurement and provides a web-based tool, the Shiny CRT Calculator, to allow researchers to easily conduct these sample size calculations.
Abstract: It has long been recognized that sample size calculations for cluster randomized trials require consideration of the correlation between multiple observations within the same cluster. When measurements are taken at anything other than a single point in time, these correlations depend not only on the cluster but also on the time separation between measurements and additionally, on whether different participants (cross-sectional designs) or the same participants (cohort designs) are repeatedly measured. This is particularly relevant in trials with multiple periods of measurement, such as the cluster cross-over and stepped-wedge designs, but also to some degree in parallel designs. Several papers describing sample size methodology for these designs have been published, but this methodology might not be accessible to all researchers. In this article we provide a tutorial on sample size calculation for cluster randomized designs with particular emphasis on designs with multiple periods of measurement and provide a web-based tool, the Shiny CRT Calculator, to allow researchers to easily conduct these sample size calculations. We consider both cross-sectional and cohort designs and allow for a variety of assumed within-cluster correlation structures. We consider cluster heterogeneity in treatment effects (for designs where treatment is crossed with cluster), as well as individually randomized group-treatment trials with differential clustering between arms, for example designs where clustering arises from interventions being delivered in groups. The calculator will compute power or precision, as a function of cluster size or number of clusters, for a wide variety of designs and correlation structures. We illustrate the methodology and the flexibility of the Shiny CRT Calculator using a range of examples.

Journal ArticleDOI
TL;DR: The majority of papers submitted to the Journal of Sports Sciences are experimental, and the data are collected from a sample of the population and then used to test hypotheses and/or make inferences about the population.
Abstract: The majority of papers submitted to the Journal of Sports Sciences are experimental The data are collected from a sample of the population and then used to test hypotheses and/or make inferences a

Journal ArticleDOI
TL;DR: It is demonstrated and illustrated that the Monte Carlo technique leads to overly precise conclusions on the values of estimated parameters, and to incorrect hypothesis tests, thus pointing out a fundamental flaw.
Abstract: The Monte Carlo technique is widely used and recommended for including uncertainties LCA. Typically, 1000 or 10,000 runs are done, but a clear argument for that number is not available, and with the growing size of LCA databases, an excessively high number of runs may be a time-consuming thing. We therefore investigate if a large number of runs are useful, or if it might be unnecessary or even harmful. We review the standard theory or probability distributions for describing stochastic variables, including the combination of different stochastic variables into a calculation. We also review the standard theory of inferential statistics for estimating a probability distribution, given a sample of values. For estimating the distribution of a function of probability distributions, two major techniques are available, analytical, applying probability theory and numerical, using Monte Carlo simulation. Because the analytical technique is often unavailable, the obvious way-out is Monte Carlo. However, we demonstrate and illustrate that it leads to overly precise conclusions on the values of estimated parameters, and to incorrect hypothesis tests. We demonstrate the effect for two simple cases: one system in a stand-alone analysis and a comparative analysis of two alternative systems. Both cases illustrate that statistical hypotheses that should not be rejected in fact are rejected in a highly convincing way, thus pointing out a fundamental flaw. Apart form the obvious recommendation to use larger samples for estimating input distributions, we suggest to restrict the number of Monte Carlo runs to a number not greater than the sample sizes used for the input parameters. As a final note, when the input parameters are not estimated using samples, but through a procedure, such as the popular pedigree approach, the Monte Carlo approach should not be used at all.

Journal ArticleDOI
TL;DR: It is proposed that a minimal important difference score of 4 points on the transformed 0-100 scale is clinically useful when assessing an individual patient's outcome using the reconstruction module of the BREAST-Q.
Abstract: Background The reconstruction module of the BREAST-Q patient-reported outcome measure is frequently used by investigators and in clinical practice. A minimal important difference establishes the smallest change in outcome measure score that patients perceive to be important. To enhance interpretability of the BREAST-Q reconstruction module, the authors determined minimal important difference estimates using distribution-based methods. Methods An analysis of prospectively collected data from 3052 Mastectomy Reconstruction Outcomes Consortium patients was performed. The authors used distribution-based methods to investigate the minimal important difference for the entire patient sample and three clinically relevant groups. The authors used both 0.2 SD units (effect size) and the standardized response mean value of 0.2 as distribution-based criteria. Clinical experience was used to guide and assess appropriateness of results. Results A total of 3052 patients had BREAST-Q data available for analysis. The average age and body mass index were 49.5 and 26.8, respectively. The minimal important difference estimates for each domain were 4 (Satisfaction with Breasts), 4 (Psychosocial Well-being), 3 (Physical Well-being), and 4 (Sexual Well-being). The minimal important difference estimates for each domain were similar when compared within the three clinically relevant groups. Conclusions The authors propose that a minimal important difference score of 4 points on the transformed 0 to 100 scale is clinically useful when assessing an individual patient's outcome using the reconstruction module of the BREAST-Q. When designing research studies, investigators should use the minimal important difference estimate for their domain of interest when calculating sample size. The authors acknowledge that distribution-based minimal important differences are estimates and may vary based on patient population and context.

Journal ArticleDOI
TL;DR: The results showed that structural optimization improved accuracy, and the optimal network structure was mostly determined by the data nature (photographic, calligraphic, or medical images), and less affected by the sample size, suggesting that the optimalnetwork structure is data-driven, not sample size driven.
Abstract: Deep neural networks have gained immense popularity in the Big Data problem; however, the availability of training samples can be relatively limited in specific application domains, particularly medical imaging, and consequently leading to overfitting problems. This "Small Data" challenge may need a mindset that is entirely different from the existing Big Data paradigm. Here, under the small data scenarios, we examined whether the network structure has a substantial influence on the performance and whether the optimal structure is predominantly determined by sample size or data nature. To this end, we listed all possible combinations of layers given an upper bound of the VC-dimension to study how structural hyperparameters affected the performance. Our results showed that structural optimization improved accuracy by 27.99%, 16.44%, and 13.11% over random selection for a sample size of 100, 500, and 1,000 in the MNIST dataset, respectively, suggesting that the importance of the network structure increases as the sample size becomes smaller. Furthermore, the optimal network structure was mostly determined by the data nature (photographic, calligraphic, or medical images), and less affected by the sample size, suggesting that the optimal network structure is data-driven, not sample size driven. After network structure optimization, the convolutional neural network could achieve 91.13% accuracy with only 500 samples, 93.66% accuracy with only 1000 samples for the MNIST dataset and 94.10% accuracy with only 3300 samples for the Mitosis (microscopic) dataset. These results indicate the primary importance of the network structure and the nature of the data in facing the Small Data challenge.

Journal ArticleDOI
TL;DR: A parameter is introduced that measures the goodness of fit of a model but does not depend on the sample size, which is a step-by-step illustration of the proposed method using a model for post-neonatal mortality developed in a large cohort of more than 300,000 observations.
Abstract: Evaluating the goodness of fit of logistic regression models is crucial to ensure the accuracy of the estimated probabilities. Unfortunately, such evaluation is problematic in large samples. Because the power of traditional goodness of fit tests increases with the sample size, practically irrelevant discrepancies between estimated and true probabilities are increasingly likely to cause the rejection of the hypothesis of perfect fit in larger and larger samples. This phenomenon has been widely documented for popular goodness of fit tests, such as the Hosmer-Lemeshow test. To address this limitation, we propose a modification of the Hosmer-Lemeshow approach. By standardizing the noncentrality parameter that characterizes the alternative distribution of the Hosmer-Lemeshow statistic, we introduce a parameter that measures the goodness of fit of a model but does not depend on the sample size. We provide the methodology to estimate this parameter and construct confidence intervals for it. Finally, we propose a formal statistical test to rigorously assess whether the fit of a model, albeit not perfect, is acceptable for practical purposes. The proposed method is compared in a simulation study with a competing modification of the Hosmer-Lemeshow test, based on repeated subsampling. We provide a step-by-step illustration of our method using a model for postneonatal mortality developed in a large cohort of more than 300 000 observations.

Journal ArticleDOI
TL;DR: In this paper, a bottom trawl survey in the coastal waters of Shandong Peninsula, China was used to evaluate the predictive performance of RF models for 21 marine demersal species.

Journal ArticleDOI
TL;DR: An online tool SSizer is unique for its ability to comprehensively evaluate whether the sample size is sufficient and determine the required number of samples for user-input dataset, which therefore facilitate the comparative and OMIC-based biological studies.

Journal ArticleDOI
TL;DR: It is concluded that, despite improved performance on average, shrinkage often worked poorly in individual datasets, in particular when it was most needed, implying that shrinkage methods do not solve problems associated with small sample size or low number of events per variable.
Abstract: When developing risk prediction models on datasets with limited sample size, shrinkage methods are recommended. Earlier studies showed that shrinkage results in better predictive performance on ave...

Journal ArticleDOI
Riko Kelter1
TL;DR: An extensive simulation study is conducted to compare common Bayesian significance and effect measures which can be obtained from a posterior distribution for one of the most important statistical procedures in medical research and in particular clinical trials, the two-sample Student's (and Welch’s) t-test.
Abstract: The replication crisis hit the medical sciences about a decade ago, but today still most of the flaws inherent in null hypothesis significance testing (NHST) have not been solved. While the drawbacks of p-values have been detailed in endless venues, for clinical research, only a few attractive alternatives have been proposed to replace p-values and NHST. Bayesian methods are one of them, and they are gaining increasing attention in medical research, as some of their advantages include the description of model parameters in terms of probability, as well as the incorporation of prior information in contrast to the frequentist framework. While Bayesian methods are not the only remedy to the situation, there is an increasing agreement that they are an essential way to avoid common misconceptions and false interpretation of study results. The requirements necessary for applying Bayesian statistics have transitioned from detailed programming knowledge into simple point-and-click programs like JASP. Still, the multitude of Bayesian significance and effect measures which contrast the gold standard of significance in medical research, the p-value, causes a lack of agreement on which measure to report. Therefore, in this paper, we conduct an extensive simulation study to compare common Bayesian significance and effect measures which can be obtained from a posterior distribution. In it, we analyse the behaviour of these measures for one of the most important statistical procedures in medical research and in particular clinical trials, the two-sample Student’s (and Welch’s) t-test. The results show that some measures cannot state evidence for both the null and the alternative. While the different indices behave similarly regarding increasing sample size and noise, the prior modelling influences the obtained results and extreme priors allow for cherry-picking similar to p-hacking in the frequentist paradigm. The indices behave quite differently regarding their ability to control the type I error rates and regarding their ability to detect an existing effect. Based on the results, two of the commonly used indices can be recommended for more widespread use in clinical and biomedical research, as they improve the type I error control compared to the classic two-sample t-test and enjoy multiple other desirable properties.

Journal ArticleDOI
TL;DR: In this article, a covariate balancing propensity score estimator is proposed to estimate the average treatment effect in observational studies when the number of potential confounders is possibly much greater than the sample size.
Abstract: SummaryWe propose a robust method to estimate the average treatment effects in observational studies when the number of potential confounders is possibly much greater than the sample size. Our method consists of three steps. We first use a class of penalized $M$-estimators for the propensity score and outcome models. We then calibrate the initial estimate of the propensity score by balancing a carefully selected subset of covariates that are predictive of the outcome. Finally, the estimated propensity score is used to construct the inverse probability weighting estimator. We prove that the proposed estimator, which we call the high-dimensional covariate balancing propensity score, has the sample boundedness property, is root-$n$ consistent, asymptotically normal, and semiparametrically efficient when the propensity score model is correctly specified and the outcome model is linear in covariates. More importantly, we show that our estimator remains root-$n$ consistent and asymptotically normal so long as either the propensity score model or the outcome model is correctly specified. We provide valid confidence intervals in both cases and further extend these results to the case where the outcome model is a generalized linear model. In simulation studies, we find that the proposed methodology often estimates the average treatment effect more accurately than existing methods. We also present an empirical application, in which we estimate the average causal effect of college attendance on adulthood political participation. An open-source software package is available for implementing the proposed methodology.