scispace - formally typeset
Search or ask a question

Showing papers on "Sample size determination published in 2005"


Journal ArticleDOI
TL;DR: It is found that in most cases the estimated ‘log probability of data’ does not provide a correct estimation of the number of clusters, K, and using an ad hoc statistic ΔK based on the rate of change in the log probability between successive K values, structure accurately detects the uppermost hierarchical level of structure for the scenarios the authors tested.
Abstract: The identification of genetically homogeneous groups of individuals is a long standing issue in population genetics. A recent Bayesian algorithm implemented in the software STRUCTURE allows the identification of such groups. However, the ability of this algorithm to detect the true number of clusters (K) in a sample of individuals when patterns of dispersal among populations are not homogeneous has not been tested. The goal of this study is to carry out such tests, using various dispersal scenarios from data generated with an individual-based model. We found that in most cases the estimated 'log probability of data' does not provide a correct estimation of the number of clusters, K. However, using an ad hoc statistic DeltaK based on the rate of change in the log probability of data between successive K values, we found that STRUCTURE accurately detects the uppermost hierarchical level of structure for the scenarios we tested. As might be expected, the results are sensitive to the type of genetic marker used (AFLP vs. microsatellite), the number of loci scored, the number of populations sampled, and the number of individuals typed in each sample.

18,572 citations


Journal ArticleDOI
TL;DR: Two simple formulas are found that estimate the mean using the values of the median, low and high end of the range, and n (the sample size) and these hope to help meta-analysts use clinical trials in their analysis even when not all of the information is available and/or reported.
Abstract: Usually the researchers performing meta-analysis of continuous outcomes from clinical trials need their mean value and the variance (or standard deviation) in order to pool data. However, sometimes the published reports of clinical trials only report the median, range and the size of the trial. In this article we use simple and elementary inequalities and approximations in order to estimate the mean and the variance for such trials. Our estimation is distribution-free, i.e., it makes no assumption on the distribution of the underlying data. We found two simple formulas that estimate the mean using the values of the median (m), low and high end of the range (a and b, respectively), and n (the sample size). Using simulations, we show that median can be used to estimate mean when the sample size is larger than 25. For smaller samples our new formula, devised in this paper, should be used. We also estimated the variance of an unknown sample using the median, low and high end of the range, and the sample size. Our estimate is performing as the best estimate in our simulations for very small samples (n ≤ 15). For moderately sized samples (15 70), the formula range/6 gives the best estimator for the standard deviation (variance). We also include an illustrative example of the potential value of our method using reports from the Cochrane review on the role of erythropoietin in anemia due to malignancy. Using these formulas, we hope to help meta-analysts use clinical trials in their analysis even when not all of the information is available and/or reported.

6,384 citations


Journal ArticleDOI
TL;DR: In this paper, a simulation study is used to determine the influence of different sample sizes at the group level on the accuracy of the estimates (regression coefficients and variances) and their standard errors.
Abstract: An important problem in multilevel modeling is what constitutes a sufficient sample size for accurate estimation. In multilevel analysis, the major restriction is often the higher-level sample size. In this paper, a simulation study is used to determine the influence of different sample sizes at the group level on the accuracy of the estimates (regression coefficients and variances) and their standard errors. In addition, the influence of other factors, such as the lowest-level sample size and different variance distributions between the levels (different intraclass correlations), is examined. The results show that only a small sample size at level two (meaning a sample of 50 or less) leads to biased estimates of the second-level standard errors. In all of the other simulated conditions the estimates of the regression coefficients, the variance components, and the standard errors are unbiased and accurate.

2,931 citations


Journal ArticleDOI
TL;DR: The effective sample size funnel plot and associated regression test of asymmetry should be used to detect publication bias and other sample size related effects in meta-analyses of test accuracy.

2,191 citations


Journal ArticleDOI
TL;DR: When designing a clinical trial an appropriate justification for the sample size should be provided in the protocol, but there are a number of settings when undertaking a pilot trial when there is no prior information to base a sample size on.
Abstract: When designing a clinical trial an appropriate justification for the sample size should be provided in the protocol. However, there are a number of settings when undertaking a pilot trial when there is no prior information to base a sample size on. For such pilot studies the recommendation is a sample size of 12 per group. The justifications for this sample size are based on rationale about feasibility; precision about the mean and variance; and regulatory considerations. The context of the justifications are that future studies will use the information from the pilot in their design. Copyright © 2005 John Wiley & Sons, Ltd.

1,624 citations


Journal ArticleDOI
TL;DR: The D1 model best predicts the values for observed health states using the time trade-off method, and represents a significant enhancement of the EQ-5D's utility for health status assessment and economic analysis in the US.
Abstract: Purpose: The EQ-5D is a brief, multiattribute, preference-based health status measure. This article describes the development of a statistical model for generating US population-based EQ-5D preference weights. Methods: A multistage probability sample was selected from the US adult civilian noninstitutional population. Respondents valued 13 of 243 EQ-5D health states using the time trade-off (TTO) method. Data for 12 states were used in econometric modeling. The TTO valuations were linearly transformed to lie on the interval 1, 1. Methods were investigated to account for interaction effects caused by having problems in multiple EQ-5D dimensions. Several alternative model specifications (eg, pooled least squares, random effects) also were considered. A modified split-sample approach was used to evaluate the predictive accuracy of the models. All statistical analyses took into account the clustering and disproportionate selection probabilities inherent in our sampling design. Results: Our D1 model for the EQ-5D included ordinal terms to capture the effect of departures from perfect health as well as interaction effects. A random effects specification of the D1 model yielded a good fit for the observed TTO data, with an overall R 2 of 0.38, a mean absolute error of 0.025, and 7 prediction errors exceeding 0.05 in absolute magnitude. Conclusions: The D1 model best predicts the values for observed health states. The resulting preference weight estimates represent a significant enhancement of the EQ-5D’s utility for health status assessment and economic analysis in the US.

1,247 citations


Journal ArticleDOI
TL;DR: In this article, a simulation study addressed minimum sample size requirements for 180 different population conditions that varied in the number of factors, number of variables per factor, and the level of communality.
Abstract: There is no shortage of recommendations regarding the appropriate sample size to use when conducting a factor analysis. Suggested minimums for sample size include from 3 to 20 times the number of variables and absolute ranges from 100 to over 1,000. For the most part, there is little empirical evidence to support these recommendations. This simulation study addressed minimum sample size requirements for 180 different population conditions that varied in the number of factors, the number of variables per factor, and the level of communality. Congruence coefficients were calculated to assess the agreement between population solutions and sample solutions generated from the various population conditions. Although absolute minimums are not presented, it was found that, in general, minimum sample sizes appear to be smaller for higher levels of communality; minimum sample sizes appear to be smaller for higher ratios of the number of variables to the number of factors; and when the variables-to-factors ratio exc...

917 citations


Journal ArticleDOI
TL;DR: Results suggest the need to minimize the influence of artifacts that produce a downward bias in the observed effect size and put into question the use of conventional definitions of moderating effect sizes.
Abstract: The authors conducted a 30-year review (1969-1998) of the size of moderating effects of categorical variables as assessed using multiple regression. The median observed effect size (f(2)) is only .002, but 72% of the moderator tests reviewed had power of .80 or greater to detect a targeted effect conventionally defined as small. Results suggest the need to minimize the influence of artifacts that produce a downward bias in the observed effect size and put into question the use of conventional definitions of moderating effect sizes. As long as an effect has a meaningful impact, the authors advise researchers to conduct a power analysis and plan future research designs on the basis of smaller and more realistic targeted effect sizes.

872 citations


Journal ArticleDOI
TL;DR: By combining the association results with results from linkage mapping in F2 crosses, this study identifies one previously known true positive and several promising new associations, but also demonstrates the existence of both false positives and false negatives.
Abstract: A potentially serious disadvantage of association mapping is the fact that marker-trait associations may arise from confounding population structure as well as from linkage to causative polymorphisms. Using genome-wide marker data, we have previously demonstrated that the problem can be severe in a global sample of 95 Arabidopsis thaliana accessions, and that established methods for controlling for population structure are generally insufficient. Here, we use the same sample together with a number of flowering-related phenotypes and data-perturbation simulations to evaluate a wider range of methods for controlling for population structure. We find that, in terms of reducing the false-positive rate while maintaining statistical power, a recently introduced mixed-model approach that takes genome-wide differences in relatedness into account via estimated pairwise kinship coefficients generally performs best. By combining the association results with results from linkage mapping in F2 crosses, we identify one previously known true positive and several promising new associations, but also demonstrate the existence of both false positives and false negatives. Our results illustrate the potential of genome-wide association scans as a tool for dissecting the genetics of natural variation, while at the same time highlighting the pitfalls. The importance of study design is clear; our study is severely under-powered both in terms of sample size and marker density. Our results also provide a striking demonstration of confounding by population structure. While statistical methods can be used to ameliorate this problem, they cannot always be effective and are certainly not a substitute for independent evidence, such as that obtained via crosses or transgenic experiments. Ultimately, association mapping is a powerful tool for identifying a list of candidates that is short enough to permit further genetic study.

735 citations


Journal ArticleDOI
TL;DR: In this paper, the authors used simulations to investigate the effect of sample size, number of indicators, factor loadings, and factor correlations on frequencies of the acceptance/rejection of models (true and misspecified) when selected goodness-of-fit indices were compared with prespecified cutoff values.

697 citations


Journal ArticleDOI
TL;DR: In this article, the authors reviewed decision rules used in the past to identify wearing period, minimal wear requirement for a valid day, spurious data, number of days used to calculate the outcome variables and extract bouts of moderate to vigorous physical activity (MVPA).
Abstract: Purpose: Accelerometers are recognized as a valid and objective tool to assess free-living physical activity. Despite the widespread use of accelerometers, there is no standardized way to process and summarize data from them, which limits our ability to compare results across studies. This paper a) reviews decision rules researchers have used in the past, b) compares the impact of using different decision rules on a common data set, and c) identifies issues to consider for accelerometer data reduction. Methods: The methods sections of studies published in 2003 and 2004 were reviewed to determine what decision rules previous researchers have used to identify wearing period, minimal wear requirement for a valid day, spurious data, number of days used to calculate the outcome variables, and extract bouts of moderate to vigorous physical activity (MVPA). For this study, four data reduction algorithms that employ different decision rules were used to analyze the same data set. Results: The review showed that among studies that reported their decision rules, much variability was observed. Overall, the analyses suggested that using different algorithms impacted several important outcome variables. The most stringent algorithm yielded significantly lower wearing time, the lowest activity counts per minute and counts per day, and fewer minutes of MVPA per day. An exploratory sensitivity analysis revealed that the most stringent inclusion criterion had an impact on sample size and wearing time, which in turn affected many outcome variables. Conclusions: These findings suggest that the decision rules employed to process accelerometer data have a significant impact on important outcome variables. Until guidelines are developed, it will remain difficult to compare findings across studies.

Journal ArticleDOI
TL;DR: In this article, the robustness of structural equation modeling to different degrees of nonnormality under 2 estimation methods, generalized least squares and maximum likelihood, and 4 sample sizes, 100, 250, 500, and 1,000, was investigated.
Abstract: This simulation study investigated the robustness of structural equation modeling to different degrees of nonnormality under 2 estimation methods, generalized least squares and maximum likelihood, and 4 sample sizes, 100, 250, 500, and 1,000. Each of the slight and severe nonnormality degrees was comprised of pure skewness, pure kurtosis, and both skewness and kurtosis. Bias and standard errors of parameter estimates were analyzed. In addition, an analysis of variance was conducted to investigate the effects of the 3 factors on several goodness-of-fit indexes. The study found that standard errors of parameter estimates were not significantly affected by estimation methods and nonnormality conditions. As expected, standard errors decreased at larger sample sizes. Parameter estimates were more sensitive to nonnormality than to sample size and estimation method. Chi-square was the least robust model fit index compared with Normed Fit Index, Nonnormed Fit Index, and Comparative Fit Index. Sample sizes of 100 ...

Journal ArticleDOI
TL;DR: Brain size in autism was slightly reduced at birth, dramatically increased within the first year of life, but then plateaued so that by adulthood the majority of cases were within normal range, and study of the older autistic brain reflects the outcome rather than the process of pathology.

Journal ArticleDOI
TL;DR: In this paper, a robust version of the Dickey-fuller t-statistic under contemporaneous correlated errors is suggested, which is based on the tstatistic of the transformed model.
Abstract: In this paper alternative approaches for testing the unit root hypothesis in panel data are considered. First, a robust version of the Dickey-Fuller t-statistic under contemporaneous correlated errors is suggested. Second, the GLS t-statistic is considered, which is based on the t-statistic of the transformed model. The asymptotic power of both tests is compared against a sequence of local alternatives. To adjust for short-run serial correlation of the errors, we propose a pre-whitening procedure that yields a test statistic with a standard normal limiting distribution as N and T tends to infinity. The test procedure is further generalized to accommodate individual specific intercepts or linear time trends. From our Monte Carlo simulations it turns out that the robust OLS t-statistic performs well with respect to size and power, whereas the GLS t-statistic may suffer from severe size distortions in small and moderate sample sizes. The tests are applied to test for a unit root in real exchange rates.

Journal ArticleDOI
TL;DR: A minimum of 100 events and 100 nonevents is suggested for external validation samples to detect that a model predicted too high probabilities, when predictions were on average 1.5 times too high on the odds scale.

Journal ArticleDOI
TL;DR: This article reviews several basic statistical tools needed for modeling data with sampling weights that are implemented in Mplus Version 3.0 and the pseudomaximum likelihood estimation method is reviewed and illustrated with stratified cluster sampling.
Abstract: This article reviews several basic statistical tools needed for modeling data with sampling weights that are implemented in Mplus Version 3. These tools are illustrated in simulation studies for several latent variable models including factor analysis with continuous and categorical indicators, latent class analysis, and growth models. The pseudomaximum likelihood estimation method is reviewed and illustrated with stratified cluster sampling. Additionally, the weighted least squares method for estimating structural equation models with categorical and continuous outcomes implemented in Mplus extended to incorporate sampling weights is also illustrated. The performance of several chi-square tests under unequal probability sampling is evaluated. Simulation studies compare the methods used in several statistical packages such as Mplus, HLM, SAS Proc Mixed, MLwiN, and the weighted sample statistics method used in other software packages.

Journal ArticleDOI
TL;DR: The branding of trials as unethical on the basis of an imprecise sample size calculation process might be acceptable if investigators use methodological rigor to eliminate bias, properly report to avoid misinterpretation, and always publish results to avert publication bias.

Journal ArticleDOI
01 Jul 2005
TL;DR: Li et al. as discussed by the authors proposed a least-angle regression (LARS) method to select genes that are relevant to patients' survival and to build a predictive model for future prediction, which can be used for identifying important genes that were related to time to death due to cancer and for predicting the survival of future patients.
Abstract: Motivation: An important application of microarray technology is to relate gene expression profiles to various clinical phenotypes of patients. Success has been demonstrated in molecular classification of cancer in which the gene expression data serve as predictors and different types of cancer serve as a categorical outcome variable. However, there has been less research in linking gene expression profiles to the censored survival data such as patients' overall survival time or time to cancer relapse. It would be desirable to have models with good prediction accuracy and parsimony property. Results: We propose to use the L1 penalized estimation for the Cox model to select genes that are relevant to patients' survival and to build a predictive model for future prediction. The computational difficulty associated with the estimation in the high-dimensional and low-sample size settings can be efficiently solved by using the recently developed least-angle regression (LARS) method. Our simulation studies and application to real datasets on predicting survival after chemotherapy for patients with diffuse large B-cell lymphoma demonstrate that the proposed procedure, which we call the LARS--Cox procedure, can be used for identifying important genes that are related to time to death due to cancer and for building a parsimonious model for predicting the survival of future patients. The LARS--Cox regression gives better predictive performance than the L2 penalized regression and a few other dimension-reduction based methods. Conclusions: We conclude that the proposed LARS--Cox procedure can be very useful in identifying genes relevant to survival phenotypes and in building a parsimonious predictive model that can be used for classifying future patients into clinically relevant high- and low-risk groups based on the gene expression profile and survival times of previous patients. Supplementary information: http://dna.ucdavis.edu/~hli/LARSCox-Appendix.pdf Contact: hli@ucdavis.edu

Journal ArticleDOI
TL;DR: In this paper, the authors investigated the relationship between sample size and the quality of factor solutions obtained from exploratory factor analysis and found that when communalities are high, sample size tended to have less influence on the quality compared to when they were low.
Abstract: The purpose of this studywas to investigate the relationship between sample size and the quality of factor solutions obtained from exploratory factor analysis. This research expanded upon the range of conditions previously examined, employing a broad selection of criteria for the evaluation of the quality of sample factor solutions. Results showed that when communalities are high, sample size tended to have less influence on the quality of factor solutions than when communalities are low. Overdetermination of factors was also shown to improve the factor analysis solution. Finally, decisions about the quality of the factor solution depended upon which criteria were examined.

Journal ArticleDOI
TL;DR: In this article, the authors find a common structure underlying many such data sets by using a non-standard type of asymptotics: the dimension tends to ∞ while the sample size is fixed.
Abstract: Summary. High dimension, low sample size data are emerging in various areas of science. We find a common structure underlying many such data sets by using a non-standard type of asymptotics: the dimension tends to ∞ while the sample size is fixed. Our analysis shows a tendency for the data to lie deterministically at the vertices of a regular simplex. Essentially all the randomness in the data appears only as a random rotation of this simplex. This geometric representation is used to obtain several new statistical insights.

Journal ArticleDOI
TL;DR: In this article, a first-order asymptotic theory for heteroskedasticity-autocorrelation (HAC) robust tests based on nonparametric covariance matrix estimators is developed.
Abstract: A new …rst-order asymptotic theory for heteroskedasticity-autocorrelation (HAC) robust tests based on nonparametric covariance matrix estimators is developed. The bandwidth of the covariance matrix estimator is modeled as a …xed proportion of the sample size. This leads to a distribution theory for HAC robust tests that explicitly captures the choice of bandwidth and kernel. This contrasts with the traditional asymptotics (where the bandwidth increases slower than the sample size) where the asymptotic distributions of HAC robust tests do not depend on the bandwidth or kernel. Finite sample simulations show that the new approach is more accurate than the traditional asymptotics. The impact of bandwidth and kernel choice on size and power of t-tests is analyzed. Smaller bandwidths lead to tests with higher power but greater size distortions and large bandwidths lead to tests with lower power but less size distortions. Size distortions across bandwidths increase as the serial correlation in the data becomes stronger. Overall, the results clearly indicate that for bandwidth and kernel choice

Journal ArticleDOI
01 May 2005-Ecology
TL;DR: Two general approaches that researchers may wish to consider that incorporate the concept of imperfect detectability are suggested: borrowing information about detectability or the other quantities of interest from other times, places, or species; and using state variables other than abundance (e.g., species richness and occupancy).
Abstract: For the vast majority of cases, it is highly unlikely that all the individuals of a population will be encountered during a study. Furthermore, it is unlikely that a constant fraction of the population is encountered over times, locations, or species to be compared. Hence, simple counts usually will not be good indices of population size. We recommend that detection probabilities (the probability of including an individual in a count) be estimated and incorporated into inference procedures. However, most techniques for estimating detection probability require moderate sample sizes, which may not be achievable when studying rare species. In order to improve the reliability of inferences from studies of rare species, we suggest two general approaches that researchers may wish to consider that incorporate the concept of imperfect detectability: (1) borrowing information about detectability or the other quantities of interest from other times, places, or species; and (2) using state variables other than abundance (e.g., species richness and occupancy). We illustrate these suggestions with examples and discuss the relative benefits and drawbacks of each approach.

Journal ArticleDOI
TL;DR: The emphasis on pooling individual aspects of diagnostic test performance and the under-use of statistical tests and graphical approaches to identify heterogeneity perhaps reflect the uncertainty in the most appropriate methods to use and also greater familiarity with more traditional indices of test accuracy.
Abstract: Background Systematic reviews of therapeutic interventions are now commonplace in many if not most areas of healthcare, and in recent years interest has turned to applying similar techniques to research evaluating diagnostic tests. One of the key parts of any review is to consider how similar or different the available primary studies are and what impact any differences have on studies’ results. Between-study differences or heterogeneity in results can result from chance, from errors in calculating accuracy indices or from true heterogeneity, that is, differences in design, conduct, participants, tests and reference tests. An important additional consideration for diagnostic studies is differences in results due to variations in the chosen threshold for a positive result for either the index or reference test. Dealing with heterogeneity is particularly challenging for diagnostic test reviews, not least because test accuracy is conventionally represented by a pair of statistics and not by a single measure of effect such as relative risk, and as a result a variety of statistical methods are available that differ in the way in which they tackle the bivariate nature of test accuracy data: methods that undertake independent analyses of each aspect of test performance methods that further summarise test performance into a single summary statistic methods that use statistical models that simultaneously consider both dimensions of test performance. The validity of a choice of meta-analytical method depends in part on the pattern of variability (heterogeneity) observed in the study results. However, currently there is no empirical guidance to judge which methods are appropriate in which circumstances, and the degree to which different methods yield comparable results. All this adds to the complexity and difficulty of undertaking systematic reviews of diagnostic test accuracy. Objectives Our objective was to review how heterogeneity has been examined in systematic reviews of diagnostic test accuracy studies. Methods Systematic reviews that evaluated a diagnostic or screening test by including studies that compared a test with a reference test were identified from the Centre for Reviews and Dissemination’s Database of Abstracts of Reviews of Effects. Reviews for which structured abstracts had been written up to December 2002 were screened for inclusion. Data extraction was undertaken using standardised data extraction forms by one reviewer and checked by a second. Results A total of 189 systematic reviews met our inclusion criteria and were included in the review. The median number of studies included in the reviews was 18 [inter-quartile range (IQR) 20]. Meta-analyses (n = 133) have a higher number with a median of 22 studies (IQR 20) compared with 11 (IQR 13) for narrative reviews (n = 56). Identification of heterogeneity Graphical plots to demonstrate the spread in study results were provided in 56% of meta-analyses; in 79% of cases these were in the form of plots of sensitivity and specificity in the receiver operating characteristic (ROC) space (commonly termed ‘ROC plots’). Statistical tests to identify heterogeneity were used in 32% of reviews: 41% of meta-analyses and 9% of reviews using narrative syntheses. The c2 test and Fisher’s exact test to assess heterogeneity in individual aspects of test performance were most commonly used. In contrast, only 16% of meta-analyses used correlation coefficients to test for a threshold effect. Type of syntheses used A narrative synthesis was used in 30% of reviews. Of the meta-analyses, 52% carried out statistical pooling alone, 18% conducted only summary receiver operator characteristic (SROC) analyses and 30% used both methods of statistical synthesis. Of the reviews that pooled accuracy indices, most pooled each aspect of test performance separately with only a handful producing single summaries of test performance such as the diagnostic odds ratio. For those undertaking SROC analyses, the main differences between the models used were the weights chosen for the regression models. In fact, in 42% of cases (27/64) the use of, or choice of, weight was not provided by the review authors. The proportion of reviews using statistical pooling alone has declined over time from 67% in 1995 to 42% in 2001, with a corresponding increase in the use of SROC methods, from 33% to 58%. However, two-thirds of those using SROC methods also carried out statistical pooling rather than presenting only SROC models. Reviews using SROC analyses also tended to present their results as some combination of sensitivity and specificity rather than using alternative, perhaps less clinically meaningful, means of data presentation such as diagnostic odds ratios. Investigation of heterogeneity sources Three-quarters of meta-analyses attempted to investigate statistically possible sources of variation, using subgroup analysis (76) or regression analysis (44). The median number of variables investigated was four, ranging from one variable in 20% of reviews to over six in 27% of reviews. The ratio of median number of variables to median number of studies was 1:6. The impact of clinical or socio-demographic variables was investigated in 74% of these reviews and test- or threshold-related variables in 79%. At least one quality-related variable was investigated in 63% of reviews. Within this subset, the most commonly considered variables were the use of blinding (41% of reviews), sample size (33%), the reference test used (28%) and the avoidance of verification bias (25%). Conclusions The emphasis on pooling individual aspects of diagnostic test performance and the under-use of statistical tests and graphical approaches to identify heterogeneity perhaps reflect the uncertainty in the most appropriate methods to use and also greater familiarity with more traditional indices of test accuracy. This is an indication of the level of difficulty and complexity of carrying out these reviews. It is strongly suggested that in such reviews meta-analyses are carried out with the involvement of a statistician familiar with the field.

Reference EntryDOI
15 Oct 2005
TL;DR: Some formulas are given to obtain insight in the design aspects that are most influential for standard errors and power in multilevel designs.
Abstract: Sample size determination in multilevel designs requires attention to the fact that statistical power depends on the total sample sizes for each level. It is usually desirable to have as many units as possible at the top level of the multilevel hierarchy. Some formulas are given to obtain insight in the design aspects that are most influential for standard errors and power. Keywords: power; statistical tests; design; multilevel analysis; sample size; multisite trial; cluster randomization

Journal ArticleDOI
TL;DR: When designing diagnostic test studies, sample size calculations should be performed in order to guarantee the design accuracy, and tables for sample size determination in this context are provided.

Journal ArticleDOI
TL;DR: Property of the commonly used model fit indices when dropping the chi-square distribution assumptions are studied and linearly approximating the distribution of a fit index/statistic by a known distribution or the distribution under a set of different conditions is proposed.
Abstract: Model evaluation is one of the most important aspects of structural equation modeling (SEM). Many model fit indices have been developed. It is not an exaggeration to say that nearly every publication using the SEM methodology has reported at least one fit index. Most fit indices are defined through test statistics. Studies and interpretation of fit indices commonly assume that the test statistics follow either a central chi-square distribution or a noncentral chi-square distribution. Because few statistics in practice follow a chi-square distribution, we study properties of the commonly used fit indices when dropping the chi-square distribution assumptions. The study identifies two sensible statistics for evaluating fit indices involving degrees of freedom. We also propose linearly approximating the distribution of a fit index/statistic by a known distribution or the distribution of the same fit index/statistic under a set of different conditions. The conditions include the sample size, the distribution of the data as well as the base-statistic. Results indicate that, for commonly used fit indices evaluated at sensible statistics, both the slope and the intercept in the linear relationship change substantially when conditions change. A fit index that changes the least might be due to an artificial factor. Thus, the value of a fit index is not just a measure of model fit but also of other uncontrollable factors. A discussion with conclusions is given on how to properly use fit indices.

Journal ArticleDOI
TL;DR: This study employs simulation for various feature-label distributions and classification rules, and across a wide range of sample and feature-set sizes, to achieve the desired end, finding the optimal number of features as a function of sample size.
Abstract: Motivation: Given the joint feature-label distribution, increasing the number of features always results in decreased classification error; however, this is not the case when a classifier is designed via a classification rule from sample data. Typically (but not always), for fixed sample size, the error of a designed classifier decreases and then increases as the number of features grows. The potential downside of using too many features is most critical for small samples, which are commonplace for gene-expression-based classifiers for phenotype discrimination. For fixed sample size and feature-label distribution, the issue is to find an optimal number of features. Results: Since only in rare cases is there a known distribution of the error as a function of the number of features and sample size, this study employs simulation for various feature-label distributions and classification rules, and across a wide range of sample and feature-set sizes. To achieve the desired end, finding the optimal number of features as a function of sample size, it employs massively parallel computation. Seven classifiers are treated: 3-nearest-neighbor, Gaussian kernel, linear support vector machine, polynomial support vector machine, perceptron, regular histogram and linear discriminant analysis. Three Gaussian-based models are considered: linear, nonlinear and bimodal. In addition, real patient data from a large breast-cancer study is considered. To mitigate the combinatorial search for finding optimal feature sets, and to model the situation in which subsets of genes are co-regulated and correlation is internal to these subsets, we assume that the covariance matrix of the features is blocked, with each block corresponding to a group of correlated features. Altogether there are a large number of error surfaces for the many cases. These are provided in full on a companion website, which is meant to serve as resource for those working with small-sample classification. Availability: For the companion website, please visit http://public.tgen.org/tamu/ofs/ Contact: e-dougherty@ee.tamu.edu

Journal ArticleDOI
TL;DR: This study compared the relative power of Mann-Whitney and ANCOVA and investigated the distribution of change scores between repeat assessments of a non-normally distributed variable to findANCOVA is the preferred method of analyzing randomized trials with baseline and post-treatment measures.
Abstract: It has generally been argued that parametric statistics should not be applied to data with non-normal distributions. Empirical research has demonstrated that Mann-Whitney generally has greater power than the t-test unless data are sampled from the normal. In the case of randomized trials, we are typically interested in how an endpoint, such as blood pressure or pain, changes following treatment. Such trials should be analyzed using ANCOVA, rather than t-test. The objectives of this study were: a) to compare the relative power of Mann-Whitney and ANCOVA; b) to determine whether ANCOVA provides an unbiased estimate for the difference between groups; c) to investigate the distribution of change scores between repeat assessments of a non-normally distributed variable. Polynomials were developed to simulate five archetypal non-normal distributions for baseline and post-treatment scores in a randomized trial. Simulation studies compared the power of Mann-Whitney and ANCOVA for analyzing each distribution, varying sample size, correlation and type of treatment effect (ratio or shift). Change between skewed baseline and post-treatment data tended towards a normal distribution. ANCOVA was generally superior to Mann-Whitney in most situations, especially where log-transformed data were entered into the model. The estimate of the treatment effect from ANCOVA was not importantly biased. ANCOVA is the preferred method of analyzing randomized trials with baseline and post-treatment measures. In certain extreme cases, ANCOVA is less powerful than Mann-Whitney. Notably, in these cases, the estimate of treatment effect provided by ANCOVA is of questionable interpretability.

Journal ArticleDOI
TL;DR: The relation among fit indexes, power, and sample size in structural equation modeling is examined in this article, where four fit indexes (RMSEA, CFI, McDonald's Fit Index, and Steiger's gamma) were used to compute the noncentrality parameter and sample sizes to achieve a certain level of power.
Abstract: The relation among fit indexes, power, and sample size in structural equation modeling is examined. The noncentrality parameter is required to compute power. The 2 existing methods of computing power have estimated the noncentrality parameter by specifying an alternative hypothesis or alternative fit. These methods cannot be implemented easily and reliably. In this study, 4 fit indexes (RMSEA, CFI, McDonald's Fit Index, and Steiger's gamma) were used to compute the noncentrality parameter and sample size to achieve certain level of power. The resulting power and sample size varied as a function of (a) choice of fit index, (b) number of variables/degrees of freedom, (c) relation among the variables, and (d) value of the fit index. However, if the level of misspecification were held constant, then the resulting power and sample size would be identical.

Journal ArticleDOI
TL;DR: In this article, the authors compare the six lag-order selection criteria most commonly used in applied work and conclude that the Akaike Information Criterion (AIC) tends to produce the most accurate structural and semi-structural impulse response estimates for realistic sample sizes.
Abstract: It is common in empirical macroeconomics to fit vector autoregressive (VAR) models to construct estimates of impulse responses. An important preliminary step in impulse response analysis is the selection of the VAR lag order. In this paper, we compare the six lag-order selection criteria most commonly used in applied work. Our metric is the mean-squared error (MSE) of the implied pointwise impulse response estimates normalized relative to their MSE based on knowing the true lag order. Based on our simulation design we conclude that for monthly VAR models, the Akaike Information Criterion (AIC) tends to produce the most accurate structural and semi-structural impulse response estimates for realistic sample sizes. For quarterly VAR models, the Hannan-Quinn Criterion (HQC) appears to be the most accurate criterion with the exception of sample sizes smaller than 120, for which the Schwarz Information Criterion (SIC) is more accurate. For persistence profiles based on quarterly vector error correction models with known cointegrating vector, our results suggest that the SIC is the most accurate criterion for all realistic sample sizes.