scispace - formally typeset
Search or ask a question

Showing papers in "Statistics in Medicine in 2000"


Journal ArticleDOI
TL;DR: A joinpoint regression model is applied to describe continuous changes in the recent trend and the grid-search method is used to fit the regression function with unknown joinpoints assuming constant variance and uncorrelated errors.
Abstract: The identification of changes in the recent trend is an important issue in the analysis of cancer mortality and incidence data. We apply a joinpoint regression model to describe such continuous changes and use the grid-search method to fit the regression function with unknown joinpoints assuming constant variance and uncorrelated errors. We find the number of significant joinpoints by performing several permutation tests, each of which has a correct significance level asymptotically. Each p-value is found using Monte Carlo methods, and the overall asymptotic significance level is maintained through a Bonferroni correction. These tests are extended to the situation with non-constant variance to handle rates with Poisson variation and possibly autocorrelated errors. The performance of these tests are studied via simulations and the tests are applied to U.S. prostate cancer incidence and mortality rates.

3,950 citations


Journal ArticleDOI
TL;DR: How to validate a model is considered and it is suggested that it is desirable to consider two rather different aspects - statistical and clinical validity - and some general approaches to validation are examined.
Abstract: Prognostic models are used in medicine for investigating patient outcome in relation to patient and disease characteristics. Such models do not always work well in practice, so it is widely recommended that they need to be validated. The idea of validating a prognostic model is generally taken to mean establishing that it works satisfactorily for patients other than those from whose data it was derived. In this paper we examine what is meant by validation and review why it is necessary. We consider how to validate a model and suggest that it is desirable to consider two rather different aspects - statistical and clinical validity - and examine some general approaches to validation. We illustrate the issues using several case studies.

1,418 citations


Journal ArticleDOI
TL;DR: This article reviews the common algorithms for resampling and methods for constructing bootstrap confidence intervals, together with some less well known ones, highlighting their strengths and weaknesses.
Abstract: Since the early 1980s, a bewildering array of methods for constructing bootstrap confidence intervals have been proposed. In this article, we address the following questions. First, when should bootstrap confidence intervals be used. Secondly, which method should be chosen, and thirdly, how should it be implemented. In order to do this, we review the common algorithms for resampling and methods for constructing bootstrap confidence intervals, together with some less well known ones, highlighting their strengths and weaknesses. We then present a simulation study, a flow chart for choosing an appropriate method and a survival analysis example.

1,416 citations


Journal ArticleDOI
TL;DR: It is shown that a ln(odds ratio) can be converted to effect size by dividing by 1.81, and the validity of effect size, the estimate of interest divided by the residual standard deviation, depends on comparable variation across studies.
Abstract: A systematic review may encompass both odds ratios and mean differences in continuous outcomes. A separate meta-analysis of each type of outcome results in loss of information and may be misleading. It is shown that a ln(odds ratio) can be converted to effect size by dividing by 1.81. The validity of effect size, the estimate of interest divided by the residual standard deviation, depends on comparable variation across studies. If researchers routinely report residual standard deviation, any subsequent review can combine both odds ratios and effect sizes in a single meta-analysis when this is justified. Copyright © 2000 John Wiley & Sons, Ltd.

1,137 citations


Journal ArticleDOI
TL;DR: In this article, the covariance structure of repeated measures data is modelled in the SAS((R)) system, and the results of the analysis are used to predict the fixed effects of covariance structures.
Abstract: The term 'repeated measures' refers to data with multiple observations on the same sampling unit. In most cases, the multiple observations are taken over time, but they could be over space. It is usually plausible to assume that observations on the same unit are correlated. Hence, statistical analysis of repeated measures data must address the issue of covariation between measures on the same unit. Until recently, analysis techniques available in computer software only offered the user limited and inadequate choices. One choice was to ignore covariance structure and make invalid assumptions. Another was to avoid the covariance structure issue by analysing transformed data or making adjustments to otherwise inadequate analyses. Ignoring covariance structure may result in erroneous inference, and avoiding it may result in inefficient inference. Recently available mixed model methodology permits the covariance structure to be incorporated into the statistical model. The MIXED procedure of the SAS((R)) System provides a rich selection of covariance structures through the RANDOM and REPEATED statements. Modelling the covariance structure is a major hurdle in the use of PROC MIXED. However, once the covariance structure is modelled, inference about fixed effects proceeds essentially as when using PROC GLM. An example from the pharmaceutical industry is used to illustrate how to choose a covariance structure. The example also illustrates the effects of choice of covariance structure on tests and estimates of fixed effects. In many situations, estimates of linear combinations are invariant with respect to covariance structure, yet standard errors of the estimates may still depend on the covariance structure.

812 citations


Journal ArticleDOI
TL;DR: It is found that stepwise selection with a low alpha led to a relatively poor model performance, when evaluated on independent data, and shrinkage methods in full models including prespecified predictors and incorporation of external information are recommended, when prognostic models are constructed in small data sets.
Abstract: Logistic regression analysis may well be used to develop a prognostic model for a dichotomous outcome. Especially when limited data are available, it is difficult to determine an appropriate selection of covariables for inclusion in such models. Also, predictions may be improved by applying some sort of shrinkage in the estimation of regression coefficients. In this study we compare the performance of several selection and shrinkage methods in small data sets of patients with acute myocardial infarction, where we aim to predict 30-day mortality. Selection methods included backward stepwise selection with significance levels alpha of 0.01, 0.05, 0. 157 (the AIC criterion) or 0.50, and the use of qualitative external information on the sign of regression coefficients in the model. Estimation methods included standard maximum likelihood, the use of a linear shrinkage factor, penalized maximum likelihood, the Lasso, or quantitative external information on univariable regression coefficients. We found that stepwise selection with a low alpha (for example, 0.05) led to a relatively poor model performance, when evaluated on independent data. Substantially better performance was obtained with full models with a limited number of important predictors, where regression coefficients were reduced with any of the shrinkage methods. Incorporation of external information for selection and estimation improved the stability and quality of the prognostic models. We therefore recommend shrinkage methods in full models including prespecified predictors and incorporation of external information, when prognostic models are constructed in small data sets.

720 citations


Journal ArticleDOI
TL;DR: It is shown how the non-parametric bootstrap provides a more flexible alternative for comparing arithmetic mean costs between randomized groups, avoiding the assumptions which limit other methods.
Abstract: Health economic evaluations are now more commonly being included in pragmatic randomized trials. However a variety of methods are being used for the presentation and analysis of the resulting cost data, and in many cases the approaches taken are inappropriate. In order to inform health care policy decisions, analysis needs to focus on arithmetic mean costs, since these will reflect the total cost of treating all patients with the disease. Thus, despite the often highly skewed distribution of cost data, standard non-parametric methods or use of normalizing transformations are not appropriate. Although standard parametric methods of comparing arithmetic means may be robust to non-normality for some data sets, this is not guaranteed. While the randomization test can be used to overcome assumptions of normality, its use for comparing means is still restricted by the need for similarly shaped distributions in the two groups. In this paper we show how the non-parametric bootstrap provides a more flexible alternative for comparing arithmetic mean costs between randomized groups, avoiding the assumptions which limit other methods. Details of several bootstrap methods for hypothesis tests and confidence intervals are described and applied to cost data from two randomized trials. The preferred bootstrap approaches are the bootstrap-t or variance stabilized bootstrap-t and the bias corrected and accelerated percentile methods. We conclude that such bootstrap techniques can be recommended either as a check on the robustness of standard parametric methods, or to provide the primary statistical analysis when making inferences about arithmetic means for moderately sized samples of highly skewed data such as costs.

637 citations


Journal ArticleDOI
TL;DR: A unified framework for a Bayesian analysis of incidence or mortality data in space and time is proposed and an epidemiological hypothesis about the temporal development of the association between urbanization and risk factors for cancer is confirmed.
Abstract: This paper proposes a unified framework for a Bayesian analysis of incidence or mortality data in space and time. We introduce four different types of prior distributions for space x time interaction in extension of a model with only main effects. Each type implies a certain degree of prior dependence for the interaction parameters, and corresponds to the product of one of the two spatial with one of the two temporal main effects. The methodology is illustrated by an analysis of Ohio lung cancer data 1968-1988 via Markov chain Monte Carlo simulation. We compare the fit and the complexity of several models with different types of interaction by means of quantities related to the posterior deviance. Our results confirm an epidemiological hypothesis about the temporal development of the association between urbanization and risk factors for cancer.

530 citations


Journal ArticleDOI
TL;DR: The combined use of these methods is recommended, and demonstrated in the context of two cancer-related examples which highlight a variety of the issues involved in the categorization of prognostic variables.
Abstract: Categorizing prognostic variables is essential for their use in clinical decision-making. Often a single cutpoint that stratifies patients into high-risk and low-risk categories is sought. These categories may be used for making treatment recommendations, determining study eligibility, or to control for varying patient prognoses in the design of a clinical trial. Methods used to categorize variables include: biological determination (most desirable but often unavailable); arbitrary selection of a cutpoint at the median value; graphical examination of the data for a threshold effect; and exploration of all observed values for the one which best separates the risk groups according to a chi-squared test. The last method, called the minimum p-value approach, involves multiple testing which inflates the type I error rates. Several methods for adjusting the inflated p-values have been proposed but remain infrequently used. Exploratory methods for categorization and the minimum p-value approach with its various p-value corrections are reviewed, and code for their easy implementation is provided. The combined use of these methods is recommended, and demonstrated in the context of two cancer-related examples which highlight a variety of the issues involved in the categorization of prognostic variables.

335 citations


Journal ArticleDOI
TL;DR: This work studied 125 meta-analyses representative of those performed by clinical investigators to examine empirically how assessment of treatment effect and heterogeneity may differ when different methods are utilized, and presents two exceptions to these observations.
Abstract: For meta-analysis, substantial uncertainty remains about the most appropriate statistical methods for combining the results of separate trials. An important issue for meta-analysis is how to incorporate heterogeneity, defined as variation among the results of individual trials beyond that expected from chance, into summary estimates of treatment effect. Another consideration is which 'metric' to use to measure treatment effect; for trials with binary outcomes, there are several possible metrics, including the odds ratio (a relative measure) and risk difference (an absolute measure). To examine empirically how assessment of treatment effect and heterogeneity may differ when different methods are utilized, we studied 125 meta-analyses representative of those performed by clinical investigators. There was no meta-analysis in which the summary risk difference and odds ratio were discrepant to the extent that one indicated significant benefit while the other indicated significant harm. Further, for most meta-analyses, summary odds ratios and risk differences agreed in statistical significance, leading to similar conclusions about whether treatments affected outcome. Heterogeneity was common regardless of whether treatment effects were measured by odds ratios or risk differences. However, risk differences usually displayed more heterogeneity than odds ratios. Random effects estimates, which incorporate heterogeneity, tended to be less precisely estimated than fixed effects estimates. We present two exceptions to these observations, which derive from the weights assigned to individual trial estimates. We discuss the implications of these findings for selection of a metric for meta-analysis and incorporation of heterogeneity into summary estimates. Published in 2000 by John Wiley & Sons, Ltd.

333 citations


Journal ArticleDOI
TL;DR: The aim of the paper is to determine which models are appropriate for recurrent event data using the key components of the Cox proportional hazards approach, and concludes that PWP-GT and TT-R are useful models for analysing recurrent eventData.
Abstract: Many extensions of survival models based on the Cox proportional hazards approach have been proposed to handle clustered or multiple event data. Of particular note are five Cox-based models for recurrent event data: Andersen and Gill (AG); Wei, Lin and Weissfeld (WLW); Prentice, Williams and Peterson, total time (PWP-CP) and gap time (PWP-GT); and Lee, Wei and Amato (LWA). Some authors have compared these models by observing differences that arise from fitting the models to real and simulated data. However, no attempt has been made to systematically identify the components of the models that are appropriate for recurrent event data. We propose a systematic way of characterizing such Cox-based models using four key components: risk intervals; baseline hazard; risk set, and correlation adjustment. From the definitions of risk interval and risk set there are conceptually seven such Cox-based models that are permissible, five of which are those previously identified. The two new variant models are termed the 'total time - restricted' (TT-R) and 'gap time - unrestricted' (GT-UR) models. The aim of the paper is to determine which models are appropriate for recurrent event data using the key components. The models are fitted to simulated data sets and to a data set of childhood recurrent infectious diseases. The LWA model is not appropriate for recurrent event data because it allows a subject to be at risk several times for the same event. The WLW model overestimates treatment effect and is not recommended. We conclude that PWP-GT and TT-R are useful models for analysing recurrent event data, providing answers to slightly different research questions. Further, applying a robust variance to any of these models does not adequately account for within-subject correlation.

Journal ArticleDOI
TL;DR: This paper studies summary measures of the predictive power of a generalized linear model, paying special attention to a generalization of the multiple correlation coefficient from ordinary linear regression.
Abstract: This paper studies summary measures of the predictive power of a generalized linear model, paying special attention to a generalization of the multiple correlation coefficient from ordinary linear regression. The population value is the correlation between the response and its conditional expectation given the predictors, and the sample value is the correlation between the observed response and the model predicted value. We compare four estimators of the measure in terms of bias, mean squared error and behaviour in the presence of overparameterization. The sample estimator and a jack-knife estimator usually behave adequately, but a cross-validation estimator has a large negative bias with large mean squared error. One can use bootstrap methods to construct confidence intervals for the population value of the correlation measure and to estimate the degree to which a model selection procedure may provide an overly optimistic measure of the actual predictive power.

Journal ArticleDOI
TL;DR: Several simple clinical examples show that the 100 log(e) scale is the natural scale on which to express percentage differences and the term sympercent or s% is proposed for them.
Abstract: The results of analyses on log transformed data are usually back-transformed and interpreted on the original scale. Yet if natural logs are used this is not necessary--the log scale can be interpreted as it stands. A difference of natural logs corresponds to a fractional difference on the original scale. The agreement is exact if the fractional difference is based on the logarithmic mean. The transform y = 100 log(e)x leads to differences, standard deviations and regression coefficients of y that are equivalent to symmetric percentage differences, standard deviations and regression coefficients of x. Several simple clinical examples show that the 100 log(e) scale is the natural scale on which to express percentage differences. The term sympercent or s% is proposed for them. Sympercents should improve the presentation of log transformed data and lead to a wider understanding of the natural log transformation.

Journal ArticleDOI
TL;DR: There is no evidence so far that application of ANNs represents real progress in the field of diagnosis and prognosis in oncology, according to a search in the medical literature from 1991 to 1995.
Abstract: The application of artificial neural networks (ANNs) for prognostic and diagnostic classification in clinical medicine has become very popular. In particular, feed-forward neural networks have been used extensively, often accompanied by exaggerated statements of their potential. In this paper, the essentials of feed-forward neural networks and their statistical counterparts (that is, logistic regression models) are reviewed. We point out that the uncritical use of ANNs may lead to serious problems, such as the fitting of implausible functions to describe the probability of class membership and the underestimation of misclassification probabilities. In applications of ANNs to survival data, further difficulties arise. Finally, the results of a search in the medical literature from 1991 to 1995 on applications of ANNs in oncology and some important common mistakes are reported. It is concluded that there is no evidence so far that application of ANNs represents real progress in the field of diagnosis and prognosis in oncology.

Journal ArticleDOI
TL;DR: The potential of multilevel models for meta-analysis of trials with binary outcomes for both summary data, such as log-odds ratios, and individual patient data is explored, and the flexibility ofMultilevel modelling may be exploited in facilitating extensions to standard Meta-analysis methods.
Abstract: In this paper we explore the potential of multilevel models for meta-analysis of trials with binary outcomes for both summary data, such as log-odds ratios, and individual patient data. Conventional fixed effect and random effects models are put into a multilevel model framework, which provides maximum likelihood or restricted maximum likelihood estimation. To exemplify the methods, we use the results from 22 trials to prevent respiratory tract infections; we also make comparisons with a second example data set comprising fewer trials. Within summary data methods, confidence intervals for the overall treatment effect and for the between-trial variance may be derived from likelihood based methods or a parametric bootstrap as well as from Wald methods; the bootstrap intervals are preferred because they relax the assumptions required by the other two methods. When modelling individual patient data, a bias corrected bootstrap may be used to provide unbiased estimation and correctly located confidence intervals; this method is particularly valuable for the between-trial variance. The trial effects may be modelled as either fixed or random within individual data models, and we discuss the corresponding assumptions and implications. If random trial effects are used, the covariance between these and the random treatment effects should be included; the resulting model is equivalent to a bivariate approach to meta-analysis. Having implemented these techniques, the flexibility of multilevel modelling may be exploited in facilitating extensions to standard meta-analysis methods.

Journal ArticleDOI
TL;DR: The main general results presented here show that the gamma-Poisson exchangeable model and the Besag, York and Mollie (BYM) model are most robust across a range of diverse models.
Abstract: SUMMARY The analysis of small area disease incidence has now developed to a degree where many methods have been proposed. However, there are few studies of the relative merits of the methods available. While many Bayesian models have been examined with respect to prior sensitivity, it is clear that wider comparisons of methods are largely missing from the literature. In this paper we present some preliminary results concerning the goodness-of-"t of a variety of disease mapping methods to simulated data for disease incidence derived from a range of models. These simulated models cover simple risk gradients to more complex true risk structures, including spatial correlation. The main general results presented here show that the gamma-Poisson exchangeable model and the Besag, York and Mollie (BYM) model are most robust across a range of diverse models. Mixture models are less robust. Non-parametric smoothing methods perform badly in general. Linear Bayes methods display behaviour similar to that of the gamma-Poisson methods. Copyright ( 2000 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: A general proportional hazards model with random effects for handling clustered survival data is proposed by allowing a multivariate random effect with arbitrary design matrix in the log relative risk, in a way similar to the modelling of random effects in linear, generalized linear and non-linear mixed models.
Abstract: We propose a general proportional hazards model with random effects for handling clustered survival data. This generalizes the usual frailty model by allowing a multivariate random effect with arbitrary design matrix in the log relative risk, in a way similar to the modelling of random effects in linear, generalized linear and non-linear mixed models. The distribution of the random effects is generally assumed to be multivariate normal, but other (preferably symmetrical) distributions are also possible. Maximum likelihood estimates of the regression parameters, the variance components and the baseline hazard function are obtained via the EM algorithm. The E-step of the algorithm involves computation of the conditional expectations of functions of the random effects, for which we use Markov chain Monte Carlo (MCMC) methods. Approximate variances of the estimates are computed by Louis' formula, and posterior expectations and variances of the individual random effects can be obtained as a by-product of the estimation. The inference procedure is exemplified on two data sets.

Journal ArticleDOI
TL;DR: It is shown, using likelihood ratios and this graphic, that a test can be superior to a competitor in terms of predictive values while having either sensitivity or specificity smaller.
Abstract: The diagnostic abilities of two or more diagnostic tests are traditionally compared by their respective sensitivities and specificities, either separately or using a summary of them such as Youden's index. Several authors have argued that the likelihood ratios provide a more appropriate, if in practice a less intuitive, comparison. We present a simple graphic which incorporates all these measures and admits easily interpreted comparison of two or more diagnostic tests. We show, using likelihood ratios and this graphic, that a test can be superior to a competitor in terms of predictive values while having either sensitivity or specificity smaller. A decision theoretic basis for the interpretation of the graph is given by relating it to the tent graph of Hilden and Glasziou (Statistics in Medicine, 1996). Finally, a brief example comparing two serodiagnostic tests for Lyme disease is presented. Published in 2000 by John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: Methods are sketched to perform validation through calibration', that is by embedding the literature model in a larger calibration model, for x-year survival probabilities, Cox regression and general non-proportional hazards models.
Abstract: The problem of assessing the validity and value of prognostic survival models presented in the literature for a particular population for which some data has been collected is discussed. Methods are sketched to perform validation through 'calibration', that is by embedding the literature model in a larger calibration model. This general approach is exemplified for x-year survival probabilities, Cox regression and general non-proportional hazards models. Some comments are made on basic structural changes to the model, described as 'revision'. Finally, general methods are discussed to combine models from different sources. The methods are illustrated with a model for non-Hodgkin's lymphoma validated on a Dutch data set.

Journal ArticleDOI
TL;DR: I review some areas of medical statistics that have gained prominence over the last 5-10 years: meta-analysis, evidence-based medicine, and cluster randomized trials and several issues relating to data analysis and interpretation are considered.
Abstract: I review some areas of medical statistics that have gained prominence over the last 5-10 years: meta-analysis, evidence-based medicine, and cluster randomized trials. I then consider several issues relating to data analysis and interpretation, many relating to the use and misuse of hypothesis testing, drawing on recent reviews of the use of statistics in medical journals. I also consider developments in the reporting of research in medical journals.

Journal ArticleDOI
TL;DR: This work model that time at which the rate of decline begins to accelerate in persons who develop dementia relative to those who do not by using a change point in a mixed linear model and proposes a profile likelihood method to draw inferences about the change point.
Abstract: Dementia is characterized by accelerated cognitive decline before and after diagnosis as compared to normal ageing. Determining the time at which that rate of decline begins to accelerate in persons who will develop dementia is important both in describing the natural history of the disease process and in identifying the optimal time window for which treatments might be useful. We model that time at which the rate of decline begins to accelerate in persons who develop dementia relative to those who do not by using a change point in a mixed linear model. A profile likelihood method is proposed to draw inferences about the change point. The method is applied to data from the Bronx Ageing Study, a cohort study of 488 initially non-demented community-dwelling elderly individuals who have been examined at approximately 12-month intervals over 15 years. Cognitive function was measured using the Buschke Selective Reminding test, a memory test with high reliability and known discriminative validity for detecting dementia. We found that the rate of cognitive decline as measured by this test in this cohort increases on average 5.1 years before the diagnosis of dementia.

Journal ArticleDOI
TL;DR: It is proposed that one such model, a (possibly over-fitted) cubic smoothing spline, may be used to define a suitable reference curve against which the fit of a parametric model may be checked, and a significance test is suggested for the purpose.
Abstract: Low-dimensional parametric models are well understood, straightforward to communicate to other workers, have very smooth curves and may easily be checked for consistency with background scientific knowledge or understanding. They should therefore be ideal tools with which to represent smooth relationships between a continuous predictor and an outcome variable in medicine and epidemiology. Unfortunately, a seriously restricted set of such models is used routinely in practical data analysis - typically, linear, quadratic or occasionally cubic polynomials, or sometimes a power or logarithmic transformation of a covariate. Since their flexibility is limited, it is not surprising that the fit of such models is often poor. Royston and Altman's recent work on fractional polynomials has extended the range of available functions. It is clearly crucial that the chosen final model fits the data well. Achieving a good fit with minimal restriction on the functional form has been the motivation behind the major recent research effort on non-parametric curve-fitting techniques. Here I propose that one such model, a (possibly over-fitted) cubic smoothing spline, may be used to define a suitable reference curve against which the fit of a parametric model may be checked. I suggest a significance test for the purpose and examine its type I error and power in a small simulation study. Several families of parametric models, including some with sigmoid curves, are considered. Their suitability in fitting regression relationships found in several real data sets is investigated. With all the example data sets, a simple parametric model can be found which fits the data approximately as well as a cubic smoothing spline, but without the latter's tendency towards artefacts in the fitted curve.

Journal ArticleDOI
TL;DR: Four goodness-of-fit measures of a generalized linear model (GLM) are extended to random effects and marginal models for longitudinal data and satisfy the basic requirements for measures of association.
Abstract: This paper extends four goodness-of-fit measures of a generalized linear model (GLM) to random effects and marginal models for longitudinal data. The four measures are the proportional reduction in entropy measure, the proportional reduction in deviance measure, the concordance correlation coefficient and the concordance index. The extended measures satisfy the basic requirements for measures of association. Two examples illustrate their use in model selection.

Journal ArticleDOI
TL;DR: The subpopulation treatment effect pattern plot (STEPP) method is introduced, designed to facilitate the interpretation of estimates of treatment effect derived from different but potentially overlapping subsets of clinical trial data.
Abstract: We introduce the subpopulation treatment effect pattern plot (STEPP) method, designed to facilitate the interpretation of estimates of treatment effect derived from different but potentially overlapping subsets of clinical trial data. In particular, we consider sequences of subpopulations defined with respect to a covariate, and obtain confidence bands for the collection of treatment effects (here obtained from the Cox proportional hazards model) associated with the sequences. The method is aimed at determining whether the magnitude of the treatment effect changes as a function of the values of the covariate. We apply STEPP to a breast cancer clinical trial data set to evaluate the treatment effect as a function of the oestrogen receptor content of the primary tumour.

Journal ArticleDOI
TL;DR: Using the methods of modern psychometric theory (in addition to those of classical test theory), the performance of the Attention subscale of the Mattis Dementia Rating Scale was examined and bias in screening measures across education and ethnic and racial subgroups was examined.
Abstract: Cognitive screening tests and items have been found to perform differently across groups that differ in terms of education, ethnicity and race. Despite the profound implications that such bias holds for studies in the epidemiology of dementia, little research has been conducted in this area. Using the methods of modern psychometric theory (in addition to those of classical test theory), we examined the performance of the Attention subscale of the Mattis Dementia Rating Scale. Several item response theory models, including the two- and three-parameter dichotomous response logistic model, as well as a polytomous response model were compared. (Log-likelihood ratio tests showed that the three-parameter model was not an improvement over the two-parameter model.) Data were collected as part of the ten-study National Institute on Aging Collaborative investigation of special dementia care in institutional settings. The subscale KR-20 estimate for this sample was 0.92. IRT model-based reliability estimates, provided at several points along the latent attribute, ranged from 0.65 to 0.97; the measure was least precise at the less disabled tail of the distribution. Most items performed in similar fashion across education groups; the item characteristic curves were almost identical, indicating little or no differential item functioning (DIF). However, four items were problematic. One item (digit span backwards) demonstrated a large error term in the confirmatory factor analysis; item-fit chi-square statistics developed using BIMAIN confirm this result for the IRT models. Further, the discrimination parameter for that item was low for all education subgroups. Generally, persons with the highest education had a greater probability of passing the item for most levels of theta. Model-based tests of DIF using MULTILOG identified three other items with significant, albeit small, DIF. One item, for example, showed non-uniform DIF in that at the impaired tail of the latent distribution, persons with higher education had a higher probability of correctly responding to the item than did lower education groups, but at less impaired levels, they had a lower probability of a correct response than did lower education groups. Another method of detection identified this item as having DIF (unsigned area statistic=3.05, p<0.01, and 2.96, p<0.01). On average, across the entire score range, the lower education group's probability of answering the item correctly was 0.11 higher than the higher education group's probability. A cross-validation with larger subgroups confirmed the overall result of little DIF for this measure. The methods used for detecting differential item functioning (which may, in turn, be indicative of bias) were applied to a neuropsychological subtest. These methods have been used previously to examine bias in screening measures across education and ethnic and racial subgroups. In addition to the important epidemiological applications of ensuring that screening measures and neuropsychological tests used in diagnoses are free of bias so that more culture-fair classifications will result, these methods are also useful for the examination of site differences in large multi-site clinical trials. It is recommended that these methods receive wider attention in the medical statistical literature.

Journal ArticleDOI
TL;DR: Three procedures based on the distribution of the Z-scores (standardized residuals) from a model and on Pearson chi(2) statistics for observed and expected counts in groups defined by age and the estimated reference centile curves are considered.
Abstract: The age-specific reference interval is a commonly used screening tool in medicine. It involves estimation of extreme quantile curves (such as the 5th and 95th centiles) of a reference distribution of clinically normal individuals. It is crucial that models used to estimate such intervals fit the data extremely well. However, few procedures to assess goodness-of-fit have been proposed in the literature, and even fewer have been evaluated systematically. Here we consider procedures based on the distribution of the Z-scores (standardized residuals) from a model and on Pearson chi(2) statistics for observed and expected counts in groups defined by age and the estimated reference centile curves. Two of the procedures (Q and grid tests) are mainly inferential, whereas the third (permutation bands and B-tests) is essentially graphical. We obtain approximations to the null distributions of several relevant test statistics and examine their size and power for a range of models based on real data sets. We recommend Q-tests in all situations where Z-scores are available since they are general, simple to calculate and usually have the highest power among the three classes of test considered. For the cases considered the grid tests are always inferior to the Q- and B- tests.

Journal ArticleDOI
TL;DR: The relationship among identifiability, Bayesian learning and MCMC convergence rates for a common class of spatial models, in order to provide guidance for prior selection and algorithm tuning is investigated.
Abstract: The marked increase in popularity of Bayesian methods in statistical practice over the last decade owes much to the simultaneous development of Markov chain Monte Carlo (MCMC) methods for the evaluation of requisite posterior distributions. However, along with this increase in computing power has come the temptation to fit models larger than the data can readily support, meaning that often the propriety of the posterior distributions for certain parameters depends on the propriety of the associated prior distributions. An important example arises in spatial modelling, wherein separate random effects for capturing unstructured heterogeneity and spatial clustering are of substantive interest, even though only their sum is well identified by the data. Increasing the informative content of the associated prior distributions offers an obvious remedy, but one that hampers parameter interpretability and may also significantly slow the convergence of the MCMC algorithm. In this paper we investigate the relationship among identifiability, Bayesian learning and MCMC convergence rates for a common class of spatial models, in order to provide guidance for prior selection and algorithm tuning. We are able to elucidate the key issues with relatively simple examples, and also illustrate the varying impacts of covariates, outliers and algorithm starting values on the resulting algorithms and posterior distributions.


Journal ArticleDOI
TL;DR: A macro written in the SAS macro language that produces several estimates of disease incidence for use in the analysis of prospective cohort data and illustrates the use of the PIE macro using Alzheimer's Disease incidence data collected in the Framingham Study.
Abstract: The incidence of disease is estimated in medical and public health applications using various different techniques presented in the statistical and epidemiologic literature. Many of these methods have not yet made their way to popular statistical software packages and their application requires custom programming. We present a macro written in the SAS macro language that produces several estimates of disease incidence for use in the analysis of prospective cohort data. The development of the Practical Incidence Estimators (PIE) Macro was motivated by research in Alzheimer's Disease (AD) in the Framingham Study in which the development of AD has been prospectively assessed over an observation period of 24 years. The PIE Macro produces crude and age-specific incidence rates, overall and stratified by the levels of a grouping variable. In addition, it produces age-adjusted rates using direct standardization to the combined group. The user specifies the width of the age groups and the number of levels of the grouping variable. The PIE macro produces estimates of future risk for user-defined time periods and the remaining lifetime risk conditional on survival event-free to user-specified ages. This allows the user to investigate the impact of increasing age on the estimate of remaining lifetime risk of disease. In each case, the macro provides estimates based on traditional unadjusted cumulative incidence, and on cumulative incidence adjusted for the competing risk of death. These estimates and their respective standard errors, are provided in table form and in an output data set for graphing. The macro is designed for use with survival age as the time variable, and with age at entry into the study as the left-truncation variable; however, calendar time can be substituted for the survival time variable and the left-truncation variable can simply be set to zero. We illustrate the use of the PIE macro using Alzheimer's Disease incidence data collected in the Framingham Study.

Journal ArticleDOI
TL;DR: Under the common correlation model for dichotomous data, 95 per cent lower confidence bounds are evaluated constructed using four asymptotic variance expressions derived using exact computation, rather than simulation.
Abstract: Cohen's kappa statistic is a very well known measure of agreement between two raters with respect to a dichotomous outcome. Several expressions for its asymptotic variance have been derived and the normal approximation to its distribution has been used to construct confidence intervals. However, information on the accuracy of these normal-approximation confidence intervals is not comprehensive. Under the common correlation model for dichotomous data, we evaluate 95 per cent lower confidence bounds constructed using four asymptotic variance expressions. Exact computation, rather than simulation is employed. Specific conditions under which the use of asymptotic variance formulae is reasonable are determined.