scispace - formally typeset
Search or ask a question

Showing papers in "Statistics in Medicine in 2002"


Journal ArticleDOI
TL;DR: It is concluded that H and I2, which can usually be calculated for published meta-analyses, are particularly useful summaries of the impact of heterogeneity, and one or both should be presented in publishedMeta-an analyses in preference to the test for heterogeneity.
Abstract: The extent of heterogeneity in a meta-analysis partly determines the difficulty in drawing overall conclusions. This extent may be measured by estimating a between-study variance, but interpretation is then specific to a particular treatment effect metric. A test for the existence of heterogeneity exists, but depends on the number of studies in the meta-analysis. We develop measures of the impact of heterogeneity on a meta-analysis, from mathematical criteria, that are independent of the number of studies and the treatment effect metric. We derive and propose three suitable statistics: H is the square root of the chi2 heterogeneity statistic divided by its degrees of freedom; R is the ratio of the standard error of the underlying mean from a random effects meta-analysis to the standard error of a fixed effect meta-analytic estimate, and I2 is a transformation of (H) that describes the proportion of total variation in study estimates that is due to heterogeneity. We discuss interpretation, interval estimates and other properties of these measures and examine them in five example data sets showing different amounts of heterogeneity. We conclude that H and I2, which can usually be calculated for published meta-analyses, are particularly useful summaries of the impact of heterogeneity. One or both should be presented in published meta-analyses in preference to the test for heterogeneity.

25,460 citations


Journal ArticleDOI
TL;DR: The examples considered in this paper show the tension between the scientific rationale for using meta-regression and the difficult interpretative problems to which such analyses are prone.
Abstract: SUMMARY Appropriate methods for meta-regression applied to a set of clinical trials, and the limitations and pitfalls in interpretation, are insuciently recognized. Here we summarize recent research focusing on these issues, and consider three published examples of meta-regression in the light of this work. One principal methodological issue is that meta-regression should be weighted to take account of both within-trial variances of treatment eects and the residual between-trial heterogeneity (that is, heterogeneity not explained by the covariates in the regression). This corresponds to random eects meta-regression. The associations derived from meta-regressions are observational, and have a weaker interpretation than the causal relationships derived from randomized comparisons. This applies particularly when averages of patient characteristics in each trial are used as covariates in the regression. Data dredging is the main pitfall in reaching reliable conclusions from meta-regression. It can only be avoided by prespecication of covariates that will be investigated as potential sources of heterogeneity. However, in practice this is not always easy to achieve. The examples considered in this paper show the tension between the scientic rationale for using meta-regression and the dicult interpretative problems to which such analyses are prone. Copyright ? 2002 John Wiley & Sons, Ltd.

2,486 citations


Journal ArticleDOI
TL;DR: A procedure by Firth originally developed to reduce the bias of maximum likelihood estimates is shown to provide an ideal solution to separation and produces finite parameter estimates by means of penalized maximum likelihood estimation.
Abstract: The phenomenon of separation or monotone likelihood is observed in the fitting process of a logistic model if the likelihood converges while at least one parameter estimate diverges to +/- infinity. Separation primarily occurs in small samples with several unbalanced and highly predictive risk factors. A procedure by Firth originally developed to reduce the bias of maximum likelihood estimates is shown to provide an ideal solution to separation. It produces finite parameter estimates by means of penalized maximum likelihood estimation. Corresponding Wald tests and confidence intervals are available but it is shown that penalized likelihood ratio tests and profile penalized likelihood confidence intervals are often preferable. The clear advantage of the procedure over previous options of analysis is impressively demonstrated by the statistical analysis of two cancer studies.

1,628 citations


Journal ArticleDOI
TL;DR: This tutorial on advanced statistical methods for meta-analysis can be seen as a sequel to the recent Tutorial in Biostatistics on meta- analysis by Normand, which focused on elementary methods.
Abstract: This tutorial on advanced statistical methods for meta-analysis can be seen as a sequel to the recent Tutorial in Biostatistics on meta-analysis by Normand, which focused on elementary methods. Within the framework of the general linear mixed model using approximate likelihood, we discuss methods to analyse univariate as well as bivariate treatment effects in meta-analyses as well as meta-regression methods. Several extensions of the models are discussed, like exact likelihood, non-normal mixtures and multiple endpoints. We end with a discussion about the use of Bayesian methods in meta-analysis. All methods are illustrated by a meta-analysis concerning the efficacy of BCG vaccine against tuberculosis. All analyses that use approximate likelihood can be carried out by standard software. We demonstrate how the models can be fitted using SAS Proc Mixed.

1,417 citations


Journal ArticleDOI
TL;DR: Extensions of the Weibull and log-logistic models are proposed in which natural cubic splines are used to smooth the baseline log cumulative hazard and log cumulative odds of failure functions and a hypothesis test of the appropriateness of the scale chosen for covariate effects (such as of treatment) is proposed.
Abstract: Modelling of censored survival data is almost always done by Cox proportional-hazards regression However, use of parametric models for such data may have some advantages For example, non-proportional hazards, a potential difficulty with Cox models, may sometimes be handled in a simple way, and visualization of the hazard function is much easier Extensions of the Weibull and log-logistic models are proposed in which natural cubic splines are used to smooth the baseline log cumulative hazard and log cumulative odds of failure functions Further extensions to allow non-proportional effects of some or all of the covariates are introduced A hypothesis test of the appropriateness of the scale chosen for covariate effects (such as of treatment) is proposed The new models are applied to two data sets in cancer The results throw interesting light on the behaviour of both the hazard function and the hazard ratio over time The tools described here may be a step towards providing greater insight into the natural history of the disease and into possible underlying causes of clinical events We illustrate these aspects by using the two examples in cancer

1,142 citations


Journal ArticleDOI
TL;DR: Methods for assessing the relative effectiveness of two treatments when they have not been compared directly in a randomized trial but have each been compared to other treatments are presented.
Abstract: I present methods for assessing the relative effectiveness of two treatments when they have not been compared directly in a randomized trial but have each been compared to other treatments. These network meta-analysis techniques allow estimation of both heterogeneity in the effect of any given treatment and inconsistency ('incoherence') in the evidence from different pairs of treatments. A simple estimation procedure using linear mixed models is given and used in a meta-analysis of treatments for acute myocardial infarction.

1,049 citations


Journal ArticleDOI
TL;DR: Key issues include: the overuse and overinterpretation of subgroup analyses; the underuse of appropriate statistical tests for interaction; inconsistencies in the use of covariate-adjustment; the lack of clear guidelines on covariate selection; the over use of baseline comparisons in some studies; the misuses of significance tests for baseline comparability, and the need for trials to have a predefined statistical analysis plan.
Abstract: Clinical trial investigators often record a great deal of baseline data on each patient at randomization When reporting the trial's findings such baseline data can be used for (i) subgroup analyses which explore whether there is evidence that the treatment difference depends on certain patient characteristics, (ii) covariate-adjusted analyses which aim to refine the analysis of the overall treatment difference by taking account of the fact that some baseline characteristics are related to outcome and may be unbalanced between treatment groups, and (iii) baseline comparisons which compare the baseline characteristics of patients in each treatment group for any possible (unlucky) differences This paper examines how these issues are currently tackled in the medical journals, based on a recent survey of 50 trial reports in four major journals The statistical ramifications are explored, major problems are highlighted and recommendations for future practice are proposed Key issues include: the overuse and overinterpretation of subgroup analyses; the underuse of appropriate statistical tests for interaction; inconsistencies in the use of covariate-adjustment; the lack of clear guidelines on covariate selection; the overuse of baseline comparisons in some studies; the misuses of significance tests for baseline comparability, and the need for trials to have a predefined statistical analysis plan for all these uses of baseline data

980 citations


Journal ArticleDOI
TL;DR: A method is developed to calculate the approximate number of subjects required to obtain an exact confidence interval of desired width for certain types of intraclass correlations in one-way and two-way ANOVA models.
Abstract: A method is developed to calculate the approximate number of subjects required to obtain an exact confidence interval of desired width for certain types of intraclass correlations in one-way and two-way ANOVA models. The sample size approximation is shown to be very accurate.

648 citations


Journal ArticleDOI
TL;DR: The non‐inferiority trial is appropriate for evaluation of the efficacy of an experimental treatment versus an active control when it is hypothesized that the experimental treatment may not be superior to a proven effective treatment, but is clinically and statistically not inferior in effectiveness.
Abstract: Placebo-controlled trials are the ideal for evaluating medical treatment efficacy. They allow for control of the placebo effect and are most efficient, requiring the smallest numbers of patients to detect a treatment effect. A placebo control is ethically justified if no standard treatment exists, if the standard treatment has not been proven efficacious, there are no risks associated with delaying treatment or escape clauses are included in the protocol. Where possible and justified, they should be the first choice for medical treatment evaluation. Given the large number of proven effective treatments, placebo-controlled trials are often unethical. In these situations active-controlled trials are generally appropriate. The non-inferiority trial is appropriate for evaluation of the efficacy of an experimental treatment versus an active control when it is hypothesized that the experimental treatment may not be superior to a proven effective treatment, but is clinically and statistically not inferior in effectiveness. These trials are not easy to design. An active control must be selected. Good historical placebo-controlled trials documenting the efficacy of the active control must exist. From these historical trials statistical analysis must be performed and clinical judgement applied in order to determine the non-inferiority margin M and to assess assay sensitivity. The latter refers to establishing that the active drug would be superior to the placebo in the setting of the present non-inferiority trial (that is, the constancy assumption). Further, a putative placebo analysis of the new treatment versus the placebo using data from the non-inferiority trial and the historical active versus placebo-controlled trials is needed. Useable placebo-controlled historical trials for the active control are often not available, and determination of assay sensitivity and an appropriate M is difficult and debatable. Serious consideration to expansions of and alternatives to non-inferiority trials are needed.

638 citations


Journal ArticleDOI
TL;DR: It is shown that AUC is maximized when the study odds ratios are homogeneous, and that it is quite robust to heterogeneity, and its standard error is derived for homogeneous studies and shown to be a reasonable approximation with heterogeneous studies.
Abstract: The summary receiver operating characteristic (SROC) curve has been recommended to represent the performance of a diagnostic test, based on data from a meta-analysis. However, little is known about the basic properties of the SROC curve or its estimate. In this paper, the position of the SROC curve is characterized in terms of the overall diagnostic odds ratio and the magnitude of inter-study heterogeneity in the odds ratio. The area under the curve (AUC) and an index Q(*) are discussed as potentially useful summaries of the curve. It is shown that AUC is maximized when the study odds ratios are homogeneous, and that it is quite robust to heterogeneity. An upper bound is derived for AUC based on an exact analytic expression for the homogeneous situation, and a lower bound based on the limit case Q(*), defined by the point where sensitivity equals specificity: Q(*) is invariant to heterogeneity. The standard error of AUC is derived for homogeneous studies, and shown to be a reasonable approximation with heterogeneous studies. The expressions for AUC and its standard error are easily computed in the homogeneous case, and avoid the need for numerical integration in the more general case. SE(AUC) and SE(Q(*)) are found to be numerically close, with SE(Q(*)) being larger if the odds ratio is very large. The methods are illustrated using data for the Pap smear screening test for cervical cancer, and for three tests for the diagnosis of metastases in cervical cancer patients.

606 citations


Journal ArticleDOI
TL;DR: Comparisons of required sample sizes indicate that the family-based designs (case-sibling and case-parent) generally require fewer matched sets than the case-control design to achieve the same power for detecting a GxE interaction.
Abstract: Consideration of gene-environment (GxE) interaction is becoming increasingly important in the design of new epidemiologic studies. We present a method for computing required sample size or power to detect GxE interaction in the context of three specific designs: the standard matched case-control; the case-sibling, and the case-parent designs. The method is based on computation of the expected value of the likelihood ratio test statistic, assuming that the data will be analysed using conditional logistic regression. Comparisons of required sample sizes indicate that the family-based designs (case-sibling and case-parent) generally require fewer matched sets than the case-control design to achieve the same power for detecting a GxE interaction. The case-sibling design is most efficient when studying a dominant gene, while the case-parent design is preferred for a recessive gene. Methods are also presented for computing sample size when matched sets are obtained from a stratified population, for example, when the population consists of multiple ethnic groups. A software program that implements the method is freely available, and may be downloaded from the website http://hydra.usc.edu/gxe.

Journal ArticleDOI
TL;DR: A review and comment on a selected number of issues, including problems of study heterogeneity, difficulties in estimating design effects from individual trials and the choice of statistical methods, including those raised by meta-analyses which include only individually randomized trials.
Abstract: Meta-analyses involving the synthesis of evidence from cluster randomization trials are being increasingly reported. These analyses raise challenging methodologic issues beyond those raised by meta-analyses which include only individually randomized trials. In this paper we review and comment on a selected number of these issues, including problems of study heterogeneity, difficulties in estimating design effects from individual trials and the choice of statistical methods.

Journal ArticleDOI
TL;DR: Evidence indicates that for interventions aimed at preventing an undesirable event, greatest absolute benefits are observed in trials with the highest baseline event rates corresponding to the model of constant RR(H), and choice of a summary statistic should be guided by both empirical evidence and clinically informed debate.
Abstract: Meta-analysis of binary data involves the computation of a weighted average of summary statistics calculated for each trial. The selection of the appropriate summary statistic is a subject of debate due to conflicts in the relative importance of mathematical properties and the ability to intuitively interpret results. This paper explores the process of identifying a summary statistic most likely to be consistent across trials when there is variation in control group event rates. Four summary statistics are considered: odds ratios (OR); risk differences (RD) and risk ratios of beneficial (RR(B)); and harmful outcomes (RR(H)). Each summary statistic corresponds to a different pattern of predicted absolute benefit of treatment with variation in baseline risk, the greatest difference in patterns of prediction being between RR(B) and RR(H). Selection of a summary statistic solely based on identification of the best-fitting model by comparing tests of heterogeneity is problematic, principally due to low numbers of trials. It is proposed that choice of a summary statistic should be guided by both empirical evidence and clinically informed debate as to which model is likely to be closest to the expected pattern of treatment benefit across baseline risks. Empirical investigations comparing the four summary statistics on a sample of 551 systematic reviews provide evidence that the RR and OR models are on average more consistent than RD, there being no difference on average between RR and OR. From a second sample of 114 meta-analyses evidence indicates that for interventions aimed at preventing an undesirable event, greatest absolute benefits are observed in trials with the highest baseline event rates, corresponding to the model of constant RR(H). The appropriate selection for a particular meta-analysis may depend on understanding reasons for variation in control group event rates; in some situations uncertainty about the choice of summary statistic will remain.

Journal ArticleDOI
TL;DR: Methods are described which improve upon a previously proposed method for estimating the log(HR) from survival curves and extend to life-tables.
Abstract: In a meta-analysis of randomized controlled trials with time-to-event outcomes, an aggregate data approach may be required for some or all included studies Variation in the reporting of survival analyses in journals suggests that no single method for extracting the log(hazard ratio) estimate will suffice Methods are described which improve upon a previously proposed method for estimating the log(HR) from survival curves These methods extend to life-tables In the situation where the treatment effect varies over time and the trials in the meta-analysis have different lengths of follow-up, heterogeneity may be evident In order to assess whether the hazard ratio changes with time, several tests are proposed and compared A cohort study comparing life expectancy of males and females with cerebral palsy and a systematic review of five trials comparing two anti-epileptic drugs, carbamazepine and sodium valproate, are used for illustration

Journal ArticleDOI
TL;DR: Kappa coefficients are measures of correlation between categorical variables often used as reliability or validity coefficients, and development and definitions of the K by M (ratings) kappas (K x M) are recapitulate and the use of the recommended kappa with applications in medical research is illustrated.
Abstract: Kappa coefficients are measures of correlation between categorical variables often used as reliability or validity coefficients. We recapitulate development and definitions of the K (categories) by M (ratings) kappas (K x M), discuss what they are well- or ill-designed to do, and summarize where kappas now stand with regard to their application in medical research. The 2 x M(M>/=2) intraclass kappa seems the ideal measure of binary reliability; a 2 x 2 weighted kappa is an excellent choice, though not a unique one, as a validity measure. For both the intraclass and weighted kappas, we address continuing problems with kappas. There are serious problems with using the K x M intraclass (K>2) or the various K x M weighted kappas for K>2 or M>2 in any context, either because they convey incomplete and possibly misleading information, or because other approaches are preferable to their use. We illustrate the use of the recommended kappas with applications in medical research.

Journal ArticleDOI
TL;DR: Empirical support for the existence of different sources of variation is reviewed, and incorporation of sources of variability explicitly into systematic reviews on diagnostic accuracy is demonstrated with data from a recent review.
Abstract: It is indispensable for any meta-analysis that potential sources of heterogeneity are examined, before one considers pooling the results of primary studies into summary estimates with enhanced precision. In reviews of studies on the diagnostic accuracy of tests, variability beyond chance can be attributed to between-study differences in the selected cutpoint for positivity, in patient selection and clinical setting, in the type of test used, in the type of reference standard, or any combination of these factors. In addition, heterogeneity in study results can also be caused by flaws in study design. This paper critically examines some of the potential reasons for heterogeneity and the methods to explore them. Empirical support for the existence of different sources of variation is reviewed. Incorporation of sources of variability explicitly into systematic reviews on diagnostic accuracy is demonstrated with data from a recent review. Application of regression techniques in meta-analysis of diagnostic tests can provide relevant additional information. Results of such analyses will help understand problems with the transferability of diagnostic tests and to point out flaws in primary studies. As such, they can guide the design of future studies.

Journal ArticleDOI
TL;DR: This paper discusses and compares estimation procedures for the area under the receiver operating characteristic curve based on the Mann-Whitney statistic; kernel smoothing; normal assumptions; empirical transformations to normality; and compares these in terms of bias and root mean square error.
Abstract: The area under the receiver operating characteristic curve is frequently used as a measure for the effectiveness of diagnostic markers. In this paper we discuss and compare estimation procedures for this area. These are based on (i) the Mann-Whitney statistic; (ii) kernel smoothing; (iii) normal assumptions; (iv) empirical transformations to normality. These are compared in terms of bias and root mean square error in a large variety of situations by means of an extensive simulation study. Overall we find that transforming to normality usually is to be preferred except for bimodal cases where kernel methods can be effective.

Journal ArticleDOI
TL;DR: It is found that treatment was significantly more effective among patients with elevated panel reactive antibodies than among patients without elevated PRA, and recommends using individual patient data, when feasible, to study patient characteristics, in order to avoid the potential for ecological bias introduced by group‐level analyses.
Abstract: When performing a meta-analysis, interest often centres on finding explanations for heterogeneity in the data, rather than on producing a single summary estimate. Such exploratory analyses are frequently undertaken with published, study-level data, using techniques of meta-analytic regression. Our goal was to explore a real-world example for which both published, group-level and individual patient-level data were available, and to compare the substantive conclusions reached by both methods. We studied the benefits of anti-lymphocyte antibody induction therapy among renal transplant patients in five randomized trials, focusing on whether there are subgroups of patients in whom therapy might prove particularly beneficial. Allograft failure within 5 years was the endpoint studied. We used a variety of analytic approaches to the group-level data, including weighted least-squares regression (N=5 studies), logistic regression (N=628, the total number of subjects), and a hierarchical Bayesian approach. We fit logistic regression models to the patient-level data. In the patient-level analysis, we found that treatment was significantly more effective among patients with elevated (20 per cent or more) panel reactive antibodies (PRA) than among patients without elevated PRA. These patients comprise a small (about 15 per cent of patients) subgroup of patients that benefited from therapy. The group-level analyses failed to detect this interaction. We recommend using individual patient data, when feasible, to study patient characteristics, in order to avoid the potential for ecological bias introduced by group-level analyses.

Journal ArticleDOI
TL;DR: It is argued that the usual notion of product-moment correlation is well adapted in a test-retest situation, whereas the concept of intraclass correlation should be used for intrarater and interrater reliability.
Abstract: In this paper we review the problem of defining and estimating intrarater, interrater and test-retest reliability of continuous measurements. We argue that the usual notion of product-moment correlation is well adapted in a test-retest situation, whereas the concept of intraclass correlation should be used for intrarater and interrater reliability. The key difference between these two approaches is the treatment of systematic error, which is often due to a learning effect for test-retest data. We also consider the reliability of a sum and a difference of variables and illustrate the effects on components. Further, we compare these approaches of reliability with the concept of limits of agreement proposed by Bland and Altman (for evaluating the agreement between two methods of clinical measurements) and show how product-moment correlation is related to it. We then propose new kinds of limits of agreement which are related to intraclass correlation. A test battery to study the development of neuro-motor functions in children and adolescents illustrates our purpose throughout the paper.

Journal ArticleDOI
TL;DR: The proposed graphical method identifies trials that account for most of the heterogeneity without having to explore all possible sources of heterogeneity by subgroup analyses and can be applied to identify types of patients that explain heterogeneity in the treatment effect.
Abstract: Heterogeneity can be a major component of meta-analyses and by virtue of that fact warrants investigation. Classic analysis methods, such as meta-regression, are used to explore the sources of heterogeneity. However, it may be difficult to apply such a method in complex cases or in the absence of an a priori hypothesis. This paper presents a graphical method to identify trials, groups of trials or groups of patients that are sources of heterogeneity. The contribution of these trials to the overall result can also be evaluated with this method. Each trial is represented by a dot on a 2D graph. The X-axis represents the contribution of the trial to the overall Cochran Q-test for heterogeneity. The Y-axis represents the influence of the trial, defined as the standardized squared difference between the treatment effects estimated with and without the trial. This approach has been applied to data from the Meta-Analysis of Chemotherapy in Head and Neck Cancer (MACH-NC) comprising 10,850 patients in 65 randomized trials. The graphical method allowed us to identify trials that contributed considerably to the overall heterogeneity and had a strong influence on the overall result. It also provided useful information for the interpretation of heterogeneity in this meta-analysis. The proposed graphical method identifies trials that account for most of the heterogeneity without having to explore all possible sources of heterogeneity by subgroup analyses. This method can also be applied to identify types of patients that explain heterogeneity in the treatment effect.

Journal ArticleDOI
TL;DR: In this article, the authors present an overview of multilevel or hierarchical data modelling and its applications in medicine, including a description of the basic model for nested data is given and it is shown how this can be extended to fit flexible models for repeated measures data and more complex structures involving cross-classifications and multiple membership patterns.
Abstract: This tutorial presents an overview of multilevel or hierarchical data modelling and its applications in medicine. A description of the basic model for nested data is given and it is shown how this can be extended to fit flexible models for repeated measures data and more complex structures involving cross-classifications and multiple membership patterns within the software package MLwiN. A variety of response types are covered and both frequentist and Bayesian estimation methods are described.

Journal ArticleDOI
TL;DR: It is shown that both within- and between meta-analysis heterogeneity may be of importance in the analysis of meta-epidemiological studies, and that confounding exists between the effects of publication status and trial quality.
Abstract: Biases in systematic reviews and meta-analyses may be examined in 'meta-epidemiological' studies, in which the influence of trial characteristics such as measures of study quality on treatment effect estimates is explored. Published studies to date have analysed data from collections of meta-analyses with binary outcomes, using logistic regression models that assume that there is no between- or within-meta-analysis heterogeneity. Using data from a study of publication bias (39 meta-analyses, 394 published and 88 unpublished trials) and language bias (29 meta-analyses, 297 English language trials and 52 non-English language trials), we compare results from logistic regression models, with and without robust standard errors to allow for clustering on meta-analysis, with results using a 'meta-meta-analytic' approach that can allow for between- and within-meta-analysis heterogeneity. We also consider how to allow for the confounding effects of different trial characteristics. We show that both within- and between meta-analysis heterogeneity may be of importance in the analysis of meta-epidemiological studies, and that confounding exists between the effects of publication status and trial quality.

Journal ArticleDOI
TL;DR: An MSM for repeated measures is described that parameterizes the marginal means of counterfactual outcomes corresponding to prespecified treatment regimes and is used to estimate the effect of zidovudine therapy on mean CD4 count among HIV-infected men in the Multicenter AIDS Cohort Study.
Abstract: Even in the absence of unmeasured confounding factors or model misspecification, standard methods for estimating the causal effect of a time-varying treatment on the mean of a repeated measures outcome (for example, GEE regression) may be biased when there are time-dependent variables that are simultaneously confounders of the effect of interest and are predicted by previous treatment. In contrast, the recently developed marginal structural models (MSMs) can provide consistent estimates of causal effects when unmeasured confounding and model misspecification are absent. We describe an MSM for repeated measures that parameterizes the marginal means of counterfactual outcomes corresponding to prespecified treatment regimes. The parameters of MSMs are estimated using a new class of estimators - inverse-probability of treatment weighted estimators. We used an MSM to estimate the effect of zidovudine therapy on mean CD4 count among HIV-infected men in the Multicenter AIDS Cohort Study. We estimated a potential expected increase of 5.4 (95 per cent confidence interval -1.8,12.7) CD4 lymphocytes/l per additional study visit while on zidovudine therapy. We also explain the theory and implementation of MSMs for repeated measures data and draw upon a simple example to illustrate the basic ideas.

Journal ArticleDOI
TL;DR: A study was conducted to estimate the accuracy and reliability of reviewers when screening records for relevant trials for a systematic review and found that two reviewers should screen records for eligibility, whenever possible, in order to maximize ascertainment of relevant trials.
Abstract: A study was conducted to estimate the accuracy and reliability of reviewers when screening records for relevant trials for a systematic review. A sensitive search of ten electronic bibliographic databases yielded 22 571 records of potentially relevant trials. Records were allocated to four reviewers such that two reviewers examined each record and so that identification of trials by each reviewer could be compared with those identified by each of the other reviewers. Agreement between reviewers was assessed using Cohen's kappa statistic. Ascertainment intersection methods were used to estimate the likely number of trials missed by reviewers. Full copies of reports were obtained and assessed independently by two researchers for eligibility for the review. Eligible reports formed the 'gold standard' against which an assessment was made about the accuracy of screening by reviewers. After screening, 301 of 22 571 records were identified by at least one reviewer as potentially relevant. Agreement was 'almost perfect' (kappa>0.8) within two pairs, 'substantial' (kappa>0.6) within three pairs and 'moderate' (kappa>0.4) within one pair. Of the 301 records selected, 273 complete reports were available. When pairs of reviewers agreed on the potential relevance of records, 81 per cent were eligible (range 69 to 91 per cent). If reviewers disagreed, 22 per cent were eligible (range 12 to 45 per cent). Single reviewers missed on average 8 per cent of eligible reports (range 0 to 24 per cent), whereas pairs of reviewers did not miss any (range 0 to 1 per cent). The use of two reviewers to screen records increased the number of randomized trials identified by an average of 9 per cent (range 0 to 32 per cent). Reviewers can reliably identify potentially relevant records when screening thousands of records for eligibility. Two reviewers should screen records for eligibility, whenever possible, in order to maximize ascertainment of relevant trials.

Journal ArticleDOI
TL;DR: It is shown that the two aspects of early growth may have different implications for imitation and fine motor dexterity.
Abstract: Poisson regression is widely used in medical studies, and can be extended to negative binomial regression to allow for heterogeneity. When there is an excess number of zero counts, a useful approach is to used a mixture model with a proportion P of subjects not at risk, and a proportion of 1-P at-risk subjects who take on outcome values following a Poisson or negative binomial distribution. Covariate effects can be incorporated into both components of the models. In child assessment, fine motor development is often measured by test items that involve a process of imitation and a process of fine motor exercise. One such developmental milestone is ‘building a tower of cubes’. This study analyses the impact of foetal growth and postnatal somatic growth on this milestone, operationalized as the number of cubes and measured around the age of 22 months. It is shown that the two aspects of early growth may have different implications for imitation and fine motor dexterity. The usual approach of recording and analysing the milestone as a binary outcome, such as whether the child can build a tower of three cubes, may leave out important information. Copyright © 2002 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: This paper demonstrates how the fully Bayesian approach to meta-analysis of binary outcome data, considered on an absolute risk or relative risk scale, can be extended to perform analyses on both the absolute and relative risk scales.
Abstract: When conducting a meta-analysis of clinical trials with binary outcomes, a normal approximation for the summary treatment effect measure in each trial is inappropriate in the common situation where some of the trials in the meta-analysis are small, or the observed risks are close to 0 or 1. This problem can be avoided by making direct use of the binomial distribution within trials. A fully Bayesian method has already been developed for random effects meta-analysis on the log-odds scale using the BUGS implementation of Gibbs sampling. In this paper we demonstrate how this method can be extended to perform analyses on both the absolute and relative risk scales. Within each approach we exemplify how trial-level covariates, including underlying risk, can be considered. Data from 46 trials of the effect of single-dose ibuprofen on post-operative pain are analysed and the results contrasted with those derived from classical and Bayesian summary statistic methods. The clinical interpretation of the odds ratio scale is not straightforward. The advantages and flexibility of a fully Bayesian approach to meta-analysis of binary outcome data, considered on an absolute risk or relative risk scale, are now available.

Journal ArticleDOI
TL;DR: This paper discusses an alternative simple approach for constructing the confidence interval, based on the t-distribution, which has improved coverage probability and is easy to calculate, and unlike some methods suggested in the statistical literature, no iterative computation is required.
Abstract: In the context of a random effects model for meta-analysis, a number of methods are available to estimate confidence limits for the overall mean effect. A simple and commonly used method is the DerSimonian and Laird approach. This paper discusses an alternative simple approach for constructing the confidence interval, based on the t-distribution. This approach has improved coverage probability compared to the DerSimonian and Laird method. Moreover, it is easy to calculate, and unlike some methods suggested in the statistical literature, no iterative computation is required.

Journal ArticleDOI
TL;DR: Non‐inferiority analyses that do not involve a fixed margin are proposed that are conditionally equivalent to two confidence interval procedures that relax the conservatism of two 95 per cent confidence interval testing procedures and preserve the type I error rate at a one‐sided 0.025 level.
Abstract: The recent revision of the Declaration of Helsinki and the existence of many new therapies that affect survival or serious morbidity, and that therefore cannot be denied patients, have generated increased interest in active-control trials, particularly those intended to show equivalence or non-inferiority to the active-control. A non-inferiority hypothesis has historically been formulated in terms of a fixed margin. This margin was historically designed to exclude a 'clinically meaningful difference', but has become recognized that the margin must also be no larger than the assured effect of the control in the new study. Depending on how this 'assured effect' is determined or estimated, the selected margin may be very small, leading to very large sample sizes, especially when there is an added requirement that a loss of some specified fraction of the assured effect must be ruled out. In cases where it is appropriate, this paper proposes non-inferiority analyses that do not involve a fixed margin, but can be described as a two confidence interval procedure that compares the 95 per cent two-sided CI for the difference between the treatment and the control to a confidence interval for the control effect (based on a meta-analysis of historical data comparing the control to placebo) that is chosen to preserve a study-wide type I error rate of about 0.025 (similar to the usual standard for a superiority trial) for testing for retention of a prespecified fraction of the control effect. The approach assumes that the estimate of the historical active-control effect size is applicable in the current study. If there is reason to believe that this effect size is diminished (for example, improved concomitant therapies) the estimate of this historical effect could be reduced appropriately. The statistical methodology for testing this non-inferiority hypothesis is developed for a hazard ratio (rather than an absolute difference between treatments, because a hazard ratio seems likely to be less population dependent than the absolute difference). In the case of oncology, the hazard ratio is the usual way of comparing treatments with respect to time to event (time to progression or survival) endpoints. The proportional hazards assumption is regarded as reasonable (approximately holding). The testing procedures proposed are conditionally equivalent to two confidence interval procedures that relax the conservatism of two 95 per cent confidence interval testing procedures and preserve the type I error rate at a one-sided 0.025 level. An application of this methodology to Xeloda, a recently approved drug for the treatment of metastatic colorectal cancers, is illustrated. Other methodologies are also described and assessed - including a point estimate procedure, a Bayesian procedure and two delta-method confidence interval procedures. Published in 2003 by John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: Concept issues are discussed and computational methods for statistical power and sample size in microarray studies, taking account of the multiple testing that is generic to these studies, are presented.
Abstract: A microarray study aims at having a high probability of declaring genes to be differentially expressed if they are truly expressed, while keeping the probability of making false declarations of expression acceptably low. Thus, in formal terms, well-designed microarray studies will have high power while controlling type I error risk. Achieving this objective is the purpose of this paper. Here, we discuss conceptual issues and present computational methods for statistical power and sample size in microarray studies, taking account of the multiple testing that is generic to these studies. The discussion encompasses choices of experimental design and replication for a study. Practical examples are used to demonstrate the methods. The examples show forcefully that replication of a microarray experiment can yield large increases in statistical power. The paper refers to cDNA arrays in the discussion and illustrations but the proposed methodology is equally applicable to expression data from oligonucleotide arrays.

Journal ArticleDOI
TL;DR: Investigation of artefactual and true causes of heterogeneity form essential steps in moving from a combined effect estimate to application to particular populations and individuals.
Abstract: What causes heterogeneity in systematic reviews of controlled trials? First, it may be an artefact of the summary measures used, of study design features such as duration of follow-up or the reliability of outcome measures. Second, it may be due to real variation in the treatment effect and hence provides the opportunity to identify factors that may modify the impact of treatment. These factors may include features of the population such as: severity of illness, age and gender; intervention factors such as dose, timing or duration of treatment; and comparator factors such as the control group treatment or the co-interventions in both groups. The ideal way to study causes of true variation is within rather than between studies. In most situations however, we will have to make do with a study level investigation and hence need to be careful about adjusting for potential confounding by artefactual factors such as study design features. Such investigation of artefactual and true causes of heterogeneity form essential steps in moving from a combined effect estimate to application to particular populations and individuals.