scispace - formally typeset
Search or ask a question

Showing papers in "Biostatistics in 2003"


Journal ArticleDOI
TL;DR: There is no obvious downside to using RMA and attaching a standard error (SE) to this quantity using a linear model which removes probe-specific affinities, and the exploratory data analyses of the probe level data motivate a new summary measure that is a robust multi-array average (RMA) of background-adjusted, normalized, and log-transformed PM values.
Abstract: SUMMARY In this paper we report exploratory analyses of high-density oligonucleotide array data from the Affymetrix GeneChip R � system with the objective of improving upon currently used measures of gene expression. Our analyses make use of three data sets: a small experimental study consisting of five MGU74A mouse GeneChip R � arrays, part of the data from an extensive spike-in study conducted by Gene Logic and Wyeth’s Genetics Institute involving 95 HG-U95A human GeneChip R � arrays; and part of a dilution study conducted by Gene Logic involving 75 HG-U95A GeneChip R � arrays. We display some familiar features of the perfect match and mismatch probe ( PM and MM )v alues of these data, and examine the variance–mean relationship with probe-level data from probes believed to be defective, and so delivering noise only. We explain why we need to normalize the arrays to one another using probe level intensities. We then examine the behavior of the PM and MM using spike-in data and assess three commonly used summary measures: Affymetrix’s (i) average difference (AvDiff) and (ii) MAS 5.0 signal, and (iii) the Li and Wong multiplicative model-based expression index (MBEI). The exploratory data analyses of the probe level data motivate a new summary measure that is a robust multiarray average (RMA) of background-adjusted, normalized, and log-transformed PM values. We evaluate the four expression summary measures using the dilution study data, assessing their behavior in terms of bias, variance and (for MBEI and RMA) model fit. Finally, we evaluate the algorithms in terms of their ability to detect known levels of differential expression using the spike-in data. We conclude that there is no obvious downside to using RMA and attaching a standard error (SE) to this quantity using a linear model which removes probe-specific affinities. ∗ To whom correspondence should be addressed

10,711 citations


Journal ArticleDOI
TL;DR: This work introduces spatial autoregression parameters for multivariate conditional autoregressive models and proposes to employ these models as specifications for second-stage spatial effects in hierarchical models.
Abstract: In the past decade conditional autoregressive modelling specifications have found considerable application for the analysis of spatial data. Nearly all of this work is done in the univariate case and employs an improper specification. Our contribution here is to move to multivariate conditional autoregressive models and to provide rich, flexible classes which yield proper distributions. Our approach is to introduce spatial autoregression parameters. We first clarify what classes can be developed from the family of Mardia (1988) and contrast with recent work of Kim et al. (2000). We then present a novel parametric linear transformation which provides an extension with attractive interpretation. We propose to employ these models as specifications for second-stage spatial effects in hierarchical models. Two applications are discussed; one for the two-dimensional case modelling spatial patterns of child growth, the other for a four-dimensional situation modelling spatial variation in HLA-B allele frequencies. In each case, full Bayesian inference is carried out using Markov chain Monte Carlo simulation.

447 citations


Journal ArticleDOI
TL;DR: The proposed method performed nearly perfectly in distinguishing cancer and benign hyperplasia from normal, and practical issues associated with the proposed approach to the analysis of SELDI output and its application in cancer biomarker discovery are discussed.
Abstract: With recent advances in mass spectrometry techniques, it is now possible to investigate proteins over a wide range of molecular weights in small biological specimens. This advance has generated data-analytic challenges in proteomics, similar to those created by microarray technologies in genetics, namely, discovery of 'signature' protein profiles specific to each pathologic state (e.g. normal vs. cancer) or differential profiles between experimental conditions (e.g. treated by a drug of interest vs. untreated) from high-dimensional data. We propose a data-analytic strategy for discovering protein biomarkers based on such high-dimensional mass spectrometry data. A real biomarker-discovery project on prostate cancer is taken as a concrete example throughout the paper: the project aims to identify proteins in serum that distinguish cancer, benign hyperplasia, and normal states of prostate using the Surface Enhanced Laser Desorption/Ionization (SELDI) technology, a recently developed mass spectrometry technique. Our data-analytic strategy takes properties of the SELDI mass spectrometer into account: the SELDI output of a specimen contains about 48,000 (x, y) points where x is the protein mass divided by the number of charges introduced by ionization and y is the protein intensity of the corresponding mass per charge value, x, in that specimen. Given high coefficients of variation and other characteristics of protein intensity measures (y values), we reduce the measures of protein intensities to a set of binary variables that indicate peaks in the y-axis direction in the nearest neighborhoods of each mass per charge point in the x-axis direction. We then account for a shifting (measurement error) problem of the x-axis in SELDI output. After this pre-analysis processing of data, we combine the binary predictors to generate classification rules for cancer, benign hyperplasia, and normal states of prostate. Our approach is to apply the boosting algorithm to select binary predictors and construct a summary classifier. We empirically evaluate sensitivity and specificity of the resulting summary classifiers with a test dataset that is independent from the training dataset used to construct the summary classifiers. The proposed method performed nearly perfectly in distinguishing cancer and benign hyperplasia from normal. In the classification of cancer vs. benign hyperplasia, however, an appreciable proportion of the benign specimens were classified incorrectly as cancer. We discuss practical issues associated with our proposed approach to the analysis of SELDI output and its application in cancer biomarker discovery.

281 citations


Journal ArticleDOI
TL;DR: This work considers how mRNA pooling affects expression estimates by assessing the finite-sample performance of different estimators for designs with and without pooling, and gives a formula for the total number of subjects and arrays required in a pooled experiment to obtain gene expression estimates and confidence intervals comparable to those obtained from the no-pooling case.
Abstract: In a microarray experiment, messenger RNA samples are oftentimes pooled across subjects out of necessity, or in an effort to reduce the effect of biological variation. A basic problem in such experiments is to estimate the nominal expression levels of a large number of genes. Pooling samples will affect expression estimation, but the exact effects are not yet known as the approach has not been systematically studied in this context. We consider how mRNA pooling affects expression estimates by assessing the finite-sample performance of different estimators for designs with and without pooling. Conditions under which it is advantageous to pool mRNA are defined; and general properties of estimates from both pooled and non-pooled designs are derived under these conditions. A formula is given for the total number of subjects and arrays required in a pooled experiment to obtain gene expression estimates and confidence intervals comparable to those obtained from the no-pooling case. The formula demonstrates that by pooling a perhaps increased number of subjects, one can decrease the number of arrays required in an experiment without a loss of precision. The assumptions that facilitate derivation of this formula are considered using data from a quantitative real-time PCR experiment. The calculations are not specific to one particular method of quantifying gene expression as they assume only that a single, normalized, estimate of expression is obtained for each gene. As such, the results should be generally applicable to a number of technologies provided sufficient pre-processing and normalization methods are available and applied.

267 citations


Journal ArticleDOI
TL;DR: The main substantive goal here is to explain the pattern of infant mortality using important covariates while accounting for possible (spatially correlated) differences in hazard among the counties, using the GIS ArcView to map resulting fitted hazard rates, to help search for possible lingering spatial correlation.
Abstract: SUMMARY The use of survival models involving a random effect or ‘frailty’ term is becoming more common. Usually the random effects are assumed to represent different clusters, and clusters are assumed to be independent. In this paper, we consider random effects corresponding to clusters that are spatially arranged, such as clinical sites or geographical regions. That is, we might suspect that random effects corresponding to strata in closer proximity to each other might also be similar in magnitude. Such spatial arrangement of the strata can be modeled in several ways, but we group these ways into two general settings: geostatistical approaches, where we use the exact geographic locations (e.g. latitude and longitude) of the strata, and lattice approaches, where we use only the positions of the strata relative to each other (e.g. which counties neighbor which others). We compare our approaches in the context of a dataset on infant mortality in Minnesota counties between 1992 and 1996. Our main substantive goal here is to explain the pattern of infant mortality using important covariates (sex, race, birth weight, age of mother, etc.) while accounting for possible (spatially correlated) differences in hazard among the counties. We use the GIS ArcView to map resulting fitted hazard rates, to help search for possible lingering spatial correlation. The DIC criterion (Spiegelhalter et al. ,J ournal of the Royal Statistical Society, Series B 2002, to appear) is used to choose among various competing models. We investigate the quality of fit of our chosen model, and compare its results when used to investigate neonatal versus post-neonatal mortality. We also compare use of our time-to-event outcome survival model with the simpler dichotomous outcome logistic model. Finally, we summarize our findings and suggest directions for future research.

230 citations


Journal ArticleDOI
TL;DR: A multivariate extension of family-based association tests based on generalized estimating equations that can be applied to multiple phenotypes and to phenotypic data obtained in longitudinal studies without making any distributional assumptions for the phenotyping observations is proposed.
Abstract: In this paper we propose a multivariate extension of family-based association tests based on generalized estimating equations. The test can be applied to multiple phenotypes and to phenotypic data obtained in longitudinal studies without making any distributional assumptions for the phenotypic observations. Methods for handling missing phenotypic information are discussed. Further, we compare the power of the multivariate test with permutation tests and with using separate tests for each outcome which are adjusted for multiple testing. Application of the proposed test to an asthma study illustrates the power of the approach.

194 citations


Journal ArticleDOI
TL;DR: Methods for monitoring the value of R using surveillance data are described, based on branching processes in which R is identified with the offspring mean, and unconditional likelihoods for the offspringmean are derived using data on outbreak size and outbreak duration.
Abstract: Mass vaccination programmes aim to maintain the effective reproduction number R of an infection below unity. We describe methods for monitoring the value of R using surveillance data. The models are based on branching processes in which R is identified with the offspring mean. We derive unconditional likelihoods for the offspring mean using data on outbreak size and outbreak duration. We also discuss Bayesian methods, implemented by Metropolis–Hastings sampling. We investigate by simulation the validity of the models with respect to depletion of susceptibles and under-ascertainment of cases. The methods are illustrated using surveillance data on measles in the USA.

144 citations


Journal ArticleDOI
TL;DR: This test provides a goodness-of-fit test for checking parametric models against nonparametric models, based on the mixed-model representation of the smoothing spline estimator of the non parametric function and the variance component score test by treating the inverse of the smoother parameter as an extra variance component.
Abstract: We consider testing whether the nonparametric function in a semiparametric additive mixed model is a simple fixed degree polynomial, for example, a simple linear function. This test provides a goodness-of-fit test for checking parametric models against nonparametric models. It is based on the mixed-model representation of the smoothing spline estimator of the nonparametric function and the variance component score test by treating the inverse of the smoothing parameter as an extra variance component. We also consider testing the equivalence of two nonparametric functions in semiparametric additive mixed models for two groups, such as treatment and placebo groups. The proposed tests are applied to data from an epidemiological study and a clinical trial and their performance is evaluated through simulations.

140 citations


Journal ArticleDOI
TL;DR: The model is applied to county-level cancer mortality data in Minnesota to find whether there exists a common spatial factor underlying the cancer mortality throughout the state.
Abstract: There are often two types of correlations in multivariate spatial data: correlations between variables measured at the same locations, and correlations of each variable across the locations. We hypothesize that these two types of correlations are caused by a common spatially correlated underlying factor. Under this hypothesis, we propose a generalized common spatial factor model. The parameters are estimated using the Bayesian method and a Markov chain Monte Carlo computing technique. Our main goals are to determine which observed variables share a common underlying spatial factor and also to predict the common spatial factor. The model is applied to county-level cancer mortality data in Minnesota to find whether there exists a common spatial factor underlying the cancer mortality throughout the state.

136 citations


Journal ArticleDOI
TL;DR: A mixed-effects varying-coefficient model based on an exploratory analysis of data from a clinical trial is proposed and the regression spline method is proposed for inference for parameters in the proposed model.
Abstract: SUMMARY In this article we study the relationship between virologic and immunologic responses in AIDS clinical trials. Since plasma HIV RNA copies (viral load) and CD4+ cell counts are crucial virologic and immunologic markers for HIV infection, it is important to study their relationship during HIV/AIDS treatment. We propose a mixed-effects varying-coefficient model based on an exploratory analysis of data from a clinical trial. Since both viral load and CD4+ cell counts are subject to measurement error, we also consider the measurement error problem in covariates in our model. The regression spline method is proposed for inference for parameters in the proposed model. The regression spline method transforms the unknown nonparametric components into parametric functions. It is relatively simple to implement using readily available software, and parameter inference can be developed from standard parametric models. We apply the proposed models and methods to an AIDS clinical study. From this study, we find an interesting relationship between viral load and CD4+ cell counts during antiviral treatments. Biological interpretations and clinical implications are discussed.

108 citations


Journal ArticleDOI
TL;DR: Methods that address the need for a methodological approach that is statistically valid and useful in the clinical setting and novel definitions for the ROC curve and the area under the curve (AUC) that are applicable to this class of combination tests are presented.
Abstract: SUMMARY In early detection of disease, combinations of biomarkers promise improved discrimination over diagnostic tests based on single markers. An example of this is in prostate cancer screening, where additional markers have been sought to improve the specificity of the conventional Prostate-Specific Antigen (PSA) test. A marker of particular interest is the percent free PSA. Studies evaluating the benefits of percent free PSA reflect the need for a methodological approach that is statistically valid and useful in the clinical setting. This article presents methods that address this need. We focus on and-or combinations of biomarker results that we call logic rules and present novel definitions for the ROC curve and the area under the curve (AUC) that are applicable to this class of combination tests. Our estimates of the ROC and AUC are amenable to statistical inference including comparisons of tests and regression analysis. The methods are applied to data on free and total PSA levels among prostate cancer cases and matched controls enrolled in the Physicians’ Health Study.

Journal ArticleDOI
TL;DR: This paper presents a fully Bayesian methodology that allows the investigator to draw a 'single' conclusion by formally incorporating prior beliefs about non-identifiable, yet interpretable, selection bias parameters in the distributional form of the continuous outcomes.
Abstract: SUMMARY In randomized studies with missing outcomes, non-identifiable assumptions are required to hold for valid data analysis. As a result, statisticians have been advocating the use of sensitivity analysis to evaluate the effect of varying asssumptions on study conclusions. While this approach may be useful in assessing the sensitivity of treatment comparisons to missing data assumptions, it may be dissatisfying to some researchers/decision makers because a single summary is not provided. In this paper, we present a fully Bayesian methodology that allows the investigator to draw a ‘single’ conclusion by formally incorporating prior beliefs about non-identifiable, yet interpretable, selection bias parameters. Our Bayesian model provides robustness to prior specification of the distributional form of the continuous outcomes.

Journal ArticleDOI
TL;DR: A new method is proposed in which the same likelihood formulation as in Excoffier and Slatkin's EM algorithm is used and the estimating equation idea and the PL computational algorithm is applied with some modifications, which can handle data sets with large number of SNPs as well as large numbers of subjects.
Abstract: SUMMARY Estimating haplotype frequencies becomes increasingly important in the mapping of complex disease genes, as millions of single nucleotide polymorphisms (SNPs) are being identified and genotyped. When genotypes at multiple SNP loci are gathered from unrelated individuals, haplotype frequencies can be accurately estimated using expectation-maximization (EM) algorithms (Excoffier and Slatkin, 1995; Hawley and Kidd, 1995; Long et al., 1995), with standard errors estimated using bootstraps. However, because the number of possible haplotypes increases exponentially with the number of SNPs, handling data with a large number of SNPs poses a computational challenge for the EM methods and for other haplotype inference methods. To solve this problem, Niu and colleagues, in their Bayesian haplotype inference paper (Niu et al., 2002), introduced a computational algorithm called progressive ligation (PL). But their Bayesian method has a limitation on the number of subjects (no more than 100 subjects in the current implementation of the method). In this paper, we propose a new method in which we use the same likelihood formulation as in Excoffier and Slatkin’s EM algorithm and apply the estimating equation idea and the PL computational algorithm with some modifications. Our proposed method can handle data sets with large number of SNPs as well as large numbers of subjects. Simultaneously, our method estimates standard errors efficiently, using the sandwich-estimate from the estimating equation, rather than the bootstrap method. Additionally, our method admits missing data and produces valid estimates of parameters and their standard errors under the assumption that the missing genotypes are missing at random in the sense defined by Rubin (1976).

Journal ArticleDOI
TL;DR: This manuscript presents a computationally simple longitudinal screening algorithm that can be implemented with data that is obtainable in a short period of time and uniformly improves the sensitivity compared with simpler screening algorithms but maintains the same specificity.
Abstract: SUMMARY Ar evolution in molecular technology is leading to the discovery of many biomarkers of disease. Monitoring these biomarkers in a population may lead to earlier disease detection, and may prevent death from diseases like cancer that are more curable if found early. For markers whose concentration is associated with disease progression the earliest detection is achieved by monitoring the marker with an algorithm able to detect very small changes. One strategy is to monitor the biomarkers using a longitudinal algorithm that incorporates a subject’s screening history into screening decisions. Longitudinal algorithms that have been proposed thus far rely on modeling the behavior of a biomarker from the moment of disease onset until its clinical presentation. Because the data needed to observe the early pre-clinical behavior of the biomarker may take years to accumulate, those algorithms are not appropriate for timely development using new biomarker discoveries. This manuscript presents a computationally simple longitudinal screening algorithm that can be implemented with data that is obtainable in a short period of time. For biomarkers meeting only a few modest assumptions our algorithm uniformly improves the sensitivity compared with simpler screening algorithms but maintains the same specificity. It is unclear what performance advantage more complex methods may have compared with our method, especially when there is doubt about the correct model for describing the behavior of the biomarker early in the disease process. Our method was specifically developed for use in screening for cancer with a new biomarker, but it is appropriate whenever the pre-clinical behavior of the disease and/or biomarker is uncertain.

Journal ArticleDOI
TL;DR: A statistical measure of the heterogeneity of a tissue characteristic that is based on the deviation of the distribution of the tissue characteristic from a unimodal elliptically contoured spatial pattern is developed.
Abstract: SUMMARY In vivo measurement of local tissue characteristics by modern bioimaging techniques such as positron emission tomography (PET) provides the opportunity to analyze quantitatively the role that tissue heterogeneity may play in understanding biological function. This paper develops a statistical measure of the heterogeneity of a tissue characteristic that is based on the deviation of the distribution of the tissue characteristic from a unimodal elliptically contoured spatial pattern. An efficient algorithm is developed for computation of the measure based on volumetric region of interest data. The technique is illustrated by application to data from PET imaging studies of fluorodeoxyglucose utilization in human sarcomas. A set of 74 sarcoma patients (with five-year follow-up survival information) were evaluated for heterogeneity as well as a number of other potential prognostic indicators of survival. A Cox proportional hazards analysis of these data shows that the degree of heterogeneity of the sarcoma is the major risk factor associated with patient death. Some theory is developed to analyze the asymptotic statistical behavior of the heterogeneity estimator. In the context of data arising from Poisson deconvolution (PET being the prime example), the heterogeneity estimator, which is a non-linear functional of the PET image data, is consistent and converges at a rate that is parametric in the injected dose.

Journal ArticleDOI
TL;DR: This paper develops four statistics that can be used to test hypotheses about the means and/or variances of the gene expression levels in both one- and two-sample problems and presents the result of a simulation comparing their methods to well-known statistics and multiple testing adjustments.
Abstract: The potential of microarray data is enormous. It allows us to monitor the expression of thousands of genes simultaneously. A common task with microarray is to determine which genes are differentially expressed between two samples obtained under two different conditions. Recently, several statistical methods have been proposed to perform such a task when there are replicate samples under each condition. Two major problems arise with microarray data. The first one is that the number of replicates is very small (usually 2-10), leading to noisy point estimates. As a consequence, traditional statistics that are based on the means and standard deviations, e.g. t-statistic, are not suitable. The second problem is that the number of genes is usually very large (approximately 10,000), and one is faced with an extreme multiple testing problem. Most multiple testing adjustments are relatively conservative, especially when the number of replicates is small. In this paper we present an empirical Bayes analysis that handles both problems very well. Using different parametrizations, we develop four statistics that can be used to test hypotheses about the means and/or variances of the gene expression levels in both one- and two-sample problems. The methods are illustrated using experimental data with prior knowledge. In addition, we present the result of a simulation comparing our methods to well-known statistics and multiple testing adjustments.

Journal ArticleDOI
TL;DR: With the goal of aligning inference from individual versus group-level studies, the interplay between exposure and study design is discussed and the additional assumptions necessary for valid inference are specified.
Abstract: SUMMARY Ecological and aggregate data studies are examples of group-level studies. Even though the link between the predictors and outcomes is not preserved in these studies, inference about individual-level exposure effects is often a goal. The disconnection between the level of inference and the level of analysis expands the array of potential biases that can invalidate the inference from group-level studies. While several sources of bias, specifically due to measurement error and confounding, may be more complex in group-level studies, two sources of bias, cross-level and model specification bias, are a direct consequence of the disconnection. With the goal of aligning inference from individual versus group-level studies, I discuss the interplay between exposure and study design. I specify the additional assumptions necessary for valid inference, specifically that the between- and within-group exposure effects are equal. Then cross-level inference is possible. However, all the information in the group-level analysis comes from between-group comparisons. Models where the group-level analysis provides even a small percentage of information about the within-group exposure effect are most susceptible to model specification bias. Model specification bias can be even more serious when the group-level model isn’t derived from an individual-level model.

Journal ArticleDOI
TL;DR: The semiparametric efficient approach is the preferred method for prevalence estimation in two-phase studies because it is more robust and comparable in its efficiency to imputation and other re-weighting estimators, and also easy to implement.
Abstract: SUMMARY Disease prevalence is ideally estimated using a ‘gold standard’ to ascertain true disease status on all subjects in a population of interest. In practice, however, the gold standard may be too costly or invasive to be applied to all subjects, in which case a two-phase design is often employed. Phase 1 data consisting of inexpensive and non-invasive screening tests on all study subjects are used to determine the subjects that receive the gold standard in the second phase. Naive estimates of prevalence in two-phase studies can be biased (verification bias). Imputation and re-weighting estimators are often used to avoid this bias. We contrast the forms and attributes of the various prevalence estimators. Distribution theory and simulation studies are used to investigate their bias and efficiency. We conclude that the semiparametric efficient approach is the preferred method for prevalence estimation in two-phase studies. It is more robust and comparable in its efficiency to imputation and other re-weighting estimators. It is also easy to implement. We use this approach to examine the prevalence of depression in adolescents with data from the Great Smoky Mountain Study.

Journal ArticleDOI
TL;DR: The assessment of the overall diagnostic accuracy of a sequence of tests (e.g. repeated screening tests) is considered, and the setting where an overall test is defined to be positive if any of the individual tests are positive ('believe the positive').
Abstract: We consider the assessment of the overall diagnostic accuracy of a sequence of tests (e.g. repeated screening tests). The complexity of diagnostic choices when two or more continuous tests are used in sequence is illustrated, and different approaches to reducing the dimensionality are presented and evaluated. For instance, in practice, when a single test is used repeatedly in routine screening, the same screening threshold is typically used at each screening visit. One possible alternative is to adjust the threshold at successive visits according to individual-specific characteristics. Such possibilities represent a particular slice of a receiver operating characteristic surface, corresponding to all possible combinations of test thresholds. We focus in the development and examples on the setting where an overall test is defined to be positive if any of the individual tests are positive ('believe the positive'). The ideas developed are illustrated by an example of application to screening for prostate cancer using prostate-specific antigen.

Journal ArticleDOI
TL;DR: The degree of attenuation depends on the type of stochastic process describing the time-dependent covariate and that attenuation may be substantial for an Ornstein-Uhlenbeck process and the simpler techniques may have advantages in larger data sets with infrequent updatings.
Abstract: This paper deals with hazard regression models for survival data with time-dependent covariates consisting of updated quantitative measurements. The main emphasis is on the Cox proportional hazards model but also additive hazard models are discussed. Attenuation of regression coefficients caused by infrequent updating of covariates is evaluated using simulated data mimicking our main example, the CSL1 liver cirrhosis trial. We conclude that the degree of attenuation depends on the type of stochastic process describing the time-dependent covariate and that attenuation may be substantial for an Ornstein-Uhlenbeck process. Also trends in the covariate combined with non-synchronous updating may cause attenuation. Simple methods to adjust for infrequent updating of covariates are proposed and compared to existing techniques using both simulations and the CSL1 data. The comparison shows that while existing, more complicated methods may work well with frequent updating of covariates the simpler techniques may have advantages in larger data sets with infrequent updatings.

Journal ArticleDOI
TL;DR: A simple test procedure is proposed that can easily handle the problem with partially missing trait values, and is applicable to the case with a mixture of qualitative and quantitative traits.
Abstract: A robust statistical method to detect linkage or association between a genetic marker and a set of distinct phenotypic traits is to combine univariate trait-specific test statistics for a more powerful overall test. This procedure does not need complex modeling assumptions, can easily handle the problem with partially missing trait values, and is applicable to the case with a mixture of qualitative and quantitative traits. In this note, we propose a simple test procedure along this line, and show its advantages over the standard combination tests for linkage or association in the literature through a data set from Genetic Analysis Workshop 12 (GAW12) and an extensive simulation study.

Journal ArticleDOI
TL;DR: Methods for estimating Re from serological survey data are discussed, semi-parametric and parametric models are described, and an upper bound on Re is obtained when vaccine coverage and efficacy are not known.
Abstract: The effective reproduction number of an infection, denoted Re, may be used to monitor the impact of a vaccination programme. If Re is maintained below 1, then sustained endemic transmission of the infection cannot occur. In this paper we discuss methods for estimating Re from serological survey data, allowing for age and individual heterogeneity. We describe semi-parametric and parametric models, and obtain an upper bound on Re when vaccine coverage and efficacy are not known. The methods are illustrated using data on mumps and rubella in England and Wales.

Journal ArticleDOI
TL;DR: A novel application of reversible jump Markov chain Monte Carlo simulation to estimate activation patterns from electromyographic data to estimate physiologically relevant quantities such as muscle coactivity, total integrated energy, and average burst duration is presented.
Abstract: SUMMARY Many facets of neuromuscular activation patterns and control can be assessed via electromyography and are important for understanding the control of locomotion. After spinal cord injury, muscle activation patterns can affect locomotor recovery. We present a novel application of reversible jump Markov chain Monte Carlo simulation to estimate activation patterns from electromyographic data. We assume the data to be a zero-mean, heteroscedastic process. The variance is explicity modeled using a step function. The number and location of points of discontinuity, or change-points, in the step function, the inter-changepoint variances, and the overall mean are jointly modeled along with the mean and variance from baseline data. The number of change-points is considered a nuisance parameter and is integrated out of the posterior distribution. Whereas current methods of detecting activation patterns are deterministic or provide only point estimates, ours provides distributional estimates of muscle activation. These estimates, in turn, are used to estimate physiologically relevant quantities such as muscle coactivity, total integrated energy, and average burst duration and to draw valid statistical inferences about these quantities.

Journal ArticleDOI
TL;DR: A method is presented for finding upper and lower bounds for this covariance between effects that have been adjusted for confounding factors and homogeneity of the relative risks is assumed but the method is extended to allow for heterogeneity in an overall estimate.
Abstract: SUMMARY This paper deals with the synthesis of information from different studies when there is lack of independence in some of the contrasts to be combined. This problem can arise in several different situations in both case-control studies and clinical trials. For efficient estimation we appeal to the method of generalized least squares to estimate the summary effect and its standard error. This method requires estimates of the covariances between those contrasts that are not independent. Although it is not possible to estimate the covariance between effects that have been adjusted for confounding factors we present a method for finding upper and lower bounds for this covariance. In the simplest discussion homogeneity of the relative risks is assumed but the method is then extended to allow for heterogeneity in an overall estimate. We then illustrate the method with several examples from an analysis in which case-control studies of cervical cancer and oral contraceptive use are synthesized.

Journal ArticleDOI
TL;DR: The utility of a pertinent distance between random vectors and its empirical counterpart constructed from gene expression data is considered, and methods of multidimensional search for biologically significant genes are developed, considering expression signals as mutually dependent random variables.
Abstract: The ultimate success of microarray technology in basic and applied biological sciences depends critically on the development of statistical methods for gene expression data analysis. The most widely used tests for differential expression of genes are essentially univariate. Such tests disregard the multidimensional structure of microarray data. Multivariate methods are needed to utilize the information hidden in gene interactions and hence to provide more powerful and biologically meaningful methods for finding subsets of differentially expressed genes. The objective of this paper is to develop methods of multidimensional search for biologically significant genes, considering expression signals as mutually dependent random variables. To attain these ends, we consider the utility of a pertinent distance between random vectors and its empirical counterpart constructed from gene expression data. The distance furnishes exploratory procedures aimed at finding a target subset of differentially expressed genes. To determine the size of the target subset, we resort to successive elimination of smaller subsets resulting from each step of a random search algorithm based on maximization of the proposed distance. Different stopping rules associated with this procedure are evaluated. The usefulness of the proposed approach is illustrated with an application to the analysis of two sets of gene expression data.

Journal ArticleDOI
TL;DR: A scaled marginal model is proposed for testing and estimating this global effect of antiretroviral therapy affects different aspects of neurocognitive functioning to the same degree and if so, to test for the treatment effect using a more powerful one-degree-of-freedom global test.
Abstract: SUMMARY In studies that involve multivariate outcomes it is often of interest to test for a common exposure effect. Fo re xample, our research is motivated by a study of neurocognitive performance in a cohort of HIVinfected women. The goal is to determine whether highly active antiretroviral therapy affects different aspects of neurocognitive functioning to the same degree and if so, to test for the treatment effect using a more powerful one-degree-of-freedom global test. Since multivariate continuous outcomes are likely to be measured on different scales, such a common exposure effect has not been well defined. We propose the use of a scaled marginal model for testing and estimating this global effect when the outcomes are all continuous. A key feature of the model is that the effect of exposure is represented by a common effect size and hence has a well-understood, practical interpretation. Estimating equations are proposed to estimate the regression coefficients and the outcome-specific scale parameters, where the correct specification of the within-subject correlation is not required. These estimating equations can be solved by repeatedly calling standard generalized estimating equations software such as SAS PROC GENMOD. To test whether the assumption of a common exposure effect is reasonable, we propose the use of an estimating-equationbased score-type test. We study the asymptotic efficiency loss of the proposed estimators, and show that they generally have high efficiency compared to the maximum likelihood estimators. The proposed method is applied to the HIV data.

Journal ArticleDOI
TL;DR: This work considers two alternative methods for estimating the independent effects of two predictors in a hierarchical model and shows both analytically and via simulation that one of these gives essentially unbiased estimates even in the presence of measurement error, at the price of a moderate reduction in power.
Abstract: Hierarchical modeling is becoming increasingly popular in epidemiology, particularly in air pollution studies. When potential confounding exists, a multilevel model yields better power to assess the independent effects of each predictor by gathering evidence across many sub-studies. If the predictors are measured with unknown error, bias can be expected in the individual substudies, and in the combined estimates of the second-stage model. We consider two alternative methods for estimating the independent effects of two predictors in a hierarchical model. We show both analytically and via simulation that one of these gives essentially unbiased estimates even in the presence of measurement error, at the price of a moderate reduction in power. The second avoids the potential for upward bias, at the price of a smaller reduction in power. Since measurement error is endemic in epidemiology, these approaches hold considerable potential. We illustrate the two methods by applying them to two air pollution studies. In the first, we re-analyze published data to show that the estimated effect of fine particles on daily deaths, independent of coarse particles, was downwardly biased by measurement error in the original analysis. The estimated effect of coarse particles becomes more protective using the new estimates. In the second example, we use published data on the association between airborne particles and daily deaths in 10 US cities to estimate the effect of gaseous air pollutants on daily deaths. The resulting effect size estimates were very small and the confidence intervals included zero.

Journal ArticleDOI
TL;DR: It is observed here that a dose-response relationship may or may not reduce sensitivity to hidden bias, and whether it has or has not can be determined by a suitable analysis using the data at hand.
Abstract: SUMMARY It is often said that an important consideration in judging whether an association between treatment and response is causal is the presence or absence of a dose–response relationship, that is, larger ostensible treatment effects when doses of treatment are larger. This criterion is widely discussed in textbooks and is often mentioned in empirical papers. At the same time, it is well known through both important examples and elementary theory that a treatment may cause dramatic effects with no dose–response relationship, and hidden biases may produce a dose–response relationship when the treatment is without effect. What does a dose–response relationship say about causality? It is observed here that a dose–response relationship may or may not reduce sensitivity to hidden bias, and whether it has or has not can be determined by a suitable analysis using the data at hand. Moreover, a study without a dose–response relationship may or may not be less sensitive to hidden bias than another study with such a relationship, and this, too, can be determined from the data at hand. An example concerning cytogenetic damage among professional painters is used to illustrate.

Journal ArticleDOI
TL;DR: Under the Armitage-Doll model the effect of exposure is equivalent to a change in age scale, adding to age a parametric multiple of cumulative dose to the mutagen, which leads to useful formulae for the relative risk.
Abstract: SUMMARY We explore some stochastic considerations regarding accumulation of mutations in relation to carcinogenesis. In particular, we consider the effect of exposure to specific agents, especially ionizing radiation, that may increase mutation rates. The formulation and consequences are a further development of the Armitage–Doll model; both in terms of background cancer where assumptions are substantially weakened, and in terms of the effect of specific mutagenic exposures through generally increasing mutation rates. Under our model the effect of exposure is equivalent to a change in age scale, adding to age a parametric multiple of cumulative dose to the mutagen, which leads to useful formulae for the relative risk. In particular, the excess relative risk at age a behaves approximately as a parametric multiple of the mean dose over ages prior to a. These results do not require assuming that some fixed number of mutations are required for malignancy. The implications are particularly useful in providing guidance for descriptive analyses since they have characteristics largely independent of parameter values. It is indicated that the model consequences conform remarkably well to observations from cohort studies of the A-bomb survivors, miners with prolonged exposure to radon, and cigarette smokers who stopped smoking at various ages.

Journal ArticleDOI
TL;DR: A semiparametric model is proposed to consider situations where the functional form of the effect of one or more covariates is unknown, as is the case in the application presented in this work.
Abstract: SUMMARY In this work we study the effect of several covariates on a censored response variable with unknown probability distribution. A semiparametric model is proposed to consider situations where the functional form of the effect of one or more covariates is unknown, as is the case in the application presented in this work. We provide its estimation procedure and, in addition, a bootstrap technique to make inference on the parameters. A simulation study has been carried out to show the good performance of the proposed estimation process and to analyse the effect of the censorship. Finally, we present the results when the methodology is applied to AIDS diagnosed patients.