scispace - formally typeset
Search or ask a question

Showing papers in "Statistical Modelling in 2005"


Journal ArticleDOI
TL;DR: In this article, random effect models for repeated measurements of zero-inflated count responses are discussed. But, the problem of extra zeros, the correlation between measurements upon the same subject at different occasions needs to be taken into account.
Abstract: For count responses, the situation of excess zeros (relative to what standard models allow) often occurs in biomedical and sociological applications. Modeling repeated measures of zero-inflated count data presents special challenges. This is because in addition to the problem of extra zeros, the correlation between measurements upon the same subject at different occasions needs to be taken into account. This article discusses random effect models for repeated measurements on this type of response variable. A useful model is the hurdle model with random effects, which separately handles the zero observations and the positive counts. In maximum likelihood model fitting, we consider both a normal distribution and a nonparametric approach for the random effects. A special case of the hurdle model can be used to test for zero inflation. Random effects can also be introduced in a zero-inflated Poisson or negative binomial model, but such a model may encounter fitting problems if there is zero deflation at any s...

330 citations


Journal ArticleDOI
TL;DR: A framework for the statistical analysis of counts from infectious disease surveillance databases is proposed and a multivariate formulation is proposed, which is well suited to capture space-time dependence caused by the spatial spread of a disease over time.
Abstract: A framework for the statistical analysis of counts from infectious disease surveillance databases is proposed. In its simplest form, the model can be seen as a Poisson branching process model with immigration. Extensions to include seasonal effects, time trends and overdispersion are outlined. The model is shown to provide an adequate fit and reliable one-step-ahead prediction intervals for a typical infectious disease time series. In addition, a multivariate formulation is proposed, which is well suited to capture space-time dependence caused by the spatial spread of a disease over time. An analysis of two multivariate time series is described. All analyses have been done using general optimization routines, where ML estimates and corresponding standard errors are readily available.

228 citations


Journal ArticleDOI
TL;DR: In this paper, the problem of analysing repeated data in the model-based cluster analysis context is considered and the maximum likelihood estimation of this family of models through the EM algorithm is presented.
Abstract: Data variability can be important in microarray data analysis. Thus, when clustering gene expression profiles, it could be judicious to make use of repeated data. In this paper, the problem of anal...

116 citations


Journal ArticleDOI
TL;DR: An inferential strategy based on the pairwise likelihood, which only requires the computation of bivariate distributions, which has the potential to handle large data sets and improve on standard inferential procedures by means of bootstrap methods.
Abstract: Inference in generalized linear models with crossed effects is often made cumbersome by the high-dimensional intractable integrals involved in the likelihood function. We propose an inferential strategy based on the pairwise likelihood, which only requires the computation of bivariate distributions. The benefits of our approach are the simplicity of implementation and the potential to handle large data sets. The estimators based on the pairwise likelihood are generally consistent and asymptotically normally distributed. The pairwise likelihood makes it possible to improve on standard inferential procedures by means of bootstrap methods. The performance of the proposed methodology is illustrated by simulations and application to the well-known salamander mating data set.

52 citations


Journal ArticleDOI
TL;DR: This work proposes an alternative model which is quite tractable computationally - even with large datasets or indirectly observed data - while still maintaining the flexibility and adaptiveness of traditional GP models.
Abstract: Gaussian processes (GP) have proven to be useful and versatile stochastic models in a wide variety of applications including computer experiments, environmental monitoring, hydrology and climate modeling. A GP model is determined by its mean and covariance functions. In most cases, the mean is specified to be a constant, or some other simple linear function, whereas the covariance function is governed by a few parameters. A Bayesian formulation is attractive as it allows for formal incorporation of uncertainty regarding the parameters governing the GP. However, estimation of these parameters can be problematic. Large datasets, posterior correlation and inverse problems can all lead to difficulties in exploring the posterior distribution. Here, we propose an alternative model which is quite tractable computationally - even with large datasets or indirectly observed data - while still maintaining the flexibility and adaptiveness of traditional GP models. This model is based on convolving simple Markov rando...

46 citations


Journal ArticleDOI
TL;DR: In this paper, a simple model for repeated observations of an ordered categorical response variable which is isotonic over time is introduced, where the measurements represent an irreversible process such that the response at time t is never lower than the response observed at the previous time point t-1.
Abstract: The paper introduces a simple model for repeated observations of an ordered categorical response variable which is isotonic over time. It is assumed that the measurements represent an irreversible process such that the response at time t is never lower than the response observed at the previous time point t-1. Observations of this type occur for example in treatment studies when improvement is measured on an ordinal scale. Since the response at time t depends on the previous outcome, the number of ordered response categories depends on the previous outcome leading to severe problems when simple threshold models for ordered data are used. In order to avoid these problems the isotonic sequential model is introduced. It accounts for the irreversible process by considering the binary transitions to higher scores and allows a parsimonious parameterization. It is shown how the model may easily be estimated by using existing software. Moreover, the model is extended to a random effects version which explicitly takes heterogeneity of individuals and potential correlations into account.

36 citations


Journal ArticleDOI
TL;DR: Two model averaging methods are discussed that yield similar results and are especially useful when there is a high number of potential prognostic factors, most likely some of them without influence in a multivariable context.
Abstract: Predictions of disease outcome in prognostic factor models are usually based on one selected model. However, often several models fit the data equally well, but these models might differ substantially in terms of included explanatory variables and might lead to different predictions for individual patients. For survival data, we discuss two approaches to account for model selection uncertainty in two data examples, with the main emphasis on variable selection in a proportional hazard Cox model. The main aim of our investigation is to establish the ways in which either of the two approaches is useful in such prognostic models. The first approach is Bayesian model averaging (BMA) adapted for the proportional hazard model, termed ‘approx. BMA’ here. As a new approach, we propose a method which averages over a set of possible models using weights estimated from bootstrap resampling as proposed by Buckland et al., but in addition, we perform an initial screening of variables based on the inclusion frequency of...

36 citations


Journal ArticleDOI
TL;DR: In standard multivariate statistical analysis, common hypotheses of interest concern changes in mean vectors and subvectors as mentioned in this paper, and it is now well established that compositional data analysis is a special case of multivariate analysis.
Abstract: In standard multivariate statistical analysis, common hypotheses of interest concern changes in mean vectors and subvectors. In compositional data analysis it is now well established that compositi...

34 citations


Journal ArticleDOI
TL;DR: The main contribution of HMC modelling is that it highlights the existence of homogeneous periods in the debugging process, which allow one to identify major corrections or version updates.
Abstract: The purpose of this paper is to use the framework of hidden Markov chains (HMCs) for the modelling of the failure and debugging process of software, and the prediction of software reliability. The ...

33 citations


Journal ArticleDOI
TL;DR: Finite mixtures of generalized linear mixed effect models are presented to handle situations where within-cluster correlation and heterogeneity (subpopulations) exist simultaneously and maximum likelihood (ML) is considered as the main approach to estimation.
Abstract: Finite mixtures of generalized linear mixed effect models are presented to handle situations where within-cluster correlation and heterogeneity (subpopulations) exist simultaneously. For this class...

27 citations


Journal ArticleDOI
TL;DR: In this paper, a latent variable model with continuous latent variables for manifest variables that are a mixture of categorical and survival outcomes is discussed, allowing for covariate effects both on the manifest variables (direct effects) and on the latent variable(s) (indirect effects).
Abstract: In this article, we discuss a latent variable model with continuous latent variables for manifest variables that are a mixture of categorical and survival outcomes. Models for censored and uncensored survival data are discussed. The model allows for covariate effects both on the manifest variables (direct effects) and on the latent variable(s) (indirect effects). The methodological developments are motivated by a demographic application: an exploration of women’s fertility preferences and family planning behaviour in Bangladesh.

Journal ArticleDOI
TL;DR: In this article, the authors introduce an approach where direct dependence between registrations is modelled leaving the continuous covariates in their measurement scale, which results in biased estimation of both the population size and standard error.
Abstract: In the presence of continuous covariates, standard capture-recapture methods assume either that the registrations operate independently at the individual level or that the covariates can be stratified and log-linear models fitted, permitting the modelling of dependence between data sources. This article introduces an approach where direct dependence between registrations is modelled leaving the continuous covariates in their measurement scale. Simulations show that not accounting for possible dependence between registrations results in biased estimation of both the population size and standard error. The proposed method is applied to Dutch neural tube defect registration data.

Journal ArticleDOI
TL;DR: In this article, a model that separates the observed variables for a customer into primary characteristics on the one hand, and indicators of previous behaviour on the other, and links the two via a latent variable that we identify as "customer quality" is presented.
Abstract: The retail banking sector makes heavy use of statistical models to predict various aspects of customer behaviour. These models are built using data from earlier customers, but have several weaknesses. An alternative approach, widely used in social measurement, but apparently not yet applied in the retail banking sector, is to use latent-variable techniques to measure the underlying key aspect of customer behaviour. This paper describes such a model that separates the observed variables for a customer into primary characteristics on the one hand, and indicators of previous behaviour on the other, and links the two via a latent variable that we identify as ‘customer quality’. We describe how to estimate the conditional distribution of customer quality, given the observed values of primary characteristics and past behaviour.

Journal ArticleDOI
TL;DR: In this paper, the authors exploit exploratory analysis of city-specific time series by fitting complete dynamic regression models and highlight the common features across cities through this analysis, which might then be used to design the meta-analyses.
Abstract: In epidemiology, time-series regression models are specially suitable for evaluating short-term effects of time-varying exposures to pollution. To summarize findings from different studies on different cities, the techniques of designed meta-analyses have been employed. In this context, city-specific findings are summarized by an ‘effect size’ measured on a common scale. Such effects are then pooled together on a second hierarchy of analysis. The objective of this article is to exploit exploratory analysis of city-specific time series. In fact, when dealing with many sources of data, that is, many cities, an exploratory analysis becomes almost unaffordable. Our idea is to explore the time series by fitting complete dynamic regression models. These models are easier to fit than models usually employed and allow implementation of very fast automated model selection algorithms. The idea is to highlight the common features across cities through this analysis, which might then be used to design the meta-analys...

Journal ArticleDOI
TL;DR: In this article, a nonparametric regression model was used to estimate the carbon isotope levels for the Pacific, Southern and North Atlantic Oceans over the last 23 million years and to provide confidence bands.
Abstract: Oceanographers study past ocean circulations and their effect on global climate through carbon isotope records obtained from microfossils deposited on the ocean floor. An initial goal is to estimate the carbon isotope levels for the Pacific, Southern and North Atlantic Oceans over the last 23 million years and to provide confidence bands. We consider a nonparametric regression model and demonstrate how several recent developments in methodology make local linear kernel regression an attractive approach for tackling the problem. The results are used to estimate a quantity called the proportion of Northern Component Water and its effect on global climate. Several interesting and important geophysical and oceanographic conclusions are suggested by the study.

Journal ArticleDOI
TL;DR: In this article, a new two-parameter family of distribution is presented to model the highly negatively skewed data with extreme observations, referred to as the logistic-sinh distribution, as it is derived from the logistics distribution by appropriately replacing an exponential term with a hyperbolic sine term.
Abstract: A new two-parameter family of distribution is presented. It is derived to model the highly negatively skewed data with extreme observations. The new family of distribution is referred to as the logistic-sinh distribution, as it is derived from the logistic distribution by appropriately replacing an exponential term with a hyperbolic sine term. The resulting family provides not only negatively skewed densities with thick tails but also variety of monotonic density shapes. The space of shape parameter, lambda greater than zero is divided by boundary line of lambda equals one, into two regions over which the hazard function is, respectively, increasing and bathtub shaped. The maximum likelihood parameter estimation techniques are discussed by providing approximate coverage probabilities for uncensored samples. The advantages of using the new family are demonstrated and compared by illustrating well known examples.

Journal ArticleDOI
TL;DR: In this article, the number of recurrent adenomatous polyps is treated as a latent variable and then a mixture distribution is used to model the observed number of observed recurrent adeno-nusus polyps.
Abstract: In this paper, we treat the number of recurrent adenomatous polyps as a latent variable and then use a mixture distribution to model the number of observed recurrent adenomatous polyps. This approach is equivalent to zero-inflated Poisson regression, which is a method used to analyse count data with excess zeros. In a zero-inflated Poisson model, a count response variable is assumed to be distributed as a mixture of a Poisson distribution and a distribution with point mass of one at zero. In many cancer studies, patients often have variable follow-up. When the disease of interest is subject to late onset, ignoring the length of follow-up will underestimate the recurrence rate. In this paper, we modify zero-inflated Poisson regression through a weight function to incorporate the length of follow-up into analysis. We motivate, develop, and illustrate the methods described here with an example from a colon cancer study.

Journal ArticleDOI
TL;DR: This paper shows that further augmenting the missing data used by the M-step leads to a quite attractive and general slice sampler for implementing the Monte Carlo E-step, and applies this scheme to the standard EM algorithm as well as to an alternative EM algorithm which treats the variance-standardized random effects, rather than the random effects themselves, as missing data.
Abstract: The celebrated simplicity of the EM algorithm is somewhat lost in its common use for generalized linear mixed models (GLMMs) because of its analytically intractable E-step. A natural and typical st...

Journal ArticleDOI
TL;DR: This paper proposes graphical models as a natural tool to analyse the multifactorial structure of complex genetic diseases and an application of this model to primary hypertension data set is illustrated.
Abstract: A crucial task in modern genetic medicine is the understanding of complex genetic diseases. The main complicating features are that a combination of genetic and environmental risk factors is involved, and the phenotype of interest may be complex. Traditional statistical techniques based on lod-scores fail when the disease is no longer monogenic and the underlying disease transmission model is not defined. Different kinds of association tests have been proved to be an appropriate and powerful statistical tool to detect a candidate gene for a complex disorder. However, statistical techniques able to investigate direct and indirect influences among phenotypes, genotypes and environmental risk factors, are required to analyse the association structure of complex diseases. In this paper we propose graphical models as a natural tool to analyse the multifactorial structure of complex genetic diseases. An application of this model to primary hypertension data set is illustrated.

Journal ArticleDOI
TL;DR: The frailty model provides an informative summary of the data from this neonatal study and is indicated how the findings may be presented as a scorecard for predicting frailty, and so be useful to doctors working in hospital neonatal units.
Abstract: A latent variable frailty model is built for data coming from a neonatal study conducted to investigate whether the presence of a particular hospital service given to families with premature babies has a positive effect on their care requirements within the first year of life. The predicted value of the latent frailty term from information obtained from the family in advance of the birth furnishes an overall measure of the quality of health of the baby. This identifies families at risk. Maximum likelihood and Bayesian approaches are used to estimate the effect of the variables on the value of the latent baby frailty and for prediction of health complications. It is found that these give much the same estimates of regression coefficients, but that the variance components are the more difficult to estimate. We indicate how the findings from the model may be presented as a scorecard for predicting frailty, and so be useful to doctors working in hospital neonatal units. New information about a baby is automatically combined with the current score to provide an up-to-date score, so that rapid decisions for taking appropriate action are made more possible. A diagnostic procedure is proposed to assess how well the independence assumptions of the model are met in fitting to this data. It is concluded that the frailty model provides an informative summary of the data from this neonatal study.