Showing papers in "Statistical Modelling in 2005"

PDF

Open Access

Journal Article•DOI•

Random effect models for repeated measures of zero-inflated count data:

[...]

Yongyi Min¹, Alan Agresti²•Institutions (2)

01 Apr 2005-Statistical Modelling

TL;DR: In this article, random effect models for repeated measurements of zero-inflated count responses are discussed. But, the problem of extra zeros, the correlation between measurements upon the same subject at different occasions needs to be taken into account.

...read moreread less

Abstract: For count responses, the situation of excess zeros (relative to what standard models allow) often occurs in biomedical and sociological applications. Modeling repeated measures of zero-inflated count data presents special challenges. This is because in addition to the problem of extra zeros, the correlation between measurements upon the same subject at different occasions needs to be taken into account. This article discusses random effect models for repeated measurements on this type of response variable. A useful model is the hurdle model with random effects, which separately handles the zero observations and the positive counts. In maximum likelihood model fitting, we consider both a normal distribution and a nonparametric approach for the random effects. A special case of the hurdle model can be used to test for zero inflation. Random effects can also be introduced in a zero-inflated Poisson or negative binomial model, but such a model may encounter fitting problems if there is zero deflation at any s...

...read moreread less

330 citations

Journal Article•DOI•

A statistical framework for the analysis of multivariate infectious disease surveillance counts

[...]

Leonhard Held¹, Michael Höhle¹, Mathias Hofmann¹•Institutions (1)

Ludwig Maximilian University of Munich¹

01 Oct 2005-Statistical Modelling

TL;DR: A framework for the statistical analysis of counts from infectious disease surveillance databases is proposed and a multivariate formulation is proposed, which is well suited to capture space-time dependence caused by the spatial spread of a disease over time.

...read moreread less

Abstract: A framework for the statistical analysis of counts from infectious disease surveillance databases is proposed. In its simplest form, the model can be seen as a Poisson branching process model with immigration. Extensions to include seasonal effects, time trends and overdispersion are outlined. The model is shown to provide an adequate fit and reliable one-step-ahead prediction intervals for a typical infectious disease time series. In addition, a multivariate formulation is proposed, which is well suited to capture space-time dependence caused by the spatial spread of a disease over time. An analysis of two multivariate time series is described. All analyses have been done using general optimization routines, where ML estimates and corresponding standard errors are readily available.

...read moreread less

228 citations

Journal Article•DOI•

Mixture of linear mixed models for clustering gene expression profiles from repeated microarray experiments

[...]

Gilles Celeux, Olivier C. Martin¹, Christian Lavergne•Institutions (1)

Institut national de la recherche agronomique¹

01 Oct 2005-Statistical Modelling

TL;DR: In this paper, the problem of analysing repeated data in the model-based cluster analysis context is considered and the maximum likelihood estimation of this family of models through the EM algorithm is presented.

...read moreread less

Abstract: Data variability can be important in microarray data analysis. Thus, when clustering gene expression profiles, it could be judicious to make use of repeated data. In this paper, the problem of anal...

...read moreread less

116 citations

Journal Article•DOI•

A pairwise likelihood approach to generalized linear models with crossed random effects

[...]

Ruggero Bellio¹, Cristiano Varin²•Institutions (2)

University of Udine¹, University of Padua²

01 Oct 2005-Statistical Modelling

TL;DR: An inferential strategy based on the pairwise likelihood, which only requires the computation of bivariate distributions, which has the potential to handle large data sets and improve on standard inferential procedures by means of bootstrap methods.

...read moreread less

Abstract: Inference in generalized linear models with crossed effects is often made cumbersome by the high-dimensional intractable integrals involved in the likelihood function. We propose an inferential strategy based on the pairwise likelihood, which only requires the computation of bivariate distributions. The benefits of our approach are the simplicity of implementation and the potential to handle large data sets. The estimators based on the pairwise likelihood are generally consistent and asymptotically normally distributed. The pairwise likelihood makes it possible to improve on standard inferential procedures by means of bootstrap methods. The performance of the proposed methodology is illustrated by simulations and application to the well-known salamander mating data set.

...read moreread less

52 citations

Journal Article•DOI•

Efficient models for correlated data via convolutions of intrinsic processes

[...]

Herbert K. H. Lee¹, Dave Higdon², Catherine A. Calder³, Christopher H. Holloman⁴•Institutions (4)

University of California, Santa Cruz¹, Los Alamos National Laboratory², Ohio State University³, J.P. Morgan & Co.⁴

01 Apr 2005-Statistical Modelling

TL;DR: This work proposes an alternative model which is quite tractable computationally - even with large datasets or indirectly observed data - while still maintaining the flexibility and adaptiveness of traditional GP models.

...read moreread less

Abstract: Gaussian processes (GP) have proven to be useful and versatile stochastic models in a wide variety of applications including computer experiments, environmental monitoring, hydrology and climate modeling. A GP model is determined by its mean and covariance functions. In most cases, the mean is specified to be a constant, or some other simple linear function, whereas the covariance function is governed by a few parameters. A Bayesian formulation is attractive as it allows for formal incorporation of uncertainty regarding the parameters governing the GP. However, estimation of these parameters can be problematic. Large datasets, posterior correlation and inverse problems can all lead to difficulties in exploring the posterior distribution. Here, we propose an alternative model which is quite tractable computationally - even with large datasets or indirectly observed data - while still maintaining the flexibility and adaptiveness of traditional GP models. This model is based on convolving simple Markov rando...

...read moreread less

46 citations

Journal Article•DOI•

Modelling of repeated ordered measurements by isotonic sequential regression

[...]

Gerhard Tutz¹•Institutions (1)

Ludwig Maximilian University of Munich¹

01 Dec 2005-Statistical Modelling

TL;DR: In this paper, a simple model for repeated observations of an ordered categorical response variable which is isotonic over time is introduced, where the measurements represent an irreversible process such that the response at time t is never lower than the response observed at the previous time point t-1.

...read moreread less

Abstract: The paper introduces a simple model for repeated observations of an ordered categorical response variable which is isotonic over time. It is assumed that the measurements represent an irreversible process such that the response at time t is never lower than the response observed at the previous time point t-1. Observations of this type occur for example in treatment studies when improvement is measured on an ordinal scale. Since the response at time t depends on the previous outcome, the number of ordered response categories depends on the previous outcome leading to severe problems when simple threshold models for ordered data are used. In order to avoid these problems the isotonic sequential model is introduced. It accounts for the irreversible process by considering the binary transitions to higher scores and allows a parsimonious parameterization. It is shown how the model may easily be estimated by using existing software. Moreover, the model is extended to a random effects version which explicitly takes heterogeneity of individuals and potential correlations into account.

...read moreread less

36 citations

Journal Article•DOI•

The practical utility of incorporating model selection uncertainty into prognostic models for survival data

[...]

Nicole H. Augustin¹, Willi Sauerbrei², Martin Schumacher²•Institutions (2)

University of Glasgow¹, University Medical Center Freiburg²

01 Jul 2005-Statistical Modelling

TL;DR: Two model averaging methods are discussed that yield similar results and are especially useful when there is a high number of potential prognostic factors, most likely some of them without influence in a multivariable context.

...read moreread less

Abstract: Predictions of disease outcome in prognostic factor models are usually based on one selected model. However, often several models fit the data equally well, but these models might differ substantially in terms of included explanatory variables and might lead to different predictions for individual patients. For survival data, we discuss two approaches to account for model selection uncertainty in two data examples, with the main emphasis on variable selection in a proportional hazard Cox model. The main aim of our investigation is to establish the ways in which either of the two approaches is useful in such prognostic models. The first approach is Bayesian model averaging (BMA) adapted for the proportional hazard model, termed ‘approx. BMA’ here. As a new approach, we propose a method which averages over a set of possible models using weights estimated from bootstrap resampling as proposed by Buckland et al., but in addition, we perform an initial screening of variables based on the inclusion frequency of...

...read moreread less

36 citations

Journal Article•DOI•

The role of perturbation in compositional data analysis

[...]

J Aitchison¹, K.W. Ng²•Institutions (2)

University of Glasgow¹, University of Hong Kong²

01 Jul 2005-Statistical Modelling

TL;DR: In standard multivariate statistical analysis, common hypotheses of interest concern changes in mean vectors and subvectors as mentioned in this paper, and it is now well established that compositional data analysis is a special case of multivariate analysis.

...read moreread less

Abstract: In standard multivariate statistical analysis, common hypotheses of interest concern changes in mean vectors and subvectors. In compositional data analysis it is now well established that compositi...

...read moreread less

34 citations

Journal Article•DOI•

Software reliability modelling and prediction with hidden Markov chains

[...]

Jean-Baptiste Durand¹, Olivier Gaudoin¹•Institutions (1)

Grenoble Institute of Technology¹

01 Apr 2005-Statistical Modelling

TL;DR: The main contribution of HMC modelling is that it highlights the existence of homogeneous periods in the debugging process, which allow one to identify major corrections or version updates.

...read moreread less

Abstract: The purpose of this paper is to use the framework of hidden Markov chains (HMCs) for the modelling of the failure and debugging process of software, and the prediction of software reliability. The ...

...read moreread less

33 citations

Journal Article•DOI•

Two-component mixtures of generalized linear mixed effects models for cluster correlated data:

[...]

Daniel B. Hall¹, Lihua Wang¹•Institutions (1)

University of Georgia¹

01 Apr 2005-Statistical Modelling

TL;DR: Finite mixtures of generalized linear mixed effect models are presented to handle situations where within-cluster correlation and heterogeneity (subpopulations) exist simultaneously and maximum likelihood (ML) is considered as the main approach to estimation.

...read moreread less

Abstract: Finite mixtures of generalized linear mixed effect models are presented to handle situations where within-cluster correlation and heterogeneity (subpopulations) exist simultaneously. For this class...

...read moreread less

27 citations

Journal Article•DOI•

Latent variable models for mixed categorical and survival responses, with an application to fertility preferences and family planning in Bangladesh:

[...]

Irini Moustaki¹, Fiona Steele•Institutions (1)

Athens University of Economics and Business¹

01 Dec 2005-Statistical Modelling

TL;DR: In this paper, a latent variable model with continuous latent variables for manifest variables that are a mixture of categorical and survival outcomes is discussed, allowing for covariate effects both on the manifest variables (direct effects) and on the latent variable(s) (indirect effects).

...read moreread less

Abstract: In this article, we discuss a latent variable model with continuous latent variables for manifest variables that are a mixture of categorical and survival outcomes. Models for censored and uncensored survival data are discussed. The model allows for covariate effects both on the manifest variables (direct effects) and on the latent variable(s) (indirect effects). The methodological developments are motivated by a demographic application: an exploration of women’s fertility preferences and family planning behaviour in Bangladesh.

...read moreread less

Journal Article•DOI•

Population estimation using the multiple system estimator in the presence of continuous covariates

[...]

Eugene Zwane¹, Peter G. M. van der Heijden¹•Institutions (1)

Utrecht University¹

01 Apr 2005-Statistical Modelling

TL;DR: In this article, the authors introduce an approach where direct dependence between registrations is modelled leaving the continuous covariates in their measurement scale, which results in biased estimation of both the population size and standard error.

...read moreread less

Abstract: In the presence of continuous covariates, standard capture-recapture methods assume either that the registrations operate independently at the individual level or that the covariates can be stratified and log-linear models fitted, permitting the modelling of dependence between data sources. This article introduces an approach where direct dependence between registrations is modelled leaving the continuous covariates in their measurement scale. Simulations show that not accounting for possible dependence between registrations results in biased estimation of both the population size and standard error. The proposed method is applied to Dutch neural tube defect registration data.

...read moreread less

Journal Article•DOI•

Measuring customer quality in retail banking

[...]

David J. Hand¹, Martin J. Crowder¹•Institutions (1)

Imperial College London¹

01 Jul 2005-Statistical Modelling

TL;DR: In this article, a model that separates the observed variables for a customer into primary characteristics on the one hand, and indicators of previous behaviour on the other, and links the two via a latent variable that we identify as "customer quality" is presented.

...read moreread less

Abstract: The retail banking sector makes heavy use of statistical models to predict various aspects of customer behaviour. These models are built using data from earlier customers, but have several weaknesses. An alternative approach, widely used in social measurement, but apparently not yet applied in the retail banking sector, is to use latent-variable techniques to measure the underlying key aspect of customer behaviour. This paper describes such a model that separates the observed variables for a customer into primary characteristics on the one hand, and indicators of previous behaviour on the other, and links the two via a latent variable that we identify as ‘customer quality’. We describe how to estimate the conditional distribution of customer quality, given the observed values of primary characteristics and past behaviour.

...read moreread less

Journal Article•DOI•

Mining epidemiological time series: an approach based on dynamic regression:

[...]

Monica Chiogna¹, Carlo Gaetan²•Institutions (2)

University of Padua¹, Ca' Foscari University of Venice²

01 Dec 2005-Statistical Modelling

TL;DR: In this paper, the authors exploit exploratory analysis of city-specific time series by fitting complete dynamic regression models and highlight the common features across cities through this analysis, which might then be used to design the meta-analyses.

...read moreread less

Abstract: In epidemiology, time-series regression models are specially suitable for evaluating short-term effects of time-varying exposures to pollution. To summarize findings from different studies on different cities, the techniques of designed meta-analyses have been employed. In this context, city-specific findings are summarized by an ‘effect size’ measured on a common scale. Such effects are then pooled together on a second hierarchy of analysis. The objective of this article is to exploit exploratory analysis of city-specific time series. In fact, when dealing with many sources of data, that is, many cities, an exploratory analysis becomes almost unaffordable. Our idea is to explore the time series by fitting complete dynamic regression models. These models are easier to fit than models usually employed and allow implementation of very fast automated model selection algorithms. The idea is to highlight the common features across cities through this analysis, which might then be used to design the meta-analys...

...read moreread less

Journal Article•DOI•

Understanding past ocean circulations: a nonparametric regression case study:

[...]

Richard J. Samworth¹, Heather Poore•Institutions (1)

University of Cambridge¹

01 Dec 2005-Statistical Modelling

TL;DR: In this article, a nonparametric regression model was used to estimate the carbon isotope levels for the Pacific, Southern and North Atlantic Oceans over the last 23 million years and to provide confidence bands.

...read moreread less

Abstract: Oceanographers study past ocean circulations and their effect on global climate through carbon isotope records obtained from microfossils deposited on the ocean floor. An initial goal is to estimate the carbon isotope levels for the Pacific, Southern and North Atlantic Oceans over the last 23 million years and to provide confidence bands. We consider a nonparametric regression model and demonstrate how several recent developments in methodology make local linear kernel regression an attractive approach for tackling the problem. The results are used to estimate a quantity called the proportion of Northern Component Water and its effect on global climate. Several interesting and important geophysical and oceanographic conclusions are suggested by the study.

...read moreread less

Journal Article•DOI•

Analyzing lifetime data with long-tailed skewed distribution: The logistic-sinh family

[...]

Kahadawala Cooray¹•Institutions (1)

University of Nevada, Las Vegas¹

01 Dec 2005-Statistical Modelling

TL;DR: In this article, a new two-parameter family of distribution is presented to model the highly negatively skewed data with extreme observations, referred to as the logistic-sinh distribution, as it is derived from the logistics distribution by appropriately replacing an exponential term with a hyperbolic sine term.

...read moreread less

Abstract: A new two-parameter family of distribution is presented. It is derived to model the highly negatively skewed data with extreme observations. The new family of distribution is referred to as the logistic-sinh distribution, as it is derived from the logistic distribution by appropriately replacing an exponential term with a hyperbolic sine term. The resulting family provides not only negatively skewed densities with thick tails but also variety of monotonic density shapes. The space of shape parameter, lambda greater than zero is divided by boundary line of lambda equals one, into two regions over which the hazard function is, respectively, increasing and bathtub shaped. The maximum likelihood parameter estimation techniques are discussed by providing approximate coverage probabilities for uncensored samples. The advantages of using the new family are demonstrated and compared by illustrating well known examples.

...read moreread less

Journal Article•DOI•

Joint modelling of recurrence and progression of adenomas: a latent variable approach

[...]

Chiu Hsieh Hsu

01 Oct 2005-Statistical Modelling

TL;DR: In this article, the number of recurrent adenomatous polyps is treated as a latent variable and then a mixture distribution is used to model the observed number of observed recurrent adeno-nusus polyps.

...read moreread less

Abstract: In this paper, we treat the number of recurrent adenomatous polyps as a latent variable and then use a mixture distribution to model the number of observed recurrent adenomatous polyps. This approach is equivalent to zero-inflated Poisson regression, which is a method used to analyse count data with excess zeros. In a zero-inflated Poisson model, a count response variable is assumed to be distributed as a mixture of a Poisson distribution and a distribution with point mass of one at zero. In many cancer studies, patients often have variable follow-up. When the disease of interest is subject to late onset, ignoring the length of follow-up will underestimate the recurrence rate. In this paper, we modify zero-inflated Poisson regression through a weight function to incorporate the length of follow-up into analysis. We motivate, develop, and illustrate the methods described here with an example from a colon cancer study.

...read moreread less

Journal Article•DOI•

Two slice-EM algorithms for fitting generalized linear mixed models with binary response

[...]

Florin Vaida¹, Xiao-Li Meng²•Institutions (2)

University of California, San Diego¹, Harvard University²

01 Oct 2005-Statistical Modelling

TL;DR: This paper shows that further augmenting the missing data used by the M-step leads to a quite attractive and general slice sampler for implementing the Monte Carlo E-step, and applies this scheme to the standard EM algorithm as well as to an alternative EM algorithm which treats the variance-standardized random effects, rather than the random effects themselves, as missing data.

...read moreread less

Abstract: The celebrated simplicity of the EM algorithm is somewhat lost in its common use for generalized linear mixed models (GLMMs) because of its analytically intractable E-step. A natural and typical st...

...read moreread less

Journal Article•DOI•

Graphical chain models for the analysis of complex genetic diseases: an application to hypertension

[...]

C. Di Serio¹, Paola Vicard²•Institutions (2)

Vita-Salute San Raffaele University¹, Roma Tre University²

01 Jul 2005-Statistical Modelling

TL;DR: This paper proposes graphical models as a natural tool to analyse the multifactorial structure of complex genetic diseases and an application of this model to primary hypertension data set is illustrated.

...read moreread less

Abstract: A crucial task in modern genetic medicine is the understanding of complex genetic diseases. The main complicating features are that a combination of genetic and environmental risk factors is involved, and the phenotype of interest may be complex. Traditional statistical techniques based on lod-scores fail when the disease is no longer monogenic and the underlying disease transmission model is not defined. Different kinds of association tests have been proved to be an appropriate and powerful statistical tool to detect a candidate gene for a complex disorder. However, statistical techniques able to investigate direct and indirect influences among phenotypes, genotypes and environmental risk factors, are required to analyse the association structure of complex diseases. In this paper we propose graphical models as a natural tool to analyse the multifactorial structure of complex genetic diseases. An application of this model to primary hypertension data set is illustrated.

...read moreread less

Journal Article•DOI•

A latent variable scorecard for neonatal baby frailty

[...]

Jack Bowden¹, Joe Whittaker²•Institutions (2)

University of Leicester¹, Lancaster University²

01 Jul 2005-Statistical Modelling

TL;DR: The frailty model provides an informative summary of the data from this neonatal study and is indicated how the findings may be presented as a scorecard for predicting frailty, and so be useful to doctors working in hospital neonatal units.

...read moreread less

Abstract: A latent variable frailty model is built for data coming from a neonatal study conducted to investigate whether the presence of a particular hospital service given to families with premature babies has a positive effect on their care requirements within the first year of life. The predicted value of the latent frailty term from information obtained from the family in advance of the birth furnishes an overall measure of the quality of health of the baby. This identifies families at risk. Maximum likelihood and Bayesian approaches are used to estimate the effect of the variables on the value of the latent baby frailty and for prediction of health complications. It is found that these give much the same estimates of regression coefficients, but that the variance components are the more difficult to estimate. We indicate how the findings from the model may be presented as a scorecard for predicting frailty, and so be useful to doctors working in hospital neonatal units. New information about a baby is automatically combined with the current score to provide an up-to-date score, so that rapid decisions for taking appropriate action are made more possible. A diagnostic procedure is proposed to assess how well the independence assumptions of the model are met in fitting to this data. It is concluded that the frailty model provides an informative summary of the data from this neonatal study.

...read moreread less