scispace - formally typeset
Open AccessJournal ArticleDOI

A marginalized zero-inflated Poisson regression model with overall exposure effects.

TLDR
A marginalized ZIP model approach for independent responses to model the population mean count directly is developed, allowing straightforward inference for overall exposure effects and empirical robust variance estimation for overall log-incidence density ratios.
Abstract
The zero-inflated Poisson (ZIP) regression model is often employed in public health research to examine the relationships between exposures of interest and a count outcome exhibiting many zeros, in excess of the amount expected under sampling from a Poisson distribution. The regression coefficients of the ZIP model have latent class interpretations, which correspond to a susceptible subpopulation at risk for the condition with counts generated from a Poisson distribution and a non-susceptible subpopulation that provides the extra or excess zeros. The ZIP model parameters, however, are not well suited for inference targeted at marginal means, specifically, in quantifying the effect of an explanatory variable in the overall mixture population. We develop a marginalized ZIP model approach for independent responses to model the population mean count directly, allowing straightforward inference for overall exposure effects and empirical robust variance estimation for overall log-incidence density ratios. Through simulation studies, the performance of maximum likelihood estimation of the marginalized ZIP model is assessed and compared with other methods of estimating overall exposure effects. The marginalized ZIP model is applied to a recent study of a motivational interviewing-based safer sex counseling intervention, designed to reduce unprotected sexual act counts.

read more

Content maybe subject to copyright    Report

A Marginalized Zero-Inflated Poisson Regression Model with
Overall Exposure Effects
D. Leann Long
a,*
, John S. Preisser
b
, Amy H. Herring
b,c
, and Carol E. Golin
d
a
Department of Biostatistics, West Virginia University
b
Department of Biostatistics, University of North Carolina
c
Carolina Population Center, University of North Carolina
d
Department of Health Behavior, University of North Carolina
Abstract
The zero-inflated Poisson (ZIP) regression model is often employed in public health research to
examine the relationships between exposures of interest and a count outcome exhibiting many
zeros, in excess of the amount expected under sampling from a Poisson distribution. The
regression coefficients of the ZIP model have latent class interpretations, which correspond to a
susceptible subpopulation at risk for the condition with counts generated from a Poisson
distribution and a non-susceptible subpopulation that provide the extra or excess zeros. The ZIP
model parameters, however, are not well suited for inference targeted at marginal means,
specifically, in quantifying the effect of an explanatory variable in the overall mixture population.
We develop a marginalized ZIP model approach for independent responses to model the
population mean count directly, allowing straightforward inference for overall exposure effects
and empirical robust variance estimation for overall log incidence density ratios. Through
simulation studies, the performance of maximum likelihood estimation of the marginalized ZIP
model is assessed and compared to other methods of estimating overall exposure effects. The
marginalized ZIP model is applied to a recent study of a motivational interviewing-based safer sex
counseling intervention, designed to reduce unprotected sexual act counts.
Keywords
Incidence; Marginalized Models; Unprotected intercourse; Zero-inflation
1. Introduction
Zero-inflated count data exist in many areas of medical and public health research. Because
Poisson regression is often inadequate in describing count data with many zeros [1],
Mullahy [2] proposed the zero-inflated Poisson (ZIP) regression model, based on a mixture
of a Poisson distribution and a degenerate distribution at zero. The ZIP model has two sets
Copyright © 2014 John Wiley & Sons, Ltd.
*
Correspondence to: Department of Biostatistics, West Virginia University, One Medical Center Drive, P.O. Box 9190, Morgantown,
WV 26506-9190. dllong@hsc.wvu.edu.
NIH Public Access
Author Manuscript
Stat Med. Author manuscript; available in PMC 2015 December 20.
Published in final edited form as:
Stat Med. 2014 December 20; 33(29): 5151–5165. doi:10.1002/sim.6293.
NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript

of regression parameters that have latent class interpretations, one for the Poisson mean and
the other for the probability of being an excess zero. These latent classes are often thought to
classify some at-risk and not-at-risk populations, indicating a difference in susceptibility
between the two groups. In a manufacturing example as described by Lambert [3], a
subpopulation of imperfect machines produce defective output at some rate and all other
machines produce perfect output; thus, zero observations are a mixture of perfect machines
and imperfect machines with no defective output. In contrast, hurdle models consider all
zero observations separately from positive realizations, with the notion that all zeros arise
from the same data-generating mechanism, distinct from the process producing positive
observations [2].
While the ZIP model parameters have latent class interpretations on these two
subpopulations, researchers sometimes seek to make inference on the marginal mean of the
sampled population. Examples include population-based sample surveys aimed at describing
an entire population, intervention studies that target populations where all members are
considered to have some risk for the outcome of interest or where interest is in the global
effect in the population as a whole. Albert et al. [4] argue that insufficient emphasis has
been given to the effects of risk factors on the overall population from which the study
sample was drawn and propose estimators of overall exposure effects using a causal
inference perspective under the zero-inflated modeling framework. Although such marginal
effects of predictors are commonly sought, some analysts may find estimating them to be
difficult in the traditional ZIP model framework. While transformation techniques, such as
those employing the delta method for variance estimation, may be employed to estimate
marginal effects of an exposure of interest, these can prove tedious, and the treatment of
covariates is not straightforward [5].
The search for easily implementable overall exposure effect estimation in the ZIP model
leads to the consideration of the marginalized models literature. Heagerty [6] proposed
marginalized multilevel models, which directly model the marginal means by linking
marginal and conditional models with a function of covariates, marginal parameters and
random effects specification. Lee et al. [7] explore hurdle models in the context of
marginalized models to analyze clustered data with excess zeros, marginalizing over the
random effects. Combining overdispersion, random effects and marginalized models
methods, Iddi and Molenberghs [8] obtain population-averaged interpretations for discrete
outcomes. These methods for regression of correlated outcomes combine the desire for
population average interpretations with the convenience of estimation with a likelihood
function constructed with random effects. In a comparatively simple implementation of the
principle of marginalization, the marginalized models approach can be adapted in the ZIP
model in order to achieve population-wide parameter interpretations for independent count
responses with many zeros. Instead of integrating (averaging) over mixtures of distributions
defined by random effects, our approach marginalizes over the Poisson and degenerate
components of the two-part ZIP model to obtain overall effects.
In studies of risky sexual behavior among HIV-positive individuals, one zero-inflated count
variable often studied is the Unprotected Anal and Vaginal Intercourse count (UAVI), the
number of unprotected anal or vaginal intercourse acts with any partner over a specified
Long et al.
Page 2
Stat Med. Author manuscript; available in PMC 2015 December 20.
NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript

time period. Golin et al. [9] developed the SafeTalk program, a multicomponent,
motivational interviewing-based, safer sex intervention for this at-risk population to reduce
the number of unprotected sexual acts. In several populations, sexual behavior count data
have displayed a distribution with excess zeros [10, 11], and population-averaged effects of
covariates on sexual behavior are often desired.
To obtain inference across the marginalized means of the ZIP model, this article proposes a
new method for zero-inflated counts in which overall exposure effect estimates are easily
obtained via a model for the marginal mean count. Section 2 reviews the traditional ZIP
model and outlines issues with overall exposure effect estimation in the ZIP model. Section
3 introduces the marginalized ZIP model that includes parameters with log-incidence density
ratio (IDR) interpretations which are estimated by a maximum likelihood procedure. Section
4 presents a simulation study, which examines the properties of the marginalized ZIP and
compares it to existing methods for estimating marginal effects. Section 5 presents analysis
of the SafeTalk sexual behavior data, using the marginalized ZIP model. A discussion
follows in Section 6.
2. Traditional ZIP Model
The ZIP regression model allows the count variable of interest, say Y
i
, i = 1, …, n to take the
value of zero from a Bernoulli distribution, with probability ψ
i
, or be drawn from a Poisson
distribution, with mean μ
i
, with probability 1 − ψ
i
. Thus,
The likelihood for this ZIP model is
(1)
Lambert [3] proposed models for the parameters μ
i
and ψ
i
: and
, where γ = (γ
1
, …, γ
p
1
)is a (p
1
× 1) column vector of parameters associated
with the excess zeros, β = (β
1
, …, β
p
2
)is a (p
2
× 1) vector of parameters associated with the
Poisson process, and and are the vectors of covariates for the i
th
individual
for excess zero and Poisson processes, respectively.
Importantly, the parameters γ and β have latent class interpretations; that is, γ
j
is the log-
odds ratio of a one-unit increase in the j
th
element of Z on the probability of being an excess
zero and β
j
is the log-incidence density ratio of a one-unit increase in the j
th
element of X on
the mean of the susceptible subpopulation. In general, no simple summary of the exposure
effect on the overall population mean of the outcome is directly available. Specifically,
consider the marginal mean of Y
i
, say ν
i
E[Y
i
], often the primary interest of investigators.
The relationship between ν
i
and the parameters from the ZIP model is
Long et al.
Page 3
Stat Med. Author manuscript; available in PMC 2015 December 20.
NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript

(2)
In (2), the population mean is a function of all covariates and parameters from both model
parts. For the j
th
covariate in a ZIP model where Z
i
= X
i
as is commonly specified, the ratio
of means for a one-unit increase in x
ij
is
where x
i
indicates all covariates except x
ij
and γ
is created by removing γ
j
from γ. Thus,
unless γ
j
= 0, the incidence density ratio (IDR) is not constant across various levels of the
extraneous covariates included in the logistic portion of the ZIP model. Additionally, in
order to make statements regarding the variability of any IDR estimates at fixed levels of the
non-exposure covariates, formal statistical techniques, such as the delta method or bootstrap
resampling methods, are required [4]. The computational tools needed for these
transformations are typically not readily available in standard software packages, meaning
that these calculations can be arduous for many applied analysts.
3. Marginalized ZIP Model
Because population-wide parameter interpretations are desired, the overall mean ν
i
can be
modeled directly to give overall exposure effect estimates. The marginalized ZIP model
specifies
(3)
Then,
(4)
allows log-IDR interpretations of the elements of α. Thus, exp(α
j
) is the amount by which
the mean ν
i
is multiplied per unit change in x
j
, providing the same interpretation as in
Poisson regression. In order to utilize the ZIP model likelihood framework, we redefine μ
i
=
exp(δ
i
), where δ
i
is not necessarily a linear function of model parameters. Rather, solving ν
i
= (1 − ψ
i
i
, with substitution for (3), provides
Substituting and μ
i
= exp(δ
i
) into (1), the likelihood of the
marginalized ZIP model for (γ,α) is
Long et al.
Page 4
Stat Med. Author manuscript; available in PMC 2015 December 20.
NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript

(5)
with score equations where
and ν
i
= ν
i
(α) and ψ
i
= ψ
i
(γ). Given the Fisher information I(γ,α), the model-based standard
errors of the parameter estimates are
To address possibly overdispersed counts relative to the ZIP model, the robust (empirical)
estimates of the standard errors are
with substitution of the MLE’s γ
and α
for γ and α, respectively [12].
While parameter estimation can be implemented using various techniques, such as MCMC
methods or the EM algorithm, all results herein are obtained through nonlinear optimization
by the quasi-Newton method, implemented in SAS 9.3 IML (SAS Institute, Cary, NC). SAS
NLMIXED can also be utilized to estimate parameters, and sample code has been provided
in the Appendix. As SAS NLMIXED does not readily provide robust variance estimates, the
SAS IML code to calculate the robust estimates of standard error for our motivating
example has been provided in the online supplementary material. Additionally, the
likelihood derivations, as well as those used to obtain the Fisher information, are provided in
the Appendix.
4. Simulation Study
Simulation studies were performed to examine the properties of the new marginalized ZIP
model under different scenarios, implemented in SAS 9.3 IML. Let Y
i
be the zero-inflated
Poisson outcome of interest for the i
th
participant. Also, let x
i1
be the exposure variable of
interest and let x
i2
be an additional covariate desired in a regression model. In the SafeTalk
example, Y
i
is the UAVI count, x
i1
is an indicator of randomization to SafeTalk intervention,
and the additional covariate x
i2
is the baseline UAVI count. Thus the simulated marginalized
ZIP regression model is
Long et al.
Page 5
Stat Med. Author manuscript; available in PMC 2015 December 20.
NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript

Citations
More filters
Journal ArticleDOI

Modeling zero-modified count and semicontinuous data in health services research Part 1: background and overview.

TL;DR: This tutorial describes recent modeling strategies for zero-modified count and semicontinuous data and highlights their role in health services research studies.
Journal ArticleDOI

Two-Part and Related Regression Models for Longitudinal Data.

TL;DR: Two-part models for semicontinuous and zero-heavy count data are examined, and models for count data with a two-part random effects distribution are considered.
Journal ArticleDOI

Marginalized zero-inflated negative binomial regression with application to dental caries.

TL;DR: A marginalized zero-inflated negative binomial regression model for independent responses is proposed to model the population marginal mean count directly, providing straightforward inference for overall exposure effects based on maximum likelihood estimation.
Journal ArticleDOI

A Marginalized Zero-inflated Poisson Regression Model with Random Effects

TL;DR: A marginalized ZIP model with random effects is presented to directly model the mean of the mixture distribution consisting of 'susceptible' individuals and excess zeroes, providing straightforward inference for overall exposure effects.
Journal ArticleDOI

A Randomized Trial of an Online Risk Reduction Intervention for Young Black MSM

TL;DR: It is suggested that exposure to an online intervention can reduce the rate of CAI among young BMSM, at least in the short term, and additional intervention engagement strategies are warranted.
References
More filters
Journal ArticleDOI

Maximum likelihood estimation of misspecified models

Halbert White
- 01 Jan 1982 - 
TL;DR: In this article, the consequences and detection of model misspecification when using maximum likelihood techniques for estimation and inference are examined, and the properties of the quasi-maximum likelihood estimator and the information matrix are exploited to yield several useful tests.
Journal ArticleDOI

Zero-inflated Poisson regression, with an application to defects in manufacturing

TL;DR: Zero-inflated Poisson (ZIP) regression as discussed by the authors is a model for counting data with excess zeros, which assumes that with probability p the only possible observation is 0, and with probability 1 − p, a Poisson(λ) random variable is observed.
Journal ArticleDOI

Specification and testing of some modified count data models

TL;DR: These alternatives permit more flexible specification of the data-generating process (dgp) than do familiar count data models, and provide a natural means for modeling data that are over- or underdispersed by the standards of the basic models.
Journal ArticleDOI

The zero-inflated Poisson model and the decayed, missing and filled teeth index in dental epidemiology

TL;DR: In this article, a zero-inflated Poisson (ZIP) model is proposed for frequency counts in a dental epidemiological study in Belo Horizonte, Brazil, which evaluated various programmes for reducing caries.
Journal ArticleDOI

Zero-Altered and other Regression Models for Count Data with Added Zeros

TL;DR: In this paper, three alternative types of regression models that utilize all the information and explicitly account for excess zeros are examined and given general formulations, and an original regression formulation is presented as a reduced form of the two-part model for count data, which is also discussed.
Related Papers (5)