A Marginalized Zero-Inflated Poisson Regression Model with
Overall Exposure Effects
D. Leann Long
a,*
, John S. Preisser
b
, Amy H. Herring
b,c
, and Carol E. Golin
d
a
Department of Biostatistics, West Virginia University
b
Department of Biostatistics, University of North Carolina
c
Carolina Population Center, University of North Carolina
d
Department of Health Behavior, University of North Carolina
Abstract
The zero-inflated Poisson (ZIP) regression model is often employed in public health research to
examine the relationships between exposures of interest and a count outcome exhibiting many
zeros, in excess of the amount expected under sampling from a Poisson distribution. The
regression coefficients of the ZIP model have latent class interpretations, which correspond to a
susceptible subpopulation at risk for the condition with counts generated from a Poisson
distribution and a non-susceptible subpopulation that provide the extra or excess zeros. The ZIP
model parameters, however, are not well suited for inference targeted at marginal means,
specifically, in quantifying the effect of an explanatory variable in the overall mixture population.
We develop a marginalized ZIP model approach for independent responses to model the
population mean count directly, allowing straightforward inference for overall exposure effects
and empirical robust variance estimation for overall log incidence density ratios. Through
simulation studies, the performance of maximum likelihood estimation of the marginalized ZIP
model is assessed and compared to other methods of estimating overall exposure effects. The
marginalized ZIP model is applied to a recent study of a motivational interviewing-based safer sex
counseling intervention, designed to reduce unprotected sexual act counts.
Keywords
Incidence; Marginalized Models; Unprotected intercourse; Zero-inflation
1. Introduction
Zero-inflated count data exist in many areas of medical and public health research. Because
Poisson regression is often inadequate in describing count data with many zeros [1],
Mullahy [2] proposed the zero-inflated Poisson (ZIP) regression model, based on a mixture
of a Poisson distribution and a degenerate distribution at zero. The ZIP model has two sets
Copyright © 2014 John Wiley & Sons, Ltd.
*
Correspondence to: Department of Biostatistics, West Virginia University, One Medical Center Drive, P.O. Box 9190, Morgantown,
WV 26506-9190. dllong@hsc.wvu.edu.
NIH Public Access
Author Manuscript
Stat Med. Author manuscript; available in PMC 2015 December 20.
Published in final edited form as:
Stat Med. 2014 December 20; 33(29): 5151–5165. doi:10.1002/sim.6293.
NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript
of regression parameters that have latent class interpretations, one for the Poisson mean and
the other for the probability of being an excess zero. These latent classes are often thought to
classify some at-risk and not-at-risk populations, indicating a difference in susceptibility
between the two groups. In a manufacturing example as described by Lambert [3], a
subpopulation of imperfect machines produce defective output at some rate and all other
machines produce perfect output; thus, zero observations are a mixture of perfect machines
and imperfect machines with no defective output. In contrast, hurdle models consider all
zero observations separately from positive realizations, with the notion that all zeros arise
from the same data-generating mechanism, distinct from the process producing positive
observations [2].
While the ZIP model parameters have latent class interpretations on these two
subpopulations, researchers sometimes seek to make inference on the marginal mean of the
sampled population. Examples include population-based sample surveys aimed at describing
an entire population, intervention studies that target populations where all members are
considered to have some risk for the outcome of interest or where interest is in the global
effect in the population as a whole. Albert et al. [4] argue that insufficient emphasis has
been given to the effects of risk factors on the overall population from which the study
sample was drawn and propose estimators of overall exposure effects using a causal
inference perspective under the zero-inflated modeling framework. Although such marginal
effects of predictors are commonly sought, some analysts may find estimating them to be
difficult in the traditional ZIP model framework. While transformation techniques, such as
those employing the delta method for variance estimation, may be employed to estimate
marginal effects of an exposure of interest, these can prove tedious, and the treatment of
covariates is not straightforward [5].
The search for easily implementable overall exposure effect estimation in the ZIP model
leads to the consideration of the marginalized models literature. Heagerty [6] proposed
marginalized multilevel models, which directly model the marginal means by linking
marginal and conditional models with a function of covariates, marginal parameters and
random effects specification. Lee et al. [7] explore hurdle models in the context of
marginalized models to analyze clustered data with excess zeros, marginalizing over the
random effects. Combining overdispersion, random effects and marginalized models
methods, Iddi and Molenberghs [8] obtain population-averaged interpretations for discrete
outcomes. These methods for regression of correlated outcomes combine the desire for
population average interpretations with the convenience of estimation with a likelihood
function constructed with random effects. In a comparatively simple implementation of the
principle of marginalization, the marginalized models approach can be adapted in the ZIP
model in order to achieve population-wide parameter interpretations for independent count
responses with many zeros. Instead of integrating (averaging) over mixtures of distributions
defined by random effects, our approach marginalizes over the Poisson and degenerate
components of the two-part ZIP model to obtain overall effects.
In studies of risky sexual behavior among HIV-positive individuals, one zero-inflated count
variable often studied is the Unprotected Anal and Vaginal Intercourse count (UAVI), the
number of unprotected anal or vaginal intercourse acts with any partner over a specified
Long et al.
Page 2
Stat Med. Author manuscript; available in PMC 2015 December 20.
NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript
time period. Golin et al. [9] developed the SafeTalk program, a multicomponent,
motivational interviewing-based, safer sex intervention for this at-risk population to reduce
the number of unprotected sexual acts. In several populations, sexual behavior count data
have displayed a distribution with excess zeros [10, 11], and population-averaged effects of
covariates on sexual behavior are often desired.
To obtain inference across the marginalized means of the ZIP model, this article proposes a
new method for zero-inflated counts in which overall exposure effect estimates are easily
obtained via a model for the marginal mean count. Section 2 reviews the traditional ZIP
model and outlines issues with overall exposure effect estimation in the ZIP model. Section
3 introduces the marginalized ZIP model that includes parameters with log-incidence density
ratio (IDR) interpretations which are estimated by a maximum likelihood procedure. Section
4 presents a simulation study, which examines the properties of the marginalized ZIP and
compares it to existing methods for estimating marginal effects. Section 5 presents analysis
of the SafeTalk sexual behavior data, using the marginalized ZIP model. A discussion
follows in Section 6.
2. Traditional ZIP Model
The ZIP regression model allows the count variable of interest, say Y
i
, i = 1, …, n to take the
value of zero from a Bernoulli distribution, with probability ψ
i
, or be drawn from a Poisson
distribution, with mean μ
i
, with probability 1 − ψ
i
. Thus,
The likelihood for this ZIP model is
(1)
Lambert [3] proposed models for the parameters μ
i
and ψ
i
: and
, where γ = (γ
1
, …, γ
p
1
)′ is a (p
1
× 1) column vector of parameters associated
with the excess zeros, β = (β
1
, …, β
p
2
)′ is a (p
2
× 1) vector of parameters associated with the
Poisson process, and and are the vectors of covariates for the i
th
individual
for excess zero and Poisson processes, respectively.
Importantly, the parameters γ and β have latent class interpretations; that is, γ
j
is the log-
odds ratio of a one-unit increase in the j
th
element of Z on the probability of being an excess
zero and β
j
is the log-incidence density ratio of a one-unit increase in the j
th
element of X on
the mean of the susceptible subpopulation. In general, no simple summary of the exposure
effect on the overall population mean of the outcome is directly available. Specifically,
consider the marginal mean of Y
i
, say ν
i
≡ E[Y
i
], often the primary interest of investigators.
The relationship between ν
i
and the parameters from the ZIP model is
Long et al.
Page 3
Stat Med. Author manuscript; available in PMC 2015 December 20.
NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript
(2)
In (2), the population mean is a function of all covariates and parameters from both model
parts. For the j
th
covariate in a ZIP model where Z
i
= X
i
as is commonly specified, the ratio
of means for a one-unit increase in x
ij
is
where x
i
indicates all covariates except x
ij
and γ
is created by removing γ
j
from γ. Thus,
unless γ
j
= 0, the incidence density ratio (IDR) is not constant across various levels of the
extraneous covariates included in the logistic portion of the ZIP model. Additionally, in
order to make statements regarding the variability of any IDR estimates at fixed levels of the
non-exposure covariates, formal statistical techniques, such as the delta method or bootstrap
resampling methods, are required [4]. The computational tools needed for these
transformations are typically not readily available in standard software packages, meaning
that these calculations can be arduous for many applied analysts.
3. Marginalized ZIP Model
Because population-wide parameter interpretations are desired, the overall mean ν
i
can be
modeled directly to give overall exposure effect estimates. The marginalized ZIP model
specifies
(3)
Then,
(4)
allows log-IDR interpretations of the elements of α. Thus, exp(α
j
) is the amount by which
the mean ν
i
is multiplied per unit change in x
j
, providing the same interpretation as in
Poisson regression. In order to utilize the ZIP model likelihood framework, we redefine μ
i
=
exp(δ
i
), where δ
i
is not necessarily a linear function of model parameters. Rather, solving ν
i
= (1 − ψ
i
)μ
i
, with substitution for (3), provides
Substituting and μ
i
= exp(δ
i
) into (1), the likelihood of the
marginalized ZIP model for (γ,α) is
Long et al.
Page 4
Stat Med. Author manuscript; available in PMC 2015 December 20.
NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript
(5)
with score equations where
and ν
i
= ν
i
(α) and ψ
i
= ψ
i
(γ). Given the Fisher information I(γ,α), the model-based standard
errors of the parameter estimates are
To address possibly overdispersed counts relative to the ZIP model, the robust (empirical)
estimates of the standard errors are
with substitution of the MLE’s γ
and α
for γ and α, respectively [12].
While parameter estimation can be implemented using various techniques, such as MCMC
methods or the EM algorithm, all results herein are obtained through nonlinear optimization
by the quasi-Newton method, implemented in SAS 9.3 IML (SAS Institute, Cary, NC). SAS
NLMIXED can also be utilized to estimate parameters, and sample code has been provided
in the Appendix. As SAS NLMIXED does not readily provide robust variance estimates, the
SAS IML code to calculate the robust estimates of standard error for our motivating
example has been provided in the online supplementary material. Additionally, the
likelihood derivations, as well as those used to obtain the Fisher information, are provided in
the Appendix.
4. Simulation Study
Simulation studies were performed to examine the properties of the new marginalized ZIP
model under different scenarios, implemented in SAS 9.3 IML. Let Y
i
be the zero-inflated
Poisson outcome of interest for the i
th
participant. Also, let x
i1
be the exposure variable of
interest and let x
i2
be an additional covariate desired in a regression model. In the SafeTalk
example, Y
i
is the UAVI count, x
i1
is an indicator of randomization to SafeTalk intervention,
and the additional covariate x
i2
is the baseline UAVI count. Thus the simulated marginalized
ZIP regression model is
Long et al.
Page 5
Stat Med. Author manuscript; available in PMC 2015 December 20.
NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript