A marginalized zero-inflated Poisson regression model with overall exposure effects.

doi:10.1002/SIM.6293

A Marginalized Zero-Inflated Poisson Regression Model with

Overall Exposure Effects

D. Leann Long

a,*

, John S. Preisser

b

, Amy H. Herring

b,c

, and Carol E. Golin

d

a

Department of Biostatistics, West Virginia University

b

Department of Biostatistics, University of North Carolina

c

Carolina Population Center, University of North Carolina

d

Department of Health Behavior, University of North Carolina

Abstract

The zero-inflated Poisson (ZIP) regression model is often employed in public health research to

examine the relationships between exposures of interest and a count outcome exhibiting many

zeros, in excess of the amount expected under sampling from a Poisson distribution. The

regression coefficients of the ZIP model have latent class interpretations, which correspond to a

susceptible subpopulation at risk for the condition with counts generated from a Poisson

distribution and a non-susceptible subpopulation that provide the extra or excess zeros. The ZIP

model parameters, however, are not well suited for inference targeted at marginal means,

specifically, in quantifying the effect of an explanatory variable in the overall mixture population.

We develop a marginalized ZIP model approach for independent responses to model the

population mean count directly, allowing straightforward inference for overall exposure effects

and empirical robust variance estimation for overall log incidence density ratios. Through

simulation studies, the performance of maximum likelihood estimation of the marginalized ZIP

model is assessed and compared to other methods of estimating overall exposure effects. The

marginalized ZIP model is applied to a recent study of a motivational interviewing-based safer sex

counseling intervention, designed to reduce unprotected sexual act counts.

Keywords

Incidence; Marginalized Models; Unprotected intercourse; Zero-inflation

1. Introduction

Zero-inflated count data exist in many areas of medical and public health research. Because

Poisson regression is often inadequate in describing count data with many zeros [1],

Mullahy [2] proposed the zero-inflated Poisson (ZIP) regression model, based on a mixture

of a Poisson distribution and a degenerate distribution at zero. The ZIP model has two sets

*

Correspondence to: Department of Biostatistics, West Virginia University, One Medical Center Drive, P.O. Box 9190, Morgantown,

WV 26506-9190. dllong@hsc.wvu.edu.

NIH Public Access

Author Manuscript

Stat Med. Author manuscript; available in PMC 2015 December 20.

Published in final edited form as:

Stat Med. 2014 December 20; 33(29): 5151–5165. doi:10.1002/sim.6293.

NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript

of regression parameters that have latent class interpretations, one for the Poisson mean and

the other for the probability of being an excess zero. These latent classes are often thought to

classify some at-risk and not-at-risk populations, indicating a difference in susceptibility

between the two groups. In a manufacturing example as described by Lambert [3], a

subpopulation of imperfect machines produce defective output at some rate and all other

machines produce perfect output; thus, zero observations are a mixture of perfect machines

and imperfect machines with no defective output. In contrast, hurdle models consider all

zero observations separately from positive realizations, with the notion that all zeros arise

from the same data-generating mechanism, distinct from the process producing positive

observations [2].

While the ZIP model parameters have latent class interpretations on these two

subpopulations, researchers sometimes seek to make inference on the marginal mean of the

sampled population. Examples include population-based sample surveys aimed at describing

an entire population, intervention studies that target populations where all members are

considered to have some risk for the outcome of interest or where interest is in the global

effect in the population as a whole. Albert et al. [4] argue that insufficient emphasis has

been given to the effects of risk factors on the overall population from which the study

sample was drawn and propose estimators of overall exposure effects using a causal

inference perspective under the zero-inflated modeling framework. Although such marginal

effects of predictors are commonly sought, some analysts may find estimating them to be

difficult in the traditional ZIP model framework. While transformation techniques, such as

those employing the delta method for variance estimation, may be employed to estimate

marginal effects of an exposure of interest, these can prove tedious, and the treatment of

covariates is not straightforward [5].

The search for easily implementable overall exposure effect estimation in the ZIP model

leads to the consideration of the marginalized models literature. Heagerty [6] proposed

marginalized multilevel models, which directly model the marginal means by linking

marginal and conditional models with a function of covariates, marginal parameters and

random effects specification. Lee et al. [7] explore hurdle models in the context of

marginalized models to analyze clustered data with excess zeros, marginalizing over the

random effects. Combining overdispersion, random effects and marginalized models

methods, Iddi and Molenberghs [8] obtain population-averaged interpretations for discrete

outcomes. These methods for regression of correlated outcomes combine the desire for

population average interpretations with the convenience of estimation with a likelihood

function constructed with random effects. In a comparatively simple implementation of the

principle of marginalization, the marginalized models approach can be adapted in the ZIP

model in order to achieve population-wide parameter interpretations for independent count

responses with many zeros. Instead of integrating (averaging) over mixtures of distributions

defined by random effects, our approach marginalizes over the Poisson and degenerate

components of the two-part ZIP model to obtain overall effects.

In studies of risky sexual behavior among HIV-positive individuals, one zero-inflated count

variable often studied is the Unprotected Anal and Vaginal Intercourse count (UAVI), the

number of unprotected anal or vaginal intercourse acts with any partner over a specified

Long et al.

Page 2

Stat Med. Author manuscript; available in PMC 2015 December 20.

NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript

time period. Golin et al. [9] developed the SafeTalk program, a multicomponent,

motivational interviewing-based, safer sex intervention for this at-risk population to reduce

the number of unprotected sexual acts. In several populations, sexual behavior count data

have displayed a distribution with excess zeros [10, 11], and population-averaged effects of

covariates on sexual behavior are often desired.

To obtain inference across the marginalized means of the ZIP model, this article proposes a

new method for zero-inflated counts in which overall exposure effect estimates are easily

obtained via a model for the marginal mean count. Section 2 reviews the traditional ZIP

model and outlines issues with overall exposure effect estimation in the ZIP model. Section

3 introduces the marginalized ZIP model that includes parameters with log-incidence density

ratio (IDR) interpretations which are estimated by a maximum likelihood procedure. Section

4 presents a simulation study, which examines the properties of the marginalized ZIP and

compares it to existing methods for estimating marginal effects. Section 5 presents analysis

of the SafeTalk sexual behavior data, using the marginalized ZIP model. A discussion

follows in Section 6.

2. Traditional ZIP Model

The ZIP regression model allows the count variable of interest, say Y

i

, i = 1, …, n to take the

value of zero from a Bernoulli distribution, with probability ψ

i

, or be drawn from a Poisson

distribution, with mean μ

i

, with probability 1 − ψ

i

. Thus,

The likelihood for this ZIP model is

(1)

Lambert [3] proposed models for the parameters μ

i

and ψ

i

: and

, where γ = (γ

1

, …, γ

p

1

)′ is a (p

1

× 1) column vector of parameters associated

with the excess zeros, β = (β

1

, …, β

p

2

)′ is a (p

2

× 1) vector of parameters associated with the

Poisson process, and and are the vectors of covariates for the i

th

individual

for excess zero and Poisson processes, respectively.

Importantly, the parameters γ and β have latent class interpretations; that is, γ

j

is the log-

odds ratio of a one-unit increase in the j

th

element of Z on the probability of being an excess

zero and β

j

is the log-incidence density ratio of a one-unit increase in the j

th

element of X on

the mean of the susceptible subpopulation. In general, no simple summary of the exposure

effect on the overall population mean of the outcome is directly available. Specifically,

consider the marginal mean of Y

i

, say ν

i

≡ E[Y

i

], often the primary interest of investigators.

The relationship between ν

i

and the parameters from the ZIP model is

Long et al.

Page 3

Stat Med. Author manuscript; available in PMC 2015 December 20.

NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript

(2)

In (2), the population mean is a function of all covariates and parameters from both model

parts. For the j

th

covariate in a ZIP model where Z

i

= X

i

as is commonly specified, the ratio

of means for a one-unit increase in x

ij

is

where x



i

indicates all covariates except x

ij

and γ



is created by removing γ

j

from γ. Thus,

unless γ

j

= 0, the incidence density ratio (IDR) is not constant across various levels of the

extraneous covariates included in the logistic portion of the ZIP model. Additionally, in

order to make statements regarding the variability of any IDR estimates at fixed levels of the

non-exposure covariates, formal statistical techniques, such as the delta method or bootstrap

resampling methods, are required [4]. The computational tools needed for these

transformations are typically not readily available in standard software packages, meaning

that these calculations can be arduous for many applied analysts.

3. Marginalized ZIP Model

Because population-wide parameter interpretations are desired, the overall mean ν

i

can be

modeled directly to give overall exposure effect estimates. The marginalized ZIP model

specifies

(3)

Then,

(4)

allows log-IDR interpretations of the elements of α. Thus, exp(α

j

) is the amount by which

the mean ν

i

is multiplied per unit change in x

j

, providing the same interpretation as in

Poisson regression. In order to utilize the ZIP model likelihood framework, we redefine μ

i

=

exp(δ

i

), where δ

i

is not necessarily a linear function of model parameters. Rather, solving ν

i

= (1 − ψ

i

)μ

i

, with substitution for (3), provides

Substituting and μ

i

= exp(δ

i

) into (1), the likelihood of the

marginalized ZIP model for (γ,α) is

Long et al.

Page 4

Stat Med. Author manuscript; available in PMC 2015 December 20.

NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript

(5)

with score equations where

and ν

i

= ν

i

(α) and ψ

i

= ψ

i

(γ). Given the Fisher information I(γ,α), the model-based standard

errors of the parameter estimates are

To address possibly overdispersed counts relative to the ZIP model, the robust (empirical)

estimates of the standard errors are

with substitution of the MLE’s γ



and α



for γ and α, respectively [12].

While parameter estimation can be implemented using various techniques, such as MCMC

methods or the EM algorithm, all results herein are obtained through nonlinear optimization

by the quasi-Newton method, implemented in SAS 9.3 IML (SAS Institute, Cary, NC). SAS

NLMIXED can also be utilized to estimate parameters, and sample code has been provided

in the Appendix. As SAS NLMIXED does not readily provide robust variance estimates, the

SAS IML code to calculate the robust estimates of standard error for our motivating

example has been provided in the online supplementary material. Additionally, the

likelihood derivations, as well as those used to obtain the Fisher information, are provided in

the Appendix.

4. Simulation Study

Simulation studies were performed to examine the properties of the new marginalized ZIP

model under different scenarios, implemented in SAS 9.3 IML. Let Y

i

be the zero-inflated

Poisson outcome of interest for the i

th

participant. Also, let x

i1

be the exposure variable of

interest and let x

i2

be an additional covariate desired in a regression model. In the SafeTalk

example, Y

i

is the UAVI count, x

i1

is an indicator of randomization to SafeTalk intervention,

and the additional covariate x

i2

is the baseline UAVI count. Thus the simulated marginalized

ZIP regression model is

Long et al.

Page 5

Stat Med. Author manuscript; available in PMC 2015 December 20.

NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript

A marginalized zero-inflated Poisson regression model with overall exposure effects.

Citations

Modeling zero-modified count and semicontinuous data in health services research Part 1: background and overview.

Two-Part and Related Regression Models for Longitudinal Data.

Marginalized zero-inflated negative binomial regression with application to dental caries.

A Marginalized Zero-inflated Poisson Regression Model with Random Effects

A Randomized Trial of an Online Risk Reduction Intervention for Young Black MSM

References

Maximum likelihood estimation of misspecified models

Zero-inflated Poisson regression, with an application to defects in manufacturing

Specification and testing of some modified count data models

The zero-inflated Poisson model and the decayed, missing and filled teeth index in dental epidemiology

Zero-Altered and other Regression Models for Count Data with Added Zeros

Related Papers (5)

Zero-inflated Poisson regression, with an application to defects in manufacturing

Specification and testing of some modified count data models

Marginally specified logistic-normal models for longitudinal binary data.

Zero-inflated Poisson and binomial regression with random effects: a case study.

A Score Test for Testing a Zero‐Inflated Poisson Regression Model Against Zero‐Inflated Negative Binomial Alternatives