How long did it take to fit a glmm?

On a standard laptop computer, it took approximately five seconds to fit a210zero-inflated Poisson GLMM in glmmTMB; MCMCglmm took four times as long as glmmTMB, followed by glmmADMB (eight times as long) and brms (27 or 14 times as long, depending on precompilation).

How long did it take to fit the model to the original data?

185Fitting the model to the original data replicated to have more observations per site required, on average, half the time to fit with INLA compared to glmmTMB, 22 times as long with glmmADMB, 30 times as long with lme4, and 59 times as long with brms (Fig A.2).

Why are pscl and mgcv not included in the glmm?

lme4 and mgcv are not included because65 they can only estimate zero-inflation when wrapped in an iterative algorithm (Minami et al., 2007; Bolker et al., 2013).

How fast was the fit to simulated data?

Fitting this model to simulated data with the same structure as the original data was, on average, equally fast in glmmTMB and INLA, 26 times slower with glmmADMB, 30 times slower with lme4, and 274 times slower with brms (Fig A.1).

(Open Access) Modeling zero-inflated count data with glmmTMB (2017) | Mollie Elizabeth Brooks

Q: How can pscl be used to test the hypothesis that sheep fecal?

40For example, pscl can be used to test the hypothesis that sheep fecal egg counts depend on age and extra zeros depend on genotype.

Modeling zero-inﬂated count data with glmmTMB

Mollie E. Brooks

a,b,h

, Kasper Kristensen

, Koen J. van Benthem

, Arni

Magnusson

, Casper W. Berg

, Anders Nielsen

, Hans J. Skaug

, Martin

M¨achler

, Benjamin M. Bolker

f,g

National Institute of Aquatic Resources, Technical University of Denmark, Charlottenlund

Slot, 2920 Charlottenlund, Denmark

Department of Evolutionary Biology and Environmental Studies, University of Zurich,

Winterthurerstrasse 190, 8057 Zurich, Switzerland

International Council for the Exploration of the Sea, H.C. Andersens Boulevard 44-46,

1553 Copenhagen, Denmark

Department of Mathematics, University of Bergen, P.O. Box 7803, 5020 Bergen, Norway

Seminar f¨ur Statistik, ETH Zurich, 8092 Zurich, Switzerland

Department of Mathematics and Statistics, McMaster University, 1280 Main St W,

L8S4L8 Hamilton, Ontario, Canada

Department of Biology, McMaster University, 1280 Main St W, L8S4L8 Hamilton,

Ontario, Canada

corresponding author, email MollieEBrooks@gmail.com

Abstract

Ecological phenomena are often measured in the form of count data. These

data can be analyzed using generalized linear mixed models (GLMMs) when

observations are correlated in ways that require random eﬀects. However, count

data are often zero-inﬂated, containing more zeros than would be expected from

the standard error distributions used in GLMMs, e.g., parasite counts may be

exactly zero for hosts with eﬀective immune defenses but vary according to a

negative binomial distribution for non-resistant hosts.

We present a new R package, glmmTMB, that increases the range of models

that can easily be ﬁtted to count data using maximum likelihood estimation.

The interface was developed to be familiar to users of the lme4 R package, a

common tool for ﬁtting GLMMs. To maximize speed and ﬂexibility, estimation

is done using Template Model Builder (TMB), utilizing automatic diﬀerentiation

to estimate model gradients and the Laplace approximation for handling random

eﬀects. We demonstrate glmmTMB and compare it to other available methods

using two ecological case studies.

In general, glmmTMB is more ﬂexible than other packages available for esti-

Preprint submitted to Ecological Modelling May 1, 2017

.CC-BY-NC 4.0 International licensea

certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted May 1, 2017. ; https://doi.org/10.1101/132753doi: bioRxiv preprint

mating zero-inﬂated models via maximum likelihood estimation and is faster

than packages that use Markov chain Monte Carlo sampling for estimation; it is

also more ﬂexible for zero-inﬂated modelling than INLA, but speed comparisons

vary with model and data structure. Our package can be used to ﬁt GLMs and

GLMMs with or without zero-inﬂation as well as hurdle models. By allowing

ecologists to quickly estimate a wide variety of models using a single package,

glmmTMB makes it easier to ﬁnd appropriate models and test hypotheses to de-

scribe ecological processes.

Keywords: abundance, overdispersion, negative binomial, mixed models,

hurdle models

1. Introduction

Ecological phenomena are often measured in the form of discrete count data,

e.g., the number of times that owl nestlings beg for food (Roulin & Bersier,

2007), counts of salamanders in streams (Price et al., 2016), or counts of para-

site eggs in fecal samples of sheep (Brown et al., 2012). These counts are often5

analyzed using generalized linear models (GLMs) and their extensions (O’Hara

& Kotze, 2010; Wilson & Grenfell, 1997). GLMs quantify how expected counts

change as a function of predictor variables, e.g., nestlings change their behavior

depending on which parent they interact with (Roulin & Bersier, 2007), sala-

mander abundance decreases in streams aﬀected by coal mining (Price et al.,10

2016), and helminth infection intensity in sheep varies with age and genotype

(Brown et al., 2012). Repeated measurements on the same individual, the same

location, or observations taken at the same point in time are often correlated;

this correlation can be accounted for using random eﬀects in generalized linear

mixed models (GLMMs; Bolker et al., 2009; Bolker, 2015).15

These types of count data are commonly modeled with GLMs and GLMMs

using either Poisson or negative binomial distributions. For the Poisson dis-

tribution, the variance is equal to the mean. When data are overdispersed —

meaning the variance is larger than the mean — they are often instead modeled

.CC-BY-NC 4.0 International licensea

certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted May 1, 2017. ; https://doi.org/10.1101/132753doi: bioRxiv preprint

using the negative binomial distribution, which can be deﬁned as a mixture20

of Poisson distributions with Gamma-distributed rates. For the Poisson and

negative binomial distributions, the expected number of zeros decreases as the

mean increases. However, when multiple processes underlie the observed counts

— which is almost ubiquitous in biology — the counts can contain many zeros

even if the mean is much greater than zero. For example, observed counts of25

salamanders could be zero if a stream is uninhabitable due to mining waste, or

the count could be any integer from zero to inﬁnity depending on other qualities

of the stream aﬀecting the population density and the salamanders’ ability to

hide from researchers (Price et al., 2016). Zero-inﬂated GLMs allow us to model

count data using a mixture of a Poisson or negative binomial distribution and a30

structural zero component, i.e., extra zeros. Models that ignore zero-inﬂation,

or attempt to handle it in the same way as simple overdispersion, yield biased

parameter estimates (Harrison, 2014).

Many biologists use the statistical computing environment R and its con-

tributed packages to organize, model, and graph their data (R Core Develop-35

ment Team, 2016). In R, there are ﬁve main packages available for modeling

zero-inﬂated data: pscl, INLA, MCMCglmm, glmmADMB, and brms (Table 1; Zeileis

et al., 2008; Rue et al., 2009; Hadﬁeld, 2010; Skaug et al., 2012; B¨urkner, in

press). The pscl package can ﬁt zero-inﬂated GLMs with predictor variables

on the zero-inﬂation using maximum likelihood estimation (Zeileis et al., 2008).40

For example, pscl can be used to test the hypothesis that sheep fecal egg counts

depend on age and extra zeros depend on genotype. However, pscl cannot

model the correlation within individuals if they are sampled repeatedly; this phe-

nomenon requires random eﬀects. Omitting random eﬀects and thereby ignoring

correlation makes statistical tests anti-conservative (Bolker et al., 2009; Bolker,45

2015). The glmmADMB package can ﬁt zero-inﬂated GLMMs that contain ran-

dom eﬀects to account for correlation among observations (Skaug et al., 2012).

However, it cannot ﬁt models with predictor variables in the zero-inﬂation part

of the model; thus, it is only appropriate for limited cases where all observa-

tional units (e.g., individual sheep) have an equal probability of producing a50

.CC-BY-NC 4.0 International licensea

certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted May 1, 2017. ; https://doi.org/10.1101/132753doi: bioRxiv preprint

structural zero. INLA has the same limitation as glmmADMB. The MCMCglmm and

brms packages can ﬁt zero-inﬂated GLMMs with predictors of zero-inﬂation,

but they are relatively slow because they require Markov chain Monte Carlo

(MCMC) sampling (B¨urkner, in press; Hadﬁeld, 2010).

Here we present a new R package, glmmTMB, that estimates GLMs, GLMMs55

and extensions of GLMMs including zero-inﬂated GLMMs using maximum like-

lihood. The ability to ﬁt these types of models quickly and using a single package

will make it easier for biologists to ﬁnd the best model to explain patterns in

their data. We demonstrate the package using two examples. We use an ex-

ample of salamander abundance to show how to ﬁt and compare zero-inﬂated60

and hurdle GLMMs and then how to extract results from a model. We use a

classic example of owl nestling behavior to compare the timing and parameter

estimates from glmmTMB to other R packages.

Table 1. Features implemented in glmmTMB and other packages that are used

for modeling zero-inﬂated count data. lme4 and mgcv are not included because65

they can only estimate zero-inﬂation when wrapped in an iterative algorithm

(Minami et al., 2007; Bolker et al., 2013).

.CC-BY-NC 4.0 International licensea

certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted May 1, 2017. ; https://doi.org/10.1101/132753doi: bioRxiv preprint

Feature glmmTMB glmmADMB pscl MCMCglmm brms INLA

predictors of zero-inﬂation X X X X

predictors of dispersion X

zero-truncated distributions X X X X X X

nbinom2 distribution X X X X X

nbinom1 distribution X X

weights

X X X X

oﬀsets X X X

X X

random eﬀects (RE) X X X X X

various RE structures X

X X X

maximum likelihood estimation X X X

MCMC sample a ﬁtted model X

X X X X

multivariate responses X X

Notes:

Weights are often used to reduce the inﬂuence of some observations over

others, e.g., (Gurevitch & Hedges, 1999); glmmTMB’s dispersion formula can be70

used to model heteroskedasticity, but might not ﬁll other roles of weights. This

feature may be added in future versions.

Oﬀsets can be implemented using

priors.

See vignette("covstruct") for details.

See vignette("mcmc") for

details.

2. Implementation of glmmTMB75

The design goal of glmmTMB is to extend the ﬂexibility of GLMMs in R while

maintaining a familiar interface. To maximize ﬂexibility and speed, glmmTMB’s

estimation is done using the TMB package (Kristensen et al., 2016), but users need

not be familiar with TMB. We based glmmTMB’s interface (e.g., formula syntax)

on the lme4 package — one of the most widely used R packages for ﬁtting80

GLMMs (Bates et al., 2015). Like lme4, glmmTMB uses maximum likelihood

estimation and the Laplace approximation to integrate over random eﬀects;

unlike lme4, glmmTMB does not have the alternative options of doing restricted

maximum likelihood (REML) estimation nor using Gauss-Hermite quadrature

.CC-BY-NC 4.0 International licensea

certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted May 1, 2017. ; https://doi.org/10.1101/132753doi: bioRxiv preprint

Modeling zero-inflated count data with glmmTMB

Citations

The coefficient of determination R2 and intra-class correlation coefficient from generalized linear mixed-effects models revisited and expanded.

The coefficient of determination R2 and intra-class correlation coefficient from generalized linear mixed-effects models revisited and expanded

Accounting for individual-specific variation in habitat-selection studies: Efficient estimation of mixed-effects models using Bayesian or frequentist computation

Violating the normality assumption may be the lesser of two evils

Violating the normality assumption may be the lesser of two evils

References

R: A language and environment for statistical computing.

Fitting Linear Mixed-Effects Models Using lme4

Regularization Paths for Generalized Linear Models via Coordinate Descent

Mixed Effects Models and Extensions in Ecology with R

Generalized linear mixed models: a practical guide for ecology and evolution

Related Papers (5)

R: A language and environment for statistical computing.

Fitting Linear Mixed-Effects Models Using lme4

Generalized linear mixed models: a practical guide for ecology and evolution

Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach

A general and simple method for obtaining R2 from generalized linear mixed-effects models

Frequently Asked Questions (7)

Q1. How long did it take to fit a glmm?

Q2. How are the counts of owls measured?

Q3. How long did it take to fit the model to the original data?

Q4. Why are pscl and mgcv not included in the glmm?

Q5. What is the common use of the statistical computing environment R?

Q6. How can pscl be used to test the hypothesis that sheep fecal?

Q7. How fast was the fit to simulated data?