scispace - formally typeset
Open AccessPosted ContentDOI

Modeling zero-inflated count data with glmmTMB

Reads0
Chats0
TLDR
A new R package, glmmTMB, is presented, that increases the range of models that can easily be fitted to count data using maximum likelihood estimation and is faster than packages that use Markov chain Monte Carlo sampling for estimation.
Abstract
Ecological phenomena are often measured in the form of count data. These data can be analyzed using generalized linear mixed models (GLMMs) when observations are correlated in ways that require random effects. However, count data are often zero-inflated, containing more zeros than would be expected from the standard error distributions used in GLMMs, e.g., parasite counts may be exactly zero for hosts with effective immune defenses but vary according to a negative binomial distribution for non-resistant hosts. We present a new R package, glmmTMB, that increases the range of models that can easily be fitted to count data using maximum likelihood estimation. The interface was developed to be familiar to users of the lme4 R package, a common tool for fitting GLMMs. To maximize speed and flexibility, estimation is done using Template Model Builder (TMB), utilizing automatic differentiation to estimate model gradients and the Laplace approximation for handling random effects. We demonstrate glmmTMB and compare it to other available methods using two ecological case studies. In general, glmmTMB is more flexible than other packages available for estimating zero-inflated models via maximum likelihood estimation and is faster than packages that use Markov chain Monte Carlo sampling for estimation; it is also more flexible for zero-inflated modelling than INLA, but speed comparisons vary with model and data structure. Our package can be used to fit GLMs and GLMMs with or without zero-inflation as well as hurdle models. By allowing ecologists to quickly estimate a wide variety of models using a single package, glmmTMB makes it easier to find appropriate models and test hypotheses to describe ecological processes.

read more

Content maybe subject to copyright    Report

Modeling zero-inflated count data with glmmTMB
Mollie E. Brooks
a,b,h
, Kasper Kristensen
a
, Koen J. van Benthem
b
, Arni
Magnusson
c
, Casper W. Berg
a
, Anders Nielsen
a
, Hans J. Skaug
d
, Martin
achler
e
, Benjamin M. Bolker
f,g
a
National Institute of Aquatic Resources, Technical University of Denmark, Charlottenlund
Slot, 2920 Charlottenlund, Denmark
b
Department of Evolutionary Biology and Environmental Studies, University of Zurich,
Winterthurerstrasse 190, 8057 Zurich, Switzerland
c
International Council for the Exploration of the Sea, H.C. Andersens Boulevard 44-46,
1553 Copenhagen, Denmark
d
Department of Mathematics, University of Bergen, P.O. Box 7803, 5020 Bergen, Norway
e
Seminar ur Statistik, ETH Zurich, 8092 Zurich, Switzerland
f
Department of Mathematics and Statistics, McMaster University, 1280 Main St W,
L8S4L8 Hamilton, Ontario, Canada
g
Department of Biology, McMaster University, 1280 Main St W, L8S4L8 Hamilton,
Ontario, Canada
h
corresponding author, email MollieEBrooks@gmail.com
Abstract
Ecological phenomena are often measured in the form of count data. These
data can be analyzed using generalized linear mixed models (GLMMs) when
observations are correlated in ways that require random effects. However, count
data are often zero-inflated, containing more zeros than would be expected from
the standard error distributions used in GLMMs, e.g., parasite counts may be
exactly zero for hosts with effective immune defenses but vary according to a
negative binomial distribution for non-resistant hosts.
We present a new R package, glmmTMB, that increases the range of models
that can easily be fitted to count data using maximum likelihood estimation.
The interface was developed to be familiar to users of the lme4 R package, a
common tool for fitting GLMMs. To maximize speed and flexibility, estimation
is done using Template Model Builder (TMB), utilizing automatic differentiation
to estimate model gradients and the Laplace approximation for handling random
effects. We demonstrate glmmTMB and compare it to other available methods
using two ecological case studies.
In general, glmmTMB is more flexible than other packages available for esti-
Preprint submitted to Ecological Modelling May 1, 2017
.CC-BY-NC 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted May 1, 2017. ; https://doi.org/10.1101/132753doi: bioRxiv preprint

mating zero-inflated models via maximum likelihood estimation and is faster
than packages that use Markov chain Monte Carlo sampling for estimation; it is
also more flexible for zero-inflated modelling than INLA, but speed comparisons
vary with model and data structure. Our package can be used to fit GLMs and
GLMMs with or without zero-inflation as well as hurdle models. By allowing
ecologists to quickly estimate a wide variety of models using a single package,
glmmTMB makes it easier to find appropriate models and test hypotheses to de-
scribe ecological processes.
Keywords: abundance, overdispersion, negative binomial, mixed models,
hurdle models
1. Introduction
Ecological phenomena are often measured in the form of discrete count data,
e.g., the number of times that owl nestlings beg for food (Roulin & Bersier,
2007), counts of salamanders in streams (Price et al., 2016), or counts of para-
site eggs in fecal samples of sheep (Brown et al., 2012). These counts are often5
analyzed using generalized linear models (GLMs) and their extensions (O’Hara
& Kotze, 2010; Wilson & Grenfell, 1997). GLMs quantify how expected counts
change as a function of predictor variables, e.g., nestlings change their behavior
depending on which parent they interact with (Roulin & Bersier, 2007), sala-
mander abundance decreases in streams affected by coal mining (Price et al.,10
2016), and helminth infection intensity in sheep varies with age and genotype
(Brown et al., 2012). Repeated measurements on the same individual, the same
location, or observations taken at the same point in time are often correlated;
this correlation can be accounted for using random effects in generalized linear
mixed models (GLMMs; Bolker et al., 2009; Bolker, 2015).15
These types of count data are commonly modeled with GLMs and GLMMs
using either Poisson or negative binomial distributions. For the Poisson dis-
tribution, the variance is equal to the mean. When data are overdispersed
meaning the variance is larger than the mean they are often instead modeled
2
.CC-BY-NC 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted May 1, 2017. ; https://doi.org/10.1101/132753doi: bioRxiv preprint

using the negative binomial distribution, which can be defined as a mixture20
of Poisson distributions with Gamma-distributed rates. For the Poisson and
negative binomial distributions, the expected number of zeros decreases as the
mean increases. However, when multiple processes underlie the observed counts
which is almost ubiquitous in biology the counts can contain many zeros
even if the mean is much greater than zero. For example, observed counts of25
salamanders could be zero if a stream is uninhabitable due to mining waste, or
the count could be any integer from zero to infinity depending on other qualities
of the stream affecting the population density and the salamanders’ ability to
hide from researchers (Price et al., 2016). Zero-inflated GLMs allow us to model
count data using a mixture of a Poisson or negative binomial distribution and a30
structural zero component, i.e., extra zeros. Models that ignore zero-inflation,
or attempt to handle it in the same way as simple overdispersion, yield biased
parameter estimates (Harrison, 2014).
Many biologists use the statistical computing environment R and its con-
tributed packages to organize, model, and graph their data (R Core Develop-35
ment Team, 2016). In R, there are five main packages available for modeling
zero-inflated data: pscl, INLA, MCMCglmm, glmmADMB, and brms (Table 1; Zeileis
et al., 2008; Rue et al., 2009; Hadfield, 2010; Skaug et al., 2012; B¨urkner, in
press). The pscl package can fit zero-inflated GLMs with predictor variables
on the zero-inflation using maximum likelihood estimation (Zeileis et al., 2008).40
For example, pscl can be used to test the hypothesis that sheep fecal egg counts
depend on age and extra zeros depend on genotype. However, pscl cannot
model the correlation within individuals if they are sampled repeatedly; this phe-
nomenon requires random effects. Omitting random effects and thereby ignoring
correlation makes statistical tests anti-conservative (Bolker et al., 2009; Bolker,45
2015). The glmmADMB package can fit zero-inflated GLMMs that contain ran-
dom effects to account for correlation among observations (Skaug et al., 2012).
However, it cannot fit models with predictor variables in the zero-inflation part
of the model; thus, it is only appropriate for limited cases where all observa-
tional units (e.g., individual sheep) have an equal probability of producing a50
3
.CC-BY-NC 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted May 1, 2017. ; https://doi.org/10.1101/132753doi: bioRxiv preprint

structural zero. INLA has the same limitation as glmmADMB. The MCMCglmm and
brms packages can fit zero-inflated GLMMs with predictors of zero-inflation,
but they are relatively slow because they require Markov chain Monte Carlo
(MCMC) sampling (B¨urkner, in press; Hadfield, 2010).
Here we present a new R package, glmmTMB, that estimates GLMs, GLMMs55
and extensions of GLMMs including zero-inflated GLMMs using maximum like-
lihood. The ability to fit these types of models quickly and using a single package
will make it easier for biologists to find the best model to explain patterns in
their data. We demonstrate the package using two examples. We use an ex-
ample of salamander abundance to show how to fit and compare zero-inflated60
and hurdle GLMMs and then how to extract results from a model. We use a
classic example of owl nestling behavior to compare the timing and parameter
estimates from glmmTMB to other R packages.
Table 1. Features implemented in glmmTMB and other packages that are used
for modeling zero-inflated count data. lme4 and mgcv are not included because65
they can only estimate zero-inflation when wrapped in an iterative algorithm
(Minami et al., 2007; Bolker et al., 2013).
4
.CC-BY-NC 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted May 1, 2017. ; https://doi.org/10.1101/132753doi: bioRxiv preprint

Feature glmmTMB glmmADMB pscl MCMCglmm brms INLA
predictors of zero-inflation X X X X
predictors of dispersion X
zero-truncated distributions X X X X X X
nbinom2 distribution X X X X X
nbinom1 distribution X X
weights
1
X X X X
offsets X X X
2
X X
random effects (RE) X X X X X
various RE structures X
3
X X X
maximum likelihood estimation X X X
MCMC sample a fitted model X
4
X X X X
multivariate responses X X
Notes:
1
Weights are often used to reduce the influence of some observations over
others, e.g., (Gurevitch & Hedges, 1999); glmmTMB’s dispersion formula can be70
used to model heteroskedasticity, but might not fill other roles of weights. This
feature may be added in future versions.
2
Offsets can be implemented using
priors.
3
See vignette("covstruct") for details.
4
See vignette("mcmc") for
details.
2. Implementation of glmmTMB75
The design goal of glmmTMB is to extend the flexibility of GLMMs in R while
maintaining a familiar interface. To maximize flexibility and speed, glmmTMB’s
estimation is done using the TMB package (Kristensen et al., 2016), but users need
not be familiar with TMB. We based glmmTMB’s interface (e.g., formula syntax)
on the lme4 package one of the most widely used R packages for fitting80
GLMMs (Bates et al., 2015). Like lme4, glmmTMB uses maximum likelihood
estimation and the Laplace approximation to integrate over random effects;
unlike lme4, glmmTMB does not have the alternative options of doing restricted
maximum likelihood (REML) estimation nor using Gauss-Hermite quadrature
5
.CC-BY-NC 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted May 1, 2017. ; https://doi.org/10.1101/132753doi: bioRxiv preprint

Citations
More filters
Journal ArticleDOI

The coefficient of determination R2 and intra-class correlation coefficient from generalized linear mixed-effects models revisited and expanded.

TL;DR: This paper generalizes the methods called for Poisson and binomial GLMMs to all other non-Gaussian distributions, in particular to negative binomial and gamma distributions that are commonly used for modelling biological data and can be used across disciplines and regardless of statistical environments.
Posted ContentDOI

The coefficient of determination R2 and intra-class correlation coefficient from generalized linear mixed-effects models revisited and expanded

TL;DR: In this paper, a non-Gaussian version of the coefficient of determination (R2GLMM) is proposed for estimating the proportion of variance explained by a statistical model and is an important summary statistic of biological interest.
Journal ArticleDOI

Accounting for individual-specific variation in habitat-selection studies: Efficient estimation of mixed-effects models using Bayesian or frequentist computation

TL;DR: By providing coded examples using integrated nested Laplace approximations and Template Model Builder for Bayesian and frequentist analysis via the R packages R-INLA and glmmTMB, it is hoped to make efficient estimation of RSFs and SSFs with random effects accessible to anyone in the field.
Posted ContentDOI

Violating the normality assumption may be the lesser of two evils

TL;DR: It is argued that violating the normality assumption bears risks that are limited and manageable, while several more sophisticated approaches are relatively error prone and difficult to check during peer review.
Journal ArticleDOI

Violating the normality assumption may be the lesser of two evils

TL;DR: In this article, Monte Carlo simulations were used to explore the pros and cons of fitting Gaussian models to non-normal data in terms of risk of type I error, power and utility for parameter estimation.
References
More filters
Journal Article

R: A language and environment for statistical computing.

R Core Team
- 01 Jan 2014 - 
TL;DR: Copyright (©) 1999–2012 R Foundation for Statistical Computing; permission is granted to make and distribute verbatim copies of this manual provided the copyright notice and permission notice are preserved on all copies.
Journal ArticleDOI

Fitting Linear Mixed-Effects Models Using lme4

TL;DR: In this article, a model is described in an lmer call by a formula, in this case including both fixed-and random-effects terms, and the formula and data together determine a numerical representation of the model from which the profiled deviance or the profeatured REML criterion can be evaluated as a function of some of model parameters.
Journal ArticleDOI

Regularization Paths for Generalized Linear Models via Coordinate Descent

TL;DR: In comparative timings, the new algorithms are considerably faster than competing methods and can handle large problems and can also deal efficiently with sparse features.
Book

Mixed Effects Models and Extensions in Ecology with R

TL;DR: In this paper, the authors apply additive mixed modelling on phyoplankton time series data and show that the additive model can be used to estimate the age distribution of small cetaceans.
Journal ArticleDOI

Generalized linear mixed models: a practical guide for ecology and evolution

TL;DR: The use (and misuse) of GLMMs in ecology and evolution are reviewed, estimation and inference are discussed, and 'best-practice' data analysis procedures for scientists facing this challenge are summarized.
Related Papers (5)
Frequently Asked Questions (7)
Q1. How long did it take to fit a glmm?

On a standard laptop computer, it took approximately five seconds to fit a210zero-inflated Poisson GLMM in glmmTMB; MCMCglmm took four times as long as glmmTMB, followed by glmmADMB (eight times as long) and brms (27 or 14 times as long, depending on precompilation). 

Ecological phenomena are often measured in the form of discrete count data, e.g., the number of times that owl nestlings beg for food (Roulin & Bersier, 2007), counts of salamanders in streams (Price et al., 2016), or counts of parasite eggs in fecal samples of sheep (Brown et al., 2012). 

185Fitting the model to the original data replicated to have more observations per site required, on average, half the time to fit with INLA compared to glmmTMB, 22 times as long with glmmADMB, 30 times as long with lme4, and 59 times as long with brms (Fig A.2). 

lme4 and mgcv are not included because65 they can only estimate zero-inflation when wrapped in an iterative algorithm (Minami et al., 2007; Bolker et al., 2013). 

Many biologists use the statistical computing environment R and its con-tributed packages to organize, model, and graph their data (R Core Develop-35ment Team, 2016). 

40For example, pscl can be used to test the hypothesis that sheep fecal egg counts depend on age and extra zeros depend on genotype. 

Fitting this model to simulated data with the same structure as the original data was, on average, equally fast in glmmTMB and INLA, 26 times slower with glmmADMB, 30 times slower with lme4, and 274 times slower with brms (Fig A.1).