scispace - formally typeset
Search or ask a question

Showing papers on "Overdispersion published in 2016"


BookDOI
19 Apr 2016
TL;DR: Estimation and Inference Essentials Estimation and Approaches to Testing Inference using Model-Based Statistics Inference Using Empirical Standard Error.
Abstract: PART I The Big Picture Modeling Basics What Is a Model? Two Model Forms: Model Equation and Probability Distribution Types of Model Effects Writing Models in Matrix Form Summary: Essential Elements for a Complete Statement of the Model Design Matters Introductory Ideas for Translating Design and Objectives into Models Describing "Data Architecture" to Facilitate Model Specification From Plot Plan to Linear Predictor Distribution Matters More Complex Example: Multiple Factors with Different Units of Replication Setting the Stage Goals for Inference with Models: Overview Basic Tools of Inference Issue I: Data Scale vs. Model Scale Issue II: Inference Space Issue III: Conditional and Marginal Models Summary PART II Estimation and Inference Essentials Estimation Introduction Essential Background Fixed Effects Only Gaussian Mixed Models Generalized Linear Mixed Models Summary Inference, Part I: Model Effects Introduction Essential Background Approaches to Testing Inference Using Model-Based Statistics Inference Using Empirical Standard Error Summary of Main Ideas and General Guidelines for Implementation Inference, Part II: Covariance Components Introduction Formal Testing of Covariance Components Fit Statistics to Compare Covariance Models Interval Estimation Summary PART III Working with GLMMs Treatment and Explanatory Variable Structure Types of Treatment Structures Types of Estimable Functions Multiple Factor Models: Overview Multifactor Models with All Factors Qualitative Multifactor: Some Factors Qualitative, Some Factors Quantitative Multifactor: All Factors Quantitative Summary Multilevel Models Types of Design Structure: Single- and Multilevel Models Defined Types of Multilevel Models and How They Arise Role of Blocking in Multilevel Models Working with Multilevel Designs Marginal and Conditional Multilevel Models Summary Best Linear Unbiased Prediction Review of Estimable and Predictable Functions BLUP in Random-Effects-Only Models Gaussian Data with Fixed and Random Effects Advanced Applications with Complex Z Matrices Summary Rates and Proportions Types of Rate and Proportion Data Discrete Proportions: Binary and Binomial Data Alternative Link Functions for Binomial Data Continuous Proportions Summary Counts Introduction Overdispersion in Count Data More on Alternative Distributions Conditional and Marginal Too Many Zeroes Summary Time-to-Event Data Introduction: Probability Concepts for Time-to-Event Data Gamma GLMMs GLMMs and Survival Analysis Summary Multinomial Data Overview Multinomial Data with Ordered Categories Nominal Categories: Generalized Logit Models Model Comparison Summary Correlated Errors, Part I: Repeated Measures Overview Gaussian Data: Correlation and Covariance Models for LMMs Covariance Model Selection Non-Gaussian Case Issues for Non-Gaussian Repeated Measures Summary Correlated Errors, Part II: Spatial Variability Overview Gaussian Case with Covariance Model Spatial Covariance Modeling by Smoothing Spline Non-Gaussian Case Summary Power, Sample Size, and Planning Basics of GLMM-Based Power and Precision Analysis Gaussian Example Power for Binomial GLMMs GLMM-Based Power Analysis for Count Data Power and Planning for Repeated Measures Summary Appendices References Index

594 citations


Journal ArticleDOI
TL;DR: It is shown that the use of the beta-binomial model makes it possible to determine accurate credible intervals even in data which exhibit substantial overdispersion, and Bayesian inference methods are used for estimating the posterior distribution of the parameters of the psychometric function.

275 citations


Journal ArticleDOI
TL;DR: The rootogram is a graphical tool associated with the work of J. W. Tukey that was originally used for assessing goodness of fit of univariate distributions as mentioned in this paper, and it is particularly useful for diagnosing and treating issues such as overdispersion and excess zeros in count data models.
Abstract: The rootogram is a graphical tool associated with the work of J. W. Tukey that was originally used for assessing goodness of fit of univariate distributions. Here, we extend the rootogram to regression models and show that this is particularly useful for diagnosing and treating issues such as overdispersion and/or excess zeros in count data models. We also introduce a weighted version of the rootogram that can be applied out of sample or to (weighted) subsets of the data, for example, in finite mixture models. An empirical illustration revisiting a well-known dataset from ethology is included, for which a negative binomial hurdle model is employed. Supplementary materials providing two further illustrations are available online: the first, using data from public health, employs a two-component finite mixture of negative binomial models; the second, using data from finance, involves underdispersion. An R implementation of our tools is available in the R package countreg. It also contains the data a...

140 citations


Journal ArticleDOI
TL;DR: A marginalized zero-inflated negative binomial regression model for independent responses is proposed to model the population marginal mean count directly, providing straightforward inference for overall exposure effects based on maximum likelihood estimation.
Abstract: The zero-inflated negative binomial regression model (ZINB) is often employed in diverse fields such as dentistry, health care utilization, highway safety, and medicine to examine relationships between exposures of interest and overdispersed count outcomes exhibiting many zeros. The regression coefficients of ZINB have latent class interpretations for a susceptible subpopulation at risk for the disease/condition under study with counts generated from a negative binomial distribution and for a non-susceptible subpopulation that provides only zero counts. The ZINB parameters, however, are not well-suited for estimating overall exposure effects, specifically, in quantifying the effect of an explanatory variable in the overall mixture population. In this paper, a marginalized zero-inflated negative binomial regression (MZINB) model for independent responses is proposed to model the population marginal mean count directly, providing straightforward inference for overall exposure effects based on maximum likelihood estimation. Through simulation studies, the finite sample performance of MZINB is compared with marginalized zero-inflated Poisson, Poisson, and negative binomial regression. The MZINB model is applied in the evaluation of a school-based fluoride mouthrinse program on dental caries in 677 children.

63 citations


Journal ArticleDOI
TL;DR: This work derives the ZICMP model and illustrates its flexibility, extrapolates the corresponding likelihood ratio test for the presence of significant data dispersion, and highlights various statistical properties and model fit through several examples.

58 citations


Journal Article
TL;DR: The ANZROD model reduces variability in SMRs due to casemix, as measured by overdispersion, and facilitates more consistent identification of true outlier ICUs, compared with the APACHE III-j model.
Abstract: Objective: To compare the impact of the 2013 Australian and New Zealand Risk of Death (ANZROD) model and the 2002 Acute Physiology and Chronic Health Evaluation (APACHE) III-j model as risk-adjustment tools for benchmarking performance and detecting outliers in Australian and New Zealand intensive care units. Methods: Data were extracted from the Australian and New Zealand Intensive Care Society Adult Patient Database for all ICUs that contributed data between 1 January 2010 and 31 December 2013. Annual standardised mortality ratios (SMRs) were calculated for ICUs using the ANZROD and APACHE III-j models. They were plotted on funnel plots separately for each hospital type, with ICUs above the upper 99.8% control limit considered as potential outliers with worse performance than their peer group. Overdispersion parameters were estimated for both models. Overall fit was assessed using the Akaike information criterion (AIC) and Bayesian information criterion (BIC). Outlier association with mortality was assessed using a logistic regression model. Results: The ANZROD model identified more outliers than the APACHE III-j model during the study period. The numbers of outliers in rural, metropolitan, tertiary and private hospitals identified by the ANZROD model were 3, 2, 6 and 6, respectively; and those identified by the APACHE III-j model were 2, 0, 1 and 1, respectively. The degree of overdispersion was less for the ANZROD model compared with the APACHE III-j model in each year. The ANZROD model showed better overall fit to the data, with smaller AIC and BIC values than the APACHE III-j model. Outlier ICUs identified using the ANZROD model were more strongly associated with increased mortality. Conclusion: The ANZROD model reduces variability in SMRs due to casemix, as measured by overdispersion, and facilitates more consistent identification of true outlier ICUs, compared with the APACHE III-j model.

54 citations


Journal ArticleDOI
TL;DR: In this article, the multivariate Poisson Lognormal model was used to predict crash counts for single vehicle, same direction and opposite direction crash types using three years (2009-2011) of crash data on Connecticut divided limited access highway segments.

51 citations


BookDOI
19 Apr 2016
TL;DR: The role of statistics in transportation engineering is discussed in this article, where the authors present a survey of the most popular sampling distributions in transportation data analysis, including discrete distributions, continuous distributions, and continuous distributions.
Abstract: Overview: The Role of Statistics in Transportation Engineering What Is Engineering? What Is Transportation Engineering? Goal of the Textbook Overview of the Textbook Who Is the Audience for This Textbook? Relax-Everything Is Fine Graphical Methods for Displaying Data Introduction Histogram Box and Whisker Plot Quantile Plot Scatter Plot Parallel Plot Time Series Plot Quality Control Plots Concluding Remarks Numerical Summary Measures Introduction Measures of Central Tendency Measures of Relative Standing Measures of Variability Measures of Association Concluding Remarks Probability and Random Variables Introduction Sample Spaces and Events Interpretation of Probability Random Variable Expectations of Random Variables Covariances and Correlation of Random Variables Computing Expected Values of Functions of Random Variables Conditional Probability Bayes' Theorem Concluding Remarks Common Probability Distributions Introduction Discrete Distributions Continuous Distributions Concluding Remarks Appendix: Table of the Most Popular Distributions in Transportation Engineering Sampling Distributions Introduction Random Sampling Sampling Distribution of a Sample Mean Sampling Distribution of a Sample Variance Sampling Distribution of a Sample Proportion Concluding Remarks Inferences: Hypothesis Testing and Interval Estimation Introduction Fundamentals of Hypothesis Testing Inferences on a Single Population Mean Inferences about Two Population Means Inferences about One Population Variance Inferences about Two Population Variances Concluding Remarks Appendix: Welch (1938) Degrees of Freedom for the Unequal Variance t-Test Other Inferential Procedures: ANOVA and Distribution-Free Tests Introduction Comparisons of More than Two Population Means Multiple Comparisons One- and Multiway ANOVA Assumptions for ANOVA Distribution-Free Tests Conclusions Inferences Concerning Categorical Data Introduction Tests and Confidence Intervals for a Single Proportion Tests and Confidence Intervals for Two Proportions Chi-Square Tests Concerning More Than Two Population Proportions The Chi-Square Goodness-of-Fit Test for Checking Distributional Assumptions Conclusions Linear Regression Introduction Simple Linear Regression Transformations Understanding and Calculating R2 Verifying the Main Assumptions in Linear Regression Comparing Two Regression Lines at a Point and Comparing Two Regression Parameters The Regression Discontinuity Design (RDD) Multiple Linear Regression Variable Selection for Regression Models Additional Collinearity Issues Concluding Remarks Regression Models for Count Data Introduction Poisson Regression Model Overdispersion Assessing Goodness of Fit of Poisson Regression Models Negative Binomial Regression Model Concluding Remarks Appendix: Maximum Likelihood Estimation Experimental Design Introduction Comparison of Direct Observation and Designed Experiments Motivation for Experimentation A Three-Factor, Two Levels per Factor Experiment Factorial Experiments Fractional Factorial Experiments Screening Designs D-Optimal and I-Optimal Designs Sample Size Determination Field and Quasi-Experiments Concluding Remarks Appendix: Choice Modeling of Experiments Cross-Validation, Jackknife, and Bootstrap Methods for Obtaining Standard Errors Introduction Methods for Standard Error Estimation When a Closed-Form Formula Is Not Available Cross-Validation The Jackknife Method for Obtaining Standard Errors Bootstrapping Concluding Remarks Bayesian Approaches to Transportation Data Analysis Introduction Fundamentals of Bayesian Statistics Bayesian Inference Concluding Remarks Microsimulation Introduction Overview of Traffic Microsimulation Models Analyzing Microsimulation Output Performance Measures Concluding Remarks Appendix: Soft Modeling and Nonparametric Model Building Homework Problems and References appear at the end of each chapter.

49 citations


Journal ArticleDOI
TL;DR: Different methodologies for analyzing cytogenetic chromosomal aberrations datasets are compared, with special focus on zero-inflated Poisson and zero- inflated negative binomial models.
Abstract: Within the field of cytogenetic biodosimetry, Poisson regression is the classical approach for modeling the number of chromosome aberrations as a function of radiation dose. However, it is common to find data that exhibit overdispersion. In practice, the assumption of equidispersion may be violated due to unobserved heterogeneity in the cell population, which will render the variance of observed aberration counts larger than their mean, and/or the frequency of zero counts greater than expected for the Poisson distribution. This phenomenon is observable for both full- and partial-body exposure, but more pronounced for the latter. In this work, different methodologies for analyzing cytogenetic chromosomal aberrations datasets are compared, with special focus on zero-inflated Poisson and zero-inflated negative binomial models. A score test for testing for zero inflation in Poisson regression models under the identity link is also developed.

40 citations


Journal ArticleDOI
TL;DR: A new classifier using the negative binomial model for RNA-seq data classification is developed and results show that the proposed classifier can serve as an effective tool for classifying RNA-Seq data.
Abstract: RNA-sequencing (RNA-Seq) has become a powerful technology to characterize gene expression profiles because it is more accurate and comprehensive than microarrays. Although statistical methods that have been developed for microarray data can be applied to RNA-Seq data, they are not ideal due to the discrete nature of RNA-Seq data. The Poisson distribution and negative binomial distribution are commonly used to model count data. Recently, Witten (Annals Appl Stat 5:2493–2518, 2011) proposed a Poisson linear discriminant analysis for RNA-Seq data. The Poisson assumption may not be as appropriate as the negative binomial distribution when biological replicates are available and in the presence of overdispersion (i.e., when the variance is larger than or equal to the mean). However, it is more complicated to model negative binomial variables because they involve a dispersion parameter that needs to be estimated. In this paper, we propose a negative binomial linear discriminant analysis for RNA-Seq data. By Bayes’ rule, we construct the classifier by fitting a negative binomial model, and propose some plug-in rules to estimate the unknown parameters in the classifier. The relationship between the negative binomial classifier and the Poisson classifier is explored, with a numerical investigation of the impact of dispersion on the discriminant score. Simulation results show the superiority of our proposed method. We also analyze two real RNA-Seq data sets to demonstrate the advantages of our method in real-world applications. We have developed a new classifier using the negative binomial model for RNA-seq data classification. Our simulation results show that our proposed classifier has a better performance than existing works. The proposed classifier can serve as an effective tool for classifying RNA-seq data. Based on the comparison results, we have provided some guidelines for scientists to decide which method should be used in the discriminant analysis of RNA-Seq data. R code is available at http://www.comp.hkbu.edu.hk/~xwan/NBLDA.R or https://github.com/yangchadam/NBLDA

39 citations


Journal ArticleDOI
TL;DR: The rootogram is a graphical tool associated with the work of J. W. Tukey as discussed by the authors that was originally used for assessing goodness of fit of univariate distributions and was extended to regression models and showed that this is particularly useful for diagnosing and treating issues such as overdispersion and excess zeros in count data models.
Abstract: The rootogram is a graphical tool associated with the work of J. W. Tukey that was originally used for assessing goodness of fit of univariate distributions. Here we extend the rootogram to regression models and show that this is particularly useful for diagnosing and treating issues such as overdispersion and/or excess zeros in count data models. We also introduce a weighted version of the rootogram that can be applied out of sample or to (weighted) subsets of the data, e.g., in finite mixture models. An empirical illustration revisiting a well-known data set from ethology is included, for which a negative binomial hurdle model is employed. Supplementary materials providing two further illustrations are available online: the first, using data from public health, employs a two-component finite mixture of negative binomial models, the second, using data from finance, involves underdispersion. An R implementation of our tools is available in the R package countreg. It also contains the data and replication code.

Journal ArticleDOI
TL;DR: Novel modeling incorporating zero inflation, clustering, and overdispersion sheds some new light on the effect of community water fluoridation and other factors.
Abstract: Community water fluoridation is an important public health measure to prevent dental caries, but it continues to be somewhat controversial. The Iowa Fluoride Study (IFS) is a longitudinal study on a cohort of Iowa children that began in 1991. The main purposes of this study (http://www.dentistry.uiowa.edu/preventive-fluoride-study) were to quantify fluoride exposures from both dietary and nondietary sources and to associate longitudinal fluoride exposures with dental fluorosis (spots on teeth) and dental caries (cavities). We analyze a subset of the IFS data by a marginal regression model with a zero-inflated version of the Conway-Maxwell-Poisson distribution for count data exhibiting excessive zeros and a wide range of dispersion patterns. In general, we introduce two estimation methods for fitting a ZICMP marginal regression model. Finite sample behaviors of the estimators and the resulting confidence intervals are studied using extensive simulation studies. We apply our methodologies to the dental caries data. Our novel modeling incorporating zero inflation, clustering, and overdispersion sheds some new light on the effect of community water fluoridation and other factors. We also include a second application of our methodology to a genomic (next-generation sequencing) dataset that exhibits underdispersion.

Journal ArticleDOI
TL;DR: Logistic regression performs much better than LDA and seems to be more attractive for the prediction of the more toxic compounds, i.e. compounds that exhibit excess toxicity versus non-polar narcotic compounds and more reactive compounds versus less reactive compounds.
Abstract: The paper highlights the use of the logistic regression (LR) method in the construction of acceptable statistically significant, robust and predictive models for the classification of chemicals according to their aquatic toxic modes of action. Essentials accounting for a reliable model were all considered carefully. The model predictors were selected by stepwise forward discriminant analysis (LDA) from a combined pool of experimental data and chemical structure-based descriptors calculated by the CODESSA and DRAGON software packages. Model predictive ability was validated both internally and externally. The applicability domain was checked by the leverage approach to verify prediction reliability. The obtained models are simple and easy to interpret. In general, LR performs much better than LDA and seems to be more attractive for the prediction of the more toxic compounds, i.e. compounds that exhibit excess toxicity versus non-polar narcotic compounds and more reactive compounds versus less reactive compounds. In addition, model fit and regression diagnostics was done through the influence plot which reflects the hat-values, studentized residuals, and Cook's distance statistics of each sample. Overdispersion was also checked for the LR model. The relationships between the descriptors and the aquatic toxic behaviour of compounds are also discussed.

DOI
12 Dec 2016
TL;DR: Application to two real datasets indicate that the proposed Zero-inflated Negative Binomial regression for identifying differentially abundant taxa between two or more populations is capable of detecting biologically meaningful taxa, consistent with previous studies.
Abstract: Motivation: The human microbiome plays an important role in human health and disease. The composition of the human microbiome is influenced by multiple factors and understanding these factors is critical to elucidate the role of the microbiome in health and disease and for development of new diagnostics or therapeutic targets based on the microbiome. 16S ribosomal RNA (rRNA) gene targeted amplicon sequencing is a commonly used approach to determine the taxonomic composition of the bacterial community. Operational taxonomic units (OTUs) are clustered based on generated sequence reads and used to determine whether and how the abundance of microbiome is correlated with some characteristics of the samples, such as health/disease status, smoking status, or dietary habit. However, OTU count data is not only overdispersed but also contains an excess number of zero counts due to undersampling. Efficient analytical tools are therefore needed for downstream statistical analysis which can simulatenously account for overdispersion and sparsity in microbiome data. Results: In this paper, we propose a Zero-inflated Negative Binomial (ZINB) regression for identifying differentially abundant taxa between two or more populations. The proposed method utilizes an Expectation Maximization (EM) algorithm, by incorporating a two-part mixture model consisting of (i) a negative binomial model to account for over-dispersion and (ii) a logistic regression model to account for excessive zero counts. Extensive simulation studies are conducted which indicate that ZINB demonstrates better peroformance as compared to several state-of-the-art approaches, as measured by the area under the curve (AUC). Application to two real datasets indicate that the proposed method is capable of detecting biologically meaningful taxa, consistent with previous studies. Availability: The software implementation of ZINB is available at: http://www.ssg.uab.edu/bhglm/. Supplementary information: Supplementary data are available at Journal of Bioinformatics and Genomics online.

Journal ArticleDOI
TL;DR: In this article, the authors examined the relationship between several seasonal and weather factors and bicycle ridership from 2 years of automated bicycle counts at a location in Seattle, Washington, using a negative binomial model and counterfactual simulation.
Abstract: This paper examines the relationship between several seasonal and weather factors and bicycle ridership from 2 years of automated bicycle counts at a location in Seattle, Washington. The authors fitted a negative binomial model and then estimated quantities of interest using counterfactual simulation. The findings confirm the significance of season (+), temperature (+), precipitation (−), as well as holidays (−), day of the week (+ for Monday through Saturday, relative to Sunday), and an overall trend (+). This paper improves on prior work by demonstrating the use of the negative binomial instead of a Poisson model, which is appropriate given the potential for overdispersion, as observed in these data. In addition to validating the significance of factors identified from the literature, this paper contributes methodologically through its intuitive visualization of effect sizes to nonstatistical audiences. The authors believe that the combination of model type and counterfactual simulation and visualizatio...

Journal ArticleDOI
TL;DR: A class of two-part hurdle models for the analysis of zero-inflated areal count data are developed, demonstrating that overdispersed hurdle models provide a useful approach to analyzing zero- inflated spatiotemporal data.
Abstract: Motivated by a study exploring spatiotemporal trends in emergency department use, we develop a class of two-part hurdle models for the analysis of zero-inflated areal count data. The models consist...

Journal ArticleDOI
TL;DR: Using a case study in chronic heart failure, it is shown that model fit can be improved, even resulting in impact on significance tests, by switching to the extended framework, and easily estimate the framework, by maximum likelihood, in standard software.
Abstract: We combine conjugate and normal random effects in a joint model for outcomes, at least one of which is non-Gaussian, with particular emphasis on cases in which one of the outcomes is of survival type. Conjugate random effects are used to relax the often-restrictive mean-variance prescription in the non-Gaussian outcome, while normal random effects account for not only the correlation induced by repeated measurements from the same subject but also the association between the different outcomes. Using a case study in chronic heart failure, we show that model fit can be improved, even resulting in impact on significance tests, by switching to our extended framework. By first taking advantage of the ease of analytical integration over conjugate random effects, we easily estimate our framework, by maximum likelihood, in standard software.

Journal ArticleDOI
TL;DR: An integer-valued ARCH model which can be used for modeling time series of counts with under-, equi-, or overdispersion is presented and a generalization of the introduced model is considered by introducing aninteger-valued GARCH model.
Abstract: We present an integer-valued ARCH model which can be used for modeling time series of counts with under-, equi-, or overdispersion. The introduced model has a conditional binomial distribution, and it is shown to be strictly stationary and ergodic. The unknown parameters are estimated by three methods: conditional maximum likelihood, conditional least squares and maximum likelihood type penalty function estimation. The asymptotic distributions of the estimators are derived. A real application of the novel model to epidemic surveillance is briefly discussed. Finally, a generalization of the introduced model is considered by introducing an integer-valued GARCH model.

Journal ArticleDOI
TL;DR: In this paper, a wide class of integer-valued stochastic processes that allow to take into consideration, simultaneously, relevant characteristics observed in count data namely zero inflation, overdispersion and conditional heteroscedasticity is introduced.
Abstract: In this paper we introduce a wide class of integer-valued stochastic processes that allows to take into consideration, simultaneously, relevant characteristics observed in count data namely zero inflation, overdispersion and conditional heteroscedasticity. This class includes, in particular, the compound Poisson, the zero-inflated Poisson and the zero-inflated negative binomial INGARCH models, recently proposed in literature. The main probabilistic analysis of this class of processes is here developed. Precisely, first- and second-order stationarity conditions are derived, the autocorrelation function is deduced and the strict stationarity is established in a large subclass. We also analyse in a particular model the existence of higher-order moments and deduce the explicit form for the first four cumulants, as well as its skewness and kurtosis.

22 Sep 2016
TL;DR: In the early 20th century, only a few count distributions (binomial and Poisson distributions) were commonly used in modeling and these distributions failed to model bimodal or overdispersed data, especially data related to phenomena for which the occurrence of a given event increases the chance of additional events occurring as discussed by the authors.
Abstract: In the early twentieth century, only a few count distributions (binomial and Poisson distributions) were commonly used in modeling. These distributions fail to model bimodal or overdispersed data, especially data related to phenomena for which the occurrence of a given event increases the chance of additional events occurring. New count distributions have since been introduced to address such phenomena; they are named "contagious" distributions. This group of distributions, which includes the negative binomial, Neyman, Thomas and Polya-Aeppli distributions, can be expressed as mixture distributions or as stopped-sum distributions. They take into account bimodality and overdispersion, and show a greater flexibility with regards to value distributions. The aim of this literature review is to 1) explain the introduction of these distributions, 2) describe each of these overdispersed distributions, focusing in particular on their definitions, their basic properties, and their practical utility, and 3) compare their strengths and weaknesses by modeling overdispersed real count data (bovine tuberculosis cases).

Journal ArticleDOI
TL;DR: Two approaches are proposed that circumvent the limitation of the literature on real applications of zero-inflated regression models by estimating the effect of covariates on the overall mean from the assumed latent class models and formulating a model that directly relates the overallmean to covariates.
Abstract: Zero-inflated regression models have emerged as a popular tool within the parametric framework to characterize count data with excess zeros. Despite their increasing popularity, much of the literature on real applications of these models has centered around the latent class formulation where the mean response of the so-called at-risk or susceptible population and the susceptibility probability are both related to covariates. While this formulation in some instances provides an interesting representation of the data, it often fails to produce easily interpretable covariate effects on the overall mean response. In this article, we propose two approaches that circumvent this limitation. The first approach consists of estimating the effect of covariates on the overall mean from the assumed latent class models, while the second approach formulates a model that directly relates the overall mean to covariates. Our results are illustrated by extensive numerical simulations and an application to an oral health study on low income African-American children, where the overall mean model is used to evaluate the effect of sugar consumption on caries indices.

Journal ArticleDOI
TL;DR: In this article, the authors proposed and applied a Cox process approach to model the arrival process and reporting pattern of insurance claims, which allows for over-dispersion and serial dependency in claim counts.
Abstract: The accurate estimation of outstanding liabilities of an insurance company is an essential task. This is to meet regulatory requirements, but also to achieve efficient internal capital management. Over the recent years, there has been increasing interest in the utilisation of insurance data at a more granular level, and to model claims using stochastic processes. So far, this so-called ‘micro-level reserving’ approach has mainly focused on the Poisson process. In this paper, we propose and apply a Cox process approach to model the arrival process and reporting pattern of insurance claims. This allows for over-dispersion and serial dependency in claim counts, which are typical features in real data. We explicitly consider risk exposure and reporting delays, and show how to use our model to predict the numbers of Incurred-But-Not-Reported (IBNR) claims. The model is calibrated and illustrated using real data from the AUSI data set.

Journal ArticleDOI
TL;DR: In this article, a detailed analysis of the temporal transferability of heterogeneous overdispersion parameter negative binomial models for crash severity types in California was presented, where the Furnival-Wilson leaps and bounds algorithm was used to identify optimal safety performance function specifications.
Abstract: This paper presents a detailed analysis of the temporal transferability of heterogeneous overdispersion parameter negative binomial models for crash severity types in California. The Furnival–Wilson leaps and bounds algorithm was used to identify optimal safety performance function specifications. The overdispersion parameter was allowed to vary across roadway segments as a function of roadway geometrics. Sixty models were developed for five major severity outcomes for homogeneous roadway segments for each of the periods 2005 to 2010, 2011 to 2012, and 2005 to 2012. Model transferability tests were conducted with likelihood ratio tests, and it was determined that temporal transferability rates (from the 2005–2010 period to the 2011–2012 period) were poor. The findings indicate the potential time instability of safety performance function parameters. The analysis found a higher rate of transferability for rural safety performance functions compared with urban safety performance functions. The rate of trans...

Journal ArticleDOI
TL;DR: In this paper, the authors extended the discussion on NB-L and NB-GE models by focusing on their capability for modeling crash data as well as quantifying the safety impact of crash contributing factors.
Abstract: A challenge in modeling crash frequency is an excess of sites with no crashes, few sites with a large number of crashes, or both. When there are excess zeros in the data or when the variance of the response is greater than the mean, the data are overdispersed. Recently, a few promising modeling techniques, such as the negative binomial–Lindley (NB-L) and negative binomial–generalized exponential (NB-GE) mixed distribution generalized linear models (GLMs), have been developed to handle count data overdispersion while keeping the core strength of the NB model. This study expanded the discussion on NB-L and NB-GE GLMs by focusing on their capability for modeling crash data as well as quantifying the safety impact of crash contributing factors. The mixed distribution models along with the conventional NB model were applied to a rural two-lane, two-way highway data set. The results showed that both NB-L and NB-GE GLMs could yield results similar to those of the NB model in addition to having mixed distribution...

Journal ArticleDOI
TL;DR: This paper proposes to control the autocorrelated count data based on a new geometric INAR (NGINAR) process, which is an alternative to the Poisson one and uses the combined jumps chart, the cumulative sum chart, and the combined exponentially weighted moving average chart to detect the shift of parameters in the process.
Abstract: In recent years, there has been a growing interest in the control of autocorrelated count data. Existing results focus on the Poisson integer-valued autoregressive (INAR) process, but this process cannot deal with overdispersion (variance is greater than mean), which is a common phenomenon in count data. We propose to control the autocorrelated count data based on a new geometric INAR (NGINAR) process, which is an alternative to the Poisson one. In this paper, we use the combined jumps chart, the cumulative sum chart, and the combined exponentially weighted moving average chart to detect the shift of parameters in the process. We compare the performance of these charts for the case of an underlying NGINAR(1) process in terms of the average run lengths. One real example is presented to demonstrate good performances of the charts. Copyright © 2015 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: In this paper, a new stationary first-order non-negative integer valued autoregressive [INAR(1)] process with geometric marginals based on a modified version of the binomial thinning operator is proposed.

Journal ArticleDOI
TL;DR: In this paper, a new claim number distribution is obtained by mixing negative binomial parameter p which is reparameterized as p =exp( −λ) with Gamma distribution, and the maximum likelihood estimators of the parameters are calculated using the Newton-Raphson and GA.
Abstract: In actuarial applications, mixed Poisson distributions are widely used for modelling claim counts as observed data on the number of claims often exhibit a variance noticeably exceeding the mean. In this study, a new claim number distribution is obtained by mixing negative binomial parameter p which is reparameterized as p = exp( −λ) with Gamma distribution. Basic properties of this new distribution are given. Maximum likelihood estimators of the parameters are calculated using the Newton–Raphson and genetic algorithm (GA). We compared the performance of these methods in terms of efficiency by simulation. A numerical example is provided.

Journal ArticleDOI
TL;DR: In this article, the authors introduced several forms of bivariate generalized Poisson regression model (BGPR) which can be fitted to bivariate and correlated count data with covariates, which allows likelihood ratio tests to be performed to choose the best model.
Abstract: This paper introduces several forms of bivariate generalized Poisson regression model (BGPR) which can be fitted to bivariate and correlated count data with covariates. The main advantage of these forms of BGPR is that they are nested and thus they allow likelihood ratio tests to be performed to choose the best model. The BGPR can be fitted not only to bivariate count data with positive, zero, or negative correlations, but also to under- or overdispersed bivariate count data with flexible form of mean–variance relationship. Applications of several forms of the BGPR are illustrated on two sets of count data: the Australian health survey data and the US National Medical Expenditure Survey data.

Posted Content
TL;DR: A flexible estimation framework for CMP regression based on iterative reweighed least squares (IRLS) is proposed and this model is extended to allow for additive components using a penalized splines approach.
Abstract: The Conway-Maxwell-Poisson (CMP) or COM-Poison regression is a popular model for count data due to its ability to capture both under dispersion and over dispersion. However, CMP regression is limited when dealing with complex nonlinear relationships. With today's wide availability of count data, especially due to the growing collection of data on human and social behavior, there is need for count data models that can capture complex nonlinear relationships. One useful approach is additive models; but, there has been no additive model implementation for the CMP distribution. To fill this void, we first propose a flexible estimation framework for CMP regression based on iterative reweighed least squares (IRLS) and then extend this model to allow for additive components using a penalized splines approach. Because the CMP distribution belongs to the exponential family, convergence of IRLS is guaranteed under some regularity conditions. Further, it is also known that IRLS provides smaller standard errors compared to gradient-based methods. We illustrate the usefulness of this approach through extensive simulation studies and using real data from a bike sharing system in Washington, DC.

Journal ArticleDOI
TL;DR: The theoretically predicted effect of spatial clustering in conventional "single-hit" dose-response models is investigated by employing the stuttering Poisson distribution, a very general family of count distributions that naturally models pathogen clustering and contains the Poisson and negative binomial distributions as special cases.
Abstract: Spatial and/or temporal clustering of pathogens will invalidate the commonly used assumption of Poisson-distributed pathogen counts (doses) in quantitative microbial risk assessment. In this work, the theoretically predicted effect of spatial clustering in conventional "single-hit" dose-response models is investigated by employing the stuttering Poisson distribution, a very general family of count distributions that naturally models pathogen clustering and contains the Poisson and negative binomial distributions as special cases. The analysis is facilitated by formulating the dose-response models in terms of probability generating functions. It is shown formally that the theoretical single-hit risk obtained with a stuttering Poisson distribution is lower than that obtained with a Poisson distribution, assuming identical mean doses. A similar result holds for mixed Poisson distributions. Numerical examples indicate that the theoretical single-hit risk is fairly insensitive to moderate clustering, though the effect tends to be more pronounced for low mean doses. Furthermore, using Jensen's inequality, an upper bound on risk is derived that tends to better approximate the exact theoretical single-hit risk for highly overdispersed dose distributions. The bound holds with any dose distribution (characterized by its mean and zero inflation index) and any conditional dose-response model that is concave in the dose variable. Its application is exemplified with published data from Norovirus feeding trials, for which some of the administered doses were prepared from an inoculum of aggregated viruses. The potential implications of clustering for dose-response assessment as well as practical risk characterization are discussed.