scispace - formally typeset
Search or ask a question

Showing papers on "Mixed model published in 2017"


Journal ArticleDOI
TL;DR: The lmerTest package extends the 'lmerMod' class of the lme4 package, by overloading the anova and summary functions by providing p values for tests for fixed effects, and implementing the Satterthwaite's method for approximating degrees of freedom for the t and F tests.
Abstract: One of the frequent questions by users of the mixed model function lmer of the lme4 package has been: How can I get p values for the F and t tests for objects returned by lmer? The lmerTest package extends the 'lmerMod' class of the lme4 package, by overloading the anova and summary functions by providing p values for tests for fixed effects. We have implemented the Satterthwaite's method for approximating degrees of freedom for the t and F tests. We have also implemented the construction of Type I - III ANOVA tables. Furthermore, one may also obtain the summary as well as the anova table using the Kenward-Roger approximation for denominator degrees of freedom (based on the KRmodcomp function from the pbkrtest package). Some other convenient mixed model analysis tools such as a step method, that performs backward elimination of nonsignificant effects - both random and fixed, calculation of population means and multiple comparison tests together with plot facilities are provided by the package as well.

12,305 citations


Book
30 May 2017
TL;DR: In this article, a simple linear model is proposed to describe the geometry of linear models, and a general linear model specification in R is presented. But the theory of linear model theory is not discussed.
Abstract: LINEAR MODELS A simple linear model Linear models in general The theory of linear models The geometry of linear modelling Practical linear models Practical modelling with factors General linear model specification in R Further linear modelling theory Exercises GENERALIZED LINEAR MODELS The theory of GLMs Geometry of GLMs GLMs with R Likelihood Exercises INTRODUCING GAMS Introduction Univariate smooth functions Additive models Generalized additive models Summary Exercises SOME GAM THEORY Smoothing bases Setting up GAMs as penalized GLMs Justifying P-IRLS Degrees of freedom and residual variance estimation Smoothing Parameter Estimation Criteria Numerical GCV/UBRE: performance iteration Numerical GCV/UBRE optimization by outer iteration Distributional results Confidence interval performance Further GAM theory Other approaches to GAMs Exercises GAMs IN PRACTICE: mgcv Cherry trees again Brain imaging example Air pollution in Chicago example Mackerel egg survey example Portuguese larks example Other packages Exercises MIXED MODELS and GAMMs Mixed models for balanced data Linear mixed models in general Linear mixed models in R Generalized linear mixed models GLMMs with R Generalized additive mixed models GAMMs with R Exercises APPENDICES A Some matrix algebra B Solutions to exercises Bibliography Index

8,393 citations


Journal ArticleDOI
TL;DR: The glmmTMB package fits many types of GLMMs and extensions, including models with continuously distributed responses, but here the authors focus on count responses and its ability to estimate the Conway-Maxwell-Poisson distribution parameterized by the mean is unique.
Abstract: Count data can be analyzed using generalized linear mixed models when observations are correlated in ways that require random effects However, count data are often zero-inflated, containing more zeros than would be expected from the typical error distributions We present a new package, glmmTMB, and compare it to other R packages that fit zero-inflated mixed models The glmmTMB package fits many types of GLMMs and extensions, including models with continuously distributed responses, but here we focus on count responses glmmTMB is faster than glmmADMB, MCMCglmm, and brms, and more flexible than INLA and mgcv for zero-inflated modeling One unique feature of glmmTMB (among packages that fit zero-inflated mixed models) is its ability to estimate the Conway-Maxwell-Poisson distribution parameterized by the mean Overall, its most appealing features for new users may be the combination of speed, flexibility, and its interface’s similarity to lme4

4,497 citations


Journal ArticleDOI
TL;DR: It is recommended that block cross-validation be used wherever dependence structures exist in a dataset, even if no correlation structure is visible in the fitted model residuals, or if the fitted models account for such correlations.
Abstract: Ecological data often show temporal, spatial, hierarchical (random effects), or phylogenetic structure. Modern statistical approaches are increasingly accounting for such dependencies. However, when performing cross-validation, these structures are regularly ignored, resulting in serious underestimation of predictive error. One cause for the poor performance of uncorrected (random) cross-validation, noted often by modellers, are dependence structures in the data that persist as dependence structures in model residuals, violating the assumption of independence. Even more concerning, because often overlooked, is that structured data also provides ample opportunity for overfitting with non-causal predictors. This problem can persist even if remedies such as autoregressive models, generalized least squares, or mixed models are used. Block cross-validation, where data are split strategically rather than randomly, can address these issues. However, the blocking strategy must be carefully considered. Blocking in space, time, random effects or phylogenetic distance, while accounting for dependencies in the data, may also unwittingly induce extrapolations by restricting the ranges or combinations of predictor variables available for model training, thus overestimating interpolation errors. On the other hand, deliberate blocking in predictor space may also improve error estimates when extrapolation is the modelling goal. Here, we review the ecological literature on non-random and blocked cross-validation approaches. We also provide a series of simulations and case studies, in which we show that, for all instances tested, block cross-validation is nearly universally more appropriate than random cross-validation if the goal is predicting to new data or predictor space, or for selecting causal predictors. We recommend that block cross-validation be used wherever dependence structures exist in a dataset, even if no correlation structure is visible in the fitted model residuals, or if the fitted models account for such correlations.

998 citations


Journal ArticleDOI
TL;DR: This paper constitutes a companion paper to the R package lcmm by introducing each family of models, the estimation technique, some implementation details and giving examples through a dataset on cognitive aging.
Abstract: The R package lcmm provides a series of functions to estimate statistical models based on linear mixed model theory. It includes the estimation of mixed models and latent class mixed models for Gaussian longitudinal outcomes (hlme), curvilinear and ordinal univariate longitudinal outcomes (lcmm) and curvilinear multivariate outcomes (multlcmm), as well as joint latent class mixed models (Jointlcmm) for a (Gaussian or curvilinear) longitudinal outcome and a time-to-event outcome that can be possibly left-truncated right-censored and defined in a competing setting. Maximum likelihood esimators are obtained using a modified Marquardt algorithm with strict convergence criteria based on the parameters and likelihood stability, and on the negativity of the second derivatives. The package also provides various post-fit functions including goodness-of-fit analyses, classification, plots, predicted trajectories, individual dynamic prediction of the event and predictive accuracy assessment. This paper constitutes a companion paper to the package by introducing each family of models, the estimation technique, some implementation details and giving examples through a dataset on cognitive aging.

470 citations


Journal ArticleDOI
TL;DR: A fast multi‐locus random‐SNP‐effect EMMA (FASTmrEMMA) model for GWAS, built on random single nucleotide polymorphism (SNP) effects and a new algorithm that whitens the covariance matrix of the polygenic matrix K and environmental noise, and specifies the number of nonzero eigenvalues as one.
Abstract: The mixed linear model has been widely used in genome-wide association studies (GWAS), but its application to multi-locus GWAS analysis has not been explored and assessed. Here, we implemented a fast multi-locus random-SNP-effect EMMA (FASTmrEMMA) model for GWAS. The model is built on random single nucleotide polymorphism (SNP) effects and a new algorithm. This algorithm whitens the covariance matrix of the polygenic matrix K and environmental noise, and specifies the number of nonzero eigenvalues as one. The model first chooses all putative quantitative trait nucleotides (QTNs) with ≤ 0.005 P-values and then includes them in a multi-locus model for true QTN detection. Owing to the multi-locus feature, the Bonferroni correction is replaced by a less stringent selection criterion. Results from analyses of both simulated and real data showed that FASTmrEMMA is more powerful in QTN detection and model fit, has less bias in QTN effect estimation and requires a less running time than existing single- and multi-locus methods, such as empirical Bayes, settlement of mixed linear model under progressively exclusive relationship (SUPER), efficient mixed model association (EMMA), compressed MLM (CMLM) and enriched CMLM (ECMLM). FASTmrEMMA provides an alternative for multi-locus GWAS.

201 citations


Journal ArticleDOI
TL;DR: MQS is based on the method of moments and the minimal norm quadratic unbiased estimation (MINQUE) criterion, and brings two seemingly unrelated methods-the renowned Haseman-Elston (HE) regression and the recent LD score regression (LDSC)-into the same unified statistical framework.
Abstract: Linear mixed models (LMMs) are among the most commonly used tools for genetic association studies. However, the standard method for estimating variance components in LMMs-the restricted maximum likelihood estimation method (REML)-suffers from several important drawbacks: REML requires individual-level genotypes and phenotypes from all samples in the study, is computationally slow, and produces downward-biased estimates in case control studies. To remedy these drawbacks, we present an alternative framework for variance component estimation, which we refer to as MQS. MQS is based on the method of moments (MoM) and the minimal norm quadratic unbiased estimation (MINQUE) criterion, and brings two seemingly unrelated methods-the renowned Haseman-Elston (HE) regression and the recent LD score regression (LDSC)-into the same unified statistical framework. With this new framework, we provide an alternative but mathematically equivalent form of HE that allows for the use of summary statistics. We provide an exact estimation form of LDSC to yield unbiased and statistically more efficient estimates. A key feature of our method is its ability to pair marginal z-scores computed using all samples with SNP correlation information computed using a small random subset of individuals (or individuals from a proper reference panel), while capable of producing estimates that can be almost as accurate as if both quantities are computed using the full data. As a result, our method produces unbiased and statistically efficient estimates, and makes use of summary statistics, while it is computationally efficient for large data sets. Using simulations and applications to 37 phenotypes from 8 real data sets, we illustrate the benefits of our method for estimating and partitioning SNP heritability in population studies as well as for heritability estimation in family studies. Our method is implemented in the GEMMA software package, freely available at www.xzlab.org/software.html.

118 citations


Journal ArticleDOI
TL;DR: The results demonstrate that each method leads to unbiased treatment effect estimates, and based on precision of estimates, 95% coverage probability, and power, ANCOVA modeling of either change scores or post-treatment score as the outcome, prove to be the most effective.
Abstract: Often repeated measures data are summarized into pre-post-treatment measurements. Various methods exist in the literature for estimating and testing treatment effect, including ANOVA, analysis of covariance (ANCOVA), and linear mixed modeling (LMM). Under the first two methods, outcomes can either be modeled as the post treatment measurement (ANOVA-POST or ANCOVA-POST), or a change score between pre and post measurements (ANOVACHANGE, ANCOVA-CHANGE). In LMM, the outcome is modeled as a vector of responses with or without Kenward- Rogers adjustment. We consider five methods common in the literature, and discuss them in terms of supporting simulations and theoretical derivations of variance. Consistent with existing literature, our results demonstrate that each method leads to unbiased treatment effect estimates, and based on precision of estimates, 95% coverage probability, and power, ANCOVA modeling of either change scores or post-treatment score as the outcome, prove to be the most effective. We further demonstrate each method in terms of a real data example to exemplify comparisons in real clinical context.

103 citations


Journal ArticleDOI
TL;DR: In this article, the authors use least squares-based linear models (LMs) together with restricted maximum likelihood-based mixed models (MMs) for the analysis of hierarchical data.
Abstract: Aims: The aim of this guide is to provide practical help for ecologists who analyze data from biodiversity–ecosystem functioning experiments. Our approach differs from others in the use of least squares-based linear models (LMs) together with restricted maximum likelihood-based mixed models (MMs) for the analysis of hierarchical data. An original data set containing diameter and height of young trees grown in monocultures, 2- or 4-species mixtures under ambient light or shade is used as an example. Methods: Starting with a simple LM, basic features of model fitting and the subsequent analysis of variance (ANOVA) for significance tests are summarized. From this, more complex models are developed. We use the statistical software R for model fitting and to demonstrate similarities and complementarities between LMs and MMs. The formation of contrasts and the use of error (LMs) or random-effects (MMs) terms to account for hierarchical data structure in ANOVAs are explained. Important Findings: Data from biodiversity experiments can be analyzed at the level of entire plant communities (plots) and plant individuals. The basic explanatory term is species composition, which can be divided into contrasts in many ways depending on specific biological hypotheses. Typically, these contrasts code for aspects of species richness or the presence of particular species. For significance tests in ANOVAs, contrast terms generally are compared with remaining variation of the explanatory terms from which they have been ‘carved out’. Once a final model has been selected, parameters (e.g. means or slopes for fixed-effects terms and variance components for error or random-effects terms) can be estimated to indicate the direction and size of effects.

82 citations


Journal ArticleDOI
TL;DR: A flexible and user-friendly spatial method called SpATS performed comparably to more elaborate and trial-specific spatial models in a series of sorghum breeding trials, and should be considered as an efficient and easy-to-use alternative for routine analyses of plant breeding trials.
Abstract: A flexible and user-friendly spatial method called SpATS performed comparably to more elaborate and trial-specific spatial models in a series of sorghum breeding trials. Adjustment for spatial trends in plant breeding field trials is essential for efficient evaluation and selection of genotypes. Current mixed model methods of spatial analysis are based on a multi-step modelling process where global and local trends are fitted after trying several candidate spatial models. This paper reports the application of a novel spatial method that accounts for all types of continuous field variation in a single modelling step by fitting a smooth surface. The method uses two-dimensional P-splines with anisotropic smoothing formulated in the mixed model framework, referred to as SpATS model. We applied this methodology to a series of large and partially replicated sorghum breeding trials. The new model was assessed in comparison with the more elaborate standard spatial models that use autoregressive correlation of residuals. The improvements in precision and the predictions of genotypic values produced by the SpATS model were equivalent to those obtained using the best fitting standard spatial models for each trial. One advantage of the approach with SpATS is that all patterns of spatial trend and genetic effects were modelled simultaneously by fitting a single model. Furthermore, we used a flexible model to adequately adjust for field trends. This strategy reduces potential parameter identification problems and simplifies the model selection process. Therefore, the new method should be considered as an efficient and easy-to-use alternative for routine analyses of plant breeding trials.

81 citations


Journal ArticleDOI
TL;DR: A Poisson mixed model with two random effects terms that account for both independent over-dispersion and sample non-independence is presented and a scalable sampling-based inference algorithm using a latent variable representation of the Poisson distribution is developed.
Abstract: Identifying differentially expressed (DE) genes from RNA sequencing (RNAseq) studies is among the most common analyses in genomics. However, RNAseq DE analysis presents several statistical and computational challenges, including over-dispersed read counts and, in some settings, sample non-independence. Previous count-based methods rely on simple hierarchical Poisson models (e.g. negative binomial) to model independent over-dispersion, but do not account for sample non-independence due to relatedness, population structure and/or hidden confounders. Here, we present a Poisson mixed model with two random effects terms that account for both independent over-dispersion and sample non-independence. We also develop a scalable sampling-based inference algorithm using a latent variable representation of the Poisson distribution. With simulations, we show that our method properly controls for type I error and is generally more powerful than other widely used approaches, except in small samples (n <15) with other unfavorable properties (e.g. small effect sizes). We also apply our method to three real datasets that contain related individuals, population stratification or hidden confounders. Our results show that our method increases power in all three data compared to other approaches, though the power gain is smallest in the smallest sample (n = 6). Our method is implemented in MACAU, freely available at www.xzlab.org/software.html.

Journal ArticleDOI
TL;DR: A system of multiphase non-linear mixed effects model is presented to model temporal patterns of longitudinal continuous measurements, with temporal decomposition to identify the phases and risk factors within each phase.
Abstract: In medical sciences, we often encounter longitudinal temporal relationships that are non-linear in nature. The influence of risk factors may also change across longitudinal follow-up. A system of multiphase non-linear mixed effects model is presented to model temporal patterns of longitudinal continuous measurements, with temporal decomposition to identify the phases and risk factors within each phase. Application of this model is illustrated using spirometry data after lung transplantation using readily available statistical software. This application illustrates the usefulness of our flexible model when dealing with complex non-linear patterns and time-varying coefficients.

Journal ArticleDOI
TL;DR: Mixed effects models avoid faulty inference in Sholl analysis of data sampled from multiple neurons per animal by accounting for intra-class correlation, which leads to correct inference.

Journal ArticleDOI
TL;DR: This study develops landscape‐directed simulations and test a series of replicates that emulate independent empirical datasets of two species with different life history characteristics, and helps establish methods for using linear mixed models to identify the features underlying patterns of dispersal across a variety of landscapes.
Abstract: Dispersal can impact population dynamics and geographic variation, and thus, genetic approaches that can establish which landscape factors influence population connectivity have ecological and evolutionary importance. Mixed models that account for the error structure of pairwise datasets are increasingly used to compare models relating genetic differentiation to pairwise measures of landscape resistance. A model selection framework based on information criteria metrics or explained variance may help disentangle the ecological and landscape factors influencing genetic structure, yet there are currently no consensus for the best protocols. Here, we develop landscape-directed simulations and test a series of replicates that emulate independent empirical datasets of two species with different life history characteristics (greater sage-grouse; eastern foxsnake). We determined that in our simulated scenarios, AIC and BIC were the best model selection indices and that marginal R2 values were biased toward more complex models. The model coefficients for landscape variables generally reflected the underlying dispersal model with confidence intervals that did not overlap with zero across the entire model set. When we controlled for geographic distance, variables not in the underlying dispersal models (i.e., nontrue) typically overlapped zero. Our study helps establish methods for using linear mixed models to identify the features underlying patterns of dispersal across a variety of landscapes.

Journal ArticleDOI
TL;DR: In the SWTs simulated here, mixed‐effect models were highly sensitive to departures from the model assumptions, which can be explained by the high dependence on within‐cluster comparisons.
Abstract: Many stepped wedge trials (SWTs) are analysed by using a mixed-effect model with a random intercept and fixed effects for the intervention and time periods (referred to here as the standard model). However, it is not known whether this model is robust to misspecification. We simulated SWTs with three groups of clusters and two time periods; one group received the intervention during the first period and two groups in the second period. We simulated period and intervention effects that were either common-to-all or varied-between clusters. Data were analysed with the standard model or with additional random effects for period effect or intervention effect. In a second simulation study, we explored the weight given to within-cluster comparisons by simulating a larger intervention effect in the group of the trial that experienced both the control and intervention conditions and applying the three analysis models described previously. Across 500 simulations, we computed bias and confidence interval coverage of the estimated intervention effect. We found up to 50% bias in intervention effect estimates when period or intervention effects varied between clusters and were treated as fixed effects in the analysis. All misspecified models showed undercoverage of 95% confidence intervals, particularly the standard model. A large weight was given to within-cluster comparisons in the standard model. In the SWTs simulated here, mixed-effect models were highly sensitive to departures from the model assumptions, which can be explained by the high dependence on within-cluster comparisons. Trialists should consider including a random effect for time period in their SWT analysis model. © 2017 The Authors. Statistics in Medicine published by John Wiley & Sons Ltd.

Journal ArticleDOI
TL;DR: It was concluded that INLA-SPDE had the potential to map the spatial distribution of environmental variables along with their posterior marginal distributions for environmental management and some drawbacks were identified, including artefacts of model response due to the use of triangle meshes and a longer computational time when dealing with non-Gaussian likelihood families.

Journal Article
TL;DR: A generic Bayesian mixed-effects model to estimate the temporal progression of a biological phenomenon from observations obtained at multiple time points for a group of individuals and shows that the estimated spatiotemporal transformations effectively put into correspondence significant events in the progression of individuals.
Abstract: We propose a generic Bayesian mixed-effects model to estimate the temporal progression of a biological phenomenon from observations obtained at multiple time points for a group of individuals. The progression is modeled by continuous trajectories in the space of measurements. Individual trajectories of progression result from spatiotemporal transformations of an average trajectory. These transformations allow to quantify the changes in direction and pace at which the trajectories are followed. The framework of Rieman-nian geometry allows the model to be used with any kind of measurements with smooth constraints. A stochastic version of the Expectation-Maximization algorithm is used to produce produce maximum a posteriori estimates of the parameters. We evaluate our method using series of neuropsychological test scores from patients with mild cognitive impairments later diagnosed with Alzheimer's disease, and simulated evolutions of symmetric positive definite matrices. The data-driven model of the impairment of cognitive functions shows the variability in the ordering and timing of the decline of these functions in the population. We show also that the estimated spatiotemporal transformations effectively put into correspondence significant events in the progression of individuals.

Journal ArticleDOI
TL;DR: The results suggested that similar performance can be expected as long as there are at least 20 studies and these are approximately balanced across categories, unless the residual between-studies variances are clearly different and there are enough studies in each category to obtain precise separate estimates.
Abstract: Subgroup analyses allow us to examine the influence of a categorical moderator on the effect size in meta-analysis. We conducted a simulation study using a dichotomous moderator, and compared the impact of pooled versus separate estimates of the residual between-studies variance on the statistical performance of the Q B(P) and Q B(S) tests for subgroup analyses assuming a mixed-effects model. Our results suggested that similar performance can be expected as long as there are at least 20 studies and these are approximately balanced across categories. Conversely, when subgroups were unbalanced, the practical consequences of having heterogeneous residual between-studies variances were more evident, with both tests leading to the wrong statistical conclusion more often than in the conditions with balanced subgroups. A pooled estimate should be preferred for most scenarios, unless the residual between-studies variances are clearly different and there are enough studies in each category to obtain precise separate estimates.

Journal ArticleDOI
TL;DR: Simulations demonstrate regularized PQL outperforms several currently employed methods for joint selection even if the cluster size is small compared to the number of clusters, while also offering dramatic reductions in computation time.
Abstract: The application of generalized linear mixed models presents some major challenges for both estimation, due to the intractable marginal likelihood, and model selection, as we usually want to jointly select over both fixed and random effects. We propose to overcome these challenges by combining penalized quasi-likelihood (PQL) estimation with sparsity inducing penalties on the fixed and random coefficients. The resulting approach, referred to as regularized PQL, is a computationally efficient method for performing joint selection in mixed models. A key aspect of regularized PQL involves the use of a group based penalty for the random effects: sparsity is induced such that all the coefficients for a random effect are shrunk to zero simultaneously, which in turn leads to the random effect being removed from the model. Despite being a quasi-likelihood approach, we show that regularized PQL is selection consistent, that is, it asymptotically selects the true set of fixed and random effects, in the setti...

Journal ArticleDOI
TL;DR: In this paper, a quantile parametric mixed regression model for bounded response variables is presented by considering the distribution introduced by [27] and a Bayesian approach is adopted for inference using Markov Chain Monte Carlo (MCMC) methods.
Abstract: Bounded response variables are common in many applications where the responses are percentages, proportions, or rates. New regression models have been proposed recently to model the relationship among one or more covariates and the conditional mean of a response variable based on the beta distribution or a mixture of beta distributions. However, when we are interested in knowing how covariates impact different levels of the response variable, quantile regression models play an important role. A new quantile parametric mixed regression model for bounded response variables is presented by considering the distribution introduced by [27]. A Bayesian approach is adopted for inference using Markov Chain Monte Carlo (MCMC) methods. Model comparison criteria are also discussed. The inferential methods can be easily programmed and then easily used for data modeling. Results from a simulation study are reported showing the good performance of the proposed inferential methods. Furthermore, results from data analyses using regression models with fixed and mixed effects are given. Specifically, we show that the quantile parametric model proposed here is an alternative and complementary modeling tool for bounded response variables such as the poverty index in Brazilian municipalities, which is linked to the Gini coefficient and the human development index.

Journal ArticleDOI
TL;DR: A novel diagnostic test based on the so-called gradient function proposed by Verbeke and Molenberghs (2013) is introduced to assess the random-effects distribution and can be used to check the adequacy of any distribution for random effects in a wide class of mixed models.
Abstract: It is traditionally assumed that the random effects in mixed models follow a multivariate normal distribution, making likelihood-based inferences more feasible theoretically and computationally. However, this assumption does not necessarily hold in practice which may lead to biased and unreliable results. We introduce a novel diagnostic test based on the so-called gradient function proposed by Verbeke and Molenberghs (2013) to assess the random-effects distribution. We establish asymptotic properties of our test and show that, under a correctly specified model, the proposed test statistic converges to a weighted sum of independent chi-squared random variables each with one degree of freedom. The weights, which are eigenvalues of a square matrix, can be easily calculated. We also develop a parametric bootstrap algorithm for small samples. Our strategy can be used to check the adequacy of any distribution for random effects in a wide class of mixed models, including linear mixed models, generalized linear mixed models, and non-linear mixed models, with univariate as well as multivariate random effects. Both asymptotic and bootstrap proposals are evaluated via simulations and a real data analysis of a randomized multicenter study on toenail dermatophyte onychomycosis.

Journal ArticleDOI
TL;DR: The developed methodology is applied to estimate the proportion of people under the poverty line by counties and sex in Galicia (a region in north-west of Spain).

Journal ArticleDOI
TL;DR: The aim is to demonstrate the value of using mixture models to describe variation in individual life‐history tactics within a population, and to promote the use of these models by ecologists and evolutionary ecologists.
Abstract: Mixed models are now well-established methods in ecology and evolution because they allow accounting for and quantifying within- and between-individual variation. However, the required normal distribution of the random effects can often be violated by the presence of clusters among subjects, which leads to multi-modal distributions. In such cases, using what is known as mixture regression models might offer a more appropriate approach. These models are widely used in psychology, sociology, and medicine to describe the diversity of trajectories occurring within a population over time (e.g. psychological development, growth). In ecology and evolution, however, these models are seldom used even though understanding changes in individual trajectories is an active area of research in life-history studies. Our aim is to demonstrate the value of using mixture models to describe variation in individual life-history tactics within a population, and hence to promote the use of these models by ecologists and evolutionary ecologists. We first ran a set of simulations to determine whether and when a mixture model allows teasing apart latent clustering, and to contrast the precision and accuracy of estimates obtained from mixture models versus mixed models under a wide range of ecological contexts. We then used empirical data from long-term studies of large mammals to illustrate the potential of using mixture models for assessing within-population variation in life-history tactics. Mixture models performed well in most cases, except for variables following a Bernoulli distribution and when sample size was small. The four selection criteria we evaluated [Akaike information criterion (AIC), Bayesian information criterion (BIC), and two bootstrap methods] performed similarly well, selecting the right number of clusters in most ecological situations. We then showed that the normality of random effects implicitly assumed by evolutionary ecologists when using mixed models was often violated in life-history data. Mixed models were quite robust to this violation in the sense that fixed effects were unbiased at the population level. However, fixed effects at the cluster level and random effects were better estimated using mixture models. Our empirical analyses demonstrated that using mixture models facilitates the identification of the diversity of growth and reproductive tactics occurring within a population. Therefore, using this modelling framework allows testing for the presence of clusters and, when clusters occur, provides reliable estimates of fixed and random effects for each cluster of the population. In the presence or expectation of clusters, using mixture models offers a suitable extension of mixed models, particularly when evolutionary ecologists aim at identifying how ecological and evolutionary processes change within a population. Mixture regression models therefore provide a valuable addition to the statistical toolbox of evolutionary ecologists. As these models are complex and have their own limitations, we provide recommendations to guide future users.

Journal ArticleDOI
TL;DR: This paper examines the identification problem in age-period-cohort models that use either linear or categorically coded ages, periods, and cohorts or combinations of these parameterizations and shows how statistical model identification comes about in mixed models and why which effects aretreated as fixed and which are treated as random can substantially change the estimates of the age, period, and cohort effects.
Abstract: This paper examines the identification problem in age-period-cohort models that use either linear or categorically coded ages, periods, and cohorts or combinations of these parameterizations. These models are not identified using the traditional fixed effect regression model approach because of a linear dependency between the ages, periods, and cohorts. However, these models can be identified if the researcher introduces a single just identifying constraint on the model coefficients. The problem with such constraints is that the results can differ substantially depending on the constraint chosen. Somewhat surprisingly, age-period-cohort models that specify one or more of ages and/or periods and/or cohorts as random effects are identified. This is the case without introducing an additional constraint. I label this identification as statistical model identification and show how statistical model identification comes about in mixed models and why which effects are treated as fixed and which are treated as random can substantially change the estimates of the age, period, and cohort effects. Copyright © 2017 John Wiley & Sons, Ltd.

Proceedings ArticleDOI
01 Jul 2017
TL;DR: This paper generalizes non-linear mixed effects model to the regime where the response variable is manifold-valued, i.e., f: Rd → M: Rd; and demonstrates the immediate benefits such a model can provide and derive the underlying model and estimation schemes and demonstrate the direct consequence of the results.
Abstract: Statistical machine learning models that operate on manifold-valued data are being extensively studied in vision, motivated by applications in activity recognition, feature tracking and medical imaging. While non-parametric methods have been relatively well studied in the literature, efficient formulations for parametric models (which may offer benefits in small sample size regimes) have only emerged recently. So far, manifold-valued regression models (such as geodesic regression) are restricted to the analysis of cross-sectional data, i.e., the so-called fixed effects in statistics. But in most longitudinal analysis (e.g., when a participant provides multiple measurements, over time) the application of fixed effects models is problematic. In an effort to answer this need, this paper generalizes non-linear mixed effects model to the regime where the response variable is manifold-valued, i.e., f: Rd → M. We derive the underlying model and estimation schemes and demonstrate the immediate benefits such a model can provide — both for group level and individual level analysis — on longitudinal brain imaging data. The direct consequence of our results is that longitudinal analysis of manifold-valued measurements (especially, the symmetric positive definite manifold) can be conducted in a computationally tractable manner.

Journal ArticleDOI
27 Jul 2017-Forests
TL;DR: The stem wood, branches, stem bark, needles, roots and total biomass models for larch were developed at the regional level, using a general allometric equation, a dummy variable model, a mixed effects model, and a Bayesian hierarchical model to select the most effective method for predicting large-scale forest biomass.
Abstract: With the development of national-scale forest biomass monitoring work, accurate estimation of forest biomass on a large scale is becoming an important research topic in forestry. In this study, the stem wood, branches, stem bark, needles, roots and total biomass models for larch were developed at the regional level, using a general allometric equation, a dummy variable model, a mixed effects model, and a Bayesian hierarchical model, to select the most effective method for predicting large-scale forest biomass. Results showed total biomass of trees with the same diameter gradually decreased from southern to northern regions in China, except in the Hebei province. We found that the stem wood, branch, stem bark, needle, root, and total biomass model relationships were statistically significant (p-values < 0.01) for the general allometric equation, linear mixed model, dummy variable model, and Bayesian hierarchical model, but the linear mixed, dummy variable, and Bayesian hierarchical models showed better performance than the general allometric equation. An F-test also showed significant differences between the models. The R2 average values of the linear mixed model, dummy variable model, and Bayesian hierarchical model were higher than those of the general allometric equation by 0.007, 0.018, 0.015, 0.004, 0.09, and 0.117 for the total tree, root, stem wood, stem bark, branch, and needle models respectively. However, there were no significant differences between the linear mixed model, dummy variable model, and Bayesian hierarchical model. When the number of categories was increased, the linear mixed model and Bayesian hierarchical model were more flexible and applicable than the dummy variable model for the construction of regional biomass models.

Journal ArticleDOI
TL;DR: In this paper, a mixed generalized Akaike information criterion xGAIC is introduced and validated, derived from a quasi-log-likelihood that focuses on the random effect and the variability between the areas, and from a generalized degree-of-freedom measure, as a model complexity penalty, which is calculated by the bootstrap.
Abstract: Summary A mixed generalized Akaike information criterion xGAIC is introduced and validated. It is derived from a quasi-log-likelihood that focuses on the random effect and the variability between the areas, and from a generalized degree-of-freedom measure, as a model complexity penalty, which is calculated by the bootstrap. To study the performance of xGAIC, we consider three popular mixed models in small area inference: a Fay–Herriot model, a monotone model and a penalized spline model. A simulation study shows the good performance of xGAIC. Besides, we show its relevance in practice, with two real applications: the estimation of employed people by economic activity and the prevalence of smokers in Galician counties. In the second case, where it is unclear which explanatory variables should be included in the model, the problem of selection between these explanatory variables is solved simultaneously with the problem of the specification of the functional form between the linear, monotone or spline options.

Journal ArticleDOI
TL;DR: In this paper, a penalized likelihood method is proposed to estimate the ridge estimation of fixed and random effects in the context of Henderson's mixed model equations in the linear mixed model.
Abstract: This paper is concerned with the ridge estimation of fixed and random effects in the context of Henderson's mixed model equations in the linear mixed model. For this purpose, a penalized likelihood method is proposed. A linear combination of ridge estimator for fixed and random effects is compared to a linear combination of best linear unbiased estimator for fixed and random effects under the mean-square error (MSE) matrix criterion. Additionally, for choosing the biasing parameter, a method of MSE under the ridge estimator is given. A real data analysis is provided to illustrate the theoretical results and a simulation study is conducted to characterize the performance of ridge and best linear unbiased estimators approach in the linear mixed model.

Journal ArticleDOI
TL;DR: In this paper, the structural assumption made here is that there are clusters of units that share the same effects and it is shown how clusters can be identified by tailored regularized estimators, even if the latter is the data generating model.
Abstract: Although each statistical unit on which measurements are taken is unique, typically there is not enough information available to account totally for its uniqueness. Therefore heterogeneity among units has to be limited by structural assumptions. One classical approach is to use random effects models which assume that heterogeneity can be described by distributional assumptions. However, inference may depend on the assumed mixing distribution and it is assumed that the random effects and the observed covariates are independent. An alternative considered here, are fixed effect models, which let each unit have its own parameter. They are quite flexible but suffer from the large number of parameters. The structural assumption made here is that there are clusters of units that share the same effects. It is shown how clusters can be identified by tailored regularized estimators. Moreover, it is shown that the regularized estimates compete well with estimates for the random effects model, even if the latter is the data generating model. They dominate if clusters are present.

Journal ArticleDOI
TL;DR: The authors proposed a regularization method that can deal with large numbers of candidate generalized linear mixed models (GLMMs) while preserving a hierarchical structure in the effects that needs to be taken into account when performing variable selection.
Abstract: In many applications of generalized linear mixed models (GLMMs), there is a hierarchical structure in the effects that needs to be taken into account when performing variable selection. A prime example of this is when fitting mixed models to longitudinal data, where it is usual for covariates to be included as only fixed effects or as composite (fixed and random) effects. In this article, we propose the first regularization method that can deal with large numbers of candidate GLMMs while preserving this hierarchical structure: CREPE (Composite Random Effects PEnalty) for joint selection in mixed models. CREPE induces sparsity in a hierarchical manner, as the fixed effect for a covariate is shrunk to zero only if the corresponding random effect is or has already been shrunk to zero. In the setting where the number of fixed effects grow at a slower rate than the number of clusters, we show that CREPE is selection consistent for both fixed and random effects, and attains the oracle property. Simulations show that CREPE outperforms some currently available penalized methods for mixed models.