scispace - formally typeset
Search or ask a question

Showing papers on "Random effects model published in 2017"


Journal ArticleDOI
TL;DR: It is recommended that block cross-validation be used wherever dependence structures exist in a dataset, even if no correlation structure is visible in the fitted model residuals, or if the fitted models account for such correlations.
Abstract: Ecological data often show temporal, spatial, hierarchical (random effects), or phylogenetic structure. Modern statistical approaches are increasingly accounting for such dependencies. However, when performing cross-validation, these structures are regularly ignored, resulting in serious underestimation of predictive error. One cause for the poor performance of uncorrected (random) cross-validation, noted often by modellers, are dependence structures in the data that persist as dependence structures in model residuals, violating the assumption of independence. Even more concerning, because often overlooked, is that structured data also provides ample opportunity for overfitting with non-causal predictors. This problem can persist even if remedies such as autoregressive models, generalized least squares, or mixed models are used. Block cross-validation, where data are split strategically rather than randomly, can address these issues. However, the blocking strategy must be carefully considered. Blocking in space, time, random effects or phylogenetic distance, while accounting for dependencies in the data, may also unwittingly induce extrapolations by restricting the ranges or combinations of predictor variables available for model training, thus overestimating interpolation errors. On the other hand, deliberate blocking in predictor space may also improve error estimates when extrapolation is the modelling goal. Here, we review the ecological literature on non-random and blocked cross-validation approaches. We also provide a series of simulations and case studies, in which we show that, for all instances tested, block cross-validation is nearly universally more appropriate than random cross-validation if the goal is predicting to new data or predictor space, or for selecting causal predictors. We recommend that block cross-validation be used wherever dependence structures exist in a dataset, even if no correlation structure is visible in the fitted model residuals, or if the fitted models account for such correlations.

998 citations


Journal ArticleDOI
TL;DR: This paper showed that for typical psychological and psycholinguistic data, higher power is achieved without inflating Type I error rate if a model selection criterion is used to select a random effect structure that is supported by the data.

928 citations


Journal ArticleDOI
TL;DR: This article compares and contrasts HLM with alternative methods including generalized estimating equations and cluster-robust standard errors and demonstrates the advantages of the alternative methods and also when HLM would be the preferred method.
Abstract: In psychology and the behavioral sciences generally, the use of the hierarchical linear model (HLM) and its extensions for discrete outcomes are popular methods for modeling clustered data. HLM and its discrete outcome extensions, however, are certainly not the only methods available to model clustered data. Although other methods exist and are widely implemented in other disciplines, it seems that psychologists have yet to consider these methods in substantive studies. This article compares and contrasts HLM with alternative methods including generalized estimating equations and cluster-robust standard errors. These alternative methods do not model random effects and thus make a smaller number of assumptions and are interpreted identically to single-level methods with the benefit that estimates are adjusted to reflect clustering of observations. Situations where these alternative methods may be advantageous are discussed including research questions where random effects are and are not required, when random effects can change the interpretation of regression coefficients, challenges of modeling with random effects with discrete outcomes, and examples of published psychology articles that use HLM that may have benefitted from using alternative methods. Illustrative examples are provided and discussed to demonstrate the advantages of the alternative methods and also when HLM would be the preferred method. (PsycINFO Database Record

428 citations


Journal ArticleDOI
TL;DR: It is found that, in practice, 5 or more studies are needed to reasonably consistently achieve powers from random‐effects meta‐analyses that are greater than the studies that contribute to them.
Abstract: One of the reasons for the popularity of meta-analysis is the notion that these analyses will possess more power to detect effects than individual studies. This is inevitably the case under a fixed-effect model. However, the inclusion of the between-study variance in the random-effects model, and the need to estimate this parameter, can have unfortunate implications for this power. We develop methods for assessing the power of random-effects meta-analyses, and the average power of the individual studies that contribute to meta-analyses, so that these powers can be compared. In addition to deriving new analytical results and methods, we apply our methods to 1991 meta-analyses taken from the Cochrane Database of Systematic Reviews to retrospectively calculate their powers. We find that, in practice, 5 or more studies are needed to reasonably consistently achieve powers from random-effects meta-analyses that are greater than the studies that contribute to them. Not only is statistical inference under the random-effects model challenging when there are very few studies but also less worthwhile in such cases. The assumption that meta-analysis will result in an increase in power is challenged by our findings.

344 citations


Journal ArticleDOI
Peter C. Austin1
TL;DR: Three families of regression models for the analysis of multilevel survival data incorporate cluster-specific random effects that modify the baseline hazard function and can be incorporated to account for within-cluster homogeneity in outcomes.
Abstract: Data that have a multilevel structure occur frequently across a range of disciplines, including epidemiology, health services research, public health, education and sociology. We describe three families of regression models for the analysis of multilevel survival data. First, Cox proportional hazards models with mixed effects incorporate cluster-specific random effects that modify the baseline hazard function. Second, piecewise exponential survival models partition the duration of follow-up into mutually exclusive intervals and fit a model that assumes that the hazard function is constant within each interval. This is equivalent to a Poisson regression model that incorporates the duration of exposure within each interval. By incorporating cluster-specific random effects, generalised linear mixed models can be used to analyse these data. Third, after partitioning the duration of follow-up into mutually exclusive intervals, one can use discrete time survival models that use a complementary log-log generalised linear model to model the occurrence of the outcome of interest within each interval. Random effects can be incorporated to account for within-cluster homogeneity in outcomes. We illustrate the application of these methods using data consisting of patients hospitalised with a heart attack. We illustrate the application of these methods using three statistical programming languages (R, SAS and Stata).

283 citations


Journal ArticleDOI
TL;DR: A Bayes factor approach to multiway analysis of variance (ANOVA) that allows researchers to state graded evidence for effects or invariances as determined by the data is provided.
Abstract: This article provides a Bayes factor approach to multiway analysis of variance (ANOVA) that allows researchers to state graded evidence for effects or invariances as determined by the data. ANOVA is conceptualized as a hierarchical model where levels are clustered within factors. The development is comprehensive in that it includes Bayes factors for fixed and random effects and for within-subjects, between-subjects, and mixed designs. Different model construction and comparison strategies are discussed, and an example is provided. We show how Bayes factors may be computed with BayesFactor package in R and with the JASP statistical package. (PsycINFO Database Record

274 citations


Journal ArticleDOI
TL;DR: This article discusses designs with multiple sources of nonindependence, for example, studies in which the same subjects rate the same set of items or in which students nested in classrooms provide multiple answers and provides clear guidelines about the types of random effects that should be included in the analysis of such designs.
Abstract: In this article we address a number of important issues that arise in the analysis of nonindependent data. Such data are common in studies in which predictors vary within "units" (e.g., within-subjects, within-classrooms). Most researchers analyze categorical within-unit predictors with repeated-measures ANOVAs, but continuous within-unit predictors with linear mixed-effects models (LMEMs). We show that both types of predictor variables can be analyzed within the LMEM framework. We discuss designs with multiple sources of nonindependence, for example, studies in which the same subjects rate the same set of items or in which students nested in classrooms provide multiple answers. We provide clear guidelines about the types of random effects that should be included in the analysis of such designs. We also present a number of corrective steps that researchers can take when convergence fails in LMEM models with too many parameters. We end with a brief discussion on the trade-off between power and generalizability in designs with "within-unit" predictors. (PsycINFO Database Record

236 citations


Posted ContentDOI
01 May 2017-bioRxiv
TL;DR: A new R package, glmmTMB, is presented, that increases the range of models that can easily be fitted to count data using maximum likelihood estimation and is faster than packages that use Markov chain Monte Carlo sampling for estimation.
Abstract: Ecological phenomena are often measured in the form of count data. These data can be analyzed using generalized linear mixed models (GLMMs) when observations are correlated in ways that require random effects. However, count data are often zero-inflated, containing more zeros than would be expected from the standard error distributions used in GLMMs, e.g., parasite counts may be exactly zero for hosts with effective immune defenses but vary according to a negative binomial distribution for non-resistant hosts. We present a new R package, glmmTMB, that increases the range of models that can easily be fitted to count data using maximum likelihood estimation. The interface was developed to be familiar to users of the lme4 R package, a common tool for fitting GLMMs. To maximize speed and flexibility, estimation is done using Template Model Builder (TMB), utilizing automatic differentiation to estimate model gradients and the Laplace approximation for handling random effects. We demonstrate glmmTMB and compare it to other available methods using two ecological case studies. In general, glmmTMB is more flexible than other packages available for estimating zero-inflated models via maximum likelihood estimation and is faster than packages that use Markov chain Monte Carlo sampling for estimation; it is also more flexible for zero-inflated modelling than INLA, but speed comparisons vary with model and data structure. Our package can be used to fit GLMs and GLMMs with or without zero-inflation as well as hurdle models. By allowing ecologists to quickly estimate a wide variety of models using a single package, glmmTMB makes it easier to find appropriate models and test hypotheses to describe ecological processes.

231 citations


Journal ArticleDOI
TL;DR: The results indicate that simpler hierarchical models are valid in situations with few studies or sparse data, and univariate random effects logistic regression models are appropriate when a bivariate model cannot be fitted.
Abstract: Hierarchical models such as the bivariate and hierarchical summary receiver operating characteristic (HSROC) models are recommended for meta-analysis of test accuracy studies. These models are challenging to fit when there are few studies and/or sparse data (for example zero cells in contingency tables due to studies reporting 100% sensitivity or specificity); the models may not converge, or give unreliable parameter estimates. Using simulation, we investigated the performance of seven hierarchical models incorporating increasing simplifications in scenarios designed to replicate realistic situations for meta-analysis of test accuracy studies. Performance of the models was assessed in terms of estimability (percentage of meta-analyses that successfully converged and percentage where the between study correlation was estimable), bias, mean square error and coverage of the 95% confidence intervals. Our results indicate that simpler hierarchical models are valid in situations with few studies or sparse data. For synthesis of sensitivity and specificity, univariate random effects logistic regression models are appropriate when a bivariate model cannot be fitted. Alternatively, an HSROC model that assumes a symmetric SROC curve (by excluding the shape parameter) can be used if the HSROC model is the chosen meta-analytic approach. In the absence of heterogeneity, fixed effect equivalent of the models can be applied.

168 citations


Journal ArticleDOI
TL;DR: In this paper, three methods to handle dependency among effect size estimates in meta-analysis arising from studies reporting multiple outcome measures taken on the same sample are compared with the method of robust variance estimation and with averaging effects within studies.
Abstract: This study investigates three methods to handle dependency among effect size estimates in meta-analysis arising from studies reporting multiple outcome measures taken on the same sample. The three-level approach is compared with the method of robust variance estimation, and with averaging effects within studies. A simulation study is performed, and the fixed and random effect estimates of the three methods are compared with each other. Both the robust variance estimation and three-level approach result in unbiased estimates of the fixed effects, corresponding standard errors and variances. Averaging effect sizes results in overestimated standard errors when the effect sizes within studies are truly independent. Although the robust variance and three-level approach are more complicated to use, they have the advantage that they do not require an estimate of the correlation between outcomes, and they still result in unbiased parameter estimates.

163 citations


Journal ArticleDOI
TL;DR: It is shown how and why an unrestricted weighted least squares MRA (WLS-MRA) estimator is superior to conventional random-effects (or mixed-effects) meta-regression when there is publication (or small-sample) bias that is as good as FE-M RA in all cases and better than fixed effects in most practical applications.
Abstract: Our study revisits and challenges two core conventional meta-regression estimators: the prevalent use of 'mixed-effects' or random-effects meta-regression analysis and the correction of standard errors that defines fixed-effects meta-regression analysis (FE-MRA). We show how and explain why an unrestricted weighted least squares MRA (WLS-MRA) estimator is superior to conventional random-effects (or mixed-effects) meta-regression when there is publication (or small-sample) bias that is as good as FE-MRA in all cases and better than fixed effects in most practical applications. Simulations and statistical theory show that WLS-MRA provides satisfactory estimates of meta-regression coefficients that are practically equivalent to mixed effects or random effects when there is no publication bias. When there is publication selection bias, WLS-MRA always has smaller bias than mixed effects or random effects. In practical applications, an unrestricted WLS meta-regression is likely to give practically equivalent or superior estimates to fixed-effects, random-effects, and mixed-effects meta-regression approaches. However, random-effects meta-regression remains viable and perhaps somewhat preferable if selection for statistical significance (publication bias) can be ruled out and when random, additive normal heterogeneity is known to directly affect the 'true' regression coefficient. Copyright © 2016 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: A linear mixed effect model is considered to describe the responses over age with random effects for intercept and slope parameters and there was a cutoff point for measurement costs relative to recruitment costs relating to frequency of measurements.
Abstract: Longitudinal studies are often used to investigate age-related developmental change. Whereas a single cohort design takes a group of individuals at the same initial age and follows them over time, an accelerated longitudinal design takes multiple single cohorts, each one starting at a different age. The main advantage of an accelerated longitudinal design is its ability to span the age range of interest in a shorter period of time than would be possible with a single cohort longitudinal design. This paper considers design issues for accelerated longitudinal studies. A linear mixed effect model is considered to describe the responses over age with random effects for intercept and slope parameters. Random and fixed cohort effects are used to cope with the potential bias accelerated longitudinal designs have due to multiple cohorts. The impact of other factors such as costs and the impact of dropouts on the power of testing or the precision of estimating parameters are examined. As duration-related costs increase relative to recruitment costs the best designs shift towards shorter duration and eventually cross-sectional design being best. For designs with the same duration but differing interval between measurements, we found there was a cutoff point for measurement costs relative to recruitment costs relating to frequency of measurements. Under our model of 30% dropout there was a maximum power loss of 7%.

Journal ArticleDOI
TL;DR: In this paper, the authors introduce xthybrid, a shell for the meglm command that can fit a variety of hybrid and correlated random-effects models, including linear, logit, probit, ordered probit and Poisson and negative binomial models.
Abstract: One typically analyzes clustered data using random- or fixed-effects models. Fixed-effects models allow consistent estimation of the effects of level-one variables, even if there is unobserved heterogeneity at level two. However, these models cannot estimate the effects of level-two variables. Hybrid and correlated random-effects models are flexible modeling specifications that separate within- and between-cluster effects and allow for both consistent estimation of level-one effects and inclusion of level-two variables. In this article, we elaborate on the separation of within- and between-cluster effects in generalized linear mixed models. These models present a unifying framework for an entire class of models whose response variables follow a distribution from the exponential family (for example, linear, logit, probit, ordered probit and logit, Poisson, and negative binomial models). We introduce the user-written command xthybrid, a shell for the meglm command. xthybrid can fit a variety of hybrid and correlated random-effects models

Journal ArticleDOI
TL;DR: In this article, a random thresholds hierarchical ordered probit model with random parameters is proposed to account for the fixed thresholds limitation of the traditional ordered probability models, which typically leads to incorrect estimation of outcome probabilities for the intermediate categories, and for the possibility of unobserved factors systematically varying across the observations.

Journal ArticleDOI
TL;DR: Two multi-environment Bayesian genomic models are proposed: one considers genetic effects (u) that can be assessed by the Kronecker product of variance–covariance matrices of genetic correlations between environments and genomic kernels through markers under two linear kernel methods, linear (genomic best linear unbiased predictors, GBLUP) and Gaussian (Gaussian kernel, GK).
Abstract: The phenomenon of genotype × environment (G × E) interaction in plant breeding decreases selection accuracy, thereby negatively affecting genetic gains Several genomic prediction models incorporating G × E have been recently developed and used in genomic selection of plant breeding programs Genomic prediction models for assessing multi-environment G × E interaction are extensions of a single-environment model, and have advantages and limitations In this study, we propose two multi-environment Bayesian genomic models: the first model considers genetic effects [Formula: see text] that can be assessed by the Kronecker product of variance-covariance matrices of genetic correlations between environments and genomic kernels through markers under two linear kernel methods, linear (genomic best linear unbiased predictors, GBLUP) and Gaussian (Gaussian kernel, GK) The other model has the same genetic component as the first model [Formula: see text] plus an extra component, F: , that captures random effects between environments that were not captured by the random effects [Formula: see text] We used five CIMMYT data sets (one maize and four wheat) that were previously used in different studies Results show that models with G × E always have superior prediction ability than single-environment models, and the higher prediction ability of multi-environment models with [Formula: see text] over the multi-environment model with only u occurred 85% of the time with GBLUP and 45% of the time with GK across the five data sets The latter result indicated that including the random effect f is still beneficial for increasing prediction ability after adjusting by the random effect [Formula: see text]

Journal ArticleDOI
TL;DR: Researchers should be cautious in deriving 95% prediction intervals following a frequentist random‐effects meta‐analysis until a more reliable solution is identified, especially when there are few studies.
Abstract: A random effects meta-analysis combines the results of several independent studies to summarise the evidence about a particular measure of interest, such as a treatment effect. The approach allows for unexplained between-study heterogeneity in the true treatment effect by incorporating random study effects about the overall mean. The variance of the mean effect estimate is conventionally calculated by assuming that the between study variance is known; however, it has been demonstrated that this approach may be inappropriate, especially when there are few studies. Alternative methods that aim to account for this uncertainty, such as Hartung-Knapp, Sidik-Jonkman and Kenward-Roger, have been proposed and shown to improve upon the conventional approach in some situations. In this paper, we use a simulation study to examine the performance of several of these methods in terms of the coverage of the 95% confidence and prediction intervals derived from a random effects meta-analysis estimated using restricted maximum likelihood. We show that, in terms of the confidence intervals, the Hartung-Knapp correction performs well across a wide-range of scenarios and outperforms other methods when heterogeneity was large and/or study sizes were similar. However, the coverage of the Hartung-Knapp method is slightly too low when the heterogeneity is low (I2 30%) and study sizes are similar. In other situations, especially when heterogeneity is small and the study sizes are quite varied, the coverage is far too low and could not be consistently improved by either increasing the number of studies, altering the degrees of freedom or using variance inflation methods. Therefore, researchers should be cautious in deriving 95% prediction intervals following a frequentist random-effects meta-analysis until a more reliable solution is identified. © 2016 The Authors. Statistics in Medicine Published by John Wiley & Sons Ltd.

Journal ArticleDOI
TL;DR: In this paper, a multivariate space-time model is proposed for predicting crash frequencies of different injury severity levels, including spatial correlation and/or heterogeneity, temporal correlation and or heterogeneity, and correlations between crash frequencies.

Journal ArticleDOI
TL;DR: In this paper, the authors used driving simulation data and surveys conducted in 2014 and 2015 in Buffalo, NY, to study the factors that affect perceived (self-reported, based on surveys) and observed (as measured based on driving simulation experiments) aggressive driving behavior.

Journal ArticleDOI
TL;DR: In this article, the authors present a systematic review of simulation studies comparing the performance of different estimation methods for this parameter, and summarise the performance in relation to estimation of heterogeneity and the overall effect estimate, and of confidence intervals for the latter.
Abstract: Random-effects meta-analysis methods include an estimate of between-study heterogeneity variance. We present a systematic review of simulation studies comparing the performance of different estimation methods for this parameter. We summarise the performance of methods in relation to estimation of heterogeneity and of the overall effect estimate, and of confidence intervals for the latter. Among the twelve included simulation studies, the DerSimonian and Laird method was most commonly evaluated. This estimate is negatively biased when heterogeneity is moderate to high and therefore most studies recommended alternatives. The Paule-Mandel method was recommended by three studies: it is simple to implement, is less biased than DerSimonian and Laird and performs well in meta-analyses with dichotomous and continuous outcomes. In many of the included simulation studies, results were based on data that do not represent meta-analyses observed in practice, and only small selections of methods were compared. Furthermore, potential conflicts of interest were present when authors of novel methods interpreted their results. On the basis of current evidence, we provisionally recommend the Paule-Mandel method for estimating the heterogeneity variance, and using this estimate to calculate the mean effect and its 95% confidence interval. However, further simulation studies are required to draw firm conclusions. Copyright © 2016 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: Empirical analysis confirmed the presence of both unstructured and spatially correlated variations in the effects of contributory factors on severe crash occurrences and suggested that ignoring spatially structured heterogeneity may result in biased parameter estimates and incorrect inferences.

Journal ArticleDOI
TL;DR: In this article, a spatially varying coefficient model was developed by extending the random effects eigenvector spatial filtering model, which is defined by a linear combination of the eigenvectors describing the Moran coefficient, and each of its coefficients can have a different degree of spatial smoothness.
Abstract: This study develops a spatially varying coefficient model by extending the random effects eigenvector spatial filtering model. The developed model has the following properties: its spatially varying coefficients are defined by a linear combination of the eigenvectors describing the Moran coefficient; each of its coefficients can have a different degree of spatial smoothness; and it yields a variant of a Bayesian spatially varying coefficient model. Moreover, parameter estimation of the model can be executed with a relatively small computational burden. Results of a Monte Carlo simulation reveal that our model outperforms a conventional eigenvector spatial filtering (ESF) model and geographically weighted regression (GWR) models in terms of the accuracy of the coefficient estimates and computational time. We empirically apply our model to the hedonic land price analysis of flood hazards in Japan.

Journal ArticleDOI
TL;DR: Efficient Leave-one-out cross validation strategies are presented, requiring little more effort than a single analysis, and these efficiencies relative to the naive approach using the same model will increase with increases in the number of observations.
Abstract: A random multiple-regression model that simultaneously fit all allele substitution effects for additive markers or haplotypes as uncorrelated random effects was proposed for Best Linear Unbiased Prediction, using whole-genome data. Leave-one-out cross validation can be used to quantify the predictive ability of a statistical model. Naive application of Leave-one-out cross validation is computationally intensive because the training and validation analyses need to be repeated n times, once for each observation. Efficient Leave-one-out cross validation strategies are presented here, requiring little more effort than a single analysis. Efficient Leave-one-out cross validation strategies is 786 times faster than the naive application for a simulated dataset with 1,000 observations and 10,000 markers and 99 times faster with 1,000 observations and 100 markers. These efficiencies relative to the naive approach using the same model will increase with increases in the number of observations. Efficient Leave-one-out cross validation strategies are presented here, requiring little more effort than a single analysis.

Journal ArticleDOI
TL;DR: Two-part models for semicontinuous and zero-heavy count data are examined, and models for count data with a two-part random effects distribution are considered.
Abstract: Statistical models that involve a two-part mixture distribution are applicable in a variety of situations. Frequently, the two parts are a model for the binary response variable and a model for the outcome variable that is conditioned on the binary response. Two common examples are zero-inflated or hurdle models for count data and two-part models for semicontinuous data. Recently, there has been particular interest in the use of these models for the analysis of repeated measures of an outcome variable over time. The aim of this review is to consider motivations for the use of such models in this context and to highlight the central issues that arise with their use. We examine two-part models for semicontinuous and zero-heavy count data, and we also consider models for count data with a two-part random effects distribution.

Journal ArticleDOI
TL;DR: The Median Hazard Ratio (MHR) is a useful and intuitive measure for expressing cluster heterogeneity in the outcome and, thereby, estimating general contextual effects in multilevel survival analysis.
Abstract: Multilevel data occurs frequently in many research areas like health services research and epidemiology. A suitable way to analyze such data is through the use of multilevel regression models (MLRM). MLRM incorporate cluster-specific random effects which allow one to partition the total individual variance into between-cluster variation and between-individual variation. Statistically, MLRM account for the dependency of the data within clusters and provide correct estimates of uncertainty around regression coefficients. Substantively, the magnitude of the effect of clustering provides a measure of the General Contextual Effect (GCE). When outcomes are binary, the GCE can also be quantified by measures of heterogeneity like the Median Odds Ratio (MOR) calculated from a multilevel logistic regression model. Time-to-event outcomes within a multilevel structure occur commonly in epidemiological and medical research. However, the Median Hazard Ratio (MHR) that corresponds to the MOR in multilevel (i.e., 'frailty') Cox proportional hazards regression is rarely used. Analogously to the MOR, the MHR is the median relative change in the hazard of the occurrence of the outcome when comparing identical subjects from two randomly selected different clusters that are ordered by risk. We illustrate the application and interpretation of the MHR in a case study analyzing the hazard of mortality in patients hospitalized for acute myocardial infarction at hospitals in Ontario, Canada. We provide R code for computing the MHR. The MHR is a useful and intuitive measure for expressing cluster heterogeneity in the outcome and, thereby, estimating general contextual effects in multilevel survival analysis. © 2016 The Authors. Statistics in Medicine published by John Wiley & Sons Ltd.

Journal ArticleDOI
TL;DR: A Poisson mixed model with two random effects terms that account for both independent over-dispersion and sample non-independence is presented and a scalable sampling-based inference algorithm using a latent variable representation of the Poisson distribution is developed.
Abstract: Identifying differentially expressed (DE) genes from RNA sequencing (RNAseq) studies is among the most common analyses in genomics. However, RNAseq DE analysis presents several statistical and computational challenges, including over-dispersed read counts and, in some settings, sample non-independence. Previous count-based methods rely on simple hierarchical Poisson models (e.g. negative binomial) to model independent over-dispersion, but do not account for sample non-independence due to relatedness, population structure and/or hidden confounders. Here, we present a Poisson mixed model with two random effects terms that account for both independent over-dispersion and sample non-independence. We also develop a scalable sampling-based inference algorithm using a latent variable representation of the Poisson distribution. With simulations, we show that our method properly controls for type I error and is generally more powerful than other widely used approaches, except in small samples (n <15) with other unfavorable properties (e.g. small effect sizes). We also apply our method to three real datasets that contain related individuals, population stratification or hidden confounders. Our results show that our method increases power in all three data compared to other approaches, though the power gain is smallest in the smallest sample (n = 6). Our method is implemented in MACAU, freely available at www.xzlab.org/software.html.

Journal ArticleDOI
TL;DR: The results provide some evidence that a smaller number of neighbours used in defining the spatial weights matrix yields a better model fit, and may provide a more accurate representation of the underlying spatial random field.
Abstract: When analysing spatial data, it is important to account for spatial autocorrelation. In Bayesian statistics, spatial autocorrelation is commonly modelled by the intrinsic conditional autoregressive prior distribution. At the heart of this model is a spatial weights matrix which controls the behaviour and degree of spatial smoothing. The purpose of this study is to review the main specifications of the spatial weights matrix found in the literature, and together with some new and less common specifications, compare the effect that they have on smoothing and model performance. The popular BYM model is described, and a simple solution for addressing the identifiability issue among the spatial random effects is provided. Seventeen different definitions of the spatial weights matrix are defined, which are classified into four classes: adjacency-based weights, and weights based on geographic distance, distance between covariate values, and a hybrid of geographic and covariate distances. These last two definitions embody the main novelty of this research. Three synthetic data sets are generated, each representing a different underlying spatial structure. These data sets together with a real spatial data set from the literature are analysed using the models. The models are evaluated using the deviance information criterion and Moran’s I statistic. The deviance information criterion indicated that the model which uses binary, first-order adjacency weights to perform spatial smoothing is generally an optimal choice for achieving a good model fit. Distance-based weights also generally perform quite well and offer similar parameter interpretations. The less commonly explored options for performing spatial smoothing generally provided a worse model fit than models with more traditional approaches to smoothing, but usually outperformed the benchmark model which did not conduct spatial smoothing. The specification of the spatial weights matrix can have a colossal impact on model fit and parameter estimation. The results provide some evidence that a smaller number of neighbours used in defining the spatial weights matrix yields a better model fit, and may provide a more accurate representation of the underlying spatial random field. The commonly used binary, first-order adjacency weights still appear to be a good choice for implementing spatial smoothing.

Journal ArticleDOI
TL;DR: This work develops and applies multilevel, multinomial logistic regression models for analyzing behavioral datasets, which can potentially be applied to a broad class of statistical analyses by behavioral ecologists, focusing on other polytomous response variables, such as behavior, habitat choice, or emotional states.
Abstract: Behavioral ecologists frequently use observational methods, such as instantaneous scan sampling, to record the behavior of animals at discrete moments in time. We develop and apply multilevel, multinomial logistic regression models for analyzing such data. These statistical methods correspond to the multinomial character of the response variable while also accounting for the repeated observations of individuals that characterize behavioral datasets. Correlated random effects potentially reveal individual-level trade-offs across behaviors, allowing for models that reveal the extent to which individuals who regularly engage in one behavior also exhibit relatively more or less of another behavior. Using an example dataset, we demonstrate the estimation of these models using Hamiltonian Monte Carlo algorithms, as implemented in the RStan package in the R statistical environment. The supplemental files include a coding script and data that demonstrate auxiliary functions to prepare the data, estimate the models, summarize the posterior samples, and generate figures that display model predictions. We discuss possible extensions to our approach, including models with random slopes to allow individual-level behavioral strategies to vary over time and the need for models that account for temporal autocorrelation. These models can potentially be applied to a broad class of statistical analyses by behavioral ecologists, focusing on other polytomous response variables, such as behavior, habitat choice, or emotional states.

Journal ArticleDOI
TL;DR: Vn is applied to two published meta‐analyses and is concluded that it usefully augments standard methods when deciding upon the likely validity of summary meta‐analysis estimates in clinical practice and the link between statistical validity and homogeneity is demonstrated.
Abstract: An important question for clinicians appraising a meta-analysis is: are the findings likely to be valid in their own practice-does the reported effect accurately represent the effect that would occur in their own clinical population? To this end we advance the concept of statistical validity-where the parameter being estimated equals the corresponding parameter for a new independent study. Using a simple ('leave-one-out') cross-validation technique, we demonstrate how we may test meta-analysis estimates for statistical validity using a new validation statistic, Vn, and derive its distribution. We compare this with the usual approach of investigating heterogeneity in meta-analyses and demonstrate the link between statistical validity and homogeneity. Using a simulation study, the properties of Vn and the Q statistic are compared for univariate random effects meta-analysis and a tailored meta-regression model, where information from the setting (included as model covariates) is used to calibrate the summary estimate to the setting of application. Their properties are found to be similar when there are 50 studies or more, but for fewer studies Vn has greater power but a higher type 1 error rate than Q. The power and type 1 error rate of Vn are also shown to depend on the within-study variance, between-study variance, study sample size, and the number of studies in the meta-analysis. Finally, we apply Vn to two published meta-analyses and conclude that it usefully augments standard methods when deciding upon the likely validity of summary meta-analysis estimates in clinical practice. © 2017 The Authors. Statistics in Medicine published by John Wiley & Sons Ltd.

Journal ArticleDOI
TL;DR: Evidence is provided that school-based physical activity interventions can be effective in increasing physical activity enjoyment in children and adolescents, however, the magnitude of the pooled effect was small-to-moderate and there was evidence for publication bias and large between-study heterogeneity.

Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper developed a generalized nonlinear mixed-effects individual tree HCB model using data from a total of 3133 Mongolian oak ( Quercus mongolica ) trees on 112 sample plots allocated in Wangqing Forest Bureau of northeast China.