scispace - formally typeset
Search or ask a question

Showing papers on "Mixed model published in 2020"


Journal ArticleDOI
TL;DR: In this article, Wang et al. found that mixing existing forecasting models can significantly improve prediction performance of stock returns, and proposed three mixed models obtained superior out-of-sample forecasting performance.
Abstract: We find that mixing existing forecasting models can significantly improve prediction performance of stock returns. Empirical results suggest that the stock return forecasting by three proposed mixed models are more significant both in statistical and economic terms than the corresponding models in Campbell and Thompson (2008) , Wang et al. (2018) and Zhang et al. (2019) . This improvement of predictability is also remarkable when we employ the multivariate information to predict stock return. The prediction performance of mixed models is robust to a series of robustness test. Particularly, the three proposed mixed models obtain superior out-of-sample forecasting performance of stock return for business cycles, rolling window predictions and different out-of-sample periods.

42 citations


Journal ArticleDOI
16 Oct 2020
TL;DR: In this paper, the state of the art of statistical modelling as applied to plant breeding is presented, where the authors emphasize the importance of model selection and parameters estimation in a practical way.
Abstract: This paper presents the state of the art of the statistical modelling as applied to plant breeding. Classes of inference, statistical models, estimation methods and model selection are emphasized in a practical way. Restricted Maximum Likelihood (REML), Hierarchical Maximum Likelihood (HIML) and Bayesian (BAYES) are highlighted. Distributions of data and effects, and dimension and structure of the models are considered for model selection and parameters estimation. Theory and practical examples referring to selection between models with different fixed effects factors are given using the Full Maximum Likelihood (FML). An analytical FML way of defining random or fixed effects is presented to avoid the subjective or conceptual usual definitions. Examples of the applications of the Hierarchical Maximum Likelihood/Hierarchical Generalized Best Linear Unbiased Prediction (HIML/HG-BLUP) procedure are also presented. Sample sizes for achieving high experimental quality and accuracy are indicated and simple interpretation of the estimates of key genetic parameters are given. Phenomics and genomics are approached. Maximum accuracy under the truest model is the key for achieving efficacy in plant breeding programs.

39 citations


Journal ArticleDOI
TL;DR: In this article, mixed-effects models are used to examine the association between a categorical moderator and the magnitude of the effect size, and two approaches are available to estimate the residual between-studi...
Abstract: Mixed-effects models can be used to examine the association between a categorical moderator and the magnitude of the effect size. Two approaches are available to estimate the residual between-studi...

38 citations


Journal ArticleDOI
TL;DR: Using carefully selected, data driven transformations can improve small area estimation and this paper proposes to tackle the potential lack of validity of the model assumptions by using data-driven scaled transformations as opposed to ad-hoc chosen transformations.
Abstract: Small area models typically depend on the validity of model assumptions. For example, a commonly used version of the Empirical Best Predictor relies on the Gaussian assumptions of the error terms of the linear mixed model, a feature rarely observed in applications with real data. The present paper proposes to tackle the potential lack of validity of the model assumptions by using data-driven scaled transformations as opposed to ad-hoc chosen transformations. Different types of transformations are explored, the estimation of the transformation parameters is studied in detail under a linear mixed model and transformations are used in small area prediction of linear and non-linear parameters. The use of scaled transformations is crucial as it allows for fitting the linear mixed model with standard software and hence it simplifies the work of the data analyst. Mean squared error estimation that accounts for the uncertainty due to the estimation of the transformation parameters is explored using parametric and semi-parametric (wild) bootstrap. The proposed methods are illustrated using real survey and census data for estimating income deprivation parameters for municipalities in the Mexican state of Guerrero. Extensive simulation studies and the results from the application show that using carefully selected, data driven transformations can improve small area estimation.

33 citations


Journal ArticleDOI
TL;DR: A fast zero-inflated negative binomial mixed modeling (FZINBMM) approach to analyze high-dimensional longitudinal metagenomic count data that remarkably outperformed in computational efficiency and was statistically comparable with two R packages that use numerical integration to fit ZINBMMs.
Abstract: Motivation Longitudinal metagenomics data, including both 16S rRNA and whole-metagenome shotgun sequencing data, enhanced our abilities to understand the dynamic associations between the human microbiome and various diseases. However, analytic tools have not been fully developed to simultaneously address the main challenges of longitudinal metagenomics data, i.e. high-dimensionality, dependence among samples and zero-inflation of observed counts. Results We propose a fast zero-inflated negative binomial mixed modeling (FZINBMM) approach to analyze high-dimensional longitudinal metagenomic count data. The FZINBMM approach is based on zero-inflated negative binomial mixed models (ZINBMMs) for modeling longitudinal metagenomic count data and a fast EM-IWLS algorithm for fitting ZINBMMs. FZINBMM takes advantage of a commonly used procedure for fitting linear mixed models, which allows us to include various types of fixed and random effects and within-subject correlation structures and quickly analyze many taxa. We found that FZINBMM remarkably outperformed in computational efficiency and was statistically comparable with two R packages, GLMMadaptive and glmmTMB, that use numerical integration to fit ZINBMMs. Extensive simulations and real data applications showed that FZINBMM outperformed other previous methods, including linear mixed models, negative binomial mixed models and zero-inflated Gaussian mixed models. Availability and implementation FZINBMM has been implemented in the R package NBZIMM, available in the public GitHub repository http://github.com//nyiuab//NBZIMM. Supplementary information Supplementary data are available at Bioinformatics online.

27 citations


Journal ArticleDOI
07 Feb 2020-Trials
TL;DR: The mixed model for repeated measures for CRTs is a good analytic choice for cluster randomized trials with a continuous outcome measured longitudinally and evidence is found that this model is inappropriate when there are more than two repeated measures on subjects.
Abstract: Cluster randomized trials (CRTs) are a design used to test interventions where individual randomization is not appropriate. The mixed model for repeated measures (MMRM) is a popular choice for individually randomized trials with longitudinal continuous outcomes. This model’s appeal is due to avoidance of model misspecification and its unbiasedness for data missing completely at random or at random. We extended the MMRM to cluster randomized trials by adding a random intercept for the cluster and undertook a simulation experiment to investigate statistical properties when data are missing at random. We simulated cluster randomized trial data where the outcome was continuous and measured at baseline and three post-intervention time points. We varied the number of clusters, the cluster size, the intra-cluster correlation, missingness and the data-generation models. We demonstrate the MMRM-CRT with an example of a cluster randomized trial on cardiovascular disease prevention among diabetics. When simulating a treatment effect at the final time point we found that estimates were unbiased when data were complete and when data were missing at random. Variance components were also largely unbiased. When simulating under the null, we found that type I error was largely nominal, although for a few specific cases it was as high as 0.081. Although there have been assertions that this model is inappropriate when there are more than two repeated measures on subjects, we found evidence to the contrary. We conclude that the MMRM for CRTs is a good analytic choice for cluster randomized trials with a continuous outcome measured longitudinally. ClinicalTrials.gov, ID: NCT02804698.

26 citations


Journal ArticleDOI
TL;DR: Predictors for concentration of Se in the respective grains were developed, and the predicted values, along with observed concentrations in the two grains were represented by a multivariate linear mixed model in which selected covariates, derived from remote sensor observations and a digital elevation model, were included as fixed effects.

25 citations


Journal ArticleDOI
TL;DR: ClusterBootstrap, an R package for the analysis of hierarchical data using generalized linear models with the cluster bootstrap, is introduced and it will become clear that the GLMCB is a promising alternative to mixed models and the ClusterBootstrap package an easy-to-use R implementation of the technique.
Abstract: In the analysis of clustered or hierarchical data, a variety of statistical techniques can be applied. Most of these techniques have assumptions that are crucial to the validity of their outcome. Mixed models rely on the correct specification of the random effects structure. Generalized estimating equations are most efficient when the working correlation form is chosen correctly and are not feasible when the within-subject variable is non-factorial. Assumptions and limitations of another common approach, ANOVA for repeated measurements, are even more worrisome: listwise deletion when data are missing, the sphericity assumption, inability to model an unevenly spaced time variable and time-varying covariates, and the limitation to normally distributed dependent variables. This paper introduces ClusterBootstrap, an R package for the analysis of hierarchical data using generalized linear models with the cluster bootstrap (GLMCB). Being a bootstrap method, the technique is relatively assumption-free, and it has already been shown to be comparable, if not superior, to GEE in its performance. The paper has three goals. First, GLMCB will be introduced. Second, there will be an empirical example, using the ClusterBootstrap package for a Gaussian and a dichotomous dependent variable. Third, GLMCB will be compared to mixed models in a Monte Carlo experiment. Although GLMCB can be applied to a multitude of hierarchical data forms, this paper discusses it in the context of the analysis of repeated measurements or longitudinal data. It will become clear that the GLMCB is a promising alternative to mixed models and the ClusterBootstrap package an easy-to-use R implementation of the technique.

24 citations


Journal ArticleDOI
01 Sep 2020-Test
TL;DR: In this paper, area-level compositional mixed models by applying transformations to a multivariate Fay-Herriot model are derived from the new model, and the corresponding mean squared errors are estimated by parametric bootstrap.
Abstract: This paper introduces area-level compositional mixed models by applying transformations to a multivariate Fay–Herriot model. Small area estimators of the proportions of the categories of a classification variable are derived from the new model, and the corresponding mean squared errors are estimated by parametric bootstrap. Several simulation experiments designed to analyse the behaviour of the introduced estimators are carried out. An application to real data from the Spanish Labour Force Survey of Galicia (north-west of Spain), in the first quarter of 2017, is given. The target is the estimation of domain proportions of people in the four categories of the variable labour status: under 16 years, employed, unemployed and inactive.

22 citations


Journal ArticleDOI
TL;DR: The performance of various MI methods for imputing three-level incomplete data when the target analysis model is a three- level random effects model with a random intercept for each level is investigated.
Abstract: Three-level data arising from repeated measures on individuals who are clustered within larger units are common in health research studies. Missing data are prominent in such longitudinal studies and multiple imputation (MI) is a popular approach for handling missing data. Extensions of joint modelling and fully conditional specification MI approaches based on multilevel models have been developed for imputing three-level data. Alternatively, it is possible to extend single- and two-level MI methods to impute three-level data using dummy indicators and/or by analysing repeated measures in wide format. However, most implementations, evaluations and applications of these approaches focus on the context of incomplete two-level data. It is currently unclear which approach is preferable for imputing three-level data. In this study, we investigated the performance of various MI methods for imputing three-level incomplete data when the target analysis model is a three-level random effects model with a random intercept for each level. The MI methods were evaluated via simulations and illustrated using empirical data, based on a case study from the Childhood to Adolescence Transition Study, a longitudinal cohort collecting repeated measures on students who were clustered within schools. In our simulations we considered a number of different scenarios covering a range of different missing data mechanisms, missing data proportions and strengths of level-2 and level-3 intra-cluster correlations. We found that all of the approaches considered produced valid inferences about both the regression coefficient corresponding to the exposure of interest and the variance components under the various scenarios within the simulation study. In the case study, all approaches led to similar results. Researchers may use extensions to the single- and two-level approaches, or the three-level approaches, to adequately handle incomplete three-level data. The two-level MI approaches with dummy indicator extension or the MI approaches based on three-level models will be required in certain circumstances such as when there are longitudinal data measured at irregular time intervals. However, the single- and two-level approaches with the DI extension should be used with caution as the DI approach has been shown to produce biased parameter estimates in certain scenarios.

21 citations


Journal ArticleDOI
01 Mar 2020-Test
TL;DR: In this article, a variant of the Fay-Herriot model is proposed to estimate the domain mean of a given target variable by taking into account the measurement error of the auxiliary variables and two fitting algorithms that calculate maximum and residual maximum likelihood estimates of the model parameters.
Abstract: The Fay–Herriot model is an area-level linear mixed model that is widely used for estimating the domain means of a given target variable. Under this model, the dependent variable is a direct estimator calculated by using the survey data and the auxiliary variables are true domain means obtained from external data sources. Administrative registers do not always give good auxiliary variables so that statisticians sometimes take them from alternative surveys and therefore they are measured with error. We introduce a variant of the Fay–Herriot model that takes into account the measurement error of the auxiliary variables and give two fitting algorithms that calculate maximum and residual maximum likelihood estimates of the model parameters. Based on the new model, empirical best predictors of domain means are introduced and an approximation of its mean squared error is derived. We finally give an application to estimate poverty proportions in the Spanish Living Condition Survey, with auxiliary information from the Spanish Labour Force Survey.

Posted Content
TL;DR: A general, efficient approach to Bayesian SEM estimation in Stan is described, contrasting it with previous implementations in R package blavaan (Merkle & Rosseel, 2018); a practical comparison shows that the new approach is clearly better.
Abstract: Structural equation models comprise a large class of popular statistical models, including factor analysis models, certain mixed models, and extensions thereof Model estimation is complicated by the fact that we typically have multiple interdependent response variables and multiple latent variables (which may also be called random effects or hidden variables), often leading to slow and inefficient MCMC samples In this paper, we describe and illustrate a general, efficient approach to Bayesian SEM estimation in Stan, contrasting it with previous implementations in R package blavaan (Merkle & Rosseel, 2018) After describing the approaches in detail, we conduct a practical comparison under multiple scenarios The comparisons show that the new approach is clearly better We also discuss ways that the approach may be extended to other models that are of interest to psychometricians

Journal ArticleDOI
TL;DR: A novel, computationally efficient technique that is based on smoothing and extraction of traits (SET) that can be used in any situation in which a traditional longitudinal analysis might be applied, especially when there are many observed time points.
Abstract: Non-destructive high-throughput plant phenotyping is becoming increasingly used and various methods for growth analysis have been proposed. Traditional longitudinal or repeated measures analyses that model growth using statistical models are common. However, often the variation in the data is inappropriately modelled, in part because the required models are complicated and difficult to fit. We provide a novel, computationally efficient technique that is based on smoothing and extraction of traits (SET), which we compare with the alternative traditional longitudinal analysis methods. The SET-based and longitudinal analyses were applied to a tomato experiment to investigate the effects on plant growth of zinc (Zn) addition and growing plants in soil inoculated with arbuscular mycorrhizal fungi (AMF). Conclusions from the SET-based and longitudinal analyses are similar, although the former analysis results in more significant differences. They showed that added Zn had little effect on plants grown in inoculated soils, but that growth depended on the amount of added Zn for plants grown in uninoculated soils. The longitudinal analysis of the unsmoothed data fitted a mixed model that involved both fixed and random regression modelling with splines, as well as allowing for unequal variances and autocorrelation between time points. A SET-based analysis can be used in any situation in which a traditional longitudinal analysis might be applied, especially when there are many observed time points. Two reasons for deploying the SET-based method are (i) biologically relevant growth parameters are required that parsimoniously describe growth, usually focussing on a small number of intervals, and/or (ii) a computationally efficient method is required for which a valid analysis is easier to achieve, while still capturing the essential features of the exhibited growth dynamics. Also discussed are the statistical models that need to be considered for traditional longitudinal analyses and it is demonstrated that the oft-omitted unequal variances and autocorrelation may be required for a valid longitudinal analysis. With respect to the separate issue of the subjective choice of mathematical growth functions or splines to characterize growth, it is recommended that, for both SET-based and longitudinal analyses, an evidence-based procedure is adopted.

Journal ArticleDOI
TL;DR: It is concluded that using sparse statistical models and the development of large reference panels across multiple ethnicities and tissues will lead to better prediction of gene expression, and thus may improve TWAS power.
Abstract: In transcriptome-wide association studies (TWAS), gene expression values are predicted using genotype data and tested for association with a phenotype. The power of this approach to detect associations relies, at least in part, on the accuracy of the prediction. Here we compare the prediction accuracy of six different methods-LASSO, Ridge regression, Elastic net, Best Linear Unbiased Predictor, Bayesian Sparse Linear Mixed Model, and Random Forests-by performing cross-validation using data from the Geuvadis Project. We also examine prediction accuracy (a) at different sample sizes, (b) when ancestry of the prediction model training and testing populations is different, and (c) when the tissue used to train the model is different from the tissue to be predicted. We find that, for most genes, the expression cannot be accurately predicted, but in general sparse statistical models tend to outperform polygenic models at prediction. Average prediction accuracy is reduced when the model training set size is reduced or when predicting across ancestries and is marginally reduced when predicting across tissues. We conclude that using sparse statistical models and the development of large reference panels across multiple ethnicities and tissues will lead to better prediction of gene expression, and thus may improve TWAS power.

Journal ArticleDOI
TL;DR: A mixed‐effects model for spatial cluster detection that takes spatial correlation into account is proposed, and the introduced random effects explain extra variability among the spatial responses beyond the cluster effect, thus reducing the false positive rate.
Abstract: Identifying spatial clusters of different regression coefficients is a useful tool for discerning the distinctive relationship between a response and covariates in space. Most of the existing cluster detection methods aim to identify the spatial similarity in responses, and the standard cluster detection algorithm assumes independent spatial units. However, the response variables are spatially correlated in many environmental applications. We propose a mixed‐effects model for spatial cluster detection that takes spatial correlation into account. Compared to a fixed‐effects model, the introduced random effects explain extra variability among the spatial responses beyond the cluster effect, thus reducing the false positive rate. The developed method exploits a sequential searching scheme and is able to identify multiple potentially overlapping clusters. We use simulation studies to evaluate the performance of our proposed method in terms of the true and false positive rates of a known cluster and the identification of multiple known clusters. We apply our proposed methodology to particulate matter (PM2.5) concentration data from the Northeastern United States in order to study the weather effect on PM2.5 and to investigate the association between the simulations from a numerical model and the satellite‐derived aerosol optical depth data. We find geographical hot spots that show distinct features, comparing to the background.

Journal ArticleDOI
TL;DR: MixWILD is generalizable to a variety of data collection strategies and as a robust and reproducible method to test predictors of variability in level 1 outcomes and the associations between subject-level parameters (variances and slopes) and level 2 outcomes.
Abstract: The use of intensive sampling methods, such as ecological momentary assessment (EMA), is increasingly prominent in medical research. However, inferences from such data are often limited to the subject-specific mean of the outcome and between-subject variance (i.e., random intercept), despite the capability to examine within-subject variance (i.e., random scale) and associations between covariates and subject-specific mean (i.e., random slope). MixWILD (Mixed model analysis With Intensive Longitudinal Data) is statistical software that tests the effects of subject-level parameters (variance and slope) of time-varying variables, specifically in the context of studies using intensive sampling methods, such as ecological momentary assessment. MixWILD combines estimation of a stage 1 mixed-effects location-scale (MELS) model, including estimation of the subject-specific random effects, with a subsequent stage 2 linear or binary/ordinal logistic regression in which values sampled from each subject's random effect distributions can be used as regressors (and then the results are aggregated across replications). Computations within MixWILD were written in FORTRAN and use maximum likelihood estimation, utilizing both the expectation-maximization (EM) algorithm and a Newton-Raphson solution. The mean and variance of each individual's random effects used in the sampling are estimated using empirical Bayes equations. This manuscript details the underlying procedures and provides examples illustrating standalone usage and features of MixWILD and its GUI. MixWILD is generalizable to a variety of data collection strategies (i.e., EMA, sensors) as a robust and reproducible method to test predictors of variability in level 1 outcomes and the associations between subject-level parameters (variances and slopes) and level 2 outcomes.

Journal ArticleDOI
TL;DR: In this article, three types of typical spatiotemporal LUR models based on parametric, semi-parametric, and nonparametric statistic methods, respectively, are considered to predict daily ground-level ozone (O3) in the megacity of Tianjin, China.

Journal ArticleDOI
TL;DR: A model that allows multiple random effects per subject in the mean model (eg, random location intercept and slopes) as well as random scale in the error variance model is described.
Abstract: Ecological Momentary Assessment data present some new modeling opportunities. Typically, there are sufficient data to explicitly model the within-subject (WS) variance, and in many applications, it is of interest to allow the WS variance to depend on covariates as well as random subject effects. We describe a model that allows multiple random effects per subject in the mean model (eg, random location intercept and slopes), as well as random scale in the error variance model. We present an example of the use of this model on a real dataset and a simulation study that shows the benefit of this model, relative to simpler approaches.

Journal ArticleDOI
TL;DR: In this paper, the authors propose a non-Gaussian Gaussian linear mixed effects model for continuous repeated measurement outcomes, which allows any combination of these stochastic components to be non−Gaussian, using multivariate normal variance-mean mixtures.
Abstract: We consider the analysis of continuous repeated measurement outcomes that are collected longitudinally. A standard framework for analysing data of this kind is a linear Gaussian mixed effects model within which the outcome variable can be decomposed into fixed effects, time invariant and time‐varying random effects, and measurement noise. We develop methodology that, for the first time, allows any combination of these stochastic components to be non‐Gaussian, using multivariate normal variance–mean mixtures. To meet the computational challenges that are presented by large data sets, i.e. in the current context, data sets with many subjects and/or many repeated measurements per subject, we propose a novel implementation of maximum likelihood estimation using a computationally efficient subsampling‐based stochastic gradient algorithm. We obtain standard error estimates by inverting the observed Fisher information matrix and obtain the predictive distributions for the random effects in both filtering (conditioning on past and current data) and smoothing (conditioning on all data) contexts. To implement these procedures, we introduce an R package: ngme. We reanalyse two data sets, from cystic fibrosis and nephrology research, that were previously analysed by using Gaussian linear mixed effects models. (Less)

Journal ArticleDOI
12 May 2020-Cerne
TL;DR: In this article, the quality of the volumetric estimation of Eucalyptus spp trees using a mixed-effects model, artificial neural network (ANN) and support vector machine (SVM) was evaluated using the Akaike Information Criterion, Maximum Likelihood Ratio Test and Vuong's Closeness Test.
Abstract: Volumetric equations is one of the main tools for quantifying forest stand production, and is the basis for sustainable management of forest plantations This study aimed to assess the quality of the volumetric estimation of Eucalyptus spp trees using a mixed-effects model, artificial neural network (ANN) and support-vector machine (SVM) The database was derived from a forest stand located in the municipalities of Bom Jardim de Minas, Lima Duarte and Arantina in Minas Gerais state, Brazil The volume of 818 trees was accurately estimated using Smalian’s Formula The Schumacher and Hall model was fitted by fixed-effects regression and by including multilevel random effects The mixed model was fitted by adopting 14 different structures for the variance and covariance matrix The best structure was selected based on the Akaike Information Criterion, Maximum Likelihood Ratio Test and Vuong’s Closeness Test The SVM and ANN training process considered diameter at breast height and total tree height to be the independent variables The techniques performed satisfactorily in modeling, with homogeneous distributions and low dispersion of residuals The quality analysis criteria indicated the superior performance of the mixed model with a Huynh-Feldt structure of the variance and covariance matrix, which showed a decrease in mean relative error from 1352% to 280%, whereas machine learning techniques had error values of 677% (SVM) and 581% (ANN) This study confirms that although fixed-effects models are widely used in the Brazilian forest sector, there are more effective methods for modeling dendrometric variables

Journal ArticleDOI
TL;DR: An additional assumption is provided that allows scientists to use standard software to fit linear mixed model with endogenous covariates, and person-specific predictions of effects can be provided.
Abstract: Mobile health is a rapidly developing field in which behavioral treatments are delivered to individuals via wearables or smartphones to facilitate health-related behavior change. Micro-randomized trials (MRT) are an experimental design for developing mobile health interventions. In an MRT, the treatments are randomized numerous times for each individual over course of the trial. Along with assessing treatment effects, behavioral scientists aim to understand between-person heterogeneity in the treatment effect. A natural approach is the familiar linear mixed model. However, directly applying linear mixed models is problematic because potential moderators of the treatment effect are frequently endogenous—that is, may depend on prior treatment. We discuss model interpretation and biases that arise in the absence of additional assumptions when endogenous covariates are included in a linear mixed model. In particular, when there are endogenous covariates, the coefficients no longer have the customary marginal interpretation. However, these coefficients still have a conditional-on-the-random-effect interpretation. We provide an additional assumption that, if true, allows scientists to use standard software to fit linear mixed model with endogenous covariates, and person-specific predictions of effects can be provided. As an illustration, we assess the effect of activity suggestion in the HeartSteps MRT and analyze the between-person treatment effect heterogeneity.

Journal ArticleDOI
06 May 2020-PLOS ONE
TL;DR: The merits of the Bayesian model including prior information by analyzing data of an empirical study on weight loss are demonstrated and it is found that it allows the inclusion of prior knowledge and gives potential for population based and personalized inference.
Abstract: Background. N-of-1 designs gain popularity in nutritional research because of the improving technological possibilities, practical applicability and promise of increased accuracy and sensitivity, especially in the field of personalized nutrition. This move asks for a search of applicable statistical methods. Objective. To demonstrate the differences of three popular statistical methods in analyzing treatment effects of data obtained in N-of-1 designs. Method. We compare Individual-participant data meta-analysis, frequentist and Bayesian linear mixed effect models using a simulation experiment. Furthermore, we demonstrate the merits of the Bayesian model including prior information by analyzing data of an empirical study on weight loss. Results. The linear mixed effect models are to be preferred over the meta-analysis method, since the individual effects are estimated more accurately as evidenced by the lower errors, especially with lower sample sizes. Differences between Bayesian and frequentist mixed models were found to be small, indicating that they will lead to the same results without including an informative prior. Conclusion. For empirical data, the Bayesian mixed model allows the inclusion of prior knowledge and gives potential for population based and personalized inference.

Journal ArticleDOI
TL;DR: Monte Carlo simulations and an empirical analysis of regional unemployment in Italy show that the proposed semiparametric P-Spline model represents a valid alternative to parametric methods aimed at disentangling strong and weak cross-sectional dependence when both spatial and temporal heterogeneity are smoothly distributed.
Abstract: We propose a semiparametric P-Spline model to deal with spatial panel data. This model includes a non-parametric spatio-temporal trend, a spatial lag of the dependent variable, and a time series autoregressive noise. Specifically, we consider a spatio-temporal ANOVA model, disaggregating the trend into spatial and temporal main effects, as well as second- and third-order interactions between them. Algorithms based on spatial anisotropic penalties are used to estimate all the parameters in a closed form without the need for multidimensional optimization. Monte Carlo simulations and an empirical analysis of regional unemployment in Italy show that our model represents a valid alternative to parametric methods aimed at disentangling strong and weak cross-sectional dependence when both spatial and temporal heterogeneity are smoothly distributed.

Journal ArticleDOI
TL;DR: In this paper, K-fold cross-validation (CV) with squared error loss is used for evaluating predictive models, especially when strong distributional assumptions cannot be taken, and the results show that CV with sq...
Abstract: –K-fold cross-validation (CV) with squared error loss is widely used for evaluating predictive models, especially when strong distributional assumptions cannot be taken. However, CV with sq...

Journal ArticleDOI
TL;DR: In this paper, the authors outline all approaches focusing on the part of the model subject to selection, the dimensionality of models and the structure of variance and covariance matrices, and also, wherever possible, the existence of an implemented application of the methodologies set out.
Abstract: Linear mixed-effects models are a class of models widely used for analyzing different types of data: longitudinal, clustered and panel data. Many fields, in which a statistical methodology is required, involve the employment of linear mixed models, such as biology, chemistry, medicine, finance and so forth. One of the most important processes, in a statistical analysis, is given by model selection. Hence, since there are a large number of linear mixed model selection procedures available in the literature, a pressing issue is how to identify the best approach to adopt in a specific case. We outline mainly all approaches focusing on the part of the model subject to selection (fixed and/or random), the dimensionality of models and the structure of variance and covariance matrices, and also, wherever possible, the existence of an implemented application of the methodologies set out.

Journal ArticleDOI
09 Nov 2020-PLOS ONE
TL;DR: ZIGMMs is a robust and flexible method which can be applicable for longitudinal microbiome proportion data or count data generated with either 16S rRNA or shotgun sequencing technologies, and can effectively handle zero-inflation.
Abstract: Motivation The human microbiome is variable and dynamic in nature. Longitudinal studies could explain the mechanisms in maintaining the microbiome in health or causing dysbiosis in disease. However, it remains challenging to properly analyze the longitudinal microbiome data from either 16S rRNA or metagenome shotgun sequencing studies, output as proportions or counts. Most microbiome data are sparse, requiring statistical models to handle zero-inflation. Moreover, longitudinal design induces correlation among the samples and thus further complicates the analysis and interpretation of the microbiome data. Results In this article, we propose zero-inflated Gaussian mixed models (ZIGMMs) to analyze longitudinal microbiome data. ZIGMMs is a robust and flexible method which can be applicable for longitudinal microbiome proportion data or count data generated with either 16S rRNA or shotgun sequencing technologies. It can include various types of fixed effects and random effects and account for various within-subject correlation structures, and can effectively handle zero-inflation. We developed an efficient Expectation-Maximization (EM) algorithm to fit the ZIGMMs by taking advantage of the standard procedure for fitting linear mixed models. We demonstrate the computational efficiency of our EM algorithm by comparing with two other zero-inflated methods. We show that ZIGMMs outperform the previously used linear mixed models (LMMs), negative binomial mixed models (NBMMs) and zero-inflated Beta regression mixed model (ZIBR) in detecting associated effects in longitudinal microbiome data through extensive simulations. We also apply our method to two public longitudinal microbiome datasets and compare with LMMs and NBMMs in detecting dynamic effects of associated taxa.

Journal ArticleDOI
TL;DR: The findings show that the estimates of fixed-effects parameters in nonlinear mixed-effects models are generally robust to deviations from normality of the random-effects distribution, but the estimatesof variance components are very sensitive to the distributional assumption of random effects.
Abstract: Nonlinear mixed-effects models are being widely used for the analysis of longitudinal data, especially from pharmaceutical research. They use random effects which are latent and unobservable variables so the random-effects distribution is subject to misspecification in practice. In this paper, we first study the consequences of misspecifying the random-effects distribution in nonlinear mixed-effects models. Our study is focused on Gauss-Hermite quadrature, which is now the routine method for calculation of the marginal likelihood in mixed models. We then present a formal diagnostic test to check the appropriateness of the assumed random-effects distribution in nonlinear mixed-effects models, which is very useful for real data analysis. Our findings show that the estimates of fixed-effects parameters in nonlinear mixed-effects models are generally robust to deviations from normality of the random-effects distribution, but the estimates of variance components are very sensitive to the distributional assumption of random effects. Furthermore, a misspecified random-effects distribution will either overestimate or underestimate the predictions of random effects. We illustrate the results using a real data application from an intensive pharmacokinetic study.

Journal ArticleDOI
TL;DR: Results show a lower rate of nonpositive definiteness with the factor analytic structure than Cholesky decomposition and suggest that factor analytic covariance structure may be useful to combating nonpositive Definiteness, especially in models with many random effects.
Abstract: Deciding which random effects to retain is a central decision in mixed effect models. Recent recommendations advise a maximal structure whereby all theoretically relevant random effects are retaine...

Journal ArticleDOI
01 Apr 2020-Genetics
TL;DR: This work proposes an alternative strategy, where genetic effects are formally included in the graph, which has important advantages: genetic effects can be directly incorporated in causal inference, implemented via the PCgen algorithm, which can analyze many more traits; and it is shown that reconstruction is much more accurate if individual plant or plot data are used, instead of genotypic means.
Abstract: Genetic variance of a phenotypic trait can originate from direct genetic effects, or from indirect effects, i.e., through genetic effects on other traits, affecting the trait of interest. This distinction is often of great importance, for example, when trying to improve crop yield and simultaneously control plant height. As suggested by Sewall Wright, assessing contributions of direct and indirect effects requires knowledge of (1) the presence or absence of direct genetic effects on each trait, and (2) the functional relationships between the traits. Because experimental validation of such relationships is often unfeasible, it is increasingly common to reconstruct them using causal inference methods. However, most current methods require all genetic variance to be explained by a small number of quantitative trait loci (QTL) with fixed effects. Only a few authors have considered the "missing heritability" case, where contributions of many undetectable QTL are modeled with random effects. Usually, these are treated as nuisance terms that need to be eliminated by taking residuals from a multi-trait mixed model (MTM). But fitting such an MTM is challenging, and it is impossible to infer the presence of direct genetic effects. Here, we propose an alternative strategy, where genetic effects are formally included in the graph. This has important advantages: (1) genetic effects can be directly incorporated in causal inference, implemented via our PCgen algorithm, which can analyze many more traits; and (2) we can test the existence of direct genetic effects, and improve the orientation of edges between traits. Finally, we show that reconstruction is much more accurate if individual plant or plot data are used, instead of genotypic means. We have implemented the PCgen-algorithm in the R-package pcgen.

Journal ArticleDOI
TL;DR: A hierarchical Bayesian non-parametric model of population growth that identifies the latent growth behavior and response to perturbation, while simultaneously correcting for random effects in the data is developed.
Abstract: Substantive changes in gene expression, metabolism, and the proteome are manifested in overall changes in microbial population growth. Quantifying how microbes grow is therefore fundamental to areas such as genetics, bioengineering, and food safety. Traditional parametric growth curve models capture the population growth behavior through a set of summarizing parameters. However, estimation of these parameters from data is confounded by random effects such as experimental variability, batch effects or differences in experimental material. A systematic statistical method to identify and correct for such confounding effects in population growth data is not currently available. Further, our previous work has demonstrated that parametric models are insufficient to explain and predict microbial response under non-standard growth conditions. Here we develop a hierarchical Bayesian non-parametric model of population growth that identifies the latent growth behavior and response to perturbation, while simultaneously correcting for random effects in the data. This model enables more accurate estimates of the biological effect of interest, while better accounting for the uncertainty due to technical variation. Additionally, modeling hierarchical variation provides estimates of the relative impact of various confounding effects on measured population growth.