scispace - formally typeset
Search or ask a question

Showing papers on "Mixed model published in 2021"


Journal ArticleDOI
TL;DR: In this article, the authors relax the assumption that the random effects and model errors follow a skew-normal distribution, which includes normality as a special case and provides flexibility in capturing a broad range of non-normal behavior.
Abstract: Normality (symmetric) of the random effects and the within-subject errors is a routine assumptions for the linear mixed model, but it may be unrealistic, obscuring important features of among- and within-subjects variation. We relax this assumption by considering that the random effects and model errors follow a skew-normal distributions, which includes normality as a special case and provides flexibility in capturing a broad range of non-normal behavior. The marginal distribution for the observed quantity is derived which is expressed in closed form, so inference may be carried out using existing statistical software and standard optimization techniques. We also implement an EM type algorithm which seem to provide some advantages over a direct maximization of the likelihood. Results of simulation studies and applications to real data sets are reported.

193 citations


Journal ArticleDOI
15 Nov 2021-Neuron
TL;DR: The authors introduce linear and generalized mixed-effects models that consider data dependence and provide clear instruction on how to recognize when they are needed and how to apply them. But the most widely used methods such as t test and ANOVA do not take data dependence into account and thus are often misused.

104 citations


Journal ArticleDOI
TL;DR: In this paper, a generalized linear mixed model with a random effect correction for individual as a means of accounting for within-sample correlation is proposed to compute differential expression within a specific cell type across treatment groups, to properly account for both zero inflation and the correlation structure among measures from cells within an individual.
Abstract: Cells from the same individual share common genetic and environmental backgrounds and are not statistically independent; therefore, they are subsamples or pseudoreplicates Thus, single-cell data have a hierarchical structure that many current single-cell methods do not address, leading to biased inference, highly inflated type 1 error rates, and reduced robustness and reproducibility This includes methods that use a batch effect correction for individual as a means of accounting for within-sample correlation Here, we document this dependence across a range of cell types and show that pseudo-bulk aggregation methods are conservative and underpowered relative to mixed models To compute differential expression within a specific cell type across treatment groups, we propose applying generalized linear mixed models with a random effect for individual, to properly account for both zero inflation and the correlation structure among measures from cells within an individual Finally, we provide power estimates across a range of experimental conditions to assist researchers in designing appropriately powered studies Single cell genomics uses cells from the same individual, or pseudoreplicates, that can introduce biases and inflate type I error rates Here the authors apply generalized linear mixed models with a random effect for individual, to properly account for both zero inflation and the correlation structure among cells within an individual

81 citations


Journal ArticleDOI
TL;DR: In this article, the authors reviewed and summarized four types of gap-filling strategies, and applied them to a random forest PM2.5 prediction model that incorporated ground observations, chemical transport model (CTM) simulations, and satellite AOD for predicting daily PM 2.5 concentrations at a 1-km resolution in 2013 in the Beijing-Tianjin-Hebei region and the Yangtze River Delta.

58 citations


Journal ArticleDOI
TL;DR: In this article, a general approach of random forests for high-dimensional longitudinal data is proposed, which includes a flexible stochastic model which allows the covariance structure to vary over time.
Abstract: Random forests are one of the state-of-the-art supervised machine learning methods and achieve good performance in high-dimensional settings where p, the number of predictors, is much larger than n, the number of observations. Repeated measurements provide, in general, additional information, hence they are worth accounted especially when analyzing high-dimensional data. Tree-based methods have already been adapted to clustered and longitudinal data by using a semi-parametric mixed effects model, in which the non-parametric part is estimated using regression trees or random forests. We propose a general approach of random forests for high-dimensional longitudinal data. It includes a flexible stochastic model which allows the covariance structure to vary over time. Furthermore, we introduce a new method which takes intra-individual covariance into consideration to build random forests. Through simulation experiments, we then study the behavior of different estimation methods, especially in the context of high-dimensional data. Finally, the proposed method has been applied to an HIV vaccine trial including 17 HIV-infected patients with 10 repeated measurements of 20,000 gene transcripts and blood concentration of human immunodeficiency virus RNA. The approach selected 21 gene transcripts for which the association with HIV viral load was fully relevant and consistent with results observed during primary infection.

30 citations


Journal ArticleDOI
TL;DR: In this paper, the R package cAIC4 is introduced that allows for the computation of the conditional Akaike information criterion (cAIC), which takes into account the uncertainty of the random effects variance and is therefore not straightforward.
Abstract: Model selection in mixed models based on the conditional distribution is appropriate for many practical applications and has been a focus of recent statistical research. In this paper we introduce the R package cAIC4 that allows for the computation of the conditional Akaike information criterion (cAIC). Computation of the conditional AIC needs to take into account the uncertainty of the random effects variance and is therefore not straightforward. We introduce a fast and stable implementation for the calculation of the cAIC for (generalized) linear mixed models estimated with lme4 and (generalized) additive mixed models estimated with gamm4. Furthermore, cAIC4 offers a stepwise function that allows for an automated stepwise selection scheme for mixed models based on the cAIC. Examples of many possible applications are presented to illustrate the practical impact and easy handling of the package.

21 citations


Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper developed a mixed effect model (MEM) to estimate the ground-level NO2 in China from January 1, 2014 to June 30, 2020 and other multivariate auxiliary data such as meteorological elements and terrain elevation.

20 citations


Journal ArticleDOI
TL;DR: A general regression framework for the analysis of network data that combines these two types of effects, including continuous, binary and ordinal network relations, has been proposed in this paper, which accommodates a variety of network types.
Abstract: Network datasets typically exhibit certain types of statistical patterns, such as within-dyad correlation, degree heterogeneity, and triadic patterns such as transitivity and clustering. The first two of these can be well represented with a social relations model, a type of additive effects model originally developed for continuous dyadic data. Higher-order patterns can be represented with multiplicative effects models, which are related to matrix decompositions that are commonly used for matrix-variate data analysis. Additionally, these multiplicative effects models generalize other popular latent feature network models, such as the stochastic blockmodel and the latent space model. In this article, we review a general regression framework for the analysis of network data that combines these two types of effects, and accommodates a variety of network data types, including continuous, binary and ordinal network relations.

17 citations


Journal ArticleDOI
06 Jul 2021
TL;DR: In this paper, the authors outline the common approaches and focus on the impact of aggregation, the effect of measurement error, the choice of prior distribution, and the detection of interactions.
Abstract: Although Bayesian linear mixed effects models are increasingly popular for analysis of within-subject designs in psychology and other fields, there remains considerable ambiguity on the most appropriate Bayes factor hypothesis test to quantify the degree to which the data support the presence or absence of an experimental effect. Specifically, different choices for both the null model and the alternative model are possible, and each choice constitutes a different definition of an effect resulting in a different test outcome. We outline the common approaches and focus on the impact of aggregation, the effect of measurement error, the choice of prior distribution, and the detection of interactions. For concreteness, three example scenarios showcase how seemingly innocuous choices can lead to dramatic differences in statistical evidence. We hope this work will facilitate a more explicit discussion about best practices in Bayes factor hypothesis testing in mixed models.

16 citations


Journal ArticleDOI
TL;DR: In this paper, the authors describe and illustrate a general, efficient approach to Bayesian SEM estimation in Stan, contrasting it with previous implementations in R package blavaan (Merkle and Rosseel 2018).
Abstract: Structural equation models comprise a large class of popular statistical models, including factor analysis models, certain mixed models, and extensions thereof. Model estimation is complicated by the fact that we typically have multiple interdependent response variables and multiple latent variables (which may also be called random effects or hidden variables), often leading to slow and inefficient posterior sampling. In this paper, we describe and illustrate a general, efficient approach to Bayesian SEM estimation in Stan, contrasting it with previous implementations in R package blavaan (Merkle and Rosseel 2018). After describing the approaches in detail, we conduct a practical comparison under multiple scenarios. The comparisons show that the new approach is clearly better. We also discuss ways that the approach may be extended to other models that are of interest to psychometricians.

15 citations


Journal ArticleDOI
TL;DR: In this article, the authors show that the collinearity between fixed effects and random effects in a spatial generalized linear mixed model can adversely affect estimates of the fixed effects, and that restricted spatio-temporal information can affect the estimation of fixed effects.
Abstract: Spatial confounding, that is, collinearity between fixed effects and random effects in a spatial generalized linear mixed model, can adversely affect estimates of the fixed effects. Restricted spat...

Journal ArticleDOI
TL;DR: Longitudinal models using the eye as the unit of analysis can be implemented using available statistical software to account for both inter-eye and longitudinal correlations.
Abstract: Purpose: To describe and demonstrate methods for analyzing longitudinal correlated eye data with a continuous outcome measure. Methods: We described fixed effects, mixed effects and generalized est...

Journal ArticleDOI
TL;DR: In this paper, robust designs for generalized linear mixed models (GLMMs) with protections against possible departures from underlying model assumptions are studied, where the authors develop methods for constructing adaptive sequential designs when the fitted mean response or the link function is possibly of an incorrect parametric form.

Journal ArticleDOI
13 May 2021
TL;DR: This article explores an underused mathematical analytical methodology in the social sciences and presents detailed guidelines regarding the estimation of models where the data for the outcome variable includes an excess number of zeros, and the dataset has a natural nested structure.
Abstract: Our article explores an underused mathematical analytical methodology in the social sciences. In addition to describing the method and its advantages, we extend a previously reported application of mixed models in a well-known database about corruption in 149 countries. The dataset in the mentioned study included a reasonable amount of zeros (13.19%) in the outcome variable, which is typical of this type of research, as well as quite a bit of social sciences research. In our paper, present detailed guidelines regarding the estimation of models where the data for the outcome variable includes an excess number of zeros, and the dataset has a natural nested structure. We believe our research is not likely to reject the hypothesis favoring the adoption of mixed modeling and the inflation of zeros over the original simpler framework. Instead, our results demonstrate the importance of considering random effects at country levels and the zero-inflated nature of the outcome variable.

Journal ArticleDOI
TL;DR: In this article, the authors propose a seamless step-wise procedure that allows for carry on of estimated means and variances from stage to stage, based on the extraction of three intermediate traits; (1) timing of key stages, (2) quantities at defined time points or periods, and (3) dose-response curves.

Journal ArticleDOI
TL;DR: A generalized linear model is developed that allows for non‐Gaussian distribution in the genotypic response variables, and treatment of multiallelic nucleotide polymorphisms and is combined with an admixture‐based model or principal components analysis to correct for population structure (MLR‐ADM and MLR‐PC).
Abstract: To understand how organisms adapt to their environment, a gene-environmental association (GEA) analysis is commonly conducted. GEA methods based on mixed models, such as linear latent factor mixed models (LFMM) and LFMM2, have grown in popularity for their robust performance in terms of power and computational speed. However, it is unclear how the assumption of a Gaussian distribution for the response variables influences model performance. In this paper, we develop a generalized linear model (GLM) that allows for non-Gaussian distribution in the genotypic response variables, and treatment of multiallelic nucleotide polymorphisms. Moreover, this multinomial logistic regression model (MLR) is combined with an admixture-based model or principal components analysis to correct for population structure (MLR-ADM and MLR-PC). Using simulations, we evaluate the type 1 error, false discovery rates (FDR), and power to detect selected SNPs, to guide model choice and best practices. With genomic control, MLR-PC and LFMM2 have similar type 1 error, FDRs, and power when analysing biallelic SNPs, while dramatically outperforming models not accounting for population structure. Differences in performance occur under continuous population structure where MLR-PC outperforms LFMM/LFMM2, especially when a larger number of clusters or triallelic SNPs are analysed. The Human Genome Diversity Project (HGDP) data set shows that both MLR-PC and LFMM2 control the inflation of P -values. Analysis of the 1,000 Genome Project Phase 3 data set illustrates that MLR-PC and LFMM2 produce consistent results for most significant SNPs, while MLR-PC discovered additional SNPs corresponding to certain genes, suggesting MLR-PC may be a useful alternative to GEA inference.

Journal ArticleDOI
TL;DR: In this paper, a linear mixed effects model is used to fit longitudinal data in the presence of non-random dropout, and parameter estimates of the dropout model have been obtained.
Abstract: Longitudinal studies represent one of the principal research strategies employed in medical and social research. These studies are the most appropriate for studying individual change over time. The prematurely withdrawal of some subjects from the study (dropout) is termed nonrandom when the probability of missingness depends on the missing value. Nonran- dom dropout is common phenomenon associated with longitudinal data and it complicates statistical inference. Linear mixed effects model is used to fit longitudinal data in the presence of nonrandom dropout. The stochastic EM algorithm is developed to obtain the model parameter estimates. Also, parameter estimates of the dropout model have been obtained. Standard errors of estimates have been calculated using the developed Monte Carlo method. All these methods are applied to two data sets.

Journal ArticleDOI
TL;DR: In this paper, the benefits of integrating geostatistical covariance structures and ANOVA procedures into a linear mixed modeling framework are discussed. But, the authors do not consider the effects of spatial correlation on the analysis of variance.
Abstract: Soil scientists are accustomed to geostatistical methods and tools such as semivariograms and kriging for analysis of observational data. Such methods assume and exploit that observations are spatially correlated. Conversely, analysis of variance (ANOVA) of designed experiments assumes that observations from different experimental units are independent, an assumption that is justified based on randomization. It may be beneficial, however, to perform an ANOVA assuming a geostatistical covariance model. Also, it is increasingly common to have multiple observations per experimental unit. Simple ANOVA assuming independence of observations is not appropriate for such data. Instead, a linear mixed model accounting for correlation among observations made on the same plot is required for proper analysis. The purpose of this paper is to demonstrate the benefits of integrating geostatistical covariance structures and ANOVA procedures into a linear mixed modelling framework. Two examples from designed experiments are considered in detail, making a link between terminologies and jargon used in geostatistical analysis on the one hand and linear mixed modelling on the other hand. We provide code in R and SAS for both examples in two supporting companion documents. HIGHLIGHTS: Analysis of variance and geostatistical analysis can be joined in a mixed model. Randomization justifies the independence assumption in analysis of variance. Geostatistical models imply a correlation of errors and can improve efficiency. Lacking randomization, spatial correlation can be accounted for in a mixed model.

Journal ArticleDOI
TL;DR: In this paper, the authors developed an efficient leave-one-out cross-validation (LOOCV) method for prediction of breeding values and other random effects under a general mixed linear model with multiple random effects.
Abstract: Empirical estimates of the accuracy of estimates of breeding values (EBV) can be obtained by cross-validation. Leave-one-out cross-validation (LOOCV) is an extreme case of k-fold cross-validation. Efficient strategies for LOOCV of predictions of phenotypes have been developed for a simple model with an overall mean and random marker or animal genetic effects. The objective here was to develop and evaluate an efficient LOOCV method for prediction of breeding values and other random effects under a general mixed linear model with multiple random effects. Conventional LOOCV of EBV requires inverting an (n-1)×(n-1) covariance matrix for each of n (= number of observations) data sets. Our efficient LOOCV obtains the required inverses from the inverse of the covariance matrix for all n observations. The efficient method can be applied to complex models with multiple fixed and random effects, but requires fixed effects to be treated as random, with large variances. An alternative is to precorrect observations using estimates of fixed effects obtained from the complete data, but this can lead to biases. The efficient LOOCV method was compared to conventional LOOCV of predictions of breeding values in terms of computational demands and accuracy. For a data set with 3,205 observations and a model with multiple random and fixed effects, the efficient LOOCV method was 962 times faster than the conventional LOOCV with precorrection for fixed effects based on each training data set but resulted in identical EBV. A computationally efficient LOOCV for prediction of breeding values for single- and multiple-trait mixed models with multiple fixed and random effects was successfully developed. The method enables cross-validation of predictions of breeding values and of any linear combination of random and/or fixed effects, along with leave-one-out precorrection of validation phenotypes.

Journal ArticleDOI
TL;DR: In this article, the authors developed a multi-stage ensemble model that estimates daily mean PM2.5 and PM10 at 1-km spatial resolution across France from 2000 to 2019.

Journal ArticleDOI
TL;DR: In this article, a temporal bivariate area-level linear mixed model with independent time effects for estimating small area socioeconomic indicators is introduced, and the model is fitted by using the residual maximum likelihood method.
Abstract: This paper introduces a temporal bivariate area-level linear mixed model with independent time effects for estimating small area socioeconomic indicators. The model is fitted by using the residual maximum likelihood method. Empirical best linear unbiased predictors of these indicators are derived. An approximation to the matrix of mean squared errors (MSE) is given and four MSE estimators are proposed. The first MSE estimator is a plug-in version of the MSE approximation. The remaining MSE estimators rely on parametric bootstrap procedures. Three simulation experiments designed to analyze the behavior of the fitting algorithm, the predictors and the MSE estimators are carried out. An application to real data from the 2005 and 2006 Spanish living conditions survey illustrate the introduced statistical methodology. The target is the estimation of 2006 poverty proportions and gaps by provinces and sex.

Journal ArticleDOI
TL;DR: This study proposes to model the mixture and Poisson parameters hierarchically, each as a function of two random effects, representing the genetic and environ- mental sources of variability, respectively, and yields posterior distributions useful for studying environmental and genetic variability, as well as genetic correlation.
Abstract: Response variables that are scored as counts, for example, number of mastitis cases in dairy cattle, often arise in quantitative genetic analysis. When the number of zeros exceeds the amount expected such as under the Poisson density, the zero-inflated Poisson (ZIP) model is more appropri- ate. In using the ZIP model in animal breeding studies, it is necessary to accommodate genetic and environmental covariances. For that, this study proposes to model the mixture and Poisson parameters hierarchically, each as a function of two random effects, representing the genetic and environ- mental sources of variability, respectively. The genetic random effects are allowed to be correlated, leading to a correlation within and between clusters. The environmental effects are introduced by independent residual terms, ac- counting for overdispersion above that caused by extra-zeros. In addition, an inter correlation structure between random genetic effects affecting mix- ture and Poisson parameters is used to infer pleiotropy, an expression of the extent to which these parameters are influenced by common genes. The methods described here are illustrated with data on number of mastitis cases from Norwegian Red cows. Bayesian analysis yields posterior distributions useful for studying environmental and genetic variability, as well as genetic correlation.

Journal ArticleDOI
TL;DR: This work proposes a new boosting algorithm which explicitly accounts for the random structure by excluding it from the selection procedure, properly correcting the random effects estimates and in addition providing likelihood-based estimation of therandom effects variance structure.
Abstract: Gradient boosting from the field of statistical learning is widely known as a powerful framework for estimation and selection of predictor effects in various regression models by adapting concepts from classification theory. Current boosting approaches also offer methods accounting for random effects and thus enable prediction of mixed models for longitudinal and clustered data. However, these approaches include several flaws resulting in unbalanced effect selection with falsely induced shrinkage and a low convergence rate on the one hand and biased estimates of the random effects on the other hand. We therefore propose a new boosting algorithm which explicitly accounts for the random structure by excluding it from the selection procedure, properly correcting the random effects estimates and in addition providing likelihood-based estimation of the random effects variance structure. The new algorithm offers an organic and unbiased fitting approach, which is shown via simulations and data examples.

Journal ArticleDOI
08 Feb 2021-Forests
TL;DR: In this article, Li et al. used TLS data to construct a mixed model of the taper function based on the tree effect, and the TLS data extraction and model prediction effects were evaluated to derive the stem diameter and volume.
Abstract: Terrestrial laser scanning (TLS) plays a significant role in forest resource investigation, forest parameter inversion and tree 3D model reconstruction. TLS can accurately, quickly and nondestructively obtain 3D structural information of standing trees. TLS data, rather than felled wood data, were used to construct a mixed model of the taper function based on the tree effect, and the TLS data extraction and model prediction effects were evaluated to derive the stem diameter and volume. TLS was applied to a total of 580 trees in the nine larch (Larix olgensis) forest plots, and another 30 were applied to a stem analysis in Mengjiagang. First, the diameter accuracies at different heights of the stem analysis were analyzed from the TLS data. Then, the stem analysis data and TLS data were used to establish the stem taper function and select the optimal basic model to determine a mixed model based on the tree effect. Six basic models were fitted, and the taper equation was comprehensively evaluated by various statistical metrics. Finally, the optimal mixed model of the plot was used to derive stem diameters and trunk volumes. The stem diameter accuracy obtained by TLS was >98%. The taper function fitting results of these data were approximately the same, and the optimal basic model was Kozak (2002)-II. For the tree effect, a 6 and a 9 were used as the mixed parameters, the mixed model showed the best fit, and the accuracy of the optimal mixed model reached 99.72%.The mixed model accuracy for predicting the tree diameter was between 74.22% and 97.68%, with a volume estimation accuracy of 96.38%. Relative height 70 (RH70) was the optimum height for extraction, and the fitting accuracy of the mixed model was higher than that of the basic model.

Journal ArticleDOI
TL;DR: In this paper, negative clustering effects are studied in the context of the linear mixed effects model and the authors highlight the importance of understanding these phenomena through analysis of the data from Lamers, Bohlmeijer, Korte, and Westerhof.
Abstract: The linear mixed effects model is an often used tool for the analysis of multilevel data. However, this model has an ill-understood shortcoming: it assumes that observations within clusters are always positively correlated. This assumption is not always true: individuals competing in a cluster for scarce resources are negatively correlated. Random effects in a mixed effects model can model a positive correlation among clustered observations but not a negative correlation. As negative clustering effects are largely unknown to the sheer majority of the research community, we conducted a simulation study to detail the bias that occurs when analysing negative clustering effects with the linear mixed effects model. We also demonstrate that ignoring a small negative correlation leads to deflated Type-I errors, invalid standard errors and confidence intervals in regression analysis. When negative clustering effects are ignored, mixed effects models incorrectly assume that observations are independently distributed. We highlight the importance of understanding these phenomena through analysis of the data from Lamers, Bohlmeijer, Korte, and Westerhof (2015). We conclude with a reflection on well-known multilevel modelling rules when dealing with negative dependencies in a cluster: negative clustering effects can, do and will occur and these effects cannot be ignored.

Journal ArticleDOI
TL;DR: An integrated hybrid approach combining dispersion modeling and land use regression to predict daily NO2 concentrations at a high spatial resolution (e.g., 50 m) in the New York tri-state area is demonstrated.

Journal ArticleDOI
TL;DR: In this paper, the authors investigated the use of remotely sensed phenology, climate data and machine learning models for estimating yield at a resolution suitable for optimising crop management in fields.
Abstract: Satellite remote sensing offers a cost-effective means of generating long-term hindcasts of yield that can be used to understand how yield varies in time and space. This study investigated the use of remotely sensed phenology, climate data and machine learning for estimating yield at a resolution suitable for optimising crop management in fields. We used spatially weighted growth curve estimation to identify the timing of phenological events from sequences of Landsat NDVI and derive phenological and seasonal climate metrics. Using data from a 17,000 ha study area, we investigated the relationships between the metrics and yield over 17 years from 2003 to 2019. We compared six statistical and machine learning models for estimating yield: multiple linear regression, mixed effects models, generalised additive models, random forests, support vector regression using radial basis functions and deep learning neural networks. We used a 50-50 train-test split on paddock-years where 50% of paddock-year combinations were randomly selected and used to train each model and the remaining 50% of paddock-years were used to assess the model accuracy. Using only phenological metrics, accuracy was highest using a linear mixed model with a random effect that allowed the relationship between integrated NDVI and yield to vary by year (R2 = 0.67, MAE = 0.25 t ha−1, RMSE = 0.33 t ha−1, NRMSE = 0.25). We quantified the improvements in accuracy when seasonal climate metrics were also used as predictors. We identified two optimal models using the combined phenological and seasonal climate metrics: support vector regression and deep learning models (R2 = 0.68, MAE = 0.25 t ha−1, RMSE = 0.32 t ha−1, NRMSE = 0.25). While the linear mixed model using only phenological metrics performed similarly to the nonlinear models that are also seasonal climate metrics, the nonlinear models can be more easily generalised to estimate yield in years for which training data are unavailable. We conclude that long-term hindcasts of wheat yield in fields, at 30 m spatial resolution, can be produced using remotely sensed phenology from Landsat NDVI, climate data and machine learning.

Journal ArticleDOI
TL;DR: In this article, the authors proposed an extension of these models that-in addition to a random effect for the mean level-also includes a random effects for the within-subject variance and a random influence for the autocorrelation.
Abstract: Research in psychology is experiencing a rapid increase in the availability of intensive longitudinal data. To use such data for predicting feelings, beliefs, and behavior, recent methodological work suggested combinations of the longitudinal mixed-effect model with Lasso regression or with regression trees. The present article adds to this literature by suggesting an extension of these models that-in addition to a random effect for the mean level-also includes a random effect for the within-subject variance and a random effect for the autocorrelation. After introducing the extended mixed-effect location scale (E-MELS), the extended mixed-effect location-scale Lasso model (Lasso E-MELS), and the extended mixed-effect location-scale tree model (E-MELS trees), we show how its parameters can be estimated using a marginal maximum likelihood approach. Using real and simulated example data, we illustrate how to use E-MELS, Lasso E-MELS, and E-MELS trees for building prediction models to forecast individuals' daily nervousness. The article is accompanied by an R package (called mels) and functions that support users in the application of the suggested models.

Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed a hybrid prediction model based on data decomposition, choosing wavelet decomposition (WD) to generate high-frequency detail sequences WD(D) and low-frequency approximate sequence WD(A).
Abstract: Accurate and reliable air quality predictions are critical to the ecological environment and public health. For the traditional model fails to make full use of the high and low frequency information obtained after wavelet decomposition, which easily leads to poor prediction performance of the model. This paper proposes a hybrid prediction model based on data decomposition, choosing wavelet decomposition (WD) to generate high-frequency detail sequences WD(D) and low-frequency approximate sequences WD(A), using sliding window high-frequency detail sequences WD(D) for reconstruction processing, and long short-term memory (LSTM) neural network and autoregressive moving average (ARMA) model for WD(D) and WD(A) sequences for prediction. The final prediction results of air quality can be obtained by accumulating the predicted values of each sub-sequence, which reduces the root mean square error (RMSE) by 52%, mean absolute error (MAE) by 47%, and increases the goodness of fit (R2) by 18% compared with the single prediction model. Compared with the mixed model, reduced the RMSE by 3%, reduced the MAE by 3%, and increased the R2 by 0.5%. The experimental verification found that the proposed prediction model solves the problem of lagging prediction results of single prediction model, which is a feasible air quality prediction method.

Posted ContentDOI
02 May 2021-bioRxiv
TL;DR: In this paper, the authors proposed a seamless stage-wise process that allows to carry on estimated means and variances from stage to stage and approximates the gold standard of a single-stage analysis.
Abstract: Decision-making in breeding increasingly depends on the ability to capture and predict crop responses to changing environmental factors. Advances in crop modeling as well as high-throughput field phenotyping (HTFP) hold promise to provide such insights. Processing HTFP data is an interdisciplinary task that requires broad knowledge on experimental design, measurement techniques, feature extraction, dynamic trait modeling, and prediction of genotypic values using statistical models. To get an overview of sources of variations in HTFP, we develop a general plot-level model for repeated measurements. Based on this model, we propose a seamless stage-wise process that allows to carry on estimated means and variances from stage to stage and approximates the gold standard of a single-stage analysis. The process builds on the extraction of three intermediate trait categories; (1) timing of key stages, (2) quantities at defined time points or periods, and (3) dose-response curves. In a first stage, these intermediate traits are extracted from low-level traits time series (e.g., canopy height) using P-splines and the quarter of maximum elongation rate method (QMER), as well as final height percentiles. In a second and third stage, extracted traits are further processed using a stage-wise linear mixed model analysis. Using a wheat canopy growth simulation to generate canopy height time series, we demonstrate the suitability of the stage-wise process for traits of the first two above-mentioned categories. Results indicate that, for the first stage, the P-spline/QMER method was more robust than the percentile method. In the subsequent two-stage linear mixed model processing, weighting the second and third stage with error variance estimates from the previous stages improved the root mean squared error. We conclude that processing phenomics data in stages represents a feasible approach if using appropriate weighting through all stages. P-splines in combination with the QMER method are suitable tools to extract timing of key stages and quantities at defined time points from HTFP data. HighlightsO_LIGeneral plot-level model for repeated high-throughput field phenotyping measurements C_LIO_LIThree main intermediate trait categories for dynamic modeling C_LIO_LISeamless stage-wise process that allows to carry on estimated means and variances C_LIO_LIPhenomics data processing cheatsheet C_LI