scispace - formally typeset
Search or ask a question

Showing papers on "Linear model published in 2010"


Book
29 Nov 2010
TL;DR: This tutorial jumps right in to the power of R without dragging you through the basic concepts of the programming language.
Abstract: Preface 1. Getting Started With R 2. Reading and Manipulating Data 3. Exploring and Transforming Data 4. Fitting Linear Models 5. Fitting Generalized Linear Models 6. Diagnosing Problems in Linear and Generalized Linear Models 7. Drawing Graphs 8. Writing Programs References Author Index Subject Index Command Index Data Set Index Package Index About the Authors

9,947 citations


Journal ArticleDOI
TL;DR: The approach is general because it offers the definition, identification, estimation, and sensitivity analysis of causal mediation effects without reference to any specific statistical model and can accommodate linear and nonlinear relationships, parametric and nonparametric models, continuous and discrete mediators, and various types of outcome variables.
Abstract: Traditionally in the social sciences, causal mediation analysis has been formulated, understood, and implemented within the framework of linear structural equation models. We argue and demonstrate that this is problematic for 3 reasons: the lack of a general definition of causal mediation effects independent of a particular statistical model, the inability to specify the key identification assumption, and the difficulty of extending the framework to nonlinear models. In this article, we propose an alternative approach that overcomes these limitations. Our approach is general because it offers the definition, identification, estimation, and sensitivity analysis of causal mediation effects without reference to any specific statistical model. Further, our approach explicitly links these 4 elements closely together within a single framework. As a result, the proposed framework can accommodate linear and nonlinear relationships, parametric and nonparametric models, continuous and discrete mediators, and various types of outcome variables. The general definition and identification result also allow us to develop sensitivity analysis in the context of commonly used models, which enables applied researchers to formally assess the robustness of their empirical conclusions to violations of the key assumption. We illustrate our approach by applying it to the Job Search Intervention Study. We also offer easy-to-use software that implements all our proposed methods.

2,393 citations


Journal ArticleDOI
TL;DR: In this article, the authors focus on parameter estimation (point estimates as well as confidence intervals) rather than on significance thresholds for linear regression models and propose a simple alternative to the more complicated calculation of standard errors from contrasts and main effects.
Abstract: Summary 1. Linear regression models are an important statistical tool in evolutionary and ecological studies. Unfortunately, these models often yield some uninterpretable estimates and hypothesis tests, especially when models contain interactions or polynomial terms. Furthermore, the standard errors for treatment groups, although often of interest for including in a publication, are not directly available in a standard linear model. 2. Centring and standardization of input variables are simple means to improve the interpretability of regression coefficients. Further, refitting the model with a slightly modified model structure allows extracting the appropriate standard errors for treatment groups directly from the model. 3. Centring will make main effects biologically interpretable even when involved in interactions and thus avoids the potential misinterpretation of main effects. This also applies to the estimation of linear effects in the presence of polynomials. Categorical input variables can also be centred and this sometimes assists interpretation. 4. Standardization (z-transformation) of input variables results in the estimation of standardized slopes or standardized partial regression coefficients. Standardized slopes are comparable in magnitude within models as well as between studies. They have some advantages over partial correlation coefficients and are often the more interesting standardized effect size. 5. The thoughtful removal of intercepts or main effects allows extracting treatment means or treatment slopes and their appropriate standard errors directly from a linear model. This provides a simple alternative to the more complicated calculation of standard errors from contrasts and main effects. 6. The simple methods presented here put the focus on parameter estimation (point estimates as well as confidence intervals) rather than on significance thresholds. They allow fitting complex, but meaningful models that can be concisely presented and interpreted. The presented methods can also be applied to generalised linear models (GLM) and linear mixed models.

2,065 citations


Journal ArticleDOI
TL;DR: A compression approach is reported, called 'compressed MLM', that decreases the effective sample size of such datasets by clustering individuals into groups and a complementary approach, 'population parameters previously determined' (P3D), that eliminates the need to re-compute variance components.
Abstract: Mixed linear model (MLM) methods have proven useful in controlling for population structure and relatedness within genome-wide association studies. However, MLM-based methods can be computationally challenging for large datasets. We report a compression approach, called ‘compressed MLM’, that decreases the effective sample size of such datasets by clustering individuals into groups. We also present a complementary approach, ‘population parameters previously determined’ (P3D), that eliminates the need to re-compute variance components. We applied these two methods both independently and combined in selected genetic association datasets from human, dog and maize. The joint implementation of these two methods markedly reduced computing time and either maintained or improved statistical power. We used simulations to demonstrate the usefulness in controlling for substructure in genetic association datasets for a range of species and genetic architectures. We have made these methods available within an implementation of the software program TASSEL.

1,687 citations


Journal ArticleDOI
TL;DR: An SAS macro is developed and presented here that creates an RCS function of continuous exposures, displays graphs showing the dose‐response association with 95 per cent confidence interval between one main continuous exposure and an outcome when performing linear, logistic, or Cox models, as well as linear and logistic‐generalized estimating equations.
Abstract: Taking into account a continuous exposure in regression models by using categorization, when non-linear dose-response associations are expected, have been widely criticized. As one alternative, restricted cubic spline (RCS) functions are powerful tools (i) to characterize a dose-response association between a continuous exposure and an outcome, (ii) to visually and/or statistically check the assumption of linearity of the association, and (iii) to minimize residual confounding when adjusting for a continuous exposure. Because their implementation with SAS® software is limited, we developed and present here an SAS macro that (i) creates an RCS function of continuous exposures, (ii) displays graphs showing the dose-response association with 95 per cent confidence interval between one main continuous exposure and an outcome when performing linear, logistic, or Cox models, as well as linear and logistic-generalized estimating equations, and (iii) provides statistical tests for overall and non-linear associations. We illustrate the SAS macro using the third National Health and Nutrition Examination Survey data to investigate adjusted dose-response associations (with different models) between calcium intake and bone mineral density (linear regression), folate intake and hyperhomocysteinemia (logistic regression), and serum high-density lipoprotein cholesterol and cardiovascular mortality (Cox model).

1,185 citations


Journal ArticleDOI
TL;DR: A novel approach of face identification by formulating the pattern recognition problem in terms of linear regression, using a fundamental concept that patterns from a single-object class lie on a linear subspace, and introducing a novel Distance-based Evidence Fusion (DEF) algorithm.
Abstract: In this paper, we present a novel approach of face identification by formulating the pattern recognition problem in terms of linear regression. Using a fundamental concept that patterns from a single-object class lie on a linear subspace, we develop a linear model representing a probe image as a linear combination of class-specific galleries. The inverse problem is solved using the least-squares method and the decision is ruled in favor of the class with the minimum reconstruction error. The proposed Linear Regression Classification (LRC) algorithm falls in the category of nearest subspace classification. The algorithm is extensively evaluated on several standard databases under a number of exemplary evaluation protocols reported in the face recognition literature. A comparative study with state-of-the-art algorithms clearly reflects the efficacy of the proposed approach. For the problem of contiguous occlusion, we propose a Modular LRC approach, introducing a novel Distance-based Evidence Fusion (DEF) algorithm. The proposed methodology achieves the best results ever reported for the challenging problem of scarf occlusion.

972 citations


Journal ArticleDOI
TL;DR: Almann et al. as discussed by the authors introduced a heuristic for normalizing feature importance measures that can correct the feature importance bias, based on repeated permutations of the outcome vector for estimating the distribution of measured importance for each variable in a non-informative setting.
Abstract: Motivation: In life sciences, interpretability of machine learning models is as important as their prediction accuracy. Linear models are probably the most frequently used methods for assessing feature relevance, despite their relative inflexibility. However, in the past years effective estimators of feature relevance have been derived for highly complex or non-parametric models such as support vector machines and RandomForest (RF) models. Recently, it has been observed that RF models are biased in such a way that categorical variables with a large number of categories are preferred. Results: In this work, we introduce a heuristic for normalizing feature importance measures that can correct the feature importance bias. The method is based on repeated permutations of the outcome vector for estimating the distribution of measured importance for each variable in a non-informative setting. The P-value of the observed importance provides a corrected measure of feature importance. We apply our method to simulated data and demonstrate that (i) non-informative predictors do not receive significant P-values, (ii) informative variables can successfully be recovered among non-informative variables and (iii) P-values computed with permutation importance (PIMP) are very helpful for deciding the significance of variables, and therefore improve model interpretability. Furthermore, PIMP was used to correct RF-based importance measures for two real-world case studies. We propose an improved RF model that uses the significant variables with respect to the PIMP measure and show that its prediction accuracy is superior to that of other existing models. Availability: R code for the method presented in this article is available at http://www.mpi-inf.mpg.de/~altmann/download/PIMP.R Contact:altmann@mpi-inf.mpg.de, laura.tolosi@mpi-inf.mpg.de Supplementary information:Supplementary data are available at Bioinformatics online.

925 citations


Journal ArticleDOI
TL;DR: It is argued in general that mixed models involve unverifiable assumptions on the data-generating distribution, which lead to potentially misleading estimates and biased inference and is concluded that the estimation-equation approach of population average models provides a more useful approximation of the truth.
Abstract: Two modeling approaches are commonly used to estimate the associations between neighborhood characteristics and individual-level health outcomes in multilevel studies (subjects within neighborhoods). Random effects models (or mixed models) use maximum likelihood estimation. Population average models typically use a generalized estimating equation (GEE) approach. These methods are used in place of basic regression approaches because the health of residents in the same neighborhood may be correlated, thus violating independence assumptions made by traditional regression procedures. This violation is particularly relevant to estimates of the variability of estimates. Though the literature appears to favor the mixed-model approach, little theoretical guidance has been offered to justify this choice. In this paper, we review the assumptions behind the estimates and inference provided by these 2 approaches. We propose a perspective that treats regression models for what they are in most circumstances: reasonable approximations of some true underlying relationship. We argue in general that mixed models involve unverifiable assumptions on the data-generating distribution, which lead to potentially misleading estimates and biased inference. We conclude that the estimation-equation approach of population average models provides a more useful approximation of the truth.

906 citations


Journal ArticleDOI
TL;DR: It is shown that problems can be overcome in most cases occurring in practice by replacing the approximate normal within-study likelihood by the appropriate exact likelihood, which leads to a generalized linear mixed model that can be fitted in standard statistical software.
Abstract: We consider random effects meta-analysis where the outcome variable is the occurrence of some event of interest. The data structures handled are where one has one or more groups in each study, and in each group either the number of subjects with and without the event, or the number of events and the total duration of follow-up is available. Traditionally, the meta-analysis follows the summary measures approach based on the estimates of the outcome measure(s) and the corresponding standard error(s). This approach assumes an approximate normal within-study likelihood and treats the standard errors as known. This approach has several potential disadvantages, such as not accounting for the standard errors being estimated, not accounting for correlation between the estimate and the standard error, the use of an (arbitrary) continuity correction in case of zero events, and the normal approximation being bad in studies with few events. We show that these problems can be overcome in most cases occurring in practice by replacing the approximate normal within-study likelihood by the appropriate exact likelihood. This leads to a generalized linear mixed model that can be fitted in standard statistical software. For instance, in the case of odds ratio meta-analysis, one can use the non-central hypergeometric distribution likelihood leading to mixed-effects conditional logistic regression. For incidence rate ratio meta-analysis, it leads to random effects logistic regression with an offset variable. We also present bivariate and multivariate extensions. We present a number of examples, especially with rare events, among which an example of network meta-analysis.

492 citations


Journal ArticleDOI
TL;DR: In this paper, the effects of placing an absolutely continuous prior distribution on the regression coefficients of a linear model was considered and it was shown that the posterior expectation is a matrix-shrunken version of the least squares estimate where the shrinkage matrix depends on the derivatives of the prior predictive density.
Abstract: This paper considers the effects of placing an absolutely continuous prior distribution on the regression coefficients of a linear model We show that the posterior expectation is a matrix-shrunken version of the least squares estimate where the shrinkage matrix depends on the derivatives of the prior predictive density of the least squares estimate The special case of the normal-gamma prior, which generalizes the Bayesian Lasso (Park and Casella 2008), is studied in depth We discuss the prior interpretation and the posterior effects of hyperparameter choice and suggest a data-dependent default prior Simulations and a chemometric example are used to compare the performance of the normal-gamma and the Bayesian Lasso in terms of out-of-sample predictive performance

490 citations


Book
19 Nov 2010
TL;DR: This work evaluated Markov Chain Monte Carlo Algorithms and Model Fit as well as Hierarchical and Multivariate Regression Models for linear regression models for Bayesian Statistics.
Abstract: Probability Theory and Classical Statistics.- Basics of Bayesian Statistics.- Modern Model Estimation Part 1: Gibbs Sampling.- Modern Model Estimation Part 2: Metroplis-Hastings Sampling.- Evaluating Markov Chain Monte Carlo Algorithms and Model Fit.- The Linear Regression Model.- Generalized Linear Models.- to Hierarchical Models.- to Multivariate Regression Models.- Conclusion.

Journal ArticleDOI
TL;DR: Song et al. as discussed by the authors proposed a more general version of the independent learning with ranking the maximum marginal likelihood estimates in generalized linear models and showed that the proposed methods also possess the sure screening property with vanishing false selection rate, which justifies the applicability of such a simple method in a wide spectrum.
Abstract: Ultrahigh dimensional variable selection plays an increasingly important role in contemporary scientific discoveries and statistical research. Among others, Fan and Lv (2008) propose an independent screening framework by ranking the marginal correlations. They showed that the correlation ranking procedure possesses a sure independence screening property within the context of the linear model with Gaussian covariates and responses. In this paper, we propose a more general version of the independent learning with ranking the maximum marginal likelihood estimates or the maximum marginal likelihood itself in generalized linear models. We show that the proposed methods, with Fan and Lv (2008) as a very special case, also possess the sure screening property with vanishing false selection rate. The conditions under which that the independence learning possesses a sure screening is surprisingly simple. This justifies the applicability of such a simple method in a wide spectrum. We quantify explicitly the extent to which the dimensionality can be reduced by independence screening, which depends on the interactions of the covariance matrix of covariates and true parameters. Simulation studies are used to illustrate the utility of the proposed approaches. In addition, we � Supported in part by Grant NSF grants DMS-0714554 and DMS-0704337. The bulk of the work was conducted when Rui Song was a postdoctoral research fellow at Princeton University. The authors would like to thank the associate editor and two referees for their constructive comments that improve the presentation and the results of the paper. AMS 2000 subject classifications: Primary 68Q32, 62J12; secondary 62E99, 60F10

Journal ArticleDOI
TL;DR: This paper demonstrates how SPM can be used to analyze both experimental and simulated biomechanical field data of arbitrary spatiotemporal dimensionality and suggests that SPM may be suitable for a wide variety of mechanical field applications.

Journal ArticleDOI
TL;DR: In this article, a flexible parametric family of matrix-valued covariance functions for multivariate spatial random fields, where each constituent component is a Matern process, is introduced, which can be interpretable in terms of process variance, smoothness, correlation length, and colocated correlation coefficients.
Abstract: We introduce a flexible parametric family of matrix-valued covariance functions for multivariate spatial random fields, where each constituent component is a Matern process. The model parameters are interpretable in terms of process variance, smoothness, correlation length, and colocated correlation coefficients, which can be positive or negative. Both the marginal and the cross-covariance functions are of the Matern type. In a data example on error fields for numerical predictions of surface pressure and temperature over the North American Pacific Northwest, we compare the bivariate Matern model to the traditional linear model of coregionalization.

Book
02 Mar 2010
TL;DR: Introduction What is measurement error?
Abstract: Introduction What is measurement error? Some examples The main ingredients Some terminology A look ahead Misclassification in Estimating a Proportion Motivating examples A model for the true values Misclassification models and naive analyses Correcting for misclassification Finite populations Multiple measures with no direct validation The multinomial case Mathematical developments Misclassification in Two-Way Tables Introduction Models for true values Misclassification models and naive estimators Behavior of naive analyses Correcting using external validation data Correcting using internal validation data General two-way tables Mathematical developments Simple Linear Regression Introduction The additive Berkson model and consequences The additive measurement error model The behavior of naive analyses Correcting for additive measurement error Examples Residual analysis Prediction Mathematical developments Multiple Linear Regression Introduction Model for true values Models and bias in naive estimators Correcting for measurement error Weighted and other estimators Examples Instrumental variables Mathematical developments Measurement Error in Regression: A General Overview Introduction Models for true values Analyses without measurement error Measurement error models Extra data Assessing bias in naive estimators Assessing bias using induced models Assessing bias via estimating equations Moment based and direct bias corrections Regression calibration and quasi-likelihood methods Simulation extrapolation (SIMEX) Correcting using likelihood methods Modified estimating equation approaches Correcting for misclassification Overview on use of validation data Bootstrapping Mathematical developments Binary Regression Introduction Additive measurement error Using validation data Misclassification of predictors Linear Models with Nonadditive Error Introduction Quadratic regression First-order models with interaction General nonlinear functions of the predictors Linear measurement error with validation data Misclassification of a categorical predictor Miscellaneous Nonlinear Regression Poisson regression: Cigarettes and cancer rates General nonlinear models Error in the Response Introduction Additive error in a single sample Linear measurement error in the one-way setting Measurement error in the response in linear models Mixed/Longitudinal Models Introduction, overview, and some examples Berkson error in designed repeated measures Additive error in the linear mixed model Time Series Introduction Random walk/population viability models Linear autoregressive models Background Material Notation for vectors, covariance matrices, etc. Double expectations Approximate Wald inferences The delta-method: approximate moments of nonlinear functions Fieller's method for ratios References Author Index Subject Index

Journal ArticleDOI
TL;DR: The performance and interpretation of linear regression analysis are subject to a variety of pitfalls, which are discussed here in detail.
Abstract: SUMMARY Background: Regression analysis is an important statistical method for the analysis of medical data. It enables the identification and characterization of relationships among multiple factors. It also enables the identification of prognostically relevant risk factors and the calculation of risk scores for individual prognostication. Methods: This article is based on selected textbooks of statistics, a selective review of the literature, and our own experience. Results: After a brief introduction of the uni- and multivariable regression models, illustrative examples are given to explain what the important considerations are before a regression analysis is performed, and how the results should be interpreted. The reader should then be able to judge whether the method has been used correctly and interpret the results appropriately. Conclusion: The performance and interpretation of linear regression analysis are subject to a variety of pitfalls, which are discussed here in detail. The reader is made aware of common errors of interpretation through practical examples. Both the opportunities for applying linear regression analysis and its limitations are presented. ►Cite this as:

Journal ArticleDOI
TL;DR: The authors show that adding a spatially-correlated error term to a linear model is equivalent to adding a saturated collection of canonical regressors, the coefficients of which are shrunk toward zero.
Abstract: Many statisticians have had the experience of fitting a linear model with uncorrelated errors, then adding a spatially-correlated error term (random effect) and finding that the estimates of the fixed-effect coefficients have changed substantially. We show that adding a spatially-correlated error term to a linear model is equivalent to adding a saturated collection of canonical regressors, the coefficients of which are shrunk toward zero, where the spatial map determines both the canonical regressors and the relative extent of the coefficients’ shrinkage. Adding a spatially-correlated error term can also be seen as inflating the error variances associated with specific contrasts of the data, where the spatial map determines the contrasts and the extent of error-variance inflation. We show how to avoid this spatial confounding by restricting the spatial random effect to the orthogonal complement (residual space) of the fixed effects, which we call restricted spatial regression. We consider five proposed in...

Journal ArticleDOI
TL;DR: In this paper, the authors compared five different non-parametric and machine learning methods, i.e., multiple linear regression (MLR), classification and regression trees (CART), boosted regression tree (BRT), generalized additive models (GAM), and artificial neural networks (ANN), to model site index of homogeneous stands of three important tree species of the Taurus Mountains (Turkey).

Book
09 Aug 2010
TL;DR: This chapter discusses the need for Statistics in Experimental Planning and Analysis, and some Basic Properties of a Distribution (Mean, Variance and Standard Deviation) and the importance of Relationships between Two or More Variables.
Abstract: Preface. Acknowledgements. 1 Introduction. 1.1 The Distinction between Trained Sensory Panels and Consumer Panels. 1.2 The Need for Statistics in Experimental Planning and Analysis. 1.3 Scales and Data Types. 1.4 Organisation of the Book. 2 Important Data Collection Techniques for Sensory and Consumer Studies. 2.1 Sensory Panel Methodologies. 2.2 Consumer Tests. PART I PROBLEM DRIVEN. 3 Quality Control of Sensory Profile Data. 3.1 General Introduction. 3.2 Visual Inspection of Raw Data. 3.3 Mixed Model ANOVA for Assessing the Importance of the Sensory Attributes. 3.4 Overall Assessment of Assessor Differences Using All Variables Simultaneously. 3.5 Methods for Detecting Differences in Use of the Scale. 3.6 Comparing the Assessors Ability to Detect Differences between the Products. 3.7 Relations between Individual Assessor Ratings and the Panel Average. 3.8 Individual Line Plots for Detailed Inspection of Assessors. 3.9 Miscellaneous Methods.- 4 Correction Methods and Other Remedies for Improving Sensory Profile Data. 4.1 Introduction. 4.2 Correcting for Different Use of the Scale. 4.3 Computing Improved Panel Averages. 4.4 Pre-processing of Data for Three-Way Analysis. 5 Detecting and Studying Sensory Differences and Similarities between Products. 5.1 Introduction. 5.2 Analysing Sensory Profile Data: Univariate Case. 5.3 Analysing Sensory Profile Data: Multivariate Case. 6 Relating Sensory Data to Other Measurements. 6.1 Introduction. 6.2 Estimating Relations between Consensus Profiles and External Data. 6.3 Estimating Relations between Individual Sensory Profiles and External Data. 7 Discrimination and Similarity Testing. 7.1 Introduction. 7.2 Analysis of Data from Basic Sensory Discrimination Tests. 7.3 Examples of Basic Discrimination Testing. 7.4 Power Calculations in Discrimination Tests. 7.5 Thurstonian Modelling: What Is It Really? 7.6 Similarity versus Difference Testing. 7.7 Replications: What to Do? 7.8 Designed Experiments, Extended Analysis and Other Test Protocols. 8 Investigating Important Factors Influencing Food Acceptance and Choice. 8.1 Introduction. 8.2 Preliminary Analysis of Consumer Data Sets (Raw Data Overview). 8.3 Experimental Designs for Rating Based Consumer Studies. 8.4 Analysis of Categorical Effect Variables. 8.5 Incorporating Additional Information about Consumers. 8.6 Modelling of Factors as Continuous Variables. 8.7 Reliability/Validity Testing for Rating Based Methods. 8.8 Rank Based Methodology. 8.9 Choice Based Conjoint Analysis. 8.10 Market Share Simulation. 9 Preference Mapping for Understanding Relations between Sensory Product Attributes and Consumer Acceptance. 9.1 Introduction. 9.2 External and Internal Preference Mapping. 9.3 Examples of Linear Preference Mapping. 9.4 Ideal Point Preference Mapping. 9.5 Selecting Samples for Preference Mapping. 9.6 Incorporating Additional Consumer Attributes. 9.7 Combining Preference Mapping with Additional Information about the Samples. 10 Segmentation of Consumer Data. 10.1 Introduction. 10.2 Segmentation of Rating Data. 10.3 Relating Segments to Consumer Attributes. PART II METHOD ORIENTED. 11 Basic Statistics. 11.1 Basic Concepts and Principles. 11.2 Histogram, Frequency and Probability. 11.3 Some Basic Properties of a Distribution (Mean, Variance and Standard Deviation). 11.4 Hypothesis Testing and Confidence Intervals for the Mean . 11.5 Statistical Process Control. 11.6 Relationships between Two or More Variables. 11.7 Simple Linear Regression. 11.8 Binomial Distribution and Tests. 11.9 Contingency Tables and Homogeneity Testing. 12 Design of Experiments for Sensory and Consumer Data. 12.1 Introduction. 12.2 Important Concepts and Distinctions. 12.3 Full Factorial Designs. 12.4 Fractional Factorial Designs: Screening Designs. 12.5 Randomised Blocks and Incomplete Block Designs. 12.6 Split-Plot and Nested Designs. 12.7 Power of Experiments. 13 ANOVA for Sensory and Consumer Data. 13.1 Introduction. 13.2 One-Way ANOVA. 13.3 Single Replicate Two-Way ANOVA. 13.4 Two-Way ANOVA with Randomised Replications. 13.5 Multi-Way ANOVA. 13.6 ANOVA for Fractional Factorial Designs. 13.7 Fixed and Random Effects in ANOVA: Mixed Models. 13.8 Nested and Split-Plot Models. 13.9 Post Hoc Testing. 14 Principal Component Analysis. 14.1 Interpretation of Complex Data Sets by PCA. 14.2 Data Structures for the PCA. 14.3 PCA: Description of the Method. 14.4 Projections and Linear Combinations. 14.5 The Scores and Loadings Plots. 14.6 Correlation Loadings Plot. 14.7 Standardisation. 14.8 Calculations and Missing Values. 14.9 Validation. 14.10 Outlier Diagnostics. 14.11 Tucker-1. 14.12 The Relation between PCA and Factor Analysis (FA). 15 Multiple Regression, Principal Components Regression and Partial Least Squares Regression. 15.1 Introduction. 15.2 Multivariate Linear Regression. 15.3 The Relation between ANOVA and Regression Analysis. 15.4 Linear Regression Used for Estimating Polynomial Models. 15.5 Combining Continuous and Categorical Variables. 15.6 Variable Selection for Multiple Linear Regression. 15.7 Principal Components Regression (PCR). 15.8 Partial Least Squares (PLS) Regression. 15.9 Model Validation: Prediction Performance. 15.10 Model Diagnostics and Outlier Detection. 15.11 Discriminant Analysis. 15.12 Generalised Linear Models, Logistic Regression and Multinomial Regression. 16 Cluster Analysis: Unsupervised Classification. 16.1 Introduction. 16.2 Hierarchical Clustering. 16.3 Partitioning Methods. 16.4 Cluster Analysis for Matrices. 17 Miscellaneous Methodologies. 17.1 Three-Way Analysis of Sensory Data. 17.2 Relating Three-Way Data to Two-Way Data. 17.3 Path Modelling. 17.4 MDS-Multidimensional Scaling. 17.5 Analysing Rank Data. 17.6 The L-PLS Method. 17.7 Missing Value Estimation. Nomenclature, Symbols and Abbreviations. Index.

Book ChapterDOI
01 Jan 2010
TL;DR: In this article, the authors studied the effect of variance-stabilizing transformations on the error structure of a Gaussian model, and showed that a transformation of the problem may help to correct some departure from the standard model assumptions.
Abstract: In previous chapters, we have studied the model $$y = A\beta + \epsilon, $$ where the mean Ey = Aβ depends linearly on the parameters β, the errors are normal (Gaussian), and the errors are additive. We have also seen (Chapter 7) that in some situations, a transformation of the problem may help to correct some departure from our standard model assumptions. For example, in §7.3 on variance-stabilising transformations, we transformed our data from y to some function g(y), to make the variance constant (at least approximately). We did not there address the effect on the error structure of so doing. Of course, \(g(y) = g(A\beta + \epsilon )\) as above will not have an additive Gaussian error structure any more, even approximately, in general.

Journal ArticleDOI
TL;DR: Several common applications of variation partitioning in ecology now appear inappropriate, and the appropriate uses of these analyses in research programmes are clarified, and potential steps to improve them are outlined.
Abstract: Summary 1. Statistical tests partitioning community variation into environmental and spatial components have been widely used to test ecological theories and explore the determinants of community structure for applied conservation questions. Despite the wide use of these tests, there is considerable debate about their relative effectiveness. 2. We used simulated communities to evaluate the most commonly employed tests that partition community variation: regression on distance matrices and canonical ordination using a third-order polynomial, principal components of neighbour matrices (PCNM) or Moran’s eigenvector maps (MEM) to model spatial components. Each test was evaluated under a variety of realistic sampling scenarios. 3. All tests failed to correctly model spatial and environmental components of variation, and in some cases produced biased estimates of the relative importance of components. Regression on distance matrices under-fit the spatial component, and ordination models consistently under-fit the environmental component. The PCNM and MEM approaches often produced inflated R2 statistics, apparently as a result of statistical artefacts involving selection of superfluous axes. This problem occurred regardless of the forward-selection technique used. 4. Both sample configuration and the underlying linear model used to analyse species–environment relationships also revealed strong potential to bias results. 5. Synthesis and applications. Several common applications of variation partitioning in ecology now appear inappropriate. These potentially include decisions for community conservation based on inferred relative strengths of niche and dispersal processes, inferred community responses to climate change, and numerous additional analyses that depend on precise results from multivariate variation-partitioning techniques. We clarify the appropriate uses of these analyses in research programmes, and outline potential steps to improve them.

Journal ArticleDOI
TL;DR: It is concluded that Bayesian inference is now practically feasible for GLMMs and provides an attractive alternative to likelihood-based approaches such as penalized quasi-likelihood.
Abstract: Generalized linear mixed models (GLMMs) continue to grow in popularity due to their ability to directly acknowledge multiple levels of dependency and model different data types. For small sample sizes especially, likelihood-based inference can be unreliable with variance components being particularly difficult to estimate. A Bayesian approach is appealing but has been hampered by the lack of a fast implementation, and the difficulty in specifying prior distributions with variance components again being particularly problematic. Here, we briefly review previous approaches to computation in Bayesian implementations of GLMMs and illustrate in detail, the use of integrated nested Laplace approximations in this context. We consider a number of examples, carefully specifying prior distributions on meaningful quantities in each case. The examples cover a wide range of data types including those requiring smoothing over time and a relatively complicated spline model for which we examine our prior specification in terms of the implied degrees of freedom. We conclude that Bayesian inference is now practically feasible for GLMMs and provides an attractive alternative to likelihood-based approaches such as penalized quasi-likelihood. As with likelihood-based approaches, great care is required in the analysis of clustered binary data since approximation strategies may be less accurate for such data.

Book
27 Oct 2010
TL;DR: A range of methods and tools to design observers for nonlinear systems represented by a special type of a dynamic nonlinear model -- the Takagi--Sugeno (TS) fuzzy model are provided.
Abstract: Many problems in decision making, monitoring, fault detection, and control require the knowledge of state variables and time-varying parameters that are not directly measured by sensors. In such situations, observers, or estimators, can be employed that use the measured input and output signals along with a dynamic model of the system in order to estimate the unknown states or parameters. An essential requirement in designing an observer is to guarantee the convergence of the estimates to the true values or at least to a small neighborhood around the true values. However, for nonlinear, large-scale, or time-varying systems, the design and tuning of an observer is generally complicated and involves large computational costs. This book provides a range of methods and tools to design observers for nonlinear systems represented by a special type of a dynamic nonlinear model -- the Takagi--Sugeno (TS) fuzzy model. The TS model is a convex combination of affine linear models, which facilitates its stability analysis and observer design by using effective algorithms based on Lyapunov functions and linear matrix inequalities. Takagi--Sugeno models are known to be universal approximators and, in addition, a broad class of nonlinear systems can be exactly represented as a TS system. Three particular structures of large-scale TS models are considered: cascaded systems, distributed systems, and systems affected by unknown disturbances. The reader will find in-depth theoretic analysis accompanied by illustrative examples and simulations of real-world systems. Stability analysis of TS fuzzy systems is addressed in detail. The intended audience are graduate students and researchers both from academia and industry. For newcomers to the field, the book provides a concise introduction dynamic TS fuzzy models along with two methods to construct TS models for a given nonlinear system

Journal ArticleDOI
TL;DR: In this article, a moment-based notion of dependence for functional time series which involves $m$-dependence is introduced, and the impact of dependence on several important statistical procedures for functional data is investigated.
Abstract: Functional data often arise from measurements on fine time grids and are obtained by separating an almost continuous time record into natural consecutive intervals, for example, days. The functions thus obtained form a functional time series, and the central issue in the analysis of such data consists in taking into account the temporal dependence of these functional observations. Examples include daily curves of financial transaction data and daily patterns of geophysical and environmental data. For scalar and vector valued stochastic processes, a large number of dependence notions have been proposed, mostly involving mixing type distances between $\sigma$-algebras. In time series analysis, measures of dependence based on moments have proven most useful (autocovariances and cumulants). We introduce a moment-based notion of dependence for functional time series which involves $m$-dependence. We show that it is applicable to linear as well as nonlinear functional time series. Then we investigate the impact of dependence thus quantified on several important statistical procedures for functional data. We study the estimation of the functional principal components, the long-run covariance matrix, change point detection and the functional linear model. We explain when temporal dependence affects the results obtained for i.i.d. functional observations and when these results are robust to weak dependence.

Journal ArticleDOI
TL;DR: In this article, a moment-based notion of dependence for functional time series which involves m-dependence is introduced, and the impact of dependence on several important statistical procedures for functional data is investigated.
Abstract: Functional data often arise from measurements on fine time grids and are obtained by separating an almost continuous time record into natural consecutive intervals, for example, days. The functions thus obtained form a functional time series, and the central issue in the analysis of such data consists in taking into account the temporal dependence of these functional observations. Examples include daily curves of financial transaction data and daily patterns of geophysical and environmental data. For scalar and vector valued stochastic processes, a large number of dependence notions have been proposed, mostly involving mixing type distances between σ-algebras. In time series analysis, measures of dependence based on moments have proven most useful (autocovariances and cumulants). We introduce a moment-based notion of dependence for functional time series which involves m-dependence. We show that it is applicable to linear as well as nonlinear functional time series. Then we investigate the impact of dependence thus quantified on several important statistical procedures for functional data. We study the estimation of the functional principal components, the long-run covariance matrix, change point detection and the functional linear model. We explain when temporal dependence affects the results obtained for i.i.d. functional observations and when these results are robust to weak dependence.

Journal ArticleDOI
TL;DR: Simultaneous accelerometer-GPS monitoring shows promise as a method to improve understanding of how the built environment influences physical activity behaviors by allowing activity to be quantified in a range of physical contexts and thereby provide a more explicit link between physical activity outcomes and built environment exposures.

Posted Content
TL;DR: In this article, the authors discuss the pitfalls involved in using reduced chi-squared for model assessment, model comparison, convergence diagnostic, and error estimation in astronomy, and recommend more sophisticated and reliable methods, which are also applicable to nonlinear models.
Abstract: Reduced chi-squared is a very popular method for model assessment, model comparison, convergence diagnostic, and error estimation in astronomy. In this manuscript, we discuss the pitfalls involved in using reduced chi-squared. There are two independent problems: (a) The number of degrees of freedom can only be estimated for linear models. Concerning nonlinear models, the number of degrees of freedom is unknown, i.e., it is not possible to compute the value of reduced chi-squared. (b) Due to random noise in the data, also the value of reduced chi-squared itself is subject to noise, i.e., the value is uncertain. This uncertainty impairs the usefulness of reduced chi-squared for differentiating between models or assessing convergence of a minimisation procedure. The impact of noise on the value of reduced chi-squared is surprisingly large, in particular for small data sets, which are very common in astrophysical problems. We conclude that reduced chi-squared can only be used with due caution for linear models, whereas it must not be used for nonlinear models at all. Finally, we recommend more sophisticated and reliable methods, which are also applicable to nonlinear models.

Journal ArticleDOI
TL;DR: A systematic strategy for addressing the challenge of how to build a good enough mixed effects model is suggested and easily implemented practical advice to build mixed effects models is introduced.
Abstract: Mixed effects models have become very popular, especially for the analysis of longitudinal data. One challenge is how to build a good enough mixed effects model. In this paper, we suggest a systematic strategy for addressing this challenge and introduce easily implemented practical advice to build mixed effects models. A general discussion of the scientific strategies motivates the recommended five-step procedure for model fitting. The need to model both the mean structure (the fixed effects) and the covariance structure (the random effects and residual error) creates the fundamental flexibility and complexity. Some very practical recommendations help to conquer the complexity. Centering, scaling, and full-rank coding of all the predictor variables radically improve the chances of convergence, computing speed, and numerical accuracy. Applying computational and assumption diagnostics from univariate linear models to mixed model data greatly helps to detect and solve the related computational problems. Applying computational and assumption diagnostics from the univariate linear models to the mixed model data can radically improve the chances of convergence, computing speed, and numerical accuracy. The approach helps to fit more general covariance models, a crucial step in selecting a credible covariance model needed for defensible inference. A detailed demonstration of the recommended strategy is based on data from a published study of a randomized trial of a multicomponent intervention to prevent young adolescents' alcohol use. The discussion highlights a need for additional covariance and inference tools for mixed models. The discussion also highlights the need for improving how scientists and statisticians teach and review the process of finding a good enough mixed model.

Journal ArticleDOI
TL;DR: This method is based on a penalized joint log likelihood with an adaptive penalty for the selection and estimation of both the fixed and random effects and enjoys the Oracle property, in that, asymptotically it performs as well as if the true model was known beforehand.
Abstract: It is of great practical interest to simultaneously identify the important predictors that correspond to both the fixed and random effects components in a linear mixed-effects (LME) model. Typical approaches perform selection separately on each of the fixed and random effect components. However, changing the structure of one set of effects can lead to different choices of variables for the other set of effects. We propose simultaneous selection of the fixed and random factors in an LME model using a modified Cholesky decomposition. Our method is based on a penalized joint log likelihood with an adaptive penalty for the selection and estimation of both the fixed and random effects. It performs model selection by allowing fixed effects or standard deviations of random effects to be exactly zero. A constrained expectation-maximization algorithm is then used to obtain the final estimates. It is further shown that the proposed penalized estimator enjoys the Oracle property, in that, asymptotically it performs as well as if the true model was known beforehand. We demonstrate the performance of our method based on a simulation study and a real data example.

Journal ArticleDOI
TL;DR: In this paper, a short review of the aggregation problem is followed by an analysis of the specific effect of proximity aggregation on the slope coefficient of a bivariate linear model using data drawn from the Los Angeles Metropolitan region.
Abstract: The problem of ecological correlation is now widely recognized but detailed analyses of the effects of aggregation on correlation and regression coefficients are rare. A short review of the aggregation problem is followed by an analysis of the specific effect of proximity aggregation on the slope coefficient of a bivariate linear model using data drawn from the Los Angeles Metropolitan region. The evidence suggests that changes in the slope coefficient are best related to the manner in which the covariation between the independent and dependent variables changes with increased aggregation.