scispace - formally typeset
Search or ask a question

Showing papers on "Proper linear model published in 2007"


Journal ArticleDOI
TL;DR: A brief tutorial introduction to the R package relaimpo, which implements six different metrics for assessing relative importance of regressors in the linear model, and a newly proposed metric (Feldman 2005) called pmvd.
Abstract: Relative importance is a topic that has seen a lot of interest in recent years, particularly in applied work. The R package relaimpo implements six different metrics for assessing relative importance of regressors in the linear model, two of which are recommended - averaging over orderings of regressors and a newly proposed metric (Feldman 2005) called pmvd. Apart from delivering the metrics themselves, relaimpo also provides (exploratory) bootstrap confidence intervals. This paper offers a brief tutorial introduction to the package. The methods and relaimpo's functionality are illustrated using the data set swiss that is generally available in R. The paper targets readers who have a basic understanding of multiple linear regression. For the background of more advanced aspects, references are provided.

1,908 citations



Book
06 Sep 2007
TL;DR: In this article, the authors define robustness as resistance and resistance to OLS estimates, and define robust regression for the linear model L-Estimators R-EstIMators M-Estimates GM-Estimate S-Estimation S- Estimate Generalized S-Evalator MM-Estime Comparing the various estimators Diagnostics Revisited: Robust Regression-Related Methods for Detecting Outliers.
Abstract: List of Figures List of Tables Series Editor's Introduction Acknowledgments 1. Introduction Defining Robustness Defining Robust Regression A Real-World Example: Coital Frequency of Married Couples in the 1970s 2. Important Background Bias and Consistency Breakdown Point Influence Function Relative Efficiency Measures of Location Measures of Scale M-Estimation Comparing Various Estimates Notes 3. Robustness, Resistance, and Ordinary Least Squares Regression Ordinary Least Squares Regression Implications of Unusual Cases for OLS Estimates and Standard Errors Detecting Problematic Observations in OLS Regression Notes 4. Robust Regression for the Linear Model L-Estimators R-Estimators M-Estimators GM-Estimators S-Estimators Generalized S-Estimators MM-Estimators Comparing the Various Estimators Diagnostics Revisited: Robust Regression-Related Methods for Detecting Outliers Notes 5. Standard Errors for Robust Regression Asymptotic Standard Errors for Robust Regression Estimators Bootstrapped Standard Errors Notes 6. Influential Cases in Generalized Linear Models The Generalized Linear Model Detecting Unusual Cases in Generalized Linear Models Robust Generalized Linear Models Notes 7. Conclusions Appendix: Software Considerations for Robust Regression References Index About the Author

322 citations


Journal ArticleDOI
TL;DR: A change point approach based on the segmented regression technique for testing the constancy of the regression parameters in a linear profile data set using a data set from a calibration application at the National Aeronautics and Space Administration (NASA) Langley Research Center.
Abstract: We propose a change point approach based on the segmented regression technique for testing the constancy of the regression parameters in a linear profile data set. Each sample collected over time in the historical data set consists of several bivariate observations for which a simple linear regression model is appropriate. The change point approach is based on the likelihood ratio test for a change in one or more regression parameters. We compare the performance of this method to that of the most effective Phase I linear profile control chart approaches using a simulation study. The advantages of the change point method over the existing methods are greatly improved detection of sustained step changes in the process parameters and improved diagnostic tools to determine the sources of profile variation and the location(s) of the change point(s). Also, we give an approximation for appropriate thresholds for the test statistic. The use of the change point method is demonstrated using a data set from a calibration application at the National Aeronautics and Space Administration (NASA) Langley Research Center. Copyright © 2006 John Wiley & Sons, Ltd.

297 citations


Proceedings ArticleDOI
20 Jun 2007
TL;DR: The equivalence relationship between the proposed least squares formulation and LDA for multi-class classifications is rigorously established under a mild condition, which is shown empirically to hold in many applications involving high-dimensional data.
Abstract: Linear Discriminant Analysis (LDA) is a well-known method for dimensionality reduction and classification. LDA in the binaryclass case has been shown to be equivalent to linear regression with the class label as the output. This implies that LDA for binary-class classifications can be formulated as a least squares problem. Previous studies have shown certain relationship between multivariate linear regression and LDA for the multi-class case. Many of these studies show that multivariate linear regression with a specific class indicator matrix as the output can be applied as a preprocessing step for LDA. However, directly casting LDA as a least squares problem is challenging for the multi-class case. In this paper, a novel formulation for multivariate linear regression is proposed. The equivalence relationship between the proposed least squares formulation and LDA for multi-class classifications is rigorously established under a mild condition, which is shown empirically to hold in many applications involving high-dimensional data. Several LDA extensions based on the equivalence relationship are discussed.

291 citations


Journal ArticleDOI
TL;DR: The use of the T2 control chart is extended to monitor the coefficients resulting from a parametric nonlinear regression model fit to profile data and three general approaches to the formulation of theT2 statistics and determination of the associated upper control limits for Phase I applications are given.
Abstract: In many quality control applications, use of a single (or several distinct) quality characteristic(s) is insufficient to characterize the quality of a produced item. In an increasing number of cases, a response curve (profile) is required. Such profiles can frequently be modeled using linear or nonlinear regression models. In recent research others have developed multivariate T2 control charts and other methods for monitoring the coefficients in a simple linear regression model of a profile. However, little work has been done to address the monitoring of profiles that can be represented by a parametric nonlinear regression model. Here we extend the use of the T2 control chart to monitor the coefficients resulting from a parametric nonlinear regression model fit to profile data. We give three general approaches to the formulation of the T2 statistics and determination of the associated upper control limits for Phase I applications. We also consider the use of non-parametric regression methods and the use of metrics to measure deviations from a baseline profile. These approaches are illustrated using the vertical board density profile data presented in Walker and Wright (Comparing curves using additive models. Journal of Quality Technology 2002; 34:118–129). Copyright © 2007 John Wiley & Sons, Ltd.

285 citations


Book
10 Dec 2007
TL;DR: The Simple Linear Regression Model and its Extensions as discussed by the authors and the Generalized Linear regression model are two popular models for categorical response variables. But they are not suitable for the analysis of incomplete data sets.
Abstract: The Simple Linear Regression Model.- The Multiple Linear Regression Model and Its Extensions.- The Generalized Linear Regression Model.- Exact and Stochastic Linear Restrictions.- Prediction in the Generalized Regression Model.- Sensitivity Analysis.- Analysis of Incomplete Data Sets.- Robust Regression.- Models for Categorical Response Variables.

268 citations


Journal ArticleDOI
TL;DR: In this article, the authors present diagnostic tools and ridge regression in GWR and demonstrate the utility of these techniques with an example using the Columbu... and integrate ridge regression into GWR to constrain and stabilize regression coefficients and lower prediction error.
Abstract: Geographically weighted regression (GWR) is drawing attention as a statistical method to estimate regression models with spatially varying relationships between explanatory variables and a response variable. Local collinearity in weighted explanatory variables leads to GWR coefficient estimates that are correlated locally and across space, have inflated variances, and are at times counterintuitive and contradictory in sign to the global regression estimates. The presence of local collinearity in the absence of global collinearity necessitates the use of diagnostic tools in the local regression model building process to highlight areas in which the results are not reliable for statistical inference. The method of ridge regression can also be integrated into the GWR framework to constrain and stabilize regression coefficients and lower prediction error. This paper presents numerous diagnostic tools and ridge regression in GWR and demonstrates the utility of these techniques with an example using the Columbu...

241 citations


Journal ArticleDOI
TL;DR: It is shown that case-crossover using conditional logistic regression is a special case of time series analysis when there is a common exposure such as in air pollution studies, and this equivalence provides computational convenience for case- crossover analyses and a better understanding of timeseries models.
Abstract: The case-crossover design was introduced in epidemiology 15 years ago as a method for studying the effects of a risk factor on a health event using only cases. The idea is to compare a case's exposure immediately prior to or during the case-defining event with that same person's exposure at otherwise similar "reference" times. An alternative approach to the analysis of daily exposure and case-only data is time series analysis. Here, log-linear regression models express the expected total number of events on each day as a function of the exposure level and potential confounding variables. In time series analyses of air pollution, smooth functions of time and weather are the main confounders. Time series and case-crossover methods are often viewed as competing methods. In this paper, we show that case-crossover using conditional logistic regression is a special case of time series analysis when there is a common exposure such as in air pollution studies. This equivalence provides computational convenience for case-crossover analyses and a better understanding of time series models. Time series log-linear regression accounts for overdispersion of the Poisson variance, while case-crossover analyses typically do not. This equivalence also permits model checking for case-crossover data using standard log-linear model diagnostics.

221 citations


Journal ArticleDOI
TL;DR: The R package flexmix provides flexible modelling of finite mixtures of regression models using the EM algorithm, and several new features of the software such as fixed and nested varying effects for mixture of generalized linear models and multinomial regression for a priori probabilities given concomitant variables are introduced.

171 citations


Journal ArticleDOI
TL;DR: Partial least squares (PLS) as mentioned in this paper was proposed for linear discriminant analysis (LDA) when predictors are data of functional type (curves), based on the equivalence between LDA and the multiple linear regression (binary response) and LDA, and the canonical correlation analysis (more than two groups).
Abstract: Partial least squares (PLS) approach is proposed for linear discriminant analysis (LDA) when predictors are data of functional type (curves). Based on the equivalence between LDA and the multiple linear regression (binary response) and LDA and the canonical correlation analysis (more than two groups), the PLS regression on functional data is used to estimate the discriminant coefficient functions. A simulation study as well as an application to kneading data compare the PLS model results with those given by other methods.

Proceedings ArticleDOI
Deepak Agarwal1, Srujana Merugu1
12 Aug 2007
TL;DR: A novel statistical method to predict large scale dyadic response variables in the presence of covariate information that simultaneously incorporates the effect of covariates and estimates local structure that is induced by interactions among the dyads through a discrete latent factor model.
Abstract: We propose a novel statistical method to predict large scale dyadic response variables in the presence of covariate information. Our approach simultaneously incorporates the effect of covariates and estimates local structure that is induced by interactions among the dyads through a discrete latent factor model. The discovered latent factors provide a redictive model that is both accurate and interpretable. We illustrate our method by working in a framework of generalized linear models, which include commonly used regression techniques like linear regression, logistic regression and Poisson regression as special cases. We also provide scalable generalized EM-based algorithms for model fitting using both "hard" and "soft" cluster assignments. We demonstrate the generality and efficacy of our approach through large scale simulation studies and analysis of datasets obtained from certain real-world movie recommendation and internet advertising applications.

Journal ArticleDOI
TL;DR: In this paper, the authors investigated the rate of convergence of estimating the regression weight function in a functional linear regression model, where the predictor and the weight function are smooth and periodic in the sense that the derivatives are equal at the boundary points.

Journal ArticleDOI
TL;DR: An iterative algorithm for multiple regression with fuzzy variables is proposed using the standard least-squares criterion as a performance index and the regression problem is posed as a gradient-descent optimisation.

Journal ArticleDOI
TL;DR: The neuro-fuzzy model is recommended as an alternative tool for modeling of flow dynamics in the study area and was able to improve the root mean square error (RMSE) and mean absolute percentage error (MAPE) values of the multiple linear regression forecasts by about 13.52% and 10.73%, respectively.

Book ChapterDOI
30 Nov 2007

Journal ArticleDOI
TL;DR: It turns out that recovery of regression weights in situations with collinearity is often very poor by all methods, unless the regression weights lie in the subspace spanning the first few principal components of the predictor variables, and in those cases, typically PLS and PCR give the best recoveries of regress weights.
Abstract: Regression tends to give very unstable and unreliable regression weights when predictors are highly collinear. Several methods have been proposed to counter this problem. A subset of these do so by finding components that summarize the information in the predictors and the criterion variables. The present paper compares six such methods (two of which are almost completely new) to ordinary regression: Partial least Squares (PLS), Principal Component regression (PCR), Principle covariates regression, reduced rank regression, and two variants of what is called power regression. The comparison is mainly done by means of a series of simulation studies, in which data are constructed in various ways, with different degrees of collinearity and noise, and the methods are compared in terms of their capability of recovering the population regression weights, as well as their prediction quality for the complete population. It turns out that recovery of regression weights in situations with collinearity is often very poor by all methods, unless the regression weights lie in the subspace spanning the first few principal components of the predictor variables. In those cases, typically PLS and PCR give the best recoveries of regression weights. The picture is inconclusive, however, because, especially in the study with more real life like simulated data, PLS and PCR gave the poorest recoveries of regression weights in conditions with relatively low noise and collinearity. It seems that PLS and PCR are particularly indicated in cases with much collinearity, whereas in other cases it is better to use ordinary regression. As far as prediction is concerned: Prediction suffers far less from collinearity than recovery of the regression weights.

Journal ArticleDOI
TL;DR: In this paper, asymptotic properties of M-estimates of regression parameters in linear models in which errors are dependent are derived and weak and strong Bahadur representations are derived.
Abstract: We study asymptotic properties of M-estimates of regression parameters in linear models in which errors are dependent. Weak and strong Bahadur representations of the M-estimates are derived and a central limit theorem is established. The results are applied to linear models with errors being short-range dependent linear processes, heavy-tailed linear processes and some widely used nonlinear time series.

Book ChapterDOI
TL;DR: This chapter describes multiple linear regression, a statistical approach used to describe the simultaneous associations of several variables with one continuous outcome.
Abstract: This chapter describes multiple linear regression, a statistical approach used to describe the simultaneous associations of several variables with one continuous outcome. Important steps in using this approach include estimation and inference, variable selection in model building, and assessing model fit. The special cases of regression with interactions among the variables, polynomial regression, regressions with categorical (grouping) variables, and separate slopes models are also covered. Examples in microbiology are used throughout.

Journal ArticleDOI
TL;DR: It is demonstrated that in specific circumstances the propensity score estimator is identical to the effect estimated from a full linear model, even if it is built on coarser covariate strata than the linear model.
Abstract: Stratifying and matching by the propensity score are increasingly popular approaches to deal with confounding in medical studies investigating effects of a treatment or exposure. A more traditional alternative technique is the direct adjustment for confounding in regression models. This paper discusses fundamental differences between the two approaches, with a focus on linear regression and propensity score stratification, and identifies points to be considered for an adequate comparison. The treatment estimators are examined for unbiasedness and efficiency. This is illustrated in an application to real data and supplemented by an investigation on properties of the estimators for a range of underlying linear models. We demonstrate that in specific circumstances the propensity score estimator is identical to the effect estimated from a full linear model, even if it is built on coarser covariate strata than the linear model. As a consequence the coarsening property of the propensity score-adjustment for a one-dimensional confounder instead of a high-dimensional covariate-may be viewed as a way to implement a pre-specified, richly parametrized linear model. We conclude that the propensity score estimator inherits the potential for overfitting and that care should be taken to restrict covariates to those relevant for outcome.

Journal ArticleDOI
TL;DR: This work derives least squares estimators for the simple linear regression model and examines them from a theoretical perspective and a stepwise algorithm is developed in order to find the estimates in this case.
Abstract: Simple and multiple linear regression models are considered between variables whose “values” are convex compact random sets in $${\mathbb{R}^p}$$ , (that is, hypercubes, spheres, and so on). We analyze such models within a set-arithmetic approach. Contrary to what happens for random variables, the least squares optimal solutions for the basic affine transformation model do not produce suitable estimates for the linear regression model. First, we derive least squares estimators for the simple linear regression model and examine them from a theoretical perspective. Moreover, the multiple linear regression model is dealt with and a stepwise algorithm is developed in order to find the estimates in this case. The particular problem of the linear regression with interval-valued data is also considered and illustrated by means of a real-life example.

Journal ArticleDOI
TL;DR: Bajari et al. as discussed by the authors introduced a computationally simple estimator that uses linear regression to estimate the distribution of random coefficients and compared their estimator to several alternatives in a Monte Carlo exercise, and found the estimator predicts out-of-sample market shares well.
Abstract: Random coefficient discrete choice models are a popular method for estimating demand in differentiated product markets. We introduce a computationally simple estimator that uses linear regression to estimate the distribution of random coefficients. The estimator is nonparametric for the distribution of the random coefficients. We compare our estimator to several alternatives in a Monte Carlo exercise, and find the estimator predicts out-of-sample market shares well. We discuss extensions to panel data and dynamic programming. ∗Bajari: Department of Economics, University of Minnesota, Twin Cities and NBER, 1035 Heller Hall, 271 19th Ave South, Minneapolis, MN 55455, email: bajari@econ.umn.edu; Fox: Department of Economics, University of Chicago, 1126 E. 59th St., Chicago, IL 60637, email: fox@uchicago.edu; Ryan: Massachusetts Institute of Technology and NBER, 50 Memorial Drive, E52-262C, Cambridge, MA 02142, email: sryan@mit.edu.

Journal ArticleDOI
TL;DR: A fuzzy nonparametric model with crisp input and LR fuzzy output is considered and the local linear smoothing technique in statistics with the cross-validation procedure for selecting the optimal value of the smoothing parameter is fuzzified to fit this model.

Journal ArticleDOI
TL;DR: Methods are constructed to test whether there is some 'linear' relationship between imprecise predictor and response variables in a regression analysis and a suitable equivalence for the hypothesis of linear independence in this model is obtained in terms of the mid-spread representations of the interval-valued variables.

Journal ArticleDOI
TL;DR: The proposed statistical scheme is demonstrated by the analysis of experimental data on internal waves, in which the results can well illustrate what has been investigated in laboratory experiment and may be applicable to the naturally occurring reflection of internal waves from sloping b...
Abstract: Purpose – This study seeks to develop a systematic means of identifying regression models using a complex regression model with a statistical method.Design/methodology/approach – As a widely adopted statistical scheme for analyzing multifactor data, regression analysis provides a conceptually simple algorithm for examining functional relationships among variables. This investigation assesses the proposed relationship using a sample of data in regression analysis and then estimates the fit using statistics. Furthermore, several algorithms and added variable plots are presented to obtain an appropriate regression model and the relationship between response variables y, p and explanatory variables x0,x1,x2, … ,xp.Findings – The proposed statistical scheme is demonstrated by the analysis of experimental data on internal waves, in which the results can well illustrate what has been investigated in laboratory experiment and may be applicable to the naturally occurring reflection of internal waves from sloping b...

DOI
10 Dec 2007
TL;DR: The present article exposes the diagnostic tool condition number to linear regression models with categorical explanatory variables and analyzes how the dummy variables and choice of reference category can affect the degree of multicollinearity.
Abstract: The present article discusses the role of categorical variable in the problem of multicollinearity in linear regression model. It exposes the diagnostic tool condition number to linear regression models with categorical explanatory variables and analyzes how the dummy variables and choice of reference category can affect the degree of multicollinearity. Such an effect is analyzed analytically as well as numerically through simulation and real data application.

Journal ArticleDOI
TL;DR: A tree-structured method that fits a simple but nontrivial model to each partition of the variable space that ensures that each piece of the fitted regression function can be visualized with a graph or a contour plot.
Abstract: Many methods can fit models with a higher prediction accuracy, on average, than the least squares linear regression technique. But the models, including linear regression, are typically impossible to interpret or visualize. We describe a tree-structured method that fits a simple but nontrivial model to each partition of the variable space. This ensures that each piece of the fitted regression function can be visualized with a graph or a contour plot. For maximum interpretability, our models are constructed with negligible variable selection bias and the tree structures are much more compact than piecewise-constant regression trees. We demonstrate, by means of a large empirical study involving 27 methods, that the average prediction accuracy of our models is almost as high as that of the most accurate “black-box” methods from the statistics and machine learning literature.

Journal ArticleDOI
TL;DR: In this paper, a change point detection approach for multivariate linear regression models is presented, which can account for missing data in the response variables and/or in the explicative variables and also improves on recently published change point detector methodologies by allowing a more flexible and thus more realistic prior specification for the existence of a change and the date of change as well as for the regression parameters.
Abstract: [1] Multivariate linear regression is one of the most popular modeling tools in hydrology and climate sciences for explaining the link between key variables. Piecewise linear regression is not always appropriate since the relationship may experiment sudden changes due to climatic, environmental, or anthropogenic perturbations. To address this issue, a practical and general approach to the Bayesian analysis of the multivariate regression model is presented. The approach allows simultaneous single change point detection in a multivariate sample and can account for missing data in the response variables and/or in the explicative variables. It also improves on recently published change point detection methodologies by allowing a more flexible and thus more realistic prior specification for the existence of a change and the date of change as well as for the regression parameters. The estimation of all unknown parameters is achieved by Monte Carlo Markov chain simulations. It is shown that the developed approach is able to reproduce the results of Rasmussen (2001) as well as those of Perreault et al. (2000a, 2000b). Furthermore, two of the examples provided in the paper show that the proposed methodology can readily be applied to some problems that cannot be addressed by any of the above-mentioned approaches because of limiting model structure and/or restrictive prior assumptions. The first of these examples deals with single change point detection in the multivariate linear relationship between mean basin-scale precipitation at different periods of the year and the summer–autumn flood peaks of the Broadback River located in northern Quebec, Canada. The second one addresses the problem of missing data estimation with uncertainty assessment in multisite streamflow records with a possible simultaneous shift in mean streamflow values that occurred at an unknown date.

Book ChapterDOI
TL;DR: This chapter presents the general linear model as an extension to the two-sample t-test, analysis of variance (ANOVA), and linear regression, and the F test is introduced as a means to test for the strength of group effect.
Abstract: This chapter presents the general linear model as an extension to the two-sample t-test, analysis of variance (ANOVA), and linear regression. We illustrate the general linear model using two-way ANOVA as a prime example. The underlying principle of ANOVA, which is based on the decomposition of the value of an observed variable into grand mean, group effect and random noise, is emphasized. Further into this chapter, the F test is introduced as a means to test for the strength of group effect. The procedure of F test for identifying a parsimonious set of factors in explaining an outcome of interest is also described.

Journal ArticleDOI
TL;DR: In this paper, a stochastic expansion for a residual-based estimator of the error distribution function in a partially linear regression model is proved, which implies a functional central limit theorem.
Abstract: We prove a stochastic expansion for a residual-based estimator of the error distribution function in a partly linear regression model. It implies a functional central limit theorem. As special cases we cover nonparametric, nonlinear and linear regression models.