scispace - formally typeset
Search or ask a question

Showing papers on "Unit-weighted regression published in 2007"


Book
06 Sep 2007
TL;DR: In this article, the authors define robustness as resistance and resistance to OLS estimates, and define robust regression for the linear model L-Estimators R-EstIMators M-Estimates GM-Estimate S-Estimation S- Estimate Generalized S-Evalator MM-Estime Comparing the various estimators Diagnostics Revisited: Robust Regression-Related Methods for Detecting Outliers.
Abstract: List of Figures List of Tables Series Editor's Introduction Acknowledgments 1. Introduction Defining Robustness Defining Robust Regression A Real-World Example: Coital Frequency of Married Couples in the 1970s 2. Important Background Bias and Consistency Breakdown Point Influence Function Relative Efficiency Measures of Location Measures of Scale M-Estimation Comparing Various Estimates Notes 3. Robustness, Resistance, and Ordinary Least Squares Regression Ordinary Least Squares Regression Implications of Unusual Cases for OLS Estimates and Standard Errors Detecting Problematic Observations in OLS Regression Notes 4. Robust Regression for the Linear Model L-Estimators R-Estimators M-Estimators GM-Estimators S-Estimators Generalized S-Estimators MM-Estimators Comparing the Various Estimators Diagnostics Revisited: Robust Regression-Related Methods for Detecting Outliers Notes 5. Standard Errors for Robust Regression Asymptotic Standard Errors for Robust Regression Estimators Bootstrapped Standard Errors Notes 6. Influential Cases in Generalized Linear Models The Generalized Linear Model Detecting Unusual Cases in Generalized Linear Models Robust Generalized Linear Models Notes 7. Conclusions Appendix: Software Considerations for Robust Regression References Index About the Author

322 citations


Proceedings ArticleDOI
10 Apr 2007
TL;DR: A Bayesian way of dealing with outlier-infested sensory data is introduced and a "black box" approach to removing outliers in real-time and expressing confidence in the estimated data is developed.
Abstract: In order to achieve reliable autonomous control in advanced robotic systems like entertainment robots, assistive robots, humanoid robots and autonomous vehicles, sensory data needs to be absolutely reliable, or some measure of reliability must be available. Bayesian statistics can offer favorable ways of accomplishing such robust sensory data pre-processing. In this paper, we introduce a Bayesian way of dealing with outlier-infested sensory data and develop a "black box" approach to removing outliers in real-time and expressing confidence in the estimated data. We develop our approach in the framework of Bayesian linear regression with heteroscedastic noise. Essentially, every measured data point is assumed to have its individual variance, and the final estimate is achieved by a weighted regression over observed data. An expectation-maximization algorithm allows us to estimate the variance of each data point in an incremental algorithm. With the exception of a time horizon (window size) over which the estimation process is averaged, no open parameters need to be tuned, and no special assumption about the generative structure of the data is required. The algorithm works efficiently in realtime. We evaluate our method on synthetic data and on a pose estimation problem of a quadruped robot, demonstrating its ease of usability, competitive nature with well-tuned alternative algorithms and advantages in terms of robust outlier removal

50 citations


Book
01 Jan 2007
TL;DR: In this article, simple linear regression is used to determine whether x from y, serial correlation, and curve fitting for simple linear regressions, and various problems in Simple Linear Regression: Determining X from Y, Serial Correlation, and Curve Fitting.
Abstract: Basic Statistical Concepts Simple Linear Regression Special Problems in Simple Linear Regression: Determining x from y, Serial Correlation, and Curve Fitting Some Aspects and Examples in Constructing a Valid Simple Regression Study Multiple Linear Regression Correlation in Multiple Regression Issues in Multiple Linear Regression Polynomial Regression Special Topics in Multiple Regression Indicator (Dummy) Variable Regression Model Building/Model Selection Analysis of Covariance Logistic Regression Appendices

9 citations


Journal ArticleDOI
TL;DR: The idea behind regression weighting is to include in the regression model all the variables and interactions that are related to the outcome values and affect the sam ple selection and the response probabilities, such that the sampling and response mechanisms are ignorable in the sense that the model fitted to the observed data is irrelevant as mentioned in this paper.
Abstract: This is an intriguing paper that raises important ques tions, and I feel privileged for being invited to discuss it. The paper deals with a very basic problem of sample surveys: how to weight the survey data in order to esti mate finite population quantities of interest like means, differences of means or regression coefficients. The paper focuses for the most part on the com mon estimator of a population mean, yw = Xw=i wiyil YTi=\ wi, and discusses different approaches to con structing the weights by use of linear regression mod els. These models vary in terms of the number and na ture of the regressors in the model and in the assump tions regarding the regression coefficients, whether fixed or random with prespecified distributions. The idea behind regression weighting is to include in the regression model all the variables and interactions that are related to the outcome values and affect the sam ple selection and the response probabilities, such that the sampling and response mechanisms are ignorable in the sense that the model fitted to the observed data is

6 citations


Journal ArticleDOI
TL;DR: The authors used regression methods to predict the expected monthly return on stocks and the covariance matrix of returns, the predictor variables being a company's "fundamentals" such as dividend yield and the history of previous returns.
Abstract: We use regression methods to predict the expected monthly return on stocks and the covariance matrix of returns, the predictor variables being a company's ‘fundamentals’, such as dividend yield and the history of previous returns. Predictions are evaluated out of sample for shares traded on the London Stock Exchange from 1976 to 2005. We explore and evaluate many modelling and inferential approaches, including the use of weighted regression, discounted regression, shrinkage of regression coefficients and the transformation to normality of predictor variables. We also investigate alternative covariance matrix models, such as a two-index model and a shrinkage model. Using suitable statistics to enable the out-of-sample performance of competing methodologies to be compared is crucial, and we develop some new statistics and a graphical aid for this purpose. What is original in this paper is an evaluation of many modelling and inferential procedures for which conflicting claims have been made in the literature and the development of new measures of portfolio performance.

5 citations


Journal ArticleDOI
TL;DR: This paper demonstrates that asymptotic estimates of standard errors provided by multiple regression are not always accurate, and a resampling permutation procedure is used to estimate the standard errors.
Abstract: In the vast majority of psychological research utilizing multiple regression analysis, asymptotic probability values are reported. This paper demonstrates that asymptotic estimates of standard errors provided by multiple regression are not always accurate. A resampling permutation procedure is used to estimate the standard errors. In some cases the results differ substantially from the traditional least squares regression estimates.

4 citations



06 Dec 2007
TL;DR: In this article, the stability of parameter coefficient estimates for Geographically-Weighted Regression (GWR) models is analyzed and the results from GWR must be carefully considered in terms of the form of data, assumed coefficient surface being modelled, and the confidence of the resulting parameter estimates.
Abstract: This paper describes preliminary work analysing the stability of parameter coefficient estimates for Geographically-Weighted Regression (GWR). Based on a large dataset (35721 points) various random samplings of this data were performed and models built using GWR. An analysis of the coefficient values for the independent variables showed that these values could varying significantly both between runs and between sampling sizes. This suggests that the results from GWR must be carefully considered in terms of the form of data, assumed coefficient surface being modelled, and the confidence of the resulting parameter estimates.

3 citations


Proceedings ArticleDOI
11 May 2007
TL;DR: In this paper, a recursive metho-d for estimating a parameterised form of the cross correlation between the regression model errors, the variance of these errors and regression model parameters is presented.
Abstract: The use of the generalised least square (GLS) technique for estimation of hydrological regression models has become good practi ce in hydrology. Through a regression model, a simple link between a part icular hydrological variable and a set of catchment descriptors can be established. The regression residuals can be treated as the sum of sampling errors in the hydrological variable and errors in the regression model. This paper presents a recursive metho d for estimating a parameterised form of the cross correlation between the regression model errors, the variance of these errors and the regression model parameters . A re -weighted set of regression residuals can be defined such that the covariance of these residuals is essentially similar to that of the model error. The cross products of the re -weighted regression residuals, pooled within bins, can be used to identify a structure and to fit a parameterised form for the cross -correlations of the regression e rrors. The procedure has been tested successfully on annual maximum flow data from 602 catchments located throughout the UK.

1 citations


Journal ArticleDOI
TL;DR: In this paper, Cook's distance is generalized to the multiple linear regression with linear constraints on regression coefficients, and it is used for identifying influential observations in constrained regression models, and a numerical example is provided for illustration.
Abstract: Cook's distance is generalized to the multiple linear regression with linear constraints on regression coefficients. It is used for identifying influential observations in constrained regression models. A numerical example is provided for illustration.

1 citations


Posted Content
TL;DR: In this article, the authors employ Geographical Information Systems and a spatial econometric technique, the Geographic Weighted Regression, integrated in a dichotomous choice CV in order to improve both the sampling design and the econometrical analysis of a CV survey, by fitting local changes and highlighting spatial nonstationarity in the relationships between estimated WTP and explanatory variables.
Abstract: The paper uses Contingent Valuation to investigate the externalities from linear infrastructures, with a particular concern for their dependence on characteristics of the local context within which they are perceived. We employ Geographical Information Systems and a spatial econometric technique, the Geographic Weighted Regression, integrated in a dichotomous choice CV in order to improve both the sampling design and the econometric analysis of a CV survey. These tools are helpful when local factors with an important spatial variability may have a crucial explanatory role in the structure of individual preferences. The Geographic Weighted Regression is introduced, beside GIS, as a way to enhance the flexibility of a stated preference analysis, by fitting local changes and highlighting spatial non-stationarity in the relationships between estimated WTP and explanatory variables. This local approach is compared with a standard double bounded contingent valuation through an empirical study about high voltage transmission lines. The GWR methodology has not been applied before in environmental economics. The paper shows its significance in testing the consistency of the standard approach by monitoring the spatial patterns in the distribution of the WTP and the spatial stability of the parameters estimated in order to compute the conditional WTPs

Dissertation
01 Mar 2007
TL;DR: A holistic cluster analysis is presented: clusters are accurately unearthed within large datasets; an estimate of the natural number of clusters is obtained; and the variables important in defining the clusters are also established.
Abstract: The increasing size of datasets is particularly evident in the field of bioinformatics. It is unlikely that analyzing these large datasets with a single model will produce an accurate solution. This has led to the ensemble approach, where many models are averaged to give a consensus representation of the data. Taking a weighted average of the individual models has improved the accuracy of both classification and regression ensembles. However, weighting models within a cluster ensemble has remained relatively undeveloped because there is no gold standard available for comparison. This thesis explores a technique of weighting cluster ensembles. A regression technique, multivariate regression trees, is shown to produce an accurate clustering solution. Each solution (tree) is then weighted purely in terms of its predictive accuracy. Various weighting strategies are trialed to determine the superior technique. After each individual tree is assigned a weight, the trees’ co-occurrence matrices are obtained. The co-occurrence matrices are then aggregated together, weighted according to the trees’ predictive weights. The final result is a single weighted co-occurrence matrix. A new technique, similarity-based k-means, is developed in order to partition the weighted co-occurrence matrix. Similarity-based k-means is demonstrated to produce accurate partitions of similarity matrices. The resulting clusters agree with the known groups in the investigated datasets. Furthermore, this thesis develops two other techniques so that maximal information can be obtained in conjunction with the weighted cluster ensemble. The first method suggests an estimate of the natural number of clusters in a dataset, by assessing the predictive performance and variability of similarity-based k-means for various numbers of clusters. The estimates agree with the known numbers of groups within the investigated datasets. The second method elucidates the variables that define the clusters. These variables have high classification power within the studied datasets. Therefore, this thesis presents a holistic cluster analysis: clusters are accurately unearthed within large datasets; an estimate of the natural number of clusters is obtained; and the variables important in defining the clusters are also established. The weighted cluster ensemble technique is applied to a variety of small and large datasets. All results demonstrate the power of weighting the individual models within the ensemble: the developed weighted cluster ensemble technique consistently outperforms the other techniques. The results of analyzing two DNA microarray datasets are particularly promising. The discovered clusters overlap with the known diagnoses in the datasets, and the variables deemed important in defining the clusters have previously been suggested as biomarkers. Whilst the size of contemporary datasets presents unique statistical challenges, the potential information within them is immense. Statistical techniques must be developed in order to accurately analyze these datasets. Motivated by the success of weighted regression and classification ensembles applied to large datasets, this thesis suggests a technique of weighting models within a cluster ensemble. The results highlight the potential of weighted cluster ensembles in high dimensional settings, such as the analysis of DNA microarrays.

01 Jan 2007
TL;DR: In this paper, a procedure based on M-estimation to determine the number of regression models for the problem of regression clustering is proposed, and the true classification is attained when n increases to infinity under certain mild conditions, for instance, without assuming normality of the distribution of the random errors in each regression model.
Abstract: In this paper, a procedure based on M-estimation to determine the number of regression models for the problem of regression clustering is proposed. We have shown that the true classification is attained when n increases to infinity under certain mild conditions, for instance, without assuming normality of the distribution of the random errors in each regression model.