scispace - formally typeset
Open AccessJournal ArticleDOI

Rfit: Rank-based Estimation for Linear Models

John Kloke, +1 more
- 01 Jan 2012 - 
- Vol. 4, Iss: 2, pp 57-64
Reads0
Chats0
TLDR
An R package, Rfit, is developed that uses standard linear model syntax and includes many of the main inference and diagnostic functions for rank-based estimators and their associated inference.
Abstract
In the nineteen seventies, Jureckova and Jaeckel proposed rank estimation for linear models. Since that time, several authors have developed inference and diagnostic methods for these estimators. These rank-based estimators and their associated inference are highly efficient and are robust to outliers in response space. The methods include estimation of standard errors, tests of general linear hypotheses, confidence intervals, diagnostic procedures including studentized residuals, and measures of influential cases. We have developed an R package, Rfit, for computing of these robust procedures. In this paper we highlight the main features of the package. The package uses standard linear model syntax and includes many of the main inference and diagnostic functions.

read more

Content maybe subject to copyright    Report

CONTRIBUTED RESEARCH ARTICLES 57
Rfit: Rank-based Estimation for Linear
Models
by John D. Kloke and Joseph W. McKean
Abstract In the nineteen seventies, Jureˇcková
and Jaeckel proposed rank estimation for linear
models. Since that time, several authors have
developed inference and diagnostic methods for
these estimators. These rank-based estimators
and their associated inference are highly efficient
and are robust to outliers in response space. The
methods include estimation of standard errors,
tests of general linear hypotheses, confidence
intervals, diagnostic procedures including stu-
dentized residuals, and measures of influential
cases. We have developed an R package, Rfit,
for computing of these robust procedures. In this
paper we highlight the main features of the pack-
age. The package uses standard linear model
syntax and includes many of the main inference
and diagnostic functions.
Introduction
Rank-based estimators were developed as a robust,
nonparametric alternative to traditional likelihood
or least squares estimators. Rank-based regression
was first introduced by Jureˇcková (1971) and Jaeckel
(1972). McKean and Hettmansperger (1978) devel-
oped a Newton step algorithm that led to feasible
computation of these rank-based estimates. Since
then a complete rank-based inference for linear mod-
els has been developed that is based on rank-based
estimation analogous to the way that traditional
analysis is based on least squares (LS) estimation; see
Chapters 3-5 of the monograph by Hettmansperger
and McKean (2011) and Chapter 9 of Hollander and
Wolfe (1999). Furthermore, robust diagnostic proce-
dures have been developed with which to ascertain
quality of fit and to locate outliers in the data; see
McKean and Sheather (2009) for a recent discussion.
Kloke et al. (2009) extended this rank-based inference
to mixed models. Thus rank-based analysis is a com-
plete analysis analogous to the traditional LS analy-
sis for general linear models. This rank-based analy-
sis generalizes Wilcoxon procedures for simple loca-
tion models and, further, it inherits the same high ef-
ficiency that these simple nonparametric procedures
possess. In addition, weighted versions can have
high breakdown (up to 50%) in factor space (Chang
et al., 1999). In this paper, we discuss the Rfit pack-
age that we have developed for rank-based (R) esti-
mation and inference for linear models. We illustrate
its use on examples from simple regression to k-way
factorial designs.
The geometry of rank-based estimation is simi-
lar to that of LS. In rank-based regression, however,
we replace Euclidean distance with another measure
of distance which we refer to as Jaeckel’s dispersion
function; see Hettmansperger and McKean (2011) for
details. For a brief overview see McKean (2004).
Jaeckel’s dispersion function depends on the
choice of a score function. As discussed in
Hettmansperger and McKean (2011), the rank-based
fit and associated analysis can be optimized by a pru-
dent choice of scores. If the form of the error distri-
bution is known and the associated scores are used,
then the the analysis is fully efficient. In Rfit we have
included a library of score functions. The default
option is to use Wilcoxon (linear) scores, however it
is straightforward to create user-defined score func-
tions. We discuss score functions further in a later
section.
Others have developed software for rank-based
estimation. Kapenga et al. (1995) developed a For-
tran package and Crimin et al. (2008) developed a
web interface (cgi) with Perl for this Fortran pro-
gram. Terpstra and McKean (2005) developed a
set of R functions to compute weighted Wilcoxon
(WW) estimates including the high breakdown point
rank-based (HBR) estimate proposed by Chang et al.
(1999). See McKean et al. (2009) for a recent review.
Rfit differs from the WW estimates in that its estima-
tion algorithms are available for general scores and it
uses a standard linear models interface.
The package Rfit allows the user to implement
rank-based estimation and inference described in
Chapters 3-5 of Hettmansperger and McKean (2011)
and Chapter 9 of Hollander and Wolfe (1999). There
are other robust packages in R. For example, the R
function rlm of the R package MASS (Venables and
Ripley, 2002) computes M estimates for linear mod-
els based on the ψ functions of Huber, Hampel, and
Tukey (bisquare). The CRAN Task View on robust
statistical methods offers robust procedures for lin-
ear and nonlinear models including methods based
on M, M-S, and MM estimators. These procedures,
though, do not obtain rank-based estimates and as-
sociated inference as do the procedures in Rfit.
Rfit uses standard linear model syntax so that
those familiar with traditional parametric analysis
can easily begin running robust analyses. In this pa-
per, discussion of the assumptions are kept to a min-
imum and we refer the interested reader to the liter-
ature. All data sets used in demonstrating Rfit are
included in the package.
The rest of the paper is organized as follows.
The next section discusses the general linear model
and rank-based fitting and associated inference. The
The R Journal Vol. 4/2, December 2012 ISSN 2073-4859

58 CONTRIBUTED RESEARCH ARTICLES
following section provides examples illustrating the
computation of Rfit for linear regression. Later sec-
tions discuss Rfit’s computation of one-way and
multi-way ANOVA as well as general scores and
writing user-defined score functions for computation
in Rfit. The final section discusses the future imple-
mentation in Rfit of rank-based procedures for mod-
els beyond the general linear model.
Rank-regression
As with least squares, the goal of rank-based regres-
sion is to estimate the vector of coefficients, β, of a
general linear model of the form:
y
i
= α + x
T
i
β + e
i
for i = 1, .. . n (1)
where y
i
is the response variable, x
i
is the vector of
explanatory variables, α is the intercept parameter,
and e
i
is the error term. We assume that the errors
are iid with probability density function (pdf) f (t).
For convenience, we write (1) in matrix notation as
follows
y = α1 + Xβ + e (2)
where y = [y
1
,. .., y
n
]
T
is the n × 1 vector of re-
sponses, X = [x
1
,. .., x
n
]
T
is the n × p design matrix,
and e = [e
1
,. .., e
n
]
T
is the n ×1 vector of error terms.
The only assumption on the distribution of the er-
rors is that it is continuous; in that sense the model
is general. Recall that the least squares estimator is
the minimizor of Euclidean distance between y and
ˆy
LS
= X
ˆ
β
LS
. To obtain the R estimator, a different
measure of distance is used that is based on the dis-
persion function of Jaeckel (1972). Jaeckel’s disper-
sion function is given by
D(β) = ky Xβk
ϕ
(3)
where k· k
ϕ
is a pseudo-norm defined as
kuk
ϕ
=
n
i=1
a(R(u
i
))u
i
,
where R denotes rank, a(t) = ϕ
t
n+1
, and ϕ is a
non-decreasing, square-integrable score function de-
fined on the interval (0,1). Assume without loss of
generality that it is standardized, so that
R
ϕ(u) du =
0 and
R
ϕ
2
(u) du = 1. Score functions are discussed
further in a later section.
The R estimator of β is defined as
ˆ
β
ϕ
= Argminky X βk
ϕ
. (4)
This estimator is a highly efficient estimator which is
robust in the Y-space. A weighted version can attain
50% breakdown in the X-space at the expense of a
loss in efficiency (Chang et al., 1999).
Inference
Under assumptions outlined in the previous sec-
tion, it can be shown that the solution to (4) is con-
sistent and asymptotically normal (Hettmansperger
and McKean, 2011). We summarize this result as fol-
lows:
ˆ
β
ϕ
˙N
β, τ
2
ϕ
(X
T
X)
1
where τ
ϕ
is a scale parameter which depends on f
and the score function ϕ. An estimate of τ
ϕ
is nec-
essary to conduct inference and Rfit implements the
consistent estimator proposed by Koul et al. (1987).
Denote this estimator by
ˆ
τ
ϕ
. Then Wald tests and
confidence regions/intervals can be calculated. Let
se(
ˆ
β
j
) =
ˆ
τ
ϕ
X
T
X
1
jj
where
X
T
X
1
jj
is the jth di-
agonal element of
X
T
X
1
. Then an approximate
(1 α) 100% confidence interval for β
j
is
ˆ
β
j
±t
1α/2,np1
se(
ˆ
β
j
).
A Wald test of the general linear hypothesis
H
0
: Mβ = 0 versus H
A
: Mβ 6= 0
is to reject H
0
if
(M
ˆ
β
ϕ
)
T
[M(X
T
X)
1
)M
T
]
1
(M
ˆ
β)/q
ˆ
τ
2
ϕ
> F
1α,q,np1
where q = dim(M). Similar to the reduced model
test of classical regression rank-based regression of-
fers a drop in dispersion test which is implemented in
the R function drop.test. For the above hypothe-
ses let
ˆ
θ
ϕ
be the rank-based coefficient estimate of
the reduced model [Model (1) constrained by H
0
].
As discussed in Theorem 3.7.2 of Hettmansperger
and McKean (2011), the reduced model design ma-
trix is easily obtained using a QR-decomposition on
M
T
. We have implemented this methodology in Rfit.
Similar to the LS reduction in sums of squares, the
rank-based test is based on a reduction of dispersion
from the reduced to the full model. Let D(
ˆ
θ
ϕ
) and
D(
ˆ
β
ϕ
) denote the reduced and full model minimum
dispersions, then the test is to reject H
0
if
[D(
ˆ
θ
ϕ
) D(
ˆ
β
ϕ
)]/q
ˆ
τ
ϕ
/2
> F
1α,q,np1
Computation
It can be shown that Jaeckel’s dispersion function (3)
is a convex function of β (Hettmansperger and McK-
ean, 2011). Rfit uses optim with option `BFGS' to
minimize the dispersion function. We investigated
other minimization methods (e.g. Nelder-Mead or
CG), however the quasi-Newton method works well
in terms of speed and convergence. An orthonormal
basis matrix, for the space spanned by the columns
The R Journal Vol. 4/2, December 2012 ISSN 2073-4859

CONTRIBUTED RESEARCH ARTICLES 59
of X, is first calculated using qr which leads to bet-
ter performance in examples. The default initial fit is
based on an L
1
fit using quantreg (Koenker, 2011).
Computations by Rfit of rank-based estimation
and associated inference are illustrated in the exam-
ples of the next section.
Rfit computations of rank-based fit-
ting of linear models
For the general linear model (1) the package Rfit ob-
tains the rank-based estimates and inference, as de-
scribed in the previous section. In this section we il-
lustrate this computation for two examples. The first
is for a simple linear model while the second is for a
multiple regression model.
Example 1 (Telephone data). We begin with a sim-
ple linear regression example using the telephone
data discussed in Rousseuw and Leroy (1987) These
data represent the number of telephone calls (in tens
of millions) placed in Belgium over the years 1950–
1973. The data are plotted in Figure 1. There are sev-
eral noticeable outliers which are due to a mistake
in the recording units for the years 1964–1969. This
is a simple dataset, containing only one explanatory
variable, however it allows us to easily highlight the
package and also demonstrate the robustness to out-
liers of the procedure. The main function of the pack-
age Rfit is rfit which, as the following code segment
illustrates, uses syntax similar to lm.
> library(Rfit)
> data(telephone)
> fit <- rfit(calls ~ year, data = telephone)
> summary(fit)
Call:
rfit(formula = calls ~ year, data = telephone)
Coefficients:
Estimate Std. Error t.value p.value
-284.313842 152.687751 -1.8621 0.07665 .
year 0.145861 0.077842 1.8738 0.07494 .
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 1
Multiple R-squared (Robust): 0.3543158
Reduction in Dispersion Test:
12.07238 p-value: 0.00215
> plot(telephone)
> abline(fit)
> abline(lm(calls ~ year, data = telephone),
+ col = 2, lty = 2)
> legend("topleft", legend = c("R", "LS"),
+ col = 1:2, lty = 1:2)
1950 1955 1960 1965 1970
0 5 10 15 20
year
calls
R
LS
Figure 1: Scatter plot of the telephone data with over-
laid regression lines.
Further, the output is similar to that of lm and
can be interpreted in the same way. The estimate of
slope is 0.146 (tens of millions of calls per year) with
a standard error of 0.078. The t-statistic is the ratio
of the two and the p-value is calculated using a t-
distribution with n 2 degrees of freedom. Hence
one could conclude that year is a marginally signifi-
cant predictor of the number of telephone calls.
The overlaid fitted regression lines in the scatter
plot in Figure 1 demonstrate the robustness of the
Wilcoxon fit and the lack of robustness of the least
squares fit.
Example 2 (Free fatty acid data). This is a data set
from Morrison (1983, p. 64) (c.f. Example 3.9.4 of
Hettmansperger and McKean (2011)). The response
variable is level of free fatty acid in a sample of pre-
pubescent boys. The explanatory variables are age
(in months), weight (in lbs), and skin fold thickness.
For this discussion, we chose the Wilcoxon (default)
scores for Rfit. Based on the residual and Q-Q plots
below, however, the underlying error distribution
appears to be right-skewed. In a later section we
analyze this data set using more appropriate (bent)
scores for a right-skewed distribution.
To begin with we demonstrate the reduction in
dispersion test discussed in the previous section.
> fitF <- rfit(ffa ~ age + weight + skin,
+ data = ffa)
> fitR <- rfit(ffa ~ skin, data = ffa)
> drop.test(fitF, fitR)
Drop in Dispersion Test
F-Statistic p-value
1.0754e+01 2.0811e-04
The R Journal Vol. 4/2, December 2012 ISSN 2073-4859

60 CONTRIBUTED RESEARCH ARTICLES
As the code segment shows, the syntax is similar to
that of the anova function used for reduced model
testing in many of the parametric packages.
0.3 0.4 0.5 0.6 0.7 0.8
0 1 2 3
fitted.values(fitF)
rstudent(fitF)
Figure 2: Studentized residuals versus fitted values
for the free fatty acid data.
−2 −1 0 1 2
−0.2 0.0 0.2 0.4 0.6
Normal Q−Q Plot
Theoretical Quantiles
Sample Quantiles
Figure 3: Normal Q-Q plot of the studentized resid-
uals for the free fatty acid data.
Studentized residuals for rank-based fits are cal-
culated in a way similar to the LS studentized resid-
uals (see Chapter 3 of Hettmansperger and McK-
ean, 2011). We have implemented these residuals
in Rfit and demonstrate their use next. These are
the residuals plotted in the residual and Q-Q plots
in Figure 2 and Figure 3 respectively. The code is
similar to that of least squares analysis. The func-
tion fitted.values returns the fitted values and
residuals returns the residuals from the full model
fit. The function rstudent calculates the studentized
residuals.
Common diagnostic tools are the residual plot
(Figure 2)
> plot(fitted.values(fitF), rstudent(fitF))
> abline(h = c(-2, 2))
and normal probability plot of the studentized resid-
uals (Figure 3).
> qqnorm(residuals(fitF))
As is shown in the plots, there are several outliers
and perhaps the errors are from a right skewed dis-
tribution. We revist this example in a later section.
One-way ANOVA
Suppose we want to determine the effect that a sin-
gle factor A has on a response of interest over a spec-
ified population. Assume that A consists of k levels
or treatments. In a completely randomized design
(CRD), n subjects are randomly selected from the ref-
erence population and n
i
of them are randomly as-
signed to level i, i = 1, . .. k. Let the jth response in
the ith level be denoted by Y
ij
, j = 1, . .., i = 1,. . .,k.
We assume that the responses are independent of one
another and that the distributions among levels dif-
fer by at most shifts in location.
Under these assumptions, the full model can be
written as
Y
ij
= µ
i
+ e
ij
j = 1, . .. , n
i
i = 1, . . . , k , (5)
where the e
ij
s are iid random variables with density
f (x) and the parameter µ
i
is a convenient location
parameter for the ith level, (for example, the mean or
median of the ith level). Generally, the parameters
of interest are the effects (contrasts),
ii
0
= µ
i
0
µ
i
,
i 6= i
0
,1,. . .,k. Upon fitting the model a residual anal-
ysis should be conducted to check these model as-
sumptions.
Observational studies can also be modeled this
way. Suppose k independent samples are drawn
from k different populations. If we assume further
that the distributions for the different populations
differ by at most a shift in locations then Model (5)
is appropriate.
The analysis for this design is usually a test of the
hypothesis that all the effects are 0, followed by indi-
vidual comparisons of levels. The hypothesis can be
written as
H
0
:µ
1
= ··· = µ
k
versus (6)
H
A
:µ
i
6= µ
i
0
for some i 6= i
0
.
Confidence intervals for the simple contrasts
ii
0
are
generally used to handle the comparisons. Rfit offers
The R Journal Vol. 4/2, December 2012 ISSN 2073-4859

CONTRIBUTED RESEARCH ARTICLES 61
a reduction in dispersion test for testing (6) as well as
pairwise p-values adjusted for multiple testing. The
function oneway.rfit is illustrated in the following
example.
Example 3 (LDL cholesterol of quail).
Hettmansperger and McKean (2011, p. 295) dis-
cuss a study that investigated the effect of four
drug compounds on low density lipid (LDL) choles-
terol in quail. The drug compounds are labeled
as I, II, III, and IV. The sample size for each
of the first three levels is 10 while 9 quail re-
ceived compound IV. The boxplots shown in Fig-
ure 4 attest to a difference in the LDL levels.
I II III IV
50 100 150
Drug Compound
LDL Cholesterol
Figure 4: Comparison boxplots for quail data.
Using Wilcoxon scores, we fit the full model. The
summary of the test of hypotheses (6) as computed
by the Rfit function oneway.rfit follows. The result-
ing Q-Q plot (Figure 5) of the studentized residuals
indicates that the random errors e
ij
have a skewed
distribution.
> data(quail)
> oneway.rfit(quail$ldl, quail$treat)
Call:
oneway.rfit(y = quail$ldl, g = quail$treat)
Overall Test of All Locations Equal
Drop in Dispersion Test
F-Statistic p-value
3.916404 0.016403
Pairwise comparisons using Rfit
data: quail$ldl and quail$treat
2 3 4
1 - - -
2 1.00 - -
3 0.68 0.99 -
4 0.72 0.99 0.55
P value adjustment method: none
Robust fits based on scores more appropriate than
the Wilcoxon for skewed errors are discussed later.
Note that the results from a call to oneway.rfit in-
clude the results from the call to rfit.
> anovafit <- oneway.rfit(quail$ldl, quail$treat)
Which may then be used for diagnostic procedures,
such as the Q-Q plot of the studentized residuals in
Figure 5.
> qqnorm(rstudent(anovafit$fit))
−2 −1 0 1 2
0 2 4 6 8
Normal Q−Q Plot
Theoretical Quantiles
Sample Quantiles
Figure 5: Normal Q-Q plot of the studentized resid-
uals for the quail data.
With a p-value of 0.0164, generally, the null hypothe-
sis would be rejected and the inference would pass to
the comparisons of interest. Finally, we note that, the
LS test of the null hypothesis has p-value 0.35; hence,
with the LS analysis H
0
would not be rejected. In
practice, one would not proceed with comparisons
with such a large p-value. Thus, for this data set
the robust and LS analyses have different interpre-
tations.
Multiple comparisons
The second stage of an analysis of a one-way design
usually consists of pairwise comparisons of the treat-
ments. The robust (1 α)100% confidence interval to
compare the ith and i
0
th treatments is given by
b
ii
0
±t
α/2,n1
b
τ
ϕ
s
1
n
i
+
1
n
i
0
. (7)
Often there are many comparisons of interest. For ex-
ample, in the case of all pairwise comparisons there
The R Journal Vol. 4/2, December 2012 ISSN 2073-4859

Citations
More filters
Journal ArticleDOI

Long-term decline of the Amazon carbon sink

Roel J. W. Brienen, +101 more
- 19 Mar 2015 - 
TL;DR: It is confirmed that Amazon forests have acted as a long-term net biomass sink, but the observed decline of the Amazon sink diverges markedly from the recent increase in terrestrial carbon uptake at the global scale, and is contrary to expectations based on models
Journal ArticleDOI

Amazon forest response to repeated droughts

Ted R. Feldpausch, +58 more
TL;DR: In this article, the authors examined the impact of the 2010 Amazon drought on forest dynamics using ground-based observations of mortality and growth from an extensive forest plot network and found that during the 2010 drought interval, forests did not gain biomass (net change: −0.43
Proceedings ArticleDOI

Using bad learners to find good configurations

TL;DR: In this paper, the authors propose a rank-based approach to find the optimal configuration of a software system for a given setting, where exact performance values are not required to rank configurations and to identify the optimal one.
Journal ArticleDOI

Prediction uncertainty of density functional approximations for properties of crystals with cubic symmetry.

TL;DR: An a posteriori calibration method to estimate the prediction uncertainty of DFAs for properties of solids is investigated and a linear model is shown to be adequate to address the systematic trend in the errors.
References
More filters
BookDOI

Modern Applied Statistics with S

TL;DR: A guide to using S environments to perform statistical analyses providing both an introduction to the use of S and a course in modern statistical methods.
Journal ArticleDOI

An Analysis of Transformations

TL;DR: In this article, Lindley et al. make the less restrictive assumption that such a normal, homoscedastic, linear model is appropriate after some suitable transformation has been applied to the y's.
Book

Applied Linear Statistical Models

TL;DR: Applied Linear Statistical Models 5e as discussed by the authors is the leading authoritative text and reference on statistical modeling, which includes brief introductory and review material, and then proceeds through regression and modeling for the first half, and through ANOVA and Experimental Design in the second half.
Book

Robust Regression and Outlier Detection

TL;DR: This paper presents the results of a two-year study of the statistical treatment of outliers in the context of one-Dimensional Location and its applications to discrete-time reinforcement learning.
Related Papers (5)