Rfit: Rank-based Estimation for Linear Models

doi:10.32614/RJ-2012-014

CONTRIBUTED RESEARCH ARTICLES 57

Rﬁt: Rank-based Estimation for Linear

Models

by John D. Kloke and Joseph W. McKean

Abstract In the nineteen seventies, Jureˇcková

and Jaeckel proposed rank estimation for linear

models. Since that time, several authors have

developed inference and diagnostic methods for

these estimators. These rank-based estimators

and their associated inference are highly efﬁcient

and are robust to outliers in response space. The

methods include estimation of standard errors,

tests of general linear hypotheses, conﬁdence

intervals, diagnostic procedures including stu-

dentized residuals, and measures of inﬂuential

cases. We have developed an R package, Rﬁt,

for computing of these robust procedures. In this

paper we highlight the main features of the pack-

age. The package uses standard linear model

syntax and includes many of the main inference

and diagnostic functions.

Introduction

Rank-based estimators were developed as a robust,

nonparametric alternative to traditional likelihood

or least squares estimators. Rank-based regression

was ﬁrst introduced by Jureˇcková (1971) and Jaeckel

(1972). McKean and Hettmansperger (1978) devel-

oped a Newton step algorithm that led to feasible

computation of these rank-based estimates. Since

then a complete rank-based inference for linear mod-

els has been developed that is based on rank-based

estimation analogous to the way that traditional

analysis is based on least squares (LS) estimation; see

Chapters 3-5 of the monograph by Hettmansperger

and McKean (2011) and Chapter 9 of Hollander and

Wolfe (1999). Furthermore, robust diagnostic proce-

dures have been developed with which to ascertain

quality of ﬁt and to locate outliers in the data; see

McKean and Sheather (2009) for a recent discussion.

Kloke et al. (2009) extended this rank-based inference

to mixed models. Thus rank-based analysis is a com-

plete analysis analogous to the traditional LS analy-

sis for general linear models. This rank-based analy-

sis generalizes Wilcoxon procedures for simple loca-

tion models and, further, it inherits the same high ef-

ﬁciency that these simple nonparametric procedures

possess. In addition, weighted versions can have

high breakdown (up to 50%) in factor space (Chang

et al., 1999). In this paper, we discuss the Rﬁt pack-

age that we have developed for rank-based (R) esti-

mation and inference for linear models. We illustrate

its use on examples from simple regression to k-way

factorial designs.

The geometry of rank-based estimation is simi-

lar to that of LS. In rank-based regression, however,

we replace Euclidean distance with another measure

of distance which we refer to as Jaeckel’s dispersion

function; see Hettmansperger and McKean (2011) for

details. For a brief overview see McKean (2004).

Jaeckel’s dispersion function depends on the

choice of a score function. As discussed in

Hettmansperger and McKean (2011), the rank-based

ﬁt and associated analysis can be optimized by a pru-

dent choice of scores. If the form of the error distri-

bution is known and the associated scores are used,

then the the analysis is fully efﬁcient. In Rﬁt we have

included a library of score functions. The default

option is to use Wilcoxon (linear) scores, however it

is straightforward to create user-deﬁned score func-

tions. We discuss score functions further in a later

section.

Others have developed software for rank-based

estimation. Kapenga et al. (1995) developed a For-

tran package and Crimin et al. (2008) developed a

web interface (cgi) with Perl for this Fortran pro-

gram. Terpstra and McKean (2005) developed a

set of R functions to compute weighted Wilcoxon

(WW) estimates including the high breakdown point

rank-based (HBR) estimate proposed by Chang et al.

(1999). See McKean et al. (2009) for a recent review.

Rﬁt differs from the WW estimates in that its estima-

tion algorithms are available for general scores and it

uses a standard linear models interface.

The package Rﬁt allows the user to implement

rank-based estimation and inference described in

Chapters 3-5 of Hettmansperger and McKean (2011)

and Chapter 9 of Hollander and Wolfe (1999). There

are other robust packages in R. For example, the R

function rlm of the R package MASS (Venables and

Ripley, 2002) computes M estimates for linear mod-

els based on the ψ functions of Huber, Hampel, and

Tukey (bisquare). The CRAN Task View on robust

statistical methods offers robust procedures for lin-

ear and nonlinear models including methods based

on M, M-S, and MM estimators. These procedures,

though, do not obtain rank-based estimates and as-

sociated inference as do the procedures in Rﬁt.

Rﬁt uses standard linear model syntax so that

those familiar with traditional parametric analysis

can easily begin running robust analyses. In this pa-

per, discussion of the assumptions are kept to a min-

imum and we refer the interested reader to the liter-

ature. All data sets used in demonstrating Rﬁt are

included in the package.

The rest of the paper is organized as follows.

The next section discusses the general linear model

and rank-based ﬁtting and associated inference. The

The R Journal Vol. 4/2, December 2012 ISSN 2073-4859

58 CONTRIBUTED RESEARCH ARTICLES

following section provides examples illustrating the

computation of Rﬁt for linear regression. Later sec-

tions discuss Rﬁt’s computation of one-way and

multi-way ANOVA as well as general scores and

writing user-deﬁned score functions for computation

in Rﬁt. The ﬁnal section discusses the future imple-

mentation in Rﬁt of rank-based procedures for mod-

els beyond the general linear model.

Rank-regression

As with least squares, the goal of rank-based regres-

sion is to estimate the vector of coefﬁcients, β, of a

general linear model of the form:

y

i

= α + x

T

i

β + e

i

for i = 1, .. . n (1)

where y

i

is the response variable, x

i

is the vector of

explanatory variables, α is the intercept parameter,

and e

i

is the error term. We assume that the errors

are iid with probability density function (pdf) f (t).

For convenience, we write (1) in matrix notation as

follows

y = α1 + Xβ + e (2)

where y = [y

1

,. .., y

n

]

T

is the n × 1 vector of re-

sponses, X = [x

1

,. .., x

n

]

T

is the n × p design matrix,

and e = [e

1

,. .., e

n

]

T

is the n ×1 vector of error terms.

The only assumption on the distribution of the er-

rors is that it is continuous; in that sense the model

is general. Recall that the least squares estimator is

the minimizor of Euclidean distance between y and

ˆy

LS

= X

ˆ

β

LS

. To obtain the R estimator, a different

measure of distance is used that is based on the dis-

persion function of Jaeckel (1972). Jaeckel’s disper-

sion function is given by

D(β) = ky − Xβk

ϕ

(3)

where k· k

ϕ

is a pseudo-norm deﬁned as

kuk

ϕ

=

n

∑

i=1

a(R(u

i

))u

i

,

where R denotes rank, a(t) = ϕ



t

n+1



, and ϕ is a

non-decreasing, square-integrable score function de-

ﬁned on the interval (0,1). Assume without loss of

generality that it is standardized, so that

R

ϕ(u) du =

0 and

R

ϕ

2

(u) du = 1. Score functions are discussed

further in a later section.

The R estimator of β is deﬁned as

ˆ

β

ϕ

= Argminky − X βk

ϕ

. (4)

This estimator is a highly efﬁcient estimator which is

robust in the Y-space. A weighted version can attain

50% breakdown in the X-space at the expense of a

loss in efﬁciency (Chang et al., 1999).

Inference

Under assumptions outlined in the previous sec-

tion, it can be shown that the solution to (4) is con-

sistent and asymptotically normal (Hettmansperger

and McKean, 2011). We summarize this result as fol-

lows:

ˆ

β

ϕ

˙∼N



β, τ

2

ϕ

(X

T

X)

−1



where τ

ϕ

is a scale parameter which depends on f

and the score function ϕ. An estimate of τ

ϕ

is nec-

essary to conduct inference and Rﬁt implements the

consistent estimator proposed by Koul et al. (1987).

Denote this estimator by

ˆ

τ

ϕ

. Then Wald tests and

conﬁdence regions/intervals can be calculated. Let

se(

ˆ

β

j

) =

ˆ

τ

ϕ



X

T

X



−1

jj

where



X

T

X



−1

jj

is the jth di-

agonal element of



X

T

X



−1

. Then an approximate

(1 −α) ∗100% conﬁdence interval for β

j

is

ˆ

β

j

±t

1−α/2,n−p−1

se(

ˆ

β

j

).

A Wald test of the general linear hypothesis

H

0

: Mβ = 0 versus H

A

: Mβ 6= 0

is to reject H

0

if

(M

ˆ

β

ϕ

)

T

[M(X

T

X)

−1

)M

T

]

−1

(M

ˆ

β)/q

ˆ

τ

2

ϕ

> F

1−α,q,n−p−1

where q = dim(M). Similar to the reduced model

test of classical regression rank-based regression of-

fers a drop in dispersion test which is implemented in

the R function drop.test. For the above hypothe-

ses let

ˆ

θ

ϕ

be the rank-based coefﬁcient estimate of

the reduced model [Model (1) constrained by H

0

].

As discussed in Theorem 3.7.2 of Hettmansperger

and McKean (2011), the reduced model design ma-

trix is easily obtained using a QR-decomposition on

M

T

. We have implemented this methodology in Rﬁt.

Similar to the LS reduction in sums of squares, the

rank-based test is based on a reduction of dispersion

from the reduced to the full model. Let D(

ˆ

θ

ϕ

) and

D(

ˆ

β

ϕ

) denote the reduced and full model minimum

dispersions, then the test is to reject H

0

if

[D(

ˆ

θ

ϕ

) − D(

ˆ

β

ϕ

)]/q

ˆ

τ

ϕ

/2

> F

1−α,q,n−p−1

Computation

It can be shown that Jaeckel’s dispersion function (3)

is a convex function of β (Hettmansperger and McK-

ean, 2011). Rﬁt uses optim with option `BFGS' to

minimize the dispersion function. We investigated

other minimization methods (e.g. Nelder-Mead or

CG), however the quasi-Newton method works well

in terms of speed and convergence. An orthonormal

basis matrix, for the space spanned by the columns

The R Journal Vol. 4/2, December 2012 ISSN 2073-4859

CONTRIBUTED RESEARCH ARTICLES 59

of X, is ﬁrst calculated using qr which leads to bet-

ter performance in examples. The default initial ﬁt is

based on an L

1

ﬁt using quantreg (Koenker, 2011).

Computations by Rﬁt of rank-based estimation

and associated inference are illustrated in the exam-

ples of the next section.

Rﬁt computations of rank-based ﬁt-

ting of linear models

For the general linear model (1) the package Rﬁt ob-

tains the rank-based estimates and inference, as de-

scribed in the previous section. In this section we il-

lustrate this computation for two examples. The ﬁrst

is for a simple linear model while the second is for a

multiple regression model.

Example 1 (Telephone data). We begin with a sim-

ple linear regression example using the telephone

data discussed in Rousseuw and Leroy (1987) These

data represent the number of telephone calls (in tens

of millions) placed in Belgium over the years 1950–

1973. The data are plotted in Figure 1. There are sev-

eral noticeable outliers which are due to a mistake

in the recording units for the years 1964–1969. This

is a simple dataset, containing only one explanatory

variable, however it allows us to easily highlight the

package and also demonstrate the robustness to out-

liers of the procedure. The main function of the pack-

age Rﬁt is rfit which, as the following code segment

illustrates, uses syntax similar to lm.

> library(Rfit)

> data(telephone)

> fit <- rfit(calls ~ year, data = telephone)

> summary(fit)

Call:

rfit(formula = calls ~ year, data = telephone)

Coefficients:

Estimate Std. Error t.value p.value

-284.313842 152.687751 -1.8621 0.07665 .

year 0.145861 0.077842 1.8738 0.07494 .

---

Signif. codes:

0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Multiple R-squared (Robust): 0.3543158

Reduction in Dispersion Test:

12.07238 p-value: 0.00215

> plot(telephone)

> abline(fit)

> abline(lm(calls ~ year, data = telephone),

+ col = 2, lty = 2)

> legend("topleft", legend = c("R", "LS"),

+ col = 1:2, lty = 1:2)

●

● ●

●

1950 1955 1960 1965 1970

0 5 10 15 20

year

calls

R

LS

Figure 1: Scatter plot of the telephone data with over-

laid regression lines.

Further, the output is similar to that of lm and

can be interpreted in the same way. The estimate of

slope is 0.146 (tens of millions of calls per year) with

a standard error of 0.078. The t-statistic is the ratio

of the two and the p-value is calculated using a t-

distribution with n − 2 degrees of freedom. Hence

one could conclude that year is a marginally signiﬁ-

cant predictor of the number of telephone calls.

The overlaid ﬁtted regression lines in the scatter

plot in Figure 1 demonstrate the robustness of the

Wilcoxon ﬁt and the lack of robustness of the least

squares ﬁt.

Example 2 (Free fatty acid data). This is a data set

from Morrison (1983, p. 64) (c.f. Example 3.9.4 of

Hettmansperger and McKean (2011)). The response

variable is level of free fatty acid in a sample of pre-

pubescent boys. The explanatory variables are age

(in months), weight (in lbs), and skin fold thickness.

For this discussion, we chose the Wilcoxon (default)

scores for Rﬁt. Based on the residual and Q-Q plots

below, however, the underlying error distribution

appears to be right-skewed. In a later section we

analyze this data set using more appropriate (bent)

scores for a right-skewed distribution.

To begin with we demonstrate the reduction in

dispersion test discussed in the previous section.

> fitF <- rfit(ffa ~ age + weight + skin,

+ data = ffa)

> fitR <- rfit(ffa ~ skin, data = ffa)

> drop.test(fitF, fitR)

Drop in Dispersion Test

F-Statistic p-value

1.0754e+01 2.0811e-04

The R Journal Vol. 4/2, December 2012 ISSN 2073-4859

60 CONTRIBUTED RESEARCH ARTICLES

As the code segment shows, the syntax is similar to

that of the anova function used for reduced model

testing in many of the parametric packages.

●

0.3 0.4 0.5 0.6 0.7 0.8

0 1 2 3

fitted.values(fitF)

rstudent(fitF)

Figure 2: Studentized residuals versus ﬁtted values

for the free fatty acid data.

●

−2 −1 0 1 2

−0.2 0.0 0.2 0.4 0.6

Normal Q−Q Plot

Theoretical Quantiles

Sample Quantiles

Figure 3: Normal Q-Q plot of the studentized resid-

uals for the free fatty acid data.

Studentized residuals for rank-based ﬁts are cal-

culated in a way similar to the LS studentized resid-

uals (see Chapter 3 of Hettmansperger and McK-

ean, 2011). We have implemented these residuals

in Rﬁt and demonstrate their use next. These are

the residuals plotted in the residual and Q-Q plots

in Figure 2 and Figure 3 respectively. The code is

similar to that of least squares analysis. The func-

tion fitted.values returns the ﬁtted values and

residuals returns the residuals from the full model

ﬁt. The function rstudent calculates the studentized

residuals.

Common diagnostic tools are the residual plot

(Figure 2)

> plot(fitted.values(fitF), rstudent(fitF))

> abline(h = c(-2, 2))

and normal probability plot of the studentized resid-

uals (Figure 3).

> qqnorm(residuals(fitF))

As is shown in the plots, there are several outliers

and perhaps the errors are from a right skewed dis-

tribution. We revist this example in a later section.

One-way ANOVA

Suppose we want to determine the effect that a sin-

gle factor A has on a response of interest over a spec-

iﬁed population. Assume that A consists of k levels

or treatments. In a completely randomized design

(CRD), n subjects are randomly selected from the ref-

erence population and n

i

of them are randomly as-

signed to level i, i = 1, . .. k. Let the jth response in

the ith level be denoted by Y

ij

, j = 1, . .., i = 1,. . .,k.

We assume that the responses are independent of one

another and that the distributions among levels dif-

fer by at most shifts in location.

Under these assumptions, the full model can be

written as

Y

ij

= µ

i

+ e

ij

j = 1, . .. , n

i

i = 1, . . . , k , (5)

where the e

ij

s are iid random variables with density

f (x) and the parameter µ

i

is a convenient location

parameter for the ith level, (for example, the mean or

median of the ith level). Generally, the parameters

of interest are the effects (contrasts), ∆

ii

0

= µ

i

0

− µ

i

,

i 6= i

0

,1,. . .,k. Upon ﬁtting the model a residual anal-

ysis should be conducted to check these model as-

sumptions.

Observational studies can also be modeled this

way. Suppose k independent samples are drawn

from k different populations. If we assume further

that the distributions for the different populations

differ by at most a shift in locations then Model (5)

is appropriate.

The analysis for this design is usually a test of the

hypothesis that all the effects are 0, followed by indi-

vidual comparisons of levels. The hypothesis can be

written as

H

0

:µ

1

= ··· = µ

k

versus (6)

H

A

:µ

i

6= µ

i

0

for some i 6= i

0

.

Conﬁdence intervals for the simple contrasts ∆

ii

0

are

generally used to handle the comparisons. Rﬁt offers

The R Journal Vol. 4/2, December 2012 ISSN 2073-4859

CONTRIBUTED RESEARCH ARTICLES 61

a reduction in dispersion test for testing (6) as well as

pairwise p-values adjusted for multiple testing. The

function oneway.rfit is illustrated in the following

example.

Example 3 (LDL cholesterol of quail).

Hettmansperger and McKean (2011, p. 295) dis-

cuss a study that investigated the effect of four

drug compounds on low density lipid (LDL) choles-

terol in quail. The drug compounds are labeled

as I, II, III, and IV. The sample size for each

of the ﬁrst three levels is 10 while 9 quail re-

ceived compound IV. The boxplots shown in Fig-

ure 4 attest to a difference in the LDL levels.

●

I II III IV

50 100 150

Drug Compound

LDL Cholesterol

Figure 4: Comparison boxplots for quail data.

Using Wilcoxon scores, we ﬁt the full model. The

summary of the test of hypotheses (6) as computed

by the Rﬁt function oneway.rfit follows. The result-

ing Q-Q plot (Figure 5) of the studentized residuals

indicates that the random errors e

ij

have a skewed

distribution.

> data(quail)

> oneway.rfit(quail$ldl, quail$treat)

Call:

oneway.rfit(y = quail$ldl, g = quail$treat)

Overall Test of All Locations Equal

Drop in Dispersion Test

F-Statistic p-value

3.916404 0.016403

Pairwise comparisons using Rfit

data: quail$ldl and quail$treat

2 3 4

1 - - -

2 1.00 - -

3 0.68 0.99 -

4 0.72 0.99 0.55

P value adjustment method: none

Robust ﬁts based on scores more appropriate than

the Wilcoxon for skewed errors are discussed later.

Note that the results from a call to oneway.rfit in-

clude the results from the call to rfit.

> anovafit <- oneway.rfit(quail$ldl, quail$treat)

Which may then be used for diagnostic procedures,

such as the Q-Q plot of the studentized residuals in

Figure 5.

> qqnorm(rstudent(anovafit$fit))

●

● ●

●

−2 −1 0 1 2

0 2 4 6 8

Normal Q−Q Plot

Theoretical Quantiles

Sample Quantiles

Figure 5: Normal Q-Q plot of the studentized resid-

uals for the quail data.

With a p-value of 0.0164, generally, the null hypothe-

sis would be rejected and the inference would pass to

the comparisons of interest. Finally, we note that, the

LS test of the null hypothesis has p-value 0.35; hence,

with the LS analysis H

0

would not be rejected. In

practice, one would not proceed with comparisons

with such a large p-value. Thus, for this data set

the robust and LS analyses have different interpre-

tations.

Multiple comparisons

The second stage of an analysis of a one-way design

usually consists of pairwise comparisons of the treat-

ments. The robust (1 −α)100% conﬁdence interval to

compare the ith and i

0

th treatments is given by

b

∆

ii

0

±t

α/2,n−1

b

τ

ϕ

s

1

n

i

+

1

n

i

0

. (7)

Often there are many comparisons of interest. For ex-

ample, in the case of all pairwise comparisons there

The R Journal Vol. 4/2, December 2012 ISSN 2073-4859

Rfit: Rank-based Estimation for Linear Models

Citations

Long-term decline of the Amazon carbon sink

Host and Environmental Factors Influencing Individual Human Cytokine Responses

Amazon forest response to repeated droughts

Using bad learners to find good configurations

Prediction uncertainty of density functional approximations for properties of crystals with cubic symmetry.

References

Modern Applied Statistics with S

An Analysis of Transformations

Applied Linear Statistical Models

Applied Linear Statistical Models

Robust Regression and Outlier Detection

Related Papers (5)

R: A language and environment for statistical computing.

Fitting Linear Mixed-Effects Models Using lme4

Controlling the false discovery rate: a practical and powerful approach to multiple testing

DADA2: High-resolution sample inference from Illumina amplicon data

Multiple Comparisons Using Rank Sums