A Comparison of Parametric and
Nonparametric Approaches to ROC
Analysis of Quantitative Diagnostic Tests
KARIM
0. HAJIAN-TILAKI,
PhD,
JAMES A. HANLEY,
PhD,
LAWRENCE JOSEPH,
PhD,
JEAN-PAUL COLLET,
PhD
Receiver operating characteristic (ROC) analysis, which yields indices of accuracy
such as the area under the curve (AUC), is increasingly being used to evaluate the
performances of diagnostic tests that produce results on continuous scales. Both par-
ametric and nonparametric ROC approaches are available to assess the discriminant
capacity of such tests, but there are no clear guidelines as to the merits of each,
particularly with non-binormal data. Investigators may worry that when data are
non-
Gaussian, estimates of diagnostic accuracy based on a binormal model may be dis-
torted. The authors conducted a Monte Carlo simulation study to compare the bias
and sampling variability in the estimates of the
AUCs
derived from parametric and
nonparametric procedures. Each approach was assessed in data sets generated from
various configurations of pairs of overlapping distributions; these included the binormal
model and non-binormal pairs of distributions where one or both pair members were
mixtures of Gaussian (MG) distributions
with
different degrees of departures from
bi-
normality. The biases in the estimates of the
AUCs
were found to be very small for
both parametric and nonparametrlc procedures. The two approaches yielded very
close estimates of the
AUCs
and of the corresponding sampling variability even when
data were generated from non-binormal models. Thus, for a wide range of distributions,
concern about bias or imprecision of the estimates of the AUC should not be a major
factor in choosing between the nonparametric and parametric approaches.
Key words:
ROC
analysis; quantitative diagnostic test; comparison, parametric; binormal model;
LABROC; nonparametric procedure; area under the curve (AUC).
Med
Decis Making
1997;17:94-102)
During the past ten years, receiver operator char-
acteristic
(ROC)
analysis has become a popular
method for evaluating the accuracy/performance of
medical diagnostic
tests.
1-3
The most attractive
property of ROC analysis is that the accuracy indices
derived from this technique are not distorted by
fluctuations caused by the use of an arbitrarily cho-
sen decision “criterion” or
“cutoff.“
4-8
One index
available from an ROC analysis, the area under the
curve”’
(AUC),
measures the ability of a diagnostic
Received
February 17,
1995, from the Department of Epide-
miology and Biostatistics, McGill University
(KOH-T,
JAH, LJ,
J-
PC); the Division of Clinical Epidemiology, Royal Victoria Hospital
(JAH);
the Division of Clinical Epidemiology, Montreal General
Hospital
(LJ);
and the Division of Clinical Epidemiology, Jewish
General Hospital (PC); all in Montreal, Quebec, Canada. Revision
accepted for publication July 17, 1995. Supported by an operat-
ing grant from the Natural Sciences and Engineering Research
Council of Canada and the Fonds de la recherche en Sante du
Quebec.
Address correspondence and reprint requests to Dr.
Hanley:
Department of Epidemiology and Biostatistics, McGill University,
1020 Pine Avenue West, Montreal,
PQ
Canada
H3A
lA2.
e-mail:
(Jimh@epid.lan.mcgill.ca).
test to discriminate between two patient states, often
labelled
“diseased” and “non-diseased.” The AUC
has been of considerable interest as a summary
measure of accuracy because of its meaningful in-
terpretation.“’
Initially, ROC methods were confined to tests in-
terpreted on rating scales and analysis was typically
carried out using the binormal
model.
9,10
However,
they are now becoming increasingly popular for
evaluating the performances of quantitative diagnos-
tic tests with numerical results recorded directly on
continuous
scales.
1,2,3,11
Both parametric and non-
parametric procedures can be used to derive an
AUC index of accuracy for such diagnostic tests.
However, Goddard and
Hinberg
12
warned that if the
distribution of raw data from a quantitative test is
far from Gaussian, the AUC
[and
corresponding
standard error
(SE)]
derived from a directly fitted
binormal model can be seriously distorted. This oc-
curs because one fits a mean and standard deviation
to the raw data for the diseased and non-diseased
patients separately. One way to avoid the possible
distortion is to use
Metz’s
adaptation of the binormal
model, previously used with rating
data,
9,13-15
with
94
VOL
17/NO
1, JAN-MAR 1997
Parametric and Nonparametrlc ROC
Analysis
l
95
laboratory-type data.’ Metz et al. implemented the
binormal model in the LABROC
software.
16
The pro-
cedure first discretizes the continuous data and
then uses the categories as ratings in the ROCFIT
procedure
l6
to obtain the maximum-likelihood esti-
mates
(MLE)
of the two relevant parameters of the
binormal model. From them it calculates the AUC
and the corresponding SE.
When the data are continuous and appear to be
non-Gaussian, many users will find the
nonpara-
metric
approach
17-19
”
to estimating the AUC more ap-
pealing than using the binormal model, since they
may worry that estimates of diagnostic accuracy
based on a binormal model will be distorted.
How-
ever, nonparametric area estimates will tend to un-
derestimate the
AUCs
for rating
data,
7,20
in particular
when ROC operating points are not well spread out
along the ROC curve. Moreover, this method does
not yield a smooth estimate of the entire ROC
curve.
21
1
The variance of the nonparametric esti-
mates of the AUC can be estimated entirely nonpar-
ametrically
18,19
or using an exponential approxima-
tion.’
Recently,
Obuchowski
22
found that the
exponential approximation underestimates the em-
pirical standard error of the nonparametric AUC for
rating data when the “ratings” begin as continuous
data with a binormal distribution and when the ra-
tio of the standard deviations
(SDS)
of the two dis-
tributions is greater than 2. However, in practice the
data might arise from a non-binormal model.
In summary, the statistical behaviors of the AUC
estimates derived from the parametric and nonpar-
ametric
approaches have not been investigated for
quantitative diagnostic test results, and there are no
general guidelines for choosing one approach over
the other. Thus, we conducted a broad numerical
investigation to compare the statistical behaviors of
the estimates of the AUC derived from parametric
and nonparametric procedures.
Methods
DATA
GENERATION
As is shown in the leftmost columns of tables 1
and 2, we generated continuously distributed data
with sample sizes of
R
= 40 for diseased and
R
=
40 for non-diseased from various pairs of overlap-
ping distributions with various degrees of separa-
tion; sample sizes of
n
=
lOO/lOO
were also investi-
gated. Overall,
1,000
data sets were generated for
each configuration studied.
Binormal
data. First, we generated continuously
distributed data from two overlapping Gaussian dis-
tributions, i.e.,
{G,
G)
pairs for the “non-diseased”
and “diseased’ patients with different degrees of
separation
(AUC
= 0.60, AUC = 0.75, AUC = 0.90)
and with various ratios of
SDs
of distributions for
the non-diseased to diseased
(l:l,
1:1.4,
1:2
and
1:3),
yielding in all
12
configurations of pairs.
Non-binormal
data. Data were also generated
from various configurations of non-binormal pairs,
where one or both members of the pairs were
mixtures of Gaussian
(MG)
distributions:
(G,
MG-
skewed or bimodal) pairs or {MG-skewed,
MG-
skewed) pairs. In all, as is shown in figures
1
and
2, 18 configurations of non-binormal pairs with var-
ious degrees of skewness and separation were used
to generate data. We calculated how often the hy-
pothesis of normality would be rejected with such
distributions. For sample sizes of 40, the hypothesis
was rejected by the Wilks’ test employed by SAS in
34% of the data sets from the moderate-skew distri-
butions and 67% of those from high-skew distribu-
tions. For sample sizes of 100, the corresponding
percentages were 59 and 97.
To some, the range of distributions shown in fig-
ures 1 and 2 may seem limited. However, one can
apply many monotonic transformations to the sep-
arator axis, thereby effectively covering a broader
range of possibilities of non-binormal data. An ex-
ample of how both distributions are converted to
non-normal pairs is shown in the last row of figure
1. Each pair was generated by mapping the
(-03,
+w) scale used in row 2 into the
(0,
11
scale by ap-
plying the transformation
exp(X)/ll
+
exp(X11.
Notice,
however, that although such monotonic transfor-
mations may radically change the shapes of the dis-
tributions, they do not change the ROC curve when
applied to both
distributions.
2,5,13
Few of the articles in the quantitative diagnostic
test literature show the distributions of raw data. In
those that do, the distributions of biomarkers for
diseased patients are often positively skewed. For ex-
ample, Goddard and Hinberg
12
reported the histo-
grams of different biomarkers for five types of can-
cer; the distributions for cancer patients were
skewed or bimodal.
Linnet
11
also showed examples
where the distributions of serum bilirubin and fast-
ing serum bile acids for diseased patients were pos-
itively skewed while the reference distribution was
approximately normal. Empirically, there is consid-
erable evidence that the binormal model used for
rating data needs to include more variation for dis-
eased patients,’
i.e.,
the ratio of
SDs
of distribution
for diseased to non-diseased patients is higher than
1. Based on this empirical evidence, we included the
range of ratios of
SDs
from 1:l to
1:3.
We chose
mixtures of Gaussian distributions for diseased pa-
tients since the distribution may contain unidenti-
fied disease subtypes. Thus, we allow for more var-
iation for the diseased than the non-diseased
patients.
96 l
Hajian-Tllaki,
Hanley,
Joseph,
Collet
MEDICAL DECISION MAKING
STATISTICAL ANALYSIS
Each generated data set underwent two analyses:
1)
nonparametric ROC analysis using the raw data,
and
2)
parametric ROC analysis of the categorized
data via the LARROC approach.
Nonparametric approach.
The nonparametric es-
timate of the AUC was calculated directly from the
raw data using the Wilcoxon-Mann-Whitney
two-
sample statistic; the SE of the AUC was calculated by
DeLong
et
al.‘s
method.”
Parametric ROC analysis:
Each data set was ana-
lyzed via
Metz's
LABROC
procedure.
16
The program
categorizes the data according to a data-dependent
rule that tries to ensure the greatest possible uni-
formity of spread of ROC operating points. We stip-
ulated a maximum of ten data categories for sample
sizes of
40/40
and 20 data categories for
lOO/lOO
(these are the default numbers of data’ categories
used in the LARROC software). The program then
fits a two-parameter binormal ROC curve by the
method of maximum-likelihood estimation
(MLE)
using the categories as ratings. From the two param-
eters of this binormal model, it calculates an esti-
mate of the ALJC and its SE, which we call the “cal-
culated” SE.
FIGURE 1.
Non-binormal
distributions
used to generate data sets, with the dis-
tribution for non-diseased (broken lines)
chosen to be Gaussian. The distributions
for diseased (solid lines) were formed
from
mixtures of two Gaussian distribu-
tions to create moderate right skew
(top
row), very right skew (second row), a
bi-
modal distribution third row), and mod-
erate left skew
(fourth
row). Each pair of
the last row was generated by mapping
the
t-q
+4
scale used in row 2 into the
(0,
1)
scale by applying the transformation
exptX)/[l
+
expcY)I.
The distributions in
this last row were not used in the simu-
lations because they would
give
the same
results as the distributions in the second
row. The degrees of separation were low
AUC
=
0.60 (leftmost column), moderate
AUC = 0.75 (middle column), and high
AUC
=
0.90 (rightmost column).
\
!’
A
A
”
i
F
IGURE
2.
Additional non-binormal distributions used to gen-
erate data sets with non-Gaussian distributions for both
non-
diseased and diseased. Distributions for both non-diseased
(bro-
ken lines) and diseased (solid lines) were formed from mixtures
of two Gaussian distributions to create moderate left skew
(first
row) and moderate right skew (second row). The degrees of sep-
aration were: low AUC = 0.60 (leftmost column), moderate AUC
= 0.75 (middle column), and high AUC = 0.90
(right
most col-
umn).
VOL
17/NO
1, JAN-MAR 1997
Parametric and Nanparametric
ROC
Analysis
l
97
Table la
0
Comparison
of Parametric and Nonparametric Approaches with Respect to Bias of the Estimates of AUC and the
Corresponding Standard Errors in 1,000 Data Sets Generated from Various Configurations of the Binormal Model,
N =
4O/40
Degree of
Accuracy
Ratio
(True
of
SDS
Index)
D:NDf
Low
1
AUC = 0.60
1.4
2
3
Parametric*
100 x
100 x
SE (Est
Bias of
AUC)
Est AUC Emoir Ave Est
(4
61
63
0.5
6.69
6.22
-0.2
6.95 6.30
-0.3
6.86
6.51
-0.3
8.71
8.77
Nonparametric
Ratio of
SEs
100 x
Bias of
100 x SE
Est
(Est AUC) Ave Est
AUC
Emoir
Delano
6’)
(E’)
F)
-
(W(B)
(W(E)
(E)/(B)
W(C)
-0.1
6.42 6.38 0.93 0.99 0.98 1.02
-0.1
6.47 6.42
0.91
0.99 0.93 1.02
-0.1
6.59 8.59 0.96
1 .oo
0.96
1.01
-0.2
6.82 6.85
1.01
1 .oo
1.02
1.01
Moderate
1
0.9 5.68 5.26
0.1
5.53 5.42 0.93 0.98 0.97 1.03
AUC = 0.75
1.4 0.8 5.89 5.33
-0.0
5.60 5.49 0.90 0.98 0.95 1.03
2 0.3
6.11
5.58 0.0 5.78 5.67
0.91
0.98 0.95 1.02
3
-0.1
8.15 5.89
0.1
6.03 5.93 0.96 0.98 0.98
1.01
High
1
0.7 3.35 3.24 0.0
3.51
3.34 0.97 0.95 1.05 1.03
AUC
=
0.90
1.4
0.6 3.46
3.31
-0.1
3.62 3.42 0.96 0.94 1.05 1.03
2
0.6 3.75 3.46 0.0 3.83 3.59 0.93 0.94 1.02 1.03
3
0.4 4.19 3.76
0.1
4.10 3.83 0.90 0.93 0.96 1.02
*Ten data categories were used in fitting the binormal model.
tD
= diseased; ND
=
non-diseased; Est
=
estimate; Ave
=
average; Empir = empirical; SE
=
standard error; SD
=
standard deviation.
Table 1 b
s
Comparison of Parametric and Nonparametric Approaches with Respect to Bias of the Estimates of AUC and the
Corresponding Standard Errors in 1,000 Data Sets Generated from Various Configurations of the Binormal Model,
N
=
lOO/lOO
Parametric*
Nonparametric Ratio of
SEs
100 x
Degree of
100 x
100 x SE Bias of 100
x
SE
Accuracy Ratio
Bias of (Est AUC)
Est
(Est AUC)
Ave Est
(True
of
SDS
Est AUC
Empir
Ave Est
AUC
Empir Delong
Index)
D:NDt
(A) (B) (C)
(D)
(E)
(0
(C)/(B) (WE) (W(B) (V(C)
Low
1
0.2
3.91 3.91
-0.1 3.89 4.00
1 .oo
1.03 0.99 1.02
AUC=0.60 1.4 0.0 3.97 3.96 -0.1 3.94 4.04
1 .oo
1.03 0.99 1.02
2 -0.1 3.97 4.07 -0.1 4.05 4.15 1.03 1.02 1.02 1.02
3 -0.2 4.08
4.21
-0.1 4.22 4.32 1.04 1.02 1.04 1.03
Moderate
1
0.5 3.29 3.34 0.0
3.31 3.41
1.02 1.03
1.01
1.02
AUC = 0.75
1.4
0.3 3.39 3.39
-0.1
3.37 3.48 1.00 1.03 0.99 1.02
2
0.0 3.50 3.52 0.0 3.50 3.58
1
.Ol
1.02
1 .oo
1.02
3
-0.2
3.60 3.68 0.0
3.71
3.75 1.02
1
.Ol
1.03 1.02
High
1
0.2 1.99 2.10
-0.1
2.07 2.13 1.06 1.03 1.04
1.01
AUC=0.90 1.4
0.1
2.03 2.16 -0.1 2.15 2.19 1.06 1.02 1.08
1.01
2
2.28
-0.1
2.28
2.31
1.02
1
.Ol
1.02
1.01
”
0.0 2.23
3
-0.1
2.50 2.44
-0.0
2.47 2.48 0.98
1 .oo
0.99 1.02
*20
data categories were used in fitting the binormal model.
tD
= diseased; ND
=
non-diseased; Est
=
estimate; Ave
=
average:
Empir = empirical; SE
=
standard error; SD
=
standard deviation.
COMPARISON OF STATISTICAL BEHAVIORS OF
PARAMETRIC
AND NONPARAMETRIC ESTIMATES
The biases in the estimates of the ALJC (i.e., the
difference between the average of the 1,000 esti-
mates of the AUC and the true value) from the par-
ametric and nonparametric approaches were cal-
culated and compared. The magnitude of the bias
in the estimates of the AUC and the absolute dis-
crepancy between individual estimates of the AUC
from the two approaches were used to judge the
impact of model mis-specification.
The SD of the 1,000 estimates of the AUC derived
from one approach was compared with the corre-
sponding SD of the 1,000 AUC estimates from the
other approach. For each approach, this SD (which
we call the empirical SE) was also compared with
the average of the 1,000 calculated
SEs.
In addition,
we compared the average calculated SE of the AUC
derived from the binormal model with that
calcu-
98
0
Hajian-Tilaki,
Hanley,
Joseph,
Collet
MEDICAL
DEClSlON
MAKING
Table 2a
0
Comparison of Parametric and Nonparametric Approaches with Respect to Bias of the Estimates of AUC and the
Correspondlng Standard Errors in 1,000 Data Sets Generated from Various Configurations of Non-binormal Models,
N
=
40/40
Parametric*
Nonparametric
Ratio of
S,Es
100 x
100 x SE
100 x
100 x SE
Distributions Degree of
Bias of
(Est AUC)
Ave Bias of
(Est AUC) Ave Est
for ND
Accuracy
Est AUC
Empir
Est
Est AUC
Empir
Delong
8,
Dt
(True Index)
(A) (B)
(C) (D)
(E)
(P)
(W(B)
(P)/(E)
(E)/(B)
(WC)
ND: G
D:MG&
moderate
skew
(nght)
ND: G
0:
MG
81
very
skew
(right)
ND: G
0:
MO
&
bimodal
ND: G
0:
MG
&
left skew
ND: MG
0:
MG
both left
skew
ND: MG
0:
MG
both right
skew
Low
AUC
=
0.807
0.5 8.10
Moderate
AUC = 0.755
1.1
5.16
High
AUC
=
0.898
0.9
3.14
*The
binormal
model was fitted using ten data categories.
6.21
5.22
3.34
Low
AUC
=
0.805
Moderate
AUC
=
0.753
High
AUC
=
0.907
1.2
1.5
0.8
8.22
5.24
3.11
8.37
5.29
3.18
4.93
4.50
3.04
8.83
5.38
3.35
0.2
5.87 8.48 1.02 1.10 0.94
1.02
0.3
5.15 5.45
1.01
1.08 0.98
1.03
0.2
3.29
3.21
1.02 0.98 1.06
1
.oe
Low
AUC
=
0.808
0.2
0.3
4.77 6.78 1.34 1.42 0.97
1.02
Moderate
AUC
=
0.752
2.1
0.3
4.48 5.59 1.20 1.25
1 .oo
1.04
High
AUC
=
0.898
.O.Q
0.2
3.27
3.41
1.10 1.04 1.08
1.02
Low
AUC = 0.605
1.1
0.2
3.95 8.82 1.48 ‘1.73 0.87
1.02
Moderate
AUC
=
0.751
2.2
0.1
3.73 5.74 1.82 1.54 1.10
1.05
High
AUC = 0.900
0.9
0.1
2.93‘
3.41
1.30 1.18 1.14
1.02
Low
AUC
=
0.807
-1.3
0.0
5.91
8.57 1.03
1.11
0.94
1
.Ol
Moderate
AUC = 0.741
-0.5
-0.2
5.21
5.88 1.04 1.13 0.93
1
.Ol
High
AUC
;
0.891 0.8
0.2
3.69
3.84
1 .oo
1.04 0.99 1.03
Low
AUC = 0.809
0.4
0.0
5.53 6.38 1.03 1.15
0.91
1.02
Moderate
AUC
=
0.750
1.3
0.1
4.84 5.53
1.02’
1.14
0.93 1.04
High
AUC = 0.885
1.1
0.0
3.54
3.79 1.12
1.07
1.10 1.05
4.58
3.38
2.57
8.88
5.48
3.34
6.32
5.59
3.71
8.48
5.81
3.72
8.07
5.22
3.23
6.24
5.32
3.61
0.2
5.61
6.35 1.02
1.13
0.92 1.02
0.2
4.88 5.39
1.01 1.11
0.94 1.03
0.1
3.28 3.44 1.08
1.05
1.04 1.03
tND
=
non-diseased; D
=
diseased; G = Gaussian; MG
=
mixture of Gaussian; SE
=
standard error; Empir
=
empiriial.
lated for the nonparametric estimate using
De-
erate
23,24
4
data sets. In other words, the MLE iteration
Long’s method.
procedure converged for all data sets.
PERFORMANCE
WITH
BINORMAL
DATA
Results
Table 1 compares the results from the parametric
and nonparametric approaches when data are
gen-
erated
from the binormd model, while table 2
com-
pares the results for non-binormal data. When
fitting the binormal model, there were no
degen-
Columns
(A)
and
(D)
in tables
la
and lb show that
when data were generated from a pair of Gaussian
distributions both the parametric and the
nonpar-
ametric
approaches yielded close to unbiased
esti-
mates of the
AUC.
The biases were
50.9%
and
S0.2%,
respectively, for the sample sizes of
40/40;