(Open Access) A comparison of parametric and nonparametric approaches to ROC analysis of quantitative diagnostic tests. (1997) | Karim O. Hajian-Tilaki

A Comparison of Parametric and

Nonparametric Approaches to ROC

Analysis of Quantitative Diagnostic Tests

KARIM

0. HAJIAN-TILAKI,

PhD,

JAMES A. HANLEY,

PhD,

LAWRENCE JOSEPH,

PhD,

JEAN-PAUL COLLET,

PhD

Receiver operating characteristic (ROC) analysis, which yields indices of accuracy

such as the area under the curve (AUC), is increasingly being used to evaluate the

performances of diagnostic tests that produce results on continuous scales. Both par-

ametric and nonparametric ROC approaches are available to assess the discriminant

capacity of such tests, but there are no clear guidelines as to the merits of each,

particularly with non-binormal data. Investigators may worry that when data are

non-

Gaussian, estimates of diagnostic accuracy based on a binormal model may be dis-

torted. The authors conducted a Monte Carlo simulation study to compare the bias

and sampling variability in the estimates of the

AUCs

derived from parametric and

nonparametric procedures. Each approach was assessed in data sets generated from

various configurations of pairs of overlapping distributions; these included the binormal

model and non-binormal pairs of distributions where one or both pair members were

mixtures of Gaussian (MG) distributions

with

different degrees of departures from

bi-

normality. The biases in the estimates of the

AUCs

were found to be very small for

both parametric and nonparametrlc procedures. The two approaches yielded very

close estimates of the

AUCs

and of the corresponding sampling variability even when

data were generated from non-binormal models. Thus, for a wide range of distributions,

concern about bias or imprecision of the estimates of the AUC should not be a major

factor in choosing between the nonparametric and parametric approaches.

Key words:

ROC

analysis; quantitative diagnostic test; comparison, parametric; binormal model;

LABROC; nonparametric procedure; area under the curve (AUC).

Med

Decis Making

1997;17:94-102)

During the past ten years, receiver operator char-

acteristic

(ROC)

analysis has become a popular

method for evaluating the accuracy/performance of

medical diagnostic

tests.

1-3

The most attractive

property of ROC analysis is that the accuracy indices

derived from this technique are not distorted by

fluctuations caused by the use of an arbitrarily cho-

sen decision “criterion” or

“cutoff.“

4-8

One index

available from an ROC analysis, the area under the

curve”’

(AUC),

measures the ability of a diagnostic

Received

February 17,

1995, from the Department of Epide-

miology and Biostatistics, McGill University

(KOH-T,

JAH, LJ,

J-

PC); the Division of Clinical Epidemiology, Royal Victoria Hospital

(JAH);

the Division of Clinical Epidemiology, Montreal General

Hospital

(LJ);

and the Division of Clinical Epidemiology, Jewish

General Hospital (PC); all in Montreal, Quebec, Canada. Revision

accepted for publication July 17, 1995. Supported by an operat-

ing grant from the Natural Sciences and Engineering Research

Council of Canada and the Fonds de la recherche en Sante du

Quebec.

Address correspondence and reprint requests to Dr.

Hanley:

Department of Epidemiology and Biostatistics, McGill University,

1020 Pine Avenue West, Montreal,

Canada

H3A

lA2.

e-mail:

(Jimh@epid.lan.mcgill.ca).

test to discriminate between two patient states, often

labelled

“diseased” and “non-diseased.” The AUC

has been of considerable interest as a summary

measure of accuracy because of its meaningful in-

terpretation.“’

Initially, ROC methods were confined to tests in-

terpreted on rating scales and analysis was typically

carried out using the binormal

model.

9,10

However,

they are now becoming increasingly popular for

evaluating the performances of quantitative diagnos-

tic tests with numerical results recorded directly on

continuous

scales.

1,2,3,11

Both parametric and non-

parametric procedures can be used to derive an

AUC index of accuracy for such diagnostic tests.

However, Goddard and

Hinberg

warned that if the

distribution of raw data from a quantitative test is

far from Gaussian, the AUC

[and

corresponding

standard error

(SE)]

derived from a directly fitted

binormal model can be seriously distorted. This oc-

curs because one fits a mean and standard deviation

to the raw data for the diseased and non-diseased

patients separately. One way to avoid the possible

distortion is to use

Metz’s

adaptation of the binormal

model, previously used with rating

data,

9,13-15

with

VOL

17/NO

1, JAN-MAR 1997

Parametric and Nonparametrlc ROC

Analysis

laboratory-type data.’ Metz et al. implemented the

binormal model in the LABROC

software.

The pro-

cedure first discretizes the continuous data and

then uses the categories as ratings in the ROCFIT

procedure

to obtain the maximum-likelihood esti-

mates

(MLE)

of the two relevant parameters of the

binormal model. From them it calculates the AUC

and the corresponding SE.

When the data are continuous and appear to be

non-Gaussian, many users will find the

nonpara-

metric

approach

17-19

”

to estimating the AUC more ap-

pealing than using the binormal model, since they

may worry that estimates of diagnostic accuracy

based on a binormal model will be distorted.

How-

ever, nonparametric area estimates will tend to un-

derestimate the

AUCs

for rating

data,

7,20

in particular

when ROC operating points are not well spread out

along the ROC curve. Moreover, this method does

not yield a smooth estimate of the entire ROC

curve.

The variance of the nonparametric esti-

mates of the AUC can be estimated entirely nonpar-

ametrically

18,19

or using an exponential approxima-

tion.’

Recently,

Obuchowski

found that the

exponential approximation underestimates the em-

pirical standard error of the nonparametric AUC for

rating data when the “ratings” begin as continuous

data with a binormal distribution and when the ra-

tio of the standard deviations

(SDS)

of the two dis-

tributions is greater than 2. However, in practice the

data might arise from a non-binormal model.

In summary, the statistical behaviors of the AUC

estimates derived from the parametric and nonpar-

ametric

approaches have not been investigated for

quantitative diagnostic test results, and there are no

general guidelines for choosing one approach over

the other. Thus, we conducted a broad numerical

investigation to compare the statistical behaviors of

the estimates of the AUC derived from parametric

and nonparametric procedures.

Methods

DATA

GENERATION

As is shown in the leftmost columns of tables 1

and 2, we generated continuously distributed data

with sample sizes of

= 40 for diseased and

40 for non-diseased from various pairs of overlap-

ping distributions with various degrees of separa-

tion; sample sizes of

lOO/lOO

were also investi-

gated. Overall,

1,000

data sets were generated for

each configuration studied.

Binormal

data. First, we generated continuously

distributed data from two overlapping Gaussian dis-

tributions, i.e.,

{G,

pairs for the “non-diseased”

and “diseased’ patients with different degrees of

separation

(AUC

= 0.60, AUC = 0.75, AUC = 0.90)

and with various ratios of

SDs

of distributions for

the non-diseased to diseased

(l:l,

1:1.4,

1:2

and

1:3),

yielding in all

configurations of pairs.

Non-binormal

data. Data were also generated

from various configurations of non-binormal pairs,

where one or both members of the pairs were

mixtures of Gaussian

(MG)

distributions:

(G,

MG-

skewed or bimodal) pairs or {MG-skewed,

MG-

skewed) pairs. In all, as is shown in figures

and

2, 18 configurations of non-binormal pairs with var-

ious degrees of skewness and separation were used

to generate data. We calculated how often the hy-

pothesis of normality would be rejected with such

distributions. For sample sizes of 40, the hypothesis

was rejected by the Wilks’ test employed by SAS in

34% of the data sets from the moderate-skew distri-

butions and 67% of those from high-skew distribu-

tions. For sample sizes of 100, the corresponding

percentages were 59 and 97.

To some, the range of distributions shown in fig-

ures 1 and 2 may seem limited. However, one can

apply many monotonic transformations to the sep-

arator axis, thereby effectively covering a broader

range of possibilities of non-binormal data. An ex-

ample of how both distributions are converted to

non-normal pairs is shown in the last row of figure

1. Each pair was generated by mapping the

(-03,

+w) scale used in row 2 into the

(0,

scale by ap-

plying the transformation

exp(X)/ll

exp(X11.

Notice,

however, that although such monotonic transfor-

mations may radically change the shapes of the dis-

tributions, they do not change the ROC curve when

applied to both

distributions.

2,5,13

Few of the articles in the quantitative diagnostic

test literature show the distributions of raw data. In

those that do, the distributions of biomarkers for

diseased patients are often positively skewed. For ex-

ample, Goddard and Hinberg

reported the histo-

grams of different biomarkers for five types of can-

cer; the distributions for cancer patients were

skewed or bimodal.

Linnet

also showed examples

where the distributions of serum bilirubin and fast-

ing serum bile acids for diseased patients were pos-

itively skewed while the reference distribution was

approximately normal. Empirically, there is consid-

erable evidence that the binormal model used for

rating data needs to include more variation for dis-

eased patients,’

i.e.,

the ratio of

SDs

of distribution

for diseased to non-diseased patients is higher than

1. Based on this empirical evidence, we included the

range of ratios of

SDs

from 1:l to

1:3.

We chose

mixtures of Gaussian distributions for diseased pa-

tients since the distribution may contain unidenti-

fied disease subtypes. Thus, we allow for more var-

iation for the diseased than the non-diseased

patients.

96 l

Hajian-Tllaki,

Hanley,

Joseph,

Collet

MEDICAL DECISION MAKING

STATISTICAL ANALYSIS

Each generated data set underwent two analyses:

nonparametric ROC analysis using the raw data,

and

parametric ROC analysis of the categorized

data via the LARROC approach.

Nonparametric approach.

The nonparametric es-

timate of the AUC was calculated directly from the

raw data using the Wilcoxon-Mann-Whitney

two-

sample statistic; the SE of the AUC was calculated by

DeLong

al.‘s

method.”

Parametric ROC analysis:

Each data set was ana-

lyzed via

Metz's

LABROC

procedure.

The program

categorizes the data according to a data-dependent

rule that tries to ensure the greatest possible uni-

formity of spread of ROC operating points. We stip-

ulated a maximum of ten data categories for sample

sizes of

40/40

and 20 data categories for

lOO/lOO

(these are the default numbers of data’ categories

used in the LARROC software). The program then

fits a two-parameter binormal ROC curve by the

method of maximum-likelihood estimation

(MLE)

using the categories as ratings. From the two param-

eters of this binormal model, it calculates an esti-

mate of the ALJC and its SE, which we call the “cal-

culated” SE.

FIGURE 1.

Non-binormal

distributions

used to generate data sets, with the dis-

tribution for non-diseased (broken lines)

chosen to be Gaussian. The distributions

for diseased (solid lines) were formed

from

mixtures of two Gaussian distribu-

tions to create moderate right skew

(top

row), very right skew (second row), a

bi-

modal distribution third row), and mod-

erate left skew

(fourth

row). Each pair of

the last row was generated by mapping

the

t-q

scale used in row 2 into the

(0,

scale by applying the transformation

exptX)/[l

expcY)I.

The distributions in

this last row were not used in the simu-

lations because they would

give

the same

results as the distributions in the second

row. The degrees of separation were low

AUC

0.60 (leftmost column), moderate

AUC = 0.75 (middle column), and high

AUC

0.90 (rightmost column).

!’

”

IGURE

Additional non-binormal distributions used to gen-

erate data sets with non-Gaussian distributions for both

non-

diseased and diseased. Distributions for both non-diseased

(bro-

ken lines) and diseased (solid lines) were formed from mixtures

of two Gaussian distributions to create moderate left skew

(first

row) and moderate right skew (second row). The degrees of sep-

aration were: low AUC = 0.60 (leftmost column), moderate AUC

= 0.75 (middle column), and high AUC = 0.90

(right

most col-

umn).

VOL

17/NO

1, JAN-MAR 1997

Parametric and Nanparametric

ROC

Analysis

Table la

Comparison

of Parametric and Nonparametric Approaches with Respect to Bias of the Estimates of AUC and the

Corresponding Standard Errors in 1,000 Data Sets Generated from Various Configurations of the Binormal Model,

N =

4O/40

Degree of

Accuracy

Ratio

(True

SDS

Index)

D:NDf

Low

AUC = 0.60

1.4

Parametric*

100 x

SE (Est

Bias of

AUC)

Est AUC Emoir Ave Est

0.5

6.69

6.22

-0.2

6.95 6.30

-0.3

6.86

6.51

-0.3

8.71

8.77

Nonparametric

Ratio of

SEs

100 x

Bias of

100 x SE

Est

(Est AUC) Ave Est

AUC

Emoir

Delano

6’)

(E’)

(W(B)

(W(E)

(E)/(B)

W(C)

-0.1

6.42 6.38 0.93 0.99 0.98 1.02

-0.1

6.47 6.42

0.91

0.99 0.93 1.02

-0.1

6.59 8.59 0.96

1 .oo

0.96

1.01

-0.2

6.82 6.85

1.01

1 .oo

1.02

1.01

Moderate

0.9 5.68 5.26

0.1

5.53 5.42 0.93 0.98 0.97 1.03

AUC = 0.75

1.4 0.8 5.89 5.33

-0.0

5.60 5.49 0.90 0.98 0.95 1.03

2 0.3

6.11

5.58 0.0 5.78 5.67

0.91

0.98 0.95 1.02

-0.1

8.15 5.89

0.1

6.03 5.93 0.96 0.98 0.98

1.01

High

0.7 3.35 3.24 0.0

3.51

3.34 0.97 0.95 1.05 1.03

AUC

0.90

1.4

0.6 3.46

3.31

-0.1

3.62 3.42 0.96 0.94 1.05 1.03

0.6 3.75 3.46 0.0 3.83 3.59 0.93 0.94 1.02 1.03

0.4 4.19 3.76

0.1

4.10 3.83 0.90 0.93 0.96 1.02

*Ten data categories were used in fitting the binormal model.

= diseased; ND

non-diseased; Est

estimate; Ave

average; Empir = empirical; SE

standard error; SD

standard deviation.

Table 1 b

Comparison of Parametric and Nonparametric Approaches with Respect to Bias of the Estimates of AUC and the

Corresponding Standard Errors in 1,000 Data Sets Generated from Various Configurations of the Binormal Model,

lOO/lOO

Parametric*

Nonparametric Ratio of

SEs

100 x

Degree of

100 x

100 x SE Bias of 100

Accuracy Ratio

Bias of (Est AUC)

Est

(Est AUC)

Ave Est

(True

SDS

Est AUC

Empir

Ave Est

AUC

Empir Delong

Index)

D:NDt

(A) (B) (C)

(D)

(E)

(C)/(B) (WE) (W(B) (V(C)

Low

0.2

3.91 3.91

-0.1 3.89 4.00

1 .oo

1.03 0.99 1.02

AUC=0.60 1.4 0.0 3.97 3.96 -0.1 3.94 4.04

1 .oo

1.03 0.99 1.02

2 -0.1 3.97 4.07 -0.1 4.05 4.15 1.03 1.02 1.02 1.02

3 -0.2 4.08

4.21

-0.1 4.22 4.32 1.04 1.02 1.04 1.03

Moderate

0.5 3.29 3.34 0.0

3.31 3.41

1.02 1.03

1.01

1.02

AUC = 0.75

1.4

0.3 3.39 3.39

-0.1

3.37 3.48 1.00 1.03 0.99 1.02

0.0 3.50 3.52 0.0 3.50 3.58

.Ol

1.02

1 .oo

1.02

-0.2

3.60 3.68 0.0

3.71

3.75 1.02

.Ol

1.03 1.02

High

0.2 1.99 2.10

-0.1

2.07 2.13 1.06 1.03 1.04

1.01

AUC=0.90 1.4

0.1

2.03 2.16 -0.1 2.15 2.19 1.06 1.02 1.08

1.01

2.28

-0.1

2.28

2.31

1.02

.Ol

1.02

1.01

”

0.0 2.23

-0.1

2.50 2.44

-0.0

2.47 2.48 0.98

1 .oo

0.99 1.02

*20

data categories were used in fitting the binormal model.

= diseased; ND

non-diseased; Est

estimate; Ave

average:

Empir = empirical; SE

standard error; SD

standard deviation.

COMPARISON OF STATISTICAL BEHAVIORS OF

PARAMETRIC

AND NONPARAMETRIC ESTIMATES

The biases in the estimates of the ALJC (i.e., the

difference between the average of the 1,000 esti-

mates of the AUC and the true value) from the par-

ametric and nonparametric approaches were cal-

culated and compared. The magnitude of the bias

in the estimates of the AUC and the absolute dis-

crepancy between individual estimates of the AUC

from the two approaches were used to judge the

impact of model mis-specification.

The SD of the 1,000 estimates of the AUC derived

from one approach was compared with the corre-

sponding SD of the 1,000 AUC estimates from the

other approach. For each approach, this SD (which

we call the empirical SE) was also compared with

the average of the 1,000 calculated

SEs.

In addition,

we compared the average calculated SE of the AUC

derived from the binormal model with that

calcu-

Hajian-Tilaki,

Hanley,

Joseph,

Collet

MEDICAL

DEClSlON

MAKING

Table 2a

Comparison of Parametric and Nonparametric Approaches with Respect to Bias of the Estimates of AUC and the

Correspondlng Standard Errors in 1,000 Data Sets Generated from Various Configurations of Non-binormal Models,

40/40

Parametric*

Nonparametric

Ratio of

S,Es

100 x

100 x SE

100 x

100 x SE

Distributions Degree of

Bias of

(Est AUC)

Ave Bias of

(Est AUC) Ave Est

for ND

Accuracy

Est AUC

Empir

Est

Est AUC

Empir

Delong

(True Index)

(A) (B)

(E)

(P)

(W(B)

(P)/(E)

(E)/(B)

(WC)

ND: G

D:MG&

moderate

skew

(nght)

ND: G

very

skew

(right)

ND: G

bimodal

ND: G

left skew

ND: MG

both left

skew

ND: MG

both right

skew

Low

AUC

0.807

0.5 8.10

Moderate

AUC = 0.755

1.1

5.16

High

AUC

0.898

0.9

3.14

*The

binormal

model was fitted using ten data categories.

6.21

5.22

3.34

Low

AUC

0.805

Moderate

AUC

0.753

High

AUC

0.907

1.2

1.5

0.8

8.22

5.24

3.11

8.37

5.29

3.18

4.93

4.50

3.04

8.83

5.38

3.35

0.2

5.87 8.48 1.02 1.10 0.94

1.02

0.3

5.15 5.45

1.01

1.08 0.98

1.03

0.2

3.29

3.21

1.02 0.98 1.06

.oe

Low

AUC

0.808

0.2

0.3

4.77 6.78 1.34 1.42 0.97

1.02

Moderate

AUC

0.752

2.1

0.3

4.48 5.59 1.20 1.25

1 .oo

1.04

High

AUC

0.898

.O.Q

0.2

3.27

3.41

1.10 1.04 1.08

1.02

Low

AUC = 0.605

1.1

0.2

3.95 8.82 1.48 ‘1.73 0.87

1.02

Moderate

AUC

0.751

2.2

0.1

3.73 5.74 1.82 1.54 1.10

1.05

High

AUC = 0.900

0.9

0.1

2.93‘

3.41

1.30 1.18 1.14

1.02

Low

AUC

0.807

-1.3

0.0

5.91

8.57 1.03

1.11

0.94

.Ol

Moderate

AUC = 0.741

-0.5

-0.2

5.21

5.88 1.04 1.13 0.93

.Ol

High

AUC

;

0.891 0.8

0.2

3.69

3.84

1 .oo

1.04 0.99 1.03

Low

AUC = 0.809

0.4

0.0

5.53 6.38 1.03 1.15

0.91

1.02

Moderate

AUC

0.750

1.3

0.1

4.84 5.53

1.02’

1.14

0.93 1.04

High

AUC = 0.885

1.1

0.0

3.54

3.79 1.12

1.07

1.10 1.05

4.58

3.38

2.57

8.88

5.48

3.34

6.32

5.59

3.71

8.48

5.81

3.72

8.07

5.22

3.23

6.24

5.32

3.61

0.2

5.61

6.35 1.02

1.13

0.92 1.02

0.2

4.88 5.39

1.01 1.11

0.94 1.03

0.1

3.28 3.44 1.08

1.05

1.04 1.03

tND

non-diseased; D

diseased; G = Gaussian; MG

mixture of Gaussian; SE

standard error; Empir

empiriial.

lated for the nonparametric estimate using

De-

erate

23,24

data sets. In other words, the MLE iteration

Long’s method.

procedure converged for all data sets.

PERFORMANCE

WITH

BINORMAL

DATA

Results

Table 1 compares the results from the parametric

and nonparametric approaches when data are

gen-

erated

from the binormd model, while table 2

com-

pares the results for non-binormal data. When

fitting the binormal model, there were no

degen-

Columns

(A)

and

(D)

in tables

and lb show that

when data were generated from a pair of Gaussian

distributions both the parametric and the

nonpar-

ametric

approaches yielded close to unbiased

esti-

mates of the

AUC.

The biases were

50.9%

and

S0.2%,

respectively, for the sample sizes of

40/40;

A comparison of parametric and nonparametric approaches to ROC analysis of quantitative diagnostic tests.

Figures

Citations

Receiver Operating Characteristic (ROC) Curve Analysis for Medical Diagnostic Test Evaluation

The use of receiver operating characteristic curves in biomedical informatics

Receiver operating characteristic (ROC) curve for medical researchers

Sample size estimation in diagnostic test studies of biomedical informatics

Tissue-Level Thresholds for Axonal Damage in an Experimental Model of Central Nervous System White Matter Injury

References

The meaning and use of the area under a receiver operating characteristic (ROC) curve.

Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach.

Signal detection theory and psychophysics

Rank Transformations as a Bridge between Parametric and Nonparametric Statistics

ROC methodology in radiologic imaging

Related Papers (5)

The meaning and use of the area under a receiver operating characteristic (ROC) curve.

Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach.

Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine.

Basic principles of ROC analysis

A method of comparing the areas under receiver operating characteristic curves derived from the same cases.

Frequently Asked Questions (1)

Q1. What contributions have the authors mentioned in the paper "A comparison of parametric and nonparametric approaches to roc analysis of quantitative diagnostic tests" ?