Testing the Normal Approximation and Minimal Sample Size Requirements of Weighted Kappa When the Number of Categories is Large

doi:10.1177/014662168100500114

101

Testing

the

Normal

Approximation

and

Minimal

Sample

Size

Requirements

of

Weighted

Kappa

When

the

Number

of

categories

(k)

of

classification,

even

un-

der

conditions

in

which

sets

of

rater

marginals

differ

markedly

one

from

the

other.

Also,

the

minimal

sample

sizes

for

the

valid

application

of

x~

are

closely

approximated

by

the

formula

2kZ

2

in

which

k,

once

again,

denotes

the

number

of

ordinal

categories

of

classification.

Specifically,

APPLIED

PSYCHOLOGICAL

MEASUREMENT

Vol.

5,

No. 1,

Winter

1981,

pp.

101-104

@

Copyright

1981

Applied

Psychological

Measurement

Inc.

this

formula

yields

the

following

approximate

minimal

sample

sizes

(N)

for

ordinal

scales

ranging

between

3

and

7

categories:

for k

=

3,

N

= 20;

fork

=

4, N = 30;

fork

=

5, N = 50 ;

fork

=

6,N=75;

and

for k = 7, N =

100.

In

this

report,

the

same

type

of

monte

carlo

re-

search

is

extended

to

8 <

k <

10

categories

of

ordinal

classification

in

order

to

encompass

those

clinical

scales

composed

of

more

than

7

categories,

e.g.,

neuropsychiatric

symptom

scales

developed

for

assessing

extent

of

phobic

reactions

(Gelder

&

Marks,

1966;

Watson,

Gaind,

&

Marks,

1971)

and

for

assessing

various

types

of

personality

disorders

(Tyrer

&

Alexan-

der,

1979;

Tyrer,

Alexander,

Cicchetti,

Cohen,

&

Remington,

1979).

Method

The

computer

simulation

technique

was

identical

to

that

used

in

the

and

Fleiss

(1977)

study.

The

following

pa-

rameters

were

systematically

varied:

1.

The

number

of

scale

points

or k

categories

of

ordinal

classification,

which

ranged

be-

tween

8

and

10.

2.

The

number

of

subjects,

N,

which

ranged

between

approximately

2k2

and

16k2.

3.

The

quantities

(n,,

n;;

i, j

=

1,...,k)

once

again

denoted

the

underlying

simulated

rater

marginal

probabilities

used

to

gener-

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227.

May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction

requires payment of royalties through the Copyright Clearance Center,

http://www.copyright.com/

102

ate

each

set

of

tables.

As

previously,

for

each

value

of k

and

N,

three

pairs

of

mar-

ginal

probabilities

were

studied:

(a)

uniform

marginals

(n,

=

n,

=

Ilk

for

all

i

andj);

(b)

moderately

different

marginals

with

ranging

between

.0375

and

.0444,

de-

pending

on

the

value

of

k ;

and

(c)

markedly

different

marginals

with

values

derived

from

Equation

1

and

ranging

between

.15

and

.16.

In

this

condition

the

underlying

marginal

probabilities

for

Rater

1

were

taken

to

be

the

exact

reverse

of

those

for

Rater

2.

For

the

9-point

ordinal

scale,

for

ex-

ample,

the

simulated

(on-the-average)

Rater

1

marginal

proportions

were

.30,

.25,

.15,

.10,

.08,

.03, .03,

.03,

and

.03,

while

the

corresponding

propor-

tions

(on

the

average)

for

Rater

2

mar-

ginals

became

.03,

.08,

.10,

.15, .25,

and

.30.

For

each

combination

of

N,

k,

and

marginal

configurations

(as

defined

above)

8,000

tables

(or

runs)

were

generated

at

random

by

a

pro-

gram

written

for

the

IBM

360.

Finally,

the

for-

mulae

for

the

rater

agreement

weights

were

the

same

as

those

utilized

in

the

earlier

Cicchetti

and

Fleiss

(1977)

research.

These

ranged

be-

tween

1

(complete

rater

agreement)

and

0

(com-

plete

rater

disagreement

or

being

as

far

apart

as

the

range

of

scale

points

will

allow,

e.g.,

1-9

or

9-1

pairings

on

a

9-category

ordinal

scale

of

classification).

These

linear

agreement

weight-

ing

systems

were

derived

earlier

by

Cicchetti

(1976)

and

are

given

by

the

formula

I-li-jl/(k-1),

where

i and j

are

the

categories

of

assignment

by

Raters

1

and

2.

Results

The

findings

of

this

followup

monte

carlo

in-

vestigation

confirmed

that

(1)

the

normal

ap-

proximation

to

the

null

distribution

of

weighted

kappa

is

valid

for

8 < k <

10

categories

of

or-

dinal

classification;

(2)

the

minimal

number

of

cases

required

for

the

valid

application

of

weighted

kappa

is

still

well

approximated

by

the

formula

2k2;

and

(3)

the

above

results

hold

well

even

under

the

condition

of

markedly

different

simulated

rater

marginals.

This

means

that

the

approximate

minimal

sample

sizes

required

for

the

valid

application

of

the

weighted

kappa

sta-

tistic,

become,

respectively:

for k

=

8,

N

=

125;

for k = 9, N =160;

and

for k

=10, N =

200.

Table

1

Central

Moments

of

Null

Distribution

of

Kw

for

a

10

categories

of

classification

to

about

200

when

the

number

of

ordinal

categories

is

10.

Thus,

the

implied

conservative

minimal

N

of

200

cases,

ir-

respective

of

the

number

of k

categories

of

clas-

sification

(see

Fleiss,

Cohen,

&

Everitt,

1969),

is

only

required

when k

=

10.

As

noted

elsewhere

(Cicchetti

&

Fleiss,

1977),

this

finding

should

be

of

comfort to

research

investigators

utilizing

the

kappa

statistics,

since

it

is

often

difficult

to

ob-

tain

sample

sizes

of >

200.

References

Cicchetti,

D.

V.

Assessing

inter-rater

reliability

for

rating

scales:

Resolving

some

basic

issues.

British

Journal

of Psychiatry

, 1976,

129

, 452-456.

Cicchetti,

D.

V.,

&

Fleiss,

J.

L.

Comparison

of

the

null

distribution

of

weighted

kappa

and

the

C

or-

dinal

statistic.

Applied

Psychological

Measure-

ment

, 1977,

1

, 195-201.

Cohen,

J.

Weighted

kappa:

Nominal

scale

agreement

with

provision

for

scaled

disagreement

or

partial

credit.

Psychological

Bulletin,

1968,

70

, 213-220.

Fleiss,

J.

L.,

Cohen,

J.,

&

Everitt,

B.

S.

Large

sample

standard

errors

of

kappa

and

weighted

kappa.

Psychological

Bulletin,

1969,

72

, 323-327.

Gelder,

M.

G.,

&

Marks,

I.

M.

Severe

agoraphobia:

A

controlled

prospective

trial

of

behavior

therapy.

British

Journal

of Psychiatry

,

1966,

112

, 309-319.

Tyrer,

P.,

&

Alexander,

J.

Classification

of

person-

ality

disorder.

British

Journal

of

Psychiatry,

1979,

135

, 163-167.

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227.

May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction

requires payment of royalties through the Copyright Clearance Center,

http://www.copyright.com/

104

Tyrer,

P.,

Alexander,

M.,

Cicchetti,

D.

V.,

Cohen,

M.,

&

Remington,

M.

Reliability

of

a

schedule

for

rating

personality

disorders.

British

Journal

of

Psychiatry,

1979,

135

, 168-174.

Watson,

J.

P.,

Gaind,

R.,

&

Marks,

I.

M.

Prolonged

exposure:

A

rapid

treatment

for

phobias.

British

Medical Journal

, 1971,

1

, 13-15.

Acknowledgments

This

research

was

supported

by

the

West

Haven

VA

Medical

Center

(MRIS

1416).

The

author

acknowl-

edges

the

contributions

of

Joseph

Vitale

and

Sandra

Aivano,

Yale

University,

in

developing

the

computer

programs

used

in

this

research

and

Professor

Joseph

L.

Fleiss for

his

collaboration

in

the

preceding

report,

as

well

as

his

helpful

critique

of

the

present

investiga-

tion.

Author’s

Address

Send

requests

for

reprints

or

further

information

to

Domenic

V.

Cicchetti,

Ph.D.,

Senior

Research

Psychologist

and

Biostatistician,

VA

Medical

Center,

West

Haven,

CT

06516.

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227.

May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction

requires payment of royalties through the Copyright Clearance Center,

http://www.copyright.com/

Testing the Normal Approximation and Minimal Sample Size Requirements of Weighted Kappa When the Number of Categories is Large

Citations

Interrater agreement and interrater reliability: Key concepts, approaches, and applications

The reliability of peer review for manuscript and grant submissions: A cross-disciplinary investigation

Observations of Effective Teacher–Student Interactions in Secondary School Classrooms: Predicting Student Achievement With the Classroom Assessment Scoring System—Secondary

The Effect of Number of Rating Scale Categories on Levels of Interrater Reliability : A Monte Carlo Investigation:

Security of infantile attachment as assessed in the “strange situation”: Its study and biological interpretation

References

Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit.

Large sample standard errors of kappa and weighted kappa.

Classification of personality disorder.

Assessing inter-rater reliability for rating scales: resolving some basic issues.

Severe Agoraphobia: A Controlled Prospective Trial of Behaviour Therapy

Related Papers (5)

A Coefficient of agreement for nominal Scales

The measurement of observer agreement for categorical data

Comparison of the null distributions of weighted kappa and the C ordinal statistic

Large sample standard errors of kappa and weighted kappa.

Developing criteria for establishing interrater reliability of specific items: applications to assessment of adaptive behavior.