scispace - formally typeset
Open AccessJournal ArticleDOI

Testing the Normal Approximation and Minimal Sample Size Requirements of Weighted Kappa When the Number of Categories is Large

Reads0
Chats0
TLDR
In this paper, the authors showed that the weighted kappa statistic, employing a standard error developed by Fleiss, Cohen, and Everitt (1969), holds for a large number of k cate gories of classification (e.g., 8 < k ≤ 10).
Abstract
The results of this computer simulation study in dicate that the weighted kappa statistic, employing a standard error developed by Fleiss, Cohen, and Everitt (1969), holds for a large number of k cate gories of classification (e.g., 8 < k ≤ 10). These data are entirely consistent with an earlier study (Cicchetti & Fleiss, 1977), which showed the same results for 3 ≤ k ≤ 7. The two studies also indicate that the minimal N required for the valid ap plication of weighted kappa can be easily approxi mated by the simple formula 2k2. This produces sample sizes that vary between a low of about 20 (when k = 3) to a high of about 200 (when k = 10). Finally, the range 3 ≤ k ≤ 10 should encompass most extant clinical scales of classification.

read more

Content maybe subject to copyright    Report

101
Testing
the
Normal
Approximation
and
Minimal
Sample
Size
Requirements
of
Weighted
Kappa
When
the
Number
of
Categories
is
Large
Domenic
V.
Cicchetti
West
Haven
VA
Medical
Center
and
Yale
University
The
results
of
this
computer
simulation
study
in-
dicate
that
the
weighted
kappa
statistic,
employing
a
standard
error
developed
by
Fleiss,
Cohen,
and
Everitt
(1969),
holds
for
a
large
number
of k
cate-
gories
of
classification
(e.g.,
8 <
k
&prcue;
10).
These
data
are
entirely
consistent
with
an
earlier
study
(Cicchetti
&
Fleiss,
1977),
which
showed
the
same
results
for
3 &prcue;
k
&prcue;
7.
The
two
studies
also
indicate
that
the
minimal
N
required
for
the
valid
ap-
plication
of
weighted
kappa
can
be
easily
approxi-
mated
by
the
simple
formula
2k2.
This
produces
sample
sizes
that
vary
between
a
low
of
about
20
(when
k
=
3)
to
a
high
of
about
200
(when
k
=
10).
Finally,
the
range
3 &prcue;
k
&prcue;
10
should
encompass
most
extant
clinical
scales of
classification.
...--,
-..,-..,
-.....--. ---.--
-.
-.---...--,.-...
In
a
previous
monte
carlo
(computer
simula-
tion)
study,
Cicchetti
and
Fleiss
(1977)
demon-
strated
that
the
normal
approximation
of
the
distribution
of
weighted
kappa
(x,,,),
based
upon
a
standard
error
proposed
earlier
by
Fleiss,
Cohen,
and
Everitt
(1969),
is
valid
for
3 < k <
7
ordinal
categories
(k)
of
classification,
even
un-
der
conditions
in
which
sets
of
rater
marginals
differ
markedly
one
from
the
other.
Also,
the
minimal
sample
sizes
for
the
valid
application
of
x~
are
closely
approximated
by
the
formula
2kZ
2
in
which
k,
once
again,
denotes
the
number
of
ordinal
categories
of
classification.
Specifically,
APPLIED
PSYCHOLOGICAL
MEASUREMENT
Vol.
5,
No. 1,
Winter
1981,
pp.
101-104
@
Copyright
1981
Applied
Psychological
Measurement
Inc.
this
formula
yields
the
following
approximate
minimal
sample
sizes
(N)
for
ordinal
scales
ranging
between
3
and
7
categories:
for k
=
3,
N
= 20;
fork
=
4, N = 30;
fork
=
5, N = 50 ;
fork
=
6,N=75;
and
for k = 7, N =
100.
In
this
report,
the
same
type
of
monte
carlo
re-
search
is
extended
to
8 <
k <
10
categories
of
ordinal
classification
in
order
to
encompass
those
clinical
scales
composed
of
more
than
7
categories,
e.g.,
neuropsychiatric
symptom
scales
developed
for
assessing
extent
of
phobic
reactions
(Gelder
&
Marks,
1966;
Watson,
Gaind,
&
Marks,
1971)
and
for
assessing
various
types
of
personality
disorders
(Tyrer
&
Alexan-
der,
1979;
Tyrer,
Alexander,
Cicchetti,
Cohen,
&
Remington,
1979).
Method
The
computer
simulation
technique
was
identical
to
that
used
in
the
previous
Cicchetti
and
Fleiss
(1977)
study.
The
following
pa-
rameters
were
systematically
varied:
1.
The
number
of
scale
points
or k
categories
of
ordinal
classification,
which
ranged
be-
tween
8
and
10.
2.
The
number
of
subjects,
N,
which
ranged
between
approximately
2k2
and
16k2.
3.
The
quantities
(n,,
n;;
i, j
=
1,...,k)
once
again
denoted
the
underlying
simulated
rater
marginal
probabilities
used
to
gener-
Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227.
May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction
requires payment of royalties through the Copyright Clearance Center,
http://www.copyright.com/

102
ate
each
set
of
tables.
As
previously,
for
each
value
of k
and
N,
three
pairs
of
mar-
ginal
probabilities
were
studied:
(a)
uniform
marginals
(n,
=
n,
=
Ilk
for
all
i
andj);
(b)
moderately
different
marginals
with
ranging
between
.0375
and
.0444,
de-
pending
on
the
value
of
k ;
and
(c)
markedly
different
marginals
with
values
derived
from
Equation
1
and
ranging
between
.15
and
.16.
In
this
condition
the
underlying
marginal
probabilities
for
Rater
1
were
taken
to
be
the
exact
reverse
of
those
for
Rater
2.
For
the
9-point
ordinal
scale,
for
ex-
ample,
the
simulated
(on-the-average)
Rater
1
marginal
proportions
were
.30,
.25,
.15,
.10,
.08,
.03, .03,
.03,
and
.03,
while
the
corresponding
propor-
tions
(on
the
average)
for
Rater
2
mar-
ginals
became
.03,
.03,
.03,
.03,
.08,
.10,
.15, .25,
and
.30.
For
each
combination
of
N,
k,
and
marginal
configurations
(as
defined
above)
8,000
tables
(or
runs)
were
generated
at
random
by
a
pro-
gram
written
for
the
IBM
360.
Finally,
the
for-
mulae
for
the
rater
agreement
weights
were
the
same
as
those
utilized
in
the
earlier
Cicchetti
and
Fleiss
(1977)
research.
These
ranged
be-
tween
1
(complete
rater
agreement)
and
0
(com-
plete
rater
disagreement
or
being
as
far
apart
as
the
range
of
scale
points
will
allow,
e.g.,
1-9
or
9-1
pairings
on
a
9-category
ordinal
scale
of
classification).
These
linear
agreement
weight-
ing
systems
were
derived
earlier
by
Cicchetti
(1976)
and
are
given
by
the
formula
I-li-jl/(k-1),
where
i and j
are
the
categories
of
assignment
by
Raters
1
and
2.
Results
The
findings
of
this
followup
monte
carlo
in-
vestigation
confirmed
that
(1)
the
normal
ap-
proximation
to
the
null
distribution
of
weighted
kappa
is
valid
for
8 < k <
10
categories
of
or-
dinal
classification;
(2)
the
minimal
number
of
cases
required
for
the
valid
application
of
weighted
kappa
is
still
well
approximated
by
the
formula
2k2;
and
(3)
the
above
results
hold
well
even
under
the
condition
of
markedly
different
simulated
rater
marginals.
This
means
that
the
approximate
minimal
sample
sizes
required
for
the
valid
application
of
the
weighted
kappa
sta-
tistic,
become,
respectively:
for k
=
8,
N
=
125;
for k = 9, N =160;
and
for k
=10, N =
200.
Table
1
Central
Moments
of
Null
Distribution
of
Kw
for
a
10
Category
Ordinal
Scale
With
Marked
Differences
in
Rater
Mareinals
Note.
Underlying
marginal
probabilities
were
.25,
.25,
.20,
.15,
.05,
.02,
.02,
.02,
.02,
and
.02
for
Rater
1;
and
.02,
.02,
.02,
.02,
.02,
.05,
.15, .20,
.25,
and
.25
for
Rater
2.
Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227.
May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction
requires payment of royalties through the Copyright Clearance Center,
http://www.copyright.com/

103
Table
2
Empirical
Tail
Areas
of
Null
Distribution
of
Kw
for
a
10
Category
Ordinal
Scale
With
Marked
Differences
in
Rater
Marginals,
for
One-Sided
and
Two-Sided
Intervals
Note.Underlying
marginal
probabilities
were
.25,
.25,
.20,
.15,
.05,
.02,
.02,
.02,
.02,
and
.02
for
Rater
1;
and
.02,
.02,
.02,
.02, .02, .05,
.15,
.20,
.25,
and
.25
for
Rater
2.
Since
the
findings
of
this
computer
simulation
held
for
each
category
(8 < k <
10)
and
each
condition
of
rater
marginals
(uniform,
moder-
ately
different,
and
markedly
different),
the
re-
sults
will
be
presented
only
for
the
10-category
ordinal
scale
under
the
stringent
condition
of
markedly
different
rater
marginals.
Discussion
and
Conclusions
The
results
of
this
followup
investigation
are
quite
straightforward.
Viewed
in
conjunction
with
the
results
previously
published
by
Cic-
chetti
and
Fleiss
(1977),
it
can
be
concluded
that
the
weighted
kappa
statistic
(due
to
Cohen,
1968)
can
be
validly
applied
in
the
null
case,
for
scales
ranging
between
3 < k <
10
categories,
even
under
conditions
in
which
the
underlying
rater
marginals
are
quite
markedly
different,
providing
only
that
the
minimal
number
of
cases
evaluated
by
any
given
pair
of
raters
is
at
least
of
the
order
of
about
2A~.
This
produces
approxi-
mate
N’s
ranging
between
about
20
for
three
categories
of
classification
to
about
200
when
the
number
of
ordinal
categories
is
10.
Thus,
the
implied
conservative
minimal
N
of
200
cases,
ir-
respective
of
the
number
of k
categories
of
clas-
sification
(see
Fleiss,
Cohen,
&
Everitt,
1969),
is
only
required
when k
=
10.
As
noted
elsewhere
(Cicchetti
&
Fleiss,
1977),
this
finding
should
be
of
comfort to
research
investigators
utilizing
the
kappa
statistics,
since
it
is
often
difficult
to
ob-
tain
sample
sizes
of >
200.
References
Cicchetti,
D.
V.
Assessing
inter-rater
reliability
for
rating
scales:
Resolving
some
basic
issues.
British
Journal
of Psychiatry
, 1976,
129
, 452-456.
Cicchetti,
D.
V.,
&
Fleiss,
J.
L.
Comparison
of
the
null
distribution
of
weighted
kappa
and
the
C
or-
dinal
statistic.
Applied
Psychological
Measure-
ment
, 1977,
1
, 195-201.
Cohen,
J.
Weighted
kappa:
Nominal
scale
agreement
with
provision
for
scaled
disagreement
or
partial
credit.
Psychological
Bulletin,
1968,
70
, 213-220.
Fleiss,
J.
L.,
Cohen,
J.,
&
Everitt,
B.
S.
Large
sample
standard
errors
of
kappa
and
weighted
kappa.
Psychological
Bulletin,
1969,
72
, 323-327.
Gelder,
M.
G.,
&
Marks,
I.
M.
Severe
agoraphobia:
A
controlled
prospective
trial
of
behavior
therapy.
British
Journal
of Psychiatry
,
1966,
112
, 309-319.
Tyrer,
P.,
&
Alexander,
J.
Classification
of
person-
ality
disorder.
British
Journal
of
Psychiatry,
1979,
135
, 163-167.
Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227.
May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction
requires payment of royalties through the Copyright Clearance Center,
http://www.copyright.com/

104
Tyrer,
P.,
Alexander,
M.,
Cicchetti,
D.
V.,
Cohen,
M.,
&
Remington,
M.
Reliability
of
a
schedule
for
rating
personality
disorders.
British
Journal
of
Psychiatry,
1979,
135
, 168-174.
Watson,
J.
P.,
Gaind,
R.,
&
Marks,
I.
M.
Prolonged
exposure:
A
rapid
treatment
for
phobias.
British
Medical Journal
, 1971,
1
, 13-15.
Acknowledgments
This
research
was
supported
by
the
West
Haven
VA
Medical
Center
(MRIS
1416).
The
author
acknowl-
edges
the
contributions
of
Joseph
Vitale
and
Sandra
Aivano,
Yale
University,
in
developing
the
computer
programs
used
in
this
research
and
Professor
Joseph
L.
Fleiss for
his
collaboration
in
the
preceding
report,
as
well
as
his
helpful
critique
of
the
present
investiga-
tion.
Author’s
Address
Send
requests
for
reprints
or
further
information
to
Domenic
V.
Cicchetti,
Ph.D.,
Senior
Research
Psychologist
and
Biostatistician,
VA
Medical
Center,
West
Haven,
CT
06516.
Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227.
May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction
requires payment of royalties through the Copyright Clearance Center,
http://www.copyright.com/
Citations
More filters
Journal ArticleDOI

Interrater agreement and interrater reliability: Key concepts, approaches, and applications

TL;DR: This study outlines the practical applications and interpretation of these indices in social and administrative pharmacy research and describes the key concepts and approaches to evaluating interrater agreement andInterrater reliability.
Journal ArticleDOI

The reliability of peer review for manuscript and grant submissions: A cross-disciplinary investigation

TL;DR: The reliability of peer review of scientific documents and the evaluative criteria scientists use to judge the work of their peers are critically reexamined with special attention to the consistently low levels of reliability that have been reported.
Journal ArticleDOI

Observations of Effective Teacher–Student Interactions in Secondary School Classrooms: Predicting Student Achievement With the Classroom Assessment Scoring System—Secondary

TL;DR: Classrooms characterized by a positive emotional climate, with sensitivity to adolescent needs and perspectives, use of diverse and engaging instructional learning formats, and a focus on analysis and problem solving were associated with higher levels of student achievement.
Journal ArticleDOI

The Effect of Number of Rating Scale Categories on Levels of Interrater Reliability : A Monte Carlo Investigation:

TL;DR: In this article, a computer simulation study was designed to investigate the extent to which the interrater reliability of a clinical scale is affected by the number of scales or scales used.
Journal ArticleDOI

Security of infantile attachment as assessed in the “strange situation”: Its study and biological interpretation

TL;DR: The Strange Situation procedure was developed by Ainsworth two decades ago as a means of assessing the security of infant-parent attachment as discussed by the authors, and it has been shown empirically unsupported in their strong form, and that the interpretations in terms of biological adaptation are misguided.
References
More filters
Journal ArticleDOI

Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit.

TL;DR: The Kw provides for the incorpation of ratio-scaled degrees of disagreement (or agreement) to each of the cells of the k * k table of joi.
Journal ArticleDOI

Large sample standard errors of kappa and weighted kappa.

TL;DR: The statistics kappa and weighted kappa (Cohen, 1960) were introduced to provide coefficients of agreement between two raters for nominal scales as discussed by the authors, and they were used to provide a measure of the relative seriousness of the different possible disagreements.
Journal ArticleDOI

Classification of personality disorder.

TL;DR: Factor analysis and cluster analysis showed a similar structure of personality variables in both groups of patients, supporting the notion that personality disorders differ only in degree from the personalities of other psychiatric patients.
Journal ArticleDOI

Assessing inter-rater reliability for rating scales: resolving some basic issues.

TL;DR: The minimal sample sizes and the specific linear agreement weights required for assessing the reliability of rating scales commonly used in neuropsychiatric and other clinico-medical settings are presented.
Journal ArticleDOI

Severe Agoraphobia: A Controlled Prospective Trial of Behaviour Therapy

TL;DR: Behaviour therapy can produce only limited changes in severe agoraphobia, although sometimes these can be worthwhile, and may be a useful additional technique which can form part of general psychiatric management but not replace conventional methods.
Related Papers (5)